linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	cff2f741b8	Driver core updates for 3.8-rc1 Here's the large driver core updates for 3.8-rc1. The biggest thing here is the various __dev* marking removals. This is going to be a pain for the merge with different subsystem trees, I know, but all of the patches included here have been ACKed by their various subsystem maintainers, as they wanted them to go through here. If this is too much of a pain, I can pull all of them out of this tree and just send you one with the other fixes/updates and then, after 3.8-rc1 is out, do the rest of the removals to ensure we catch them all, it's up to you. The merges should all be trivial, and Stephen has been doing them all in linux-next for a few weeks now quite easily. Other than the __dev* marking removals, there's nothing major here, some firmware loading updates and other minor things in the driver core. All of these have (much to Stephen's annoyance), been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlDHkPkACgkQMUfUDdst+ykaWgCfW7AM30cv0nzoVO08ax6KjlG1 KVYAn3z/KYazvp4B6LMvrW9y0G34Wmad =yvVr -----END PGP SIGNATURE----- Merge tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg Kroah-Hartman: "Here's the large driver core updates for 3.8-rc1. The biggest thing here is the various __dev* marking removals. This is going to be a pain for the merge with different subsystem trees, I know, but all of the patches included here have been ACKed by their various subsystem maintainers, as they wanted them to go through here. If this is too much of a pain, I can pull all of them out of this tree and just send you one with the other fixes/updates and then, after 3.8-rc1 is out, do the rest of the removals to ensure we catch them all, it's up to you. The merges should all be trivial, and Stephen has been doing them all in linux-next for a few weeks now quite easily. Other than the __dev* marking removals, there's nothing major here, some firmware loading updates and other minor things in the driver core. All of these have (much to Stephen's annoyance), been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed up trivial conflicts in drivers/gpio/gpio-{em,stmpe}.c due to gpio update. * tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (93 commits) modpost.c: Stop checking __dev* section mismatches init.h: Remove __dev* sections from the kernel acpi: remove use of __devinit PCI: Remove __dev* markings PCI: Always build setup-bus when PCI is enabled PCI: Move pci_uevent into pci-driver.c PCI: Remove CONFIG_HOTPLUG ifdefs unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs sh/PCI: Remove CONFIG_HOTPLUG ifdefs powerpc/PCI: Remove CONFIG_HOTPLUG ifdefs mips/PCI: Remove CONFIG_HOTPLUG ifdefs microblaze/PCI: Remove CONFIG_HOTPLUG ifdefs dma: remove use of __devinit dma: remove use of __devexit_p firewire: remove use of __devinitdata firewire: remove use of __devinit leds: remove use of __devexit leds: remove use of __devinit leds: remove use of __devexit_p mmc: remove use of __devexit ...	2012-12-11 13:13:55 -08:00
Linus Torvalds	bad73c5aa0	ACPI and power management updates for 3.8-rc1 * Introduction of device PM QoS flags. * ACPI device power management update allowing subsystems other than PCI to use it more easily. * ACPI device enumeration rework allowing additional kinds of devices to be enumerated via ACPI. From Mika Westerberg, Adrian Hunter, Mathias Nyman, Andy Shevchenko, and Rafael J. Wysocki. * ACPICA update to version 20121018 from Bob Moore and Lv Zheng. * ACPI memory hotplug update from Wen Congyang and Yasuaki Ishimatsu. * Introduction of acpi_handle_<level>() messaging macros and ACPI-based CPU hot-remove support from Toshi Kani. * ACPI EC updates from Feng Tang. * cpufreq updates from Viresh Kumar, Fabio Baltieri and others. * cpuidle changes to quickly notice governor prediction failure from Youquan Song. * Support for using multiple cpuidle drivers at the same time and cpuidle cleanups from Daniel Lezcano. * devfreq updates from Nishanth Menon and others. * cpupower update from Thomas Renninger. * Fixes and small cleanups all over the place. -- -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIcBAABAgAGBQJQxxevAAoJEKhOf7ml8uNsHacQAK2xoQozDddPBAaTCf1OWW/G J4E2qMn+gy4AtFmQ0/xeZVvvylOCn9eu/kv3QJ/bFlVoUsqTgfPwQBjT6HjF1Acn 7yVFdr9H/tn2wi9nW2Gv6Tl2Hr/kFLpo0f5Kd40Z+P8SV5sKX5ktJcVpJ/I/P4Vh Qw2nWtj7hQktZDERzgG4ZpMbQToNhbLhbKWB9ad3/XQSSA9JkfgvBFgrTEGHcZD5 bwsggjNdOAWNGeDdzRsQSDn0Alld5ILVdSJ5xKimO1O70WvKc7fqA1IdYRIeKL90 yz4bcoYKzl9iktlw8+x5o1U9mrc8TFV5p4+zV+t5Z6pzS/J3kWvnsW4zu9sCrxFv Wic3SKyiem7s2dxrYyj4ZXAci3GK4ouRTrCLqk7/00tEGdwAQD1ZNfsUJp6jKayz FvtZUgItcOyrlQ6B4nh951OY6dI3AUYJ2NuWWNr5NZkgVAvQGV8zTGOImbeVeL2+ pMiw14zScO3ylYilVcjTKDDMj2sDZ68mw5PIcbmksvWsCLo26jDBVDtLVmtYWyd4 ek3WnOrQZr0R3agvOLLssMKXompvpP+N4Klf4rV+GtqGsWtHryYKys2Laju9FwFj yYLchxYlxhGTzqq8LjF90HDL0TWpPe6cPi+B5ow9g/SXLexbMKNQGhv3Jovm2yR3 j54tKBWy7e9AAYEDPirX =6OEP -----END PGP SIGNATURE----- Merge tag 'pm+acpi-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: - Introduction of device PM QoS flags. - ACPI device power management update allowing subsystems other than PCI to use it more easily. - ACPI device enumeration rework allowing additional kinds of devices to be enumerated via ACPI. From Mika Westerberg, Adrian Hunter, Mathias Nyman, Andy Shevchenko, and Rafael J. Wysocki. - ACPICA update to version 20121018 from Bob Moore and Lv Zheng. - ACPI memory hotplug update from Wen Congyang and Yasuaki Ishimatsu. - Introduction of acpi_handle_<level>() messaging macros and ACPI-based CPU hot-remove support from Toshi Kani. - ACPI EC updates from Feng Tang. - cpufreq updates from Viresh Kumar, Fabio Baltieri and others. - cpuidle changes to quickly notice governor prediction failure from Youquan Song. - Support for using multiple cpuidle drivers at the same time and cpuidle cleanups from Daniel Lezcano. - devfreq updates from Nishanth Menon and others. - cpupower update from Thomas Renninger. - Fixes and small cleanups all over the place. * tag 'pm+acpi-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (196 commits) mmc: sdhci-acpi: enable runtime-pm for device HID INT33C6 ACPI: add Haswell LPSS devices to acpi_platform_device_ids list ACPI: add documentation about ACPI 5 enumeration pnpacpi: fix incorrect TEST_ALPHA() test ACPI / PM: Fix header of acpi_dev_pm_detach() in acpi.h ACPI / video: ignore BIOS initial backlight value for HP Folio 13-2000 ACPI : do not use Lid and Sleep button for S5 wakeup ACPI / PNP: Do not crash due to stale pointer use during system resume ACPI / video: Add "Asus UL30VT" to ACPI video detect blacklist ACPI: do acpisleep dmi check when CONFIG_ACPI_SLEEP is set spi / ACPI: add ACPI enumeration support gpio / ACPI: add ACPI support PM / devfreq: remove compiler error with module governors (2) cpupower: IvyBridge (0x3a and 0x3e models) support cpupower: Provide -c param for cpupower monitor to schedule process on all cores cpupower tools: Fix warning and a bug with the cpu package count cpupower tools: Fix malloc of cpu_info structure cpupower tools: Fix issues with sysfs_topology_read_file cpupower tools: Fix minor warnings cpupower tools: Update .gitignore for files created in the debug directories ...	2012-12-11 12:45:35 -08:00
Linus Torvalds	259cdbee20	irqdomain changes for Linux v3.8 Trivial changes to irqdomain. An update to the documentation and make one of the error paths not quite so obnoxious. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQx25bAAoJEEFnBt12D9kBmBkP/i4TONBT0TVWVKsK7RIdZpVF aQCC1W2gFuo5E8JITB6gfhvSw6LMiho9ThzX0NVP52SN6k5QWm0QzlB1sjQyaKph +rB7pepO9+MxYuz5zPz1RPNlBA7pvu+S2e7eDD0dtWfmJn9gjeubrschZINSU25E g0BYdy/eJTnNwOdsASNaXNN2Bu7/gO7EpV0Z5T8fCfbih9euyKcuGhHasXGdnHQX /PpYu87kq0FPYyLuHBo1+9WXIG++sGtlw8SPZ7kjPLFlnis4spiajoaXrwXWMyQN KJTdDLHh2DSRlxQEr+Ibw2ptajJ/iNIwQdQksFn1DzDzkcCDBgl8e+0sCCUhv0eb sQcf/jQ6clYmwH/7SCHo93kyLt7BpY4jocw8UL6XiITTY156I0u/4YTPJNCm/ai8 NajWsmt0otekZOmLpKQhDEsnWrAfVMfmRiQOBAhACt2KPS78FtNpt3b8jCWYyEV4 MB5aemsTdtRGSLCITe3nZ9W0CyFDS8fhgS310DV+y3QVa7KIT7oFYy2S8FuO2AOn HVRNjKqZJm1jwHdgSN4BXpve/KqxycTA3Kfz/SnaOs/JNP7oWxPY/X6MUrKHmcId eCciJ/hXknslgH6q+1OD/7RCJL7fAt0+9xmqzKStBGcfgswC4uIn3+gHmWUwdwlE 1f8NCvsRERl+79biglT8 =24xd -----END PGP SIGNATURE----- Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6 Pull irqdomain changes from Grant Likely: "Trivial changes to irqdomain. An update to the documentation and make one of the error paths not quite so obnoxious." * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: irqdomain: update documentation irqdomain: stop screaming about preallocated irqdescs	2012-12-11 11:30:08 -08:00
Lino Sanfilippo	e2a29943e9	fsnotify: pass group to fsnotify_destroy_mark() In fsnotify_destroy_mark() dont get the group from the passed mark anymore, but pass the group itself as an additional parameter to the function. Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Signed-off-by: Eric Paris <eparis@redhat.com>	2012-12-11 13:44:36 -05:00
Mel Gorman	5bca230353	mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:56 +00:00
Mel Gorman	3105b86a9f	mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG The "mm: sched: numa: Control enabling and disabling of NUMA balancing" depends on scheduling debug being enabled but it's perfectly legimate to disable automatic NUMA balancing even without this option. This should take care of it. Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:56 +00:00
Mel Gorman	1a687c2e9a	mm: sched: numa: Control enabling and disabling of NUMA balancing This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:55 +00:00
Mel Gorman	b8593bfda1	mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate The PTE scanning rate and fault rates are two of the biggest sources of system CPU overhead with automatic NUMA placement. Ideally a proper policy would detect if a workload was properly placed, schedule and adjust the PTE scanning rate accordingly. We do not track the necessary information to do that but we at least know if we migrated or not. This patch scans slower if a page was not migrated as the result of a NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is now higher than the previous default. Once every minute it will reset the scanner in case of phase changes. This is hilariously crude and the numbers are arbitrary. Workloads will converge quite slowly in comparison to what a proper policy should be able to do. On the plus side, we will chew up less CPU for workloads that have no need for automatic balancing. Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:55 +00:00
Mel Gorman	fb003b80da	sched: numa: Slowly increase the scanning period as NUMA faults are handled Currently the rate of scanning for an address space is controlled by the individual tasks. The next scan is simply determined by 2p->numa_scan_period. The 2p->numa_scan_period is arbitrary and never changes. At this point there is still no proper policy that decides if a task or process is properly placed. It just scans and assumes the next NUMA fault will place it properly. As it is assumed that pages will get properly placed over time, increase the scan window each time a fault is incurred. This is a big assumption as noted in the comments. It should be noted that changing to p->numa_scan_period will increase system CPU usage because now the scanning rate has effectively doubled. If that is a problem then the min_rate should be made 200ms instead of restoring the 2* logic. Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:51 +00:00
Mel Gorman	e14808b49f	mm: numa: Rate limit setting of pte_numa if node is saturated If there are a large number of NUMA hinting faults and all of them are resulting in migrations it may indicate that memory is just bouncing uselessly around. NUMA balancing cost is likely exceeding any benefit from locality. Rate limit the PTE updates if the node is migration rate-limited. As noted in the comments, this distorts the NUMA faulting statistics. Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:51 +00:00
Peter Zijlstra	4b96a29ba8	mm: sched: numa: Implement slow start for working set sampling Add a 1 second delay before starting to scan the working set of a task and starting to balance it amongst nodes. [ note that before the constant per task WSS sampling rate patch the initial scan would happen much later still, in effect that patch caused this regression. ] The theory is that short-run tasks benefit very little from NUMA placement: they come and go, and they better stick to the node they were started on. As tasks mature and rebalance to other CPUs and nodes, so does their NUMA placement have to change and so does it start to matter more and more. In practice this change fixes an observable kbuild regression: # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ] !NUMA: 45.291088843 seconds time elapsed ( +- 0.40% ) 45.154231752 seconds time elapsed ( +- 0.36% ) +NUMA, no slow start: 46.172308123 seconds time elapsed ( +- 0.30% ) 46.343168745 seconds time elapsed ( +- 0.25% ) +NUMA, 1 sec slow start: 45.224189155 seconds time elapsed ( +- 0.25% ) 45.160866532 seconds time elapsed ( +- 0.17% ) and it also fixes an observable perf bench (hackbench) regression: # perf stat --null --repeat 10 perf bench sched messaging -NUMA: -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% ) +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% ) +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% ) The implementation is simple and straightforward, most of the patch deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable knob. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote the changelog, ran measurements, tuned the default. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>	2012-12-11 14:42:47 +00:00
Mel Gorman	9f40604cda	sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges By accounting against the present PTEs, scanning speed reflects the actual present (mapped) memory. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:46 +00:00
Peter Zijlstra	6e5fb223e8	mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Previously, to probe the working set of a task, we'd use a very simple and crude method: mark all of its address space PROT_NONE. That method has various (obvious) disadvantages: - it samples the working set at dissimilar rates, giving some tasks a sampling quality advantage over others. - creates performance problems for tasks with very large working sets - over-samples processes with large address spaces but which only very rarely execute Improve that method by keeping a rotating offset into the address space that marks the current position of the scan, and advance it by a constant rate (in a CPU cycles execution proportional manner). If the offset reaches the last mapped address of the mm then it then it starts over at the first address. The per-task nature of the working set sampling functionality in this tree allows such constant rate, per task, execution-weight proportional sampling of the working set, with an adaptive sampling interval/frequency that goes from once per 100ms up to just once per 8 seconds. The current sampling volume is 256 MB per interval. As tasks mature and converge their working set, so does the sampling rate slow down to just a trickle, 256 MB per 8 seconds of CPU time executed. This, beyond being adaptive, also rate-limits rarely executing systems and does not over-sample on overloaded systems. [ In AutoNUMA speak, this patch deals with the effective sampling rate of the 'hinting page fault'. AutoNUMA's scanning is currently rate-limited, but it is also fundamentally single-threaded, executing in the knuma_scand kernel thread, so the limit in AutoNUMA is global and does not scale up with the number of CPUs, nor does it scan tasks in an execution proportional manner. So the idea of rate-limiting the scanning was first implemented in the AutoNUMA tree via a global rate limit. This patch goes beyond that by implementing an execution rate proportional working set sampling rate that is not implemented via a single global scanning daemon. ] [ Dan Carpenter pointed out a possible NULL pointer dereference in the first version of this patch. ] Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ Wrote changelog and fixed bug. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>	2012-12-11 14:42:46 +00:00
Peter Zijlstra	cbee9f88ec	mm: numa: Add fault driven placement and migration NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com>	2012-12-11 14:42:45 +00:00
Ingo Molnar	c1ad41f1f7	Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled" This reverts commit `5258f386ea`, because the underlying autogroups bug got fixed upstream in a better way, via: `fd8ef11730` Revert "sched, autogroup: Stop going ahead if autogroup is disabled" Cc: Mike Galbraith <efault@gmx.de> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-11 10:23:45 +01:00
Ingo Molnar	cc1b39dbf9	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull ftrace updates from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-08 15:54:35 +01:00
Ingo Molnar	7e0dd574cd	Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core Pull uprobes fixes, cleanups and preparation for the ARM port from Oleg Nesterov. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-08 15:51:10 +01:00
Ingo Molnar	38130ec087	Some more cputime cleanups: * Get rid of underscores polluting the vtime namespace * Consolidate context switch and tick handling * Improve debuggability by detecting irq unsafe callers Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQq52nAAoJEIUkVEdQjox3ZRoP/RuDC59hNGu3rR0ERMM3TqW3 SIaMSHlHQh3h8P8OASpRqBb9s0BoWD0l3xZ68TEACnRdLS50Rre2P0SSxqpkdbnL cj0+I7gmbxKa9c9zpm+mn1TvL2bEhg6hkWCMK9jn2SSBl33cOqKUGUfy8Gx0nryc q+cOZrXSgMvYKCixGubCqsTl8MKs9CrpyrLSYtFUiHFVWREPfndS9M9BB5yfKHL1 t9qmdb5WRq2NpU6apoZBMBdPQcmQr5WswLpbhoTocpvCiEmt5RZGkSDOawPa1DHP 2SPM7fGZIDrXCMW/g9d2mt43j/HxS9LYu9lToZCbbMqehe2Bf5jYqO1Kwi7FhedR NSofoXbW/j589+7I+pN66lo0pctNWxd59YDvLw22SqUFcBEUSmypM6eUwbrbVUg7 /H0a8T/5bPwx2ukNrCW0+Zsd9X3If4K290j4lNOMLki9ikYG6IXfGw1GMwsiyFSo LNSnDs0ekovvWOAg1iRq8DW8j/TWoZuZUSRME2LdCde9SbkMEGgWaNYCwNLMenie 6jZHar7SfpdRDPP6NCY85jMy5MRbyN3mzSFhMfqMKQgmFNd7ay7oRKppIkwT+qkD VozCvdPmCxd+orNMbWINDAhNY5RUlcPj/Em8Mue1U152rpjfNt/WZOfmujmLwNW2 /RPQtHo+F7w7KhbylFpx =cL3t -----END PGP SIGNATURE----- Merge tag 'sched-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core Pull more cputime cleanups from Frederic Weisbecker: * Get rid of underscores polluting the vtime namespace * Consolidate context switch and tick handling * Improve debuggability by detecting irq unsafe callers Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-08 15:44:43 +01:00
Ingo Molnar	e783377e93	Cputime cleanups on reader side: * Improve naming and code location * Consolidate adjustment code * Comment the adjustement code Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQt5oaAAoJEIUkVEdQjox3hNwP/2QP7p9BHCPwGWenIi4aVUWH tlDLwWvQE919YPYL4AUgz4b9f4G7U7dbBozIJRxhB0rjqrbXU6PDvVCIwVyDH2xQ mTp5qdqyysgzqgZ7q0t27zLfHEANRcH8Tnrqj2XustqvdYcIzZKZeNkFsF3QRiDw utIEmE8A9mBnWDP7O4fDmo8onHNUmJc50Y0c/WJW7fbtq5aCh2vn87efV4GYGNjk e1qZuLRWdZYXkDnO6zqD5tUe/kB0ioPzXXyBkYAHXCMhCpkMDu7c18N+IrY80kBb vBQqeAGlpUuXnJ/MDFazqqbmezBYhnTIbnojyWO4ONzi2z6L3K9F1/zukM4WtvLv RNDF4MS7smFjyXXXfliIGOhvI5C5O9bosPOzBtvwHSYrnS5KGL8fv8N8tXixqytW nX5NEcjfCZXpNpm4TELcDyAvOrVMFe2CQwKgLBPSY1zRch34nJi9G55uKKSjg1xd Z1aDbVZFNt9R3ozV1rVaptNzagEa/023bvmnB8IiuA9oh6rNZOHhsc/lo1T2VaeO PhJqD50JPbJyycJ1m0pIW8iVSUxfIvJtICEHgVSCPH5A58PsKFr+8ELs+InTPTDt 11V7dxHAmspar1CO1mqYMMIS4VKgPfwNI6zuaO+JlmU4nMB42y8WAZn/lzMyafQE Uswa6UTBBiU159HNzgDh =FRxY -----END PGP SIGNATURE----- Merge tag 'cputime-adjustment-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core Pull cputime cleanups from Frederic Weisbecker: * Improve naming and code location * Consolidate adjustment code * Comment the adjustement code Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-08 15:31:07 +01:00
Ingo Molnar	f0b9abfb04	Merge branch 'linus' into perf/core Conflicts: tools/perf/Makefile tools/perf/builtin-test.c tools/perf/perf.h tools/perf/tests/parse-events.c tools/perf/util/evsel.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-08 15:25:06 +01:00
Ingo Molnar	222e82bef4	Merge branch 'linus' into sched/core Pick up the autogroups fix and other fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-07 12:15:33 +01:00
Gao feng	f33fddc2b9	cgroup_rm_file: don't delete the uncreated files in cgroup_add_file,when creating files for cgroup, some of creation may be skipped. So we need to avoid deleting these uncreated files in cgroup_rm_file, otherwise the warning msg will be triggered. "cgroup_addrm_files: failed to remove memory_pressure_enabled, err=-2" Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@redhat.com> Cc: stable@vger.kernel.org	2012-12-06 08:58:11 -08:00
Linus Torvalds	54d1ae492f	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module signing fixes from Rusty Russell: "David gave me these a month ago, during my git workflow churn :(" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: ASN.1: Fix an indefinite length skip error MODSIGN: Don't use enum-type bitfields in module signature info block	2012-12-06 08:29:08 -08:00
Linus Torvalds	cfd1f032f9	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "Trivial CPU hotplug regression fix for the watchdog code" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: watchdog: Fix CPU hotplug regression	2012-12-06 08:27:11 -08:00
Nadia Yvette Chambers	6d49e352ae	propagate name change to comments in kernel source I've legally changed my name with New York State, the US Social Security Administration, et al. This patch propagates the name change and change in initials and login to comments in the kernel source as well. Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-12-06 10:39:54 +01:00
Shan Wei	f0fcf2002b	padata: use __this_cpu_read per-cpu helper For bottom halves off, __this_cpu_read is better. Signed-off-by: Shan Wei <davidshan@tencent.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2012-12-06 17:16:23 +08:00
David Howells	12e130b045	MODSIGN: Don't use enum-type bitfields in module signature info block Don't use enum-type bitfields in the module signature info block as we can't be certain how the compiler will handle them. As I understand it, it is arch dependent, and it is possible for the compiler to rearrange them based on endianness and to insert a byte of padding to pad the three enums out to four bytes. Instead use u8 fields for these, which the compiler should emit in the right order without padding. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2012-12-05 11:27:24 +10:30
Thomas Gleixner	8d4516904b	watchdog: Fix CPU hotplug regression Norbert reported: "3.7-rc6 booted with nmi_watchdog=0 fails to suspend to RAM or offline CPUs. It's reproducable with a KVM guest and physical system." The reason is that commit bcd951cf(watchdog: Use hotplug thread infrastructure) missed to take this into account. So the cpu offline code gets stuck in the teardown function because it accesses non initialized data structures. Add a check for watchdog_enabled into that path to cure the issue. Reported-and-tested-by: Norbert Warmuth <nwarmuth@t-online.de> Tested-by: Joseph Salisbury <joseph.salisbury@canonical.com> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1211231033230.2701@ionos Link: http://bugs.launchpad.net/bugs/1079534 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2012-12-04 19:56:59 +01:00
Linus Torvalds	df2fc246c8	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module fixes from Rusty Russell: "Module signing build fixes for blackfin and metag" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: modsign: add symbol prefix to certificate list linux/kernel.h: define SYMBOL_PREFIX	2012-12-04 09:32:12 -08:00
Linus Torvalds	ca50496eb4	Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "So, safe fixes my ass. Commit `8852aac25e` ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay") had the side-effect of performing delayed_work sanity checks even when @delay is 0, which should be fine for any sane use cases. Unfortunately, megaraid was being overly ingenious. It seemingly wanted to use cancel_delayed_work_sync() before cancel_work_sync() was introduced, but didn't want to waste the space for full delayed_work as it was only going to use 0 @delay. So, it only allocated space for struct work_struct and then cast it to struct delayed_work and passed it into delayed_work functions - truly awesome engineering tradeoff to save some bytes. Xiaotian fixed it by making megraid allocate full delayed_work for now. It should be converted to use work_struct and cancel_work_sync() but I think we better do that after 3.7. I added another commit to change BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that the kernel doesn't crash even if there are more such abuses." * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s megaraid: fix BUG_ON() from incorrect use of delayed work	2012-12-04 09:02:45 -08:00
Tejun Heo	fc4b514f27	workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s `8852aac25e` ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in megaraid - it allocated work_struct, casted it to delayed_work and then pass that into queue_delayed_work(). Previously, this was okay because 0 @delay short-circuited to queue_work() before doing anything with delayed_work. `8852aac25e` moved 0 @delay test into __queue_delayed_work() after sanity check on delayed_work making megaraid trigger BUG_ON(). Although megaraid is already fixed by `c1d390d8e6` ("megaraid: fix BUG_ON() from incorrect use of delayed work"), this patch converts BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such abusers, if there are more, trigger warning but don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Xiaotian Feng <xtfeng@gmail.com>	2012-12-04 07:58:47 -08:00
Mike Galbraith	fd8ef11730	Revert "sched, autogroup: Stop going ahead if autogroup is disabled" This reverts commit `800d4d30c8`. Between commits `8323f26ce3` ("sched: Fix race in task_group()") and `800d4d30c8` ("sched, autogroup: Stop going ahead if autogroup is disabled"), autogroup is a wreck. With both applied, all you have to do to crash a box is disable autogroup during boot up, then reboot.. boom, NULL pointer dereference due to commit `800d4d30c8` not allowing autogroup to move things, and commit `8323f26ce3` making that the only way to switch runqueues: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90 Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502 RIP: effective_load.isra.43+0x50/0x90 Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0) Call Trace: select_task_rq_fair+0x255/0x780 try_to_wake_up+0x156/0x2c0 wake_up_state+0xb/0x10 signal_wake_up+0x28/0x40 complete_signal+0x1d6/0x250 __send_signal+0x170/0x310 send_signal+0x40/0x80 do_send_sig_info+0x47/0x90 group_send_sig_info+0x4a/0x70 kill_pid_info+0x3a/0x60 sys_kill+0x97/0x1a0 ? vfs_read+0x120/0x160 ? sys_read+0x45/0x90 system_call_fastpath+0x16/0x1b Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 <48> 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c RIP [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90 RSP <ffff880221ddfbd8> CR2: 0000000000000000 Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org # 2.6.39+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-03 11:10:24 -08:00
Gao feng	7083d0378a	cgroup: remove subsystem files when remounting cgroup cgroup_clear_directroy is called by cgroup_d_remove_dir and cgroup_remount. when we call cgroup_remount to remount the cgroup,the subsystem may be unlinked from cgroupfs_root->subsys_list in rebind_subsystem,this subsystem's files will not be removed in cgroup_clear_directroy. And the system will panic when we try to access these files. this patch removes subsystems's files before rebind_subsystems, if rebind_subsystems failed, repopulate these removed files. With help from Tejun. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-03 08:33:11 -08:00
Ingo Molnar	630e1e0bcd	Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Conflicts: arch/x86/kernel/ptrace.c Pull the latest RCU tree from Paul E. McKenney: " The major features of this series are: 1. A first version of no-callbacks CPUs. This version prohibits offlining CPU 0, but only when enabled via CONFIG_RCU_NOCB_CPU=y. Relaxing this constraint is in progress, but not yet ready for prime time. These commits were posted to LKML at https://lkml.org/lkml/2012/10/30/724, and are at branch rcu/nocb. 2. Changes to SRCU that allows statically initialized srcu_struct structures. These commits were posted to LKML at https://lkml.org/lkml/2012/10/30/296, and are at branch rcu/srcu. 3. Restructuring of RCU's debugfs output. These commits were posted to LKML at https://lkml.org/lkml/2012/10/30/341, and are at branch rcu/tracing. 4. Additional CPU-hotplug/RCU improvements, posted to LKML at https://lkml.org/lkml/2012/10/30/327, and are at branch rcu/hotplug. Note that the commit eliminating __stop_machine() was judged to be too-high of risk, so is deferred to 3.9. 5. Changes to RCU's idle interface, most notably a new module parameter that redirects normal grace-period operations to their expedited equivalents. These were posted to LKML at https://lkml.org/lkml/2012/10/30/739, and are at branch rcu/idle. 6. Additional diagnostics for RCU's CPU stall warning facility, posted to LKML at https://lkml.org/lkml/2012/10/30/315, and are at branch rcu/stall. The most notable change reduces the default RCU CPU stall-warning time from 60 seconds to 21 seconds, so that it once again happens sooner than the softlockup timeout. 7. Documentation updates, which were posted to LKML at https://lkml.org/lkml/2012/10/30/280, and are at branch rcu/doc. A couple of late-breaking changes were posted at https://lkml.org/lkml/2012/11/16/634 and https://lkml.org/lkml/2012/11/16/547. 8. Miscellaneous fixes, which were posted to LKML at https://lkml.org/lkml/2012/10/30/309, along with a late-breaking change posted at Fri, 16 Nov 2012 11:26:25 -0800 with message-ID <20121116192625.GA447@linux.vnet.ibm.com>, but which lkml.org seems to have missed. These are at branch rcu/fixes. 9. Finally, a fix for an lockdep-RCU splat was posted to LKML at https://lkml.org/lkml/2012/11/7/486. This is at rcu/next. " Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-12-03 06:27:05 +01:00
James Hogan	84ecfd15f5	modsign: add symbol prefix to certificate list Add the arch symbol prefix (if applicable) to the asm definition of modsign_certificate_list and modsign_certificate_list_end. This uses the recently defined SYMBOL_PREFIX which is derived from CONFIG_SYMBOL_PREFIX. This fixes the build of module signing on the blackfin and metag architectures. Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: David Howells <dhowells@redhat.com> Cc: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2012-12-03 13:06:25 +10:30
Joonsoo Kim	3657600040	workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up() Recently, workqueue code has gone through some changes and we found some bugs related to concurrency management operations happening on the wrong CPU. When a worker is concurrency managed (!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and woken up to that cpu. Add WARN_ON_ONCE() to verify this. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-01 16:45:45 -08:00
Joonsoo Kim	999767beb1	workqueue: trivial fix for return statement in work_busy() Return type of work_busy() is unsigned int. There is return statement returning boolean value, 'false' in work_busy(). It is not problem, because 'false' may be treated '0'. However, fixing it would make code robust. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-01 16:45:40 -08:00
Tejun Heo	8852aac25e	workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay `8376fe22c7` ("workqueue: implement mod_delayed_work[_on]()") implemented mod_delayed_work[_on]() using the improved try_to_grab_pending(). The function is later used, among others, to replace [__]candel_delayed_work() + queue_delayed_work() combinations. Unfortunately, a delayed_work item w/ zero @delay is handled slightly differently by mod_delayed_work_on() compared to queue_delayed_work_on(). The latter skips timer altogether and directly queues it using queue_work_on() while the former schedules timer which will expire on the closest tick. This means, when @delay is zero, that [__]cancel_delayed_work() + queue_delayed_work_on() makes the target item immediately executable while mod_delayed_work_on() may induce delay of upto a full tick. This somewhat subtle difference breaks some of the converted users. e.g. block queue plugging uses delayed_work for deferred processing and uses mod_delayed_work_on() when the queue needs to be immediately unplugged. The above problem manifested as noticeably higher number of context switches under certain circumstances. The difference in behavior was caused by missing special case handling for 0 delay in mod_delayed_work_on() compared to queue_delayed_work_on(). Joonsoo Kim posted a patch to add it - ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1]. The patch was queued for 3.8 but it was described as optimization and I missed that it was a correctness issue. As both queue_delayed_work_on() and mod_delayed_work_on() use __queue_delayed_work() for queueing, it seems that the better approach is to move the 0 delay special handling to the function instead of duplicating it in mod_delayed_work_on(). Fix the problem by moving 0 delay special case handling from queue_delayed_work_on() to __queue_delayed_work(). This replaces Joonsoo's patch. [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU> Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr> LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu> LKML-Reference: <50A78AA9.5040904@iskon.hr> Cc: Joonsoo Kim <js1304@gmail.com>	2012-12-01 16:43:18 -08:00
Mike Galbraith	412d32e6c9	workqueue: exit rescuer_thread() as TASK_RUNNING A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling off, never to be seen again. In the case where this occurred, an exiting thread hit reiserfs homebrew conditional resched while holding a mutex, bringing the box to its knees. PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush" #0 [ffff8808157e7670] schedule at ffffffff8143f489 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs] #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs] #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f [exception RIP: kernel_thread_helper] RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2012-12-01 15:56:42 -08:00
Linus Torvalds	455e987c0c	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This is mostly about unbreaking architectures that took the UAPI changes in the v3.7 cycle, plus misc fixes." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf kvm: Fix building perf kvm on non x86 arches perf kvm: Rename perf_kvm to perf_kvm_stat perf: Make perf build for x86 with UAPI disintegration applied perf powerpc: Use uapi/unistd.h to fix build error tools: Pass the target in descend tools: Honour the O= flag when tool build called from a higher Makefile tools: Define a Makefile function to do subdir processing x86: Export asm/{svm.h,vmx.h,perf_regs.h} perf tools: Fix strbuf_addf() when the buffer needs to grow perf header: Fix numa topology printing perf, powerpc: Fix hw breakpoints returning -ENOSPC	2012-12-01 13:07:48 -08:00
Gao feng	879a3d9dbb	cgroup: use cgroup_addrm_files() in cgroup_clear_directory() cgroup_clear_directory() incorrectly invokes cgroup_rm_file() on each cftset of the target subsystems, which only removes the first file of each set. This leaves dangling files after subsystems are removed from a cgroup root via remount. Use cgroup_addrm_files() to remove all files of target subsystems. tj: Move cgroup_addrm_files() prototype decl upwards next to other global declarations. Commit message updated. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-30 11:44:12 -08:00
Frederic Weisbecker	91d1aa43d3	context_tracking: New context tracking susbsystem Create a new subsystem that probes on kernel boundaries to keep track of the transitions between level contexts with two basic initial contexts: user or kernel. This is an abstraction of some RCU code that use such tracking to implement its userspace extended quiescent state. We need to pull this up from RCU into this new level of indirection because this tracking is also going to be used to implement an "on demand" generic virtual cputime accounting. A necessary step to shutdown the tick while still accounting the cputime. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> [ paulmck: fix whitespace error and email address. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2012-11-30 11:40:07 -08:00
Steven Rostedt	9366c1ba13	ring-buffer: Fix race between integrity check and readers The function rb_check_pages() was added to make sure the ring buffer's pages were sane. This check is done when the ring buffer size is modified as well as when the iterator is released (closing the "trace" file), as that was considered a non fast path and a good place to do a sanity check. The problem is that the check does not have any locks around it. If one process were to read the trace file, and another were to read the raw binary file, the check could happen while the reader is reading the file. The issues with this is that the check requires to clear the HEAD page before doing the full check and it restores it afterward. But readers require the HEAD page to exist before it can read the buffer, otherwise it gives a nasty warning and disables the buffer. By adding the reader lock around the check, this keeps the race from happening. Cc: stable@vger.kernel.org # 3.6 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-30 11:09:57 -05:00
Steven Rostedt	54f7be5b83	ring-buffer: Fix NULL pointer if rb_set_head_page() fails The function rb_set_head_page() searches the list of ring buffer pages for a the page that has the HEAD page flag set. If it does not find it, it will do a WARN_ON(), disable the ring buffer and return NULL, as this should never happen. But if this bug happens to happen, not all callers of this function can handle a NULL pointer being returned from it. That needs to be fixed. Cc: stable@vger.kernel.org # 3.0+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-30 11:09:28 -05:00
Glauber Costa	1f869e8711	cgroup: warn about broken hierarchies only after css_online If everything goes right, it shouldn't really matter if we are spitting this warning after css_alloc or css_online. If we fail between then, there are some ill cases where we would previously see the message and now we won't (like if the files fail to be created). I believe it really shouldn't matter: this message is intended in spirit to be shown when creation succeeds, but with insane settings. Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-30 07:11:07 -08:00
Linus Walleij	d202b7b970	irqdomain: stop screaming about preallocated irqdescs In the simple irqdomain: don't shout warnings to the user, there is no point. An informational print is sufficient. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>	2012-11-30 09:02:35 +00:00
Rafael J. Wysocki	170bb4c800	Merge branch 'pm-sleep' * pm-sleep: PM / Freezer: Fixup compile error of try_to_freeze_nowarn() driver core / PM: move the calling to device_pm_remove behind the calling to bus_remove_device PM / Hibernate: use rb_entry PM / sysfs: replace strict_str* with kstrto*	2012-11-29 21:46:48 +01:00
Rafael J. Wysocki	9ee71f513c	Merge branch 'pm-cpuidle' * pm-cpuidle: cpuidle: Measure idle state durations with monotonic clock cpuidle: fix a suspicious RCU usage in menu governor cpuidle: support multiple drivers cpuidle: prepare the cpuidle core to handle multiple drivers cpuidle: move driver checking within the lock section cpuidle: move driver's refcount to cpuidle cpuidle: fixup device.h header in cpuidle.h cpuidle / sysfs: move structure declaration into the sysfs.c file cpuidle: Get typical recent sleep interval cpuidle: Set residency to 0 if target Cstate not enter cpuidle: Quickly notice prediction failure in general case cpuidle: Quickly notice prediction failure for repeat mode cpuidle / sysfs: move kobj initialization in the syfs file cpuidle / sysfs: change function parameter	2012-11-29 21:46:14 +01:00
Rafael J. Wysocki	d4c091f13d	Merge branch 'acpi-general' * acpi-general: (38 commits) ACPI / thermal: _TMP and _CRT/_HOT/_PSV/_ACx dependency fix ACPI: drop unnecessary local variable from acpi_system_write_wakeup_device() ACPI: Fix logging when no pci_irq is allocated ACPI: Update Dock hotplug error messages ACPI: Update Container hotplug error messages ACPI: Update Memory hotplug error messages ACPI: Update CPU hotplug error messages ACPI: Add acpi_handle_<level>() interfaces ACPI: remove use of __devexit ACPI / PM: Add Sony Vaio VPCEB1S1E to nonvs blacklist. ACPI / battery: Correct battery capacity values on Thinkpads Revert "ACPI / x86: Add quirk for "CheckPoint P-20-00" to not use bridge _CRS_ info" ACPI: create _SUN sysfs file ACPI / memhotplug: bind the memory device when the driver is being loaded ACPI / memhotplug: don't allow to eject the memory device if it is being used ACPI / memhotplug: free memory device if acpi_memory_enable_device() failed ACPI / memhotplug: fix memory leak when memory device is unbound from acpi_memhotplug ACPI / memhotplug: deal with eject request in hotplug queue ACPI / memory-hotplug: add memory offline code to acpi_memory_device_remove() ACPI / memory-hotplug: call acpi_bus_trim() to remove memory device ... Conflicts: include/linux/acpi.h (two additions at the end of the same file)	2012-11-29 21:43:06 +01:00
Rafael J. Wysocki	c8b6817103	Merge branch 'pm-qos' * pm-qos: PM / QoS: Handle device PM QoS flags while removing constraints PM / QoS: Resume device before exposing/hiding PM QoS flags PM / QoS: Document request manipulation requirement for flags PM / QoS: Fix a free error in the dev_pm_qos_constraints_destroy() PM / QoS: Fix the return value of dev_pm_qos_update_request() PM / ACPI: Take device PM QoS flags into account PM / Domains: Check device PM QoS flags in pm_genpd_poweroff() PM / QoS: Make it possible to expose PM QoS device flags to user space PM / QoS: Introduce PM QoS device flags support PM / QoS: Prepare struct dev_pm_qos_request for more request types PM / QoS: Introduce request and constraint data types for PM QoS flags PM / QoS: Prepare device structure for adding more constraint types	2012-11-29 21:40:32 +01:00
David S. Miller	8a2cf062b2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-29 12:51:17 -05:00
Al Viro	541880d9a2	do_coredump(): get rid of pt_regs argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-29 00:01:25 -05:00
Al Viro	4aaefee589	print_fatal_signal(): get rid of pt_regs argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-29 00:01:25 -05:00
Al Viro	94eb22d505	ptrace_signal(): get rid of unused arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-29 00:01:24 -05:00
Al Viro	b7f9591c44	get rid of ptrace_signal_deliver() arguments the first one is equal to signal_pt_regs(), the second is never used (and always NULL, while we are at it). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-29 00:01:24 -05:00
Al Viro	e80d6661c3	flagday: kill pt_regs argument of do_fork() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-29 00:01:08 -05:00
Al Viro	18c26c27ae	death to idle_regs() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-28 23:43:42 -05:00
Al Viro	62e791c1b8	don't pass regs to copy_process() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-28 23:43:42 -05:00
Al Viro	afa86fc426	flagday: don't pass regs to copy_thread() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-28 23:43:42 -05:00
Al Viro	c62d773a37	audit: no nested contexts anymore... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-28 21:53:36 -05:00
Al Viro	d2125043ae	generic sys_fork / sys_vfork / sys_clone ... and get rid of idiotic struct pt_regs * in asm-generic/syscalls.h prototypes of the same, while we are at it. Eventually we want those in linux/syscalls.h, of course, but that'll have to wait a bit. Note that there are three variants of sys_clone() order of arguments. Braindamage galore... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-28 21:49:04 -05:00
Al Viro	c4144670fd	kill daemonize() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-28 21:49:02 -05:00
Greg Thelen	9718ceb343	cgroup: list_del_init() on removed events Use list_del_init() rather than list_del() to remove events from cgrp->event_list. No functional change. This is just defensive coding. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-28 13:52:14 -08:00
Greg Thelen	205a872bd6	cgroup: fix lockdep warning for event_control The cgroup_event_wake() function is called with the wait queue head locked and it takes cgrp->event_list_lock. However, in cgroup_rmdir() remove_wait_queue() was being called after taking cgrp->event_list_lock. Correct the lock ordering by using a temporary list to obtain the event list to remove from the wait queue. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Aaron Durbin <adurbin@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-28 13:51:56 -08:00
Bill Pemberton	e3a1a5ec5c	kernel/ksysfs.c: remove CONFIG_HOTPLUG ifdefs Remove conditional code based on CONFIG_HOTPLUG being false. It's always on now in preparation of it going away as an option. Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2012-11-28 10:33:03 -08:00
Bill Pemberton	3b572b506c	sysctl: remove CONFIG_HOTPLUG ifdefs Remove conditional code based on CONFIG_HOTPLUG being false. It's always on now in preparation of it going away as an option. Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2012-11-28 10:33:03 -08:00
Frederic Weisbecker	fa09205783	cputime: Comment cputime's adjusting code The reason for the scaling and monotonicity correction performed by cputime_adjust() may not be immediately clear to the reviewer. Add some comments to explain what happens there. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-11-28 17:08:20 +01:00
Frederic Weisbecker	d37f761dbd	cputime: Consolidate cputime adjustment code task_cputime_adjusted() and thread_group_cputime_adjusted() essentially share the same code. They just don't use the same source: * The first function uses the cputime in the task struct and the previous adjusted snapshot that ensures monotonicity. * The second adds the cputime of all tasks in the group and the previous adjusted snapshot of the whole group from the signal structure. Just consolidate the common code that does the adjustment. These functions just need to fetch the values from the appropriate source. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-11-28 17:08:10 +01:00
Frederic Weisbecker	e80d0a1ae8	cputime: Rename thread_group_times to thread_group_cputime_adjusted We have thread_group_cputime() and thread_group_times(). The naming doesn't provide enough information about the difference between these two APIs. To lower the confusion, rename thread_group_times() to thread_group_cputime_adjusted(). This name better suggests that it's a version of thread_group_cputime() that does some stabilization on the raw cputime values. ie here: scale on top of CFS runtime stats and bound lower value for monotonicity. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-11-28 17:07:57 +01:00
Frederic Weisbecker	a634f93335	cputime: Move thread_group_cputime() to sched code thread_group_cputime() is a general cputime API that is not only used by posix cpu timer. Let's move this helper to sched code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-11-28 17:07:38 +01:00
Li Zhong	fddfb02ad0	cgroup: move list add after list head initilization `2243076ad1` ("cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping()") initializes cgrp->allcg_node in init_cgroup_housekeeping(). Then in init_cgroup_root(), we should call init_cgroup_housekeeping() before adding it to &root->allcg_list; otherwise, we are initializing an entry already in a list. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-28 06:02:39 -08:00
Marcelo Tosatti	e0b306fef9	time: export time information for KVM pvclock As suggested by John, export time data similarly to how its done by vsyscall support. This allows KVM to retrieve necessary information to implement vsyscall support in KVM guests. Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:12 -02:00
Marcelo Tosatti	582b336ec2	sched: add notifier for cross-cpu migrations Originally from Jeremy Fitzhardinge. Acked-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:09 -02:00
Darren Hart	aa10990e02	futex: avoid wake_futex() for a PI futex_q Dave Jones reported a bug with futex_lock_pi() that his trinity test exposed. Sometime between queue_me() and taking the q.lock_ptr, the lock_ptr became NULL, resulting in a crash. While futex_wake() is careful to not call wake_futex() on futex_q's with a pi_state or an rt_waiter (which are either waiting for a futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and futex_requeue() do not perform the same test. Update futex_wake_op() and futex_requeue() to test for q.pi_state and q.rt_waiter and abort with -EINVAL if detected. To ensure any future breakage is caught, add a WARN() to wake_futex() if the same condition is true. This fix has seen 3 hours of testing with "trinity -c futex" on an x86_64 VM with 4 CPUS. [akpm@linux-foundation.org: tidy up the WARN()] Signed-off-by: Darren Hart <dvhart@linux.intel.com> Reported-by: Dave Jones <davej@redat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-11-26 17:41:24 -08:00
Chuansheng Liu	8ffeb9b0e6	watchdog: using u64 in get_sample_period() In get_sample_period(), unsigned long is not enough: watchdog_thresh * 2 * (NSEC_PER_SEC / 5) case1: watchdog_thresh is 10 by default, the sample value will be: 0xEE6B2800 case2: set watchdog_thresh is 20, the sample value will be: 0x1 DCD6 5000 In case2, we need use u64 to express the sample period. Otherwise, changing the threshold thru proc often can not be successful. Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-11-26 17:41:24 -08:00
Thomas Gleixner	9c3f9e2816	Merge branch 'fortglx/3.8/time' of git://git.linaro.org/people/jstultz/linux into timers/core Fix trivial conflicts in: kernel/time/tick-sched.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2012-11-21 20:31:52 +01:00
Tao Ma	d0b2fdd2a5	cgroup: remove obsolete guarantee from cgroup_task_migrate. 'guarantee' is already removed from cgroup_task_migrate, so remove the corresponding comments. Some other typos in cgroup are also changed. Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-11-20 06:44:58 -08:00
Eric W. Biederman	98f842e675	proc: Usable inode numbers for the namespace file descriptors. Assign a unique proc inode to each namespace, and use that inode number to ensure we only allocate at most one proc inode for every namespace in proc. A single proc inode per namespace allows userspace to test to see if two processes are in the same namespace. This has been a long requested feature and only blocked because a naive implementation would put the id in a global space and would ultimately require having a namespace for the names of namespaces, making migration and certain virtualization tricks impossible. We still don't have per superblock inode numbers for proc, which appears necessary for application unaware checkpoint/restart and migrations (if the application is using namespace file descriptors) but that is now allowd by the design if it becomes important. I have preallocated the ipc and uts initial proc inode numbers so their structures can be statically initialized. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-20 04:19:49 -08:00
Eric W. Biederman	c450f371d4	userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file To keep things sane in the context of file descriptor passing derive the user namespace that uids are mapped into from the opener of the file instead of from current. When writing to the maps file the lower user namespace must always be the parent user namespace, or setting the mapping simply does not make sense. Enforce that the opener of the file was in the parent user namespace or the user namespace whose mapping is being set. Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:18:55 -08:00
Eric W. Biederman	b2e0d98705	userns: Implement unshare of the user namespace - Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected As changing user namespaces is only valid if all there is only a single thread. - Restore the code to add CLONE_VM if CLONE_THREAD is selected and the code to addCLONE_SIGHAND if CLONE_VM is selected. Making the constraints in the code clear. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:18:14 -08:00
Eric W. Biederman	cde1975bc2	userns: Implent proc namespace operations This allows entering a user namespace, and the ability to store a reference to a user namespace with a bind mount. Addition of missing userns_ns_put in userns_install from Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:18:13 -08:00
Eric W. Biederman	4c44aaafa8	userns: Kill task_user_ns The task_user_ns function hides the fact that it is getting the user namespace from struct cred on the task. struct cred may go away as soon as the rcu lock is released. This leads to a race where we can dereference a stale user namespace pointer. To make it obvious a struct cred is involved kill task_user_ns. To kill the race modify the users of task_user_ns to only reference the user namespace while the rcu lock is held. Cc: Kees Cook <keescook@chromium.org> Cc: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:17:44 -08:00
Eric W. Biederman	bcf58e725d	userns: Make create_new_namespaces take a user_ns parameter Modify create_new_namespaces to explicitly take a user namespace parameter, instead of implicitly through the task_struct. This allows an implementation of unshare(CLONE_NEWUSER) where the new user namespace is not stored onto the current task_struct until after all of the namespaces are created. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:17:43 -08:00
Eric W. Biederman	142e1d1d5f	userns: Allow unprivileged use of setns. - Push the permission check from the core setns syscall into the setns install methods where the user namespace of the target namespace can be determined, and used in a ns_capable call. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:17:42 -08:00
Eric W. Biederman	b33c77ef23	userns: Allow unprivileged users to create new namespaces If an unprivileged user has the appropriate capabilities in their current user namespace allow the creation of new namespaces. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:17:41 -08:00
Eric W. Biederman	37657da3c5	userns: Allow setting a userns mapping to your current uid. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:17:40 -08:00
Dave Jones	bf3071f5a0	tracing: Remove unnecessary WARN_ONCE's from tracing_buffers_splice_read WARN shouldn't be used as a means of communicating failure to a userspace programmer. Link: http://lkml.kernel.org/r/20120725153908.GA25203@redhat.com Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-19 15:25:09 -05:00
Anton Vorontsov	717a9ef7f3	tracing: Remove unneeded checks from the stack tracer It seems that 'ftrace_enabled' flag should not be used inside the tracer functions. The ftrace core is using this flag for internal purposes, and the flag wasn't meant to be used in tracers' runtime checks. stack tracer is the only tracer that abusing the flag. So stop it from serving as a bad example. Also, there is a local 'stack_trace_disabled' flag in the stack tracer, which is never updated; so it can be removed as well. Link: http://lkml.kernel.org/r/1342637761-9655-1-git-send-email-anton.vorontsov@linaro.org Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-19 15:07:13 -05:00
Tejun Heo	0a950f65e1	cgroup: add cgroup->id With the introduction of generic cgroup hierarchy iterators, css_id is being phased out. It was unnecessarily complex, id'ing the wrong thing (cgroups need IDs, not CSSes) and has other oddities like not being available at ->css_alloc(). This patch adds cgroup->id, which is a simple per-hierarchy ida-allocated ID which is assigned before ->css_alloc() and released after ->css_free(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>	2012-11-19 09:02:12 -08:00
Tejun Heo	033fa1c5f5	cgroup, cpuset: remove cgroup_subsys->post_clone() Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now that clone_children is cpuset specific, there's no reason to have this rather odd option activation mechanism in cgroup core. cpuset can check the flag from its ->css_allocate() and take the necessary action. Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and remove cgroup_subsys->post_clone(). Loosely based on Glauber's "generalize post_clone into post_create" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Glauber Costa <glommer@parallels.com> Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-19 08:13:39 -08:00
Tejun Heo	2260e7fc1f	cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/ clone_children is only meaningful for cpuset and will stay that way. Rename the flag to reflect that and update documentation. Also, drop clone_children() wrapper in cgroup.c. The thin wrapper is used only a few times and one of them will go away soon. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	92fb97487a	cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	b1929db42f	cgroup: allow ->post_create() to fail There could be cases where controllers want to do initialization operations which may fail from ->post_create(). This patch makes ->post_create() return -errno to indicate failure and online_css() relay such failures. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	4b8b47eb00	cgroup: update cgroup_create() failure path cgroup_create() was ignoring failure of cgroupfs files. Update it such that, if file creation fails, it rolls back by calling cgroup_destroy_locked() and returns failure. Note that error out goto labels are renamed. The labels are a bit confusing but will become better w/ later cgroup operation renames. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:38 -08:00
Tejun Heo	b8a2df6a5b	cgroup: use mutex_trylock() when grabbing i_mutex of a new cgroup directory All cgroup directory i_mutexes nest outside cgroup_mutex; however, new directory creation is a special case. A new cgroup directory is created while holding cgroup_mutex. Populating the new directory requires both the new directory's i_mutex and cgroup_mutex. Because all directory i_mutexes nest outside cgroup_mutex, grabbing both requires releasing cgroup_mutex first, which isn't a good idea as the new cgroup isn't yet ready to be manipulated by other cgroup opreations. This is worked around by grabbing the new directory's i_mutex while holding cgroup_mutex before making it visible. As there's no other user at that point, grabbing the i_mutex under cgroup_mutex can't lead to deadlock. cgroup_create_file() was using I_MUTEX_CHILD to tell lockdep not to worry about the reverse locking order; however, this creates pseudo locking dependency cgroup_mutex -> I_MUTEX_CHILD, which isn't true - all directory i_mutexes are still nested outside cgroup_mutex. This pseudo locking dependency can lead to spurious lockdep warnings. Use mutex_trylock() instead. This will always succeed and lockdep doesn't create any locking dependency for it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	d19e19de48	cgroup: simplify cgroup_load_subsys() failure path Now that cgroup_unload_subsys() can tell whether the root css is online or not, we can safely call cgroup_unload_subsys() after idr init failure in cgroup_load_subsys(). Replace the manual unrolling and invoke cgroup_unload_subsys() on failure. This drops cgroup_mutex inbetween but should be safe as the subsystem will fail try_module_get() and thus can't be mounted inbetween. As this means that cgroup_unload_subsys() can be called before css_sets are rehashed, remove BUG_ON() on %NULL css_set->subsys[] from cgroup_unload_subsys(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	a31f2d3ff7	cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers New helpers on/offline_css() respectively wrap ->post_create() and ->pre_destroy() invocations. online_css() sets CSS_ONLINE after ->post_create() is complete and offline_css() invokes ->pre_destroy() iff CSS_ONLINE is set and clears it while also handling the temporary dropping of cgroup_mutex. This patch doesn't introduce any behavior change at the moment but will be used to improve cgroup_create() failure path and allow ->post_create() to fail. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	42809dd422	cgroup: separate out cgroup_destroy_locked() Separate out cgroup_destroy_locked() from cgroup_destroy(). This will be later used in cgroup_create() failure path. While at it, add lockdep asserts on i_mutex and cgroup_mutex, and move @d and @parent assignments to their declarations. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	02ae7486d0	cgroup: fix harmless bugs in cgroup_load_subsys() fail path and cgroup_unload_subsys() * If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[] before calilng ->destroy() making CSS inaccessible to the callback, and didn't unlink ss->sibling. As no modular controller uses ->use_id, this doesn't cause any actual problems. * cgroup_unload_subsys() was forgetting to free idr, call ->pre_destroy() and clear ->active. As there currently is no modular controller which uses ->use_id, ->pre_destroy() or ->active, this doesn't cause any actual problems. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:37 -08:00
Tejun Heo	648bb56d07	cgroup: lock cgroup_mutex in cgroup_init_subsys() Make cgroup_init_subsys() grab cgroup_mutex while initializing a subsystem so that all helpers and callbacks are called under the context they expect. This isn't strictly necessary as cgroup_init_subsys() doesn't race with anybody but will allow adding lockdep assertions. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	b48c6a80a0	cgroup: trivial cleanup for cgroup_init/load_subsys() Consistently use @css and @dummytop in these two functions instead of referring to them indirectly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	38b53abaa3	cgroup: make CSS_* flags bit masks instead of bit positions Currently, CSS_* flags are defined as bit positions and manipulated using atomic bitops. There's no reason to use atomic bitops for them and bit positions are clunkier to deal with than bit masks. Make CSS_* bit masks instead and use the usual C bitwise operators to access them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	febfcef60d	cgroup: cgroup->dentry isn't a RCU pointer cgroup->dentry is marked and used as a RCU pointer; however, it isn't one - the final dentry put doesn't go through call_rcu(). cgroup and dentry share the same RCU freeing rule via synchronize_rcu() in cgroup_diput() (kfree_rcu() used on cgrp is unnecessary). If cgrp is accessible under RCU read lock, so is its dentry and dereferencing cgrp->dentry doesn't need any further RCU protection or annotation. While not being accurate, before the previous patch, the RCU accessors served a purpose as memory barriers - cgroup->dentry used to be assigned after the cgroup was made visible to cgroup_path(), so the assignment and dereferencing in cgroup_path() needed the memory barrier pair. Now that list_add_tail_rcu() happens after cgroup->dentry is assigned, this no longer is necessary. Remove the now unnecessary and misleading RCU annotations from cgroup->dentry. To make up for the removal of rcu_dereference_check() in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts the dereference rule of @cgrp, not cgrp->dentry. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	4e139afc22	cgroup: create directory before linking while creating a new cgroup While creating a new cgroup, cgroup_create() links the newly allocated cgroup into various places before trying to create its directory. Because cgroup life-cycle is tied to the vfs objects, this makes it impossible to use cgroup_rmdir() for rolling back creation - the removal logic depends on having full vfs objects. This patch moves directory creation above linking and collect linking operations to one place. This allows directory creation failure to share error exit path with css allocation failures and any failure sites afterwards (to be added later) can use cgroup_rmdir() logic to undo creation. Note that this also makes the memory barriers around cgroup->dentry, which currently is misleadingly using RCU operations, unnecessary. This will be handled in the next patch. While at it, locking BUG_ON() on i_mutex is converted to lockdep_assert_held(). v2: Patch originally removed %NULL dentry check in cgroup_path(); however, Li pointed out that this patch doesn't make it unnecessary as ->create() may call cgroup_path(). Drop the change for now. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	28fd6f30ac	cgroup: open-code cgroup_create_dir() The operation order of cgroup creation is about to change and cgroup_create_dir() is more of a hindrance than a proper abstraction. Open-code it by moving the parent nlink adjustment next to self nlink adjustment in cgroup_create_file() and the rest to cgroup_create(). This patch doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:36 -08:00
Tejun Heo	2243076ad1	cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping() Not strictly necessary but it's annoying to have uninitialized list_head around. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:35 -08:00
Tejun Heo	175431635e	cgroup: remove incorrect dget/dput() pair in cgroup_create_dir() cgroup_create_dir() does weird dancing with dentry refcnt. On success, it gets and then puts it achieving nothing. On failure, it puts but there isn't no matching get anywhere leading to the following oops if cgroup_create_file() fails for whatever reason. ------------[ cut here ]------------ kernel BUG at /work/os/work/fs/dcache.c:552! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU 2 Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff811d9c0c>] [<ffffffff811d9c0c>] dput+0x1dc/0x1e0 RSP: 0018:ffff88001a3ebef8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403 RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58 RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001 R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60 FS: 00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080) Stack: ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000 ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8 ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8 Call Trace: [<ffffffff811cc889>] done_path_create+0x19/0x50 [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80 [<ffffffff811d2009>] sys_mkdir+0x19/0x20 [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41 RIP [<ffffffff811d9c0c>] dput+0x1dc/0x1e0 RSP <ffff88001a3ebef8> ---[ end trace 1277bcfd9561ddb0 ]--- Fix it by dropping the unnecessary dget/dput() pair. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2012-11-19 08:13:35 -08:00
Frederic Weisbecker	1017769bd0	vtime: No need to disable irqs on vtime_account() vtime_account() is only called from irq entry. irqs are always disabled at this point so we can safely remove the irq disabling guards on that function. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>	2012-11-19 16:41:41 +01:00
Frederic Weisbecker	e3942ba040	vtime: Consolidate a bit the ctx switch code On ia64 and powerpc, vtime context switch only consists in flushing system and user pending time, plus a few arch housekeeping. Consolidate that into a generic implementation. s390 is a special case because pending user and system time accounting there is hard to dissociate. So it's keeping its own implementation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>	2012-11-19 16:41:32 +01:00
Frederic Weisbecker	fd25b4c2f2	vtime: Remove the underscore prefix invasion Prepending irq-unsafe vtime APIs with underscores was actually a bad idea as the result is a big mess in the API namespace that is even waiting to be further extended. Also these helpers are always called from irq safe callers except kvm. Just provide a vtime_account_system_irqsafe() for this specific case so that we can remove the underscore prefix on other vtime functions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>	2012-11-19 16:40:16 +01:00
Eric W. Biederman	5eaf563e53	userns: Allow unprivileged users to create user namespaces. Now that we have been through every permission check in the kernel having uid == 0 and gid == 0 in your local user namespace no longer adds any special privileges. Even having a full set of caps in your local user namespace is safe because capabilies are relative to your local user namespace, and do not confer unexpected privileges. Over the long term this should allow much more of the kernels functionality to be safely used by non-root users. Functionality like unsharing the mount namespace that is only unsafe because it can fool applications whose privileges are raised when they are executed. Since those applications have no privileges in a user namespaces it becomes safe to spoof and confuse those applications all you want. Those capabilities will still need to be enabled carefully because we may still need things like rlimits on the number of unprivileged mounts but that is to avoid DOS attacks not to avoid fooling root owned processes. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:24 -08:00
Eric W. Biederman	771b137168	vfs: Add a user namespace reference from struct mnt_namespace This will allow for support for unprivileged mounts in a new user namespace. Acked-by: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-19 05:59:19 -08:00
Eric W. Biederman	50804fe373	pidns: Support unsharing the pid namespace. Unsharing of the pid namespace unlike unsharing of other namespaces does not take affect immediately. Instead it affects the children created with fork and clone. The first of these children becomes the init process of the new pid namespace, the rest become oddball children of pid 0. From the point of view of the new pid namespace the process that created it is pid 0, as it's pid does not map. A couple of different semantics were considered but this one was settled on because it is easy to implement and it is usable from pam modules. The core reasons for the existence of unshare. I took a survey of the callers of pam modules and the following appears to be a representative sample of their logic. { setup stuff include pam child = fork(); if (!child) { setuid() exec /bin/bash } waitpid(child); pam and other cleanup } As you can see there is a fork to create the unprivileged user space process. Which means that the unprivileged user space process will appear as pid 1 in the new pid namespace. Further most login processes do not cope with extraneous children which means shifting the duty of reaping extraneous child process to the creator of those extraneous children makes the system more comprehensible. The practical reason for this set of pid namespace semantics is that it is simple to implement and verify they work correctly. Whereas an implementation that requres changing the struct pid on a process comes with a lot more races and pain. Not the least of which is that glibc caches getpid(). These semantics are implemented by having two notions of the pid namespace of a proces. There is task_active_pid_ns which is the pid namspace the process was created with and the pid namespace that all pids are presented to that process in. The task_active_pid_ns is stored in the struct pid of the task. Then there is the pid namespace that will be used for children that pid namespace is stored in task->nsproxy->pid_ns. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:16 -08:00
Eric W. Biederman	1c4042c29b	pidns: Consolidate initialzation of special init task state Instead of setting child_reaper and SIGNAL_UNKILLABLE one way for the system init process, and another way for pid namespace init processes test pid->nr == 1 and use the same code for both. For the global init this results in SIGNAL_UNKILLABLE being set much earlier in the initialization process. This is a small cleanup and it paves the way for allowing unshare and enter of the pid namespace as that path like our global init also will not set CLONE_NEWPID. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:15 -08:00
Eric W. Biederman	57e8391d32	pidns: Add setns support - Pid namespaces are designed to be inescapable so verify that the passed in pid namespace is a child of the currently active pid namespace or the currently active pid namespace itself. Allowing the currently active pid namespace is important so the effects of an earlier setns can be cancelled. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:14 -08:00
Eric W. Biederman	225778d68d	pidns: Deny strange cases when creating pid namespaces. task_active_pid_ns(current) != current->ns_proxy->pid_ns will soon be allowed to support unshare and setns. The definition of creating a child pid namespace when task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that we create a child pid namespace of current->ns_proxy->pid_ns. However that leads to strange cases like trying to have a single process be init in multiple pid namespaces, which is racy and hard to think about. The definition of creating a child pid namespace when task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that we create a child pid namespace of task_active_pid_ns(current). While that seems less racy it does not provide any utility. Therefore define the semantics of creating a child pid namespace when task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the pid namespace creation fails. That is easy to implement and easy to think about. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-19 05:59:13 -08:00
Eric W. Biederman	af4b8a83ad	pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 Looking at pid_ns->nr_hashed is a bit simpler and it works for disjoint process trees that an unshare or a join of a pid_namespace may create. Acked-by: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-19 05:59:12 -08:00
Eric W. Biederman	5e1182deb8	pidns: Don't allow new processes in a dead pid namespace. Set nr_hashed to -1 just before we schedule the work to cleanup proc. Test nr_hashed just before we hash a new pid and if nr_hashed is < 0 fail. This guaranteees that processes never enter a pid namespaces after we have cleaned up the state to support processes in a pid namespace. Currently sending SIGKILL to all of the process in a pid namespace as init exists gives us this guarantee but we need something a little stronger to support unsharing and joining a pid namespace. Acked-by: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:11 -08:00
Eric W. Biederman	0a01f2cc39	pidns: Make the pidns proc mount/umount logic obvious. Track the number of pids in the proc hash table. When the number of pids goes to 0 schedule work to unmount the kernel mount of proc. Move the mount of proc into alloc_pid when we allocate the pid for init. Remove the surprising calls of pid_ns_release proc in fork and proc_flush_task. Those code paths really shouldn't know about proc namespace implementation details and people have demonstrated several times that finding and understanding those code paths is difficult and non-obvious. Because of the call path detach pid is alwasy called with the rtnl_lock held free_pid is not allowed to sleep, so the work to unmounting proc is moved to a work queue. This has the side benefit of not blocking the entire world waiting for the unnecessary rcu_barrier in deactivate_locked_super. In the process of making the code clear and obvious this fixes a bug reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a mount of proc during clone(CLONE_NEWPID\|CLONE_NEWNET) if copy_pid_ns succeeded and copy_net_ns failed. Acked-by: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-19 05:59:10 -08:00
Eric W. Biederman	17cf22c33e	pidns: Use task_active_pid_ns where appropriate The expressions tsk->nsproxy->pid_ns and task_active_pid_ns aka ns_of_pid(task_pid(tsk)) should have the same number of cache line misses with the practical difference that ns_of_pid(task_pid(tsk)) is released later in a processes life. Furthermore by using task_active_pid_ns it becomes trivial to write an unshare implementation for the the pid namespace. So I have used task_active_pid_ns everywhere I can. In fork since the pid has not yet been attached to the process I use ns_of_pid, to achieve the same effect. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-19 05:59:09 -08:00
Eric W. Biederman	49f4d8b93c	pidns: Capture the user namespace and filter ns_last_pid - Capture the the user namespace that creates the pid namespace - Use that user namespace to test if it is ok to write to /proc/sys/kernel/ns_last_pid. Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns in when destroying a pid_ns. I have foloded his patch into this one so that bisects will work properly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-19 05:57:31 -08:00
Eric W. Biederman	038e7332b8	userns: make each net (net_ns) belong to a user_ns The user namespace which creates a new network namespace owns that namespace and all resources created in it. This way we can target capability checks for privileged operations against network resources to the user_ns which created the network namespace in which the resource lives. Privilege to the user namespace which owns the network namespace, or any parent user namespace thereof, provides the same privilege to the network resource. This patch is reworked from a version originally by Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-18 22:46:23 -08:00
Eric W. Biederman	d328b83682	userns: make each net (net_ns) belong to a user_ns The user namespace which creates a new network namespace owns that namespace and all resources created in it. This way we can target capability checks for privileged operations against network resources to the user_ns which created the network namespace in which the resource lives. Privilege to the user namespace which owns the network namespace, or any parent user namespace thereof, provides the same privilege to the network resource. This patch is reworked from a version originally by Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:30:55 -05:00
Ingo Molnar	ec05a2311c	Merge branch 'sched/urgent' into sched/core Merge in fixes before we queue up dependent bits, to avoid conflicts. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-11-18 09:34:44 +01:00
David S. Miller	67f4efdce7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor line offset auto-merges. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-17 22:00:43 -05:00
Greg Kroah-Hartman	1e619a1bf9	Merge 3.7-rc6 into tty-next	2012-11-16 18:26:00 -08:00
Paul E. McKenney	4e79752c25	sched: Mark RCU reader in sched_show_task() When sched_show_task() is invoked from try_to_freeze_tasks(), there is no RCU read-side critical section, resulting in the following splat: [ 125.780730] =============================== [ 125.780766] [ INFO: suspicious RCU usage. ] [ 125.780804] 3.7.0-rc3+ #988 Not tainted [ 125.780838] ------------------------------- [ 125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage! [ 125.780946] [ 125.780946] other info that might help us debug this: [ 125.780946] [ 125.781031] [ 125.781031] rcu_scheduler_active = 1, debug_locks = 0 [ 125.781087] 4 locks held by s2ram/4211: [ 125.781120] #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160 [ 125.781233] #1: (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160 [ 125.781339] #2: (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230 [ 125.781439] #3: (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0 [ 125.781543] [ 125.781543] stack backtrace: [ 125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988 [ 125.781632] Call Trace: [ 125.781662] [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140 [ 125.781719] [<ffffffff8107cf21>] sched_show_task+0x121/0x180 [ 125.781770] [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0 [ 125.781823] [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80 [ 125.781876] [<ffffffff81090b65>] pm_suspend+0x165/0x230 [ 125.781924] [<ffffffff8108fa29>] state_store+0x99/0x100 [ 125.781975] [<ffffffff812f5867>] kobj_attr_store+0x17/0x20 [ 125.782038] [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160 [ 125.782091] [<ffffffff811667a6>] vfs_write+0xc6/0x180 [ 125.782138] [<ffffffff81166ada>] sys_write+0x5a/0xa0 [ 125.782185] [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 125.782242] [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b This commit therefore adds the needed RCU read-side critical section. Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2012-11-16 10:05:58 -08:00
Paul E. McKenney	c635a4e1c2	rcu: Separate accounting of callbacks from callback-free CPUs Currently, callback invocations from callback-free CPUs are accounted to the CPU that registered the callback, but using the same field that is used for normal callbacks. This makes it impossible to determine from debugfs output whether callbacks are in fact being diverted. This commit therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure in which diverted callback invocations are counted. RCU's debugfs tracing still displays normal callback invocations using ci=, but displayed diverted callbacks with nci=. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2012-11-16 10:05:57 -08:00
Paul E. McKenney	3fbfbf7a3b	rcu: Add callback-free CPUs RCU callback execution can add significant OS jitter and also can degrade both scheduling latency and, in asymmetric multiprocessors, energy efficiency. This commit therefore adds the ability for selected CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded to kthreads. If the "rcu_nocb_poll" boot parameter is also specified, these kthreads will do polling, removing the need for the offloaded CPUs to do wakeups. At least one CPU must be doing normal callback processing: currently CPU 0 cannot be selected as a no-CBs CPU. In addition, attempts to offline the last normal-CBs CPU will fail. This feature was inspired by Jim Houston's and Joe Korty's JRCU, and this commit includes fixes to problems located by Fengguang Wu's kbuild test robot. [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ] Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2012-11-16 10:05:56 -08:00
Paul E. McKenney	aac1cda34b	Merge branches 'urgent.2012.10.27a', 'doc.2012.11.16a', 'fixes.2012.11.13a', 'srcu.2012.10.27a', 'stall.2012.11.13a', 'tracing.2012.11.08a' and 'idle.2012.10.24a' into HEAD urgent.2012.10.27a: Fix for RCU user-mode transition (already in -tip). doc.2012.11.08a: Documentation updates, most notably codifying the memory-barrier guarantees inherent to grace periods. fixes.2012.11.13a: Miscellaneous fixes. srcu.2012.10.27a: Allow statically allocated and initialized srcu_struct structures (courtesy of Lai Jiangshan). stall.2012.11.13a: Add more diagnostic information to RCU CPU stall warnings, also decrease from 60 seconds to 21 seconds. hotplug.2012.11.08a: Minor updates to CPU hotplug handling. tracing.2012.11.08a: Improved debugfs tracing, courtesy of Michael Wang. idle.2012.10.24a: Updates to RCU idle/adaptive-idle handling, including a boot parameter that maps normal grace periods to expedited. Resolved conflict in kernel/rcutree.c due to side-by-side change.	2012-11-16 09:59:58 -08:00
Oleg Nesterov	32cdba1e05	uprobes: Use percpu_rw_semaphore to fix register/unregister vs dup_mmap() race This was always racy, but `268720903f` "uprobes: Rework register_for_each_vma() to make it O(n)" should be blamed anyway, it made everything worse and I didn't notice. register/unregister call build_map_info() and then do install/remove breakpoint for every mm which mmaps inode/offset. This can obviously race with fork()->dup_mmap() in between and we can miss the child. uprobe_register() could be easily fixed but unregister is much worse, the new mm inherits "int3" from parent and there is no way to detect this if uprobe goes away. So this patch simply adds percpu_down_read/up_read around dup_mmap(), and percpu_down_write/up_write into register_for_each_vma(). This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap() and fold it into uprobe_end_dup_mmap(). Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2012-11-16 14:52:51 +01:00
Hiraku Toyooka	d60da506cb	tracing: Add a resize function to make one buffer equivalent to another buffer Trace buffer size is now per-cpu, so that there are the following two patterns in resizing of buffers. (1) resize per-cpu buffers to same given size (2) resize per-cpu buffers to another trace_array's buffer size for each CPU (such as preparing the max_tr which is equivalent to the global_trace's size) __tracing_resize_ring_buffer() can be used for (1), and had implemented (2) inside it for resetting the global_trace to the original size. (2) was also implemented in another place. So this patch assembles them in a new function - resize_buffer_duplicate_size(). Link: http://lkml.kernel.org/r/20121017025616.2627.91226.stgit@falsita Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-15 17:10:21 -05:00
Dan Carpenter	70f77b3f7e	ftrace: Clear bits properly in reset_iter_read() There is a typo here where '&' is used instead of '\|' and it turns the statement into a noop. The original code is equivalent to: iter->flags &= ~((1 << 2) & (1 << 4)); Link: http://lkml.kernel.org/r/20120609161027.GD6488@elgon.mountain Cc: stable@vger.kernel.org # all of them Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-15 16:10:17 -05:00
Davidlohr Bueso	8316bd72c0	PM / Hibernate: use rb_entry Since the software suspend extents are organized in an rbtree, use rb_entry instead of container_of, as it is semantically more appropriate in order to get a node as it is iterated. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2012-11-15 00:37:08 +01:00
Daniel Walter	883ee4f79d	PM / sysfs: replace strict_str* with kstrto* Replace strict_strtoul() with kstrtoul() in pm_async_store() and pm_qos_power_write(). [rjw: Modified subject and changelog.] Signed-off-by: Daniel Walter <sahne@0x90.at> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2012-11-15 00:37:08 +01:00
Youquan Song	69a37beabf	cpuidle: Quickly notice prediction failure for repeat mode The prediction for future is difficult and when the cpuidle governor prediction fails and govenor possibly choose the shallower C-state than it should. How to quickly notice and find the failure becomes important for power saving. cpuidle menu governor has a method to predict the repeat pattern if there are 8 C-states residency which are continuous and the same or very close, so it will predict the next C-states residency will keep same residency time. There is a real case that turbostat utility (tools/power/x86/turbostat) at kernel 3.3 or early. turbostat utility will read 10 registers one by one at Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu governor will predict it is repeat mode and there is another IPI wake up idle CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally idle. However, in the turbostat, following 10 registers reading is sleep 5 seconds by default, so the idle CPU will keep at C1 for a long time though it is idle until break event occurs. In a idle Sandybridge system, run "./turbostat -v", we will notice that deep C-state dangles between "70% ~ 99%". After patched the kernel, we will notice deep C-state stays at >99.98%. In the patch, a timer is added when menu governor detects a repeat mode and choose a shallow C-state. The timer is set to a time out value that greater than predicted time, and we conclude repeat mode prediction failure if timer is triggered. When repeat mode happens as expected, the timer is not triggered and CPU waken up from C-states and it will cancel the timer initiatively. When repeat mode does not happen, the timer will be time out and menu governor will quickly notice that the repeat mode prediction fails and then re-evaluates deeper C-states possibility. Below is another case which will clearly show the patch much benefit: #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/time.h> #include <time.h> #include <pthread.h> volatile int * shutdown; volatile long * count; int delay = 20; int loop = 8; void usage(void) { fprintf(stderr, "Usage: idle_predict [options]\n" " --help -h Print this help\n" " --thread -n Thread number\n" " --loop -l Loop times in shallow Cstate\n" " --delay -t Sleep time (uS)in shallow Cstate\n"); } void simple_loop() { int idle_num = 1; while (!(shutdown)) { count = count + 1; if (idle_num % loop) usleep(delay); else { /* sleep 1 second / usleep(1000000); idle_num = 0; } idle_num++; } } static void sighand(int sig) { shutdown = 1; } int main(int argc, char argv[]) { sigset_t sigset; int signum = SIGALRM; int i, c, er = 0, thread_num = 8; pthread_t pt[1024]; static char optstr[] = "n:l:t:h:"; while ((c = getopt(argc, argv, optstr)) != EOF) switch (c) { case 'n': thread_num = atoi(optarg); break; case 'l': loop = atoi(optarg); break; case 't': delay = atoi(optarg); break; case 'h': default: usage(); exit(1); } printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay); count = malloc(sizeof(long)); shutdown = malloc(sizeof(int)); count = 0; *shutdown = 0; sigemptyset(&sigset); sigaddset(&sigset, signum); sigprocmask (SIG_BLOCK, &sigset, NULL); signal(SIGINT, sighand); signal(SIGTERM, sighand); for(i = 0; i < thread_num ; i++) pthread_create(&pt[i], NULL, simple_loop, NULL); for (i = 0; i < thread_num; i++) pthread_join(pt[i], NULL); exit(0); } Get powertop V2 from git://github.com/fenrus75/powertop, build powertop. After build the above test application, then run it. Test plaform can be Intel Sandybridge or other recent platforms. #./idle_predict -l 10 & #./powertop We will find that deep C-state will dangle between 40%~100% and much time spent on C1 state. It is because menu governor wrongly predict that repeat mode is kept, so it will choose the C1 shallow C-state even though it has chance to sleep 1 second in deep C-state. While after patched the kernel, we find that deep C-state will keep >99.6%. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2012-11-15 00:34:19 +01:00
Yasuaki Ishimatsu	5e5041f352	ACPI / processor: prevent cpu from becoming online Even if acpi_processor_handle_eject() offlines cpu, there is a chance to online the cpu after that. So the patch closes the window by using get/put_online_cpus(). Why does the patch change _cpu_up() logic? The patch cares the race of hot-remove cpu and _cpu_up(). If the patch does not change it, there is the following race. hot-remove cpu \| _cpu_up() ------------------------------------- ------------------------------------ call acpi_processor_handle_eject() \| call cpu_down() \| call get_online_cpus() \| \| call cpu_hotplug_begin() and stop here call arch_unregister_cpu() \| call acpi_unmap_lsapic() \| call put_online_cpus() \| \| start and continue _cpu_up() return acpi_processor_remove() \| continue hot-remove the cpu \| So _cpu_up() can continue to itself. And hot-remove cpu can also continue itself. If the patch changes _cpu_up() logic, the race disappears as below: hot-remove cpu \| _cpu_up() ----------------------------------------------------------------------- call acpi_processor_handle_eject() \| call cpu_down() \| call get_online_cpus() \| \| call cpu_hotplug_begin() and stop here call arch_unregister_cpu() \| call acpi_unmap_lsapic() \| cpu's cpu_present is set \| to false by set_cpu_present()\| call put_online_cpus() \| \| start _cpu_up() \| check cpu_present() and return -EINVAL return acpi_processor_remove() \| continue hot-remove the cpu \| Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Reviewed-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2012-11-15 00:16:00 +01:00
Greg Kroah-Hartman	54d5f88f25	Merge v3.7-rc5 into tty-next This pulls in the 3.7-rc5 fixes into tty-next to make it easier to test.	2012-11-14 12:30:12 -08:00
Fenghua Yu	6e32d479db	kernel/cpu.c: Add comment for priority in cpu_hotplug_pm_callback cpu_hotplug_pm_callback should have higher priority than bsp_pm_callback which depends on cpu_hotplug_pm_callback to disable cpu hotplug to avoid race during bsp online checking. This is to hightlight the priorities between the two callbacks in case people may overlook the order. Ideally the priorities should be defined in macro/enum instead of fixed values. To do that, a seperate patchset may be pushed which will touch serveral other generic files and is out of scope of this patchset. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1352835171-3958-7-git-send-email-fenghua.yu@intel.com Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2012-11-14 09:39:50 -08:00
Rabin Vincent	65b6ecc038	uprobes: Flush cache after xol write Flush the cache so that the instructions written to the XOL area are visible. Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2012-11-14 18:32:24 +01:00
Ingo Molnar	a7a0aaa17a	Linux 3.7-rc5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJQn526AAoJEHm+PkMAQRiGQJoH/RAk2zmryS6wEpsdp7kl0pez DA7iI0lkBD/2s6gb3xbgqiuLpYoj8FZsNfKia8G9gcMo+BxL7tO46J4seki7UUZc Zt219olN6bfTlBToVtUsPPA758aMQ7SDJqe6ZPm3Znj8Zwcp8KodszJ6h5oeuPkj rkglyZwTxmbkFWu7TddIKVs4Gcb0CGuCx4gx6IWdYomgblwhZCnmFgHKmuiFXz2Z 0PonP2ANBnUeihZYENTrpJFc0iBeA6HUVdTMeHqEeSGg5jyp3kYHhSj8NLKx16Xl CvI8d4kGpH6CPL7VCJo8BUh0DuR1MXdS++YpjPHasC2DkwatXQ7FLjmYs8RIJ3A= =dZwT -----END PGP SIGNATURE----- Merge tag 'v3.7-rc5' into sched/core Merge Linux 3.7-rc5, to pick up fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-11-14 08:49:49 +01:00
Paul E. McKenney	351573a86d	rcu: Fix TINY_RCU rcu_is_cpu_rrupt_from_idle check The rcu_is_cpu_rrupt_from_idle() needs to allow for one interrupt level from the idle loop, but TINY_RCU checks for a call from the idle loop itself. This commit fixes this issue. Reported-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2012-11-13 14:08:34 -08:00
Paul E. McKenney	f0a0e6f282	rcu: Clarify memory-ordering properties of grace-period primitives This commit explicitly states the memory-ordering properties of the RCU grace-period primitives. Although these properties were in some sense implied by the fundmental property of RCU ("a grace period must wait for all pre-existing RCU read-side critical sections to complete"), stating it explicitly will be a great labor-saving device. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com>	2012-11-13 14:08:23 -08:00
Paul E. McKenney	67afeed2ca	rcu: Add new rcutorture module parameters to start/end test messages Several new rcutorture module parameters have been added, but are not printed to the console at the beginning and end of tests, which makes it difficult to reproduce a prior test. This commit therefore adds these new module parameters to the list printed at the beginning and the end of the tests. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2012-11-13 14:08:22 -08:00
Eric Dumazet	878d7439d0	rcu: Fix batch-limit size problem Commit `29c00b4a1d` (rcu: Add event-tracing for RCU callback invocation) added a regression in rcu_do_batch() Under stress, RCU is supposed to allow to process all items in queue, instead of a batch of 10 items (blimit), but an integer overflow makes the effective limit being 1. So, unless there is frequent idle periods (during which RCU ignores batch limits), RCU can be forced into a state where it cannot keep up with the callback-generation rate, eventually resulting in OOM. This commit therefore converts a few variables in rcu_do_batch() from int to long to fix this problem, along with the module parameters controlling the batch limits. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> # 3.2 +	2012-11-13 14:07:57 -08:00
Yoshihiro YUNOMAE	11043d8b12	tracing: Show raw time stamp on stats per cpu using counter or tsc mode for trace_clock Show raw time stamp values for stats per cpu if you choose counter or tsc mode for trace_clock. Although a unit of tracing time stamp is nsec in local or global mode, the units in counter and TSC mode are tracing counter and cycles respectively. Link: http://lkml.kernel.org/r/1352837903-32191-3-git-send-email-dhsharp@google.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Signed-off-by: David Sharp <dhsharp@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-13 15:49:11 -05:00
David Sharp	8be0709f10	tracing: Format non-nanosec times from tsc clock without a decimal point. With the addition of the "tsc" clock, formatting timestamps to look like fractional seconds is misleading. Mark clocks as either in nanoseconds or not, and format non-nanosecond timestamps as decimal integers. Tested: $ cd /sys/kernel/debug/tracing/ $ cat trace_clock [local] global tsc $ echo sched_switch > set_event $ echo 1 > tracing_on ; sleep 0.0005 ; echo 0 > tracing_on $ cat trace <idle>-0 [000] 6330.555552: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=29964 next_prio=120 sleep-29964 [000] 6330.555628: sched_switch: prev_comm=bash prev_pid=29964 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120 ... $ echo 1 > options/latency-format $ cat trace <idle>-0 0 4104553247us+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=29964 next_prio=120 sleep-29964 0 4104553322us+: sched_switch: prev_comm=bash prev_pid=29964 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120 ... $ echo tsc > trace_clock $ cat trace $ echo 1 > tracing_on ; sleep 0.0005 ; echo 0 > tracing_on $ echo 0 > options/latency-format $ cat trace <idle>-0 [000] 16490053398357: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=31128 next_prio=120 sleep-31128 [000] 16490053588518: sched_switch: prev_comm=bash prev_pid=31128 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120 ... echo 1 > options/latency-format $ cat trace <idle>-0 0 91557653238+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=31128 next_prio=120 sleep-31128 0 91557843399+: sched_switch: prev_comm=bash prev_pid=31128 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120 ... v2: Move arch-specific bits out of generic code. v4: Fix x86_32 build due to 64-bit division. Google-Bug-Id: 6980623 Link: http://lkml.kernel.org/r/1352837903-32191-2-git-send-email-dhsharp@google.com Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: David Sharp <dhsharp@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-13 15:48:40 -05:00
David Sharp	8cbd9cc625	tracing,x86: Add a TSC trace_clock In order to promote interoperability between userspace tracers and ftrace, add a trace_clock that reports raw TSC values which will then be recorded in the ring buffer. Userspace tracers that also record TSCs are then on exactly the same time base as the kernel and events can be unambiguously interlaced. Tested: Enabled a tracepoint and the "tsc" trace_clock and saw very large timestamp values. v2: Move arch-specific bits out of generic code. v3: Rename "x86-tsc", cleanups v7: Generic arch bits in Kbuild. Google-Bug-Id: 6980623 Link: http://lkml.kernel.org/r/1352837903-32191-1-git-send-email-dhsharp@google.com Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Signed-off-by: David Sharp <dhsharp@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-11-13 15:48:27 -05:00
John Stultz	d6ad418763	time: Kill xtime_lock, replacing it with jiffies_lock Now that timekeeping is protected by its own locks, rename the xtime_lock to jifffies_lock to better describe what it protects. CC: Thomas Gleixner <tglx@linutronix.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Richard Cochran <richardcochran@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2012-11-13 14:08:23 -05:00
Lars-Peter Clausen	f95a985781	time/jiffies: Make clocksource_jiffies static Commit `f1b8274` ("clocksource: Cleanup clocksource selection") removed all external references to clocksource_jiffies so there is no need to have the symbol globally visible. Fixes the following sparse warning: CHECK kernel/time/jiffies.c kernel/time/jiffies.c:61:20: warning: symbol 'clocksource_jiffies' was not declared. Should it be static? Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2012-11-13 14:04:51 -05:00

1 2 3 4 5 ...

14676 Commits