linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Dhaval Giani	b1a8c172c3	sched: fix !SYSFS build breakage When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build with kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1488): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1490): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1480): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1494): undefined reference to `kernel_subsys' This patch fixes this build error. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Ken Chen	908a7c1b9b	sched: fix improper load balance across sched domain We recently discovered a nasty performance bug in the kernel CPU load balancer where we were hit by 50% performance regression. When tasks are assigned to a subset of CPUs that span across sched_domains (either ccNUMA node or the new multi-core domain) via cpu affinity, kernel fails to perform proper load balance at these domains, due to several logic in find_busiest_group() miss identified busiest sched group within a given domain. This leads to inadequate load balance and causes 50% performance hit. To give you a concrete example, on a dual-core, 2 socket numa system, there are 4 logical cpu, organized as: CPU0 attaching sched-domain: domain 0: span 0003 groups: 0001 0002 domain 1: span 000f groups: 0003 000c CPU1 attaching sched-domain: domain 0: span 0003 groups: 0002 0001 domain 1: span 000f groups: 0003 000c CPU2 attaching sched-domain: domain 0: span 000c groups: 0004 0008 domain 1: span 000f groups: 000c 0003 CPU3 attaching sched-domain: domain 0: span 000c groups: 0008 0004 domain 1: span 000f groups: 000c 0003 If I run 2 tasks with CPU affinity set to 0x5. There are situation where cpu0 has run queue length of 2, and cpu2 will be idle. The kernel load balancer is unable to balance out these two tasks over cpu0 and cpu2 due to at least three logics in find_busiest_group() that heavily bias load balance towards power saving mode. e.g. while determining "busiest" variable, kernel only set it when "sum_nr_running > group_capacity". This test is flawed that "sum_nr_running" is not necessary same as sum-tasks-allowed-to-run-within-the sched-group. The end result is that kernel "think" everything is balanced, but in reality we have an imbalance and thus causing one CPU to be over-subscribed and leaving other idle. There are two other logic in the same function will also causing similar effect. The nastiness of this bug is that kernel not be able to get unstuck in this unfortunate broken state. From what we've seen in our environment, kernel will stuck in imbalanced state for extended period of time and it is also very easy for the kernel to stuck into that state (it's pretty much 100% reproducible for us). So proposing the following fix: add addition logic in find_busiest_group to detect intrinsic imbalance within the busiest group. When such condition is detected, load balance goes into spread mode instead of default grouping mode. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Milton Miller	cd79007634	sched: more robust sd-sysctl entry freeing It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry->mode is set and checking proc_handler instead of ->table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Ingo Molnar	f20bf61256	time: introduce xtime_seconds improve performance of sys_time(). sys_time() returns time in seconds, but it does so by calling do_gettimeofday() and then returning the tv_sec portion of the GTOD time. But the data structure "xtime", which is updated by every timer/scheduler tick, already offers HZ granularity time. the patch improves the sysbench oltp macrobenchmark by 4-5% on an AMD dual-core system: v2.6.23: #threads 1: transactions: 4073 (407.23 per sec.) 2: transactions: 8530 (852.81 per sec.) 3: transactions: 8321 (831.88 per sec.) 4: transactions: 8407 (840.58 per sec.) 5: transactions: 8070 (806.74 per sec.) v2.6.23 + sys_time-speedup.patch: 1: transactions: 4281 (428.09 per sec.) 2: transactions: 8910 (890.85 per sec.) 3: transactions: 8659 (865.79 per sec.) 4: transactions: 8676 (867.34 per sec.) 5: transactions: 8532 (852.91 per sec.) and by 4-5% on an Intel dual-core system too: 2.6.23: 1: transactions: 4560 (455.94 per sec.) 2: transactions: 10094 (1009.30 per sec.) 3: transactions: 9755 (975.36 per sec.) 4: transactions: 9859 (985.78 per sec.) 5: transactions: 9701 (969.72 per sec.) 2.6.23 + sys_time-speedup.patch: 1: transactions: 4779 (477.84 per sec.) 2: transactions: 10103 (1010.14 per sec.) 3: transactions: 10141 (1013.93 per sec.) 4: transactions: 10371 (1036.89 per sec.) 5: transactions: 10178 (1017.50 per sec.) (the more CPUs the system has, the more speedup this patch gives for this particular workload.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 10:01:50 -07:00
Masami Hiramatsu	f438d914b2	kprobes: support kretprobe blacklist Introduce architecture dependent kretprobe blacklists to prohibit users from inserting return probes on the function in which kprobes can be inserted but kretprobes can not. This patch also removes "__kprobes" mark from "__switch_to" on x86_64 and registers "__switch_to" to the blacklist on x86-64, because that mark is to prohibit user from inserting only kretprobe. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:10 -07:00
Paul Jackson	607717a65d	cpuset: remove sched domain hooks from cpusets Remove the cpuset hooks that defined sched domains depending on the setting of the 'cpu_exclusive' flag. The cpu_exclusive flag can only be set on a child if it is set on the parent. This made that flag painfully unsuitable for use as a flag defining a partitioning of a system. It was entirely unobvious to a cpuset user what partitioning of sched domains they would be causing when they set that one cpu_exclusive bit on one cpuset, because it depended on what CPUs were in the remainder of that cpusets siblings and child cpusets, after subtracting out other cpu_exclusive cpusets. Furthermore, there was no way on production systems to query the result. Using the cpu_exclusive flag for this was simply wrong from the get go. Fortunately, it was sufficiently borked that so far as I know, almost no successful use has been made of this. One real time group did use it to affectively isolate CPUs from any load balancing efforts. They are willing to adapt to alternative mechanisms for this, such as someway to manipulate the list of isolated CPUs on a running system. They can do without this present cpu_exclusive based mechanism while we develop an alternative. There is a real risk, to the best of my understanding, of users accidentally setting up a partitioned scheduler domains, inhibiting desired load balancing across all their CPUs, due to the nonobvious (from the cpuset perspective) side affects of the cpu_exclusive flag. Furthermore, since there was no way on a running system to see what one was doing with sched domains, this change will be invisible to any using code. Unless they have real insight to the scheduler load balancing choices, they will be unable to detect that this change has been made in the kernel's behaviour. Initial discussion on lkml of this patch has generated much comment. My (probably controversial) take on that discussion is that it has reached a rough concensus that the current cpuset cpu_exclusive mechanism for defining sched domains is borked. There is no concensus on the replacement. But since we can remove this mechanism, and since its continued presence risks causing unwanted partitioning of the schedulers load balancing, we should remove it while we can, as we proceed to work the replacement scheduler domain mechanisms. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:09 -07:00
Christoph Hellwig	0ac1555915	m32r: convert to generic sys_ptrace Convert m32r to the generic sys_ptrace. The conversion requires an architecture hook after ptrace_attach which this patch adds. The hook will also be needed for a conersion of ia64 to the generic ptrace code. Thanks to Hirokazu Takata for fixing a bug in the first version of this code. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:04 -07:00
Adam Litke	54f9f80d65	hugetlb: Add hugetlb_dynamic_pool sysctl The maximum size of the huge page pool can be controlled using the overall size of the hugetlb filesystem (via its 'size' mount option). However in the common case the this will not be set as the pool is traditionally fixed in size at boot time. In order to maintain the expected semantics, we need to prevent the pool expanding by default. This patch introduces a new sysctl controlling dynamic pool resizing. When this is enabled the pool will expand beyond its base size up to the size of the hugetlb filesystem. It is disabled by default. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Dave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki	75884fb1c6	memory unplug: memory hotplug cleanup A clean up patch for "scanning memory resource [start, end)" operation. Now, find_next_system_ram() function is used in memory hotplug, but this interface is not easy to use and codes are complicated. This patch adds walk_memory_resouce(start,len,arg,func) function. The function 'func' is called per valid memory resouce range in [start,pfn). [pbadari@us.ibm.com: Error handling in walk_memory_resource()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:01 -07:00
Mel Gorman	e12ba74d8f	Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:00 -07:00
Christoph Lameter	0e1e7c7a73	Memoryless nodes: Use N_HIGH_MEMORY for cpusets cpusets try to ensure that any node added to a cpuset's mems_allowed is on-line and contains memory. The assumption was that online nodes contained memory. Thus, it is possible to add memoryless nodes to a cpuset and then add tasks to this cpuset. This results in continuous series of oom-kill and apparent system hang. Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in place of node_online_map when vetting memories. Return error if admin attempts to write a non-empty mems_allowed node mask containing only memoryless-nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:59 -07:00
Christoph Lameter	4199cfa02b	Memoryless nodes: Allow profiling data to fall back to other nodes Processors on memoryless nodes must be able to fall back to remote nodes in order to get a profiling buffer. This may lead to excessive NUMA traffic but I think we should allow this rather than failing. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:58 -07:00
Christoph Hellwig	74a0b57627	x86: optimize page faults like all other achitectures and kill notifier cruft x86(-64) are the last architectures still using the page fault notifier cruft for the kprobes page fault hook. This patch converts them to the proper direct calls, and removes the now unused pagefault notifier bits aswell as the cruft in kprobes.c that was related to this mess. I know Andi didn't really like this, but all other architecture maintainers agreed the direct calls are much better and besides the obvious cruft removal a common way of dealing with kprobes across architectures is important aswell. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: fix sparc64] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Andi Kleen <ak@suse.de> Cc: <linux-arch@vger.kernel.org> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:50 -07:00
Mike Travis	d5a7430ddc	Convert cpu_sibling_map to be a per cpu variable Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly from startup and CPU HOTPLUG functions. Signed-off-by: Mike Travis <travis@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:50 -07:00
Randy Dunlap	bfe8df3d31	slow down printk during boot Optionally add a boot delay after each kernel printk() call, crudely measured in milliseconds, with a maximum delay of 10 seconds per printk. Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.): "lpj=loops_per_jiffy boot_delay=100" to the kernel command line. It has been useful in cases like "during boot, my machine just reboots or the screen goes black" by slowing down printk, (and adding initcall_debug), we can usually see the last thing that happened before the lights went out which is usually a valuable clue. [akpm@linux-foundation.org: not all architectures implement CONFIG_HZ] [akpm@linux-foundation.org: fix lots of stuff] [bunk@stusta.de: kernel/printk.c: make 2 variables static] [heiko.carstens@de.ibm.com: fix slow down printk on boot compile error] Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:49 -07:00
Alexey Dobriyan	1bcf548293	Consolidate PTRACE_DETACH Identical handlers of PTRACE_DETACH go into ptrace_request(). Not touching compat code. Not touching archs that don't call ptrace_request. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:49 -07:00
Linus Torvalds	f4921aff5b	Merge git://git.linux-nfs.org/pub/linux/nfs-2.6 * git://git.linux-nfs.org/pub/linux/nfs-2.6: (131 commits) NFSv4: Fix a typo in nfs_inode_reclaim_delegation NFS: Add a boot parameter to disable 64 bit inode numbers NFS: nfs_refresh_inode should clear cache_validity flags on success NFS: Fix a connectathon regression in NFSv3 and NFSv4 NFS: Use nfs_refresh_inode() in ops that aren't expected to change the inode SUNRPC: Don't call xprt_release in call refresh SUNRPC: Don't call xprt_release() if call_allocate fails SUNRPC: Fix buggy UDP transmission [23/37] Clean up duplicate includes in [2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static SUNRPC: Use correct type in buffer length calculations SUNRPC: Fix default hostname created in rpc_create() nfs: add server port to rpc_pipe info file NFS: Get rid of some obsolete macros NFS: Simplify filehandle revalidation NFS: Ensure that nfs_link() returns a hashed dentry NFS: Be strict about dentry revalidation when doing exclusive create NFS: Don't zap the readdir caches upon error NFS: Remove the redundant nfs_reval_fsid() NFSv3: Always use directory post-op attributes in nfs3_proc_lookup ... Fix up trivial conflict due to sock_owned_by_user() cleanup manually in net/sunrpc/xprtsock.c	2007-10-15 10:47:35 -07:00
Linus Torvalds	419217cb1d	Merge branch 'v2.6.24-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep * 'v2.6.24-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep: lockdep: annotate dir vs file i_mutex lockdep: per filesystem inode lock class lockdep: annotate kprobes irq fiddling lockdep: annotate rcu_read_{,un}lock{,_bh} lockdep: annotate journal_start() lockdep: s390: connect the sysexit hook lockdep: x86_64: connect the sysexit hook lockdep: i386: connect the sysexit hook lockdep: syscall exit check lockdep: fixup mutex annotations lockdep: fix mismatched lockdep_depth/curr_chain_hash lockdep: Avoid /proc/lockdep & lock_stat infinite output lockdep: maintainers	2007-10-15 10:40:41 -07:00
Ingo Molnar	9c63d9c021	sched: sync wakeups preempt too make sure sync wakeups preempt too - the scheduler will not overschedule as we've got various throttles against that. As a result, sync wakeups can be used more widely in the kernel (to signal wakeup affinity between tasks), and no arbitrary latencies will be introduced either. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:20 +02:00
Ingo Molnar	71e20f1873	sched: affine sync wakeups make sync wakeups affine for cache-cold tasks: if a cache-cold task is woken up by a sync wakeup then use the opportunity to migrate it straight away. (the two tasks are 'related' because they communicate) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Laurent Vivier	94886b84b1	sched: guest CPU accounting: maintain stats in account_system_time() modify account_system_time() to add cputime to cpustat->guest if we are running a VCPU. We add this cputime to cpustat->user instead of cpustat->system because this part of KVM code is in fact user code although it is executed in the kernel. We duplicate VCPU time between guest and user to allow an unmodified "top(1)" to display correct value. A modified "top(1)" is able to display good cpu user time and cpu guest time by subtracting cpu guest time from cpu user time. Update "gtime" in task_struct accordingly. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Laurent Vivier	9ac52315d4	sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields like for cpustat, introduce the "gtime" (guest time of the task) and "cgtime" (guest time of the task children) fields for the tasks. Modify signal_struct and task_struct. Modify /proc/<pid>/stat to display these new fields. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	6323469f9b	sched: domain sysctl fixes: add terminator comment we had an incorrect-terminator bug in sd_alloc_ctl_domain_table() before, so add a comment that documents it. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	ad1cdc1d78	sched: domain sysctl fixes: do not crash on allocation failure Now that we are calling this at runtime, a more relaxed error path is suggested. If an allocation fails, we just register the partial table, which will show empty directories. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	6382bc90f5	sched: domain sysctl fixes: unregister the sysctl table before domains Unregister and free the sysctl table before destroying domains, then rebuild and register after creating the new domains. This prevents the sysctl table from pointing to freed memory for root to write. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	97b6ea7b63	sched: domain sysctl fixes: use for_each_online_cpu() init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu variables. If the cpus_possible mask is not contigious this will result in a crash referencing unallocated data. If the online mask is not contigious then we would show offline cpus and miss online ones. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	5cf9f062c8	sched: domain sysctl fixes: use kcalloc() kcalloc checks for n * sizeof(element) overflows and it zeros. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Arjan van de Ven	0dbee3a6b0	Make scheduler debug file operations const In general, struct file_operations are const in the kernel, to not have false cacheline sharing and to catch bugs at compiletime with accidental writes to them. The new scheduler code introduces a new non-const one; fix this up. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Ingo Molnar	6bc1665ba7	sched: allow the immediate migration of cache-cold tasks allow the immediate migration of cache-cold tasks. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	cc367732ff	sched: debug, improve migration statistics add new migration statistics when SCHED_DEBUG and SCHEDSTATS is enabled. Available in /proc/<PID>/sched. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	2d92f22784	sched: debug: increase width of debug line increase width of debug line - in preparation of more debugging info. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Peter Zijlstra	ff56b2f015	sched: activate task_hot() only on fair-scheduled tasks activate task_hot() only for fair-scheduled tasks (i.e. disable it for RT tasks). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	da84d96176	sched: reintroduce cache-hot affinity reintroduce a simplified version of cache-hot/cold scheduling affinity. This improves performance with certain SMP workloads, such as sysbench. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	e5f32a3856	sched: speed up context-switches a bit speed up context-switches a bit by not clearing p->exec_start. (as a side-effect, this also makes p->exec_start a universal timestamp available to cache-hot estimations.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	91c234b4e3	sched: do not wakeup-preempt with SCHED_BATCH tasks do not wakeup-preempt with SCHED_BATCH tasks, their preemption is batched too, driven by the tick. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Srivatsa Vaddagiri	fb7dde37ec	sched: generate uevents for user creation/destruction Generate uevents when a user is being created/destroyed. These events can be used to configure cpu share of a new user. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	178be79348	sched: do not normalize kernel threads via SysRq-N do not normalize kernel threads via SysRq-N: the migration threads, softlockup threads, etc. might be essential for the system to function properly. So only zap user tasks. pointed out by Andi Kleen. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Andi Kleen	1666703af9	sched: remove stale comment from sched_group_set_shares() remove stale comment from sched_group_set_shares(). Function never returns -EINVAL. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	d5036e89dc	sched: clean up is_migration_thread() clean up is_migration_thread() and turn it into an inline function. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:15 +02:00
Andi Kleen	3a5e4dc12f	sched: cleanup: refactor normalize_rt_tasks Replace a particularly ugly ifdef with an inline and a new macro. Also split up the function to be easier to read. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:15 +02:00
Andi Kleen	8cbbe86dfc	sched: cleanup: refactor common code of sleep_on / wait_for_completion Refactor common code of sleep_on / wait_for_completion These functions were largely cut'n'pasted. This moves the common code into single helpers instead. Advantage is about 1k less code on x86-64 and 91 lines of code removed. It adds one function call to the non timeout version of the functions; i don't expect this to be measurable. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Andi Kleen	3a5c359a58	sched: cleanup: remove unnecessary gotos Replace loops implemented with gotos with real loops. Replace err = ...; goto x; x: return err; with return ...; No functional changes. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Ingo Molnar	d274a4cee1	sched: update comment update comment: clarify time-slices and remove obsolete tuning detail. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Mike Galbraith	95938a35c5	sched: prevent wakeup over-scheduling Prevent wakeup over-scheduling. Once a task has been preempted by a task of the same or lower priority, it becomes ineligible for repeated preemption by same until it has been ticked, or slept. Instead, the task is marked for preemption at the next tick. Tasks of higher priority still preempt immediately. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Peter Zijlstra	ce6c131131	sched: disable forced preemption by default Implement feature bit to disable forced preemption. This way it can be checked whether a workload is overscheduling or not. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Dmitry Adamushko	e62dd02ed0	sched: fix group scheduling for SCHED_BATCH The following patch (sched: disable sleeper_fairness on SCHED_BATCH) seems to break GROUP_SCHED. Although, it may be 'oops'-less due to the possibility of 'p' being always a valid address. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Zou Nan hai	ace8b3d633	sched: some proc entries are missed in sched_domain sys_ctl debug code cache_nice_tries and flags entry do not appear in proc fs sched_domain directory, because ctl_table entry is skipped. This patch fixes the issue. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Gautham R Shenoy	638e13ac37	sched: fix rt ptracer monopolizing CPU yield() in wait_task_inactive(), can cause a high priority thread to be scheduled back in, and there by loop forever while it is waiting for some lower priority thread which is unfortunately still on the runqueue. Use schedule_timeout_uninterruptible(1) instead. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Credit: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Dhaval Giani	5cb350baf5	sched: group scheduling, sysfs tunables Add tunables in sysfs to modify a user's cpu share. A directory is created in sysfs for each new user in the system. /sys/kernel/uids/<uid>/cpu_share Reading this file returns the cpu shares granted for the user. Writing into this file modifies the cpu share for the user. Only an administrator is allowed to modify a user's cpu share. Ex: # cd /sys/kernel/uids/ # cat 512/cpu_share 1024 # echo 2048 > 512/cpu_share # cat 512/cpu_share 2048 # Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Peter Zijlstra	8ca0e14ffb	sched: disable sleeper_fairness on SCHED_BATCH disable sleeper fairness for batch tasks - they are about batch processing after all. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Peter Zijlstra	810e95ccd5	sched: another wakeup_granularity fix unit mis-match: wakeup_gran was used against a vruntime Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Paul E. McKenney	a58f6f253d	sched: export cpu_clock() export cpu_clock() - the preferred API instead of sched_clock(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Ingo Molnar	00bf7bfc2e	sched: fix: move the CPU check into ->task_new_fair() noticed by Peter Zijlstra: fix: move the CPU check into ->task_new_fair(), this way we can call place_entity() and get child ->vruntime right at initial wakeup time. (without this there can be large latencies) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:14 +02:00
Ingo Molnar	0702e3ebc1	sched: cleanup: function prototype cleanups noticed by Thomas Gleixner: cleanup: function prototype cleanups - move into single line wherever possible. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Ingo Molnar	4cf86d77f5	sched: cleanup: rename task_grp to task_group cleanup: rename task_grp to task_group. No need to save two characters and 'grp' is annoying to read. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Ingo Molnar	06877c33fe	sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG, to make SCHED_FEAT_ names more consistent. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	a65914b365	sched: kfree(NULL) is valid kfree(NULL) is valid. pointed out by checkpatch.pl. the fix shrinks the code a bit: text data bss dec hex filename 40024 3842 100 43966 abbe sched.o.before 40002 3842 100 43944 aba8 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	8927f49479	sched: style cleanup fix up __setup() style bug - noticed via checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	26797a34a2	sched: break out if printing a warning in sched_domain_debug() checkpatch.pl and Andy Whitcroft noticed the following bug: we did not break out after printing an error. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	3e9830dcab	sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y run sched_domain_debug() if CONFIG_SCHED_DEBUG=y, instead of relying on the hand-crafted SCHED_DOMAIN_DEBUG switch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Dmitry Adamushko	a2a2d68073	sched: cleanup, make dequeue_entity() and update_stats_wait_end() similar make dequeue_entity() / enqueue_entity() and update_stats_dequeue() / update_stats_enqueue() look similar, structure-wise. zero effect, functionality-wise: text data bss dec hex filename 34550 3026 100 37676 932c sched.o.before 34550 3026 100 37676 932c sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Dmitry Adamushko	a03c9061d9	sched: cleanup, remove calc_weighted() remove obsolete code -- calc_weighted() Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Dmitry Adamushko	a4ec24b48d	sched: tidy up SCHED_RR - make timeslices of SCHED_RR tasks constant and not dependent on task's static_prio [1] ; - remove obsolete code (timeslice related bits); - make sched_rr_get_interval() return something more meaningful [2] for SCHED_OTHER tasks. [1] according to the following link, it's not compliant with SUSv3 (not sure though, what is the reference for us :-) http://lkml.org/lkml/2007/3/7/656 [2] the interval is dynamic and can be depicted as follows "should a task be one of the runnable tasks at this particular moment, it would expect to run for this interval of time before being re-scheduled by the scheduler tick". (i.e. it's more precise if a task is runnable at the moment) yeah, this seems to require task_rq_lock/unlock() but this is not a hot path. results: (SCHED_FIFO) dimm@earth:~/storage/prog$ sudo chrt -f 10 ./rr_interval time_slice: 0 : 0 (SCHED_RR) dimm@earth:~/storage/prog$ sudo chrt 10 ./rr_interval time_slice: 0 : 99984800 (SCHED_NORMAL) dimm@earth:~/storage/prog$ ./rr_interval time_slice: 0 : 19996960 (SCHED_NORMAL + a cpu_hog of similar 'weight' on the same CPU --- so should be a half of the previous result) dimm@earth:~/storage/prog$ taskset 1 ./rr_interval time_slice: 0 : 9998480 Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Alexey Dobriyan	a9957449b0	sched: uninline scheduler * save ~300 bytes * activate_idle_task() was moved to avoid a warning bloat-o-meter output: add/remove: 6/0 grow/shrink: 0/16 up/down: 438/-733 (-295) <=== function old new delta __enqueue_entity - 165 +165 finish_task_switch - 110 +110 update_curr_rt - 79 +79 __load_balance_iterator - 32 +32 __task_rq_unlock - 28 +28 find_process_by_pid - 24 +24 do_sched_setscheduler 133 123 -10 sys_sched_rr_get_interval 176 165 -11 sys_sched_getparam 156 145 -11 normalize_rt_tasks 482 470 -12 sched_getaffinity 112 99 -13 sys_sched_getscheduler 86 72 -14 sched_setaffinity 226 212 -14 sched_setscheduler 666 642 -24 load_balance_start_fair 33 9 -24 load_balance_next_fair 33 9 -24 dequeue_task_rt 133 67 -66 put_prev_task_rt 97 28 -69 schedule_tail 133 50 -83 schedule 682 594 -88 enqueue_entity 499 366 -133 task_new_fair 317 180 -137 Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	155bb293ae	sched: tweak wakeup granularity tweak wakeup granularity. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	1e81995066	sched: optimize schedule() a bit on SMP optimize schedule() a bit on SMP, by moving the rq-clock update outside the rq lock. code size is the same: text data bss dec hex filename 25725 2666 96 28487 6f47 sched.o.before 25725 2666 96 28487 6f47 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:13 +02:00
Dmitry Adamushko	08ec3df510	sched: fix __pick_next_entity() The thing is that __pick_next_entity() must never be called when first_fair(cfs_rq) == NULL. It wouldn't be a problem, should 'run_node' be the very first field of 'struct sched_entity' (and it's the second). The 'nr_running != 0' check is _not_ enough, due to the fact that 'current' is not within the tree. Generic paths are ok (e.g. schedule() as put_prev_task() is called previously)... I'm more worried about e.g. migration_call() -> CPU_DEAD_FROZEN -> migrate_dead_tasks()... if 'current' == rq->idle, no problems.. if it's one of the SCHED_NORMAL tasks (or imagine, some other use-cases in the future -- i.e. we should not make outer world dependent on internal details of sched_fair class) -- it may be "Houston, we've got a problem" case. it's +16 bytes to the ".text". Another variant is to make 'run_node' the first data member of 'struct sched_entity' but an additional check (se ! = NULL) is still needed in pick_next_entity(). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:13 +02:00
Ingo Molnar	647e7cac2d	sched: vslice fixups for non-0 nice levels Make vslice accurate wrt nice levels, and add some comments while we're at it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:13 +02:00
Ingo Molnar	3a25201572	sched: whitespace cleanups more whitespace cleanups. No code changed: text data bss dec hex filename 26553 2790 288 29631 73bf sched.o.before 26553 2790 288 29631 73bf sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Ingo Molnar	5522d5d5f7	sched: mark scheduling classes as const mark scheduling classes as const. The speeds up the code a bit and shrinks it: text data bss dec hex filename 40027 4018 292 44337 ad31 sched.o.before 40190 3842 292 44324 ad24 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri	b9fa3df33f	sched: group scheduler, fix latency There is a possibility that because of task of a group moving from one cpu to another, it may gain more cpu time that desired. See http://marc.info/?l=linux-kernel&m=119073197730334 for details. This is an attempt to fix that problem. Basically it simulates dequeue of higher level entities as if they are going to sleep. Similarly it simulate wakeup of higher level entities as if they are waking up from sleep. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri	fad095a7b9	sched: group scheduler, fix bloat Recent fix to check_preempt_wakeup() to check for preemption at higher levels caused a size bloat for !CONFIG_FAIR_GROUP_SCHED. Fix the problem. 42277 10598 320 53195 cfcb kernel/sched.o-before_this_patch 42216 10598 320 53134 cf8e kernel/sched.o-after_this_patch Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri	fb615581c7	sched: group scheduler, fix coding style issues Fix coding style issues reported by Randy Dunlap and others Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Ingo Molnar	b39c5dd7f9	sched: cleanup, remove stale comment cleanup, remove stale comment. Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Peter Zijlstra	5f6d858ecc	sched: speed up and simplify vslice calculations speed up and simplify vslice calculations. [ From: Mike Galbraith <efault@gmx.de>: build fix ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:12 +02:00
Peter Zijlstra	b0ffd246ea	sched: clean up min_vruntime use clean up min_vruntime use. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri	2830cf8c90	sched: group scheduler SMP migration fix group scheduler SMP migration fix: use task_cfs_rq(p) to get to the relevant fair-scheduling runqueue of a task, rq->cfs is not the right one. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:12 +02:00
Ingo Molnar	2d72376b3a	sched: clean up schedstats, cnt -> count rename all 'cnt' fields and variables to the less yucky 'count' name. yuckage noticed by Andrew Morton. no change in code, other than the /proc/sched_debug bkl_count string got a bit larger: text data bss dec hex filename 38236 3506 24 41766 a326 sched.o.before 38240 3506 24 41770 a32a sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Dmitry Adamushko	2b1e315dd2	sched: yield fix fix yield bugs due to the current-not-in-rbtree changes: the task is not in the rbtree so rbtree-removal is a no-no. [ From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>: build fix. ] also, nice code size reduction: kernel/sched.o: text data bss dec hex filename 38323 3506 24 41853 a37d sched.o.before 38236 3506 24 41766 a326 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri	8651a86c34	sched: group scheduler wakeup latency fix group scheduler wakeup latency fix: when checking for preemption we must check cross-group too, not just intra-group. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:12 +02:00
Ingo Molnar	57cb499df2	sched: remove set_leftmost() Lee Schermerhorn noticed that set_leftmost() contains dead code, remove this. Reported-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:11 +02:00
Hiroshi Shimamoto	2ddbf95250	sched: clean up sched_fork() The adjusting sched_class is a missing part of the already existing "do not leak PI boosting priority to the child" at the sched_fork(). This patch moves the adjusting sched_class from wake_up_new_task() to sched_fork(). this also shrinks the code a bit: text data bss dec hex filename 40111 4018 292 44421 ad85 sched.o.before 40102 4018 292 44412 ad7c sched.o.after Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:11 +02:00
Peter Zijlstra	368059a977	sched: max_vruntime() simplification max_vruntime() simplification. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Ingo Molnar	02e4bac2a5	sched: fix sched_fork() fix sched_fork(): large latencies at new task creation time because the ->vruntime was not fixed up cross-CPU, if the parent got migrated after the child's CPU got set up. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:11 +02:00
Ingo Molnar	b8487b9241	sched: fix sign check error in place_entity() fix sign check error in place_entity() - we'd get excessive latencies due to negatives being converted to large u64's. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Ingo Molnar	94359f05cb	sched: undo some of the recent changes undo some of the recent changes that are not needed after all, such as last_min_vruntime. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Ingo Molnar	dc1f31c90c	sched: remove last_min_vruntime effect remove last_min_vruntime use - prepare to remove it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Ingo Molnar	785c29ef95	sched: remove condition from set_task_cpu() remove condition from set_task_cpu(). Now that ->vruntime is not global anymore, it should (and does) work fine without it too. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Ingo Molnar	8465e792e8	sched: entity_key() fix entity_key() fix - we'd occasionally end up with a 0 vruntime in the !initial case. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Peter Zijlstra	ddc9729750	sched debug: check spread debug feature: check how well we schedule within a reasonable vruntime 'spread' range. (note that CPU overload can increase the spread, so this is not a hard condition, but normal loads should be within the spread.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:10 +02:00
Ingo Molnar	d822ceceda	sched debug: more width for parameter printouts more width for parameter printouts in /proc/sched_debug. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Peter Zijlstra	67e9fb2a39	sched: add vslice add vslice: the load-dependent "virtual slice" a task should run ideally, so that the observed latency stays within the sched_latency window. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Ingo Molnar	1aa4731eff	sched debug: print settings print the current value of all tunables in /proc/sched_debug output. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Ingo Molnar	c18b8a7cbc	sched: remove unneeded tunables remove unneeded tunables. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
S.Caglar Onur	fdd71d132b	sched debug: BKL usage statistics, fix build fix for the SCHED_DEBUG && !SCHEDSTATS case. Signed-off-by: S.Ceglar Onur <caglar@pardus.org.tr> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Ingo Molnar	b8efb56172	sched debug: BKL usage statistics add per task and per rq BKL usage statistics. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Srivatsa Vaddagiri	24e377a832	sched: add fair-user scheduler Enable user-id based fair group scheduling. This is useful for anyone who wants to test the group scheduler w/o having to enable CONFIG_CGROUPS. A separate scheduling group (i.e struct task_grp) is automatically created for every new user added to the system. Upon uid change for a task, it is made to move to the corresponding scheduling group. A /proc tunable (/proc/root_user_share) is also provided to tune root user's quota of cpu bandwidth. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri	9b5b77512d	sched: clean up code under CONFIG_FAIR_GROUP_SCHED With the view of supporting user-id based fair scheduling (and not just container-based fair scheduling), this patch renames several functions and makes them independent of whether they are being used for container or user-id based fair scheduling. Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating less-sized array for tg->cfs_rq[] and tf->se[]). Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri	75c28ace9f	sched: print &rq->cfs stats - Print &rq->cfs statistics as well (useful for group scheduling) Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri	545f3b1815	sched: print nr_running and load in /proc/sched_debug - print nr_running and load information for cfs_rq in /proc/sched_debug Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri	72ea22f8fb	sched: fix minor bug in yield - fix a minor bug in yield (seen for CONFIG_FAIR_GROUP_SCHED), group scheduling would skew when yield was called. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Srivatsa Vaddagiri	83b699ed20	sched: revert recent removal of set_curr_task() Revert removal of set_curr_task. Use put_prev_task/set_curr_task when changing groups/policies Signed-off-by: Srivatsa Vaddagiri < vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:08 +02:00
Ingo Molnar	edcb60a309	sched: kernel/sched_fair.c whitespace cleanups some trivial whitespace cleanups. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Mike Galbraith	c86da3a3d4	sched: fix formatting of /proc/sched_debug fix formatting of /proc/sched_debug Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Ingo Molnar	ef83a5714d	sched: enhance debug output enhance debug output by changing 12345678 nsecs to 12.345678 output, this is more human-readable. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Ingo Molnar	1a75b94f7b	sched: prettify /proc/sched_debug output print the correct amount of dashes in /proc/sched_debug. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	f6b53205e1	sched: rework enqueue/dequeue_entity() to get rid of set_curr_task() rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task(). This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks(). text data bss dec hex filename 24330 2734 20 27084 69cc sched.o.before 24233 2730 20 26983 6967 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	4530d7ab0f	sched: simplify sched_class::yield_task() the 'p' (task_struct) parameter in the sched_class :: yield_task() is redundant as the caller is always the 'current'. Get rid of it. text data bss dec hex filename 24341 2734 20 27095 69d7 sched.o.before 24330 2734 20 27084 69cc sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	87fefa381e	sched: optimize task_new_fair() due to the fact that we no longer keep the 'current' within the tree, dequeue/enqueue_entity() is useless for the 'current' in task_new_fair(). We are about to reschedule and sched_class->put_prev_task() will put the 'current' back into the tree, based on its new key. text data bss dec hex filename 24388 2734 20 27142 6a06 sched.o.before 24341 2734 20 27095 69d7 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Ingo Molnar	75d4ef16a6	sched: fix delay accounting performance regression fix delay accounting performance regression - those sched_clock() calls are not needed. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	30cfdcfc5f	sched: do not keep current in the tree and get rid of sched_entity::fair_key Get rid of 'sched_entity::fair_key'. As a side effect, 'current' is not kept withing the tree for SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code (e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes them (e.g. a single update_curr() now vs. dequeue/enqueue() before in entity_tick()). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Dmitry Adamushko	7074badbcb	sched: add set_curr_task() calls p->sched_class->set_curr_task() has to be called before activate_task()/enqueue_task() in rt_mutex_setprio(), sched_setschedule() and sched_move_task() in order to set up 'cfs_rq->curr'. The logic of enqueueing depends on whether a task to be inserted is 'current' or not. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Dmitry Adamushko	d02e5ed8d5	sched: sched_setscheduler() fix Fix a problem in the 'sched-group' patch for !CONFIG_FAIR_GROUP_SCHED. description: sched_setscheduler() { ... if (task_running()) p->sched_class->put_prev_entity(); [ this one sets up cfs_rq->curr to NULL ] ... if (task_running) p->sched_class->set_curr_task(); [ and this one is a _NOP_ (empty) for !CONFIG_FAIR_GROUP_SCHED ] As a result, the task continues to run with cfs_rq->curr == NULL... no crashes (due to checks for !NULL in place) but e.g. update_curr() effectively becomes a NOP... i.e. runtime statistics for this task is not accounted untill it's rescheduled anew. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Srivatsa Vaddagiri	29f59db3a7	sched: group-scheduler core Add interface to control cpu bandwidth allocation to task-groups. (not yet configurable, due to missing CONFIG_CONTAINERS) Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:07 +02:00
Mike Galbraith	119fe5e068	sched: fix SMP migration latencies fix SMP migration latencies: the vruntimes of different CPUs are at incompatible offsets so they have to be fixed up when migrating a task across CPUs. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Peter Zijlstra	02e0431a3d	sched: better min_vruntime tracking Better min_vruntime tracking: update it every time 'curr' is updated - not just when a task is enqueued into the tree. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Dmitry Adamushko	db36cc7d6d	sched: clean up schedstat block in dequeue_entity() Better placement of #ifdef CONFIG_SCHEDSTAT block in dequeue_entity(). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Ingo Molnar	bbdba7c0e1	sched: remove wait_runtime fields and features remove wait_runtime based fields and features, now that the CFS math has been changed over to the vruntime metric. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Ingo Molnar	e22f5bbf86	sched: remove wait_runtime limit remove the wait_runtime-limit fields and the code depending on it, now that the math has been changed over to rely on the vruntime metric. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Dmitry Adamushko	495eca494a	sched: clean up struct load_stat 'struct load_stat' is redundant now so let's get rid of it. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Ingo Molnar	7a62eabc4d	sched: debug: update exec_clock only when SCHED_DEBUG micro-optimization: update cfs_rq->exec_clock only if CONFIG_SCHED_DEBUG=y. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Ingo Molnar	86d9560cb6	sched: add more vruntime statistics add more vruntime statistics. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Peter Zijlstra	9014623c0e	sched: handle vruntime 64-bit overflow Handle vruntime overflow by centering the key space around min_vruntime. ( otherwise we could overflow 64-bit vruntime in a few days with SCHED_IDLE tasks - or in a few years with nice +19. ) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:05 +02:00
Peter Zijlstra	94dfb5e75e	sched: add tree based averages add support for tree based vruntime averages. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:05 +02:00
Ingo Molnar	28a1f6fa2f	sched: remove SCHED_FEAT_SKIP_INITIAL remove SCHED_FEAT_SKIP_INITIAL - it was off by default and even when enabled it never made any real difference. Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:05 +02:00
Ingo Molnar	67e12eac32	sched: add se->vruntime debugging debug se->vruntime fields. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-10-15 17:00:05 +02:00
Peter Zijlstra	aeb73b0403	sched: clean up new task placement clean up new task placement. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-10-15 17:00:05 +02:00
Ingo Molnar	2e09bf556f	sched: wakeup granularity increase increase wakeup granularity - we were overscheduling a bit. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-10-15 17:00:05 +02:00
Ingo Molnar	5c6b5964a0	sched: simplify check_preempt() methods simplify the check_preempt() methods. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-10-15 17:00:05 +02:00
Peter Zijlstra	6d0f0ebd06	sched: simplify adaptive latency simplify adaptive latency. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:05 +02:00
Peter Zijlstra	4d78e7b656	sched: new task placement for vruntime add proper new task placement for the vruntime based math too. ( note: introduces a swap() macro, but the swap token is too widely used in the kernel namespace for a generic version to be added without changing non-scheduler code - so this cleanup will be done separately. ) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	6cb5819514	sched: optimize vruntime based scheduling optimize vruntime based scheduling. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	bf5c91ba8c	sched: move sched_feat() definitions move sched_feat() definitions so that it can be used sooner by generic code too. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	e9acbff648	sched: introduce se->vruntime introduce se->vruntime as a sum of weighted delta-exec's, and use that as the key into the tree. the idea to use absolute virtual time as the basic metric of scheduling has been first raised by William Lee Irwin, advanced by Tong Li and first prototyped by Roman Zippel in the "Really Fair Scheduler" (RFS) patchset. also see: http://lkml.org/lkml/2007/9/2/76 for a simpler variant of this patch. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	08e2388aa1	sched: clean up calc_weighted() clean up calc_weighted() - we always use the normalized shift so it's not needed to pass that in. Also, push the non-nice0 branch into the function. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	1091985b48	sched: speed up update_load_add/_sub() speed up update_load_add/_sub() by not delaying the division - this reduces CPU pipeline dependencies. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	19ccd97a03	sched: uninline __enqueue_entity()/__dequeue_entity() suggested by Roman Zippel: uninline __enqueue_entity() and __dequeue_entity(). this reduces code size: text data bss dec hex filename 25385 2386 16 27787 6c8b sched.o.before 25257 2386 16 27659 6c0b sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Peter Zijlstra	e59c80c5bb	sched: simplify SCHED_FEAT_* code Peter Zijlstra suggested to simplify SCHED_FEAT_* checks via the sched_feat(x) macro. No code changed: text data bss dec hex filename 38895 3550 24 42469 a5e5 sched.o.before 38895 3550 24 42469 a5e5 sched.o.after Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	429d43bcc0	sched: cleanup: simplify cfs_rq_curr() methods cleanup: simplify cfs_rq_curr() methods - now that the cfs_rq->curr pointer is unconditionally present, remove the wrappers. kernel/sched.o: text data bss dec hex filename 11784 224 2012 14020 36c4 sched.o.before 11784 224 2012 14020 36c4 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	62160e3f4a	sched: track cfs_rq->curr on !group-scheduling too Noticed by Roman Zippel: use cfs_rq->curr in the !group-scheduling case too. Small micro-optimization and cleanup effect: text data bss dec hex filename 36269 3482 24 39775 9b5f sched.o.before 36177 3486 24 39687 9b07 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	53df556e06	sched: remove precise CPU load calculations #2 continued removal of precise CPU load calculations. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	a25707f3ae	sched: remove precise CPU load CPU load calculations are statistical anyway, and there's little benefit from having it calculated on every scheduling event. So remove this code, it gets rid of a divide from the scheduler wakeup and context-switch fastpath. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	8ebc91d936	sched: remove stat_gran remove the stat_gran code - it was disabled by default and it causes unnecessary overhead. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	2bd8e6d422	sched: use constants if !CONFIG_SCHED_DEBUG use constants if !CONFIG_SCHED_DEBUG. this speeds up the code and reduces code-size: text data bss dec hex filename 27464 3014 16 30494 771e sched.o.before 26929 3010 20 29959 7507 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	38ad464d41	sched: uniform tunings use the same defaults on both UP and SMP. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	eba1ed4b7e	sched: debug: track maximum 'slice' track the maximum amount of time a task has executed while the CPU load was at least 2x. (i.e. at least two nice-0 tasks were runnable) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	a4b29ba2f7	sched: small sched_debug cleanup small kernel/sched_debug.c cleanup - break up multi-variable assignment. no code changed: text data bss dec hex filename 38869 3550 24 42443 a5cb sched.o.before 38869 3550 24 42443 a5cb sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Matthias Kaehlcke	2e45874c5a	sched: use list_for_each_entry_safe() in __wake_up_common() Use list_for_each_entry_safe() instead of list_for_each_safe() in __wake_up_common() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	bb61c21083	sched: resched task in task_new_fair() to get full child-runs-first semantics make sure the parent is rescheduled. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	44142fac34	sched: fix sysctl_sched_child_runs_first flag fix the sched_child_runs_first flag: always call into ->task_new() if we are on the same CPU, as SCHED_OTHER tasks depend on it for correct initial setup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:01 +02:00
Thomas Gleixner	1595f452f3	clockevents: introduce force broadcast notifier The 64bit SMP bootup is slightly different to the 32bit one. It enables the boot CPU local APIC timer before all CPUs are brought up. Some AMD C1E systems have the C1E feature flag only set in the secondary CPU. Due to the early enable of the boot CPU local APIC timer the APIC timer is registered as a fully functional device. When we detect the wreckage during the bringup of the secondary CPU, we need to force the boot CPU into broadcast mode. Add a new notifier reason and implement the force broadcast in the clock events layer. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-14 22:57:45 +02:00
Al Viro	5ba253313d	more low-hanging fruits - kernel, fs, lib signedness Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-14 12:41:52 -07:00
Al Viro	2b8232ce51	minimal build fixes for uml (fallout from x86 merge) a) include/asm-um/arch can't just point to include/asm-$(SUBARCH) now b) arch/{i386,x86_64}/crypto are merged now c) subarch-obj needed changes d) cpufeature_64.h should pull "cpufeature_32.h", not <asm/cpufeature_32.h> since it can be included from asm-um/cpufeature.h e) in case of uml-i386 we need CONFIG_X86_32 for make and gcc, but not for Kconfig f) sysctl.c shouldn't do vdso_enabled for uml-i386 (actually, that one should be registered from corresponding arch//kernel/, with ifdef going away; that's a separate patch, though). With that and with Stephen's patch ("[PATCH net-2.6] uml: hard_header fix") we have uml allmodconfig building both on i386 and amd64. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-13 09:57:15 -07:00
Venki Pallipadi	4a93232dab	clock events: allow replacement of broadcast timer Change the broadcast timer, if a timer with higher rating becomes available. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-12 23:04:23 +02:00
Thomas Gleixner	c8a1d398de	clockevents: fix periodic broadcast for oneshot devices The next_event member of the clock event device is used to keep track of the next periodic event. For one shot only devices it is wrong to clear the variable, as the next event will be based on it. Pointed out by Ralf Baechle Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2007-10-12 23:04:06 +02:00
Thomas Gleixner	de68d9b173	clockevents: Allow build w/o run-tine usage for migration purposes Migration aid to allow preparatory patches which introduce not yet used parts of clock events code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2007-10-12 23:04:05 +02:00
Linus Torvalds	e86908614f	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (408 commits) [POWERPC] Add memchr() to the bootwrapper [POWERPC] Implement logging of unhandled signals [POWERPC] Add legacy serial support for OPB with flattened device tree [POWERPC] Use 1TB segments [POWERPC] XilinxFB: Allow fixed framebuffer base address [POWERPC] XilinxFB: Add support for custom screen resolution [POWERPC] XilinxFB: Use pdata to pass around framebuffer parameters [POWERPC] PCI: Add 64-bit physical address support to setup_indirect_pci [POWERPC] 4xx: Kilauea defconfig file [POWERPC] 4xx: Kilauea DTS [POWERPC] 4xx: Add AMCC Kilauea eval board support to platforms/40x [POWERPC] 4xx: Add AMCC 405EX support to cputable.c [POWERPC] Adjust TASK_SIZE on ppc32 systems to 3GB that are capable [POWERPC] Use PAGE_OFFSET to tell if an address is user/kernel in SW TLB handlers [POWERPC] 85xx: Enable FP emulation in MPC8560 ADS defconfig [POWERPC] 85xx: Killed <asm/mpc85xx.h> [POWERPC] 85xx: Add cpm nodes for 8541/8555 CDS [POWERPC] 85xx: Convert mpc8560ads to the new CPM binding. [POWERPC] mpc8272ads: Remove muram from the CPM reg property. [POWERPC] Make clockevents work on PPC601 processors ... Fixed up conflict in Documentation/powerpc/booting-without-of.txt manually.	2007-10-11 21:55:47 -07:00
Olof Johansson	d0c3d534a4	[POWERPC] Implement logging of unhandled signals Implement show_unhandled_signals sysctl + support to print when a process is killed due to unhandled signals just as i386 and x86_64 does. Default to having it off, unlike x86 that defaults on. Signed-off-by: Olof Johansson <olof@lixom.net> Signed-off-by: Paul Mackerras <paulus@samba.org>	2007-10-12 14:05:18 +10:00
Linus Torvalds	038a5008b2	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (867 commits) [SKY2]: status polling loop (post merge) [NET]: Fix NAPI completion handling in some drivers. [TCP]: Limit processing lost_retrans loop to work-to-do cases [TCP]: Fix lost_retrans loop vs fastpath problems [TCP]: No need to re-count fackets_out/sacked_out at RTO [TCP]: Extract tcp_match_queue_to_sack from sacktag code [TCP]: Kill almost unused variable pcount from sacktag [TCP]: Fix mark_head_lost to ignore R-bit when trying to mark L [TCP]: Add bytes_acked (ABC) clearing to FRTO too [IPv6]: Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493, try2 [NETFILTER]: x_tables: add missing ip6t_modulename aliases [NETFILTER]: nf_conntrack_tcp: fix connection reopening [QETH]: fix qeth_main.c [NETLINK]: fib_frontend build fixes [IPv6]: Export userland ND options through netlink (RDNSS support) [9P]: build fix with !CONFIG_SYSCTL [NET]: Fix dev_put() and dev_hold() comments [NET]: make netlink user -> kernel interface synchronious [NET]: unify netlink kernel socket recognition [NET]: cleanup 3rd argument in netlink_sendskb ... Fix up conflicts manually in Documentation/feature-removal-schedule.txt and my new least favourite crap, the "mod_devicetable" support in the files include/linux/mod_devicetable.h and scripts/mod/file2alias.c. (The latter files seem to be explicitly _designed_ to get conflicts when different subsystems work with them - that have an absolutely horrid lack of subsystem separation!) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-11 19:40:14 -07:00
Peter Zijlstra	851a67b825	lockdep: annotate rcu_read_{,un}lock{,_bh} lockdep annotate rcu_read_{,un}lock{,_bh} in order to catch imbalanced usage. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2007-10-11 22:11:12 +02:00
Peter Zijlstra	b351d164e8	lockdep: syscall exit check Provide a check to validate that we do not hold any locks when switching back to user-space. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-11 22:11:12 +02:00
Peter Zijlstra	e4564f79d4	lockdep: fixup mutex annotations The fancy mutex_lock fastpath has too many indirections to track the caller hence all contentions are perceived to come from mutex_lock(). Avoid this by explicitly not using the fastpath code (it was disabled already anyway). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-11 22:11:12 +02:00
Gregory Haskins	3aa416b07f	lockdep: fix mismatched lockdep_depth/curr_chain_hash It is possible for the current->curr_chain_key to become inconsistent with the current index if the chain fails to validate. The end result is that future lock_acquire() operations may inadvertently fail to find a hit in the cache resulting in a new node being added to the graph for every acquire. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-11 22:11:11 +02:00
Tim Pepper	94c61c0aef	lockdep: Avoid /proc/lockdep & lock_stat infinite output Both /proc/lockdep and /proc/lock_stat output may loop infinitely. When a read() requests an amount of data smaller than the amount of data that the seq_file's foo_show() outputs, the output starts looping and outputs the "stuck" element's data infinitely. There may be multiple sequential calls to foo_start(), foo_next()/foo_show(), and foo_stop() for a single open with sequential read of the file. The _start() does not have to start with the 0th element and _show() might be called multiple times in a row for the same element for a given open/read of the seq_file. Also header output should not be happening in _start(). All output should be in _show(), which SEQ_START_TOKEN is meant to help. Having output in _start() may also negatively impact seq_file's seq_read() and traverse() accounting. Signed-off-by: Tim Pepper <lnxninja@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ingo Molnar <mingo@elte.hu> Cc: Al Viro <viro@ftp.linux.org.uk>	2007-10-11 22:11:11 +02:00
Denis V. Lunev	cd40b7d398	[NET]: make netlink user -> kernel interface synchronious This patch make processing netlink user -> kernel messages synchronious. This change was inspired by the talk with Alexey Kuznetsov about current netlink messages processing. He says that he was badly wrong when introduced asynchronious user -> kernel communication. The call netlink_unicast is the only path to send message to the kernel netlink socket. But, unfortunately, it is also used to send data to the user. Before this change the user message has been attached to the socket queue and sk->sk_data_ready was called. The process has been blocked until all pending messages were processed. The bad thing is that this processing may occur in the arbitrary process context. This patch changes nlk->data_ready callback to get 1 skb and force packet processing right in the netlink_unicast. Kernel -> user path in netlink_unicast remains untouched. EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock drop, but the process remains in the cycle until the message will be fully processed. So, there is no need to use this kludges now. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 21:15:29 -07:00
Eric W. Biederman	9dd776b6d7	[NET]: Add network namespace clone & unshare support. This patch allows you to create a new network namespace using sys_clone, or sys_unshare. As the network namespace is still experimental and under development clone and unshare support is only made available when CONFIG_NET_NS is selected at compile time. As this patch introduces network namespace support into code paths that exist when the CONFIG_NET is not selected there are a few additions made to net_namespace.h to allow a few more functions to be used when the networking stack is not compiled in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:52:46 -07:00
Adrian Bunk	464771fe47	[KERNEL]: Unexport raise_softirq_irqoff raise_softirq_irqoff no longer has any modular user. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:49:18 -07:00
Eric W. Biederman	b4b510290b	[NET]: Support multiple network namespaces with netlink Each netlink socket will live in exactly one network namespace, this includes the controlling kernel sockets. This patch updates all of the existing netlink protocols to only support the initial network namespace. Request by clients in other namespaces will get -ECONREFUSED. As they would if the kernel did not have the support for that netlink protocol compiled in. As each netlink protocol is updated to be multiple network namespace safe it can register multiple kernel sockets to acquire a presence in the rest of the network namespaces. The implementation in af_netlink is a simple filter implementation at hash table insertion and hash table look up time. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:49:09 -07:00
Robert Olsson	c45248c701	[SOFTIRQ]: Remove do_softirq() symbol export. As noted by Christoph Hellwig, pktgen was the only user so it can now be removed. [ Add missing cases caught by Adrian Bunk. -DaveM ] Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:48:36 -07:00
Arnaldo Carvalho de Melo	a272378d11	[KTIME]: Introduce ktime_sub_ns and ktime_sub_us First user will be the DCCP transport networking protocol. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:48:12 -07:00
Jens Axboe	f5ff8422bb	Fix warnings with !CONFIG_BLOCK Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup the (few) files that fail to build because they were relying on blkdev.h pulling in extra includes for them. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-10 09:25:57 +02:00
Len Brown	4f86d3a8e2	cpuidle: consolidate 2.6.22 cpuidle branch into one patch commit e5a16b1f9eec0af7cfa0830304b41c1c0833cf9f Author: Len Brown <len.brown@intel.com> Date: Tue Oct 2 23:44:44 2007 -0400 cpuidle: shrink diff processor_idle.c \| 440 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 429 insertions(+), 11 deletions(-) Signed-off-by: Len Brown <len.brown@intel.com> commit dfbb9d5aedfb18848a3e0d6f6e3e4969febb209c Author: Len Brown <len.brown@intel.com> Date: Wed Sep 26 02:17:55 2007 -0400 cpuidle: reduce diff size Reduces the cpuidle processor_idle.c diff vs 2.6.22 from this processor_idle.c \| 2006 ++++++++++++++++++++++++++----------------- 1 file changed, 1219 insertions(+), 787 deletions(-) to this: processor_idle.c \| 502 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 458 insertions(+), 44 deletions(-) ...for the purpose of making the cpuilde patch less invasive and easier to review. no functional changes. build tested only. Signed-off-by: Len Brown <len.brown@intel.com> commit 889172fc915f5a7fe20f35b133cbd205ce69bf6c Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Thu Sep 13 13:40:05 2007 -0700 cpuidle: Retain old ACPI policy for !CONFIG_CPU_IDLE Retain the old policy in processor_idle, so that when CPU_IDLE is not configured, old C-state policy will still be used. This provides a clean gradual migration path from old ACPI policy to new cpuidle based policy. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 9544a8181edc7ecc33b3bfd69271571f98ed08bc Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Thu Sep 13 13:39:17 2007 -0700 cpuidle: Configure governors by default Quoting Len "Do not give an option to users to shoot themselves in the foot". Remove the configurability of ladder and menu governors as they are needed for default policy of cpuidle. That way users will not be able to have cpuidle without any policy loosing all C-state power savings. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 8975059a2c1e56cfe83d1bcf031bcf4cb39be743 Author: Adam Belay <abelay@novell.com> Date: Tue Aug 21 18:27:07 2007 -0400 CPUIDLE: load ACPI properly when CPUIDLE is disabled Change the registration return codes for when CPUIDLE support is not compiled into the kernel. As a result, the ACPI processor driver will load properly even if CPUIDLE is unavailable. However, it may be possible to cleanup the ACPI processor driver further and eliminate some dead code paths. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit e0322e2b58dd1b12ec669bf84693efe0dc2414a8 Author: Adam Belay <abelay@novell.com> Date: Tue Aug 21 18:26:06 2007 -0400 CPUIDLE: remove cpuidle_get_bm_activity() Remove cpuidle_get_bm_activity() and updates governors accordingly. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 18a6e770d5c82ba26653e53d240caa617e09e9ab Author: Adam Belay <abelay@novell.com> Date: Tue Aug 21 18:25:58 2007 -0400 CPUIDLE: max_cstate fix Currently max_cstate is limited to 0, resulting in no idle processor power management on ACPI platforms. This patch restores the value to the array size. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 1fdc0887286179b40ce24bcdbde663172e205ef0 Author: Adam Belay <abelay@novell.com> Date: Tue Aug 21 18:25:40 2007 -0400 CPUIDLE: handle BM detection inside the ACPI Processor driver Update the ACPI processor driver to detect BM activity and limit state entry depth internally, rather than exposing such requirements to CPUIDLE. As a result, CPUIDLE can drop this ACPI-specific interface and become more platform independent. BM activity is now handled much more aggressively than it was in the original implementation, so some testing coverage may be needed to verify that this doesn't introduce any DMA buffer under-run issues. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 0ef38840db666f48e3cdd2b769da676c57228dd9 Author: Adam Belay <abelay@novell.com> Date: Tue Aug 21 18:25:14 2007 -0400 CPUIDLE: menu governor updates Tweak the menu governor to more effectively handle non-timer break events. Non-timer break events are detected by comparing the actual sleep time to the expected sleep time. In future revisions, it may be more reliable to use the timer data structures directly. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit bb4d74fca63fa96cf3ace644b15ae0f12b7df5a1 Author: Adam Belay <abelay@novell.com> Date: Tue Aug 21 18:24:40 2007 -0400 CPUIDLE: fix 'current_governor' sysfs entry Allow the "current_governor" sysfs entry to properly handle input terminated with '\n'. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit df3c71559bb69b125f1a48971bf0d17f78bbdf47 Author: Len Brown <len.brown@intel.com> Date: Sun Aug 12 02:00:45 2007 -0400 cpuidle: fix IA64 build (again) Signed-off-by: Len Brown <len.brown@intel.com> commit a02064579e3f9530fd31baae16b1fc46b5a7bca8 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Sun Aug 12 01:39:27 2007 -0400 cpuidle: Remove support for runtime changing of max_cstate Remove support for runtime changeability of max_cstate. Drivers can use use latency APIs. max_cstate can still be used as a boot time option and dmi override. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 0912a44b13adf22f5e3f607d263aed23b4910d7e Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Sun Aug 12 01:39:16 2007 -0400 cpuidle: Remove ACPI cstate_limit calls from ipw2100 ipw2100 already has code to use accetable_latency interfaces to limit the C-state. Remove the calls to acpi_set_cstate_limit and acpi_get_cstate_limit as they are redundant. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit c649a76e76be6bff1fd770d0a775798813a3f6e0 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Sun Aug 12 01:35:39 2007 -0400 cpuidle: compile fix for pause and resume functions Fix the compilation failure when cpuidle is not compiled in. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Acked-by: Adam Belay <adam.belay@novell.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 2305a5920fb8ee6ccec1c62ade05aa8351091d71 Author: Adam Belay <abelay@novell.com> Date: Thu Jul 19 00:49:00 2007 -0400 cpuidle: re-write Some portions have been rewritten to make the code cleaner and lighter weight. The following is a list of changes: 1.) the state name is now included in the sysfs interface 2.) detection, hotplug, and available state modifications are handled by CPUIDLE drivers directly 3.) the CPUIDLE idle handler is only ever installed when at least one cpuidle_device is enabled and ready 4.) the menu governor BM code no longer overflows 5.) the sysfs attributes are now printed as unsigned integers, avoiding negative values 6.) a variety of other small cleanups Also, Idle drivers are no longer swappable during runtime through the CPUIDLE sysfs inteface. On i386 and x86_64 most idle handlers (e.g. poll, mwait, halt, etc.) don't benefit from an infrastructure that supports multiple states, so I think using a more general case idle handler selection mechanism would be cleaner. Signed-off-by: Adam Belay <abelay@novell.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Acked-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit df25b6b56955714e6e24b574d88d1fd11f0c3ee5 Author: Len Brown <len.brown@intel.com> Date: Tue Jul 24 17:08:21 2007 -0400 cpuidle: fix IA64 buid Signed-off-by: Len Brown <len.brown@intel.com> commit fd6ada4c14488755ff7068860078c437431fbccd Author: Adrian Bunk <bunk@stusta.de> Date: Mon Jul 9 11:33:13 2007 -0700 cpuidle: static make cpuidle_replace_governor() static Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit c1d4a2cebcadf2429c0c72e1d29aa2a9684c32e0 Author: Adrian Bunk <bunk@stusta.de> Date: Tue Jul 3 00:54:40 2007 -0400 cpuidle: static This patch makes the needlessly global struct menu_governor static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit dbf8780c6e8d572c2c273da97ed1cca7608fd999 Author: Andrew Morton <akpm@linux-foundation.org> Date: Tue Jul 3 00:49:14 2007 -0400 export symbol tick_nohz_get_sleep_length ERROR: "tick_nohz_get_sleep_length" [drivers/cpuidle/governors/menu.ko] undefined! ERROR: "tick_nohz_get_idle_jiffies" [drivers/cpuidle/governors/menu.ko] undefined! And please be sure to get your changes to core kernel suitably reviewed. Cc: Adam Belay <abelay@novell.com> Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 29f0e248e7017be15f99febf9143a2cef00b2961 Author: Andrew Morton <akpm@linux-foundation.org> Date: Tue Jul 3 00:43:04 2007 -0400 tick.h needs hrtimer.h It uses hrtimers. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit e40cede7d63a029e92712a3fe02faee60cc38fb4 Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:40:34 2007 -0400 cpuidle: first round of documentation updates Documentation changes based on Pavel's feedback. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 83b42be2efece386976507555c29e7773a0dfcd1 Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:39:25 2007 -0400 cpuidle: add rating to the governors and pick the one with highest rating by default Introduce a governor rating scheme to pick the right governor by default. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit d2a74b8c5e8f22def4709330d4bfc4a29209b71c Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:38:08 2007 -0400 cpuidle: make cpuidle sysfs driver governor switch off by default Make default cpuidle sysfs to show current_governor and current_driver in read-only mode. More elaborate available_governors and available_drivers with writeable current_governor and current_driver interface only appear with "cpuidle_sysfs_switch" boot parameter. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 1f60a0e80bf83cf6b55c8845bbe5596ed8f6307b Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:37:00 2007 -0400 cpuidle: menu governor: change the early break condition Change the C-state early break out algorithm in menu governor. We only look at early breakouts that result in wakeups shorter than idle state's target_residency. If such a breakout is frequent enough, eliminate the particular idle state upto a timeout period. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 45a42095cf64b003b4a69be3ce7f434f97d7af51 Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:35:38 2007 -0400 cpuidle: fix uninitialized variable in sysfs routine Fix the uninitialized usage of ret. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 80dca7cdba3e6ee13eae277660873ab9584eb3be Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:34:16 2007 -0400 cpuidle: reenable /proc/acpi//power interface for the time being Keep /proc/acpi/processor/CPU/power around for a while as powertop depends on it. It will be marked deprecated and removed in future. powertop can use cpuidle interfaces instead. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 589c37c2646c5e3813a51255a5ee1159cb4c33fc Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Jul 3 00:32:37 2007 -0400 cpuidle: menu governor and hrtimer compile fix Compile fix for menu governor. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 0ba80bd9ab3ed304cb4f19b722e4cc6740588b5e Author: Len Brown <len.brown@intel.com> Date: Thu May 31 22:51:43 2007 -0400 cpuidle: build fix - cpuidle vs ipw2100 module ERROR: "acpi_set_cstate_limit" [drivers/net/wireless/ipw2100.ko] undefined! Signed-off-by: Len Brown <len.brown@intel.com> commit d7d8fa7f96a7f7682be7c6cc0cc53fa7a18c3b58 Author: Adam Belay <abelay@novell.com> Date: Sat Mar 24 03:47:07 2007 -0400 cpuidle: add the 'menu' governor Here is my first take at implementing an idle PM governor that takes full advantage of NO_HZ. I call it the 'menu' governor because it considers the full list of idle states before each entry. I've kept the implementation fairly simple. It attempts to guess the next residency time and then chooses a state that would meet at least the break-even point between power savings and entry cost. To this end, it selects the deepest idle state that satisfies the following constraints: 1. If the idle time elapsed since bus master activity was detected is below a threshold (currently 20 ms), then limit the selection to C2-type or above. 2. Do not choose a state with a break-even residency that exceeds the expected time remaining until the next timer interrupt. 3. Do not choose a state with a break-even residency that exceeds the elapsed time between the last pair of break events, excluding timer interrupts. This governor has an advantage over "ladder" governor because it proactively checks how much time remains until the next timer interrupt using the tick infrastructure. Also, it handles device interrupt activity more intelligently by not including timer interrupts in break event calculations. Finally, it doesn't make policy decisions using the number of state entries, which can have variable residency times (NO_HZ makes these potentially very large), and instead only considers sleep time deltas. The menu governor can be selected during runtime using the cpuidle sysfs interface like so: "echo "menu" > /sys/devices/system/cpu/cpuidle/current_governor" Signed-off-by: Adam Belay <abelay@novell.com> Signed-off-by: Len Brown <len.brown@intel.com> commit a4bec7e65aa3b7488b879d971651cc99a6c410fe Author: Adam Belay <abelay@novell.com> Date: Sat Mar 24 03:47:03 2007 -0400 cpuidle: export time until next timer interrupt using NO_HZ Expose information about the time remaining until the next timer interrupt expires by utilizing the dynticks infrastructure. Also modify the main idle loop to allow dynticks to handle non-interrupt break events (e.g. DMA). Finally, expose sleep ticks information to external code. Thomas Gleixner is responsible for much of the code in this patch. However, I've made some additional changes, so I'm probably responsible if there are any bugs or oversights :) Signed-off-by: Adam Belay <abelay@novell.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 2929d8996fbc77f41a5ff86bb67cdde3ca7d2d72 Author: Adam Belay <abelay@novell.com> Date: Sat Mar 24 03:46:58 2007 -0400 cpuidle: governor API changes This patch prepares cpuidle for the menu governor. It adds an optional stage after idle state entry to give the governor an opportunity to check why the state was exited. Also it makes sure the idle loop returns after each state entry, allowing the appropriate dynticks code to run. Signed-off-by: Adam Belay <abelay@novell.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 3a7fd42f9825c3b03e364ca59baa751bb350775f Author: Venki Pallipadi <venkatesh.pallipadi@intel.com> Date: Thu Apr 26 00:03:59 2007 -0700 cpuidle: hang fix Prevent hang on x86-64, when ACPI processor driver is added as a module on a system that does not support C-states. x86-64 expects all idle handlers to enable interrupts before returning from idle handler. This is due to enter_idle(), exit_idle() races. Make cpuidle_idle_call() confirm to this when there is no pm_idle_old. Also, cpuidle look at the return values of attch_driver() and set current_driver to NULL if attach fails on all CPUs. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 4893339a142afbd5b7c01ffadfd53d14746e858e Author: Shaohua Li <shaohua.li@intel.com> Date: Thu Apr 26 10:40:09 2007 +0800 cpuidle: add support for max_cstate limit With CPUIDLE framework, the max_cstate (to limit max cpu c-state) parameter is ingored. Some systems require it to ignore C2/C3 and some drivers like ipw require it too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 43bbbbe1cb998cbd2df656f55bb3bfe30f30e7d1 Author: Shaohua Li <shaohua.li@intel.com> Date: Thu Apr 26 10:40:13 2007 +0800 cpuidle: add cpuidle_fore_redetect_devices API add cpuidle_force_redetect_devices API, which forces all CPU redetect idle states. Next patch will use it. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit d1edadd608f24836def5ec483d2edccfb37b1d19 Author: Shaohua Li <shaohua.li@intel.com> Date: Thu Apr 26 10:40:01 2007 +0800 cpuidle: fix sysfs related issue Fix the cpuidle sysfs issue. a. make kobject dynamicaly allocated b. fixed sysfs init issue to avoid suspend/resume issue Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 7169a5cc0d67b263978859672e86c13c23a5570d Author: Randy Dunlap <randy.dunlap@oracle.com> Date: Wed Mar 28 22:52:53 2007 -0400 cpuidle: 1-bit field must be unsigned A 1-bit bitfield has no room for a sign bit. drivers/cpuidle/governors/ladder.c:54:16: error: dubious bitfield without explicit `signed' or `unsigned' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 4658620158dc2fbd9e4bcb213c5b6fb5d05ba7d4 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Wed Mar 28 22:52:41 2007 -0400 cpuidle: fix boot hang Patch for cpuidle boot hang reported by Larry Finger here. http://www.ussg.iu.edu/hypermail/linux/kernel/0703.2/2025.html Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Larry Finger <larry.finger@lwfinger.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit c17e168aa6e5fe3851baaae8df2fbc1cf11443a9 Author: Len Brown <len.brown@intel.com> Date: Wed Mar 7 04:37:53 2007 -0500 cpuidle: ladder does not depend on ACPI build fix for CONFIG_ACPI=n In file included from drivers/cpuidle/governors/ladder.c:21: include/acpi/processor.h:88: error: expected specifier-qualifier-list before âacpi_integerâ include/acpi/processor.h:106: error: expected specifier-qualifier-list before âacpi_integerâ include/acpi/processor.h:168: error: expected specifier-qualifier-list before âacpi_handleâ Signed-off-by: Len Brown <len.brown@intel.com> commit 8c91d958246bde68db0c3f0c57b535962ce861cb Author: Adrian Bunk <bunk@stusta.de> Date: Tue Mar 6 02:29:40 2007 -0800 cpuidle: make code static This patch makes the following needlessly global code static: - driver.c: __cpuidle_find_driver() - governor.c: __cpuidle_find_governor() - ladder.c: struct ladder_governor Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Adam Belay <abelay@novell.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 0c39dc3187094c72c33ab65a64d2017b21f372d2 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Wed Mar 7 02:38:22 2007 -0500 cpu_idle: fix build break This patch fixes a build breakage with !CONFIG_HOTPLUG_CPU and CONFIG_CPU_IDLE. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 8112e3b115659b07df340ef170515799c0105f82 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Tue Mar 6 02:29:39 2007 -0800 cpuidle: build fix for !CPU_IDLE Fix the compile issues when CPU_IDLE is not configured. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Adam Belay <abelay@novell.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> commit 1eb4431e9599cd25e0d9872f3c2c8986821839dd Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Thu Feb 22 13:54:57 2007 -0800 cpuidle take2: Basic documentation for cpuidle Documentation for cpuidle infrastructure Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Adam Belay <abelay@novell.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> commit ef5f15a8b79123a047285ec2e3899108661df779 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Thu Feb 22 13:54:03 2007 -0800 cpuidle take2: Hookup ACPI C-states driver with cpuidle Hookup ACPI C-states onto generic cpuidle infrastructure. drivers/acpi/procesor_idle.c is now a ACPI C-states driver that registers as a driver in cpuidle infrastructure and the policy part is removed from drivers/acpi/processor_idle.c. We use governor in cpuidle instead. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Adam Belay <abelay@novell.com> Signed-off-by: Len Brown <len.brown@intel.com> commit 987196fa82d4db52c407e8c9d5dec884ba602183 Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Date: Thu Feb 22 13:52:57 2007 -0800 cpuidle take2: Core cpuidle infrastructure Announcing 'cpuidle', a new CPU power management infrastructure to manage idle CPUs in a clean and efficient manner. cpuidle separates out the drivers that can provide support for multiple types of idle states and policy governors that decide on what idle state to use at run time. A cpuidle driver can support multiple idle states based on parameters like varying power consumption, wakeup latency, etc (ACPI C-states for example). A cpuidle governor can be usage model specific (laptop, server, laptop on battery etc). Main advantage of the infrastructure being, it allows independent development of drivers and governors and allows for better CPU power management. A huge thanks to Adam Belay and Shaohua Li who were part of this mini-project since its beginning and are greatly responsible for this patchset. This patch: Core cpuidle infrastructure. Introduces a new abstraction layer for cpuidle: which manages drivers that can support multiple idles states. Drivers can be generic or particular to specific hardware/platform * allows pluging in multiple policy governors that can take idle state policy decision * The core also has a set of sysfs interfaces with which administrato can know about supported drivers and governors and switch them at run time. Signed-off-by: Adam Belay <abelay@novell.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>	2007-10-10 00:12:41 -04:00
Trond Myklebust	50e437d522	SUNRPC: Convert rpc_pipefs to use the generic filesystem notification hooks This will allow rpc.gssd to use inotify instead of dnotify in order to locate new rpc upcall pipes. This also requires the exporting of __audit_inode_child(), which is used by fsnotify_create() and fsnotify_mkdir(). Ccing David Woodhouse. Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-10-09 17:15:26 -04:00
Al Viro	291041e935	fix bogus reporting of signals by audit Async signals should not be reported as sent by current in audit log. As it is, we call audit_signal_info() too early in check_kill_permission(). Note that check_kill_permission() has that test already - it needs to know if it should apply current-based permission checks. So the solution is to move the call of audit_signal_info() between those. Bogosity in question is easily reproduced - add a rule watching for e.g. kill(2) from specific process (so that audit_signal_info() would not short-circuit to nothing), say load_policy, watch the bogus OBJ_PID entry in audit logs claiming that write(2) on selinuxfs file issued by load_policy(8) had somehow managed to send a signal to syslogd... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Steve Grubb <sgrubb@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-07 16:28:43 -07:00
Anton Blanchard	74922be148	Fix timer_stats printout of events/sec When using /proc/timer_stats on ppc64 I noticed the events/sec field wasnt accurate. Sometimes the integer part was incorrect due to rounding (we werent taking the fractional seconds into consideration). The fraction part is also wrong, we need to pad the printf statement and take the bottom three digits of 1000 times the value. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-07 16:28:43 -07:00
Ingo Molnar	30084fbd1c	sched: fix profile=sleep fix sleep profiling - we lost this chunk in the CFS merge. Found-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-02 14:13:08 +02:00
Martin Schwidefsky	9f96cb1e8b	robust futex thread exit race Calling handle_futex_death in exit_robust_list for the different robust mutexes of a thread basically frees the mutex. Another thread might grab the lock immediately which updates the next pointer of the mutex. fetch_robust_entry over the next pointer might therefore branch into the robust mutex list of a different thread. This can cause two problems: 1) some mutexes held by the dead thread are not getting freed and 2) some mutexs held by a different thread are freed. The next point need to be read before calling handle_futex_death. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-01 07:52:23 -07:00
Mark Lord	4047727e5a	Fix SMP poweroff hangs We need to disable all CPUs other than the boot CPU (usually 0) before attempting to power-off modern SMP machines. This fixes the hang-on-poweroff issue on my MythTV SMP box, and also on Thomas Gleixner's new toybox. Signed-off-by: Mark Lord <mlord@pobox.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-01 07:52:23 -07:00
Al Viro	459685c75b	hibernation doesn't even build on frv - tons of helpers are missing Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-By: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-26 09:22:04 -07:00
Thomas Gleixner	b7e113dc9d	clockevents: remove the suspend/resume workaround^Wthinko In a desparate attempt to fix the suspend/resume problem on Andrews VAIO I added a workaround which enforced the broadcast of the oneshot timer on resume. This was actually resolving the problem on the VAIO but was just a stupid workaround, which was not tackling the root cause: the assignement of lower idle C-States in the ACPI processor_idle code. The cpuidle patches, which utilize the dynamic tick feature and go faster into deeper C-states exposed the problem again. The correct solution is the previous patch, which prevents lower C-states across the suspend/resume. Remove the enforcement code, including the conditional broadcast timer arming, which helped to pamper over the real problem for quite a time. The oneshot broadcast flag for the cpu, which runs the resume code can never be set at the time when this code is executed. It only gets set, when the CPU is entering a lower idle C-State. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Len Brown <lenb@kernel.org> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-22 17:15:34 -07:00
Davide Libenzi	b8fceee17a	signalfd simplification This simplifies signalfd code, by avoiding it to remain attached to the sighand during its lifetime. In this way, the signalfd remain attached to the sighand only during poll(2) (and select and epoll) and read(2). This also allows to remove all the custom "tsk == current" checks in kernel/signal.c, since dequeue_signal() will only be called by "current". I think this is also what Ben was suggesting time ago. The external effect of this, is that a thread can extract only its own private signals and the group ones. I think this is an acceptable behaviour, in that those are the signals the thread would be able to fetch w/out signalfd. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-20 13:19:59 -07:00
Hiroshi Shimamoto	9c95e7319b	sched: fix invalid sched_class use When using rt_mutex, a NULL pointer dereference is occurred at enqueue_task_rt. Here is a scenario; 1) there are two threads, the thread A is fair_sched_class and thread B is rt_sched_class. 2) Thread A is boosted up to rt_sched_class, because the thread A has a rt_mutex lock and the thread B is waiting the lock. 3) At this time, when thread A create a new thread C, the thread C has a rt_sched_class. 4) When doing wake_up_new_task() for the thread C, the priority of the thread C is out of the RT priority range, because the normal priority of thread A is not the RT priority. It makes data corruption by overflowing the rt_prio_array. The new thread C should be fair_sched_class. The new thread should be valid scheduler class before queuing. This patch fixes to set the suitable scheduler class. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-09-19 23:34:46 +02:00
Ingo Molnar	1799e35d5b	sched: add /proc/sys/kernel/sched_compat_yield add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-09-19 23:34:46 +02:00
Pavel Emelyanov	28f300d236	Fix user namespace exiting OOPs It turned out, that the user namespace is released during the do_exit() in exit_task_namespaces(), but the struct user_struct is released only during the put_task_struct(), i.e. MUCH later. On debug kernels with poisoned slabs this will cause the oops in uid_hash_remove() because the head of the chain, which resides inside the struct user_namespace, will be already freed and poisoned. Since the uid hash itself is required only when someone can search it, i.e. when the namespace is alive, we can safely unhash all the user_struct-s from it during the namespace exiting. The subsequent free_uid() will complete the user_struct destruction. For example simple program #include <sched.h> char stack[2 * 1024 * 1024]; int f(void foo) { return 0; } int main(void) { clone(f, stack + 1 1024 * 1024, 0x10000000, 0); return 0; } run on kernel with CONFIG_USER_NS turned on will oops the kernel immediately. This was spotted during OpenVZ kernel testing. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-19 11:24:18 -07:00
Pavel Emelyanov	735de2230f	Convert uid hash to hlist Surprisingly, but (spotted by Alexey Dobriyan) the uid hash still uses list_heads, thus occupying twice as much place as it could. Convert it to hlist_heads. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-19 11:24:18 -07:00
Matthias Kaehlcke	d8a4821dca	kernel/user.c: Use list_for_each_entry instead of list_for_each kernel/user.c: Convert list_for_each to list_for_each_entry in uid_hash_find() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-19 11:24:18 -07:00
Alexey Dobriyan	efc63c4fb0	Fix UTS corruption during clone(CLONE_NEWUTS) struct utsname is copied from master one without any exclusion. Here is sample output from one proggie doing sethostname("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); sethostname("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"); and another clone(,, CLONE_NEWUTS, ...) uname() hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaabbbbb' hostname = 'bbbaaaaaaaaaaaaaaaaaaaaaaaaaaa' hostname = 'aaaaaaaabbbbbbbbbbbbbbbbbbbbbb' hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaaabbbb' hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaabb' hostname = 'aaabbbbbbbbbbbbbbbbbbbbbbbbbbb' hostname = 'bbbbbbbbbbbbbbbbaaaaaaaaaaaaaa' Hostname is sometimes corrupted. Yes, even _the_ simplest namespace activity had bug in it. :-( Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-19 11:24:17 -07:00
Thomas Gleixner	5e41d0d60a	clockevents: prevent stale tick update on offline cpu Taking a cpu offline removes the cpu from the online mask before the CPU_DEAD notification is done. The clock events layer does the cleanup of the dead CPU from the CPU_DEAD notifier chain. tick_do_timer_cpu is used to avoid xtime lock contention by assigning the task of jiffies xtime updates to one CPU. If a CPU is taken offline, then this assignment becomes stale. This went unnoticed because most of the time the offline CPU went dead before the online CPU reached __cpu_die(), where the CPU_DEAD state is checked. In the case that the offline CPU did not reach the DEAD state before we reach __cpu_die(), the code in there goes to sleep for 100ms. Due to the stale time update assignment, the system is stuck forever. Take the assignment away when a cpu is not longer in the cpu_online_mask. We do this in the last call to tick_nohz_stop_sched_tick() when the offline CPU is on the way to the final play_dead() idle entry. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-09-16 15:36:43 +02:00
Thomas Gleixner	31d9b3938c	clockevents: do not shutdown the oneshot broadcast device When a cpu goes offline it is removed from the broadcast masks. If the mask becomes empty the code shuts down the broadcast device. This is wrong, because the broadcast device needs to be ready for the online cpu going idle (into a c-state, which stops the local apic timer). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-09-16 15:36:43 +02:00
Thomas Gleixner	07eec6af44	clockevents: Enforce oneshot broadcast when broadcast mask is set on resume The jinxed VAIO refuses to resume without hitting keys on the keyboard when this is not enforced. It is unclear why the cpu ends up in a lower C State without notifying the clock events layer, but enforcing the oneshot broadcast here is safe. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-09-16 15:36:43 +02:00
Thomas Gleixner	6a669ee8a7	timekeeping: Prevent time going backwards on resume Timekeeping resume adjusts xtime by adding the slept time in seconds and resets the reference value of the clock source (clock->cycle_last). clock->cycle last is used to calculate the delta between the last xtime update and the readout of the clock source in __get_nsec_offset(). xtime plus the offset is the current time. The resume code ignores the delta which had already elapsed between the last xtime update and the actual time of suspend. If the suspend time is short, then we can see time going backwards on resume. Suspend: offs_s = clock->read() - clock->cycle_last; now = xtime + offs_s; timekeeping_suspend_time = read_rtc(); Resume: sleep_time = read_rtc() - timekeeping_suspend_time; xtime.tv_sec += sleep_time; clock->cycle_last = clock->read(); offs_r = clock->read() - clock->cycle_last; now = xtime + offs_r; if sleep_time_seconds == 0 and offs_r < offs_s, then time goes backwards. Fix this by storing the offset from the last xtime update and add it to xtime during resume, when we reset clock->cycle_last: sleep_time = read_rtc() - timekeeping_suspend_time; xtime.tv_sec += sleep_time; xtime += offs_s; /* Fixup xtime offset at suspend time */ clock->cycle_last = clock->read(); offs_r = clock->read() - clock->cycle_last; now = xtime + offs_r; Thanks to Marcelo for tracking this down on the OLPC and providing the necessary details to analyze the root cause. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@us.ibm.com> Cc: Tosatti <marcelo@kvack.org>	2007-09-16 15:36:43 +02:00
Thomas Gleixner	3be9095063	timekeeping: access rtc outside of xtime lock Lockdep complains about the access of rtc in timekeeping_suspend inside the interrupt disabled region of the write locked xtime lock. Move the access outside. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@us.ibm.com>	2007-09-16 15:36:43 +02:00
Tony Breeds	298a5df45d	Fix "no_sync_cmos_clock" logic inversion in kernel/time/ntp.c Seems to me that this timer will only get started on platforms that say they don't want it? Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Gabriel Paubert <paubert@iram.es> Cc: Zachary Amsden <zach@vmware.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-11 17:21:27 -07:00
Michael Ellerman	3210f0ecdb	Restore call_usermodehelper_pipe() behaviour The semantics of call_usermodehelper_pipe() used to be that it would fork the helper, and wait for the kernel thread to be started. This was implemented by setting sub_info.wait to 0 (implicitly), and doing a wait_for_completion(). As part of the cleanup done in `0ab4dc9227`, call_usermodehelper_pipe() was changed to pass 1 as the value for wait to call_usermodehelper_exec(). This is equivalent to setting sub_info.wait to 1, which is a change from the previous behaviour. Using 1 instead of 0 causes __call_usermodehelper() to start the kernel thread running wait_for_helper(), rather than directly calling ____call_usermodehelper(). The end result is that the calling kernel code blocks until the user mode helper finishes. As the helper is expecting input on stdin, and now no one is writing anything, everything locks up (observed in do_coredump). The fix is to change the 1 to UMH_WAIT_EXEC (aka 0), indicating that we want to wait for the kernel thread to be started, but not for the helper to finish. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-11 17:21:20 -07:00
Arnd Bergmann	179c85ea53	futex_compat: fix list traversal bugs The futex list traversal on the compat side appears to have a bug. It's loop termination condition compares: while (compat_ptr(uentry) != &head->list) But that can't be right because "uentry" has the special "pi" indicator bit still potentially set at bit 0. This is cleared by fetch_robust_entry() into the "entry" return value. What this seems to mean is that the list won't terminate when list iteration gets back to the the head. And we'll also process the list head like a normal entry, which could cause all kinds of problems. So we should check for equality with "entry". That pointer is of the non-compat type so we have to do a little casting to keep the compiler and sparse happy. The same problem can in theory occur with the 'pending' variable, although that has not been reported from users so far. Based on the original patch from David Miller. Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-11 17:21:20 -07:00
Roland McGrath	7d94143291	Fix spurious syscall tracing after PTRACE_DETACH + PTRACE_ATTACH When PTRACE_SYSCALL was used and then PTRACE_DETACH is used, the TIF_SYSCALL_TRACE flag is left set on the formerly-traced task. This means that when a new tracer comes along and does PTRACE_ATTACH, it's possible he gets a syscall tracing stop even though he's never used PTRACE_SYSCALL. This happens if the task was in the middle of a system call when the second PTRACE_ATTACH was done. The symptom is an unexpected SIGTRAP when the tracer thinks that only SIGSTOP should have been provoked by his ptrace calls so far. A few machines already fixed this in ptrace_disable (i386, ia64, m68k). But all other machines do not, and still have this bug. On x86_64, this constitutes a regression in IA32 compatibility support. Since all machines now use TIF_SYSCALL_TRACE for this, I put the clearing of TIF_SYSCALL_TRACE in the generic ptrace_detach code rather than adding it to every other machine's ptrace_disable. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-10 18:57:47 -07:00
Peter Zijlstra	1169783085	sched: fix ideal_runtime calculations for reniced tasks fix ideal_runtime: - do not scale it using niced_granularity() it is against sum_exec_delta, so its wall-time, not fair-time. - move the whole check into __check_preempt_curr_fair() so that wakeup preemption can also benefit from the new logic. this also results in code size reduction: text data bss dec hex filename 13391 228 1204 14823 39e7 sched.o.before 13369 228 1204 14801 39d1 sched.o.after Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Peter Zijlstra	4a55b45036	sched: improve prev_sum_exec_runtime setting Second preparatory patch for fix-ideal runtime: Mark prev_sum_exec_runtime at the beginning of our run, the same spot that adds our wait period to wait_runtime. This seems a more natural location to do this, and it also reduces the code a bit: text data bss dec hex filename 13397 228 1204 14829 39ed sched.o.before 13391 228 1204 14823 39e7 sched.o.after Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Peter Zijlstra	7c92e54f6f	sched: simplify __check_preempt_curr_fair() Preparatory patch for fix-ideal-runtime: simplify __check_preempt_curr_fair(): get rid of the integer return. text data bss dec hex filename 13404 228 1204 14836 39f4 sched.o.before 13393 228 1204 14825 39e9 sched.o.after functionality is unchanged. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Ingo Molnar	cf2ab4696e	sched: fix xtensa build warning rename RSR to SRR - 'RSR' is already defined on xtensa. found by Adrian Bunk. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Ingo Molnar	2491b2b89d	sched: debug: fix sum_exec_runtime clearing when cleaning sched-stats also clear prev_sum_exec_runtime. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Ingo Molnar	a206c07213	sched: debug: fix cfs_rq->wait_runtime accounting the cfs_rq->wait_runtime debug/statistics counter was not maintained properly - fix this. this also removes some code: text data bss dec hex filename 13420 228 1204 14852 3a04 sched.o.before 13404 228 1204 14836 39f4 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-09-05 14:32:49 +02:00
Ingo Molnar	a0dc72601d	sched: fix niced_granularity() shift fix niced_granularity(). This resulted in under-scheduling for CPU-bound negative nice level tasks (and this in turn caused higher than necessary latencies in nice-0 tasks). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Suresh Siddha	7fd0d2dde9	sched: fix MC/HT scheduler optimization, without breaking the FUZZ logic. First fix the check if (imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) with this if (imbalance < busiest_load_per_task) As the current check is always false for nice 0 tasks (as SCHED_LOAD_SCALE_FUZZ is same as busiest_load_per_task for nice 0 tasks). With the above change, imbalance was getting reset to 0 in the corner case condition, making the FUZZ logic fail. Fix it by not corrupting the imbalance and change the imbalance, only when it finds that the HT/MC optimization is needed. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:48 +02:00
Linus Torvalds	5e7a39275b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: clean up task_new_fair() sched: small schedstat fix sched: fix wait_start_fair condition in update_stats_wait_end() sched: call update_curr() in task_tick_fair() sched: make the scheduler converge to the ideal latency sched: fix sleeper bonus limit	2007-08-31 10:52:00 -07:00
Oleg Nesterov	60187d2708	sigqueue_free: fix the race with collect_signal() Spotted by taoyue <yue.tao@windriver.com> and Jeremy Katz <jeremy.katz@windriver.com>. collect_signal: sigqueue_free: list_del_init(&first->list); if (!list_empty(&q->list)) { // not taken } q->flags &= ~SIGQUEUE_PREALLOC; __sigqueue_free(first); __sigqueue_free(q); Now, __sigqueue_free() is called twice on the same "struct sigqueue" with the obviously bad implications. In particular, this double free breaks the array_cache->avail logic, so the same sigqueue could be "allocated" twice, and the bug can manifest itself via the "impossible" BUG_ON(!SIGQUEUE_PREALLOC) in sigqueue_free/send_sigqueue. Hopefully this can explain these mysterious bug-reports, see http://marc.info/?t=118766926500003 http://marc.info/?t=118466273000005 Alexey Dobriyan reports this patch makes the difference for the testcase, but nobody has an access to the application which opened the problems originally. Also, this patch removes tasklist lock/unlock, ->siglock is enough. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: taoyue <yue.tao@windriver.com> Cc: Jeremy Katz <jeremy.katz@windriver.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Roland McGrath <roland@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-31 01:42:23 -07:00
Alexey Dobriyan	99db67bc04	userns: don't leak root user Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-31 01:42:23 -07:00
Jarek Poplawski	59845b1ffd	request_irq: fix DEBUG_SHIRQ handling Mariusz Kozlowski reported lockdep's warning: > ================================= > [ INFO: inconsistent lock state ] > 2.6.23-rc2-mm1 #7 > --------------------------------- > inconsistent {in-hardirq-W} -> {hardirq-on-W} usage. > ifconfig/5492 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&tp->lock){+...}, at: [<de8706e0>] rtl8139_interrupt+0x27/0x46b [8139too] > {in-hardirq-W} state was registered at: > [<c0138eeb>] __lock_acquire+0x949/0x11ac > [<c01397e7>] lock_acquire+0x99/0xb2 > [<c0452ff3>] _spin_lock+0x35/0x42 > [<de8706e0>] rtl8139_interrupt+0x27/0x46b [8139too] > [<c0147a5d>] handle_IRQ_event+0x28/0x59 > [<c01493ca>] handle_level_irq+0xad/0x10b > [<c0105a13>] do_IRQ+0x93/0xd0 > [<c010441e>] common_interrupt+0x2e/0x34 ... > other info that might help us debug this: > 1 lock held by ifconfig/5492: > #0: (rtnl_mutex){--..}, at: [<c0451778>] mutex_lock+0x1c/0x1f > > stack backtrace: ... > [<c0452ff3>] _spin_lock+0x35/0x42 > [<de8706e0>] rtl8139_interrupt+0x27/0x46b [8139too] > [<c01480fd>] free_irq+0x11b/0x146 > [<de871d59>] rtl8139_close+0x8a/0x14a [8139too] > [<c03bde63>] dev_close+0x57/0x74 ... This shows that a driver's irq handler was running both in hard interrupt and process contexts with irqs enabled. The latter was done during free_irq() call and was possible only with CONFIG_DEBUG_SHIRQ enabled. This was fixed by another patch. But similar problem is possible with request_irq(): any locks taken from irq handler could be vulnerable - especially with soft interrupts. This patch fixes it by disabling local interrupts during handler's run. (It seems, disabling softirqs should be enough, but it needs more checking on possible races or other special cases). Reported-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-31 01:42:23 -07:00
Rafael J. Wysocki	f3de4be9d5	PM: Fix dependencies of CONFIG_SUSPEND and CONFIG_HIBERNATION Dependencies of CONFIG_SUSPEND and CONFIG_HIBERNATION introduced by commit `296699de6b` "Introduce CONFIG_SUSPEND for suspend-to-Ram and standby" are incorrect, as they don't cover the facts that (1) not all architectures support suspend and (2) SMP hibernation is only possible on X86 and PPC64 (if CONFIG_PPC64_SWSUSP is set). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-31 01:42:22 -07:00
Oleg Nesterov	b07e35f94a	setpgid(child) fails if the child was forked by sub-thread Spotted by Marcin Kowalczyk <qrczak@knm.org.pl>. sys_setpgid(child) fails if the child was forked by sub-thread. Fix the "is it our child" check. The previous commit `ee0acf90d3` was not complete. (this patch asks for the new same_thread_group() helper, but mainline doesn't have it yet). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: <stable@kernel.org> Tested-by: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-31 01:42:22 -07:00
Jonathan Lim	f2ab6d8889	Assign task_struct.exit_code before taskstats_exit() taskstats.ac_exitcode is assigned to task_struct.exit_code in bacct_add_tsk() through the following kernel function calls: do_exit() taskstats_exit() fill_pid() bacct_add_tsk() The problem is that in do_exit(), task_struct.exit_code is set to 'code' only after taskstats_exit() has been called. So we need to move the assignment before taskstats_exit(). Signed-off-by: Jonathan Lim <jlim@sgi.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-31 01:42:22 -07:00
Ingo Molnar	9f508f8258	sched: clean up task_new_fair() cleanup: we have the 'se' and 'curr' entity-pointers already, no need to use p->se and current->se. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-08-28 12:53:24 +02:00
Ingo Molnar	213c8af67f	sched: small schedstat fix small schedstat fix: the cfs_rq->wait_runtime 'sum of all runtimes' statistics counters missed newly forked tasks and thus had a constant negative skew. Fix this. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-08-28 12:53:24 +02:00
Ingo Molnar	b77d69db9f	sched: fix wait_start_fair condition in update_stats_wait_end() Peter Zijlstra noticed the following bug in SCHED_FEAT_SKIP_INITIAL (which is disabled by default at the moment): it relies on se.wait_start_fair being 0 while update_stats_wait_end() did not recognize a 0 value, so instead of 'skipping' the initial interval we gave the new child a maximum boost of +runtime-limit ... (No impact on the default kernel, but nice to fix for completeness.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-08-28 12:53:24 +02:00
Ting Yang	7109c4429a	sched: call update_curr() in task_tick_fair() update the fair-clock before using it for the key value. [ mingo@elte.hu: small cleanups. ] Signed-off-by: Ting Yang <tingy@cs.umass.edu> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-08-28 12:53:24 +02:00
Ingo Molnar	f6cf891c4d	sched: make the scheduler converge to the ideal latency de-HZ-ification of the granularity defaults unearthed a pre-existing property of CFS: while it correctly converges to the granularity goal, it does not prevent run-time fluctuations in the range of [-gran ... 0 ... +gran]. With the increase of the granularity due to the removal of HZ dependencies, this becomes visible in chew-max output (with 5 tasks running): out: 28 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 37 . 40 out: 27 . 27. 32 \| flu: 0 . 0 \| ran: 17 . 13 \| per: 44 . 40 out: 27 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 36 . 40 out: 29 . 27. 32 \| flu: 2 . 0 \| ran: 17 . 13 \| per: 46 . 40 out: 28 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 37 . 40 out: 29 . 27. 32 \| flu: 0 . 0 \| ran: 18 . 13 \| per: 47 . 40 out: 28 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 37 . 40 average slice is the ideal 13 msecs and the period is picture-perfect 40 msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no mechanism in CFS to keep that from happening: it's a perfectly valid solution that CFS finds. to fix this we add a granularity/preemption rule that knows about the "target latency", which makes tasks that run longer than the ideal latency run a bit less. The simplest approach is to simply decrease the preemption granularity when a task overruns its ideal latency. For this we have to track how much the task executed since its last preemption. ( this adds a new field to task_struct, but we can eliminate that overhead in 2.6.24 by putting all the scheduler timestamps into an anonymous union. ) with this change in place, chew-max output is fluctuation-less all around: out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 1 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 1 \| ran: 13 . 13 \| per: 41 . 40 this patch has no impact on any fastpath or on any globally observable scheduling property. (unless you have sharp enough eyes to see millisecond-level ruckles in glxgears smoothness :-) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-08-28 12:53:24 +02:00
Mike Galbraith	5f01d519e6	sched: fix sleeper bonus limit There is an Amarok song switch time increase (regression) under hefty load. What is happening is that sleeper_bonus is never consumed, and only rarely goes below runtime_limit, so for the most part, Amarok isn't getting any bonus at all. We're keeping sleeper_bonus right at runtime_limit (sched_latency == sched_runtime_limit == 40ms) forever, ie we don't consume if we're lower that that, and don't add if we're above it. One Amarok thread waking (or anybody else) will push us past the threshold, so the next thread waking gets nada, but will reap pain from the previous thread waking until we drop back to runtime_limit. It looks to me like under load, some random task gets a bonus, and everybody else pays, whether deserving or not. This diff fixed the regression for me at any load rate. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-08-28 12:53:24 +02:00
Hugh Dickins	d243769d3f	fix bogus hotplug cpu warning Fix bogus DEBUG_PREEMPT warning on x86_64, when cpu brought online after bootup: current_is_keventd is right to note its use of smp_processor_id is preempt-safe, but should use raw_smp_processor_id to avoid the warning. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-27 10:27:48 -07:00
Ingo Molnar	50c46637aa	sched: s/sched_latency/sched_min_granularity runtime limit and wakeup granularity used to be a function of granularity and that was incorrect changed to sched_latency. Fix this to make wakeup granularity a function of min-granularity, and the runtime limit equal to latency. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-25 22:17:19 +02:00
Ingo Molnar	172ac3dbb7	sched: cleanup, sched_granularity -> sched_min_granularity due to adaptive granularity scheduling the role of sched_granularity has changed to "minimum granularity", so rename the variable (and the tunable) accordingly. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-08-25 18:41:53 +02:00
Peter Zijlstra	218050855e	sched: adaptive scheduler granularity Instead of specifying the preemption granularity, specify the wanted latency. By fixing the granlarity to a constany the wakeup latency it a function of the number of running tasks on the rq. Invert this relation. sysctl_sched_granularity becomes a minimum for the dynamic granularity computed from the new sysctl_sched_latency. Then use this latency to do more intelligent granularity decisions: if there are fewer tasks running then we can schedule coarser. This helps performance while still always keeping the latency target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-25 18:41:53 +02:00
Peter Zijlstra	1fc84aaae3	sched: fix CONFIG_SCHED_DEBUG dependency of lockdep sysctls Make the lockdep sysctls not depend on CONFIG_SCHED_DEBUG. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-25 18:41:52 +02:00
Ingo Molnar	095e56c703	sched: fix startup penalty calculation fix task startup penalty miscalculation: sysctl_sched_granularity is unsigned int and wait_runtime is long so we first have to convert it to long before turning it negative ... Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Peter Zijlstra	ea0aa3b23a	sched: simplify bonus calculation #2 current code: delta = calc_delta_mine(delta_exec, curr->load.weight, lw); delta = min((u64)delta, cfs_rq->sleeper_bonus); Notice that this calc_delta_mine() line is exactly delta_mine, which gives: delta = min((u64)delta_mine, cfs_rq->sleeper_bonus); Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Peter Zijlstra	a6f2994042	sched: simplify bonus calculation #1 current code: delta = min(cfs_rq->sleeper_bonus, (u64)delta_exec); delta = calc_delta_mine(delta, curr->load.weight, lw); delta = min((u64)delta, cfs_rq->sleeper_bonus); drop the first min(), because we clip against sleeper_bonus in the 3rd line again. That gives: delta = calc_delta_mine(delta_exec, curr->load.weight, lw); delta = min((u64)delta, cfs_rq->sleeper_bonus); Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Ingo Molnar	b2133c8b1e	sched: tidy up and simplify the bonus balance make the bonus balance more consistent: do not hand out a bonus if there's too much in flight already, and only deduct as much from a runner as it has the capacity. This makes the bonus engine a zero-sum game (as intended). this also simplifies the code: text data bss dec hex filename 34770 2998 24 37792 93a0 sched.o.before 34749 2998 24 37771 938b sched.o.after and it also avoids overscheduling in sleep-happy workloads like hackbench.c. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Dmitry Adamushko	98fbc79853	sched: optimize task_tick_rt() a bit Mitchell Erblich suggested a quality-of-implementation change to not requeue SCHED_RR tasks if there's only a single task on the runqueue, by checking for rq->nr_running == 1. provide a more efficient implementation of that, to check that particular RT priority-queue only. [ From: mingo@elte.hu ] Also first requeue the task then set need_resched - results in slightly better machine-instruction ordering. Also clean up the code a bit. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Sven-Thorsten Dietrich	deac4ee65a	sched: simplify can_migrate_task() Remove trivial conditional branch in Linux scheduler's can_migrate_task() function. text data bss dec hex filename 34770 2998 24 37792 93a0 sched.o.before 34757 2998 24 37779 9393 sched.o.after Signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Ingo Molnar	71fd371463	sched: remove HZ dependency from the granularity default remove HZ dependency from the granularity default. Use 10 msec for the base granularity, 1 msec for wakeup granularity and 25 msec for batch wakeup granularity. (These defaults are close to the values that the default HZ=250 setting got previously, and thus it's the most common setting.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Bruce Ashfield	7c6c16f354	sched: CONFIG_SCHED_GROUP_FAIR=y fixlet when I built with CONFIG_FAIR_GROUP_SCHED=y, I need the following change to make things right. [ From: mingo@elte.hu ] this config option is not upstream-configurable right now but lets fix this for completeness. Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Linus Torvalds	d0797b39dc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: tweak the sched_runtime_limit tunable sched: skip updating rq's next_balance under null SD sched: fix broken SMT/MC optimizations sched: accounting regression since rc1 sched: fix sysctl directory permissions sched: sched_clock_idle_[sleep\|wakeup]_event()	2007-08-23 21:38:39 -07:00
Linus Torvalds	de80af4cc9	Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: sysfs: don't warn on removal of a nonexistent binary file HOWTO: latest lxr url address changed HOWTO: korean translation of Documentation/HOWTO Fix Off-by-one in /sys/module/*/refcnt sysfs: fix locking in sysfs_lookup() and sysfs_rename_dir()	2007-08-23 21:34:43 -07:00
Ingo Molnar	505c0efd58	sched: tweak the sched_runtime_limit tunable Michael Gerdau reported reniced task CPU usage weirdnesses. Such symptoms can be caused by limit underruns so double the sched_runtime_limit. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Suresh Siddha	f549da848e	sched: skip updating rq's next_balance under null SD Was playing with sched_smt_power_savings/sched_mc_power_savings and found out that while the scheduler domains are reconstructed when sysfs settings change, rebalance_domains() can get triggered with null domain on other cpus, which is setting next_balance to jiffies + 60*HZ. Resulting in no idle/busy balancing for 60 seconds. Fix this. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Suresh Siddha	f8700df7c4	sched: fix broken SMT/MC optimizations On a four package system with HT - HT load balancing optimizations were broken. For example, if two tasks end up running on two logical threads of one of the packages, scheduler is not able to pull one of the tasks to a completely idle package. In this scenario, for nice-0 tasks, imbalance calculated by scheduler will be 512 and find_busiest_queue() will return 0 (as each cpu's load is 1024 > imbalance and has only one task running). Similarly MC scheduler optimizations also get fixed with this patch. [ mingo@elte.hu: restored fair balancing by increasing the fuzz and adding it back to the power decision, without the /2 factor. ] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Eric W. Biederman	c57baf1e1e	sched: fix sysctl directory permissions There are two remaining gotchas: - The directories have impossible permissions (writeable). - The ctl_name for the kernel directory is inconsistent with everything else. It should be CTL_KERN. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Ingo Molnar	2aa44d0567	sched: sched_clock_idle_[sleep\|wakeup]_event() construct a more or less wall-clock time out of sched_clock(), by using ACPI-idle's existing knowledge about how much time we spent idling. This allows the rq clock to work around TSC-stops-in-C2, TSC-gets-corrupted-in-C3 type of problems. ( Besides the scheduler's statistics this also benefits blktrace and printk-timestamps as well. ) Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup callbacks allow the scheduler to get out the most of the period where the CPU has a reliable TSC. This results in slightly more precise task statistics. the ACPI bits were acked by Len. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Len Brown <len.brown@intel.com>	2007-08-23 15:18:02 +02:00
Oleg Nesterov	834d216e1f	signalfd: fix interaction with posix-timers dequeue_signal: if (__SI_TIMER) { spin_unlock(&tsk->sighand->siglock); do_schedule_next_timer(info); spin_lock(&tsk->sighand->siglock); } Unless tsk == curent, this is absolutely unsafe: nothing prevents tsk from exiting. If signalfd was passed to another process, do_schedule_next_timer() is just wrong. Add yet another "tsk == current" check into dequeue_signal(). This patch fixes an oopsable bug, but breaks the scheduling of posix timers if the shared __SI_TIMER signal was fetched via signalfd attached to another sub-thread. Mostly fixed by the next patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Roland McGrath <roland@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:46 -07:00
Oleg Nesterov	d02479bdeb	posix-timers: fix creation race sys_timer_create() sets ->it_process and unlocks ->siglock, then checks tmr->it_sigev_notify to define if get_task_struct() is needed. We already passed ->it_id to the caller, another thread can delete this timer and free its memory in between. As a minimal fix, move this code under ->siglock, sys_timer_delete() takes it too before calling release_posix_timer(). A proper serialization would be to take ->it_lock, we add a partly initialized timer on posix_timers_id, not good. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:46 -07:00
Thomas Gleixner	179394af7a	posix-timers: fix deletion race timer_delete does: lock_timer(); timer->it_process = NULL; unlock_timer(); release_posix_timer(); timer->it_process is checked in lock_timer() to prevent access to a timer, which is on the way to be deleted, but the check happens after idr_lock is dropped. This allows release_posix_timer() to delete the timer before the lock code can check the timer: CPU 0 CPU 1 lock_timer(); timer->it_process = NULL; unlock_timer(); lock_timer() spin_lock(idr_lock); timer = idr_find(); spin_lock(timer->lock); spin_unlock(idr_lock); release_posix_timer(); spin_lock(idr_lock); idr_remove(timer); spin_unlock(idr_lock); free_timer(timer); if (timer->......) Change the locking to prevent this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:45 -07:00
Andrew Morton	8b7f07155f	free_irq(): fix DEBUG_SHIRQ handling If we're going to run the handler from free_irq() then we must do it with local irq's disabled. Otherwise lockdep complains that the handler is taking irq-safe spinlocks in a non-irq-safe fashion. Cc: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:44 -07:00
john stultz	187226f57f	futex_unlock_pi() hurts my brain and may cause application deadlock Avoid futex_unlock_pi returning -EFAULT (which results in deadlock), by clearing uval before jumping to retry_locked. Signed-off-by: John Stultz <johnstul@us.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:44 -07:00
Adrian Bunk	88ae704c2a	kernel/auditsc.c: fix an off-by-one This patch fixes an off-by-one in a BUG_ON() spotted by the Coverity checker. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Amy Griffis <amy.griffis@hp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:44 -07:00
Alexey Dobriyan	256e2fdf03	Fix Off-by-one in /sys/module/*/refcnt sysfs internals were changed to not pin module in question. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-08-22 14:35:35 -07:00
Robin Getz	cb00e99c0a	fix - ensure we don't use bootconsoles after init has been released Gerd Hoffmann pointed out that my patch from yesterday can lead to a null pointer dereference if the kernel is booted with no console, and no earlyprintk defined. This fixes that issue. Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-21 20:23:53 -07:00
Robin Getz	0c5564bd91	ensure we don't use bootconsoles after init has been released This is a followup to the cleanups for earlyprintk patch from Gerd Hoffmann http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=69331af79cf29e26d1231152a172a1a10c2df511 This ensures that a bootconsole is unregistered if it is not replaced. The current implementation spews garbage out the bootconsole in this case, since the bootconsole structure is normally in the init section, and is freed, but still used. Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Mike Frysinger <vapier.adi@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-20 22:42:01 -07:00
Christian Heim	e598fbaabd	Remove double inclusion of linux/capability.h Remove the second inclusion of linux/capability.h, which has been introduced with "[PATCH] move capable() to capability.h" (commit `c59ede7b78`) Signed-off-by: Christian Heim <phreak@gentoo.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-19 10:12:32 -07:00
Linus Torvalds	738ddd3039	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: run_rebalance_domains: s/SCHED_IDLE/CPU_IDLE/ sched: fix sleeper bonus sched: make global code static	2007-08-12 11:06:45 -07:00
Thomas Gleixner	2464286ace	genirq: suppress resend of level interrupts Level type interrupts are resent by the interrupt hardware when they are still active at irq_enable(). Suppress the resend mechanism for interrupts marked as level. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-12 11:05:45 -07:00
Thomas Gleixner	496634217e	genirq: cleanup mismerge artifact Commit 5a43a066b11ac2fe84cf67307f20b83bea390f83: "genirq: Allow fasteoi handler to retrigger disabled interrupts" was erroneously applied to handle_level_irq(). This added the irq retrigger / resend functionality to the level irq handler. Revert the offending bits. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-12 11:05:45 -07:00

... 3 4 5 6 7 ...

2919 Commits