OpenCloudOS-Kernel/init/main.c

831 lines
20 KiB
C
Raw Normal View History

/*
* linux/init/main.c
*
* Copyright (C) 1991, 1992 Linus Torvalds
*
* GK 2/5/95 - Changed to support mounting root fs via NFS
* Added initrd & change_root: Werner Almesberger & Hans Lermen, Feb '96
* Moan early if gcc is old, avoiding bogus kernels - Paul Gortmaker, May '96
* Simplified starting of init: Michael A. Griffith <grif@acm.org>
*/
#include <linux/types.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <linux/stackprotector.h>
#include <linux/string.h>
#include <linux/ctype.h>
#include <linux/delay.h>
#include <linux/ioport.h>
#include <linux/init.h>
#include <linux/initrd.h>
#include <linux/bootmem.h>
#include <linux/acpi.h>
#include <linux/tty.h>
#include <linux/percpu.h>
#include <linux/kmod.h>
mm: rewrite vmap layer Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:27:03 +08:00
#include <linux/vmalloc.h>
#include <linux/kernel_stat.h>
#include <linux/start_kernel.h>
#include <linux/security.h>
#include <linux/smp.h>
#include <linux/profile.h>
#include <linux/rcupdate.h>
#include <linux/moduleparam.h>
#include <linux/kallsyms.h>
#include <linux/writeback.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
Task Control Groups: basic task cgroup framework Generic Process Control Groups -------------------------- There have recently been various proposals floating around for resource management/accounting and other task grouping subsystems in the kernel, including ResGroups, User BeanCounters, NSProxy cgroups, and others. These all need the basic abstraction of being able to group together multiple processes in an aggregate, in order to track/limit the resources permitted to those processes, or control other behaviour of the processes, and all implement this grouping in different ways. This patchset provides a framework for tracking and grouping processes into arbitrary "cgroups" and assigning arbitrary state to those groupings, in order to control the behaviour of the cgroup as an aggregate. The intention is that the various resource management and virtualization/cgroup efforts can also become task cgroup clients, with the result that: - the userspace APIs are (somewhat) normalised - it's easier to test e.g. the ResGroups CPU controller in conjunction with the BeanCounters memory controller, or use either of them as the resource-control portion of a virtual server system. - the additional kernel footprint of any of the competing resource management systems is substantially reduced, since it doesn't need to provide process grouping/containment, hence improving their chances of getting into the kernel This patch: Add the main task cgroups framework - the cgroup filesystem, and the basic structures for tracking membership and associating subsystem state objects to tasks. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
#include <linux/cgroup.h>
#include <linux/efi.h>
#include <linux/tick.h>
#include <linux/interrupt.h>
#include <linux/taskstats_kern.h>
#include <linux/delayacct.h>
#include <linux/unistd.h>
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
#include <linux/buffer_head.h>
#include <linux/page_cgroup.h>
#include <linux/debug_locks.h>
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
#include <linux/debugobjects.h>
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
#include <linux/lockdep.h>
#include <linux/kmemleak.h>
#include <linux/pid_namespace.h>
#include <linux/device.h>
kthread: don't depend on work queues Currently there is a circular reference between work queue initialization and kthread initialization. This prevents the kthread infrastructure from initializing until after work queues have been initialized. We want the properties of tasks created with kthread_create to be as close as possible to the init_task and to not be contaminated by user processes. The later we start our kthreadd that creates these tasks the harder it is to avoid contamination from user processes and the more of a mess we have to clean up because the defaults have changed on us. So this patch modifies the kthread support to not use work queues but to instead use a simple list of structures, and to have kthreadd start from init_task immediately after our kernel thread that execs /sbin/init. By being a true child of init_task we only have to change those process settings that we want to have different from init_task, such as our process name, the cpus that are allowed, blocking all signals and setting SIGCHLD to SIG_IGN so that all of our children are reaped automatically. By being a true child of init_task we also naturally get our ppid set to 0 and do not wind up as a child of PID == 1. Ensuring that tasks generated by kthread_create will not slow down the functioning of the wait family of functions. [akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 17:34:32 +08:00
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/signal.h>
#include <linux/idr.h>
#include <linux/kgdb.h>
#include <linux/ftrace.h>
#include <linux/async.h>
#include <linux/kmemcheck.h>
#include <linux/sfi.h>
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev Devtmpfs lets the kernel create a tmpfs instance called devtmpfs very early at kernel initialization, before any driver-core device is registered. Every device with a major/minor will provide a device node in devtmpfs. Devtmpfs can be changed and altered by userspace at any time, and in any way needed - just like today's udev-mounted tmpfs. Unmodified udev versions will run just fine on top of it, and will recognize an already existing kernel-created device node and use it. The default node permissions are root:root 0600. Proper permissions and user/group ownership, meaningful symlinks, all other policy still needs to be applied by userspace. If a node is created by devtmps, devtmpfs will remove the device node when the device goes away. If the device node was created by userspace, or the devtmpfs created node was replaced by userspace, it will no longer be removed by devtmpfs. If it is requested to auto-mount it, it makes init=/bin/sh work without any further userspace support. /dev will be fully populated and dynamic, and always reflect the current device state of the kernel. With the commonly used dynamic device numbers, it solves the problem where static devices nodes may point to the wrong devices. It is intended to make the initial bootup logic simpler and more robust, by de-coupling the creation of the inital environment, to reliably run userspace processes, from a complex userspace bootstrap logic to provide a working /dev. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Tested-By: Harald Hoyer <harald@redhat.com> Tested-By: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-30 21:23:42 +08:00
#include <linux/shmem_fs.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
#include <linux/slab.h>
#include <linux/perf_event.h>
#include <asm/io.h>
#include <asm/bugs.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/cacheflush.h>
#ifdef CONFIG_X86_LOCAL_APIC
#include <asm/smp.h>
#endif
static int kernel_init(void *);
extern void init_IRQ(void);
extern void fork_init(unsigned long);
extern void mca_init(void);
extern void sbus_init(void);
extern void prio_tree_init(void);
extern void radix_tree_init(void);
extern void free_initmem(void);
#ifndef CONFIG_DEBUG_RODATA
static inline void mark_rodata_ro(void) { }
#endif
#ifdef CONFIG_TC
extern void tc_init(void);
#endif
/*
* Debug helper: via this flag we know that we are in 'early bootup code'
* where only the boot processor is running with IRQ disabled. This means
* two things - IRQ must not be enabled before the flag is cleared and some
* operations which are not allowed with IRQ disabled are allowed while the
* flag is set.
*/
bool early_boot_irqs_disabled __read_mostly;
rcu: Teach RCU that idle task is not quiscent state at boot This patch fixes a bug located by Vegard Nossum with the aid of kmemcheck, updated based on review comments from Nick Piggin, Ingo Molnar, and Andrew Morton. And cleans up the variable-name and function-name language. ;-) The boot CPU runs in the context of its idle thread during boot-up. During this time, idle_cpu(0) will always return nonzero, which will fool Classic and Hierarchical RCU into deciding that a large chunk of the boot-up sequence is a big long quiescent state. This in turn causes RCU to prematurely end grace periods during this time. This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks() function to ignore the idle task as a quiescent state until the system has started up the scheduler in rest_init(), introducing a new non-API function rcu_idle_now_means_idle() to inform RCU of this transition. RCU maintains an internal rcu_idle_cpu_truthful variable to track this state, which is then used by rcu_check_callback() to determine if it should believe idle_cpu(). Because this patch has the effect of disallowing RCU grace periods during long stretches of the boot-up sequence, this patch also introduces Josh Triplett's UP-only optimization that makes synchronize_rcu() be a no-op if num_online_cpus() returns 1. This allows boot-time code that calls synchronize_rcu() to proceed normally. Note, however, that RCU callbacks registered by call_rcu() will likely queue up until later in the boot sequence. Although rcuclassic and rcutree can also use this same optimization after boot completes, rcupreempt must restrict its use of this optimization to the portion of the boot sequence before the scheduler starts up, given that an rcupreempt RCU read-side critical section may be preeempted. In addition, this patch takes Nick Piggin's suggestion to make the system_state global variable be __read_mostly. Changes since v4: o Changes the name of the introduced function and variable to be less emotional. ;-) Changes since v3: o WARN_ON(nr_context_switches() > 0) to verify that RCU switches out of boot-time mode before the first context switch, as suggested by Nick Piggin. Changes since v2: o Created rcu_blocking_is_gp() internal-to-RCU API that determines whether a call to synchronize_rcu() is itself a grace period. o The definition of rcu_blocking_is_gp() for rcuclassic and rcutree checks to see if but a single CPU is online. o The definition of rcu_blocking_is_gp() for rcupreempt checks to see both if but a single CPU is online and if the system is still in early boot. This allows rcupreempt to again work correctly if running on a single CPU after booting is complete. o Added check to rcupreempt's synchronize_sched() for there being but one online CPU. Tested all three variants both SMP and !SMP, booted fine, passed a short rcutorture test on both x86 and Power. Located-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 10:03:42 +08:00
enum system_states system_state __read_mostly;
EXPORT_SYMBOL(system_state);
/*
* Boot command-line arguments
*/
#define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
#define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
extern void time_init(void);
/* Default late time init is NULL. archs can override this later. */
void (*__initdata late_time_init)(void);
extern void softirq_init(void);
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
/* Untouched command line saved by arch-specific code. */
char __initdata boot_command_line[COMMAND_LINE_SIZE];
/* Untouched saved command line (eg. for /proc) */
char *saved_command_line;
/* Command line for parameter parsing */
static char *static_command_line;
static char *execute_command;
static char *ramdisk_execute_command;
/*
* If set, this is an indication to the drivers that reset the underlying
* device before going ahead with the initialization otherwise driver might
* rely on the BIOS and skip the reset operation.
*
* This is useful if kernel is booting in an unreliable environment.
* For ex. kdump situaiton where previous kernel has crashed, BIOS has been
* skipped and devices will be in unknown state.
*/
unsigned int reset_devices;
EXPORT_SYMBOL(reset_devices);
static int __init set_reset_devices(char *str)
{
reset_devices = 1;
return 1;
}
__setup("reset_devices", set_reset_devices);
static const char * argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
const char * envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };
static const char *panic_later, *panic_param;
extern const struct obs_kernel_param __setup_start[], __setup_end[];
static int __init obsolete_checksetup(char *line)
{
const struct obs_kernel_param *p;
int had_early_param = 0;
p = __setup_start;
do {
int n = strlen(p->str);
if (!strncmp(line, p->str, n)) {
if (p->early) {
/* Already done in parse_early_param?
* (Needs exact match on param part).
* Keep iterating, as we can have early
* params and __setups of same names 8( */
if (line[n] == '\0' || line[n] == '=')
had_early_param = 1;
} else if (!p->setup_func) {
printk(KERN_WARNING "Parameter %s is obsolete,"
" ignored\n", p->str);
return 1;
} else if (p->setup_func(line + n))
return 1;
}
p++;
} while (p < __setup_end);
return had_early_param;
}
/*
* This should be approx 2 Bo*oMips to start (note initial shift), and will
* still work even if initially too large, it will just take slightly longer
*/
unsigned long loops_per_jiffy = (1<<12);
EXPORT_SYMBOL(loops_per_jiffy);
static int __init debug_kernel(char *str)
{
console_loglevel = 10;
return 0;
}
static int __init quiet_kernel(char *str)
{
console_loglevel = 4;
return 0;
}
early_param("debug", debug_kernel);
early_param("quiet", quiet_kernel);
static int __init loglevel(char *str)
{
get_option(&str, &console_loglevel);
return 0;
}
early_param("loglevel", loglevel);
/*
* Unknown boot options get handed to init, unless they look like
* unused parameters (modprobe will find them in /proc/cmdline).
*/
static int __init unknown_bootoption(char *param, char *val)
{
/* Change NUL term back to "=", to make "param" the whole string. */
if (val) {
/* param=val or param="val"? */
if (val == param+strlen(param)+1)
val[-1] = '=';
else if (val == param+strlen(param)+2) {
val[-2] = '=';
memmove(val-1, val, strlen(val)+1);
val--;
} else
BUG();
}
/* Handle obsolete-style parameters */
if (obsolete_checksetup(param))
return 0;
/* Unused module parameter. */
if (strchr(param, '.') && (!val || strchr(param, '.') < val))
return 0;
if (panic_later)
return 0;
if (val) {
/* Environment option */
unsigned int i;
for (i = 0; envp_init[i]; i++) {
if (i == MAX_INIT_ENVS) {
panic_later = "Too many boot env vars at `%s'";
panic_param = param;
}
if (!strncmp(param, envp_init[i], val - param))
break;
}
envp_init[i] = param;
} else {
/* Command line option */
unsigned int i;
for (i = 0; argv_init[i]; i++) {
if (i == MAX_INIT_ARGS) {
panic_later = "Too many boot init vars at `%s'";
panic_param = param;
}
}
argv_init[i] = param;
}
return 0;
}
#ifdef CONFIG_DEBUG_PAGEALLOC
int __read_mostly debug_pagealloc_enabled = 0;
#endif
static int __init init_setup(char *str)
{
unsigned int i;
execute_command = str;
/*
* In case LILO is going to boot us with default command line,
* it prepends "auto" before the whole cmdline which makes
* the shell think it should execute a script with such name.
* So we ignore all arguments entered _before_ init=... [MJ]
*/
for (i = 1; i < MAX_INIT_ARGS; i++)
argv_init[i] = NULL;
return 1;
}
__setup("init=", init_setup);
static int __init rdinit_setup(char *str)
{
unsigned int i;
ramdisk_execute_command = str;
/* See "auto" comment in init_setup */
for (i = 1; i < MAX_INIT_ARGS; i++)
argv_init[i] = NULL;
return 1;
}
__setup("rdinit=", rdinit_setup);
#ifndef CONFIG_SMP
static const unsigned int setup_max_cpus = NR_CPUS;
#ifdef CONFIG_X86_LOCAL_APIC
static void __init smp_init(void)
{
APIC_init_uniprocessor();
}
#else
#define smp_init() do { } while (0)
#endif
static inline void setup_nr_cpu_ids(void) { }
static inline void smp_prepare_cpus(unsigned int maxcpus) { }
#endif
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
/*
* We need to store the untouched command line for future reference.
* We also need to store the touched command line since the parameter
* parsing is performed in place, and we should allow a component to
* store reference of name/value for future reference.
*/
static void __init setup_command_line(char *command_line)
{
saved_command_line = alloc_bootmem(strlen (boot_command_line)+1);
static_command_line = alloc_bootmem(strlen (command_line)+1);
strcpy (saved_command_line, boot_command_line);
strcpy (static_command_line, command_line);
}
/*
* We need to finalize in a non-__init function or else race conditions
* between the root thread and the init thread may cause start_kernel to
* be reaped by free_initmem before the root thread has proceeded to
* cpu_idle.
*
* gcc-3.4 accidentally inlines this function, so use noinline.
*/
static __initdata DECLARE_COMPLETION(kthreadd_done);
static noinline void __init_refok rest_init(void)
{
kthread: don't depend on work queues Currently there is a circular reference between work queue initialization and kthread initialization. This prevents the kthread infrastructure from initializing until after work queues have been initialized. We want the properties of tasks created with kthread_create to be as close as possible to the init_task and to not be contaminated by user processes. The later we start our kthreadd that creates these tasks the harder it is to avoid contamination from user processes and the more of a mess we have to clean up because the defaults have changed on us. So this patch modifies the kthread support to not use work queues but to instead use a simple list of structures, and to have kthreadd start from init_task immediately after our kernel thread that execs /sbin/init. By being a true child of init_task we only have to change those process settings that we want to have different from init_task, such as our process name, the cpus that are allowed, blocking all signals and setting SIGCHLD to SIG_IGN so that all of our children are reaped automatically. By being a true child of init_task we also naturally get our ppid set to 0 and do not wind up as a child of PID == 1. Ensuring that tasks generated by kthread_create will not slow down the functioning of the wait family of functions. [akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 17:34:32 +08:00
int pid;
rcu_scheduler_starting();
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
numa_default_policy();
kthread: don't depend on work queues Currently there is a circular reference between work queue initialization and kthread initialization. This prevents the kthread infrastructure from initializing until after work queues have been initialized. We want the properties of tasks created with kthread_create to be as close as possible to the init_task and to not be contaminated by user processes. The later we start our kthreadd that creates these tasks the harder it is to avoid contamination from user processes and the more of a mess we have to clean up because the defaults have changed on us. So this patch modifies the kthread support to not use work queues but to instead use a simple list of structures, and to have kthreadd start from init_task immediately after our kernel thread that execs /sbin/init. By being a true child of init_task we only have to change those process settings that we want to have different from init_task, such as our process name, the cpus that are allowed, blocking all signals and setting SIGCHLD to SIG_IGN so that all of our children are reaped automatically. By being a true child of init_task we also naturally get our ppid set to 0 and do not wind up as a child of PID == 1. Ensuring that tasks generated by kthread_create will not slow down the functioning of the wait family of functions. [akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 17:34:32 +08:00
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
complete(&kthreadd_done);
/*
* The boot idle thread must execute schedule()
* at least once to get things moving:
*/
init_idle_bootup_task(current);
preempt_enable_no_resched();
schedule();
preempt_disable();
/* Call into cpu_idle with preempt disabled */
cpu_idle();
}
/* Check for early params. */
static int __init do_early_param(char *param, char *val)
{
const struct obs_kernel_param *p;
for (p = __setup_start; p < __setup_end; p++) {
if ((p->early && strcmp(param, p->str) == 0) ||
(strcmp(param, "console") == 0 &&
strcmp(p->str, "earlycon") == 0)
) {
if (p->setup_func(val) != 0)
printk(KERN_WARNING
"Malformed early option '%s'\n", param);
}
}
/* We accept everything at this stage. */
return 0;
}
Driver Core: early platform driver V3 of the early platform driver implementation. Platform drivers are great for embedded platforms because we can separate driver configuration from the actual driver. So base addresses, interrupts and other configuration can be kept with the processor or board code, and the platform driver can be reused by many different platforms. For early devices we have nothing today. For instance, to configure early timers and early serial ports we cannot use platform devices. This because the setup order during boot. Timers are needed before the platform driver core code is available. The same goes for early printk support. Early in this case means before initcalls. These early drivers today have their configuration either hard coded or they receive it using some special configuration method. This is working quite well, but if we want to support both regular kernel modules and early devices then we need to have two ways of configuring the same driver. A single way would be better. The early platform driver patch is basically a set of functions that allow drivers to register themselves and architecture code to locate them and probe. Registration happens through early_param(). The time for the probe is decided by the architecture code. See Documentation/driver-model/platform.txt for more details. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Brownell <david-b@pacbell.net> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-31 05:37:25 +08:00
void __init parse_early_options(char *cmdline)
{
parse_args("early options", cmdline, NULL, 0, do_early_param);
}
/* Arch code calls this early on, or if not, just before other parsing. */
void __init parse_early_param(void)
{
static __initdata int done = 0;
static __initdata char tmp_cmdline[COMMAND_LINE_SIZE];
if (done)
return;
/* All fall through to do_early_param. */
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
Driver Core: early platform driver V3 of the early platform driver implementation. Platform drivers are great for embedded platforms because we can separate driver configuration from the actual driver. So base addresses, interrupts and other configuration can be kept with the processor or board code, and the platform driver can be reused by many different platforms. For early devices we have nothing today. For instance, to configure early timers and early serial ports we cannot use platform devices. This because the setup order during boot. Timers are needed before the platform driver core code is available. The same goes for early printk support. Early in this case means before initcalls. These early drivers today have their configuration either hard coded or they receive it using some special configuration method. This is working quite well, but if we want to support both regular kernel modules and early devices then we need to have two ways of configuring the same driver. A single way would be better. The early platform driver patch is basically a set of functions that allow drivers to register themselves and architecture code to locate them and probe. Registration happens through early_param(). The time for the probe is decided by the architecture code. See Documentation/driver-model/platform.txt for more details. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Brownell <david-b@pacbell.net> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-31 05:37:25 +08:00
parse_early_options(tmp_cmdline);
done = 1;
}
/*
* Activate the first processor.
*/
static void __init boot_cpu_init(void)
{
int cpu = smp_processor_id();
/* Mark the boot cpu "present", "online" etc for SMP and UP case */
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
}
void __init __weak smp_setup_processor_id(void)
{
}
void __init __weak thread_info_cache_init(void)
{
}
/*
* Set up kernel memory allocators
*/
static void __init mm_init(void)
{
/*
* page_cgroup requires countinous pages as memmap
* and it's bigger than MAX_ORDER unless SPARSEMEM.
*/
page_cgroup_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
pgtable_cache_init();
vmalloc_init();
}
asmlinkage void __init start_kernel(void)
{
char * command_line;
extern const struct kernel_param __start___param[], __stop___param[];
smp_setup_processor_id();
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
/*
* Need to run as early as possible, to initialize the
* lockdep hash:
*/
lockdep_init();
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
debug_objects_early_init();
/*
* Set up the the initial canary ASAP:
*/
boot_init_stack_canary();
Task Control Groups: basic task cgroup framework Generic Process Control Groups -------------------------- There have recently been various proposals floating around for resource management/accounting and other task grouping subsystems in the kernel, including ResGroups, User BeanCounters, NSProxy cgroups, and others. These all need the basic abstraction of being able to group together multiple processes in an aggregate, in order to track/limit the resources permitted to those processes, or control other behaviour of the processes, and all implement this grouping in different ways. This patchset provides a framework for tracking and grouping processes into arbitrary "cgroups" and assigning arbitrary state to those groupings, in order to control the behaviour of the cgroup as an aggregate. The intention is that the various resource management and virtualization/cgroup efforts can also become task cgroup clients, with the result that: - the userspace APIs are (somewhat) normalised - it's easier to test e.g. the ResGroups CPU controller in conjunction with the BeanCounters memory controller, or use either of them as the resource-control portion of a virtual server system. - the additional kernel footprint of any of the competing resource management systems is substantially reduced, since it doesn't need to provide process grouping/containment, hence improving their chances of getting into the kernel This patch: Add the main task cgroups framework - the cgroup filesystem, and the basic structures for tracking membership and associating subsystem state objects to tasks. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
cgroup_init_early();
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
local_irq_disable();
early_boot_irqs_disabled = true;
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/
tick_init();
boot_cpu_init();
page_address_init();
printk(KERN_NOTICE "%s", linux_banner);
setup_arch(&command_line);
cgroups: add an owner to the mm_struct Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 16:00:16 +08:00
mm_init_owner(&init_mm, &init_task);
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
setup_command_line(command_line);
setup_nr_cpu_ids();
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset For each new populated zone of hotadded node, need to update its pagesets with dynamically allocated per_cpu_pageset struct for all possible CPUs: 1) Detach zone->pageset from the shared boot_pageset at end of __build_all_zonelists(). 2) Use mutex to protect zone->pageset when it's still shared in onlined_pages() Otherwises, multiple zones of different nodes would share same boot strapping boot_pageset for same CPU, which will finally cause below kernel panic: ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:1239! invalid opcode: 0000 [#1] SMP ... Call Trace: [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0 [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0 [<ffffffff81128407>] __page_cache_alloc+0x67/0x70 [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260 [<ffffffff81132751>] ra_submit+0x21/0x30 [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0 [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0 [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670 [<ffffffff81266cfa>] nfs_file_read+0xca/0x130 [<ffffffff8117b20a>] do_sync_read+0xfa/0x140 [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0 [<ffffffff8117c151>] sys_read+0x51/0x80 [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900 RSP <ffff88000d1e78a8> ---[ end trace 4bda28328b9990db ] [akpm@linux-foundation.org: merge fix] Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Andi Kleen <andi.kleen@intel.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:51 +08:00
build_all_zonelists(NULL);
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
page_alloc_init();
printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
parse_early_param();
parse_args("Booting kernel", static_command_line, __start___param,
__stop___param - __start___param,
&unknown_bootoption);
/*
* These use large bootmem allocations and must precede
* kmem_cache_init()
*/
pidhash_init();
vfs_caches_init_early();
sort_main_extable();
trap_init();
mm_init();
/*
* Set up the scheduler prior starting any interrupts (such as the
* timer interrupt). Full topology setup happens at smp_init()
* time - but meanwhile we still have a functioning scheduler.
*/
sched_init();
/*
* Disable preemption - early bootup scheduling is extremely
* fragile until we cpu_idle() for the first time.
*/
preempt_disable();
if (!irqs_disabled()) {
printk(KERN_WARNING "start_kernel(): bug: interrupts were "
"enabled *very* early, fixing it\n");
local_irq_disable();
}
idr_init_cache();
perf_event_init();
rcu_init();
radix_tree_init();
/* init some links before init_ISA_irqs() */
early_irq_init();
init_IRQ();
prio_tree_init();
init_timers();
hrtimers_init();
softirq_init();
timekeeping_init();
time_init();
profile_init();
if (!irqs_disabled())
printk(KERN_CRIT "start_kernel(): bug: interrupts were "
"enabled early\n");
early_boot_irqs_disabled = false;
local_irq_enable();
/* Interrupts are enabled now so all GFP allocations are safe. */
gfp_allowed_mask = __GFP_BITS_MASK;
kmem_cache_init_late();
/*
* HACK ALERT! This is early. We're enabling the console before
* we've done PCI setups etc, and console_init() must be aware of
* this. But we do want output early, in case something goes wrong.
*/
console_init();
if (panic_later)
panic(panic_later, panic_param);
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
lockdep_info();
/*
* Need to run this when irqs are enabled, because it wants
* to self-test [hard/soft]-irqs on/off lock inversion bugs
* too:
*/
locking_selftest();
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - "
"disabling it.\n",
page_to_pfn(virt_to_page((void *)initrd_start)),
min_low_pfn);
initrd_start = 0;
}
#endif
page_cgroup_init();
enable_debug_pagealloc();
kmemleak_init();
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
debug_objects_mem_init();
[PATCH] node local per-cpu-pages This patch modifies the way pagesets in struct zone are managed. Each zone has a per-cpu array of pagesets. So any particular CPU has some memory in each zone structure which belongs to itself. Even if that CPU is not local to that zone. So the patch relocates the pagesets for each cpu to the node that is nearest to the cpu instead of allocating the pagesets in the (possibly remote) target zone. This means that the operations to manage pages on remote zone can be done with information available locally. We play a macro trick so that non-NUMA pmachines avoid the additional pointer chase on the page allocator fastpath. AIM7 benchmark on a 32 CPU SGI Altix w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005 100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005 200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005 300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005 400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005 500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005 600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005 700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005 800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005 900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005 1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005 with slab API changes and pageset patch: Tasks jobs/min jti jobs/min/task real cpu 1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005 100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005 200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005 300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005 400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005 500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005 600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005 700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005 800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005 900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005 1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22 08:14:47 +08:00
setup_per_cpu_pageset();
numa_policy_init();
if (late_time_init)
late_time_init();
sched_clock_init();
calibrate_delay();
pidmap_init();
anon_vma_init();
#ifdef CONFIG_X86
if (efi_enabled)
efi_enter_virtual_mode();
#endif
thread_info_cache_init();
CRED: Inaugurate COW credentials Inaugurate copy-on-write credentials management. This uses RCU to manage the credentials pointer in the task_struct with respect to accesses by other tasks. A process may only modify its own credentials, and so does not need locking to access or modify its own credentials. A mutex (cred_replace_mutex) is added to the task_struct to control the effect of PTRACE_ATTACHED on credential calculations, particularly with respect to execve(). With this patch, the contents of an active credentials struct may not be changed directly; rather a new set of credentials must be prepared, modified and committed using something like the following sequence of events: struct cred *new = prepare_creds(); int ret = blah(new); if (ret < 0) { abort_creds(new); return ret; } return commit_creds(new); There are some exceptions to this rule: the keyrings pointed to by the active credentials may be instantiated - keyrings violate the COW rule as managing COW keyrings is tricky, given that it is possible for a task to directly alter the keys in a keyring in use by another task. To help enforce this, various pointers to sets of credentials, such as those in the task_struct, are declared const. The purpose of this is compile-time discouragement of altering credentials through those pointers. Once a set of credentials has been made public through one of these pointers, it may not be modified, except under special circumstances: (1) Its reference count may incremented and decremented. (2) The keyrings to which it points may be modified, but not replaced. The only safe way to modify anything else is to create a replacement and commit using the functions described in Documentation/credentials.txt (which will be added by a later patch). This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). This now prepares and commits credentials in various places in the security code rather than altering the current creds directly. (2) Temporary credential overrides. do_coredump() and sys_faccessat() now prepare their own credentials and temporarily override the ones currently on the acting thread, whilst preventing interference from other threads by holding cred_replace_mutex on the thread being dumped. This will be replaced in a future patch by something that hands down the credentials directly to the functions being called, rather than altering the task's objective credentials. (3) LSM interface. A number of functions have been changed, added or removed: (*) security_capset_check(), ->capset_check() (*) security_capset_set(), ->capset_set() Removed in favour of security_capset(). (*) security_capset(), ->capset() New. This is passed a pointer to the new creds, a pointer to the old creds and the proposed capability sets. It should fill in the new creds or return an error. All pointers, barring the pointer to the new creds, are now const. (*) security_bprm_apply_creds(), ->bprm_apply_creds() Changed; now returns a value, which will cause the process to be killed if it's an error. (*) security_task_alloc(), ->task_alloc_security() Removed in favour of security_prepare_creds(). (*) security_cred_free(), ->cred_free() New. Free security data attached to cred->security. (*) security_prepare_creds(), ->cred_prepare() New. Duplicate any security data attached to cred->security. (*) security_commit_creds(), ->cred_commit() New. Apply any security effects for the upcoming installation of new security by commit_creds(). (*) security_task_post_setuid(), ->task_post_setuid() Removed in favour of security_task_fix_setuid(). (*) security_task_fix_setuid(), ->task_fix_setuid() Fix up the proposed new credentials for setuid(). This is used by cap_set_fix_setuid() to implicitly adjust capabilities in line with setuid() changes. Changes are made to the new credentials, rather than the task itself as in security_task_post_setuid(). (*) security_task_reparent_to_init(), ->task_reparent_to_init() Removed. Instead the task being reparented to init is referred directly to init's credentials. NOTE! This results in the loss of some state: SELinux's osid no longer records the sid of the thread that forked it. (*) security_key_alloc(), ->key_alloc() (*) security_key_permission(), ->key_permission() Changed. These now take cred pointers rather than task pointers to refer to the security context. (4) sys_capset(). This has been simplified and uses less locking. The LSM functions it calls have been merged. (5) reparent_to_kthreadd(). This gives the current thread the same credentials as init by simply using commit_thread() to point that way. (6) __sigqueue_alloc() and switch_uid() __sigqueue_alloc() can't stop the target task from changing its creds beneath it, so this function gets a reference to the currently applicable user_struct which it then passes into the sigqueue struct it returns if successful. switch_uid() is now called from commit_creds(), and possibly should be folded into that. commit_creds() should take care of protecting __sigqueue_alloc(). (7) [sg]et[ug]id() and co and [sg]et_current_groups. The set functions now all use prepare_creds(), commit_creds() and abort_creds() to build and check a new set of credentials before applying it. security_task_set[ug]id() is called inside the prepared section. This guarantees that nothing else will affect the creds until we've finished. The calling of set_dumpable() has been moved into commit_creds(). Much of the functionality of set_user() has been moved into commit_creds(). The get functions all simply access the data directly. (8) security_task_prctl() and cap_task_prctl(). security_task_prctl() has been modified to return -ENOSYS if it doesn't want to handle a function, or otherwise return the return value directly rather than through an argument. Additionally, cap_task_prctl() now prepares a new set of credentials, even if it doesn't end up using it. (9) Keyrings. A number of changes have been made to the keyrings code: (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have all been dropped and built in to the credentials functions directly. They may want separating out again later. (b) key_alloc() and search_process_keyrings() now take a cred pointer rather than a task pointer to specify the security context. (c) copy_creds() gives a new thread within the same thread group a new thread keyring if its parent had one, otherwise it discards the thread keyring. (d) The authorisation key now points directly to the credentials to extend the search into rather pointing to the task that carries them. (e) Installing thread, process or session keyrings causes a new set of credentials to be created, even though it's not strictly necessary for process or session keyrings (they're shared). (10) Usermode helper. The usermode helper code now carries a cred struct pointer in its subprocess_info struct instead of a new session keyring pointer. This set of credentials is derived from init_cred and installed on the new process after it has been cloned. call_usermodehelper_setup() allocates the new credentials and call_usermodehelper_freeinfo() discards them if they haven't been used. A special cred function (prepare_usermodeinfo_creds()) is provided specifically for call_usermodehelper_setup() to call. call_usermodehelper_setkeys() adjusts the credentials to sport the supplied keyring as the new session keyring. (11) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) selinux_setprocattr() no longer does its check for whether the current ptracer can access processes with the new SID inside the lock that covers getting the ptracer's SID. Whilst this lock ensures that the check is done with the ptracer pinned, the result is only valid until the lock is released, so there's no point doing it inside the lock. (12) is_single_threaded(). This function has been extracted from selinux_setprocattr() and put into a file of its own in the lib/ directory as join_session_keyring() now wants to use it too. The code in SELinux just checked to see whether a task shared mm_structs with other tasks (CLONE_VM), but that isn't good enough. We really want to know if they're part of the same thread group (CLONE_THREAD). (13) nfsd. The NFS server daemon now has to use the COW credentials to set the credentials it is going to use. It really needs to pass the credentials down to the functions it calls, but it can't do that until other patches in this series have been applied. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 07:39:23 +08:00
cred_init();
fork_init(totalram_pages);
proc_caches_init();
buffer_init();
key_init();
security_init();
dbg_late_init();
vfs_caches_init(totalram_pages);
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
#ifdef CONFIG_PROC_FS
proc_root_init();
#endif
Task Control Groups: basic task cgroup framework Generic Process Control Groups -------------------------- There have recently been various proposals floating around for resource management/accounting and other task grouping subsystems in the kernel, including ResGroups, User BeanCounters, NSProxy cgroups, and others. These all need the basic abstraction of being able to group together multiple processes in an aggregate, in order to track/limit the resources permitted to those processes, or control other behaviour of the processes, and all implement this grouping in different ways. This patchset provides a framework for tracking and grouping processes into arbitrary "cgroups" and assigning arbitrary state to those groupings, in order to control the behaviour of the cgroup as an aggregate. The intention is that the various resource management and virtualization/cgroup efforts can also become task cgroup clients, with the result that: - the userspace APIs are (somewhat) normalised - it's easier to test e.g. the ResGroups CPU controller in conjunction with the BeanCounters memory controller, or use either of them as the resource-control portion of a virtual server system. - the additional kernel footprint of any of the competing resource management systems is substantially reduced, since it doesn't need to provide process grouping/containment, hence improving their chances of getting into the kernel This patch: Add the main task cgroups framework - the cgroup filesystem, and the basic structures for tracking membership and associating subsystem state objects to tasks. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
cgroup_init();
cpuset_init();
taskstats_init_early();
delayacct_init();
check_bugs();
acpi_early_init(); /* before LAPIC and SMP init */
sfi_init_late();
ftrace_init();
/* Do the rest non-__init'ed, we're now alive */
rest_init();
}
/* Call all constructor functions linked into the kernel. */
static void __init do_ctors(void)
{
#ifdef CONFIG_CONSTRUCTORS
ctor_fn_t *fn = (ctor_fn_t *) __ctors_start;
for (; fn < (ctor_fn_t *) __ctors_end; fn++)
(*fn)();
#endif
}
int initcall_debug;
core_param(initcall_debug, initcall_debug, bool, 0644);
tracing: Fix too large stack usage in do_one_initcall() One of my testboxes triggered this nasty stack overflow crash during SCSI probing: [ 5.874004] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 5.875004] device: 'sda': device_add [ 5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c [ 5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110 [ 5.878004] *pde = 00000000 [ 5.878004] Thread overran stack, or stack corrupted [ 5.878004] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 5.878004] last sysfs file: [ 5.878004] [ 5.878004] Pid: 1, comm: swapper Not tainted (2.6.31-rc6-tip-01272-g9919e28-dirty #5685) [ 5.878004] EIP: 0060:[<b1008321>] EFLAGS: 00010083 CPU: 0 [ 5.878004] EIP is at print_context_stack+0x81/0x110 [ 5.878004] EAX: cf8a3000 EBX: cf8a3fe4 ECX: 00000049 EDX: 00000000 [ 5.878004] ESI: b1cfce84 EDI: 00000000 EBP: cf8a3018 ESP: cf8a2ff4 [ 5.878004] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 5.878004] Process swapper (pid: 1, ti=cf8a2000 task=cf8a8000 task.ti=cf8a3000) [ 5.878004] Stack: [ 5.878004] b1004867 fffff000 cf8a3ffc [ 5.878004] Call Trace: [ 5.878004] [<b1004867>] ? kernel_thread_helper+0x7/0x10 [ 5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c [ 5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110 [ 5.878004] *pde = 00000000 [ 5.878004] Thread overran stack, or stack corrupted [ 5.878004] Oops: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC The oops did not reveal any more details about the real stack that we have and the system got into an infinite loop of recursive pagefaults. So i booted with CONFIG_STACK_TRACER=y and the 'stacktrace' boot parameter. The box did not crash (timings/conditions probably changed a tiny bit to trigger the catastrophic crash), but the /debug/tracing/stack_trace file was rather revealing: Depth Size Location (72 entries) ----- ---- -------- 0) 3704 52 __change_page_attr+0xb8/0x290 1) 3652 24 __change_page_attr_set_clr+0x43/0x90 2) 3628 60 kernel_map_pages+0x108/0x120 3) 3568 40 prep_new_page+0x7d/0x130 4) 3528 84 get_page_from_freelist+0x106/0x420 5) 3444 116 __alloc_pages_nodemask+0xd7/0x550 6) 3328 36 allocate_slab+0xb1/0x100 7) 3292 36 new_slab+0x1c/0x160 8) 3256 36 __slab_alloc+0x133/0x2b0 9) 3220 4 kmem_cache_alloc+0x1bb/0x1d0 10) 3216 108 create_object+0x28/0x250 11) 3108 40 kmemleak_alloc+0x81/0xc0 12) 3068 24 kmem_cache_alloc+0x162/0x1d0 13) 3044 52 scsi_pool_alloc_command+0x29/0x70 14) 2992 20 scsi_host_alloc_command+0x22/0x70 15) 2972 24 __scsi_get_command+0x1b/0x90 16) 2948 28 scsi_get_command+0x35/0x90 17) 2920 24 scsi_setup_blk_pc_cmnd+0xd4/0x100 18) 2896 128 sd_prep_fn+0x332/0xa70 19) 2768 36 blk_peek_request+0xe7/0x1d0 20) 2732 56 scsi_request_fn+0x54/0x520 21) 2676 12 __generic_unplug_device+0x2b/0x40 22) 2664 24 blk_execute_rq_nowait+0x59/0x80 23) 2640 172 blk_execute_rq+0x6b/0xb0 24) 2468 32 scsi_execute+0xe0/0x140 25) 2436 64 scsi_execute_req+0x152/0x160 26) 2372 60 scsi_vpd_inquiry+0x6c/0x90 27) 2312 44 scsi_get_vpd_page+0x112/0x160 28) 2268 52 sd_revalidate_disk+0x1df/0x320 29) 2216 92 rescan_partitions+0x98/0x330 30) 2124 52 __blkdev_get+0x309/0x350 31) 2072 8 blkdev_get+0xf/0x20 32) 2064 44 register_disk+0xff/0x120 33) 2020 36 add_disk+0x6e/0xb0 34) 1984 44 sd_probe_async+0xfb/0x1d0 35) 1940 44 __async_schedule+0xf4/0x1b0 36) 1896 8 async_schedule+0x12/0x20 37) 1888 60 sd_probe+0x305/0x360 38) 1828 44 really_probe+0x63/0x170 39) 1784 36 driver_probe_device+0x5d/0x60 40) 1748 16 __device_attach+0x49/0x50 41) 1732 32 bus_for_each_drv+0x5b/0x80 42) 1700 24 device_attach+0x6b/0x70 43) 1676 16 bus_attach_device+0x47/0x60 44) 1660 76 device_add+0x33d/0x400 45) 1584 52 scsi_sysfs_add_sdev+0x6a/0x2c0 46) 1532 108 scsi_add_lun+0x44b/0x460 47) 1424 116 scsi_probe_and_add_lun+0x182/0x4e0 48) 1308 36 __scsi_add_device+0xd9/0xe0 49) 1272 44 ata_scsi_scan_host+0x10b/0x190 50) 1228 24 async_port_probe+0x96/0xd0 51) 1204 44 __async_schedule+0xf4/0x1b0 52) 1160 8 async_schedule+0x12/0x20 53) 1152 48 ata_host_register+0x171/0x1d0 54) 1104 60 ata_pci_sff_activate_host+0xf3/0x230 55) 1044 44 ata_pci_sff_init_one+0xea/0x100 56) 1000 48 amd_init_one+0xb2/0x190 57) 952 8 local_pci_probe+0x13/0x20 58) 944 32 pci_device_probe+0x68/0x90 59) 912 44 really_probe+0x63/0x170 60) 868 36 driver_probe_device+0x5d/0x60 61) 832 20 __driver_attach+0x89/0xa0 62) 812 32 bus_for_each_dev+0x5b/0x80 63) 780 12 driver_attach+0x1e/0x20 64) 768 72 bus_add_driver+0x14b/0x2d0 65) 696 36 driver_register+0x6e/0x150 66) 660 20 __pci_register_driver+0x53/0xc0 67) 640 8 amd_init+0x14/0x16 68) 632 572 do_one_initcall+0x2b/0x1d0 69) 60 12 do_basic_setup+0x56/0x6a 70) 48 20 kernel_init+0x84/0xce 71) 28 28 kernel_thread_helper+0x7/0x10 There's a lot of fat functions on that stack trace, but the largest of all is do_one_initcall(). This is due to the boot trace entry variables being on the stack. Fixing this is relatively easy, initcalls are fundamentally serialized, so we can move the local variables to file scope. Note that this large stack footprint was present for a couple of months already - what pushed my system over the edge was the addition of kmemleak to the call-chain: 6) 3328 36 allocate_slab+0xb1/0x100 7) 3292 36 new_slab+0x1c/0x160 8) 3256 36 __slab_alloc+0x133/0x2b0 9) 3220 4 kmem_cache_alloc+0x1bb/0x1d0 10) 3216 108 create_object+0x28/0x250 11) 3108 40 kmemleak_alloc+0x81/0xc0 12) 3068 24 kmem_cache_alloc+0x162/0x1d0 13) 3044 52 scsi_pool_alloc_command+0x29/0x70 This pushes the total to ~3800 bytes, only a tiny bit more was needed to corrupt the on-kernel-stack thread_info. The fix reduces the stack footprint from 572 bytes to 28 bytes. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-21 18:53:36 +08:00
static char msgbuf[64];
static int __init_or_module do_one_initcall_debug(initcall_t fn)
{
tracing/fastboot: Use the ring-buffer timestamp for initcall entries Impact: Split the boot tracer entries in two parts: call and return Now that we are using the sched tracer from the boot tracer, we want to use the same timestamp than the ring-buffer to have consistent time captures between sched events and initcall events. So we get rid of the old time capture by the boot tracer and split the initcall events in two parts: call and return. This way we have the ring buffer timestamp of both. An example trace: [ 27.904149584] calling net_ns_init+0x0/0x1c0 @ 1 [ 27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs [ 27.904575926] calling reboot_init+0x0/0x20 @ 1 [ 27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs [ 27.904800228] calling sysctl_init+0x0/0x30 @ 1 [ 27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs [ 27.905287211] calling ksysfs_init+0x0/0xb0 @ 1 ##### CPU 0 buffer started #### init-1 [000] 27.905395: 1:120:R + [001] 11:115:S ##### CPU 1 buffer started #### <idle>-0 [001] 27.905425: 0:140:R ==> [001] 11:115:R init-1 [000] 27.905426: 1:120:D ==> [000] 0:140:R <idle>-0 [000] 27.905431: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905451: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905456: 4:115:S ==> [000] 0:140:R udevd-11 [001] 27.905458: 11:115:R + [001] 14:115:R <idle>-0 [000] 27.905459: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905462: 0:140:R ==> [000] 4:115:R udevd-11 [001] 27.905462: 11:115:R ==> [001] 14:115:R ksoftirqd/0-4 [000] 27.905467: 4:115:S ==> [000] 0:140:R <idle>-0 [000] 27.905470: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905473: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905476: 4:115:S ==> [000] 0:140:R <idle>-0 [000] 27.905479: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.905482: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905486: 4:115:S ==> [000] 0:140:R udevd-14 [001] 27.905499: 14:120:X ==> [001] 11:115:R udevd-11 [001] 27.905506: 11:115:R + [000] 1:120:D <idle>-0 [000] 27.905515: 0:140:R ==> [000] 1:120:R udevd-11 [001] 27.905517: 11:115:S ==> [001] 0:140:R [ 27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs [ 27.905705736] calling init_jiffies_clocksource+0x0/0x10 @ 1 [ 27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs [ 27.906769814] calling pm_init+0x0/0x30 @ 1 [ 27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs [ 27.906997803] calling pm_disk_init+0x0/0x20 @ 1 [ 27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs [ 27.907222556] calling swsusp_header_init+0x0/0x30 @ 1 [ 27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs [ 27.907439620] calling stop_machine_init+0x0/0x50 @ 1 init-1 [000] 27.907485: 1:120:R + [000] 2:115:S init-1 [000] 27.907490: 1:120:D ==> [000] 2:115:R kthreadd-2 [000] 27.907507: 2:115:R + [001] 15:115:R <idle>-0 [001] 27.907517: 0:140:R ==> [001] 15:115:R kthreadd-2 [000] 27.907517: 2:115:D ==> [000] 0:140:R <idle>-0 [000] 27.907521: 0:140:R + [000] 4:115:S <idle>-0 [000] 27.907524: 0:140:R ==> [000] 4:115:R udevd-15 [001] 27.907527: 15:115:D + [000] 2:115:D ksoftirqd/0-4 [000] 27.907537: 4:115:S ==> [000] 2:115:R udevd-15 [001] 27.907537: 15:115:D ==> [001] 0:140:R kthreadd-2 [000] 27.907546: 2:115:R + [000] 1:120:D kthreadd-2 [000] 27.907550: 2:115:S ==> [000] 1:120:R init-1 [000] 27.907584: 1:120:R + [000] 15: 0:D init-1 [000] 27.907589: 1:120:R + [000] 2:115:S init-1 [000] 27.907593: 1:120:D ==> [000] 15: 0:R udevd-15 [000] 27.907601: 15: 0:S ==> [000] 2:115:R ##### CPU 0 buffer started #### kthreadd-2 [000] 27.907616: 2:115:R + [001] 16:115:R ##### CPU 1 buffer started #### <idle>-0 [001] 27.907620: 0:140:R ==> [001] 16:115:R kthreadd-2 [000] 27.907621: 2:115:D ==> [000] 0:140:R udevd-16 [001] 27.907625: 16:115:D + [000] 2:115:D <idle>-0 [000] 27.907628: 0:140:R + [000] 4:115:S udevd-16 [001] 27.907629: 16:115:D ==> [001] 0:140:R <idle>-0 [000] 27.907631: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.907636: 4:115:S ==> [000] 2:115:R kthreadd-2 [000] 27.907644: 2:115:R + [000] 1:120:D kthreadd-2 [000] 27.907647: 2:115:S ==> [000] 1:120:R init-1 [000] 27.907657: 1:120:R + [001] 16: 0:D <idle>-0 [001] 27.907666: 0:140:R ==> [001] 16: 0:R [ 27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs [ 27.907850704] calling filelock_init+0x0/0x30 @ 1 [ 27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs [ 27.908071327] calling init_script_binfmt+0x0/0x10 @ 1 [ 27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs [ 27.908309461] calling init_elf_binfmt+0x0/0x10 @ 1 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-12 06:24:42 +08:00
ktime_t calltime, delta, rettime;
unsigned long long duration;
int ret;
printk(KERN_DEBUG "calling %pF @ %i\n", fn, task_pid_nr(current));
calltime = ktime_get();
ret = fn();
rettime = ktime_get();
delta = ktime_sub(rettime, calltime);
duration = (unsigned long long) ktime_to_ns(delta) >> 10;
printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n", fn,
ret, duration);
return ret;
}
int __init_or_module do_one_initcall(initcall_t fn)
{
int count = preempt_count();
int ret;
if (initcall_debug)
ret = do_one_initcall_debug(fn);
else
ret = fn();
msgbuf[0] = 0;
if (ret && ret != -ENODEV && initcall_debug)
sprintf(msgbuf, "error code %d ", ret);
if (preempt_count() != count) {
strlcat(msgbuf, "preemption imbalance ", sizeof(msgbuf));
preempt_count() = count;
}
if (irqs_disabled()) {
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
local_irq_enable();
}
if (msgbuf[0]) {
printk("initcall %pF returned with %s\n", fn, msgbuf);
}
return ret;
}
extern initcall_t __initcall_start[], __initcall_end[], __early_initcall_end[];
static void __init do_initcalls(void)
{
initcall_t *fn;
for (fn = __early_initcall_end; fn < __initcall_end; fn++)
do_one_initcall(*fn);
}
/*
* Ok, the machine is now initialized. None of the devices
* have been touched yet, but the CPU subsystem is up and
* running, and memory and process management works.
*
* Now we can finally start doing some real work..
*/
static void __init do_basic_setup(void)
{
cpuset_init_smp();
usermodehelper_init();
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev Devtmpfs lets the kernel create a tmpfs instance called devtmpfs very early at kernel initialization, before any driver-core device is registered. Every device with a major/minor will provide a device node in devtmpfs. Devtmpfs can be changed and altered by userspace at any time, and in any way needed - just like today's udev-mounted tmpfs. Unmodified udev versions will run just fine on top of it, and will recognize an already existing kernel-created device node and use it. The default node permissions are root:root 0600. Proper permissions and user/group ownership, meaningful symlinks, all other policy still needs to be applied by userspace. If a node is created by devtmps, devtmpfs will remove the device node when the device goes away. If the device node was created by userspace, or the devtmpfs created node was replaced by userspace, it will no longer be removed by devtmpfs. If it is requested to auto-mount it, it makes init=/bin/sh work without any further userspace support. /dev will be fully populated and dynamic, and always reflect the current device state of the kernel. With the commonly used dynamic device numbers, it solves the problem where static devices nodes may point to the wrong devices. It is intended to make the initial bootup logic simpler and more robust, by de-coupling the creation of the inital environment, to reliably run userspace processes, from a complex userspace bootstrap logic to provide a working /dev. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Tested-By: Harald Hoyer <harald@redhat.com> Tested-By: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-30 21:23:42 +08:00
init_tmpfs();
driver_init();
init_irq_proc();
do_ctors();
do_initcalls();
}
static void __init do_pre_smp_initcalls(void)
{
initcall_t *fn;
for (fn = __initcall_start; fn < __early_initcall_end; fn++)
do_one_initcall(*fn);
}
static void run_init_process(const char *init_filename)
{
argv_init[0] = init_filename;
[PATCH] introduce kernel_execve The use of execve() in the kernel is dubious, since it relies on the __KERNEL_SYSCALLS__ mechanism that stores the result in a global errno variable. As a first step of getting rid of this, change all users to a global kernel_execve function that returns a proper error code. This function is a terrible hack, and a later patch removes it again after the kernel syscalls are gone. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 17:18:26 +08:00
kernel_execve(init_filename, argv_init, envp_init);
}
/* This is a non __init function. Force it to be noinline otherwise gcc
* makes it inline to init() and it becomes part of init.text section
*/
static noinline int init_post(void)
{
/* need to finish all async __init code before freeing the memory */
async_synchronize_full();
free_initmem();
mark_rodata_ro();
system_state = SYSTEM_RUNNING;
numa_default_policy();
current->signal->flags |= SIGNAL_UNKILLABLE;
if (ramdisk_execute_command) {
run_init_process(ramdisk_execute_command);
printk(KERN_WARNING "Failed to execute %s\n",
ramdisk_execute_command);
}
/*
* We try each of these until one succeeds.
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine.
*/
if (execute_command) {
run_init_process(execute_command);
printk(KERN_WARNING "Failed to execute %s. Attempting "
"defaults...\n", execute_command);
}
run_init_process("/sbin/init");
run_init_process("/etc/init");
run_init_process("/bin/init");
run_init_process("/bin/sh");
panic("No init found. Try passing init= option to kernel. "
"See Linux Documentation/init.txt for guidance.");
}
static int __init kernel_init(void * unused)
{
/*
* Wait until kthreadd is all set-up.
*/
wait_for_completion(&kthreadd_done);
cpuset,mm: update tasks' mems_allowed in time Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
/*
* init can allocate pages on any node
*/
set_mems_allowed(node_states[N_HIGH_MEMORY]);
/*
* init can run on any cpu.
*/
set_cpus_allowed_ptr(current, cpu_all_mask);
cad_pid = task_pid(current);
smp_prepare_cpus(setup_max_cpus);
do_pre_smp_initcalls();
lockup_detector_init();
smp_init();
sched_init_smp();
do_basic_setup();
/* Open the /dev/console on the rootfs, this should never fail */
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
printk(KERN_WARNING "Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
/*
* check if there is an early userspace init. If yes, let it do all
* the work
*/
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
/*
* Ok, we have completed the initial bootup, and
* we're essentially up and running. Get rid of the
* initmem segments and start the user-mode stuff..
*/
init_post();
return 0;
}