OpenCloudOS-Kernel/init/main.c

1573 lines
38 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/init/main.c
*
* Copyright (C) 1991, 1992 Linus Torvalds
*
* GK 2/5/95 - Changed to support mounting root fs via NFS
* Added initrd & change_root: Werner Almesberger & Hans Lermen, Feb '96
* Moan early if gcc is old, avoiding bogus kernels - Paul Gortmaker, May '96
* Simplified starting of init: Michael A. Griffith <grif@acm.org>
*/
#define DEBUG /* Enable initcall_debug */
#include <linux/types.h>
#include <linux/extable.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/binfmts.h>
#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <linux/stackprotector.h>
#include <linux/string.h>
#include <linux/ctype.h>
#include <linux/delay.h>
#include <linux/ioport.h>
#include <linux/init.h>
#include <linux/initrd.h>
mm: remove include/linux/bootmem.h Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:09:49 +08:00
#include <linux/memblock.h>
#include <linux/acpi.h>
#include <linux/bootconfig.h>
#include <linux/console.h>
#include <linux/nmi.h>
#include <linux/percpu.h>
#include <linux/kmod.h>
#include <linux/kprobes.h>
init: kmsan: call KMSAN initialization routines kmsan_init_shadow() scans the mappings created at boot time and creates metadata pages for those mappings. When the memblock allocator returns pages to pagealloc, we reserve 2/3 of those pages and use them as metadata for the remaining 1/3. Once KMSAN starts, every page allocated by pagealloc has its associated shadow and origin pages. kmsan_initialize() initializes the bookkeeping for init_task and enables KMSAN. Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-15 23:03:51 +08:00
#include <linux/kmsan.h>
mm: rewrite vmap layer Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:27:03 +08:00
#include <linux/vmalloc.h>
#include <linux/kernel_stat.h>
#include <linux/start_kernel.h>
#include <linux/security.h>
#include <linux/smp.h>
#include <linux/profile.h>
mm: add Kernel Electric-Fence infrastructure Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7. This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. This series enables KFENCE for the x86 and arm64 architectures, and adds KFENCE hooks to the SLAB and SLUB allocators. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, the next allocation through the main allocator (SLAB or SLUB) returns a guarded allocation from the KFENCE object pool. At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. The KFENCE memory pool is of fixed size, and if the pool is exhausted no further KFENCE allocations occur. The default config is conservative with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB pages). We have verified by running synthetic benchmarks (sysbench I/O, hackbench) and production server-workload benchmarks that a kernel with KFENCE (using sample intervals 100-500ms) is performance-neutral compared to a non-KFENCE baseline kernel. KFENCE is inspired by GWP-ASan [1], a userspace tool with similar properties. The name "KFENCE" is a homage to the Electric Fence Malloc Debugger [2]. For more details, see Documentation/dev-tools/kfence.rst added in the series -- also viewable here: https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst [1] http://llvm.org/docs/GwpAsan.html [2] https://linux.die.net/man/3/efence This patch (of 9): This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. To detect out-of-bounds writes to memory within the object's page itself, KFENCE also uses pattern-based redzones. The following figure illustrates the page layout: ---+-----------+-----------+-----------+-----------+-----------+--- | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | ---+-----------+-----------+-----------+-----------+-----------+--- Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, a guarded allocation from the KFENCE object pool is returned to the main allocator (SLAB or SLUB). At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. To date, we have verified by running synthetic benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE is performance-neutral compared to the non-KFENCE baseline. For more details, see Documentation/dev-tools/kfence.rst (added later in the series). [elver@google.com: fix parameter description for kfence_object_start()] Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com [elver@google.com: avoid stalling work queue task without allocations] Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com [elver@google.com: fix potential deadlock due to wake_up()] Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com [elver@google.com: add option to use KFENCE without static keys] Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com [elver@google.com: add missing copyright and description headers] Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: SeongJae Park <sjpark@amazon.de> Co-developed-by: Marco Elver <elver@google.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Joern Engel <joern@purestorage.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:18:53 +08:00
#include <linux/kfence.h>
#include <linux/rcupdate.h>
#include <linux/srcu.h>
#include <linux/moduleparam.h>
#include <linux/kallsyms.h>
buildid: stash away kernels build ID on init Parse the kernel's build ID at initialization so that other code can print a hex format string representation of the running kernel's build ID. This will be used in the kdump and dump_stack code so that developers can easily locate the vmlinux debug symbols for a crash/stacktrace. [swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()] Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Evan Green <evgreen@chromium.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sasha Levin <sashal@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 09:09:13 +08:00
#include <linux/buildid.h>
#include <linux/writeback.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
Task Control Groups: basic task cgroup framework Generic Process Control Groups -------------------------- There have recently been various proposals floating around for resource management/accounting and other task grouping subsystems in the kernel, including ResGroups, User BeanCounters, NSProxy cgroups, and others. These all need the basic abstraction of being able to group together multiple processes in an aggregate, in order to track/limit the resources permitted to those processes, or control other behaviour of the processes, and all implement this grouping in different ways. This patchset provides a framework for tracking and grouping processes into arbitrary "cgroups" and assigning arbitrary state to those groupings, in order to control the behaviour of the cgroup as an aggregate. The intention is that the various resource management and virtualization/cgroup efforts can also become task cgroup clients, with the result that: - the userspace APIs are (somewhat) normalised - it's easier to test e.g. the ResGroups CPU controller in conjunction with the BeanCounters memory controller, or use either of them as the resource-control portion of a virtual server system. - the additional kernel footprint of any of the competing resource management systems is substantially reduced, since it doesn't need to provide process grouping/containment, hence improving their chances of getting into the kernel This patch: Add the main task cgroups framework - the cgroup filesystem, and the basic structures for tracking membership and associating subsystem state objects to tasks. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
#include <linux/cgroup.h>
#include <linux/efi.h>
#include <linux/tick.h>
#include <linux/sched/isolation.h>
#include <linux/interrupt.h>
#include <linux/taskstats_kern.h>
#include <linux/delayacct.h>
#include <linux/unistd.h>
#include <linux/utsname.h>
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
#include <linux/debug_locks.h>
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
#include <linux/debugobjects.h>
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
#include <linux/lockdep.h>
#include <linux/kmemleak.h>
#include <linux/padata.h>
#include <linux/pid_namespace.h>
#include <linux/device/driver.h>
kthread: don't depend on work queues Currently there is a circular reference between work queue initialization and kthread initialization. This prevents the kthread infrastructure from initializing until after work queues have been initialized. We want the properties of tasks created with kthread_create to be as close as possible to the init_task and to not be contaminated by user processes. The later we start our kthreadd that creates these tasks the harder it is to avoid contamination from user processes and the more of a mess we have to clean up because the defaults have changed on us. So this patch modifies the kthread support to not use work queues but to instead use a simple list of structures, and to have kthreadd start from init_task immediately after our kernel thread that execs /sbin/init. By being a true child of init_task we only have to change those process settings that we want to have different from init_task, such as our process name, the cpus that are allowed, blocking all signals and setting SIGCHLD to SIG_IGN so that all of our children are reaped automatically. By being a true child of init_task we also naturally get our ppid set to 0 and do not wind up as a child of PID == 1. Ensuring that tasks generated by kthread_create will not slow down the functioning of the wait family of functions. [akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 17:34:32 +08:00
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/sched/init.h>
#include <linux/signal.h>
#include <linux/idr.h>
#include <linux/kgdb.h>
#include <linux/ftrace.h>
#include <linux/async.h>
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev Devtmpfs lets the kernel create a tmpfs instance called devtmpfs very early at kernel initialization, before any driver-core device is registered. Every device with a major/minor will provide a device node in devtmpfs. Devtmpfs can be changed and altered by userspace at any time, and in any way needed - just like today's udev-mounted tmpfs. Unmodified udev versions will run just fine on top of it, and will recognize an already existing kernel-created device node and use it. The default node permissions are root:root 0600. Proper permissions and user/group ownership, meaningful symlinks, all other policy still needs to be applied by userspace. If a node is created by devtmps, devtmpfs will remove the device node when the device goes away. If the device node was created by userspace, or the devtmpfs created node was replaced by userspace, it will no longer be removed by devtmpfs. If it is requested to auto-mount it, it makes init=/bin/sh work without any further userspace support. /dev will be fully populated and dynamic, and always reflect the current device state of the kernel. With the commonly used dynamic device numbers, it solves the problem where static devices nodes may point to the wrong devices. It is intended to make the initial bootup logic simpler and more robust, by de-coupling the creation of the inital environment, to reliably run userspace processes, from a complex userspace bootstrap logic to provide a working /dev. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Tested-By: Harald Hoyer <harald@redhat.com> Tested-By: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-30 21:23:42 +08:00
#include <linux/shmem_fs.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
#include <linux/slab.h>
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/context_tracking.h>
#include <linux/random.h>
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
#include <linux/list.h>
#include <linux/integrity.h>
#include <linux/proc_ns.h>
#include <linux/io.h>
#include <linux/cache.h>
#include <linux/rodata_test.h>
#include <linux/jump_label.h>
#include <linux/kcsan.h>
#include <linux/init_syscalls.h>
#include <linux/stackdepot.h>
#include <linux/randomize_kstack.h>
#include <net/net_namespace.h>
#include <asm/io.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/cacheflush.h>
#define CREATE_TRACE_POINTS
#include <trace/events/initcall.h>
#include <kunit/test.h>
static int kernel_init(void *);
/*
* Debug helper: via this flag we know that we are in 'early bootup code'
* where only the boot processor is running with IRQ disabled. This means
* two things - IRQ must not be enabled before the flag is cleared and some
* operations which are not allowed with IRQ disabled are allowed while the
* flag is set.
*/
bool early_boot_irqs_disabled __read_mostly;
rcu: Teach RCU that idle task is not quiscent state at boot This patch fixes a bug located by Vegard Nossum with the aid of kmemcheck, updated based on review comments from Nick Piggin, Ingo Molnar, and Andrew Morton. And cleans up the variable-name and function-name language. ;-) The boot CPU runs in the context of its idle thread during boot-up. During this time, idle_cpu(0) will always return nonzero, which will fool Classic and Hierarchical RCU into deciding that a large chunk of the boot-up sequence is a big long quiescent state. This in turn causes RCU to prematurely end grace periods during this time. This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks() function to ignore the idle task as a quiescent state until the system has started up the scheduler in rest_init(), introducing a new non-API function rcu_idle_now_means_idle() to inform RCU of this transition. RCU maintains an internal rcu_idle_cpu_truthful variable to track this state, which is then used by rcu_check_callback() to determine if it should believe idle_cpu(). Because this patch has the effect of disallowing RCU grace periods during long stretches of the boot-up sequence, this patch also introduces Josh Triplett's UP-only optimization that makes synchronize_rcu() be a no-op if num_online_cpus() returns 1. This allows boot-time code that calls synchronize_rcu() to proceed normally. Note, however, that RCU callbacks registered by call_rcu() will likely queue up until later in the boot sequence. Although rcuclassic and rcutree can also use this same optimization after boot completes, rcupreempt must restrict its use of this optimization to the portion of the boot sequence before the scheduler starts up, given that an rcupreempt RCU read-side critical section may be preeempted. In addition, this patch takes Nick Piggin's suggestion to make the system_state global variable be __read_mostly. Changes since v4: o Changes the name of the introduced function and variable to be less emotional. ;-) Changes since v3: o WARN_ON(nr_context_switches() > 0) to verify that RCU switches out of boot-time mode before the first context switch, as suggested by Nick Piggin. Changes since v2: o Created rcu_blocking_is_gp() internal-to-RCU API that determines whether a call to synchronize_rcu() is itself a grace period. o The definition of rcu_blocking_is_gp() for rcuclassic and rcutree checks to see if but a single CPU is online. o The definition of rcu_blocking_is_gp() for rcupreempt checks to see both if but a single CPU is online and if the system is still in early boot. This allows rcupreempt to again work correctly if running on a single CPU after booting is complete. o Added check to rcupreempt's synchronize_sched() for there being but one online CPU. Tested all three variants both SMP and !SMP, booted fine, passed a short rcutorture test on both x86 and Power. Located-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 10:03:42 +08:00
enum system_states system_state __read_mostly;
EXPORT_SYMBOL(system_state);
/*
* Boot command-line arguments
*/
#define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
#define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
/* Default late time init is NULL. archs can override this later. */
void (*__initdata late_time_init)(void);
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
/* Untouched command line saved by arch-specific code. */
char __initdata boot_command_line[COMMAND_LINE_SIZE];
/* Untouched saved command line (eg. for /proc) */
proc: give /proc/cmdline size Most /proc files don't have length (in fstat sense). This leads to inefficiencies when reading such files with APIs commonly found in modern programming languages. They open file, then fstat descriptor, get st_size == 0 and either assume file is empty or start reading without knowing target size. cat(1) does OK because it uses large enough buffer by default. But naive programs copy-pasted from SO aren't: let mut f = std::fs::File::open("/proc/cmdline").unwrap(); let mut buf: Vec<u8> = Vec::new(); f.read_to_end(&mut buf).unwrap(); will result in openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3 statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address) statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32 read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32 read(3, "apper/fedora_localhost--live-roo"..., 64) = 64 read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116 read(3, "", 12) open/stat is OK, lseek looks silly but there are 3 unnecessary reads because Rust starts with 32 bytes per Vec<u8> and grows from there. In case of /proc/cmdline, the length is known precisely. Make variables readonly while I'm at it. P.S.: I tried to scp /proc/cpuinfo today and got empty file but this is separate story. Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09 02:21:54 +08:00
char *saved_command_line __ro_after_init;
unsigned int saved_command_line_len __ro_after_init;
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
/* Command line for parameter parsing */
static char *static_command_line;
/* Untouched extra command line */
static char *extra_command_line;
/* Extra init arguments */
static char *extra_init_args;
#ifdef CONFIG_BOOT_CONFIG
/* Is bootconfig on command line? */
static bool bootconfig_found;
static size_t initargs_offs;
#else
# define bootconfig_found false
# define initargs_offs 0
#endif
static char *execute_command;
static char *ramdisk_execute_command = "/init";
/*
* Used to generate warnings if static_key manipulation functions are used
* before jump_label_init is called.
*/
bool static_key_initialized __read_mostly;
EXPORT_SYMBOL_GPL(static_key_initialized);
/*
* If set, this is an indication to the drivers that reset the underlying
* device before going ahead with the initialization otherwise driver might
* rely on the BIOS and skip the reset operation.
*
* This is useful if kernel is booting in an unreliable environment.
* For ex. kdump situation where previous kernel has crashed, BIOS has been
* skipped and devices will be in unknown state.
*/
unsigned int reset_devices;
EXPORT_SYMBOL(reset_devices);
static int __init set_reset_devices(char *str)
{
reset_devices = 1;
return 1;
}
__setup("reset_devices", set_reset_devices);
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
const char *envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };
static const char *panic_later, *panic_param;
static bool __init obsolete_checksetup(char *line)
{
const struct obs_kernel_param *p;
bool had_early_param = false;
p = __setup_start;
do {
int n = strlen(p->str);
if (parameqn(line, p->str, n)) {
if (p->early) {
/* Already done in parse_early_param?
* (Needs exact match on param part).
* Keep iterating, as we can have early
* params and __setups of same names 8( */
if (line[n] == '\0' || line[n] == '=')
had_early_param = true;
} else if (!p->setup_func) {
pr_warn("Parameter %s is obsolete, ignored\n",
p->str);
return true;
} else if (p->setup_func(line + n))
return true;
}
p++;
} while (p < __setup_end);
return had_early_param;
}
/*
* This should be approx 2 Bo*oMips to start (note initial shift), and will
* still work even if initially too large, it will just take slightly longer
*/
unsigned long loops_per_jiffy = (1<<12);
EXPORT_SYMBOL(loops_per_jiffy);
static int __init debug_kernel(char *str)
{
console_loglevel = CONSOLE_LOGLEVEL_DEBUG;
return 0;
}
static int __init quiet_kernel(char *str)
{
console_loglevel = CONSOLE_LOGLEVEL_QUIET;
return 0;
}
early_param("debug", debug_kernel);
early_param("quiet", quiet_kernel);
static int __init loglevel(char *str)
{
int newlevel;
/*
* Only update loglevel value when a correct setting was passed,
* to prevent blind crashes (when loglevel being set to 0) that
* are quite hard to debug
*/
if (get_option(&str, &newlevel)) {
console_loglevel = newlevel;
return 0;
}
return -EINVAL;
}
early_param("loglevel", loglevel);
#ifdef CONFIG_BLK_DEV_INITRD
static void * __init get_boot_config_from_initrd(size_t *_size)
{
u32 size, csum;
char *data;
u32 *hdr;
int i;
if (!initrd_end)
return NULL;
data = (char *)initrd_end - BOOTCONFIG_MAGIC_LEN;
/*
* Since Grub may align the size of initrd to 4, we must
* check the preceding 3 bytes as well.
*/
for (i = 0; i < 4; i++) {
if (!memcmp(data, BOOTCONFIG_MAGIC, BOOTCONFIG_MAGIC_LEN))
goto found;
data--;
}
return NULL;
found:
hdr = (u32 *)(data - 8);
size = le32_to_cpu(hdr[0]);
csum = le32_to_cpu(hdr[1]);
data = ((void *)hdr) - size;
if ((unsigned long)data < initrd_start) {
pr_err("bootconfig size %d is greater than initrd size %ld\n",
size, initrd_end - initrd_start);
return NULL;
}
if (xbc_calc_checksum(data, size) != csum) {
pr_err("bootconfig checksum failed\n");
return NULL;
}
/* Remove bootconfig from initramfs/initrd */
initrd_end = (unsigned long)data;
if (_size)
*_size = size;
return data;
}
#else
static void * __init get_boot_config_from_initrd(size_t *_size)
{
return NULL;
}
#endif
#ifdef CONFIG_BOOT_CONFIG
static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
static int __init xbc_snprint_cmdline(char *buf, size_t size,
struct xbc_node *root)
{
struct xbc_node *knode, *vnode;
char *end = buf + size;
const char *val;
int ret;
xbc_node_for_each_key_value(root, knode, val) {
ret = xbc_node_compose_key_after(root, knode,
xbc_namebuf, XBC_KEYLEN_MAX);
if (ret < 0)
return ret;
vnode = xbc_node_get_child(knode);
if (!vnode) {
ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
if (ret < 0)
return ret;
buf += ret;
continue;
}
xbc_array_for_each_value(vnode, val) {
ret = snprintf(buf, rest(buf, end), "%s=\"%s\" ",
xbc_namebuf, val);
if (ret < 0)
return ret;
buf += ret;
}
}
return buf - (end - size);
}
#undef rest
/* Make an extra command line under given key word */
static char * __init xbc_make_cmdline(const char *key)
{
struct xbc_node *root;
char *new_cmdline;
int ret, len = 0;
root = xbc_find_node(key);
if (!root)
return NULL;
/* Count required buffer size */
len = xbc_snprint_cmdline(NULL, 0, root);
if (len <= 0)
return NULL;
new_cmdline = memblock_alloc(len + 1, SMP_CACHE_BYTES);
if (!new_cmdline) {
pr_err("Failed to allocate memory for extra kernel cmdline.\n");
return NULL;
}
ret = xbc_snprint_cmdline(new_cmdline, len + 1, root);
if (ret < 0 || ret > len) {
pr_err("Failed to print extra kernel cmdline.\n");
memblock_free(new_cmdline, len + 1);
return NULL;
}
return new_cmdline;
}
static int __init bootconfig_params(char *param, char *val,
const char *unused, void *arg)
{
if (strcmp(param, "bootconfig") == 0) {
bootconfig_found = true;
}
return 0;
}
static int __init warn_bootconfig(char *str)
{
/* The 'bootconfig' has been handled by bootconfig_params(). */
return 0;
}
static void __init setup_boot_config(void)
{
static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
const char *msg, *data;
int pos, ret;
size_t size;
char *err;
/* Cut out the bootconfig data even if we have no bootconfig option */
data = get_boot_config_from_initrd(&size);
/* If there is no bootconfig in initrd, try embedded one. */
if (!data)
data = xbc_get_embedded_bootconfig(&size);
strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
bootconfig_params);
if (IS_ERR(err) || !(bootconfig_found || IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE)))
return;
/* parse_args() stops at the next param of '--' and returns an address */
if (err)
initargs_offs = err - tmp_cmdline;
if (!data) {
/* If user intended to use bootconfig, show an error level message */
if (bootconfig_found)
pr_err("'bootconfig' found on command line, but no bootconfig found\n");
else
pr_info("No bootconfig data provided, so skipping bootconfig");
return;
}
if (size >= XBC_DATA_MAX) {
pr_err("bootconfig size %ld greater than max size %d\n",
(long)size, XBC_DATA_MAX);
return;
}
ret = xbc_init(data, size, &msg, &pos);
if (ret < 0) {
if (pos < 0)
pr_err("Failed to init bootconfig: %s.\n", msg);
else
pr_err("Failed to parse bootconfig: %s at %d.\n",
msg, pos);
} else {
xbc_get_info(&ret, NULL);
pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
/* keys starting with "kernel." are passed via cmdline */
extra_command_line = xbc_make_cmdline("kernel");
/* Also, "init." keys are init arguments */
extra_init_args = xbc_make_cmdline("init");
}
return;
}
static void __init exit_boot_config(void)
{
xbc_exit();
}
#else /* !CONFIG_BOOT_CONFIG */
static void __init setup_boot_config(void)
{
/* Remove bootconfig data from initrd */
get_boot_config_from_initrd(NULL);
}
static int __init warn_bootconfig(char *str)
{
pr_warn("WARNING: 'bootconfig' found on the kernel command line but CONFIG_BOOT_CONFIG is not set.\n");
return 0;
}
#define exit_boot_config() do {} while (0)
#endif /* CONFIG_BOOT_CONFIG */
early_param("bootconfig", warn_bootconfig);
/* Change NUL term back to "=", to make "param" the whole string. */
static void __init repair_env_string(char *param, char *val)
{
if (val) {
/* param=val or param="val"? */
if (val == param+strlen(param)+1)
val[-1] = '=';
else if (val == param+strlen(param)+2) {
val[-2] = '=';
memmove(val-1, val, strlen(val)+1);
} else
BUG();
}
}
/* Anything after -- gets handed straight to init. */
module: add extra argument for parse_params() callback This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
static int __init set_init_arg(char *param, char *val,
const char *unused, void *arg)
{
unsigned int i;
if (panic_later)
return 0;
repair_env_string(param, val);
for (i = 0; argv_init[i]; i++) {
if (i == MAX_INIT_ARGS) {
panic_later = "init";
panic_param = param;
return 0;
}
}
argv_init[i] = param;
return 0;
}
/*
* Unknown boot options get handed to init, unless they look like
* unused parameters (modprobe will find them in /proc/cmdline).
*/
module: add extra argument for parse_params() callback This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
static int __init unknown_bootoption(char *param, char *val,
const char *unused, void *arg)
{
init/main.c: fix quoted value handling in unknown_bootoption Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2. unknown_bootoption passes unrecognized command line arguments to init as either environment variables or arguments. Some of the logic in the function is broken for quoted command line arguments. When an argument of the form param="value" is processed by parse_args and passed to unknown_bootoption, the command line has param\0"value\0 with val pointing to the beginning of value. The helper function repair_env_string is then used to restore the '=' character that was removed by parse_args, and strip the quotes off fully. This results in param=value\0\0 and val ends up pointing to the 'a' instead of the 'v' in value. This bug was introduced when repair_env_string was refactored into a separate function, and the decrement of val in repair_env_string became dead code. This causes two problems in unknown_bootoption in the two places where the val pointer is used as a substitute for the length of param: 1. An argument of the form param=".value" is misinterpreted as a potential module parameter, with the result that it will not be placed in init's environment. 2. An argument of the form param="value" is checked to see if param is an existing environment variable that should be overwritten, but the comparison is off-by-one and compares 'param=v' instead of 'param=' against the existing environment. So passing, for example, TERM="vt100" on the command line results in init being passed both TERM=linux and TERM=vt100 in its environment. Patch 1 adds logging for the arguments and environment passed to init and is independent of the rest: it can be dropped if this is unnecessarily verbose. Patch 2 removes repair_env_string from initcall parameter parsing in do_initcall_level, as that uses a separate copy of the command line now and the repairing is no longer necessary. Patch 3 fixes the bug in unknown_bootoption by recording the length of param explicitly instead of implying it from val-param. This patch (of 3): Commit a99cd1125189 ("init: fix bug where environment vars can't be passed via boot args") introduced two minor bugs in unknown_bootoption by factoring out the quoted value handling into a separate function. When value is quoted, repair_env_string will move the value up 1 byte to strip the quotes, so val in unknown_bootoption no longer points to the actual location of the value. The result is that an argument of the form param=".value" is mistakenly treated as a potential module parameter and is not placed in init's environment, and an argument of the form param="value" can result in a duplicate environment variable: eg TERM="vt100" on the command line will result in both TERM=linux and TERM=vt100 being placed into init's environment. Fix this by recording the length of the param before calling repair_env_string instead of relying on val. Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Krzysztof Mazur <krzysiek@podlesie.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:17:19 +08:00
size_t len = strlen(param);
repair_env_string(param, val);
/* Handle obsolete-style parameters */
if (obsolete_checksetup(param))
return 0;
/* Unused module parameter. */
init/main.c: fix quoted value handling in unknown_bootoption Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2. unknown_bootoption passes unrecognized command line arguments to init as either environment variables or arguments. Some of the logic in the function is broken for quoted command line arguments. When an argument of the form param="value" is processed by parse_args and passed to unknown_bootoption, the command line has param\0"value\0 with val pointing to the beginning of value. The helper function repair_env_string is then used to restore the '=' character that was removed by parse_args, and strip the quotes off fully. This results in param=value\0\0 and val ends up pointing to the 'a' instead of the 'v' in value. This bug was introduced when repair_env_string was refactored into a separate function, and the decrement of val in repair_env_string became dead code. This causes two problems in unknown_bootoption in the two places where the val pointer is used as a substitute for the length of param: 1. An argument of the form param=".value" is misinterpreted as a potential module parameter, with the result that it will not be placed in init's environment. 2. An argument of the form param="value" is checked to see if param is an existing environment variable that should be overwritten, but the comparison is off-by-one and compares 'param=v' instead of 'param=' against the existing environment. So passing, for example, TERM="vt100" on the command line results in init being passed both TERM=linux and TERM=vt100 in its environment. Patch 1 adds logging for the arguments and environment passed to init and is independent of the rest: it can be dropped if this is unnecessarily verbose. Patch 2 removes repair_env_string from initcall parameter parsing in do_initcall_level, as that uses a separate copy of the command line now and the repairing is no longer necessary. Patch 3 fixes the bug in unknown_bootoption by recording the length of param explicitly instead of implying it from val-param. This patch (of 3): Commit a99cd1125189 ("init: fix bug where environment vars can't be passed via boot args") introduced two minor bugs in unknown_bootoption by factoring out the quoted value handling into a separate function. When value is quoted, repair_env_string will move the value up 1 byte to strip the quotes, so val in unknown_bootoption no longer points to the actual location of the value. The result is that an argument of the form param=".value" is mistakenly treated as a potential module parameter and is not placed in init's environment, and an argument of the form param="value" can result in a duplicate environment variable: eg TERM="vt100" on the command line will result in both TERM=linux and TERM=vt100 being placed into init's environment. Fix this by recording the length of the param before calling repair_env_string instead of relying on val. Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Krzysztof Mazur <krzysiek@podlesie.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:17:19 +08:00
if (strnchr(param, len, '.'))
return 0;
if (panic_later)
return 0;
if (val) {
/* Environment option */
unsigned int i;
for (i = 0; envp_init[i]; i++) {
if (i == MAX_INIT_ENVS) {
panic_later = "env";
panic_param = param;
}
init/main.c: fix quoted value handling in unknown_bootoption Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2. unknown_bootoption passes unrecognized command line arguments to init as either environment variables or arguments. Some of the logic in the function is broken for quoted command line arguments. When an argument of the form param="value" is processed by parse_args and passed to unknown_bootoption, the command line has param\0"value\0 with val pointing to the beginning of value. The helper function repair_env_string is then used to restore the '=' character that was removed by parse_args, and strip the quotes off fully. This results in param=value\0\0 and val ends up pointing to the 'a' instead of the 'v' in value. This bug was introduced when repair_env_string was refactored into a separate function, and the decrement of val in repair_env_string became dead code. This causes two problems in unknown_bootoption in the two places where the val pointer is used as a substitute for the length of param: 1. An argument of the form param=".value" is misinterpreted as a potential module parameter, with the result that it will not be placed in init's environment. 2. An argument of the form param="value" is checked to see if param is an existing environment variable that should be overwritten, but the comparison is off-by-one and compares 'param=v' instead of 'param=' against the existing environment. So passing, for example, TERM="vt100" on the command line results in init being passed both TERM=linux and TERM=vt100 in its environment. Patch 1 adds logging for the arguments and environment passed to init and is independent of the rest: it can be dropped if this is unnecessarily verbose. Patch 2 removes repair_env_string from initcall parameter parsing in do_initcall_level, as that uses a separate copy of the command line now and the repairing is no longer necessary. Patch 3 fixes the bug in unknown_bootoption by recording the length of param explicitly instead of implying it from val-param. This patch (of 3): Commit a99cd1125189 ("init: fix bug where environment vars can't be passed via boot args") introduced two minor bugs in unknown_bootoption by factoring out the quoted value handling into a separate function. When value is quoted, repair_env_string will move the value up 1 byte to strip the quotes, so val in unknown_bootoption no longer points to the actual location of the value. The result is that an argument of the form param=".value" is mistakenly treated as a potential module parameter and is not placed in init's environment, and an argument of the form param="value" can result in a duplicate environment variable: eg TERM="vt100" on the command line will result in both TERM=linux and TERM=vt100 being placed into init's environment. Fix this by recording the length of the param before calling repair_env_string instead of relying on val. Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Krzysztof Mazur <krzysiek@podlesie.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:17:19 +08:00
if (!strncmp(param, envp_init[i], len+1))
break;
}
envp_init[i] = param;
} else {
/* Command line option */
unsigned int i;
for (i = 0; argv_init[i]; i++) {
if (i == MAX_INIT_ARGS) {
panic_later = "init";
panic_param = param;
}
}
argv_init[i] = param;
}
return 0;
}
static int __init init_setup(char *str)
{
unsigned int i;
execute_command = str;
/*
* In case LILO is going to boot us with default command line,
* it prepends "auto" before the whole cmdline which makes
* the shell think it should execute a script with such name.
* So we ignore all arguments entered _before_ init=... [MJ]
*/
for (i = 1; i < MAX_INIT_ARGS; i++)
argv_init[i] = NULL;
return 1;
}
__setup("init=", init_setup);
static int __init rdinit_setup(char *str)
{
unsigned int i;
ramdisk_execute_command = str;
/* See "auto" comment in init_setup */
for (i = 1; i < MAX_INIT_ARGS; i++)
argv_init[i] = NULL;
return 1;
}
__setup("rdinit=", rdinit_setup);
#ifndef CONFIG_SMP
static const unsigned int setup_max_cpus = NR_CPUS;
static inline void setup_nr_cpu_ids(void) { }
static inline void smp_prepare_cpus(unsigned int maxcpus) { }
#endif
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
/*
* We need to store the untouched command line for future reference.
* We also need to store the touched command line since the parameter
* parsing is performed in place, and we should allow a component to
* store reference of name/value for future reference.
*/
static void __init setup_command_line(char *command_line)
{
size_t len, xlen = 0, ilen = 0;
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
if (extra_command_line)
xlen = strlen(extra_command_line);
if (extra_init_args)
ilen = strlen(extra_init_args) + 4; /* for " -- " */
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
len = xlen + strlen(boot_command_line) + 1;
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
saved_command_line = memblock_alloc(len + ilen, SMP_CACHE_BYTES);
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
if (!saved_command_line)
panic("%s: Failed to allocate %zu bytes\n", __func__, len + ilen);
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
static_command_line = memblock_alloc(len, SMP_CACHE_BYTES);
if (!static_command_line)
panic("%s: Failed to allocate %zu bytes\n", __func__, len);
if (xlen) {
/*
* We have to put extra_command_line before boot command
* lines because there could be dashes (separator of init
* command line) in the command lines.
*/
strcpy(saved_command_line, extra_command_line);
strcpy(static_command_line, extra_command_line);
}
strcpy(saved_command_line + xlen, boot_command_line);
strcpy(static_command_line + xlen, command_line);
if (ilen) {
/*
* Append supplemental init boot args to saved_command_line
* so that user can check what command line options passed
* to init.
* The order should always be
* " -- "[bootconfig init-param][cmdline init-param]
*/
if (initargs_offs) {
len = xlen + initargs_offs;
strcpy(saved_command_line + len, extra_init_args);
len += ilen - 4; /* strlen(extra_init_args) */
strcpy(saved_command_line + len,
boot_command_line + initargs_offs - 1);
} else {
len = strlen(saved_command_line);
strcpy(saved_command_line + len, " -- ");
len += 4;
strcpy(saved_command_line + len, extra_init_args);
}
}
proc: give /proc/cmdline size Most /proc files don't have length (in fstat sense). This leads to inefficiencies when reading such files with APIs commonly found in modern programming languages. They open file, then fstat descriptor, get st_size == 0 and either assume file is empty or start reading without knowing target size. cat(1) does OK because it uses large enough buffer by default. But naive programs copy-pasted from SO aren't: let mut f = std::fs::File::open("/proc/cmdline").unwrap(); let mut buf: Vec<u8> = Vec::new(); f.read_to_end(&mut buf).unwrap(); will result in openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3 statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address) statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32 read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32 read(3, "apper/fedora_localhost--live-roo"..., 64) = 64 read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116 read(3, "", 12) open/stat is OK, lseek looks silly but there are 3 unnecessary reads because Rust starts with 32 bytes per Vec<u8> and grows from there. In case of /proc/cmdline, the length is known precisely. Make variables readonly while I'm at it. P.S.: I tried to scp /proc/cpuinfo today and got empty file but this is separate story. Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09 02:21:54 +08:00
saved_command_line_len = strlen(saved_command_line);
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
}
/*
* We need to finalize in a non-__init function or else race conditions
* between the root thread and the init thread may cause start_kernel to
* be reaped by free_initmem before the root thread has proceeded to
* cpu_idle.
*
* gcc-3.4 accidentally inlines this function, so use noinline.
*/
static __initdata DECLARE_COMPLETION(kthreadd_done);
noinline void __ref __noreturn rest_init(void)
{
struct task_struct *tsk;
kthread: don't depend on work queues Currently there is a circular reference between work queue initialization and kthread initialization. This prevents the kthread infrastructure from initializing until after work queues have been initialized. We want the properties of tasks created with kthread_create to be as close as possible to the init_task and to not be contaminated by user processes. The later we start our kthreadd that creates these tasks the harder it is to avoid contamination from user processes and the more of a mess we have to clean up because the defaults have changed on us. So this patch modifies the kthread support to not use work queues but to instead use a simple list of structures, and to have kthreadd start from init_task immediately after our kernel thread that execs /sbin/init. By being a true child of init_task we only have to change those process settings that we want to have different from init_task, such as our process name, the cpus that are allowed, blocking all signals and setting SIGCHLD to SIG_IGN so that all of our children are reaped automatically. By being a true child of init_task we also naturally get our ppid set to 0 and do not wind up as a child of PID == 1. Ensuring that tasks generated by kthread_create will not slow down the functioning of the wait family of functions. [akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 17:34:32 +08:00
int pid;
rcu_scheduler_starting();
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
kthread: Don't allocate kthread_struct for init and umh If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from. This bug was introduced by commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads"). When kthread_struct started to be allocated for all tasks that have PF_KTHREAD set. This in turn required the kthread_struct to be freed in kernel_execve and violated the assumption that kthread_struct will have the same lifetime as the task. Looking a bit deeper this only applies to callers of kernel_execve which is just the init process and the user mode helper processes. These processes really don't want to be kernel threads but are for historical reasons. Mostly that copy_thread does not know how to take a kernel mode function to the process with for processes without PF_KTHREAD or PF_IO_WORKER set. Solve this by not allocating kthread_struct for the init process and the user mode helper processes. This is done by adding a kthread member to struct kernel_clone_args. Setting kthread in fork_idle and kernel_thread. Adding user_mode_thread that works like kernel_thread except it does not set kthread. In fork only allocating the kthread_struct if .kthread is set. I have looked at kernel/kthread.c and since commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") there have been no assumptions added that to_kthread or __to_kthread will not return NULL. There are a few callers of to_kthread or __to_kthread that assume a non-NULL struct kthread pointer will be returned. These functions are kthread_data(), kthread_parmme(), kthread_exit(), kthread(), kthread_park(), kthread_unpark(), kthread_stop(). All of those functions can reasonably expected to be called when it is know that a task is a kthread so that assumption seems reasonable. Cc: stable@vger.kernel.org Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин <maximkabox13@gmail.com> Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-04-12 00:40:14 +08:00
pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
/*
* Pin init on the boot CPU. Task migration is not properly working
* until sched_init_smp() has been run. It will set the allowed
* CPUs for init to the non isolated CPUs.
*/
rcu_read_lock();
tsk = find_task_by_pid_ns(pid, &init_pid_ns);
tsk->flags |= PF_NO_SETAFFINITY;
set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
rcu_read_unlock();
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
/*
* Enable might_sleep() and smp_processor_id() checks.
* They cannot be enabled earlier because with CONFIG_PREEMPTION=y
* kernel_thread() would trigger might_sleep() splats. With
* CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
* already, but it's stuck on the kthreadd_done completion.
*/
system_state = SYSTEM_SCHEDULING;
complete(&kthreadd_done);
/*
* The boot idle thread must execute schedule()
* at least once to get things moving:
*/
schedule_preempt_disabled();
/* Call into cpu_idle with preempt disabled */
cpu_startup_entry(CPUHP_ONLINE);
}
/* Check for early params. */
module: add extra argument for parse_params() callback This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
static int __init do_early_param(char *param, char *val,
const char *unused, void *arg)
{
const struct obs_kernel_param *p;
for (p = __setup_start; p < __setup_end; p++) {
if ((p->early && parameq(param, p->str)) ||
(strcmp(param, "console") == 0 &&
strcmp(p->str, "earlycon") == 0)
) {
if (p->setup_func(val) != 0)
pr_warn("Malformed early option '%s'\n", param);
}
}
/* We accept everything at this stage. */
return 0;
}
Driver Core: early platform driver V3 of the early platform driver implementation. Platform drivers are great for embedded platforms because we can separate driver configuration from the actual driver. So base addresses, interrupts and other configuration can be kept with the processor or board code, and the platform driver can be reused by many different platforms. For early devices we have nothing today. For instance, to configure early timers and early serial ports we cannot use platform devices. This because the setup order during boot. Timers are needed before the platform driver core code is available. The same goes for early printk support. Early in this case means before initcalls. These early drivers today have their configuration either hard coded or they receive it using some special configuration method. This is working quite well, but if we want to support both regular kernel modules and early devices then we need to have two ways of configuring the same driver. A single way would be better. The early platform driver patch is basically a set of functions that allow drivers to register themselves and architecture code to locate them and probe. Registration happens through early_param(). The time for the probe is decided by the architecture code. See Documentation/driver-model/platform.txt for more details. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Brownell <david-b@pacbell.net> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-31 05:37:25 +08:00
void __init parse_early_options(char *cmdline)
{
module: add extra argument for parse_params() callback This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
parse_args("early options", cmdline, NULL, 0, 0, 0, NULL,
do_early_param);
Driver Core: early platform driver V3 of the early platform driver implementation. Platform drivers are great for embedded platforms because we can separate driver configuration from the actual driver. So base addresses, interrupts and other configuration can be kept with the processor or board code, and the platform driver can be reused by many different platforms. For early devices we have nothing today. For instance, to configure early timers and early serial ports we cannot use platform devices. This because the setup order during boot. Timers are needed before the platform driver core code is available. The same goes for early printk support. Early in this case means before initcalls. These early drivers today have their configuration either hard coded or they receive it using some special configuration method. This is working quite well, but if we want to support both regular kernel modules and early devices then we need to have two ways of configuring the same driver. A single way would be better. The early platform driver patch is basically a set of functions that allow drivers to register themselves and architecture code to locate them and probe. Registration happens through early_param(). The time for the probe is decided by the architecture code. See Documentation/driver-model/platform.txt for more details. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Brownell <david-b@pacbell.net> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-31 05:37:25 +08:00
}
/* Arch code calls this early on, or if not, just before other parsing. */
void __init parse_early_param(void)
{
static int done __initdata;
static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
if (done)
return;
/* All fall through to do_early_param. */
strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
Driver Core: early platform driver V3 of the early platform driver implementation. Platform drivers are great for embedded platforms because we can separate driver configuration from the actual driver. So base addresses, interrupts and other configuration can be kept with the processor or board code, and the platform driver can be reused by many different platforms. For early devices we have nothing today. For instance, to configure early timers and early serial ports we cannot use platform devices. This because the setup order during boot. Timers are needed before the platform driver core code is available. The same goes for early printk support. Early in this case means before initcalls. These early drivers today have their configuration either hard coded or they receive it using some special configuration method. This is working quite well, but if we want to support both regular kernel modules and early devices then we need to have two ways of configuring the same driver. A single way would be better. The early platform driver patch is basically a set of functions that allow drivers to register themselves and architecture code to locate them and probe. Registration happens through early_param(). The time for the probe is decided by the architecture code. See Documentation/driver-model/platform.txt for more details. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Brownell <david-b@pacbell.net> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-31 05:37:25 +08:00
parse_early_options(tmp_cmdline);
done = 1;
}
x86/amd: Check for the C1E bug post ACPI subsystem init AMD CPUs affected by the E400 erratum suffer from the issue that the local APIC timer stops when the CPU goes into C1E. Unfortunately there is no way to detect the affected CPUs on early boot. It's only possible to determine the range of possibly affected CPUs from the family/model range. The actual decision whether to enter C1E and thus cause the bug is done by the firmware and we need to detect that case late, after ACPI has been initialized. The current solution is to check in the idle routine whether the CPU is affected by reading the MSR_K8_INT_PENDING_MSG MSR and checking for the K8_INTP_C1E_ACTIVE_MASK bits. If one of the bits is set then the CPU is affected and the system is switched into forced broadcast mode. This is ineffective and on non-affected CPUs every entry to idle does the extra RDMSR. After doing some research it turns out that the bits are visible on the boot CPU right after the ACPI subsystem is initialized in the early boot process. So instead of polling for the bits in the idle loop, add a detection function after acpi_subsystem_init() and check for the MSR bits. If set, then the X86_BUG_AMD_APIC_C1E is set on the boot CPU and the TSC is marked unstable when X86_FEATURE_NONSTOP_TSC is not set as it will stop in C1E state as well. The switch to broadcast mode cannot be done at this point because the boot CPU still uses HPET as a clockevent device and the local APIC timer is not yet calibrated and installed. The switch to broadcast mode on the affected CPUs needs to be done when the local APIC timer is actually set up. This allows to cleanup the amd_e400_idle() function in the next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/20161209182912.2726-4-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-10 02:29:10 +08:00
void __init __weak arch_post_acpi_subsys_init(void) { }
void __init __weak smp_setup_processor_id(void)
{
}
# if THREAD_SIZE >= PAGE_SIZE
Clarify naming of thread info/stack allocators We've had the thread info allocated together with the thread stack for most architectures for a long time (since the thread_info was split off from the task struct), but that is about to change. But the patches that move the thread info to be off-stack (and a part of the task struct instead) made it clear how confused the allocator and freeing functions are. Because the common case was that we share an allocation with the thread stack and the thread_info, the two pointers were identical. That identity then meant that we would have things like ti = alloc_thread_info_node(tsk, node); ... tsk->stack = ti; which certainly _worked_ (since stack and thread_info have the same value), but is rather confusing: why are we assigning a thread_info to the stack? And if we move the thread_info away, the "confusing" code just gets to be entirely bogus. So remove all this confusion, and make it clear that we are doing the stack allocation by renaming and clarifying the function names to be about the stack. The fact that the thread_info then shares the allocation is an implementation detail, and not really about the allocation itself. This is a pure renaming and type fix: we pass in the same pointer, it's just that we clarify what the pointer means. The ia64 code that actually only has one single allocation (for all of task_struct, thread_info and kernel thread stack) now looks a bit odd, but since "tsk->stack" is actually not even used there, that oddity doesn't matter. It would be a separate thing to clean that up, I intentionally left the ia64 changes as a pure brute-force renaming and type change. Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-25 06:09:37 +08:00
void __init __weak thread_stack_cache_init(void)
{
}
#endif
2019-04-27 07:22:46 +08:00
void __init __weak poking_init(void) { }
void __init __weak pgtable_cache_init(void) { }
void __init __weak trap_init(void) { }
bool initcall_debug;
core_param(initcall_debug, initcall_debug, bool, 0644);
#ifdef TRACEPOINTS_ENABLED
static void __init initcall_debug_enable(void);
#else
static inline void initcall_debug_enable(void)
{
}
#endif
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
stack: Optionally randomize kernel stack offset each syscall This provides the ability for architectures to enable kernel stack base address offset randomization. This feature is controlled by the boot param "randomize_kstack_offset=on/off", with its default value set by CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT. This feature is based on the original idea from the last public release of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt All the credit for the original idea goes to the PaX team. Note that the design and implementation of this upstream randomize_kstack_offset feature differs greatly from the RANDKSTACK feature (see below). Reasoning for the feature: This feature aims to make harder the various stack-based attacks that rely on deterministic stack structure. We have had many such attacks in past (just to name few): https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html As Linux kernel stack protections have been constantly improving (vmap-based stack allocation with guard pages, removal of thread_info, STACKLEAK), attackers have had to find new ways for their exploits to work. They have done so, continuing to rely on the kernel's stack determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT were not relevant. For example, the following recent attacks would have been hampered if the stack offset was non-deterministic between syscalls: https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf (page 70: targeting the pt_regs copy with linear stack overflow) https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html (leaked stack address from one syscall as a target during next syscall) The main idea is that since the stack offset is randomized on each system call, it is harder for an attack to reliably land in any particular place on the thread stack, even with address exposures, as the stack base will change on the next syscall. Also, since randomization is performed after placing pt_regs, the ptrace-based approach[1] to discover the randomized offset during a long-running syscall should not be possible. Design description: During most of the kernel's execution, it runs on the "thread stack", which is pretty deterministic in its structure: it is fixed in size, and on every entry from userspace to kernel on a syscall the thread stack starts construction from an address fetched from the per-cpu cpu_current_top_of_stack variable. The first element to be pushed to the thread stack is the pt_regs struct that stores all required CPU registers and syscall parameters. Finally the specific syscall function is called, with the stack being used as the kernel executes the resulting request. The goal of randomize_kstack_offset feature is to add a random offset after the pt_regs has been pushed to the stack and before the rest of the thread stack is used during the syscall processing, and to change it every time a process issues a syscall. The source of randomness is currently architecture-defined (but x86 is using the low byte of rdtsc()). Future improvements for different entropy sources is possible, but out of scope for this patch. Further more, to add more unpredictability, new offsets are chosen at the end of syscalls (the timing of which should be less easy to measure from userspace than at syscall entry time), and stored in a per-CPU variable, so that the life of the value does not stay explicitly tied to a single task. As suggested by Andy Lutomirski, the offset is added using alloca() and an empty asm() statement with an output constraint, since it avoids changes to assembly syscall entry code, to the unwinder, and provides correct stack alignment as defined by the compiler. In order to make this available by default with zero performance impact for those that don't want it, it is boot-time selectable with static branches. This way, if the overhead is not wanted, it can just be left turned off with no performance impact. The generated assembly for x86_64 with GCC looks like this: ... ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax # 12380 <kstack_offset> ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax ffffffff81003983: 48 83 c0 0f add $0xf,%rax ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax ffffffff8100398c: 48 29 c4 sub %rax,%rsp ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax ... As a result of the above stack alignment, this patch introduces about 5 bits of randomness after pt_regs is spilled to the thread stack on x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for stack alignment). The amount of entropy could be adjusted based on how much of the stack space we wish to trade for security. My measure of syscall performance overhead (on x86_64): lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null randomize_kstack_offset=y Simple syscall: 0.7082 microseconds randomize_kstack_offset=n Simple syscall: 0.7016 microseconds So, roughly 0.9% overhead growth for a no-op syscall, which is very manageable. And for people that don't want this, it's off by default. There are two gotchas with using the alloca() trick. First, compilers that have Stack Clash protection (-fstack-clash-protection) enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to any dynamic stack allocations. While the randomization offset is always less than a page, the resulting assembly would still contain (unreachable!) probing routines, bloating the resulting assembly. To avoid this, -fno-stack-clash-protection is unconditionally added to the kernel Makefile since this is the only dynamic stack allocation in the kernel (now that VLAs have been removed) and it is provably safe from Stack Clash style attacks. The second gotcha with alloca() is a negative interaction with -fstack-protector*, in that it sees the alloca() as an array allocation, which triggers the unconditional addition of the stack canary function pre/post-amble which slows down syscalls regardless of the static branch. In order to avoid adding this unneeded check and its associated performance impact, architectures need to carefully remove uses of -fstack-protector-strong (or -fstack-protector) in the compilation units that use the add_random_kstack() macro and to audit the resulting stack mitigation coverage (to make sure no desired coverage disappears). No change is visible for this on x86 because the stack protector is already unconditionally disabled for the compilation unit, but the change is required on arm64. There is, unfortunately, no attribute that can be used to disable stack protector for specific functions. Comparison to PaX RANDKSTACK feature: The RANDKSTACK feature randomizes the location of the stack start (cpu_current_top_of_stack), i.e. including the location of pt_regs structure itself on the stack. Initially this patch followed the same approach, but during the recent discussions[2], it has been determined to be of a little value since, if ptrace functionality is available for an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write different offsets in the pt_regs struct, observe the cache behavior of the pt_regs accesses, and figure out the random stack offset. Another difference is that the random offset is stored in a per-cpu variable, rather than having it be per-thread. As a result, these implementations differ a fair bit in their implementation details and results, though obviously the intent is similar. [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/ [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/ [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html Co-developed-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
2021-04-02 07:23:44 +08:00
DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
randomize_kstack_offset);
DEFINE_PER_CPU(u32, kstack_offset);
static int __init early_randomize_kstack_offset(char *buf)
{
int ret;
bool bool_result;
ret = kstrtobool(buf, &bool_result);
if (ret)
return ret;
if (bool_result)
static_branch_enable(&randomize_kstack_offset);
else
static_branch_disable(&randomize_kstack_offset);
return 0;
}
early_param("randomize_kstack_offset", early_randomize_kstack_offset);
#endif
void __init __weak __noreturn arch_call_rest_init(void)
{
rest_init();
}
init: print out unknown kernel parameters It is easy to foobar setting a kernel parameter on the command line without realizing it, there's not much output that you can use to assess what the kernel did with that parameter by default. Make it a little more explicit which parameters on the command line _looked_ like a valid parameter for the kernel, but did not match anything and ultimately got tossed to init. This is very similar to the unknown parameter message received when loading a module. This assumes the parameters are processed in a normal fashion, some parameters (dyndbg= for example) don't register their parameter with the rest of the kernel's parameters, and therefore always show up in this list (and are also given to init - like the rest of this list). Another example is BOOT_IMAGE= is highlighted as an offender, which it technically is, but is passed by LILO and GRUB so most systems will see that complaint. An example output where "foobared" and "unrecognized" are intentionally invalid parameters: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Borislav Petkov <bp@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 09:56:28 +08:00
static void __init print_unknown_bootoptions(void)
{
char *unknown_options;
char *end;
const char *const *p;
size_t len;
if (panic_later || (!argv_init[1] && !envp_init[2]))
return;
/*
* Determine how many options we have to print out, plus a space
* before each
*/
len = 1; /* null terminator */
for (p = &argv_init[1]; *p; p++) {
len++;
len += strlen(*p);
}
for (p = &envp_init[2]; *p; p++) {
len++;
len += strlen(*p);
}
unknown_options = memblock_alloc(len, SMP_CACHE_BYTES);
if (!unknown_options) {
pr_err("%s: Failed to allocate %zu bytes\n",
__func__, len);
return;
}
end = unknown_options;
for (p = &argv_init[1]; *p; p++)
end += sprintf(end, " %s", *p);
for (p = &envp_init[2]; *p; p++)
end += sprintf(end, " %s", *p);
/* Start at unknown_options[1] to skip the initial space */
pr_notice("Unknown kernel command line parameters \"%s\", will be passed to user space.\n",
&unknown_options[1]);
memblock_free(unknown_options, len);
init: print out unknown kernel parameters It is easy to foobar setting a kernel parameter on the command line without realizing it, there's not much output that you can use to assess what the kernel did with that parameter by default. Make it a little more explicit which parameters on the command line _looked_ like a valid parameter for the kernel, but did not match anything and ultimately got tossed to init. This is very similar to the unknown parameter message received when loading a module. This assumes the parameters are processed in a normal fashion, some parameters (dyndbg= for example) don't register their parameter with the rest of the kernel's parameters, and therefore always show up in this list (and are also given to init - like the rest of this list). Another example is BOOT_IMAGE= is highlighted as an offender, which it technically is, but is passed by LILO and GRUB so most systems will see that complaint. An example output where "foobared" and "unrecognized" are intentionally invalid parameters: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Borislav Petkov <bp@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 09:56:28 +08:00
}
asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
void start_kernel(void)
{
char *command_line;
char *after_dashes;
init/main.c: Give init_task a canary Tasks get their end of stack set to STACK_END_MAGIC with the aim to catch stack overruns. Currently this feature does not apply to init_task. This patch removes this restriction. Note that a similar patch was posted by Prarit Bhargava some time ago but was never merged: http://marc.info/?l=linux-kernel&m=127144305403241&w=2 Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dzickus@redhat.com Cc: bmr@redhat.com Cc: jcastillo@redhat.com Cc: jgh@redhat.com Cc: minchan@kernel.org Cc: tglx@linutronix.de Cc: hannes@cmpxchg.org Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-12 21:16:17 +08:00
set_task_stack_end_magic(&init_task);
smp_setup_processor_id();
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 15:55:01 +08:00
debug_objects_early_init();
buildid: stash away kernels build ID on init Parse the kernel's build ID at initialization so that other code can print a hex format string representation of the running kernel's build ID. This will be used in the kdump and dump_stack code so that developers can easily locate the vmlinux debug symbols for a crash/stacktrace. [swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()] Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Evan Green <evgreen@chromium.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sasha Levin <sashal@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 09:09:13 +08:00
init_vmlinux_build_id();
Task Control Groups: basic task cgroup framework Generic Process Control Groups -------------------------- There have recently been various proposals floating around for resource management/accounting and other task grouping subsystems in the kernel, including ResGroups, User BeanCounters, NSProxy cgroups, and others. These all need the basic abstraction of being able to group together multiple processes in an aggregate, in order to track/limit the resources permitted to those processes, or control other behaviour of the processes, and all implement this grouping in different ways. This patchset provides a framework for tracking and grouping processes into arbitrary "cgroups" and assigning arbitrary state to those groupings, in order to control the behaviour of the cgroup as an aggregate. The intention is that the various resource management and virtualization/cgroup efforts can also become task cgroup clients, with the result that: - the userspace APIs are (somewhat) normalised - it's easier to test e.g. the ResGroups CPU controller in conjunction with the BeanCounters memory controller, or use either of them as the resource-control portion of a virtual server system. - the additional kernel footprint of any of the competing resource management systems is substantially reduced, since it doesn't need to provide process grouping/containment, hence improving their chances of getting into the kernel This patch: Add the main task cgroups framework - the cgroup filesystem, and the basic structures for tracking membership and associating subsystem state objects to tasks. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
cgroup_init_early();
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
local_irq_disable();
early_boot_irqs_disabled = true;
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them.
*/
boot_cpu_init();
page_address_init();
pr_notice("%s", linux_banner);
early_security_init();
setup_arch(&command_line);
setup_boot_config();
[PATCH] Dynamic kernel command-line: common Current implementation stores a static command-line buffer allocated to COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer, one for future reference and one for parameter parsing. Current kernel command-line size for most architecture is much too small for module parameters, video settings, initramfs paramters and much more. The problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static buffers. In order to allow a greater command-line size, these buffers should be dynamically allocated or marked as init disposable buffers, so unused memory can be released. This patch renames the static saved_command_line variable into boot_command_line adding __initdata attribute, so that it can be disposed after initialization. This rename is required so applications that use saved_command_line will not be affected by this change. It reintroduces saved_command_line as dynamically allocated buffer to match the data in boot_command_line. It also mark secondary command-line buffer as __initdata, and copies it to dynamically allocated static_command_line buffer components may hold reference to it after initialization. This patch is for linux-2.6.20-rc4-mm1 and is divided to target each architecture. I could not check this in any architecture so please forgive me if I got it wrong. The per-architecture modification is very simple, use boot_command_line in place of saved_command_line. The common code is the change into dynamic command-line. This patch: 1. Rename saved_command_line into boot_command_line, mark as init disposable. 2. Add dynamic allocated saved_command_line. 3. Add dynamic allocated static_command_line. 4. During startup copy: boot_command_line into saved_command_line. arch command_line into static_command_line. 5. Parse static_command_line and not arch command_line, so arch command_line may be freed. Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
setup_command_line(command_line);
setup_nr_cpu_ids();
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
init: rename and re-order boot_cpu_state_init() This is purely a preparatory patch for upcoming changes during the 4.19 merge window. We have a function called "boot_cpu_state_init()" that isn't really about the bootup cpu state: that is done much earlier by the similarly named "boot_cpu_init()" (note lack of "state" in name). This function initializes some hotplug CPU state, and needs to run after the percpu data has been properly initialized. It even has a comment to that effect. Except it _doesn't_ actually run after the percpu data has been properly initialized. On x86 it happens to do that, but on at least arm and arm64, the percpu base pointers are initialized by the arch-specific 'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init(). This had some unexpected results, and in particular we have a patch pending for the merge window that did the obvious cleanup of using 'this_cpu_write()' in the cpu hotplug init code: - per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE; + this_cpu_write(cpuhp_state.state, CPUHP_ONLINE); which is obviously the right thing to do. Except because of the ordering issue, it actually failed miserably and unexpectedly on arm64. So this just fixes the ordering, and changes the name of the function to be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu hotplug state, because the core CPU state was supposed to have already been done earlier. Marked for stable, since the (not yet merged) patch that will show this problem is marked for stable. Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-13 03:19:42 +08:00
boot_cpu_hotplug_init();
pr_notice("Kernel command line: %s\n", saved_command_line);
init: initialize jump labels before command line option parsing When a module option, or core kernel argument, toggles a static-key it requires jump labels to be initialized early. While x86, PowerPC, and ARM64 arrange for jump_label_init() to be called before parse_args(), ARM does not. Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1 console=ttyAMA0,115200 page_alloc.shuffle=1 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303 page_alloc_shuffle+0x12c/0x1ac static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used before call to jump_label_init() Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc4-next-20190410-00003-g3367c36ce744 #1 Hardware name: ARM Integrator/CP (Device Tree) [<c0011c68>] (unwind_backtrace) from [<c000ec48>] (show_stack+0x10/0x18) [<c000ec48>] (show_stack) from [<c07e9710>] (dump_stack+0x18/0x24) [<c07e9710>] (dump_stack) from [<c001bb1c>] (__warn+0xe0/0x108) [<c001bb1c>] (__warn) from [<c001bb88>] (warn_slowpath_fmt+0x44/0x6c) [<c001bb88>] (warn_slowpath_fmt) from [<c0b0c4a8>] (page_alloc_shuffle+0x12c/0x1ac) [<c0b0c4a8>] (page_alloc_shuffle) from [<c0b0c550>] (shuffle_store+0x28/0x48) [<c0b0c550>] (shuffle_store) from [<c003e6a0>] (parse_args+0x1f4/0x350) [<c003e6a0>] (parse_args) from [<c0ac3c00>] (start_kernel+0x1c0/0x488) Move the fallback call to jump_label_init() to occur before parse_args(). The redundant calls to jump_label_init() in other archs are left intact in case they have static key toggling use cases that are even earlier than option parsing. Link: http://lkml.kernel.org/r/155544804466.1032396.13418949511615676665.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Guenter Roeck <groeck@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Russell King <rmk@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19 08:50:44 +08:00
/* parameters may set static keys */
jump_label_init();
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
parse_early_param();
after_dashes = parse_args("Booting kernel",
static_command_line, __start___param,
__stop___param - __start___param,
module: add extra argument for parse_params() callback This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
-1, -1, NULL, &unknown_bootoption);
init: print out unknown kernel parameters It is easy to foobar setting a kernel parameter on the command line without realizing it, there's not much output that you can use to assess what the kernel did with that parameter by default. Make it a little more explicit which parameters on the command line _looked_ like a valid parameter for the kernel, but did not match anything and ultimately got tossed to init. This is very similar to the unknown parameter message received when loading a module. This assumes the parameters are processed in a normal fashion, some parameters (dyndbg= for example) don't register their parameter with the rest of the kernel's parameters, and therefore always show up in this list (and are also given to init - like the rest of this list). Another example is BOOT_IMAGE= is highlighted as an offender, which it technically is, but is passed by LILO and GRUB so most systems will see that complaint. An example output where "foobared" and "unrecognized" are intentionally invalid parameters: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Borislav Petkov <bp@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 09:56:28 +08:00
print_unknown_bootoptions();
if (!IS_ERR_OR_NULL(after_dashes))
parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
module: add extra argument for parse_params() callback This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
NULL, set_init_arg);
if (extra_init_args)
parse_args("Setting extra init args", extra_init_args,
NULL, 0, -1, -1, NULL, set_init_arg);
random: split initialization into early step and later step The full RNG initialization relies on some timestamps, made possible with initialization functions like time_init() and timekeeping_init(). However, these are only available rather late in initialization. Meanwhile, other things, such as memory allocator functions, make use of the RNG much earlier. So split RNG initialization into two phases. We can provide arch randomness very early on, and then later, after timekeeping and such are available, initialize the rest. This ensures that, for example, slabs are properly randomized if RDRAND is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree of its security, because its random seed is potentially deterministic, since it hasn't yet incorporated RDRAND. It also makes it possible to use a better seed in kfence, which currently relies on only the cycle counter. Another positive consequence is that on systems with RDRAND, running with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all. One subtle side effect of this change is that on systems with no RDRAND, RDTSC is now only queried by random_init() once, committing the moment of the function call, instead of multiple times as before. This is intentional, as the multiple RDTSCs in a loop before weren't accomplishing very much, with jitter being better provided by try_to_generate_entropy(). Plus, filling blocks with RDTSC is still being done in extract_entropy(), which is necessarily called before random bytes are served anyway. Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-26 23:43:14 +08:00
/* Architectural and non-timekeeping rng init, before allocator init */
random_init_early(command_line);
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
/*
* These use large bootmem allocations and must precede
* initalization of page allocator
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
*/
setup_log_buf(0);
slab: setup allocators earlier in the boot sequence This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 00:40:04 +08:00
vfs_caches_init_early();
sort_main_extable();
trap_init();
mm_core_init();
poking_init();
ftrace_init();
/* trace_printk can be enabled here */
early_trace_init();
/*
* Set up the scheduler prior starting any interrupts (such as the
* timer interrupt). Full topology setup happens at smp_init()
* time - but meanwhile we still have a functioning scheduler.
*/
sched_init();
sched/core: Initialize the idle task with preemption disabled As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12 17:46:36 +08:00
if (WARN(!irqs_disabled(),
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
Maple Tree: add new data structure Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
maple_tree_init();
workqueue: make workqueue available early during boot Workqueue is currently initialized in an early init call; however, there are cases where early boot code has to be split and reordered to come after workqueue initialization or the same code path which makes use of workqueues is used both before workqueue initailization and after. The latter cases have to gate workqueue usages with keventd_up() tests, which is nasty and easy to get wrong. Workqueue usages have become widespread and it'd be a lot more convenient if it can be used very early from boot. This patch splits workqueue initialization into two steps. workqueue_init_early() which sets up the basic data structures so that workqueues can be created and work items queued, and workqueue_init() which actually brings up workqueues online and starts executing queued work items. The former step can be done very early during boot once memory allocation, cpumasks and idr are initialized. The latter right after kthreads become available. This allows work item queueing and canceling from very early boot which is what most of these use cases want. * As systemd_wq being initialized doesn't indicate that workqueue is fully online anymore, update keventd_up() to test wq_online instead. The follow-up patches will get rid of all its usages and the function itself. * Flushing doesn't make sense before workqueue is fully initialized. The flush functions trigger WARN and return immediately before fully online. * Work items are never in-flight before fully online. Canceling can always succeed by skipping the flush step. * Some code paths can no longer assume to be called with irq enabled as irq is disabled during early boot. Use irqsave/restore operations instead. v2: Watchdog init, which requires timer to be running, moved from workqueue_init_early() to workqueue_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
2016-09-17 03:49:32 +08:00
/*
* Set up housekeeping before setting up workqueues to allow the unbound
* workqueue to take non-housekeeping into account.
*/
housekeeping_init();
workqueue: make workqueue available early during boot Workqueue is currently initialized in an early init call; however, there are cases where early boot code has to be split and reordered to come after workqueue initialization or the same code path which makes use of workqueues is used both before workqueue initailization and after. The latter cases have to gate workqueue usages with keventd_up() tests, which is nasty and easy to get wrong. Workqueue usages have become widespread and it'd be a lot more convenient if it can be used very early from boot. This patch splits workqueue initialization into two steps. workqueue_init_early() which sets up the basic data structures so that workqueues can be created and work items queued, and workqueue_init() which actually brings up workqueues online and starts executing queued work items. The former step can be done very early during boot once memory allocation, cpumasks and idr are initialized. The latter right after kthreads become available. This allows work item queueing and canceling from very early boot which is what most of these use cases want. * As systemd_wq being initialized doesn't indicate that workqueue is fully online anymore, update keventd_up() to test wq_online instead. The follow-up patches will get rid of all its usages and the function itself. * Flushing doesn't make sense before workqueue is fully initialized. The flush functions trigger WARN and return immediately before fully online. * Work items are never in-flight before fully online. Canceling can always succeed by skipping the flush step. * Some code paths can no longer assume to be called with irq enabled as irq is disabled during early boot. Use irqsave/restore operations instead. v2: Watchdog init, which requires timer to be running, moved from workqueue_init_early() to workqueue_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
2016-09-17 03:49:32 +08:00
/*
* Allow workqueue creation and work item queueing/cancelling
* early. Work item execution depends on kthreads and starts after
* workqueue_init().
*/
workqueue_init_early();
rcu_init();
/* Trace events are available after this */
trace_init();
if (initcall_debug)
initcall_debug_enable();
context_tracking_init();
/* init some links before init_ISA_irqs() */
early_irq_init();
init_IRQ();
tick_init();
rcu_init_nohz();
init_timers();
srcu_init();
hrtimers_init();
softirq_init();
timekeeping_init();
time_init();
random: split initialization into early step and later step The full RNG initialization relies on some timestamps, made possible with initialization functions like time_init() and timekeeping_init(). However, these are only available rather late in initialization. Meanwhile, other things, such as memory allocator functions, make use of the RNG much earlier. So split RNG initialization into two phases. We can provide arch randomness very early on, and then later, after timekeeping and such are available, initialize the rest. This ensures that, for example, slabs are properly randomized if RDRAND is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree of its security, because its random seed is potentially deterministic, since it hasn't yet incorporated RDRAND. It also makes it possible to use a better seed in kfence, which currently relies on only the cycle counter. Another positive consequence is that on systems with RDRAND, running with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all. One subtle side effect of this change is that on systems with no RDRAND, RDTSC is now only queried by random_init() once, committing the moment of the function call, instead of multiple times as before. This is intentional, as the multiple RDTSCs in a loop before weren't accomplishing very much, with jitter being better provided by try_to_generate_entropy(). Plus, filling blocks with RDTSC is still being done in extract_entropy(), which is necessarily called before random bytes are served anyway. Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-26 23:43:14 +08:00
/* This must be after timekeeping is initialized */
random_init();
/* These make use of the fully initialized rng */
kfence_init();
boot_init_stack_canary();
perf: Use hrtimers for event multiplexing The current scheme of using the timer tick was fine for per-thread events. However, it was causing bias issues in system-wide mode (including for uncore PMUs). Event groups would not get their fair share of runtime on the PMU. With tickless kernels, if a core is idle there is no timer tick, and thus no event rotation (multiplexing). However, there are events (especially uncore events) which do count even though cores are asleep. This patch changes the timer source for multiplexing. It introduces a per-PMU per-cpu hrtimer. The advantage is that even when a core goes idle, it will come back to service the hrtimer, thus multiplexing on system-wide events works much better. The per-PMU implementation (suggested by PeterZ) enables adjusting the multiplexing interval per PMU. The preferred interval is stashed into the struct pmu. If not set, it will be forced to the default interval value. In order to minimize the impact of the hrtimer, it is turned on and off on demand. When the PMU on a CPU is overcommited, the hrtimer is activated. It is stopped when the PMU is not overcommitted. In order for this to work properly, we had to change the order of initialization in start_kernel() such that hrtimer_init() is run before perf_event_init(). The default interval in milliseconds is set to a timer tick just like with the old code. We will provide a sysctl to tune this in another patch. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-03 20:21:33 +08:00
perf_event_init();
profile_init();
generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts There is a problem that kdump(2nd kernel) sometimes hangs up due to a pending IPI from 1st kernel. Kernel panic occurs because IPI comes before call_single_queue is initialized. To fix the crash, rename init_call_single_data() to call_function_init() and call it in start_kernel() so that call_single_queue can be initialized before enabling interrupts. The details of the crash are: (1) 2nd kernel boots up (2) A pending IPI from 1st kernel comes when irqs are first enabled in start_kernel(). (3) Kernel tries to handle the interrupt, but call_single_queue is not initialized yet at this point. As a result, in the generic_smp_call_function_single_interrupt(), NULL pointer dereference occurs when list_replace_init() tries to access &q->list.next. Therefore this patch changes the name of init_call_single_data() to call_function_init() and calls it before local_irq_enable() in start_kernel(). Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Milton Miller <miltonm@bga.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/D6CBEE2F420741indou.takao@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-30 00:35:04 +08:00
call_function_init();
init: scream bloody murder if interrupts are enabled too early As I was testing a lot of my code recently, and having several "successes", I accidentally noticed in the dmesg this little line: start_kernel(): bug: interrupts were enabled *very* early, fixing it Sure enough, one of my patches two commits ago enabled interrupts early. The sad part here is that I never noticed it, and I ran several tests with ktest too, and ktest did not notice this line. What ktest looks for (and so does many other automated testing scripts) is a back trace produced by a WARN_ON() or BUG(). As a back trace was never produced, my buggy patch could have slipped into linux-next, or even worse, mainline. Adding a WARN(!irqs_disabled()) makes this bug a little more obvious: PID hash table entries: 4096 (order: 3, 32768 bytes) __ex_table already sorted, skipping sort Checking aperture... No AGP bridge found Calgary: detecting Calgary via BIOS EBDA area Calgary: Unable to locate Rio Grande table in EBDA - bailing! Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init) ------------[ cut here ]------------ WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415() Hardware name: To Be Filled By O.E.M. Interrupts were enabled *very* early, fixing it Modules linked in: Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286 Call Trace: warn_slowpath_common+0x83/0x9b warn_slowpath_fmt+0x46/0x48 start_kernel+0x21e/0x415 x86_64_start_reservations+0x10e/0x112 x86_64_start_kernel+0x102/0x111 ---[ end trace 007d8b0491b4f5d8 ]--- Preemptible hierarchical RCU implementation. RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4. NR_IRQS:4352 nr_irqs:712 16 Console: colour VGA+ 80x25 console [ttyS0] enabled, bootconsole disabled Do you see it? The original version of this patch just slapped a WARN_ON() in there and kept the printk(). Ard van Breemen suggested using the WARN() interface, which makes the code a bit cleaner. Also, while examining other warnings in init/main.c, I found two other locations that deserve a bloody murder scream if their conditions are hit, and updated them accordingly. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ard van Breemen <ard@telegraafnet.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 07:18:18 +08:00
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
tracing: Centralize preemptirq tracepoints and unify their usage This patch detaches the preemptirq tracepoints from the tracers and keeps it separate. Advantages: * Lockdep and irqsoff event can now run in parallel since they no longer have their own calls. * This unifies the usecase of adding hooks to an irqsoff and irqson event, and a preemptoff and preempton event. 3 users of the events exist: - Lockdep - irqsoff and preemptoff tracers - irqs and preempt trace events The unification cleans up several ifdefs and makes the code in preempt tracer and irqsoff tracers simpler. It gets rid of all the horrific ifdeferry around PROVE_LOCKING and makes configuration of the different users of the tracepoints more easy and understandable. It also gets rid of the time_* function calls from the lockdep hooks used to call into the preemptirq tracer which is not needed anymore. The negative delta in lines of code in this patch is quite large too. In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS as a single point for registering probes onto the tracepoints. With this, the web of config options for preempt/irq toggle tracepoints and its users becomes: PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING | | \ | | \ (selects) / \ \ (selects) / TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS \ / \ (depends on) / PREEMPTIRQ_TRACEPOINTS Other than the performance tests mentioned in the previous patch, I also ran the locking API test suite. I verified that all tests cases are passing. I also injected issues by not registering lockdep probes onto the tracepoints and I see failures to confirm that the probes are indeed working. This series + lockdep probes not registered (just to inject errors): [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok | [ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok | [ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok | [ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok | [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok | With this series + lockdep probes registered, all locking tests pass: [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok | [ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok | [ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok | [ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok | [ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok | [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok | [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok | Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-31 06:24:23 +08:00
early_boot_irqs_disabled = false;
local_irq_enable();
kmem_cache_init_late();
/*
* HACK ALERT! This is early. We're enabling the console before
* we've done PCI setups etc, and console_init() must be aware of
* this. But we do want output early, in case something goes wrong.
*/
console_init();
if (panic_later)
panic("Too many boot %s vars at `%s'", panic_later,
panic_param);
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
tracing: Centralize preemptirq tracepoints and unify their usage This patch detaches the preemptirq tracepoints from the tracers and keeps it separate. Advantages: * Lockdep and irqsoff event can now run in parallel since they no longer have their own calls. * This unifies the usecase of adding hooks to an irqsoff and irqson event, and a preemptoff and preempton event. 3 users of the events exist: - Lockdep - irqsoff and preemptoff tracers - irqs and preempt trace events The unification cleans up several ifdefs and makes the code in preempt tracer and irqsoff tracers simpler. It gets rid of all the horrific ifdeferry around PROVE_LOCKING and makes configuration of the different users of the tracepoints more easy and understandable. It also gets rid of the time_* function calls from the lockdep hooks used to call into the preemptirq tracer which is not needed anymore. The negative delta in lines of code in this patch is quite large too. In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS as a single point for registering probes onto the tracepoints. With this, the web of config options for preempt/irq toggle tracepoints and its users becomes: PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING | | \ | | \ (selects) / \ \ (selects) / TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS \ / \ (depends on) / PREEMPTIRQ_TRACEPOINTS Other than the performance tests mentioned in the previous patch, I also ran the locking API test suite. I verified that all tests cases are passing. I also injected issues by not registering lockdep probes onto the tracepoints and I see failures to confirm that the probes are indeed working. This series + lockdep probes not registered (just to inject errors): [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok | [ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok | [ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok | [ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok | [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok | With this series + lockdep probes registered, all locking tests pass: [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok | [ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok | [ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok | [ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok | [ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok | [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok | [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok | Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-31 06:24:23 +08:00
lockdep_init();
[PATCH] lockdep: core Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
/*
* Need to run this when irqs are enabled, because it wants
* to self-test [hard/soft]-irqs on/off lock inversion bugs
* too:
*/
locking_selftest();
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n",
page_to_pfn(virt_to_page((void *)initrd_start)),
min_low_pfn);
initrd_start = 0;
}
#endif
[PATCH] node local per-cpu-pages This patch modifies the way pagesets in struct zone are managed. Each zone has a per-cpu array of pagesets. So any particular CPU has some memory in each zone structure which belongs to itself. Even if that CPU is not local to that zone. So the patch relocates the pagesets for each cpu to the node that is nearest to the cpu instead of allocating the pagesets in the (possibly remote) target zone. This means that the operations to manage pages on remote zone can be done with information available locally. We play a macro trick so that non-NUMA pmachines avoid the additional pointer chase on the page allocator fastpath. AIM7 benchmark on a 32 CPU SGI Altix w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005 100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005 200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005 300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005 400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005 500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005 600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005 700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005 800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005 900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005 1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005 with slab API changes and pageset patch: Tasks jobs/min jti jobs/min/task real cpu 1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005 100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005 200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005 300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005 400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005 500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005 600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005 700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005 800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005 900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005 1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22 08:14:47 +08:00
setup_per_cpu_pageset();
numa_policy_init();
acpi_early_init();
if (late_time_init)
late_time_init();
sched_clock_init();
calibrate_delay();
arch_cpu_finalize_init();
pid: replace pid bitmap implementation with IDR API Patch series "Replacing PID bitmap implementation with IDR API", v4. This series replaces kernel bitmap implementation of PID allocation with IDR API. These patches are written to simplify the kernel by replacing custom code with calls to generic code. The following are the stats for pid and pid_namespace object files before and after the replacement. There is a noteworthy change between the IDR and bitmap implementation. Before text data bss dec hex filename 8447 3894 64 12405 3075 kernel/pid.o After text data bss dec hex filename 3397 304 0 3701 e75 kernel/pid.o Before text data bss dec hex filename 5692 1842 192 7726 1e2e kernel/pid_namespace.o After text data bss dec hex filename 2854 216 16 3086 c0e kernel/pid_namespace.o The following are the stats for ps, pstree and calling readdir on /proc for 10,000 processes. ps: With IDR API With bitmap real 0m1.479s 0m2.319s user 0m0.070s 0m0.060s sys 0m0.289s 0m0.516s pstree: With IDR API With bitmap real 0m1.024s 0m1.794s user 0m0.348s 0m0.612s sys 0m0.184s 0m0.264s proc: With IDR API With bitmap real 0m0.059s 0m0.074s user 0m0.000s 0m0.004s sys 0m0.016s 0m0.016s This patch (of 2): Replace the current bitmap implementation for Process ID allocation. Functions that are no longer required, for example, free_pidmap(), alloc_pidmap(), etc. are removed. The rest of the functions are modified to use the IDR API. The change was made to make the PID allocation less complex by replacing custom code with calls to generic API. [gs051095@gmail.com: v6] Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com [avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl] Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.com Signed-off-by: Gargi Sharma <gs051095@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Ingo Molnar <mingo@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-18 07:30:30 +08:00
pid_idr_init();
anon_vma_init();
#ifdef CONFIG_X86
efi: Make 'efi_enabled' a function to query EFI facilities Originally 'efi_enabled' indicated whether a kernel was booted from EFI firmware. Over time its semantics have changed, and it now indicates whether or not we are booted on an EFI machine with bit-native firmware, e.g. 64-bit kernel with 64-bit firmware. The immediate motivation for this patch is the bug report at, https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 which details how running a platform driver on an EFI machine that is designed to run under BIOS can cause the machine to become bricked. Also, the following report, https://bugzilla.kernel.org/show_bug.cgi?id=47121 details how running said driver can also cause Machine Check Exceptions. Drivers need a new means of detecting whether they're running on an EFI machine, as sadly the expression, if (!efi_enabled) hasn't been a sufficient condition for quite some time. Users actually want to query 'efi_enabled' for different reasons - what they really want access to is the list of available EFI facilities. For instance, the x86 reboot code needs to know whether it can invoke the ResetSystem() function provided by the EFI runtime services, while the ACPI OSL code wants to know whether the EFI config tables were mapped successfully. There are also checks in some of the platform driver code to simply see if they're running on an EFI machine (which would make it a bad idea to do BIOS-y things). This patch is a prereq for the samsung-laptop fix patch. Cc: David Airlie <airlied@linux.ie> Cc: Corentin Chary <corentincj@iksaif.net> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Olof Johansson <olof@lixom.net> Cc: Peter Jones <pjones@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Steve Langasek <steve.langasek@canonical.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-14 17:42:35 +08:00
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
#endif
Clarify naming of thread info/stack allocators We've had the thread info allocated together with the thread stack for most architectures for a long time (since the thread_info was split off from the task struct), but that is about to change. But the patches that move the thread info to be off-stack (and a part of the task struct instead) made it clear how confused the allocator and freeing functions are. Because the common case was that we share an allocation with the thread stack and the thread_info, the two pointers were identical. That identity then meant that we would have things like ti = alloc_thread_info_node(tsk, node); ... tsk->stack = ti; which certainly _worked_ (since stack and thread_info have the same value), but is rather confusing: why are we assigning a thread_info to the stack? And if we move the thread_info away, the "confusing" code just gets to be entirely bogus. So remove all this confusion, and make it clear that we are doing the stack allocation by renaming and clarifying the function names to be about the stack. The fact that the thread_info then shares the allocation is an implementation detail, and not really about the allocation itself. This is a pure renaming and type fix: we pass in the same pointer, it's just that we clarify what the pointer means. The ia64 code that actually only has one single allocation (for all of task_struct, thread_info and kernel thread stack) now looks a bit odd, but since "tsk->stack" is actually not even used there, that oddity doesn't matter. It would be a separate thing to clean that up, I intentionally left the ia64 changes as a pure brute-force renaming and type change. Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-25 06:09:37 +08:00
thread_stack_cache_init();
CRED: Inaugurate COW credentials Inaugurate copy-on-write credentials management. This uses RCU to manage the credentials pointer in the task_struct with respect to accesses by other tasks. A process may only modify its own credentials, and so does not need locking to access or modify its own credentials. A mutex (cred_replace_mutex) is added to the task_struct to control the effect of PTRACE_ATTACHED on credential calculations, particularly with respect to execve(). With this patch, the contents of an active credentials struct may not be changed directly; rather a new set of credentials must be prepared, modified and committed using something like the following sequence of events: struct cred *new = prepare_creds(); int ret = blah(new); if (ret < 0) { abort_creds(new); return ret; } return commit_creds(new); There are some exceptions to this rule: the keyrings pointed to by the active credentials may be instantiated - keyrings violate the COW rule as managing COW keyrings is tricky, given that it is possible for a task to directly alter the keys in a keyring in use by another task. To help enforce this, various pointers to sets of credentials, such as those in the task_struct, are declared const. The purpose of this is compile-time discouragement of altering credentials through those pointers. Once a set of credentials has been made public through one of these pointers, it may not be modified, except under special circumstances: (1) Its reference count may incremented and decremented. (2) The keyrings to which it points may be modified, but not replaced. The only safe way to modify anything else is to create a replacement and commit using the functions described in Documentation/credentials.txt (which will be added by a later patch). This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). This now prepares and commits credentials in various places in the security code rather than altering the current creds directly. (2) Temporary credential overrides. do_coredump() and sys_faccessat() now prepare their own credentials and temporarily override the ones currently on the acting thread, whilst preventing interference from other threads by holding cred_replace_mutex on the thread being dumped. This will be replaced in a future patch by something that hands down the credentials directly to the functions being called, rather than altering the task's objective credentials. (3) LSM interface. A number of functions have been changed, added or removed: (*) security_capset_check(), ->capset_check() (*) security_capset_set(), ->capset_set() Removed in favour of security_capset(). (*) security_capset(), ->capset() New. This is passed a pointer to the new creds, a pointer to the old creds and the proposed capability sets. It should fill in the new creds or return an error. All pointers, barring the pointer to the new creds, are now const. (*) security_bprm_apply_creds(), ->bprm_apply_creds() Changed; now returns a value, which will cause the process to be killed if it's an error. (*) security_task_alloc(), ->task_alloc_security() Removed in favour of security_prepare_creds(). (*) security_cred_free(), ->cred_free() New. Free security data attached to cred->security. (*) security_prepare_creds(), ->cred_prepare() New. Duplicate any security data attached to cred->security. (*) security_commit_creds(), ->cred_commit() New. Apply any security effects for the upcoming installation of new security by commit_creds(). (*) security_task_post_setuid(), ->task_post_setuid() Removed in favour of security_task_fix_setuid(). (*) security_task_fix_setuid(), ->task_fix_setuid() Fix up the proposed new credentials for setuid(). This is used by cap_set_fix_setuid() to implicitly adjust capabilities in line with setuid() changes. Changes are made to the new credentials, rather than the task itself as in security_task_post_setuid(). (*) security_task_reparent_to_init(), ->task_reparent_to_init() Removed. Instead the task being reparented to init is referred directly to init's credentials. NOTE! This results in the loss of some state: SELinux's osid no longer records the sid of the thread that forked it. (*) security_key_alloc(), ->key_alloc() (*) security_key_permission(), ->key_permission() Changed. These now take cred pointers rather than task pointers to refer to the security context. (4) sys_capset(). This has been simplified and uses less locking. The LSM functions it calls have been merged. (5) reparent_to_kthreadd(). This gives the current thread the same credentials as init by simply using commit_thread() to point that way. (6) __sigqueue_alloc() and switch_uid() __sigqueue_alloc() can't stop the target task from changing its creds beneath it, so this function gets a reference to the currently applicable user_struct which it then passes into the sigqueue struct it returns if successful. switch_uid() is now called from commit_creds(), and possibly should be folded into that. commit_creds() should take care of protecting __sigqueue_alloc(). (7) [sg]et[ug]id() and co and [sg]et_current_groups. The set functions now all use prepare_creds(), commit_creds() and abort_creds() to build and check a new set of credentials before applying it. security_task_set[ug]id() is called inside the prepared section. This guarantees that nothing else will affect the creds until we've finished. The calling of set_dumpable() has been moved into commit_creds(). Much of the functionality of set_user() has been moved into commit_creds(). The get functions all simply access the data directly. (8) security_task_prctl() and cap_task_prctl(). security_task_prctl() has been modified to return -ENOSYS if it doesn't want to handle a function, or otherwise return the return value directly rather than through an argument. Additionally, cap_task_prctl() now prepares a new set of credentials, even if it doesn't end up using it. (9) Keyrings. A number of changes have been made to the keyrings code: (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have all been dropped and built in to the credentials functions directly. They may want separating out again later. (b) key_alloc() and search_process_keyrings() now take a cred pointer rather than a task pointer to specify the security context. (c) copy_creds() gives a new thread within the same thread group a new thread keyring if its parent had one, otherwise it discards the thread keyring. (d) The authorisation key now points directly to the credentials to extend the search into rather pointing to the task that carries them. (e) Installing thread, process or session keyrings causes a new set of credentials to be created, even though it's not strictly necessary for process or session keyrings (they're shared). (10) Usermode helper. The usermode helper code now carries a cred struct pointer in its subprocess_info struct instead of a new session keyring pointer. This set of credentials is derived from init_cred and installed on the new process after it has been cloned. call_usermodehelper_setup() allocates the new credentials and call_usermodehelper_freeinfo() discards them if they haven't been used. A special cred function (prepare_usermodeinfo_creds()) is provided specifically for call_usermodehelper_setup() to call. call_usermodehelper_setkeys() adjusts the credentials to sport the supplied keyring as the new session keyring. (11) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) selinux_setprocattr() no longer does its check for whether the current ptracer can access processes with the new SID inside the lock that covers getting the ptracer's SID. Whilst this lock ensures that the check is done with the ptracer pinned, the result is only valid until the lock is released, so there's no point doing it inside the lock. (12) is_single_threaded(). This function has been extracted from selinux_setprocattr() and put into a file of its own in the lib/ directory as join_session_keyring() now wants to use it too. The code in SELinux just checked to see whether a task shared mm_structs with other tasks (CLONE_VM), but that isn't good enough. We really want to know if they're part of the same thread group (CLONE_THREAD). (13) nfsd. The NFS server daemon now has to use the COW credentials to set the credentials it is going to use. It really needs to pass the credentials down to the functions it calls, but it can't do that until other patches in this series have been applied. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 07:39:23 +08:00
cred_init();
fork_init();
proc_caches_init();
uts_ns_init();
key_init();
security_init();
dbg_late_init();
net_ns_init();
vfs_caches_init();
2016-12-25 11:00:30 +08:00
pagecache_init();
signals_init();
seq_file_init();
proc_root_init();
nsfs_init();
cpuset_init();
cgroup_init();
taskstats_init_early();
delayacct_init();
ACPI / init: Switch over platform to the ACPI mode later Commit 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()" moved the ACPI subsystem initialization, including the ACPI mode enabling, to an earlier point in the initialization sequence, to allow the timekeeping subsystem use ACPI early. Unfortunately, that resulted in boot regressions on some systems and the early ACPI initialization was moved toward its original position in the kernel initialization code by commit c4e1acbb35e4 "ACPI / init: Invoke early ACPI initialization later". However, that turns out to be insufficient, as boot is still broken on the Tyan S8812 mainboard. To fix that issue, split the ACPI early initialization code into two pieces so the majority of it still located in acpi_early_init() and the part switching over the platform into the ACPI mode goes into a new function, acpi_subsystem_init(), executed at the original early ACPI initialization spot. That fixes the Tyan S8812 boot problem, but still allows ACPI tables to be loaded earlier which is useful to the EFI code in efi_enter_virtual_mode(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141 Fixes: 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()" Reported-and-tested-by: Marius Tolzmann <tolzmann@molgen.mpg.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
2015-06-10 07:33:36 +08:00
acpi_subsystem_init();
x86/amd: Check for the C1E bug post ACPI subsystem init AMD CPUs affected by the E400 erratum suffer from the issue that the local APIC timer stops when the CPU goes into C1E. Unfortunately there is no way to detect the affected CPUs on early boot. It's only possible to determine the range of possibly affected CPUs from the family/model range. The actual decision whether to enter C1E and thus cause the bug is done by the firmware and we need to detect that case late, after ACPI has been initialized. The current solution is to check in the idle routine whether the CPU is affected by reading the MSR_K8_INT_PENDING_MSG MSR and checking for the K8_INTP_C1E_ACTIVE_MASK bits. If one of the bits is set then the CPU is affected and the system is switched into forced broadcast mode. This is ineffective and on non-affected CPUs every entry to idle does the extra RDMSR. After doing some research it turns out that the bits are visible on the boot CPU right after the ACPI subsystem is initialized in the early boot process. So instead of polling for the bits in the idle loop, add a detection function after acpi_subsystem_init() and check for the MSR bits. If set, then the X86_BUG_AMD_APIC_C1E is set on the boot CPU and the TSC is marked unstable when X86_FEATURE_NONSTOP_TSC is not set as it will stop in C1E state as well. The switch to broadcast mode cannot be done at this point because the boot CPU still uses HPET as a clockevent device and the local APIC timer is not yet calibrated and installed. The switch to broadcast mode on the affected CPUs needs to be done when the local APIC timer is actually set up. This allows to cleanup the amd_e400_idle() function in the next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/20161209182912.2726-4-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-10 02:29:10 +08:00
arch_post_acpi_subsys_init();
kcsan_init();
/* Do the rest non-__init'ed, we're now alive */
arch_call_rest_init();
x86: Fix early boot crash on gcc-10, third try ... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-04-23 00:11:30 +08:00
/*
* Avoid stack canaries in callers of boot_init_stack_canary for gcc-10
* and older.
*/
#if !__has_attribute(__no_stack_protector__)
x86: Fix early boot crash on gcc-10, third try ... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-04-23 00:11:30 +08:00
prevent_tail_call_optimization();
#endif
}
/* Call all constructor functions linked into the kernel. */
static void __init do_ctors(void)
{
/*
* For UML, the constructors have already been called by the
* normal setup code as it's just a normal ELF binary, so we
* cannot do it again - but we do need CONFIG_CONSTRUCTORS
* even on UML for modules.
*/
#if defined(CONFIG_CONSTRUCTORS) && !defined(CONFIG_UML)
ctor_fn_t *fn = (ctor_fn_t *) __ctors_start;
for (; fn < (ctor_fn_t *) __ctors_end; fn++)
(*fn)();
#endif
}
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
#ifdef CONFIG_KALLSYMS
struct blacklist_entry {
struct list_head next;
char *buf;
};
static __initdata_or_module LIST_HEAD(blacklisted_initcalls);
static int __init initcall_blacklist(char *str)
{
char *str_entry;
struct blacklist_entry *entry;
/* str argument is a comma-separated list of functions */
do {
str_entry = strsep(&str, ",");
if (str_entry) {
pr_debug("blacklisting initcall %s\n", str_entry);
memblock: stop using implicit alignment to SMP_CACHE_BYTES When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES. Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise. Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions. For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate. The direct memblock APIs users were updated using the semantic patch below: @@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) ) [mhocko@suse.com: changelog update] [akpm@linux-foundation.org: coding-style fixes] [rppt@linux.ibm.com: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:09:57 +08:00
entry = memblock_alloc(sizeof(*entry),
SMP_CACHE_BYTES);
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
if (!entry)
panic("%s: Failed to allocate %zu bytes\n",
__func__, sizeof(*entry));
memblock: stop using implicit alignment to SMP_CACHE_BYTES When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES. Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise. Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions. For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate. The direct memblock APIs users were updated using the semantic patch below: @@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) ) [mhocko@suse.com: changelog update] [akpm@linux-foundation.org: coding-style fixes] [rppt@linux.ibm.com: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:09:57 +08:00
entry->buf = memblock_alloc(strlen(str_entry) + 1,
SMP_CACHE_BYTES);
init/main: add checks for the return value of memblock_alloc*() Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 14:30:20 +08:00
if (!entry->buf)
panic("%s: Failed to allocate %zu bytes\n",
__func__, strlen(str_entry) + 1);
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
strcpy(entry->buf, str_entry);
list_add(&entry->next, &blacklisted_initcalls);
}
} while (str_entry);
return 1;
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
}
static bool __init_or_module initcall_blacklisted(initcall_t fn)
{
struct blacklist_entry *entry;
char fn_name[KSYM_SYMBOL_LEN];
unsigned long addr;
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
if (list_empty(&blacklisted_initcalls))
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
return false;
addr = (unsigned long) dereference_function_descriptor(fn);
sprint_symbol_no_offset(fn_name, addr);
/*
* fn will be "function_name [module_name]" where [module_name] is not
* displayed for built-in init functions. Strip off the [module_name].
*/
strreplace(fn_name, ' ', '\0');
list_for_each_entry(entry, &blacklisted_initcalls, next) {
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
if (!strcmp(fn_name, entry->buf)) {
pr_debug("initcall %s blacklisted\n", fn_name);
return true;
}
}
return false;
}
#else
static int __init initcall_blacklist(char *str)
{
pr_warn("initcall_blacklist requires CONFIG_KALLSYMS\n");
return 0;
}
static bool __init_or_module initcall_blacklisted(initcall_t fn)
{
return false;
}
#endif
__setup("initcall_blacklist=", initcall_blacklist);
static __init_or_module void
trace_initcall_start_cb(void *data, initcall_t fn)
{
ktime_t *calltime = data;
printk(KERN_DEBUG "calling %pS @ %i\n", fn, task_pid_nr(current));
*calltime = ktime_get();
}
static __init_or_module void
trace_initcall_finish_cb(void *data, initcall_t fn, int ret)
{
ktime_t rettime, *calltime = data;
rettime = ktime_get();
printk(KERN_DEBUG "initcall %pS returned %d after %lld usecs\n",
fn, ret, (unsigned long long)ktime_us_delta(rettime, *calltime));
}
static ktime_t initcall_calltime;
#ifdef TRACEPOINTS_ENABLED
static void __init initcall_debug_enable(void)
{
int ret;
ret = register_trace_initcall_start(trace_initcall_start_cb,
&initcall_calltime);
ret |= register_trace_initcall_finish(trace_initcall_finish_cb,
&initcall_calltime);
WARN(ret, "Failed to register initcall tracepoints\n");
}
# define do_trace_initcall_start trace_initcall_start
# define do_trace_initcall_finish trace_initcall_finish
#else
static inline void do_trace_initcall_start(initcall_t fn)
{
if (!initcall_debug)
return;
trace_initcall_start_cb(&initcall_calltime, fn);
}
static inline void do_trace_initcall_finish(initcall_t fn, int ret)
{
if (!initcall_debug)
return;
trace_initcall_finish_cb(&initcall_calltime, fn, ret);
}
#endif /* !TRACEPOINTS_ENABLED */
int __init_or_module do_one_initcall(initcall_t fn)
{
int count = preempt_count();
char msgbuf[64];
int ret;
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:12:17 +08:00
if (initcall_blacklisted(fn))
return -EPERM;
do_trace_initcall_start(fn);
ret = fn();
do_trace_initcall_finish(fn, ret);
msgbuf[0] = 0;
if (preempt_count() != count) {
sprintf(msgbuf, "preemption imbalance ");
preempt_count_set(count);
}
if (irqs_disabled()) {
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
local_irq_enable();
}
2019-03-26 03:32:28 +08:00
WARN(msgbuf[0], "initcall %pS returned with %s\n", fn, msgbuf);
gcc-plugins: Add latent_entropy plugin This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals. The need for very-early boot entropy tends to be very architecture or system design specific, so this plugin is more suited for those sorts of special cases. The existing kernel RNG already attempts to extract entropy from reliable runtime variation, but this plugin takes the idea to a logical extreme by permuting a global variable based on any variation in code execution (e.g. a different value (and permutation function) is used to permute the global based on loop count, case statement, if/then/else branching, etc). To do this, the plugin starts by inserting a local variable in every marked function. The plugin then adds logic so that the value of this variable is modified by randomly chosen operations (add, xor and rol) and random values (gcc generates separate static values for each location at compile time and also injects the stack pointer at runtime). The resulting value depends on the control flow path (e.g., loops and branches taken). Before the function returns, the plugin mixes this local variable into the latent_entropy global variable. The value of this global variable is added to the kernel entropy pool in do_one_initcall() and _do_fork(), though it does not credit any bytes of entropy to the pool; the contents of the global are just used to mix the pool. Additionally, the plugin can pre-initialize arrays with build-time random contents, so that two different kernel builds running on identical hardware will not have the same starting values. Signed-off-by: Emese Revfy <re.emese@gmail.com> [kees: expanded commit message and code comments] Signed-off-by: Kees Cook <keescook@chromium.org>
2016-06-21 02:41:19 +08:00
add_latent_entropy();
return ret;
}
static initcall_entry_t *initcall_levels[] __initdata = {
__initcall0_start,
__initcall1_start,
__initcall2_start,
__initcall3_start,
__initcall4_start,
__initcall5_start,
__initcall6_start,
__initcall7_start,
__initcall_end,
};
/* Keep these in sync with initcalls in include/linux/init.h */
static const char *initcall_level_names[] __initdata = {
"pure",
2012-04-28 04:30:34 +08:00
"core",
"postcore",
"arch",
"subsys",
"fs",
"device",
"late",
};
static int __init ignore_unknown_bootoption(char *param, char *val,
const char *unused, void *arg)
{
return 0;
}
static void __init do_initcall_level(int level, char *command_line)
{
initcall_entry_t *fn;
parse_args(initcall_level_names[level],
command_line, __start___param,
__stop___param - __start___param,
level, level,
NULL, ignore_unknown_bootoption);
trace_initcall_level(initcall_level_names[level]);
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
do_one_initcall(initcall_from_entry(fn));
}
static void __init do_initcalls(void)
{
int level;
proc: give /proc/cmdline size Most /proc files don't have length (in fstat sense). This leads to inefficiencies when reading such files with APIs commonly found in modern programming languages. They open file, then fstat descriptor, get st_size == 0 and either assume file is empty or start reading without knowing target size. cat(1) does OK because it uses large enough buffer by default. But naive programs copy-pasted from SO aren't: let mut f = std::fs::File::open("/proc/cmdline").unwrap(); let mut buf: Vec<u8> = Vec::new(); f.read_to_end(&mut buf).unwrap(); will result in openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3 statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address) statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32 read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32 read(3, "apper/fedora_localhost--live-roo"..., 64) = 64 read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116 read(3, "", 12) open/stat is OK, lseek looks silly but there are 3 unnecessary reads because Rust starts with 32 bytes per Vec<u8> and grows from there. In case of /proc/cmdline, the length is known precisely. Make variables readonly while I'm at it. P.S.: I tried to scp /proc/cpuinfo today and got empty file but this is separate story. Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09 02:21:54 +08:00
size_t len = saved_command_line_len + 1;
char *command_line;
command_line = kzalloc(len, GFP_KERNEL);
if (!command_line)
panic("%s: Failed to allocate %zu bytes\n", __func__, len);
for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++) {
/* Parser modifies command_line, restore it each time */
strcpy(command_line, saved_command_line);
do_initcall_level(level, command_line);
}
kfree(command_line);
}
/*
* Ok, the machine is now initialized. None of the devices
* have been touched yet, but the CPU subsystem is up and
* running, and memory and process management works.
*
* Now we can finally start doing some real work..
*/
static void __init do_basic_setup(void)
{
cpuset_init_smp();
driver_init();
init_irq_proc();
do_ctors();
do_initcalls();
}
static void __init do_pre_smp_initcalls(void)
{
initcall_entry_t *fn;
trace_initcall_level("early");
for (fn = __initcall_start; fn < __initcall0_start; fn++)
do_one_initcall(initcall_from_entry(fn));
}
static int run_init_process(const char *init_filename)
{
const char *const *p;
argv_init[0] = init_filename;
pr_info("Run %s as init process\n", init_filename);
pr_debug(" with arguments:\n");
for (p = argv_init; *p; p++)
pr_debug(" %s\n", *p);
pr_debug(" with environment:\n");
for (p = envp_init; *p; p++)
pr_debug(" %s\n", *p);
return kernel_execve(init_filename, argv_init, envp_init);
}
static int try_to_run_init_process(const char *init_filename)
{
int ret;
ret = run_init_process(init_filename);
if (ret && ret != -ENOENT) {
pr_err("Starting init: %s exists but couldn't execute it (error %d)\n",
init_filename, ret);
}
return ret;
}
static noinline void __init kernel_init_freeable(void);
#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
bool rodata_enabled __ro_after_init = true;
arm64: fix rodata=full On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-17 23:40:22 +08:00
#ifndef arch_parse_debug_rodata
static inline bool arch_parse_debug_rodata(char *str) { return false; }
#endif
static int __init set_debug_rodata(char *str)
{
arm64: fix rodata=full On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-17 23:40:22 +08:00
if (arch_parse_debug_rodata(str))
return 0;
if (str && !strcmp(str, "on"))
rodata_enabled = true;
else if (str && !strcmp(str, "off"))
rodata_enabled = false;
else
pr_warn("Invalid option string for rodata: '%s'\n", str);
arm64: fix rodata=full On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-17 23:40:22 +08:00
return 0;
}
arm64: fix rodata=full On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-17 23:40:22 +08:00
early_param("rodata", set_debug_rodata);
#endif
#ifdef CONFIG_STRICT_KERNEL_RWX
static void mark_readonly(void)
{
if (rodata_enabled) {
init: fix false positives in W+X checking load_module() creates W+X mappings via __vmalloc_node_range() (from layout_and_allocate()->move_module()->module_alloc()) by using PAGE_KERNEL_EXEC. These mappings are later cleaned up via "call_rcu_sched(&freeinit->rcu, do_free_init)" from do_init_module(). This is a problem because call_rcu_sched() queues work, which can be run after debug_checkwx() is run, resulting in a race condition. If hit, the race results in a nasty splat about insecure W+X mappings, which results in a poor user experience as these are not the mappings that debug_checkwx() is intended to catch. This issue is observed on multiple arm64 platforms, and has been artificially triggered on an x86 platform. Address the race by flushing the queued work before running the arch-defined mark_rodata_ro() which then calls debug_checkwx(). Link: http://lkml.kernel.org/r/1525103946-29526-1-git-send-email-jhugo@codeaurora.org Fixes: e1a58320a38d ("x86/mm: Warn on W^X mappings") Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Reported-by: Timur Tabi <timur@codeaurora.org> Reported-by: Jan Glauber <jan.glauber@caviumnetworks.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-12 07:01:42 +08:00
/*
* load_module() results in W+X mappings, which are cleaned
* up with call_rcu(). Let's make sure that queued work is
init: fix false positives in W+X checking load_module() creates W+X mappings via __vmalloc_node_range() (from layout_and_allocate()->move_module()->module_alloc()) by using PAGE_KERNEL_EXEC. These mappings are later cleaned up via "call_rcu_sched(&freeinit->rcu, do_free_init)" from do_init_module(). This is a problem because call_rcu_sched() queues work, which can be run after debug_checkwx() is run, resulting in a race condition. If hit, the race results in a nasty splat about insecure W+X mappings, which results in a poor user experience as these are not the mappings that debug_checkwx() is intended to catch. This issue is observed on multiple arm64 platforms, and has been artificially triggered on an x86 platform. Address the race by flushing the queued work before running the arch-defined mark_rodata_ro() which then calls debug_checkwx(). Link: http://lkml.kernel.org/r/1525103946-29526-1-git-send-email-jhugo@codeaurora.org Fixes: e1a58320a38d ("x86/mm: Warn on W^X mappings") Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Reported-by: Timur Tabi <timur@codeaurora.org> Reported-by: Jan Glauber <jan.glauber@caviumnetworks.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-12 07:01:42 +08:00
* flushed so that we don't hit false positives looking for
* insecure pages which are W+X.
*/
rcu_barrier();
mark_rodata_ro();
rodata_test();
} else
pr_info("Kernel memory protection disabled.\n");
}
#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
static inline void mark_readonly(void)
{
pr_warn("Kernel memory protection not selected by kernel config.\n");
}
#else
static inline void mark_readonly(void)
{
pr_warn("This architecture does not have kernel memory protection.\n");
}
#endif
void __weak free_initmem(void)
{
free_initmem_default(POISON_FREE_INITMEM);
}
static int __ref kernel_init(void *unused)
{
int ret;
/*
* Wait until kthreadd is all set-up.
*/
wait_for_completion(&kthreadd_done);
kernel_init_freeable();
/* need to finish all async __init code before freeing the memory */
async_synchronize_full();
system_state = SYSTEM_FREEING_INITMEM;
kprobe_free_init_mem();
ftrace_free_init_mem();
kgdb_free_init_mem();
exit_boot_config();
free_initmem();
mark_readonly();
/*
* Kernel mappings are now finalized - update the userspace page-table
* to finalize PTI.
*/
pti_finalize();
system_state = SYSTEM_RUNNING;
numa_default_policy();
rcu_end_inkernel_boot();
kernel/sysctl: support setting sysctl parameters from kernel command line Patch series "support setting sysctl parameters from kernel command line", v3. This series adds support for something that seems like many people always wanted but nobody added it yet, so here's the ability to set sysctl parameters via kernel command line options in the form of sysctl.vm.something=1 The important part is Patch 1. The second, not so important part is an attempt to clean up legacy one-off parameters that do the same thing as a sysctl. I don't want to remove them completely for compatibility reasons, but with generic sysctl support the idea is to remove the one-off param handlers and treat the parameters as aliases for the sysctl variants. I have identified several parameters that mention sysctl counterparts in Documentation/admin-guide/kernel-parameters.txt but there might be more. The conversion also has varying level of success: - numa_zonelist_order is converted in Patch 2 together with adding the necessary infrastructure. It's easy as it doesn't really do anything but warn on deprecated value these days. - hung_task_panic is converted in Patch 3, but there's a downside that now it only accepts 0 and 1, while previously it was any integer value - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic, so there's no straighforward conversion possible - traceoff_on_warning is a flag without value and it would be required to handle that somehow in the conversion infractructure, which seems pointless for a single flag This patch (of 5): A recently proposed patch to add vm_swappiness command line parameter in addition to existing sysctl [1] made me wonder why we don't have a general support for passing sysctl parameters via command line. Googling found only somebody else wondering the same [2], but I haven't found any prior discussion with reasons why not to do this. Settings the vm_swappiness issue aside (the underlying issue might be solved in a different way), quick search of kernel-parameters.txt shows there are already some that exist as both sysctl and kernel parameter - hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning. A general mechanism would remove the need to add more of those one-offs and might be handy in situations where configuration by e.g. /etc/sysctl.d/ is impractical. Hence, this patch adds a new parse_args() pass that looks for parameters prefixed by 'sysctl.' and tries to interpret them as writes to the corresponding sys/ files using an temporary in-kernel procfs mount. This mechanism was suggested by Eric W. Biederman [3], as it handles all dynamically registered sysctl tables, even though we don't handle modular sysctls. Errors due to e.g. invalid parameter name or value are reported in the kernel log. The processing is hooked right before the init process is loaded, as some handlers might be more complicated than simple setters and might need some subsystems to be initialized. At the moment the init process can be started and eventually execute a process writing to /proc/sys/ then it should be also fine to do that from the kernel. Sysctls registered later on module load time are not set by this mechanism - it's expected that in such scenarios, setting sysctl values from userspace is practical enough. [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/ [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 12:40:24 +08:00
do_sysctl_args();
if (ramdisk_execute_command) {
ret = run_init_process(ramdisk_execute_command);
if (!ret)
return 0;
pr_err("Failed to execute %s (error %d)\n",
ramdisk_execute_command, ret);
}
/*
* We try each of these until one succeeds.
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine.
*/
if (execute_command) {
ret = run_init_process(execute_command);
if (!ret)
return 0;
panic("Requested init %s failed (error %d).",
execute_command, ret);
}
init: allow distribution configuration of default init Some init systems (eg. systemd) have init at their own paths, for example, /usr/lib/systemd/systemd. A compatibility symlink to one of the hardcoded init paths is provided by another package, usually named something like systemd-sysvcompat or similar. Currently distro maintainers who are hands-off on the bootloader are more or less required to include those compatibility links as part of their base distribution, because it's hard to migrate away from them since there's a risk some users will not get the message to set init= on the kernel command line appropriately. Moreover, for distributions where the init system is something the distribution itself is opinionated about (eg. Arch, which has systemd in the required `base` package), we could usually reasonably configure this ahead of time when building the distribution kernel. However, we currently simply don't have any way to configure the kernel to do this. Here's an example discussion where removing sysvcompat was discussed by distro maintainers[0]. This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set is tried before the hardcoded fallback list. So the order of precedence is now thus: 1. init= on command line (on failure: panic) 2. CONFIG_DEFAULT_INIT (on failure: try #3) 3. Hardcoded fallback list (on failure: panic) This new config parameter will allow distribution maintainers to move away from these compatibility links safely, without having to worry that their users might not have the right init=. There are also two other benefits of this over having the distribution maintain a symlink: 1. One of the value propositions over simply having distributions maintain a /sbin/init symlink via a package is that it also frees distributions which have a preferred default, but not mandatory, init system from having their package manager fight with their users for control of /{s,}bin/init. Instead, the distribution simply makes their preference known in CONFIG_DEFAULT_INIT, and if the user installs another init system and uninstalls the default one they can still make use of /{s,}bin/init and friends for their own uses. This makes more cases Just Work(tm) without the user having to perform extra configuration via init=. 2. Since before this we don't know which path the distribution actually _intends_ to serve init from, we don't pr_err if it is simply missing, and usually will just silently put the user in a /bin/sh shell. Now that the distribution can make a declaration of intent, we can be more vocal when this init system fails to launch for any reason, even if it's simply because no file exists at that location, speeding up the palaver of init/mount dependency/etc debugging a bit. [0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.html Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:50:53 +08:00
if (CONFIG_DEFAULT_INIT[0] != '\0') {
ret = run_init_process(CONFIG_DEFAULT_INIT);
if (ret)
pr_err("Default init %s failed (error %d)\n",
CONFIG_DEFAULT_INIT, ret);
else
return 0;
}
if (!try_to_run_init_process("/sbin/init") ||
!try_to_run_init_process("/etc/init") ||
!try_to_run_init_process("/bin/init") ||
!try_to_run_init_process("/bin/sh"))
return 0;
panic("No working init found. Try passing init= option to kernel. "
"See Linux Documentation/admin-guide/init.rst for guidance.");
}
/* Open /dev/console, for stdin/stdout/stderr, this should never fail */
void __init console_on_rootfs(void)
{
struct file *file = filp_open("/dev/console", O_RDWR, 0);
if (IS_ERR(file)) {
Revert "init/console: Use ttynull as a fallback when there is no console" This reverts commit 757055ae8dedf5333af17b3b5b4b70ba9bc9da4e. The commit caused that ttynull was used as the default console on several systems[1][2][3]. As a result, the console was blank even when a better alternative existed. It happened when there was no console configured on the command line and ttynull_init() was the first initcall calling register_console(). Or it happened when /dev/ did not exist when console_on_rootfs() was called. It was not able to open /dev/console even though a console driver was registered. It tried to add ttynull console but it obviously did not help. But ttynull became the preferred console and was used by /dev/console when it was available later. The commit tried to fix a historical problem that have been there for ages. The primary motivation was the commit 3cffa06aeef7ece30f6 ("printk/console: Allow to disable console output by using console="" or console=null"). It provided a clean solution for a workaround that was widely used and worked only by chance. This revert causes that the console="" or console=null command line options will again work only by chance. These options will cause that a particular console will be preferred and the default (tty) ones will not get enabled. There will be no console registered at all. As a result there won't be stdin, stdout, and stderr for the init process. But it worked exactly this way even before. The proper solution has to fulfill many conditions: + Register ttynull only when explicitly required or as the ultimate fallback. + ttynull should get associated with /dev/console but it must not become preferred console when used as a fallback. Especially, it must still be possible to replace it by a better console later. Such a change requires clean up of the register_console() code. Otherwise, it would be even harder to follow. Especially, the use of has_preferred_console and CON_CONSDEV flag is tricky. The clean up is risky. The ordering of consoles is not well defined. And any changes tend to break existing user settings. Do the revert at the least risky solution for now. [1] https://lore.kernel.org/linux-kselftest/20201221144302.GR4077@smile.fi.intel.com/ [2] https://lore.kernel.org/lkml/d2a3b3c0-e548-7dd1-730f-59bc5c04e191@synopsys.com/ [3] https://patchwork.ozlabs.org/project/linux-um/patch/20210105120128.10854-1-thomas@m3y3r.de/ Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reported-by: Vineet Gupta <vgupta@synopsys.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-08 19:48:47 +08:00
pr_err("Warning: unable to open an initial console.\n");
return;
}
init_dup(file);
init_dup(file);
init_dup(file);
fput(file);
}
static noinline void __init kernel_init_freeable(void)
{
/* Now the scheduler is fully set up and can do blocking allocations */
gfp_allowed_mask = __GFP_BITS_MASK;
cpuset,mm: update tasks' mems_allowed in time Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
/*
* init can allocate pages on any node
*/
set_mems_allowed(node_states[N_MEMORY]);
pid: take a reference when initializing `cad_pid` During boot, kernel_init_freeable() initializes `cad_pid` to the init task's struct pid. Later on, we may change `cad_pid` via a sysctl, and when this happens proc_do_cad_pid() will increment the refcount on the new pid via get_pid(), and will decrement the refcount on the old pid via put_pid(). As we never called get_pid() when we initialized `cad_pid`, we decrement a reference we never incremented, can therefore free the init task's struct pid early. As there can be dangling references to the struct pid, we can later encounter a use-after-free (e.g. when delivering signals). This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to have been around since the conversion of `cad_pid` to struct pid in commit 9ec52099e4b8 ("[PATCH] replace cad_pid by a struct pid") from the pre-KASAN stone age of v2.6.19. Fix this by getting a reference to the init task's struct pid when we assign it to `cad_pid`. Full KASAN splat below. ================================================================== BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline] BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509 Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273 CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1 Hardware name: linux,dummy-virt (DT) Call trace: ns_of_pid include/linux/pid.h:153 [inline] task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509 do_notify_parent+0x308/0xe60 kernel/signal.c:1950 exit_notify kernel/exit.c:682 [inline] do_exit+0x2334/0x2bd0 kernel/exit.c:845 do_group_exit+0x108/0x2c8 kernel/exit.c:922 get_signal+0x4e4/0x2a88 kernel/signal.c:2781 do_signal arch/arm64/kernel/signal.c:882 [inline] do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936 work_pending+0xc/0x2dc Allocated by task 0: slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516 slab_alloc_node mm/slub.c:2907 [inline] slab_alloc mm/slub.c:2915 [inline] kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920 alloc_pid+0xdc/0xc00 kernel/pid.c:180 copy_process+0x2794/0x5e18 kernel/fork.c:2129 kernel_clone+0x194/0x13c8 kernel/fork.c:2500 kernel_thread+0xd4/0x110 kernel/fork.c:2552 rest_init+0x44/0x4a0 init/main.c:687 arch_call_rest_init+0x1c/0x28 start_kernel+0x520/0x554 init/main.c:1064 0x0 Freed by task 270: slab_free_hook mm/slub.c:1562 [inline] slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600 slab_free mm/slub.c:3161 [inline] kmem_cache_free+0x224/0x8e0 mm/slub.c:3177 put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114 put_pid+0x30/0x48 kernel/pid.c:109 proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401 proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591 proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617 call_write_iter include/linux/fs.h:1977 [inline] new_sync_write+0x3ac/0x510 fs/read_write.c:518 vfs_write fs/read_write.c:605 [inline] vfs_write+0x9c4/0x1018 fs/read_write.c:585 ksys_write+0x124/0x240 fs/read_write.c:658 __do_sys_write fs/read_write.c:670 [inline] __se_sys_write fs/read_write.c:667 [inline] __arm64_sys_write+0x78/0xb0 fs/read_write.c:667 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall arch/arm64/kernel/syscall.c:49 [inline] el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129 do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168 el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416 el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432 el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701 The buggy address belongs to the object at ffff23794dda0000 which belongs to the cache pid of size 224 The buggy address is located 4 bytes inside of 224-byte region [ffff23794dda0000, ffff23794dda00e0) The buggy address belongs to the page: page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0 head:(____ptrval____) order:1 compound_mapcount:0 flags: 0x3fffc0000010200(slab|head) raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080 raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00 ================================================================== Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-05 11:01:14 +08:00
cad_pid = get_pid(task_pid(current));
smp_prepare_cpus(setup_max_cpus);
workqueue: make workqueue available early during boot Workqueue is currently initialized in an early init call; however, there are cases where early boot code has to be split and reordered to come after workqueue initialization or the same code path which makes use of workqueues is used both before workqueue initailization and after. The latter cases have to gate workqueue usages with keventd_up() tests, which is nasty and easy to get wrong. Workqueue usages have become widespread and it'd be a lot more convenient if it can be used very early from boot. This patch splits workqueue initialization into two steps. workqueue_init_early() which sets up the basic data structures so that workqueues can be created and work items queued, and workqueue_init() which actually brings up workqueues online and starts executing queued work items. The former step can be done very early during boot once memory allocation, cpumasks and idr are initialized. The latter right after kthreads become available. This allows work item queueing and canceling from very early boot which is what most of these use cases want. * As systemd_wq being initialized doesn't indicate that workqueue is fully online anymore, update keventd_up() to test wq_online instead. The follow-up patches will get rid of all its usages and the function itself. * Flushing doesn't make sense before workqueue is fully initialized. The flush functions trigger WARN and return immediately before fully online. * Work items are never in-flight before fully online. Canceling can always succeed by skipping the flush step. * Some code paths can no longer assume to be called with irq enabled as irq is disabled during early boot. Use irqsave/restore operations instead. v2: Watchdog init, which requires timer to be running, moved from workqueue_init_early() to workqueue_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
2016-09-17 03:49:32 +08:00
workqueue_init();
mm: move mm_percpu_wq initialization earlier Yang Li has reported that drain_all_pages triggers a WARN_ON which means that this function is called earlier than the mm_percpu_wq is initialized on arm64 with CMA configured: WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127 Hardware name: Freescale Layerscape 2088A RDB Board (DT) task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000 PC is at drain_all_pages+0x244/0x25c LR is at start_isolate_page_range+0x14c/0x1f0 [...] drain_all_pages+0x244/0x25c start_isolate_page_range+0x14c/0x1f0 alloc_contig_range+0xec/0x354 cma_alloc+0x100/0x1fc dma_alloc_from_contiguous+0x3c/0x44 atomic_pool_init+0x7c/0x208 arm64_dma_init+0x44/0x4c do_one_initcall+0x38/0x128 kernel_init_freeable+0x1a0/0x240 kernel_init+0x10/0xfc ret_from_fork+0x10/0x20 Fix this by moving the whole setup_vmstat which is an initcall right now to init_mm_internals which will be called right after the WQ subsystem is initialized. Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Yang Li <pku.leo@gmail.com> Tested-by: Yang Li <pku.leo@gmail.com> Tested-by: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-01 06:11:47 +08:00
init_mm_internals();
rcu_init_tasks_generic();
do_pre_smp_initcalls();
lockup_detector_init();
smp_init();
sched_init_smp();
padata_init();
page_alloc_init_late();
do_basic_setup();
kunit_run_all_tests();
init/initramfs.c: do unpacking asynchronously Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3. These two patches are independent, but better-together. The second is a rather trivial patch that simply allows the developer to change "/sbin/modprobe" to something else - e.g. the empty string, so that all request_module() during early boot return -ENOENT early, without even spawning a usermode helper, needlessly synchronizing with the initramfs unpacking. The first patch delegates decompressing the initramfs to a worker thread, allowing do_initcalls() in main.c to proceed to the device_ and late_ initcalls without waiting for that decompression (and populating of rootfs) to finish. Obviously, some of those later calls may rely on the initramfs being available, so I've added synchronization points in the firmware loader and usermodehelper paths - there might be other places that would need this, but so far no one has been able to think of any places I have missed. There's not much to win if most of the functionality needed during boot is only available as modules. But systems with a custom-made .config and initramfs can boot faster, partly due to utilizing more than one cpu earlier, partly by avoiding known-futile modprobe calls (which would still trigger synchronization with the initramfs unpacking, thus eliminating most of the first benefit). This patch (of 2): Most of the boot process doesn't actually need anything from the initramfs, until of course PID1 is to be executed. So instead of doing the decompressing and populating of the initramfs synchronously in populate_rootfs() itself, push that off to a worker thread. This is primarily motivated by an embedded ppc target, where unpacking even the rather modest sized initramfs takes 0.6 seconds, which is long enough that the external watchdog becomes unhappy that it doesn't get attention soon enough. By doing the initramfs decompression in a worker thread, we get to do the device_initcalls and hence start petting the watchdog much sooner. Normal desktops might benefit as well. On my mostly stock Ubuntu kernel, my initramfs is a 26M xz-compressed blob, decompressing to around 126M. That takes almost two seconds: [ 0.201454] Trying to unpack rootfs image as initramfs... [ 1.976633] Freeing initrd memory: 29416K Before this patch, these lines occur consecutively in dmesg. With this patch, the timestamps on these two lines is roughly the same as above, but with 172 lines inbetween - so more than one cpu has been kept busy doing work that would otherwise only happen after the populate_rootfs() finished. Should one of the initcalls done after rootfs_initcall time (i.e., device_ and late_ initcalls) need something from the initramfs (say, a kernel module or a firmware blob), it will simply wait for the initramfs unpacking to be done before proceeding, which should in theory make this completely safe. But if some driver pokes around in the filesystem directly and not via one of the official kernel interfaces (i.e. request_firmware*(), call_usermodehelper*) that theory may not hold - also, I certainly might have missed a spot when sprinkling wait_for_initramfs(). So there is an escape hatch in the form of an initramfs_async= command line parameter. Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 09:05:42 +08:00
wait_for_initramfs();
console_on_rootfs();
/*
* check if there is an early userspace init. If yes, let it do all
* the work
*/
if (init_eaccess(ramdisk_execute_command) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
/*
* Ok, we have completed the initial bootup, and
* we're essentially up and running. Get rid of the
* initmem segments and start the user-mode stuff..
*
* rootfs is available now, try loading the public keys
* and default modules
*/
integrity_load_keys();
}