2019-05-19 20:08:55 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* linux/init/main.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
*
|
|
|
|
* GK 2/5/95 - Changed to support mounting root fs via NFS
|
|
|
|
* Added initrd & change_root: Werner Almesberger & Hans Lermen, Feb '96
|
|
|
|
* Moan early if gcc is old, avoiding bogus kernels - Paul Gortmaker, May '96
|
2014-08-09 05:23:44 +08:00
|
|
|
* Simplified starting of init: Michael A. Griffith <grif@acm.org>
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
2013-04-30 07:18:20 +08:00
|
|
|
#define DEBUG /* Enable initcall_debug */
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/types.h>
|
2016-07-24 02:01:45 +08:00
|
|
|
#include <linux/extable.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/proc_fs.h>
|
2017-02-05 21:24:31 +08:00
|
|
|
#include <linux/binfmts.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/syscalls.h>
|
2008-02-14 16:41:09 +08:00
|
|
|
#include <linux/stackprotector.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/ctype.h>
|
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <linux/ioport.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/initrd.h>
|
2018-10-31 06:09:49 +08:00
|
|
|
#include <linux/memblock.h>
|
2009-06-13 08:42:08 +08:00
|
|
|
#include <linux/acpi.h>
|
2020-01-11 00:03:44 +08:00
|
|
|
#include <linux/bootconfig.h>
|
2017-04-13 06:37:14 +08:00
|
|
|
#include <linux/console.h>
|
2017-02-09 01:51:31 +08:00
|
|
|
#include <linux/nmi.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/percpu.h>
|
|
|
|
#include <linux/kmod.h>
|
2020-09-10 16:55:05 +08:00
|
|
|
#include <linux/kprobes.h>
|
2022-09-15 23:03:51 +08:00
|
|
|
#include <linux/kmsan.h>
|
mm: rewrite vmap layer
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:27:03 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/kernel_stat.h>
|
2006-12-07 09:14:08 +08:00
|
|
|
#include <linux/start_kernel.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/security.h>
|
2008-06-26 17:21:34 +08:00
|
|
|
#include <linux/smp.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/profile.h>
|
mm: add Kernel Electric-Fence infrastructure
Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.
This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors. This
series enables KFENCE for the x86 and arm64 architectures, and adds
KFENCE hooks to the SLAB and SLUB allocators.
KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.
KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.
Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval,
the next allocation through the main allocator (SLAB or SLUB) returns a
guarded allocation from the KFENCE object pool. At this point, the timer
is reset, and the next allocation is set up after the expiration of the
interval.
To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE.
The KFENCE memory pool is of fixed size, and if the pool is exhausted no
further KFENCE allocations occur. The default config is conservative
with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
pages).
We have verified by running synthetic benchmarks (sysbench I/O,
hackbench) and production server-workload benchmarks that a kernel with
KFENCE (using sample intervals 100-500ms) is performance-neutral
compared to a non-KFENCE baseline kernel.
KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
properties. The name "KFENCE" is a homage to the Electric Fence Malloc
Debugger [2].
For more details, see Documentation/dev-tools/kfence.rst added in the
series -- also viewable here:
https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst
[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence
This patch (of 9):
This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.
KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.
KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error. To detect out-of-bounds
writes to memory within the object's page itself, KFENCE also uses
pattern-based redzones. The following figure illustrates the page
layout:
---+-----------+-----------+-----------+-----------+-----------+---
| xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
| xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
| x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
| xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
| xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
| xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
---+-----------+-----------+-----------+-----------+-----------+---
Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the
next allocation is set up after the expiration of the interval.
To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.
For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).
[elver@google.com: fix parameter description for kfence_object_start()]
Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com
[elver@google.com: avoid stalling work queue task without allocations]
Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com
[elver@google.com: fix potential deadlock due to wake_up()]
Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com
Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com
[elver@google.com: add option to use KFENCE without static keys]
Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com
[elver@google.com: add missing copyright and description headers]
Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com
Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Co-developed-by: Marco Elver <elver@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Joern Engel <joern@purestorage.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:18:53 +08:00
|
|
|
#include <linux/kfence.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/rcupdate.h>
|
2021-04-09 06:38:59 +08:00
|
|
|
#include <linux/srcu.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/moduleparam.h>
|
|
|
|
#include <linux/kallsyms.h>
|
2021-07-08 09:09:13 +08:00
|
|
|
#include <linux/buildid.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/writeback.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cpuset.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
|
|
|
#include <linux/cgroup.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/efi.h>
|
2007-02-16 17:28:01 +08:00
|
|
|
#include <linux/tick.h>
|
2017-10-27 10:42:28 +08:00
|
|
|
#include <linux/sched/isolation.h>
|
2007-02-18 13:22:39 +08:00
|
|
|
#include <linux/interrupt.h>
|
2006-07-14 15:24:40 +08:00
|
|
|
#include <linux/taskstats_kern.h>
|
2006-07-14 15:24:36 +08:00
|
|
|
#include <linux/delayacct.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/unistd.h>
|
2018-04-11 07:32:36 +08:00
|
|
|
#include <linux/utsname.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/rmap.h>
|
|
|
|
#include <linux/mempolicy.h>
|
|
|
|
#include <linux/key.h>
|
2006-07-03 15:24:33 +08:00
|
|
|
#include <linux/debug_locks.h>
|
2008-04-30 15:55:01 +08:00
|
|
|
#include <linux/debugobjects.h>
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
#include <linux/lockdep.h>
|
2009-06-11 20:22:39 +08:00
|
|
|
#include <linux/kmemleak.h>
|
2020-06-04 06:59:35 +08:00
|
|
|
#include <linux/padata.h>
|
2006-12-08 18:38:01 +08:00
|
|
|
#include <linux/pid_namespace.h>
|
2019-12-10 03:33:03 +08:00
|
|
|
#include <linux/device/driver.h>
|
2007-05-09 17:34:32 +08:00
|
|
|
#include <linux/kthread.h>
|
2007-11-10 05:39:39 +08:00
|
|
|
#include <linux/sched.h>
|
2017-02-05 21:47:12 +08:00
|
|
|
#include <linux/sched/init.h>
|
2008-02-06 17:36:44 +08:00
|
|
|
#include <linux/signal.h>
|
2008-04-29 16:03:13 +08:00
|
|
|
#include <linux/idr.h>
|
2010-05-21 10:04:29 +08:00
|
|
|
#include <linux/kgdb.h>
|
2008-08-15 03:45:08 +08:00
|
|
|
#include <linux/ftrace.h>
|
2009-01-08 00:45:46 +08:00
|
|
|
#include <linux/async.h>
|
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
very early at kernel initialization, before any driver-core device
is registered. Every device with a major/minor will provide a
device node in devtmpfs.
Devtmpfs can be changed and altered by userspace at any time,
and in any way needed - just like today's udev-mounted tmpfs.
Unmodified udev versions will run just fine on top of it, and will
recognize an already existing kernel-created device node and use it.
The default node permissions are root:root 0600. Proper permissions
and user/group ownership, meaningful symlinks, all other policy still
needs to be applied by userspace.
If a node is created by devtmps, devtmpfs will remove the device node
when the device goes away. If the device node was created by
userspace, or the devtmpfs created node was replaced by userspace, it
will no longer be removed by devtmpfs.
If it is requested to auto-mount it, it makes init=/bin/sh work
without any further userspace support. /dev will be fully populated
and dynamic, and always reflect the current device state of the kernel.
With the commonly used dynamic device numbers, it solves the problem
where static devices nodes may point to the wrong devices.
It is intended to make the initial bootup logic simpler and more robust,
by de-coupling the creation of the inital environment, to reliably run
userspace processes, from a complex userspace bootstrap logic to provide
a working /dev.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Tested-By: Harald Hoyer <harald@redhat.com>
Tested-By: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-30 21:23:42 +08:00
|
|
|
#include <linux/shmem_fs.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2010-11-18 06:17:33 +08:00
|
|
|
#include <linux/perf_event.h>
|
2012-10-11 09:28:25 +08:00
|
|
|
#include <linux/ptrace.h>
|
2017-12-04 22:07:36 +08:00
|
|
|
#include <linux/pti.h>
|
2013-01-19 06:05:56 +08:00
|
|
|
#include <linux/blkdev.h>
|
2018-07-20 04:55:41 +08:00
|
|
|
#include <linux/sched/clock.h>
|
2017-02-09 01:51:36 +08:00
|
|
|
#include <linux/sched/task.h>
|
2017-02-09 01:51:37 +08:00
|
|
|
#include <linux/sched/task_stack.h>
|
2013-07-12 01:12:32 +08:00
|
|
|
#include <linux/context_tracking.h>
|
2013-09-10 22:52:35 +08:00
|
|
|
#include <linux/random.h>
|
2014-06-05 07:12:17 +08:00
|
|
|
#include <linux/list.h>
|
2014-11-05 23:01:15 +08:00
|
|
|
#include <linux/integrity.h>
|
take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-01 22:57:28 +08:00
|
|
|
#include <linux/proc_ns.h>
|
2015-04-15 06:47:20 +08:00
|
|
|
#include <linux/io.h>
|
2016-11-14 14:15:05 +08:00
|
|
|
#include <linux/cache.h>
|
2017-02-28 06:30:22 +08:00
|
|
|
#include <linux/rodata_test.h>
|
2018-02-21 01:37:51 +08:00
|
|
|
#include <linux/jump_label.h>
|
2018-05-26 05:48:00 +08:00
|
|
|
#include <linux/mem_encrypt.h>
|
2019-11-15 02:02:54 +08:00
|
|
|
#include <linux/kcsan.h>
|
2020-07-22 17:14:02 +08:00
|
|
|
#include <linux/init_syscalls.h>
|
2021-02-26 09:21:27 +08:00
|
|
|
#include <linux/stackdepot.h>
|
2022-06-29 14:04:23 +08:00
|
|
|
#include <linux/randomize_kstack.h>
|
2022-02-06 01:01:25 +08:00
|
|
|
#include <net/net_namespace.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#include <asm/io.h>
|
|
|
|
#include <asm/bugs.h>
|
|
|
|
#include <asm/setup.h>
|
2005-07-29 12:15:30 +08:00
|
|
|
#include <asm/sections.h>
|
2006-01-06 16:12:01 +08:00
|
|
|
#include <asm/cacheflush.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-03-23 22:18:03 +08:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/initcall.h>
|
|
|
|
|
2020-08-05 04:47:43 +08:00
|
|
|
#include <kunit/test.h>
|
|
|
|
|
2007-02-26 23:45:41 +08:00
|
|
|
static int kernel_init(void *);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
extern void init_IRQ(void);
|
|
|
|
extern void radix_tree_init(void);
|
Maple Tree: add new data structure
Patch series "Introducing the Maple Tree"
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Davidlor said
: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage. Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit. This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead. Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.
A similar work has been discovered in the academic press
https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
Sheer coincidence. We designed our tree with the intention of solving the
hardest problem first. Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article. So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.
This patch (of 70):
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
There is additional BUG_ON() calls added within the tree, most of which
are in debug code. These will be replaced with a WARN_ON() call in the
future. There is also additional BUG_ON() calls within the code which
will also be reduced in number at a later date. These exist to catch
things such as out-of-range accesses which would crash anyways.
Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
|
|
|
extern void maple_tree_init(void);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-01-20 19:06:35 +08:00
|
|
|
/*
|
|
|
|
* Debug helper: via this flag we know that we are in 'early bootup code'
|
|
|
|
* where only the boot processor is running with IRQ disabled. This means
|
|
|
|
* two things - IRQ must not be enabled before the flag is cleared and some
|
|
|
|
* operations which are not allowed with IRQ disabled are allowed while the
|
|
|
|
* flag is set.
|
|
|
|
*/
|
|
|
|
bool early_boot_irqs_disabled __read_mostly;
|
|
|
|
|
rcu: Teach RCU that idle task is not quiscent state at boot
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton. And cleans up the variable-name
and function-name language. ;-)
The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state. This in turn causes
RCU to prematurely end grace periods during this time.
This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition. RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().
Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1. This allows boot-time code that
calls synchronize_rcu() to proceed normally. Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence. Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.
In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.
Changes since v4:
o Changes the name of the introduced function and variable to
be less emotional. ;-)
Changes since v3:
o WARN_ON(nr_context_switches() > 0) to verify that RCU
switches out of boot-time mode before the first context
switch, as suggested by Nick Piggin.
Changes since v2:
o Created rcu_blocking_is_gp() internal-to-RCU API that
determines whether a call to synchronize_rcu() is itself
a grace period.
o The definition of rcu_blocking_is_gp() for rcuclassic and
rcutree checks to see if but a single CPU is online.
o The definition of rcu_blocking_is_gp() for rcupreempt
checks to see both if but a single CPU is online and if
the system is still in early boot.
This allows rcupreempt to again work correctly if running
on a single CPU after booting is complete.
o Added check to rcupreempt's synchronize_sched() for there
being but one online CPU.
Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 10:03:42 +08:00
|
|
|
enum system_states system_state __read_mostly;
|
2005-04-17 06:20:36 +08:00
|
|
|
EXPORT_SYMBOL(system_state);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Boot command-line arguments
|
|
|
|
*/
|
|
|
|
#define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
|
|
|
|
#define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
|
|
|
|
|
|
|
|
extern void time_init(void);
|
|
|
|
/* Default late time init is NULL. archs can override this later. */
|
2009-01-07 06:41:10 +08:00
|
|
|
void (*__initdata late_time_init)(void);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
|
|
|
/* Untouched command line saved by arch-specific code. */
|
|
|
|
char __initdata boot_command_line[COMMAND_LINE_SIZE];
|
|
|
|
/* Untouched saved command line (eg. for /proc) */
|
proc: give /proc/cmdline size
Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.
cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:
let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();
will result in
openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)
open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.
In case of /proc/cmdline, the length is known precisely.
Make variables readonly while I'm at it.
P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.
Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09 02:21:54 +08:00
|
|
|
char *saved_command_line __ro_after_init;
|
|
|
|
unsigned int saved_command_line_len __ro_after_init;
|
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
|
|
|
/* Command line for parameter parsing */
|
|
|
|
static char *static_command_line;
|
2020-01-11 00:04:43 +08:00
|
|
|
/* Untouched extra command line */
|
|
|
|
static char *extra_command_line;
|
2020-01-11 00:04:55 +08:00
|
|
|
/* Extra init arguments */
|
|
|
|
static char *extra_init_args;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-02-08 08:07:37 +08:00
|
|
|
#ifdef CONFIG_BOOT_CONFIG
|
|
|
|
/* Is bootconfig on command line? */
|
2023-02-28 09:01:42 +08:00
|
|
|
static bool bootconfig_found;
|
2021-09-04 23:54:16 +08:00
|
|
|
static size_t initargs_offs;
|
2020-02-08 08:07:37 +08:00
|
|
|
#else
|
|
|
|
# define bootconfig_found false
|
2021-09-04 23:54:16 +08:00
|
|
|
# define initargs_offs 0
|
2020-02-08 08:07:37 +08:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static char *execute_command;
|
2020-06-07 15:06:29 +08:00
|
|
|
static char *ramdisk_execute_command = "/init";
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-10-20 03:48:53 +08:00
|
|
|
/*
|
|
|
|
* Used to generate warnings if static_key manipulation functions are used
|
|
|
|
* before jump_label_init is called.
|
|
|
|
*/
|
2014-08-09 05:23:44 +08:00
|
|
|
bool static_key_initialized __read_mostly;
|
2013-10-20 03:48:53 +08:00
|
|
|
EXPORT_SYMBOL_GPL(static_key_initialized);
|
|
|
|
|
2007-07-16 14:41:07 +08:00
|
|
|
/*
|
|
|
|
* If set, this is an indication to the drivers that reset the underlying
|
|
|
|
* device before going ahead with the initialization otherwise driver might
|
|
|
|
* rely on the BIOS and skip the reset operation.
|
|
|
|
*
|
|
|
|
* This is useful if kernel is booting in an unreliable environment.
|
2015-02-24 05:05:56 +08:00
|
|
|
* For ex. kdump situation where previous kernel has crashed, BIOS has been
|
2007-07-16 14:41:07 +08:00
|
|
|
* skipped and devices will be in unknown state.
|
|
|
|
*/
|
|
|
|
unsigned int reset_devices;
|
|
|
|
EXPORT_SYMBOL(reset_devices);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-09-27 16:50:44 +08:00
|
|
|
static int __init set_reset_devices(char *str)
|
|
|
|
{
|
|
|
|
reset_devices = 1;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
__setup("reset_devices", set_reset_devices);
|
|
|
|
|
2014-08-09 05:23:44 +08:00
|
|
|
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
|
|
|
|
const char *envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };
|
2005-04-17 06:20:36 +08:00
|
|
|
static const char *panic_later, *panic_param;
|
|
|
|
|
2010-08-12 13:04:18 +08:00
|
|
|
extern const struct obs_kernel_param __setup_start[], __setup_end[];
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-01-21 06:59:27 +08:00
|
|
|
static bool __init obsolete_checksetup(char *line)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2010-08-12 13:04:18 +08:00
|
|
|
const struct obs_kernel_param *p;
|
2016-01-21 06:59:27 +08:00
|
|
|
bool had_early_param = false;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
p = __setup_start;
|
|
|
|
do {
|
|
|
|
int n = strlen(p->str);
|
2011-10-10 06:03:37 +08:00
|
|
|
if (parameqn(line, p->str, n)) {
|
2005-04-17 06:20:36 +08:00
|
|
|
if (p->early) {
|
2006-09-26 16:52:32 +08:00
|
|
|
/* Already done in parse_early_param?
|
|
|
|
* (Needs exact match on param part).
|
|
|
|
* Keep iterating, as we can have early
|
|
|
|
* params and __setups of same names 8( */
|
2005-04-17 06:20:36 +08:00
|
|
|
if (line[n] == '\0' || line[n] == '=')
|
2016-01-21 06:59:27 +08:00
|
|
|
had_early_param = true;
|
2005-04-17 06:20:36 +08:00
|
|
|
} else if (!p->setup_func) {
|
2013-04-30 07:18:20 +08:00
|
|
|
pr_warn("Parameter %s is obsolete, ignored\n",
|
|
|
|
p->str);
|
2016-01-21 06:59:27 +08:00
|
|
|
return true;
|
2005-04-17 06:20:36 +08:00
|
|
|
} else if (p->setup_func(line + n))
|
2016-01-21 06:59:27 +08:00
|
|
|
return true;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
p++;
|
|
|
|
} while (p < __setup_end);
|
2006-09-26 16:52:32 +08:00
|
|
|
|
|
|
|
return had_early_param;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This should be approx 2 Bo*oMips to start (note initial shift), and will
|
|
|
|
* still work even if initially too large, it will just take slightly longer
|
|
|
|
*/
|
|
|
|
unsigned long loops_per_jiffy = (1<<12);
|
|
|
|
EXPORT_SYMBOL(loops_per_jiffy);
|
|
|
|
|
|
|
|
static int __init debug_kernel(char *str)
|
|
|
|
{
|
2014-06-05 07:11:46 +08:00
|
|
|
console_loglevel = CONSOLE_LOGLEVEL_DEBUG;
|
2008-02-08 20:21:58 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __init quiet_kernel(char *str)
|
|
|
|
{
|
2014-06-05 07:11:46 +08:00
|
|
|
console_loglevel = CONSOLE_LOGLEVEL_QUIET;
|
2008-02-08 20:21:58 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-02-08 20:21:58 +08:00
|
|
|
early_param("debug", debug_kernel);
|
|
|
|
early_param("quiet", quiet_kernel);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
static int __init loglevel(char *str)
|
|
|
|
{
|
2011-09-21 15:51:40 +08:00
|
|
|
int newlevel;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only update loglevel value when a correct setting was passed,
|
|
|
|
* to prevent blind crashes (when loglevel being set to 0) that
|
|
|
|
* are quite hard to debug
|
|
|
|
*/
|
|
|
|
if (get_option(&str, &newlevel)) {
|
|
|
|
console_loglevel = newlevel;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -EINVAL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-02-08 20:21:58 +08:00
|
|
|
early_param("loglevel", loglevel);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-04-26 14:53:30 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_INITRD
|
2022-04-06 10:31:19 +08:00
|
|
|
static void * __init get_boot_config_from_initrd(size_t *_size)
|
2020-04-26 14:53:30 +08:00
|
|
|
{
|
|
|
|
u32 size, csum;
|
|
|
|
char *data;
|
|
|
|
u32 *hdr;
|
2020-11-13 01:27:31 +08:00
|
|
|
int i;
|
2020-04-26 14:53:30 +08:00
|
|
|
|
|
|
|
if (!initrd_end)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
data = (char *)initrd_end - BOOTCONFIG_MAGIC_LEN;
|
2020-11-13 01:27:31 +08:00
|
|
|
/*
|
|
|
|
* Since Grub may align the size of initrd to 4, we must
|
|
|
|
* check the preceding 3 bytes as well.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 4; i++) {
|
|
|
|
if (!memcmp(data, BOOTCONFIG_MAGIC, BOOTCONFIG_MAGIC_LEN))
|
|
|
|
goto found;
|
|
|
|
data--;
|
|
|
|
}
|
|
|
|
return NULL;
|
2020-04-26 14:53:30 +08:00
|
|
|
|
2020-11-13 01:27:31 +08:00
|
|
|
found:
|
2020-04-26 14:53:30 +08:00
|
|
|
hdr = (u32 *)(data - 8);
|
2020-11-20 10:29:04 +08:00
|
|
|
size = le32_to_cpu(hdr[0]);
|
|
|
|
csum = le32_to_cpu(hdr[1]);
|
2020-04-26 14:53:30 +08:00
|
|
|
|
|
|
|
data = ((void *)hdr) - size;
|
|
|
|
if ((unsigned long)data < initrd_start) {
|
|
|
|
pr_err("bootconfig size %d is greater than initrd size %ld\n",
|
|
|
|
size, initrd_end - initrd_start);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2022-04-06 10:31:09 +08:00
|
|
|
if (xbc_calc_checksum(data, size) != csum) {
|
|
|
|
pr_err("bootconfig checksum failed\n");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-04-26 14:53:30 +08:00
|
|
|
/* Remove bootconfig from initramfs/initrd */
|
|
|
|
initrd_end = (unsigned long)data;
|
|
|
|
if (_size)
|
|
|
|
*_size = size;
|
|
|
|
|
|
|
|
return data;
|
|
|
|
}
|
|
|
|
#else
|
2022-04-06 10:31:19 +08:00
|
|
|
static void * __init get_boot_config_from_initrd(size_t *_size)
|
2020-04-26 14:53:30 +08:00
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-01-11 00:03:44 +08:00
|
|
|
#ifdef CONFIG_BOOT_CONFIG
|
2020-01-11 00:04:43 +08:00
|
|
|
|
2020-09-15 15:03:24 +08:00
|
|
|
static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
|
2020-01-11 00:04:43 +08:00
|
|
|
|
|
|
|
#define rest(dst, end) ((end) > (dst) ? (end) - (dst) : 0)
|
|
|
|
|
|
|
|
static int __init xbc_snprint_cmdline(char *buf, size_t size,
|
|
|
|
struct xbc_node *root)
|
|
|
|
{
|
|
|
|
struct xbc_node *knode, *vnode;
|
|
|
|
char *end = buf + size;
|
|
|
|
const char *val;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
xbc_node_for_each_key_value(root, knode, val) {
|
|
|
|
ret = xbc_node_compose_key_after(root, knode,
|
|
|
|
xbc_namebuf, XBC_KEYLEN_MAX);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
vnode = xbc_node_get_child(knode);
|
2020-02-20 20:19:42 +08:00
|
|
|
if (!vnode) {
|
|
|
|
ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
buf += ret;
|
2020-01-11 00:04:43 +08:00
|
|
|
continue;
|
2020-02-20 20:19:42 +08:00
|
|
|
}
|
2020-01-11 00:04:43 +08:00
|
|
|
xbc_array_for_each_value(vnode, val) {
|
2020-02-20 20:19:42 +08:00
|
|
|
ret = snprintf(buf, rest(buf, end), "%s=\"%s\" ",
|
|
|
|
xbc_namebuf, val);
|
2020-01-11 00:04:43 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
buf += ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return buf - (end - size);
|
|
|
|
}
|
|
|
|
#undef rest
|
|
|
|
|
|
|
|
/* Make an extra command line under given key word */
|
|
|
|
static char * __init xbc_make_cmdline(const char *key)
|
|
|
|
{
|
|
|
|
struct xbc_node *root;
|
|
|
|
char *new_cmdline;
|
|
|
|
int ret, len = 0;
|
|
|
|
|
|
|
|
root = xbc_find_node(key);
|
|
|
|
if (!root)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* Count required buffer size */
|
|
|
|
len = xbc_snprint_cmdline(NULL, 0, root);
|
|
|
|
if (len <= 0)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
new_cmdline = memblock_alloc(len + 1, SMP_CACHE_BYTES);
|
|
|
|
if (!new_cmdline) {
|
|
|
|
pr_err("Failed to allocate memory for extra kernel cmdline.\n");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = xbc_snprint_cmdline(new_cmdline, len + 1, root);
|
|
|
|
if (ret < 0 || ret > len) {
|
|
|
|
pr_err("Failed to print extra kernel cmdline.\n");
|
2021-11-06 04:43:22 +08:00
|
|
|
memblock_free(new_cmdline, len + 1);
|
2020-01-11 00:04:43 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return new_cmdline;
|
|
|
|
}
|
|
|
|
|
2020-02-08 08:07:37 +08:00
|
|
|
static int __init bootconfig_params(char *param, char *val,
|
|
|
|
const char *unused, void *arg)
|
|
|
|
{
|
|
|
|
if (strcmp(param, "bootconfig") == 0) {
|
|
|
|
bootconfig_found = true;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-08-05 10:10:51 +08:00
|
|
|
static int __init warn_bootconfig(char *str)
|
|
|
|
{
|
|
|
|
/* The 'bootconfig' has been handled by bootconfig_params(). */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-03-11 16:52:13 +08:00
|
|
|
static void __init setup_boot_config(void)
|
2020-01-11 00:03:44 +08:00
|
|
|
{
|
2020-02-08 08:07:37 +08:00
|
|
|
static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
|
2022-04-06 10:31:19 +08:00
|
|
|
const char *msg, *data;
|
|
|
|
int pos, ret;
|
|
|
|
size_t size;
|
|
|
|
char *err;
|
2020-01-11 00:03:44 +08:00
|
|
|
|
2020-05-11 09:39:24 +08:00
|
|
|
/* Cut out the bootconfig data even if we have no bootconfig option */
|
2022-04-06 10:31:09 +08:00
|
|
|
data = get_boot_config_from_initrd(&size);
|
2022-04-06 10:31:19 +08:00
|
|
|
/* If there is no bootconfig in initrd, try embedded one. */
|
|
|
|
if (!data)
|
|
|
|
data = xbc_get_embedded_bootconfig(&size);
|
2020-04-26 14:53:30 +08:00
|
|
|
|
2022-08-19 05:01:59 +08:00
|
|
|
strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
|
2020-08-04 10:52:13 +08:00
|
|
|
err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
|
|
|
|
bootconfig_params);
|
2020-02-08 08:07:37 +08:00
|
|
|
|
2023-02-28 09:01:42 +08:00
|
|
|
if (IS_ERR(err) || !(bootconfig_found || IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE)))
|
2020-01-11 00:03:44 +08:00
|
|
|
return;
|
|
|
|
|
2021-09-04 23:54:16 +08:00
|
|
|
/* parse_args() stops at the next param of '--' and returns an address */
|
2020-08-04 10:52:13 +08:00
|
|
|
if (err)
|
2021-09-04 23:54:16 +08:00
|
|
|
initargs_offs = err - tmp_cmdline;
|
2020-08-04 10:52:13 +08:00
|
|
|
|
2020-05-11 09:39:24 +08:00
|
|
|
if (!data) {
|
2023-02-28 09:01:42 +08:00
|
|
|
/* If user intended to use bootconfig, show an error level message */
|
|
|
|
if (bootconfig_found)
|
|
|
|
pr_err("'bootconfig' found on command line, but no bootconfig found\n");
|
|
|
|
else
|
|
|
|
pr_info("No bootconfig data provided, so skipping bootconfig");
|
2020-05-11 09:39:24 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2020-02-04 20:33:53 +08:00
|
|
|
if (size >= XBC_DATA_MAX) {
|
2022-04-06 10:31:19 +08:00
|
|
|
pr_err("bootconfig size %ld greater than max size %d\n",
|
|
|
|
(long)size, XBC_DATA_MAX);
|
2020-01-11 00:03:44 +08:00
|
|
|
return;
|
2020-02-04 20:33:53 +08:00
|
|
|
}
|
2020-01-11 00:03:44 +08:00
|
|
|
|
2021-09-16 14:23:20 +08:00
|
|
|
ret = xbc_init(data, size, &msg, &pos);
|
2020-03-03 19:24:50 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
if (pos < 0)
|
|
|
|
pr_err("Failed to init bootconfig: %s.\n", msg);
|
|
|
|
else
|
|
|
|
pr_err("Failed to parse bootconfig: %s at %d.\n",
|
|
|
|
msg, pos);
|
|
|
|
} else {
|
2021-09-16 14:23:29 +08:00
|
|
|
xbc_get_info(&ret, NULL);
|
2022-04-06 10:31:19 +08:00
|
|
|
pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
|
2020-01-11 00:04:43 +08:00
|
|
|
/* keys starting with "kernel." are passed via cmdline */
|
|
|
|
extra_command_line = xbc_make_cmdline("kernel");
|
2020-01-11 00:04:55 +08:00
|
|
|
/* Also, "init." keys are init arguments */
|
|
|
|
extra_init_args = xbc_make_cmdline("init");
|
2020-01-11 00:04:43 +08:00
|
|
|
}
|
2020-02-04 20:33:53 +08:00
|
|
|
return;
|
2020-01-11 00:03:44 +08:00
|
|
|
}
|
2020-04-26 14:53:30 +08:00
|
|
|
|
2021-09-04 23:54:09 +08:00
|
|
|
static void __init exit_boot_config(void)
|
|
|
|
{
|
2021-09-17 18:02:39 +08:00
|
|
|
xbc_exit();
|
2021-09-04 23:54:09 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#else /* !CONFIG_BOOT_CONFIG */
|
2020-04-26 14:53:30 +08:00
|
|
|
|
2021-03-11 16:52:13 +08:00
|
|
|
static void __init setup_boot_config(void)
|
2020-04-26 14:53:30 +08:00
|
|
|
{
|
|
|
|
/* Remove bootconfig data from initrd */
|
2022-04-06 10:31:09 +08:00
|
|
|
get_boot_config_from_initrd(NULL);
|
2020-04-26 14:53:30 +08:00
|
|
|
}
|
2020-02-20 20:18:33 +08:00
|
|
|
|
|
|
|
static int __init warn_bootconfig(char *str)
|
|
|
|
{
|
2020-05-09 16:33:55 +08:00
|
|
|
pr_warn("WARNING: 'bootconfig' found on the kernel command line but CONFIG_BOOT_CONFIG is not set.\n");
|
2020-02-20 20:18:33 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2021-09-04 23:54:09 +08:00
|
|
|
|
|
|
|
#define exit_boot_config() do {} while (0)
|
|
|
|
|
|
|
|
#endif /* CONFIG_BOOT_CONFIG */
|
|
|
|
|
2021-08-05 10:10:51 +08:00
|
|
|
early_param("bootconfig", warn_bootconfig);
|
2020-01-11 00:03:44 +08:00
|
|
|
|
2012-04-07 00:53:50 +08:00
|
|
|
/* Change NUL term back to "=", to make "param" the whole string. */
|
2020-01-31 14:17:16 +08:00
|
|
|
static void __init repair_env_string(char *param, char *val)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
if (val) {
|
|
|
|
/* param=val or param="val"? */
|
|
|
|
if (val == param+strlen(param)+1)
|
|
|
|
val[-1] = '=';
|
|
|
|
else if (val == param+strlen(param)+2) {
|
|
|
|
val[-2] = '=';
|
|
|
|
memmove(val-1, val, strlen(val)+1);
|
|
|
|
} else
|
|
|
|
BUG();
|
|
|
|
}
|
2012-04-07 00:53:50 +08:00
|
|
|
}
|
|
|
|
|
2014-04-28 10:04:33 +08:00
|
|
|
/* Anything after -- gets handed straight to init. */
|
module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.
@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
extern char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
));
@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
))
{
...
}
@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@
(
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
)
@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@
int func(char *param, char *val, const char *unused
+ , void *arg
)
{
...
}
@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@
- func(A1, A2, A3);
+ func(A1, A2, A3, NULL);
Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
|
|
|
static int __init set_init_arg(char *param, char *val,
|
|
|
|
const char *unused, void *arg)
|
2014-04-28 10:04:33 +08:00
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
if (panic_later)
|
|
|
|
return 0;
|
|
|
|
|
2020-01-31 14:17:16 +08:00
|
|
|
repair_env_string(param, val);
|
2014-04-28 10:04:33 +08:00
|
|
|
|
|
|
|
for (i = 0; argv_init[i]; i++) {
|
|
|
|
if (i == MAX_INIT_ARGS) {
|
|
|
|
panic_later = "init";
|
|
|
|
panic_param = param;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
argv_init[i] = param;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-04-07 00:53:50 +08:00
|
|
|
/*
|
|
|
|
* Unknown boot options get handed to init, unless they look like
|
|
|
|
* unused parameters (modprobe will find them in /proc/cmdline).
|
|
|
|
*/
|
module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.
@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
extern char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
));
@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
))
{
...
}
@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@
(
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
)
@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@
int func(char *param, char *val, const char *unused
+ , void *arg
)
{
...
}
@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@
- func(A1, A2, A3);
+ func(A1, A2, A3, NULL);
Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
|
|
|
static int __init unknown_bootoption(char *param, char *val,
|
|
|
|
const char *unused, void *arg)
|
2012-04-07 00:53:50 +08:00
|
|
|
{
|
init/main.c: fix quoted value handling in unknown_bootoption
Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2.
unknown_bootoption passes unrecognized command line arguments to init as
either environment variables or arguments. Some of the logic in the
function is broken for quoted command line arguments.
When an argument of the form param="value" is processed by parse_args
and passed to unknown_bootoption, the command line has
param\0"value\0
with val pointing to the beginning of value. The helper function
repair_env_string is then used to restore the '=' character that was
removed by parse_args, and strip the quotes off fully. This results in
param=value\0\0
and val ends up pointing to the 'a' instead of the 'v' in value. This
bug was introduced when repair_env_string was refactored into a separate
function, and the decrement of val in repair_env_string became dead
code.
This causes two problems in unknown_bootoption in the two places where
the val pointer is used as a substitute for the length of param:
1. An argument of the form param=".value" is misinterpreted as a
potential module parameter, with the result that it will not be
placed in init's environment.
2. An argument of the form param="value" is checked to see if param is
an existing environment variable that should be overwritten, but the
comparison is off-by-one and compares 'param=v' instead of 'param='
against the existing environment. So passing, for example,
TERM="vt100" on the command line results in init being passed both
TERM=linux and TERM=vt100 in its environment.
Patch 1 adds logging for the arguments and environment passed to init
and is independent of the rest: it can be dropped if this is
unnecessarily verbose.
Patch 2 removes repair_env_string from initcall parameter parsing in
do_initcall_level, as that uses a separate copy of the command line now
and the repairing is no longer necessary.
Patch 3 fixes the bug in unknown_bootoption by recording the length of
param explicitly instead of implying it from val-param.
This patch (of 3):
Commit a99cd1125189 ("init: fix bug where environment vars can't be
passed via boot args") introduced two minor bugs in unknown_bootoption
by factoring out the quoted value handling into a separate function.
When value is quoted, repair_env_string will move the value up 1 byte to
strip the quotes, so val in unknown_bootoption no longer points to the
actual location of the value.
The result is that an argument of the form param=".value" is mistakenly
treated as a potential module parameter and is not placed in init's
environment, and an argument of the form param="value" can result in a
duplicate environment variable: eg TERM="vt100" on the command line will
result in both TERM=linux and TERM=vt100 being placed into init's
environment.
Fix this by recording the length of the param before calling
repair_env_string instead of relying on val.
Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:17:19 +08:00
|
|
|
size_t len = strlen(param);
|
|
|
|
|
2020-01-31 14:17:16 +08:00
|
|
|
repair_env_string(param, val);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Handle obsolete-style parameters */
|
|
|
|
if (obsolete_checksetup(param))
|
|
|
|
return 0;
|
|
|
|
|
2009-12-01 12:26:44 +08:00
|
|
|
/* Unused module parameter. */
|
init/main.c: fix quoted value handling in unknown_bootoption
Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2.
unknown_bootoption passes unrecognized command line arguments to init as
either environment variables or arguments. Some of the logic in the
function is broken for quoted command line arguments.
When an argument of the form param="value" is processed by parse_args
and passed to unknown_bootoption, the command line has
param\0"value\0
with val pointing to the beginning of value. The helper function
repair_env_string is then used to restore the '=' character that was
removed by parse_args, and strip the quotes off fully. This results in
param=value\0\0
and val ends up pointing to the 'a' instead of the 'v' in value. This
bug was introduced when repair_env_string was refactored into a separate
function, and the decrement of val in repair_env_string became dead
code.
This causes two problems in unknown_bootoption in the two places where
the val pointer is used as a substitute for the length of param:
1. An argument of the form param=".value" is misinterpreted as a
potential module parameter, with the result that it will not be
placed in init's environment.
2. An argument of the form param="value" is checked to see if param is
an existing environment variable that should be overwritten, but the
comparison is off-by-one and compares 'param=v' instead of 'param='
against the existing environment. So passing, for example,
TERM="vt100" on the command line results in init being passed both
TERM=linux and TERM=vt100 in its environment.
Patch 1 adds logging for the arguments and environment passed to init
and is independent of the rest: it can be dropped if this is
unnecessarily verbose.
Patch 2 removes repair_env_string from initcall parameter parsing in
do_initcall_level, as that uses a separate copy of the command line now
and the repairing is no longer necessary.
Patch 3 fixes the bug in unknown_bootoption by recording the length of
param explicitly instead of implying it from val-param.
This patch (of 3):
Commit a99cd1125189 ("init: fix bug where environment vars can't be
passed via boot args") introduced two minor bugs in unknown_bootoption
by factoring out the quoted value handling into a separate function.
When value is quoted, repair_env_string will move the value up 1 byte to
strip the quotes, so val in unknown_bootoption no longer points to the
actual location of the value.
The result is that an argument of the form param=".value" is mistakenly
treated as a potential module parameter and is not placed in init's
environment, and an argument of the form param="value" can result in a
duplicate environment variable: eg TERM="vt100" on the command line will
result in both TERM=linux and TERM=vt100 being placed into init's
environment.
Fix this by recording the length of the param before calling
repair_env_string instead of relying on val.
Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:17:19 +08:00
|
|
|
if (strnchr(param, len, '.'))
|
2005-04-17 06:20:36 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (panic_later)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (val) {
|
|
|
|
/* Environment option */
|
|
|
|
unsigned int i;
|
|
|
|
for (i = 0; envp_init[i]; i++) {
|
|
|
|
if (i == MAX_INIT_ENVS) {
|
2014-01-24 07:54:56 +08:00
|
|
|
panic_later = "env";
|
2005-04-17 06:20:36 +08:00
|
|
|
panic_param = param;
|
|
|
|
}
|
init/main.c: fix quoted value handling in unknown_bootoption
Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2.
unknown_bootoption passes unrecognized command line arguments to init as
either environment variables or arguments. Some of the logic in the
function is broken for quoted command line arguments.
When an argument of the form param="value" is processed by parse_args
and passed to unknown_bootoption, the command line has
param\0"value\0
with val pointing to the beginning of value. The helper function
repair_env_string is then used to restore the '=' character that was
removed by parse_args, and strip the quotes off fully. This results in
param=value\0\0
and val ends up pointing to the 'a' instead of the 'v' in value. This
bug was introduced when repair_env_string was refactored into a separate
function, and the decrement of val in repair_env_string became dead
code.
This causes two problems in unknown_bootoption in the two places where
the val pointer is used as a substitute for the length of param:
1. An argument of the form param=".value" is misinterpreted as a
potential module parameter, with the result that it will not be
placed in init's environment.
2. An argument of the form param="value" is checked to see if param is
an existing environment variable that should be overwritten, but the
comparison is off-by-one and compares 'param=v' instead of 'param='
against the existing environment. So passing, for example,
TERM="vt100" on the command line results in init being passed both
TERM=linux and TERM=vt100 in its environment.
Patch 1 adds logging for the arguments and environment passed to init
and is independent of the rest: it can be dropped if this is
unnecessarily verbose.
Patch 2 removes repair_env_string from initcall parameter parsing in
do_initcall_level, as that uses a separate copy of the command line now
and the repairing is no longer necessary.
Patch 3 fixes the bug in unknown_bootoption by recording the length of
param explicitly instead of implying it from val-param.
This patch (of 3):
Commit a99cd1125189 ("init: fix bug where environment vars can't be
passed via boot args") introduced two minor bugs in unknown_bootoption
by factoring out the quoted value handling into a separate function.
When value is quoted, repair_env_string will move the value up 1 byte to
strip the quotes, so val in unknown_bootoption no longer points to the
actual location of the value.
The result is that an argument of the form param=".value" is mistakenly
treated as a potential module parameter and is not placed in init's
environment, and an argument of the form param="value" can result in a
duplicate environment variable: eg TERM="vt100" on the command line will
result in both TERM=linux and TERM=vt100 being placed into init's
environment.
Fix this by recording the length of the param before calling
repair_env_string instead of relying on val.
Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:17:19 +08:00
|
|
|
if (!strncmp(param, envp_init[i], len+1))
|
2005-04-17 06:20:36 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
envp_init[i] = param;
|
|
|
|
} else {
|
|
|
|
/* Command line option */
|
|
|
|
unsigned int i;
|
|
|
|
for (i = 0; argv_init[i]; i++) {
|
|
|
|
if (i == MAX_INIT_ARGS) {
|
2014-01-24 07:54:56 +08:00
|
|
|
panic_later = "init";
|
2005-04-17 06:20:36 +08:00
|
|
|
panic_param = param;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
argv_init[i] = param;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init init_setup(char *str)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
execute_command = str;
|
|
|
|
/*
|
|
|
|
* In case LILO is going to boot us with default command line,
|
|
|
|
* it prepends "auto" before the whole cmdline which makes
|
|
|
|
* the shell think it should execute a script with such name.
|
|
|
|
* So we ignore all arguments entered _before_ init=... [MJ]
|
|
|
|
*/
|
|
|
|
for (i = 1; i < MAX_INIT_ARGS; i++)
|
|
|
|
argv_init[i] = NULL;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("init=", init_setup);
|
|
|
|
|
2005-09-07 06:17:19 +08:00
|
|
|
static int __init rdinit_setup(char *str)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
ramdisk_execute_command = str;
|
|
|
|
/* See "auto" comment in init_setup */
|
|
|
|
for (i = 1; i < MAX_INIT_ARGS; i++)
|
|
|
|
argv_init[i] = NULL;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("rdinit=", rdinit_setup);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef CONFIG_SMP
|
2011-03-23 07:34:06 +08:00
|
|
|
static const unsigned int setup_max_cpus = NR_CPUS;
|
2008-03-27 05:23:48 +08:00
|
|
|
static inline void setup_nr_cpu_ids(void) { }
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline void smp_prepare_cpus(unsigned int maxcpus) { }
|
|
|
|
#endif
|
|
|
|
|
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
|
|
|
/*
|
|
|
|
* We need to store the untouched command line for future reference.
|
|
|
|
* We also need to store the touched command line since the parameter
|
|
|
|
* parsing is performed in place, and we should allow a component to
|
|
|
|
* store reference of name/value for future reference.
|
|
|
|
*/
|
|
|
|
static void __init setup_command_line(char *command_line)
|
|
|
|
{
|
2020-01-11 00:04:55 +08:00
|
|
|
size_t len, xlen = 0, ilen = 0;
|
2019-03-12 14:30:20 +08:00
|
|
|
|
2020-01-11 00:04:43 +08:00
|
|
|
if (extra_command_line)
|
|
|
|
xlen = strlen(extra_command_line);
|
2020-01-11 00:04:55 +08:00
|
|
|
if (extra_init_args)
|
|
|
|
ilen = strlen(extra_init_args) + 4; /* for " -- " */
|
2019-03-12 14:30:20 +08:00
|
|
|
|
2020-01-11 00:04:43 +08:00
|
|
|
len = xlen + strlen(boot_command_line) + 1;
|
2019-03-12 14:30:20 +08:00
|
|
|
|
2020-01-11 00:04:55 +08:00
|
|
|
saved_command_line = memblock_alloc(len + ilen, SMP_CACHE_BYTES);
|
2019-03-12 14:30:20 +08:00
|
|
|
if (!saved_command_line)
|
2020-01-11 00:04:55 +08:00
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__, len + ilen);
|
2019-03-12 14:30:20 +08:00
|
|
|
|
|
|
|
static_command_line = memblock_alloc(len, SMP_CACHE_BYTES);
|
|
|
|
if (!static_command_line)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__, len);
|
|
|
|
|
2020-01-11 00:04:43 +08:00
|
|
|
if (xlen) {
|
|
|
|
/*
|
|
|
|
* We have to put extra_command_line before boot command
|
|
|
|
* lines because there could be dashes (separator of init
|
|
|
|
* command line) in the command lines.
|
|
|
|
*/
|
|
|
|
strcpy(saved_command_line, extra_command_line);
|
|
|
|
strcpy(static_command_line, extra_command_line);
|
|
|
|
}
|
|
|
|
strcpy(saved_command_line + xlen, boot_command_line);
|
|
|
|
strcpy(static_command_line + xlen, command_line);
|
2020-01-11 00:04:55 +08:00
|
|
|
|
|
|
|
if (ilen) {
|
|
|
|
/*
|
|
|
|
* Append supplemental init boot args to saved_command_line
|
|
|
|
* so that user can check what command line options passed
|
|
|
|
* to init.
|
2021-09-04 23:54:16 +08:00
|
|
|
* The order should always be
|
|
|
|
* " -- "[bootconfig init-param][cmdline init-param]
|
2020-01-11 00:04:55 +08:00
|
|
|
*/
|
2021-09-04 23:54:16 +08:00
|
|
|
if (initargs_offs) {
|
|
|
|
len = xlen + initargs_offs;
|
|
|
|
strcpy(saved_command_line + len, extra_init_args);
|
|
|
|
len += ilen - 4; /* strlen(extra_init_args) */
|
|
|
|
strcpy(saved_command_line + len,
|
|
|
|
boot_command_line + initargs_offs - 1);
|
2020-02-08 08:07:37 +08:00
|
|
|
} else {
|
2021-09-04 23:54:16 +08:00
|
|
|
len = strlen(saved_command_line);
|
2020-01-11 00:04:55 +08:00
|
|
|
strcpy(saved_command_line + len, " -- ");
|
|
|
|
len += 4;
|
2021-09-04 23:54:16 +08:00
|
|
|
strcpy(saved_command_line + len, extra_init_args);
|
2020-02-08 08:07:37 +08:00
|
|
|
}
|
2020-01-11 00:04:55 +08:00
|
|
|
}
|
proc: give /proc/cmdline size
Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.
cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:
let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();
will result in
openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)
open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.
In case of /proc/cmdline, the length is known precisely.
Make variables readonly while I'm at it.
P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.
Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09 02:21:54 +08:00
|
|
|
|
|
|
|
saved_command_line_len = strlen(saved_command_line);
|
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* We need to finalize in a non-__init function or else race conditions
|
|
|
|
* between the root thread and the init thread may cause start_kernel to
|
|
|
|
* be reaped by free_initmem before the root thread has proceeded to
|
|
|
|
* cpu_idle.
|
|
|
|
*
|
|
|
|
* gcc-3.4 accidentally inlines this function, so use noinline.
|
|
|
|
*/
|
|
|
|
|
2010-06-28 22:51:01 +08:00
|
|
|
static __initdata DECLARE_COMPLETION(kthreadd_done);
|
|
|
|
|
2023-04-13 07:49:31 +08:00
|
|
|
noinline void __ref __noreturn rest_init(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-05-17 02:42:32 +08:00
|
|
|
struct task_struct *tsk;
|
2007-05-09 17:34:32 +08:00
|
|
|
int pid;
|
|
|
|
|
2009-09-03 05:01:24 +08:00
|
|
|
rcu_scheduler_starting();
|
2010-06-28 22:51:01 +08:00
|
|
|
/*
|
2010-06-30 16:37:11 +08:00
|
|
|
* We need to spawn init first so that it obtains pid 1, however
|
2010-06-28 22:51:01 +08:00
|
|
|
* the init task will end up wanting to create kthreads, which, if
|
|
|
|
* we schedule it before we create kthreadd, will OOPS.
|
|
|
|
*/
|
2022-04-12 00:40:14 +08:00
|
|
|
pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
|
2017-05-17 02:42:32 +08:00
|
|
|
/*
|
|
|
|
* Pin init on the boot CPU. Task migration is not properly working
|
|
|
|
* until sched_init_smp() has been run. It will set the allowed
|
|
|
|
* CPUs for init to the non isolated CPUs.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
tsk = find_task_by_pid_ns(pid, &init_pid_ns);
|
2021-05-31 18:21:13 +08:00
|
|
|
tsk->flags |= PF_NO_SETAFFINITY;
|
2017-05-17 02:42:32 +08:00
|
|
|
set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
numa_default_policy();
|
2023-03-11 06:03:23 +08:00
|
|
|
pid = kernel_thread(kthreadd, NULL, NULL, CLONE_FS | CLONE_FILES);
|
2010-02-23 09:04:50 +08:00
|
|
|
rcu_read_lock();
|
2008-04-30 15:54:24 +08:00
|
|
|
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
|
2010-02-23 09:04:50 +08:00
|
|
|
rcu_read_unlock();
|
2017-05-17 02:42:48 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Enable might_sleep() and smp_processor_id() checks.
|
2019-07-27 05:19:37 +08:00
|
|
|
* They cannot be enabled earlier because with CONFIG_PREEMPTION=y
|
2017-05-17 02:42:48 +08:00
|
|
|
* kernel_thread() would trigger might_sleep() splats. With
|
|
|
|
* CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
|
|
|
|
* already, but it's stuck on the kthreadd_done completion.
|
|
|
|
*/
|
|
|
|
system_state = SYSTEM_SCHEDULING;
|
|
|
|
|
2010-06-28 22:51:01 +08:00
|
|
|
complete(&kthreadd_done);
|
2005-06-28 22:40:42 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The boot idle thread must execute schedule()
|
2007-07-10 00:51:58 +08:00
|
|
|
* at least once to get things moving:
|
2005-06-28 22:40:42 +08:00
|
|
|
*/
|
2011-03-21 19:33:18 +08:00
|
|
|
schedule_preempt_disabled();
|
2005-11-09 13:39:01 +08:00
|
|
|
/* Call into cpu_idle with preempt disabled */
|
2013-03-22 05:49:34 +08:00
|
|
|
cpu_startup_entry(CPUHP_ONLINE);
|
2007-07-10 00:51:58 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Check for early params. */
|
module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.
@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
extern char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
));
@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
))
{
...
}
@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@
(
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
)
@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@
int func(char *param, char *val, const char *unused
+ , void *arg
)
{
...
}
@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@
- func(A1, A2, A3);
+ func(A1, A2, A3, NULL);
Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
|
|
|
static int __init do_early_param(char *param, char *val,
|
|
|
|
const char *unused, void *arg)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2010-08-12 13:04:18 +08:00
|
|
|
const struct obs_kernel_param *p;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
for (p = __setup_start; p < __setup_end; p++) {
|
2011-10-10 06:03:37 +08:00
|
|
|
if ((p->early && parameq(param, p->str)) ||
|
serial: convert early_uart to earlycon for 8250
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 14:37:59 +08:00
|
|
|
(strcmp(param, "console") == 0 &&
|
|
|
|
strcmp(p->str, "earlycon") == 0)
|
|
|
|
) {
|
2005-04-17 06:20:36 +08:00
|
|
|
if (p->setup_func(val) != 0)
|
2013-04-30 07:18:20 +08:00
|
|
|
pr_warn("Malformed early option '%s'\n", param);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
/* We accept everything at this stage. */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-03-31 05:37:25 +08:00
|
|
|
void __init parse_early_options(char *cmdline)
|
|
|
|
{
|
module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.
@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
extern char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
));
@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
))
{
...
}
@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@
(
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
)
@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@
int func(char *param, char *val, const char *unused
+ , void *arg
)
{
...
}
@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@
- func(A1, A2, A3);
+ func(A1, A2, A3, NULL);
Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
|
|
|
parse_args("early options", cmdline, NULL, 0, 0, 0, NULL,
|
|
|
|
do_early_param);
|
2009-03-31 05:37:25 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Arch code calls this early on, or if not, just before other parsing. */
|
|
|
|
void __init parse_early_param(void)
|
|
|
|
{
|
2014-08-09 05:23:44 +08:00
|
|
|
static int done __initdata;
|
|
|
|
static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
if (done)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* All fall through to do_early_param. */
|
2022-08-19 05:01:59 +08:00
|
|
|
strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
|
2009-03-31 05:37:25 +08:00
|
|
|
parse_early_options(tmp_cmdline);
|
2005-04-17 06:20:36 +08:00
|
|
|
done = 1;
|
|
|
|
}
|
|
|
|
|
2016-12-10 02:29:10 +08:00
|
|
|
void __init __weak arch_post_acpi_subsys_init(void) { }
|
|
|
|
|
2008-04-18 14:56:18 +08:00
|
|
|
void __init __weak smp_setup_processor_id(void)
|
2006-06-30 16:55:50 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2012-11-02 21:20:42 +08:00
|
|
|
# if THREAD_SIZE >= PAGE_SIZE
|
Clarify naming of thread info/stack allocators
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.
But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.
Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things like
ti = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;
which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.
So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.
This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.
The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-25 06:09:37 +08:00
|
|
|
void __init __weak thread_stack_cache_init(void)
|
2008-04-18 14:56:15 +08:00
|
|
|
{
|
|
|
|
}
|
2012-11-02 21:20:42 +08:00
|
|
|
#endif
|
2008-04-18 14:56:15 +08:00
|
|
|
|
2017-07-18 05:10:21 +08:00
|
|
|
void __init __weak mem_encrypt_init(void) { }
|
|
|
|
|
2019-04-27 07:22:46 +08:00
|
|
|
void __init __weak poking_init(void) { }
|
|
|
|
|
2019-09-24 06:35:31 +08:00
|
|
|
void __init __weak pgtable_cache_init(void) { }
|
2019-05-05 09:11:24 +08:00
|
|
|
|
2021-09-08 11:16:06 +08:00
|
|
|
void __init __weak trap_init(void) { }
|
|
|
|
|
2018-03-27 01:31:07 +08:00
|
|
|
bool initcall_debug;
|
|
|
|
core_param(initcall_debug, initcall_debug, bool, 0644);
|
2018-04-06 21:24:25 +08:00
|
|
|
|
|
|
|
#ifdef TRACEPOINTS_ENABLED
|
2018-03-27 01:31:07 +08:00
|
|
|
static void __init initcall_debug_enable(void);
|
2018-04-06 21:24:25 +08:00
|
|
|
#else
|
|
|
|
static inline void initcall_debug_enable(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
2018-03-27 01:31:07 +08:00
|
|
|
|
2022-01-31 17:05:20 +08:00
|
|
|
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
|
stack: Optionally randomize kernel stack offset each syscall
This provides the ability for architectures to enable kernel stack base
address offset randomization. This feature is controlled by the boot
param "randomize_kstack_offset=on/off", with its default value set by
CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
This feature is based on the original idea from the last public release
of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
All the credit for the original idea goes to the PaX team. Note that
the design and implementation of this upstream randomize_kstack_offset
feature differs greatly from the RANDKSTACK feature (see below).
Reasoning for the feature:
This feature aims to make harder the various stack-based attacks that
rely on deterministic stack structure. We have had many such attacks in
past (just to name few):
https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
As Linux kernel stack protections have been constantly improving
(vmap-based stack allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have had to find new ways for their exploits
to work. They have done so, continuing to rely on the kernel's stack
determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
were not relevant. For example, the following recent attacks would have
been hampered if the stack offset was non-deterministic between syscalls:
https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
(page 70: targeting the pt_regs copy with linear stack overflow)
https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
(leaked stack address from one syscall as a target during next syscall)
The main idea is that since the stack offset is randomized on each system
call, it is harder for an attack to reliably land in any particular place
on the thread stack, even with address exposures, as the stack base will
change on the next syscall. Also, since randomization is performed after
placing pt_regs, the ptrace-based approach[1] to discover the randomized
offset during a long-running syscall should not be possible.
Design description:
During most of the kernel's execution, it runs on the "thread stack",
which is pretty deterministic in its structure: it is fixed in size,
and on every entry from userspace to kernel on a syscall the thread
stack starts construction from an address fetched from the per-cpu
cpu_current_top_of_stack variable. The first element to be pushed to the
thread stack is the pt_regs struct that stores all required CPU registers
and syscall parameters. Finally the specific syscall function is called,
with the stack being used as the kernel executes the resulting request.
The goal of randomize_kstack_offset feature is to add a random offset
after the pt_regs has been pushed to the stack and before the rest of the
thread stack is used during the syscall processing, and to change it every
time a process issues a syscall. The source of randomness is currently
architecture-defined (but x86 is using the low byte of rdtsc()). Future
improvements for different entropy sources is possible, but out of scope
for this patch. Further more, to add more unpredictability, new offsets
are chosen at the end of syscalls (the timing of which should be less
easy to measure from userspace than at syscall entry time), and stored
in a per-CPU variable, so that the life of the value does not stay
explicitly tied to a single task.
As suggested by Andy Lutomirski, the offset is added using alloca()
and an empty asm() statement with an output constraint, since it avoids
changes to assembly syscall entry code, to the unwinder, and provides
correct stack alignment as defined by the compiler.
In order to make this available by default with zero performance impact
for those that don't want it, it is boot-time selectable with static
branches. This way, if the overhead is not wanted, it can just be
left turned off with no performance impact.
The generated assembly for x86_64 with GCC looks like this:
...
ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax
# 12380 <kstack_offset>
ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax
ffffffff81003983: 48 83 c0 0f add $0xf,%rax
ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax
ffffffff8100398c: 48 29 c4 sub %rax,%rsp
ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax
ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax
...
As a result of the above stack alignment, this patch introduces about
5 bits of randomness after pt_regs is spilled to the thread stack on
x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
stack alignment). The amount of entropy could be adjusted based on how
much of the stack space we wish to trade for security.
My measure of syscall performance overhead (on x86_64):
lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
randomize_kstack_offset=y Simple syscall: 0.7082 microseconds
randomize_kstack_offset=n Simple syscall: 0.7016 microseconds
So, roughly 0.9% overhead growth for a no-op syscall, which is very
manageable. And for people that don't want this, it's off by default.
There are two gotchas with using the alloca() trick. First,
compilers that have Stack Clash protection (-fstack-clash-protection)
enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
any dynamic stack allocations. While the randomization offset is
always less than a page, the resulting assembly would still contain
(unreachable!) probing routines, bloating the resulting assembly. To
avoid this, -fno-stack-clash-protection is unconditionally added to
the kernel Makefile since this is the only dynamic stack allocation in
the kernel (now that VLAs have been removed) and it is provably safe
from Stack Clash style attacks.
The second gotcha with alloca() is a negative interaction with
-fstack-protector*, in that it sees the alloca() as an array allocation,
which triggers the unconditional addition of the stack canary function
pre/post-amble which slows down syscalls regardless of the static
branch. In order to avoid adding this unneeded check and its associated
performance impact, architectures need to carefully remove uses of
-fstack-protector-strong (or -fstack-protector) in the compilation units
that use the add_random_kstack() macro and to audit the resulting stack
mitigation coverage (to make sure no desired coverage disappears). No
change is visible for this on x86 because the stack protector is already
unconditionally disabled for the compilation unit, but the change is
required on arm64. There is, unfortunately, no attribute that can be
used to disable stack protector for specific functions.
Comparison to PaX RANDKSTACK feature:
The RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. including the location of pt_regs
structure itself on the stack. Initially this patch followed the same
approach, but during the recent discussions[2], it has been determined
to be of a little value since, if ptrace functionality is available for
an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
different offsets in the pt_regs struct, observe the cache behavior of
the pt_regs accesses, and figure out the random stack offset. Another
difference is that the random offset is stored in a per-cpu variable,
rather than having it be per-thread. As a result, these implementations
differ a fair bit in their implementation details and results, though
obviously the intent is similar.
[1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
[2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
[3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html
Co-developed-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
2021-04-02 07:23:44 +08:00
|
|
|
DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
|
|
|
|
randomize_kstack_offset);
|
|
|
|
DEFINE_PER_CPU(u32, kstack_offset);
|
|
|
|
|
|
|
|
static int __init early_randomize_kstack_offset(char *buf)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
bool bool_result;
|
|
|
|
|
|
|
|
ret = kstrtobool(buf, &bool_result);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (bool_result)
|
|
|
|
static_branch_enable(&randomize_kstack_offset);
|
|
|
|
else
|
|
|
|
static_branch_disable(&randomize_kstack_offset);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
early_param("randomize_kstack_offset", early_randomize_kstack_offset);
|
|
|
|
#endif
|
|
|
|
|
2023-04-13 07:49:31 +08:00
|
|
|
void __init __weak __noreturn arch_call_rest_init(void)
|
2018-08-31 16:42:24 +08:00
|
|
|
{
|
|
|
|
rest_init();
|
|
|
|
}
|
|
|
|
|
2021-07-01 09:56:28 +08:00
|
|
|
static void __init print_unknown_bootoptions(void)
|
|
|
|
{
|
|
|
|
char *unknown_options;
|
|
|
|
char *end;
|
|
|
|
const char *const *p;
|
|
|
|
size_t len;
|
|
|
|
|
|
|
|
if (panic_later || (!argv_init[1] && !envp_init[2]))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine how many options we have to print out, plus a space
|
|
|
|
* before each
|
|
|
|
*/
|
|
|
|
len = 1; /* null terminator */
|
|
|
|
for (p = &argv_init[1]; *p; p++) {
|
|
|
|
len++;
|
|
|
|
len += strlen(*p);
|
|
|
|
}
|
|
|
|
for (p = &envp_init[2]; *p; p++) {
|
|
|
|
len++;
|
|
|
|
len += strlen(*p);
|
|
|
|
}
|
|
|
|
|
|
|
|
unknown_options = memblock_alloc(len, SMP_CACHE_BYTES);
|
|
|
|
if (!unknown_options) {
|
|
|
|
pr_err("%s: Failed to allocate %zu bytes\n",
|
|
|
|
__func__, len);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
end = unknown_options;
|
|
|
|
|
|
|
|
for (p = &argv_init[1]; *p; p++)
|
|
|
|
end += sprintf(end, " %s", *p);
|
|
|
|
for (p = &envp_init[2]; *p; p++)
|
|
|
|
end += sprintf(end, " %s", *p);
|
|
|
|
|
2021-11-09 10:34:27 +08:00
|
|
|
/* Start at unknown_options[1] to skip the initial space */
|
|
|
|
pr_notice("Unknown kernel command line parameters \"%s\", will be passed to user space.\n",
|
|
|
|
&unknown_options[1]);
|
2021-11-06 04:43:22 +08:00
|
|
|
memblock_free(unknown_options, len);
|
2021-07-01 09:56:28 +08:00
|
|
|
}
|
|
|
|
|
2023-04-13 07:49:32 +08:00
|
|
|
asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2014-08-09 05:23:44 +08:00
|
|
|
char *command_line;
|
|
|
|
char *after_dashes;
|
2006-06-30 16:55:50 +08:00
|
|
|
|
2014-09-12 21:16:17 +08:00
|
|
|
set_task_stack_end_magic(&init_task);
|
2011-11-17 13:34:31 +08:00
|
|
|
smp_setup_processor_id();
|
2008-04-30 15:55:01 +08:00
|
|
|
debug_objects_early_init();
|
2021-07-08 09:09:13 +08:00
|
|
|
init_vmlinux_build_id();
|
2008-02-14 16:44:08 +08:00
|
|
|
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 14:39:30 +08:00
|
|
|
cgroup_init_early();
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
|
|
|
|
local_irq_disable();
|
2011-01-20 19:06:35 +08:00
|
|
|
early_boot_irqs_disabled = true;
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
|
2017-03-23 19:30:05 +08:00
|
|
|
/*
|
|
|
|
* Interrupts are still disabled. Do necessary setups, then
|
|
|
|
* enable them.
|
|
|
|
*/
|
2006-03-23 18:59:44 +08:00
|
|
|
boot_cpu_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
page_address_init();
|
2013-04-30 07:18:20 +08:00
|
|
|
pr_notice("%s", linux_banner);
|
2019-08-20 08:17:37 +08:00
|
|
|
early_security_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
setup_arch(&command_line);
|
2021-03-11 16:52:13 +08:00
|
|
|
setup_boot_config();
|
[PATCH] Dynamic kernel command-line: common
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 16:53:52 +08:00
|
|
|
setup_command_line(command_line);
|
2008-03-27 05:23:48 +08:00
|
|
|
setup_nr_cpu_ids();
|
2009-07-21 16:11:50 +08:00
|
|
|
setup_per_cpu_areas();
|
2006-03-23 18:59:44 +08:00
|
|
|
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
|
init: rename and re-order boot_cpu_state_init()
This is purely a preparatory patch for upcoming changes during the 4.19
merge window.
We have a function called "boot_cpu_state_init()" that isn't really
about the bootup cpu state: that is done much earlier by the similarly
named "boot_cpu_init()" (note lack of "state" in name).
This function initializes some hotplug CPU state, and needs to run after
the percpu data has been properly initialized. It even has a comment to
that effect.
Except it _doesn't_ actually run after the percpu data has been properly
initialized. On x86 it happens to do that, but on at least arm and
arm64, the percpu base pointers are initialized by the arch-specific
'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().
This had some unexpected results, and in particular we have a patch
pending for the merge window that did the obvious cleanup of using
'this_cpu_write()' in the cpu hotplug init code:
- per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
+ this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
which is obviously the right thing to do. Except because of the
ordering issue, it actually failed miserably and unexpectedly on arm64.
So this just fixes the ordering, and changes the name of the function to
be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
hotplug state, because the core CPU state was supposed to have already
been done earlier.
Marked for stable, since the (not yet merged) patch that will show this
problem is marked for stable.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-13 03:19:42 +08:00
|
|
|
boot_cpu_hotplug_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-01-11 00:04:43 +08:00
|
|
|
pr_notice("Kernel command line: %s\n", saved_command_line);
|
2019-04-19 08:50:44 +08:00
|
|
|
/* parameters may set static keys */
|
|
|
|
jump_label_init();
|
2009-06-11 00:40:04 +08:00
|
|
|
parse_early_param();
|
2014-04-28 10:04:33 +08:00
|
|
|
after_dashes = parse_args("Booting kernel",
|
|
|
|
static_command_line, __start___param,
|
|
|
|
__stop___param - __start___param,
|
module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.
@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
extern char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
));
@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
))
{
...
}
@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@
(
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
)
@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@
int func(char *param, char *val, const char *unused
+ , void *arg
)
{
...
}
@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@
- func(A1, A2, A3);
+ func(A1, A2, A3, NULL);
Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
|
|
|
-1, -1, NULL, &unknown_bootoption);
|
2021-07-01 09:56:28 +08:00
|
|
|
print_unknown_bootoptions();
|
2014-11-11 13:59:46 +08:00
|
|
|
if (!IS_ERR_OR_NULL(after_dashes))
|
2014-04-28 10:04:33 +08:00
|
|
|
parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
|
module: add extra argument for parse_params() callback
This adds an extra argument onto parse_params() to be used
as a way to make the unused callback a bit more useful and
generic by allowing the caller to pass on a data structure
of its choice. An example use case is to allow us to easily
make module parameters for every module which we will do
next.
@ parse @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
extern char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
));
@ parse_mod @
identifier name, args, params, num, level_min, level_max;
identifier unknown, param, val, doing;
type s16;
@@
char *parse_args(const char *name,
char *args,
const struct kernel_param *params,
unsigned num,
s16 level_min,
s16 level_max,
+ void *arg,
int (*unknown)(char *param, char *val,
const char *doing
+ , void *arg
))
{
...
}
@ parse_args_found @
expression R, E1, E2, E3, E4, E5, E6;
identifier func;
@@
(
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
R =
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
&func);
|
parse_args(E1, E2, E3, E4, E5, E6,
+ NULL,
NULL);
)
@ parse_args_unused depends on parse_args_found @
identifier parse_args_found.func;
@@
int func(char *param, char *val, const char *unused
+ , void *arg
)
{
...
}
@ mod_unused depends on parse_args_found @
identifier parse_args_found.func;
expression A1, A2, A3;
@@
- func(A1, A2, A3);
+ func(A1, A2, A3, NULL);
Generated-by: Coccinelle SmPL
Cc: cocci@systeme.lip6.fr
Cc: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-31 07:20:03 +08:00
|
|
|
NULL, set_init_arg);
|
2020-01-11 00:04:55 +08:00
|
|
|
if (extra_init_args)
|
|
|
|
parse_args("Setting extra init args", extra_init_args,
|
|
|
|
NULL, 0, -1, -1, NULL, set_init_arg);
|
2011-10-13 07:17:54 +08:00
|
|
|
|
random: split initialization into early step and later step
The full RNG initialization relies on some timestamps, made possible
with initialization functions like time_init() and timekeeping_init().
However, these are only available rather late in initialization.
Meanwhile, other things, such as memory allocator functions, make use of
the RNG much earlier.
So split RNG initialization into two phases. We can provide arch
randomness very early on, and then later, after timekeeping and such are
available, initialize the rest.
This ensures that, for example, slabs are properly randomized if RDRAND
is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree
of its security, because its random seed is potentially deterministic,
since it hasn't yet incorporated RDRAND. It also makes it possible to
use a better seed in kfence, which currently relies on only the cycle
counter.
Another positive consequence is that on systems with RDRAND, running
with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all.
One subtle side effect of this change is that on systems with no RDRAND,
RDTSC is now only queried by random_init() once, committing the moment
of the function call, instead of multiple times as before. This is
intentional, as the multiple RDTSCs in a loop before weren't
accomplishing very much, with jitter being better provided by
try_to_generate_entropy(). Plus, filling blocks with RDTSC is still
being done in extract_entropy(), which is necessarily called before
random bytes are served anyway.
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-26 23:43:14 +08:00
|
|
|
/* Architectural and non-timekeeping rng init, before allocator init */
|
|
|
|
random_init_early(command_line);
|
|
|
|
|
2009-06-11 00:40:04 +08:00
|
|
|
/*
|
|
|
|
* These use large bootmem allocations and must precede
|
2023-03-22 01:05:06 +08:00
|
|
|
* initalization of page allocator
|
2009-06-11 00:40:04 +08:00
|
|
|
*/
|
2011-05-25 08:13:20 +08:00
|
|
|
setup_log_buf(0);
|
2009-06-11 00:40:04 +08:00
|
|
|
vfs_caches_init_early();
|
|
|
|
sort_main_extable();
|
|
|
|
trap_init();
|
2023-03-22 01:05:06 +08:00
|
|
|
mm_core_init();
|
2022-10-26 03:38:25 +08:00
|
|
|
poking_init();
|
2017-03-04 02:43:34 +08:00
|
|
|
ftrace_init();
|
|
|
|
|
2017-03-04 02:37:33 +08:00
|
|
|
/* trace_printk can be enabled here */
|
|
|
|
early_trace_init();
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Set up the scheduler prior starting any interrupts (such as the
|
|
|
|
* timer interrupt). Full topology setup happens at smp_init()
|
|
|
|
* time - but meanwhile we still have a functioning scheduler.
|
|
|
|
*/
|
|
|
|
sched_init();
|
sched/core: Initialize the idle task with preemption disabled
As pointed out by commit
de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread")
init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.
As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().
Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().
Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().
Secondary startups were patched via coccinelle:
@begone@
@@
-preempt_disable();
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12 17:46:36 +08:00
|
|
|
|
2014-08-09 05:23:44 +08:00
|
|
|
if (WARN(!irqs_disabled(),
|
|
|
|
"Interrupts were enabled *very* early, fixing it\n"))
|
2007-01-06 08:36:19 +08:00
|
|
|
local_irq_disable();
|
2016-12-20 23:27:56 +08:00
|
|
|
radix_tree_init();
|
Maple Tree: add new data structure
Patch series "Introducing the Maple Tree"
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Davidlor said
: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage. Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit. This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead. Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.
A similar work has been discovered in the academic press
https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
Sheer coincidence. We designed our tree with the intention of solving the
hardest problem first. Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article. So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.
This patch (of 70):
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
There is additional BUG_ON() calls added within the tree, most of which
are in debug code. These will be replaced with a WARN_ON() call in the
future. There is also additional BUG_ON() calls within the code which
will also be reduced in number at a later date. These exist to catch
things such as out-of-range accesses which would crash anyways.
Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-07 03:48:39 +08:00
|
|
|
maple_tree_init();
|
2016-09-17 03:49:32 +08:00
|
|
|
|
2017-11-03 23:27:49 +08:00
|
|
|
/*
|
|
|
|
* Set up housekeeping before setting up workqueues to allow the unbound
|
|
|
|
* workqueue to take non-housekeeping into account.
|
|
|
|
*/
|
|
|
|
housekeeping_init();
|
|
|
|
|
2016-09-17 03:49:32 +08:00
|
|
|
/*
|
|
|
|
* Allow workqueue creation and work item queueing/cancelling
|
|
|
|
* early. Work item execution depends on kthreads and starts after
|
|
|
|
* workqueue_init().
|
|
|
|
*/
|
|
|
|
workqueue_init_early();
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
rcu_init();
|
2014-12-13 09:05:10 +08:00
|
|
|
|
2017-03-04 02:37:33 +08:00
|
|
|
/* Trace events are available after this */
|
2014-12-13 09:05:10 +08:00
|
|
|
trace_init();
|
|
|
|
|
2018-03-27 01:31:07 +08:00
|
|
|
if (initcall_debug)
|
|
|
|
initcall_debug_enable();
|
|
|
|
|
2013-07-12 01:12:32 +08:00
|
|
|
context_tracking_init();
|
2008-12-06 10:58:31 +08:00
|
|
|
/* init some links before init_ISA_irqs() */
|
|
|
|
early_irq_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
init_IRQ();
|
2013-03-05 22:14:05 +08:00
|
|
|
tick_init();
|
2014-10-13 21:44:12 +08:00
|
|
|
rcu_init_nohz();
|
2005-04-17 06:20:36 +08:00
|
|
|
init_timers();
|
2021-04-09 06:38:59 +08:00
|
|
|
srcu_init();
|
2006-01-10 12:52:32 +08:00
|
|
|
hrtimers_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
softirq_init();
|
2006-06-26 15:25:06 +08:00
|
|
|
timekeeping_init();
|
2022-05-05 08:20:22 +08:00
|
|
|
time_init();
|
2019-04-20 11:27:05 +08:00
|
|
|
|
random: split initialization into early step and later step
The full RNG initialization relies on some timestamps, made possible
with initialization functions like time_init() and timekeeping_init().
However, these are only available rather late in initialization.
Meanwhile, other things, such as memory allocator functions, make use of
the RNG much earlier.
So split RNG initialization into two phases. We can provide arch
randomness very early on, and then later, after timekeeping and such are
available, initialize the rest.
This ensures that, for example, slabs are properly randomized if RDRAND
is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree
of its security, because its random seed is potentially deterministic,
since it hasn't yet incorporated RDRAND. It also makes it possible to
use a better seed in kfence, which currently relies on only the cycle
counter.
Another positive consequence is that on systems with RDRAND, running
with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all.
One subtle side effect of this change is that on systems with no RDRAND,
RDTSC is now only queried by random_init() once, committing the moment
of the function call, instead of multiple times as before. This is
intentional, as the multiple RDTSCs in a loop before weren't
accomplishing very much, with jitter being better provided by
try_to_generate_entropy(). Plus, filling blocks with RDTSC is still
being done in extract_entropy(), which is necessarily called before
random bytes are served anyway.
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-26 23:43:14 +08:00
|
|
|
/* This must be after timekeeping is initialized */
|
|
|
|
random_init();
|
|
|
|
|
|
|
|
/* These make use of the fully initialized rng */
|
|
|
|
kfence_init();
|
2019-04-20 11:27:05 +08:00
|
|
|
boot_init_stack_canary();
|
|
|
|
|
perf: Use hrtimers for event multiplexing
The current scheme of using the timer tick was fine for per-thread
events. However, it was causing bias issues in system-wide mode
(including for uncore PMUs). Event groups would not get their fair
share of runtime on the PMU. With tickless kernels, if a core is idle
there is no timer tick, and thus no event rotation (multiplexing).
However, there are events (especially uncore events) which do count
even though cores are asleep.
This patch changes the timer source for multiplexing. It introduces a
per-PMU per-cpu hrtimer. The advantage is that even when a core goes
idle, it will come back to service the hrtimer, thus multiplexing on
system-wide events works much better.
The per-PMU implementation (suggested by PeterZ) enables adjusting the
multiplexing interval per PMU. The preferred interval is stashed into
the struct pmu. If not set, it will be forced to the default interval
value.
In order to minimize the impact of the hrtimer, it is turned on and
off on demand. When the PMU on a CPU is overcommited, the hrtimer is
activated. It is stopped when the PMU is not overcommitted.
In order for this to work properly, we had to change the order of
initialization in start_kernel() such that hrtimer_init() is run
before perf_event_init().
The default interval in milliseconds is set to a timer tick just like
with the old code. We will provide a sysctl to tune this in another
patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-03 20:21:33 +08:00
|
|
|
perf_event_init();
|
2006-07-03 15:24:24 +08:00
|
|
|
profile_init();
|
2011-03-30 00:35:04 +08:00
|
|
|
call_function_init();
|
init: scream bloody murder if interrupts are enabled too early
As I was testing a lot of my code recently, and having several
"successes", I accidentally noticed in the dmesg this little line:
start_kernel(): bug: interrupts were enabled *very* early, fixing it
Sure enough, one of my patches two commits ago enabled interrupts early.
The sad part here is that I never noticed it, and I ran several tests with
ktest too, and ktest did not notice this line.
What ktest looks for (and so does many other automated testing scripts) is
a back trace produced by a WARN_ON() or BUG(). As a back trace was never
produced, my buggy patch could have slipped into linux-next, or even
worse, mainline.
Adding a WARN(!irqs_disabled()) makes this bug a little more obvious:
PID hash table entries: 4096 (order: 3, 32768 bytes)
__ex_table already sorted, skipping sort
Checking aperture...
No AGP bridge found
Calgary: detecting Calgary via BIOS EBDA area
Calgary: Unable to locate Rio Grande table in EBDA - bailing!
Memory: 2003252k/2054848k available (4857k kernel code, 460k absent, 51136k reserved, 6210k data, 1096k init)
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/init/main.c:543 start_kernel+0x21e/0x415()
Hardware name: To Be Filled By O.E.M.
Interrupts were enabled *very* early, fixing it
Modules linked in:
Pid: 0, comm: swapper/0 Not tainted 3.8.0-test+ #286
Call Trace:
warn_slowpath_common+0x83/0x9b
warn_slowpath_fmt+0x46/0x48
start_kernel+0x21e/0x415
x86_64_start_reservations+0x10e/0x112
x86_64_start_kernel+0x102/0x111
---[ end trace 007d8b0491b4f5d8 ]---
Preemptible hierarchical RCU implementation.
RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
NR_IRQS:4352 nr_irqs:712 16
Console: colour VGA+ 80x25
console [ttyS0] enabled, bootconsole disabled
Do you see it?
The original version of this patch just slapped a WARN_ON() in there and
kept the printk(). Ard van Breemen suggested using the WARN() interface,
which makes the code a bit cleaner.
Also, while examining other warnings in init/main.c, I found two other
locations that deserve a bloody murder scream if their conditions are hit,
and updated them accordingly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ard van Breemen <ard@telegraafnet.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 07:18:18 +08:00
|
|
|
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
|
2018-07-31 06:24:23 +08:00
|
|
|
|
2011-01-20 19:06:35 +08:00
|
|
|
early_boot_irqs_disabled = false;
|
2006-07-03 15:24:24 +08:00
|
|
|
local_irq_enable();
|
2009-06-18 11:24:12 +08:00
|
|
|
|
2009-06-12 19:03:06 +08:00
|
|
|
kmem_cache_init_late();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* HACK ALERT! This is early. We're enabling the console before
|
|
|
|
* we've done PCI setups etc, and console_init() must be aware of
|
|
|
|
* this. But we do want output early, in case something goes wrong.
|
|
|
|
*/
|
|
|
|
console_init();
|
|
|
|
if (panic_later)
|
2014-01-24 07:54:56 +08:00
|
|
|
panic("Too many boot %s vars at `%s'", panic_later,
|
|
|
|
panic_param);
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
|
2018-07-31 06:24:23 +08:00
|
|
|
lockdep_init();
|
[PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:24:50 +08:00
|
|
|
|
2006-07-03 15:24:33 +08:00
|
|
|
/*
|
|
|
|
* Need to run this when irqs are enabled, because it wants
|
|
|
|
* to self-test [hard/soft]-irqs on/off lock inversion bugs
|
|
|
|
* too:
|
|
|
|
*/
|
|
|
|
locking_selftest();
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_INITRD
|
|
|
|
if (initrd_start && !initrd_below_start_ok &&
|
2008-07-30 13:33:36 +08:00
|
|
|
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
|
2013-04-30 07:18:20 +08:00
|
|
|
pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n",
|
2008-07-30 13:33:36 +08:00
|
|
|
page_to_pfn(virt_to_page((void *)initrd_start)),
|
|
|
|
min_low_pfn);
|
2005-04-17 06:20:36 +08:00
|
|
|
initrd_start = 0;
|
|
|
|
}
|
|
|
|
#endif
|
2005-06-22 08:14:47 +08:00
|
|
|
setup_per_cpu_pageset();
|
2005-04-17 06:20:36 +08:00
|
|
|
numa_policy_init();
|
2017-09-13 17:17:54 +08:00
|
|
|
acpi_early_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
if (late_time_init)
|
|
|
|
late_time_init();
|
2018-07-20 04:55:42 +08:00
|
|
|
sched_clock_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
calibrate_delay();
|
2023-03-09 10:40:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This needs to be called before any devices perform DMA
|
|
|
|
* operations that might use the SWIOTLB bounce buffers. It will
|
|
|
|
* mark the bounce buffers as decrypted so that their usage will
|
|
|
|
* not cause "plain-text" data to be decrypted when accessed. It
|
|
|
|
* must be called after late_time_init() so that Hyper-V x86/x64
|
|
|
|
* hypercalls work when the SWIOTLB bounce buffers are decrypted.
|
|
|
|
*/
|
|
|
|
mem_encrypt_init();
|
|
|
|
|
2017-11-18 07:30:30 +08:00
|
|
|
pid_idr_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
anon_vma_init();
|
2012-12-16 07:15:24 +08:00
|
|
|
#ifdef CONFIG_X86
|
2012-11-14 17:42:35 +08:00
|
|
|
if (efi_enabled(EFI_RUNTIME_SERVICES))
|
2012-12-16 07:15:24 +08:00
|
|
|
efi_enter_virtual_mode();
|
|
|
|
#endif
|
Clarify naming of thread info/stack allocators
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.
But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.
Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things like
ti = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;
which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.
So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.
This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.
The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-25 06:09:37 +08:00
|
|
|
thread_stack_cache_init();
|
CRED: Inaugurate COW credentials
Inaugurate copy-on-write credentials management. This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.
A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().
With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:
struct cred *new = prepare_creds();
int ret = blah(new);
if (ret < 0) {
abort_creds(new);
return ret;
}
return commit_creds(new);
There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.
To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const. The purpose of this is compile-time
discouragement of altering credentials through those pointers. Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:
(1) Its reference count may incremented and decremented.
(2) The keyrings to which it points may be modified, but not replaced.
The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
This now prepares and commits credentials in various places in the
security code rather than altering the current creds directly.
(2) Temporary credential overrides.
do_coredump() and sys_faccessat() now prepare their own credentials and
temporarily override the ones currently on the acting thread, whilst
preventing interference from other threads by holding cred_replace_mutex
on the thread being dumped.
This will be replaced in a future patch by something that hands down the
credentials directly to the functions being called, rather than altering
the task's objective credentials.
(3) LSM interface.
A number of functions have been changed, added or removed:
(*) security_capset_check(), ->capset_check()
(*) security_capset_set(), ->capset_set()
Removed in favour of security_capset().
(*) security_capset(), ->capset()
New. This is passed a pointer to the new creds, a pointer to the old
creds and the proposed capability sets. It should fill in the new
creds or return an error. All pointers, barring the pointer to the
new creds, are now const.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
Changed; now returns a value, which will cause the process to be
killed if it's an error.
(*) security_task_alloc(), ->task_alloc_security()
Removed in favour of security_prepare_creds().
(*) security_cred_free(), ->cred_free()
New. Free security data attached to cred->security.
(*) security_prepare_creds(), ->cred_prepare()
New. Duplicate any security data attached to cred->security.
(*) security_commit_creds(), ->cred_commit()
New. Apply any security effects for the upcoming installation of new
security by commit_creds().
(*) security_task_post_setuid(), ->task_post_setuid()
Removed in favour of security_task_fix_setuid().
(*) security_task_fix_setuid(), ->task_fix_setuid()
Fix up the proposed new credentials for setuid(). This is used by
cap_set_fix_setuid() to implicitly adjust capabilities in line with
setuid() changes. Changes are made to the new credentials, rather
than the task itself as in security_task_post_setuid().
(*) security_task_reparent_to_init(), ->task_reparent_to_init()
Removed. Instead the task being reparented to init is referred
directly to init's credentials.
NOTE! This results in the loss of some state: SELinux's osid no
longer records the sid of the thread that forked it.
(*) security_key_alloc(), ->key_alloc()
(*) security_key_permission(), ->key_permission()
Changed. These now take cred pointers rather than task pointers to
refer to the security context.
(4) sys_capset().
This has been simplified and uses less locking. The LSM functions it
calls have been merged.
(5) reparent_to_kthreadd().
This gives the current thread the same credentials as init by simply using
commit_thread() to point that way.
(6) __sigqueue_alloc() and switch_uid()
__sigqueue_alloc() can't stop the target task from changing its creds
beneath it, so this function gets a reference to the currently applicable
user_struct which it then passes into the sigqueue struct it returns if
successful.
switch_uid() is now called from commit_creds(), and possibly should be
folded into that. commit_creds() should take care of protecting
__sigqueue_alloc().
(7) [sg]et[ug]id() and co and [sg]et_current_groups.
The set functions now all use prepare_creds(), commit_creds() and
abort_creds() to build and check a new set of credentials before applying
it.
security_task_set[ug]id() is called inside the prepared section. This
guarantees that nothing else will affect the creds until we've finished.
The calling of set_dumpable() has been moved into commit_creds().
Much of the functionality of set_user() has been moved into
commit_creds().
The get functions all simply access the data directly.
(8) security_task_prctl() and cap_task_prctl().
security_task_prctl() has been modified to return -ENOSYS if it doesn't
want to handle a function, or otherwise return the return value directly
rather than through an argument.
Additionally, cap_task_prctl() now prepares a new set of credentials, even
if it doesn't end up using it.
(9) Keyrings.
A number of changes have been made to the keyrings code:
(a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
all been dropped and built in to the credentials functions directly.
They may want separating out again later.
(b) key_alloc() and search_process_keyrings() now take a cred pointer
rather than a task pointer to specify the security context.
(c) copy_creds() gives a new thread within the same thread group a new
thread keyring if its parent had one, otherwise it discards the thread
keyring.
(d) The authorisation key now points directly to the credentials to extend
the search into rather pointing to the task that carries them.
(e) Installing thread, process or session keyrings causes a new set of
credentials to be created, even though it's not strictly necessary for
process or session keyrings (they're shared).
(10) Usermode helper.
The usermode helper code now carries a cred struct pointer in its
subprocess_info struct instead of a new session keyring pointer. This set
of credentials is derived from init_cred and installed on the new process
after it has been cloned.
call_usermodehelper_setup() allocates the new credentials and
call_usermodehelper_freeinfo() discards them if they haven't been used. A
special cred function (prepare_usermodeinfo_creds()) is provided
specifically for call_usermodehelper_setup() to call.
call_usermodehelper_setkeys() adjusts the credentials to sport the
supplied keyring as the new session keyring.
(11) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) selinux_setprocattr() no longer does its check for whether the
current ptracer can access processes with the new SID inside the lock
that covers getting the ptracer's SID. Whilst this lock ensures that
the check is done with the ptracer pinned, the result is only valid
until the lock is released, so there's no point doing it inside the
lock.
(12) is_single_threaded().
This function has been extracted from selinux_setprocattr() and put into
a file of its own in the lib/ directory as join_session_keyring() now
wants to use it too.
The code in SELinux just checked to see whether a task shared mm_structs
with other tasks (CLONE_VM), but that isn't good enough. We really want
to know if they're part of the same thread group (CLONE_THREAD).
(13) nfsd.
The NFS server daemon now has to use the COW credentials to set the
credentials it is going to use. It really needs to pass the credentials
down to the functions it calls, but it can't do that until other patches
in this series have been applied.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 07:39:23 +08:00
|
|
|
cred_init();
|
2015-04-17 03:47:44 +08:00
|
|
|
fork_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
proc_caches_init();
|
2018-04-11 07:32:36 +08:00
|
|
|
uts_ns_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
key_init();
|
|
|
|
security_init();
|
2010-05-21 10:04:29 +08:00
|
|
|
dbg_late_init();
|
2022-02-06 01:01:25 +08:00
|
|
|
net_ns_init();
|
2015-08-07 06:46:20 +08:00
|
|
|
vfs_caches_init();
|
2016-12-25 11:00:30 +08:00
|
|
|
pagecache_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
signals_init();
|
2018-04-11 07:34:45 +08:00
|
|
|
seq_file_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
proc_root_init();
|
take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-01 22:57:28 +08:00
|
|
|
nsfs_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
cpuset_init();
|
2015-03-04 17:09:33 +08:00
|
|
|
cgroup_init();
|
2006-07-14 15:24:40 +08:00
|
|
|
taskstats_init_early();
|
2006-07-14 15:24:36 +08:00
|
|
|
delayacct_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
check_bugs();
|
|
|
|
|
ACPI / init: Switch over platform to the ACPI mode later
Commit 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before
timekeeping_init()" moved the ACPI subsystem initialization,
including the ACPI mode enabling, to an earlier point in the
initialization sequence, to allow the timekeeping subsystem
use ACPI early. Unfortunately, that resulted in boot regressions
on some systems and the early ACPI initialization was moved toward
its original position in the kernel initialization code by commit
c4e1acbb35e4 "ACPI / init: Invoke early ACPI initialization later".
However, that turns out to be insufficient, as boot is still broken
on the Tyan S8812 mainboard.
To fix that issue, split the ACPI early initialization code into
two pieces so the majority of it still located in acpi_early_init()
and the part switching over the platform into the ACPI mode goes into
a new function, acpi_subsystem_init(), executed at the original early
ACPI initialization spot.
That fixes the Tyan S8812 boot problem, but still allows ACPI
tables to be loaded earlier which is useful to the EFI code in
efi_enter_virtual_mode().
Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141
Fixes: 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()"
Reported-and-tested-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
2015-06-10 07:33:36 +08:00
|
|
|
acpi_subsystem_init();
|
2016-12-10 02:29:10 +08:00
|
|
|
arch_post_acpi_subsys_init();
|
2019-11-15 02:02:54 +08:00
|
|
|
kcsan_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Do the rest non-__init'ed, we're now alive */
|
2018-08-31 16:42:24 +08:00
|
|
|
arch_call_rest_init();
|
x86: Fix early boot crash on gcc-10, third try
... or the odyssey of trying to disable the stack protector for the
function which generates the stack canary value.
The whole story started with Sergei reporting a boot crash with a kernel
built with gcc-10:
Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139
Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
Call Trace:
dump_stack
panic
? start_secondary
__stack_chk_fail
start_secondary
secondary_startup_64
-—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
This happens because gcc-10 tail-call optimizes the last function call
in start_secondary() - cpu_startup_entry() - and thus emits a stack
canary check which fails because the canary value changes after the
boot_init_stack_canary() call.
To fix that, the initial attempt was to mark the one function which
generates the stack canary with:
__attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
however, using the optimize attribute doesn't work cumulatively
as the attribute does not add to but rather replaces previously
supplied optimization options - roughly all -fxxx options.
The key one among them being -fno-omit-frame-pointer and thus leading to
not present frame pointer - frame pointer which the kernel needs.
The next attempt to prevent compilers from tail-call optimizing
the last function call cpu_startup_entry(), shy of carving out
start_secondary() into a separate compilation unit and building it with
-fno-stack-protector, was to add an empty asm("").
This current solution was short and sweet, and reportedly, is supported
by both compilers but we didn't get very far this time: future (LTO?)
optimization passes could potentially eliminate this, which leads us
to the third attempt: having an actual memory barrier there which the
compiler cannot ignore or move around etc.
That should hold for a long time, but hey we said that about the other
two solutions too so...
Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kalle Valo <kvalo@codeaurora.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-04-23 00:11:30 +08:00
|
|
|
|
|
|
|
prevent_tail_call_optimization();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-06-18 07:28:03 +08:00
|
|
|
/* Call all constructor functions linked into the kernel. */
|
|
|
|
static void __init do_ctors(void)
|
|
|
|
{
|
2021-02-05 10:32:28 +08:00
|
|
|
/*
|
|
|
|
* For UML, the constructors have already been called by the
|
|
|
|
* normal setup code as it's just a normal ELF binary, so we
|
|
|
|
* cannot do it again - but we do need CONFIG_CONSTRUCTORS
|
|
|
|
* even on UML for modules.
|
|
|
|
*/
|
|
|
|
#if defined(CONFIG_CONSTRUCTORS) && !defined(CONFIG_UML)
|
2009-12-15 10:00:18 +08:00
|
|
|
ctor_fn_t *fn = (ctor_fn_t *) __ctors_start;
|
2009-06-18 07:28:03 +08:00
|
|
|
|
2009-12-15 10:00:18 +08:00
|
|
|
for (; fn < (ctor_fn_t *) __ctors_end; fn++)
|
|
|
|
(*fn)();
|
2009-06-18 07:28:03 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2014-06-05 07:12:17 +08:00
|
|
|
#ifdef CONFIG_KALLSYMS
|
|
|
|
struct blacklist_entry {
|
|
|
|
struct list_head next;
|
|
|
|
char *buf;
|
|
|
|
};
|
|
|
|
|
|
|
|
static __initdata_or_module LIST_HEAD(blacklisted_initcalls);
|
|
|
|
|
|
|
|
static int __init initcall_blacklist(char *str)
|
|
|
|
{
|
|
|
|
char *str_entry;
|
|
|
|
struct blacklist_entry *entry;
|
|
|
|
|
|
|
|
/* str argument is a comma-separated list of functions */
|
|
|
|
do {
|
|
|
|
str_entry = strsep(&str, ",");
|
|
|
|
if (str_entry) {
|
|
|
|
pr_debug("blacklisting initcall %s\n", str_entry);
|
memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.
Implicit alignment is done deep in the memblock allocator and it can
come as a surprise. Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.
Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.
For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.
The direct memblock APIs users were updated using the semantic patch below:
@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)
[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:09:57 +08:00
|
|
|
entry = memblock_alloc(sizeof(*entry),
|
|
|
|
SMP_CACHE_BYTES);
|
2019-03-12 14:30:20 +08:00
|
|
|
if (!entry)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n",
|
|
|
|
__func__, sizeof(*entry));
|
memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.
Implicit alignment is done deep in the memblock allocator and it can
come as a surprise. Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.
Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.
For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.
The direct memblock APIs users were updated using the semantic patch below:
@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)
[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:09:57 +08:00
|
|
|
entry->buf = memblock_alloc(strlen(str_entry) + 1,
|
|
|
|
SMP_CACHE_BYTES);
|
2019-03-12 14:30:20 +08:00
|
|
|
if (!entry->buf)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n",
|
|
|
|
__func__, strlen(str_entry) + 1);
|
2014-06-05 07:12:17 +08:00
|
|
|
strcpy(entry->buf, str_entry);
|
|
|
|
list_add(&entry->next, &blacklisted_initcalls);
|
|
|
|
}
|
|
|
|
} while (str_entry);
|
|
|
|
|
2022-03-24 07:06:14 +08:00
|
|
|
return 1;
|
2014-06-05 07:12:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static bool __init_or_module initcall_blacklisted(initcall_t fn)
|
|
|
|
{
|
|
|
|
struct blacklist_entry *entry;
|
2016-05-21 08:04:27 +08:00
|
|
|
char fn_name[KSYM_SYMBOL_LEN];
|
2016-06-25 05:50:30 +08:00
|
|
|
unsigned long addr;
|
2014-06-05 07:12:17 +08:00
|
|
|
|
2016-05-21 08:04:27 +08:00
|
|
|
if (list_empty(&blacklisted_initcalls))
|
2014-06-05 07:12:17 +08:00
|
|
|
return false;
|
|
|
|
|
2016-06-25 05:50:30 +08:00
|
|
|
addr = (unsigned long) dereference_function_descriptor(fn);
|
|
|
|
sprint_symbol_no_offset(fn_name, addr);
|
2016-05-21 08:04:27 +08:00
|
|
|
|
2016-08-03 05:07:15 +08:00
|
|
|
/*
|
|
|
|
* fn will be "function_name [module_name]" where [module_name] is not
|
|
|
|
* displayed for built-in init functions. Strip off the [module_name].
|
|
|
|
*/
|
|
|
|
strreplace(fn_name, ' ', '\0');
|
|
|
|
|
2016-03-16 05:52:46 +08:00
|
|
|
list_for_each_entry(entry, &blacklisted_initcalls, next) {
|
2014-06-05 07:12:17 +08:00
|
|
|
if (!strcmp(fn_name, entry->buf)) {
|
|
|
|
pr_debug("initcall %s blacklisted\n", fn_name);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static int __init initcall_blacklist(char *str)
|
|
|
|
{
|
|
|
|
pr_warn("initcall_blacklist requires CONFIG_KALLSYMS\n");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool __init_or_module initcall_blacklisted(initcall_t fn)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
__setup("initcall_blacklist=", initcall_blacklist);
|
|
|
|
|
2018-03-27 01:31:07 +08:00
|
|
|
static __init_or_module void
|
|
|
|
trace_initcall_start_cb(void *data, initcall_t fn)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2022-09-28 09:45:39 +08:00
|
|
|
ktime_t *calltime = data;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-09-01 15:09:28 +08:00
|
|
|
printk(KERN_DEBUG "calling %pS @ %i\n", fn, task_pid_nr(current));
|
2018-03-27 01:31:07 +08:00
|
|
|
*calltime = ktime_get();
|
|
|
|
}
|
|
|
|
|
|
|
|
static __init_or_module void
|
|
|
|
trace_initcall_finish_cb(void *data, initcall_t fn, int ret)
|
|
|
|
{
|
2022-09-28 09:45:39 +08:00
|
|
|
ktime_t rettime, *calltime = data;
|
2018-03-27 01:31:07 +08:00
|
|
|
|
2010-08-10 08:20:32 +08:00
|
|
|
rettime = ktime_get();
|
2021-09-01 15:09:28 +08:00
|
|
|
printk(KERN_DEBUG "initcall %pS returned %d after %lld usecs\n",
|
2022-03-24 07:06:08 +08:00
|
|
|
fn, ret, (unsigned long long)ktime_us_delta(rettime, *calltime));
|
2018-03-27 01:31:07 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-03-27 01:31:07 +08:00
|
|
|
static ktime_t initcall_calltime;
|
|
|
|
|
2018-04-06 21:24:25 +08:00
|
|
|
#ifdef TRACEPOINTS_ENABLED
|
2018-03-27 01:31:07 +08:00
|
|
|
static void __init initcall_debug_enable(void)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = register_trace_initcall_start(trace_initcall_start_cb,
|
|
|
|
&initcall_calltime);
|
|
|
|
ret |= register_trace_initcall_finish(trace_initcall_finish_cb,
|
|
|
|
&initcall_calltime);
|
|
|
|
WARN(ret, "Failed to register initcall tracepoints\n");
|
2010-08-10 08:20:32 +08:00
|
|
|
}
|
2018-04-06 21:24:25 +08:00
|
|
|
# define do_trace_initcall_start trace_initcall_start
|
|
|
|
# define do_trace_initcall_finish trace_initcall_finish
|
|
|
|
#else
|
|
|
|
static inline void do_trace_initcall_start(initcall_t fn)
|
|
|
|
{
|
|
|
|
if (!initcall_debug)
|
|
|
|
return;
|
|
|
|
trace_initcall_start_cb(&initcall_calltime, fn);
|
|
|
|
}
|
|
|
|
static inline void do_trace_initcall_finish(initcall_t fn, int ret)
|
|
|
|
{
|
|
|
|
if (!initcall_debug)
|
|
|
|
return;
|
|
|
|
trace_initcall_finish_cb(&initcall_calltime, fn, ret);
|
|
|
|
}
|
|
|
|
#endif /* !TRACEPOINTS_ENABLED */
|
2010-08-10 08:20:32 +08:00
|
|
|
|
2010-08-10 08:20:32 +08:00
|
|
|
int __init_or_module do_one_initcall(initcall_t fn)
|
2010-08-10 08:20:32 +08:00
|
|
|
{
|
|
|
|
int count = preempt_count();
|
2013-07-04 06:05:37 +08:00
|
|
|
char msgbuf[64];
|
2018-03-27 01:31:07 +08:00
|
|
|
int ret;
|
2010-08-10 08:20:32 +08:00
|
|
|
|
2014-06-05 07:12:17 +08:00
|
|
|
if (initcall_blacklisted(fn))
|
|
|
|
return -EPERM;
|
|
|
|
|
2018-04-06 21:24:25 +08:00
|
|
|
do_trace_initcall_start(fn);
|
2018-03-27 01:31:07 +08:00
|
|
|
ret = fn();
|
2018-04-06 21:24:25 +08:00
|
|
|
do_trace_initcall_finish(fn, ret);
|
2007-05-08 15:28:26 +08:00
|
|
|
|
2008-05-16 09:14:01 +08:00
|
|
|
msgbuf[0] = 0;
|
2008-05-13 05:02:22 +08:00
|
|
|
|
2008-05-16 09:14:01 +08:00
|
|
|
if (preempt_count() != count) {
|
2013-05-02 01:35:51 +08:00
|
|
|
sprintf(msgbuf, "preemption imbalance ");
|
2013-08-14 20:55:24 +08:00
|
|
|
preempt_count_set(count);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2008-05-16 09:14:01 +08:00
|
|
|
if (irqs_disabled()) {
|
2008-05-16 04:52:41 +08:00
|
|
|
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
|
2008-05-16 09:14:01 +08:00
|
|
|
local_irq_enable();
|
|
|
|
}
|
2019-03-26 03:32:28 +08:00
|
|
|
WARN(msgbuf[0], "initcall %pS returned with %s\n", fn, msgbuf);
|
2008-07-31 03:49:02 +08:00
|
|
|
|
gcc-plugins: Add latent_entropy plugin
This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.
The need for very-early boot entropy tends to be very architecture or
system design specific, so this plugin is more suited for those sorts
of special cases. The existing kernel RNG already attempts to extract
entropy from reliable runtime variation, but this plugin takes the idea to
a logical extreme by permuting a global variable based on any variation
in code execution (e.g. a different value (and permutation function)
is used to permute the global based on loop count, case statement,
if/then/else branching, etc).
To do this, the plugin starts by inserting a local variable in every
marked function. The plugin then adds logic so that the value of this
variable is modified by randomly chosen operations (add, xor and rol) and
random values (gcc generates separate static values for each location at
compile time and also injects the stack pointer at runtime). The resulting
value depends on the control flow path (e.g., loops and branches taken).
Before the function returns, the plugin mixes this local variable into
the latent_entropy global variable. The value of this global variable
is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
though it does not credit any bytes of entropy to the pool; the contents
of the global are just used to mix the pool.
Additionally, the plugin can pre-initialize arrays with build-time
random contents, so that two different kernel builds running on identical
hardware will not have the same starting values.
Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message and code comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-06-21 02:41:19 +08:00
|
|
|
add_latent_entropy();
|
2010-05-26 18:57:53 +08:00
|
|
|
return ret;
|
2008-05-16 09:14:01 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2018-08-22 12:56:13 +08:00
|
|
|
extern initcall_entry_t __initcall_start[];
|
|
|
|
extern initcall_entry_t __initcall0_start[];
|
|
|
|
extern initcall_entry_t __initcall1_start[];
|
|
|
|
extern initcall_entry_t __initcall2_start[];
|
|
|
|
extern initcall_entry_t __initcall3_start[];
|
|
|
|
extern initcall_entry_t __initcall4_start[];
|
|
|
|
extern initcall_entry_t __initcall5_start[];
|
|
|
|
extern initcall_entry_t __initcall6_start[];
|
|
|
|
extern initcall_entry_t __initcall7_start[];
|
|
|
|
extern initcall_entry_t __initcall_end[];
|
|
|
|
|
|
|
|
static initcall_entry_t *initcall_levels[] __initdata = {
|
2012-03-26 10:20:51 +08:00
|
|
|
__initcall0_start,
|
|
|
|
__initcall1_start,
|
|
|
|
__initcall2_start,
|
|
|
|
__initcall3_start,
|
|
|
|
__initcall4_start,
|
|
|
|
__initcall5_start,
|
|
|
|
__initcall6_start,
|
|
|
|
__initcall7_start,
|
|
|
|
__initcall_end,
|
|
|
|
};
|
|
|
|
|
2012-06-15 06:00:59 +08:00
|
|
|
/* Keep these in sync with initcalls in include/linux/init.h */
|
2019-01-04 07:27:29 +08:00
|
|
|
static const char *initcall_level_names[] __initdata = {
|
2018-03-23 07:28:54 +08:00
|
|
|
"pure",
|
params: add 3rd arg to option handler callback signature
Add a 3rd arg, named "doing", to unknown-options callbacks invoked
from parse_args(). The arg is passed as:
"Booting kernel" from start_kernel(),
initcall_level_names[i] from do_initcall_level(),
mod->name from load_module(), via parse_args(), parse_one()
parse_args() already has the "name" parameter, which is renamed to
"doing" to better reflect current uses 1,2 above. parse_args() passes
it to an altered parse_one(), which now passes it down into the
unknown option handler callbacks.
The mod->name will be needed to handle dyndbg for loadable modules,
since params passed by modprobe are not qualified (they do not have a
"$modname." prefix), and by the time the unknown-param callback is
called, the module name is not otherwise available.
Minor tweaks:
Add param-name to parse_one's pr_debug(), current message doesnt
identify the param being handled, add it.
Add a pr_info to print current level and level_name of the initcall,
and number of registered initcalls at that level. This adds 7 lines
to dmesg output, like:
initlevel:6=device, 172 registered initcalls
Drop "parameters" from initcall_level_names[], its unhelpful in the
pr_info() added above. This array is passed into parse_args() by
do_initcall_level().
CC: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-28 04:30:34 +08:00
|
|
|
"core",
|
|
|
|
"postcore",
|
|
|
|
"arch",
|
|
|
|
"subsys",
|
|
|
|
"fs",
|
|
|
|
"device",
|
|
|
|
"late",
|
2012-03-26 10:20:51 +08:00
|
|
|
};
|
|
|
|
|
2020-01-31 14:17:16 +08:00
|
|
|
static int __init ignore_unknown_bootoption(char *param, char *val,
|
|
|
|
const char *unused, void *arg)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-01-11 00:04:31 +08:00
|
|
|
static void __init do_initcall_level(int level, char *command_line)
|
2008-05-16 09:14:01 +08:00
|
|
|
{
|
2018-08-22 12:56:13 +08:00
|
|
|
initcall_entry_t *fn;
|
2008-05-16 09:14:01 +08:00
|
|
|
|
2012-03-26 10:20:51 +08:00
|
|
|
parse_args(initcall_level_names[level],
|
2020-01-11 00:04:31 +08:00
|
|
|
command_line, __start___param,
|
2012-03-26 10:20:51 +08:00
|
|
|
__stop___param - __start___param,
|
|
|
|
level, level,
|
2020-01-31 14:17:16 +08:00
|
|
|
NULL, ignore_unknown_bootoption);
|
2012-03-26 10:20:51 +08:00
|
|
|
|
2018-03-23 22:18:03 +08:00
|
|
|
trace_initcall_level(initcall_level_names[level]);
|
2012-03-26 10:20:51 +08:00
|
|
|
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
|
2018-08-22 12:56:13 +08:00
|
|
|
do_one_initcall(initcall_from_entry(fn));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-03-26 10:20:51 +08:00
|
|
|
static void __init do_initcalls(void)
|
|
|
|
{
|
|
|
|
int level;
|
proc: give /proc/cmdline size
Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.
cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:
let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();
will result in
openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)
open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.
In case of /proc/cmdline, the length is known precisely.
Make variables readonly while I'm at it.
P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.
Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09 02:21:54 +08:00
|
|
|
size_t len = saved_command_line_len + 1;
|
2020-01-11 00:04:31 +08:00
|
|
|
char *command_line;
|
|
|
|
|
|
|
|
command_line = kzalloc(len, GFP_KERNEL);
|
|
|
|
if (!command_line)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__, len);
|
|
|
|
|
|
|
|
for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++) {
|
|
|
|
/* Parser modifies command_line, restore it each time */
|
|
|
|
strcpy(command_line, saved_command_line);
|
|
|
|
do_initcall_level(level, command_line);
|
|
|
|
}
|
2012-03-26 10:20:51 +08:00
|
|
|
|
2020-01-11 00:04:31 +08:00
|
|
|
kfree(command_line);
|
2012-03-26 10:20:51 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Ok, the machine is now initialized. None of the devices
|
|
|
|
* have been touched yet, but the CPU subsystem is up and
|
|
|
|
* running, and memory and process management works.
|
|
|
|
*
|
|
|
|
* Now we can finally start doing some real work..
|
|
|
|
*/
|
|
|
|
static void __init do_basic_setup(void)
|
|
|
|
{
|
2009-03-25 17:06:30 +08:00
|
|
|
cpuset_init_smp();
|
2005-04-17 06:20:36 +08:00
|
|
|
driver_init();
|
2007-02-14 16:33:57 +08:00
|
|
|
init_irq_proc();
|
2009-06-18 07:28:03 +08:00
|
|
|
do_ctors();
|
2011-09-29 15:09:40 +08:00
|
|
|
do_initcalls();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-07-26 10:45:11 +08:00
|
|
|
static void __init do_pre_smp_initcalls(void)
|
2008-07-26 10:45:11 +08:00
|
|
|
{
|
2018-08-22 12:56:13 +08:00
|
|
|
initcall_entry_t *fn;
|
2008-07-26 10:45:11 +08:00
|
|
|
|
2018-03-23 22:18:03 +08:00
|
|
|
trace_initcall_level("early");
|
2012-03-26 10:20:51 +08:00
|
|
|
for (fn = __initcall_start; fn < __initcall0_start; fn++)
|
2018-08-22 12:56:13 +08:00
|
|
|
do_one_initcall(initcall_from_entry(fn));
|
2008-07-26 10:45:11 +08:00
|
|
|
}
|
|
|
|
|
2012-10-11 09:28:25 +08:00
|
|
|
static int run_init_process(const char *init_filename)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-01-31 14:17:13 +08:00
|
|
|
const char *const *p;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
argv_init[0] = init_filename;
|
2018-08-22 12:58:37 +08:00
|
|
|
pr_info("Run %s as init process\n", init_filename);
|
2020-01-31 14:17:13 +08:00
|
|
|
pr_debug(" with arguments:\n");
|
|
|
|
for (p = argv_init; *p; p++)
|
|
|
|
pr_debug(" %s\n", *p);
|
|
|
|
pr_debug(" with environment:\n");
|
|
|
|
for (p = envp_init; *p; p++)
|
|
|
|
pr_debug(" %s\n", *p);
|
2020-07-14 01:06:48 +08:00
|
|
|
return kernel_execve(init_filename, argv_init, envp_init);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2013-11-13 07:10:21 +08:00
|
|
|
static int try_to_run_init_process(const char *init_filename)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = run_init_process(init_filename);
|
|
|
|
|
|
|
|
if (ret && ret != -ENOENT) {
|
|
|
|
pr_err("Starting init: %s exists but couldn't execute it (error %d)\n",
|
|
|
|
init_filename, ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-12-21 14:55:44 +08:00
|
|
|
static noinline void __init kernel_init_freeable(void);
|
2012-10-11 07:57:26 +08:00
|
|
|
|
2017-02-07 08:31:58 +08:00
|
|
|
#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
|
2016-11-14 14:15:05 +08:00
|
|
|
bool rodata_enabled __ro_after_init = true;
|
2022-08-17 23:40:22 +08:00
|
|
|
|
|
|
|
#ifndef arch_parse_debug_rodata
|
|
|
|
static inline bool arch_parse_debug_rodata(char *str) { return false; }
|
|
|
|
#endif
|
|
|
|
|
2016-02-18 06:41:13 +08:00
|
|
|
static int __init set_debug_rodata(char *str)
|
|
|
|
{
|
2022-08-17 23:40:22 +08:00
|
|
|
if (arch_parse_debug_rodata(str))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (str && !strcmp(str, "on"))
|
|
|
|
rodata_enabled = true;
|
|
|
|
else if (str && !strcmp(str, "off"))
|
|
|
|
rodata_enabled = false;
|
|
|
|
else
|
2022-03-24 07:06:14 +08:00
|
|
|
pr_warn("Invalid option string for rodata: '%s'\n", str);
|
2022-08-17 23:40:22 +08:00
|
|
|
return 0;
|
2016-02-18 06:41:13 +08:00
|
|
|
}
|
2022-08-17 23:40:22 +08:00
|
|
|
early_param("rodata", set_debug_rodata);
|
2016-11-14 14:15:05 +08:00
|
|
|
#endif
|
2016-02-18 06:41:13 +08:00
|
|
|
|
2017-02-07 08:31:58 +08:00
|
|
|
#ifdef CONFIG_STRICT_KERNEL_RWX
|
2016-02-18 06:41:13 +08:00
|
|
|
static void mark_readonly(void)
|
|
|
|
{
|
2017-02-28 06:30:22 +08:00
|
|
|
if (rodata_enabled) {
|
2018-05-12 07:01:42 +08:00
|
|
|
/*
|
2018-11-07 10:58:01 +08:00
|
|
|
* load_module() results in W+X mappings, which are cleaned
|
|
|
|
* up with call_rcu(). Let's make sure that queued work is
|
2018-05-12 07:01:42 +08:00
|
|
|
* flushed so that we don't hit false positives looking for
|
|
|
|
* insecure pages which are W+X.
|
|
|
|
*/
|
2018-11-07 10:58:01 +08:00
|
|
|
rcu_barrier();
|
2016-02-18 06:41:13 +08:00
|
|
|
mark_rodata_ro();
|
2017-02-28 06:30:22 +08:00
|
|
|
rodata_test();
|
|
|
|
} else
|
2016-02-18 06:41:13 +08:00
|
|
|
pr_info("Kernel memory protection disabled.\n");
|
|
|
|
}
|
2020-01-31 14:17:23 +08:00
|
|
|
#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
|
|
|
|
static inline void mark_readonly(void)
|
|
|
|
{
|
|
|
|
pr_warn("Kernel memory protection not selected by kernel config.\n");
|
|
|
|
}
|
2016-02-18 06:41:13 +08:00
|
|
|
#else
|
|
|
|
static inline void mark_readonly(void)
|
|
|
|
{
|
|
|
|
pr_warn("This architecture does not have kernel memory protection.\n");
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-05-14 08:18:40 +08:00
|
|
|
void __weak free_initmem(void)
|
|
|
|
{
|
2019-05-14 08:18:46 +08:00
|
|
|
free_initmem_default(POISON_FREE_INITMEM);
|
2019-05-14 08:18:40 +08:00
|
|
|
}
|
|
|
|
|
2012-10-11 07:57:26 +08:00
|
|
|
static int __ref kernel_init(void *unused)
|
2007-02-13 20:26:22 +08:00
|
|
|
{
|
2013-11-13 07:10:21 +08:00
|
|
|
int ret;
|
|
|
|
|
2021-05-31 18:21:13 +08:00
|
|
|
/*
|
|
|
|
* Wait until kthreadd is all set-up.
|
|
|
|
*/
|
|
|
|
wait_for_completion(&kthreadd_done);
|
|
|
|
|
2012-10-11 07:57:26 +08:00
|
|
|
kernel_init_freeable();
|
2009-01-08 00:45:46 +08:00
|
|
|
/* need to finish all async __init code before freeing the memory */
|
|
|
|
async_synchronize_full();
|
2021-11-06 04:40:40 +08:00
|
|
|
|
|
|
|
system_state = SYSTEM_FREEING_INITMEM;
|
2020-09-10 16:55:05 +08:00
|
|
|
kprobe_free_init_mem();
|
2017-04-04 00:57:35 +08:00
|
|
|
ftrace_free_init_mem();
|
2021-02-26 09:22:38 +08:00
|
|
|
kgdb_free_init_mem();
|
2021-09-04 23:54:09 +08:00
|
|
|
exit_boot_config();
|
2007-02-13 20:26:22 +08:00
|
|
|
free_initmem();
|
2016-02-18 06:41:13 +08:00
|
|
|
mark_readonly();
|
2018-07-18 17:41:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Kernel mappings are now finalized - update the userspace page-table
|
|
|
|
* to finalize PTI.
|
|
|
|
*/
|
|
|
|
pti_finalize();
|
|
|
|
|
2007-02-13 20:26:22 +08:00
|
|
|
system_state = SYSTEM_RUNNING;
|
|
|
|
numa_default_policy();
|
|
|
|
|
2015-11-26 08:52:36 +08:00
|
|
|
rcu_end_inkernel_boot();
|
|
|
|
|
kernel/sysctl: support setting sysctl parameters from kernel command line
Patch series "support setting sysctl parameters from kernel command line", v3.
This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1
The important part is Patch 1. The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl. I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.
I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:
- numa_zonelist_order is converted in Patch 2 together with adding the
necessary infrastructure. It's easy as it doesn't really do anything
but warn on deprecated value these days.
- hung_task_panic is converted in Patch 3, but there's a downside that
now it only accepts 0 and 1, while previously it was any integer
value
- nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
so there's no straighforward conversion possible
- traceoff_on_warning is a flag without value and it would be required
to handle that somehow in the conversion infractructure, which seems
pointless for a single flag
This patch (of 5):
A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.
Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.
Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.
Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W. Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls. Errors due to e.g. invalid parameter name or value
are reported in the kernel log.
The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized. At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.
Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.
[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 12:40:24 +08:00
|
|
|
do_sysctl_args();
|
|
|
|
|
2007-02-13 20:26:22 +08:00
|
|
|
if (ramdisk_execute_command) {
|
2013-11-13 07:10:21 +08:00
|
|
|
ret = run_init_process(ramdisk_execute_command);
|
|
|
|
if (!ret)
|
2012-10-11 09:28:25 +08:00
|
|
|
return 0;
|
2013-11-13 07:10:21 +08:00
|
|
|
pr_err("Failed to execute %s (error %d)\n",
|
|
|
|
ramdisk_execute_command, ret);
|
2007-02-13 20:26:22 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We try each of these until one succeeds.
|
|
|
|
*
|
|
|
|
* The Bourne shell can be used instead of init if we are
|
|
|
|
* trying to recover a really broken machine.
|
|
|
|
*/
|
|
|
|
if (execute_command) {
|
2013-11-13 07:10:21 +08:00
|
|
|
ret = run_init_process(execute_command);
|
|
|
|
if (!ret)
|
2012-10-11 09:28:25 +08:00
|
|
|
return 0;
|
2014-12-11 07:52:19 +08:00
|
|
|
panic("Requested init %s failed (error %d).",
|
|
|
|
execute_command, ret);
|
2007-02-13 20:26:22 +08:00
|
|
|
}
|
init: allow distribution configuration of default init
Some init systems (eg. systemd) have init at their own paths, for
example, /usr/lib/systemd/systemd. A compatibility symlink to one of the
hardcoded init paths is provided by another package, usually named
something like systemd-sysvcompat or similar.
Currently distro maintainers who are hands-off on the bootloader are more
or less required to include those compatibility links as part of their
base distribution, because it's hard to migrate away from them since
there's a risk some users will not get the message to set init= on the
kernel command line appropriately.
Moreover, for distributions where the init system is something the
distribution itself is opinionated about (eg. Arch, which has systemd in
the required `base` package), we could usually reasonably configure this
ahead of time when building the distribution kernel. However, we
currently simply don't have any way to configure the kernel to do this.
Here's an example discussion where removing sysvcompat was discussed by
distro maintainers[0].
This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set
is tried before the hardcoded fallback list. So the order of precedence
is now thus:
1. init= on command line (on failure: panic)
2. CONFIG_DEFAULT_INIT (on failure: try #3)
3. Hardcoded fallback list (on failure: panic)
This new config parameter will allow distribution maintainers to move away
from these compatibility links safely, without having to worry that their
users might not have the right init=.
There are also two other benefits of this over having the distribution
maintain a symlink:
1. One of the value propositions over simply having distributions
maintain a /sbin/init symlink via a package is that it also frees
distributions which have a preferred default, but not mandatory, init
system from having their package manager fight with their users for
control of /{s,}bin/init. Instead, the distribution simply makes
their preference known in CONFIG_DEFAULT_INIT, and if the user
installs another init system and uninstalls the default one they can
still make use of /{s,}bin/init and friends for their own uses. This
makes more cases Just Work(tm) without the user having to perform
extra configuration via init=.
2. Since before this we don't know which path the distribution actually
_intends_ to serve init from, we don't pr_err if it is simply
missing, and usually will just silently put the user in a /bin/sh
shell. Now that the distribution can make a declaration of intent, we
can be more vocal when this init system fails to launch for any
reason, even if it's simply because no file exists at that location,
speeding up the palaver of init/mount dependency/etc debugging a bit.
[0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.html
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:50:53 +08:00
|
|
|
|
|
|
|
if (CONFIG_DEFAULT_INIT[0] != '\0') {
|
|
|
|
ret = run_init_process(CONFIG_DEFAULT_INIT);
|
|
|
|
if (ret)
|
|
|
|
pr_err("Default init %s failed (error %d)\n",
|
|
|
|
CONFIG_DEFAULT_INIT, ret);
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-11-13 07:10:21 +08:00
|
|
|
if (!try_to_run_init_process("/sbin/init") ||
|
|
|
|
!try_to_run_init_process("/etc/init") ||
|
|
|
|
!try_to_run_init_process("/bin/init") ||
|
|
|
|
!try_to_run_init_process("/bin/sh"))
|
2012-10-11 09:28:25 +08:00
|
|
|
return 0;
|
2007-02-13 20:26:22 +08:00
|
|
|
|
2013-11-13 07:10:21 +08:00
|
|
|
panic("No working init found. Try passing init= option to kernel. "
|
2016-10-18 20:12:27 +08:00
|
|
|
"See Linux Documentation/admin-guide/init.rst for guidance.");
|
2007-02-13 20:26:22 +08:00
|
|
|
}
|
|
|
|
|
2020-06-07 23:44:28 +08:00
|
|
|
/* Open /dev/console, for stdin/stdout/stderr, this should never fail */
|
2020-07-29 00:23:21 +08:00
|
|
|
void __init console_on_rootfs(void)
|
2018-10-23 22:00:10 +08:00
|
|
|
{
|
2020-06-07 23:44:28 +08:00
|
|
|
struct file *file = filp_open("/dev/console", O_RDWR, 0);
|
2018-10-23 22:00:10 +08:00
|
|
|
|
2020-06-07 23:44:28 +08:00
|
|
|
if (IS_ERR(file)) {
|
2021-01-08 19:48:47 +08:00
|
|
|
pr_err("Warning: unable to open an initial console.\n");
|
|
|
|
return;
|
2020-06-07 23:44:28 +08:00
|
|
|
}
|
2020-07-28 23:49:47 +08:00
|
|
|
init_dup(file);
|
|
|
|
init_dup(file);
|
|
|
|
init_dup(file);
|
|
|
|
fput(file);
|
2018-10-23 22:00:10 +08:00
|
|
|
}
|
|
|
|
|
2012-12-21 14:55:44 +08:00
|
|
|
static noinline void __init kernel_init_freeable(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-05-22 03:52:42 +08:00
|
|
|
/* Now the scheduler is fully set up and can do blocking allocations */
|
|
|
|
gfp_allowed_mask = __GFP_BITS_MASK;
|
|
|
|
|
cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 06:31:49 +08:00
|
|
|
/*
|
|
|
|
* init can allocate pages on any node
|
|
|
|
*/
|
2012-12-13 05:51:40 +08:00
|
|
|
set_mems_allowed(node_states[N_MEMORY]);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
pid: take a reference when initializing `cad_pid`
During boot, kernel_init_freeable() initializes `cad_pid` to the init
task's struct pid. Later on, we may change `cad_pid` via a sysctl, and
when this happens proc_do_cad_pid() will increment the refcount on the
new pid via get_pid(), and will decrement the refcount on the old pid
via put_pid(). As we never called get_pid() when we initialized
`cad_pid`, we decrement a reference we never incremented, can therefore
free the init task's struct pid early. As there can be dangling
references to the struct pid, we can later encounter a use-after-free
(e.g. when delivering signals).
This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
have been around since the conversion of `cad_pid` to struct pid in
commit 9ec52099e4b8 ("[PATCH] replace cad_pid by a struct pid") from the
pre-KASAN stone age of v2.6.19.
Fix this by getting a reference to the init task's struct pid when we
assign it to `cad_pid`.
Full KASAN splat below.
==================================================================
BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
ns_of_pid include/linux/pid.h:153 [inline]
task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
do_notify_parent+0x308/0xe60 kernel/signal.c:1950
exit_notify kernel/exit.c:682 [inline]
do_exit+0x2334/0x2bd0 kernel/exit.c:845
do_group_exit+0x108/0x2c8 kernel/exit.c:922
get_signal+0x4e4/0x2a88 kernel/signal.c:2781
do_signal arch/arm64/kernel/signal.c:882 [inline]
do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
work_pending+0xc/0x2dc
Allocated by task 0:
slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
slab_alloc_node mm/slub.c:2907 [inline]
slab_alloc mm/slub.c:2915 [inline]
kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
alloc_pid+0xdc/0xc00 kernel/pid.c:180
copy_process+0x2794/0x5e18 kernel/fork.c:2129
kernel_clone+0x194/0x13c8 kernel/fork.c:2500
kernel_thread+0xd4/0x110 kernel/fork.c:2552
rest_init+0x44/0x4a0 init/main.c:687
arch_call_rest_init+0x1c/0x28
start_kernel+0x520/0x554 init/main.c:1064
0x0
Freed by task 270:
slab_free_hook mm/slub.c:1562 [inline]
slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
slab_free mm/slub.c:3161 [inline]
kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
put_pid+0x30/0x48 kernel/pid.c:109
proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
call_write_iter include/linux/fs.h:1977 [inline]
new_sync_write+0x3ac/0x510 fs/read_write.c:518
vfs_write fs/read_write.c:605 [inline]
vfs_write+0x9c4/0x1018 fs/read_write.c:585
ksys_write+0x124/0x240 fs/read_write.c:658
__do_sys_write fs/read_write.c:670 [inline]
__se_sys_write fs/read_write.c:667 [inline]
__arm64_sys_write+0x78/0xb0 fs/read_write.c:667
__invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701
The buggy address belongs to the object at ffff23794dda0000
which belongs to the cache pid of size 224
The buggy address is located 4 bytes inside of
224-byte region [ffff23794dda0000, ffff23794dda00e0)
The buggy address belongs to the page:
page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
head:(____ptrval____) order:1 compound_mapcount:0
flags: 0x3fffc0000010200(slab|head)
raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
==================================================================
Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-05 11:01:14 +08:00
|
|
|
cad_pid = get_pid(task_pid(current));
|
2006-10-02 17:19:00 +08:00
|
|
|
|
2008-01-30 20:33:17 +08:00
|
|
|
smp_prepare_cpus(setup_max_cpus);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-09-17 03:49:32 +08:00
|
|
|
workqueue_init();
|
|
|
|
|
2017-04-01 06:11:47 +08:00
|
|
|
init_mm_internals();
|
|
|
|
|
2020-12-10 04:27:31 +08:00
|
|
|
rcu_init_tasks_generic();
|
2005-04-17 06:20:36 +08:00
|
|
|
do_pre_smp_initcalls();
|
2010-11-26 01:38:29 +08:00
|
|
|
lockup_detector_init();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
smp_init();
|
|
|
|
sched_init_smp();
|
|
|
|
|
2020-06-04 06:59:35 +08:00
|
|
|
padata_init();
|
2015-07-01 05:57:27 +08:00
|
|
|
page_alloc_init_late();
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
do_basic_setup();
|
|
|
|
|
2020-08-05 04:47:43 +08:00
|
|
|
kunit_run_all_tests();
|
|
|
|
|
init/initramfs.c: do unpacking asynchronously
Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.
These two patches are independent, but better-together.
The second is a rather trivial patch that simply allows the developer to
change "/sbin/modprobe" to something else - e.g. the empty string, so
that all request_module() during early boot return -ENOENT early, without
even spawning a usermode helper, needlessly synchronizing with the
initramfs unpacking.
The first patch delegates decompressing the initramfs to a worker thread,
allowing do_initcalls() in main.c to proceed to the device_ and late_
initcalls without waiting for that decompression (and populating of
rootfs) to finish. Obviously, some of those later calls may rely on the
initramfs being available, so I've added synchronization points in the
firmware loader and usermodehelper paths - there might be other places
that would need this, but so far no one has been able to think of any
places I have missed.
There's not much to win if most of the functionality needed during boot is
only available as modules. But systems with a custom-made .config and
initramfs can boot faster, partly due to utilizing more than one cpu
earlier, partly by avoiding known-futile modprobe calls (which would still
trigger synchronization with the initramfs unpacking, thus eliminating
most of the first benefit).
This patch (of 2):
Most of the boot process doesn't actually need anything from the
initramfs, until of course PID1 is to be executed. So instead of doing
the decompressing and populating of the initramfs synchronously in
populate_rootfs() itself, push that off to a worker thread.
This is primarily motivated by an embedded ppc target, where unpacking
even the rather modest sized initramfs takes 0.6 seconds, which is long
enough that the external watchdog becomes unhappy that it doesn't get
attention soon enough. By doing the initramfs decompression in a worker
thread, we get to do the device_initcalls and hence start petting the
watchdog much sooner.
Normal desktops might benefit as well. On my mostly stock Ubuntu kernel,
my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
That takes almost two seconds:
[ 0.201454] Trying to unpack rootfs image as initramfs...
[ 1.976633] Freeing initrd memory: 29416K
Before this patch, these lines occur consecutively in dmesg. With this
patch, the timestamps on these two lines is roughly the same as above, but
with 172 lines inbetween - so more than one cpu has been kept busy doing
work that would otherwise only happen after the populate_rootfs()
finished.
Should one of the initcalls done after rootfs_initcall time (i.e., device_
and late_ initcalls) need something from the initramfs (say, a kernel
module or a firmware blob), it will simply wait for the initramfs
unpacking to be done before proceeding, which should in theory make this
completely safe.
But if some driver pokes around in the filesystem directly and not via one
of the official kernel interfaces (i.e. request_firmware*(),
call_usermodehelper*) that theory may not hold - also, I certainly might
have missed a spot when sprinkling wait_for_initramfs(). So there is an
escape hatch in the form of an initramfs_async= command line parameter.
Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 09:05:42 +08:00
|
|
|
wait_for_initramfs();
|
2018-10-23 22:00:10 +08:00
|
|
|
console_on_rootfs();
|
2010-03-03 15:53:19 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* check if there is an early userspace init. If yes, let it do all
|
|
|
|
* the work
|
|
|
|
*/
|
2020-07-22 17:14:02 +08:00
|
|
|
if (init_eaccess(ramdisk_execute_command) != 0) {
|
2005-09-07 06:17:19 +08:00
|
|
|
ramdisk_execute_command = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
prepare_namespace();
|
2005-09-07 06:17:19 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we have completed the initial bootup, and
|
|
|
|
* we're essentially up and running. Get rid of the
|
|
|
|
* initmem segments and start the user-mode stuff..
|
2014-11-05 23:01:15 +08:00
|
|
|
*
|
|
|
|
* rootfs is available now, try loading the public keys
|
|
|
|
* and default modules
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2013-01-19 06:05:56 +08:00
|
|
|
|
2014-11-05 23:01:15 +08:00
|
|
|
integrity_load_keys();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|