2019-05-19 20:07:45 +08:00
|
|
|
# SPDX-License-Identifier: GPL-2.0-only
|
2020-04-23 22:23:52 +08:00
|
|
|
config CC_VERSION_TEXT
|
|
|
|
string
|
|
|
|
default "$(CC_VERSION_TEXT)"
|
|
|
|
help
|
|
|
|
This is used in unclear ways:
|
|
|
|
|
|
|
|
- Re-run Kconfig when the compiler is updated
|
|
|
|
The 'default' property references the environment variable,
|
|
|
|
CC_VERSION_TEXT so it is recorded in include/config/auto.conf.cmd.
|
|
|
|
When the compiler is updated, Kconfig will be invoked.
|
|
|
|
|
2021-02-26 09:22:18 +08:00
|
|
|
- Ensure full rebuild when the compiler is updated
|
2021-03-04 19:37:08 +08:00
|
|
|
include/linux/compiler-version.h contains this option in the comment
|
2021-04-16 01:36:07 +08:00
|
|
|
line so fixdep adds include/config/CC_VERSION_TEXT into the
|
2021-03-04 19:37:08 +08:00
|
|
|
auto-generated dependency. When the compiler is updated, syncconfig
|
|
|
|
will touch it and then every file will be rebuilt.
|
2020-04-23 22:23:52 +08:00
|
|
|
|
2018-05-28 17:22:01 +08:00
|
|
|
config CC_IS_GCC
|
2021-01-16 07:35:42 +08:00
|
|
|
def_bool $(success,test "$(cc-name)" = GCC)
|
2018-05-28 17:22:01 +08:00
|
|
|
|
|
|
|
config GCC_VERSION
|
|
|
|
int
|
2021-01-16 07:35:42 +08:00
|
|
|
default $(cc-version) if CC_IS_GCC
|
2018-05-28 17:22:01 +08:00
|
|
|
default 0
|
|
|
|
|
2018-05-28 17:22:02 +08:00
|
|
|
config CC_IS_CLANG
|
2021-01-16 07:35:42 +08:00
|
|
|
def_bool $(success,test "$(cc-name)" = Clang)
|
2020-04-29 06:14:15 +08:00
|
|
|
|
2018-05-28 17:22:02 +08:00
|
|
|
config CLANG_VERSION
|
|
|
|
int
|
2021-01-16 07:35:42 +08:00
|
|
|
default $(cc-version) if CC_IS_CLANG
|
|
|
|
default 0
|
2018-05-28 17:22:02 +08:00
|
|
|
|
kbuild: check the minimum assembler version in Kconfig
Documentation/process/changes.rst defines the minimum assembler version
(binutils version), but we have never checked it in the build time.
Kbuild never invokes 'as' directly because all assembly files in the
kernel tree are *.S, hence must be preprocessed. I do not expect
raw assembly source files (*.s) would be added to the kernel tree.
Therefore, we always use $(CC) as the assembler driver, and commit
aa824e0c962b ("kbuild: remove AS variable") removed 'AS'. However,
we are still interested in the version of the assembler acting behind.
As usual, the --version option prints the version string.
$ as --version | head -n 1
GNU assembler (GNU Binutils for Ubuntu) 2.35.1
But, we do not have $(AS). So, we can add the -Wa prefix so that
$(CC) passes --version down to the backing assembler.
$ gcc -Wa,--version | head -n 1
gcc: fatal error: no input files
compilation terminated.
OK, we need to input something to satisfy gcc.
$ gcc -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
GNU assembler (GNU Binutils for Ubuntu) 2.35.1
The combination of Clang and GNU assembler works in the same way:
$ clang -no-integrated-as -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
GNU assembler (GNU Binutils for Ubuntu) 2.35.1
Clang with the integrated assembler fails like this:
$ clang -integrated-as -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
clang: error: unsupported argument '--version' to option 'Wa,'
For the last case, checking the error message is fragile. If the
proposal for -Wa,--version support [1] is accepted, this may not be
even an error in the future.
One easy way is to check if -integrated-as is present in the passed
arguments. We did not pass -integrated-as to CLANG_FLAGS before, but
we can make it explicit.
Nathan pointed out -integrated-as is the default for all of the
architectures/targets that the kernel cares about, but it goes
along with "explicit is better than implicit" policy. [2]
With all this in my mind, I implemented scripts/as-version.sh to
check the assembler version in Kconfig time.
$ scripts/as-version.sh gcc
GNU 23501
$ scripts/as-version.sh clang -no-integrated-as
GNU 23501
$ scripts/as-version.sh clang -integrated-as
LLVM 0
[1]: https://github.com/ClangBuiltLinux/linux/issues/1320
[2]: https://lore.kernel.org/linux-kbuild/20210307044253.v3h47ucq6ng25iay@archlinux-ax161/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
2021-03-16 00:12:56 +08:00
|
|
|
config AS_IS_GNU
|
|
|
|
def_bool $(success,test "$(as-name)" = GNU)
|
|
|
|
|
|
|
|
config AS_IS_LLVM
|
|
|
|
def_bool $(success,test "$(as-name)" = LLVM)
|
|
|
|
|
|
|
|
config AS_VERSION
|
|
|
|
int
|
|
|
|
# Use clang version if this is the integrated assembler
|
|
|
|
default CLANG_VERSION if AS_IS_LLVM
|
|
|
|
default $(as-version)
|
|
|
|
|
2021-02-16 11:10:04 +08:00
|
|
|
config LD_IS_BFD
|
|
|
|
def_bool $(success,test "$(ld-name)" = BFD)
|
|
|
|
|
|
|
|
config LD_VERSION
|
|
|
|
int
|
|
|
|
default $(ld-version) if LD_IS_BFD
|
|
|
|
default 0
|
|
|
|
|
|
|
|
config LD_IS_LLD
|
|
|
|
def_bool $(success,test "$(ld-name)" = LLD)
|
2018-05-28 17:22:02 +08:00
|
|
|
|
2020-11-20 04:46:58 +08:00
|
|
|
config LLD_VERSION
|
|
|
|
int
|
2021-02-16 11:10:04 +08:00
|
|
|
default $(ld-version) if LD_IS_LLD
|
|
|
|
default 0
|
2020-11-20 04:46:58 +08:00
|
|
|
|
2021-07-03 22:42:57 +08:00
|
|
|
config RUST_IS_AVAILABLE
|
|
|
|
def_bool $(success,$(srctree)/scripts/rust_is_available.sh)
|
|
|
|
help
|
|
|
|
This shows whether a suitable Rust toolchain is available (found).
|
|
|
|
|
|
|
|
Please see Documentation/rust/quick-start.rst for instructions on how
|
2022-10-08 04:43:39 +08:00
|
|
|
to satisfy the build requirements of Rust support.
|
2021-07-03 22:42:57 +08:00
|
|
|
|
|
|
|
In particular, the Makefile target 'rustavailable' is useful to check
|
|
|
|
why the Rust toolchain is not being detected.
|
|
|
|
|
2019-07-01 08:58:39 +08:00
|
|
|
config CC_CAN_LINK
|
2020-04-29 11:45:13 +08:00
|
|
|
bool
|
2022-02-02 05:35:42 +08:00
|
|
|
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(USERCFLAGS) $(USERLDFLAGS) $(m64-flag)) if 64BIT
|
|
|
|
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(USERCFLAGS) $(USERLDFLAGS) $(m32-flag))
|
2020-05-09 15:39:15 +08:00
|
|
|
|
|
|
|
config CC_CAN_LINK_STATIC
|
|
|
|
bool
|
2022-02-02 05:35:42 +08:00
|
|
|
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(USERCFLAGS) $(USERLDFLAGS) $(m64-flag) -static) if 64BIT
|
|
|
|
default $(success,$(srctree)/scripts/cc-can-link.sh $(CC) $(CLANG_FLAGS) $(USERCFLAGS) $(USERLDFLAGS) $(m32-flag) -static)
|
2019-07-01 08:58:39 +08:00
|
|
|
|
2020-02-15 06:18:11 +08:00
|
|
|
config CC_HAS_ASM_GOTO_OUTPUT
|
|
|
|
def_bool $(success,echo 'int foo(int x) { asm goto ("": "=r"(x) ::: bar); return x; bar: return 0; }' | $(CC) -x c - -c -o /dev/null)
|
|
|
|
|
2022-02-02 08:49:41 +08:00
|
|
|
config CC_HAS_ASM_GOTO_TIED_OUTPUT
|
|
|
|
depends on CC_HAS_ASM_GOTO_OUTPUT
|
|
|
|
# Detect buggy gcc and clang, fixed in gcc-11 clang-14.
|
2022-11-15 19:01:58 +08:00
|
|
|
def_bool $(success,echo 'int foo(int *x) { asm goto (".long (%l[bar]) - .": "+m"(*x) ::: bar); return *x; bar: return 0; }' | $CC -x c - -c -o /dev/null)
|
2022-02-02 08:49:41 +08:00
|
|
|
|
update workarounds for gcc "asm goto" issue
commit 68fb3ca0e408e00db1c3f8fccdfa19e274c033be upstream.
In commit 4356e9f841f7 ("work around gcc bugs with 'asm goto' with
outputs") I did the gcc workaround unconditionally, because the cause of
the bad code generation wasn't entirely clear.
In the meantime, Jakub Jelinek debugged the issue, and has come up with
a fix in gcc [2], which also got backported to the still maintained
branches of gcc-11, gcc-12 and gcc-13.
Note that while the fix technically wasn't in the original gcc-14
branch, Jakub says:
"while it is true that no GCC 14 snapshots until today (or whenever the
fix will be committed) have the fix, for GCC trunk it is up to the
distros to use the latest snapshot if they use it at all and would
allow better testing of the kernel code without the workaround, so
that if there are other issues they won't be discovered years later.
Most userland code doesn't actually use asm goto with outputs..."
so we will consider gcc-14 to be fixed - if somebody is using gcc
snapshots of the gcc-14 before the fix, they should upgrade.
Note that while the bug goes back to gcc-11, in practice other gcc
changes seem to have effectively hidden it since gcc-12.1 as per a
bisect by Jakub. So even a gcc-14 snapshot without the fix likely
doesn't show actual problems.
Also, make the default 'asm_goto_output()' macro mark the asm as
volatile by hand, because of an unrelated gcc issue [1] where it doesn't
match the documented behavior ("asm goto is always volatile").
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103979 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 [2]
Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/
Requested-by: Jakub Jelinek <jakub@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Andrew Pinski <quic_apinski@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-16 03:14:33 +08:00
|
|
|
config GCC_ASM_GOTO_OUTPUT_WORKAROUND
|
|
|
|
bool
|
|
|
|
depends on CC_IS_GCC && CC_HAS_ASM_GOTO_OUTPUT
|
|
|
|
# Fixed in GCC 14, 13.3, 12.4 and 11.5
|
|
|
|
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921
|
|
|
|
default y if GCC_VERSION < 110500
|
|
|
|
default y if GCC_VERSION >= 120000 && GCC_VERSION < 120400
|
|
|
|
default y if GCC_VERSION >= 130000 && GCC_VERSION < 130300
|
|
|
|
|
2019-08-01 09:18:42 +08:00
|
|
|
config TOOLS_SUPPORT_RELR
|
2019-08-20 17:11:54 +08:00
|
|
|
def_bool $(success,env "CC=$(CC)" "LD=$(LD)" "NM=$(NM)" "OBJCOPY=$(OBJCOPY)" $(srctree)/scripts/tools-support-relr.sh)
|
2019-08-01 09:18:42 +08:00
|
|
|
|
2019-09-13 06:19:25 +08:00
|
|
|
config CC_HAS_ASM_INLINE
|
|
|
|
def_bool $(success,echo 'void foo(void) { asm inline (""); }' | $(CC) -x c - -c -o /dev/null)
|
|
|
|
|
2021-06-22 07:18:22 +08:00
|
|
|
config CC_HAS_NO_PROFILE_FN_ATTR
|
|
|
|
def_bool $(success,echo '__attribute__((no_profile_instrument_function)) int x();' | $(CC) -x c - -c -o /dev/null -Werror)
|
|
|
|
|
2022-02-02 04:56:21 +08:00
|
|
|
config PAHOLE_VERSION
|
|
|
|
int
|
|
|
|
default $(shell,$(srctree)/scripts/pahole-version.sh $(PAHOLE))
|
|
|
|
|
2009-06-18 07:28:03 +08:00
|
|
|
config CONSTRUCTORS
|
|
|
|
bool
|
|
|
|
|
2010-10-14 14:01:34 +08:00
|
|
|
config IRQ_WORK
|
|
|
|
bool
|
|
|
|
|
2019-12-04 08:46:31 +08:00
|
|
|
config BUILDTIME_TABLE_SORT
|
2012-04-20 05:59:57 +08:00
|
|
|
bool
|
|
|
|
|
2016-09-14 05:29:24 +08:00
|
|
|
config THREAD_INFO_IN_TASK
|
|
|
|
bool
|
|
|
|
help
|
|
|
|
Select this to move thread_info off the stack into task_struct. To
|
|
|
|
make this work, an arch will need to remove all thread_info fields
|
|
|
|
except flags and fix any runtime bugs.
|
|
|
|
|
2016-09-16 13:45:43 +08:00
|
|
|
One subtle change that will be needed is to use try_get_task_stack()
|
|
|
|
and put_task_stack() in save_thread_stack_tsk() and get_wchan().
|
|
|
|
|
2007-07-31 15:39:23 +08:00
|
|
|
menu "General setup"
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
config BROKEN
|
|
|
|
bool
|
|
|
|
|
|
|
|
config BROKEN_ON_SMP
|
|
|
|
bool
|
|
|
|
depends on BROKEN || !SMP
|
|
|
|
default y
|
|
|
|
|
|
|
|
config INIT_ENV_ARG_LIMIT
|
|
|
|
int
|
2006-06-30 16:55:51 +08:00
|
|
|
default 32 if !UML
|
|
|
|
default 128 if UML
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
2005-10-31 07:01:46 +08:00
|
|
|
Maximum of each of the number of arguments and environment
|
|
|
|
variables passed to init from the kernel command line.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-05-22 16:56:24 +08:00
|
|
|
config COMPILE_TEST
|
|
|
|
bool "Compile also drivers which will not load"
|
2021-03-13 13:07:08 +08:00
|
|
|
depends on HAS_IOMEM
|
2013-05-22 16:56:24 +08:00
|
|
|
help
|
|
|
|
Some drivers can be compiled on a different platform than they are
|
|
|
|
intended to be run on. Despite they cannot be loaded there (or even
|
|
|
|
when they load they cannot be used due to missing HW support),
|
|
|
|
developers still, opposing to distributors, might want to build such
|
|
|
|
drivers to compile-test them.
|
|
|
|
|
|
|
|
If you are a developer and want to build everything available, say Y
|
|
|
|
here. If you are a user/distributor, say N here to exclude useless
|
|
|
|
drivers to be distributed.
|
|
|
|
|
2021-09-06 02:24:05 +08:00
|
|
|
config WERROR
|
|
|
|
bool "Compile the kernel with warnings as errors"
|
kbuild: Only default to -Werror if COMPILE_TEST
The cross-product of the kernel's supported toolchains, architectures,
and configuration options is large. So large, that it's generally
accepted to be infeasible to enumerate and build+test them all
(many compile-testers rely on randomly generated configs).
Without the possibility to enumerate all possible combinations of
toolchains, architectures, and configuration options, it is inevitable
that compiler warnings in this space exist.
With -Werror, this means that an innumerable set of kernels are now
broken, yet had been perfectly usable before (confused compilers, code
with warnings unused, or luck).
Distributors will necessarily pick a point in the toolchain X arch X
config space, and if unlucky, will have a broken build. Granted, those
will likely disable CONFIG_WERROR and move on.
The kernel's default configuration is unlikely to be suitable for all
users, but it's inappropriate to force many users to set CONFIG_WERROR=n.
This also holds for CI systems which are focused on runtime testing,
where the odd warning in some subsystem will disrupt testing of the rest
of the kernel. Many of those runtime-focused CI systems run tests or
fuzz the kernel using runtime debugging tools. Runtime testing of
different subsystems can proceed in parallel, and potentially uncover
serious bugs; halting runtime testing of the entire kernel because of
the odd warning (now error) in a subsystem or driver is simply
inappropriate.
Therefore, runtime-focused CI systems will likely choose CONFIG_WERROR=n
as well.
The appropriate usecase for -Werror is therefore compile-test focused
builds (often done by developers or CI systems).
Reflect this in the Kconfig option by making the default value of WERROR
match COMPILE_TEST.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviwed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 05:12:08 +08:00
|
|
|
default COMPILE_TEST
|
2021-09-06 02:24:05 +08:00
|
|
|
help
|
|
|
|
A kernel build should not cause any compiler warnings, and this
|
2021-07-03 22:42:57 +08:00
|
|
|
enables the '-Werror' (for C) and '-Dwarnings' (for Rust) flags
|
2022-10-25 15:30:23 +08:00
|
|
|
to enforce that rule by default. Certain warnings from other tools
|
|
|
|
such as the linker may be upgraded to errors with this option as
|
|
|
|
well.
|
2021-09-06 02:24:05 +08:00
|
|
|
|
2022-10-25 15:30:23 +08:00
|
|
|
However, if you have a new (or very old) compiler or linker with odd
|
|
|
|
and unusual warnings, or you have some architecture with problems,
|
2021-09-06 02:24:05 +08:00
|
|
|
you may need to disable this config option in order to
|
|
|
|
successfully build the kernel.
|
|
|
|
|
|
|
|
If in doubt, say Y.
|
|
|
|
|
kbuild: compile-test exported headers to ensure they are self-contained
Multiple people have suggested compile-testing UAPI headers to ensure
they can be really included from user-space. "make headers_check" is
obviously not enough to catch bugs, and we often leak unresolved
references to user-space.
Use the new header-test-y syntax to implement it. Please note exported
headers are compile-tested with a completely different set of compiler
flags. The header search path is set to $(objtree)/usr/include since
exported headers should not include unexported ones.
We use -std=gnu89 for the kernel space since the kernel code highly
depends on GNU extensions. On the other hand, UAPI headers should be
written in more standardized C, so they are compiled with -std=c90.
This will emit errors if C++ style comments, the keyword 'inline', etc.
are used. Please use C style comments (/* ... */), '__inline__', etc.
in UAPI headers.
There is additional compiler requirement to enable this test because
many of UAPI headers include <stdlib.h>, <sys/ioctl.h>, <sys/time.h>,
etc. directly or indirectly. You cannot use kernel.org pre-built
toolchains [1] since they lack <stdlib.h>.
I reused CONFIG_CC_CAN_LINK to check the system header availability.
The intention is slightly different, but a compiler that can link
userspace programs provide system headers.
For now, a lot of headers need to be excluded because they cannot
be compiled standalone, but this is a good start point.
[1] https://mirrors.edge.kernel.org/pub/tools/crosstool/index.html
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
2019-07-01 08:58:40 +08:00
|
|
|
config UAPI_HEADER_TEST
|
|
|
|
bool "Compile test UAPI headers"
|
2019-11-07 15:14:40 +08:00
|
|
|
depends on HEADERS_INSTALL && CC_CAN_LINK
|
kbuild: compile-test exported headers to ensure they are self-contained
Multiple people have suggested compile-testing UAPI headers to ensure
they can be really included from user-space. "make headers_check" is
obviously not enough to catch bugs, and we often leak unresolved
references to user-space.
Use the new header-test-y syntax to implement it. Please note exported
headers are compile-tested with a completely different set of compiler
flags. The header search path is set to $(objtree)/usr/include since
exported headers should not include unexported ones.
We use -std=gnu89 for the kernel space since the kernel code highly
depends on GNU extensions. On the other hand, UAPI headers should be
written in more standardized C, so they are compiled with -std=c90.
This will emit errors if C++ style comments, the keyword 'inline', etc.
are used. Please use C style comments (/* ... */), '__inline__', etc.
in UAPI headers.
There is additional compiler requirement to enable this test because
many of UAPI headers include <stdlib.h>, <sys/ioctl.h>, <sys/time.h>,
etc. directly or indirectly. You cannot use kernel.org pre-built
toolchains [1] since they lack <stdlib.h>.
I reused CONFIG_CC_CAN_LINK to check the system header availability.
The intention is slightly different, but a compiler that can link
userspace programs provide system headers.
For now, a lot of headers need to be excluded because they cannot
be compiled standalone, but this is a good start point.
[1] https://mirrors.edge.kernel.org/pub/tools/crosstool/index.html
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
2019-07-01 08:58:40 +08:00
|
|
|
help
|
|
|
|
Compile test headers exported to user-space to ensure they are
|
|
|
|
self-contained, i.e. compilable as standalone units.
|
|
|
|
|
|
|
|
If you are a developer or tester and want to ensure the exported
|
|
|
|
headers are self-contained, say Y here. Otherwise, choose N.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config LOCALVERSION
|
|
|
|
string "Local version - append to kernel release"
|
|
|
|
help
|
|
|
|
Append an extra string to the end of your kernel version.
|
|
|
|
This will show up when you type uname, for example.
|
|
|
|
The string you set here will be appended after the contents of
|
|
|
|
any files with a filename matching localversion* in your
|
|
|
|
object and source tree, in that order. Your total string can
|
|
|
|
be a maximum of 64 characters.
|
|
|
|
|
2005-07-31 16:57:49 +08:00
|
|
|
config LOCALVERSION_AUTO
|
|
|
|
bool "Automatically append version information to the version string"
|
|
|
|
default y
|
2016-08-03 05:07:21 +08:00
|
|
|
depends on !COMPILE_TEST
|
2005-07-31 16:57:49 +08:00
|
|
|
help
|
|
|
|
This will try to automatically determine if the current tree is a
|
2007-05-02 05:08:11 +08:00
|
|
|
release tree by looking for git tags that belong to the current
|
|
|
|
top of tree revision.
|
2005-07-31 16:57:49 +08:00
|
|
|
|
|
|
|
A string of the format -gxxxxxxxx will be added to the localversion
|
2007-05-02 05:08:11 +08:00
|
|
|
if a git-based tree is found. The string generated by this will be
|
2005-07-31 16:57:49 +08:00
|
|
|
appended after any matching localversion* files, and after the value
|
2007-05-02 05:08:11 +08:00
|
|
|
set in CONFIG_LOCALVERSION.
|
2005-07-31 16:57:49 +08:00
|
|
|
|
2023-01-11 17:38:22 +08:00
|
|
|
(The actual string used here is the first 12 characters produced
|
2007-05-02 05:08:11 +08:00
|
|
|
by running the command:
|
|
|
|
|
|
|
|
$ git rev-parse --verify HEAD
|
|
|
|
|
|
|
|
which is done within the script "scripts/setlocalversion".)
|
2005-07-31 16:57:49 +08:00
|
|
|
|
2018-07-06 08:49:37 +08:00
|
|
|
config BUILD_SALT
|
2019-12-05 08:52:28 +08:00
|
|
|
string "Build ID Salt"
|
|
|
|
default ""
|
|
|
|
help
|
|
|
|
The build ID is used to link binaries and their debug info. Setting
|
|
|
|
this option will use the value in the calculation of the build id.
|
|
|
|
This is mostly useful for distributions which want to ensure the
|
|
|
|
build is unique between builds. It's safe to leave the default.
|
2018-07-06 08:49:37 +08:00
|
|
|
|
2009-01-05 07:41:25 +08:00
|
|
|
config HAVE_KERNEL_GZIP
|
|
|
|
bool
|
|
|
|
|
|
|
|
config HAVE_KERNEL_BZIP2
|
|
|
|
bool
|
|
|
|
|
|
|
|
config HAVE_KERNEL_LZMA
|
|
|
|
bool
|
|
|
|
|
decompressors: add boot-time XZ support
This implements the API defined in <linux/decompress/generic.h> which is
used for kernel, initramfs, and initrd decompression. This patch together
with the first patch is enough for XZ-compressed initramfs and initrd;
XZ-compressed kernel will need arch-specific changes.
The buffering requirements described in decompress_unxz.c are stricter
than with gzip, so the relevant changes should be done to the
arch-specific code when adding support for XZ-compressed kernel.
Similarly, the heap size in arch-specific pre-boot code may need to be
increased (30 KiB is enough).
The XZ decompressor needs memmove(), memeq() (memcmp() == 0), and
memzero() (memset(ptr, 0, size)), which aren't available in all
arch-specific pre-boot environments. I'm including simple versions in
decompress_unxz.c, but a cleaner solution would naturally be nicer.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alain Knaff <alain@knaff.lu>
Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 09:01:23 +08:00
|
|
|
config HAVE_KERNEL_XZ
|
|
|
|
bool
|
|
|
|
|
lib: add support for LZO-compressed kernels
This patch series adds generic support for creating and extracting
LZO-compressed kernel images, as well as support for using such images on
the x86 and ARM architectures, and support for creating and using
LZO-compressed initrd and initramfs images.
Russell King said:
: Testing on a Cortex A9 model:
: - lzo decompressor is 65% of the time gzip takes to decompress a kernel
: - lzo kernel is 9% larger than a gzip kernel
:
: which I'm happy to say confirms your figures when comparing the two.
:
: However, when comparing your new gzip code to the old gzip code:
: - new is 99% of the size of the old code
: - new takes 42% of the time to decompress than the old code
:
: What this means is that for a proper comparison, the results get even better:
: - lzo is 7.5% larger than the old gzip'd kernel image
: - lzo takes 28% of the time that the old gzip code took
:
: So the expense seems definitely worth the effort. The only reason I
: can think of ever using gzip would be if you needed the additional
: compression (eg, because you have limited flash to store the image.)
:
: I would argue that the default for ARM should therefore be LZO.
This patch:
The lzo compressor is worse than gzip at compression, but faster at
extraction. Here are some figures for an ARM board I'm working on:
Uncompressed size: 3.24Mo
gzip 1.61Mo 0.72s
lzo 1.75Mo 0.48s
So for a compression ratio that is still relatively close to gzip, it's
much faster to extract, at least in that case.
This part contains:
- Makefile routine to support lzo compression
- Fixes to the existing lzo compressor so that it can be used in
compressed kernels
- wrapper around the existing lzo1x_decompress, as it only extracts one
block at a time, while we need to extract a whole file here
- config dialog for kernel compression
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Tested-by: Wu Zhangjin <wuzhangjin@gmail.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Russell King <rmk@arm.linux.org.uk>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-09 06:42:42 +08:00
|
|
|
config HAVE_KERNEL_LZO
|
|
|
|
bool
|
|
|
|
|
2013-07-09 07:01:46 +08:00
|
|
|
config HAVE_KERNEL_LZ4
|
|
|
|
bool
|
|
|
|
|
2020-07-31 03:08:36 +08:00
|
|
|
config HAVE_KERNEL_ZSTD
|
|
|
|
bool
|
|
|
|
|
2018-06-13 03:26:35 +08:00
|
|
|
config HAVE_KERNEL_UNCOMPRESSED
|
|
|
|
bool
|
|
|
|
|
2009-01-05 05:46:17 +08:00
|
|
|
choice
|
2009-01-05 07:41:25 +08:00
|
|
|
prompt "Kernel compression mode"
|
|
|
|
default KERNEL_GZIP
|
2020-07-31 03:08:36 +08:00
|
|
|
depends on HAVE_KERNEL_GZIP || HAVE_KERNEL_BZIP2 || HAVE_KERNEL_LZMA || HAVE_KERNEL_XZ || HAVE_KERNEL_LZO || HAVE_KERNEL_LZ4 || HAVE_KERNEL_ZSTD || HAVE_KERNEL_UNCOMPRESSED
|
2009-01-05 07:41:25 +08:00
|
|
|
help
|
2009-01-05 05:46:17 +08:00
|
|
|
The linux kernel is a kind of self-extracting executable.
|
|
|
|
Several compression algorithms are available, which differ
|
|
|
|
in efficiency, compression and decompression speed.
|
|
|
|
Compression speed is only relevant when building a kernel.
|
|
|
|
Decompression speed is relevant at each boot.
|
|
|
|
|
|
|
|
If you have any problems with bzip2 or lzma compressed
|
|
|
|
kernels, mail me (Alain Knaff) <alain@knaff.lu>. (An older
|
|
|
|
version of this functionality (bzip2 only), for 2.4, was
|
|
|
|
supplied by Christian Ludwig)
|
|
|
|
|
|
|
|
High compression options are mostly useful for users, who
|
|
|
|
are low on disk space (embedded systems), but for whom ram
|
|
|
|
size matters less.
|
|
|
|
|
|
|
|
If in doubt, select 'gzip'
|
|
|
|
|
|
|
|
config KERNEL_GZIP
|
2009-01-05 07:41:25 +08:00
|
|
|
bool "Gzip"
|
|
|
|
depends on HAVE_KERNEL_GZIP
|
|
|
|
help
|
lib: add support for LZO-compressed kernels
This patch series adds generic support for creating and extracting
LZO-compressed kernel images, as well as support for using such images on
the x86 and ARM architectures, and support for creating and using
LZO-compressed initrd and initramfs images.
Russell King said:
: Testing on a Cortex A9 model:
: - lzo decompressor is 65% of the time gzip takes to decompress a kernel
: - lzo kernel is 9% larger than a gzip kernel
:
: which I'm happy to say confirms your figures when comparing the two.
:
: However, when comparing your new gzip code to the old gzip code:
: - new is 99% of the size of the old code
: - new takes 42% of the time to decompress than the old code
:
: What this means is that for a proper comparison, the results get even better:
: - lzo is 7.5% larger than the old gzip'd kernel image
: - lzo takes 28% of the time that the old gzip code took
:
: So the expense seems definitely worth the effort. The only reason I
: can think of ever using gzip would be if you needed the additional
: compression (eg, because you have limited flash to store the image.)
:
: I would argue that the default for ARM should therefore be LZO.
This patch:
The lzo compressor is worse than gzip at compression, but faster at
extraction. Here are some figures for an ARM board I'm working on:
Uncompressed size: 3.24Mo
gzip 1.61Mo 0.72s
lzo 1.75Mo 0.48s
So for a compression ratio that is still relatively close to gzip, it's
much faster to extract, at least in that case.
This part contains:
- Makefile routine to support lzo compression
- Fixes to the existing lzo compressor so that it can be used in
compressed kernels
- wrapper around the existing lzo1x_decompress, as it only extracts one
block at a time, while we need to extract a whole file here
- config dialog for kernel compression
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Tested-by: Wu Zhangjin <wuzhangjin@gmail.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Russell King <rmk@arm.linux.org.uk>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-09 06:42:42 +08:00
|
|
|
The old and tried gzip compression. It provides a good balance
|
|
|
|
between compression ratio and decompression speed.
|
2009-01-05 05:46:17 +08:00
|
|
|
|
|
|
|
config KERNEL_BZIP2
|
|
|
|
bool "Bzip2"
|
2009-01-05 07:41:25 +08:00
|
|
|
depends on HAVE_KERNEL_BZIP2
|
2009-01-05 05:46:17 +08:00
|
|
|
help
|
|
|
|
Its compression ratio and speed is intermediate.
|
2012-06-01 07:26:46 +08:00
|
|
|
Decompression speed is slowest among the choices. The kernel
|
2009-01-05 07:41:25 +08:00
|
|
|
size is about 10% smaller with bzip2, in comparison to gzip.
|
|
|
|
Bzip2 uses a large amount of memory. For modern kernels you
|
|
|
|
will need at least 8MB RAM or more for booting.
|
2009-01-05 05:46:17 +08:00
|
|
|
|
|
|
|
config KERNEL_LZMA
|
2009-01-05 07:41:25 +08:00
|
|
|
bool "LZMA"
|
|
|
|
depends on HAVE_KERNEL_LZMA
|
|
|
|
help
|
2012-06-01 07:26:46 +08:00
|
|
|
This compression algorithm's ratio is best. Decompression speed
|
|
|
|
is between gzip and bzip2. Compression is slowest.
|
|
|
|
The kernel size is about 33% smaller with LZMA in comparison to gzip.
|
2009-01-05 05:46:17 +08:00
|
|
|
|
decompressors: add boot-time XZ support
This implements the API defined in <linux/decompress/generic.h> which is
used for kernel, initramfs, and initrd decompression. This patch together
with the first patch is enough for XZ-compressed initramfs and initrd;
XZ-compressed kernel will need arch-specific changes.
The buffering requirements described in decompress_unxz.c are stricter
than with gzip, so the relevant changes should be done to the
arch-specific code when adding support for XZ-compressed kernel.
Similarly, the heap size in arch-specific pre-boot code may need to be
increased (30 KiB is enough).
The XZ decompressor needs memmove(), memeq() (memcmp() == 0), and
memzero() (memset(ptr, 0, size)), which aren't available in all
arch-specific pre-boot environments. I'm including simple versions in
decompress_unxz.c, but a cleaner solution would naturally be nicer.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alain Knaff <alain@knaff.lu>
Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 09:01:23 +08:00
|
|
|
config KERNEL_XZ
|
|
|
|
bool "XZ"
|
|
|
|
depends on HAVE_KERNEL_XZ
|
|
|
|
help
|
|
|
|
XZ uses the LZMA2 algorithm and instruction set specific
|
|
|
|
BCJ filters which can improve compression ratio of executable
|
|
|
|
code. The size of the kernel is about 30% smaller with XZ in
|
|
|
|
comparison to gzip. On architectures for which there is a BCJ
|
|
|
|
filter (i386, x86_64, ARM, IA-64, PowerPC, and SPARC), XZ
|
|
|
|
will create a few percent smaller kernel than plain LZMA.
|
|
|
|
|
|
|
|
The speed is about the same as with LZMA: The decompression
|
|
|
|
speed of XZ is better than that of bzip2 but worse than gzip
|
|
|
|
and LZO. Compression is slow.
|
|
|
|
|
lib: add support for LZO-compressed kernels
This patch series adds generic support for creating and extracting
LZO-compressed kernel images, as well as support for using such images on
the x86 and ARM architectures, and support for creating and using
LZO-compressed initrd and initramfs images.
Russell King said:
: Testing on a Cortex A9 model:
: - lzo decompressor is 65% of the time gzip takes to decompress a kernel
: - lzo kernel is 9% larger than a gzip kernel
:
: which I'm happy to say confirms your figures when comparing the two.
:
: However, when comparing your new gzip code to the old gzip code:
: - new is 99% of the size of the old code
: - new takes 42% of the time to decompress than the old code
:
: What this means is that for a proper comparison, the results get even better:
: - lzo is 7.5% larger than the old gzip'd kernel image
: - lzo takes 28% of the time that the old gzip code took
:
: So the expense seems definitely worth the effort. The only reason I
: can think of ever using gzip would be if you needed the additional
: compression (eg, because you have limited flash to store the image.)
:
: I would argue that the default for ARM should therefore be LZO.
This patch:
The lzo compressor is worse than gzip at compression, but faster at
extraction. Here are some figures for an ARM board I'm working on:
Uncompressed size: 3.24Mo
gzip 1.61Mo 0.72s
lzo 1.75Mo 0.48s
So for a compression ratio that is still relatively close to gzip, it's
much faster to extract, at least in that case.
This part contains:
- Makefile routine to support lzo compression
- Fixes to the existing lzo compressor so that it can be used in
compressed kernels
- wrapper around the existing lzo1x_decompress, as it only extracts one
block at a time, while we need to extract a whole file here
- config dialog for kernel compression
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Tested-by: Wu Zhangjin <wuzhangjin@gmail.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Russell King <rmk@arm.linux.org.uk>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-09 06:42:42 +08:00
|
|
|
config KERNEL_LZO
|
|
|
|
bool "LZO"
|
|
|
|
depends on HAVE_KERNEL_LZO
|
|
|
|
help
|
2012-06-01 07:26:46 +08:00
|
|
|
Its compression ratio is the poorest among the choices. The kernel
|
2010-07-14 17:23:08 +08:00
|
|
|
size is about 10% bigger than gzip; however its speed
|
lib: add support for LZO-compressed kernels
This patch series adds generic support for creating and extracting
LZO-compressed kernel images, as well as support for using such images on
the x86 and ARM architectures, and support for creating and using
LZO-compressed initrd and initramfs images.
Russell King said:
: Testing on a Cortex A9 model:
: - lzo decompressor is 65% of the time gzip takes to decompress a kernel
: - lzo kernel is 9% larger than a gzip kernel
:
: which I'm happy to say confirms your figures when comparing the two.
:
: However, when comparing your new gzip code to the old gzip code:
: - new is 99% of the size of the old code
: - new takes 42% of the time to decompress than the old code
:
: What this means is that for a proper comparison, the results get even better:
: - lzo is 7.5% larger than the old gzip'd kernel image
: - lzo takes 28% of the time that the old gzip code took
:
: So the expense seems definitely worth the effort. The only reason I
: can think of ever using gzip would be if you needed the additional
: compression (eg, because you have limited flash to store the image.)
:
: I would argue that the default for ARM should therefore be LZO.
This patch:
The lzo compressor is worse than gzip at compression, but faster at
extraction. Here are some figures for an ARM board I'm working on:
Uncompressed size: 3.24Mo
gzip 1.61Mo 0.72s
lzo 1.75Mo 0.48s
So for a compression ratio that is still relatively close to gzip, it's
much faster to extract, at least in that case.
This part contains:
- Makefile routine to support lzo compression
- Fixes to the existing lzo compressor so that it can be used in
compressed kernels
- wrapper around the existing lzo1x_decompress, as it only extracts one
block at a time, while we need to extract a whole file here
- config dialog for kernel compression
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Tested-by: Wu Zhangjin <wuzhangjin@gmail.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Russell King <rmk@arm.linux.org.uk>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-09 06:42:42 +08:00
|
|
|
(both compression and decompression) is the fastest.
|
|
|
|
|
2013-07-09 07:01:46 +08:00
|
|
|
config KERNEL_LZ4
|
|
|
|
bool "LZ4"
|
|
|
|
depends on HAVE_KERNEL_LZ4
|
|
|
|
help
|
|
|
|
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
|
|
|
|
A preliminary version of LZ4 de/compression tool is available at
|
|
|
|
<https://code.google.com/p/lz4/>.
|
|
|
|
|
|
|
|
Its compression ratio is worse than LZO. The size of the kernel
|
|
|
|
is about 8% bigger than LZO. But the decompression speed is
|
|
|
|
faster than LZO.
|
|
|
|
|
2020-07-31 03:08:36 +08:00
|
|
|
config KERNEL_ZSTD
|
|
|
|
bool "ZSTD"
|
|
|
|
depends on HAVE_KERNEL_ZSTD
|
|
|
|
help
|
|
|
|
ZSTD is a compression algorithm targeting intermediate compression
|
|
|
|
with fast decompression speed. It will compress better than GZIP and
|
|
|
|
decompress around the same speed as LZO, but slower than LZ4. You
|
|
|
|
will need at least 192 KB RAM or more for booting. The zstd command
|
|
|
|
line tool is required for compression.
|
|
|
|
|
2018-06-13 03:26:35 +08:00
|
|
|
config KERNEL_UNCOMPRESSED
|
|
|
|
bool "None"
|
|
|
|
depends on HAVE_KERNEL_UNCOMPRESSED
|
|
|
|
help
|
|
|
|
Produce uncompressed kernel image. This option is usually not what
|
|
|
|
you want. It is useful for debugging the kernel in slow simulation
|
|
|
|
environments, where decompressing and moving the kernel is awfully
|
|
|
|
slow. This option allows early boot code to skip the decompressor
|
|
|
|
and jump right at uncompressed kernel image.
|
|
|
|
|
2009-01-05 05:46:17 +08:00
|
|
|
endchoice
|
|
|
|
|
init: allow distribution configuration of default init
Some init systems (eg. systemd) have init at their own paths, for
example, /usr/lib/systemd/systemd. A compatibility symlink to one of the
hardcoded init paths is provided by another package, usually named
something like systemd-sysvcompat or similar.
Currently distro maintainers who are hands-off on the bootloader are more
or less required to include those compatibility links as part of their
base distribution, because it's hard to migrate away from them since
there's a risk some users will not get the message to set init= on the
kernel command line appropriately.
Moreover, for distributions where the init system is something the
distribution itself is opinionated about (eg. Arch, which has systemd in
the required `base` package), we could usually reasonably configure this
ahead of time when building the distribution kernel. However, we
currently simply don't have any way to configure the kernel to do this.
Here's an example discussion where removing sysvcompat was discussed by
distro maintainers[0].
This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set
is tried before the hardcoded fallback list. So the order of precedence
is now thus:
1. init= on command line (on failure: panic)
2. CONFIG_DEFAULT_INIT (on failure: try #3)
3. Hardcoded fallback list (on failure: panic)
This new config parameter will allow distribution maintainers to move away
from these compatibility links safely, without having to worry that their
users might not have the right init=.
There are also two other benefits of this over having the distribution
maintain a symlink:
1. One of the value propositions over simply having distributions
maintain a /sbin/init symlink via a package is that it also frees
distributions which have a preferred default, but not mandatory, init
system from having their package manager fight with their users for
control of /{s,}bin/init. Instead, the distribution simply makes
their preference known in CONFIG_DEFAULT_INIT, and if the user
installs another init system and uninstalls the default one they can
still make use of /{s,}bin/init and friends for their own uses. This
makes more cases Just Work(tm) without the user having to perform
extra configuration via init=.
2. Since before this we don't know which path the distribution actually
_intends_ to serve init from, we don't pr_err if it is simply
missing, and usually will just silently put the user in a /bin/sh
shell. Now that the distribution can make a declaration of intent, we
can be more vocal when this init system fails to launch for any
reason, even if it's simply because no file exists at that location,
speeding up the palaver of init/mount dependency/etc debugging a bit.
[0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.html
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05 07:50:53 +08:00
|
|
|
config DEFAULT_INIT
|
|
|
|
string "Default init path"
|
|
|
|
default ""
|
|
|
|
help
|
|
|
|
This option determines the default init for the system if no init=
|
|
|
|
option is passed on the kernel command line. If the requested path is
|
|
|
|
not present, we will still then move on to attempting further
|
|
|
|
locations (e.g. /sbin/init, etc). If this is empty, we will just use
|
|
|
|
the fallback list when init= is not passed.
|
|
|
|
|
uts: make default hostname configurable, rather than always using "(none)"
The "hostname" tool falls back to setting the hostname to "localhost" if
/etc/hostname does not exist. Distribution init scripts have the same
fallback. However, if userspace never calls sethostname, such as when
booting with init=/bin/sh, or otherwise booting a minimal system without
the usual init scripts, the default hostname of "(none)" remains,
unhelpfully appearing in various places such as prompts ("root@(none):~#")
and logs. Furthermore, "(none)" doesn't typically resolve to anything
useful.
Make the default hostname configurable. This removes the need for the
standard fallback, provides a useful default for systems that never call
sethostname, and makes minimal systems that much more useful with less
configuration. Distributions could choose to use "localhost" here to
avoid the fallback, while embedded systems may wish to use a specific
target hostname.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Miller <davem@davemloft.net>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Kel Modderman <kel@otaku42.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-16 06:08:28 +08:00
|
|
|
config DEFAULT_HOSTNAME
|
|
|
|
string "Default hostname"
|
|
|
|
default "(none)"
|
|
|
|
help
|
|
|
|
This option determines the default system hostname before userspace
|
|
|
|
calls sethostname(2). The kernel traditionally uses "(none)" here,
|
|
|
|
but you may wish to use a different default here to make a minimal
|
|
|
|
system more usable with less configuration.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config SYSVIPC
|
|
|
|
bool "System V IPC"
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
Inter Process Communication is a suite of library functions and
|
|
|
|
system calls which let processes (running programs) synchronize and
|
|
|
|
exchange information. It is generally considered to be a good thing,
|
|
|
|
and some programs won't run unless you say Y here. In particular, if
|
|
|
|
you want to run the DOS emulator dosemu under Linux (read the
|
|
|
|
DOSEMU-HOWTO, available from <http://www.tldp.org/docs.html#howto>),
|
|
|
|
you'll need to say Y here.
|
|
|
|
|
|
|
|
You can find documentation about IPC with "info ipc" and also in
|
|
|
|
section 6.4 of the Linux Programmer's Guide, available from
|
|
|
|
<http://www.tldp.org/guides.html>.
|
|
|
|
|
2007-02-14 16:34:06 +08:00
|
|
|
config SYSVIPC_SYSCTL
|
|
|
|
bool
|
|
|
|
depends on SYSVIPC
|
|
|
|
depends on SYSCTL
|
|
|
|
default y
|
|
|
|
|
2022-04-05 15:12:58 +08:00
|
|
|
config SYSVIPC_COMPAT
|
|
|
|
def_bool y
|
|
|
|
depends on COMPAT && SYSVIPC
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config POSIX_MQUEUE
|
|
|
|
bool "POSIX Message Queues"
|
2012-10-03 02:19:29 +08:00
|
|
|
depends on NET
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
POSIX variant of message queues is a part of IPC. In POSIX message
|
|
|
|
queues every message has a priority which decides about succession
|
|
|
|
of receiving it by a process. If you want to compile and run
|
|
|
|
programs written e.g. for Solaris with use of its POSIX message
|
2007-05-09 13:25:13 +08:00
|
|
|
queues (functions mq_*) say Y here.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
POSIX message queues are visible as a filesystem called 'mqueue'
|
|
|
|
and can be mounted somewhere if you want to do filesystem
|
|
|
|
operations on message queues.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2009-04-07 10:01:11 +08:00
|
|
|
config POSIX_MQUEUE_SYSCTL
|
|
|
|
bool
|
|
|
|
depends on POSIX_MQUEUE
|
|
|
|
depends on SYSCTL
|
|
|
|
default y
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
config WATCH_QUEUE
|
|
|
|
bool "General notification queue"
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
|
|
|
|
This is a general notification queue for the kernel to pass events to
|
|
|
|
userspace by splicing them into pipes. It can be used in conjunction
|
|
|
|
with watches for key/keyring change notifications and device
|
|
|
|
notifications.
|
|
|
|
|
2022-06-26 17:10:56 +08:00
|
|
|
See Documentation/core-api/watch_queue.rst
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
|
2014-06-05 07:10:50 +08:00
|
|
|
config CROSS_MEMORY_ATTACH
|
|
|
|
bool "Enable process_vm_readv/writev syscalls"
|
|
|
|
depends on MMU
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
Enabling this option adds the system calls process_vm_readv and
|
|
|
|
process_vm_writev which allow a process with the correct privileges
|
2014-08-13 04:46:11 +08:00
|
|
|
to directly read from or write to another process' address space.
|
2014-06-05 07:10:50 +08:00
|
|
|
See the man page for more details.
|
|
|
|
|
2014-04-04 05:48:27 +08:00
|
|
|
config USELIB
|
2022-04-30 05:38:01 +08:00
|
|
|
bool "uselib syscall (for libc5 and earlier)"
|
|
|
|
default ALPHA || M68K || SPARC
|
2014-04-04 05:48:27 +08:00
|
|
|
help
|
|
|
|
This option enables the uselib syscall, a system call used in the
|
|
|
|
dynamic linker from libc5 and earlier. glibc does not use this
|
|
|
|
system call. If you intend to run programs built on libc5 or
|
|
|
|
earlier, you may need to enable this syscall. Current systems
|
|
|
|
running glibc can safely disable this.
|
|
|
|
|
2012-09-09 20:22:07 +08:00
|
|
|
config AUDIT
|
|
|
|
bool "Auditing support"
|
|
|
|
depends on NET
|
|
|
|
help
|
|
|
|
Enable auditing infrastructure that can be used with another
|
|
|
|
kernel subsystem, such as SELinux (which requires this for
|
2016-01-13 22:18:55 +08:00
|
|
|
logging of avc messages output). System call auditing is included
|
|
|
|
on architectures which support it.
|
2012-09-09 20:22:07 +08:00
|
|
|
|
2014-02-25 17:16:24 +08:00
|
|
|
config HAVE_ARCH_AUDITSYSCALL
|
|
|
|
bool
|
|
|
|
|
2012-09-09 20:22:07 +08:00
|
|
|
config AUDITSYSCALL
|
2016-01-13 22:18:55 +08:00
|
|
|
def_bool y
|
2014-02-25 17:16:24 +08:00
|
|
|
depends on AUDIT && HAVE_ARCH_AUDITSYSCALL
|
2012-09-09 20:22:07 +08:00
|
|
|
select FSNOTIFY
|
|
|
|
|
|
|
|
source "kernel/irq/Kconfig"
|
|
|
|
source "kernel/time/Kconfig"
|
2021-05-12 04:35:16 +08:00
|
|
|
source "kernel/bpf/Kconfig"
|
2018-07-31 19:39:32 +08:00
|
|
|
source "kernel/Kconfig.preempt"
|
2012-09-09 20:22:07 +08:00
|
|
|
|
|
|
|
menu "CPU/Task time and stats accounting"
|
|
|
|
|
2012-07-25 13:56:04 +08:00
|
|
|
config VIRT_CPU_ACCOUNTING
|
|
|
|
bool
|
|
|
|
|
2012-09-09 20:56:31 +08:00
|
|
|
choice
|
|
|
|
prompt "Cputime accounting"
|
2022-09-02 16:53:15 +08:00
|
|
|
default TICK_CPU_ACCOUNTING
|
2012-09-09 20:56:31 +08:00
|
|
|
|
|
|
|
# Kind of a stub config for the pure tick based cputime accounting
|
|
|
|
config TICK_CPU_ACCOUNTING
|
|
|
|
bool "Simple tick based cputime accounting"
|
2013-04-26 21:16:31 +08:00
|
|
|
depends on !S390 && !NO_HZ_FULL
|
2012-09-09 20:56:31 +08:00
|
|
|
help
|
|
|
|
This is the basic tick based cputime accounting that maintains
|
|
|
|
statistics about user, system and idle time spent on per jiffies
|
|
|
|
granularity.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2012-07-25 13:56:04 +08:00
|
|
|
config VIRT_CPU_ACCOUNTING_NATIVE
|
2012-06-16 21:39:34 +08:00
|
|
|
bool "Deterministic task and CPU time accounting"
|
2013-04-26 21:16:31 +08:00
|
|
|
depends on HAVE_VIRT_CPU_ACCOUNTING && !NO_HZ_FULL
|
2012-07-25 13:56:04 +08:00
|
|
|
select VIRT_CPU_ACCOUNTING
|
2012-06-16 21:39:34 +08:00
|
|
|
help
|
|
|
|
Select this option to enable more accurate task and CPU time
|
|
|
|
accounting. This is done by reading a CPU counter on each
|
|
|
|
kernel entry and exit and on transitions within the kernel
|
|
|
|
between system, softirq and hardirq state, so there is a
|
|
|
|
small performance impact. In the case of s390 or IBM POWER > 5,
|
|
|
|
this also enables accounting of stolen time on logically-partitioned
|
|
|
|
systems.
|
|
|
|
|
2012-07-25 13:56:04 +08:00
|
|
|
config VIRT_CPU_ACCOUNTING_GEN
|
|
|
|
bool "Full dynticks CPU time accounting"
|
2022-06-08 22:40:24 +08:00
|
|
|
depends on HAVE_CONTEXT_TRACKING_USER
|
2013-09-17 06:28:21 +08:00
|
|
|
depends on HAVE_VIRT_CPU_ACCOUNTING_GEN
|
2019-03-05 04:01:31 +08:00
|
|
|
depends on GENERIC_CLOCKEVENTS
|
2012-07-25 13:56:04 +08:00
|
|
|
select VIRT_CPU_ACCOUNTING
|
2022-06-08 22:40:24 +08:00
|
|
|
select CONTEXT_TRACKING_USER
|
2012-07-25 13:56:04 +08:00
|
|
|
help
|
|
|
|
Select this option to enable task and CPU time accounting on full
|
|
|
|
dynticks systems. This accounting is implemented by watching every
|
|
|
|
kernel-user boundaries using the context tracking subsystem.
|
|
|
|
The accounting is thus performed at the expense of some significant
|
|
|
|
overhead.
|
|
|
|
|
|
|
|
For now this is only useful if you are working on the full
|
|
|
|
dynticks subsystem development.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2016-07-13 22:50:02 +08:00
|
|
|
endchoice
|
|
|
|
|
2012-09-09 20:56:31 +08:00
|
|
|
config IRQ_TIME_ACCOUNTING
|
|
|
|
bool "Fine granularity task level IRQ time accounting"
|
2016-07-13 22:50:02 +08:00
|
|
|
depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
|
2012-09-09 20:56:31 +08:00
|
|
|
help
|
|
|
|
Select this option to enable fine granularity task irq time
|
|
|
|
accounting. This is done by reading a timestamp on each
|
|
|
|
transitions between softirq and hardirq state, so there can be a
|
|
|
|
small performance impact.
|
|
|
|
|
|
|
|
If in doubt, say N here.
|
|
|
|
|
2018-09-25 17:17:42 +08:00
|
|
|
config HAVE_SCHED_AVG_IRQ
|
|
|
|
def_bool y
|
|
|
|
depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
|
|
|
|
depends on SMP
|
|
|
|
|
2020-02-22 08:52:05 +08:00
|
|
|
config SCHED_THERMAL_PRESSURE
|
2020-07-13 00:59:16 +08:00
|
|
|
bool
|
2020-07-29 21:57:18 +08:00
|
|
|
default y if ARM && ARM_CPU_TOPOLOGY
|
|
|
|
default y if ARM64
|
2020-02-22 08:52:05 +08:00
|
|
|
depends on SMP
|
2020-07-13 00:59:16 +08:00
|
|
|
depends on CPU_FREQ_THERMAL
|
|
|
|
help
|
|
|
|
Select this option to enable thermal pressure accounting in the
|
|
|
|
scheduler. Thermal pressure is the value conveyed to the scheduler
|
|
|
|
that reflects the reduction in CPU compute capacity resulted from
|
|
|
|
thermal throttling. Thermal throttling occurs when the performance of
|
|
|
|
a CPU is capped due to high operating temperatures.
|
|
|
|
|
|
|
|
If selected, the scheduler will be able to balance tasks accordingly,
|
|
|
|
i.e. put less load on throttled CPUs than on non/less throttled ones.
|
|
|
|
|
|
|
|
This requires the architecture to implement
|
2021-11-10 03:57:14 +08:00
|
|
|
arch_update_thermal_pressure() and arch_scale_thermal_pressure().
|
2020-02-22 08:52:05 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config BSD_PROCESS_ACCT
|
|
|
|
bool "BSD Process Accounting"
|
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 07:16:41 +08:00
|
|
|
depends on MULTIUSER
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
|
|
|
If you say Y here, a user level program will be able to instruct the
|
|
|
|
kernel (via a special system call) to write process accounting
|
|
|
|
information to a file: whenever a process exits, information about
|
|
|
|
that process will be appended to the file by the kernel. The
|
|
|
|
information includes things such as creation time, owning user,
|
|
|
|
command name, memory usage, controlling terminal etc. (the complete
|
|
|
|
list is in the struct acct in <file:include/linux/acct.h>). It is
|
|
|
|
up to the user level program to do useful things with this
|
|
|
|
information. This is generally a good idea, so say Y.
|
|
|
|
|
|
|
|
config BSD_PROCESS_ACCT_V3
|
|
|
|
bool "BSD Process Accounting version 3 file format"
|
|
|
|
depends on BSD_PROCESS_ACCT
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
If you say Y here, the process accounting information is written
|
|
|
|
in a new file format that also logs the process IDs of each
|
2018-08-22 12:58:34 +08:00
|
|
|
process and its parent. Note that this file format is incompatible
|
2005-04-17 06:20:36 +08:00
|
|
|
with previous v0/v1/v2 file formats, so you will need updated tools
|
|
|
|
for processing it. A preliminary version of these tools is available
|
2008-06-18 16:45:13 +08:00
|
|
|
at <http://www.gnu.org/software/acct/>.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-07-14 15:24:40 +08:00
|
|
|
config TASKSTATS
|
2012-10-03 02:19:29 +08:00
|
|
|
bool "Export task/process statistics through netlink"
|
2006-07-14 15:24:40 +08:00
|
|
|
depends on NET
|
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 07:16:41 +08:00
|
|
|
depends on MULTIUSER
|
2006-07-14 15:24:40 +08:00
|
|
|
default n
|
|
|
|
help
|
|
|
|
Export selected statistics for tasks/processes through the
|
|
|
|
generic netlink interface. Unlike BSD process accounting, the
|
|
|
|
statistics are available during the lifetime of tasks/processes as
|
|
|
|
responses to commands. Like BSD accounting, they are sent to user
|
|
|
|
space on task exit.
|
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2006-07-14 15:24:36 +08:00
|
|
|
config TASK_DELAY_ACCT
|
2012-10-03 02:19:29 +08:00
|
|
|
bool "Enable per-task delay accounting"
|
2006-07-14 15:24:41 +08:00
|
|
|
depends on TASKSTATS
|
2015-06-26 02:23:37 +08:00
|
|
|
select SCHED_INFO
|
2006-07-14 15:24:36 +08:00
|
|
|
help
|
|
|
|
Collect information on time spent by a task waiting for system
|
|
|
|
resources like cpu, synchronous block I/O completion and swapping
|
|
|
|
in pages. Such statistics can help in setting a task's priorities
|
|
|
|
relative to other tasks for cpu, io, rss limits etc.
|
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2007-02-10 17:46:44 +08:00
|
|
|
config TASK_XACCT
|
2012-10-03 02:19:29 +08:00
|
|
|
bool "Enable extended accounting over taskstats"
|
2007-02-10 17:46:44 +08:00
|
|
|
depends on TASKSTATS
|
|
|
|
help
|
|
|
|
Collect extended task accounting data and send the data
|
|
|
|
to userland for processing over the taskstats interface.
|
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
|
|
|
config TASK_IO_ACCOUNTING
|
2012-10-03 02:19:29 +08:00
|
|
|
bool "Enable per-task storage I/O accounting"
|
2007-02-10 17:46:44 +08:00
|
|
|
depends on TASK_XACCT
|
|
|
|
help
|
|
|
|
Collect information on the number of bytes of storage I/O which this
|
|
|
|
task has caused.
|
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
config PSI
|
|
|
|
bool "Pressure stall information tracking"
|
2023-07-31 11:07:40 +08:00
|
|
|
select KERNFS
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
help
|
|
|
|
Collect metrics that indicate how overcommitted the CPU, memory,
|
|
|
|
and IO capacity are in the system.
|
|
|
|
|
|
|
|
If you say Y here, the kernel will create /proc/pressure/ with the
|
|
|
|
pressure statistics files cpu, memory, and io. These will indicate
|
|
|
|
the share of walltime in which some or all tasks in the system are
|
|
|
|
delayed due to contention of the respective resource.
|
|
|
|
|
2018-10-27 06:06:31 +08:00
|
|
|
In kernels with cgroup support, cgroups (cgroup2 only) will
|
|
|
|
have cpu.pressure, memory.pressure, and io.pressure files,
|
|
|
|
which aggregate pressure stalls for the grouped tasks only.
|
|
|
|
|
2019-04-17 16:46:08 +08:00
|
|
|
For more details see Documentation/accounting/psi.rst.
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:27 +08:00
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2018-12-01 06:09:58 +08:00
|
|
|
config PSI_DEFAULT_DISABLED
|
|
|
|
bool "Require boot parameter to enable pressure stall information tracking"
|
|
|
|
default n
|
|
|
|
depends on PSI
|
|
|
|
help
|
|
|
|
If set, pressure stall information tracking will be disabled
|
2018-12-15 06:17:03 +08:00
|
|
|
per default but can be enabled through passing psi=1 on the
|
|
|
|
kernel commandline during boot.
|
2018-12-01 06:09:58 +08:00
|
|
|
|
2019-02-02 06:21:15 +08:00
|
|
|
This feature adds some code to the task wakeup and sleep
|
|
|
|
paths of the scheduler. The overhead is too low to affect
|
|
|
|
common scheduling-intense workloads in practice (such as
|
|
|
|
webservers, memcache), but it does show up in artificial
|
|
|
|
scheduler stress tests, such as hackbench.
|
|
|
|
|
|
|
|
If you are paranoid and not sure what the kernel will be
|
|
|
|
used for, say Y.
|
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2012-09-09 20:22:07 +08:00
|
|
|
endmenu # "CPU/Task time and stats accounting"
|
2010-09-27 20:45:59 +08:00
|
|
|
|
2017-10-27 10:42:34 +08:00
|
|
|
config CPU_ISOLATION
|
|
|
|
bool "CPU isolation"
|
2018-01-02 19:13:10 +08:00
|
|
|
depends on SMP || COMPILE_TEST
|
2017-12-15 02:18:26 +08:00
|
|
|
default y
|
2017-10-27 10:42:34 +08:00
|
|
|
help
|
|
|
|
Make sure that CPUs running critical tasks are not disturbed by
|
|
|
|
any source of "noise" such as unbound workqueues, timers, kthreads...
|
2017-12-15 02:18:26 +08:00
|
|
|
Unbound jobs get offloaded to housekeeping CPUs. This is driven by
|
|
|
|
the "isolcpus=" boot parameter.
|
|
|
|
|
|
|
|
Say Y if unsure.
|
2017-10-27 10:42:34 +08:00
|
|
|
|
2017-05-17 23:43:40 +08:00
|
|
|
source "kernel/rcu/Kconfig"
|
2009-01-16 04:28:29 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config IKCONFIG
|
2006-10-01 14:27:25 +08:00
|
|
|
tristate "Kernel .config support"
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
This option enables the complete Linux kernel ".config" file
|
|
|
|
contents to be saved in the kernel. It provides documentation
|
|
|
|
of which kernel options are used in a running kernel or in an
|
|
|
|
on-disk kernel. This information can be extracted from the kernel
|
|
|
|
image file with the script scripts/extract-ikconfig and used as
|
|
|
|
input to rebuild the current kernel or to build another kernel.
|
|
|
|
It can also be extracted from a running kernel by reading
|
|
|
|
/proc/config.gz if enabled (below).
|
|
|
|
|
|
|
|
config IKCONFIG_PROC
|
|
|
|
bool "Enable access to .config through /proc/config.gz"
|
|
|
|
depends on IKCONFIG && PROC_FS
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
This option enables access to the kernel configuration file
|
|
|
|
through /proc/config.gz.
|
|
|
|
|
2019-05-16 05:35:51 +08:00
|
|
|
config IKHEADERS
|
|
|
|
tristate "Enable kernel headers through /sys/kernel/kheaders.tar.xz"
|
|
|
|
depends on SYSFS
|
|
|
|
help
|
|
|
|
This option enables access to the in-kernel headers that are generated during
|
|
|
|
the build process. These can be used to build eBPF tracing programs,
|
|
|
|
or similar programs. If you build the headers as a module, a module called
|
|
|
|
kheaders.ko is built which can be loaded on-demand to get access to headers.
|
Provide in-kernel headers to make extending kernel easier
Introduce in-kernel headers which are made available as an archive
through proc (/proc/kheaders.tar.xz file). This archive makes it
possible to run eBPF and other tracing programs that need to extend the
kernel for tracing purposes without any dependency on the file system
having headers.
A github PR is sent for the corresponding BCC patch at:
https://github.com/iovisor/bcc/pull/2312
On Android and embedded systems, it is common to switch kernels but not
have kernel headers available on the file system. Further once a
different kernel is booted, any headers stored on the file system will
no longer be useful. This is an issue even well known to distros.
By storing the headers as a compressed archive within the kernel, we can
avoid these issues that have been a hindrance for a long time.
The best way to use this feature is by building it in. Several users
have a need for this, when they switch debug kernels, they do not want to
update the filesystem or worry about it where to store the headers on
it. However, the feature is also buildable as a module in case the user
desires it not being part of the kernel image. This makes it possible to
load and unload the headers from memory on demand. A tracing program can
load the module, do its operations, and then unload the module to save
kernel memory. The total memory needed is 3.3MB.
By having the archive available at a fixed location independent of
filesystem dependencies and conventions, all debugging tools can
directly refer to the fixed location for the archive, without concerning
with where the headers on a typical filesystem which significantly
simplifies tooling that needs kernel headers.
The code to read the headers is based on /proc/config.gz code and uses
the same technique to embed the headers.
Other approaches were discussed such as having an in-memory mountable
filesystem, but that has drawbacks such as requiring an in-kernel xz
decompressor which we don't have today, and requiring usage of 42 MB of
kernel memory to host the decompressed headers at anytime. Also this
approach is simpler than such approaches.
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27 03:04:29 +08:00
|
|
|
|
2007-05-08 15:31:15 +08:00
|
|
|
config LOG_BUF_SHIFT
|
|
|
|
int "Kernel log buffer size (16 => 64KB, 17 => 128KB)"
|
2022-02-23 15:47:20 +08:00
|
|
|
range 12 25
|
2008-04-29 15:58:58 +08:00
|
|
|
default 17
|
2014-10-04 07:00:54 +08:00
|
|
|
depends on PRINTK
|
2007-05-08 15:31:15 +08:00
|
|
|
help
|
2014-08-07 07:08:56 +08:00
|
|
|
Select the minimal kernel log buffer size as a power of 2.
|
|
|
|
The final size is affected by LOG_CPU_MAX_BUF_SHIFT config
|
|
|
|
parameter, see below. Any higher size also might be forced
|
|
|
|
by "log_buf_len" boot parameter.
|
|
|
|
|
2008-04-29 15:58:58 +08:00
|
|
|
Examples:
|
2014-08-07 07:08:56 +08:00
|
|
|
17 => 128 KB
|
2008-04-29 15:58:58 +08:00
|
|
|
16 => 64 KB
|
2014-08-07 07:08:56 +08:00
|
|
|
15 => 32 KB
|
|
|
|
14 => 16 KB
|
2007-05-08 15:31:15 +08:00
|
|
|
13 => 8 KB
|
|
|
|
12 => 4 KB
|
|
|
|
|
2014-08-07 07:08:56 +08:00
|
|
|
config LOG_CPU_MAX_BUF_SHIFT
|
|
|
|
int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
|
2014-10-14 06:51:11 +08:00
|
|
|
depends on SMP
|
2014-08-07 07:08:56 +08:00
|
|
|
range 0 21
|
|
|
|
default 12 if !BASE_SMALL
|
|
|
|
default 0 if BASE_SMALL
|
2014-10-04 07:00:54 +08:00
|
|
|
depends on PRINTK
|
2014-08-07 07:08:56 +08:00
|
|
|
help
|
|
|
|
This option allows to increase the default ring buffer size
|
|
|
|
according to the number of CPUs. The value defines the contribution
|
|
|
|
of each CPU as a power of 2. The used space is typically only few
|
|
|
|
lines however it might be much more when problems are reported,
|
|
|
|
e.g. backtraces.
|
|
|
|
|
|
|
|
The increased size means that a new buffer has to be allocated and
|
|
|
|
the original static one is unused. It makes sense only on systems
|
|
|
|
with more CPUs. Therefore this value is used only when the sum of
|
|
|
|
contributions is greater than the half of the default kernel ring
|
|
|
|
buffer as defined by LOG_BUF_SHIFT. The default values are set
|
2020-08-11 17:29:23 +08:00
|
|
|
so that more than 16 CPUs are needed to trigger the allocation.
|
2014-08-07 07:08:56 +08:00
|
|
|
|
|
|
|
Also this option is ignored when "log_buf_len" kernel parameter is
|
|
|
|
used as it forces an exact (power of two) size of the ring buffer.
|
|
|
|
|
|
|
|
The number of possible CPUs is used for this computation ignoring
|
2016-06-05 16:47:02 +08:00
|
|
|
hotplugging making the computation optimal for the worst case
|
|
|
|
scenario while allowing a simple algorithm to be used from bootup.
|
2014-08-07 07:08:56 +08:00
|
|
|
|
|
|
|
Examples shift values and their meaning:
|
|
|
|
17 => 128 KB for each CPU
|
|
|
|
16 => 64 KB for each CPU
|
|
|
|
15 => 32 KB for each CPU
|
|
|
|
14 => 16 KB for each CPU
|
|
|
|
13 => 8 KB for each CPU
|
|
|
|
12 => 4 KB for each CPU
|
|
|
|
|
printk: Userspace format indexing support
We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:
$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-16 00:52:53 +08:00
|
|
|
config PRINTK_INDEX
|
|
|
|
bool "Printk indexing debugfs interface"
|
|
|
|
depends on PRINTK && DEBUG_FS
|
|
|
|
help
|
|
|
|
Add support for indexing of all printk formats known at compile time
|
|
|
|
at <debugfs>/printk/index/<module>.
|
|
|
|
|
|
|
|
This can be used as part of maintaining daemons which monitor
|
|
|
|
/dev/kmsg, as it permits auditing the printk formats present in a
|
|
|
|
kernel, allowing detection of cases where monitored printks are
|
|
|
|
changed or no longer present.
|
|
|
|
|
|
|
|
There is no additional runtime cost to printk with this enabled.
|
|
|
|
|
2008-05-06 05:19:50 +08:00
|
|
|
#
|
|
|
|
# Architectures with an unreliable sched_clock() should select this:
|
|
|
|
#
|
|
|
|
config HAVE_UNSTABLE_SCHED_CLOCK
|
|
|
|
bool
|
|
|
|
|
2013-06-02 14:39:40 +08:00
|
|
|
config GENERIC_SCHED_CLOCK
|
|
|
|
bool
|
|
|
|
|
2019-06-21 16:42:02 +08:00
|
|
|
menu "Scheduler features"
|
|
|
|
|
|
|
|
config UCLAMP_TASK
|
|
|
|
bool "Enable utilization clamping for RT/FAIR tasks"
|
|
|
|
depends on CPU_FREQ_GOV_SCHEDUTIL
|
|
|
|
help
|
|
|
|
This feature enables the scheduler to track the clamped utilization
|
|
|
|
of each CPU based on RUNNABLE tasks scheduled on that CPU.
|
|
|
|
|
|
|
|
With this option, the user can specify the min and max CPU
|
|
|
|
utilization allowed for RUNNABLE tasks. The max utilization defines
|
|
|
|
the maximum frequency a task should use while the min utilization
|
|
|
|
defines the minimum frequency it should use.
|
|
|
|
|
|
|
|
Both min and max utilization clamp values are hints to the scheduler,
|
|
|
|
aiming at improving its frequency selection policy, but they do not
|
|
|
|
enforce or grant any specific bandwidth for tasks.
|
|
|
|
|
|
|
|
If in doubt, say N.
|
|
|
|
|
|
|
|
config UCLAMP_BUCKETS_COUNT
|
|
|
|
int "Number of supported utilization clamp buckets"
|
|
|
|
range 5 20
|
|
|
|
default 5
|
|
|
|
depends on UCLAMP_TASK
|
|
|
|
help
|
|
|
|
Defines the number of clamp buckets to use. The range of each bucket
|
|
|
|
will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
|
|
|
|
number of clamp buckets the finer their granularity and the higher
|
|
|
|
the precision of clamping aggregation and tracking at run-time.
|
|
|
|
|
|
|
|
For example, with the minimum configuration value we will have 5
|
|
|
|
clamp buckets tracking 20% utilization each. A 25% boosted tasks will
|
|
|
|
be refcounted in the [20..39]% bucket and will set the bucket clamp
|
|
|
|
effective value to 25%.
|
|
|
|
If a second 30% boosted task should be co-scheduled on the same CPU,
|
|
|
|
that task will be refcounted in the same bucket of the first task and
|
|
|
|
it will boost the bucket clamp effective value to 30%.
|
|
|
|
The clamp effective value of a bucket is reset to its nominal value
|
|
|
|
(20% in the example above) when there are no more tasks refcounted in
|
|
|
|
that bucket.
|
|
|
|
|
|
|
|
An additional boost/capping margin can be added to some tasks. In the
|
|
|
|
example above the 25% task will be boosted to 30% until it exits the
|
|
|
|
CPU. If that should be considered not acceptable on certain systems,
|
|
|
|
it's always possible to reduce the margin by increasing the number of
|
|
|
|
clamp buckets to trade off used memory for run-time tracking
|
|
|
|
precision.
|
|
|
|
|
|
|
|
If in doubt, use the default value.
|
|
|
|
|
|
|
|
endmenu
|
|
|
|
|
2012-10-04 07:50:47 +08:00
|
|
|
#
|
|
|
|
# For architectures that want to enable the support for NUMA-affine scheduler
|
|
|
|
# balancing logic:
|
|
|
|
#
|
|
|
|
config ARCH_SUPPORTS_NUMA_BALANCING
|
|
|
|
bool
|
|
|
|
|
2015-09-05 06:47:32 +08:00
|
|
|
#
|
|
|
|
# For architectures that prefer to flush all TLBs after a number of pages
|
|
|
|
# are unmapped instead of sending one IPI per page to flush. The architecture
|
|
|
|
# must provide guarantees on what happens if a clean TLB cache entry is
|
|
|
|
# written after the unmap. Details are in mm/rmap.c near the check for
|
|
|
|
# should_defer_flush. The architecture should also consider if the full flush
|
|
|
|
# and the refill costs are offset by the savings of sending fewer IPIs.
|
|
|
|
config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
|
|
|
|
bool
|
|
|
|
|
2019-11-08 20:22:27 +08:00
|
|
|
config CC_HAS_INT128
|
2020-03-10 18:12:50 +08:00
|
|
|
def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT
|
2019-11-08 20:22:27 +08:00
|
|
|
|
2021-11-14 08:57:25 +08:00
|
|
|
config CC_IMPLICIT_FALLTHROUGH
|
|
|
|
string
|
2021-11-15 10:48:44 +08:00
|
|
|
default "-Wimplicit-fallthrough=5" if CC_IS_GCC && $(cc-option,-Wimplicit-fallthrough=5)
|
2021-11-14 08:57:25 +08:00
|
|
|
default "-Wimplicit-fallthrough" if CC_IS_CLANG && $(cc-option,-Wunreachable-code-fallthrough)
|
|
|
|
|
2024-02-24 01:08:27 +08:00
|
|
|
# Currently, disable gcc-10+ array-bounds globally.
|
gcc: disable '-Warray-bounds' for gcc-13 too
We started disabling '-Warray-bounds' for gcc-12 originally on s390,
because it resulted in some warnings that weren't realistically fixable
(commit 8b202ee21839: "s390: disable -Warray-bounds").
That s390-specific issue was then found to be less common elsewhere, but
generic (see f0be87c42cbd: "gcc-12: disable '-Warray-bounds' universally
for now"), and then later expanded the version check was expanded to
gcc-11 (5a41237ad1d4: "gcc: disable -Warray-bounds for gcc-11 too").
And it turns out that I was much too optimistic in thinking that it's
all going to go away, and here we are with gcc-13 showing all the same
issues. So instead of expanding this one version at a time, let's just
disable it for gcc-11+, and put an end limit to it only when we actually
find a solution.
Yes, I'm sure some of this is because the kernel just does odd things
(like our "container_of()" use, but also knowingly playing games with
things like linker tables and array layouts).
And yes, some of the warnings are likely signs of real bugs, but when
there are hundreds of false positives, that doesn't really help.
Oh well.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-24 00:56:20 +08:00
|
|
|
# It's still broken in gcc-13, so no upper bound yet.
|
2024-02-24 01:08:27 +08:00
|
|
|
config GCC10_NO_ARRAY_BOUNDS
|
2023-01-10 07:04:49 +08:00
|
|
|
def_bool y
|
|
|
|
|
gcc-12: disable '-Warray-bounds' universally for now
In commit 8b202ee21839 ("s390: disable -Warray-bounds") the s390 people
disabled the '-Warray-bounds' warning for gcc-12, because the new logic
in gcc would cause warnings for their use of the S390_lowcore macro,
which accesses absolute pointers.
It turns out gcc-12 has many other issues in this area, so this takes
that s390 warning disable logic, and turns it into a kernel build config
entry instead.
Part of the intent is that we can make this all much more targeted, and
use this conflig flag to disable it in only particular configurations
that cause problems, with the s390 case as an example:
select GCC12_NO_ARRAY_BOUNDS
and we could do that for other configuration cases that cause issues.
Or we could possibly use the CONFIG_CC_NO_ARRAY_BOUNDS thing in a more
targeted way, and disable the warning only for particular uses: again
the s390 case as an example:
KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
but this ends up just doing it globally in the top-level Makefile, since
the current issues are spread fairly widely all over:
KBUILD_CFLAGS-$(CONFIG_CC_NO_ARRAY_BOUNDS) += -Wno-array-bounds
We'll try to limit this later, since the gcc-12 problems are rare enough
that *much* of the kernel can be built with it without disabling this
warning.
Cc: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-10 01:11:12 +08:00
|
|
|
config CC_NO_ARRAY_BOUNDS
|
|
|
|
bool
|
2024-02-24 01:08:27 +08:00
|
|
|
default y if CC_IS_GCC && GCC_VERSION >= 100000 && GCC10_NO_ARRAY_BOUNDS
|
gcc-12: disable '-Warray-bounds' universally for now
In commit 8b202ee21839 ("s390: disable -Warray-bounds") the s390 people
disabled the '-Warray-bounds' warning for gcc-12, because the new logic
in gcc would cause warnings for their use of the S390_lowcore macro,
which accesses absolute pointers.
It turns out gcc-12 has many other issues in this area, so this takes
that s390 warning disable logic, and turns it into a kernel build config
entry instead.
Part of the intent is that we can make this all much more targeted, and
use this conflig flag to disable it in only particular configurations
that cause problems, with the s390 case as an example:
select GCC12_NO_ARRAY_BOUNDS
and we could do that for other configuration cases that cause issues.
Or we could possibly use the CONFIG_CC_NO_ARRAY_BOUNDS thing in a more
targeted way, and disable the warning only for particular uses: again
the s390 case as an example:
KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
but this ends up just doing it globally in the top-level Makefile, since
the current issues are spread fairly widely all over:
KBUILD_CFLAGS-$(CONFIG_CC_NO_ARRAY_BOUNDS) += -Wno-array-bounds
We'll try to limit this later, since the gcc-12 problems are rare enough
that *much* of the kernel can be built with it without disabling this
warning.
Cc: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-10 01:11:12 +08:00
|
|
|
|
2013-11-19 01:27:06 +08:00
|
|
|
#
|
|
|
|
# For architectures that know their GCC __int128 support is sound
|
|
|
|
#
|
|
|
|
config ARCH_SUPPORTS_INT128
|
|
|
|
bool
|
|
|
|
|
2012-10-04 07:50:47 +08:00
|
|
|
# For architectures that (ab)use NUMA to represent different memory regions
|
|
|
|
# all cpu-local but of different latencies, such as SuperH.
|
|
|
|
#
|
|
|
|
config ARCH_WANT_NUMA_VARIABLE_LOCALITY
|
|
|
|
bool
|
|
|
|
|
|
|
|
config NUMA_BALANCING
|
|
|
|
bool "Memory placement aware NUMA scheduler"
|
|
|
|
depends on ARCH_SUPPORTS_NUMA_BALANCING
|
|
|
|
depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
|
2021-11-06 04:35:27 +08:00
|
|
|
depends on SMP && NUMA && MIGRATION && !PREEMPT_RT
|
2012-10-04 07:50:47 +08:00
|
|
|
help
|
|
|
|
This option adds support for automatic NUMA aware memory/task placement.
|
|
|
|
The mechanism is quite primitive and is based on migrating memory when
|
2013-08-13 23:06:50 +08:00
|
|
|
it has references to the node the task is running on.
|
2012-10-04 07:50:47 +08:00
|
|
|
|
|
|
|
This system will be inactive on UMA systems.
|
|
|
|
|
2014-12-11 07:43:37 +08:00
|
|
|
config NUMA_BALANCING_DEFAULT_ENABLED
|
|
|
|
bool "Automatically enable NUMA aware memory/task placement"
|
|
|
|
default y
|
|
|
|
depends on NUMA_BALANCING
|
|
|
|
help
|
|
|
|
If set, automatic NUMA balancing will be enabled if running on a NUMA
|
|
|
|
machine.
|
|
|
|
|
2009-01-16 05:50:58 +08:00
|
|
|
menuconfig CGROUPS
|
2014-12-21 04:41:11 +08:00
|
|
|
bool "Control Group support"
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-12 00:52:49 +08:00
|
|
|
select KERNFS
|
2009-01-08 10:07:30 +08:00
|
|
|
help
|
2009-01-16 05:50:58 +08:00
|
|
|
This option adds support for grouping sets of processes together, for
|
2009-01-08 10:07:30 +08:00
|
|
|
use with process control subsystems such as Cpusets, CFS, memory
|
|
|
|
controls or device isolation.
|
|
|
|
See
|
2019-06-13 01:53:03 +08:00
|
|
|
- Documentation/scheduler/sched-design-CFS.rst (CFS)
|
2019-06-28 00:08:35 +08:00
|
|
|
- Documentation/admin-guide/cgroup-v1/ (features for grouping, isolation
|
2009-01-16 05:50:59 +08:00
|
|
|
and resource control)
|
2009-01-08 10:07:30 +08:00
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2009-01-16 05:50:58 +08:00
|
|
|
if CGROUPS
|
|
|
|
|
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 07:42:31 +08:00
|
|
|
config PAGE_COUNTER
|
2019-12-05 08:52:28 +08:00
|
|
|
bool
|
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 07:42:31 +08:00
|
|
|
|
2022-07-23 22:28:28 +08:00
|
|
|
config CGROUP_FAVOR_DYNMODS
|
|
|
|
bool "Favor dynamic modification latency reduction by default"
|
|
|
|
help
|
|
|
|
This option enables the "favordynmods" mount option by default
|
|
|
|
which reduces the latencies of dynamic cgroup modifications such
|
|
|
|
as task migrations and controller on/offs at the cost of making
|
|
|
|
hot path operations such as forks and exits more expensive.
|
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2012-08-01 07:43:02 +08:00
|
|
|
config MEMCG
|
2015-12-18 06:19:56 +08:00
|
|
|
bool "Memory controller"
|
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 07:42:31 +08:00
|
|
|
select PAGE_COUNTER
|
2013-11-23 07:20:42 +08:00
|
|
|
select EVENTFD
|
2008-03-05 06:28:39 +08:00
|
|
|
help
|
2015-12-18 06:19:56 +08:00
|
|
|
Provides control over the memory footprint of tasks in a cgroup.
|
2008-03-05 06:28:39 +08:00
|
|
|
|
2018-08-18 06:47:25 +08:00
|
|
|
config MEMCG_KMEM
|
|
|
|
bool
|
2023-02-28 00:46:13 +08:00
|
|
|
depends on MEMCG
|
2018-08-18 06:47:25 +08:00
|
|
|
default y
|
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config BLK_CGROUP
|
|
|
|
bool "IO controller"
|
|
|
|
depends on BLOCK
|
2012-08-01 07:42:12 +08:00
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2015-12-18 06:19:57 +08:00
|
|
|
Generic block IO controller cgroup interface. This is the common
|
|
|
|
cgroup interface which should be used by various IO controlling
|
|
|
|
policies.
|
2012-08-01 07:42:12 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
Currently, CFQ IO scheduler uses it to recognize task groups and
|
|
|
|
control disk bandwidth allocation (proportional time slice allocation)
|
|
|
|
to such task groups. It is also used by bio throttling logic in
|
|
|
|
block layer to implement upper limit in IO rates on a device.
|
2011-02-14 17:20:01 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
This option only enables generic Block IO controller infrastructure.
|
|
|
|
One needs to also enable actual IO controlling logic/policy. For
|
|
|
|
enabling proportional weight division of disk bandwidth in CFQ, set
|
2020-04-07 11:12:02 +08:00
|
|
|
CONFIG_BFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
|
2015-12-18 06:19:57 +08:00
|
|
|
CONFIG_BLK_DEV_THROTTLING=y.
|
|
|
|
|
2019-06-28 00:08:35 +08:00
|
|
|
See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
|
2015-12-18 06:19:57 +08:00
|
|
|
|
|
|
|
config CGROUP_WRITEBACK
|
|
|
|
bool
|
|
|
|
depends on MEMCG && BLK_CGROUP
|
|
|
|
default y
|
2011-02-14 17:20:01 +08:00
|
|
|
|
2010-01-20 20:26:18 +08:00
|
|
|
menuconfig CGROUP_SCHED
|
2015-12-18 06:19:56 +08:00
|
|
|
bool "CPU controller"
|
2010-01-20 20:26:18 +08:00
|
|
|
default n
|
|
|
|
help
|
|
|
|
This feature lets CPU scheduler recognize task groups and control CPU
|
|
|
|
bandwidth allocation to such task groups. It uses cgroups to group
|
|
|
|
tasks.
|
|
|
|
|
|
|
|
if CGROUP_SCHED
|
|
|
|
config FAIR_GROUP_SCHED
|
|
|
|
bool "Group scheduling for SCHED_OTHER"
|
|
|
|
depends on CGROUP_SCHED
|
|
|
|
default CGROUP_SCHED
|
|
|
|
|
2011-07-22 00:43:28 +08:00
|
|
|
config CFS_BANDWIDTH
|
|
|
|
bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
|
|
|
|
depends on FAIR_GROUP_SCHED
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
This option allows users to define CPU bandwidth rates (limits) for
|
|
|
|
tasks running within the fair group scheduler. Groups with no limit
|
|
|
|
set are considered to be unconstrained and will run with no
|
|
|
|
restriction.
|
2019-06-13 01:53:03 +08:00
|
|
|
See Documentation/scheduler/sched-bwc.rst for more information.
|
2011-07-22 00:43:28 +08:00
|
|
|
|
2010-01-20 20:26:18 +08:00
|
|
|
config RT_GROUP_SCHED
|
|
|
|
bool "Group scheduling for SCHED_RR/FIFO"
|
|
|
|
depends on CGROUP_SCHED
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
This feature lets you explicitly allocate real CPU bandwidth
|
2010-03-24 13:17:19 +08:00
|
|
|
to task groups. If enabled, it will also make it impossible to
|
2010-01-20 20:26:18 +08:00
|
|
|
schedule realtime tasks for non-root users until you allocate
|
|
|
|
realtime bandwidth for them.
|
2019-06-13 01:53:03 +08:00
|
|
|
See Documentation/scheduler/sched-rt-group.rst for more information.
|
2010-01-20 20:26:18 +08:00
|
|
|
|
|
|
|
endif #CGROUP_SCHED
|
|
|
|
|
2022-11-23 04:39:09 +08:00
|
|
|
config SCHED_MM_CID
|
|
|
|
def_bool y
|
|
|
|
depends on SMP && RSEQ
|
|
|
|
|
sched/uclamp: Extend CPU's cgroup controller
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.
Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.
Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.
Specifically:
- uclamp.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the uclamp.min
utilization
- uclamp.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the uclamp.max
utilization
These attributes:
a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.
b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent
c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests.
Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.
Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-22 21:28:06 +08:00
|
|
|
config UCLAMP_TASK_GROUP
|
|
|
|
bool "Utilization clamping per group of tasks"
|
|
|
|
depends on CGROUP_SCHED
|
|
|
|
depends on UCLAMP_TASK
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
This feature enables the scheduler to track the clamped utilization
|
|
|
|
of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
|
|
|
|
|
|
|
|
When this option is enabled, the user can specify a min and max
|
|
|
|
CPU bandwidth which is allowed for each single task in a group.
|
|
|
|
The max bandwidth allows to clamp the maximum frequency a task
|
|
|
|
can use, while the min bandwidth allows to define a minimum
|
|
|
|
frequency a task will always use.
|
|
|
|
|
|
|
|
When task group based utilization clamping is enabled, an eventually
|
|
|
|
specified task-specific clamp value is constrained by the cgroup
|
|
|
|
specified clamp value. Both minimum and maximum task clamping cannot
|
|
|
|
be bigger than the corresponding clamping defined at task group level.
|
|
|
|
|
|
|
|
If in doubt, say N.
|
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config CGROUP_PIDS
|
|
|
|
bool "PIDs controller"
|
|
|
|
help
|
|
|
|
Provides enforcement of process number limits in the scope of a
|
|
|
|
cgroup. Any attempt to fork more processes than is allowed in the
|
|
|
|
cgroup will fail. PIDs are fundamentally a global resource because it
|
|
|
|
is fairly trivial to reach PID exhaustion before you reach even a
|
|
|
|
conservative kmemcg limit. As a result, it is possible to grind a
|
|
|
|
system to halt without being limited by other cgroup policies. The
|
2016-03-05 14:00:56 +08:00
|
|
|
PIDs controller is designed to stop this from happening.
|
2015-12-18 06:19:57 +08:00
|
|
|
|
|
|
|
It should be noted that organisational operations (such as attaching
|
2019-02-02 06:21:01 +08:00
|
|
|
to a cgroup hierarchy) will *not* be blocked by the PIDs controller,
|
2015-12-18 06:19:57 +08:00
|
|
|
since the PIDs limit only affects a process's ability to fork, not to
|
|
|
|
attach to a cgroup.
|
|
|
|
|
2017-01-10 08:02:13 +08:00
|
|
|
config CGROUP_RDMA
|
|
|
|
bool "RDMA controller"
|
|
|
|
help
|
|
|
|
Provides enforcement of RDMA resources defined by IB stack.
|
|
|
|
It is fairly easy for consumers to exhaust RDMA resources, which
|
|
|
|
can result into resource unavailability to other consumers.
|
|
|
|
RDMA controller is designed to stop this from happening.
|
|
|
|
Attaching processes with active RDMA resources to the cgroup
|
|
|
|
hierarchy is allowed even if can cross the hierarchy's limit.
|
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config CGROUP_FREEZER
|
|
|
|
bool "Freezer controller"
|
|
|
|
help
|
|
|
|
Provides a way to freeze and unfreeze all tasks in a
|
|
|
|
cgroup.
|
|
|
|
|
2016-01-21 07:02:41 +08:00
|
|
|
This option affects the ORIGINAL cgroup interface. The cgroup2 memory
|
|
|
|
controller includes important in-kernel memory consumers per default.
|
|
|
|
|
|
|
|
If you're using cgroup2, say N.
|
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config CGROUP_HUGETLB
|
|
|
|
bool "HugeTLB controller"
|
|
|
|
depends on HUGETLB_PAGE
|
|
|
|
select PAGE_COUNTER
|
2010-04-27 01:27:56 +08:00
|
|
|
default n
|
2015-12-18 06:19:57 +08:00
|
|
|
help
|
|
|
|
Provides a cgroup controller for HugeTLB pages.
|
|
|
|
When you enable this, you can put a per cgroup limit on HugeTLB usage.
|
|
|
|
The limit is enforced during page fault. Since HugeTLB doesn't
|
|
|
|
support page reclaim, enforcing the limit at page fault time implies
|
|
|
|
that, the application will get SIGBUS signal if it tries to access
|
|
|
|
HugeTLB pages beyond its limit. This requires the application to know
|
|
|
|
beforehand how much HugeTLB pages it would require for its use. The
|
|
|
|
control group is tracked in the third page lru pointer. This means
|
|
|
|
that we cannot use the controller with huge page less than 3 pages.
|
2010-04-27 01:27:56 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config CPUSETS
|
|
|
|
bool "Cpuset controller"
|
2017-06-15 01:19:23 +08:00
|
|
|
depends on SMP
|
2015-12-18 06:19:57 +08:00
|
|
|
help
|
|
|
|
This option will let you create and manage CPUSETs which
|
|
|
|
allow dynamically partitioning a system into sets of CPUs and
|
|
|
|
Memory Nodes and assigning tasks to run only within those sets.
|
|
|
|
This is primarily useful on large SMP or NUMA systems.
|
2010-04-27 01:27:56 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
Say N if unsure.
|
2010-04-27 01:27:56 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config PROC_PID_CPUSET
|
|
|
|
bool "Include legacy /proc/<pid>/cpuset file"
|
|
|
|
depends on CPUSETS
|
|
|
|
default y
|
2010-04-27 01:27:56 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config CGROUP_DEVICE
|
|
|
|
bool "Device controller"
|
|
|
|
help
|
|
|
|
Provides a cgroup controller implementing whitelists for
|
|
|
|
devices which a process in the cgroup can mknod or open.
|
|
|
|
|
|
|
|
config CGROUP_CPUACCT
|
|
|
|
bool "Simple CPU accounting controller"
|
|
|
|
help
|
|
|
|
Provides a simple controller for monitoring the
|
|
|
|
total CPU consumed by the tasks in a cgroup.
|
|
|
|
|
|
|
|
config CGROUP_PERF
|
|
|
|
bool "Perf controller"
|
|
|
|
depends on PERF_EVENTS
|
|
|
|
help
|
|
|
|
This option extends the perf per-cpu mode to restrict monitoring
|
|
|
|
to threads which belong to the cgroup specified and run on the
|
2020-03-25 20:45:29 +08:00
|
|
|
designated cpu. Or this can be used to have cgroup ID in samples
|
|
|
|
so that it can monitor performance events among cgroups.
|
2015-12-18 06:19:57 +08:00
|
|
|
|
|
|
|
Say N if unsure.
|
|
|
|
|
2016-11-23 23:52:26 +08:00
|
|
|
config CGROUP_BPF
|
|
|
|
bool "Support for eBPF programs attached to cgroups"
|
2016-12-17 00:33:45 +08:00
|
|
|
depends on BPF_SYSCALL
|
|
|
|
select SOCK_CGROUP_DATA
|
2016-11-23 23:52:26 +08:00
|
|
|
help
|
|
|
|
Allow attaching eBPF programs to a cgroup using the bpf(2)
|
|
|
|
syscall command BPF_PROG_ATTACH.
|
|
|
|
|
|
|
|
In which context these programs are accessed depends on the type
|
|
|
|
of attachment. For instance, programs that are attached using
|
|
|
|
BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
|
|
|
|
inet sockets.
|
|
|
|
|
2021-03-30 12:42:04 +08:00
|
|
|
config CGROUP_MISC
|
|
|
|
bool "Misc resource controller"
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
Provides a controller for miscellaneous resources on a host.
|
|
|
|
|
|
|
|
Miscellaneous scalar resources are the resources on the host system
|
|
|
|
which cannot be abstracted like the other cgroups. This controller
|
|
|
|
tracks and limits the miscellaneous resources used by a process
|
|
|
|
attached to a cgroup hierarchy.
|
|
|
|
|
|
|
|
For more information, please check misc cgroup section in
|
|
|
|
/Documentation/admin-guide/cgroup-v2.rst.
|
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
config CGROUP_DEBUG
|
2017-06-14 05:18:03 +08:00
|
|
|
bool "Debug controller"
|
2010-04-27 01:27:56 +08:00
|
|
|
default n
|
2017-06-14 05:18:03 +08:00
|
|
|
depends on DEBUG_KERNEL
|
2015-12-18 06:19:57 +08:00
|
|
|
help
|
|
|
|
This option enables a simple controller that exports
|
2017-06-14 05:18:03 +08:00
|
|
|
debugging information about the cgroups framework. This
|
|
|
|
controller is for control cgroup debugging only. Its
|
|
|
|
interfaces are not stable.
|
2010-04-27 01:27:56 +08:00
|
|
|
|
2015-12-18 06:19:57 +08:00
|
|
|
Say N.
|
2015-05-23 05:13:36 +08:00
|
|
|
|
2017-01-10 20:08:06 +08:00
|
|
|
config SOCK_CGROUP_DATA
|
|
|
|
bool
|
|
|
|
default n
|
|
|
|
|
2009-01-16 05:50:58 +08:00
|
|
|
endif # CGROUPS
|
2009-01-08 10:07:57 +08:00
|
|
|
|
2010-10-28 06:34:38 +08:00
|
|
|
menuconfig NAMESPACES
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Namespaces support" if EXPERT
|
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 07:16:41 +08:00
|
|
|
depends on MULTIUSER
|
2011-01-21 06:44:16 +08:00
|
|
|
default !EXPERT
|
2008-02-08 20:18:19 +08:00
|
|
|
help
|
|
|
|
Provides the way to make tasks work with different objects using
|
|
|
|
the same id. For example same IPC id may refer to different objects
|
|
|
|
or same user id or pid may refer to different tasks when used in
|
|
|
|
different namespaces.
|
|
|
|
|
2010-10-28 06:34:38 +08:00
|
|
|
if NAMESPACES
|
|
|
|
|
2008-02-08 20:18:21 +08:00
|
|
|
config UTS_NS
|
|
|
|
bool "UTS namespace"
|
2010-10-28 06:34:37 +08:00
|
|
|
default y
|
2008-02-08 20:18:21 +08:00
|
|
|
help
|
|
|
|
In this namespace tasks see different info provided with the
|
|
|
|
uname() system call
|
|
|
|
|
ns: Introduce Time Namespace
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
2019-11-12 09:26:52 +08:00
|
|
|
config TIME_NS
|
|
|
|
bool "TIME namespace"
|
2019-11-12 09:27:09 +08:00
|
|
|
depends on GENERIC_VDSO_TIME_NS
|
ns: Introduce Time Namespace
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
2019-11-12 09:26:52 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
In this namespace boottime and monotonic clocks can be set.
|
|
|
|
The time will keep going with the same pace.
|
|
|
|
|
2008-02-08 20:18:22 +08:00
|
|
|
config IPC_NS
|
|
|
|
bool "IPC namespace"
|
2010-10-28 06:34:38 +08:00
|
|
|
depends on (SYSVIPC || POSIX_MQUEUE)
|
2010-10-28 06:34:37 +08:00
|
|
|
default y
|
2008-02-08 20:18:22 +08:00
|
|
|
help
|
|
|
|
In this namespace tasks work with IPC ids which correspond to
|
2009-04-07 10:01:08 +08:00
|
|
|
different IPC objects in different namespaces.
|
2008-02-08 20:18:22 +08:00
|
|
|
|
2008-02-08 20:18:23 +08:00
|
|
|
config USER_NS
|
2012-10-03 02:19:29 +08:00
|
|
|
bool "User namespace"
|
2011-11-18 02:23:55 +08:00
|
|
|
default n
|
2008-02-08 20:18:23 +08:00
|
|
|
help
|
|
|
|
This allows containers, i.e. vservers, to use user namespaces
|
|
|
|
to provide different user info for different servers.
|
2013-01-26 08:48:31 +08:00
|
|
|
|
|
|
|
When user namespaces are enabled in the kernel it is
|
2016-01-21 07:02:47 +08:00
|
|
|
recommended that the MEMCG option also be enabled and that
|
|
|
|
user-space use the memory control groups to limit the amount
|
|
|
|
of memory a memory unprivileged users can use.
|
2013-01-26 08:48:31 +08:00
|
|
|
|
2008-02-08 20:18:23 +08:00
|
|
|
If unsure, say N.
|
|
|
|
|
2008-02-08 20:18:24 +08:00
|
|
|
config PID_NS
|
2010-10-28 06:34:37 +08:00
|
|
|
bool "PID Namespaces"
|
2010-10-28 06:34:37 +08:00
|
|
|
default y
|
2008-02-08 20:18:24 +08:00
|
|
|
help
|
2008-07-06 20:48:02 +08:00
|
|
|
Support process id namespaces. This allows having multiple
|
2009-01-26 18:12:25 +08:00
|
|
|
processes with the same pid as long as they are in different
|
2008-02-08 20:18:24 +08:00
|
|
|
pid namespaces. This is a building block of containers.
|
|
|
|
|
2009-01-27 04:25:55 +08:00
|
|
|
config NET_NS
|
|
|
|
bool "Network namespace"
|
2010-10-28 06:34:38 +08:00
|
|
|
depends on NET
|
2010-10-28 06:34:37 +08:00
|
|
|
default y
|
2009-01-27 04:25:55 +08:00
|
|
|
help
|
|
|
|
Allow user space to create what appear to be multiple instances
|
|
|
|
of the network stack.
|
|
|
|
|
2010-10-28 06:34:38 +08:00
|
|
|
endif # NAMESPACES
|
|
|
|
|
2018-08-22 13:01:17 +08:00
|
|
|
config CHECKPOINT_RESTORE
|
|
|
|
bool "Checkpoint/restore support"
|
2022-09-29 15:00:57 +08:00
|
|
|
depends on PROC_FS
|
2018-08-22 13:01:17 +08:00
|
|
|
select PROC_CHILDREN
|
2021-02-06 06:00:12 +08:00
|
|
|
select KCMP
|
2018-08-22 13:01:17 +08:00
|
|
|
default n
|
|
|
|
help
|
|
|
|
Enables additional kernel features in a sake of checkpoint/restore.
|
|
|
|
In particular it adds auxiliary prctl codes to setup process text,
|
|
|
|
data and heap segment sizes, and a few additional /proc filesystem
|
|
|
|
entries.
|
|
|
|
|
|
|
|
If unsure, say N here.
|
|
|
|
|
sched: Add 'autogroup' scheduling feature: automated per session task groups
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30 21:18:03 +08:00
|
|
|
config SCHED_AUTOGROUP
|
|
|
|
bool "Automatic process group scheduling"
|
|
|
|
select CGROUPS
|
|
|
|
select CGROUP_SCHED
|
|
|
|
select FAIR_GROUP_SCHED
|
|
|
|
help
|
|
|
|
This option optimizes the scheduler for common desktop workloads by
|
|
|
|
automatically creating and populating task groups. This separation
|
|
|
|
of workloads isolates aggressive CPU burners (like build jobs) from
|
|
|
|
desktop applications. Task group autogeneration is currently based
|
|
|
|
upon task session.
|
|
|
|
|
2010-10-28 06:34:41 +08:00
|
|
|
config RELAY
|
|
|
|
bool "Kernel->user space relay support (formerly relayfs)"
|
relay: Use irq_work instead of plain timer for deferred wakeup
Relay avoids calling wake_up_interruptible() for doing the wakeup of
readers/consumers, waiting for the generation of new data, from the
context of a process which produced the data. This is apparently done to
prevent the possibility of a deadlock in case Scheduler itself is is
generating data for the relay, after acquiring rq->lock.
The following patch used a timer (to be scheduled at next jiffy), for
delegating the wakeup to another context.
commit 7c9cb38302e78d24e37f7d8a2ea7eed4ae5f2fa7
Author: Tom Zanussi <zanussi@comcast.net>
Date: Wed May 9 02:34:01 2007 -0700
relay: use plain timer instead of delayed work
relay doesn't need to use schedule_delayed_work() for waking readers
when a simple timer will do.
Scheduling a plain timer, at next jiffies boundary, to do the wakeup
causes a significant wakeup latency for the Userspace client, which makes
relay less suitable for the high-frequency low-payload use cases where the
data gets generated at a very high rate, like multiple sub buffers getting
filled within a milli second. Moreover the timer is re-scheduled on every
newly produced sub buffer so the timer keeps getting pushed out if sub
buffers are filled in a very quick succession (less than a jiffy gap
between filling of 2 sub buffers). As a result relay runs out of sub
buffers to store the new data.
By using irq_work it is ensured that wakeup of userspace client, blocked
in the poll call, is done at earliest (through self IPI or next timer
tick) enabling it to always consume the data in time. Also this makes
relay consistent with printk & ring buffers (trace), as they too use
irq_work for deferred wake up of readers.
[arnd@arndb.de: select CONFIG_IRQ_WORK]
Link: http://lkml.kernel.org/r/20160912154035.3222156-1-arnd@arndb.de
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1472906487-1559-1-git-send-email-akash.goel@intel.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Akash Goel <akash.goel@intel.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-12 04:54:33 +08:00
|
|
|
select IRQ_WORK
|
2010-10-28 06:34:41 +08:00
|
|
|
help
|
|
|
|
This option enables support for relay interface support in
|
|
|
|
certain file systems (such as debugfs).
|
|
|
|
It is designed to provide an efficient mechanism for tools and
|
|
|
|
facilities to relay large amounts of data from kernel space to
|
|
|
|
user space.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2007-03-06 17:42:17 +08:00
|
|
|
config BLK_DEV_INITRD
|
|
|
|
bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
|
|
|
|
help
|
|
|
|
The initial RAM filesystem is a ramfs which is loaded by the
|
|
|
|
boot loader (loadlin or lilo) and that is mounted as root
|
|
|
|
before the normal boot procedure. It is typically used to
|
|
|
|
load modules needed to mount the "real" root file system,
|
2016-10-18 20:12:27 +08:00
|
|
|
etc. See <file:Documentation/admin-guide/initrd.rst> for details.
|
2007-03-06 17:42:17 +08:00
|
|
|
|
|
|
|
If RAM disk support (BLK_DEV_RAM) is also included, this
|
|
|
|
also enables initial RAM disk (initrd) support and adds
|
|
|
|
15 Kbytes (more on some other architectures) to the kernel size.
|
|
|
|
|
|
|
|
If unsure say Y.
|
|
|
|
|
2007-02-10 17:44:43 +08:00
|
|
|
if BLK_DEV_INITRD
|
|
|
|
|
2005-08-11 02:44:50 +08:00
|
|
|
source "usr/Kconfig"
|
|
|
|
|
2007-02-10 17:44:43 +08:00
|
|
|
endif
|
|
|
|
|
2020-01-11 00:03:32 +08:00
|
|
|
config BOOT_CONFIG
|
|
|
|
bool "Boot config support"
|
2022-04-06 10:31:19 +08:00
|
|
|
select BLK_DEV_INITRD if !BOOT_CONFIG_EMBED
|
2020-01-11 00:03:32 +08:00
|
|
|
help
|
|
|
|
Extra boot config allows system admin to pass a config file as
|
|
|
|
complemental extension of kernel cmdline when booting.
|
2020-01-20 11:23:00 +08:00
|
|
|
The boot config file must be attached at the end of initramfs
|
2020-02-20 20:18:42 +08:00
|
|
|
with checksum, size and magic word.
|
2020-01-20 11:23:00 +08:00
|
|
|
See <file:Documentation/admin-guide/bootconfig.rst> for details.
|
2020-01-11 00:03:32 +08:00
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2023-02-22 07:27:48 +08:00
|
|
|
config BOOT_CONFIG_FORCE
|
|
|
|
bool "Force unconditional bootconfig processing"
|
|
|
|
depends on BOOT_CONFIG
|
2023-02-22 07:27:49 +08:00
|
|
|
default y if BOOT_CONFIG_EMBED
|
2023-02-22 07:27:48 +08:00
|
|
|
help
|
|
|
|
With this Kconfig option set, BOOT_CONFIG processing is carried
|
|
|
|
out even when the "bootconfig" kernel-boot parameter is omitted.
|
|
|
|
In fact, with this Kconfig option set, there is no way to
|
|
|
|
make the kernel ignore the BOOT_CONFIG-supplied kernel-boot
|
|
|
|
parameters.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2022-04-06 10:31:19 +08:00
|
|
|
config BOOT_CONFIG_EMBED
|
|
|
|
bool "Embed bootconfig file in the kernel"
|
|
|
|
depends on BOOT_CONFIG
|
|
|
|
help
|
|
|
|
Embed a bootconfig file given by BOOT_CONFIG_EMBED_FILE in the
|
|
|
|
kernel. Usually, the bootconfig file is loaded with the initrd
|
|
|
|
image. But if the system doesn't support initrd, this option will
|
|
|
|
help you by embedding a bootconfig file while building the kernel.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
|
|
|
config BOOT_CONFIG_EMBED_FILE
|
|
|
|
string "Embedded bootconfig file path"
|
|
|
|
depends on BOOT_CONFIG_EMBED
|
|
|
|
help
|
|
|
|
Specify a bootconfig file which will be embedded to the kernel.
|
|
|
|
This bootconfig will be used if there is no initrd or no other
|
|
|
|
bootconfig in the initrd.
|
|
|
|
|
2022-05-10 09:29:19 +08:00
|
|
|
config INITRAMFS_PRESERVE_MTIME
|
|
|
|
bool "Preserve cpio archive mtimes in initramfs"
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
Each entry in an initramfs cpio archive carries an mtime value. When
|
|
|
|
enabled, extracted cpio items take this mtime, with directory mtime
|
|
|
|
setting deferred until after creation of any child entries.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
2020-01-11 00:03:32 +08:00
|
|
|
|
2016-04-25 23:35:27 +08:00
|
|
|
choice
|
|
|
|
prompt "Compiler optimization level"
|
2017-10-04 07:53:26 +08:00
|
|
|
default CC_OPTIMIZE_FOR_PERFORMANCE
|
2016-04-25 23:35:27 +08:00
|
|
|
|
|
|
|
config CC_OPTIMIZE_FOR_PERFORMANCE
|
2019-08-21 01:09:40 +08:00
|
|
|
bool "Optimize for performance (-O2)"
|
2016-04-25 23:35:27 +08:00
|
|
|
help
|
|
|
|
This is the default optimization level for the kernel, building
|
|
|
|
with the "-O2" compiler flag for best performance and most
|
|
|
|
helpful compile-time warnings.
|
|
|
|
|
Move size optimization option outside of EMBEDDED menu, mark it EXPERIMENTAL
Also, disable on sparc64 - a number of people report breakage. Probably
a compiler bug, but it's quite possible that it tickles some latent
kernel problem too.
It still defaults to 'y' everywhere else (when enabled through
EXPERIMENTAL), and Dave Jones points out that Fedora (and RHEL4) has
been building with size optimizations for a long time on x86, x86-64,
ia64, s390, s390x, ppc32 and ppc64. So it is really only moderately
experimental, but the sparc64 breakage certainly shows that it can
trigger "issues".
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-15 10:52:21 +08:00
|
|
|
config CC_OPTIMIZE_FOR_SIZE
|
2019-08-21 01:09:40 +08:00
|
|
|
bool "Optimize for size (-Os)"
|
Move size optimization option outside of EMBEDDED menu, mark it EXPERIMENTAL
Also, disable on sparc64 - a number of people report breakage. Probably
a compiler bug, but it's quite possible that it tickles some latent
kernel problem too.
It still defaults to 'y' everywhere else (when enabled through
EXPERIMENTAL), and Dave Jones points out that Fedora (and RHEL4) has
been building with size optimizations for a long time on x86, x86-64,
ia64, s390, s390x, ppc32 and ppc64. So it is really only moderately
experimental, but the sparc64 breakage certainly shows that it can
trigger "issues".
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-15 10:52:21 +08:00
|
|
|
help
|
2019-08-21 01:09:39 +08:00
|
|
|
Choosing this option will pass "-Os" to your compiler resulting
|
|
|
|
in a smaller kernel.
|
Move size optimization option outside of EMBEDDED menu, mark it EXPERIMENTAL
Also, disable on sparc64 - a number of people report breakage. Probably
a compiler bug, but it's quite possible that it tickles some latent
kernel problem too.
It still defaults to 'y' everywhere else (when enabled through
EXPERIMENTAL), and Dave Jones points out that Fedora (and RHEL4) has
been building with size optimizations for a long time on x86, x86-64,
ia64, s390, s390x, ppc32 and ppc64. So it is really only moderately
experimental, but the sparc64 breakage certainly shows that it can
trigger "issues".
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-15 10:52:21 +08:00
|
|
|
|
2016-04-25 23:35:27 +08:00
|
|
|
endchoice
|
|
|
|
|
2018-05-09 21:00:00 +08:00
|
|
|
config HAVE_LD_DEAD_CODE_DATA_ELIMINATION
|
|
|
|
bool
|
|
|
|
help
|
|
|
|
This requires that the arch annotates or otherwise protects
|
|
|
|
its external entry points from being discarded. Linker scripts
|
|
|
|
must also merge .text.*, .data.*, and .bss.* correctly into
|
|
|
|
output sections. Care must be taken not to pull in unrelated
|
|
|
|
sections (e.g., '.text.init'). Typically '.' in section names
|
|
|
|
is used to distinguish them from label names / C identifiers.
|
|
|
|
|
|
|
|
config LD_DEAD_CODE_DATA_ELIMINATION
|
|
|
|
bool "Dead code and data elimination (EXPERIMENTAL)"
|
|
|
|
depends on HAVE_LD_DEAD_CODE_DATA_ELIMINATION
|
|
|
|
depends on EXPERT
|
2018-08-22 21:51:09 +08:00
|
|
|
depends on $(cc-option,-ffunction-sections -fdata-sections)
|
|
|
|
depends on $(ld-option,--gc-sections)
|
2018-05-09 21:00:00 +08:00
|
|
|
help
|
2018-06-24 00:41:51 +08:00
|
|
|
Enable this if you want to do dead code and data elimination with
|
|
|
|
the linker by compiling with -ffunction-sections -fdata-sections,
|
|
|
|
and linking with --gc-sections.
|
2018-05-09 21:00:00 +08:00
|
|
|
|
|
|
|
This can reduce on disk and in-memory size of the kernel
|
|
|
|
code and static data, particularly for small configs and
|
|
|
|
on small systems. This has the possibility of introducing
|
|
|
|
silently broken kernel if the required annotations are not
|
|
|
|
present. This option is not well tested yet, so use at your
|
|
|
|
own risk.
|
|
|
|
|
2020-11-20 04:46:56 +08:00
|
|
|
config LD_ORPHAN_WARN
|
|
|
|
def_bool y
|
|
|
|
depends on ARCH_WANT_LD_ORPHAN_WARN
|
|
|
|
depends on $(ld-option,--orphan-handling=warn)
|
2022-10-25 15:30:23 +08:00
|
|
|
depends on $(ld-option,--orphan-handling=error)
|
|
|
|
|
|
|
|
config LD_ORPHAN_WARN_LEVEL
|
|
|
|
string
|
|
|
|
depends on LD_ORPHAN_WARN
|
|
|
|
default "error" if WERROR
|
|
|
|
default "warn"
|
2020-11-20 04:46:56 +08:00
|
|
|
|
2006-10-01 14:28:13 +08:00
|
|
|
config SYSCTL
|
|
|
|
bool
|
|
|
|
|
2013-05-01 06:28:45 +08:00
|
|
|
config HAVE_UID16
|
|
|
|
bool
|
|
|
|
|
|
|
|
config SYSCTL_EXCEPTION_TRACE
|
|
|
|
bool
|
|
|
|
help
|
|
|
|
Enable support for /proc/sys/debug/exception-trace.
|
|
|
|
|
|
|
|
config SYSCTL_ARCH_UNALIGN_NO_WARN
|
|
|
|
bool
|
|
|
|
help
|
|
|
|
Enable support for /proc/sys/kernel/ignore-unaligned-usertrap
|
|
|
|
Allows arch to define/use @no_unaligned_warning to possibly warn
|
|
|
|
about unaligned access emulation going on under the hood.
|
|
|
|
|
|
|
|
config SYSCTL_ARCH_UNALIGN_ALLOW
|
|
|
|
bool
|
|
|
|
help
|
|
|
|
Enable support for /proc/sys/kernel/unaligned-trap
|
|
|
|
Allows arches to define/use @unaligned_enabled to runtime toggle
|
|
|
|
the unaligned access emulation.
|
|
|
|
see arch/parisc/kernel/unaligned.c for reference
|
|
|
|
|
|
|
|
config HAVE_PCSPKR_PLATFORM
|
|
|
|
bool
|
|
|
|
|
2014-10-24 09:41:08 +08:00
|
|
|
# interpreter that classic socket filters depend on
|
|
|
|
config BPF
|
|
|
|
bool
|
2022-07-10 05:18:49 +08:00
|
|
|
select CRYPTO_LIB_SHA1
|
2014-10-24 09:41:08 +08:00
|
|
|
|
2011-01-21 06:44:16 +08:00
|
|
|
menuconfig EXPERT
|
|
|
|
bool "Configure standard kernel features (expert users)"
|
2011-06-06 09:23:58 +08:00
|
|
|
# Unhide debug options, to make the on-by-default options visible
|
|
|
|
select DEBUG_KERNEL
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
|
|
|
This option allows certain base kernel options and settings
|
2019-12-05 08:52:28 +08:00
|
|
|
to be disabled or tweaked. This is for specialized
|
|
|
|
environments which can tolerate a "non-standard" kernel.
|
|
|
|
Only use this if you really know what you are doing.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-09-17 03:15:53 +08:00
|
|
|
config UID16
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable 16-bit UID system calls" if EXPERT
|
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 07:16:41 +08:00
|
|
|
depends on HAVE_UID16 && MULTIUSER
|
2006-09-17 03:15:53 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
This enables the legacy 16-bit UID syscall wrappers.
|
|
|
|
|
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 07:16:41 +08:00
|
|
|
config MULTIUSER
|
|
|
|
bool "Multiple users, groups and capabilities support" if EXPERT
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
This option enables support for non-root users, groups and
|
|
|
|
capabilities.
|
|
|
|
|
|
|
|
If you say N here, all processes will run with UID 0, GID 0, and all
|
|
|
|
possible capabilities. Saying N here also compiles out support for
|
|
|
|
system calls related to UIDs, GIDs, and capabilities, such as setuid,
|
|
|
|
setgid, and capset.
|
|
|
|
|
|
|
|
If unsure, say Y here.
|
|
|
|
|
2014-06-05 07:11:12 +08:00
|
|
|
config SGETMASK_SYSCALL
|
|
|
|
bool "sgetmask/ssetmask syscalls support" if EXPERT
|
2018-03-08 06:30:54 +08:00
|
|
|
def_bool PARISC || M68K || PPC || MIPS || X86 || SPARC || MICROBLAZE || SUPERH
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2014-06-05 07:11:12 +08:00
|
|
|
sys_sgetmask and sys_ssetmask are obsolete system calls
|
|
|
|
no longer supported in libc but still enabled by default in some
|
|
|
|
architectures.
|
|
|
|
|
|
|
|
If unsure, leave the default option here.
|
|
|
|
|
2014-04-04 05:48:25 +08:00
|
|
|
config SYSFS_SYSCALL
|
|
|
|
bool "Sysfs syscall support" if EXPERT
|
|
|
|
default y
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2014-04-04 05:48:25 +08:00
|
|
|
sys_sysfs is an obsolete system call no longer supported in libc.
|
|
|
|
Note that disabling this option is more secure but might break
|
|
|
|
compatibility with some systems.
|
|
|
|
|
|
|
|
If unsure say Y here.
|
|
|
|
|
2017-11-18 07:31:47 +08:00
|
|
|
config FHANDLE
|
|
|
|
bool "open by fhandle syscalls" if EXPERT
|
|
|
|
select EXPORTFS
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
If you say Y here, a user level program will be able to map
|
|
|
|
file names to handle and then later use the handle for
|
|
|
|
different file system operations. This is useful in implementing
|
|
|
|
userspace file servers, which now track files using handles instead
|
|
|
|
of names. The handle would remain the same even if file names
|
|
|
|
get renamed. Enables open_by_handle_at(2) and name_to_handle_at(2)
|
|
|
|
syscalls.
|
|
|
|
|
posix-timers: Make them configurable
Some embedded systems have no use for them. This removes about
25KB from the kernel binary size when configured out.
Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.
The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-11 13:10:10 +08:00
|
|
|
config POSIX_TIMERS
|
|
|
|
bool "Posix Clocks & timers" if EXPERT
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
This includes native support for POSIX timers to the kernel.
|
|
|
|
Some embedded systems have no use for them and therefore they
|
|
|
|
can be configured out to reduce the size of the kernel image.
|
|
|
|
|
|
|
|
When this option is disabled, the following syscalls won't be
|
|
|
|
available: timer_create, timer_gettime: timer_getoverrun,
|
|
|
|
timer_settime, timer_delete, clock_adjtime, getitimer,
|
|
|
|
setitimer, alarm. Furthermore, the clock_settime, clock_gettime,
|
|
|
|
clock_getres and clock_nanosleep syscalls will be limited to
|
|
|
|
CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_BOOTTIME only.
|
|
|
|
|
|
|
|
If unsure say y.
|
|
|
|
|
2005-05-01 23:59:02 +08:00
|
|
|
config PRINTK
|
|
|
|
default y
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable support for printk" if EXPERT
|
2012-10-13 00:00:23 +08:00
|
|
|
select IRQ_WORK
|
2005-05-01 23:59:02 +08:00
|
|
|
help
|
|
|
|
This option enables normal printk support. Removing it
|
|
|
|
eliminates most of the message strings from the kernel image
|
|
|
|
and makes the kernel more or less silent. As this makes it
|
|
|
|
very difficult to diagnose system problems, saying N here is
|
|
|
|
strongly discouraged.
|
|
|
|
|
2005-05-01 23:59:01 +08:00
|
|
|
config BUG
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "BUG() support" if EXPERT
|
2005-05-01 23:59:01 +08:00
|
|
|
default y
|
|
|
|
help
|
2019-12-05 08:52:28 +08:00
|
|
|
Disabling this option eliminates support for BUG and WARN, reducing
|
|
|
|
the size of your kernel image and potentially quietly ignoring
|
|
|
|
numerous fatal conditions. You should only consider disabling this
|
|
|
|
option for embedded systems with no facilities for reporting errors.
|
|
|
|
Just say Y.
|
2005-05-01 23:59:01 +08:00
|
|
|
|
2006-01-08 17:05:25 +08:00
|
|
|
config ELF_CORE
|
2012-10-05 08:15:23 +08:00
|
|
|
depends on COREDUMP
|
2006-01-08 17:05:25 +08:00
|
|
|
default y
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable ELF core dumps" if EXPERT
|
2006-01-08 17:05:25 +08:00
|
|
|
help
|
|
|
|
Enable support for generating core dumps. Disabling saves about 4k.
|
|
|
|
|
2011-06-02 02:05:09 +08:00
|
|
|
|
2008-05-07 18:39:56 +08:00
|
|
|
config PCSPKR_PLATFORM
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable PC-Speaker support" if EXPERT
|
2011-06-02 02:05:09 +08:00
|
|
|
depends on HAVE_PCSPKR_PLATFORM
|
2011-06-02 02:04:59 +08:00
|
|
|
select I8253_LOCK
|
2008-05-07 18:39:56 +08:00
|
|
|
default y
|
|
|
|
help
|
2019-12-05 08:52:28 +08:00
|
|
|
This option allows to disable the internal PC-Speaker
|
|
|
|
support, saving some memory.
|
2008-05-07 18:39:56 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config BASE_FULL
|
|
|
|
default y
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable full-sized data structures for core" if EXPERT
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
|
|
|
Disabling this option reduces the size of miscellaneous core
|
|
|
|
kernel data structures. This saves memory on small machines,
|
|
|
|
but may reduce performance.
|
|
|
|
|
|
|
|
config FUTEX
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable futex support" if EXPERT
|
2021-10-26 18:03:47 +08:00
|
|
|
depends on !(SPARC32 && SMP)
|
2005-04-17 06:20:36 +08:00
|
|
|
default y
|
2017-08-01 12:31:32 +08:00
|
|
|
imply RT_MUTEXES
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
|
|
|
Disabling this option will cause the kernel to be built without
|
|
|
|
support for "fast userspace mutexes". The resulting kernel may not
|
|
|
|
run glibc-based applications correctly.
|
|
|
|
|
2017-08-01 12:31:32 +08:00
|
|
|
config FUTEX_PI
|
|
|
|
bool
|
|
|
|
depends on FUTEX && RT_MUTEXES
|
|
|
|
default y
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config EPOLL
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable eventpoll support" if EXPERT
|
2005-04-17 06:20:36 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
Disabling this option will cause the kernel to be built without
|
|
|
|
support for epoll family of system calls.
|
|
|
|
|
signal/timer/event: signalfd core
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 13:23:13 +08:00
|
|
|
config SIGNALFD
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable signalfd() system call" if EXPERT
|
signal/timer/event: signalfd core
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 13:23:13 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
Enable the signalfd() system call that allows to receive signals
|
|
|
|
on a file descriptor.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
signal/timer/event: timerfd core
This patch introduces a new system call for timers events delivered though
file descriptors. This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2). As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.
The system call is defined as:
int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.
If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.
If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.
The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.
The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
returned.
The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:
http://www.xmailserver.org/timerfd-test.c
[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 13:23:16 +08:00
|
|
|
config TIMERFD
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable timerfd() system call" if EXPERT
|
signal/timer/event: timerfd core
This patch introduces a new system call for timers events delivered though
file descriptors. This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2). As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.
The system call is defined as:
int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.
If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.
If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.
The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.
The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
returned.
The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:
http://www.xmailserver.org/timerfd-test.c
[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 13:23:16 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
Enable the timerfd() system call that allows to receive timer
|
|
|
|
events on a file descriptor.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
signal/timer/event: eventfd core
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only). It can be used instead of pipe(2) in all cases where those
would simply be used to signal events. Their kernel overhead is much lower
than pipes, and they do not consume two fds. When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it. The API is:
int eventfd(unsigned int count);
The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it to the current
counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:
struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);
The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post an event
to userspace. The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:
http://www.xmailserver.org/eventfd-bench.c
This is the eventfd-based version of pipetest-4 (pipe(2) based):
http://www.xmailserver.org/pipetest-4.c
Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.
[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 13:23:19 +08:00
|
|
|
config EVENTFD
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable eventfd() system call" if EXPERT
|
signal/timer/event: eventfd core
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only). It can be used instead of pipe(2) in all cases where those
would simply be used to signal events. Their kernel overhead is much lower
than pipes, and they do not consume two fds. When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it. The API is:
int eventfd(unsigned int count);
The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it to the current
counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:
struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);
The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post an event
to userspace. The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:
http://www.xmailserver.org/eventfd-bench.c
This is the eventfd-based version of pipetest-4 (pipe(2) based):
http://www.xmailserver.org/pipetest-4.c
Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.
[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 13:23:19 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
Enable the eventfd() system call that allows to receive both
|
|
|
|
kernel notification (ie. KAIO) or userspace notifications.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config SHMEM
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Use full shmem filesystem" if EXPERT
|
2005-04-17 06:20:36 +08:00
|
|
|
default y
|
|
|
|
depends on MMU
|
|
|
|
help
|
|
|
|
The shmem is an internal filesystem used to manage shared memory.
|
|
|
|
It is backed by swap and manages resource limits. It is also exported
|
|
|
|
to userspace as tmpfs if TMPFS is enabled. Disabling this
|
|
|
|
option replaces shmem and tmpfs with the much simpler ramfs code,
|
|
|
|
which may be appropriate on small systems without swap.
|
|
|
|
|
2008-10-16 13:05:12 +08:00
|
|
|
config AIO
|
2011-01-21 06:44:16 +08:00
|
|
|
bool "Enable AIO support" if EXPERT
|
2008-10-16 13:05:12 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
This option enables POSIX asynchronous I/O which may by used
|
2013-05-01 06:28:45 +08:00
|
|
|
by some high performance threaded applications. Disabling
|
|
|
|
this option saves about 7k.
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
config IO_URING
|
|
|
|
bool "Enable IO uring support" if EXPERT
|
2019-10-24 21:25:42 +08:00
|
|
|
select IO_WQ
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
default y
|
|
|
|
help
|
|
|
|
This option enables support for the io_uring interface, enabling
|
|
|
|
applications to submit and complete IO through submission and
|
|
|
|
completion rings that are shared between the kernel and application.
|
|
|
|
|
2014-08-18 08:41:09 +08:00
|
|
|
config ADVISE_SYSCALLS
|
|
|
|
bool "Enable madvise/fadvise syscalls" if EXPERT
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
This option enables the madvise and fadvise syscalls, used by
|
|
|
|
applications to advise the kernel about their future memory or file
|
|
|
|
usage, improving performance. If building an embedded system where no
|
|
|
|
applications use these syscalls, you can disable this option to save
|
|
|
|
space.
|
|
|
|
|
sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.
The existing applications of which I am aware that would be improved by
this system call are as follows:
* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)
Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().
* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.
To explain the benefit of this scheme, let's introduce two example threads:
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())
In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".
Before the change, we had, for each smp_mb() pairs:
Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses
After the change, these pairs become:
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).
1) Non-concurrent Thread A vs Thread B accesses:
Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses
In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.
2) Concurrent Thread A vs Thread B accesses
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().
* Benchmarks
On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)
1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
* User-space user of this system call: Userspace RCU library
Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
Results in liburcu:
Operations in 10s, 6 readers, 2 writers:
memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes
The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.
Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.
An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.
This patch adds the system call to x86 and to asm-generic.
[1] http://urcu.so
membarrier(2) man page:
MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
NAME
membarrier - issue memory barriers on a set of threads
SYNOPSIS
#include <linux/membarrier.h>
int membarrier(int cmd, int flags);
DESCRIPTION
The cmd argument is one of the following:
MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.
MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.
The flags argument needs to be 0. For future extensions.
All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():
The pair ordering is detailed as (O: ordered, X: not ordered):
barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.
ERRORS
ENOSYS System call is not implemented.
EINVAL Invalid arguments.
Linux 2015-04-15 MEMBARRIER(2)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-12 04:07:39 +08:00
|
|
|
config MEMBARRIER
|
|
|
|
bool "Enable membarrier() system call" if EXPERT
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
Enable the membarrier() system call that allows issuing memory
|
|
|
|
barriers across all running threads, which can be used to distribute
|
|
|
|
the cost of user-space memory barriers asymmetrically by transforming
|
|
|
|
pairs of memory barriers into pairs consisting of membarrier() and a
|
|
|
|
compiler barrier.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2017-11-18 07:31:47 +08:00
|
|
|
config KALLSYMS
|
2019-12-05 08:52:28 +08:00
|
|
|
bool "Load all symbols for debugging/ksymoops" if EXPERT
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
Say Y here to let the kernel print out symbolic crash information and
|
|
|
|
symbolic stack backtraces. This increases the size of the kernel
|
|
|
|
somewhat, as all symbols have to be loaded into the kernel image.
|
2017-11-18 07:31:47 +08:00
|
|
|
|
kallsyms: Add self-test facility
Added test cases for basic functions and performance of functions
kallsyms_lookup_name(), kallsyms_on_each_symbol() and
kallsyms_on_each_match_symbol(). It also calculates the compression rate
of the kallsyms compression algorithm for the current symbol set.
The basic functions test begins by testing a set of symbols whose address
values are known. Then, traverse all symbol addresses and find the
corresponding symbol name based on the address. It's impossible to
determine whether these addresses are correct, but we can use the above
three functions along with the addresses to test each other. Due to the
traversal operation of kallsyms_on_each_symbol() is too slow, only 60
symbols can be tested in one second, so let it test on average once
every 128 symbols. The other two functions validate all symbols.
If the basic functions test is passed, print only performance test
results. If the test fails, print error information, but do not perform
subsequent performance tests.
Start self-test automatically after system startup if
CONFIG_KALLSYMS_SELFTEST=y.
Example of output content: (prefix 'kallsyms_selftest:' is omitted
start
---------------------------------------------------------
| nr_symbols | compressed size | original size | ratio(%) |
|---------------------------------------------------------|
| 107543 | 1357912 | 2407433 | 56.40 |
---------------------------------------------------------
kallsyms_lookup_name() looked up 107543 symbols
The time spent on each symbol is (ns): min=630, max=35295, avg=7353
kallsyms_on_each_symbol() traverse all: 11782628 ns
kallsyms_on_each_match_symbol() traverse all: 9261 ns
finish
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-11-15 16:33:48 +08:00
|
|
|
config KALLSYMS_SELFTEST
|
|
|
|
bool "Test the basic functions and performance of kallsyms"
|
|
|
|
depends on KALLSYMS
|
|
|
|
default n
|
|
|
|
help
|
|
|
|
Test the basic functions and performance of some interfaces, such as
|
|
|
|
kallsyms_lookup_name. It also calculates the compression rate of the
|
|
|
|
kallsyms compression algorithm for the current symbol set.
|
|
|
|
|
|
|
|
Start self-test automatically after system startup. Suggest executing
|
|
|
|
"dmesg | grep kallsyms_selftest" to collect test results. "finish" is
|
|
|
|
displayed in the last line, indicating that the test is complete.
|
|
|
|
|
2017-11-18 07:31:47 +08:00
|
|
|
config KALLSYMS_ALL
|
|
|
|
bool "Include all symbols in kallsyms"
|
|
|
|
depends on DEBUG_KERNEL && KALLSYMS
|
|
|
|
help
|
2019-12-05 08:52:28 +08:00
|
|
|
Normally kallsyms only contains the symbols of functions for nicer
|
|
|
|
OOPS messages and backtraces (i.e., symbols from the text and inittext
|
2022-07-07 12:43:29 +08:00
|
|
|
sections). This is sufficient for most cases. And only if you want to
|
|
|
|
enable kernel live patching, or other less common use cases (e.g.,
|
|
|
|
when a debugger is used) all symbols are required (i.e., names of
|
|
|
|
variables from the data sections, etc).
|
2017-11-18 07:31:47 +08:00
|
|
|
|
2019-12-05 08:52:28 +08:00
|
|
|
This option makes sure that all symbols are loaded into the kernel
|
|
|
|
image (i.e., symbols from all sections) in cost of increased kernel
|
|
|
|
size (depending on the kernel configuration, it may be 300KiB or
|
|
|
|
something like this).
|
2017-11-18 07:31:47 +08:00
|
|
|
|
2022-07-07 12:43:29 +08:00
|
|
|
Say N unless you really need all symbols, or kernel live patching.
|
2017-11-18 07:31:47 +08:00
|
|
|
|
|
|
|
config KALLSYMS_ABSOLUTE_PERCPU
|
|
|
|
bool
|
|
|
|
depends on KALLSYMS
|
|
|
|
default X86_64 && SMP
|
|
|
|
|
|
|
|
config KALLSYMS_BASE_RELATIVE
|
|
|
|
bool
|
|
|
|
depends on KALLSYMS
|
2018-03-08 06:30:54 +08:00
|
|
|
default !IA64
|
2017-11-18 07:31:47 +08:00
|
|
|
help
|
|
|
|
Instead of emitting them as absolute values in the native word size,
|
|
|
|
emit the symbol references in the kallsyms table as 32-bit entries,
|
|
|
|
each containing a relative value in the range [base, base + U32_MAX]
|
|
|
|
or, when KALLSYMS_ABSOLUTE_PERCPU is in effect, each containing either
|
|
|
|
an absolute value in the range [0, S32_MAX] or a relative value in the
|
|
|
|
range [base, base + S32_MAX], where base is the lowest relative symbol
|
|
|
|
address encountered in the image.
|
|
|
|
|
|
|
|
On 64-bit builds, this reduces the size of the address table by 50%,
|
|
|
|
but more importantly, it results in entries whose values are build
|
|
|
|
time constants, and no relocation pass is required at runtime to fix
|
|
|
|
up the entries based on the runtime load address of the kernel.
|
|
|
|
|
|
|
|
# end of the "standard kernel features (expert users)" menu
|
|
|
|
|
|
|
|
# syscall, maps, verifier
|
2020-03-29 08:43:49 +08:00
|
|
|
|
2018-01-30 04:20:11 +08:00
|
|
|
config ARCH_HAS_MEMBARRIER_CALLBACKS
|
|
|
|
bool
|
|
|
|
|
2018-01-30 04:20:17 +08:00
|
|
|
config ARCH_HAS_MEMBARRIER_SYNC_CORE
|
|
|
|
bool
|
|
|
|
|
2021-02-06 06:00:12 +08:00
|
|
|
config KCMP
|
|
|
|
bool "Enable kcmp() system call" if EXPERT
|
|
|
|
help
|
|
|
|
Enable the kernel resource comparison system call. It provides
|
|
|
|
user-space with the ability to compare two processes to see if they
|
|
|
|
share a common resource, such as a file descriptor or even virtual
|
|
|
|
memory space.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
config RSEQ
|
|
|
|
bool "Enable rseq() system call" if EXPERT
|
|
|
|
default y
|
|
|
|
depends on HAVE_RSEQ
|
|
|
|
select MEMBARRIER
|
|
|
|
help
|
|
|
|
Enable the restartable sequences system call. It provides a
|
|
|
|
user-space cache for the current CPU number value, which
|
|
|
|
speeds up getting the current CPU number from user-space,
|
|
|
|
as well as an ABI to speed up user-space operations on
|
|
|
|
per-CPU data.
|
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:07 +08:00
|
|
|
config CACHESTAT_SYSCALL
|
|
|
|
bool "Enable cachestat() system call" if EXPERT
|
|
|
|
default y
|
|
|
|
help
|
|
|
|
Enable the cachestat system call, which queries the page cache
|
|
|
|
statistics of a file (number of cached pages, dirty pages,
|
|
|
|
pages marked for writeback, (recently) evicted pages).
|
|
|
|
|
|
|
|
If unsure say Y here.
|
|
|
|
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
config DEBUG_RSEQ
|
|
|
|
default n
|
|
|
|
bool "Enabled debugging of rseq() system call" if EXPERT
|
|
|
|
depends on RSEQ && DEBUG_KERNEL
|
|
|
|
help
|
|
|
|
Enable extra debugging checks for the rseq system call.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 18:02:48 +08:00
|
|
|
config HAVE_PERF_EVENTS
|
2008-12-05 03:12:29 +08:00
|
|
|
bool
|
2009-06-13 01:17:43 +08:00
|
|
|
help
|
|
|
|
See tools/perf/design.txt for details.
|
2008-12-05 03:12:29 +08:00
|
|
|
|
2021-11-11 10:07:29 +08:00
|
|
|
config GUEST_PERF_EVENTS
|
|
|
|
bool
|
|
|
|
depends on HAVE_PERF_EVENTS
|
|
|
|
|
2009-09-21 22:08:49 +08:00
|
|
|
config PERF_USE_VMALLOC
|
|
|
|
bool
|
|
|
|
help
|
|
|
|
See tools/perf/design.txt for details
|
|
|
|
|
2017-01-11 02:50:54 +08:00
|
|
|
config PC104
|
2017-12-30 04:14:59 +08:00
|
|
|
bool "PC/104 support" if EXPERT
|
2017-01-11 02:50:54 +08:00
|
|
|
help
|
|
|
|
Expose PC/104 form factor device drivers and options available for
|
|
|
|
selection and configuration. Enable this option if your target
|
|
|
|
machine has a PC/104 bus.
|
|
|
|
|
2009-09-21 18:20:38 +08:00
|
|
|
menu "Kernel Performance Events And Counters"
|
2008-12-05 03:12:29 +08:00
|
|
|
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 18:02:48 +08:00
|
|
|
config PERF_EVENTS
|
2009-09-21 18:20:38 +08:00
|
|
|
bool "Kernel performance events and counters"
|
2012-04-06 00:24:44 +08:00
|
|
|
default y if PROFILING
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 18:02:48 +08:00
|
|
|
depends on HAVE_PERF_EVENTS
|
2010-10-14 14:01:34 +08:00
|
|
|
select IRQ_WORK
|
2008-12-05 03:12:29 +08:00
|
|
|
help
|
2009-09-21 18:20:38 +08:00
|
|
|
Enable kernel support for various performance events provided
|
|
|
|
by software and hardware.
|
2008-12-05 03:12:29 +08:00
|
|
|
|
2009-10-31 05:32:25 +08:00
|
|
|
Software events are supported either built-in or via the
|
2009-09-21 18:20:38 +08:00
|
|
|
use of generic tracepoints.
|
2008-12-05 03:12:29 +08:00
|
|
|
|
2009-09-21 18:20:38 +08:00
|
|
|
Most modern CPUs support performance events via performance
|
|
|
|
counter registers. These registers count the number of certain
|
2008-12-05 03:12:29 +08:00
|
|
|
types of hw events: such as instructions executed, cachemisses
|
|
|
|
suffered, or branches mis-predicted - without slowing down the
|
|
|
|
kernel or applications. These registers can also trigger interrupts
|
|
|
|
when a threshold number of events have passed - and can thus be
|
|
|
|
used to profile the code that runs on that CPU.
|
|
|
|
|
2009-09-21 18:20:38 +08:00
|
|
|
The Linux Performance Event subsystem provides an abstraction of
|
2009-10-31 05:32:25 +08:00
|
|
|
these software and hardware event capabilities, available via a
|
2009-09-21 18:20:38 +08:00
|
|
|
system call and used by the "perf" utility in tools/perf/. It
|
2008-12-05 03:12:29 +08:00
|
|
|
provides per task and per CPU counters, and it provides event
|
|
|
|
capabilities on top of those.
|
|
|
|
|
|
|
|
Say Y if unsure.
|
|
|
|
|
2009-09-21 22:08:49 +08:00
|
|
|
config DEBUG_PERF_USE_VMALLOC
|
|
|
|
default n
|
|
|
|
bool "Debug: use vmalloc to back perf mmap() buffers"
|
2015-05-04 14:26:39 +08:00
|
|
|
depends on PERF_EVENTS && DEBUG_KERNEL && !PPC
|
2009-09-21 22:08:49 +08:00
|
|
|
select PERF_USE_VMALLOC
|
|
|
|
help
|
2019-12-05 08:52:28 +08:00
|
|
|
Use vmalloc memory to back perf mmap() buffers.
|
2009-09-21 22:08:49 +08:00
|
|
|
|
2019-12-05 08:52:28 +08:00
|
|
|
Mostly useful for debugging the vmalloc code on platforms
|
|
|
|
that don't require it.
|
2009-09-21 22:08:49 +08:00
|
|
|
|
2019-12-05 08:52:28 +08:00
|
|
|
Say N if unsure.
|
2009-09-21 22:08:49 +08:00
|
|
|
|
2008-12-05 03:12:29 +08:00
|
|
|
endmenu
|
|
|
|
|
2015-07-21 04:16:28 +08:00
|
|
|
config SYSTEM_DATA_VERIFICATION
|
|
|
|
def_bool n
|
|
|
|
select SYSTEM_TRUSTED_KEYRING
|
|
|
|
select KEYS
|
|
|
|
select CRYPTO
|
2016-03-04 05:49:27 +08:00
|
|
|
select CRYPTO_RSA
|
2015-07-21 04:16:28 +08:00
|
|
|
select ASYMMETRIC_KEY_TYPE
|
|
|
|
select ASYMMETRIC_PUBLIC_KEY_SUBTYPE
|
|
|
|
select ASN1
|
|
|
|
select OID_REGISTRY
|
|
|
|
select X509_CERTIFICATE_PARSER
|
|
|
|
select PKCS7_MESSAGE_PARSER
|
2014-04-19 06:07:11 +08:00
|
|
|
help
|
2015-07-21 04:16:28 +08:00
|
|
|
Provide PKCS#7 message verification using the contents of the system
|
|
|
|
trusted keyring to provide public keys. This then can be used for
|
|
|
|
module verification, kexec image verification and firmware blob
|
|
|
|
verification.
|
2014-04-19 06:07:11 +08:00
|
|
|
|
2008-02-03 04:10:36 +08:00
|
|
|
config PROFILING
|
2010-02-26 22:01:23 +08:00
|
|
|
bool "Profiling support"
|
2008-02-03 04:10:36 +08:00
|
|
|
help
|
|
|
|
Say Y here to enable the extended profiling support mechanisms used
|
2021-01-14 19:35:30 +08:00
|
|
|
by profilers.
|
2008-02-03 04:10:36 +08:00
|
|
|
|
2021-07-03 22:42:57 +08:00
|
|
|
config RUST
|
|
|
|
bool "Rust support"
|
|
|
|
depends on HAVE_RUST
|
|
|
|
depends on RUST_IS_AVAILABLE
|
2024-04-04 22:17:02 +08:00
|
|
|
depends on !CFI_CLANG
|
2021-07-03 22:42:57 +08:00
|
|
|
depends on !MODVERSIONS
|
|
|
|
depends on !GCC_PLUGINS
|
|
|
|
depends on !RANDSTRUCT
|
2024-07-29 22:22:49 +08:00
|
|
|
depends on !SHADOW_CALL_STACK
|
2023-01-11 23:20:50 +08:00
|
|
|
depends on !DEBUG_INFO_BTF || PAHOLE_HAS_LANG_EXCLUDE
|
2021-07-03 22:42:57 +08:00
|
|
|
help
|
|
|
|
Enables Rust support in the kernel.
|
|
|
|
|
|
|
|
This allows other Rust-related options, like drivers written in Rust,
|
|
|
|
to be selected.
|
|
|
|
|
|
|
|
It is also required to be able to load external kernel modules
|
|
|
|
written in Rust.
|
|
|
|
|
|
|
|
See Documentation/rust/ for more information.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
|
|
|
config RUSTC_VERSION_TEXT
|
|
|
|
string
|
|
|
|
depends on RUST
|
2024-07-27 22:03:00 +08:00
|
|
|
default "$(shell,$(RUSTC) --version 2>/dev/null)"
|
2021-07-03 22:42:57 +08:00
|
|
|
|
|
|
|
config BINDGEN_VERSION_TEXT
|
|
|
|
string
|
|
|
|
depends on RUST
|
2024-07-10 00:06:03 +08:00
|
|
|
# The dummy parameter `workaround-for-0.69.0` is required to support 0.69.0
|
|
|
|
# (https://github.com/rust-lang/rust-bindgen/pull/2678). It can be removed when
|
|
|
|
# the minimum version is upgraded past that (0.69.1 already fixed the issue).
|
2024-07-27 22:03:00 +08:00
|
|
|
default "$(shell,$(BINDGEN) --version workaround-for-0.69.0 2>/dev/null)"
|
2021-07-03 22:42:57 +08:00
|
|
|
|
2008-07-23 20:15:22 +08:00
|
|
|
#
|
|
|
|
# Place an empty function call at each tracepoint site. Can be
|
|
|
|
# dynamically changed for a probe function.
|
|
|
|
#
|
tracing: Kernel Tracepoints
Implementation of kernel tracepoints. Inspired from the Linux Kernel
Markers. Allows complete typing verification by declaring both tracing
statement inline functions and probe registration/unregistration static
inline functions within the same macro "DEFINE_TRACE". No format string
is required. See the tracepoint Documentation and Samples patches for
usage examples.
Taken from the documentation patch :
"A tracepoint placed in code provides a hook to call a function (probe)
that you can provide at runtime. A tracepoint can be "on" (a probe is
connected to it) or "off" (no probe is attached). When a tracepoint is
"off" it has no effect, except for adding a tiny time penalty (checking
a condition for a branch) and space penalty (adding a few bytes for the
function call at the end of the instrumented function and adds a data
structure in a separate section). When a tracepoint is "on", the
function you provide is called each time the tracepoint is executed, in
the execution context of the caller. When the function provided ends its
execution, it returns to the caller (continuing from the tracepoint
site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters, which
prototypes are described in a tracepoint declaration placed in a header
file."
Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".
We make sure the previous array containing probes, which has been
scheduled for deletion by the rcu callback, is indeed freed before we
proceed to the next update. It therefore limits the rate of modification
of a single tracepoint to one update per RCU period. The objective here
is to permit fast batch add/removal of probes on _different_
tracepoints.
Changelog :
- Use #name ":" #proto as string to identify the tracepoint in the
tracepoint table. This will make sure not type mismatch happens due to
connexion of a probe with the wrong type to a tracepoint declared with
the same name in a different header.
- Add tracepoint_entry_free_old.
- Change __TO_TRACE to get rid of the 'i' iterator.
Masami Hiramatsu <mhiramat@redhat.com> :
Tested on x86-64.
Performance impact of a tracepoint : same as markers, except that it
adds about 70 bytes of instructions in an unlikely branch of each
instrumented function (the for loop, the stack setup and the function
call). It currently adds a memory read, a test and a conditional branch
at the instrumentation site (in the hot path). Immediate values will
eventually change this into a load immediate, test and branch, which
removes the memory read which will make the i-cache impact smaller
(changing the memory read for a load immediate removes 3-4 bytes per
site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
also saves the d-cache hit).
About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added.
Quoting Hideo Aoki about Markers :
I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
tree, which includes several markers for LTTng, using an ia64 server.
While the immediate trace mark feature isn't implemented on ia64, there
is no major performance regression. So, I think that we don't have any
issues to propose merging marker point patches into Linus's tree from
the viewpoint of performance impact.
I prepared two kernels to evaluate. The first one was compiled without
CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
I ran hackbench 5 times in each condition and calculated the average and
difference between the kernels.
The parameter of hackbench: every 50 from 50 to 800
The number of CPUs of the server: 2, 4, and 8
Below is the results. As you can see, major performance regression
wasn't found in any case. Even if number of processes increases,
differences between marker-enabled kernel and marker- disabled kernel
doesn't increase. Moreover, if number of CPUs increases, the differences
doesn't increase either.
Curiously, marker-enabled kernel is better than marker-disabled kernel
in more than half cases, although I guess it comes from the difference
of memory access pattern.
* 2 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 4.811 | 4.872 | +0.061 | +1.27 |
100 | 9.854 | 10.309 | +0.454 | +4.61 |
150 | 15.602 | 15.040 | -0.562 | -3.6 |
200 | 20.489 | 20.380 | -0.109 | -0.53 |
250 | 25.798 | 25.652 | -0.146 | -0.56 |
300 | 31.260 | 30.797 | -0.463 | -1.48 |
350 | 36.121 | 35.770 | -0.351 | -0.97 |
400 | 42.288 | 42.102 | -0.186 | -0.44 |
450 | 47.778 | 47.253 | -0.526 | -1.1 |
500 | 51.953 | 52.278 | +0.325 | +0.63 |
550 | 58.401 | 57.700 | -0.701 | -1.2 |
600 | 63.334 | 63.222 | -0.112 | -0.18 |
650 | 68.816 | 68.511 | -0.306 | -0.44 |
700 | 74.667 | 74.088 | -0.579 | -0.78 |
750 | 78.612 | 79.582 | +0.970 | +1.23 |
800 | 85.431 | 85.263 | -0.168 | -0.2 |
--------------------------------------------------------------
* 4 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.586 | 2.584 | -0.003 | -0.1 |
100 | 5.254 | 5.283 | +0.030 | +0.56 |
150 | 8.012 | 8.074 | +0.061 | +0.76 |
200 | 11.172 | 11.000 | -0.172 | -1.54 |
250 | 13.917 | 14.036 | +0.119 | +0.86 |
300 | 16.905 | 16.543 | -0.362 | -2.14 |
350 | 19.901 | 20.036 | +0.135 | +0.68 |
400 | 22.908 | 23.094 | +0.186 | +0.81 |
450 | 26.273 | 26.101 | -0.172 | -0.66 |
500 | 29.554 | 29.092 | -0.461 | -1.56 |
550 | 32.377 | 32.274 | -0.103 | -0.32 |
600 | 35.855 | 35.322 | -0.533 | -1.49 |
650 | 39.192 | 38.388 | -0.804 | -2.05 |
700 | 41.744 | 41.719 | -0.025 | -0.06 |
750 | 45.016 | 44.496 | -0.520 | -1.16 |
800 | 48.212 | 47.603 | -0.609 | -1.26 |
--------------------------------------------------------------
* 8 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.094 | 2.072 | -0.022 | -1.07 |
100 | 4.162 | 4.273 | +0.111 | +2.66 |
150 | 6.485 | 6.540 | +0.055 | +0.84 |
200 | 8.556 | 8.478 | -0.078 | -0.91 |
250 | 10.458 | 10.258 | -0.200 | -1.91 |
300 | 12.425 | 12.750 | +0.325 | +2.62 |
350 | 14.807 | 14.839 | +0.032 | +0.22 |
400 | 16.801 | 16.959 | +0.158 | +0.94 |
450 | 19.478 | 19.009 | -0.470 | -2.41 |
500 | 21.296 | 21.504 | +0.208 | +0.98 |
550 | 23.842 | 23.979 | +0.137 | +0.57 |
600 | 26.309 | 26.111 | -0.198 | -0.75 |
650 | 28.705 | 28.446 | -0.259 | -0.9 |
700 | 31.233 | 31.394 | +0.161 | +0.52 |
750 | 34.064 | 33.720 | -0.344 | -1.01 |
800 | 36.320 | 36.114 | -0.206 | -0.57 |
--------------------------------------------------------------
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: 'Peter Zijlstra' <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-19 00:16:16 +08:00
|
|
|
config TRACEPOINTS
|
2008-07-23 20:15:22 +08:00
|
|
|
bool
|
tracing: Kernel Tracepoints
Implementation of kernel tracepoints. Inspired from the Linux Kernel
Markers. Allows complete typing verification by declaring both tracing
statement inline functions and probe registration/unregistration static
inline functions within the same macro "DEFINE_TRACE". No format string
is required. See the tracepoint Documentation and Samples patches for
usage examples.
Taken from the documentation patch :
"A tracepoint placed in code provides a hook to call a function (probe)
that you can provide at runtime. A tracepoint can be "on" (a probe is
connected to it) or "off" (no probe is attached). When a tracepoint is
"off" it has no effect, except for adding a tiny time penalty (checking
a condition for a branch) and space penalty (adding a few bytes for the
function call at the end of the instrumented function and adds a data
structure in a separate section). When a tracepoint is "on", the
function you provide is called each time the tracepoint is executed, in
the execution context of the caller. When the function provided ends its
execution, it returns to the caller (continuing from the tracepoint
site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters, which
prototypes are described in a tracepoint declaration placed in a header
file."
Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".
We make sure the previous array containing probes, which has been
scheduled for deletion by the rcu callback, is indeed freed before we
proceed to the next update. It therefore limits the rate of modification
of a single tracepoint to one update per RCU period. The objective here
is to permit fast batch add/removal of probes on _different_
tracepoints.
Changelog :
- Use #name ":" #proto as string to identify the tracepoint in the
tracepoint table. This will make sure not type mismatch happens due to
connexion of a probe with the wrong type to a tracepoint declared with
the same name in a different header.
- Add tracepoint_entry_free_old.
- Change __TO_TRACE to get rid of the 'i' iterator.
Masami Hiramatsu <mhiramat@redhat.com> :
Tested on x86-64.
Performance impact of a tracepoint : same as markers, except that it
adds about 70 bytes of instructions in an unlikely branch of each
instrumented function (the for loop, the stack setup and the function
call). It currently adds a memory read, a test and a conditional branch
at the instrumentation site (in the hot path). Immediate values will
eventually change this into a load immediate, test and branch, which
removes the memory read which will make the i-cache impact smaller
(changing the memory read for a load immediate removes 3-4 bytes per
site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
also saves the d-cache hit).
About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added.
Quoting Hideo Aoki about Markers :
I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
tree, which includes several markers for LTTng, using an ia64 server.
While the immediate trace mark feature isn't implemented on ia64, there
is no major performance regression. So, I think that we don't have any
issues to propose merging marker point patches into Linus's tree from
the viewpoint of performance impact.
I prepared two kernels to evaluate. The first one was compiled without
CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
I ran hackbench 5 times in each condition and calculated the average and
difference between the kernels.
The parameter of hackbench: every 50 from 50 to 800
The number of CPUs of the server: 2, 4, and 8
Below is the results. As you can see, major performance regression
wasn't found in any case. Even if number of processes increases,
differences between marker-enabled kernel and marker- disabled kernel
doesn't increase. Moreover, if number of CPUs increases, the differences
doesn't increase either.
Curiously, marker-enabled kernel is better than marker-disabled kernel
in more than half cases, although I guess it comes from the difference
of memory access pattern.
* 2 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 4.811 | 4.872 | +0.061 | +1.27 |
100 | 9.854 | 10.309 | +0.454 | +4.61 |
150 | 15.602 | 15.040 | -0.562 | -3.6 |
200 | 20.489 | 20.380 | -0.109 | -0.53 |
250 | 25.798 | 25.652 | -0.146 | -0.56 |
300 | 31.260 | 30.797 | -0.463 | -1.48 |
350 | 36.121 | 35.770 | -0.351 | -0.97 |
400 | 42.288 | 42.102 | -0.186 | -0.44 |
450 | 47.778 | 47.253 | -0.526 | -1.1 |
500 | 51.953 | 52.278 | +0.325 | +0.63 |
550 | 58.401 | 57.700 | -0.701 | -1.2 |
600 | 63.334 | 63.222 | -0.112 | -0.18 |
650 | 68.816 | 68.511 | -0.306 | -0.44 |
700 | 74.667 | 74.088 | -0.579 | -0.78 |
750 | 78.612 | 79.582 | +0.970 | +1.23 |
800 | 85.431 | 85.263 | -0.168 | -0.2 |
--------------------------------------------------------------
* 4 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.586 | 2.584 | -0.003 | -0.1 |
100 | 5.254 | 5.283 | +0.030 | +0.56 |
150 | 8.012 | 8.074 | +0.061 | +0.76 |
200 | 11.172 | 11.000 | -0.172 | -1.54 |
250 | 13.917 | 14.036 | +0.119 | +0.86 |
300 | 16.905 | 16.543 | -0.362 | -2.14 |
350 | 19.901 | 20.036 | +0.135 | +0.68 |
400 | 22.908 | 23.094 | +0.186 | +0.81 |
450 | 26.273 | 26.101 | -0.172 | -0.66 |
500 | 29.554 | 29.092 | -0.461 | -1.56 |
550 | 32.377 | 32.274 | -0.103 | -0.32 |
600 | 35.855 | 35.322 | -0.533 | -1.49 |
650 | 39.192 | 38.388 | -0.804 | -2.05 |
700 | 41.744 | 41.719 | -0.025 | -0.06 |
750 | 45.016 | 44.496 | -0.520 | -1.16 |
800 | 48.212 | 47.603 | -0.609 | -1.26 |
--------------------------------------------------------------
* 8 CPUs
Number of | without | with | diff | diff |
processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] |
--------------------------------------------------------------
50 | 2.094 | 2.072 | -0.022 | -1.07 |
100 | 4.162 | 4.273 | +0.111 | +2.66 |
150 | 6.485 | 6.540 | +0.055 | +0.84 |
200 | 8.556 | 8.478 | -0.078 | -0.91 |
250 | 10.458 | 10.258 | -0.200 | -1.91 |
300 | 12.425 | 12.750 | +0.325 | +2.62 |
350 | 14.807 | 14.839 | +0.032 | +0.22 |
400 | 16.801 | 16.959 | +0.158 | +0.94 |
450 | 19.478 | 19.009 | -0.470 | -2.41 |
500 | 21.296 | 21.504 | +0.208 | +0.98 |
550 | 23.842 | 23.979 | +0.137 | +0.57 |
600 | 26.309 | 26.111 | -0.198 | -0.75 |
650 | 28.705 | 28.446 | -0.259 | -0.9 |
700 | 31.233 | 31.394 | +0.161 | +0.52 |
750 | 34.064 | 33.720 | -0.344 | -1.01 |
800 | 36.320 | 36.114 | -0.206 | -0.57 |
--------------------------------------------------------------
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: 'Peter Zijlstra' <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-19 00:16:16 +08:00
|
|
|
|
kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
Patch series "refactor Kconfig to consolidate KEXEC and CRASH options", v6.
The Kconfig is refactored to consolidate KEXEC and CRASH options from
various arch/<arch>/Kconfig files into new file kernel/Kconfig.kexec.
The Kconfig.kexec is now a submenu titled "Kexec and crash features"
located under "General Setup".
The following options are impacted:
- KEXEC
- KEXEC_FILE
- KEXEC_SIG
- KEXEC_SIG_FORCE
- KEXEC_IMAGE_VERIFY_SIG
- KEXEC_BZIMAGE_VERIFY_SIG
- KEXEC_JUMP
- CRASH_DUMP
Over time, these options have been copied between Kconfig files and
are very similar to one another, but with slight differences.
The following architectures are impacted by the refactor (because of
use of one or more KEXEC/CRASH options):
- arm
- arm64
- ia64
- loongarch
- m68k
- mips
- parisc
- powerpc
- riscv
- s390
- sh
- x86
More information:
In the patch series "crash: Kernel handling of CPU and memory hot
un/plug"
https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devolder@oracle.com/
the new kernel feature introduces the config option CRASH_HOTPLUG.
In reviewing, Thomas Gleixner requested that the new config option
not be placed in x86 Kconfig. Rather the option needs a generic/common
home. To Thomas' point, the KEXEC and CRASH options have largely been
duplicated in the various arch/<arch>/Kconfig files, with minor
differences. This kind of proliferation is to be avoid/stopped.
https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/
To that end, I have refactored the arch Kconfigs so as to consolidate
the various KEXEC and CRASH options. Generally speaking, this work has
the following themes:
- KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec
- These items from arch/Kconfig:
CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC
- These items from arch/x86/Kconfig form the common options:
KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE
KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP
- These items from arch/arm64/Kconfig form the common options:
KEXEC_IMAGE_VERIFY_SIG
- The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec
- The Kconfig.kexec is now a submenu titled "Kexec and crash features"
and is now listed in "General Setup" submenu from init/Kconfig.
- To control the common options, each has a new ARCH_SUPPORTS_<option>
option. These gateway options determine whether the common options
options are valid for the architecture.
- To account for the slight differences in the original architecture
coding of the common options, each now has a corresponding
ARCH_SELECTS_<option> which are used to elicit the same side effects
as the original arch/<arch>/Kconfig files for KEXEC and CRASH options.
An example, 'make menuconfig' illustrating the submenu:
> General setup > Kexec and crash features
[*] Enable kexec system call
[*] Enable kexec file based system call
[*] Verify kernel signature during kexec_file_load() syscall
[ ] Require a valid signature in kexec_file_load() syscall
[ ] Enable bzImage signature verification support
[*] kexec jump
[*] kernel crash dumps
[*] Update the crash elfcorehdr on system configuration changes
In the process of consolidating the common options, I encountered
slight differences in the coding of these options in several of the
architectures. As a result, I settled on the following solution:
- Each of the common options has a 'depends on ARCH_SUPPORTS_<option>'
statement. For example, the KEXEC_FILE option has a 'depends on
ARCH_SUPPORTS_KEXEC_FILE' statement.
This approach is needed on all common options so as to prevent
options from appearing for architectures which previously did
not allow/enable them. For example, arm supports KEXEC but not
KEXEC_FILE. The arch/arm/Kconfig does not provide
ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options
are not available to arm.
- The boolean ARCH_SUPPORTS_<option> in effect allows the arch to
determine when the feature is allowed. Archs which don't have the
feature simply do not provide the corresponding ARCH_SUPPORTS_<option>.
For each arch, where there previously were KEXEC and/or CRASH
options, these have been replaced with the corresponding boolean
ARCH_SUPPORTS_<option>, and an appropriate def_bool statement.
For example, if the arch supports KEXEC_FILE, then the
ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits
the KEXEC_FILE option to be available.
If the arch has a 'depends on' statement in its original coding
of the option, then that expression becomes part of the def_bool
expression. For example, arm64 had:
config KEXEC
depends on PM_SLEEP_SMP
and in this solution, this converts to:
config ARCH_SUPPORTS_KEXEC
def_bool PM_SLEEP_SMP
- In order to account for the architecture differences in the
coding for the common options, the ARCH_SELECTS_<option> in the
arch/<arch>/Kconfig is used. This option has a 'depends on
<option>' statement to couple it to the main option, and from
there can insert the differences from the common option and the
arch original coding of that option.
For example, a few archs enable CRYPTO and CRYTPO_SHA256 for
KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and
'select CRYPTO' and 'select CRYPTO_SHA256' statements.
Illustrating the option relationships:
For each of the common KEXEC and CRASH options:
ARCH_SUPPORTS_<option> <- <option> <- ARCH_SELECTS_<option>
<option> # in Kconfig.kexec
ARCH_SUPPORTS_<option> # in arch/<arch>/Kconfig, as needed
ARCH_SELECTS_<option> # in arch/<arch>/Kconfig, as needed
For example, KEXEC:
ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC
KEXEC # in Kconfig.kexec
ARCH_SUPPORTS_KEXEC # in arch/<arch>/Kconfig, as needed
ARCH_SELECTS_KEXEC # in arch/<arch>/Kconfig, as needed
To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be
enabled, and the ARCH_SELECTS_<option> handles side effects (ie.
select statements).
Examples:
A few examples to show the new strategy in action:
===== x86 (minus the help section) =====
Original:
config KEXEC
bool "kexec system call"
select KEXEC_CORE
config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
select HAVE_IMA_KEXEC if IMA
depends on X86_64
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config KEXEC_SIG
bool "Verify kernel signature during kexec_file_load() syscall"
depends on KEXEC_FILE
config KEXEC_SIG_FORCE
bool "Require a valid signature in kexec_file_load() syscall"
depends on KEXEC_SIG
config KEXEC_BZIMAGE_VERIFY_SIG
bool "Enable bzImage signature verification support"
depends on KEXEC_SIG
depends on SIGNED_PE_FILE_VERIFICATION
select SYSTEM_TRUSTED_KEYRING
config CRASH_DUMP
bool "kernel crash dumps"
depends on X86_64 || (X86_32 && HIGHMEM)
config KEXEC_JUMP
bool "kexec jump"
depends on KEXEC && HIBERNATION
help
becomes...
New:
config ARCH_SUPPORTS_KEXEC
def_bool y
config ARCH_SUPPORTS_KEXEC_FILE
def_bool X86_64 && CRYPTO && CRYPTO_SHA256
config ARCH_SELECTS_KEXEC_FILE
def_bool y
depends on KEXEC_FILE
select HAVE_IMA_KEXEC if IMA
config ARCH_SUPPORTS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config ARCH_SUPPORTS_KEXEC_SIG
def_bool y
config ARCH_SUPPORTS_KEXEC_SIG_FORCE
def_bool y
config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
def_bool y
config ARCH_SUPPORTS_KEXEC_JUMP
def_bool y
config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)
===== powerpc (minus the help section) =====
Original:
config KEXEC
bool "kexec system call"
depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP)
select KEXEC_CORE
config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
select HAVE_IMA_KEXEC if IMA
select KEXEC_ELF
depends on PPC64
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config CRASH_DUMP
bool "Build a dump capture kernel"
depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
select RELOCATABLE if PPC64 || 44x || PPC_85xx
becomes...
New:
config ARCH_SUPPORTS_KEXEC
def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP)
config ARCH_SUPPORTS_KEXEC_FILE
def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y
config ARCH_SUPPORTS_KEXEC_PURGATORY
def_bool KEXEC_FILE
config ARCH_SELECTS_KEXEC_FILE
def_bool y
depends on KEXEC_FILE
select KEXEC_ELF
select HAVE_IMA_KEXEC if IMA
config ARCH_SUPPORTS_CRASH_DUMP
def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
config ARCH_SELECTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
select RELOCATABLE if PPC64 || 44x || PPC_85xx
Testing Approach and Results
There are 388 config files in the arch/<arch>/configs directories.
For each of these config files, a .config is generated both before and
after this Kconfig series, and checked for equivalence. This approach
allows for a rather rapid check of all architectures and a wide
variety of configs wrt/ KEXEC and CRASH, and avoids requiring
compiling for all architectures and running kernels and run-time
testing.
For each config file, the olddefconfig, allnoconfig and allyesconfig
targets are utilized. In testing the randconfig has revealed problems
as well, but is not used in the before and after equivalence check
since one can not generate the "same" .config for before and after,
even if using the same KCONFIG_SEED since the option list is
different.
As such, the following script steps compare the before and after
of 'make olddefconfig'. The new symbols introduced by this series
are filtered out, but otherwise the config files are PASS only if
they were equivalent, and FAIL otherwise.
The script performs the test by doing the following:
# Obtain the "golden" .config output for given config file
# Reset test sandbox
git checkout master
git branch -D test_Kconfig
git checkout -B test_Kconfig master
make distclean
# Write out updated config
cp -f <config file> .config
make ARCH=<arch> olddefconfig
# Track each item in .config, LHSB is "golden"
scoreboard .config
# Obtain the "changed" .config output for given config file
# Reset test sandbox
make distclean
# Apply this Kconfig series
git am <this Kconfig series>
# Write out updated config
cp -f <config file> .config
make ARCH=<arch> olddefconfig
# Track each item in .config, RHSB is "changed"
scoreboard .config
# Determine test result
# Filter-out new symbols introduced by this series
# Filter-out symbol=n which not in either scoreboard
# Compare LHSB "golden" and RHSB "changed" scoreboards and issue PASS/FAIL
The script was instrumental during the refactoring of Kconfig as it
continually revealed problems. The end result being that the solution
presented in this series passes all configs as checked by the script,
with the following exceptions:
- arch/ia64/configs/zx1_config with olddefconfig
This config file has:
# CONFIG_KEXEC is not set
CONFIG_CRASH_DUMP=y
and this refactor now couples KEXEC to CRASH_DUMP, so it is not
possible to enable CRASH_DUMP without KEXEC.
- arch/sh/configs/* with allyesconfig
The arch/sh/Kconfig codes CRASH_DUMP as dependent upon BROKEN_ON_MMU
(which clearly is not meant to be set). This symbol is not provided
but with the allyesconfig it is set to yes which enables CRASH_DUMP.
But KEXEC is coded as dependent upon MMU, and is set to no in
arch/sh/mm/Kconfig, so KEXEC is not enabled.
This refactor now couples KEXEC to CRASH_DUMP, so it is not
possible to enable CRASH_DUMP without KEXEC.
While the above exceptions are not equivalent to their original,
the config file produced is valid (and in fact better wrt/ CRASH_DUMP
handling).
This patch (of 14)
The config options for kexec and crash features are consolidated
into new file kernel/Kconfig.kexec. Under the "General Setup" submenu
is a new submenu "Kexec and crash handling". All the kexec and
crash options that were once in the arch-dependent submenu "Processor
type and features" are now consolidated in the new submenu.
The following options are impacted:
- KEXEC
- KEXEC_FILE
- KEXEC_SIG
- KEXEC_SIG_FORCE
- KEXEC_BZIMAGE_VERIFY_SIG
- KEXEC_JUMP
- CRASH_DUMP
The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP.
Architectures specify support of certain KEXEC and CRASH features with
similarly named new ARCH_SUPPORTS_<option> config options.
Architectures can utilize the new ARCH_SELECTS_<option> config
options to specify additional components when <option> is enabled.
To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be
enabled, and the ARCH_SELECTS_<option> handles side effects (ie.
select statements).
Link: https://lkml.kernel.org/r/20230712161545.87870-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230712161545.87870-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Cc. "H. Peter Anvin" <hpa@zytor.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Juerg Haefliger <juerg.haefliger@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Marc Aurèle La France <tsi@tuyoix.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Xin Li <xin3.li@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-13 00:15:32 +08:00
|
|
|
source "kernel/Kconfig.kexec"
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
endmenu # General setup
|
|
|
|
|
2018-07-31 19:39:30 +08:00
|
|
|
source "arch/Kconfig"
|
|
|
|
|
2006-09-17 03:15:53 +08:00
|
|
|
config RT_MUTEXES
|
2014-12-21 04:41:11 +08:00
|
|
|
bool
|
2022-02-09 01:21:10 +08:00
|
|
|
default y if PREEMPT_RT
|
2006-09-17 03:15:53 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config BASE_SMALL
|
|
|
|
int
|
|
|
|
default 0 if BASE_FULL
|
|
|
|
default 1 if !BASE_FULL
|
|
|
|
|
2019-07-05 02:57:34 +08:00
|
|
|
config MODULE_SIG_FORMAT
|
|
|
|
def_bool n
|
|
|
|
select SYSTEM_DATA_VERIFICATION
|
|
|
|
|
2022-07-12 13:52:33 +08:00
|
|
|
source "kernel/module/Kconfig"
|
2015-05-27 09:39:37 +08:00
|
|
|
|
2008-12-13 18:49:41 +08:00
|
|
|
config INIT_ALL_POSSIBLE
|
|
|
|
bool
|
|
|
|
help
|
2012-03-29 13:08:31 +08:00
|
|
|
Back when each arch used to define their own cpu_online_mask and
|
|
|
|
cpu_possible_mask, some of them chose to initialize cpu_possible_mask
|
2008-12-13 18:49:41 +08:00
|
|
|
with all 1s, and others with all 0s. When they were centralised,
|
|
|
|
it was better to provide this option than to break all the archs
|
2009-01-26 18:12:25 +08:00
|
|
|
and have several arch maintainers pursuing me down dark alleys.
|
2008-12-13 18:49:41 +08:00
|
|
|
|
2005-11-04 15:43:35 +08:00
|
|
|
source "block/Kconfig"
|
2007-10-17 14:27:31 +08:00
|
|
|
|
|
|
|
config PREEMPT_NOTIFIERS
|
|
|
|
bool
|
2008-01-26 04:08:24 +08:00
|
|
|
|
2010-01-06 16:47:10 +08:00
|
|
|
config PADATA
|
|
|
|
depends on SMP
|
|
|
|
bool
|
|
|
|
|
2012-09-22 06:31:13 +08:00
|
|
|
config ASN1
|
|
|
|
tristate
|
|
|
|
help
|
|
|
|
Build a simple ASN.1 grammar compiler that produces a bytecode output
|
|
|
|
that can be interpreted by the ASN.1 stream decoder and used to
|
|
|
|
inform it as to what tags are to be expected in a stream and what
|
|
|
|
functions to call on what tags.
|
|
|
|
|
2009-11-09 23:21:34 +08:00
|
|
|
source "kernel/Kconfig.locks"
|
2018-01-30 04:20:15 +08:00
|
|
|
|
bpf: Restrict bpf_probe_read{, str}() only to archs where they work
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.
To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").
Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.
However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).
For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2020-05-15 18:11:16 +08:00
|
|
|
config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
|
|
|
|
bool
|
|
|
|
|
2018-01-30 04:20:15 +08:00
|
|
|
config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
|
|
|
|
bool
|
2018-04-05 17:53:01 +08:00
|
|
|
|
|
|
|
# It may be useful for an architecture to override the definitions of the
|
2018-04-05 17:53:03 +08:00
|
|
|
# SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
|
|
|
|
# and the COMPAT_ variants in <linux/compat.h>, in particular to use a
|
|
|
|
# different calling convention for syscalls. They can also override the
|
|
|
|
# macros for not-implemented syscalls in kernel/sys_ni.c and
|
|
|
|
# kernel/time/posix-stubs.c. All these overrides need to be available in
|
|
|
|
# <asm/syscall_wrapper.h>.
|
2018-04-05 17:53:01 +08:00
|
|
|
config ARCH_HAS_SYSCALL_WRAPPER
|
|
|
|
def_bool n
|