OpenCloudOS-Kernel

Go to file

Aleksa Sarai 9876cfe8ec memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT \| O_RDWR \| O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char execpath = NULL; char argv[] = { "bad-exe", NULL }, envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH \| O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. / New API / The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility / As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com Fixes: `105ff5339f` ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: Jeff Xu <jeffxu@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-08-21 13:37:59 -07:00
Documentation	mm: remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers	2023-08-21 13:37:58 -07:00
LICENSES	LICENSES: Add the copyleft-next-0.3.1 license	2022-11-08 15:44:01 +01:00
arch	um: convert {pmd, pte}_free_tlb() to use ptdescs	2023-08-21 13:37:58 -07:00
block	block-6.5-2023-07-21	2023-07-22 11:05:15 -07:00
certs	KEYS: Add missing function documentation	2023-04-24 16:15:52 +03:00
crypto	crypto: algif_hash - Fix race between MORE and non-MORE sends	2023-07-08 22:48:42 +10:00
drivers	mm/memory_hotplug: embed vmem_altmap details in memory block	2023-08-21 13:37:49 -07:00
fs	mm: memtest: convert to memtest_report_meminfo()	2023-08-21 13:37:47 -07:00
include	memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy	2023-08-21 13:37:59 -07:00
init	mm: remove arguments of show_mem()	2023-08-18 10:12:02 -07:00
io_uring	io_uring-6.5-2023-07-28	2023-07-28 10:19:44 -07:00
ipc	Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2023-02-24 19:20:07 -08:00
kernel	memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy	2023-08-21 13:37:59 -07:00
lib	maple_tree: replace data before marking dead in split and spanning store	2023-08-21 13:37:41 -07:00
mm	memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy	2023-08-21 13:37:59 -07:00
net	mm: allow per-VMA locks on file-backed VMAs	2023-08-18 10:12:51 -07:00
rust	rust: error: `impl Debug` for `Error` with `errname()` integration	2023-06-13 01:24:42 +02:00
samples	arm64: ftrace: Add direct call trampoline samples support	2023-07-10 17:51:54 -04:00
scripts	x86:	2023-07-30 11:19:08 -07:00
security	selinux: use vma_is_initial_stack() and vma_is_initial_heap()	2023-08-21 13:37:31 -07:00
sound	ASoC: Fixes for v6.5	2023-07-27 14:54:23 +02:00
tools	memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2	2023-08-21 13:37:59 -07:00
usr	initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP	2023-06-06 17:54:49 +09:00
virt	KVM: Grab a reference to KVM for VM and vCPU stats file descriptors	2023-07-29 11:05:28 -04:00
.clang-format	iommu: Add for_each_group_device()	2023-05-23 08:15:51 +02:00
.cocciconfig	…
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	.gitattributes: set diff driver for Rust source code files	2023-05-31 17:48:25 +02:00
.gitignore	Revert ".gitignore: ignore .cover and .mbx"	2023-07-04 15:05:12 -07:00
.mailmap	mailmap: update remaining active codeaurora.org email addresses	2023-07-27 13:07:05 -07:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	- Address -Wmissing-prototype warnings	2023-06-26 16:43:54 -07:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	mm: kill frontswap	2023-08-21 13:37:26 -07:00
Makefile	Linux 6.5-rc4	2023-07-30 13:23:47 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.