OpenCloudOS-Kernel/kernel
Aleksa Sarai 9876cfe8ec memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.

The justification given in [1] is that this is a security feature and thus
it should not be possible to disable.  Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it.  A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):

  % cat mount-memfd.c
  #include <fcntl.h>
  #include <string.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <linux/mount.h>

  #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"

  int main(void)
  {
  	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
  	assert(fsfd >= 0);
  	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));

  	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
  	assert(dfd >= 0);

  	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
  	assert(execfd >= 0);
  	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
  	assert(!close(execfd));

  	char *execpath = NULL;
  	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
  	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
  	assert(execfd >= 0);
  	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
  	assert(!execve(execpath, argv, envp));
  }
  % ./mount-memfd
  this file was executed from this totally private tmpfs: /proc/self/fd/5
  %

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this.  There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.

/* New API */

The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.

So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you.  The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.

Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time.  This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).

/* Backwards Compatibility */

As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.

However it should be noted that now that the setting is completely
hierarchical.  Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces.  Now, the restriction
applies recursively.  This is a uAPI change, however:

 * The sysctl is very new, having been merged in 6.3.
 * Several aspects of the sysctl were broken up until this patchset and
   the other patchset by Jeff Xu last month.

And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.

[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com
Fixes: 105ff5339f ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:59 -07:00
..
bpf bpf: Repeat check_max_stack_depth for async callbacks 2023-07-18 15:21:09 -07:00
cgroup sched/psi: use kernfs polling functions for PSI trigger polling 2023-07-10 09:52:30 +02:00
configs mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED 2023-05-26 19:01:47 +02:00
debug kdb: move kdb_send_sig() declaration to a better header file 2023-07-03 09:27:12 +01:00
dma dma-mapping fixes for Linux 6.5 2023-07-09 10:24:22 -07:00
entry ptrace: Provide set/get interface for syscall user dispatch 2023-04-16 14:23:07 +02:00
events perf/core: use vma_is_initial_stack() and vma_is_initial_heap() 2023-08-21 13:37:32 -07:00
futex mm/mm_init.c: remove obsolete macro HASH_SMALL 2023-08-18 10:12:07 -07:00
gcov gcov: add support for checksum field 2022-12-21 14:31:52 -08:00
irq irqdomain: Use return value of strreplace() 2023-06-30 11:13:44 +02:00
kcsan kcsan: Don't expect 64 bits atomic builtins from 32 bits architectures 2023-06-09 23:29:50 +10:00
livepatch livepatch: Make 'klp_stack_entries' static 2023-06-05 13:56:52 +02:00
locking locking/rtmutex: Fix task->pi_waiters integrity 2023-07-17 13:59:10 +02:00
module module: fix init_module_from_file() error handling 2023-07-04 10:17:11 -07:00
power Merge branches 'pm-sleep' and 'pm-qos' 2023-07-14 19:13:21 +02:00
printk seqlock/latch: Provide raw_read_seqcount_latch_retry() 2023-06-05 21:11:03 +02:00
rcu Merge branches 'doc.2023.05.10a', 'fixes.2023.05.11a', 'kvfree.2023.05.10a', 'nocb.2023.05.11a', 'rcu-tasks.2023.05.10a', 'torture.2023.05.15a' and 'rcu-urgent.2023.06.06a' into HEAD 2023-06-07 13:44:06 -07:00
sched sched/psi: use kernfs polling functions for PSI trigger polling 2023-07-10 09:52:30 +02:00
time hardening updates for v6.5-rc1 2023-06-27 21:24:18 -07:00
trace Probe fixes for 6.5-rc3: 2023-07-30 11:27:22 -07:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
Makefile v6.5-rc1-modules-next 2023-06-28 15:51:08 -07:00
acct.c acct: fix potential integer overflow in encode_comp_t() 2022-11-30 16:13:18 -08:00
async.c
audit.c audit: use time_after to compare time 2022-08-29 19:47:03 -04:00
audit.h audit: avoid missing-prototype warnings 2023-05-17 11:34:55 -04:00
audit_fsnotify.c audit: fix potential double free on error path from fsnotify_add_inode_mark 2022-08-22 18:50:06 -04:00
audit_tree.c
audit_watch.c audit_init_parent(): constify path 2022-09-01 17:39:30 -04:00
auditfilter.c
auditsc.c capability: just use a 'u64' instead of a 'u32[2]' array 2023-03-01 10:01:22 -08:00
backtracetest.c
bounds.c mm: multi-gen LRU: minimal implementation 2022-09-26 19:46:09 -07:00
capability.c capability: fix kernel-doc warnings in capability.c 2023-05-22 14:30:52 -04:00
cfi.c cfi: Switch to -fsanitize=kcfi 2022-09-26 10:13:13 -07:00
compat.c sched_getaffinity: don't assume 'cpumask_size()' is fully initialized 2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c locking/atomic: treewide: use raw_atomic*_<op>() 2023-06-05 09:57:20 +02:00
cpu.c cpu/hotplug: Fix off by one in cpuhp_bringup_mask() 2023-05-23 18:06:40 +02:00
cpu_pm.c cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}() 2023-01-13 11:48:15 +01:00
crash_core.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
crash_dump.c
cred.c cred: Do not default to init_cred in prepare_kernel_cred() 2022-11-01 10:04:52 -07:00
delayacct.c delayacct: track delays from IRQ/SOFTIRQ 2023-04-18 16:39:34 -07:00
dma.c
exec_domain.c
exit.c fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-06-01 17:15:33 -04:00
extable.c context_tracking: Take NMI eqs entrypoints over RCU 2022-07-05 13:32:59 -07:00
fail_function.c kernel/fail_function: fix memory leak with using debugfs_lookup() 2023-02-08 13:36:22 +01:00
fork.c fork: lock VMAs of the parent process when forking 2023-07-08 14:08:02 -07:00
freezer.c freezer,sched: Rewrite core freezer logic 2022-09-07 21:53:50 +02:00
gen_kheaders.sh Revert "kheaders: substituting --sort in archive creation" 2023-05-28 16:20:21 +09:00
groups.c security: Add LSM hook to setgroups() syscall 2022-07-15 18:21:49 +00:00
hung_task.c kernel/hung_task.c: set some hung_task.c variables storage-class-specifier to static 2023-04-08 13:45:37 -07:00
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c trace: Add trace_ipi_send_cpu() 2023-03-24 11:01:29 +01:00
jump_label.c jump_label: Prevent key->enabled int overflow 2022-12-01 15:53:05 -08:00
kallsyms.c kallsyms: strip LTO-only suffixes from promoted global functions 2023-07-12 15:39:34 -07:00
kallsyms_internal.h kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[] 2022-11-12 18:47:36 -08:00
kallsyms_selftest.c kallsyms: Delete an unused parameter related to {module_}kallsyms_on_each_symbol() 2023-03-19 13:27:19 -07:00
kallsyms_selftest.h kallsyms: Add self-test facility 2022-11-15 00:42:02 -08:00
kcmp.c
kcov.c kcov: add prototypes for helper functions 2023-06-09 17:44:17 -07:00
kexec.c kexec: introduce sysctl parameters kexec_load_limit_* 2023-02-02 22:50:05 -08:00
kexec_core.c kexec: enable kexec_crash_size to support two crash kernel regions 2023-06-09 17:44:24 -07:00
kexec_elf.c
kexec_file.c - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in 2023-06-28 10:59:38 -07:00
kexec_internal.h panic, kexec: make __crash_kexec() NMI safe 2022-09-11 21:55:06 -07:00
kheaders.c kheaders: Use array declaration instead of char 2023-03-24 20:10:59 -07:00
kprobes.c kprobes: Prohibit probing on CFI preamble symbol 2023-07-29 23:32:26 +09:00
ksyms_common.c kallsyms: make kallsyms_show_value() as generic function 2023-06-08 12:27:20 -07:00
ksysfs.c kernel/ksysfs.c: use sysfs_emit for sysfs show handlers 2023-03-24 17:09:14 +01:00
kthread.c - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in 2023-06-28 10:59:38 -07:00
latencytop.c latencytop: use the last element of latency_record of system 2022-09-11 21:55:12 -07:00
module_signature.c
notifier.c notifiers: add tracepoints to the notifiers infrastructure 2023-04-08 13:45:38 -07:00
nsproxy.c convert setns(2) to fdget()/fdput() 2023-04-20 22:55:35 -04:00
padata.c padata: use alignment when calculating the number of worker threads 2023-03-14 17:06:44 +08:00
panic.c mm: remove arguments of show_mem() 2023-08-18 10:12:02 -07:00
params.c kallsyms: Replace all non-returning strlcpy with strscpy 2023-06-14 12:27:38 -07:00
pid.c memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid_namespace.c memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid_sysctl.h memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
profile.c kernel/profile.c: simplify duplicated code in profile_setup() 2022-09-11 21:55:12 -07:00
ptrace.c ptrace: Provide set/get interface for syscall user dispatch 2023-04-16 14:23:07 +02:00
range.c
reboot.c kernel/reboot: Add SYS_OFF_MODE_RESTART_PREPARE mode 2022-10-04 15:59:36 +02:00
regset.c
relay.c relayfs: fix out-of-bounds access in relay_file_read 2023-05-02 17:23:27 -07:00
resource.c dax/kmem: Fix leak of memory-hotplug resources 2023-02-17 14:58:01 -08:00
resource_kunit.c
rseq.c rseq: Extend struct rseq with per-memory-map concurrency ID 2022-12-27 12:52:12 +01:00
scftorture.c
scs.c scs: add support for dynamic shadow call stacks 2022-11-09 18:06:35 +00:00
seccomp.c seccomp: simplify sysctls with register_sysctl_init() 2023-04-13 11:49:20 -07:00
signal.c mm: suppress mm fault logging if fatal signal already pending 2023-07-26 10:51:59 -07:00
smp.c trace,smp: Add tracepoints for scheduling remotelly called functions 2023-06-16 22:08:09 +02:00
smpboot.c cpu/hotplug: Remove unused state functions 2023-05-15 13:45:00 +02:00
smpboot.h
softirq.c Revert "softirq: Let ksoftirqd do its job" 2023-05-09 21:50:27 +02:00
stackleak.c stackleak: allow to specify arch specific stackleak poison function 2023-04-20 11:36:35 +02:00
stacktrace.c
static_call.c
static_call_inline.c static_call: Add call depth tracking support 2022-10-17 16:41:16 +02:00
stop_machine.c Scheduler changes in this cycle were: 2022-05-24 11:11:13 -07:00
sys.c prctl: move PR_GET_AUXV out of PR_MCE_KILL 2023-07-17 12:53:21 -07:00
sys_ni.c asm-generic updates for 6.5 2023-07-06 10:06:04 -07:00
sysctl-test.c kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred} 2022-09-08 16:56:45 -07:00
sysctl.c v6.5-rc1-sysctl-next 2023-06-28 16:05:21 -07:00
task_work.c task_work: use try_cmpxchg in task_work_add, task_work_cancel_match and task_work_run 2022-09-11 21:55:10 -07:00
taskstats.c genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
torture.c torture: Fix hang during kthread shutdown phase 2023-01-05 12:10:35 -08:00
tracepoint.c tracepoint: Allow livepatch module add trace event 2023-02-18 14:34:36 -05:00
tsacct.c taskstats: version 12 with thread group and exe info 2022-04-29 14:38:03 -07:00
ucount.c ucounts: Split rlimit and ucount values and max values 2022-05-18 18:24:57 -05:00
uid16.c
uid16.h
umh.c sysctl: fix unused proc_cap_handler() function warning 2023-06-29 15:19:43 -07:00
up.c
user-return-notifier.c
user.c kernel/user: Allow user_struct::locked_vm to be usable for iommufd 2022-11-30 20:16:49 -04:00
user_namespace.c userns: fix a struct's kernel-doc notation 2023-02-02 22:50:04 -08:00
usermode_driver.c blob_to_mnt(): kern_unmount() is needed to undo kern_mount() 2022-05-19 23:25:47 -04:00
utsname.c
utsname_sysctl.c utsname: simplify one-level sysctl registration for uts_kern_table 2023-04-13 11:49:35 -07:00
vhost_task.c vhost: Fix worker hangs due to missed wake up calls 2023-06-08 15:43:09 -04:00
watch_queue.c watch_queue: prevent dangling pipe pointer 2023-06-06 10:47:04 +02:00
watchdog.c watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64 2023-06-19 16:25:29 -07:00
watchdog_buddy.c watchdog/hardlockup: move SMP barriers from common code to buddy code 2023-06-19 16:25:28 -07:00
watchdog_perf.c watchdog/perf: add a weak function for an arch to detect if perf can use NMIs 2023-06-09 17:44:21 -07:00
workqueue.c workqueue: Changes for v6.5 2023-06-27 16:32:52 -07:00
workqueue_internal.h workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE 2023-05-17 17:02:08 -10:00