docs: Document Syscall User Dispatch
Explain the interface, provide some background and security notes. [ tglx: Add note about non-visibility, add it to the index and fix the kerneldoc warning ] Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20201127193238.821364-8-krisman@collabora.com
This commit is contained in:
parent
d87ae0fa21
commit
a4452e671c
|
@ -111,6 +111,7 @@ configure specific aspects of kernel behavior to your liking.
|
|||
rtc
|
||||
serial-console
|
||||
svga
|
||||
syscall-user-dispatch
|
||||
sysrq
|
||||
thunderbolt
|
||||
ufs
|
||||
|
|
|
@ -0,0 +1,90 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
Syscall User Dispatch
|
||||
=====================
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
Compatibility layers like Wine need a way to efficiently emulate system
|
||||
calls of only a part of their process - the part that has the
|
||||
incompatible code - while being able to execute native syscalls without
|
||||
a high performance penalty on the native part of the process. Seccomp
|
||||
falls short on this task, since it has limited support to efficiently
|
||||
filter syscalls based on memory regions, and it doesn't support removing
|
||||
filters. Therefore a new mechanism is necessary.
|
||||
|
||||
Syscall User Dispatch brings the filtering of the syscall dispatcher
|
||||
address back to userspace. The application is in control of a flip
|
||||
switch, indicating the current personality of the process. A
|
||||
multiple-personality application can then flip the switch without
|
||||
invoking the kernel, when crossing the compatibility layer API
|
||||
boundaries, to enable/disable the syscall redirection and execute
|
||||
syscalls directly (disabled) or send them to be emulated in userspace
|
||||
through a SIGSYS.
|
||||
|
||||
The goal of this design is to provide very quick compatibility layer
|
||||
boundary crosses, which is achieved by not executing a syscall to change
|
||||
personality every time the compatibility layer executes. Instead, a
|
||||
userspace memory region exposed to the kernel indicates the current
|
||||
personality, and the application simply modifies that variable to
|
||||
configure the mechanism.
|
||||
|
||||
There is a relatively high cost associated with handling signals on most
|
||||
architectures, like x86, but at least for Wine, syscalls issued by
|
||||
native Windows code are currently not known to be a performance problem,
|
||||
since they are quite rare, at least for modern gaming applications.
|
||||
|
||||
Since this mechanism is designed to capture syscalls issued by
|
||||
non-native applications, it must function on syscalls whose invocation
|
||||
ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
|
||||
doesn't rely on any of the syscall ABI to make the filtering. It uses
|
||||
only the syscall dispatcher address and the userspace key.
|
||||
|
||||
As the ABI of these intercepted syscalls is unknown to Linux, these
|
||||
syscalls are not instrumentable via ptrace or the syscall tracepoints.
|
||||
|
||||
Interface
|
||||
---------
|
||||
|
||||
A thread can setup this mechanism on supported kernels by executing the
|
||||
following prctl:
|
||||
|
||||
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
|
||||
|
||||
<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
|
||||
disable the mechanism globally for that thread. When
|
||||
PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
|
||||
|
||||
[<offset>, <offset>+<length>) delimit a memory region interval
|
||||
from which syscalls are always executed directly, regardless of the
|
||||
userspace selector. This provides a fast path for the C library, which
|
||||
includes the most common syscall dispatchers in the native code
|
||||
applications, and also provides a way for the signal handler to return
|
||||
without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
|
||||
interface should make sure that at least the signal trampoline code is
|
||||
included in this region. In addition, for syscalls that implement the
|
||||
trampoline code on the vDSO, that trampoline is never intercepted.
|
||||
|
||||
[selector] is a pointer to a char-sized region in the process memory
|
||||
region, that provides a quick way to enable disable syscall redirection
|
||||
thread-wide, without the need to invoke the kernel directly. selector
|
||||
can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other
|
||||
value should terminate the program with a SIGSYS.
|
||||
|
||||
Security Notes
|
||||
--------------
|
||||
|
||||
Syscall User Dispatch provides functionality for compatibility layers to
|
||||
quickly capture system calls issued by a non-native part of the
|
||||
application, while not impacting the Linux native regions of the
|
||||
process. It is not a mechanism for sandboxing system calls, and it
|
||||
should not be seen as a security mechanism, since it is trivial for a
|
||||
malicious application to subvert the mechanism by jumping to an allowed
|
||||
dispatcher region prior to executing the syscall, or to discover the
|
||||
address and modify the selector value. If the use case requires any
|
||||
kind of security sandboxing, Seccomp should be used instead.
|
||||
|
||||
Any fork or exec of the existing process resets the mechanism to
|
||||
PR_SYS_DISPATCH_OFF.
|
Loading…
Reference in New Issue