2019-05-31 16:09:26 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* syscalls.h - Linux syscall interfaces (non-arch-specific)
|
|
|
|
*
|
|
|
|
* Copyright (c) 2004 Randy Dunlap
|
|
|
|
* Copyright (c) 2004 Open Source Development Labs
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _LINUX_SYSCALLS_H
|
|
|
|
#define _LINUX_SYSCALLS_H
|
|
|
|
|
2018-07-11 21:48:46 +08:00
|
|
|
struct __aio_sigset;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct epoll_event;
|
|
|
|
struct iattr;
|
|
|
|
struct inode;
|
|
|
|
struct iocb;
|
|
|
|
struct io_event;
|
|
|
|
struct iovec;
|
2019-11-15 22:53:29 +08:00
|
|
|
struct __kernel_old_itimerval;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct kexec_segment;
|
|
|
|
struct linux_dirent;
|
|
|
|
struct linux_dirent64;
|
|
|
|
struct list_head;
|
2010-03-11 07:21:15 +08:00
|
|
|
struct mmap_arg_struct;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct msgbuf;
|
separate kernel- and userland-side msghdr
Kernel-side struct msghdr is (currently) using the same layout as
userland one, but it's not a one-to-one copy - even without considering
32bit compat issues, we have msg_iov, msg_name and msg_control copied
to kernel[1]. It's fairly localized, so we get away with a few functions
where that knowledge is needed (and we could shrink that set even
more). Pretty much everything deals with the kernel-side variant and
the few places that want userland one just use a bunch of force-casts
to paper over the differences.
The thing is, kernel-side definition of struct msghdr is *not* exposed
in include/uapi - libc doesn't see it, etc. So we can add struct user_msghdr,
with proper annotations and let the few places that ever deal with those
beasts use it for userland pointers. Saner typechecking aside, that will
allow to change the layout of kernel-side msghdr - e.g. replace
msg_iov/msg_iovlen there with struct iov_iter, getting rid of the need
to modify the iovec as we copy data to/from it, etc.
We could introduce kernel_msghdr instead, but that would create much more
noise - the absolute majority of the instances would need to have the
type switched to kernel_msghdr and definition of struct msghdr in
include/linux/socket.h is not going to be seen by userland anyway.
This commit just introduces user_msghdr and switches the few places that
are dealing with userland-side msghdr to it.
[1] actually, it's even trickier than that - we copy msg_control for
sendmsg, but keep the userland address on recvmsg.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-07 02:03:05 +08:00
|
|
|
struct user_msghdr;
|
2009-10-13 14:40:10 +08:00
|
|
|
struct mmsghdr;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct msqid_ds;
|
|
|
|
struct new_utsname;
|
|
|
|
struct nfsctl_arg;
|
|
|
|
struct __old_kernel_stat;
|
2010-03-11 07:21:21 +08:00
|
|
|
struct oldold_utsname;
|
|
|
|
struct old_utsname;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct pollfd;
|
|
|
|
struct rlimit;
|
2010-05-05 00:03:50 +08:00
|
|
|
struct rlimit64;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct rusage;
|
|
|
|
struct sched_param;
|
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-07 21:43:36 +08:00
|
|
|
struct sched_attr;
|
2010-03-11 07:21:13 +08:00
|
|
|
struct sel_arg_struct;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct semaphore;
|
|
|
|
struct sembuf;
|
|
|
|
struct shmid_ds;
|
|
|
|
struct sockaddr;
|
|
|
|
struct stat;
|
|
|
|
struct stat64;
|
|
|
|
struct statfs;
|
|
|
|
struct statfs64;
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-02-01 00:46:22 +08:00
|
|
|
struct statx;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sysinfo;
|
|
|
|
struct timespec;
|
2019-10-26 04:56:17 +08:00
|
|
|
struct __kernel_old_timeval;
|
2018-07-03 13:44:22 +08:00
|
|
|
struct __kernel_timex;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct timezone;
|
|
|
|
struct tms;
|
|
|
|
struct utimbuf;
|
|
|
|
struct mq_attr;
|
2006-02-01 19:04:33 +08:00
|
|
|
struct compat_stat;
|
y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-07-13 18:52:28 +08:00
|
|
|
struct old_timeval32;
|
2006-05-23 22:46:40 +08:00
|
|
|
struct robust_list_head;
|
futex: Implement sys_futex_waitv()
Add support to wait on multiple futexes. This is the interface
implemented by this syscall:
futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
unsigned int flags, struct timespec *timeout, clockid_t clockid)
struct futex_waitv {
__u64 val;
__u64 uaddr;
__u32 flags;
__u32 __reserved;
};
Given an array of struct futex_waitv, wait on each uaddr. The thread
wakes if a futex_wake() is performed at any uaddr. The syscall returns
immediately if any waiter has *uaddr != val. *timeout is an optional
absolute timeout value for the operation. This syscall supports only
64bit sized timeout structs. The flags argument of the syscall should be
empty, but it can be used for future extensions. Flags for shared
futexes, sizes, etc. should be used on the individual flags of each
waiter.
__reserved is used for explicit padding and should be 0, but it might be
used for future extensions. If the userspace uses 32-bit pointers, it
should make sure to explicitly cast it when assigning to waitv::uaddr.
Returns the array index of one of the woken futexes. There’s no given
information of how many were woken, or any particular attribute of it
(if it’s the first woken, if it is of the smaller index...).
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210923171111.300673-17-andrealmeid@collabora.com
2021-09-24 01:11:05 +08:00
|
|
|
struct futex_waitv;
|
2006-09-26 16:52:28 +08:00
|
|
|
struct getcpu_cache;
|
2009-01-14 21:13:55 +08:00
|
|
|
struct old_linux_dirent;
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 18:02:48 +08:00
|
|
|
struct perf_event_attr;
|
2011-01-29 21:13:26 +08:00
|
|
|
struct file_handle;
|
2012-12-15 03:09:47 +08:00
|
|
|
struct sigaltstack;
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
struct rseq;
|
2014-09-26 15:16:58 +08:00
|
|
|
union bpf_attr;
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
struct io_uring_params;
|
fork: add clone3
This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
in clone() into a dedicated struct clone_args
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- use a separate kernel internal struct kernel_clone_args that is
properly typed according to current kernel conventions in fork.c and is
different from the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
syscalls such as fork(), vfork(), clone(), and clone3() behave identical
(Arnd suggested, that we can probably also port do_fork() itself in a
separate patchset.)
- ease of transition for userspace from clone() to clone3()
This very much means that we do *not* remove functionality that userspace
currently relies on as the latter is a good way of creating a syscall
that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
It is superseeded by a dedicated "exit_signal" argument in struct
clone_args freeing up space for additional flags.
This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-25 17:36:41 +08:00
|
|
|
struct clone_args;
|
open: introduce openat2(2) syscall
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).
/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);
/* Description. */
The initial version of 'struct open_how' contains the following fields:
flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).
mode
The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).
After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.
/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.
In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).
/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).
Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.
Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-18 20:07:59 +08:00
|
|
|
struct open_how;
|
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.
The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:
int mount_setattr(int dfd, const char *path, unsigned flags,
struct mount_attr *uattr, size_t usize);
Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.
The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:
struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};
The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.
The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.
The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.
[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-21 21:19:53 +08:00
|
|
|
struct mount_attr;
|
2021-04-22 23:41:18 +08:00
|
|
|
struct landlock_ruleset_attr;
|
|
|
|
enum landlock_rule_type;
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:07 +08:00
|
|
|
struct cachestat_range;
|
|
|
|
struct cachestat;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/aio_abi.h>
|
|
|
|
#include <linux/capability.h>
|
2012-11-26 11:24:19 +08:00
|
|
|
#include <linux/signal.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/list.h>
|
2011-11-24 09:12:59 +08:00
|
|
|
#include <linux/bug.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/sem.h>
|
|
|
|
#include <asm/siginfo.h>
|
2009-08-11 04:52:47 +08:00
|
|
|
#include <linux/unistd.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/quota.h>
|
|
|
|
#include <linux/key.h>
|
2018-07-11 21:56:50 +08:00
|
|
|
#include <linux/personality.h>
|
2009-04-09 02:40:59 +08:00
|
|
|
#include <trace/syscall.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-04-05 17:53:01 +08:00
|
|
|
#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
|
|
|
|
/*
|
|
|
|
* It may be useful for an architecture to override the definitions of the
|
|
|
|
* SYSCALL_DEFINE0() and __SYSCALL_DEFINEx() macros, in particular to use a
|
|
|
|
* different calling convention for syscalls. To allow for that, the prototypes
|
|
|
|
* for the sys_*() functions below will *not* be included if
|
|
|
|
* CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled.
|
|
|
|
*/
|
|
|
|
#include <asm/syscall_wrapper.h>
|
|
|
|
#endif /* CONFIG_ARCH_HAS_SYSCALL_WRAPPER */
|
|
|
|
|
2013-01-22 04:03:44 +08:00
|
|
|
/*
|
|
|
|
* __MAP - apply a macro to syscall arguments
|
|
|
|
* __MAP(n, m, t1, a1, t2, a2, ..., tn, an) will expand to
|
|
|
|
* m(t1, a1), m(t2, a2), ..., m(tn, an)
|
|
|
|
* The first argument must be equal to the amount of type/name
|
|
|
|
* pairs given. Note that this list of pairs (i.e. the arguments
|
|
|
|
* of __MAP starting at the third one) is in the same format as
|
|
|
|
* for SYSCALL_DEFINE<n>/COMPAT_SYSCALL_DEFINE<n>
|
|
|
|
*/
|
2013-03-06 04:36:40 +08:00
|
|
|
#define __MAP0(m,...)
|
syscalls/x86: Use 'struct pt_regs' based syscall calling convention for 64-bit syscalls
Let's make use of ARCH_HAS_SYSCALL_WRAPPER=y on pure 64-bit x86-64 systems:
Each syscall defines a stub which takes struct pt_regs as its only
argument. It decodes just those parameters it needs, e.g:
asmlinkage long sys_xyzzy(const struct pt_regs *regs)
{
return SyS_xyzzy(regs->di, regs->si, regs->dx);
}
This approach avoids leaking random user-provided register content down
the call chain.
For example, for sys_recv() which is a 4-parameter syscall, the assembly
now is (in slightly reordered fashion):
<sys_recv>:
callq <__fentry__>
/* decode regs->di, ->si, ->dx and ->r10 */
mov 0x70(%rdi),%rdi
mov 0x68(%rdi),%rsi
mov 0x60(%rdi),%rdx
mov 0x38(%rdi),%rcx
[ SyS_recv() is automatically inlined by the compiler,
as it is not [yet] used anywhere else ]
/* clear %r9 and %r8, the 5th and 6th args */
xor %r9d,%r9d
xor %r8d,%r8d
/* do the actual work */
callq __sys_recvfrom
/* cleanup and return */
cltq
retq
The only valid place in an x86-64 kernel which rightfully calls
a syscall function on its own -- vsyscall -- needs to be modified
to pass struct pt_regs onwards as well.
To keep the syscall table generation working independent of
SYSCALL_PTREGS being enabled, the stubs are named the same as the
"original" syscall stubs, i.e. sys_*().
This patch is based on an original proof-of-concept
| From: Linus Torvalds <torvalds@linux-foundation.org>
| Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
and was split up and heavily modified by me, in particular to base it on
ARCH_HAS_SYSCALL_WRAPPER, to limit it to 64-bit-only for the time being,
and to update the vsyscall to the new calling convention.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180405095307.3730-4-linux@dominikbrodowski.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 17:53:02 +08:00
|
|
|
#define __MAP1(m,t,a,...) m(t,a)
|
2013-01-22 04:03:44 +08:00
|
|
|
#define __MAP2(m,t,a,...) m(t,a), __MAP1(m,__VA_ARGS__)
|
|
|
|
#define __MAP3(m,t,a,...) m(t,a), __MAP2(m,__VA_ARGS__)
|
|
|
|
#define __MAP4(m,t,a,...) m(t,a), __MAP3(m,__VA_ARGS__)
|
|
|
|
#define __MAP5(m,t,a,...) m(t,a), __MAP4(m,__VA_ARGS__)
|
|
|
|
#define __MAP6(m,t,a,...) m(t,a), __MAP5(m,__VA_ARGS__)
|
|
|
|
#define __MAP(n,...) __MAP##n(__VA_ARGS__)
|
|
|
|
|
|
|
|
#define __SC_DECL(t, a) t a
|
2017-07-08 23:40:39 +08:00
|
|
|
#define __TYPE_AS(t, v) __same_type((__force t)0, v)
|
|
|
|
#define __TYPE_IS_L(t) (__TYPE_AS(t, 0L))
|
|
|
|
#define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL))
|
|
|
|
#define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL))
|
2013-01-22 04:16:58 +08:00
|
|
|
#define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
|
2017-07-08 23:40:39 +08:00
|
|
|
#define __SC_CAST(t, a) (__force t) a
|
2013-01-22 04:25:54 +08:00
|
|
|
#define __SC_ARGS(t, a) a
|
2013-01-22 04:16:58 +08:00
|
|
|
#define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))
|
2009-01-14 21:13:59 +08:00
|
|
|
|
2009-03-13 22:42:11 +08:00
|
|
|
#ifdef CONFIG_FTRACE_SYSCALLS
|
2013-01-22 04:03:44 +08:00
|
|
|
#define __SC_STR_ADECL(t, a) #a
|
|
|
|
#define __SC_STR_TDECL(t, a) #t
|
2009-03-13 22:42:11 +08:00
|
|
|
|
2015-05-05 23:45:27 +08:00
|
|
|
extern struct trace_event_class event_class_syscall_enter;
|
|
|
|
extern struct trace_event_class event_class_syscall_exit;
|
2010-04-23 22:00:22 +08:00
|
|
|
extern struct trace_event_functions enter_syscall_print_funcs;
|
|
|
|
extern struct trace_event_functions exit_syscall_print_funcs;
|
2010-04-20 22:47:33 +08:00
|
|
|
|
2009-08-11 04:52:47 +08:00
|
|
|
#define SYSCALL_TRACE_ENTER_EVENT(sname) \
|
tracing: Replace syscall_meta_data struct array with pointer array
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-03 06:06:09 +08:00
|
|
|
static struct syscall_metadata __syscall_meta_##sname; \
|
2015-05-05 23:45:27 +08:00
|
|
|
static struct trace_event_call __used \
|
2009-08-11 04:52:47 +08:00
|
|
|
event_enter_##sname = { \
|
2010-04-22 00:27:06 +08:00
|
|
|
.class = &event_class_syscall_enter, \
|
2014-04-10 05:06:08 +08:00
|
|
|
{ \
|
|
|
|
.name = "sys_enter"#sname, \
|
|
|
|
}, \
|
2010-04-23 22:00:22 +08:00
|
|
|
.event.funcs = &enter_syscall_print_funcs, \
|
2009-12-01 16:23:30 +08:00
|
|
|
.data = (void *)&__syscall_meta_##sname,\
|
2013-10-24 21:34:19 +08:00
|
|
|
.flags = TRACE_EVENT_FL_CAP_ANY, \
|
2010-11-18 09:11:42 +08:00
|
|
|
}; \
|
2015-05-05 23:45:27 +08:00
|
|
|
static struct trace_event_call __used \
|
2020-10-22 10:36:07 +08:00
|
|
|
__section("_ftrace_events") \
|
2011-01-26 16:49:00 +08:00
|
|
|
*__event_enter_##sname = &event_enter_##sname;
|
2009-08-11 04:52:47 +08:00
|
|
|
|
|
|
|
#define SYSCALL_TRACE_EXIT_EVENT(sname) \
|
tracing: Replace syscall_meta_data struct array with pointer array
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-03 06:06:09 +08:00
|
|
|
static struct syscall_metadata __syscall_meta_##sname; \
|
2015-05-05 23:45:27 +08:00
|
|
|
static struct trace_event_call __used \
|
2009-08-11 04:52:47 +08:00
|
|
|
event_exit_##sname = { \
|
2010-04-22 00:27:06 +08:00
|
|
|
.class = &event_class_syscall_exit, \
|
2014-04-10 05:06:08 +08:00
|
|
|
{ \
|
|
|
|
.name = "sys_exit"#sname, \
|
|
|
|
}, \
|
2010-04-23 22:00:22 +08:00
|
|
|
.event.funcs = &exit_syscall_print_funcs, \
|
2009-12-01 16:23:30 +08:00
|
|
|
.data = (void *)&__syscall_meta_##sname,\
|
2013-10-24 21:34:19 +08:00
|
|
|
.flags = TRACE_EVENT_FL_CAP_ANY, \
|
2010-11-18 09:11:42 +08:00
|
|
|
}; \
|
2015-05-05 23:45:27 +08:00
|
|
|
static struct trace_event_call __used \
|
2020-10-22 10:36:07 +08:00
|
|
|
__section("_ftrace_events") \
|
2011-01-26 16:49:00 +08:00
|
|
|
*__event_exit_##sname = &event_exit_##sname;
|
2009-08-11 04:52:47 +08:00
|
|
|
|
2013-03-06 04:36:40 +08:00
|
|
|
#define SYSCALL_METADATA(sname, nb, ...) \
|
|
|
|
static const char *types_##sname[] = { \
|
|
|
|
__MAP(nb,__SC_STR_TDECL,__VA_ARGS__) \
|
|
|
|
}; \
|
|
|
|
static const char *args_##sname[] = { \
|
|
|
|
__MAP(nb,__SC_STR_ADECL,__VA_ARGS__) \
|
|
|
|
}; \
|
2009-08-19 15:54:51 +08:00
|
|
|
SYSCALL_TRACE_ENTER_EVENT(sname); \
|
|
|
|
SYSCALL_TRACE_EXIT_EVENT(sname); \
|
2010-04-22 22:35:55 +08:00
|
|
|
static struct syscall_metadata __used \
|
2009-03-13 22:42:11 +08:00
|
|
|
__syscall_meta_##sname = { \
|
|
|
|
.name = "sys"#sname, \
|
2011-02-03 11:27:20 +08:00
|
|
|
.syscall_nr = -1, /* Filled in at boot */ \
|
2009-03-13 22:42:11 +08:00
|
|
|
.nb_args = nb, \
|
2013-03-06 04:36:40 +08:00
|
|
|
.types = nb ? types_##sname : NULL, \
|
|
|
|
.args = nb ? args_##sname : NULL, \
|
2009-08-19 15:54:51 +08:00
|
|
|
.enter_event = &event_enter_##sname, \
|
|
|
|
.exit_event = &event_exit_##sname, \
|
2010-04-22 22:35:55 +08:00
|
|
|
.enter_fields = LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
|
tracing: Replace syscall_meta_data struct array with pointer array
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-03 06:06:09 +08:00
|
|
|
}; \
|
|
|
|
static struct syscall_metadata __used \
|
2020-10-22 10:36:07 +08:00
|
|
|
__section("__syscalls_metadata") \
|
tracing: Replace syscall_meta_data struct array with pointer array
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-03 06:06:09 +08:00
|
|
|
*__p_syscall_meta_##sname = &__syscall_meta_##sname;
|
2017-08-05 07:00:09 +08:00
|
|
|
|
|
|
|
static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
|
|
|
|
{
|
|
|
|
return tp_event->class == &event_class_syscall_enter ||
|
|
|
|
tp_event->class == &event_class_syscall_exit;
|
|
|
|
}
|
|
|
|
|
2013-03-06 04:36:40 +08:00
|
|
|
#else
|
|
|
|
#define SYSCALL_METADATA(sname, nb, ...)
|
2017-08-05 07:00:09 +08:00
|
|
|
|
|
|
|
static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2013-03-06 04:36:40 +08:00
|
|
|
#endif
|
2009-03-13 22:42:11 +08:00
|
|
|
|
2018-04-05 17:53:01 +08:00
|
|
|
#ifndef SYSCALL_DEFINE0
|
2009-03-13 22:42:11 +08:00
|
|
|
#define SYSCALL_DEFINE0(sname) \
|
2013-03-06 04:36:40 +08:00
|
|
|
SYSCALL_METADATA(_##sname, 0); \
|
2018-03-22 09:59:08 +08:00
|
|
|
asmlinkage long sys_##sname(void); \
|
|
|
|
ALLOW_ERROR_INJECTION(sys_##sname, ERRNO); \
|
2009-03-13 22:42:11 +08:00
|
|
|
asmlinkage long sys_##sname(void)
|
2018-04-05 17:53:01 +08:00
|
|
|
#endif /* SYSCALL_DEFINE0 */
|
2009-03-13 22:42:11 +08:00
|
|
|
|
2009-02-12 05:04:38 +08:00
|
|
|
#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
|
|
|
|
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
|
|
|
|
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
|
|
|
|
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
|
|
|
|
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
|
|
|
|
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)
|
2009-01-14 21:13:59 +08:00
|
|
|
|
2017-09-08 09:36:15 +08:00
|
|
|
#define SYSCALL_DEFINE_MAXARGS 6
|
|
|
|
|
2009-03-13 22:42:11 +08:00
|
|
|
#define SYSCALL_DEFINEx(x, sname, ...) \
|
2013-03-06 04:36:40 +08:00
|
|
|
SYSCALL_METADATA(sname, x, __VA_ARGS__) \
|
2009-03-13 22:42:11 +08:00
|
|
|
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
|
|
|
|
|
2013-01-22 04:25:54 +08:00
|
|
|
#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
|
2018-04-05 17:53:01 +08:00
|
|
|
|
2018-04-09 18:51:42 +08:00
|
|
|
/*
|
|
|
|
* The asmlinkage stub is aliased to a function named __se_sys_*() which
|
|
|
|
* sign-extends 32-bit ints to longs whenever needed. The actual work is
|
|
|
|
* done within __do_sys_*().
|
|
|
|
*/
|
2018-04-05 17:53:01 +08:00
|
|
|
#ifndef __SYSCALL_DEFINEx
|
2009-03-13 22:42:11 +08:00
|
|
|
#define __SYSCALL_DEFINEx(x, name, ...) \
|
disable -Wattribute-alias warning for SYSCALL_DEFINEx()
gcc-8 warns for every single definition of a system call entry
point, e.g.:
include/linux/compat.h:56:18: error: 'compat_sys_rt_sigprocmask' alias between functions of incompatible types 'long int(int, compat_sigset_t *, compat_sigset_t *, compat_size_t)' {aka 'long int(int, struct <anonymous> *, struct <anonymous> *, unsigned int)'} and 'long int(long int, long int, long int, long int)' [-Werror=attribute-alias]
asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\
^~~~~~~~~~
include/linux/compat.h:45:2: note: in expansion of macro 'COMPAT_SYSCALL_DEFINEx'
COMPAT_SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
^~~~~~~~~~~~~~~~~~~~~~
kernel/signal.c:2601:1: note: in expansion of macro 'COMPAT_SYSCALL_DEFINE4'
COMPAT_SYSCALL_DEFINE4(rt_sigprocmask, int, how, compat_sigset_t __user *, nset,
^~~~~~~~~~~~~~~~~~~~~~
include/linux/compat.h:60:18: note: aliased declaration here
asmlinkage long compat_SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\
^~~~~~~~~~
The new warning seems reasonable in principle, but it doesn't
help us here, since we rely on the type mismatch to sanitize the
system call arguments. After I reported this as GCC PR82435, a new
-Wno-attribute-alias option was added that could be used to turn the
warning off globally on the command line, but I'd prefer to do it a
little more fine-grained.
Interestingly, turning a warning off and on again inside of
a single macro doesn't always work, in this case I had to add
an extra statement inbetween and decided to copy the __SC_TEST
one from the native syscall to the compat syscall macro. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83256 for more details
about this.
[paul.burton@mips.com:
- Rebase atop current master.
- Split GCC & version arguments to __diag_ignore() in order to match
changes to the preceding patch.
- Add the comment argument to match the preceding patch.]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82435
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-06-20 04:14:57 +08:00
|
|
|
__diag_push(); \
|
|
|
|
__diag_ignore(GCC, 8, "-Wattribute-alias", \
|
|
|
|
"Type aliasing is used to sanitize syscall arguments");\
|
2013-11-13 07:08:36 +08:00
|
|
|
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
|
2018-04-09 18:51:42 +08:00
|
|
|
__attribute__((alias(__stringify(__se_sys##name)))); \
|
2018-03-22 09:59:08 +08:00
|
|
|
ALLOW_ERROR_INJECTION(sys##name, ERRNO); \
|
2018-04-09 18:51:42 +08:00
|
|
|
static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\
|
|
|
|
asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
|
|
|
|
asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
|
2009-01-14 21:13:59 +08:00
|
|
|
{ \
|
2018-04-09 18:51:42 +08:00
|
|
|
long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\
|
2013-01-22 04:03:44 +08:00
|
|
|
__MAP(x,__SC_TEST,__VA_ARGS__); \
|
2013-01-22 04:25:54 +08:00
|
|
|
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
|
|
|
|
return ret; \
|
2009-01-14 21:13:59 +08:00
|
|
|
} \
|
disable -Wattribute-alias warning for SYSCALL_DEFINEx()
gcc-8 warns for every single definition of a system call entry
point, e.g.:
include/linux/compat.h:56:18: error: 'compat_sys_rt_sigprocmask' alias between functions of incompatible types 'long int(int, compat_sigset_t *, compat_sigset_t *, compat_size_t)' {aka 'long int(int, struct <anonymous> *, struct <anonymous> *, unsigned int)'} and 'long int(long int, long int, long int, long int)' [-Werror=attribute-alias]
asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\
^~~~~~~~~~
include/linux/compat.h:45:2: note: in expansion of macro 'COMPAT_SYSCALL_DEFINEx'
COMPAT_SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
^~~~~~~~~~~~~~~~~~~~~~
kernel/signal.c:2601:1: note: in expansion of macro 'COMPAT_SYSCALL_DEFINE4'
COMPAT_SYSCALL_DEFINE4(rt_sigprocmask, int, how, compat_sigset_t __user *, nset,
^~~~~~~~~~~~~~~~~~~~~~
include/linux/compat.h:60:18: note: aliased declaration here
asmlinkage long compat_SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\
^~~~~~~~~~
The new warning seems reasonable in principle, but it doesn't
help us here, since we rely on the type mismatch to sanitize the
system call arguments. After I reported this as GCC PR82435, a new
-Wno-attribute-alias option was added that could be used to turn the
warning off globally on the command line, but I'd prefer to do it a
little more fine-grained.
Interestingly, turning a warning off and on again inside of
a single macro doesn't always work, in this case I had to add
an extra statement inbetween and decided to copy the __SC_TEST
one from the native syscall to the compat syscall macro. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83256 for more details
about this.
[paul.burton@mips.com:
- Rebase atop current master.
- Split GCC & version arguments to __diag_ignore() in order to match
changes to the preceding patch.
- Add the comment argument to match the preceding patch.]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82435
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-06-20 04:14:57 +08:00
|
|
|
__diag_pop(); \
|
2018-04-09 18:51:42 +08:00
|
|
|
static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
|
2018-04-05 17:53:01 +08:00
|
|
|
#endif /* __SYSCALL_DEFINEx */
|
2009-01-14 21:13:59 +08:00
|
|
|
|
2020-12-01 06:30:59 +08:00
|
|
|
/* For split 64-bit arguments on 32-bit architectures */
|
|
|
|
#ifdef __LITTLE_ENDIAN
|
|
|
|
#define SC_ARG64(name) u32, name##_lo, u32, name##_hi
|
|
|
|
#else
|
|
|
|
#define SC_ARG64(name) u32, name##_hi, u32, name##_lo
|
|
|
|
#endif
|
|
|
|
#define SC_VAL64(type, name) ((type) name##_hi << 32 | name##_lo)
|
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
2022-06-07 04:38:01 +08:00
|
|
|
#define SYSCALL32_DEFINE0 COMPAT_SYSCALL_DEFINE0
|
2020-12-01 06:30:59 +08:00
|
|
|
#define SYSCALL32_DEFINE1 COMPAT_SYSCALL_DEFINE1
|
|
|
|
#define SYSCALL32_DEFINE2 COMPAT_SYSCALL_DEFINE2
|
|
|
|
#define SYSCALL32_DEFINE3 COMPAT_SYSCALL_DEFINE3
|
|
|
|
#define SYSCALL32_DEFINE4 COMPAT_SYSCALL_DEFINE4
|
|
|
|
#define SYSCALL32_DEFINE5 COMPAT_SYSCALL_DEFINE5
|
|
|
|
#define SYSCALL32_DEFINE6 COMPAT_SYSCALL_DEFINE6
|
|
|
|
#else
|
2022-06-07 04:38:01 +08:00
|
|
|
#define SYSCALL32_DEFINE0 SYSCALL_DEFINE0
|
2020-12-01 06:30:59 +08:00
|
|
|
#define SYSCALL32_DEFINE1 SYSCALL_DEFINE1
|
|
|
|
#define SYSCALL32_DEFINE2 SYSCALL_DEFINE2
|
|
|
|
#define SYSCALL32_DEFINE3 SYSCALL_DEFINE3
|
|
|
|
#define SYSCALL32_DEFINE4 SYSCALL_DEFINE4
|
|
|
|
#define SYSCALL32_DEFINE5 SYSCALL_DEFINE5
|
|
|
|
#define SYSCALL32_DEFINE6 SYSCALL_DEFINE6
|
|
|
|
#endif
|
|
|
|
|
2017-06-15 09:12:01 +08:00
|
|
|
/*
|
|
|
|
* Called before coming back to user-mode. Returning to user-mode with an
|
|
|
|
* address limit different than USER_DS can allow to overwrite kernel memory.
|
|
|
|
*/
|
|
|
|
static inline void addr_limit_user_check(void)
|
|
|
|
{
|
2017-09-07 23:30:44 +08:00
|
|
|
#ifdef TIF_FSCHECK
|
2017-06-15 09:12:01 +08:00
|
|
|
if (!test_thread_flag(TIF_FSCHECK))
|
|
|
|
return;
|
2017-09-07 23:30:44 +08:00
|
|
|
#endif
|
2017-06-15 09:12:01 +08:00
|
|
|
|
2017-09-07 23:30:44 +08:00
|
|
|
#ifdef TIF_FSCHECK
|
2017-06-15 09:12:01 +08:00
|
|
|
clear_thread_flag(TIF_FSCHECK);
|
|
|
|
#endif
|
2017-09-07 23:30:44 +08:00
|
|
|
}
|
2017-06-15 09:12:01 +08:00
|
|
|
|
2018-03-26 03:50:11 +08:00
|
|
|
/*
|
|
|
|
* These syscall function prototypes are kept in the same order as
|
|
|
|
* include/uapi/asm-generic/unistd.h. Architecture specific entries go below,
|
|
|
|
* followed by deprecated or obsolete system calls.
|
|
|
|
*
|
|
|
|
* Please note that these prototypes here are only provided for information
|
|
|
|
* purposes, for static analysis, and for linking from the syscall table.
|
|
|
|
* These functions should not be called elsewhere from kernel code.
|
2018-04-05 17:53:01 +08:00
|
|
|
*
|
|
|
|
* As the syscall calling convention may be different from the default
|
|
|
|
* for architectures overriding the syscall calling convention, do not
|
|
|
|
* include the prototypes if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled.
|
2018-03-26 03:50:11 +08:00
|
|
|
*/
|
2018-04-05 17:53:01 +08:00
|
|
|
#ifndef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t __user *ctx);
|
|
|
|
asmlinkage long sys_io_destroy(aio_context_t ctx);
|
|
|
|
asmlinkage long sys_io_submit(aio_context_t, long,
|
|
|
|
struct iocb __user * __user *);
|
|
|
|
asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,
|
|
|
|
struct io_event __user *result);
|
|
|
|
asmlinkage long sys_io_getevents(aio_context_t ctx_id,
|
|
|
|
long min_nr,
|
|
|
|
long nr,
|
|
|
|
struct io_event __user *events,
|
2018-09-20 12:41:08 +08:00
|
|
|
struct __kernel_timespec __user *timeout);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_io_getevents_time32(__u32 ctx_id,
|
|
|
|
__s32 min_nr,
|
|
|
|
__s32 nr,
|
|
|
|
struct io_event __user *events,
|
|
|
|
struct old_timespec32 __user *timeout);
|
aio: implement io_pgetevents
This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-03 01:51:00 +08:00
|
|
|
asmlinkage long sys_io_pgetevents(aio_context_t ctx_id,
|
|
|
|
long min_nr,
|
|
|
|
long nr,
|
|
|
|
struct io_event __user *events,
|
2018-09-20 12:41:08 +08:00
|
|
|
struct __kernel_timespec __user *timeout,
|
|
|
|
const struct __aio_sigset *sig);
|
|
|
|
asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
|
|
|
|
long min_nr,
|
|
|
|
long nr,
|
|
|
|
struct io_event __user *events,
|
|
|
|
struct old_timespec32 __user *timeout,
|
aio: implement io_pgetevents
This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-03 01:51:00 +08:00
|
|
|
const struct __aio_sigset *sig);
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
asmlinkage long sys_io_uring_setup(u32 entries,
|
|
|
|
struct io_uring_params __user *p);
|
|
|
|
asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
|
|
|
|
u32 min_complete, u32 flags,
|
2020-11-03 10:54:37 +08:00
|
|
|
const void __user *argp, size_t argsz);
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
|
|
|
|
void __user *arg, unsigned int nr_args);
|
2008-04-29 15:59:41 +08:00
|
|
|
asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
|
|
|
|
const void __user *value, size_t size, int flags);
|
|
|
|
asmlinkage long sys_lsetxattr(const char __user *path, const char __user *name,
|
|
|
|
const void __user *value, size_t size, int flags);
|
|
|
|
asmlinkage long sys_fsetxattr(int fd, const char __user *name,
|
|
|
|
const void __user *value, size_t size, int flags);
|
2009-01-14 21:13:54 +08:00
|
|
|
asmlinkage long sys_getxattr(const char __user *path, const char __user *name,
|
|
|
|
void __user *value, size_t size);
|
|
|
|
asmlinkage long sys_lgetxattr(const char __user *path, const char __user *name,
|
|
|
|
void __user *value, size_t size);
|
|
|
|
asmlinkage long sys_fgetxattr(int fd, const char __user *name,
|
|
|
|
void __user *value, size_t size);
|
|
|
|
asmlinkage long sys_listxattr(const char __user *path, char __user *list,
|
|
|
|
size_t size);
|
|
|
|
asmlinkage long sys_llistxattr(const char __user *path, char __user *list,
|
|
|
|
size_t size);
|
|
|
|
asmlinkage long sys_flistxattr(int fd, char __user *list, size_t size);
|
2008-04-29 15:59:41 +08:00
|
|
|
asmlinkage long sys_removexattr(const char __user *path,
|
|
|
|
const char __user *name);
|
|
|
|
asmlinkage long sys_lremovexattr(const char __user *path,
|
|
|
|
const char __user *name);
|
|
|
|
asmlinkage long sys_fremovexattr(int fd, const char __user *name);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
|
|
|
|
asmlinkage long sys_lookup_dcookie(u64 cookie64, char __user *buf, size_t len);
|
|
|
|
asmlinkage long sys_eventfd2(unsigned int count, int flags);
|
|
|
|
asmlinkage long sys_epoll_create1(int flags);
|
|
|
|
asmlinkage long sys_epoll_ctl(int epfd, int op, int fd,
|
|
|
|
struct epoll_event __user *event);
|
|
|
|
asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
|
|
|
|
int maxevents, int timeout,
|
|
|
|
const sigset_t __user *sigmask,
|
|
|
|
size_t sigsetsize);
|
2020-12-19 06:05:41 +08:00
|
|
|
asmlinkage long sys_epoll_pwait2(int epfd, struct epoll_event __user *events,
|
|
|
|
int maxevents,
|
|
|
|
const struct __kernel_timespec __user *timeout,
|
|
|
|
const sigset_t __user *sigmask,
|
|
|
|
size_t sigsetsize);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_dup(unsigned int fildes);
|
|
|
|
asmlinkage long sys_dup3(unsigned int oldfd, unsigned int newfd, int flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_fcntl(unsigned int fd, unsigned int cmd, unsigned long arg);
|
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
asmlinkage long sys_fcntl64(unsigned int fd,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
|
|
|
#endif
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_inotify_init1(int flags);
|
|
|
|
asmlinkage long sys_inotify_add_watch(int fd, const char __user *path,
|
|
|
|
u32 mask);
|
|
|
|
asmlinkage long sys_inotify_rm_watch(int fd, __s32 wd);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd,
|
|
|
|
unsigned long arg);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_ioprio_set(int which, int who, int ioprio);
|
|
|
|
asmlinkage long sys_ioprio_get(int which, int who);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_flock(unsigned int fd, unsigned int cmd);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_mknodat(int dfd, const char __user * filename, umode_t mode,
|
|
|
|
unsigned dev);
|
|
|
|
asmlinkage long sys_mkdirat(int dfd, const char __user * pathname, umode_t mode);
|
|
|
|
asmlinkage long sys_unlinkat(int dfd, const char __user * pathname, int flag);
|
|
|
|
asmlinkage long sys_symlinkat(const char __user * oldname,
|
|
|
|
int newdfd, const char __user * newname);
|
|
|
|
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
|
|
|
|
int newdfd, const char __user *newname, int flags);
|
|
|
|
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
|
|
|
|
int newdfd, const char __user * newname);
|
|
|
|
asmlinkage long sys_umount(char __user *name, int flags);
|
|
|
|
asmlinkage long sys_mount(char __user *dev_name, char __user *dir_name,
|
|
|
|
char __user *type, unsigned long flags,
|
|
|
|
void __user *data);
|
|
|
|
asmlinkage long sys_pivot_root(const char __user *new_root,
|
|
|
|
const char __user *put_old);
|
|
|
|
asmlinkage long sys_statfs(const char __user * path,
|
|
|
|
struct statfs __user *buf);
|
|
|
|
asmlinkage long sys_statfs64(const char __user *path, size_t sz,
|
|
|
|
struct statfs64 __user *buf);
|
|
|
|
asmlinkage long sys_fstatfs(unsigned int fd, struct statfs __user *buf);
|
|
|
|
asmlinkage long sys_fstatfs64(unsigned int fd, size_t sz,
|
|
|
|
struct statfs64 __user *buf);
|
|
|
|
asmlinkage long sys_truncate(const char __user *path, long length);
|
|
|
|
asmlinkage long sys_ftruncate(unsigned int fd, unsigned long length);
|
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
asmlinkage long sys_truncate64(const char __user *path, loff_t length);
|
|
|
|
asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length);
|
|
|
|
#endif
|
|
|
|
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
|
|
|
|
asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode);
|
2020-05-14 22:44:25 +08:00
|
|
|
asmlinkage long sys_faccessat2(int dfd, const char __user *filename, int mode,
|
|
|
|
int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_chdir(const char __user *filename);
|
|
|
|
asmlinkage long sys_fchdir(unsigned int fd);
|
|
|
|
asmlinkage long sys_chroot(const char __user *filename);
|
|
|
|
asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
|
|
|
|
asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
|
|
|
|
umode_t mode);
|
|
|
|
asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
|
|
|
|
gid_t group, int flag);
|
|
|
|
asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
|
|
|
|
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
|
|
|
|
umode_t mode);
|
open: introduce openat2(2) syscall
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).
/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);
/* Description. */
The initial version of 'struct open_how' contains the following fields:
flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).
mode
The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).
After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.
/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.
In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).
/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).
Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.
Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-18 20:07:59 +08:00
|
|
|
asmlinkage long sys_openat2(int dfd, const char __user *filename,
|
|
|
|
struct open_how *how, size_t size);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_close(unsigned int fd);
|
open: add close_range()
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.
I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle
was fast enough to merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
once its merged in Linux too. Python is in the process of switching to
close_range() on FreeBSD and they are waiting on us to merge this to switch on
Linux as well: https://bugs.python.org/issue38061
The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.
First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:
static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;
dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}
to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us
2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us
close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.
From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.
/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
Note, in a recent patch series Florian made grantpt() a nop thereby
removing the code referenced here.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-24 17:30:34 +08:00
|
|
|
asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
|
|
|
|
unsigned int flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_vhangup(void);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_pipe2(int __user *fildes, int flags);
|
|
|
|
asmlinkage long sys_quotactl(unsigned int cmd, const char __user *special,
|
|
|
|
qid_t id, void __user *addr);
|
2021-05-25 22:07:48 +08:00
|
|
|
asmlinkage long sys_quotactl_fd(unsigned int fd, unsigned int cmd, qid_t id,
|
|
|
|
void __user *addr);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_getdents64(unsigned int fd,
|
|
|
|
struct linux_dirent64 __user *dirent,
|
|
|
|
unsigned int count);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_llseek(unsigned int fd, unsigned long offset_high,
|
|
|
|
unsigned long offset_low, loff_t __user *result,
|
2012-12-18 07:59:39 +08:00
|
|
|
unsigned int whence);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_lseek(unsigned int fd, off_t offset,
|
|
|
|
unsigned int whence);
|
2009-01-14 21:13:54 +08:00
|
|
|
asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_write(unsigned int fd, const char __user *buf,
|
|
|
|
size_t count);
|
2009-01-14 21:13:54 +08:00
|
|
|
asmlinkage long sys_readv(unsigned long fd,
|
|
|
|
const struct iovec __user *vec,
|
|
|
|
unsigned long vlen);
|
|
|
|
asmlinkage long sys_writev(unsigned long fd,
|
|
|
|
const struct iovec __user *vec,
|
|
|
|
unsigned long vlen);
|
|
|
|
asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
|
|
|
|
size_t count, loff_t pos);
|
|
|
|
asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
|
|
|
|
size_t count, loff_t pos);
|
2009-04-03 07:59:23 +08:00
|
|
|
asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
|
Make non-compat preadv/pwritev use native register size
Instead of always splitting the file offset into 32-bit 'high' and 'low'
parts, just split them into the largest natural word-size - which in C
terms is 'unsigned long'.
This allows 64-bit architectures to avoid the unnecessary 32-bit
shifting and masking for native format (while the compat interfaces will
obviously always have to do it).
This also changes the order of 'high' and 'low' to be "low first". Why?
Because when we have it like this, the 64-bit system calls now don't use
the "pos_high" argument at all, and it makes more sense for the native
system call to simply match the user-mode prototype.
This results in a much more natural calling convention, and allows the
compiler to generate much more straightforward code. On x86-64, we now
generate
testq %rcx, %rcx # pos_l
js .L122 #,
movq %rcx, -48(%rbp) # pos_l, pos
from the C source
loff_t pos = pos_from_hilo(pos_h, pos_l);
...
if (pos < 0)
return -EINVAL;
and the 'pos_h' register isn't even touched. It used to generate code
like
mov %r8d, %r8d # pos_low, pos_low
salq $32, %rcx #, tmp71
movq %r8, %rax # pos_low, pos.386
orq %rcx, %rax # tmp71, pos.386
js .L122 #,
movq %rax, -48(%rbp) # pos.386, pos
which isn't _that_ horrible, but it does show how the natural word size
is just a more sensible interface (same arguments will hold in the user
level glibc wrapper function, of course, so the kernel side is just half
of the equation!)
Note: in all cases the user code wrapper can again be the same. You can
just do
#define HALF_BITS (sizeof(unsigned long)*4)
__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);
or something like that. That way the user mode wrapper will also be
nicely passing in a zero (it won't actually have to do the shifts, the
compiler will understand what is going on) for the last argument.
And that is a good idea, even if nobody will necessarily ever care: if
we ever do move to a 128-bit lloff_t, this particular system call might
be left alone. Of course, that will be the least of our worries if we
really ever need to care, so this may not be worth really caring about.
[ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 23:03:22 +08:00
|
|
|
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
|
2009-04-03 07:59:23 +08:00
|
|
|
asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
|
Make non-compat preadv/pwritev use native register size
Instead of always splitting the file offset into 32-bit 'high' and 'low'
parts, just split them into the largest natural word-size - which in C
terms is 'unsigned long'.
This allows 64-bit architectures to avoid the unnecessary 32-bit
shifting and masking for native format (while the compat interfaces will
obviously always have to do it).
This also changes the order of 'high' and 'low' to be "low first". Why?
Because when we have it like this, the 64-bit system calls now don't use
the "pos_high" argument at all, and it makes more sense for the native
system call to simply match the user-mode prototype.
This results in a much more natural calling convention, and allows the
compiler to generate much more straightforward code. On x86-64, we now
generate
testq %rcx, %rcx # pos_l
js .L122 #,
movq %rcx, -48(%rbp) # pos_l, pos
from the C source
loff_t pos = pos_from_hilo(pos_h, pos_l);
...
if (pos < 0)
return -EINVAL;
and the 'pos_h' register isn't even touched. It used to generate code
like
mov %r8d, %r8d # pos_low, pos_low
salq $32, %rcx #, tmp71
movq %r8, %rax # pos_low, pos.386
orq %rcx, %rax # tmp71, pos.386
js .L122 #,
movq %rax, -48(%rbp) # pos.386, pos
which isn't _that_ horrible, but it does show how the natural word size
is just a more sensible interface (same arguments will hold in the user
level glibc wrapper function, of course, so the kernel side is just half
of the equation!)
Note: in all cases the user code wrapper can again be the same. You can
just do
#define HALF_BITS (sizeof(unsigned long)*4)
__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);
or something like that. That way the user mode wrapper will also be
nicely passing in a zero (it won't actually have to do the shifts, the
compiler will understand what is going on) for the last argument.
And that is a good idea, even if nobody will necessarily ever care: if
we ever do move to a 128-bit lloff_t, this particular system call might
be left alone. Of course, that will be the least of our worries if we
really ever need to care, so this may not be worth really caring about.
[ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03 23:03:22 +08:00
|
|
|
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_sendfile64(int out_fd, int in_fd,
|
|
|
|
loff_t __user *offset, size_t count);
|
|
|
|
asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
|
2018-09-20 12:41:07 +08:00
|
|
|
fd_set __user *, struct __kernel_timespec __user *,
|
|
|
|
void __user *);
|
|
|
|
asmlinkage long sys_pselect6_time32(int, fd_set __user *, fd_set __user *,
|
|
|
|
fd_set __user *, struct old_timespec32 __user *,
|
2018-03-26 03:50:11 +08:00
|
|
|
void __user *);
|
|
|
|
asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
|
2018-09-20 12:41:06 +08:00
|
|
|
struct __kernel_timespec __user *, const sigset_t __user *,
|
|
|
|
size_t);
|
|
|
|
asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
|
|
|
|
struct old_timespec32 __user *, const sigset_t __user *,
|
2018-03-26 03:50:11 +08:00
|
|
|
size_t);
|
|
|
|
asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
|
|
|
|
asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
|
|
|
|
unsigned long nr_segs, unsigned int flags);
|
|
|
|
asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
|
|
|
|
int fd_out, loff_t __user *off_out,
|
|
|
|
size_t len, unsigned int flags);
|
|
|
|
asmlinkage long sys_tee(int fdin, int fdout, size_t len, unsigned int flags);
|
|
|
|
asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *buf,
|
|
|
|
int bufsiz);
|
|
|
|
asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
|
|
|
|
struct stat __user *statbuf, int flag);
|
|
|
|
asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
|
|
|
|
#if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
|
|
|
|
asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
|
|
|
|
asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
|
|
|
|
struct stat64 __user *statbuf, int flag);
|
|
|
|
#endif
|
|
|
|
asmlinkage long sys_sync(void);
|
|
|
|
asmlinkage long sys_fsync(unsigned int fd);
|
|
|
|
asmlinkage long sys_fdatasync(unsigned int fd);
|
|
|
|
asmlinkage long sys_sync_file_range2(int fd, unsigned int flags,
|
|
|
|
loff_t offset, loff_t nbytes);
|
|
|
|
asmlinkage long sys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
|
|
|
|
unsigned int flags);
|
|
|
|
asmlinkage long sys_timerfd_create(int clockid, int flags);
|
|
|
|
asmlinkage long sys_timerfd_settime(int ufd, int flags,
|
2018-06-17 13:11:44 +08:00
|
|
|
const struct __kernel_itimerspec __user *utmr,
|
|
|
|
struct __kernel_itimerspec __user *otmr);
|
|
|
|
asmlinkage long sys_timerfd_gettime(int ufd, struct __kernel_itimerspec __user *otmr);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_timerfd_gettime32(int ufd,
|
|
|
|
struct old_itimerspec32 __user *otmr);
|
|
|
|
asmlinkage long sys_timerfd_settime32(int ufd, int flags,
|
|
|
|
const struct old_itimerspec32 __user *utmr,
|
|
|
|
struct old_itimerspec32 __user *otmr);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_utimensat(int dfd, const char __user *filename,
|
2018-04-17 15:11:58 +08:00
|
|
|
struct __kernel_timespec __user *utimes,
|
|
|
|
int flags);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_utimensat_time32(unsigned int dfd,
|
|
|
|
const char __user *filename,
|
|
|
|
struct old_timespec32 __user *t, int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_acct(const char __user *name);
|
|
|
|
asmlinkage long sys_capget(cap_user_header_t header,
|
|
|
|
cap_user_data_t dataptr);
|
|
|
|
asmlinkage long sys_capset(cap_user_header_t header,
|
|
|
|
const cap_user_data_t data);
|
|
|
|
asmlinkage long sys_personality(unsigned int personality);
|
|
|
|
asmlinkage long sys_exit(int error_code);
|
|
|
|
asmlinkage long sys_exit_group(int error_code);
|
|
|
|
asmlinkage long sys_waitid(int which, pid_t pid,
|
|
|
|
struct siginfo __user *infop,
|
|
|
|
int options, struct rusage __user *ru);
|
|
|
|
asmlinkage long sys_set_tid_address(int __user *tidptr);
|
|
|
|
asmlinkage long sys_unshare(unsigned long unshare_flags);
|
|
|
|
asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
|
2020-11-28 20:39:46 +08:00
|
|
|
const struct __kernel_timespec __user *utime,
|
|
|
|
u32 __user *uaddr2, u32 val3);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_futex_time32(u32 __user *uaddr, int op, u32 val,
|
2020-11-28 20:39:46 +08:00
|
|
|
const struct old_timespec32 __user *utime,
|
|
|
|
u32 __user *uaddr2, u32 val3);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_get_robust_list(int pid,
|
|
|
|
struct robust_list_head __user * __user *head_ptr,
|
|
|
|
size_t __user *len_ptr);
|
|
|
|
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
|
|
|
|
size_t len);
|
|
|
|
|
futex: Implement sys_futex_waitv()
Add support to wait on multiple futexes. This is the interface
implemented by this syscall:
futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
unsigned int flags, struct timespec *timeout, clockid_t clockid)
struct futex_waitv {
__u64 val;
__u64 uaddr;
__u32 flags;
__u32 __reserved;
};
Given an array of struct futex_waitv, wait on each uaddr. The thread
wakes if a futex_wake() is performed at any uaddr. The syscall returns
immediately if any waiter has *uaddr != val. *timeout is an optional
absolute timeout value for the operation. This syscall supports only
64bit sized timeout structs. The flags argument of the syscall should be
empty, but it can be used for future extensions. Flags for shared
futexes, sizes, etc. should be used on the individual flags of each
waiter.
__reserved is used for explicit padding and should be 0, but it might be
used for future extensions. If the userspace uses 32-bit pointers, it
should make sure to explicitly cast it when assigning to waitv::uaddr.
Returns the array index of one of the woken futexes. There’s no given
information of how many were woken, or any particular attribute of it
(if it’s the first woken, if it is of the smaller index...).
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210923171111.300673-17-andrealmeid@collabora.com
2021-09-24 01:11:05 +08:00
|
|
|
asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
|
|
|
|
unsigned int nr_futexes, unsigned int flags,
|
|
|
|
struct __kernel_timespec __user *timeout, clockid_t clockid);
|
2018-03-14 12:03:33 +08:00
|
|
|
asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
|
|
|
|
struct __kernel_timespec __user *rmtp);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_nanosleep_time32(struct old_timespec32 __user *rqtp,
|
|
|
|
struct old_timespec32 __user *rmtp);
|
2019-11-15 22:53:29 +08:00
|
|
|
asmlinkage long sys_getitimer(int which, struct __kernel_old_itimerval __user *value);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_setitimer(int which,
|
2019-11-15 22:53:29 +08:00
|
|
|
struct __kernel_old_itimerval __user *value,
|
|
|
|
struct __kernel_old_itimerval __user *ovalue);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
|
|
|
|
struct kexec_segment __user *segments,
|
|
|
|
unsigned long flags);
|
|
|
|
asmlinkage long sys_init_module(void __user *umod, unsigned long len,
|
|
|
|
const char __user *uargs);
|
|
|
|
asmlinkage long sys_delete_module(const char __user *name_user,
|
|
|
|
unsigned int flags);
|
|
|
|
asmlinkage long sys_timer_create(clockid_t which_clock,
|
|
|
|
struct sigevent __user *timer_event_spec,
|
|
|
|
timer_t __user * created_timer_id);
|
|
|
|
asmlinkage long sys_timer_gettime(timer_t timer_id,
|
2018-06-17 13:11:44 +08:00
|
|
|
struct __kernel_itimerspec __user *setting);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_timer_getoverrun(timer_t timer_id);
|
|
|
|
asmlinkage long sys_timer_settime(timer_t timer_id, int flags,
|
2018-06-17 13:11:44 +08:00
|
|
|
const struct __kernel_itimerspec __user *new_setting,
|
2019-01-02 00:34:39 +08:00
|
|
|
struct __kernel_itimerspec __user *old_setting);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_timer_delete(timer_t timer_id);
|
|
|
|
asmlinkage long sys_clock_settime(clockid_t which_clock,
|
2018-03-14 12:03:32 +08:00
|
|
|
const struct __kernel_timespec __user *tp);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_clock_gettime(clockid_t which_clock,
|
2018-03-14 12:03:32 +08:00
|
|
|
struct __kernel_timespec __user *tp);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_clock_getres(clockid_t which_clock,
|
2018-03-14 12:03:32 +08:00
|
|
|
struct __kernel_timespec __user *tp);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
|
2018-03-14 12:03:33 +08:00
|
|
|
const struct __kernel_timespec __user *rqtp,
|
|
|
|
struct __kernel_timespec __user *rmtp);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_timer_gettime32(timer_t timer_id,
|
|
|
|
struct old_itimerspec32 __user *setting);
|
|
|
|
asmlinkage long sys_timer_settime32(timer_t timer_id, int flags,
|
|
|
|
struct old_itimerspec32 __user *new,
|
|
|
|
struct old_itimerspec32 __user *old);
|
|
|
|
asmlinkage long sys_clock_settime32(clockid_t which_clock,
|
|
|
|
struct old_timespec32 __user *tp);
|
|
|
|
asmlinkage long sys_clock_gettime32(clockid_t which_clock,
|
|
|
|
struct old_timespec32 __user *tp);
|
|
|
|
asmlinkage long sys_clock_getres_time32(clockid_t which_clock,
|
|
|
|
struct old_timespec32 __user *tp);
|
|
|
|
asmlinkage long sys_clock_nanosleep_time32(clockid_t which_clock, int flags,
|
|
|
|
struct old_timespec32 __user *rqtp,
|
|
|
|
struct old_timespec32 __user *rmtp);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_syslog(int type, char __user *buf, int len);
|
|
|
|
asmlinkage long sys_ptrace(long request, long pid, unsigned long addr,
|
|
|
|
unsigned long data);
|
|
|
|
asmlinkage long sys_sched_setparam(pid_t pid,
|
|
|
|
struct sched_param __user *param);
|
|
|
|
asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
|
|
|
|
struct sched_param __user *param);
|
|
|
|
asmlinkage long sys_sched_getscheduler(pid_t pid);
|
|
|
|
asmlinkage long sys_sched_getparam(pid_t pid,
|
|
|
|
struct sched_param __user *param);
|
|
|
|
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
|
|
|
|
unsigned long __user *user_mask_ptr);
|
|
|
|
asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
|
|
|
|
unsigned long __user *user_mask_ptr);
|
|
|
|
asmlinkage long sys_sched_yield(void);
|
|
|
|
asmlinkage long sys_sched_get_priority_max(int policy);
|
|
|
|
asmlinkage long sys_sched_get_priority_min(int policy);
|
|
|
|
asmlinkage long sys_sched_rr_get_interval(pid_t pid,
|
2018-04-18 03:59:47 +08:00
|
|
|
struct __kernel_timespec __user *interval);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid,
|
|
|
|
struct old_timespec32 __user *interval);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_restart_syscall(void);
|
|
|
|
asmlinkage long sys_kill(pid_t pid, int sig);
|
|
|
|
asmlinkage long sys_tkill(pid_t pid, int sig);
|
|
|
|
asmlinkage long sys_tgkill(pid_t tgid, pid_t pid, int sig);
|
|
|
|
asmlinkage long sys_sigaltstack(const struct sigaltstack __user *uss,
|
|
|
|
struct sigaltstack __user *uoss);
|
|
|
|
asmlinkage long sys_rt_sigsuspend(sigset_t __user *unewset, size_t sigsetsize);
|
|
|
|
#ifndef CONFIG_ODD_RT_SIGACTION
|
|
|
|
asmlinkage long sys_rt_sigaction(int,
|
|
|
|
const struct sigaction __user *,
|
|
|
|
struct sigaction __user *,
|
|
|
|
size_t);
|
|
|
|
#endif
|
|
|
|
asmlinkage long sys_rt_sigprocmask(int how, sigset_t __user *set,
|
|
|
|
sigset_t __user *oset, size_t sigsetsize);
|
|
|
|
asmlinkage long sys_rt_sigpending(sigset_t __user *set, size_t sigsetsize);
|
|
|
|
asmlinkage long sys_rt_sigtimedwait(const sigset_t __user *uthese,
|
|
|
|
siginfo_t __user *uinfo,
|
2018-04-18 21:56:13 +08:00
|
|
|
const struct __kernel_timespec __user *uts,
|
2006-10-11 16:21:44 +08:00
|
|
|
size_t sigsetsize);
|
2018-04-18 22:15:37 +08:00
|
|
|
asmlinkage long sys_rt_sigtimedwait_time32(const sigset_t __user *uthese,
|
|
|
|
siginfo_t __user *uinfo,
|
|
|
|
const struct old_timespec32 __user *uts,
|
|
|
|
size_t sigsetsize);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t __user *uinfo);
|
|
|
|
asmlinkage long sys_setpriority(int which, int who, int niceval);
|
|
|
|
asmlinkage long sys_getpriority(int which, int who);
|
|
|
|
asmlinkage long sys_reboot(int magic1, int magic2, unsigned int cmd,
|
|
|
|
void __user *arg);
|
|
|
|
asmlinkage long sys_setregid(gid_t rgid, gid_t egid);
|
|
|
|
asmlinkage long sys_setgid(gid_t gid);
|
|
|
|
asmlinkage long sys_setreuid(uid_t ruid, uid_t euid);
|
|
|
|
asmlinkage long sys_setuid(uid_t uid);
|
|
|
|
asmlinkage long sys_setresuid(uid_t ruid, uid_t euid, uid_t suid);
|
|
|
|
asmlinkage long sys_getresuid(uid_t __user *ruid, uid_t __user *euid, uid_t __user *suid);
|
|
|
|
asmlinkage long sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid);
|
|
|
|
asmlinkage long sys_getresgid(gid_t __user *rgid, gid_t __user *egid, gid_t __user *sgid);
|
|
|
|
asmlinkage long sys_setfsuid(uid_t uid);
|
|
|
|
asmlinkage long sys_setfsgid(gid_t gid);
|
|
|
|
asmlinkage long sys_times(struct tms __user *tbuf);
|
|
|
|
asmlinkage long sys_setpgid(pid_t pid, pid_t pgid);
|
|
|
|
asmlinkage long sys_getpgid(pid_t pid);
|
|
|
|
asmlinkage long sys_getsid(pid_t pid);
|
|
|
|
asmlinkage long sys_setsid(void);
|
|
|
|
asmlinkage long sys_getgroups(int gidsetsize, gid_t __user *grouplist);
|
|
|
|
asmlinkage long sys_setgroups(int gidsetsize, gid_t __user *grouplist);
|
|
|
|
asmlinkage long sys_newuname(struct new_utsname __user *name);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_sethostname(char __user *name, int len);
|
|
|
|
asmlinkage long sys_setdomainname(char __user *name, int len);
|
|
|
|
asmlinkage long sys_getrlimit(unsigned int resource,
|
|
|
|
struct rlimit __user *rlim);
|
|
|
|
asmlinkage long sys_setrlimit(unsigned int resource,
|
|
|
|
struct rlimit __user *rlim);
|
|
|
|
asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
|
|
|
|
asmlinkage long sys_umask(int mask);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
|
|
|
|
unsigned long arg4, unsigned long arg5);
|
|
|
|
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
|
2019-10-26 04:56:17 +08:00
|
|
|
asmlinkage long sys_gettimeofday(struct __kernel_old_timeval __user *tv,
|
2018-03-26 03:50:11 +08:00
|
|
|
struct timezone __user *tz);
|
2018-08-16 02:04:11 +08:00
|
|
|
asmlinkage long sys_settimeofday(struct __kernel_old_timeval __user *tv,
|
2018-03-26 03:50:11 +08:00
|
|
|
struct timezone __user *tz);
|
2018-07-03 13:44:22 +08:00
|
|
|
asmlinkage long sys_adjtimex(struct __kernel_timex __user *txc_p);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_adjtimex_time32(struct old_timex32 __user *txc_p);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_getpid(void);
|
|
|
|
asmlinkage long sys_getppid(void);
|
|
|
|
asmlinkage long sys_getuid(void);
|
|
|
|
asmlinkage long sys_geteuid(void);
|
|
|
|
asmlinkage long sys_getgid(void);
|
|
|
|
asmlinkage long sys_getegid(void);
|
|
|
|
asmlinkage long sys_gettid(void);
|
|
|
|
asmlinkage long sys_sysinfo(struct sysinfo __user *info);
|
|
|
|
asmlinkage long sys_mq_open(const char __user *name, int oflag, umode_t mode, struct mq_attr __user *attr);
|
|
|
|
asmlinkage long sys_mq_unlink(const char __user *name);
|
2018-04-13 19:58:00 +08:00
|
|
|
asmlinkage long sys_mq_timedsend(mqd_t mqdes, const char __user *msg_ptr, size_t msg_len, unsigned int msg_prio, const struct __kernel_timespec __user *abs_timeout);
|
|
|
|
asmlinkage long sys_mq_timedreceive(mqd_t mqdes, char __user *msg_ptr, size_t msg_len, unsigned int __user *msg_prio, const struct __kernel_timespec __user *abs_timeout);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_mq_notify(mqd_t mqdes, const struct sigevent __user *notification);
|
|
|
|
asmlinkage long sys_mq_getsetattr(mqd_t mqdes, const struct mq_attr __user *mqstat, struct mq_attr __user *omqstat);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_mq_timedreceive_time32(mqd_t mqdes,
|
|
|
|
char __user *u_msg_ptr,
|
|
|
|
unsigned int msg_len, unsigned int __user *u_msg_prio,
|
|
|
|
const struct old_timespec32 __user *u_abs_timeout);
|
|
|
|
asmlinkage long sys_mq_timedsend_time32(mqd_t mqdes,
|
|
|
|
const char __user *u_msg_ptr,
|
|
|
|
unsigned int msg_len, unsigned int msg_prio,
|
|
|
|
const struct old_timespec32 __user *u_abs_timeout);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_msgget(key_t key, int msgflg);
|
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-01 05:22:40 +08:00
|
|
|
asmlinkage long sys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_msgrcv(int msqid, struct msgbuf __user *msgp,
|
|
|
|
size_t msgsz, long msgtyp, int msgflg);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_msgsnd(int msqid, struct msgbuf __user *msgp,
|
|
|
|
size_t msgsz, int msgflg);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_semget(key_t key, int nsems, int semflg);
|
2013-03-06 04:04:55 +08:00
|
|
|
asmlinkage long sys_semctl(int semid, int semnum, int cmd, unsigned long arg);
|
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-01 05:22:40 +08:00
|
|
|
asmlinkage long sys_old_semctl(int semid, int semnum, int cmd, unsigned long arg);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_semtimedop(int semid, struct sembuf __user *sops,
|
|
|
|
unsigned nsops,
|
2018-04-13 19:58:00 +08:00
|
|
|
const struct __kernel_timespec __user *timeout);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_semtimedop_time32(int semid, struct sembuf __user *sops,
|
|
|
|
unsigned nsops,
|
|
|
|
const struct old_timespec32 __user *timeout);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_semop(int semid, struct sembuf __user *sops,
|
|
|
|
unsigned nsops);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_shmget(key_t key, size_t size, int flag);
|
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-01 05:22:40 +08:00
|
|
|
asmlinkage long sys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg);
|
|
|
|
asmlinkage long sys_shmdt(char __user *shmaddr);
|
|
|
|
asmlinkage long sys_socket(int, int, int);
|
|
|
|
asmlinkage long sys_socketpair(int, int, int, int __user *);
|
|
|
|
asmlinkage long sys_bind(int, struct sockaddr __user *, int);
|
|
|
|
asmlinkage long sys_listen(int, int);
|
|
|
|
asmlinkage long sys_accept(int, struct sockaddr __user *, int __user *);
|
|
|
|
asmlinkage long sys_connect(int, struct sockaddr __user *, int);
|
|
|
|
asmlinkage long sys_getsockname(int, struct sockaddr __user *, int __user *);
|
|
|
|
asmlinkage long sys_getpeername(int, struct sockaddr __user *, int __user *);
|
|
|
|
asmlinkage long sys_sendto(int, void __user *, size_t, unsigned,
|
|
|
|
struct sockaddr __user *, int);
|
|
|
|
asmlinkage long sys_recvfrom(int, void __user *, size_t, unsigned,
|
|
|
|
struct sockaddr __user *, int __user *);
|
|
|
|
asmlinkage long sys_setsockopt(int fd, int level, int optname,
|
|
|
|
char __user *optval, int optlen);
|
|
|
|
asmlinkage long sys_getsockopt(int fd, int level, int optname,
|
|
|
|
char __user *optval, int __user *optlen);
|
|
|
|
asmlinkage long sys_shutdown(int, int);
|
|
|
|
asmlinkage long sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned flags);
|
|
|
|
asmlinkage long sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned flags);
|
|
|
|
asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
|
|
|
|
asmlinkage long sys_brk(unsigned long brk);
|
|
|
|
asmlinkage long sys_munmap(unsigned long addr, size_t len);
|
|
|
|
asmlinkage long sys_mremap(unsigned long addr,
|
|
|
|
unsigned long old_len, unsigned long new_len,
|
|
|
|
unsigned long flags, unsigned long new_addr);
|
2005-04-17 06:20:36 +08:00
|
|
|
asmlinkage long sys_add_key(const char __user *_type,
|
|
|
|
const char __user *_description,
|
|
|
|
const void __user *_payload,
|
|
|
|
size_t plen,
|
|
|
|
key_serial_t destringid);
|
|
|
|
asmlinkage long sys_request_key(const char __user *_type,
|
|
|
|
const char __user *_description,
|
|
|
|
const char __user *_callout_info,
|
|
|
|
key_serial_t destringid);
|
|
|
|
asmlinkage long sys_keyctl(int cmd, unsigned long arg2, unsigned long arg3,
|
|
|
|
unsigned long arg4, unsigned long arg5);
|
2012-11-29 12:04:26 +08:00
|
|
|
#ifdef CONFIG_CLONE_BACKWARDS
|
clone: support passing tls argument via C rather than pt_regs magic
clone has some of the quirkiest syscall handling in the kernel, with a
pile of special cases, historical curiosities, and architecture-specific
calling conventions. In particular, clone with CLONE_SETTLS accepts a
parameter "tls" that the C entry point completely ignores and some
assembly entry points overwrite; instead, the low-level arch-specific
code pulls the tls parameter out of the arch-specific register captured
as part of pt_regs on entry to the kernel. That's a massive hack, and
it makes the arch-specific code only work when called via the specific
existing syscall entry points; because of this hack, any new clone-like
system call would have to accept an identical tls argument in exactly
the same arch-specific position, rather than providing a unified system
call entry point across architectures.
The first patch allows architectures to handle the tls argument via
normal C parameter passing, if they opt in by selecting
HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt
into this.
These two patches came out of the clone4 series, which isn't ready for
this merge window, but these first two cleanup patches were entirely
uncontroversial and have acks. I'd like to go ahead and submit these
two so that other architectures can begin building on top of this and
opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and
send these through the next merge window (along with v3 of clone4) if
anyone would prefer that.
This patch (of 2):
clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread. sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass along
that argument. Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes tls
in).
Apart from being awful and inscrutable, that also only works because only
one code path into copy_thread can pass the CLONE_SETTLS flag, and that
code path comes from sys_clone with its architecture-specific
argument-passing order. This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.
However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.
Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
and a new copy_thread_tls that accepts the tls parameter as an additional
unsigned long (syscall-argument-sized) argument. Change sys_clone's tls
argument to an unsigned long (which does not change the ABI), and pass
that down to copy_thread_tls.
Architectures that don't opt into copy_thread_tls will continue to ignore
the C argument to sys_clone in favor of the pt_regs captured at kernel
entry, and thus will be unable to introduce new versions of the clone
syscall.
Patch co-authored by Josh Triplett and Thiago Macieira.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26 06:01:19 +08:00
|
|
|
asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, unsigned long,
|
2012-11-29 12:04:26 +08:00
|
|
|
int __user *);
|
|
|
|
#else
|
2013-08-14 07:00:53 +08:00
|
|
|
#ifdef CONFIG_CLONE_BACKWARDS3
|
|
|
|
asmlinkage long sys_clone(unsigned long, unsigned long, int, int __user *,
|
clone: support passing tls argument via C rather than pt_regs magic
clone has some of the quirkiest syscall handling in the kernel, with a
pile of special cases, historical curiosities, and architecture-specific
calling conventions. In particular, clone with CLONE_SETTLS accepts a
parameter "tls" that the C entry point completely ignores and some
assembly entry points overwrite; instead, the low-level arch-specific
code pulls the tls parameter out of the arch-specific register captured
as part of pt_regs on entry to the kernel. That's a massive hack, and
it makes the arch-specific code only work when called via the specific
existing syscall entry points; because of this hack, any new clone-like
system call would have to accept an identical tls argument in exactly
the same arch-specific position, rather than providing a unified system
call entry point across architectures.
The first patch allows architectures to handle the tls argument via
normal C parameter passing, if they opt in by selecting
HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt
into this.
These two patches came out of the clone4 series, which isn't ready for
this merge window, but these first two cleanup patches were entirely
uncontroversial and have acks. I'd like to go ahead and submit these
two so that other architectures can begin building on top of this and
opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and
send these through the next merge window (along with v3 of clone4) if
anyone would prefer that.
This patch (of 2):
clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread. sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass along
that argument. Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes tls
in).
Apart from being awful and inscrutable, that also only works because only
one code path into copy_thread can pass the CLONE_SETTLS flag, and that
code path comes from sys_clone with its architecture-specific
argument-passing order. This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.
However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.
Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
and a new copy_thread_tls that accepts the tls parameter as an additional
unsigned long (syscall-argument-sized) argument. Change sys_clone's tls
argument to an unsigned long (which does not change the ABI), and pass
that down to copy_thread_tls.
Architectures that don't opt into copy_thread_tls will continue to ignore
the C argument to sys_clone in favor of the pt_regs captured at kernel
entry, and thus will be unable to introduce new versions of the clone
syscall.
Patch co-authored by Josh Triplett and Thiago Macieira.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26 06:01:19 +08:00
|
|
|
int __user *, unsigned long);
|
2013-08-14 07:00:53 +08:00
|
|
|
#else
|
2012-11-29 12:04:26 +08:00
|
|
|
asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
|
clone: support passing tls argument via C rather than pt_regs magic
clone has some of the quirkiest syscall handling in the kernel, with a
pile of special cases, historical curiosities, and architecture-specific
calling conventions. In particular, clone with CLONE_SETTLS accepts a
parameter "tls" that the C entry point completely ignores and some
assembly entry points overwrite; instead, the low-level arch-specific
code pulls the tls parameter out of the arch-specific register captured
as part of pt_regs on entry to the kernel. That's a massive hack, and
it makes the arch-specific code only work when called via the specific
existing syscall entry points; because of this hack, any new clone-like
system call would have to accept an identical tls argument in exactly
the same arch-specific position, rather than providing a unified system
call entry point across architectures.
The first patch allows architectures to handle the tls argument via
normal C parameter passing, if they opt in by selecting
HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt
into this.
These two patches came out of the clone4 series, which isn't ready for
this merge window, but these first two cleanup patches were entirely
uncontroversial and have acks. I'd like to go ahead and submit these
two so that other architectures can begin building on top of this and
opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and
send these through the next merge window (along with v3 of clone4) if
anyone would prefer that.
This patch (of 2):
clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread. sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass along
that argument. Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes tls
in).
Apart from being awful and inscrutable, that also only works because only
one code path into copy_thread can pass the CLONE_SETTLS flag, and that
code path comes from sys_clone with its architecture-specific
argument-passing order. This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.
However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.
Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
and a new copy_thread_tls that accepts the tls parameter as an additional
unsigned long (syscall-argument-sized) argument. Change sys_clone's tls
argument to an unsigned long (which does not change the ABI), and pass
that down to copy_thread_tls.
Architectures that don't opt into copy_thread_tls will continue to ignore
the C argument to sys_clone in favor of the pt_regs captured at kernel
entry, and thus will be unable to introduce new versions of the clone
syscall.
Patch co-authored by Josh Triplett and Thiago Macieira.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26 06:01:19 +08:00
|
|
|
int __user *, unsigned long);
|
2012-11-29 12:04:26 +08:00
|
|
|
#endif
|
2013-08-14 07:00:53 +08:00
|
|
|
#endif
|
fork: add clone3
This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
in clone() into a dedicated struct clone_args
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- use a separate kernel internal struct kernel_clone_args that is
properly typed according to current kernel conventions in fork.c and is
different from the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
syscalls such as fork(), vfork(), clone(), and clone3() behave identical
(Arnd suggested, that we can probably also port do_fork() itself in a
separate patchset.)
- ease of transition for userspace from clone() to clone3()
This very much means that we do *not* remove functionality that userspace
currently relies on as the latter is a good way of creating a syscall
that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
It is superseeded by a dedicated "exit_signal" argument in struct
clone_args freeing up space for additional flags.
This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-25 17:36:41 +08:00
|
|
|
|
|
|
|
asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
|
|
|
|
|
2012-10-21 01:32:30 +08:00
|
|
|
asmlinkage long sys_execve(const char __user *filename,
|
|
|
|
const char __user *const __user *argv,
|
|
|
|
const char __user *const __user *envp);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* CONFIG_MMU only */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
|
|
|
|
asmlinkage long sys_swapoff(const char __user *specialfile);
|
|
|
|
asmlinkage long sys_mprotect(unsigned long start, size_t len,
|
|
|
|
unsigned long prot);
|
|
|
|
asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
|
|
|
|
asmlinkage long sys_mlock(unsigned long start, size_t len);
|
|
|
|
asmlinkage long sys_munlock(unsigned long start, size_t len);
|
|
|
|
asmlinkage long sys_mlockall(int flags);
|
|
|
|
asmlinkage long sys_munlockall(void);
|
|
|
|
asmlinkage long sys_mincore(unsigned long start, size_t len,
|
|
|
|
unsigned char __user * vec);
|
|
|
|
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
|
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to the
app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint. It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement. I
think it would be bigger in real practice because the testing ran very
cache friendly environment).
Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone. Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
If someone want to add other hints, we could hear the usecase and review
it for each hint. It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.
So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve
system or application performance.
The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.
MADV_COLD
MADV_PAGEOUT
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 07:14:59 +08:00
|
|
|
asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
|
|
|
|
size_t vlen, int behavior, unsigned int flags);
|
2021-09-03 06:00:33 +08:00
|
|
|
asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
|
|
|
|
unsigned long prot, unsigned long pgoff,
|
|
|
|
unsigned long flags);
|
|
|
|
asmlinkage long sys_mbind(unsigned long start, unsigned long len,
|
|
|
|
unsigned long mode,
|
|
|
|
const unsigned long __user *nmask,
|
|
|
|
unsigned long maxnode,
|
|
|
|
unsigned flags);
|
|
|
|
asmlinkage long sys_get_mempolicy(int __user *policy,
|
|
|
|
unsigned long __user *nmask,
|
|
|
|
unsigned long maxnode,
|
|
|
|
unsigned long addr, unsigned long flags);
|
|
|
|
asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
|
|
|
|
unsigned long maxnode);
|
|
|
|
asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
|
|
|
|
const unsigned long __user *from,
|
|
|
|
const unsigned long __user *to);
|
|
|
|
asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
|
|
|
|
const void __user * __user *pages,
|
|
|
|
const int __user *nodes,
|
|
|
|
int __user *status,
|
|
|
|
int flags);
|
|
|
|
asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig,
|
|
|
|
siginfo_t __user *uinfo);
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 18:02:48 +08:00
|
|
|
asmlinkage long sys_perf_event_open(
|
|
|
|
struct perf_event_attr __user *attr_uptr,
|
2009-03-04 17:36:51 +08:00
|
|
|
pid_t pid, int cpu, int group_fd, unsigned long flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_accept4(int, struct sockaddr __user *, int __user *, int);
|
|
|
|
asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
|
|
|
|
unsigned int vlen, unsigned flags,
|
2018-04-18 19:42:25 +08:00
|
|
|
struct __kernel_timespec __user *timeout);
|
y2038: socket: Add compat_sys_recvmmsg_time64
recvmmsg() takes two arguments to pointers of structures that differ
between 32-bit and 64-bit architectures: mmsghdr and timespec.
For y2038 compatbility, we are changing the native system call from
timespec to __kernel_timespec with a 64-bit time_t (in another patch),
and use the existing compat system call on both 32-bit and 64-bit
architectures for compatibility with traditional 32-bit user space.
As we now have two variants of recvmmsg() for 32-bit tasks that are both
different from the variant that we use on 64-bit tasks, this means we
also require two compat system calls!
The solution I picked is to flip things around: The existing
compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
and now handles the case for old user space on all architectures that
have set CONFIG_COMPAT_32BIT_TIME. A new compat_sys_recvmmsg_time64()
call gets added in the old place for 64-bit architectures only, this
one handles the case of a compat mmsghdr structure combined with
__kernel_timespec.
In the indirect sys_socketcall(), we now need to call either
do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
architecture we are on. For compat_sys_socketcall(), no such change is
needed, we always call __compat_sys_recvmmsg().
I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
implementation for 64-bit time_t will need significant changes including
an updated asm/unistd.h, and it seems better to consistently use the
separate syscalls that configuration, leaving the socketcall only for
backward compatibility with 32-bit time_t based libc.
The naming is asymmetric for the moment, so both existing syscalls
entry points keep their names, while the new ones are recvmmsg_time32
and compat_recvmmsg_time64 respectively. I expect that we will rename
the compat syscalls later as we start using generated syscall tables
everywhere and add these entry points.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-18 19:43:52 +08:00
|
|
|
asmlinkage long sys_recvmmsg_time32(int fd, struct mmsghdr __user *msg,
|
|
|
|
unsigned int vlen, unsigned flags,
|
|
|
|
struct old_timespec32 __user *timeout);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr,
|
|
|
|
int options, struct rusage __user *ru);
|
|
|
|
asmlinkage long sys_prlimit64(pid_t pid, unsigned int resource,
|
|
|
|
const struct rlimit64 __user *new_rlim,
|
|
|
|
struct rlimit64 __user *old_rlim);
|
|
|
|
asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags);
|
|
|
|
asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
|
|
|
|
u64 mask, int fd,
|
|
|
|
const char __user *pathname);
|
2011-01-29 21:13:26 +08:00
|
|
|
asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
|
|
|
|
struct file_handle __user *handle,
|
|
|
|
int __user *mnt_id, int flag);
|
2011-01-29 21:13:26 +08:00
|
|
|
asmlinkage long sys_open_by_handle_at(int mountdirfd,
|
|
|
|
struct file_handle __user *handle,
|
|
|
|
int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_clock_adjtime(clockid_t which_clock,
|
2018-07-03 13:44:22 +08:00
|
|
|
struct __kernel_timex __user *tx);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
|
|
|
|
struct old_timex32 __user *tx);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_syncfs(int fd);
|
2011-05-12 05:06:58 +08:00
|
|
|
asmlinkage long sys_setns(int fd, int nstype);
|
2019-05-24 18:43:51 +08:00
|
|
|
asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
|
|
|
|
unsigned int vlen, unsigned flags);
|
2011-11-01 08:06:39 +08:00
|
|
|
asmlinkage long sys_process_vm_readv(pid_t pid,
|
|
|
|
const struct iovec __user *lvec,
|
|
|
|
unsigned long liovcnt,
|
|
|
|
const struct iovec __user *rvec,
|
|
|
|
unsigned long riovcnt,
|
|
|
|
unsigned long flags);
|
|
|
|
asmlinkage long sys_process_vm_writev(pid_t pid,
|
|
|
|
const struct iovec __user *lvec,
|
|
|
|
unsigned long liovcnt,
|
|
|
|
const struct iovec __user *rvec,
|
|
|
|
unsigned long riovcnt,
|
|
|
|
unsigned long flags);
|
2012-06-01 07:26:44 +08:00
|
|
|
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
|
|
|
|
unsigned long idx1, unsigned long idx2);
|
2012-10-22 15:39:41 +08:00
|
|
|
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_sched_setattr(pid_t pid,
|
|
|
|
struct sched_attr __user *attr,
|
|
|
|
unsigned int flags);
|
|
|
|
asmlinkage long sys_sched_getattr(pid_t pid,
|
|
|
|
struct sched_attr __user *attr,
|
|
|
|
unsigned int size,
|
|
|
|
unsigned int flags);
|
|
|
|
asmlinkage long sys_renameat2(int olddfd, const char __user *oldname,
|
|
|
|
int newdfd, const char __user *newname,
|
|
|
|
unsigned int flags);
|
2014-06-26 07:08:24 +08:00
|
|
|
asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
|
2018-12-10 02:24:12 +08:00
|
|
|
void __user *uargs);
|
random: introduce getrandom(2) system call
The getrandom(2) system call was requested by the LibreSSL Portable
developers. It is analoguous to the getentropy(2) system call in
OpenBSD.
The rationale of this system call is to provide resiliance against
file descriptor exhaustion attacks, where the attacker consumes all
available file descriptors, forcing the use of the fallback code where
/dev/[u]random is not available. Since the fallback code is often not
well-tested, it is better to eliminate this potential failure mode
entirely.
The other feature provided by this new system call is the ability to
request randomness from the /dev/urandom entropy pool, but to block
until at least 128 bits of entropy has been accumulated in the
/dev/urandom entropy pool. Historically, the emphasis in the
/dev/urandom development has been to ensure that urandom pool is
initialized as quickly as possible after system boot, and preferably
before the init scripts start execution.
This is because changing /dev/urandom reads to block represents an
interface change that could potentially break userspace which is not
acceptable. In practice, on most x86 desktop and server systems, in
general the entropy pool can be initialized before it is needed (and
in modern kernels, we will printk a warning message if not). However,
on an embedded system, this may not be the case. And so with this new
interface, we can provide the functionality of blocking until the
urandom pool has been initialized. Any userspace program which uses
this new functionality must take care to assure that if it is used
during the boot process, that it will not cause the init scripts or
other portions of the system startup to hang indefinitely.
SYNOPSIS
#include <linux/random.h>
int getrandom(void *buf, size_t buflen, unsigned int flags);
DESCRIPTION
The system call getrandom() fills the buffer pointed to by buf
with up to buflen random bytes which can be used to seed user
space random number generators (i.e., DRBG's) or for other
cryptographic uses. It should not be used for Monte Carlo
simulations or other programs/algorithms which are doing
probabilistic sampling.
If the GRND_RANDOM flags bit is set, then draw from the
/dev/random pool instead of the /dev/urandom pool. The
/dev/random pool is limited based on the entropy that can be
obtained from environmental noise, so if there is insufficient
entropy, the requested number of bytes may not be returned.
If there is no entropy available at all, getrandom(2) will
either block, or return an error with errno set to EAGAIN if
the GRND_NONBLOCK bit is set in flags.
If the GRND_RANDOM bit is not set, then the /dev/urandom pool
will be used. Unlike using read(2) to fetch data from
/dev/urandom, if the urandom pool has not been sufficiently
initialized, getrandom(2) will block (or return -1 with the
errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).
The getentropy(2) system call in OpenBSD can be emulated using
the following function:
int getentropy(void *buf, size_t buflen)
{
int ret;
if (buflen > 256)
goto failure;
ret = getrandom(buf, buflen, 0);
if (ret < 0)
return ret;
if (ret == buflen)
return 0;
failure:
errno = EIO;
return -1;
}
RETURN VALUE
On success, the number of bytes that was filled in the buf is
returned. This may not be all the bytes requested by the
caller via buflen if insufficient entropy was present in the
/dev/random pool, or if the system call was interrupted by a
signal.
On error, -1 is returned, and errno is set appropriately.
ERRORS
EINVAL An invalid flag was passed to getrandom(2)
EFAULT buf is outside the accessible address space.
EAGAIN The requested entropy was not available, and
getentropy(2) would have blocked if the
GRND_NONBLOCK flag was not set.
EINTR While blocked waiting for entropy, the call was
interrupted by a signal handler; see the description
of how interrupted read(2) calls on "slow" devices
are handled with and without the SA_RESTART flag
in the signal(7) man page.
NOTES
For small requests (buflen <= 256) getrandom(2) will not
return EINTR when reading from the urandom pool once the
entropy pool has been initialized, and it will return all of
the bytes that have been requested. This is the recommended
way to use getrandom(2), and is designed for compatibility
with OpenBSD's getentropy() system call.
However, if you are using GRND_RANDOM, then getrandom(2) may
block until the entropy accounting determines that sufficient
environmental noise has been gathered such that getrandom(2)
will be operating as a NRBG instead of a DRBG for those people
who are working in the NIST SP 800-90 regime. Since it may
block for a long time, these guarantees do *not* apply. The
user may want to interrupt a hanging process using a signal,
so blocking until all of the requested bytes are returned
would be unfriendly.
For this reason, the user of getrandom(2) MUST always check
the return value, in case it returns some error, or if fewer
bytes than requested was returned. In the case of
!GRND_RANDOM and small request, the latter should never
happen, but the careful userspace code (and all crypto code
should be careful) should check for this anyway!
Finally, unless you are doing long-term key generation (and
perhaps not even then), you probably shouldn't be using
GRND_RANDOM. The cryptographic algorithms used for
/dev/urandom are quite conservative, and so should be
sufficient for all purposes. The disadvantage of GRND_RANDOM
is that it can block, and the increased complexity required to
deal with partially fulfilled getrandom(2) requests.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Zach Brown <zab@zabbo.net>
2014-07-17 16:13:05 +08:00
|
|
|
asmlinkage long sys_getrandom(char __user *buf, size_t count,
|
|
|
|
unsigned int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
|
2014-09-26 15:16:58 +08:00
|
|
|
asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 08:57:29 +08:00
|
|
|
asmlinkage long sys_execveat(int dfd, const char __user *filename,
|
|
|
|
const char __user *const __user *argv,
|
|
|
|
const char __user *const __user *envp, int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_userfaultfd(int flags);
|
2020-09-24 07:36:16 +08:00
|
|
|
asmlinkage long sys_membarrier(int cmd, unsigned int flags, int cpu_id);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
|
2015-11-11 05:53:30 +08:00
|
|
|
asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
|
|
|
|
int fd_out, loff_t __user *off_out,
|
|
|
|
size_t len, unsigned int flags);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
|
|
|
|
unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
|
|
|
|
rwf_t flags);
|
|
|
|
asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
|
|
|
|
unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
|
|
|
|
rwf_t flags);
|
2016-07-30 00:30:18 +08:00
|
|
|
asmlinkage long sys_pkey_mprotect(unsigned long start, size_t len,
|
|
|
|
unsigned long prot, int pkey);
|
|
|
|
asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
|
|
|
|
asmlinkage long sys_pkey_free(int pkey);
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-02-01 00:46:22 +08:00
|
|
|
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
|
|
|
|
unsigned mask, struct statx __user *buffer);
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 20:43:54 +08:00
|
|
|
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
|
|
|
|
int flags, uint32_t sig);
|
2018-11-06 01:40:30 +08:00
|
|
|
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
|
2018-11-06 01:40:30 +08:00
|
|
|
asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
|
|
|
|
int to_dfd, const char __user *to_path,
|
|
|
|
unsigned int ms_flags);
|
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.
The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:
int mount_setattr(int dfd, const char *path, unsigned flags,
struct mount_attr *uattr, size_t usize);
Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.
The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:
struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};
The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.
The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.
The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.
[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-21 21:19:53 +08:00
|
|
|
asmlinkage long sys_mount_setattr(int dfd, const char __user *path,
|
|
|
|
unsigned int flags,
|
|
|
|
struct mount_attr __user *uattr, size_t usize);
|
vfs: syscall: Add fsopen() to prepare for superblock creation
Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle. fsopen() is given the name of the filesystem that will be used:
int mfd = fsopen(const char *fsname, unsigned int flags);
where flags can be 0 or FSOPEN_CLOEXEC.
For example:
sfd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fsinfo(sfd, NULL, ...); // query new superblock attributes
mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
sfd = fsopen("afs", -1);
fsconfig(fd, FSCONFIG_SET_STRING, "source",
"#grand.central.org:root.cell", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
mfd = fsmount(sfd, 0, MS_NODEV);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:
"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"
Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.
The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.
Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.
Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-02 07:33:31 +08:00
|
|
|
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
|
vfs: syscall: Add fsconfig() for configuring and managing a context
Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
long fsconfig(int fs_fd, unsigned int cmd, const char *key,
const void *value, int aux);
Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.
The following command IDs are proposed:
(*) FSCONFIG_SET_FLAG: No value is specified. The parameter must be
boolean in nature. The key may be prefixed with "no" to invert the
setting. value must be NULL and aux must be 0.
(*) FSCONFIG_SET_STRING: A string value is specified. The parameter can
be expecting boolean, integer, string or take a path. A conversion to
an appropriate type will be attempted (which may include looking up as
a path). value points to a NUL-terminated string and aux must be 0.
(*) FSCONFIG_SET_BINARY: A binary blob is specified. value points to
the blob and aux indicates its size. The parameter must be expecting
a blob.
(*) FSCONFIG_SET_PATH: A non-empty path is specified. The parameter must
be expecting a path object. value points to a NUL-terminated string
that is the path and aux is a file descriptor at which to start a
relative lookup or AT_FDCWD.
(*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
implied.
(*) FSCONFIG_SET_FD: An open file descriptor is specified. value must
be NULL and aux indicates the file descriptor.
(*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
(*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take. The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.
Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..
[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem. Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.
Example usage:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("afs", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("jffs2", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-02 07:36:09 +08:00
|
|
|
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
|
|
|
|
const void __user *value, int aux);
|
2018-11-02 07:36:14 +08:00
|
|
|
asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
|
vfs: syscall: Add fspick() to select a superblock for reconfiguration
Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).
This looks like:
int fd = fspick(AT_FDCWD, "/mnt",
FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-02 07:36:23 +08:00
|
|
|
asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
|
signal: add pidfd_send_signal() syscall
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
2018-11-19 07:51:56 +08:00
|
|
|
asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
|
|
|
|
siginfo_t __user *info,
|
|
|
|
unsigned int flags);
|
2020-01-08 01:59:26 +08:00
|
|
|
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
|
2021-04-22 23:41:18 +08:00
|
|
|
asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr __user *attr,
|
|
|
|
size_t size, __u32 flags);
|
|
|
|
asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
|
|
|
|
const void __user *rule_attr, __u32 flags);
|
|
|
|
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
|
2021-07-08 09:08:11 +08:00
|
|
|
asmlinkage long sys_memfd_secret(unsigned int flags);
|
2022-01-15 06:08:21 +08:00
|
|
|
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
|
|
|
|
unsigned long home_node,
|
|
|
|
unsigned long flags);
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:07 +08:00
|
|
|
asmlinkage long sys_cachestat(unsigned int fd,
|
|
|
|
struct cachestat_range __user *cstat_range,
|
|
|
|
struct cachestat __user *cstat, unsigned int flags);
|
2018-03-11 18:34:25 +08:00
|
|
|
|
2018-03-26 03:50:11 +08:00
|
|
|
/*
|
|
|
|
* Architecture-specific system calls
|
|
|
|
*/
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* x86 */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
|
|
|
|
|
|
|
|
/* pciconfig: alpha, arm, arm64, ia64, sparc */
|
|
|
|
asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
|
|
|
|
unsigned long off, unsigned long len,
|
|
|
|
void __user *buf);
|
|
|
|
asmlinkage long sys_pciconfig_write(unsigned long bus, unsigned long dfn,
|
|
|
|
unsigned long off, unsigned long len,
|
|
|
|
void __user *buf);
|
|
|
|
asmlinkage long sys_pciconfig_iobase(long which, unsigned long bus, unsigned long devfn);
|
|
|
|
|
|
|
|
/* powerpc */
|
|
|
|
asmlinkage long sys_spu_run(int fd, __u32 __user *unpc,
|
|
|
|
__u32 __user *ustatus);
|
|
|
|
asmlinkage long sys_spu_create(const char __user *name,
|
|
|
|
unsigned int flags, umode_t mode, int fd);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Deprecated system calls which are still defined in
|
|
|
|
* include/uapi/asm-generic/unistd.h and wanted by >= 1 arch
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* __ARCH_WANT_SYSCALL_NO_AT */
|
|
|
|
asmlinkage long sys_open(const char __user *filename,
|
|
|
|
int flags, umode_t mode);
|
|
|
|
asmlinkage long sys_link(const char __user *oldname,
|
|
|
|
const char __user *newname);
|
|
|
|
asmlinkage long sys_unlink(const char __user *pathname);
|
|
|
|
asmlinkage long sys_mknod(const char __user *filename, umode_t mode,
|
|
|
|
unsigned dev);
|
|
|
|
asmlinkage long sys_chmod(const char __user *filename, umode_t mode);
|
|
|
|
asmlinkage long sys_chown(const char __user *filename,
|
|
|
|
uid_t user, gid_t group);
|
|
|
|
asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
|
|
|
|
asmlinkage long sys_rmdir(const char __user *pathname);
|
|
|
|
asmlinkage long sys_lchown(const char __user *filename,
|
|
|
|
uid_t user, gid_t group);
|
|
|
|
asmlinkage long sys_access(const char __user *filename, int mode);
|
|
|
|
asmlinkage long sys_rename(const char __user *oldname,
|
|
|
|
const char __user *newname);
|
|
|
|
asmlinkage long sys_symlink(const char __user *old, const char __user *new);
|
|
|
|
#if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
|
|
|
|
asmlinkage long sys_stat64(const char __user *filename,
|
|
|
|
struct stat64 __user *statbuf);
|
|
|
|
asmlinkage long sys_lstat64(const char __user *filename,
|
|
|
|
struct stat64 __user *statbuf);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* __ARCH_WANT_SYSCALL_NO_FLAGS */
|
|
|
|
asmlinkage long sys_pipe(int __user *fildes);
|
|
|
|
asmlinkage long sys_dup2(unsigned int oldfd, unsigned int newfd);
|
|
|
|
asmlinkage long sys_epoll_create(int size);
|
|
|
|
asmlinkage long sys_inotify_init(void);
|
|
|
|
asmlinkage long sys_eventfd(unsigned int count);
|
|
|
|
asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemask);
|
|
|
|
|
|
|
|
/* __ARCH_WANT_SYSCALL_OFF_T */
|
|
|
|
asmlinkage long sys_sendfile(int out_fd, int in_fd,
|
|
|
|
off_t __user *offset, size_t count);
|
|
|
|
asmlinkage long sys_newstat(const char __user *filename,
|
|
|
|
struct stat __user *statbuf);
|
|
|
|
asmlinkage long sys_newlstat(const char __user *filename,
|
|
|
|
struct stat __user *statbuf);
|
|
|
|
asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
|
|
|
|
|
|
|
|
/* __ARCH_WANT_SYSCALL_DEPRECATED */
|
|
|
|
asmlinkage long sys_alarm(unsigned int seconds);
|
|
|
|
asmlinkage long sys_getpgrp(void);
|
|
|
|
asmlinkage long sys_pause(void);
|
2019-11-05 18:10:01 +08:00
|
|
|
asmlinkage long sys_time(__kernel_old_time_t __user *tloc);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_time32(old_time32_t __user *tloc);
|
y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.
There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-17 15:11:58 +08:00
|
|
|
#ifdef __ARCH_WANT_SYS_UTIME
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_utime(char __user *filename,
|
|
|
|
struct utimbuf __user *times);
|
y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.
There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-17 15:11:58 +08:00
|
|
|
asmlinkage long sys_utimes(char __user *filename,
|
2019-10-26 04:56:17 +08:00
|
|
|
struct __kernel_old_timeval __user *utimes);
|
y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.
There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-17 15:11:58 +08:00
|
|
|
asmlinkage long sys_futimesat(int dfd, const char __user *filename,
|
2019-10-26 04:56:17 +08:00
|
|
|
struct __kernel_old_timeval __user *utimes);
|
y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.
There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-17 15:11:58 +08:00
|
|
|
#endif
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_futimesat_time32(unsigned int dfd,
|
|
|
|
const char __user *filename,
|
|
|
|
struct old_timeval32 __user *t);
|
|
|
|
asmlinkage long sys_utime32(const char __user *filename,
|
|
|
|
struct old_utimbuf32 __user *t);
|
|
|
|
asmlinkage long sys_utimes_time32(const char __user *filename,
|
|
|
|
struct old_timeval32 __user *t);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_creat(const char __user *pathname, umode_t mode);
|
|
|
|
asmlinkage long sys_getdents(unsigned int fd,
|
|
|
|
struct linux_dirent __user *dirent,
|
|
|
|
unsigned int count);
|
|
|
|
asmlinkage long sys_select(int n, fd_set __user *inp, fd_set __user *outp,
|
2019-10-26 04:56:17 +08:00
|
|
|
fd_set __user *exp, struct __kernel_old_timeval __user *tvp);
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_poll(struct pollfd __user *ufds, unsigned int nfds,
|
|
|
|
int timeout);
|
|
|
|
asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events,
|
|
|
|
int maxevents, int timeout);
|
|
|
|
asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
|
|
|
|
asmlinkage long sys_vfork(void);
|
|
|
|
asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
|
|
|
|
asmlinkage long sys_send(int, void __user *, size_t, unsigned);
|
|
|
|
asmlinkage long sys_oldumount(char __user *name);
|
|
|
|
asmlinkage long sys_uselib(const char __user *library);
|
|
|
|
asmlinkage long sys_sysfs(int option,
|
|
|
|
unsigned long arg1, unsigned long arg2);
|
|
|
|
asmlinkage long sys_fork(void);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2019-11-05 18:10:01 +08:00
|
|
|
asmlinkage long sys_stime(__kernel_old_time_t __user *tptr);
|
2019-01-07 07:33:08 +08:00
|
|
|
asmlinkage long sys_stime32(old_time32_t __user *tptr);
|
2018-03-26 03:50:11 +08:00
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_sigpending(old_sigset_t __user *uset);
|
|
|
|
asmlinkage long sys_sigprocmask(int how, old_sigset_t __user *set,
|
|
|
|
old_sigset_t __user *oset);
|
|
|
|
#ifdef CONFIG_OLD_SIGSUSPEND
|
|
|
|
asmlinkage long sys_sigsuspend(old_sigset_t mask);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_OLD_SIGSUSPEND3
|
|
|
|
asmlinkage long sys_sigsuspend(int unused1, int unused2, old_sigset_t mask);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_OLD_SIGACTION
|
|
|
|
asmlinkage long sys_sigaction(int, const struct old_sigaction __user *,
|
|
|
|
struct old_sigaction __user *);
|
|
|
|
#endif
|
|
|
|
asmlinkage long sys_sgetmask(void);
|
|
|
|
asmlinkage long sys_ssetmask(int newmask);
|
|
|
|
asmlinkage long sys_signal(int sig, __sighandler_t handler);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_nice(int increment);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
|
|
|
|
unsigned long cmdline_len,
|
|
|
|
const char __user *cmdline_ptr,
|
|
|
|
unsigned long flags);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_waitpid(pid_t pid, int __user *stat_addr, int options);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
#ifdef CONFIG_HAVE_UID16
|
|
|
|
asmlinkage long sys_chown16(const char __user *filename,
|
|
|
|
old_uid_t user, old_gid_t group);
|
|
|
|
asmlinkage long sys_lchown16(const char __user *filename,
|
|
|
|
old_uid_t user, old_gid_t group);
|
|
|
|
asmlinkage long sys_fchown16(unsigned int fd, old_uid_t user, old_gid_t group);
|
|
|
|
asmlinkage long sys_setregid16(old_gid_t rgid, old_gid_t egid);
|
|
|
|
asmlinkage long sys_setgid16(old_gid_t gid);
|
|
|
|
asmlinkage long sys_setreuid16(old_uid_t ruid, old_uid_t euid);
|
|
|
|
asmlinkage long sys_setuid16(old_uid_t uid);
|
|
|
|
asmlinkage long sys_setresuid16(old_uid_t ruid, old_uid_t euid, old_uid_t suid);
|
|
|
|
asmlinkage long sys_getresuid16(old_uid_t __user *ruid,
|
|
|
|
old_uid_t __user *euid, old_uid_t __user *suid);
|
|
|
|
asmlinkage long sys_setresgid16(old_gid_t rgid, old_gid_t egid, old_gid_t sgid);
|
|
|
|
asmlinkage long sys_getresgid16(old_gid_t __user *rgid,
|
|
|
|
old_gid_t __user *egid, old_gid_t __user *sgid);
|
|
|
|
asmlinkage long sys_setfsuid16(old_uid_t uid);
|
|
|
|
asmlinkage long sys_setfsgid16(old_gid_t gid);
|
|
|
|
asmlinkage long sys_getgroups16(int gidsetsize, old_gid_t __user *grouplist);
|
|
|
|
asmlinkage long sys_setgroups16(int gidsetsize, old_gid_t __user *grouplist);
|
|
|
|
asmlinkage long sys_getuid16(void);
|
|
|
|
asmlinkage long sys_geteuid16(void);
|
|
|
|
asmlinkage long sys_getgid16(void);
|
|
|
|
asmlinkage long sys_getegid16(void);
|
|
|
|
#endif
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_socketcall(int call, unsigned long __user *args);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_stat(const char __user *filename,
|
|
|
|
struct __old_kernel_stat __user *statbuf);
|
|
|
|
asmlinkage long sys_lstat(const char __user *filename,
|
|
|
|
struct __old_kernel_stat __user *statbuf);
|
|
|
|
asmlinkage long sys_fstat(unsigned int fd,
|
|
|
|
struct __old_kernel_stat __user *statbuf);
|
|
|
|
asmlinkage long sys_readlink(const char __user *path,
|
|
|
|
char __user *buf, int bufsiz);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_old_select(struct sel_arg_struct __user *arg);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_gethostname(char __user *name, int len);
|
|
|
|
asmlinkage long sys_uname(struct old_utsname __user *);
|
|
|
|
asmlinkage long sys_olduname(struct oldold_utsname __user *);
|
|
|
|
#ifdef __ARCH_WANT_SYS_OLD_GETRLIMIT
|
|
|
|
asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *rlim);
|
|
|
|
#endif
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_ipc(unsigned int call, int first, unsigned long second,
|
|
|
|
unsigned long third, void __user *ptr, long fifth);
|
|
|
|
|
2023-06-22 06:36:00 +08:00
|
|
|
/* obsolete */
|
2018-03-26 03:50:11 +08:00
|
|
|
asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
|
|
|
|
unsigned long prot, unsigned long flags,
|
|
|
|
unsigned long fd, unsigned long pgoff);
|
|
|
|
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not a real system call, but a placeholder for syscalls which are
|
|
|
|
* not implemented -- see kernel/sys_ni.c
|
|
|
|
*/
|
|
|
|
asmlinkage long sys_ni_syscall(void);
|
|
|
|
|
2018-04-05 17:53:01 +08:00
|
|
|
#endif /* CONFIG_ARCH_HAS_SYSCALL_WRAPPER */
|
|
|
|
|
2023-06-07 22:28:45 +08:00
|
|
|
asmlinkage long sys_ni_posix_timers(void);
|
2018-03-26 03:50:11 +08:00
|
|
|
|
2018-03-11 18:34:25 +08:00
|
|
|
/*
|
|
|
|
* Kernel code should not call syscalls (i.e., sys_xyzyyz()) directly.
|
|
|
|
* Instead, use one of the functions which work equivalently, such as
|
|
|
|
* the ksys_xyzyyz() functions prototyped below.
|
|
|
|
*/
|
2018-03-11 18:34:41 +08:00
|
|
|
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count);
|
fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers
Using the fs-interal do_fchownat() wrapper allows us to get rid of
fs-internal calls to the sys_fchownat() syscall.
Introducing the ksys_fchown() helper and the ksys_{,}chown() wrappers
allows us to avoid the in-kernel calls to the sys_{,l,f}chown() syscalls.
The ksys_ prefix denotes that these functions are meant as a drop-in
replacement for the syscalls. In particular, they use the same calling
convention as sys_{,l,f}chown().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-03-11 18:34:55 +08:00
|
|
|
int ksys_fchown(unsigned int fd, uid_t user, gid_t group);
|
2018-03-14 04:56:26 +08:00
|
|
|
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count);
|
2018-03-15 05:35:11 +08:00
|
|
|
void ksys_sync(void);
|
2018-03-11 18:34:42 +08:00
|
|
|
int ksys_unshare(unsigned long unshare_flags);
|
2018-03-16 19:36:06 +08:00
|
|
|
int ksys_setsid(void);
|
2018-03-11 18:34:47 +08:00
|
|
|
int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
|
|
|
|
unsigned int flags);
|
2018-03-20 00:38:31 +08:00
|
|
|
ssize_t ksys_pread64(unsigned int fd, char __user *buf, size_t count,
|
|
|
|
loff_t pos);
|
|
|
|
ssize_t ksys_pwrite64(unsigned int fd, const char __user *buf,
|
|
|
|
size_t count, loff_t pos);
|
2018-03-20 00:46:32 +08:00
|
|
|
int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len);
|
2018-03-11 18:34:45 +08:00
|
|
|
#ifdef CONFIG_ADVISE_SYSCALLS
|
|
|
|
int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
|
|
|
|
#else
|
|
|
|
static inline int ksys_fadvise64_64(int fd, loff_t offset, loff_t len,
|
|
|
|
int advice)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
#endif
|
2018-03-11 18:34:46 +08:00
|
|
|
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
|
|
|
|
unsigned long prot, unsigned long flags,
|
|
|
|
unsigned long fd, unsigned long pgoff);
|
2018-03-20 00:51:36 +08:00
|
|
|
ssize_t ksys_readahead(int fd, loff_t offset, size_t count);
|
2019-01-16 21:15:20 +08:00
|
|
|
int ksys_ipc(unsigned int call, int first, unsigned long second,
|
|
|
|
unsigned long third, void __user * ptr, long fifth);
|
|
|
|
int compat_ksys_ipc(u32 call, int first, int second,
|
|
|
|
u32 third, u32 ptr, u32 fifth);
|
2018-03-11 18:34:39 +08:00
|
|
|
|
2018-03-11 18:34:47 +08:00
|
|
|
/*
|
|
|
|
* The following kernel syscall equivalents are just wrappers to fs-internal
|
|
|
|
* functions. Therefore, provide stubs to be inlined at the callsites.
|
|
|
|
*/
|
fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers
Using the fs-interal do_fchownat() wrapper allows us to get rid of
fs-internal calls to the sys_fchownat() syscall.
Introducing the ksys_fchown() helper and the ksys_{,}chown() wrappers
allows us to avoid the in-kernel calls to the sys_{,l,f}chown() syscalls.
The ksys_ prefix denotes that these functions are meant as a drop-in
replacement for the syscalls. In particular, they use the same calling
convention as sys_{,l,f}chown().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-03-11 18:34:55 +08:00
|
|
|
extern int do_fchownat(int dfd, const char __user *filename, uid_t user,
|
|
|
|
gid_t group, int flag);
|
|
|
|
|
|
|
|
static inline long ksys_chown(const char __user *filename, uid_t user,
|
|
|
|
gid_t group)
|
|
|
|
{
|
|
|
|
return do_fchownat(AT_FDCWD, filename, user, group, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline long ksys_lchown(const char __user *filename, uid_t user,
|
|
|
|
gid_t group)
|
|
|
|
{
|
|
|
|
return do_fchownat(AT_FDCWD, filename, user, group,
|
|
|
|
AT_SYMLINK_NOFOLLOW);
|
|
|
|
}
|
|
|
|
|
2018-03-11 18:34:54 +08:00
|
|
|
extern long do_sys_ftruncate(unsigned int fd, loff_t length, int small);
|
|
|
|
|
2020-06-10 19:48:51 +08:00
|
|
|
static inline long ksys_ftruncate(unsigned int fd, loff_t length)
|
2018-03-11 18:34:54 +08:00
|
|
|
{
|
|
|
|
return do_sys_ftruncate(fd, length, 1);
|
|
|
|
}
|
|
|
|
|
2018-03-20 00:32:11 +08:00
|
|
|
extern long do_sys_truncate(const char __user *pathname, loff_t length);
|
|
|
|
|
|
|
|
static inline long ksys_truncate(const char __user *pathname, loff_t length)
|
|
|
|
{
|
|
|
|
return do_sys_truncate(pathname, length);
|
|
|
|
}
|
|
|
|
|
2018-07-11 21:56:50 +08:00
|
|
|
static inline unsigned int ksys_personality(unsigned int personality)
|
|
|
|
{
|
|
|
|
unsigned int old = current->personality;
|
|
|
|
|
|
|
|
if (personality != 0xffffffff)
|
|
|
|
set_personality(personality);
|
|
|
|
|
|
|
|
return old;
|
|
|
|
}
|
|
|
|
|
ipc: fix sparc64 ipc() wrapper
Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
to a commit from my y2038 series in linux-5.1, as I missed the custom
sys_ipc() wrapper that sparc64 uses in place of the generic version that
I patched.
The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
now do not allow being called with the IPC_64 flag any more, resulting
in a -EINVAL error when they don't recognize the command.
Instead, the correct way to do this now is to call the internal
ksys_old_{sem,shm,msg}ctl() functions to select the API version.
As we generally move towards these functions anyway, change all of
sparc_ipc() to consistently use those in place of the sys_*() versions,
and move the required ksys_*() declarations into linux/syscalls.h
The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
errors when ipc is disabled.
Reported-by: Matt Turner <mattst88@gmail.com>
Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
Cc: stable@vger.kernel.org
Tested-by: Matt Turner <mattst88@gmail.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-05 22:48:38 +08:00
|
|
|
/* for __ARCH_WANT_SYS_IPC */
|
|
|
|
long ksys_semtimedop(int semid, struct sembuf __user *tsops,
|
|
|
|
unsigned int nsops,
|
|
|
|
const struct __kernel_timespec __user *timeout);
|
|
|
|
long ksys_semget(key_t key, int nsems, int semflg);
|
|
|
|
long ksys_old_semctl(int semid, int semnum, int cmd, unsigned long arg);
|
|
|
|
long ksys_msgget(key_t key, int msgflg);
|
|
|
|
long ksys_old_msgctl(int msqid, int cmd, struct msqid_ds __user *buf);
|
|
|
|
long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
|
|
|
|
long msgtyp, int msgflg);
|
|
|
|
long ksys_msgsnd(int msqid, struct msgbuf __user *msgp, size_t msgsz,
|
|
|
|
int msgflg);
|
|
|
|
long ksys_shmget(key_t key, size_t size, int shmflg);
|
|
|
|
long ksys_shmdt(char __user *shmaddr);
|
|
|
|
long ksys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
|
|
|
|
long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems,
|
|
|
|
unsigned int nsops,
|
|
|
|
const struct old_timespec32 __user *timeout);
|
2021-08-11 15:30:23 +08:00
|
|
|
long __do_semtimedop(int semid, struct sembuf *tsems, unsigned int nsops,
|
|
|
|
const struct timespec64 *timeout,
|
|
|
|
struct ipc_namespace *ns);
|
ipc: fix sparc64 ipc() wrapper
Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
to a commit from my y2038 series in linux-5.1, as I missed the custom
sys_ipc() wrapper that sparc64 uses in place of the generic version that
I patched.
The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
now do not allow being called with the IPC_64 flag any more, resulting
in a -EINVAL error when they don't recognize the command.
Instead, the correct way to do this now is to call the internal
ksys_old_{sem,shm,msg}ctl() functions to select the API version.
As we generally move towards these functions anyway, change all of
sparc_ipc() to consistently use those in place of the sys_*() versions,
and move the required ksys_*() declarations into linux/syscalls.h
The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
errors when ipc is disabled.
Reported-by: Matt Turner <mattst88@gmail.com>
Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
Cc: stable@vger.kernel.org
Tested-by: Matt Turner <mattst88@gmail.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-05 22:48:38 +08:00
|
|
|
|
2020-07-17 14:23:15 +08:00
|
|
|
int __sys_getsockopt(int fd, int level, int optname, char __user *optval,
|
|
|
|
int __user *optlen);
|
|
|
|
int __sys_setsockopt(int fd, int level, int optname, char __user *optval,
|
|
|
|
int optlen);
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|