License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* linux/ipc/util.h
|
|
|
|
* Copyright (C) 1999 Christoph Rohland
|
|
|
|
*
|
2006-01-15 09:43:54 +08:00
|
|
|
* ipc helper functions (c) 1999 Manfred Spraul <manfred@colorfullife.com>
|
2006-10-02 17:18:20 +08:00
|
|
|
* namespaces support. 2006 OpenVZ, SWsoft Inc.
|
|
|
|
* Pavel Emelianov <xemul@openvz.org>
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _IPC_UTIL_H
|
|
|
|
#define _IPC_UTIL_H
|
|
|
|
|
2009-06-20 08:23:29 +08:00
|
|
|
#include <linux/unistd.h>
|
2007-10-19 14:40:51 +08:00
|
|
|
#include <linux/err.h>
|
2017-11-18 07:31:18 +08:00
|
|
|
#include <linux/ipc_namespace.h>
|
2007-10-19 14:40:48 +08:00
|
|
|
|
2019-05-15 06:46:29 +08:00
|
|
|
/*
|
|
|
|
* The IPC ID contains 2 separate numbers - index and sequence number.
|
|
|
|
* By default,
|
|
|
|
* bits 0-14: index (32k, 15 bits)
|
|
|
|
* bits 15-30: sequence number (64k, 16 bits)
|
|
|
|
*
|
|
|
|
* When IPCMNI extension mode is turned on, the composition changes:
|
|
|
|
* bits 0-23: index (16M, 24 bits)
|
|
|
|
* bits 24-30: sequence number (128, 7 bits)
|
|
|
|
*/
|
|
|
|
#define IPCMNI_SHIFT 15
|
|
|
|
#define IPCMNI_EXTEND_SHIFT 24
|
ipc: do cyclic id allocation for the ipc object.
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.
To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.
To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements
Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.
- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.
For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.
Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.
2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.
3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.
Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.
1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%
Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).
V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).
[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 06:46:36 +08:00
|
|
|
#define IPCMNI_EXTEND_MIN_CYCLE (RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE)
|
2019-05-15 06:46:29 +08:00
|
|
|
#define IPCMNI (1 << IPCMNI_SHIFT)
|
|
|
|
#define IPCMNI_EXTEND (1 << IPCMNI_EXTEND_SHIFT)
|
|
|
|
|
|
|
|
#ifdef CONFIG_SYSVIPC_SYSCTL
|
|
|
|
extern int ipc_mni;
|
|
|
|
extern int ipc_mni_shift;
|
ipc: do cyclic id allocation for the ipc object.
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.
To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.
To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements
Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.
- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.
For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.
Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.
2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.
3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.
Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.
1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%
Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).
V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).
[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 06:46:36 +08:00
|
|
|
extern int ipc_min_cycle;
|
2019-05-15 06:46:29 +08:00
|
|
|
|
2019-05-15 06:46:33 +08:00
|
|
|
#define ipcmni_seq_shift() ipc_mni_shift
|
2019-05-15 06:46:29 +08:00
|
|
|
#define IPCMNI_IDX_MASK ((1 << ipc_mni_shift) - 1)
|
|
|
|
|
|
|
|
#else /* CONFIG_SYSVIPC_SYSCTL */
|
|
|
|
|
|
|
|
#define ipc_mni IPCMNI
|
ipc: do cyclic id allocation for the ipc object.
For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.
To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.
To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements
Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.
- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.
For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.
Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.
2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.
3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.
Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.
1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%
Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).
V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).
[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 06:46:36 +08:00
|
|
|
#define ipc_min_cycle ((int)RADIX_TREE_MAP_SIZE)
|
2019-05-15 06:46:33 +08:00
|
|
|
#define ipcmni_seq_shift() IPCMNI_SHIFT
|
2019-05-15 06:46:29 +08:00
|
|
|
#define IPCMNI_IDX_MASK ((1 << IPCMNI_SHIFT) - 1)
|
|
|
|
#endif /* CONFIG_SYSVIPC_SYSCTL */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-08-22 13:01:56 +08:00
|
|
|
void sem_init(void);
|
|
|
|
void msg_init(void);
|
2014-01-28 09:07:04 +08:00
|
|
|
void shm_init(void);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-02-08 20:18:22 +08:00
|
|
|
struct ipc_namespace;
|
2018-03-23 13:22:05 +08:00
|
|
|
struct pid_namespace;
|
2008-02-08 20:18:22 +08:00
|
|
|
|
2009-04-07 10:01:08 +08:00
|
|
|
#ifdef CONFIG_POSIX_MQUEUE
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
extern void mq_clear_sbinfo(struct ipc_namespace *ns);
|
|
|
|
extern void mq_put_mnt(struct ipc_namespace *ns);
|
2009-04-07 10:01:08 +08:00
|
|
|
#else
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
static inline void mq_clear_sbinfo(struct ipc_namespace *ns) { }
|
|
|
|
static inline void mq_put_mnt(struct ipc_namespace *ns) { }
|
2009-04-07 10:01:08 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_SYSVIPC
|
2018-08-22 13:01:56 +08:00
|
|
|
void sem_init_ns(struct ipc_namespace *ns);
|
2022-09-14 03:25:38 +08:00
|
|
|
int msg_init_ns(struct ipc_namespace *ns);
|
2018-08-22 13:01:56 +08:00
|
|
|
void shm_init_ns(struct ipc_namespace *ns);
|
2006-10-02 17:18:20 +08:00
|
|
|
|
|
|
|
void sem_exit_ns(struct ipc_namespace *ns);
|
|
|
|
void msg_exit_ns(struct ipc_namespace *ns);
|
|
|
|
void shm_exit_ns(struct ipc_namespace *ns);
|
2009-04-07 10:01:08 +08:00
|
|
|
#else
|
2018-08-22 13:01:56 +08:00
|
|
|
static inline void sem_init_ns(struct ipc_namespace *ns) { }
|
2022-09-14 03:25:38 +08:00
|
|
|
static inline int msg_init_ns(struct ipc_namespace *ns) { return 0; }
|
2018-08-22 13:01:56 +08:00
|
|
|
static inline void shm_init_ns(struct ipc_namespace *ns) { }
|
2009-04-07 10:01:08 +08:00
|
|
|
|
|
|
|
static inline void sem_exit_ns(struct ipc_namespace *ns) { }
|
|
|
|
static inline void msg_exit_ns(struct ipc_namespace *ns) { }
|
|
|
|
static inline void shm_exit_ns(struct ipc_namespace *ns) { }
|
|
|
|
#endif
|
2006-10-02 17:18:20 +08:00
|
|
|
|
2007-10-19 14:40:49 +08:00
|
|
|
/*
|
|
|
|
* Structure that holds the parameters needed by the ipc operations
|
|
|
|
* (see after)
|
|
|
|
*/
|
|
|
|
struct ipc_params {
|
|
|
|
key_t key;
|
|
|
|
int flg;
|
|
|
|
union {
|
|
|
|
size_t size; /* for shared memories */
|
|
|
|
int nsems; /* for semaphores */
|
|
|
|
} u; /* holds the getnew() specific param */
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Structure that holds some ipc operations. This structure is used to unify
|
|
|
|
* the calls to sys_msgget(), sys_semget(), sys_shmget()
|
|
|
|
* . routine to call to create a new ipc object. Can be one of newque,
|
|
|
|
* newary, newseg
|
2007-10-19 14:40:53 +08:00
|
|
|
* . routine to call to check permissions for a new ipc object.
|
2007-10-19 14:40:49 +08:00
|
|
|
* Can be one of security_msg_associate, security_sem_associate,
|
|
|
|
* security_shm_associate
|
|
|
|
* . routine to call for an extra check if needed
|
|
|
|
*/
|
|
|
|
struct ipc_ops {
|
2014-06-07 05:37:37 +08:00
|
|
|
int (*getnew)(struct ipc_namespace *, struct ipc_params *);
|
|
|
|
int (*associate)(struct kern_ipc_perm *, int);
|
|
|
|
int (*more_checks)(struct kern_ipc_perm *, struct ipc_params *);
|
2007-10-19 14:40:49 +08:00
|
|
|
};
|
|
|
|
|
2005-09-07 06:17:09 +08:00
|
|
|
struct seq_file;
|
2008-02-08 20:18:57 +08:00
|
|
|
struct ipc_ids;
|
2007-07-16 14:40:58 +08:00
|
|
|
|
2018-08-22 13:01:56 +08:00
|
|
|
void ipc_init_ids(struct ipc_ids *ids);
|
2005-09-07 06:17:09 +08:00
|
|
|
#ifdef CONFIG_PROC_FS
|
|
|
|
void __init ipc_init_proc_interface(const char *path, const char *header,
|
2006-10-02 17:18:20 +08:00
|
|
|
int ids, int (*show)(struct seq_file *, void *));
|
2018-03-23 13:22:05 +08:00
|
|
|
struct pid_namespace *ipc_seq_pid_ns(struct seq_file *);
|
2005-09-07 06:17:09 +08:00
|
|
|
#else
|
|
|
|
#define ipc_init_proc_interface(path, header, ids, show) do {} while (0)
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-10-02 17:18:20 +08:00
|
|
|
#define IPC_SEM_IDS 0
|
|
|
|
#define IPC_MSG_IDS 1
|
|
|
|
#define IPC_SHM_IDS 2
|
|
|
|
|
2019-05-15 06:46:29 +08:00
|
|
|
#define ipcid_to_idx(id) ((id) & IPCMNI_IDX_MASK)
|
2019-05-15 06:46:33 +08:00
|
|
|
#define ipcid_to_seqx(id) ((id) >> ipcmni_seq_shift())
|
|
|
|
#define ipcid_seq_max() (INT_MAX >> ipcmni_seq_shift())
|
2007-10-19 14:40:52 +08:00
|
|
|
|
2013-09-12 05:26:24 +08:00
|
|
|
/* must be called with ids->rwsem acquired for writing */
|
2007-10-19 14:40:48 +08:00
|
|
|
int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
|
2007-10-19 14:40:54 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* must be called with both locks acquired. */
|
2007-10-19 14:40:48 +08:00
|
|
|
void ipc_rmid(struct ipc_ids *, struct kern_ipc_perm *);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
ipc: optimize semget/shmget/msgget for lots of keys
ipc_findkey() used to scan all objects to look for the wanted key. This
is slow when using a high number of keys. This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).
This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]
Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:
for (int i = 0, k=0x424242; i < KEYS; ++i)
semget(k++, 1, IPC_CREAT | 0600);
total total max single max single
KEYS without with call without call with
1 3.5 4.9 µs 3.5 4.9
10 7.6 8.6 µs 3.7 4.7
32 16.2 15.9 µs 4.3 5.3
100 72.9 41.8 µs 3.7 4.7
1000 5,630.0 502.0 µs * *
10000 1,340,000.0 7,240.0 µs * *
31900 17,600,000.0 22,200.0 µs * *
*: unreliable measure: high variance
The duration for a lookup-only usage was obtained by the same loop once
the keys are present:
total total max single max single
KEYS without with call without call with
1 2.1 2.5 µs 2.1 2.5
10 4.5 4.8 µs 2.2 2.3
32 13.0 10.8 µs 2.3 2.8
100 82.9 25.1 µs * 2.3
1000 5,780.0 217.0 µs * *
10000 1,470,000.0 2,520.0 µs * *
31900 17,400,000.0 7,810.0 µs * *
Finally, executing each semget() in a new process gave, when still
summing only the durations of these syscalls:
creation:
total total
KEYS without with
1 3.7 5.0 µs
10 32.9 36.7 µs
32 125.0 109.0 µs
100 523.0 353.0 µs
1000 20,300.0 3,280.0 µs
10000 2,470,000.0 46,700.0 µs
31900 27,800,000.0 219,000.0 µs
lookup-only:
total total
KEYS without with
1 2.5 2.7 µs
10 25.4 24.4 µs
32 106.0 72.6 µs
100 591.0 352.0 µs
1000 22,400.0 2,250.0 µs
10000 2,510,000.0 25,700.0 µs
31900 28,200,000.0 115,000.0 µs
[1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop
Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09 07:17:55 +08:00
|
|
|
/* must be called with both locks acquired. */
|
|
|
|
void ipc_set_key_private(struct ipc_ids *, struct kern_ipc_perm *);
|
|
|
|
|
2007-10-19 14:40:53 +08:00
|
|
|
/* must be called with ipcp locked */
|
2011-03-24 07:43:24 +08:00
|
|
|
int ipcperms(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp, short flg);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-11-18 07:31:18 +08:00
|
|
|
/**
|
2018-08-22 13:02:00 +08:00
|
|
|
* ipc_get_maxidx - get the highest assigned index
|
2017-11-18 07:31:18 +08:00
|
|
|
* @ids: ipc identifier set
|
|
|
|
*
|
ipc/util.c: use binary search for max_idx
If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used index in the kernel's internal array recording information about all
SysV objects of the requested type for the current namespace. (This
information can be used with repeated ..._STAT or ..._STAT_ANY operations
to obtain information about all SysV objects on the system.)
There is a cache for this value. But when the cache needs up be updated,
then the highest used index is determined by looping over all possible
values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the
index values do not need to be consecutive.
With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
loop, I have observed a performance increase of around factor 13000.
As there is no get_last() function for idr structures: Implement a
"get_last()" using a binary search.
As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.
[akpm@linux-foundation.org: tweak comment, fix typo]
Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 09:57:18 +08:00
|
|
|
* The function returns the highest assigned index for @ids. The function
|
|
|
|
* doesn't scan the idr tree, it uses a cached value.
|
|
|
|
*
|
2017-11-18 07:31:18 +08:00
|
|
|
* Called with ipc_ids.rwsem held for reading.
|
|
|
|
*/
|
2018-08-22 13:02:00 +08:00
|
|
|
static inline int ipc_get_maxidx(struct ipc_ids *ids)
|
2017-11-18 07:31:18 +08:00
|
|
|
{
|
|
|
|
if (ids->in_use == 0)
|
|
|
|
return -1;
|
|
|
|
|
2019-05-15 06:46:29 +08:00
|
|
|
if (ids->in_use == ipc_mni)
|
|
|
|
return ipc_mni - 1;
|
2017-11-18 07:31:18 +08:00
|
|
|
|
2018-08-22 13:02:00 +08:00
|
|
|
return ids->max_idx;
|
2017-11-18 07:31:18 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* For allocation that need to be freed by RCU.
|
|
|
|
* Objects are reference counted, they start with reference count 1.
|
|
|
|
* getref increases the refcount, the putref call that reduces the recount
|
|
|
|
* to 0 schedules the rcu destruction. Caller must guarantee locking.
|
2017-07-13 05:35:34 +08:00
|
|
|
*
|
|
|
|
* refcount is initialized by ipc_addid(), before that point call_rcu()
|
|
|
|
* must be used.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2018-08-22 13:02:04 +08:00
|
|
|
bool ipc_rcu_getref(struct kern_ipc_perm *ptr);
|
2017-07-13 05:34:41 +08:00
|
|
|
void ipc_rcu_putref(struct kern_ipc_perm *ptr,
|
|
|
|
void (*func)(struct rcu_head *head));
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-07-01 05:58:42 +08:00
|
|
|
struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
|
|
|
|
void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
|
2012-02-08 08:54:11 +08:00
|
|
|
int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
|
2018-08-22 13:01:34 +08:00
|
|
|
struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
|
2013-05-01 10:15:24 +08:00
|
|
|
struct ipc_ids *ids, int id, int cmd,
|
|
|
|
struct ipc64_perm *perm, int extra_perm);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-03-23 13:22:05 +08:00
|
|
|
static inline void ipc_update_pid(struct pid **pos, struct pid *pid)
|
|
|
|
{
|
|
|
|
struct pid *old = *pos;
|
|
|
|
if (old != pid) {
|
|
|
|
*pos = get_pid(pid);
|
|
|
|
put_pid(old);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-01 05:22:40 +08:00
|
|
|
#ifdef CONFIG_ARCH_WANT_IPC_PARSE_VERSION
|
2014-01-28 09:07:04 +08:00
|
|
|
int ipc_parse_version(int *cmd);
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
extern void free_msg(struct msg_msg *msg);
|
ipc, msg: fix message length check for negative values
On 64 bit systems the test for negative message sizes is bogus as the
size, which may be positive when evaluated as a long, will get truncated
to an int when passed to load_msg(). So a long might very well contain a
positive value but when truncated to an int it would become negative.
That in combination with a small negative value of msg_ctlmax (which will
be promoted to an unsigned type for the comparison against msgsz, making
it a big positive value and therefore make it pass the check) will lead to
two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
small buffer as the addition of alen is effectively a subtraction. 2/ The
copy_from_user() call in load_msg() will first overflow the buffer with
userland data and then, when the userland access generates an access
violation, the fixup handler copy_user_handle_tail() will try to fill the
remainder with zeros -- roughly 4GB. That almost instantly results in a
system crash or reset.
,-[ Reproducer (needs to be run as root) ]--
| #include <sys/stat.h>
| #include <sys/msg.h>
| #include <unistd.h>
| #include <fcntl.h>
|
| int main(void) {
| long msg = 1;
| int fd;
|
| fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
| write(fd, "-1", 2);
| close(fd);
|
| msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
|
| return 0;
| }
'---
Fix the issue by preventing msgsz from getting truncated by consistently
using size_t for the message length. This way the size checks in
do_msgsnd() could still be passed with a negative value for msg_ctlmax but
we would fail on the buffer allocation in that case and error out.
Also change the type of m_ts from int to size_t to avoid similar nastiness
in other code paths -- it is used in similar constructs, i.e. signed vs.
unsigned checks. It should never become negative under normal
circumstances, though.
Setting msg_ctlmax to a negative value is an odd configuration and should
be prevented. As that might break existing userland, it will be handled
in a separate commit so it could easily be reverted and reworked without
reintroducing the above described bug.
Hardening mechanisms for user copy operations would have catched that bug
early -- e.g. checking slab object sizes on user copy operations as the
usercopy feature of the PaX patch does. Or, for that matter, detect the
long vs. int sign change due to truncation, as the size overflow plugin
of the very same patch does.
[akpm@linux-foundation.org: fix i386 min() warnings]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [ v2.3.27+ -- yes, that old ;) ]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:11:47 +08:00
|
|
|
extern struct msg_msg *load_msg(const void __user *src, size_t len);
|
2013-01-05 07:34:55 +08:00
|
|
|
extern struct msg_msg *copy_msg(struct msg_msg *src, struct msg_msg *dst);
|
ipc, msg: fix message length check for negative values
On 64 bit systems the test for negative message sizes is bogus as the
size, which may be positive when evaluated as a long, will get truncated
to an int when passed to load_msg(). So a long might very well contain a
positive value but when truncated to an int it would become negative.
That in combination with a small negative value of msg_ctlmax (which will
be promoted to an unsigned type for the comparison against msgsz, making
it a big positive value and therefore make it pass the check) will lead to
two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
small buffer as the addition of alen is effectively a subtraction. 2/ The
copy_from_user() call in load_msg() will first overflow the buffer with
userland data and then, when the userland access generates an access
violation, the fixup handler copy_user_handle_tail() will try to fill the
remainder with zeros -- roughly 4GB. That almost instantly results in a
system crash or reset.
,-[ Reproducer (needs to be run as root) ]--
| #include <sys/stat.h>
| #include <sys/msg.h>
| #include <unistd.h>
| #include <fcntl.h>
|
| int main(void) {
| long msg = 1;
| int fd;
|
| fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
| write(fd, "-1", 2);
| close(fd);
|
| msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
|
| return 0;
| }
'---
Fix the issue by preventing msgsz from getting truncated by consistently
using size_t for the message length. This way the size checks in
do_msgsnd() could still be passed with a negative value for msg_ctlmax but
we would fail on the buffer allocation in that case and error out.
Also change the type of m_ts from int to size_t to avoid similar nastiness
in other code paths -- it is used in similar constructs, i.e. signed vs.
unsigned checks. It should never become negative under normal
circumstances, though.
Setting msg_ctlmax to a negative value is an odd configuration and should
be prevented. As that might break existing userland, it will be handled
in a separate commit so it could easily be reverted and reworked without
reintroducing the above described bug.
Hardening mechanisms for user copy operations would have catched that bug
early -- e.g. checking slab object sizes on user copy operations as the
usercopy feature of the PaX patch does. Or, for that matter, detect the
long vs. int sign change due to truncation, as the size overflow plugin
of the very same patch does.
[akpm@linux-foundation.org: fix i386 min() warnings]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [ v2.3.27+ -- yes, that old ;) ]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:11:47 +08:00
|
|
|
extern int store_msg(void __user *dest, struct msg_msg *msg, size_t len);
|
2007-10-19 14:40:49 +08:00
|
|
|
|
2018-08-22 13:02:00 +08:00
|
|
|
static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int id)
|
2007-10-19 14:40:51 +08:00
|
|
|
{
|
2018-08-22 13:02:00 +08:00
|
|
|
return ipcid_to_seqx(id) != ipcp->seq;
|
2007-10-19 14:40:51 +08:00
|
|
|
}
|
|
|
|
|
2013-07-09 07:01:10 +08:00
|
|
|
static inline void ipc_lock_object(struct kern_ipc_perm *perm)
|
2007-10-19 14:40:51 +08:00
|
|
|
{
|
|
|
|
spin_lock(&perm->lock);
|
|
|
|
}
|
|
|
|
|
2013-07-09 07:01:10 +08:00
|
|
|
static inline void ipc_unlock_object(struct kern_ipc_perm *perm)
|
2007-10-19 14:40:51 +08:00
|
|
|
{
|
|
|
|
spin_unlock(&perm->lock);
|
|
|
|
}
|
|
|
|
|
2013-07-09 07:01:10 +08:00
|
|
|
static inline void ipc_assert_locked_object(struct kern_ipc_perm *perm)
|
ipc,sem: do not hold ipc lock more than necessary
Instead of holding the ipc lock for permissions and security checks, among
others, only acquire it when necessary.
Some numbers....
1) With Rik's semop-multi.c microbenchmark we can see the following
results:
Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409
+ 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.86% a.out [kernel.kallsyms] [k] ipc_lock
With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213
+ 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
+ 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check
2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:
Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%
With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%
[riel@redhat.com: fix two locking bugs]
[sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01 10:15:29 +08:00
|
|
|
{
|
2013-07-09 07:01:10 +08:00
|
|
|
assert_spin_locked(&perm->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ipc_unlock(struct kern_ipc_perm *perm)
|
|
|
|
{
|
2013-07-09 07:01:11 +08:00
|
|
|
ipc_unlock_object(perm);
|
2013-07-09 07:01:10 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2014-01-28 09:07:01 +08:00
|
|
|
/*
|
|
|
|
* ipc_valid_object() - helper to sort out IPC_RMID races for codepaths
|
|
|
|
* where the respective ipc_ids.rwsem is not being held down.
|
|
|
|
* Checks whether the ipc object is still around or if it's gone already, as
|
|
|
|
* ipc_rmid() may have already freed the ID while the ipc lock was spinning.
|
|
|
|
* Needs to be called with kern_ipc_perm.lock held -- exception made for one
|
|
|
|
* checkpoint case at sys_semtimedop() as noted in code commentary.
|
|
|
|
*/
|
|
|
|
static inline bool ipc_valid_object(struct kern_ipc_perm *perm)
|
|
|
|
{
|
2014-01-28 09:07:02 +08:00
|
|
|
return !perm->deleted;
|
2014-01-28 09:07:01 +08:00
|
|
|
}
|
|
|
|
|
2013-05-01 10:15:19 +08:00
|
|
|
struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id);
|
2008-02-08 20:18:54 +08:00
|
|
|
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
|
2014-06-07 05:37:36 +08:00
|
|
|
const struct ipc_ops *ops, struct ipc_params *params);
|
2009-06-18 07:27:57 +08:00
|
|
|
void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
|
|
|
|
void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
|
2017-07-09 10:52:47 +08:00
|
|
|
|
2018-10-31 06:07:24 +08:00
|
|
|
static inline int sem_check_semmni(struct ipc_namespace *ns) {
|
|
|
|
/*
|
2019-05-15 06:46:29 +08:00
|
|
|
* Check semmni range [0, ipc_mni]
|
2018-10-31 06:07:24 +08:00
|
|
|
* semmni is the last element of sem_ctls[4] array
|
|
|
|
*/
|
2019-05-15 06:46:29 +08:00
|
|
|
return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipc_mni))
|
2018-10-31 06:07:24 +08:00
|
|
|
? -ERANGE : 0;
|
|
|
|
}
|
|
|
|
|
2017-07-09 10:52:47 +08:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
#include <linux/compat.h>
|
|
|
|
struct compat_ipc_perm {
|
|
|
|
key_t key;
|
|
|
|
__compat_uid_t uid;
|
|
|
|
__compat_gid_t gid;
|
|
|
|
__compat_uid_t cuid;
|
|
|
|
__compat_gid_t cgid;
|
|
|
|
compat_mode_t mode;
|
|
|
|
unsigned short seq;
|
|
|
|
};
|
|
|
|
|
2017-07-09 22:03:23 +08:00
|
|
|
void to_compat_ipc_perm(struct compat_ipc_perm *, struct ipc64_perm *);
|
|
|
|
void to_compat_ipc64_perm(struct compat_ipc64_perm *, struct ipc64_perm *);
|
|
|
|
int get_compat_ipc_perm(struct ipc64_perm *, struct compat_ipc_perm __user *);
|
|
|
|
int get_compat_ipc64_perm(struct ipc64_perm *,
|
|
|
|
struct compat_ipc64_perm __user *);
|
|
|
|
|
2017-07-09 10:52:47 +08:00
|
|
|
static inline int compat_ipc_parse_version(int *cmd)
|
|
|
|
{
|
|
|
|
int version = *cmd & IPC_64;
|
|
|
|
*cmd &= ~IPC_64;
|
|
|
|
return version;
|
|
|
|
}
|
2018-03-21 02:48:14 +08:00
|
|
|
|
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-01 05:22:40 +08:00
|
|
|
long compat_ksys_old_semctl(int semid, int semnum, int cmd, int arg);
|
|
|
|
long compat_ksys_old_msgctl(int msqid, int cmd, void __user *uptr);
|
2018-03-21 04:25:57 +08:00
|
|
|
long compat_ksys_msgrcv(int msqid, compat_uptr_t msgp, compat_ssize_t msgsz,
|
|
|
|
compat_long_t msgtyp, int msgflg);
|
2018-03-21 04:29:00 +08:00
|
|
|
long compat_ksys_msgsnd(int msqid, compat_uptr_t msgp,
|
|
|
|
compat_ssize_t msgsz, int msgflg);
|
ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.
The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.
As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.
A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-01 05:22:40 +08:00
|
|
|
long compat_ksys_old_shmctl(int shmid, int cmd, void __user *uptr);
|
ipc: fix sparc64 ipc() wrapper
Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
to a commit from my y2038 series in linux-5.1, as I missed the custom
sys_ipc() wrapper that sparc64 uses in place of the generic version that
I patched.
The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
now do not allow being called with the IPC_64 flag any more, resulting
in a -EINVAL error when they don't recognize the command.
Instead, the correct way to do this now is to call the internal
ksys_old_{sem,shm,msg}ctl() functions to select the API version.
As we generally move towards these functions anyway, change all of
sparc_ipc() to consistently use those in place of the sys_*() versions,
and move the required ksys_*() declarations into linux/syscalls.h
The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
errors when ipc is disabled.
Reported-by: Matt Turner <mattst88@gmail.com>
Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
Cc: stable@vger.kernel.org
Tested-by: Matt Turner <mattst88@gmail.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-05 22:48:38 +08:00
|
|
|
|
|
|
|
#endif
|
2018-03-21 02:48:14 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|