License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2007-07-16 14:40:59 +08:00
|
|
|
#ifndef _LINUX_USER_NAMESPACE_H
|
|
|
|
#define _LINUX_USER_NAMESPACE_H
|
|
|
|
|
|
|
|
#include <linux/kref.h>
|
|
|
|
#include <linux/nsproxy.h>
|
2014-11-01 10:56:04 +08:00
|
|
|
#include <linux/ns_common.h>
|
2007-07-16 14:40:59 +08:00
|
|
|
#include <linux/sched.h>
|
2017-02-06 16:56:40 +08:00
|
|
|
#include <linux/workqueue.h>
|
2017-02-09 01:51:58 +08:00
|
|
|
#include <linux/rwsem.h>
|
2017-02-03 17:06:45 +08:00
|
|
|
#include <linux/sysctl.h>
|
2007-07-16 14:41:01 +08:00
|
|
|
#include <linux/err.h>
|
2007-07-16 14:40:59 +08:00
|
|
|
|
userns: bump idmap limits to 340
There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:
some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1
This translates to:
MAPPING-TYPE | CONTAINER | HOST | RANGE |
-------------|-----------|---------|-----------|
uid | 999 | 999 | 1 |
uid | 1001 | 1001 | 1 |
uid | 0 | 1000000 | 999 |
uid | 1000 | 1001000 | 1 |
uid | 1002 | 1001002 | 999998998 |
------------------------------------------------
gid | 999 | 999 | 1 |
gid | 1001 | 1001 | 1 |
gid | 0 | 1000000 | 999 |
gid | 1000 | 1001000 | 1 |
gid | 1002 | 1001002 | 999998998 |
which is already the current limit.
As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to use
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};
struct uid_gid_map { /* 64 bytes -- 1 cache line */
u32 nr_extents;
union {
struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
struct {
struct uid_gid_extent *forward;
struct uid_gid_extent *reverse;
};
};
};
For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.
Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:
> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:
> v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)
Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:
| # MAPPINGS | PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
| 0 mappings | 158 ns | 158 ns |
| 1 mappings | 164 ns | 157 ns |
| 2 mappings | 170 ns | 158 ns |
| 3 mappings | 175 ns | 161 ns |
| 5 mappings | 187 ns | 165 ns |
| 10 mappings | 218 ns | 199 ns |
| 50 mappings | 528 ns | 218 ns |
| 100 mappings | 980 ns | 229 ns |
| 200 mappings | 1880 ns | 239 ns |
| 300 mappings | 2760 ns | 240 ns |
| 340 mappings | not tested | 248 ns |
Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:
#define __GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <errno.h>
#include <dirent.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{
int ret;
size_t i, k;
int fd[1000000];
int times[1000];
char pathname[4096];
struct stat st;
struct timeval t1, t2;
uint64_t time_in_mcs;
uint64_t sum = 0;
if (argc != 2) {
fprintf(stderr, "Please specify a directory where to create "
"the test files\n");
exit(EXIT_FAILURE);
}
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
if (fd[i] < 0) {
ssize_t j;
for (j = i; j >= 0; j--)
close(fd[j]);
exit(EXIT_FAILURE);
}
}
for (k = 0; k < 1000; k++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto close_all;
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
ret = fstat(fd[i], &st);
if (ret < 0)
goto close_all;
}
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto close_all;
time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
(1000000 * t1.tv_sec + t1.tv_usec);
printf("Total time in micro seconds: %" PRIu64 "\n",
time_in_mcs);
printf("Total time in nanoseconds: %" PRIu64 "\n",
time_in_mcs * 1000);
printf("Time per file in nanoseconds: %" PRIu64 "\n",
(time_in_mcs * 1000) / 1000000);
times[k] = (time_in_mcs * 1000) / 1000000;
}
close_all:
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
close(fd[i]);
if (ret < 0)
exit(EXIT_FAILURE);
for (k = 0; k < 1000; k++) {
sum += times[k];
}
printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
exit(EXIT_SUCCESS);;
}
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-10-25 06:04:41 +08:00
|
|
|
#define UID_GID_MAP_MAX_BASE_EXTENTS 5
|
|
|
|
#define UID_GID_MAP_MAX_EXTENTS 340
|
2011-11-17 16:11:58 +08:00
|
|
|
|
2017-10-25 06:04:40 +08:00
|
|
|
struct uid_gid_extent {
|
|
|
|
u32 first;
|
|
|
|
u32 lower_first;
|
|
|
|
u32 count;
|
|
|
|
};
|
|
|
|
|
userns: bump idmap limits to 340
There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:
some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1
This translates to:
MAPPING-TYPE | CONTAINER | HOST | RANGE |
-------------|-----------|---------|-----------|
uid | 999 | 999 | 1 |
uid | 1001 | 1001 | 1 |
uid | 0 | 1000000 | 999 |
uid | 1000 | 1001000 | 1 |
uid | 1002 | 1001002 | 999998998 |
------------------------------------------------
gid | 999 | 999 | 1 |
gid | 1001 | 1001 | 1 |
gid | 0 | 1000000 | 999 |
gid | 1000 | 1001000 | 1 |
gid | 1002 | 1001002 | 999998998 |
which is already the current limit.
As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to use
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};
struct uid_gid_map { /* 64 bytes -- 1 cache line */
u32 nr_extents;
union {
struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
struct {
struct uid_gid_extent *forward;
struct uid_gid_extent *reverse;
};
};
};
For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.
Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:
> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:
> v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)
Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:
| # MAPPINGS | PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
| 0 mappings | 158 ns | 158 ns |
| 1 mappings | 164 ns | 157 ns |
| 2 mappings | 170 ns | 158 ns |
| 3 mappings | 175 ns | 161 ns |
| 5 mappings | 187 ns | 165 ns |
| 10 mappings | 218 ns | 199 ns |
| 50 mappings | 528 ns | 218 ns |
| 100 mappings | 980 ns | 229 ns |
| 200 mappings | 1880 ns | 239 ns |
| 300 mappings | 2760 ns | 240 ns |
| 340 mappings | not tested | 248 ns |
Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:
#define __GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <errno.h>
#include <dirent.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{
int ret;
size_t i, k;
int fd[1000000];
int times[1000];
char pathname[4096];
struct stat st;
struct timeval t1, t2;
uint64_t time_in_mcs;
uint64_t sum = 0;
if (argc != 2) {
fprintf(stderr, "Please specify a directory where to create "
"the test files\n");
exit(EXIT_FAILURE);
}
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
if (fd[i] < 0) {
ssize_t j;
for (j = i; j >= 0; j--)
close(fd[j]);
exit(EXIT_FAILURE);
}
}
for (k = 0; k < 1000; k++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto close_all;
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
ret = fstat(fd[i], &st);
if (ret < 0)
goto close_all;
}
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto close_all;
time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
(1000000 * t1.tv_sec + t1.tv_usec);
printf("Total time in micro seconds: %" PRIu64 "\n",
time_in_mcs);
printf("Total time in nanoseconds: %" PRIu64 "\n",
time_in_mcs * 1000);
printf("Time per file in nanoseconds: %" PRIu64 "\n",
(time_in_mcs * 1000) / 1000000);
times[k] = (time_in_mcs * 1000) / 1000000;
}
close_all:
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
close(fd[i]);
if (ret < 0)
exit(EXIT_FAILURE);
for (k = 0; k < 1000; k++) {
sum += times[k];
}
printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
exit(EXIT_SUCCESS);;
}
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-10-25 06:04:41 +08:00
|
|
|
struct uid_gid_map { /* 64 bytes -- 1 cache line */
|
2011-11-17 16:11:58 +08:00
|
|
|
u32 nr_extents;
|
2017-10-25 06:04:40 +08:00
|
|
|
union {
|
userns: bump idmap limits to 340
There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:
some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1
This translates to:
MAPPING-TYPE | CONTAINER | HOST | RANGE |
-------------|-----------|---------|-----------|
uid | 999 | 999 | 1 |
uid | 1001 | 1001 | 1 |
uid | 0 | 1000000 | 999 |
uid | 1000 | 1001000 | 1 |
uid | 1002 | 1001002 | 999998998 |
------------------------------------------------
gid | 999 | 999 | 1 |
gid | 1001 | 1001 | 1 |
gid | 0 | 1000000 | 999 |
gid | 1000 | 1001000 | 1 |
gid | 1002 | 1001002 | 999998998 |
which is already the current limit.
As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to use
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};
struct uid_gid_map { /* 64 bytes -- 1 cache line */
u32 nr_extents;
union {
struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
struct {
struct uid_gid_extent *forward;
struct uid_gid_extent *reverse;
};
};
};
For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.
Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:
> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:
> v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)
Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:
| # MAPPINGS | PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
| 0 mappings | 158 ns | 158 ns |
| 1 mappings | 164 ns | 157 ns |
| 2 mappings | 170 ns | 158 ns |
| 3 mappings | 175 ns | 161 ns |
| 5 mappings | 187 ns | 165 ns |
| 10 mappings | 218 ns | 199 ns |
| 50 mappings | 528 ns | 218 ns |
| 100 mappings | 980 ns | 229 ns |
| 200 mappings | 1880 ns | 239 ns |
| 300 mappings | 2760 ns | 240 ns |
| 340 mappings | not tested | 248 ns |
Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:
#define __GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <errno.h>
#include <dirent.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{
int ret;
size_t i, k;
int fd[1000000];
int times[1000];
char pathname[4096];
struct stat st;
struct timeval t1, t2;
uint64_t time_in_mcs;
uint64_t sum = 0;
if (argc != 2) {
fprintf(stderr, "Please specify a directory where to create "
"the test files\n");
exit(EXIT_FAILURE);
}
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
if (fd[i] < 0) {
ssize_t j;
for (j = i; j >= 0; j--)
close(fd[j]);
exit(EXIT_FAILURE);
}
}
for (k = 0; k < 1000; k++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto close_all;
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
ret = fstat(fd[i], &st);
if (ret < 0)
goto close_all;
}
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto close_all;
time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
(1000000 * t1.tv_sec + t1.tv_usec);
printf("Total time in micro seconds: %" PRIu64 "\n",
time_in_mcs);
printf("Total time in nanoseconds: %" PRIu64 "\n",
time_in_mcs * 1000);
printf("Time per file in nanoseconds: %" PRIu64 "\n",
(time_in_mcs * 1000) / 1000000);
times[k] = (time_in_mcs * 1000) / 1000000;
}
close_all:
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
close(fd[i]);
if (ret < 0)
exit(EXIT_FAILURE);
for (k = 0; k < 1000; k++) {
sum += times[k];
}
printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
exit(EXIT_SUCCESS);;
}
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-10-25 06:04:41 +08:00
|
|
|
struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
|
2017-10-25 06:04:40 +08:00
|
|
|
struct {
|
|
|
|
struct uid_gid_extent *forward;
|
|
|
|
struct uid_gid_extent *reverse;
|
|
|
|
};
|
|
|
|
};
|
2011-11-17 16:11:58 +08:00
|
|
|
};
|
|
|
|
|
2014-12-03 02:27:26 +08:00
|
|
|
#define USERNS_SETGROUPS_ALLOWED 1UL
|
|
|
|
|
|
|
|
#define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
|
|
|
|
|
2016-08-09 02:54:50 +08:00
|
|
|
struct ucounts;
|
2016-08-09 03:41:52 +08:00
|
|
|
|
|
|
|
enum ucount_type {
|
|
|
|
UCOUNT_USER_NAMESPACES,
|
2016-08-09 03:08:36 +08:00
|
|
|
UCOUNT_PID_NAMESPACES,
|
2016-08-09 03:11:25 +08:00
|
|
|
UCOUNT_UTS_NAMESPACES,
|
2016-08-09 03:20:23 +08:00
|
|
|
UCOUNT_IPC_NAMESPACES,
|
2016-08-09 03:33:23 +08:00
|
|
|
UCOUNT_NET_NAMESPACES,
|
2016-08-09 03:37:37 +08:00
|
|
|
UCOUNT_MNT_NAMESPACES,
|
2016-08-09 03:25:30 +08:00
|
|
|
UCOUNT_CGROUP_NAMESPACES,
|
ns: Introduce Time Namespace
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
2019-11-12 09:26:52 +08:00
|
|
|
UCOUNT_TIME_NAMESPACES,
|
2016-12-14 21:56:33 +08:00
|
|
|
#ifdef CONFIG_INOTIFY_USER
|
|
|
|
UCOUNT_INOTIFY_INSTANCES,
|
|
|
|
UCOUNT_INOTIFY_WATCHES,
|
2021-03-04 19:29:20 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_FANOTIFY
|
|
|
|
UCOUNT_FANOTIFY_GROUPS,
|
|
|
|
UCOUNT_FANOTIFY_MARKS,
|
2016-12-14 21:56:33 +08:00
|
|
|
#endif
|
2021-04-22 20:27:11 +08:00
|
|
|
UCOUNT_RLIMIT_NPROC,
|
2021-04-22 20:27:12 +08:00
|
|
|
UCOUNT_RLIMIT_MSGQUEUE,
|
2021-04-22 20:27:13 +08:00
|
|
|
UCOUNT_RLIMIT_SIGPENDING,
|
2021-04-22 20:27:14 +08:00
|
|
|
UCOUNT_RLIMIT_MEMLOCK,
|
2016-08-09 03:41:52 +08:00
|
|
|
UCOUNT_COUNTS,
|
|
|
|
};
|
|
|
|
|
2021-04-22 20:27:11 +08:00
|
|
|
#define MAX_PER_NAMESPACE_UCOUNTS UCOUNT_RLIMIT_NPROC
|
|
|
|
|
2007-07-16 14:40:59 +08:00
|
|
|
struct user_namespace {
|
2011-11-17 16:11:58 +08:00
|
|
|
struct uid_gid_map uid_map;
|
|
|
|
struct uid_gid_map gid_map;
|
2012-08-30 16:24:05 +08:00
|
|
|
struct uid_gid_map projid_map;
|
2011-11-17 13:59:43 +08:00
|
|
|
struct user_namespace *parent;
|
2013-08-09 00:55:32 +08:00
|
|
|
int level;
|
2011-11-17 17:32:59 +08:00
|
|
|
kuid_t owner;
|
|
|
|
kgid_t group;
|
2014-11-01 10:56:04 +08:00
|
|
|
struct ns_common ns;
|
2014-12-03 02:27:26 +08:00
|
|
|
unsigned long flags;
|
capabilities: require CAP_SETFCAP to map uid 0
cap_setfcap is required to create file capabilities.
Since commit 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"),
a process running as uid 0 but without cap_setfcap is able to work
around this as follows: unshare a new user namespace which maps parent
uid 0 into the child namespace.
While this task will not have new capabilities against the parent
namespace, there is a loophole due to the way namespaced file
capabilities are represented as xattrs. File capabilities valid in
userns 1 are distinguished from file capabilities valid in userns 2 by
the kuid which underlies uid 0. Therefore the restricted root process
can unshare a new self-mapping namespace, add a namespaced file
capability onto a file, then use that file capability in the parent
namespace.
To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the
capability for setting file capabilities.
As a further wrinkle: a task can unshare its user namespace, then open
its uid_map file itself, and map (only) its own uid. In this case we do
not have the credential from before unshare, which was potentially more
restricted. So, when creating a user namespace, we record whether the
creator had CAP_SETFCAP. Then we can use that during map_write().
With this patch:
1. Unprivileged user can still unshare -Ur
ubuntu@caps:~$ unshare -Ur
root@caps:~# logout
2. Root user can still unshare -Ur
ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout
3. Root user without CAP_SETFCAP cannot unshare -Ur:
root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted
Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities. This approach can be seen at [1].
Background history: commit 95ebabde382 ("capabilities: Don't allow
writing ambiguous v3 file capabilities") tried to fix the issue by
preventing v3 fscaps to be written to disk when the root uid would map
to the same uid in nested user namespaces. This led to regressions for
various workloads. For example, see [2]. Ultimately this is a valid
use-case we have to support meaning we had to revert this change in
3b0c2d3eaa83 ("Revert 95ebabde382c ("capabilities: Don't allow writing
ambiguous v3 file capabilities")").
Link: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 [1]
Link: https://github.com/containers/buildah/issues/3071 [2]
Signed-off-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andrew G. Morgan <morgan@kernel.org>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-20 21:43:34 +08:00
|
|
|
/* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
|
|
|
|
* in its effective capability set at the child ns creation time. */
|
|
|
|
bool parent_could_setfcap;
|
2013-09-24 17:35:19 +08:00
|
|
|
|
2019-06-27 04:02:32 +08:00
|
|
|
#ifdef CONFIG_KEYS
|
2019-06-27 04:02:32 +08:00
|
|
|
/* List of joinable keyrings in this namespace. Modification access of
|
|
|
|
* these pointers is controlled by keyring_sem. Once
|
|
|
|
* user_keyring_register is set, it won't be changed, so it can be
|
|
|
|
* accessed directly with READ_ONCE().
|
|
|
|
*/
|
2019-06-27 04:02:32 +08:00
|
|
|
struct list_head keyring_name_list;
|
2019-06-27 04:02:32 +08:00
|
|
|
struct key *user_keyring_register;
|
|
|
|
struct rw_semaphore keyring_sem;
|
2019-06-27 04:02:32 +08:00
|
|
|
#endif
|
|
|
|
|
2013-09-24 17:35:19 +08:00
|
|
|
/* Register of per-UID persistent keyrings for this namespace */
|
|
|
|
#ifdef CONFIG_PERSISTENT_KEYRINGS
|
|
|
|
struct key *persistent_keyring_register;
|
|
|
|
#endif
|
2016-07-31 02:53:37 +08:00
|
|
|
struct work_struct work;
|
2016-07-31 02:58:49 +08:00
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
struct ctl_table_set set;
|
|
|
|
struct ctl_table_header *sysctls;
|
|
|
|
#endif
|
2016-08-09 02:54:50 +08:00
|
|
|
struct ucounts *ucounts;
|
2021-04-22 20:27:08 +08:00
|
|
|
long ucount_max[UCOUNT_COUNTS];
|
2016-10-28 16:22:25 +08:00
|
|
|
} __randomize_layout;
|
2016-08-09 02:54:50 +08:00
|
|
|
|
|
|
|
struct ucounts {
|
|
|
|
struct hlist_node node;
|
|
|
|
struct user_namespace *ns;
|
|
|
|
kuid_t uid;
|
2021-04-22 20:27:10 +08:00
|
|
|
atomic_t count;
|
2021-04-22 20:27:08 +08:00
|
|
|
atomic_long_t ucount[UCOUNT_COUNTS];
|
2007-07-16 14:40:59 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
extern struct user_namespace init_user_ns;
|
2021-04-22 20:27:09 +08:00
|
|
|
extern struct ucounts init_ucounts;
|
2016-08-09 02:54:50 +08:00
|
|
|
|
|
|
|
bool setup_userns_sysctls(struct user_namespace *ns);
|
|
|
|
void retire_userns_sysctls(struct user_namespace *ns);
|
2016-08-09 03:41:52 +08:00
|
|
|
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
|
|
|
|
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
|
2021-04-22 20:27:09 +08:00
|
|
|
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
|
2021-04-22 20:27:10 +08:00
|
|
|
struct ucounts * __must_check get_ucounts(struct ucounts *ucounts);
|
2021-04-22 20:27:09 +08:00
|
|
|
void put_ucounts(struct ucounts *ucounts);
|
2007-07-16 14:40:59 +08:00
|
|
|
|
2021-04-22 20:27:11 +08:00
|
|
|
static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type type)
|
|
|
|
{
|
|
|
|
return atomic_long_read(&ucounts->ucount[type]);
|
|
|
|
}
|
|
|
|
|
|
|
|
long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
|
|
|
|
bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
|
|
|
|
bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
|
|
|
|
|
2021-04-22 20:27:16 +08:00
|
|
|
static inline void set_rlimit_ucount_max(struct user_namespace *ns,
|
|
|
|
enum ucount_type type, unsigned long max)
|
|
|
|
{
|
|
|
|
ns->ucount_max[type] = max <= LONG_MAX ? max : LONG_MAX;
|
|
|
|
}
|
2007-07-16 14:40:59 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_USER_NS
|
|
|
|
|
|
|
|
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
|
|
|
|
{
|
|
|
|
if (ns)
|
2020-08-03 18:16:37 +08:00
|
|
|
refcount_inc(&ns->ns.count);
|
2007-07-16 14:40:59 +08:00
|
|
|
return ns;
|
|
|
|
}
|
|
|
|
|
2008-10-16 05:38:45 +08:00
|
|
|
extern int create_user_ns(struct cred *new);
|
2012-07-26 20:15:35 +08:00
|
|
|
extern int unshare_userns(unsigned long unshare_flags, struct cred **new_cred);
|
2016-07-31 02:53:37 +08:00
|
|
|
extern void __put_user_ns(struct user_namespace *ns);
|
2007-07-16 14:40:59 +08:00
|
|
|
|
|
|
|
static inline void put_user_ns(struct user_namespace *ns)
|
|
|
|
{
|
2020-08-03 18:16:37 +08:00
|
|
|
if (ns && refcount_dec_and_test(&ns->ns.count))
|
2016-07-31 02:53:37 +08:00
|
|
|
__put_user_ns(ns);
|
2007-07-16 14:40:59 +08:00
|
|
|
}
|
|
|
|
|
2011-11-17 16:11:58 +08:00
|
|
|
struct seq_operations;
|
2014-08-09 05:21:22 +08:00
|
|
|
extern const struct seq_operations proc_uid_seq_operations;
|
|
|
|
extern const struct seq_operations proc_gid_seq_operations;
|
|
|
|
extern const struct seq_operations proc_projid_seq_operations;
|
2011-11-17 16:11:58 +08:00
|
|
|
extern ssize_t proc_uid_map_write(struct file *, const char __user *, size_t, loff_t *);
|
|
|
|
extern ssize_t proc_gid_map_write(struct file *, const char __user *, size_t, loff_t *);
|
2012-08-30 16:24:05 +08:00
|
|
|
extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t, loff_t *);
|
2014-12-03 02:27:26 +08:00
|
|
|
extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
|
|
|
|
extern int proc_setgroups_show(struct seq_file *m, void *v);
|
2014-12-06 08:01:11 +08:00
|
|
|
extern bool userns_may_setgroups(const struct user_namespace *ns);
|
2017-04-30 03:12:15 +08:00
|
|
|
extern bool in_userns(const struct user_namespace *ancestor,
|
|
|
|
const struct user_namespace *child);
|
2015-09-24 04:16:04 +08:00
|
|
|
extern bool current_in_userns(const struct user_namespace *target_ns);
|
2016-09-06 15:47:13 +08:00
|
|
|
struct ns_common *ns_get_owner(struct ns_common *ns);
|
2007-07-16 14:40:59 +08:00
|
|
|
#else
|
|
|
|
|
|
|
|
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
|
|
|
|
{
|
|
|
|
return &init_user_ns;
|
|
|
|
}
|
|
|
|
|
2008-10-16 05:38:45 +08:00
|
|
|
static inline int create_user_ns(struct cred *new)
|
2007-07-16 14:40:59 +08:00
|
|
|
{
|
2008-10-16 05:38:45 +08:00
|
|
|
return -EINVAL;
|
2007-07-16 14:40:59 +08:00
|
|
|
}
|
|
|
|
|
2012-07-26 20:15:35 +08:00
|
|
|
static inline int unshare_userns(unsigned long unshare_flags,
|
|
|
|
struct cred **new_cred)
|
|
|
|
{
|
|
|
|
if (unshare_flags & CLONE_NEWUSER)
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-07-16 14:40:59 +08:00
|
|
|
static inline void put_user_ns(struct user_namespace *ns)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2014-12-06 08:01:11 +08:00
|
|
|
static inline bool userns_may_setgroups(const struct user_namespace *ns)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2015-09-24 04:16:04 +08:00
|
|
|
|
2017-04-30 03:12:15 +08:00
|
|
|
static inline bool in_userns(const struct user_namespace *ancestor,
|
|
|
|
const struct user_namespace *child)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-24 04:16:04 +08:00
|
|
|
static inline bool current_in_userns(const struct user_namespace *target_ns)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2016-09-06 15:47:13 +08:00
|
|
|
|
|
|
|
static inline struct ns_common *ns_get_owner(struct ns_common *ns)
|
|
|
|
{
|
|
|
|
return ERR_PTR(-EPERM);
|
|
|
|
}
|
2011-11-17 16:11:58 +08:00
|
|
|
#endif
|
|
|
|
|
2007-07-16 14:40:59 +08:00
|
|
|
#endif /* _LINUX_USER_H */
|