License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef _LINUX_PIPE_FS_I_H
|
|
|
|
#define _LINUX_PIPE_FS_I_H
|
|
|
|
|
2010-05-20 16:43:18 +08:00
|
|
|
#define PIPE_DEF_BUFFERS 16
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-05-03 16:35:26 +08:00
|
|
|
#define PIPE_BUF_FLAG_LRU 0x01 /* page is on the LRU */
|
|
|
|
#define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */
|
|
|
|
#define PIPE_BUF_FLAG_GIFT 0x04 /* page is a gift */
|
pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30 04:12:42 +08:00
|
|
|
#define PIPE_BUF_FLAG_PACKET 0x08 /* read() as a packet */
|
2020-05-20 23:58:12 +08:00
|
|
|
#define PIPE_BUF_FLAG_CAN_MERGE 0x10 /* can merge buffers */
|
Notifications over pipes + Keyring notifications
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
=YTGf
-----END PGP SIGNATURE-----
Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull notification queue from David Howells:
"This adds a general notification queue concept and adds an event
source for keys/keyrings, such as linking and unlinking keys and
changing their attributes.
Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:
https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
Without this, g-o-a has to constantly poll a keyring-based kerberos
cache to find out if kinit has changed anything.
[ There are other notification pending: mount/sb fsinfo notifications
for libmount that Karel Zak and Ian Kent have been working on, and
Christian Brauner would like to use them in lxc, but let's see how
this one works first ]
LSM hooks are included:
- A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The
LSM should use current's credentials. [Wanted by SELinux & Smack]
- A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is
given the credentials from the event generator (which may be the
system) and the watch setter. [Wanted by Smack]
I've provided SELinux and Smack with implementations of some of these
hooks.
WHY
===
Key/keyring notifications are desirable because if you have your
kerberos tickets in a file/directory, your Gnome desktop will monitor
that using something like fanotify and tell you if your credentials
cache changes.
However, we also have the ability to cache your kerberos tickets in
the session, user or persistent keyring so that it isn't left around
on disk across a reboot or logout. Keyrings, however, cannot currently
be monitored asynchronously, so the desktop has to poll for it - not
so good on a laptop. This facility will allow the desktop to avoid the
need to poll.
DESIGN DECISIONS
================
- The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:
pipe2(fds, O_NOTIFICATION_PIPE);
The special flag has the same value as O_EXCL (which doesn't seem
like it will ever be applicable in this context)[?]. It is given up
front to make it a lot easier to prohibit splice&co from accessing
the pipe.
[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?
The pipe is then configured::
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
Messages are then read out of the pipe using read().
- It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the
kernel has to be able to insert messages into the pipe *without*
holding pipe->mutex and the code to make this work needs careful
auditing.
- sendfile(), splice() and vmsplice() are disabled on notification
pipes because of the pipe->mutex issue and also because they
sometimes want to revert what they just did - but one or more
notification messages might've been interleaved in the ring.
- The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock
to update the queue pointers.
- Records in the buffer are binary, typed and have a length so that
they can be of varying size.
This allows multiple heterogeneous sources to share a common
buffer; there are 16 million types available, of which I've used
just a few, so there is scope for others to be used. Tags may be
specified when a watchpoint is created to help distinguish the
sources.
- Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.
- Notification pipes don't interfere with each other; each may be
bound to a different set of watches. Any particular notification
will be copied to all the queues that are currently watching for it
- and only those that are watching for it.
- When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's
insufficient space. read() will fabricate a loss notification
message at an appropriate point later.
- The notification pipe is created and then watchpoints are attached
to it, using one of:
keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
where in both cases, fd indicates the queue and the number after is
a tag between 0 and 255.
- Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.
Things I want to avoid:
- Introducing features that make the core VFS dependent on the
network stack or networking namespaces (ie. usage of netlink).
- Dumping all this stuff into dmesg and having a daemon that sits
there parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.
- Letting users see events they shouldn't be able to see.
TESTING AND MANPAGES
====================
- The keyutils tree has a pipe-watch branch that has keyctl commands
for making use of notifications. Proposed manual pages can also be
found on this branch, though a couple of them really need to go to
the main manpages repository instead.
If the kernel supports the watching of keys, then running "make
test" on that branch will cause the testing infrastructure to spawn
a monitoring process on the side that monitors a notifications pipe
for all the key/keyring changes induced by the tests and they'll
all be checked off to make sure they happened.
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
- A test program is provided (samples/watch_queue/watch_test) that
can be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout"
* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
smack: Implement the watch_key and post_notification hooks
selinux: Implement the watch_key security hook
keys: Make the KEY_NEED_* perms an enum rather than a mask
pipe: Add notification lossage handling
pipe: Allow buffers to be marked read-whole-or-error for notifications
Add sample notification program
watch_queue: Add a key/keyring notification facility
security: Add hooks to rule on setting a watch
pipe: Add general notification queue support
pipe: Add O_NOTIFICATION_PIPE
security: Add a hook for the point of notification insertion
uapi: General notification queue definitions
2020-06-14 00:56:21 +08:00
|
|
|
#define PIPE_BUF_FLAG_WHOLE 0x20 /* read() must return entire buffer or error */
|
2020-01-15 01:07:12 +08:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
Notifications over pipes + Keyring notifications
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
=YTGf
-----END PGP SIGNATURE-----
Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull notification queue from David Howells:
"This adds a general notification queue concept and adds an event
source for keys/keyrings, such as linking and unlinking keys and
changing their attributes.
Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:
https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
Without this, g-o-a has to constantly poll a keyring-based kerberos
cache to find out if kinit has changed anything.
[ There are other notification pending: mount/sb fsinfo notifications
for libmount that Karel Zak and Ian Kent have been working on, and
Christian Brauner would like to use them in lxc, but let's see how
this one works first ]
LSM hooks are included:
- A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The
LSM should use current's credentials. [Wanted by SELinux & Smack]
- A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is
given the credentials from the event generator (which may be the
system) and the watch setter. [Wanted by Smack]
I've provided SELinux and Smack with implementations of some of these
hooks.
WHY
===
Key/keyring notifications are desirable because if you have your
kerberos tickets in a file/directory, your Gnome desktop will monitor
that using something like fanotify and tell you if your credentials
cache changes.
However, we also have the ability to cache your kerberos tickets in
the session, user or persistent keyring so that it isn't left around
on disk across a reboot or logout. Keyrings, however, cannot currently
be monitored asynchronously, so the desktop has to poll for it - not
so good on a laptop. This facility will allow the desktop to avoid the
need to poll.
DESIGN DECISIONS
================
- The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:
pipe2(fds, O_NOTIFICATION_PIPE);
The special flag has the same value as O_EXCL (which doesn't seem
like it will ever be applicable in this context)[?]. It is given up
front to make it a lot easier to prohibit splice&co from accessing
the pipe.
[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?
The pipe is then configured::
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
Messages are then read out of the pipe using read().
- It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the
kernel has to be able to insert messages into the pipe *without*
holding pipe->mutex and the code to make this work needs careful
auditing.
- sendfile(), splice() and vmsplice() are disabled on notification
pipes because of the pipe->mutex issue and also because they
sometimes want to revert what they just did - but one or more
notification messages might've been interleaved in the ring.
- The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock
to update the queue pointers.
- Records in the buffer are binary, typed and have a length so that
they can be of varying size.
This allows multiple heterogeneous sources to share a common
buffer; there are 16 million types available, of which I've used
just a few, so there is scope for others to be used. Tags may be
specified when a watchpoint is created to help distinguish the
sources.
- Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.
- Notification pipes don't interfere with each other; each may be
bound to a different set of watches. Any particular notification
will be copied to all the queues that are currently watching for it
- and only those that are watching for it.
- When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's
insufficient space. read() will fabricate a loss notification
message at an appropriate point later.
- The notification pipe is created and then watchpoints are attached
to it, using one of:
keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
where in both cases, fd indicates the queue and the number after is
a tag between 0 and 255.
- Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.
Things I want to avoid:
- Introducing features that make the core VFS dependent on the
network stack or networking namespaces (ie. usage of netlink).
- Dumping all this stuff into dmesg and having a daemon that sits
there parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.
- Letting users see events they shouldn't be able to see.
TESTING AND MANPAGES
====================
- The keyutils tree has a pipe-watch branch that has keyctl commands
for making use of notifications. Proposed manual pages can also be
found on this branch, though a couple of them really need to go to
the main manpages repository instead.
If the kernel supports the watching of keys, then running "make
test" on that branch will cause the testing infrastructure to spawn
a monitoring process on the side that monitors a notifications pipe
for all the key/keyring changes induced by the tests and they'll
all be checked off to make sure they happened.
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
- A test program is provided (samples/watch_queue/watch_test) that
can be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout"
* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
smack: Implement the watch_key and post_notification hooks
selinux: Implement the watch_key security hook
keys: Make the KEY_NEED_* perms an enum rather than a mask
pipe: Add notification lossage handling
pipe: Allow buffers to be marked read-whole-or-error for notifications
Add sample notification program
watch_queue: Add a key/keyring notification facility
security: Add hooks to rule on setting a watch
pipe: Add general notification queue support
pipe: Add O_NOTIFICATION_PIPE
security: Add a hook for the point of notification insertion
uapi: General notification queue definitions
2020-06-14 00:56:21 +08:00
|
|
|
#define PIPE_BUF_FLAG_LOSS 0x40 /* Message loss happened after this buffer */
|
2020-01-15 01:07:12 +08:00
|
|
|
#endif
|
2006-04-03 05:11:04 +08:00
|
|
|
|
2007-06-13 02:51:32 +08:00
|
|
|
/**
|
|
|
|
* struct pipe_buffer - a linux kernel pipe buffer
|
|
|
|
* @page: the page containing the data for the pipe buffer
|
|
|
|
* @offset: offset of data inside the @page
|
|
|
|
* @len: length of data inside the @page
|
|
|
|
* @ops: operations associated with this buffer. See @pipe_buf_operations.
|
|
|
|
* @flags: pipe buffer flags. See above.
|
|
|
|
* @private: private data owned by the ops.
|
|
|
|
**/
|
2005-04-17 06:20:36 +08:00
|
|
|
struct pipe_buffer {
|
|
|
|
struct page *page;
|
|
|
|
unsigned int offset, len;
|
2006-12-13 16:34:04 +08:00
|
|
|
const struct pipe_buf_operations *ops;
|
2006-04-03 05:11:04 +08:00
|
|
|
unsigned int flags;
|
2007-06-11 18:00:45 +08:00
|
|
|
unsigned long private;
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2007-06-13 02:51:32 +08:00
|
|
|
/**
|
|
|
|
* struct pipe_inode_info - a linux kernel pipe
|
2013-03-21 14:32:24 +08:00
|
|
|
* @mutex: mutex protecting the whole thing
|
2020-02-10 11:36:14 +08:00
|
|
|
* @rd_wait: reader wait point in case of empty pipe
|
|
|
|
* @wr_wait: writer wait point in case of full pipe
|
2019-11-15 21:30:32 +08:00
|
|
|
* @head: The point of buffer production
|
|
|
|
* @tail: The point of buffer consumption
|
2020-01-15 01:07:12 +08:00
|
|
|
* @note_loss: The next read() should insert a data-lost message
|
2019-10-16 23:47:32 +08:00
|
|
|
* @max_usage: The maximum number of slots that may be used in the ring
|
2019-11-15 21:30:32 +08:00
|
|
|
* @ring_size: total number of buffers (should be a power of 2)
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
* @nr_accounted: The amount this pipe accounts for in user->pipe_bufs
|
2007-06-13 02:51:32 +08:00
|
|
|
* @tmp_page: cached released page
|
|
|
|
* @readers: number of current readers of this pipe
|
|
|
|
* @writers: number of current writers of this pipe
|
2014-02-18 21:54:36 +08:00
|
|
|
* @files: number of struct file referring this pipe (protected by ->i_lock)
|
2007-06-13 02:51:32 +08:00
|
|
|
* @r_counter: reader counter
|
|
|
|
* @w_counter: writer counter
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-06 01:04:43 +08:00
|
|
|
* @poll_usage: is this pipe used for epoll, which has crazy wakeups?
|
2007-06-13 02:51:32 +08:00
|
|
|
* @fasync_readers: reader side fasync
|
|
|
|
* @fasync_writers: writer side fasync
|
|
|
|
* @bufs: the circular array of pipe buffers
|
2016-01-18 23:36:09 +08:00
|
|
|
* @user: the user who created this pipe
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
* @watch_queue: If this pipe is a watch_queue, this is the stuff for that
|
2007-06-13 02:51:32 +08:00
|
|
|
**/
|
2007-06-04 21:03:12 +08:00
|
|
|
struct pipe_inode_info {
|
2013-03-21 14:32:24 +08:00
|
|
|
struct mutex mutex;
|
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-10 01:48:27 +08:00
|
|
|
wait_queue_head_t rd_wait, wr_wait;
|
2019-11-15 21:30:32 +08:00
|
|
|
unsigned int head;
|
|
|
|
unsigned int tail;
|
2019-10-16 23:47:32 +08:00
|
|
|
unsigned int max_usage;
|
2019-11-15 21:30:32 +08:00
|
|
|
unsigned int ring_size;
|
2020-01-15 01:07:12 +08:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
bool note_loss;
|
|
|
|
#endif
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
unsigned int nr_accounted;
|
2007-06-04 21:03:12 +08:00
|
|
|
unsigned int readers;
|
|
|
|
unsigned int writers;
|
2013-03-21 14:21:19 +08:00
|
|
|
unsigned int files;
|
2007-06-04 21:03:12 +08:00
|
|
|
unsigned int r_counter;
|
|
|
|
unsigned int w_counter;
|
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-06 01:04:43 +08:00
|
|
|
unsigned int poll_usage;
|
2010-05-20 16:43:18 +08:00
|
|
|
struct page *tmp_page;
|
2007-06-04 21:03:12 +08:00
|
|
|
struct fasync_struct *fasync_readers;
|
|
|
|
struct fasync_struct *fasync_writers;
|
2010-05-20 16:43:18 +08:00
|
|
|
struct pipe_buffer *bufs;
|
2016-01-18 23:36:09 +08:00
|
|
|
struct user_struct *user;
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
struct watch_queue *watch_queue;
|
|
|
|
#endif
|
2007-06-04 21:03:12 +08:00
|
|
|
};
|
|
|
|
|
2006-05-02 01:59:03 +08:00
|
|
|
/*
|
|
|
|
* Note on the nesting of these functions:
|
|
|
|
*
|
2007-06-14 19:10:48 +08:00
|
|
|
* ->confirm()
|
2020-05-20 23:58:16 +08:00
|
|
|
* ->try_steal()
|
2006-05-02 01:59:03 +08:00
|
|
|
*
|
2020-05-20 23:58:16 +08:00
|
|
|
* That is, ->try_steal() must be called on a confirmed buffer. See below for
|
|
|
|
* the meaning of each operation. Also see the kerneldoc in fs/pipe.c for the
|
|
|
|
* pipe and generic variants of these hooks.
|
2006-05-02 01:59:03 +08:00
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
struct pipe_buf_operations {
|
2007-06-13 02:51:32 +08:00
|
|
|
/*
|
|
|
|
* ->confirm() verifies that the data in the pipe buffer is there
|
|
|
|
* and that the contents are good. If the pages in the pipe belong
|
|
|
|
* to a file system, we may need to wait for IO completion in this
|
|
|
|
* hook. Returns 0 for good, or a negative error value in case of
|
2020-05-20 23:58:15 +08:00
|
|
|
* error. If not present all pages are considered good.
|
2007-06-13 02:51:32 +08:00
|
|
|
*/
|
2007-06-14 19:10:48 +08:00
|
|
|
int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *);
|
2007-06-13 02:51:32 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When the contents of this pipe buffer has been completely
|
|
|
|
* consumed by a reader, ->release() is called.
|
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
void (*release)(struct pipe_inode_info *, struct pipe_buffer *);
|
2007-06-13 02:51:32 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to take ownership of the pipe buffer and its contents.
|
2020-05-20 23:58:16 +08:00
|
|
|
* ->try_steal() returns %true for success, in which case the contents
|
|
|
|
* of the pipe (the buf->page) is locked and now completely owned by the
|
|
|
|
* caller. The page may then be transferred to a different mapping, the
|
|
|
|
* most often used case is insertion into different file address space
|
|
|
|
* cache.
|
2007-06-13 02:51:32 +08:00
|
|
|
*/
|
2020-05-20 23:58:16 +08:00
|
|
|
bool (*try_steal)(struct pipe_inode_info *, struct pipe_buffer *);
|
2007-06-13 02:51:32 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a reference to the pipe buffer.
|
|
|
|
*/
|
2019-04-06 05:02:10 +08:00
|
|
|
bool (*get)(struct pipe_inode_info *, struct pipe_buffer *);
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2019-11-15 21:30:32 +08:00
|
|
|
/**
|
|
|
|
* pipe_empty - Return true if the pipe is empty
|
|
|
|
* @head: The pipe ring head pointer
|
|
|
|
* @tail: The pipe ring tail pointer
|
|
|
|
*/
|
|
|
|
static inline bool pipe_empty(unsigned int head, unsigned int tail)
|
|
|
|
{
|
|
|
|
return head == tail;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pipe_occupancy - Return number of slots used in the pipe
|
|
|
|
* @head: The pipe ring head pointer
|
|
|
|
* @tail: The pipe ring tail pointer
|
|
|
|
*/
|
|
|
|
static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
|
|
|
|
{
|
|
|
|
return head - tail;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pipe_full - Return true if the pipe is full
|
|
|
|
* @head: The pipe ring head pointer
|
|
|
|
* @tail: The pipe ring tail pointer
|
|
|
|
* @limit: The maximum amount of slots available.
|
|
|
|
*/
|
|
|
|
static inline bool pipe_full(unsigned int head, unsigned int tail,
|
|
|
|
unsigned int limit)
|
|
|
|
{
|
|
|
|
return pipe_occupancy(head, tail) >= limit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pipe_space_for_user - Return number of slots available to userspace
|
|
|
|
* @head: The pipe ring head pointer
|
|
|
|
* @tail: The pipe ring tail pointer
|
|
|
|
* @pipe: The pipe info structure
|
|
|
|
*/
|
|
|
|
static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int tail,
|
|
|
|
struct pipe_inode_info *pipe)
|
|
|
|
{
|
|
|
|
unsigned int p_occupancy, p_space;
|
|
|
|
|
|
|
|
p_occupancy = pipe_occupancy(head, tail);
|
2019-10-16 23:47:32 +08:00
|
|
|
if (p_occupancy >= pipe->max_usage)
|
2019-11-15 21:30:32 +08:00
|
|
|
return 0;
|
|
|
|
p_space = pipe->ring_size - p_occupancy;
|
2019-10-16 23:47:32 +08:00
|
|
|
if (p_space > pipe->max_usage)
|
|
|
|
p_space = pipe->max_usage;
|
2019-11-15 21:30:32 +08:00
|
|
|
return p_space;
|
|
|
|
}
|
|
|
|
|
2016-09-27 16:45:12 +08:00
|
|
|
/**
|
|
|
|
* pipe_buf_get - get a reference to a pipe_buffer
|
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to get a reference to
|
2019-04-06 05:02:10 +08:00
|
|
|
*
|
|
|
|
* Return: %true if the reference was successfully obtained.
|
2016-09-27 16:45:12 +08:00
|
|
|
*/
|
2019-04-06 05:02:10 +08:00
|
|
|
static inline __must_check bool pipe_buf_get(struct pipe_inode_info *pipe,
|
2016-09-27 16:45:12 +08:00
|
|
|
struct pipe_buffer *buf)
|
|
|
|
{
|
2019-04-06 05:02:10 +08:00
|
|
|
return buf->ops->get(pipe, buf);
|
2016-09-27 16:45:12 +08:00
|
|
|
}
|
|
|
|
|
2016-09-27 16:45:12 +08:00
|
|
|
/**
|
|
|
|
* pipe_buf_release - put a reference to a pipe_buffer
|
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to put a reference to
|
|
|
|
*/
|
|
|
|
static inline void pipe_buf_release(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
|
|
|
{
|
|
|
|
const struct pipe_buf_operations *ops = buf->ops;
|
|
|
|
|
|
|
|
buf->ops = NULL;
|
|
|
|
ops->release(pipe, buf);
|
|
|
|
}
|
|
|
|
|
2016-09-27 16:45:12 +08:00
|
|
|
/**
|
|
|
|
* pipe_buf_confirm - verify contents of the pipe buffer
|
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to confirm
|
|
|
|
*/
|
|
|
|
static inline int pipe_buf_confirm(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
|
|
|
{
|
2020-05-20 23:58:15 +08:00
|
|
|
if (!buf->ops->confirm)
|
|
|
|
return 0;
|
2016-09-27 16:45:12 +08:00
|
|
|
return buf->ops->confirm(pipe, buf);
|
|
|
|
}
|
|
|
|
|
2016-09-27 16:45:12 +08:00
|
|
|
/**
|
2020-05-20 23:58:16 +08:00
|
|
|
* pipe_buf_try_steal - attempt to take ownership of a pipe_buffer
|
2016-09-27 16:45:12 +08:00
|
|
|
* @pipe: the pipe that the buffer belongs to
|
|
|
|
* @buf: the buffer to attempt to steal
|
|
|
|
*/
|
2020-05-20 23:58:16 +08:00
|
|
|
static inline bool pipe_buf_try_steal(struct pipe_inode_info *pipe,
|
|
|
|
struct pipe_buffer *buf)
|
2016-09-27 16:45:12 +08:00
|
|
|
{
|
2020-05-20 23:58:16 +08:00
|
|
|
if (!buf->ops->try_steal)
|
|
|
|
return false;
|
|
|
|
return buf->ops->try_steal(pipe, buf);
|
2016-09-27 16:45:12 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
|
|
|
|
memory allocation, whereas PIPE_BUF makes atomicity guarantees. */
|
|
|
|
#define PIPE_SIZE PAGE_SIZE
|
|
|
|
|
2009-04-15 01:48:41 +08:00
|
|
|
/* Pipe lock and unlock operations */
|
|
|
|
void pipe_lock(struct pipe_inode_info *);
|
|
|
|
void pipe_unlock(struct pipe_inode_info *);
|
|
|
|
void pipe_double_lock(struct pipe_inode_info *, struct pipe_inode_info *);
|
|
|
|
|
2018-02-07 07:41:45 +08:00
|
|
|
extern unsigned int pipe_max_size;
|
2016-01-18 23:36:09 +08:00
|
|
|
extern unsigned long pipe_user_pages_hard;
|
|
|
|
extern unsigned long pipe_user_pages_soft;
|
2010-06-03 20:54:39 +08:00
|
|
|
|
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-02 10:14:36 +08:00
|
|
|
/* Wait for a pipe to be readable/writable while dropping the pipe lock */
|
|
|
|
void pipe_wait_readable(struct pipe_inode_info *);
|
|
|
|
void pipe_wait_writable(struct pipe_inode_info *);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-03-21 23:04:15 +08:00
|
|
|
struct pipe_inode_info *alloc_pipe_info(void);
|
2013-03-21 23:06:46 +08:00
|
|
|
void free_pipe_info(struct pipe_inode_info *);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-05-02 01:59:03 +08:00
|
|
|
/* Generic pipe buffer ops functions */
|
2019-04-06 05:02:10 +08:00
|
|
|
bool generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
|
2020-05-20 23:58:16 +08:00
|
|
|
bool generic_pipe_buf_try_steal(struct pipe_inode_info *, struct pipe_buffer *);
|
2009-05-07 21:37:36 +08:00
|
|
|
void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
|
2006-05-02 01:59:03 +08:00
|
|
|
|
2014-01-23 02:36:57 +08:00
|
|
|
extern const struct pipe_buf_operations nosteal_pipe_buf_ops;
|
|
|
|
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
unsigned long account_pipe_buffers(struct user_struct *user,
|
|
|
|
unsigned long old, unsigned long new);
|
|
|
|
bool too_many_pipe_buffers_soft(unsigned long user_bufs);
|
|
|
|
bool too_many_pipe_buffers_hard(unsigned long user_bufs);
|
|
|
|
bool pipe_is_unprivileged_user(void);
|
|
|
|
#endif
|
|
|
|
|
2010-05-20 16:43:18 +08:00
|
|
|
/* for F_SETPIPE_SZ and F_GETPIPE_SZ */
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
|
|
int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots);
|
|
|
|
#endif
|
2010-05-20 16:43:18 +08:00
|
|
|
long pipe_fcntl(struct file *, unsigned int, unsigned long arg);
|
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-15 01:07:11 +08:00
|
|
|
struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
|
2010-11-29 06:09:57 +08:00
|
|
|
|
2012-07-21 19:33:25 +08:00
|
|
|
int create_pipe_files(struct file **, int);
|
2018-02-07 07:42:00 +08:00
|
|
|
unsigned int round_pipe_size(unsigned long size);
|
2012-07-21 19:33:25 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|