2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* POSIX message queues filesystem for Linux.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2003,2004 Krzysztof Benedyczak (golbi@mat.uni.torun.pl)
|
2006-10-04 05:23:27 +08:00
|
|
|
* Michal Wronski (michal.wronski@gmail.com)
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* Spinlocks: Mohamed Abbas (abbas.mohamed@intel.com)
|
|
|
|
* Lockless receive & send, fd based notify:
|
|
|
|
* Manfred Spraul (manfred@colorfullife.com)
|
|
|
|
*
|
2006-05-25 05:09:55 +08:00
|
|
|
* Audit: George Wilson (ltcgcw@us.ibm.com)
|
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
* This file is released under the GPL.
|
|
|
|
*/
|
|
|
|
|
2006-01-12 04:17:46 +08:00
|
|
|
#include <linux/capability.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
#include <linux/poll.h>
|
|
|
|
#include <linux/mqueue.h>
|
|
|
|
#include <linux/msg.h>
|
|
|
|
#include <linux/skbuff.h>
|
2012-06-01 07:26:30 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/netlink.h>
|
|
|
|
#include <linux/syscalls.h>
|
2006-05-25 05:09:55 +08:00
|
|
|
#include <linux/audit.h>
|
2005-05-01 23:59:14 +08:00
|
|
|
#include <linux/signal.h>
|
2006-03-26 17:37:17 +08:00
|
|
|
#include <linux/mutex.h>
|
2007-10-19 14:40:14 +08:00
|
|
|
#include <linux/nsproxy.h>
|
|
|
|
#include <linux/pid.h>
|
2009-04-07 10:01:08 +08:00
|
|
|
#include <linux/ipc_namespace.h>
|
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 07:11:37 +08:00
|
|
|
#include <linux/user_namespace.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2006-03-26 17:37:17 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <net/sock.h>
|
|
|
|
#include "util.h"
|
|
|
|
|
|
|
|
#define MQUEUE_MAGIC 0x19800202
|
|
|
|
#define DIRENT_SIZE 20
|
|
|
|
#define FILENT_SIZE 80
|
|
|
|
|
|
|
|
#define SEND 0
|
|
|
|
#define RECV 1
|
|
|
|
|
|
|
|
#define STATE_NONE 0
|
|
|
|
#define STATE_PENDING 1
|
|
|
|
#define STATE_READY 2
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
struct posix_msg_tree_node {
|
|
|
|
struct rb_node rb_node;
|
|
|
|
struct list_head msg_list;
|
|
|
|
int priority;
|
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct ext_wait_queue { /* queue of sleeping tasks */
|
|
|
|
struct task_struct *task;
|
|
|
|
struct list_head list;
|
|
|
|
struct msg_msg *msg; /* ptr of loaded message */
|
|
|
|
int state; /* one of STATE_* values */
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mqueue_inode_info {
|
|
|
|
spinlock_t lock;
|
|
|
|
struct inode vfs_inode;
|
|
|
|
wait_queue_head_t wait_q;
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
struct rb_root msg_tree;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
struct posix_msg_tree_node *node_cache;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct mq_attr attr;
|
|
|
|
|
|
|
|
struct sigevent notify;
|
2006-10-02 17:17:26 +08:00
|
|
|
struct pid* notify_owner;
|
2011-11-17 14:57:55 +08:00
|
|
|
struct user_namespace *notify_user_ns;
|
2005-09-10 15:26:54 +08:00
|
|
|
struct user_struct *user; /* user who created, for accounting */
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sock *notify_sock;
|
|
|
|
struct sk_buff *notify_cookie;
|
|
|
|
|
|
|
|
/* for tasks waiting for free space and messages, respectively */
|
|
|
|
struct ext_wait_queue e_wait_q[2];
|
|
|
|
|
|
|
|
unsigned long qsize; /* size of queue in memory (sum of all msgs) */
|
|
|
|
};
|
|
|
|
|
2007-02-12 16:55:39 +08:00
|
|
|
static const struct inode_operations mqueue_dir_inode_operations;
|
2007-02-12 16:55:35 +08:00
|
|
|
static const struct file_operations mqueue_file_operations;
|
2009-09-22 08:01:09 +08:00
|
|
|
static const struct super_operations mqueue_super_ops;
|
2005-04-17 06:20:36 +08:00
|
|
|
static void remove_notification(struct mqueue_inode_info *info);
|
|
|
|
|
2006-12-07 12:33:20 +08:00
|
|
|
static struct kmem_cache *mqueue_inode_cachep;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
static struct ctl_table_header * mq_sysctl_table;
|
|
|
|
|
|
|
|
static inline struct mqueue_inode_info *MQUEUE_I(struct inode *inode)
|
|
|
|
{
|
|
|
|
return container_of(inode, struct mqueue_inode_info, vfs_inode);
|
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
/*
|
|
|
|
* This routine should be called with the mq_lock held.
|
|
|
|
*/
|
|
|
|
static inline struct ipc_namespace *__get_ns_from_inode(struct inode *inode)
|
2009-04-07 10:01:08 +08:00
|
|
|
{
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
return get_ipc_ns(inode->i_sb->s_fs_info);
|
2009-04-07 10:01:08 +08:00
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
static struct ipc_namespace *get_ns_from_inode(struct inode *inode)
|
2009-04-07 10:01:08 +08:00
|
|
|
{
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct ipc_namespace *ns;
|
|
|
|
|
|
|
|
spin_lock(&mq_lock);
|
|
|
|
ns = __get_ns_from_inode(inode);
|
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
return ns;
|
2009-04-07 10:01:08 +08:00
|
|
|
}
|
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
/* Auxiliary functions to manipulate messages' list */
|
|
|
|
static int msg_insert(struct msg_msg *msg, struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
struct rb_node **p, *parent = NULL;
|
|
|
|
struct posix_msg_tree_node *leaf;
|
|
|
|
|
|
|
|
p = &info->msg_tree.rb_node;
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
leaf = rb_entry(parent, struct posix_msg_tree_node, rb_node);
|
|
|
|
|
|
|
|
if (likely(leaf->priority == msg->m_type))
|
|
|
|
goto insert_msg;
|
|
|
|
else if (msg->m_type < leaf->priority)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
}
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
if (info->node_cache) {
|
|
|
|
leaf = info->node_cache;
|
|
|
|
info->node_cache = NULL;
|
|
|
|
} else {
|
|
|
|
leaf = kmalloc(sizeof(*leaf), GFP_ATOMIC);
|
|
|
|
if (!leaf)
|
|
|
|
return -ENOMEM;
|
|
|
|
INIT_LIST_HEAD(&leaf->msg_list);
|
|
|
|
info->qsize += sizeof(*leaf);
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
leaf->priority = msg->m_type;
|
|
|
|
rb_link_node(&leaf->rb_node, parent, p);
|
|
|
|
rb_insert_color(&leaf->rb_node, &info->msg_tree);
|
|
|
|
insert_msg:
|
|
|
|
info->attr.mq_curmsgs++;
|
|
|
|
info->qsize += msg->m_ts;
|
|
|
|
list_add_tail(&msg->m_list, &leaf->msg_list);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct msg_msg *msg_get(struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
struct rb_node **p, *parent = NULL;
|
|
|
|
struct posix_msg_tree_node *leaf;
|
|
|
|
struct msg_msg *msg;
|
|
|
|
|
|
|
|
try_again:
|
|
|
|
p = &info->msg_tree.rb_node;
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
/*
|
|
|
|
* During insert, low priorities go to the left and high to the
|
|
|
|
* right. On receive, we want the highest priorities first, so
|
|
|
|
* walk all the way to the right.
|
|
|
|
*/
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
}
|
|
|
|
if (!parent) {
|
|
|
|
if (info->attr.mq_curmsgs) {
|
|
|
|
pr_warn_once("Inconsistency in POSIX message queue, "
|
|
|
|
"no tree element, but supposedly messages "
|
|
|
|
"should exist!\n");
|
|
|
|
info->attr.mq_curmsgs = 0;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
leaf = rb_entry(parent, struct posix_msg_tree_node, rb_node);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
if (unlikely(list_empty(&leaf->msg_list))) {
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
pr_warn_once("Inconsistency in POSIX message queue, "
|
|
|
|
"empty leaf node but we haven't implemented "
|
|
|
|
"lazy leaf delete!\n");
|
|
|
|
rb_erase(&leaf->rb_node, &info->msg_tree);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
if (info->node_cache) {
|
|
|
|
info->qsize -= sizeof(*leaf);
|
|
|
|
kfree(leaf);
|
|
|
|
} else {
|
|
|
|
info->node_cache = leaf;
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
goto try_again;
|
|
|
|
} else {
|
|
|
|
msg = list_first_entry(&leaf->msg_list,
|
|
|
|
struct msg_msg, m_list);
|
|
|
|
list_del(&msg->m_list);
|
|
|
|
if (list_empty(&leaf->msg_list)) {
|
|
|
|
rb_erase(&leaf->rb_node, &info->msg_tree);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
if (info->node_cache) {
|
|
|
|
info->qsize -= sizeof(*leaf);
|
|
|
|
kfree(leaf);
|
|
|
|
} else {
|
|
|
|
info->node_cache = leaf;
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
info->attr.mq_curmsgs--;
|
|
|
|
info->qsize -= msg->m_ts;
|
|
|
|
return msg;
|
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
static struct inode *mqueue_get_inode(struct super_block *sb,
|
2011-07-25 02:18:20 +08:00
|
|
|
struct ipc_namespace *ipc_ns, umode_t mode,
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct mq_attr *attr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-11-14 07:39:18 +08:00
|
|
|
struct user_struct *u = current_user();
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode;
|
2011-07-27 07:08:47 +08:00
|
|
|
int ret = -ENOMEM;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
inode = new_inode(sb);
|
2011-07-27 07:08:46 +08:00
|
|
|
if (!inode)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
inode->i_ino = get_next_ino();
|
|
|
|
inode->i_mode = mode;
|
|
|
|
inode->i_uid = current_fsuid();
|
|
|
|
inode->i_gid = current_fsgid();
|
|
|
|
inode->i_mtime = inode->i_ctime = inode->i_atime = CURRENT_TIME;
|
|
|
|
|
|
|
|
if (S_ISREG(mode)) {
|
|
|
|
struct mqueue_inode_info *info;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
unsigned long mq_bytes, mq_treesize;
|
2011-07-27 07:08:46 +08:00
|
|
|
|
|
|
|
inode->i_fop = &mqueue_file_operations;
|
|
|
|
inode->i_size = FILENT_SIZE;
|
|
|
|
/* mqueue specific info */
|
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
spin_lock_init(&info->lock);
|
|
|
|
init_waitqueue_head(&info->wait_q);
|
|
|
|
INIT_LIST_HEAD(&info->e_wait_q[0].list);
|
|
|
|
INIT_LIST_HEAD(&info->e_wait_q[1].list);
|
|
|
|
info->notify_owner = NULL;
|
2011-11-17 14:57:55 +08:00
|
|
|
info->notify_user_ns = NULL;
|
2011-07-27 07:08:46 +08:00
|
|
|
info->qsize = 0;
|
|
|
|
info->user = NULL; /* set when all is ok */
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
info->msg_tree = RB_ROOT;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
info->node_cache = NULL;
|
2011-07-27 07:08:46 +08:00
|
|
|
memset(&info->attr, 0, sizeof(info->attr));
|
2012-06-01 07:26:33 +08:00
|
|
|
info->attr.mq_maxmsg = min(ipc_ns->mq_msg_max,
|
|
|
|
ipc_ns->mq_msg_default);
|
|
|
|
info->attr.mq_msgsize = min(ipc_ns->mq_msgsize_max,
|
|
|
|
ipc_ns->mq_msgsize_default);
|
2011-07-27 07:08:46 +08:00
|
|
|
if (attr) {
|
|
|
|
info->attr.mq_maxmsg = attr->mq_maxmsg;
|
|
|
|
info->attr.mq_msgsize = attr->mq_msgsize;
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
/*
|
|
|
|
* We used to allocate a static array of pointers and account
|
|
|
|
* the size of that array as well as one msg_msg struct per
|
|
|
|
* possible message into the queue size. That's no longer
|
|
|
|
* accurate as the queue is now an rbtree and will grow and
|
|
|
|
* shrink depending on usage patterns. We can, however, still
|
|
|
|
* account one msg_msg struct per message, but the nodes are
|
|
|
|
* allocated depending on priority usage, and most programs
|
|
|
|
* only use one, or a handful, of priorities. However, since
|
|
|
|
* this is pinned memory, we need to assume worst case, so
|
|
|
|
* that means the min(mq_maxmsg, max_priorities) * struct
|
|
|
|
* posix_msg_tree_node.
|
|
|
|
*/
|
|
|
|
mq_treesize = info->attr.mq_maxmsg * sizeof(struct msg_msg) +
|
|
|
|
min_t(unsigned int, info->attr.mq_maxmsg, MQ_PRIO_MAX) *
|
|
|
|
sizeof(struct posix_msg_tree_node);
|
2011-07-27 07:08:46 +08:00
|
|
|
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
mq_bytes = mq_treesize + (info->attr.mq_maxmsg *
|
|
|
|
info->attr.mq_msgsize);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-07-27 07:08:46 +08:00
|
|
|
spin_lock(&mq_lock);
|
|
|
|
if (u->mq_bytes + mq_bytes < u->mq_bytes ||
|
2012-01-21 06:34:01 +08:00
|
|
|
u->mq_bytes + mq_bytes > rlimit(RLIMIT_MSGQUEUE)) {
|
2011-07-27 07:08:46 +08:00
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
/* mqueue_evict_inode() releases info->messages */
|
2011-07-27 07:08:47 +08:00
|
|
|
ret = -EMFILE;
|
2011-07-27 07:08:46 +08:00
|
|
|
goto out_inode;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2011-07-27 07:08:46 +08:00
|
|
|
u->mq_bytes += mq_bytes;
|
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
|
|
|
|
/* all is ok */
|
|
|
|
info->user = get_uid(u);
|
|
|
|
} else if (S_ISDIR(mode)) {
|
|
|
|
inc_nlink(inode);
|
|
|
|
/* Some things misbehave if size == 0 on a directory */
|
|
|
|
inode->i_size = 2 * DIRENT_SIZE;
|
|
|
|
inode->i_op = &mqueue_dir_inode_operations;
|
|
|
|
inode->i_fop = &simple_dir_operations;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2011-07-27 07:08:46 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return inode;
|
|
|
|
out_inode:
|
|
|
|
iput(inode);
|
2011-07-27 07:08:46 +08:00
|
|
|
err:
|
2011-07-27 07:08:47 +08:00
|
|
|
return ERR_PTR(ret);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
|
|
|
|
{
|
|
|
|
struct inode *inode;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct ipc_namespace *ns = data;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
sb->s_blocksize = PAGE_CACHE_SIZE;
|
|
|
|
sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
|
|
|
|
sb->s_magic = MQUEUE_MAGIC;
|
|
|
|
sb->s_op = &mqueue_super_ops;
|
|
|
|
|
2012-01-09 11:15:13 +08:00
|
|
|
inode = mqueue_get_inode(sb, ns, S_IFDIR | S_ISVTX | S_IRWXUGO, NULL);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-01-09 11:15:13 +08:00
|
|
|
sb->s_root = d_make_root(inode);
|
|
|
|
if (!sb->s_root)
|
|
|
|
return -ENOMEM;
|
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2010-07-26 17:16:50 +08:00
|
|
|
static struct dentry *mqueue_mount(struct file_system_type *fs_type,
|
[PATCH] VFS: Permit filesystem to override root dentry on mount
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:02:57 +08:00
|
|
|
int flags, const char *dev_name,
|
2010-07-26 17:16:50 +08:00
|
|
|
void *data)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
if (!(flags & MS_KERNMOUNT))
|
|
|
|
data = current->nsproxy->ipc_ns;
|
2010-07-26 17:16:50 +08:00
|
|
|
return mount_ns(fs_type, flags, data, mqueue_fill_super);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-07-26 10:45:34 +08:00
|
|
|
static void init_once(void *foo)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct mqueue_inode_info *p = (struct mqueue_inode_info *) foo;
|
|
|
|
|
2007-05-17 13:10:57 +08:00
|
|
|
inode_init_once(&p->vfs_inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct inode *mqueue_alloc_inode(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct mqueue_inode_info *ei;
|
|
|
|
|
2006-12-07 12:33:17 +08:00
|
|
|
ei = kmem_cache_alloc(mqueue_inode_cachep, GFP_KERNEL);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!ei)
|
|
|
|
return NULL;
|
|
|
|
return &ei->vfs_inode;
|
|
|
|
}
|
|
|
|
|
2011-01-07 14:49:49 +08:00
|
|
|
static void mqueue_i_callback(struct rcu_head *head)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2011-01-07 14:49:49 +08:00
|
|
|
struct inode *inode = container_of(head, struct inode, i_rcu);
|
2005-04-17 06:20:36 +08:00
|
|
|
kmem_cache_free(mqueue_inode_cachep, MQUEUE_I(inode));
|
|
|
|
}
|
|
|
|
|
2011-01-07 14:49:49 +08:00
|
|
|
static void mqueue_destroy_inode(struct inode *inode)
|
|
|
|
{
|
|
|
|
call_rcu(&inode->i_rcu, mqueue_i_callback);
|
|
|
|
}
|
|
|
|
|
2010-06-06 04:29:45 +08:00
|
|
|
static void mqueue_evict_inode(struct inode *inode)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
struct user_struct *user;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
unsigned long mq_bytes, mq_treesize;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct ipc_namespace *ipc_ns;
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
struct msg_msg *msg;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-05-03 20:48:02 +08:00
|
|
|
clear_inode(inode);
|
2010-06-06 04:29:45 +08:00
|
|
|
|
|
|
|
if (S_ISDIR(inode->i_mode))
|
2005-04-17 06:20:36 +08:00
|
|
|
return;
|
2010-06-06 04:29:45 +08:00
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
ipc_ns = get_ns_from_inode(inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
spin_lock(&info->lock);
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
while ((msg = msg_get(info)) != NULL)
|
|
|
|
free_msg(msg);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
kfree(info->node_cache);
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
2010-02-23 15:04:24 +08:00
|
|
|
/* Total amount of bytes accounted for the mqueue */
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
mq_treesize = info->attr.mq_maxmsg * sizeof(struct msg_msg) +
|
|
|
|
min_t(unsigned int, info->attr.mq_maxmsg, MQ_PRIO_MAX) *
|
|
|
|
sizeof(struct posix_msg_tree_node);
|
|
|
|
|
|
|
|
mq_bytes = mq_treesize + (info->attr.mq_maxmsg *
|
|
|
|
info->attr.mq_msgsize);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
user = info->user;
|
|
|
|
if (user) {
|
|
|
|
spin_lock(&mq_lock);
|
|
|
|
user->mq_bytes -= mq_bytes;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
/*
|
|
|
|
* get_ns_from_inode() ensures that the
|
|
|
|
* (ipc_ns = sb->s_fs_info) is either a valid ipc_ns
|
|
|
|
* to which we now hold a reference, or it is NULL.
|
|
|
|
* We can't put it here under mq_lock, though.
|
|
|
|
*/
|
|
|
|
if (ipc_ns)
|
|
|
|
ipc_ns->mq_queues_count--;
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
free_uid(user);
|
|
|
|
}
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
if (ipc_ns)
|
|
|
|
put_ipc_ns(ipc_ns);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mqueue_create(struct inode *dir, struct dentry *dentry,
|
2012-06-11 06:05:36 +08:00
|
|
|
umode_t mode, bool excl)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
struct mq_attr *attr = dentry->d_fsdata;
|
|
|
|
int error;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct ipc_namespace *ipc_ns;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
spin_lock(&mq_lock);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
ipc_ns = __get_ns_from_inode(dir);
|
|
|
|
if (!ipc_ns) {
|
|
|
|
error = -EACCES;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2012-06-01 07:26:29 +08:00
|
|
|
if (ipc_ns->mq_queues_count >= HARD_QUEUESMAX ||
|
|
|
|
(ipc_ns->mq_queues_count >= ipc_ns->mq_queues_max &&
|
|
|
|
!capable(CAP_SYS_RESOURCE))) {
|
2005-04-17 06:20:36 +08:00
|
|
|
error = -ENOSPC;
|
2009-04-07 10:01:08 +08:00
|
|
|
goto out_unlock;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2009-04-07 10:01:08 +08:00
|
|
|
ipc_ns->mq_queues_count++;
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_unlock(&mq_lock);
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
inode = mqueue_get_inode(dir->i_sb, ipc_ns, mode, attr);
|
2011-07-27 07:08:47 +08:00
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
error = PTR_ERR(inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_lock(&mq_lock);
|
2009-04-07 10:01:08 +08:00
|
|
|
ipc_ns->mq_queues_count--;
|
|
|
|
goto out_unlock;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
put_ipc_ns(ipc_ns);
|
2005-04-17 06:20:36 +08:00
|
|
|
dir->i_size += DIRENT_SIZE;
|
|
|
|
dir->i_ctime = dir->i_mtime = dir->i_atime = CURRENT_TIME;
|
|
|
|
|
|
|
|
d_instantiate(dentry, inode);
|
|
|
|
dget(dentry);
|
|
|
|
return 0;
|
2009-04-07 10:01:08 +08:00
|
|
|
out_unlock:
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_unlock(&mq_lock);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
if (ipc_ns)
|
|
|
|
put_ipc_ns(ipc_ns);
|
2005-04-17 06:20:36 +08:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mqueue_unlink(struct inode *dir, struct dentry *dentry)
|
|
|
|
{
|
|
|
|
struct inode *inode = dentry->d_inode;
|
|
|
|
|
|
|
|
dir->i_ctime = dir->i_mtime = dir->i_atime = CURRENT_TIME;
|
|
|
|
dir->i_size -= DIRENT_SIZE;
|
2006-10-01 14:29:03 +08:00
|
|
|
drop_nlink(inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
dput(dentry);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is routine for system read from queue file.
|
|
|
|
* To avoid mess with doing here some sort of mq_receive we allow
|
|
|
|
* to read only queue size & notification info (the only values
|
|
|
|
* that are interesting from user point of view and aren't accessible
|
|
|
|
* through std routines)
|
|
|
|
*/
|
|
|
|
static ssize_t mqueue_read_file(struct file *filp, char __user *u_data,
|
2008-07-25 16:48:07 +08:00
|
|
|
size_t count, loff_t *off)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2006-12-08 18:37:11 +08:00
|
|
|
struct mqueue_inode_info *info = MQUEUE_I(filp->f_path.dentry->d_inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
char buffer[FILENT_SIZE];
|
2008-07-25 16:48:07 +08:00
|
|
|
ssize_t ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
snprintf(buffer, sizeof(buffer),
|
|
|
|
"QSIZE:%-10lu NOTIFY:%-5d SIGNO:%-5d NOTIFY_PID:%-6d\n",
|
|
|
|
info->qsize,
|
|
|
|
info->notify_owner ? info->notify.sigev_notify : 0,
|
|
|
|
(info->notify_owner &&
|
|
|
|
info->notify.sigev_notify == SIGEV_SIGNAL) ?
|
|
|
|
info->notify.sigev_signo : 0,
|
2008-02-08 20:19:20 +08:00
|
|
|
pid_vnr(info->notify_owner));
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
buffer[sizeof(buffer)-1] = '\0';
|
|
|
|
|
2008-07-25 16:48:07 +08:00
|
|
|
ret = simple_read_from_buffer(u_data, count, off, buffer,
|
|
|
|
strlen(buffer));
|
|
|
|
if (ret <= 0)
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-12-08 18:37:11 +08:00
|
|
|
filp->f_path.dentry->d_inode->i_atime = filp->f_path.dentry->d_inode->i_ctime = CURRENT_TIME;
|
2008-07-25 16:48:07 +08:00
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:05:12 +08:00
|
|
|
static int mqueue_flush_file(struct file *filp, fl_owner_t id)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2006-12-08 18:37:11 +08:00
|
|
|
struct mqueue_inode_info *info = MQUEUE_I(filp->f_path.dentry->d_inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
2006-10-02 17:17:26 +08:00
|
|
|
if (task_tgid(current) == info->notify_owner)
|
2005-04-17 06:20:36 +08:00
|
|
|
remove_notification(info);
|
|
|
|
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned int mqueue_poll_file(struct file *filp, struct poll_table_struct *poll_tab)
|
|
|
|
{
|
2006-12-08 18:37:11 +08:00
|
|
|
struct mqueue_inode_info *info = MQUEUE_I(filp->f_path.dentry->d_inode);
|
2005-04-17 06:20:36 +08:00
|
|
|
int retval = 0;
|
|
|
|
|
|
|
|
poll_wait(filp, &info->wait_q, poll_tab);
|
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
if (info->attr.mq_curmsgs)
|
|
|
|
retval = POLLIN | POLLRDNORM;
|
|
|
|
|
|
|
|
if (info->attr.mq_curmsgs < info->attr.mq_maxmsg)
|
|
|
|
retval |= POLLOUT | POLLWRNORM;
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Adds current to info->e_wait_q[sr] before element with smaller prio */
|
|
|
|
static void wq_add(struct mqueue_inode_info *info, int sr,
|
|
|
|
struct ext_wait_queue *ewp)
|
|
|
|
{
|
|
|
|
struct ext_wait_queue *walk;
|
|
|
|
|
|
|
|
ewp->task = current;
|
|
|
|
|
|
|
|
list_for_each_entry(walk, &info->e_wait_q[sr].list, list) {
|
|
|
|
if (walk->task->static_prio <= current->static_prio) {
|
|
|
|
list_add_tail(&ewp->list, &walk->list);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
list_add_tail(&ewp->list, &info->e_wait_q[sr].list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Puts current task to sleep. Caller must hold queue lock. After return
|
|
|
|
* lock isn't held.
|
|
|
|
* sr: SEND or RECV
|
|
|
|
*/
|
|
|
|
static int wq_sleep(struct mqueue_inode_info *info, int sr,
|
2010-04-03 04:40:20 +08:00
|
|
|
ktime_t *timeout, struct ext_wait_queue *ewp)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int retval;
|
|
|
|
signed long time;
|
|
|
|
|
|
|
|
wq_add(info, sr, ewp);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
set_current_state(TASK_INTERRUPTIBLE);
|
|
|
|
|
|
|
|
spin_unlock(&info->lock);
|
2011-11-01 08:06:35 +08:00
|
|
|
time = schedule_hrtimeout_range_clock(timeout, 0,
|
|
|
|
HRTIMER_MODE_ABS, CLOCK_REALTIME);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
while (ewp->state == STATE_PENDING)
|
|
|
|
cpu_relax();
|
|
|
|
|
|
|
|
if (ewp->state == STATE_READY) {
|
|
|
|
retval = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
if (ewp->state == STATE_READY) {
|
|
|
|
retval = 0;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
if (signal_pending(current)) {
|
|
|
|
retval = -ERESTARTSYS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (time == 0) {
|
|
|
|
retval = -ETIMEDOUT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
list_del(&ewp->list);
|
|
|
|
out_unlock:
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
out:
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns waiting task that should be serviced first or NULL if none exists
|
|
|
|
*/
|
|
|
|
static struct ext_wait_queue *wq_get_first_waiter(
|
|
|
|
struct mqueue_inode_info *info, int sr)
|
|
|
|
{
|
|
|
|
struct list_head *ptr;
|
|
|
|
|
|
|
|
ptr = info->e_wait_q[sr].list.prev;
|
|
|
|
if (ptr == &info->e_wait_q[sr].list)
|
|
|
|
return NULL;
|
|
|
|
return list_entry(ptr, struct ext_wait_queue, list);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
static inline void set_cookie(struct sk_buff *skb, char code)
|
|
|
|
{
|
|
|
|
((char*)skb->data)[NOTIFY_COOKIE_LEN-1] = code;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The next function is only to split too long sys_mq_timedsend
|
|
|
|
*/
|
|
|
|
static void __do_notify(struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
/* notification
|
|
|
|
* invoked when there is registered process and there isn't process
|
|
|
|
* waiting synchronously for message AND state of queue changed from
|
|
|
|
* empty to not empty. Here we are sure that no one is waiting
|
|
|
|
* synchronously. */
|
|
|
|
if (info->notify_owner &&
|
|
|
|
info->attr.mq_curmsgs == 1) {
|
|
|
|
struct siginfo sig_i;
|
|
|
|
switch (info->notify.sigev_notify) {
|
|
|
|
case SIGEV_NONE:
|
|
|
|
break;
|
|
|
|
case SIGEV_SIGNAL:
|
|
|
|
/* sends signal */
|
|
|
|
|
|
|
|
sig_i.si_signo = info->notify.sigev_signo;
|
|
|
|
sig_i.si_errno = 0;
|
|
|
|
sig_i.si_code = SI_MESGQ;
|
|
|
|
sig_i.si_value = info->notify.sigev_value;
|
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 07:11:37 +08:00
|
|
|
/* map current pid/uid into info->owner's namespaces */
|
|
|
|
rcu_read_lock();
|
2009-01-08 10:08:50 +08:00
|
|
|
sig_i.si_pid = task_tgid_nr_ns(current,
|
|
|
|
ns_of_pid(info->notify_owner));
|
2012-03-15 06:24:19 +08:00
|
|
|
sig_i.si_uid = from_kuid_munged(info->notify_user_ns, current_uid());
|
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 07:11:37 +08:00
|
|
|
rcu_read_unlock();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-10-02 17:17:26 +08:00
|
|
|
kill_pid_info(info->notify.sigev_signo,
|
|
|
|
&sig_i, info->notify_owner);
|
2005-04-17 06:20:36 +08:00
|
|
|
break;
|
|
|
|
case SIGEV_THREAD:
|
|
|
|
set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
|
2007-10-11 12:14:03 +08:00
|
|
|
netlink_sendskb(info->notify_sock, info->notify_cookie);
|
2005-04-17 06:20:36 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* after notification unregisters process */
|
2006-10-02 17:17:26 +08:00
|
|
|
put_pid(info->notify_owner);
|
2011-11-17 14:57:55 +08:00
|
|
|
put_user_ns(info->notify_user_ns);
|
2006-10-02 17:17:26 +08:00
|
|
|
info->notify_owner = NULL;
|
2011-11-17 14:57:55 +08:00
|
|
|
info->notify_user_ns = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
wake_up(&info->wait_q);
|
|
|
|
}
|
|
|
|
|
2010-04-03 04:40:20 +08:00
|
|
|
static int prepare_timeout(const struct timespec __user *u_abs_timeout,
|
|
|
|
ktime_t *expires, struct timespec *ts)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2010-04-03 04:40:20 +08:00
|
|
|
if (copy_from_user(ts, u_abs_timeout, sizeof(struct timespec)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (!timespec_valid(ts))
|
|
|
|
return -EINVAL;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2010-04-03 04:40:20 +08:00
|
|
|
*expires = timespec_to_ktime(*ts);
|
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void remove_notification(struct mqueue_inode_info *info)
|
|
|
|
{
|
2006-10-02 17:17:26 +08:00
|
|
|
if (info->notify_owner != NULL &&
|
2005-04-17 06:20:36 +08:00
|
|
|
info->notify.sigev_notify == SIGEV_THREAD) {
|
|
|
|
set_cookie(info->notify_cookie, NOTIFY_REMOVED);
|
2007-10-11 12:14:03 +08:00
|
|
|
netlink_sendskb(info->notify_sock, info->notify_cookie);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2006-10-02 17:17:26 +08:00
|
|
|
put_pid(info->notify_owner);
|
2011-11-17 14:57:55 +08:00
|
|
|
put_user_ns(info->notify_user_ns);
|
2006-10-02 17:17:26 +08:00
|
|
|
info->notify_owner = NULL;
|
2011-11-17 14:57:55 +08:00
|
|
|
info->notify_user_ns = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-04-07 10:01:08 +08:00
|
|
|
static int mq_attr_ok(struct ipc_namespace *ipc_ns, struct mq_attr *attr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-06-01 07:26:36 +08:00
|
|
|
int mq_treesize;
|
|
|
|
unsigned long total_size;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (attr->mq_maxmsg <= 0 || attr->mq_msgsize <= 0)
|
2012-06-01 07:26:36 +08:00
|
|
|
return -EINVAL;
|
2005-04-17 06:20:36 +08:00
|
|
|
if (capable(CAP_SYS_RESOURCE)) {
|
2012-06-01 07:26:29 +08:00
|
|
|
if (attr->mq_maxmsg > HARD_MSGMAX ||
|
|
|
|
attr->mq_msgsize > HARD_MSGSIZEMAX)
|
2012-06-01 07:26:36 +08:00
|
|
|
return -EINVAL;
|
2005-04-17 06:20:36 +08:00
|
|
|
} else {
|
2009-04-07 10:01:08 +08:00
|
|
|
if (attr->mq_maxmsg > ipc_ns->mq_msg_max ||
|
|
|
|
attr->mq_msgsize > ipc_ns->mq_msgsize_max)
|
2012-06-01 07:26:36 +08:00
|
|
|
return -EINVAL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
/* check for overflow */
|
|
|
|
if (attr->mq_msgsize > ULONG_MAX/attr->mq_maxmsg)
|
2012-06-01 07:26:36 +08:00
|
|
|
return -EOVERFLOW;
|
2012-06-01 07:26:36 +08:00
|
|
|
mq_treesize = attr->mq_maxmsg * sizeof(struct msg_msg) +
|
|
|
|
min_t(unsigned int, attr->mq_maxmsg, MQ_PRIO_MAX) *
|
|
|
|
sizeof(struct posix_msg_tree_node);
|
|
|
|
total_size = attr->mq_maxmsg * attr->mq_msgsize;
|
|
|
|
if (total_size + mq_treesize < total_size)
|
2012-06-01 07:26:36 +08:00
|
|
|
return -EOVERFLOW;
|
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Invoked when creating a new queue via sys_mq_open
|
|
|
|
*/
|
2012-06-27 01:58:53 +08:00
|
|
|
static struct file *do_create(struct ipc_namespace *ipc_ns, struct inode *dir,
|
|
|
|
struct path *path, int oflag, umode_t mode,
|
2009-04-07 10:01:08 +08:00
|
|
|
struct mq_attr *attr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-11-14 07:39:22 +08:00
|
|
|
const struct cred *cred = current_cred();
|
2005-04-17 06:20:36 +08:00
|
|
|
int ret;
|
|
|
|
|
2008-12-14 17:02:26 +08:00
|
|
|
if (attr) {
|
2012-06-01 07:26:36 +08:00
|
|
|
ret = mq_attr_ok(ipc_ns, attr);
|
|
|
|
if (ret)
|
2012-06-27 01:58:53 +08:00
|
|
|
return ERR_PTR(ret);
|
2005-04-17 06:20:36 +08:00
|
|
|
/* store for use during create */
|
2012-06-27 01:58:53 +08:00
|
|
|
path->dentry->d_fsdata = attr;
|
2012-06-01 07:26:36 +08:00
|
|
|
} else {
|
|
|
|
struct mq_attr def_attr;
|
|
|
|
|
|
|
|
def_attr.mq_maxmsg = min(ipc_ns->mq_msg_max,
|
|
|
|
ipc_ns->mq_msg_default);
|
|
|
|
def_attr.mq_msgsize = min(ipc_ns->mq_msgsize_max,
|
|
|
|
ipc_ns->mq_msgsize_default);
|
|
|
|
ret = mq_attr_ok(ipc_ns, &def_attr);
|
|
|
|
if (ret)
|
2012-06-27 01:58:53 +08:00
|
|
|
return ERR_PTR(ret);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2009-03-30 07:08:22 +08:00
|
|
|
mode &= ~current_umask();
|
2012-06-27 01:58:53 +08:00
|
|
|
ret = vfs_create(dir, path->dentry, mode, true);
|
|
|
|
path->dentry->d_fsdata = NULL;
|
2012-08-06 14:18:17 +08:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
return dentry_open(path, oflag, cred);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Opens existing queue */
|
2012-06-27 01:58:53 +08:00
|
|
|
static struct file *do_open(struct path *path, int oflag)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-11-14 07:39:22 +08:00
|
|
|
static const int oflag2acc[O_ACCMODE] = { MAY_READ, MAY_WRITE,
|
|
|
|
MAY_READ | MAY_WRITE };
|
2012-06-27 01:58:53 +08:00
|
|
|
int acc;
|
|
|
|
if ((oflag & O_ACCMODE) == (O_RDWR | O_WRONLY))
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
acc = oflag2acc[oflag & O_ACCMODE];
|
|
|
|
if (inode_permission(path->dentry->d_inode, acc))
|
|
|
|
return ERR_PTR(-EACCES);
|
|
|
|
return dentry_open(path, oflag, current_cred());
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2011-07-26 17:26:10 +08:00
|
|
|
SYSCALL_DEFINE4(mq_open, const char __user *, u_name, int, oflag, umode_t, mode,
|
2009-01-14 21:14:27 +08:00
|
|
|
struct mq_attr __user *, u_attr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-06-27 01:58:53 +08:00
|
|
|
struct path path;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct file *filp;
|
|
|
|
char *name;
|
2008-12-14 17:02:26 +08:00
|
|
|
struct mq_attr attr;
|
2005-04-17 06:20:36 +08:00
|
|
|
int fd, error;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct ipc_namespace *ipc_ns = current->nsproxy->ipc_ns;
|
2012-08-06 14:18:17 +08:00
|
|
|
struct vfsmount *mnt = ipc_ns->mq_mnt;
|
|
|
|
struct dentry *root = mnt->mnt_root;
|
|
|
|
int ro;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-12-14 17:02:26 +08:00
|
|
|
if (u_attr && copy_from_user(&attr, u_attr, sizeof(struct mq_attr)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
audit_mq_open(oflag, mode, u_attr ? &attr : NULL);
|
2006-05-25 05:09:55 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (IS_ERR(name = getname(u_name)))
|
|
|
|
return PTR_ERR(name);
|
|
|
|
|
2008-05-04 03:28:45 +08:00
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (fd < 0)
|
|
|
|
goto out_putname;
|
|
|
|
|
2012-08-06 14:18:17 +08:00
|
|
|
ro = mnt_want_write(mnt); /* we'll drop it in any case */
|
2012-06-27 01:58:53 +08:00
|
|
|
error = 0;
|
|
|
|
mutex_lock(&root->d_inode->i_mutex);
|
|
|
|
path.dentry = lookup_one_len(name, root, strlen(name));
|
|
|
|
if (IS_ERR(path.dentry)) {
|
|
|
|
error = PTR_ERR(path.dentry);
|
2010-02-23 15:04:28 +08:00
|
|
|
goto out_putfd;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2012-08-06 14:18:17 +08:00
|
|
|
path.mnt = mntget(mnt);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
if (oflag & O_CREAT) {
|
2012-06-27 01:58:53 +08:00
|
|
|
if (path.dentry->d_inode) { /* entry already exists */
|
2012-10-11 03:25:23 +08:00
|
|
|
audit_inode(name, path.dentry, 0);
|
2010-02-23 15:04:26 +08:00
|
|
|
if (oflag & O_EXCL) {
|
|
|
|
error = -EEXIST;
|
2006-01-15 04:29:55 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2012-06-27 01:58:53 +08:00
|
|
|
filp = do_open(&path, oflag);
|
2005-04-17 06:20:36 +08:00
|
|
|
} else {
|
2012-08-06 14:18:17 +08:00
|
|
|
if (ro) {
|
|
|
|
error = ro;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-06-27 01:58:53 +08:00
|
|
|
filp = do_create(ipc_ns, root->d_inode,
|
|
|
|
&path, oflag, mode,
|
2008-12-14 17:02:26 +08:00
|
|
|
u_attr ? &attr : NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2006-01-15 04:29:55 +08:00
|
|
|
} else {
|
2012-06-27 01:58:53 +08:00
|
|
|
if (!path.dentry->d_inode) {
|
2010-02-23 15:04:26 +08:00
|
|
|
error = -ENOENT;
|
2006-01-15 04:29:55 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2012-10-11 03:25:23 +08:00
|
|
|
audit_inode(name, path.dentry, 0);
|
2012-06-27 01:58:53 +08:00
|
|
|
filp = do_open(&path, oflag);
|
2006-01-15 04:29:55 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-06-27 01:58:53 +08:00
|
|
|
if (!IS_ERR(filp))
|
|
|
|
fd_install(fd, filp);
|
|
|
|
else
|
2005-04-17 06:20:36 +08:00
|
|
|
error = PTR_ERR(filp);
|
2006-01-15 04:29:55 +08:00
|
|
|
out:
|
2012-06-27 01:58:53 +08:00
|
|
|
path_put(&path);
|
2006-01-15 04:29:55 +08:00
|
|
|
out_putfd:
|
2012-06-27 01:58:53 +08:00
|
|
|
if (error) {
|
|
|
|
put_unused_fd(fd);
|
|
|
|
fd = error;
|
|
|
|
}
|
|
|
|
mutex_unlock(&root->d_inode->i_mutex);
|
2012-08-06 14:18:17 +08:00
|
|
|
mnt_drop_write(mnt);
|
2005-04-17 06:20:36 +08:00
|
|
|
out_putname:
|
|
|
|
putname(name);
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:27 +08:00
|
|
|
SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
char *name;
|
|
|
|
struct dentry *dentry;
|
|
|
|
struct inode *inode = NULL;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
struct ipc_namespace *ipc_ns = current->nsproxy->ipc_ns;
|
2012-08-06 14:18:17 +08:00
|
|
|
struct vfsmount *mnt = ipc_ns->mq_mnt;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
name = getname(u_name);
|
|
|
|
if (IS_ERR(name))
|
|
|
|
return PTR_ERR(name);
|
|
|
|
|
2012-08-06 14:18:17 +08:00
|
|
|
err = mnt_want_write(mnt);
|
|
|
|
if (err)
|
|
|
|
goto out_name;
|
|
|
|
mutex_lock_nested(&mnt->mnt_root->d_inode->i_mutex, I_MUTEX_PARENT);
|
|
|
|
dentry = lookup_one_len(name, mnt->mnt_root, strlen(name));
|
2005-04-17 06:20:36 +08:00
|
|
|
if (IS_ERR(dentry)) {
|
|
|
|
err = PTR_ERR(dentry);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
inode = dentry->d_inode;
|
2012-08-06 14:18:17 +08:00
|
|
|
if (!inode) {
|
|
|
|
err = -ENOENT;
|
|
|
|
} else {
|
2010-10-23 23:11:40 +08:00
|
|
|
ihold(inode);
|
2012-08-06 14:18:17 +08:00
|
|
|
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
dput(dentry);
|
|
|
|
|
|
|
|
out_unlock:
|
2012-08-06 14:18:17 +08:00
|
|
|
mutex_unlock(&mnt->mnt_root->d_inode->i_mutex);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (inode)
|
|
|
|
iput(inode);
|
2012-08-06 14:18:17 +08:00
|
|
|
mnt_drop_write(mnt);
|
|
|
|
out_name:
|
|
|
|
putname(name);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Pipelined send and receive functions.
|
|
|
|
*
|
|
|
|
* If a receiver finds no waiting message, then it registers itself in the
|
|
|
|
* list of waiting receivers. A sender checks that list before adding the new
|
|
|
|
* message into the message array. If there is a waiting receiver, then it
|
|
|
|
* bypasses the message array and directly hands the message over to the
|
|
|
|
* receiver.
|
|
|
|
* The receiver accepts the message and returns without grabbing the queue
|
|
|
|
* spinlock. Therefore an intermediate STATE_PENDING state and memory barriers
|
|
|
|
* are necessary. The same algorithm is used for sysv semaphores, see
|
2006-03-28 17:56:23 +08:00
|
|
|
* ipc/sem.c for more details.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* The same algorithm is used for senders.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* pipelined_send() - send a message directly to the task waiting in
|
|
|
|
* sys_mq_timedreceive() (without inserting message into a queue).
|
|
|
|
*/
|
|
|
|
static inline void pipelined_send(struct mqueue_inode_info *info,
|
|
|
|
struct msg_msg *message,
|
|
|
|
struct ext_wait_queue *receiver)
|
|
|
|
{
|
|
|
|
receiver->msg = message;
|
|
|
|
list_del(&receiver->list);
|
|
|
|
receiver->state = STATE_PENDING;
|
|
|
|
wake_up_process(receiver->task);
|
2005-05-01 23:58:47 +08:00
|
|
|
smp_wmb();
|
2005-04-17 06:20:36 +08:00
|
|
|
receiver->state = STATE_READY;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
|
|
|
|
* gets its message and put to the queue (we have one free place for sure). */
|
|
|
|
static inline void pipelined_receive(struct mqueue_inode_info *info)
|
|
|
|
{
|
|
|
|
struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND);
|
|
|
|
|
|
|
|
if (!sender) {
|
|
|
|
/* for poll */
|
|
|
|
wake_up_interruptible(&info->wait_q);
|
|
|
|
return;
|
|
|
|
}
|
ipc/mqueue: improve performance of send/recv
The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.
OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.
"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:
Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.
So, on to the results already:
Subsystem/Test Old New
Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s
Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s
% slowdown from mqueue
cache thrashing ~8% ~.5%
Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436
Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17
So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.
[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:35 +08:00
|
|
|
if (msg_insert(sender->msg, info))
|
|
|
|
return;
|
2005-04-17 06:20:36 +08:00
|
|
|
list_del(&sender->list);
|
|
|
|
sender->state = STATE_PENDING;
|
|
|
|
wake_up_process(sender->task);
|
2005-05-01 23:58:47 +08:00
|
|
|
smp_wmb();
|
2005-04-17 06:20:36 +08:00
|
|
|
sender->state = STATE_READY;
|
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:28 +08:00
|
|
|
SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
|
|
|
|
size_t, msg_len, unsigned int, msg_prio,
|
|
|
|
const struct timespec __user *, u_abs_timeout)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode;
|
|
|
|
struct ext_wait_queue wait;
|
|
|
|
struct ext_wait_queue *receiver;
|
|
|
|
struct msg_msg *msg_ptr;
|
|
|
|
struct mqueue_inode_info *info;
|
2010-04-03 04:40:20 +08:00
|
|
|
ktime_t expires, *timeout = NULL;
|
|
|
|
struct timespec ts;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
struct posix_msg_tree_node *new_leaf = NULL;
|
2012-08-29 00:52:22 +08:00
|
|
|
int ret = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-12-14 16:46:48 +08:00
|
|
|
if (u_abs_timeout) {
|
2010-04-03 04:40:20 +08:00
|
|
|
int res = prepare_timeout(u_abs_timeout, &expires, &ts);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
timeout = &expires;
|
2008-12-14 16:46:48 +08:00
|
|
|
}
|
2006-05-25 05:09:55 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (unlikely(msg_prio >= (unsigned long) MQ_PRIO_MAX))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2010-04-03 04:40:20 +08:00
|
|
|
audit_mq_sendrecv(mqdes, msg_len, msg_prio, timeout ? &ts : NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (unlikely(!f.file)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
inode = f.file->f_path.dentry->d_inode;
|
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_fput;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
info = MQUEUE_I(inode);
|
2012-10-11 03:25:23 +08:00
|
|
|
audit_inode(NULL, f.file->f_path.dentry, 0);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
if (unlikely(!(f.file->f_mode & FMODE_WRITE))) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_fput;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
if (unlikely(msg_len > info->attr.mq_msgsize)) {
|
|
|
|
ret = -EMSGSIZE;
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* First try to allocate memory, before doing anything with
|
|
|
|
* existing queues. */
|
|
|
|
msg_ptr = load_msg(u_msg_ptr, msg_len);
|
|
|
|
if (IS_ERR(msg_ptr)) {
|
|
|
|
ret = PTR_ERR(msg_ptr);
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
msg_ptr->m_ts = msg_len;
|
|
|
|
msg_ptr->m_type = msg_prio;
|
|
|
|
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
/*
|
|
|
|
* msg_insert really wants us to have a valid, spare node struct so
|
|
|
|
* it doesn't have to kmalloc a GFP_ATOMIC allocation, but it will
|
|
|
|
* fall back to that if necessary.
|
|
|
|
*/
|
|
|
|
if (!info->node_cache)
|
|
|
|
new_leaf = kmalloc(sizeof(*new_leaf), GFP_KERNEL);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_lock(&info->lock);
|
|
|
|
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
if (!info->node_cache && new_leaf) {
|
|
|
|
/* Save our speculative allocation into the cache */
|
|
|
|
INIT_LIST_HEAD(&new_leaf->msg_list);
|
|
|
|
info->node_cache = new_leaf;
|
|
|
|
info->qsize += sizeof(*new_leaf);
|
|
|
|
new_leaf = NULL;
|
|
|
|
} else {
|
|
|
|
kfree(new_leaf);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (info->attr.mq_curmsgs == info->attr.mq_maxmsg) {
|
2012-08-29 00:52:22 +08:00
|
|
|
if (f.file->f_flags & O_NONBLOCK) {
|
2005-04-17 06:20:36 +08:00
|
|
|
ret = -EAGAIN;
|
|
|
|
} else {
|
|
|
|
wait.task = current;
|
|
|
|
wait.msg = (void *) msg_ptr;
|
|
|
|
wait.state = STATE_NONE;
|
|
|
|
ret = wq_sleep(info, SEND, timeout, &wait);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
/*
|
|
|
|
* wq_sleep must be called with info->lock held, and
|
|
|
|
* returns with the lock released
|
|
|
|
*/
|
|
|
|
goto out_free;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
receiver = wq_get_first_waiter(info, RECV);
|
|
|
|
if (receiver) {
|
|
|
|
pipelined_send(info, msg_ptr, receiver);
|
|
|
|
} else {
|
|
|
|
/* adds message to the queue */
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
ret = msg_insert(msg_ptr, info);
|
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
2005-04-17 06:20:36 +08:00
|
|
|
__do_notify(info);
|
|
|
|
}
|
|
|
|
inode->i_atime = inode->i_mtime = inode->i_ctime =
|
|
|
|
CURRENT_TIME;
|
|
|
|
}
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
out_unlock:
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
out_free:
|
|
|
|
if (ret)
|
|
|
|
free_msg(msg_ptr);
|
2005-04-17 06:20:36 +08:00
|
|
|
out_fput:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:28 +08:00
|
|
|
SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
|
|
|
|
size_t, msg_len, unsigned int __user *, u_msg_prio,
|
|
|
|
const struct timespec __user *, u_abs_timeout)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
struct msg_msg *msg_ptr;
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode;
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
struct ext_wait_queue wait;
|
2010-04-03 04:40:20 +08:00
|
|
|
ktime_t expires, *timeout = NULL;
|
|
|
|
struct timespec ts;
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
struct posix_msg_tree_node *new_leaf = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-12-14 16:46:48 +08:00
|
|
|
if (u_abs_timeout) {
|
2010-04-03 04:40:20 +08:00
|
|
|
int res = prepare_timeout(u_abs_timeout, &expires, &ts);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
timeout = &expires;
|
2008-12-14 16:46:48 +08:00
|
|
|
}
|
2006-05-25 05:09:55 +08:00
|
|
|
|
2010-04-03 04:40:20 +08:00
|
|
|
audit_mq_sendrecv(mqdes, msg_len, 0, timeout ? &ts : NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (unlikely(!f.file)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
inode = f.file->f_path.dentry->d_inode;
|
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_fput;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
info = MQUEUE_I(inode);
|
2012-10-11 03:25:23 +08:00
|
|
|
audit_inode(NULL, f.file->f_path.dentry, 0);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
if (unlikely(!(f.file->f_mode & FMODE_READ))) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_fput;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* checks if buffer is big enough */
|
|
|
|
if (unlikely(msg_len < info->attr.mq_msgsize)) {
|
|
|
|
ret = -EMSGSIZE;
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
/*
|
|
|
|
* msg_insert really wants us to have a valid, spare node struct so
|
|
|
|
* it doesn't have to kmalloc a GFP_ATOMIC allocation, but it will
|
|
|
|
* fall back to that if necessary.
|
|
|
|
*/
|
|
|
|
if (!info->node_cache)
|
|
|
|
new_leaf = kmalloc(sizeof(*new_leaf), GFP_KERNEL);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_lock(&info->lock);
|
ipc/mqueue: add rbtree node caching support
When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).
This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.
Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.
Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.
Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.
The net result of all of this is:
1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.
2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).
3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.
The performance changes are:
Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns
I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.
As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01 07:26:38 +08:00
|
|
|
|
|
|
|
if (!info->node_cache && new_leaf) {
|
|
|
|
/* Save our speculative allocation into the cache */
|
|
|
|
INIT_LIST_HEAD(&new_leaf->msg_list);
|
|
|
|
info->node_cache = new_leaf;
|
|
|
|
info->qsize += sizeof(*new_leaf);
|
|
|
|
} else {
|
|
|
|
kfree(new_leaf);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
if (info->attr.mq_curmsgs == 0) {
|
2012-08-29 00:52:22 +08:00
|
|
|
if (f.file->f_flags & O_NONBLOCK) {
|
2005-04-17 06:20:36 +08:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
ret = -EAGAIN;
|
|
|
|
} else {
|
|
|
|
wait.task = current;
|
|
|
|
wait.state = STATE_NONE;
|
|
|
|
ret = wq_sleep(info, RECV, timeout, &wait);
|
|
|
|
msg_ptr = wait.msg;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
msg_ptr = msg_get(info);
|
|
|
|
|
|
|
|
inode->i_atime = inode->i_mtime = inode->i_ctime =
|
|
|
|
CURRENT_TIME;
|
|
|
|
|
|
|
|
/* There is now free space in queue. */
|
|
|
|
pipelined_receive(info);
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
if (ret == 0) {
|
|
|
|
ret = msg_ptr->m_ts;
|
|
|
|
|
|
|
|
if ((u_msg_prio && put_user(msg_ptr->m_type, u_msg_prio)) ||
|
|
|
|
store_msg(u_msg_ptr, msg_ptr, msg_ptr->m_ts)) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
free_msg(msg_ptr);
|
|
|
|
}
|
|
|
|
out_fput:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Notes: the case when user wants us to deregister (with NULL as pointer)
|
|
|
|
* and he isn't currently owner of notification, will be silently discarded.
|
|
|
|
* It isn't explicitly defined in the POSIX.
|
|
|
|
*/
|
2009-01-14 21:14:28 +08:00
|
|
|
SYSCALL_DEFINE2(mq_notify, mqd_t, mqdes,
|
|
|
|
const struct sigevent __user *, u_notification)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-08-29 00:52:22 +08:00
|
|
|
int ret;
|
|
|
|
struct fd f;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sock *sock;
|
|
|
|
struct inode *inode;
|
|
|
|
struct sigevent notification;
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
struct sk_buff *nc;
|
|
|
|
|
2008-12-10 20:16:12 +08:00
|
|
|
if (u_notification) {
|
2005-04-17 06:20:36 +08:00
|
|
|
if (copy_from_user(¬ification, u_notification,
|
|
|
|
sizeof(struct sigevent)))
|
|
|
|
return -EFAULT;
|
2008-12-10 20:16:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
audit_mq_notify(mqdes, u_notification ? ¬ification : NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-12-10 20:16:12 +08:00
|
|
|
nc = NULL;
|
|
|
|
sock = NULL;
|
|
|
|
if (u_notification != NULL) {
|
2005-04-17 06:20:36 +08:00
|
|
|
if (unlikely(notification.sigev_notify != SIGEV_NONE &&
|
|
|
|
notification.sigev_notify != SIGEV_SIGNAL &&
|
|
|
|
notification.sigev_notify != SIGEV_THREAD))
|
|
|
|
return -EINVAL;
|
|
|
|
if (notification.sigev_notify == SIGEV_SIGNAL &&
|
2005-05-01 23:59:14 +08:00
|
|
|
!valid_signal(notification.sigev_signo)) {
|
2005-04-17 06:20:36 +08:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
if (notification.sigev_notify == SIGEV_THREAD) {
|
2007-11-07 18:42:09 +08:00
|
|
|
long timeo;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* create the notify skb */
|
|
|
|
nc = alloc_skb(NOTIFY_COOKIE_LEN, GFP_KERNEL);
|
2010-02-23 15:04:26 +08:00
|
|
|
if (!nc) {
|
|
|
|
ret = -ENOMEM;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
if (copy_from_user(nc->data,
|
|
|
|
notification.sigev_value.sival_ptr,
|
|
|
|
NOTIFY_COOKIE_LEN)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EFAULT;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* TODO: add a header? */
|
|
|
|
skb_put(nc, NOTIFY_COOKIE_LEN);
|
|
|
|
/* and attach it to the socket */
|
|
|
|
retry:
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(notification.sigev_signo);
|
|
|
|
if (!f.file) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2012-08-29 00:52:22 +08:00
|
|
|
sock = netlink_getsockbyfilp(f.file);
|
|
|
|
fdput(f);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (IS_ERR(sock)) {
|
|
|
|
ret = PTR_ERR(sock);
|
|
|
|
sock = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2007-11-07 18:42:09 +08:00
|
|
|
timeo = MAX_SCHEDULE_TIMEOUT;
|
2008-06-06 02:23:39 +08:00
|
|
|
ret = netlink_attachskb(sock, nc, &timeo, NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (ret == 1)
|
2010-02-23 15:04:26 +08:00
|
|
|
goto retry;
|
2005-04-17 06:20:36 +08:00
|
|
|
if (ret) {
|
|
|
|
sock = NULL;
|
|
|
|
nc = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (!f.file) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
inode = f.file->f_path.dentry->d_inode;
|
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_fput;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
if (u_notification == NULL) {
|
2006-10-02 17:17:26 +08:00
|
|
|
if (info->notify_owner == task_tgid(current)) {
|
2005-04-17 06:20:36 +08:00
|
|
|
remove_notification(info);
|
|
|
|
inode->i_atime = inode->i_ctime = CURRENT_TIME;
|
|
|
|
}
|
2006-10-02 17:17:26 +08:00
|
|
|
} else if (info->notify_owner != NULL) {
|
2005-04-17 06:20:36 +08:00
|
|
|
ret = -EBUSY;
|
|
|
|
} else {
|
|
|
|
switch (notification.sigev_notify) {
|
|
|
|
case SIGEV_NONE:
|
|
|
|
info->notify.sigev_notify = SIGEV_NONE;
|
|
|
|
break;
|
|
|
|
case SIGEV_THREAD:
|
|
|
|
info->notify_sock = sock;
|
|
|
|
info->notify_cookie = nc;
|
|
|
|
sock = NULL;
|
|
|
|
nc = NULL;
|
|
|
|
info->notify.sigev_notify = SIGEV_THREAD;
|
|
|
|
break;
|
|
|
|
case SIGEV_SIGNAL:
|
|
|
|
info->notify.sigev_signo = notification.sigev_signo;
|
|
|
|
info->notify.sigev_value = notification.sigev_value;
|
|
|
|
info->notify.sigev_notify = SIGEV_SIGNAL;
|
|
|
|
break;
|
|
|
|
}
|
2006-10-02 17:17:26 +08:00
|
|
|
|
|
|
|
info->notify_owner = get_pid(task_tgid(current));
|
2011-11-17 14:57:55 +08:00
|
|
|
info->notify_user_ns = get_user_ns(current_user_ns());
|
2005-04-17 06:20:36 +08:00
|
|
|
inode->i_atime = inode->i_ctime = CURRENT_TIME;
|
|
|
|
}
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
out_fput:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
if (sock) {
|
|
|
|
netlink_detachskb(sock, nc);
|
|
|
|
} else if (nc) {
|
|
|
|
dev_kfree_skb(nc);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:28 +08:00
|
|
|
SYSCALL_DEFINE3(mq_getsetattr, mqd_t, mqdes,
|
|
|
|
const struct mq_attr __user *, u_mqstat,
|
|
|
|
struct mq_attr __user *, u_omqstat)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct mq_attr mqstat, omqstat;
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct inode *inode;
|
|
|
|
struct mqueue_inode_info *info;
|
|
|
|
|
|
|
|
if (u_mqstat != NULL) {
|
|
|
|
if (copy_from_user(&mqstat, u_mqstat, sizeof(struct mq_attr)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (mqstat.mq_flags & (~O_NONBLOCK))
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(mqdes);
|
|
|
|
if (!f.file) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
inode = f.file->f_path.dentry->d_inode;
|
|
|
|
if (unlikely(f.file->f_op != &mqueue_file_operations)) {
|
2010-02-23 15:04:26 +08:00
|
|
|
ret = -EBADF;
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_fput;
|
2010-02-23 15:04:26 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
info = MQUEUE_I(inode);
|
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
|
|
|
|
omqstat = info->attr;
|
2012-08-29 00:52:22 +08:00
|
|
|
omqstat.mq_flags = f.file->f_flags & O_NONBLOCK;
|
2005-04-17 06:20:36 +08:00
|
|
|
if (u_mqstat) {
|
2008-12-10 19:58:59 +08:00
|
|
|
audit_mq_getsetattr(mqdes, &mqstat);
|
2012-08-29 00:52:22 +08:00
|
|
|
spin_lock(&f.file->f_lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (mqstat.mq_flags & O_NONBLOCK)
|
2012-08-29 00:52:22 +08:00
|
|
|
f.file->f_flags |= O_NONBLOCK;
|
2005-04-17 06:20:36 +08:00
|
|
|
else
|
2012-08-29 00:52:22 +08:00
|
|
|
f.file->f_flags &= ~O_NONBLOCK;
|
|
|
|
spin_unlock(&f.file->f_lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
inode->i_atime = inode->i_ctime = CURRENT_TIME;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
if (u_omqstat != NULL && copy_to_user(u_omqstat, &omqstat,
|
|
|
|
sizeof(struct mq_attr)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out_fput:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-02-12 16:55:39 +08:00
|
|
|
static const struct inode_operations mqueue_dir_inode_operations = {
|
2005-04-17 06:20:36 +08:00
|
|
|
.lookup = simple_lookup,
|
|
|
|
.create = mqueue_create,
|
|
|
|
.unlink = mqueue_unlink,
|
|
|
|
};
|
|
|
|
|
2007-02-12 16:55:35 +08:00
|
|
|
static const struct file_operations mqueue_file_operations = {
|
2005-04-17 06:20:36 +08:00
|
|
|
.flush = mqueue_flush_file,
|
|
|
|
.poll = mqueue_poll_file,
|
|
|
|
.read = mqueue_read_file,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-16 00:52:59 +08:00
|
|
|
.llseek = default_llseek,
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2009-09-22 08:01:09 +08:00
|
|
|
static const struct super_operations mqueue_super_ops = {
|
2005-04-17 06:20:36 +08:00
|
|
|
.alloc_inode = mqueue_alloc_inode,
|
|
|
|
.destroy_inode = mqueue_destroy_inode,
|
2010-06-06 04:29:45 +08:00
|
|
|
.evict_inode = mqueue_evict_inode,
|
2005-04-17 06:20:36 +08:00
|
|
|
.statfs = simple_statfs,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct file_system_type mqueue_fs_type = {
|
|
|
|
.name = "mqueue",
|
2010-07-26 17:16:50 +08:00
|
|
|
.mount = mqueue_mount,
|
2005-04-17 06:20:36 +08:00
|
|
|
.kill_sb = kill_litter_super,
|
|
|
|
};
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
int mq_init_ns(struct ipc_namespace *ns)
|
|
|
|
{
|
|
|
|
ns->mq_queues_count = 0;
|
|
|
|
ns->mq_queues_max = DFLT_QUEUESMAX;
|
|
|
|
ns->mq_msg_max = DFLT_MSGMAX;
|
|
|
|
ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
|
2012-06-01 07:26:33 +08:00
|
|
|
ns->mq_msg_default = DFLT_MSG;
|
|
|
|
ns->mq_msgsize_default = DFLT_MSGSIZE;
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
|
|
|
|
ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
|
|
|
|
if (IS_ERR(ns->mq_mnt)) {
|
|
|
|
int err = PTR_ERR(ns->mq_mnt);
|
|
|
|
ns->mq_mnt = NULL;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void mq_clear_sbinfo(struct ipc_namespace *ns)
|
|
|
|
{
|
|
|
|
ns->mq_mnt->mnt_sb->s_fs_info = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
void mq_put_mnt(struct ipc_namespace *ns)
|
|
|
|
{
|
2011-12-09 13:38:50 +08:00
|
|
|
kern_unmount(ns->mq_mnt);
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static int __init init_mqueue_fs(void)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
|
|
|
|
sizeof(struct mqueue_inode_info), 0,
|
2007-07-20 09:11:58 +08:00
|
|
|
SLAB_HWCACHE_ALIGN, init_once);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (mqueue_inode_cachep == NULL)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-02-23 15:04:27 +08:00
|
|
|
/* ignore failures - they are not fatal */
|
2009-04-07 10:01:11 +08:00
|
|
|
mq_sysctl_table = mq_register_sysctl_table();
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
error = register_filesystem(&mqueue_fs_type);
|
|
|
|
if (error)
|
|
|
|
goto out_sysctl;
|
|
|
|
|
namespaces: ipc namespaces: implement support for posix msqueues
Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.
Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.
When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.
Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".
If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.
To make this happen, we must protect the ipc reference count when
a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns
b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.
So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 10:01:10 +08:00
|
|
|
spin_lock_init(&mq_lock);
|
|
|
|
|
2011-12-09 13:38:50 +08:00
|
|
|
error = mq_init_ns(&init_ipc_ns);
|
|
|
|
if (error)
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out_filesystem;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_filesystem:
|
|
|
|
unregister_filesystem(&mqueue_fs_type);
|
|
|
|
out_sysctl:
|
|
|
|
if (mq_sysctl_table)
|
|
|
|
unregister_sysctl_table(mq_sysctl_table);
|
2006-09-27 16:49:40 +08:00
|
|
|
kmem_cache_destroy(mqueue_inode_cachep);
|
2005-04-17 06:20:36 +08:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
__initcall(init_mqueue_fs);
|