2019-06-04 16:11:15 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* Copyright (C) 2009 Red Hat, Inc.
|
|
|
|
* Author: Michael S. Tsirkin <mst@redhat.com>
|
|
|
|
*
|
|
|
|
* virtio-net server in host kernel.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/compat.h>
|
|
|
|
#include <linux/eventfd.h>
|
|
|
|
#include <linux/vhost.h>
|
|
|
|
#include <linux/virtio_net.h>
|
|
|
|
#include <linux/miscdevice.h>
|
|
|
|
#include <linux/module.h>
|
2011-07-18 11:48:46 +08:00
|
|
|
#include <linux/moduleparam.h>
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
#include <linux/mutex.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/file.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2017-02-01 23:36:40 +08:00
|
|
|
#include <linux/sched/clock.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2013-01-24 04:46:47 +08:00
|
|
|
#include <linux/vmalloc.h>
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
|
|
|
#include <linux/net.h>
|
|
|
|
#include <linux/if_packet.h>
|
|
|
|
#include <linux/if_arp.h>
|
|
|
|
#include <linux/if_tun.h>
|
2010-02-18 13:46:50 +08:00
|
|
|
#include <linux/if_macvlan.h>
|
2017-02-11 08:03:47 +08:00
|
|
|
#include <linux/if_tap.h>
|
2012-05-04 06:55:23 +08:00
|
|
|
#include <linux/if_vlan.h>
|
2017-05-17 12:14:45 +08:00
|
|
|
#include <linux/skb_array.h>
|
|
|
|
#include <linux/skbuff.h>
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
|
|
|
#include <net/sock.h>
|
2018-04-17 22:45:47 +08:00
|
|
|
#include <net/xdp.h>
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
|
|
|
#include "vhost.h"
|
|
|
|
|
2019-06-17 17:20:54 +08:00
|
|
|
static int experimental_zcopytx = 0;
|
2011-07-18 11:48:46 +08:00
|
|
|
module_param(experimental_zcopytx, int, 0444);
|
2012-12-06 20:56:00 +08:00
|
|
|
MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
|
|
|
|
" 1 -Enable; 0 - Disable");
|
2011-07-18 11:48:46 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* Max number of bytes transferred before requeueing the job.
|
|
|
|
* Using this limit prevents one virtqueue from starving others. */
|
|
|
|
#define VHOST_NET_WEIGHT 0x80000
|
|
|
|
|
2018-04-09 15:22:17 +08:00
|
|
|
/* Max number of packets transferred before requeueing the job.
|
2018-04-24 16:34:36 +08:00
|
|
|
* Using this limit prevents one virtqueue from starving others with small
|
|
|
|
* pkts.
|
|
|
|
*/
|
|
|
|
#define VHOST_NET_PKT_WEIGHT 256
|
2018-04-09 15:22:17 +08:00
|
|
|
|
2011-07-18 11:48:46 +08:00
|
|
|
/* MAX number of TX used buffers for outstanding zerocopy */
|
|
|
|
#define VHOST_MAX_PEND 128
|
|
|
|
#define VHOST_GOODCOPY_LEN 256
|
|
|
|
|
2012-11-01 17:16:51 +08:00
|
|
|
/*
|
|
|
|
* For transmit, used buffer len is unused; we override it to track buffer
|
|
|
|
* status internally; used for zerocopy tx only.
|
|
|
|
*/
|
|
|
|
/* Lower device DMA failed */
|
2014-10-24 16:49:27 +08:00
|
|
|
#define VHOST_DMA_FAILED_LEN ((__force __virtio32)3)
|
2012-11-01 17:16:51 +08:00
|
|
|
/* Lower device DMA done */
|
2014-10-24 16:49:27 +08:00
|
|
|
#define VHOST_DMA_DONE_LEN ((__force __virtio32)2)
|
2012-11-01 17:16:51 +08:00
|
|
|
/* Lower device DMA in progress */
|
2014-10-24 16:49:27 +08:00
|
|
|
#define VHOST_DMA_IN_PROGRESS ((__force __virtio32)1)
|
2012-11-01 17:16:51 +08:00
|
|
|
/* Buffer unused */
|
2014-10-24 16:49:27 +08:00
|
|
|
#define VHOST_DMA_CLEAR_LEN ((__force __virtio32)0)
|
2012-11-01 17:16:51 +08:00
|
|
|
|
2014-10-24 16:49:27 +08:00
|
|
|
#define VHOST_DMA_IS_DONE(len) ((__force u32)(len) >= (__force u32)VHOST_DMA_DONE_LEN)
|
2012-11-01 17:16:51 +08:00
|
|
|
|
2013-05-06 16:38:20 +08:00
|
|
|
enum {
|
|
|
|
VHOST_NET_FEATURES = VHOST_FEATURES |
|
|
|
|
(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
|
2016-06-23 14:04:32 +08:00
|
|
|
(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
|
|
|
|
(1ULL << VIRTIO_F_IOMMU_PLATFORM)
|
2013-05-06 16:38:20 +08:00
|
|
|
};
|
|
|
|
|
2018-08-06 11:17:47 +08:00
|
|
|
enum {
|
|
|
|
VHOST_NET_BACKEND_FEATURES = (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2)
|
|
|
|
};
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
enum {
|
|
|
|
VHOST_NET_VQ_RX = 0,
|
|
|
|
VHOST_NET_VQ_TX = 1,
|
|
|
|
VHOST_NET_VQ_MAX = 2,
|
|
|
|
};
|
|
|
|
|
2013-05-06 16:38:24 +08:00
|
|
|
struct vhost_net_ubuf_ref {
|
2014-02-13 17:42:05 +08:00
|
|
|
/* refcount follows semantics similar to kref:
|
|
|
|
* 0: object is released
|
|
|
|
* 1: no outstanding ubufs
|
|
|
|
* >1: outstanding ubufs
|
|
|
|
*/
|
|
|
|
atomic_t refcount;
|
2013-04-27 15:07:46 +08:00
|
|
|
wait_queue_head_t wait;
|
|
|
|
struct vhost_virtqueue *vq;
|
|
|
|
};
|
|
|
|
|
2018-07-20 08:15:20 +08:00
|
|
|
#define VHOST_NET_BATCH 64
|
2017-05-17 12:14:45 +08:00
|
|
|
struct vhost_net_buf {
|
2018-01-04 11:14:27 +08:00
|
|
|
void **queue;
|
2017-05-17 12:14:45 +08:00
|
|
|
int tail;
|
|
|
|
int head;
|
|
|
|
};
|
|
|
|
|
2013-04-27 11:16:48 +08:00
|
|
|
struct vhost_net_virtqueue {
|
|
|
|
struct vhost_virtqueue vq;
|
2013-04-28 20:51:40 +08:00
|
|
|
size_t vhost_hlen;
|
|
|
|
size_t sock_hlen;
|
2013-04-27 15:07:46 +08:00
|
|
|
/* vhost zerocopy support fields below: */
|
|
|
|
/* last used idx for outstanding DMA zerocopy buffers */
|
|
|
|
int upend_idx;
|
2018-05-29 14:18:19 +08:00
|
|
|
/* For TX, first used idx for DMA done zerocopy buffers
|
|
|
|
* For RX, number of batched heads
|
|
|
|
*/
|
2013-04-27 15:07:46 +08:00
|
|
|
int done_idx;
|
2018-09-12 11:17:09 +08:00
|
|
|
/* Number of XDP frames batched */
|
|
|
|
int batched_xdp;
|
2013-04-27 15:07:46 +08:00
|
|
|
/* an array of userspace buffers info */
|
|
|
|
struct ubuf_info *ubuf_info;
|
|
|
|
/* Reference counting for outstanding ubufs.
|
|
|
|
* Protected by vq mutex. Writers must also take device mutex. */
|
2013-05-06 16:38:24 +08:00
|
|
|
struct vhost_net_ubuf_ref *ubufs;
|
2018-01-04 11:14:27 +08:00
|
|
|
struct ptr_ring *rx_ring;
|
2017-05-17 12:14:45 +08:00
|
|
|
struct vhost_net_buf rxq;
|
2018-09-12 11:17:09 +08:00
|
|
|
/* Batched XDP buffs */
|
|
|
|
struct xdp_buff *xdp;
|
2013-04-27 11:16:48 +08:00
|
|
|
};
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
struct vhost_net {
|
|
|
|
struct vhost_dev dev;
|
2013-04-27 11:16:48 +08:00
|
|
|
struct vhost_net_virtqueue vqs[VHOST_NET_VQ_MAX];
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
struct vhost_poll poll[VHOST_NET_VQ_MAX];
|
2012-11-01 17:16:51 +08:00
|
|
|
/* Number of TX recently submitted.
|
|
|
|
* Protected by tx vq lock. */
|
|
|
|
unsigned tx_packets;
|
|
|
|
/* Number of times zerocopy TX recently failed.
|
|
|
|
* Protected by tx vq lock. */
|
|
|
|
unsigned tx_zcopy_err;
|
2012-12-04 06:17:14 +08:00
|
|
|
/* Flush in progress. Protected by tx vq lock. */
|
|
|
|
bool tx_flush;
|
2018-11-15 17:43:09 +08:00
|
|
|
/* Private page frag */
|
|
|
|
struct page_frag page_frag;
|
|
|
|
/* Refcount bias of page frag */
|
|
|
|
int refcnt_bias;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
};
|
|
|
|
|
2013-05-06 16:38:24 +08:00
|
|
|
static unsigned vhost_net_zcopy_mask __read_mostly;
|
2013-04-27 15:07:46 +08:00
|
|
|
|
2017-05-17 12:14:45 +08:00
|
|
|
static void *vhost_net_buf_get_ptr(struct vhost_net_buf *rxq)
|
|
|
|
{
|
|
|
|
if (rxq->tail != rxq->head)
|
|
|
|
return rxq->queue[rxq->head];
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int vhost_net_buf_get_size(struct vhost_net_buf *rxq)
|
|
|
|
{
|
|
|
|
return rxq->tail - rxq->head;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int vhost_net_buf_is_empty(struct vhost_net_buf *rxq)
|
|
|
|
{
|
|
|
|
return rxq->tail == rxq->head;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
|
|
|
|
{
|
|
|
|
void *ret = vhost_net_buf_get_ptr(rxq);
|
|
|
|
++rxq->head;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
|
|
|
|
{
|
|
|
|
struct vhost_net_buf *rxq = &nvq->rxq;
|
|
|
|
|
|
|
|
rxq->head = 0;
|
2018-01-04 11:14:27 +08:00
|
|
|
rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
|
2018-07-20 08:15:20 +08:00
|
|
|
VHOST_NET_BATCH);
|
2017-05-17 12:14:45 +08:00
|
|
|
return rxq->tail;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
|
|
|
|
{
|
|
|
|
struct vhost_net_buf *rxq = &nvq->rxq;
|
|
|
|
|
2018-01-04 11:14:27 +08:00
|
|
|
if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
|
|
|
|
ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
|
|
|
|
vhost_net_buf_get_size(rxq),
|
2018-03-09 14:50:34 +08:00
|
|
|
tun_ptr_free);
|
2017-05-17 12:14:45 +08:00
|
|
|
rxq->head = rxq->tail = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-01-04 11:14:28 +08:00
|
|
|
static int vhost_net_buf_peek_len(void *ptr)
|
|
|
|
{
|
2018-04-17 22:45:47 +08:00
|
|
|
if (tun_is_xdp_frame(ptr)) {
|
|
|
|
struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
|
2018-01-04 11:14:28 +08:00
|
|
|
|
2018-04-17 22:45:47 +08:00
|
|
|
return xdpf->len;
|
2018-01-04 11:14:28 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return __skb_array_len_with_tag(ptr);
|
|
|
|
}
|
|
|
|
|
2017-05-17 12:14:45 +08:00
|
|
|
static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
|
|
|
|
{
|
|
|
|
struct vhost_net_buf *rxq = &nvq->rxq;
|
|
|
|
|
|
|
|
if (!vhost_net_buf_is_empty(rxq))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!vhost_net_buf_produce(nvq))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out:
|
2018-01-04 11:14:28 +08:00
|
|
|
return vhost_net_buf_peek_len(vhost_net_buf_get_ptr(rxq));
|
2017-05-17 12:14:45 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_buf_init(struct vhost_net_buf *rxq)
|
|
|
|
{
|
|
|
|
rxq->head = rxq->tail = 0;
|
|
|
|
}
|
|
|
|
|
2013-05-06 16:38:24 +08:00
|
|
|
static void vhost_net_enable_zcopy(int vq)
|
2013-04-27 15:07:46 +08:00
|
|
|
{
|
2013-05-06 16:38:24 +08:00
|
|
|
vhost_net_zcopy_mask |= 0x1 << vq;
|
2013-04-27 15:07:46 +08:00
|
|
|
}
|
|
|
|
|
2013-05-06 16:38:24 +08:00
|
|
|
static struct vhost_net_ubuf_ref *
|
|
|
|
vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
|
2013-04-27 15:07:46 +08:00
|
|
|
{
|
2013-05-06 16:38:24 +08:00
|
|
|
struct vhost_net_ubuf_ref *ubufs;
|
2013-04-27 15:07:46 +08:00
|
|
|
/* No zero copy backend? Nothing to count. */
|
|
|
|
if (!zcopy)
|
|
|
|
return NULL;
|
|
|
|
ubufs = kmalloc(sizeof(*ubufs), GFP_KERNEL);
|
|
|
|
if (!ubufs)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2014-02-13 17:42:05 +08:00
|
|
|
atomic_set(&ubufs->refcount, 1);
|
2013-04-27 15:07:46 +08:00
|
|
|
init_waitqueue_head(&ubufs->wait);
|
|
|
|
ubufs->vq = vq;
|
|
|
|
return ubufs;
|
|
|
|
}
|
|
|
|
|
2014-02-13 17:42:05 +08:00
|
|
|
static int vhost_net_ubuf_put(struct vhost_net_ubuf_ref *ubufs)
|
2013-04-27 15:07:46 +08:00
|
|
|
{
|
2014-02-13 17:42:05 +08:00
|
|
|
int r = atomic_sub_return(1, &ubufs->refcount);
|
|
|
|
if (unlikely(!r))
|
|
|
|
wake_up(&ubufs->wait);
|
|
|
|
return r;
|
2013-04-27 15:07:46 +08:00
|
|
|
}
|
|
|
|
|
2013-05-06 16:38:24 +08:00
|
|
|
static void vhost_net_ubuf_put_and_wait(struct vhost_net_ubuf_ref *ubufs)
|
2013-04-27 15:07:46 +08:00
|
|
|
{
|
2014-02-13 17:42:05 +08:00
|
|
|
vhost_net_ubuf_put(ubufs);
|
|
|
|
wait_event(ubufs->wait, !atomic_read(&ubufs->refcount));
|
2013-06-25 22:29:46 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_ubuf_put_wait_and_free(struct vhost_net_ubuf_ref *ubufs)
|
|
|
|
{
|
|
|
|
vhost_net_ubuf_put_and_wait(ubufs);
|
2013-04-27 15:07:46 +08:00
|
|
|
kfree(ubufs);
|
|
|
|
}
|
|
|
|
|
2013-05-06 11:16:00 +08:00
|
|
|
static void vhost_net_clear_ubuf_info(struct vhost_net *n)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2013-06-06 20:20:46 +08:00
|
|
|
for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
|
|
|
|
kfree(n->vqs[i].ubuf_info);
|
|
|
|
n->vqs[i].ubuf_info = NULL;
|
2013-05-06 11:16:00 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-06-05 21:17:38 +08:00
|
|
|
static int vhost_net_set_ubuf_info(struct vhost_net *n)
|
2013-04-27 15:07:46 +08:00
|
|
|
{
|
|
|
|
bool zcopy;
|
|
|
|
int i;
|
|
|
|
|
2013-06-06 20:20:46 +08:00
|
|
|
for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
|
2013-05-06 16:38:24 +08:00
|
|
|
zcopy = vhost_net_zcopy_mask & (0x1 << i);
|
2013-04-27 15:07:46 +08:00
|
|
|
if (!zcopy)
|
|
|
|
continue;
|
treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
kmalloc(a * b, gfp)
with:
kmalloc_array(a * b, gfp)
as well as handling cases of:
kmalloc(a * b * c, gfp)
with:
kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 04:55:00 +08:00
|
|
|
n->vqs[i].ubuf_info =
|
|
|
|
kmalloc_array(UIO_MAXIOV,
|
|
|
|
sizeof(*n->vqs[i].ubuf_info),
|
|
|
|
GFP_KERNEL);
|
2013-04-27 15:07:46 +08:00
|
|
|
if (!n->vqs[i].ubuf_info)
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err:
|
2013-06-06 20:20:46 +08:00
|
|
|
vhost_net_clear_ubuf_info(n);
|
2013-04-27 15:07:46 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2013-06-05 21:17:38 +08:00
|
|
|
static void vhost_net_vq_reset(struct vhost_net *n)
|
2013-04-27 15:07:46 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2013-06-06 20:20:46 +08:00
|
|
|
vhost_net_clear_ubuf_info(n);
|
|
|
|
|
2013-04-27 15:07:46 +08:00
|
|
|
for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
|
|
|
|
n->vqs[i].done_idx = 0;
|
|
|
|
n->vqs[i].upend_idx = 0;
|
|
|
|
n->vqs[i].ubufs = NULL;
|
2013-04-28 20:51:40 +08:00
|
|
|
n->vqs[i].vhost_hlen = 0;
|
|
|
|
n->vqs[i].sock_hlen = 0;
|
2017-05-17 12:14:45 +08:00
|
|
|
vhost_net_buf_init(&n->vqs[i].rxq);
|
2013-04-27 15:07:46 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2012-11-01 17:16:51 +08:00
|
|
|
static void vhost_net_tx_packet(struct vhost_net *net)
|
|
|
|
{
|
|
|
|
++net->tx_packets;
|
|
|
|
if (net->tx_packets < 1024)
|
|
|
|
return;
|
|
|
|
net->tx_packets = 0;
|
|
|
|
net->tx_zcopy_err = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_tx_err(struct vhost_net *net)
|
|
|
|
{
|
|
|
|
++net->tx_zcopy_err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool vhost_net_tx_select_zcopy(struct vhost_net *net)
|
|
|
|
{
|
2012-12-04 06:17:14 +08:00
|
|
|
/* TX flush waits for outstanding DMAs to be done.
|
|
|
|
* Don't start new DMAs.
|
|
|
|
*/
|
|
|
|
return !net->tx_flush &&
|
|
|
|
net->tx_packets / 64 >= net->tx_zcopy_err;
|
2012-11-01 17:16:51 +08:00
|
|
|
}
|
|
|
|
|
2011-07-18 11:48:46 +08:00
|
|
|
static bool vhost_sock_zcopy(struct socket *sock)
|
|
|
|
{
|
|
|
|
return unlikely(experimental_zcopytx) &&
|
|
|
|
sock_flag(sock->sk, SOCK_ZEROCOPY);
|
|
|
|
}
|
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
static bool vhost_sock_xdp(struct socket *sock)
|
|
|
|
{
|
|
|
|
return sock_flag(sock->sk, SOCK_XDP);
|
|
|
|
}
|
|
|
|
|
2012-11-01 17:16:46 +08:00
|
|
|
/* In case of DMA done not in order in lower device driver for some reason.
|
|
|
|
* upend_idx is used to track end of used idx, done_idx is used to track head
|
|
|
|
* of used idx. Once lower device DMA done contiguously, we will signal KVM
|
|
|
|
* guest used idx.
|
|
|
|
*/
|
2013-09-02 16:40:56 +08:00
|
|
|
static void vhost_zerocopy_signal_used(struct vhost_net *net,
|
|
|
|
struct vhost_virtqueue *vq)
|
2012-11-01 17:16:46 +08:00
|
|
|
{
|
2013-04-27 15:07:46 +08:00
|
|
|
struct vhost_net_virtqueue *nvq =
|
|
|
|
container_of(vq, struct vhost_net_virtqueue, vq);
|
2013-09-02 16:40:57 +08:00
|
|
|
int i, add;
|
2012-11-01 17:16:46 +08:00
|
|
|
int j = 0;
|
|
|
|
|
2013-04-27 15:07:46 +08:00
|
|
|
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
|
2012-11-01 17:16:51 +08:00
|
|
|
if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
|
|
|
|
vhost_net_tx_err(net);
|
2012-11-01 17:16:46 +08:00
|
|
|
if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
|
|
|
|
vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
|
|
|
|
++j;
|
|
|
|
} else
|
|
|
|
break;
|
|
|
|
}
|
2013-09-02 16:40:57 +08:00
|
|
|
while (j) {
|
|
|
|
add = min(UIO_MAXIOV - nvq->done_idx, j);
|
|
|
|
vhost_add_used_and_signal_n(vq->dev, vq,
|
|
|
|
&vq->heads[nvq->done_idx], add);
|
|
|
|
nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
|
|
|
|
j -= add;
|
|
|
|
}
|
2012-11-01 17:16:46 +08:00
|
|
|
}
|
|
|
|
|
2012-11-01 17:16:51 +08:00
|
|
|
static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
|
2012-11-01 17:16:46 +08:00
|
|
|
{
|
2013-05-06 16:38:24 +08:00
|
|
|
struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
|
2012-11-01 17:16:46 +08:00
|
|
|
struct vhost_virtqueue *vq = ubufs->vq;
|
2014-02-13 17:42:05 +08:00
|
|
|
int cnt;
|
2012-11-01 17:16:55 +08:00
|
|
|
|
2014-02-13 17:45:11 +08:00
|
|
|
rcu_read_lock_bh();
|
|
|
|
|
2013-09-02 16:41:00 +08:00
|
|
|
/* set len to mark this desc buffers done DMA */
|
|
|
|
vq->heads[ubuf->desc].len = success ?
|
|
|
|
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
|
2014-02-13 17:42:05 +08:00
|
|
|
cnt = vhost_net_ubuf_put(ubufs);
|
2013-09-02 16:41:00 +08:00
|
|
|
|
2012-11-01 17:16:55 +08:00
|
|
|
/*
|
|
|
|
* Trigger polling thread if guest stopped submitting new buffers:
|
2014-02-13 17:42:05 +08:00
|
|
|
* in this case, the refcount after decrement will eventually reach 1.
|
2012-11-01 17:16:55 +08:00
|
|
|
* We also trigger polling periodically after each 16 packets
|
|
|
|
* (the value 16 here is more or less arbitrary, it's tuned to trigger
|
|
|
|
* less than 10% of times).
|
|
|
|
*/
|
2014-02-13 17:42:05 +08:00
|
|
|
if (cnt <= 1 || !(cnt % 16))
|
2012-11-01 17:16:55 +08:00
|
|
|
vhost_poll_queue(&vq->poll);
|
2014-02-13 17:45:11 +08:00
|
|
|
|
|
|
|
rcu_read_unlock_bh();
|
2012-11-01 17:16:46 +08:00
|
|
|
}
|
|
|
|
|
2016-03-04 19:24:53 +08:00
|
|
|
static inline unsigned long busy_clock(void)
|
|
|
|
{
|
|
|
|
return local_clock() >> 10;
|
|
|
|
}
|
|
|
|
|
2018-07-03 15:31:32 +08:00
|
|
|
static bool vhost_can_busy_poll(unsigned long endtime)
|
2016-03-04 19:24:53 +08:00
|
|
|
{
|
2018-07-03 15:31:32 +08:00
|
|
|
return likely(!need_resched() && !time_after(busy_clock(), endtime) &&
|
|
|
|
!signal_pending(current));
|
2016-03-04 19:24:53 +08:00
|
|
|
}
|
|
|
|
|
2016-06-01 13:56:33 +08:00
|
|
|
static void vhost_net_disable_vq(struct vhost_net *n,
|
|
|
|
struct vhost_virtqueue *vq)
|
|
|
|
{
|
|
|
|
struct vhost_net_virtqueue *nvq =
|
|
|
|
container_of(vq, struct vhost_net_virtqueue, vq);
|
|
|
|
struct vhost_poll *poll = n->poll + (nvq - n->vqs);
|
|
|
|
if (!vq->private_data)
|
|
|
|
return;
|
|
|
|
vhost_poll_stop(poll);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int vhost_net_enable_vq(struct vhost_net *n,
|
|
|
|
struct vhost_virtqueue *vq)
|
|
|
|
{
|
|
|
|
struct vhost_net_virtqueue *nvq =
|
|
|
|
container_of(vq, struct vhost_net_virtqueue, vq);
|
|
|
|
struct vhost_poll *poll = n->poll + (nvq - n->vqs);
|
|
|
|
struct socket *sock;
|
|
|
|
|
|
|
|
sock = vq->private_data;
|
|
|
|
if (!sock)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return vhost_poll_start(poll, sock->file);
|
|
|
|
}
|
|
|
|
|
2018-07-20 08:15:21 +08:00
|
|
|
static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
|
|
|
|
{
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
|
|
|
struct vhost_dev *dev = vq->dev;
|
|
|
|
|
|
|
|
if (!nvq->done_idx)
|
|
|
|
return;
|
|
|
|
|
|
|
|
vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
|
|
|
|
nvq->done_idx = 0;
|
|
|
|
}
|
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
static void vhost_tx_batch(struct vhost_net *net,
|
|
|
|
struct vhost_net_virtqueue *nvq,
|
|
|
|
struct socket *sock,
|
|
|
|
struct msghdr *msghdr)
|
|
|
|
{
|
|
|
|
struct tun_msg_ctl ctl = {
|
|
|
|
.type = TUN_MSG_PTR,
|
|
|
|
.num = nvq->batched_xdp,
|
|
|
|
.ptr = nvq->xdp,
|
|
|
|
};
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (nvq->batched_xdp == 0)
|
|
|
|
goto signal_used;
|
|
|
|
|
|
|
|
msghdr->msg_control = &ctl;
|
|
|
|
err = sock->ops->sendmsg(sock, msghdr, 0);
|
|
|
|
if (unlikely(err < 0)) {
|
|
|
|
vq_err(&nvq->vq, "Fail to batch sending packets\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
signal_used:
|
|
|
|
vhost_net_signal_used(nvq);
|
|
|
|
nvq->batched_xdp = 0;
|
|
|
|
}
|
|
|
|
|
2018-09-25 20:36:51 +08:00
|
|
|
static int sock_has_rx_data(struct socket *sock)
|
|
|
|
{
|
|
|
|
if (unlikely(!sock))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (sock->ops->peek_len)
|
|
|
|
return sock->ops->peek_len(sock);
|
|
|
|
|
|
|
|
return skb_queue_empty(&sock->sk->sk_receive_queue);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_busy_poll_try_queue(struct vhost_net *net,
|
|
|
|
struct vhost_virtqueue *vq)
|
|
|
|
{
|
|
|
|
if (!vhost_vq_avail_empty(&net->dev, vq)) {
|
|
|
|
vhost_poll_queue(&vq->poll);
|
|
|
|
} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
|
|
|
|
vhost_disable_notify(&net->dev, vq);
|
|
|
|
vhost_poll_queue(&vq->poll);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_busy_poll(struct vhost_net *net,
|
|
|
|
struct vhost_virtqueue *rvq,
|
|
|
|
struct vhost_virtqueue *tvq,
|
|
|
|
bool *busyloop_intr,
|
|
|
|
bool poll_rx)
|
|
|
|
{
|
|
|
|
unsigned long busyloop_timeout;
|
|
|
|
unsigned long endtime;
|
|
|
|
struct socket *sock;
|
|
|
|
struct vhost_virtqueue *vq = poll_rx ? tvq : rvq;
|
|
|
|
|
2018-12-13 10:53:38 +08:00
|
|
|
/* Try to hold the vq mutex of the paired virtqueue. We can't
|
|
|
|
* use mutex_lock() here since we could not guarantee a
|
|
|
|
* consistenet lock ordering.
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&vq->mutex))
|
|
|
|
return;
|
|
|
|
|
2018-09-25 20:36:51 +08:00
|
|
|
vhost_disable_notify(&net->dev, vq);
|
|
|
|
sock = rvq->private_data;
|
|
|
|
|
|
|
|
busyloop_timeout = poll_rx ? rvq->busyloop_timeout:
|
|
|
|
tvq->busyloop_timeout;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
endtime = busy_clock() + busyloop_timeout;
|
|
|
|
|
|
|
|
while (vhost_can_busy_poll(endtime)) {
|
|
|
|
if (vhost_has_work(&net->dev)) {
|
|
|
|
*busyloop_intr = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((sock_has_rx_data(sock) &&
|
|
|
|
!vhost_vq_avail_empty(&net->dev, rvq)) ||
|
|
|
|
!vhost_vq_avail_empty(&net->dev, tvq))
|
|
|
|
break;
|
|
|
|
|
|
|
|
cpu_relax();
|
|
|
|
}
|
|
|
|
|
|
|
|
preempt_enable();
|
|
|
|
|
|
|
|
if (poll_rx || sock_has_rx_data(sock))
|
|
|
|
vhost_net_busy_poll_try_queue(net, vq);
|
|
|
|
else if (!poll_rx) /* On tx here, sock has no rx data. */
|
|
|
|
vhost_enable_notify(&net->dev, rvq);
|
|
|
|
|
|
|
|
mutex_unlock(&vq->mutex);
|
|
|
|
}
|
|
|
|
|
2016-03-04 19:24:53 +08:00
|
|
|
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
|
2018-09-25 20:36:52 +08:00
|
|
|
struct vhost_net_virtqueue *tnvq,
|
2018-07-03 15:31:32 +08:00
|
|
|
unsigned int *out_num, unsigned int *in_num,
|
2018-09-12 11:17:09 +08:00
|
|
|
struct msghdr *msghdr, bool *busyloop_intr)
|
2016-03-04 19:24:53 +08:00
|
|
|
{
|
2018-09-25 20:36:52 +08:00
|
|
|
struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
|
|
|
|
struct vhost_virtqueue *rvq = &rnvq->vq;
|
|
|
|
struct vhost_virtqueue *tvq = &tnvq->vq;
|
|
|
|
|
|
|
|
int r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
|
2016-06-23 14:04:32 +08:00
|
|
|
out_num, in_num, NULL, NULL);
|
2016-03-04 19:24:53 +08:00
|
|
|
|
2018-09-25 20:36:52 +08:00
|
|
|
if (r == tvq->num && tvq->busyloop_timeout) {
|
2018-09-12 11:17:09 +08:00
|
|
|
/* Flush batched packets first */
|
2018-09-25 20:36:52 +08:00
|
|
|
if (!vhost_sock_zcopy(tvq->private_data))
|
|
|
|
vhost_tx_batch(net, tnvq, tvq->private_data, msghdr);
|
|
|
|
|
|
|
|
vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, false);
|
|
|
|
|
|
|
|
r = vhost_get_vq_desc(tvq, tvq->iov, ARRAY_SIZE(tvq->iov),
|
2016-06-23 14:04:32 +08:00
|
|
|
out_num, in_num, NULL, NULL);
|
2016-03-04 19:24:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2017-01-18 15:02:02 +08:00
|
|
|
static bool vhost_exceeds_maxpend(struct vhost_net *net)
|
|
|
|
{
|
|
|
|
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
|
|
|
|
vhost_net: do not stall on zerocopy depletion
Vhost-net has a hard limit on the number of zerocopy skbs in flight.
When reached, transmission stalls. Stalls cause latency, as well as
head-of-line blocking of other flows that do not use zerocopy.
Instead of stalling, revert to copy-based transmission.
Tested by sending two udp flows from guest to host, one with payload
of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
large flow is redirected to a netem instance with 1MBps rate limit
and deep 1000 entry queue.
modprobe ifb
ip link set dev ifb0 up
tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
tc qdisc add dev tap0 ingress
tc filter add dev tap0 parent ffff: protocol ip \
u32 match ip dport 8000 0xffff \
action mirred egress redirect dev ifb0
Before the delay, both flows process around 80K pps. With the delay,
before this patch, both process around 400. After this patch, the
large flow is still rate limited, while the small reverts to its
original rate. See also discussion in the first link, below.
Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to
send at 100% zerocopy.
The limit in vhost_exceeds_maxpend must be carefully chosen. With
vq->num >> 1, the flows remain correlated. This value happens to
correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
fractions and ensure correctness also for much smaller values of
vq->num, by testing the min() of both explicitly. See also the
discussion in the second link below.
Changes
v1 -> v2
- replaced min with typed min_t
- avoid unnecessary whitespace change
Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 01:22:31 +08:00
|
|
|
return (nvq->upend_idx + UIO_MAXIOV - nvq->done_idx) % UIO_MAXIOV >
|
|
|
|
min_t(unsigned int, VHOST_MAX_PEND, vq->num >> 2);
|
2017-01-18 15:02:02 +08:00
|
|
|
}
|
|
|
|
|
2018-07-20 08:15:14 +08:00
|
|
|
static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
|
|
|
|
size_t hdr_size, int out)
|
|
|
|
{
|
|
|
|
/* Skip header. TODO: support TSO. */
|
|
|
|
size_t len = iov_length(vq->iov, out);
|
|
|
|
|
|
|
|
iov_iter_init(iter, WRITE, vq->iov, out, len);
|
|
|
|
iov_iter_advance(iter, hdr_size);
|
|
|
|
|
|
|
|
return iov_iter_count(iter);
|
|
|
|
}
|
|
|
|
|
2018-07-20 08:15:16 +08:00
|
|
|
static int get_tx_bufs(struct vhost_net *net,
|
|
|
|
struct vhost_net_virtqueue *nvq,
|
|
|
|
struct msghdr *msg,
|
|
|
|
unsigned int *out, unsigned int *in,
|
|
|
|
size_t *len, bool *busyloop_intr)
|
|
|
|
{
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
|
|
|
int ret;
|
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, msg, busyloop_intr);
|
2018-07-20 08:15:21 +08:00
|
|
|
|
2018-07-20 08:15:16 +08:00
|
|
|
if (ret < 0 || ret == vq->num)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (*in) {
|
|
|
|
vq_err(vq, "Unexpected descriptor format for TX: out %d, int %d\n",
|
|
|
|
*out, *in);
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Sanity check */
|
|
|
|
*len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
|
|
|
|
if (*len == 0) {
|
|
|
|
vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
|
|
|
|
*len, nvq->vhost_hlen);
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-07-20 08:15:17 +08:00
|
|
|
static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
|
|
|
|
{
|
|
|
|
return total_len < VHOST_NET_WEIGHT &&
|
|
|
|
!vhost_vq_avail_empty(vq->dev, vq);
|
|
|
|
}
|
|
|
|
|
2018-11-15 17:43:09 +08:00
|
|
|
#define SKB_FRAG_PAGE_ORDER get_order(32768)
|
|
|
|
|
|
|
|
static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
|
|
|
|
struct page_frag *pfrag, gfp_t gfp)
|
|
|
|
{
|
|
|
|
if (pfrag->page) {
|
|
|
|
if (pfrag->offset + sz <= pfrag->size)
|
|
|
|
return true;
|
|
|
|
__page_frag_cache_drain(pfrag->page, net->refcnt_bias);
|
|
|
|
}
|
|
|
|
|
|
|
|
pfrag->offset = 0;
|
|
|
|
net->refcnt_bias = 0;
|
|
|
|
if (SKB_FRAG_PAGE_ORDER) {
|
|
|
|
/* Avoid direct reclaim but allow kswapd to wake */
|
|
|
|
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
|
|
|
|
__GFP_COMP | __GFP_NOWARN |
|
|
|
|
__GFP_NORETRY,
|
|
|
|
SKB_FRAG_PAGE_ORDER);
|
|
|
|
if (likely(pfrag->page)) {
|
|
|
|
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
pfrag->page = alloc_page(gfp);
|
|
|
|
if (likely(pfrag->page)) {
|
|
|
|
pfrag->size = PAGE_SIZE;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
|
|
|
|
done:
|
|
|
|
net->refcnt_bias = USHRT_MAX;
|
|
|
|
page_ref_add(pfrag->page, USHRT_MAX - 1);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
|
|
|
|
|
|
|
|
static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
|
|
|
|
struct iov_iter *from)
|
|
|
|
{
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
2018-11-15 17:43:09 +08:00
|
|
|
struct vhost_net *net = container_of(vq->dev, struct vhost_net,
|
|
|
|
dev);
|
2018-09-12 11:17:09 +08:00
|
|
|
struct socket *sock = vq->private_data;
|
2018-11-15 17:43:09 +08:00
|
|
|
struct page_frag *alloc_frag = &net->page_frag;
|
2018-09-12 11:17:09 +08:00
|
|
|
struct virtio_net_hdr *gso;
|
|
|
|
struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
|
|
|
|
struct tun_xdp_hdr *hdr;
|
|
|
|
size_t len = iov_iter_count(from);
|
|
|
|
int headroom = vhost_sock_xdp(sock) ? XDP_PACKET_HEADROOM : 0;
|
|
|
|
int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
|
|
|
|
int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + headroom + nvq->sock_hlen);
|
|
|
|
int sock_hlen = nvq->sock_hlen;
|
|
|
|
void *buf;
|
|
|
|
int copied;
|
|
|
|
|
|
|
|
if (unlikely(len < nvq->sock_hlen))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (SKB_DATA_ALIGN(len + pad) +
|
|
|
|
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
buflen += SKB_DATA_ALIGN(len + pad);
|
|
|
|
alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
|
2018-11-15 17:43:09 +08:00
|
|
|
if (unlikely(!vhost_net_page_frag_refill(net, buflen,
|
|
|
|
alloc_frag, GFP_KERNEL)))
|
2018-09-12 11:17:09 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
|
|
|
|
copied = copy_page_from_iter(alloc_frag->page,
|
|
|
|
alloc_frag->offset +
|
|
|
|
offsetof(struct tun_xdp_hdr, gso),
|
|
|
|
sock_hlen, from);
|
|
|
|
if (copied != sock_hlen)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
hdr = buf;
|
|
|
|
gso = &hdr->gso;
|
|
|
|
|
|
|
|
if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
|
|
|
|
vhost16_to_cpu(vq, gso->csum_start) +
|
|
|
|
vhost16_to_cpu(vq, gso->csum_offset) + 2 >
|
|
|
|
vhost16_to_cpu(vq, gso->hdr_len)) {
|
|
|
|
gso->hdr_len = cpu_to_vhost16(vq,
|
|
|
|
vhost16_to_cpu(vq, gso->csum_start) +
|
|
|
|
vhost16_to_cpu(vq, gso->csum_offset) + 2);
|
|
|
|
|
|
|
|
if (vhost16_to_cpu(vq, gso->hdr_len) > len)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
len -= sock_hlen;
|
|
|
|
copied = copy_page_from_iter(alloc_frag->page,
|
|
|
|
alloc_frag->offset + pad,
|
|
|
|
len, from);
|
|
|
|
if (copied != len)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
xdp->data_hard_start = buf;
|
|
|
|
xdp->data = buf + pad;
|
|
|
|
xdp->data_end = xdp->data + len;
|
|
|
|
hdr->buflen = buflen;
|
|
|
|
|
2018-11-15 17:43:09 +08:00
|
|
|
--net->refcnt_bias;
|
2018-09-12 11:17:09 +08:00
|
|
|
alloc_frag->offset += buflen;
|
|
|
|
|
|
|
|
++nvq->batched_xdp;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-07-20 08:15:18 +08:00
|
|
|
static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2013-04-27 15:07:46 +08:00
|
|
|
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
|
2013-04-28 20:51:40 +08:00
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
2014-12-11 04:00:58 +08:00
|
|
|
unsigned out, in;
|
2010-06-24 21:59:59 +08:00
|
|
|
int head;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
struct msghdr msg = {
|
|
|
|
.msg_name = NULL,
|
|
|
|
.msg_namelen = 0,
|
|
|
|
.msg_control = NULL,
|
|
|
|
.msg_controllen = 0,
|
|
|
|
.msg_flags = MSG_DONTWAIT,
|
|
|
|
};
|
|
|
|
size_t len, total_len = 0;
|
2013-04-11 04:50:48 +08:00
|
|
|
int err;
|
2018-04-09 15:22:17 +08:00
|
|
|
int sent_pkts = 0;
|
2018-09-12 11:17:09 +08:00
|
|
|
bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX);
|
2010-03-10 02:24:45 +08:00
|
|
|
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
do {
|
2018-07-20 08:15:18 +08:00
|
|
|
bool busyloop_intr = false;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
if (nvq->done_idx == VHOST_NET_BATCH)
|
|
|
|
vhost_tx_batch(net, nvq, sock, &msg);
|
|
|
|
|
2018-07-20 08:15:18 +08:00
|
|
|
head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
|
|
|
|
&busyloop_intr);
|
|
|
|
/* On error, stop handling until the next kick. */
|
|
|
|
if (unlikely(head < 0))
|
|
|
|
break;
|
|
|
|
/* Nothing new? Wait for eventfd to tell us they refilled. */
|
|
|
|
if (head == vq->num) {
|
|
|
|
if (unlikely(busyloop_intr)) {
|
|
|
|
vhost_poll_queue(&vq->poll);
|
|
|
|
} else if (unlikely(vhost_enable_notify(&net->dev,
|
|
|
|
vq))) {
|
|
|
|
vhost_disable_notify(&net->dev, vq);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
2016-06-23 14:04:32 +08:00
|
|
|
|
2018-07-20 08:15:18 +08:00
|
|
|
total_len += len;
|
2018-09-12 11:17:09 +08:00
|
|
|
|
|
|
|
/* For simplicity, TX batching is only enabled if
|
|
|
|
* sndbuf is unlimited.
|
|
|
|
*/
|
|
|
|
if (sock_can_batch) {
|
|
|
|
err = vhost_net_build_xdp(nvq, &msg.msg_iter);
|
|
|
|
if (!err) {
|
|
|
|
goto done;
|
|
|
|
} else if (unlikely(err != -ENOSPC)) {
|
|
|
|
vhost_tx_batch(net, nvq, sock, &msg);
|
|
|
|
vhost_discard_vq_desc(vq, 1);
|
|
|
|
vhost_net_enable_vq(net, vq);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We can't build XDP buff, go for single
|
|
|
|
* packet path but let's flush batched
|
|
|
|
* packets.
|
|
|
|
*/
|
|
|
|
vhost_tx_batch(net, nvq, sock, &msg);
|
|
|
|
msg.msg_control = NULL;
|
|
|
|
} else {
|
|
|
|
if (tx_can_batch(vq, total_len))
|
|
|
|
msg.msg_flags |= MSG_MORE;
|
|
|
|
else
|
|
|
|
msg.msg_flags &= ~MSG_MORE;
|
|
|
|
}
|
2018-07-20 08:15:18 +08:00
|
|
|
|
|
|
|
/* TODO: Check specific error and bomb out unless ENOBUFS? */
|
|
|
|
err = sock->ops->sendmsg(sock, &msg, len);
|
|
|
|
if (unlikely(err < 0)) {
|
|
|
|
vhost_discard_vq_desc(vq, 1);
|
|
|
|
vhost_net_enable_vq(net, vq);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (err != len)
|
|
|
|
pr_debug("Truncated TX packet: len %d != %zd\n",
|
|
|
|
err, len);
|
2018-09-12 11:17:09 +08:00
|
|
|
done:
|
|
|
|
vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
|
|
|
|
vq->heads[nvq->done_idx].len = 0;
|
|
|
|
++nvq->done_idx;
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
|
2018-07-20 08:15:21 +08:00
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
vhost_tx_batch(net, nvq, sock, &msg);
|
2018-07-20 08:15:18 +08:00
|
|
|
}
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
2018-07-20 08:15:18 +08:00
|
|
|
static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
|
|
|
|
{
|
|
|
|
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
|
|
|
unsigned out, in;
|
|
|
|
int head;
|
|
|
|
struct msghdr msg = {
|
|
|
|
.msg_name = NULL,
|
|
|
|
.msg_namelen = 0,
|
|
|
|
.msg_control = NULL,
|
|
|
|
.msg_controllen = 0,
|
|
|
|
.msg_flags = MSG_DONTWAIT,
|
|
|
|
};
|
2018-09-12 11:17:06 +08:00
|
|
|
struct tun_msg_ctl ctl;
|
2018-07-20 08:15:18 +08:00
|
|
|
size_t len, total_len = 0;
|
|
|
|
int err;
|
|
|
|
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
|
|
|
|
bool zcopy_used;
|
|
|
|
int sent_pkts = 0;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
do {
|
2018-07-03 15:31:32 +08:00
|
|
|
bool busyloop_intr;
|
|
|
|
|
2011-07-18 11:48:46 +08:00
|
|
|
/* Release DMAs done buffers first */
|
2018-07-20 08:15:18 +08:00
|
|
|
vhost_zerocopy_signal_used(net, vq);
|
2011-07-18 11:48:46 +08:00
|
|
|
|
2018-07-03 15:31:32 +08:00
|
|
|
busyloop_intr = false;
|
2018-07-20 08:15:16 +08:00
|
|
|
head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
|
|
|
|
&busyloop_intr);
|
2010-06-24 21:59:59 +08:00
|
|
|
/* On error, stop handling until the next kick. */
|
2010-07-01 23:40:12 +08:00
|
|
|
if (unlikely(head < 0))
|
2010-06-24 21:59:59 +08:00
|
|
|
break;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* Nothing new? Wait for eventfd to tell us they refilled. */
|
|
|
|
if (head == vq->num) {
|
2018-07-03 15:31:32 +08:00
|
|
|
if (unlikely(busyloop_intr)) {
|
|
|
|
vhost_poll_queue(&vq->poll);
|
|
|
|
} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
|
2011-05-20 07:10:54 +08:00
|
|
|
vhost_disable_notify(&net->dev, vq);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
2013-09-02 16:40:59 +08:00
|
|
|
|
2018-07-20 08:15:18 +08:00
|
|
|
zcopy_used = len >= VHOST_GOODCOPY_LEN
|
|
|
|
&& !vhost_exceeds_maxpend(net)
|
|
|
|
&& vhost_net_tx_select_zcopy(net);
|
2012-12-06 23:00:18 +08:00
|
|
|
|
2011-07-18 11:48:46 +08:00
|
|
|
/* use msg_control to pass vhost zerocopy ubuf info to skb */
|
2012-12-06 23:00:18 +08:00
|
|
|
if (zcopy_used) {
|
2013-09-02 16:40:59 +08:00
|
|
|
struct ubuf_info *ubuf;
|
|
|
|
ubuf = nvq->ubuf_info + nvq->upend_idx;
|
|
|
|
|
2014-10-24 19:19:48 +08:00
|
|
|
vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
|
2013-09-02 16:40:59 +08:00
|
|
|
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
|
|
|
|
ubuf->callback = vhost_zerocopy_callback;
|
|
|
|
ubuf->ctx = nvq->ubufs;
|
|
|
|
ubuf->desc = nvq->upend_idx;
|
2017-09-01 07:48:22 +08:00
|
|
|
refcount_set(&ubuf->refcnt, 1);
|
2018-09-12 11:17:06 +08:00
|
|
|
msg.msg_control = &ctl;
|
|
|
|
ctl.type = TUN_MSG_UBUF;
|
|
|
|
ctl.ptr = ubuf;
|
|
|
|
msg.msg_controllen = sizeof(ctl);
|
2013-09-02 16:40:59 +08:00
|
|
|
ubufs = nvq->ubufs;
|
2014-02-13 17:42:05 +08:00
|
|
|
atomic_inc(&ubufs->refcount);
|
2013-04-27 15:07:46 +08:00
|
|
|
nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
|
2013-09-02 16:40:59 +08:00
|
|
|
} else {
|
2013-06-05 15:40:46 +08:00
|
|
|
msg.msg_control = NULL;
|
2013-09-02 16:40:59 +08:00
|
|
|
ubufs = NULL;
|
|
|
|
}
|
2017-01-18 15:02:02 +08:00
|
|
|
total_len += len;
|
2018-07-20 08:15:17 +08:00
|
|
|
if (tx_can_batch(vq, total_len) &&
|
2017-01-18 15:02:02 +08:00
|
|
|
likely(!vhost_exceeds_maxpend(net))) {
|
|
|
|
msg.msg_flags |= MSG_MORE;
|
|
|
|
} else {
|
|
|
|
msg.msg_flags &= ~MSG_MORE;
|
|
|
|
}
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* TODO: Check specific error and bomb out unless ENOBUFS? */
|
2015-03-02 15:37:48 +08:00
|
|
|
err = sock->ops->sendmsg(sock, &msg, len);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (unlikely(err < 0)) {
|
2012-12-06 23:00:18 +08:00
|
|
|
if (zcopy_used) {
|
2013-09-02 16:40:59 +08:00
|
|
|
vhost_net_ubuf_put(ubufs);
|
2013-04-27 15:07:46 +08:00
|
|
|
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
|
|
|
|
% UIO_MAXIOV;
|
2011-07-18 11:48:46 +08:00
|
|
|
}
|
2010-07-27 23:52:21 +08:00
|
|
|
vhost_discard_vq_desc(vq, 1);
|
2017-11-13 11:45:34 +08:00
|
|
|
vhost_net_enable_vq(net, vq);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (err != len)
|
2010-06-24 22:10:25 +08:00
|
|
|
pr_debug("Truncated TX packet: "
|
|
|
|
" len %d != %zd\n", err, len);
|
2012-12-06 23:00:18 +08:00
|
|
|
if (!zcopy_used)
|
2011-07-18 11:48:46 +08:00
|
|
|
vhost_add_used_and_signal(&net->dev, vq, head, 0);
|
2012-05-02 11:42:41 +08:00
|
|
|
else
|
2012-11-01 17:16:51 +08:00
|
|
|
vhost_zerocopy_signal_used(net, vq);
|
|
|
|
vhost_net_tx_packet(net);
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
|
2018-07-20 08:15:18 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Expects to be always run from workqueue - which acts as
|
|
|
|
* read-size critical section for our kind of RCU. */
|
|
|
|
static void handle_tx(struct vhost_net *net)
|
|
|
|
{
|
|
|
|
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
|
|
|
struct socket *sock;
|
|
|
|
|
2018-09-25 20:36:50 +08:00
|
|
|
mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_TX);
|
2018-07-20 08:15:18 +08:00
|
|
|
sock = vq->private_data;
|
|
|
|
if (!sock)
|
|
|
|
goto out;
|
|
|
|
|
2019-05-24 16:12:15 +08:00
|
|
|
if (!vq_meta_prefetch(vq))
|
2018-07-20 08:15:18 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
vhost_disable_notify(&net->dev, vq);
|
|
|
|
vhost_net_disable_vq(net, vq);
|
|
|
|
|
|
|
|
if (vhost_sock_zcopy(sock))
|
|
|
|
handle_tx_zerocopy(net, sock);
|
|
|
|
else
|
|
|
|
handle_tx_copy(net, sock);
|
|
|
|
|
2013-05-07 14:54:33 +08:00
|
|
|
out:
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
mutex_unlock(&vq->mutex);
|
|
|
|
}
|
|
|
|
|
2017-05-17 12:14:45 +08:00
|
|
|
static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
|
2010-07-27 23:52:21 +08:00
|
|
|
{
|
|
|
|
struct sk_buff *head;
|
|
|
|
int len = 0;
|
2011-01-17 16:11:17 +08:00
|
|
|
unsigned long flags;
|
2010-07-27 23:52:21 +08:00
|
|
|
|
2018-01-04 11:14:27 +08:00
|
|
|
if (rvq->rx_ring)
|
2017-05-17 12:14:45 +08:00
|
|
|
return vhost_net_buf_peek(rvq);
|
2016-06-30 14:45:36 +08:00
|
|
|
|
2011-01-17 16:11:17 +08:00
|
|
|
spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
|
2010-07-27 23:52:21 +08:00
|
|
|
head = skb_peek(&sk->sk_receive_queue);
|
2012-05-04 06:55:23 +08:00
|
|
|
if (likely(head)) {
|
2010-07-27 23:52:21 +08:00
|
|
|
len = head->len;
|
2015-01-14 00:13:44 +08:00
|
|
|
if (skb_vlan_tag_present(head))
|
2012-05-04 06:55:23 +08:00
|
|
|
len += VLAN_HLEN;
|
|
|
|
}
|
|
|
|
|
2011-01-17 16:11:17 +08:00
|
|
|
spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
|
2010-07-27 23:52:21 +08:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2018-07-03 15:31:33 +08:00
|
|
|
static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
|
|
|
|
bool *busyloop_intr)
|
2016-03-04 19:24:53 +08:00
|
|
|
{
|
2018-07-03 15:31:31 +08:00
|
|
|
struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
|
|
|
|
struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
|
2018-07-03 15:31:34 +08:00
|
|
|
struct vhost_virtqueue *rvq = &rnvq->vq;
|
2018-07-03 15:31:31 +08:00
|
|
|
struct vhost_virtqueue *tvq = &tnvq->vq;
|
|
|
|
int len = peek_head_len(rnvq, sk);
|
2016-03-04 19:24:53 +08:00
|
|
|
|
2018-09-25 20:36:51 +08:00
|
|
|
if (!len && rvq->busyloop_timeout) {
|
2018-05-29 14:18:19 +08:00
|
|
|
/* Flush batched heads first */
|
2018-07-20 08:15:19 +08:00
|
|
|
vhost_net_signal_used(rnvq);
|
2016-03-04 19:24:53 +08:00
|
|
|
/* Both tx vq and rx socket were polled here */
|
2018-09-25 20:36:51 +08:00
|
|
|
vhost_net_busy_poll(net, rvq, tvq, busyloop_intr, true);
|
2016-03-04 19:24:53 +08:00
|
|
|
|
2018-07-03 15:31:31 +08:00
|
|
|
len = peek_head_len(rnvq, sk);
|
2016-03-04 19:24:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2010-07-27 23:52:21 +08:00
|
|
|
/* This is a multi-buffer version of vhost_get_desc, that works if
|
|
|
|
* vq has read descriptors only.
|
|
|
|
* @vq - the relevant virtqueue
|
|
|
|
* @datalen - data length we'll be reading
|
|
|
|
* @iovcount - returned count of io vectors we fill
|
|
|
|
* @log - vhost log
|
|
|
|
* @log_num - log offset
|
2011-01-17 16:11:08 +08:00
|
|
|
* @quota - headcount quota, 1 for big buffer
|
2010-07-27 23:52:21 +08:00
|
|
|
* returns number of buffer heads allocated, negative on error
|
|
|
|
*/
|
|
|
|
static int get_rx_bufs(struct vhost_virtqueue *vq,
|
|
|
|
struct vring_used_elem *heads,
|
|
|
|
int datalen,
|
|
|
|
unsigned *iovcount,
|
|
|
|
struct vhost_log *log,
|
2011-01-17 16:11:08 +08:00
|
|
|
unsigned *log_num,
|
|
|
|
unsigned int quota)
|
2010-07-27 23:52:21 +08:00
|
|
|
{
|
|
|
|
unsigned int out, in;
|
|
|
|
int seg = 0;
|
|
|
|
int headcount = 0;
|
|
|
|
unsigned d;
|
|
|
|
int r, nlogs = 0;
|
2014-10-24 19:19:48 +08:00
|
|
|
/* len is always initialized before use since we are always called with
|
|
|
|
* datalen > 0.
|
|
|
|
*/
|
|
|
|
u32 uninitialized_var(len);
|
2010-07-27 23:52:21 +08:00
|
|
|
|
2011-01-17 16:11:08 +08:00
|
|
|
while (datalen > 0 && headcount < quota) {
|
2010-09-14 23:53:05 +08:00
|
|
|
if (unlikely(seg >= UIO_MAXIOV)) {
|
2010-07-27 23:52:21 +08:00
|
|
|
r = -ENOBUFS;
|
|
|
|
goto err;
|
|
|
|
}
|
2014-06-05 20:20:27 +08:00
|
|
|
r = vhost_get_vq_desc(vq, vq->iov + seg,
|
2010-07-27 23:52:21 +08:00
|
|
|
ARRAY_SIZE(vq->iov) - seg, &out,
|
|
|
|
&in, log, log_num);
|
2014-03-27 18:53:37 +08:00
|
|
|
if (unlikely(r < 0))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
d = r;
|
2010-07-27 23:52:21 +08:00
|
|
|
if (d == vq->num) {
|
|
|
|
r = 0;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
if (unlikely(out || in <= 0)) {
|
|
|
|
vq_err(vq, "unexpected descriptor format for RX: "
|
|
|
|
"out %d, in %d\n", out, in);
|
|
|
|
r = -EINVAL;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
if (unlikely(log)) {
|
|
|
|
nlogs += *log_num;
|
|
|
|
log += *log_num;
|
|
|
|
}
|
2014-10-24 19:19:48 +08:00
|
|
|
heads[headcount].id = cpu_to_vhost32(vq, d);
|
|
|
|
len = iov_length(vq->iov + seg, in);
|
|
|
|
heads[headcount].len = cpu_to_vhost32(vq, len);
|
|
|
|
datalen -= len;
|
2010-07-27 23:52:21 +08:00
|
|
|
++headcount;
|
|
|
|
seg += in;
|
|
|
|
}
|
2015-01-07 16:51:00 +08:00
|
|
|
heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
|
2010-07-27 23:52:21 +08:00
|
|
|
*iovcount = seg;
|
|
|
|
if (unlikely(log))
|
|
|
|
*log_num = nlogs;
|
2014-03-27 18:00:26 +08:00
|
|
|
|
|
|
|
/* Detect overrun */
|
|
|
|
if (unlikely(datalen > 0)) {
|
|
|
|
r = UIO_MAXIOV + 1;
|
|
|
|
goto err;
|
|
|
|
}
|
2010-07-27 23:52:21 +08:00
|
|
|
return headcount;
|
|
|
|
err:
|
|
|
|
vhost_discard_vq_desc(vq, headcount);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* Expects to be always run from workqueue - which acts as
|
|
|
|
* read-size critical section for our kind of RCU. */
|
2011-01-17 16:11:08 +08:00
|
|
|
static void handle_rx(struct vhost_net *net)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2013-04-28 20:51:40 +08:00
|
|
|
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_RX];
|
|
|
|
struct vhost_virtqueue *vq = &nvq->vq;
|
2010-07-27 23:52:21 +08:00
|
|
|
unsigned uninitialized_var(in), log;
|
|
|
|
struct vhost_log *vq_log;
|
|
|
|
struct msghdr msg = {
|
|
|
|
.msg_name = NULL,
|
|
|
|
.msg_namelen = 0,
|
|
|
|
.msg_control = NULL, /* FIXME: get and handle RX aux data. */
|
|
|
|
.msg_controllen = 0,
|
|
|
|
.msg_flags = MSG_DONTWAIT,
|
|
|
|
};
|
2015-02-15 16:35:17 +08:00
|
|
|
struct virtio_net_hdr hdr = {
|
|
|
|
.flags = 0,
|
|
|
|
.gso_type = VIRTIO_NET_HDR_GSO_NONE
|
2010-07-27 23:52:21 +08:00
|
|
|
};
|
|
|
|
size_t total_len = 0;
|
2012-10-25 02:37:51 +08:00
|
|
|
int err, mergeable;
|
2018-05-29 14:18:19 +08:00
|
|
|
s16 headcount;
|
2010-07-27 23:52:21 +08:00
|
|
|
size_t vhost_hlen, sock_hlen;
|
|
|
|
size_t vhost_len, sock_len;
|
2018-07-03 15:31:33 +08:00
|
|
|
bool busyloop_intr = false;
|
2013-05-07 14:54:33 +08:00
|
|
|
struct socket *sock;
|
2014-12-11 04:51:28 +08:00
|
|
|
struct iov_iter fixup;
|
2015-02-15 16:35:17 +08:00
|
|
|
__virtio16 num_buffers;
|
2018-04-24 16:34:36 +08:00
|
|
|
int recv_pkts = 0;
|
2010-07-27 23:52:21 +08:00
|
|
|
|
2018-09-25 20:36:50 +08:00
|
|
|
mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
|
2013-05-07 14:54:33 +08:00
|
|
|
sock = vq->private_data;
|
|
|
|
if (!sock)
|
|
|
|
goto out;
|
2016-06-23 14:04:32 +08:00
|
|
|
|
2019-05-24 16:12:15 +08:00
|
|
|
if (!vq_meta_prefetch(vq))
|
2016-06-23 14:04:32 +08:00
|
|
|
goto out;
|
|
|
|
|
2011-05-20 07:10:54 +08:00
|
|
|
vhost_disable_notify(&net->dev, vq);
|
2016-06-01 13:56:33 +08:00
|
|
|
vhost_net_disable_vq(net, vq);
|
2013-05-07 14:54:33 +08:00
|
|
|
|
2013-04-28 20:51:40 +08:00
|
|
|
vhost_hlen = nvq->vhost_hlen;
|
|
|
|
sock_hlen = nvq->sock_hlen;
|
2010-07-27 23:52:21 +08:00
|
|
|
|
2014-06-05 20:20:23 +08:00
|
|
|
vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
|
2010-07-27 23:52:21 +08:00
|
|
|
vq->log : NULL;
|
2014-06-05 20:20:23 +08:00
|
|
|
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
|
2010-07-27 23:52:21 +08:00
|
|
|
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
do {
|
|
|
|
sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
|
|
|
|
&busyloop_intr);
|
|
|
|
if (!sock_len)
|
|
|
|
break;
|
2010-07-27 23:52:21 +08:00
|
|
|
sock_len += sock_hlen;
|
|
|
|
vhost_len = sock_len + vhost_hlen;
|
2018-05-29 14:18:19 +08:00
|
|
|
headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
|
|
|
|
vhost_len, &in, vq_log, &log,
|
2011-01-17 16:11:08 +08:00
|
|
|
likely(mergeable) ? UIO_MAXIOV : 1);
|
2010-07-27 23:52:21 +08:00
|
|
|
/* On error, stop handling until the next kick. */
|
|
|
|
if (unlikely(headcount < 0))
|
2016-06-01 13:56:33 +08:00
|
|
|
goto out;
|
2010-07-27 23:52:21 +08:00
|
|
|
/* OK, now we need to know about added descriptors. */
|
|
|
|
if (!headcount) {
|
2018-07-03 15:31:34 +08:00
|
|
|
if (unlikely(busyloop_intr)) {
|
|
|
|
vhost_poll_queue(&vq->poll);
|
|
|
|
} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
|
2010-07-27 23:52:21 +08:00
|
|
|
/* They have slipped one in as we were
|
|
|
|
* doing that: check again. */
|
2011-05-20 07:10:54 +08:00
|
|
|
vhost_disable_notify(&net->dev, vq);
|
2010-07-27 23:52:21 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* Nothing new? Wait for eventfd to tell us
|
|
|
|
* they refilled. */
|
2016-06-01 13:56:33 +08:00
|
|
|
goto out;
|
2010-07-27 23:52:21 +08:00
|
|
|
}
|
2018-07-03 15:31:34 +08:00
|
|
|
busyloop_intr = false;
|
2018-01-04 11:14:27 +08:00
|
|
|
if (nvq->rx_ring)
|
2017-12-01 18:10:36 +08:00
|
|
|
msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
|
|
|
|
/* On overrun, truncate and discard */
|
|
|
|
if (unlikely(headcount > UIO_MAXIOV)) {
|
|
|
|
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
|
|
|
|
err = sock->ops->recvmsg(sock, &msg,
|
|
|
|
1, MSG_DONTWAIT | MSG_TRUNC);
|
|
|
|
pr_debug("Discarded rx packet: len %zd\n", sock_len);
|
|
|
|
continue;
|
|
|
|
}
|
2010-07-27 23:52:21 +08:00
|
|
|
/* We don't need to be notified again. */
|
2014-12-11 04:51:28 +08:00
|
|
|
iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
|
|
|
|
fixup = msg.msg_iter;
|
|
|
|
if (unlikely((vhost_hlen))) {
|
|
|
|
/* We will supply the header ourselves
|
|
|
|
* TODO: support TSO.
|
|
|
|
*/
|
|
|
|
iov_iter_advance(&msg.msg_iter, vhost_hlen);
|
|
|
|
}
|
2015-03-02 15:37:48 +08:00
|
|
|
err = sock->ops->recvmsg(sock, &msg,
|
2010-07-27 23:52:21 +08:00
|
|
|
sock_len, MSG_DONTWAIT | MSG_TRUNC);
|
|
|
|
/* Userspace might have consumed the packet meanwhile:
|
|
|
|
* it's not supposed to do this usually, but might be hard
|
|
|
|
* to prevent. Discard data we got (if any) and keep going. */
|
|
|
|
if (unlikely(err != sock_len)) {
|
|
|
|
pr_debug("Discarded rx packet: "
|
|
|
|
" len %d, expected %zd\n", err, sock_len);
|
|
|
|
vhost_discard_vq_desc(vq, headcount);
|
|
|
|
continue;
|
|
|
|
}
|
2014-12-11 04:51:28 +08:00
|
|
|
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
|
2015-02-25 22:19:28 +08:00
|
|
|
if (unlikely(vhost_hlen)) {
|
|
|
|
if (copy_to_iter(&hdr, sizeof(hdr),
|
|
|
|
&fixup) != sizeof(hdr)) {
|
|
|
|
vq_err(vq, "Unable to write vnet_hdr "
|
|
|
|
"at addr %p\n", vq->iov->iov_base);
|
2016-06-01 13:56:33 +08:00
|
|
|
goto out;
|
2015-02-25 22:19:28 +08:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* Header came from socket; we'll need to patch
|
|
|
|
* ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
|
|
|
|
*/
|
|
|
|
iov_iter_advance(&fixup, sizeof(hdr));
|
2010-07-27 23:52:21 +08:00
|
|
|
}
|
|
|
|
/* TODO: Should check and handle checksum. */
|
2015-02-03 17:07:06 +08:00
|
|
|
|
2015-02-15 16:35:17 +08:00
|
|
|
num_buffers = cpu_to_vhost16(vq, headcount);
|
2011-01-17 16:10:59 +08:00
|
|
|
if (likely(mergeable) &&
|
2015-02-25 22:20:01 +08:00
|
|
|
copy_to_iter(&num_buffers, sizeof num_buffers,
|
|
|
|
&fixup) != sizeof num_buffers) {
|
2010-07-27 23:52:21 +08:00
|
|
|
vq_err(vq, "Failed num_buffers write");
|
|
|
|
vhost_discard_vq_desc(vq, headcount);
|
2016-06-01 13:56:33 +08:00
|
|
|
goto out;
|
2010-07-27 23:52:21 +08:00
|
|
|
}
|
2018-05-29 14:18:19 +08:00
|
|
|
nvq->done_idx += headcount;
|
2018-07-20 08:15:20 +08:00
|
|
|
if (nvq->done_idx > VHOST_NET_BATCH)
|
2018-07-20 08:15:19 +08:00
|
|
|
vhost_net_signal_used(nvq);
|
2010-07-27 23:52:21 +08:00
|
|
|
if (unlikely(vq_log))
|
2019-01-16 16:54:42 +08:00
|
|
|
vhost_log_write(vq, vq_log, log, vhost_len,
|
|
|
|
vq->iov, in);
|
2010-07-27 23:52:21 +08:00
|
|
|
total_len += vhost_len;
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
} while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
|
|
|
|
|
2018-07-03 15:31:33 +08:00
|
|
|
if (unlikely(busyloop_intr))
|
|
|
|
vhost_poll_queue(&vq->poll);
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
else if (!sock_len)
|
2018-07-03 15:31:33 +08:00
|
|
|
vhost_net_enable_vq(net, vq);
|
2013-05-07 14:54:33 +08:00
|
|
|
out:
|
2018-07-20 08:15:19 +08:00
|
|
|
vhost_net_signal_used(nvq);
|
2010-07-27 23:52:21 +08:00
|
|
|
mutex_unlock(&vq->mutex);
|
|
|
|
}
|
|
|
|
|
2010-06-03 02:40:00 +08:00
|
|
|
static void handle_tx_kick(struct vhost_work *work)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2010-06-03 02:40:00 +08:00
|
|
|
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
|
|
|
|
poll.work);
|
|
|
|
struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
handle_tx(net);
|
|
|
|
}
|
|
|
|
|
2010-06-03 02:40:00 +08:00
|
|
|
static void handle_rx_kick(struct vhost_work *work)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2010-06-03 02:40:00 +08:00
|
|
|
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
|
|
|
|
poll.work);
|
|
|
|
struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
handle_rx(net);
|
|
|
|
}
|
|
|
|
|
2010-06-03 02:40:00 +08:00
|
|
|
static void handle_tx_net(struct vhost_work *work)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2010-06-03 02:40:00 +08:00
|
|
|
struct vhost_net *net = container_of(work, struct vhost_net,
|
|
|
|
poll[VHOST_NET_VQ_TX].work);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
handle_tx(net);
|
|
|
|
}
|
|
|
|
|
2010-06-03 02:40:00 +08:00
|
|
|
static void handle_rx_net(struct vhost_work *work)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2010-06-03 02:40:00 +08:00
|
|
|
struct vhost_net *net = container_of(work, struct vhost_net,
|
|
|
|
poll[VHOST_NET_VQ_RX].work);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
handle_rx(net);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int vhost_net_open(struct inode *inode, struct file *f)
|
|
|
|
{
|
2013-01-24 04:46:47 +08:00
|
|
|
struct vhost_net *n;
|
2010-06-03 02:40:00 +08:00
|
|
|
struct vhost_dev *dev;
|
2013-04-27 11:16:48 +08:00
|
|
|
struct vhost_virtqueue **vqs;
|
2018-01-04 11:14:27 +08:00
|
|
|
void **queue;
|
2018-09-12 11:17:09 +08:00
|
|
|
struct xdp_buff *xdp;
|
2013-12-07 04:13:03 +08:00
|
|
|
int i;
|
2010-06-03 02:40:00 +08:00
|
|
|
|
2017-07-13 05:36:45 +08:00
|
|
|
n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
|
2017-05-09 06:57:15 +08:00
|
|
|
if (!n)
|
|
|
|
return -ENOMEM;
|
treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
kmalloc(a * b, gfp)
with:
kmalloc_array(a * b, gfp)
as well as handling cases of:
kmalloc(a * b * c, gfp)
with:
kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 04:55:00 +08:00
|
|
|
vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
|
2013-04-27 11:16:48 +08:00
|
|
|
if (!vqs) {
|
2014-06-12 16:42:34 +08:00
|
|
|
kvfree(n);
|
2013-04-27 11:16:48 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2010-06-03 02:40:00 +08:00
|
|
|
|
2018-07-20 08:15:20 +08:00
|
|
|
queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
|
2017-05-17 12:14:45 +08:00
|
|
|
GFP_KERNEL);
|
|
|
|
if (!queue) {
|
|
|
|
kfree(vqs);
|
|
|
|
kvfree(n);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
|
|
|
|
|
2018-09-12 11:17:09 +08:00
|
|
|
xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
|
|
|
|
if (!xdp) {
|
|
|
|
kfree(vqs);
|
|
|
|
kvfree(n);
|
|
|
|
kfree(queue);
|
2018-09-20 18:01:59 +08:00
|
|
|
return -ENOMEM;
|
2018-09-12 11:17:09 +08:00
|
|
|
}
|
|
|
|
n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
|
|
|
|
|
2010-06-03 02:40:00 +08:00
|
|
|
dev = &n->dev;
|
2013-04-27 11:16:48 +08:00
|
|
|
vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
|
|
|
|
vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
|
|
|
|
n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
|
|
|
|
n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
|
2013-04-27 15:07:46 +08:00
|
|
|
for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
|
|
|
|
n->vqs[i].ubufs = NULL;
|
|
|
|
n->vqs[i].ubuf_info = NULL;
|
|
|
|
n->vqs[i].upend_idx = 0;
|
|
|
|
n->vqs[i].done_idx = 0;
|
2018-09-12 11:17:09 +08:00
|
|
|
n->vqs[i].batched_xdp = 0;
|
2013-04-28 20:51:40 +08:00
|
|
|
n->vqs[i].vhost_hlen = 0;
|
|
|
|
n->vqs[i].sock_hlen = 0;
|
2018-03-09 14:50:32 +08:00
|
|
|
n->vqs[i].rx_ring = NULL;
|
2017-05-17 12:14:45 +08:00
|
|
|
vhost_net_buf_init(&n->vqs[i].rxq);
|
2013-04-27 15:07:46 +08:00
|
|
|
}
|
2019-01-28 15:05:05 +08:00
|
|
|
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
|
2019-05-17 12:29:49 +08:00
|
|
|
UIO_MAXIOV + VHOST_NET_BATCH,
|
vhost_net: fix possible infinite loop
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-05-17 12:29:50 +08:00
|
|
|
VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
2018-02-12 06:34:03 +08:00
|
|
|
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
|
|
|
|
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
|
|
|
f->private_data = n;
|
2018-11-15 17:43:09 +08:00
|
|
|
n->page_frag.page = NULL;
|
|
|
|
n->refcnt_bias = 0;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct socket *vhost_net_stop_vq(struct vhost_net *n,
|
|
|
|
struct vhost_virtqueue *vq)
|
|
|
|
{
|
|
|
|
struct socket *sock;
|
2017-05-17 12:14:45 +08:00
|
|
|
struct vhost_net_virtqueue *nvq =
|
|
|
|
container_of(vq, struct vhost_net_virtqueue, vq);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
|
|
|
mutex_lock(&vq->mutex);
|
2013-05-07 14:54:36 +08:00
|
|
|
sock = vq->private_data;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
vhost_net_disable_vq(n, vq);
|
2013-05-07 14:54:36 +08:00
|
|
|
vq->private_data = NULL;
|
2017-05-17 12:14:45 +08:00
|
|
|
vhost_net_buf_unproduce(nvq);
|
2018-03-09 14:50:33 +08:00
|
|
|
nvq->rx_ring = NULL;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
mutex_unlock(&vq->mutex);
|
|
|
|
return sock;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
|
|
|
|
struct socket **rx_sock)
|
|
|
|
{
|
2013-04-27 11:16:48 +08:00
|
|
|
*tx_sock = vhost_net_stop_vq(n, &n->vqs[VHOST_NET_VQ_TX].vq);
|
|
|
|
*rx_sock = vhost_net_stop_vq(n, &n->vqs[VHOST_NET_VQ_RX].vq);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_flush_vq(struct vhost_net *n, int index)
|
|
|
|
{
|
|
|
|
vhost_poll_flush(n->poll + index);
|
2013-04-27 11:16:48 +08:00
|
|
|
vhost_poll_flush(&n->vqs[index].vq.poll);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void vhost_net_flush(struct vhost_net *n)
|
|
|
|
{
|
|
|
|
vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
|
|
|
|
vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
|
2013-04-27 15:07:46 +08:00
|
|
|
if (n->vqs[VHOST_NET_VQ_TX].ubufs) {
|
2013-04-27 11:16:48 +08:00
|
|
|
mutex_lock(&n->vqs[VHOST_NET_VQ_TX].vq.mutex);
|
2012-12-04 06:17:14 +08:00
|
|
|
n->tx_flush = true;
|
2013-04-27 11:16:48 +08:00
|
|
|
mutex_unlock(&n->vqs[VHOST_NET_VQ_TX].vq.mutex);
|
2012-12-04 06:17:14 +08:00
|
|
|
/* Wait for all lower device DMAs done. */
|
2013-05-06 16:38:24 +08:00
|
|
|
vhost_net_ubuf_put_and_wait(n->vqs[VHOST_NET_VQ_TX].ubufs);
|
2013-04-27 11:16:48 +08:00
|
|
|
mutex_lock(&n->vqs[VHOST_NET_VQ_TX].vq.mutex);
|
2012-12-04 06:17:14 +08:00
|
|
|
n->tx_flush = false;
|
2014-02-13 17:42:05 +08:00
|
|
|
atomic_set(&n->vqs[VHOST_NET_VQ_TX].ubufs->refcount, 1);
|
2013-04-27 11:16:48 +08:00
|
|
|
mutex_unlock(&n->vqs[VHOST_NET_VQ_TX].vq.mutex);
|
2012-12-04 06:17:14 +08:00
|
|
|
}
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int vhost_net_release(struct inode *inode, struct file *f)
|
|
|
|
{
|
|
|
|
struct vhost_net *n = f->private_data;
|
|
|
|
struct socket *tx_sock;
|
|
|
|
struct socket *rx_sock;
|
|
|
|
|
|
|
|
vhost_net_stop(n, &tx_sock, &rx_sock);
|
|
|
|
vhost_net_flush(n);
|
2012-11-01 17:16:46 +08:00
|
|
|
vhost_dev_stop(&n->dev);
|
2017-12-25 00:08:58 +08:00
|
|
|
vhost_dev_cleanup(&n->dev);
|
2013-04-28 20:51:40 +08:00
|
|
|
vhost_net_vq_reset(n);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (tx_sock)
|
2014-03-06 09:39:00 +08:00
|
|
|
sockfd_put(tx_sock);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (rx_sock)
|
2014-03-06 09:39:00 +08:00
|
|
|
sockfd_put(rx_sock);
|
2014-02-13 17:45:11 +08:00
|
|
|
/* Make sure no callbacks are outstanding */
|
2018-11-06 09:14:53 +08:00
|
|
|
synchronize_rcu();
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* We do an extra flush before freeing memory,
|
|
|
|
* since jobs can re-queue themselves. */
|
|
|
|
vhost_net_flush(n);
|
2017-05-17 12:14:45 +08:00
|
|
|
kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
|
2018-09-12 11:17:09 +08:00
|
|
|
kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
|
2013-04-27 11:16:48 +08:00
|
|
|
kfree(n->dev.vqs);
|
2018-11-15 17:43:09 +08:00
|
|
|
if (n->page_frag.page)
|
|
|
|
__page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
|
2014-06-12 16:42:34 +08:00
|
|
|
kvfree(n);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct socket *get_raw_socket(int fd)
|
|
|
|
{
|
|
|
|
struct {
|
|
|
|
struct sockaddr_ll sa;
|
|
|
|
char buf[MAX_ADDR_LEN];
|
|
|
|
} uaddr;
|
2018-02-13 03:00:20 +08:00
|
|
|
int r;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
struct socket *sock = sockfd_lookup(fd, &r);
|
2011-03-01 19:36:37 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (!sock)
|
|
|
|
return ERR_PTR(-ENOTSOCK);
|
|
|
|
|
|
|
|
/* Parameter checking */
|
|
|
|
if (sock->sk->sk_type != SOCK_RAW) {
|
|
|
|
r = -ESOCKTNOSUPPORT;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2018-02-13 03:00:20 +08:00
|
|
|
r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa, 0);
|
|
|
|
if (r < 0)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
if (uaddr.sa.sll_family != AF_PACKET) {
|
|
|
|
r = -EPFNOSUPPORT;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
return sock;
|
|
|
|
err:
|
2014-03-06 09:39:00 +08:00
|
|
|
sockfd_put(sock);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
return ERR_PTR(r);
|
|
|
|
}
|
|
|
|
|
2018-01-04 11:14:27 +08:00
|
|
|
static struct ptr_ring *get_tap_ptr_ring(int fd)
|
2017-05-17 12:14:45 +08:00
|
|
|
{
|
2018-01-04 11:14:27 +08:00
|
|
|
struct ptr_ring *ring;
|
2017-05-17 12:14:45 +08:00
|
|
|
struct file *file = fget(fd);
|
|
|
|
|
|
|
|
if (!file)
|
|
|
|
return NULL;
|
2018-01-04 11:14:27 +08:00
|
|
|
ring = tun_get_tx_ring(file);
|
|
|
|
if (!IS_ERR(ring))
|
2017-05-17 12:14:45 +08:00
|
|
|
goto out;
|
2018-01-04 11:14:27 +08:00
|
|
|
ring = tap_get_ptr_ring(file);
|
|
|
|
if (!IS_ERR(ring))
|
2017-05-17 12:14:45 +08:00
|
|
|
goto out;
|
2018-01-04 11:14:27 +08:00
|
|
|
ring = NULL;
|
2017-05-17 12:14:45 +08:00
|
|
|
out:
|
|
|
|
fput(file);
|
2018-01-04 11:14:27 +08:00
|
|
|
return ring;
|
2017-05-17 12:14:45 +08:00
|
|
|
}
|
|
|
|
|
2010-02-18 13:46:50 +08:00
|
|
|
static struct socket *get_tap_socket(int fd)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
|
|
|
struct file *file = fget(fd);
|
|
|
|
struct socket *sock;
|
2011-03-01 19:36:37 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (!file)
|
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
sock = tun_get_socket(file);
|
2010-02-18 13:46:50 +08:00
|
|
|
if (!IS_ERR(sock))
|
|
|
|
return sock;
|
2017-02-11 08:03:47 +08:00
|
|
|
sock = tap_get_socket(file);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (IS_ERR(sock))
|
|
|
|
fput(file);
|
|
|
|
return sock;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct socket *get_socket(int fd)
|
|
|
|
{
|
|
|
|
struct socket *sock;
|
2011-03-01 19:36:37 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
/* special case to disable backend */
|
|
|
|
if (fd == -1)
|
|
|
|
return NULL;
|
|
|
|
sock = get_raw_socket(fd);
|
|
|
|
if (!IS_ERR(sock))
|
|
|
|
return sock;
|
2010-02-18 13:46:50 +08:00
|
|
|
sock = get_tap_socket(fd);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (!IS_ERR(sock))
|
|
|
|
return sock;
|
|
|
|
return ERR_PTR(-ENOTSOCK);
|
|
|
|
}
|
|
|
|
|
|
|
|
static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
|
|
|
|
{
|
|
|
|
struct socket *sock, *oldsock;
|
|
|
|
struct vhost_virtqueue *vq;
|
2013-04-27 15:07:46 +08:00
|
|
|
struct vhost_net_virtqueue *nvq;
|
2013-05-06 16:38:24 +08:00
|
|
|
struct vhost_net_ubuf_ref *ubufs, *oldubufs = NULL;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
int r;
|
|
|
|
|
|
|
|
mutex_lock(&n->dev.mutex);
|
|
|
|
r = vhost_dev_check_owner(&n->dev);
|
|
|
|
if (r)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
if (index >= VHOST_NET_VQ_MAX) {
|
|
|
|
r = -ENOBUFS;
|
|
|
|
goto err;
|
|
|
|
}
|
2013-04-27 11:16:48 +08:00
|
|
|
vq = &n->vqs[index].vq;
|
2013-04-27 15:07:46 +08:00
|
|
|
nvq = &n->vqs[index];
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
mutex_lock(&vq->mutex);
|
|
|
|
|
|
|
|
/* Verify that ring has been setup correctly. */
|
|
|
|
if (!vhost_vq_access_ok(vq)) {
|
|
|
|
r = -EFAULT;
|
2010-03-05 05:10:14 +08:00
|
|
|
goto err_vq;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
sock = get_socket(fd);
|
|
|
|
if (IS_ERR(sock)) {
|
|
|
|
r = PTR_ERR(sock);
|
2010-03-05 05:10:14 +08:00
|
|
|
goto err_vq;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* start polling new socket */
|
2013-05-07 14:54:36 +08:00
|
|
|
oldsock = vq->private_data;
|
2010-07-21 09:25:24 +08:00
|
|
|
if (sock != oldsock) {
|
2013-05-06 16:38:24 +08:00
|
|
|
ubufs = vhost_net_ubuf_alloc(vq,
|
|
|
|
sock && vhost_sock_zcopy(sock));
|
2011-07-18 11:48:46 +08:00
|
|
|
if (IS_ERR(ubufs)) {
|
|
|
|
r = PTR_ERR(ubufs);
|
|
|
|
goto err_ubufs;
|
|
|
|
}
|
2013-01-28 09:05:17 +08:00
|
|
|
|
2011-03-01 19:36:37 +08:00
|
|
|
vhost_net_disable_vq(n, vq);
|
2013-05-07 14:54:36 +08:00
|
|
|
vq->private_data = sock;
|
2017-05-17 12:14:45 +08:00
|
|
|
vhost_net_buf_unproduce(nvq);
|
2016-02-16 22:59:44 +08:00
|
|
|
r = vhost_vq_init_access(vq);
|
2011-06-21 18:04:27 +08:00
|
|
|
if (r)
|
2013-01-28 09:05:17 +08:00
|
|
|
goto err_used;
|
2013-01-28 09:05:18 +08:00
|
|
|
r = vhost_net_enable_vq(n, vq);
|
|
|
|
if (r)
|
|
|
|
goto err_used;
|
2018-03-09 14:50:33 +08:00
|
|
|
if (index == VHOST_NET_VQ_RX)
|
|
|
|
nvq->rx_ring = get_tap_ptr_ring(fd);
|
2013-01-28 09:05:17 +08:00
|
|
|
|
2013-04-27 15:07:46 +08:00
|
|
|
oldubufs = nvq->ubufs;
|
|
|
|
nvq->ubufs = ubufs;
|
2012-12-03 15:31:51 +08:00
|
|
|
|
|
|
|
n->tx_packets = 0;
|
|
|
|
n->tx_zcopy_err = 0;
|
2012-12-04 06:17:14 +08:00
|
|
|
n->tx_flush = false;
|
2010-03-05 05:10:14 +08:00
|
|
|
}
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
|
2010-07-15 20:19:12 +08:00
|
|
|
mutex_unlock(&vq->mutex);
|
|
|
|
|
2011-07-20 18:41:31 +08:00
|
|
|
if (oldubufs) {
|
2013-06-25 22:29:46 +08:00
|
|
|
vhost_net_ubuf_put_wait_and_free(oldubufs);
|
2011-07-20 18:41:31 +08:00
|
|
|
mutex_lock(&vq->mutex);
|
2012-11-01 17:16:51 +08:00
|
|
|
vhost_zerocopy_signal_used(n, vq);
|
2011-07-20 18:41:31 +08:00
|
|
|
mutex_unlock(&vq->mutex);
|
|
|
|
}
|
2011-07-18 11:48:46 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (oldsock) {
|
|
|
|
vhost_net_flush_vq(n, index);
|
2014-03-06 09:39:00 +08:00
|
|
|
sockfd_put(oldsock);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
2010-03-05 05:10:14 +08:00
|
|
|
|
2010-07-15 20:19:12 +08:00
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
return 0;
|
|
|
|
|
2013-01-28 09:05:17 +08:00
|
|
|
err_used:
|
2013-05-07 14:54:36 +08:00
|
|
|
vq->private_data = oldsock;
|
2013-01-28 09:05:17 +08:00
|
|
|
vhost_net_enable_vq(n, vq);
|
|
|
|
if (ubufs)
|
2013-06-25 22:29:46 +08:00
|
|
|
vhost_net_ubuf_put_wait_and_free(ubufs);
|
2011-07-18 11:48:46 +08:00
|
|
|
err_ubufs:
|
2018-06-21 13:11:31 +08:00
|
|
|
if (sock)
|
|
|
|
sockfd_put(sock);
|
2010-03-05 05:10:14 +08:00
|
|
|
err_vq:
|
|
|
|
mutex_unlock(&vq->mutex);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
err:
|
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long vhost_net_reset_owner(struct vhost_net *n)
|
|
|
|
{
|
|
|
|
struct socket *tx_sock = NULL;
|
|
|
|
struct socket *rx_sock = NULL;
|
|
|
|
long err;
|
2016-06-23 14:04:31 +08:00
|
|
|
struct vhost_umem *umem;
|
2011-03-01 19:36:37 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
mutex_lock(&n->dev.mutex);
|
|
|
|
err = vhost_dev_check_owner(&n->dev);
|
|
|
|
if (err)
|
|
|
|
goto done;
|
2016-06-23 14:04:31 +08:00
|
|
|
umem = vhost_dev_reset_owner_prepare();
|
|
|
|
if (!umem) {
|
2013-04-28 22:12:08 +08:00
|
|
|
err = -ENOMEM;
|
|
|
|
goto done;
|
|
|
|
}
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
vhost_net_stop(n, &tx_sock, &rx_sock);
|
|
|
|
vhost_net_flush(n);
|
2018-01-25 22:03:52 +08:00
|
|
|
vhost_dev_stop(&n->dev);
|
2016-06-23 14:04:31 +08:00
|
|
|
vhost_dev_reset_owner(&n->dev, umem);
|
2013-04-28 20:51:40 +08:00
|
|
|
vhost_net_vq_reset(n);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
done:
|
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
if (tx_sock)
|
2014-03-06 09:39:00 +08:00
|
|
|
sockfd_put(tx_sock);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
if (rx_sock)
|
2014-03-06 09:39:00 +08:00
|
|
|
sockfd_put(rx_sock);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-08-06 11:17:47 +08:00
|
|
|
static int vhost_net_set_backend_features(struct vhost_net *n, u64 features)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
mutex_lock(&n->dev.mutex);
|
|
|
|
for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
|
|
|
|
mutex_lock(&n->vqs[i].vq.mutex);
|
|
|
|
n->vqs[i].vq.acked_backend_features = features;
|
|
|
|
mutex_unlock(&n->vqs[i].vq.mutex);
|
|
|
|
}
|
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
static int vhost_net_set_features(struct vhost_net *n, u64 features)
|
|
|
|
{
|
2010-07-27 23:52:21 +08:00
|
|
|
size_t vhost_hlen, sock_hlen, hdr_len;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
int i;
|
2010-07-27 23:52:21 +08:00
|
|
|
|
2014-10-24 19:23:52 +08:00
|
|
|
hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
|
|
|
|
(1ULL << VIRTIO_F_VERSION_1))) ?
|
2010-07-27 23:52:21 +08:00
|
|
|
sizeof(struct virtio_net_hdr_mrg_rxbuf) :
|
|
|
|
sizeof(struct virtio_net_hdr);
|
|
|
|
if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
|
|
|
|
/* vhost provides vnet_hdr */
|
|
|
|
vhost_hlen = hdr_len;
|
|
|
|
sock_hlen = 0;
|
|
|
|
} else {
|
|
|
|
/* socket provides vnet_hdr */
|
|
|
|
vhost_hlen = 0;
|
|
|
|
sock_hlen = hdr_len;
|
|
|
|
}
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
mutex_lock(&n->dev.mutex);
|
|
|
|
if ((features & (1 << VHOST_F_LOG_ALL)) &&
|
2016-06-23 14:04:32 +08:00
|
|
|
!vhost_log_access_ok(&n->dev))
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
if ((features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))) {
|
|
|
|
if (vhost_init_device_iotlb(&n->dev, true))
|
|
|
|
goto out_unlock;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
2016-06-23 14:04:32 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
|
2013-04-27 11:16:48 +08:00
|
|
|
mutex_lock(&n->vqs[i].vq.mutex);
|
2014-06-05 20:20:23 +08:00
|
|
|
n->vqs[i].vq.acked_features = features;
|
2013-04-28 20:51:40 +08:00
|
|
|
n->vqs[i].vhost_hlen = vhost_hlen;
|
|
|
|
n->vqs[i].sock_hlen = sock_hlen;
|
2013-04-27 11:16:48 +08:00
|
|
|
mutex_unlock(&n->vqs[i].vq.mutex);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
return 0;
|
2016-06-23 14:04:32 +08:00
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
return -EFAULT;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
|
2013-05-06 11:16:00 +08:00
|
|
|
static long vhost_net_set_owner(struct vhost_net *n)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
|
|
|
mutex_lock(&n->dev.mutex);
|
2013-06-06 20:20:39 +08:00
|
|
|
if (vhost_dev_has_owner(&n->dev)) {
|
|
|
|
r = -EBUSY;
|
|
|
|
goto out;
|
|
|
|
}
|
2013-05-06 11:16:00 +08:00
|
|
|
r = vhost_net_set_ubuf_info(n);
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = vhost_dev_set_owner(&n->dev);
|
|
|
|
if (r)
|
|
|
|
vhost_net_clear_ubuf_info(n);
|
|
|
|
vhost_net_flush(n);
|
|
|
|
out:
|
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
struct vhost_net *n = f->private_data;
|
|
|
|
void __user *argp = (void __user *)arg;
|
|
|
|
u64 __user *featurep = argp;
|
|
|
|
struct vhost_vring_file backend;
|
|
|
|
u64 features;
|
|
|
|
int r;
|
2011-03-01 19:36:37 +08:00
|
|
|
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
switch (ioctl) {
|
|
|
|
case VHOST_NET_SET_BACKEND:
|
2010-05-27 18:01:58 +08:00
|
|
|
if (copy_from_user(&backend, argp, sizeof backend))
|
|
|
|
return -EFAULT;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
return vhost_net_set_backend(n, backend.index, backend.fd);
|
|
|
|
case VHOST_GET_FEATURES:
|
2012-07-21 14:55:36 +08:00
|
|
|
features = VHOST_NET_FEATURES;
|
2010-05-27 18:01:58 +08:00
|
|
|
if (copy_to_user(featurep, &features, sizeof features))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
case VHOST_SET_FEATURES:
|
2010-05-27 18:01:58 +08:00
|
|
|
if (copy_from_user(&features, featurep, sizeof features))
|
|
|
|
return -EFAULT;
|
2012-07-21 14:55:36 +08:00
|
|
|
if (features & ~VHOST_NET_FEATURES)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
return vhost_net_set_features(n, features);
|
2018-08-06 11:17:47 +08:00
|
|
|
case VHOST_GET_BACKEND_FEATURES:
|
|
|
|
features = VHOST_NET_BACKEND_FEATURES;
|
|
|
|
if (copy_to_user(featurep, &features, sizeof(features)))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
case VHOST_SET_BACKEND_FEATURES:
|
|
|
|
if (copy_from_user(&features, featurep, sizeof(features)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (features & ~VHOST_NET_BACKEND_FEATURES)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
return vhost_net_set_backend_features(n, features);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
case VHOST_RESET_OWNER:
|
|
|
|
return vhost_net_reset_owner(n);
|
2013-05-06 11:16:00 +08:00
|
|
|
case VHOST_SET_OWNER:
|
|
|
|
return vhost_net_set_owner(n);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
default:
|
|
|
|
mutex_lock(&n->dev.mutex);
|
2012-12-06 20:03:34 +08:00
|
|
|
r = vhost_dev_ioctl(&n->dev, ioctl, argp);
|
|
|
|
if (r == -ENOIOCTLCMD)
|
|
|
|
r = vhost_vring_ioctl(&n->dev, ioctl, argp);
|
|
|
|
else
|
|
|
|
vhost_net_flush(n);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
mutex_unlock(&n->dev.mutex);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-23 14:04:32 +08:00
|
|
|
static ssize_t vhost_net_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
|
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
|
|
|
struct vhost_net *n = file->private_data;
|
|
|
|
struct vhost_dev *dev = &n->dev;
|
|
|
|
int noblock = file->f_flags & O_NONBLOCK;
|
|
|
|
|
|
|
|
return vhost_chr_read_iter(dev, to, noblock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t vhost_net_chr_write_iter(struct kiocb *iocb,
|
|
|
|
struct iov_iter *from)
|
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
|
|
|
struct vhost_net *n = file->private_data;
|
|
|
|
struct vhost_dev *dev = &n->dev;
|
|
|
|
|
|
|
|
return vhost_chr_write_iter(dev, from);
|
|
|
|
}
|
|
|
|
|
2017-07-03 18:39:46 +08:00
|
|
|
static __poll_t vhost_net_chr_poll(struct file *file, poll_table *wait)
|
2016-06-23 14:04:32 +08:00
|
|
|
{
|
|
|
|
struct vhost_net *n = file->private_data;
|
|
|
|
struct vhost_dev *dev = &n->dev;
|
|
|
|
|
|
|
|
return vhost_chr_poll(file, dev, wait);
|
|
|
|
}
|
|
|
|
|
2010-05-17 21:12:49 +08:00
|
|
|
static const struct file_operations vhost_net_fops = {
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.release = vhost_net_release,
|
2016-06-23 14:04:32 +08:00
|
|
|
.read_iter = vhost_net_chr_read_iter,
|
|
|
|
.write_iter = vhost_net_chr_write_iter,
|
|
|
|
.poll = vhost_net_chr_poll,
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
.unlocked_ioctl = vhost_net_ioctl,
|
2018-09-11 23:23:00 +08:00
|
|
|
.compat_ioctl = compat_ptr_ioctl,
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
.open = vhost_net_open,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-16 00:52:59 +08:00
|
|
|
.llseek = noop_llseek,
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct miscdevice vhost_net_misc = {
|
2012-01-12 03:30:38 +08:00
|
|
|
.minor = VHOST_NET_MINOR,
|
|
|
|
.name = "vhost-net",
|
|
|
|
.fops = &vhost_net_fops,
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
};
|
|
|
|
|
2010-04-14 02:11:25 +08:00
|
|
|
static int vhost_net_init(void)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
2011-07-18 11:48:46 +08:00
|
|
|
if (experimental_zcopytx)
|
2013-05-06 16:38:24 +08:00
|
|
|
vhost_net_enable_zcopy(VHOST_NET_VQ_TX);
|
2010-06-03 02:40:00 +08:00
|
|
|
return misc_register(&vhost_net_misc);
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
}
|
|
|
|
module_init(vhost_net_init);
|
|
|
|
|
2010-04-14 02:11:25 +08:00
|
|
|
static void vhost_net_exit(void)
|
vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-14 14:17:27 +08:00
|
|
|
{
|
|
|
|
misc_deregister(&vhost_net_misc);
|
|
|
|
}
|
|
|
|
module_exit(vhost_net_exit);
|
|
|
|
|
|
|
|
MODULE_VERSION("0.0.1");
|
|
|
|
MODULE_LICENSE("GPL v2");
|
|
|
|
MODULE_AUTHOR("Michael S. Tsirkin");
|
|
|
|
MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
|
2012-01-12 03:30:38 +08:00
|
|
|
MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
|
|
|
|
MODULE_ALIAS("devname:vhost-net");
|