Commit Graph

206 Commits

Author SHA1 Message Date
Linus Torvalds 70e71ca0af Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

 2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers.  Thanks to Al Viro
    and Herbert Xu.

 3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

 4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

 5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

 6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

 7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

 8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

 9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets.  From Alexei
    Starovoitov.

10) Support TSO/LSO in sunvnet driver, from David L Stevens.

11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

12) Remote checksum offload, from Tom Herbert.

13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

14) Add MPLS support to openvswitch, from Simon Horman.

15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet.  This tries to resolve the conflicting goals between the
    desired handling of bulk vs.  RPC-like traffic.

17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU.  From Eric Dumazet.

18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

21) Add VLAN packet scheduler action, from Jiri Pirko.

22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
  Fix race condition between vxlan_sock_add and vxlan_sock_release
  net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
  net/mlx4: Add support for A0 steering
  net/mlx4: Refactor QUERY_PORT
  net/mlx4_core: Add explicit error message when rule doesn't meet configuration
  net/mlx4: Add A0 hybrid steering
  net/mlx4: Add mlx4_bitmap zone allocator
  net/mlx4: Add a check if there are too many reserved QPs
  net/mlx4: Change QP allocation scheme
  net/mlx4_core: Use tasklet for user-space CQ completion events
  net/mlx4_core: Mask out host side virtualization features for guests
  net/mlx4_en: Set csum level for encapsulated packets
  be2net: Export tunnel offloads only when a VxLAN tunnel is created
  gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
  cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
  net: fec: only enable mdio interrupt before phy device link up
  net: fec: clear all interrupt events to support i.MX6SX
  net: fec: reset fep link status in suspend function
  net: sock: fix access via invalid file descriptor
  net: introduce helper macro for_each_cmsghdr
  ...
2014-12-11 14:27:06 -08:00
Al Viro c0371da604 put iov_iter into msghdr
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09 16:29:03 -05:00
Michael S. Tsirkin 41e3e42108 vhost/net: enable virtio 1.0
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-12-09 12:05:30 +02:00
Michael S. Tsirkin e4fca7d6ff vhost/net: larger header for virtio 1.0
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2014-12-09 12:05:30 +02:00
Michael S. Tsirkin 8b38694a2d vhost/net: virtio 1.0 byte swap
I had to add an explicit tag to suppress compiler warning:
gcc isn't smart enough to notice that
len is always initialized since function is called with size > 0.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2014-12-09 12:05:29 +02:00
Michael S. Tsirkin bf99573496 vhost/net: force len for TX to host endian
vhost/net keeps a copy of the used ring in host memory but (ab)uses
the length field for internal house-keeping. This works because the
length in the used ring for tx is always 0. In order to suppress sparse
warnings, we force native endianness here.
Note that these values are never exposed to guests.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
2014-12-09 12:05:29 +02:00
Romain Francoise d04257b07f vhost-net: don't open-code kvfree
Commit 23cc5a991c ("vhost-net: extend device allocation to vmalloc")
added another open-coded version of kvfree (which is available since
v3.15-rc5), nuke it.

Signed-off-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-23 09:22:48 +03:00
Michael S. Tsirkin 47283bef7e vhost: move memory pointer to VQs
commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc
    vhost: replace rcu with mutex
replaced rcu sync for memory accesses with VQ mutex locl/unlock.
This is correct since all accesses are under VQ mutex, but incomplete:
we still do useless rcu lock/unlock operations, someone might copy this
code into some other context where this won't be right.
This use of RCU is also non standard and hard to understand.
Let's copy the pointer to each VQ structure, this way
the access rules become straight-forward, and there's
no need for RCU anymore.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-09 16:21:07 +03:00
Michael S. Tsirkin ea16c51433 vhost: move acked_features to VQs
Refactor code to make sure features are only accessed
under VQ mutex. This makes everything simpler, no need
for RCU here anymore.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-09 16:21:06 +03:00
Michael S. Tsirkin 23cc5a991c vhost-net: extend device allocation to vmalloc
Michael Mueller provided a patch to reduce the size of
vhost-net structure as some allocations could fail under
memory pressure/fragmentation. We are still left with
high order allocations though.

This patch is handling the problem at the core level, allowing
vhost structures to use vmalloc() if kmalloc() failed.

As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.

People are still looking at cleaner ways to handle the problem
at the API level, probably passing in multiple iovecs.
This hack seems consistent with approaches
taken since then by drivers/vhost/scsi.c and net/core/dev.c

Based on patch by Romain Francoise.

Cc: Michael Mueller <mimu@linux.vnet.ibm.com>
Signed-off-by: Romain Francoise <romain@orebokech.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-09 16:21:05 +03:00
Al Viro 09aaacf02a vhost: don't open-code sockfd_put()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:10 -04:00
Michael S. Tsirkin a39ee449f9 vhost: validate vhost_get_vq_desc return value
vhost fails to validate negative error code
from vhost_get_vq_desc causing
a crash: we are using -EFAULT which is 0xfffffff2
as vector size, which exceeds the allocated size.

The code in question was introduced in commit
8dd014adfe
    vhost-net: mergeable buffers support

CVE-2014-0055

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:10:35 -04:00
Michael S. Tsirkin d8316f3991 vhost: fix total length when packets are too short
When mergeable buffers are disabled, and the
incoming packet is too large for the rx buffer,
get_rx_bufs returns success.

This was intentional in order for make recvmsg
truncate the packet and then handle_rx would
detect err != sock_len and drop it.

Unfortunately we pass the original sock_len to
recvmsg - which means we use parts of iov not fully
validated.

Fix this up by detecting this overrun and doing packet drop
immediately.

CVE-2014-0077

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:07:31 -04:00
Michael S. Tsirkin b0c057ca7e vhost: fix a theoretical race in device cleanup
vhost_zerocopy_callback accesses VQ right after it drops a ubuf
reference.  In theory, this could race with device removal which waits
on the ubuf kref, and crash on use after free.

Do all accesses within rcu read side critical section, and synchronize
on release.

Since callbacks are always invoked from bh, synchronize_rcu_bh seems
enough and will help release complete a bit faster.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:47:30 -05:00
Michael S. Tsirkin 0ad8b480d6 vhost: fix ref cnt checking deadlock
vhost checked the counter within the refcnt before decrementing.  It
really wanted to know that it is the one that has the last reference, as
a way to batch freeing resources a bit more efficiently.

Note: we only let refcount go to 0 on device release.

This works well but we now access the ref counter twice so there's a
race: all users might see a high count and decide to defer freeing
resources.
In the end no one initiates freeing resources until the last reference
is gone (which is on VM shotdown so might happen after a looooong time).

Let's do what we probably should have done straight away:
switch from kref to plain atomic, documenting the
semantics, return the refcount value atomically after decrement,
then use that to avoid the deadlock.

Reported-by: Qin Chuanyu <qinchuanyu@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:47:30 -05:00
Zhi Yong Wu 59566b6e8c vhost: remove the dead branch
Since vhost_dev_init() forever return 0, some branches are never run,
therefore need to be removed.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 15:22:05 -05:00
Jason Wang f7c6be404d vhost_net: correctly limit the max pending buffers
As Michael point out, We used to limit the max pending DMAs to get better cache
utilization. But it was not done correctly since it was one done when there's no
new buffers submitted from guest. Guest can easily exceeds the limitation by
keeping sending packets.

So this patch moves the check into main loop. Tests shows about 5%-10%
improvement on per cpu throughput for guest tx.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 22:46:58 -04:00
Jason Wang 19c73b3e08 vhost_net: poll vhost queue after marking DMA is done
We used to poll vhost queue before making DMA is done, this is racy if vhost
thread were waked up before marking DMA is done which can result the signal to
be missed. Fix this by always polling the vhost thread before DMA is done.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 22:46:57 -04:00
Jason Wang ce21a02913 vhost_net: determine whether or not to use zerocopy at one time
Currently, even if the packet length is smaller than VHOST_GOODCOPY_LEN, if
upend_idx != done_idx we still set zcopy_used to true and rollback this choice
later. This could be avoided by determining zerocopy once by checking all
conditions at one time before.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 22:46:57 -04:00
Jason Wang c92112aed3 vhost_net: use vhost_add_used_and_signal_n() in vhost_zerocopy_signal_used()
We tend to batch the used adding and signaling in vhost_zerocopy_callback()
which may result more than 100 used buffers to be updated in
vhost_zerocopy_signal_used() in some cases. So switch to use
vhost_add_used_and_signal_n() to avoid multiple calls to
vhost_add_used_and_signal(). Which means much less times of used index
updating and memory barriers.

2% performance improvement were seen on netperf TCP_RR test.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 22:46:57 -04:00
Jason Wang 094afe7d55 vhost_net: make vhost_zerocopy_signal_used() return void
None of its caller use its return value, so let it return void.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 22:46:57 -04:00
Asias He 22fa90c7fb vhost: Remove custom vhost rcu usage
Now, vq->private_data is always accessed under vq mutex. No need to play
the vhost rcu trick.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-07-11 15:38:40 +03:00
Asias He 2e26af79b7 vhost-net: Always access vq->private_data under vq mutex
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-07-11 15:36:45 +03:00
Asias He 0a1febf7ba vhost: Make local function static
$ make C=1 M=drivers/vhost

drivers/vhost/net.c:168:5: warning: symbol 'vhost_net_set_ubuf_info' was not declared. Should it be static?
drivers/vhost/net.c:194:6: warning: symbol 'vhost_net_vq_reset' was not declared. Should it be static?
drivers/vhost/scsi.c:219:6: warning: symbol 'tcm_vhost_done_inflight' was not declared. Should it be static?

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-07-07 17:34:07 +03:00
Michael S. Tsirkin c38e39c378 vhost-net: fix use-after-free in vhost_net_flush
vhost_net_ubuf_put_and_wait has a confusing name:
it will actually also free it's argument.
Thus since commit 1280c27f8e
    "vhost-net: flush outstanding DMAs on memory change"
vhost_net_flush tries to use the argument after passing it
to vhost_net_ubuf_put_and_wait, this results
in use after free.
To fix, don't free the argument in vhost_net_ubuf_put_and_wait,
add an new API for callers that want to free ubufs.

Acked-by: Asias He <asias@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-07-07 14:06:22 +03:00
Michael S. Tsirkin 288cfe78c8 vhost: fix ubuf_info cleanup
vhost_net_clear_ubuf_info didn't clear ubuf_info
after kfree, this could trigger double free.
Fix this and simplify this code to make it more robust: make sure
ubuf info is always freed through vhost_net_clear_ubuf_info.

Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:46:21 -07:00
Michael S. Tsirkin 05c0535194 vhost: check owner before we overwrite ubuf_info
If device has an owner, we shouldn't touch ubuf_info
since it might be in use.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:46:21 -07:00
Jason Wang 4364d5f96e vhost_net: clear msg.control for non-zerocopy case during tx
When we decide not use zero-copy, msg.control should be set to NULL otherwise
macvtap/tap may set zerocopy callbacks which may decrease the kref of ubufs
wrongly.

Bug were introduced by commit cedb9bdce0
(vhost-net: skip head management if no outstanding).

This solves the following warnings:

WARNING: at include/linux/kref.h:47 handle_tx+0x477/0x4b0 [vhost_net]()
Modules linked in: vhost_net macvtap macvlan tun nfsd exportfs bridge stp llc openvswitch kvm_amd kvm bnx2 megaraid_sas [last unloaded: tun]
CPU: 5 PID: 8670 Comm: vhost-8668 Not tainted 3.10.0-rc2+ #1566
Hardware name: Dell Inc. PowerEdge R715/00XHKG, BIOS 1.5.2 04/19/2011
ffffffffa0198323 ffff88007c9ebd08 ffffffff81796b73 ffff88007c9ebd48
ffffffff8103d66b 000000007b773e20 ffff8800779f0000 ffff8800779f43f0
ffff8800779f8418 000000000000015c 0000000000000062 ffff88007c9ebd58
Call Trace:
[<ffffffff81796b73>] dump_stack+0x19/0x1e
[<ffffffff8103d66b>] warn_slowpath_common+0x6b/0xa0
[<ffffffff8103d6b5>] warn_slowpath_null+0x15/0x20
[<ffffffffa0197627>] handle_tx+0x477/0x4b0 [vhost_net]
[<ffffffffa0197690>] handle_tx_kick+0x10/0x20 [vhost_net]
[<ffffffffa019541e>] vhost_worker+0xfe/0x1a0 [vhost_net]
[<ffffffffa0195320>] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
[<ffffffffa0195320>] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
[<ffffffff81061f46>] kthread+0xc6/0xd0
[<ffffffff81061e80>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff817a1aec>] ret_from_fork+0x7c/0xb0
[<ffffffff81061e80>] ? kthread_freezable_should_stop+0x70/0x70

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 14:31:45 -07:00
Asias He fe729a57c8 vhost-net: Cleanup vhost_ubuf and vhost_zcopy
- Rename vhost_ubuf to vhost_net_ubuf
- Rename vhost_zcopy_mask to vhost_net_zcopy_mask
- Make funcs static

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-06 13:25:47 +03:00
Asias He 8570a6e72c vhost: Move VHOST_NET_FEATURES to net.c
vhost.h should not depend on device specific marcos like
VHOST_NET_F_VIRTIO_NET_HDR and VIRTIO_NET_F_MRG_RXBUF.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-06 13:21:00 +03:00
Asias He b1ad8496c9 vhost-net: Free ubuf when vhost_dev_set_owner fails
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-06 12:57:54 +03:00
Michael S. Tsirkin 150b9e51ae vhost: fix error handling in RESET_OWNER ioctl
RESET_OWNER ioctl would leave the fd in a bad state if
memory allocation failed: device is stopped
but owner is not reset. Make state changes
after allocating memory, such that a failed
ioctl has no effect.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-01 10:02:54 +03:00
Michael S. Tsirkin 81f95a5580 vhost: move per-vq net specific fields out to net
This will remove the need for vhost scsi to pull
in virtio-net.h.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-01 10:02:53 +03:00
Asias He 2839400f8f vhost: move vhost-net zerocopy fields to net.c
On top of 'vhost: Allow device specific fields per vq', we can move device
specific fields to device virt queue from vhost virt queue.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-01 10:02:52 +03:00
Asias He 3ab2e420ec vhost: Allow device specific fields per vq
This is useful for any device who wants device specific fields per vq.
For example, tcm_vhost wants a per vq field to track requests which are
in flight on the vq. Also, on top of this we can add patches to move
things like ubufs from vhost.h out to net.c.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2013-05-01 10:02:45 +03:00
Jason Wang 70181d5120 vhost_net: remove tx polling state
After commit 2b8b328b61 (vhost_net: handle polling
errors when setting backend), we in fact track the polling state through
poll->wqh, so there's no need to duplicate the work with an extra
vhost_net_polling_state. So this patch removes this and make the code simpler.

This patch also removes the all tx starting/stopping code in tx path according
to Michael's suggestion.

Netperf test shows almost the same result in stream test, but gets improvements
on TCP_RR tests (both zerocopy or copy) especially on low load cases.

Tested between multiqueue kvm guest and external host with two direct
connected 82599s.

zerocopy disabled:

sessions|transaction rates|normalize|
before/after/+improvements
1 | 9510.24/11727.29/+23.3%    | 693.54/887.68/+28.0%   |
25| 192931.50/241729.87/+25.3% | 2376.80/2771.70/+16.6% |
50| 277634.64/291905.76/+5%    | 3118.36/3230.11/+3.6%  |

zerocopy enabled:

sessions|transaction rates|normalize|
before/after/+improvements
1 | 7318.33/11929.76/+63.0%    | 521.86/843.30/+61.6%   |
25| 167264.88/242422.15/+44.9% | 2181.60/2788.16/+27.8% |
50| 272181.02/294347.04/+8.1%  | 3071.56/3257.85/+6.1%  |

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-11 16:16:22 -04:00
Michael S. Tsirkin 46aa92d1ba vhost/net: fix heads usage of ubuf_info
ubuf info allocator uses guest controlled head as an index,
so a malicious guest could put the same head entry in the ring twice,
and we will get two callbacks on the same value.
To fix use upend_idx which is guaranteed to be unique.

Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17 14:28:54 -04:00
Jason Wang 2b8b328b61 vhost_net: handle polling errors when setting backend
Currently, the polling errors were ignored, which can lead following issues:

- vhost remove itself unconditionally from waitqueue when stopping the poll,
  this may crash the kernel since the previous attempt of starting may fail to
  add itself to the waitqueue
- userspace may think the backend were successfully set even when the polling
  failed.

Solve this by:

- check poll->wqh before trying to remove from waitqueue
- report polling errors in vhost_poll_start(), tx_poll_start(), the return value
  will be checked and returned when userspace want to set the backend

After this fix, there still could be a polling failure after backend is set, it
will addressed by the next patch.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:43:03 -05:00
Jason Wang 692a998b90 vhost_net: correct error handling in vhost_net_set_backend()
Currently, when vhost_init_used() fails the sock refcnt and ubufs were
leaked. Correct this by calling vhost_init_used() before assign ubufs and
restore the oldsock when it fails.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:43:03 -05:00
Michael S. Tsirkin f9611c43ab vhost-net: enable zerocopy tx by default
Zero copy TX has been around for a while now.
We seem to be down to eliminating theoretical bugs
and performance tuning at this point:
it's probably time to enable it by default so that
most users get the benefit.

Keep the flag around meanwhile so users can experiment
with disabling this if they experience regressions.
I expect that we will remove it in the future.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-12-06 17:09:18 +02:00
Michael S. Tsirkin cedb9bdce0 vhost-net: skip head management if no outstanding
For short packets zerocopy mode adds overhead
of managing heads which isn't necessary: we
could simly update used ring directly
same as with zerocopy disabled.

Things seem to run a bit faster if we detect
and bypass head management when zcopy isn't used.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-12-06 17:09:18 +02:00
Michael S. Tsirkin 1280c27f8e vhost-net: flush outstanding DMAs on memory change
When memory map changes, we need to flush outstanding
DMAs as they might in theory reference old memory addresses.
To do this simply stop initiating new DMAs
and wait for ubufs ref count to drop to 0.
Afterwards reset the count back to 1.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-12-06 17:09:18 +02:00
Michael S. Tsirkin 935cdee7ee vhost: avoid backend flush on vring ops
vring changes already do a flush internally where appropriate, so we do
not need a second flush.

It's currently not very expensive but a follow-up patch makes flush more
heavy-weight, so remove the extra flush here to avoid regressing
performance if call or kick fds are changed on data path.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-12-06 17:09:18 +02:00
Michael S. Tsirkin 64e9a9b8a0 vhost-net: initialize zcopy packet counters
These packet counters are used to drive the zercopy
selection heuristic so nothing too bad happens if they are off a bit -
and they are also reset once in a while.
But it's cleaner to clear them when backend is set so that
we start in a known state.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 13:42:17 -05:00
David S. Miller d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Michael S. Tsirkin 24eb21a148 vhost-net: reduce vq polling on tx zerocopy
It seems that to avoid deadlocks it is enough to poll vq before
 we are going to use the last buffer.  This is faster than
c70aa540c7.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-02 21:29:58 -04:00
Michael S. Tsirkin eaae8132ef vhost-net: select tx zero copy dynamically
Even when vhost-net is in zero-copy transmit mode,
net core might still decide to copy the skb later
which is somewhat slower than a copy in user
context: data copy overhead is added to the cost of
page pin/unpin. The result is that enabling tx zero copy
option leads to higher CPU utilization for guest to guest
and guest to host traffic.

To fix this, suppress zero copy tx after a given number of
packets triggered late data copy. Re-enable periodically
to detect workload changes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-02 21:29:58 -04:00
Michael S. Tsirkin b211616d71 vhost: move -net specific code out
Zerocopy handling code is vhost-net specific.
Move it from vhost.c/vhost.h out to net.c

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-02 21:29:58 -04:00
Michael S. Tsirkin 70e4cb9aaf vhost-net: cleanup macros for DMA status tracking
Better document macros for DMA tracking. Add an
explicit one for DMA in progress instead of
relying on user supplying len != 1.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-02 21:29:57 -04:00
Michael S. Tsirkin 910a578f7e vhost: fix mergeable bufs on BE hosts
We copy head count to a 16 bit field, this works by chance on LE but on
BE guest gets 0. Fix it up.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Alexander Graf <agraf@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-24 23:19:30 -04:00
Stefan Hajnoczi 0dd05a3b60 vhost: Separate vhost-net features from vhost features
In order for other vhost devices to use the VHOST_FEATURES bits the
vhost-net specific bits need to be moved to their own VHOST_NET_FEATURES
constant.

(Asias: Update drivers/vhost/test.c to use VHOST_NET_FEATURES)

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Asias He <asias@redhat.com>
Signed-off-by: Nicholas A. Bellinger <nab@risingtidesystems.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-07-22 01:21:53 +03:00
David S. Miller 028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
Basil Gor c53cff5e42 vhost-net: fix handle_rx buffer size
Take vlan header length into account, when vlan id is stored as
vlan_tci. Otherwise tagged packets coming from macvtap will be
truncated.

Signed-off-by: Basil Gor <basil.gor@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-11 18:16:57 -04:00
Jason Wang c8fb217af5 vhost_net: zerocopy: adding and signalling immediately when fully copied
When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:24 +03:00
Jason Wang dbf34207c6 vhost_net: re-poll only on EAGAIN or ENOBUFS
Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.

The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:24 +03:00
Jason Wang c460f05739 vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs
When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.

As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:22 +03:00
Michael S. Tsirkin ca8f4fb21d skbuff: struct ubuf_info callback type safety
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-13 13:09:19 -04:00
Michael S. Tsirkin ea5d404655 vhost: fix release path lockdep checks
We shouldn't hold any locks on release path. Pass a flag to
vhost_dev_cleanup to use the lockdep info correctly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
2012-02-28 09:13:22 +02:00
stephen hemminger 7c7c7f01cc vhost-net: add module alias (v2.1)
By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.

Also:
  - use C99 style initialization.
  - add missing entry in documentation for loop-control

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-13 10:12:23 -08:00
Shirley Ma 9e380825ab vhost: handle wrap around in # of bufs math
The meth for calculating the # of outstanding buffers gives
incorrect results when vq->upend_idx wraps around zero.
Fix that.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-07-21 10:48:27 +03:00
Michael S. Tsirkin c047e5f317 vhost-net: update used ring on backend change
On backend change, we flushed out outstanding skbs
but forgot to update the used ring, so that
done entries were left in the ubuf_info ring.
As a result we lose heads or complete incorrect ones,
crashing the guest or leaking memory.
Fix by updating the used ring.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-07-21 10:23:31 +03:00
Jason Wang f59281dafb vhost: init used ring after backend was set
Move the used ring initialization after backend was set. This
makes it possible to disable the backend and tweak the used ring,
then restart. This will also make it possible to log the used ring
write correctly.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-07-19 13:28:34 +03:00
Michael S. Tsirkin bab632d69e vhost: vhost TX zero-copy support
>From: Shirley Ma <mashirle@us.ibm.com>

This adds experimental zero copy support in vhost-net,
disabled by default. To enable, set
experimental_zcopytx module option to 1.

This patch maintains the outstanding userspace buffers in the
sequence it is delivered to vhost. The outstanding userspace buffers
will be marked as done once the lower device buffers DMA has finished.
This is monitored through last reference of kfree_skb callback. Two
buffer indices are used for this purpose.

The vhost-net device passes the userspace buffers info to lower device
skb through message control. DMA done status check and guest
notification are handled by handle_tx: in the worst case is all buffers
in the vq are in pending/done status, so we need to notify guest to
release DMA done buffers first before we get any new buffers from the
vq.

One known problem is that if the guest stops submitting
buffers, buffers might never get used until some
further action, e.g. device reset. This does not
seem to affect linux guests.

Signed-off-by: Shirley <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-18 10:42:32 -07:00
Michael S. Tsirkin 8ea8cf89e1 vhost: support event index
Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-05-30 11:14:15 +09:30
Michael S. Tsirkin de4d768a42 vhost-net: remove unlocked use of receive_queue
Use of skb_queue_empty(&sock->sk->sk_receive_queue)
without taking the sk_receive_queue.lock is unsafe
or useless. Take it out.

Reported-by:  Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-03-13 23:08:19 +02:00
Jason Wang 783e398854 vhost: lock receive queue, not the socket
vhost takes a sock lock to try and prevent
the skb from being pulled from the receive queue
after skb_peek.  However this is not the right lock to use for that,
sk_receive_queue.lock is. Fix that up.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-03-13 23:08:04 +02:00
Jason Wang 94249369e9 vhost-net: Unify the code of mergeable and big buffer handling
Codes duplication were found between the handling of mergeable and big
buffers, so this patch tries to unify them. This could be easily done
by adding a quota to the get_rx_bufs() which is used to limit the
number of buffers it returns (for mergeable buffer, the quota is
simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
previous handle_rx_mergeable() could be resued also for big buffers.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-03-13 17:00:10 +02:00
Jason Wang cfbdab9513 vhost-net: check the support of mergeable buffer outside the receive loop
No need to check the support of mergeable buffer inside the recevie
loop as the whole handle_rx()_xx is in the read critical region.  So
this patch move it ahead of the receiving loop.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-03-13 16:57:30 +02:00
Krishna Kumar d47effe1be vhost: Cleanup vhost.c and net.c
Minor cleanup of vhost.c and net.c to match coding style.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-03-08 18:02:47 +02:00
Michael S. Tsirkin 5e18247b02 vhost: rcu annotation fixup
When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev->mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.

Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2011-02-01 16:48:46 +02:00
David S. Miller 9fe146aef4 Merge branch 'vhost-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost 2010-12-14 11:33:23 -08:00
Jason Wang a290aec88a vhost: fix typos in comment
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-12-09 15:39:15 +02:00
Michael S. Tsirkin 11cd1a8b8c vhost/net: fix rcu check usage
Incorrect rcu check was used as rcu isn't done
under mutex here. Force check to 1 for now,
to stop it from complaining.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-11-25 11:29:16 +02:00
Michael S. Tsirkin 64e1c80748 vhost-net: batch use/unuse mm
Move use/unuse mm to vhost.c which makes it possible to batch these
operations.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-11-04 13:22:11 +02:00
Linus Torvalds 5f05647dd8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
  bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
  vlan: Calling vlan_hwaccel_do_receive() is always valid.
  tproxy: use the interface primary IP address as a default value for --on-ip
  tproxy: added IPv6 support to the socket match
  cxgb3: function namespace cleanup
  tproxy: added IPv6 support to the TPROXY target
  tproxy: added IPv6 socket lookup function to nf_tproxy_core
  be2net: Changes to use only priority codes allowed by f/w
  tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
  tproxy: added tproxy sockopt interface in the IPV6 layer
  tproxy: added udp6_lib_lookup function
  tproxy: added const specifiers to udp lookup functions
  tproxy: split off ipv6 defragmentation to a separate module
  l2tp: small cleanup
  nf_nat: restrict ICMP translation for embedded header
  can: mcp251x: fix generation of error frames
  can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
  can-raw: add msg_flags to distinguish local traffic
  9p: client code cleanup
  rds: make local functions/variables static
  ...

Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
2010-10-23 11:47:02 -07:00
Linus Torvalds 092e0e7e52 Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  vfs: make no_llseek the default
  vfs: don't use BKL in default_llseek
  llseek: automatically add .llseek fop
  libfs: use generic_file_llseek for simple_attr
  mac80211: disallow seeks in minstrel debug code
  lirc: make chardev nonseekable
  viotape: use noop_llseek
  raw: use explicit llseek file operations
  ibmasmfs: use generic_file_llseek
  spufs: use llseek in all file operations
  arm/omap: use generic_file_llseek in iommu_debug
  lkdtm: use generic_file_llseek in debugfs
  net/wireless: use generic_file_llseek in debugfs
  drm: use noop_llseek
2010-10-22 10:52:56 -07:00
Arnd Bergmann 6038f373a3 llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time.  Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
//   but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
   *off = E
|
   *off += E
|
   func(..., off, ...)
|
   E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
  *off = E
|
  *off += E
|
  func(..., off, ...)
|
  E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
 ...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
 .llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
 .read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
 .write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
 .open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
...  .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
...  .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
...  .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+	.llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
 .write = write_f,
 .read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-15 15:53:27 +02:00
Ingo Molnar 556ef63255 Merge commit 'v2.6.36-rc7' into core/rcu
Merge reason: Update from -rc3 to -rc7.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-07 09:43:45 +02:00
Jason Wang e0e9b40647 vhost: max s/g to match qemu
Qemu supports up to UIO_MAXIOV s/g so we have to match that because guest
drivers may rely on this.

Allocate indirect and log arrays dynamically to avoid using too much contigious
memory and make the length of hdr array to match the header length since each
iovec entry has a least one byte.

Test with copying large files w/ and w/o migration in both linux and windows
guests.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-10-05 13:33:43 +02:00
Michael S. Tsirkin ee05d6939e vhost-net: fix range checking in mrg bufs case
In mergeable buffer case, we use headcount, log_num
and seg as indexes in same-size arrays, and
we know that headcount <= seg and
log_num equals either 0 or seg.

Therefore, the right thing to do is range-check seg,
not headcount as we do now: these will be different
if guest chains s/g descriptors (this does not
happen now, but we can not trust the guest).

Long term, we should add BUG_ON checks to verify
two other indexes are what we think they should be.

Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-09-14 15:22:41 +02:00
Arnd Bergmann 28457ee69c vhost: add __rcu annotations
Also add rcu_dereference_protected() for code paths where locks are held.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
2010-08-21 16:27:36 -07:00
Linus Torvalds 6ba74014c1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
  phy/marvell: add 88ec048 support
  igb: Program MDICNFG register prior to PHY init
  e1000e: correct MAC-PHY interconnect register offset for 82579
  hso: Add new product ID
  can: Add driver for esd CAN-USB/2 device
  l2tp: fix export of header file for userspace
  can-raw: Fix skb_orphan_try handling
  Revert "net: remove zap_completion_queue"
  net: cleanup inclusion
  phy/marvell: add 88e1121 interface mode support
  u32: negative offset fix
  net: Fix a typo from "dev" to "ndev"
  igb: Use irq_synchronize per vector when using MSI-X
  ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
  e1000e: Fix irq_synchronize in MSI-X case
  e1000e: register pm_qos request on hardware activation
  ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
  net: Add getsockopt support for TCP thin-streams
  cxgb4: update driver version
  cxgb4: add new PCI IDs
  ...

Manually fix up conflicts in:
 - drivers/net/e1000e/netdev.c: due to pm_qos registration
   infrastructure changes
 - drivers/net/phy/marvell.c: conflict between adding 88ec048 support
   and cleaning up the IDs
 - drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
   conflict (registration change vs marking it static)
2010-08-04 11:47:58 -07:00
David Stevens 8dd014adfe vhost-net: mergeable buffers support
This adds support for mergeable buffers in vhost-net: this is needed
for older guests without indirect buffer support, as well
as for zero copy with some devices.

Includes changes by Michael S. Tsirkin to make the
patch as low risk as possible (i.e., close to no changes
when feature is disabled).

Signed-off-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-07-28 15:45:36 +03:00
Tejun Heo c23f3445e6 vhost: replace vhost_workqueue with per-vhost kthread
Replace vhost_workqueue with per-vhost kthread.  Other than callback
argument change from struct work_struct * to struct vhost_work *,
there's no visible change to vhost_poll_*() interface.

This conversion is to make each vhost use a dedicated kthread so that
resource control via cgroup can be applied.

Partially based on Sridhar Samudrala's patch.

* Updated to use sub structure vhost_work instead of directly using
  vhost_poll at Michael's suggestion.

* Added flusher wake_up() optimization at Michael's suggestion.

Changes by MST:
* Converted atomics/barrier use to a spinlock.
* Create thread on SET_OWNER
* Fix flushing

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
2010-07-28 15:44:53 +03:00
David S. Miller 4c3e5edf2f vhost net: Fix warning.
Reported by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-21 21:09:23 -07:00
David S. Miller 11fe883936 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/vhost/net.c
	net/bridge/br_device.c

Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.

Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20 18:25:24 -07:00
Linus Torvalds 516bd66415 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (24 commits)
  bridge: Partially disable netpoll support
  tcp: fix crash in tcp_xmit_retransmit_queue
  IPv6: fix CoA check in RH2 input handler (mip6_rthdr_input())
  ibmveth: lost IRQ while closing/opening device leads to service loss
  rt2x00: Fix lockdep warning in rt2x00lib_probe_dev()
  vhost: avoid pr_err on condition guest can trigger
  ipmr: Don't leak memory if fib lookup fails.
  vhost-net: avoid flush under lock
  net: fix problem in reading sock TX queue
  net/core: neighbour update Oops
  net: skb_tx_hash() fix relative to skb_orphan_try()
  rfs: call sock_rps_record_flow() in tcp_splice_read()
  xfrm: do not assume that template resolving always returns xfrms
  hostap_pci: set dev->base_addr during probe
  axnet_cs: use spin_lock_irqsave in ax_interrupt
  dsa: Fix Kconfig dependencies.
  act_nat: not all of the ICMP packets need an IP header payload
  r8169: incorrect identifier for a 8168dp
  Phonet: fix skb leak in pipe endpoint accept()
  Bluetooth: Update sec_level/auth_type for already existing connections
  ...
2010-07-20 16:26:42 -07:00
Michael S. Tsirkin 95c0ec6a97 vhost: avoid pr_err on condition guest can trigger
Guest can trigger packet truncation by posting
a very short buffer and disabling buffer merging.
Convert pr_err to pr_debug to avoid log from filling
up when this happens.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-07-16 12:18:17 +03:00
Michael S. Tsirkin 1680e9063e vhost-net: avoid flush under lock
We flush under vq mutex when changing backends.
This creates a deadlock as workqueue being flushed
needs this lock as well.

https://bugzilla.redhat.com/show_bug.cgi?id=612421

Drop the vq mutex before flush: we have the device mutex
which is sufficient to prevent another ioctl from touching
the vq.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-07-15 15:26:12 +03:00
Linus Torvalds 2aa72f6121 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (35 commits)
  NET: SB1250: Initialize .owner
  vxge: show startup message with KERN_INFO
  ll_temac: Fix missing iounmaps
  bridge: Clear IPCB before possible entry into IP stack
  bridge br_multicast: BUG: unable to handle kernel NULL pointer dereference
  net: Fix definition of netif_vdbg() when VERBOSE_DEBUG is defined
  net/ne: fix memory leak in ne_drv_probe()
  xfrm: fix xfrm by MARK logic
  virtio_net: fix oom handling on tx
  virtio_net: do not reschedule rx refill forever
  s2io: resolve statistics issues
  linux/net.h: fix kernel-doc warnings
  net: decreasing real_num_tx_queues needs to flush qdisc
  sched: qdisc_reset_all_tx is calling qdisc_reset without qdisc_lock
  qlge: fix a eeh handler to not add a pending timer
  qlge: Replacing add_timer() to mod_timer()
  usbnet: Set parent device early for netdev_printk()
  net: Revert "rndis_host: Poll status channel before control channel"
  netfilter: ip6t_REJECT: fix a dst leak in ipv6 REJECT
  drivers: bluetooth: bluecard_cs.c: Fixed include error, changed to linux/io.h
  ...
2010-07-07 19:56:00 -07:00
David S. Miller 597e608a84 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-07-07 15:59:38 -07:00
Michael S. Tsirkin 7b3384fc30 vhost: add unlikely annotations to error path
patch 'break out of polling loop on error' caused
a minor performance regression on my machine: recover
that performance by adding a bunch of unlikely annotations
in the error handling.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-07-01 19:25:40 +03:00
Michael S. Tsirkin d5675bd204 vhost: break out of polling loop on error
When ring parsing fails, we currently handle this
as ring empty condition. This means that we enable
kicks and recheck ring empty: if this not empty,
we re-start polling which of course will fail again.

Instead, let's return a negative error code and stop polling.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-06-27 11:52:25 +03:00
Alan Cox 79907d89c3 misc: Fix allocation 'borrowed' by vhost_net
10, 233 is allocated officially to /dev/kmview which is shipping in
Ubuntu and Debian distributions.  vhost_net seem to have borrowed it
without making a proper request and this causes regressions in the other
distributions.

vhost_net can use a dynamic minor so use that instead.  Also update the
file with a comment to try and avoid future misunderstandings.

cc: stable@kernel.org
Signed-off-by: Alan Cox <device@lanana.org>
[ We should have caught this before 2.6.34 got released.  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-09 08:50:31 -07:00
David S. Miller 0dea7c12fc Merge branch 'vhost-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost 2010-06-02 08:26:36 -07:00
Takuya Yoshikawa d3553a5249 vhost-net: fix to check the return value of copy_to/from_user() correctly
copy_to/from_user() returns the number of bytes that could not be copied.

So we need to check if it is not zero, and in that case, we should return
the error number -EFAULT rather than directly return the return value from
copy_to/from_user().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-05-27 13:55:13 +03:00
Jeff Dike dd1f4078f0 vhost-net: minor cleanup
Delete a label and goto from vhost_net_set_backend
Inverting a test allows a label and goto to be eliminated.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-05-27 12:25:37 +03:00
Tobias Klauser 373a83a699 vhost: Storage class should be before const qualifier
The C99 specification states in section 6.11.5:

The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-05-27 12:23:49 +03:00
David S. Miller fea0691526 Merge branch 'vhost' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost 2010-04-14 22:52:46 -07:00
Christoph Hellwig a8d3782f9e vhost: fix sparse warnings
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-04-14 11:52:08 +03:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Michael S. Tsirkin 0e25557212 vhost: fix interrupt mitigation with raw sockets
A thinko in code means we never trigger interrupt
mitigation. Fix this.

Reported-by: Juan Quintela <quintela@redhat.com>
Reported-by: Unai Uribarri <unai.uribarri@optenet.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-03-17 16:44:20 +02:00
Jeff Dike 1dace8c801 vhost: fix error path in vhost_net_set_backend
An error could cause vhost_net_set_backend to exit without unlocking
vq->mutex. Fix this.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2010-03-07 13:28:53 +02:00
Sridhar Samudrala 39286fa41a vhost-net: restart tx poll on sk_sndbuf full
guest to remote communication with vhost net sometimes stops until
guest driver is restarted. This happens when we get guest kick precisely
when the backend send queue is full, as a result handle_tx() returns without
polling backend. This patch fixes this by restarting tx poll on this condition.

Signed-off-by: Sridhar Samudrala <samudrala@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Tom Lendacky <toml@us.ibm.com>
2010-02-28 19:50:33 +02:00
Arnd Bergmann 501c774cb1 net/macvtap: add vhost support
This adds support for passing a macvtap file descriptor into
vhost-net, much like we already do for tun/tap.

Most of the new code is taken from the respective patch
in the tun driver and may get consolidated in the future.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-18 14:08:38 -08:00
Michael S. Tsirkin 3a4d5c94e9 vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item.  The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.

(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)

Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-15 01:43:29 -08:00