Commit Graph

155 Commits

Author SHA1 Message Date
Michael S. Tsirkin 39ec7de709 macvtap: fix uninitialized access on TUNSETIFF
flags field in ifreq is only 16 bit wide, but
we read it as a 32 bit value.
If userspace doesn't zero-initialize unused fields,
this will lead to failures.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-16 11:19:41 -05:00
Linus Torvalds 70e71ca0af Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

 2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers.  Thanks to Al Viro
    and Herbert Xu.

 3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

 4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

 5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

 6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

 7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

 8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

 9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets.  From Alexei
    Starovoitov.

10) Support TSO/LSO in sunvnet driver, from David L Stevens.

11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

12) Remote checksum offload, from Tom Herbert.

13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

14) Add MPLS support to openvswitch, from Simon Horman.

15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet.  This tries to resolve the conflicting goals between the
    desired handling of bulk vs.  RPC-like traffic.

17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU.  From Eric Dumazet.

18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

21) Add VLAN packet scheduler action, from Jiri Pirko.

22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
  Fix race condition between vxlan_sock_add and vxlan_sock_release
  net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
  net/mlx4: Add support for A0 steering
  net/mlx4: Refactor QUERY_PORT
  net/mlx4_core: Add explicit error message when rule doesn't meet configuration
  net/mlx4: Add A0 hybrid steering
  net/mlx4: Add mlx4_bitmap zone allocator
  net/mlx4: Add a check if there are too many reserved QPs
  net/mlx4: Change QP allocation scheme
  net/mlx4_core: Use tasklet for user-space CQ completion events
  net/mlx4_core: Mask out host side virtualization features for guests
  net/mlx4_en: Set csum level for encapsulated packets
  be2net: Export tunnel offloads only when a VxLAN tunnel is created
  gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
  cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
  net: fec: only enable mdio interrupt before phy device link up
  net: fec: clear all interrupt events to support i.MX6SX
  net: fec: reset fep link status in suspend function
  net: sock: fix access via invalid file descriptor
  net: introduce helper macro for_each_cmsghdr
  ...
2014-12-11 14:27:06 -08:00
Al Viro c0371da604 put iov_iter into msghdr
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09 16:29:03 -05:00
Michael S. Tsirkin 6ae7feb316 macvtap: TUN_VNET_LE support
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
2014-12-09 12:05:31 +02:00
Jason Wang f51a5e82ea tun/macvtap: use consume_skb() instead of kfree_skb() when needed
To be more friendly with drop monitor, we should only call kfree_skb() when
the packets were dropped and use consume_skb() in other cases.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-05 21:45:09 -08:00
Al Viro f5ff53b4d9 {macvtap,tun}_get_user(): switch to iov_iter
allows to switch macvtap and tun from ->aio_write() to ->write_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24 05:15:41 -05:00
Al Viro 3af0bfe582 switch macvtap to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24 04:32:52 -05:00
Jason Wang 7cc76f5150 macvtap: advance iov iterator when needed in macvtap_put_user()
When mergeable buffer is used, vnet_hdr_sz is greater than sizeof struct
virtio_net_hdr. So we need advance the iov iterators in this case.

Fixes 6c36d2e26c ("macvtap: Use iovec iterators")
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:59:19 -05:00
Herbert Xu 6c36d2e26c macvtap: Use iovec iterators
This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 12:13:34 -05:00
Herbert Xu 3ce9b20f19 macvtap: Fix csum_start when VLAN tags are present
When VLAN is in use in macvtap_put_user, we end up setting
csum_start to the wrong place.  The result is that the whoever
ends up doing the checksum setting will corrupt the packet instead
of writing the checksum to the expected location, usually this
means writing the checksum with an offset of -4.

This patch fixes this by adjusting csum_start when VLAN tags are
detected.

Fixes: f09e2249c4 ("macvtap: restore vlan header on user read")
Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 14:52:20 -05:00
Ben Hutchings 5188cd44c5 drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
UFO is now disabled on all drivers that work with virtio net headers,
but userland may try to send UFO/IPv6 packets anyway.  Instead of
sending with ID=0, we should select identifiers on their behalf (as we
used to).

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30 20:01:18 -04:00
Ben Hutchings 3d0ad09412 drivers/net: Disable UFO through virtio
IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header.  UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.

Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug.  Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.

Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.

We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30 20:01:18 -04:00
Tom Herbert 04ffcb255f net: Add ndo_gso_check
Add ndo_gso_check which a device can define to indicate whether is
is capable of doing GSO on a packet. This funciton would be called from
the stack to determine whether software GSO is needed to be done. A
driver should populate this function if it advertises GSO types for
which there are combinations that it wouldn't be able to handle. For
instance a device that performs UDP tunneling might only implement
support for transparent Ethernet bridging type of inner packets
or might have limitations on lengths of inner headers.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 12:11:00 -04:00
Vlad Yasevich 40b8fe45d1 macvtap: Fix race between device delete and open.
In macvtap device delete and open calls can race and
this causes a list curruption of the vlan queue_list.

The race intself is triggered by the idr accessors
that located the vlan device.  The device is stored
into and removed from the idr under both an rtnl and
a mutex.  However, when attempting to locate the device
in idr, only a mutex is taken.  As a result, once cpu
perfoming a delete may take an rtnl and wait for the mutex,
while another cput doing an open() will take the idr
mutex first to fetch the device pointer and later take
an rtnl to add a queue for the device which may have
just gotten deleted.

With this patch, we now hold the rtnl for the duration
of the macvtap_open() call thus making sure that
open will not race with delete.

CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:20:26 -04:00
Vlad Yasevich cbdb04279c mactap: Fix checksum errors for non-gso packets in bridge mode
The following is a problematic configuration:

 VM1: virtio-net device connected to macvtap0@eth0
 VM2: e1000 device connect to macvtap1@eth0

The problem is is that virtio-net supports checksum offloading
and thus sends the packets to the host with CHECKSUM_PARTIAL set.
On the other hand, e1000 does not support any acceleration.

For small TCP packets (and this includes the 3-way handshake),
e1000 ends up receiving packets that only have a partial checksum
set.  This causes TCP to fail checksum validation and to drop
packets.  As a result tcp connections can not be established.

Commit 3e4f8b7873
	macvtap: Perform GSO on forwarding path.
fixes this issue for large packets wthat will end up undergoing GSO.
This commit adds a check for the non-GSO case and attempts to
compute the checksum for partially checksummed packets in the
non-GSO case.

CC: Daniel Lezcano <daniel.lezcano@free.fr>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrian Nord <nightnord@gmail.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-30 16:12:22 -04:00
Paul Gortmaker a81ab36bf5 drivers/net: delete non-required instances of include <linux/init.h>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>.   Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.

This covers everything under drivers/net except for wireless, which
has been submitted separately.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 11:53:26 -08:00
David S. Miller 143c905494 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/i40e/i40e_main.c
	drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:42:06 -05:00
Tom Herbert 3958afa1b2 net: Change skb_get_rxhash to skb_get_hash
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00
Vlad Yasevich 2f6a1b6607 macvlan: Remove custom recieve and forward handlers
Since now macvlan and macvtap use the same receive and
forward handlers, we can remove them completely and use
netif_rx and dev_forward_skb() directly.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12 13:38:39 -05:00
Vlad Yasevich 6acf54f1cf macvtap: Add support of packet capture on macvtap device.
Macvtap device currently doesn not allow a user to capture
traffic on due to the fact that it steals the packets
from the network stack before the skb->dev is set correctly
on the receive side, and that use uses macvlan transmit
path directly on the send side.  As a result, we never
get a change to give traffic to the taps while the correct
device is set in the skb.

This patch makes macvtap device behave almost exaclty like
macvlan.  On the send side, we switch to using dev_queue_xmit().
On the receive side, to deliver packets to macvtap, we now
use netif_rx and dev_forward_skb just like macvlan.  The only
differnce now is that macvtap has its own rx_handler which is
attached to the macvtap netdev.  It is here that we now steal
the packet and provide it to the socket.

As a result, we can now capture traffic on the macvtap device:
   tcpdump -i macvtap0

It also gives us the abilit to add tc actions to the macvtap
device and actually utilize different bandwidth management
queues on output.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12 13:38:39 -05:00
Jason Wang ce232ce01d macvtap: signal truncated packets
macvtap_put_user() never return a value grater than iov length, this in fact
bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
returning the size of packet plus the possible vlan header to let the trunca
checking work.

Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 15:23:06 -05:00
David S. Miller bbd37626e6 net: Revert macvtap/tun truncation signalling changes.
Jason Wang and Michael S. Tsirkin are still discussing how
to properly fix this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:10:21 -05:00
Jason Wang 730054da38 macvtap: signal truncated packets
macvtap_put_user() never return a value grater than iov length, this in fact
bypasses the truncated checking in macvtap_recvmsg(). Fix this by always
returning the size of packet plus the possible vlan header to let the truncated
checking work.

Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:06:49 -05:00
David S. Miller de2aa4760b Revert "macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()"
This reverts commit 41e4af69a5.

MSG_TRUNC handling was broken and is going to be fixed in the
'net' tree, so revert this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:06:18 -05:00
Zhi Yong Wu 41e4af69a5 macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()
By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless coeds in both above functions.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:35:42 -05:00
David S. Miller 34f9f43710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:20:14 -05:00
Zhi Yong Wu 55ec8e25cd macvtap: remove unused parameter in macvtap_do_read()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 15:22:05 -05:00
Zhi Yong Wu 359d44d764 macvtap: remove the dead branch
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 15:22:05 -05:00
Zhi Yong Wu e6ebc7f16c macvtap: update file current position
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 12:42:14 -05:00
Vlad Yasevich 006da7b07b macvtap: Do not double-count received packets
Currently macvlan will count received packets after calling each
vlans receive handler.   Macvtap attempts to count the packet
yet again when the user reads the packet from the tap socket.
This code doesn't do this consistently either.  Remove the
counting from macvtap and let only macvlan count received
packets.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:07:08 -05:00
Jason Wang cd3e22b75c macvtap: fix tx_dropped counting error
After commit 8ffab51b3d
(macvlan: lockless tx path), tx stat counter were converted to percpu stat
structure. So we need use to this also for tx_dropped in macvtap. Otherwise, the
management won't notice the dropping packet in macvtap tx path.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28 18:32:22 -05:00
Jason Wang 16a3fa2863 macvtap: limit head length of skb allocated
We currently use hdr_len as a hint of head length which is advertised by
guest. But when guest advertise a very big value, it can lead to an 64K+
allocating of kmalloc() which has a very high possibility of failure when host
memory is fragmented or under heavy stress. The huge hdr_len also reduce the
effect of zerocopy or even disable if a gso skb is linearized in guest.

To solves those issues, this patch introduces an upper limit (PAGE_SIZE) of the
head, which guarantees an order 0 allocation each time.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14 16:05:27 -05:00
David S. Miller b05930f5d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/iwlwifi/pcie/trans.c
	include/linux/inetdevice.h

The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.

The iwlwifi conflict is a context overlap.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-26 16:37:08 -04:00
Vlad Yasevich e5733321d5 macvtap: Ignore tap features when VNET_HDR is off
When the user turns off VNET_HDR support on the
macvtap device, there is no way to provide any
offload information to the user.  So, it's safer
to ignore offload setting then depend on the user
setting them correctly.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:09:12 -07:00
Vlad Yasevich e558b0188b macvtap: Correctly set tap features when IFF_VNET_HDR is disabled.
When the user turns off IFF_VNET_HDR flag, attempts to change
offload features via TUNSETOFFLOAD do not work.  This could cause
GSO packets to be delivered to the user when the user is
not prepared to handle them.

To solve, allow processing of TUNSETOFFLOAD when IFF_VNET_HDR is
disabled.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:09:11 -07:00
Vlad Yasevich a567dd6252 macvtap: simplify usage of tap_features
In macvtap, tap_features specific the features of that the user
has specified via ioctl().  If we treat macvtap as a macvlan+tap
then we could all the tap a pseudo-device and give it other features
like SG and GSO.  Then we can stop using the features of lower
device (macvlan) when forwarding the traffic the tap.

This solves the issue of possible checksum offload mismatch between
tap feature and macvlan features.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:09:11 -07:00
David S. Miller 2ff1cf12c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
Eric Dumazet 29d7919692 macvtap: fix two races
Since commit ac4e4af1e5 ("macvtap: Consistently use rcu functions"),
Thomas gets two different warnings :

BUG: using smp_processor_id() in preemptible [00000000] code: vhost-45891/45892
caller is macvtap_do_read+0x45c/0x600 [macvtap]
CPU: 1 PID: 45892 Comm: vhost-45891 Not tainted 3.11.0-bisecttest #13
Call Trace:
([<00000000001126ee>] show_trace+0x126/0x144)
 [<00000000001127d2>] show_stack+0xc6/0xd4
 [<000000000068bcec>] dump_stack+0x74/0xd8
 [<0000000000481066>] debug_smp_processor_id+0xf6/0x114
 [<000003ff802e9a18>] macvtap_do_read+0x45c/0x600 [macvtap]
 [<000003ff802e9c1c>] macvtap_recvmsg+0x60/0x88 [macvtap]
 [<000003ff80318c5e>] handle_rx+0x5b2/0x800 [vhost_net]
 [<000003ff8028f77c>] vhost_worker+0x15c/0x1c4 [vhost]
 [<000000000015f3ac>] kthread+0xd8/0xe4
 [<00000000006934a6>] kernel_thread_starter+0x6/0xc
 [<00000000006934a0>] kernel_thread_starter+0x0/0xc

And

BUG: using smp_processor_id() in preemptible [00000000] code: vhost-45897/45898
caller is macvlan_start_xmit+0x10a/0x1b4 [macvlan]
CPU: 1 PID: 45898 Comm: vhost-45897 Not tainted 3.11.0-bisecttest #16
Call Trace:
([<00000000001126ee>] show_trace+0x126/0x144)
 [<00000000001127d2>] show_stack+0xc6/0xd4
 [<000000000068bdb8>] dump_stack+0x74/0xd4
 [<0000000000481132>] debug_smp_processor_id+0xf6/0x114
 [<000003ff802b72ca>] macvlan_start_xmit+0x10a/0x1b4 [macvlan]
 [<000003ff802ea69a>] macvtap_get_user+0x982/0xbc4 [macvtap]
 [<000003ff802ea92a>] macvtap_sendmsg+0x4e/0x60 [macvtap]
 [<000003ff8031947c>] handle_tx+0x494/0x5ec [vhost_net]
 [<000003ff8028f77c>] vhost_worker+0x15c/0x1c4 [vhost]
 [<000000000015f3ac>] kthread+0xd8/0xe4
 [<000000000069356e>] kernel_thread_starter+0x6/0xc
 [<0000000000693568>] kernel_thread_starter+0x0/0xc
2 locks held by vhost-45897/45898:
 #0:  (&vq->mutex){+.+.+.}, at: [<000003ff8031903c>] handle_tx+0x54/0x5ec [vhost_net]
 #1:  (rcu_read_lock){.+.+..}, at: [<000003ff802ea53c>] macvtap_get_user+0x824/0xbc4 [macvtap]

In the first case, macvtap_put_user() calls macvlan_count_rx()
in a preempt-able context, and this is not allowed.

In the second case, macvtap_get_user() calls
macvlan_start_xmit() with BH enabled, and this is not allowed.

Reported-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Bisected-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-11 21:48:58 -07:00
Eric Dumazet 28d6427109 net: attempt high order allocations in sock_alloc_send_pskb()
Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    57981.11

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10 01:16:44 -07:00
Jason Wang c3bdeb5c7c net: move zerocopy_sg_from_iovec() to net/core/datagram.c
To let it be reused and reduce code duplication. Also document this function.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 16:52:34 -07:00
Jason Wang b4bf07771f net: move iov_pages() to net/core/iovec.c
To let it be reused and reduce code duplication.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 16:52:33 -07:00
Jason Wang ece793fcfc macvtap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS
We try to linearize part of the skb when the number of iov is greater than
MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
network.

Solve this problem by calculate the pages needed for iov before trying to do
zerocopy and switch to use copy instead of zerocopy if it needs more than
MAX_SKB_FRAGS.

This is done through introducing a new helper to count the pages for iov, and
call uarg->callback() manually when switching from zerocopy to copy to notify
vhost.

We can do further optimization on top.

This bug were introduced from b92946e291
(macvtap: zerocopy: validate vectors before building skb).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-18 13:04:25 -07:00
Jason Wang 0fbe0d47b1 macvtap: do not assume 802.1Q when send vlan packets
The hard-coded 8021.q proto will break 802.1ad traffic. So switch to use
vlan->proto.

Cc: Basil Gor <basil.gor@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-16 13:01:57 -07:00
Jason Wang 82a19eb8c0 macvtap: fix the missing ret value of TUNSETQUEUE
Commit 441ac0fcaa
(macvtap: Convert to using rtnl lock) forget to return what
macvtap_ioctl_set_queue() returns to its caller. This may break multiqueue API
by always falling through to TUNGETFEATURES.

Cc: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-16 13:01:57 -07:00
Jason Wang 61d46bf979 macvtap: correctly linearize skb when zerocopy is used
Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to
linearize parts of the skb to let the rest of iov to be fit in
the frags, we need count copylen into linear when calling macvtap_alloc_skb()
instead of partly counting it into data_len. Since this breaks
zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should
be zero at beginning. This cause nr_frags to be increased wrongly without
setting the correct frags.

This bug were introduced from b92946e291
(macvtap: zerocopy: validate vectors before building skb).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 18:45:52 -07:00
David S. Miller 0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
Vlad Yasevich 3e4f8b7873 macvtap: Perform GSO on forwarding path.
When macvtap forwards skb to its tap, it needs to check
if GSO needs to be performed.  This is sometimes necessary
when the HW device performed GRO, but the guest reading
from the tap does not support it (ex: Windows 7).

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:45:23 -07:00
Vlad Yasevich 2be5c76794 macvtap: Let TUNSETOFFLOAD actually controll offload features.
When the user issues TUNSETOFFLOAD ioctl, macvtap does not do
anything other then to verify arguments.  This patch adds
functionality to allow users to actually control offload features.
NETIF_F_GSO and NETIF_F_GRO are always on, but the rest of the
features can be controlled.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:44:56 -07:00
Vlad Yasevich ac4e4af1e5 macvtap: Consistently use rcu functions
Currently macvtap uses rcu_bh functions in its
user facing fuction macvtap_get_user() and macvtap_put_user().
However, its packet handlers use normal rcu as the rcu_read_lock()
is taken in netif_receive_skb().  We can safely discontinue
the usage or rcu with bh disabled.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:44:56 -07:00
Vlad Yasevich 441ac0fcaa macvtap: Convert to using rtnl lock
Macvtap uses a private lock to protect the relationship between
macvtap_queue and macvlan_dev.  The private lock is not needed
since the relationship is managed by user via open(), release(),
and dellink() calls.  dellink() already happens under rtnl, so
we can safely convert open() and release(), and use it in ioctl()
as well.

Suggested by Eric Dumazet.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:44:56 -07:00
Michael S. Tsirkin 4c7ab054ab macvtap: fix recovery from gup errors
get user pages might fail partially in macvtap zero copy
mode. To recover we need to put all pages that we got,
but code used a wrong index resulting in double-free
errors.

Reported-by: Brad Hubbard <bhubbard@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:17:10 -07:00
Jason Wang f57855a54f macvtap: fix uninitialized return value macvtap_ioctl_set_queue()
Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:23:11 -07:00
Jason Wang d9a90a3105 macvtap: slient sparse warnings
This patch silents the following sparse warnings:

drivers/net/macvtap.c:98:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:98:9:    expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:98:9:    got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:120:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:120:9:    expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:120:9:    got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:151:22: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:233:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:243:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:247:15: error: incompatible types in comparison expression
(different address spaces)
  CC [M]  drivers/net/macvtap.o
drivers/net/macvlan.c:232:24: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:23:11 -07:00
Jason Wang df09b36f22 macvtap: enable multiqueue flag
To notify the userspace about our capability of multiqueue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:49:09 -07:00
Jason Wang 815f236d62 macvtap: add TUNSETQUEUE ioctl
This patch adds TUNSETQUEUE ioctl to let userspace can temporarily disable or
enable a queue of macvtap. This is used to be compatible at API layer of tuntap
to simplify the userspace to manage the queues. This is done through introducing
a linked list to track all taps while using vlan->taps array to only track
active taps.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:49:09 -07:00
Jason Wang 376b1aabe1 macvtap: eliminate linear search
Linear search were used in both get_slot() and macvtap_get_queue(), this is
because:

- macvtap didn't reshuffle the array of taps when create or destroy a queue, so
  when adding a new queue, macvtap must do linear search to find a location for
  the new queue. This will also complicate the TUNSETQUEUE implementation for
  multiqueue API.
- the queue itself didn't track the queue index, so the we must do a linear
  search in the array to find the location of a existed queue.

The solution is straightforward: reshuffle the array and introduce a queue_index
to macvtap_queue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:49:09 -07:00
Jason Wang 8f475a318a macvtap: introduce macvtap_get_vlan()
Factor out the device holding logic to a macvtap_get_vlan(), this will be also
used by multiqueue API.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:49:08 -07:00
Jason Wang 89cee917de macvtap: do not add self to waitqueue if doing a nonblock read
There's no need to add self to waitqueue if doing a nonblock read. This could
help to avoid the spinlock contention.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:49:08 -07:00
Jason Wang ed0483fa06 macvtap: fix a possible race between queue selection and changing queues
Complier may generate codes that re-read the vlan->numvtaps during
macvtap_get_queue(). This may lead a race if vlan->numvtaps were changed in the
same time and which can lead unexpected result (e.g. very huge value).

We need prevent the compiler from generating such codes by adding an
ACCESS_ONCE() to make sure vlan->numvtaps were only read once.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:49:08 -07:00
Jiri Pirko 351638e7de net: pass info struct via netdevice notifier
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: fix typo on simeth
	shortened dev_getter
	shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:01 -07:00
Jason Wang 40893fd0fd net: switch to use skb_probe_transport_header()
Switch to use the new help skb_probe_transport_header() to do the l4 header
probing for untrusted sources. For packets with partial csum, the header should
already been set by skb_partial_csum_set().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-27 12:48:31 -04:00
Jason Wang 9b4d669bc0 macvtap: set transport header before passing skb to lower device
Set the transport header for 1) some drivers (e.g ixgbe) needs l4 header 2)
precise packet length estimation (introduced in 1def9238) needs l4 header to
compute header length.

For the packets with partial checksum, the patch just set the transport header
to csum_start. Otherwise tries to use skb_flow_dissect() to get l4 offset, if it
fails, just pretend no l4 header.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:44:43 -04:00
Tejun Heo ec09ebc143 macvtap: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:18 -08:00
Pravin B Shelar c9af6db4c1 net: Fix possible wrong checksum generation.
Patch cef401de7b (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-13 13:30:10 -05:00
Eric Dumazet cef401de7b net: fix possible wrong checksum generation
Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-28 00:27:15 -05:00
Denis Efremov 3a7f8c34fe macvtap: rcu_dereference outside read-lock section
rcu_dereference occurs in update section. Replacement by
rcu_dereference_protected in order to prevent lockdep
complaint.

Found by Linux Driver Verification project (linuxtesting.org)

Signed-off-by: Denis Efremov <yefremov.denis@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-12 13:40:48 -07:00
Hong zhi guo ccf7e72b54 macvtap: use prepare_to_wait/finish_wait to ensure mb
instead of raw assignment to current->state

Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-07 13:18:54 -07:00
David S. Miller 028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
Basil Gor f09e2249c4 macvtap: restore vlan header on user read
Ethernet vlan header is not on the packet and kept in the skb->vlan_tci
when it comes from lower dev. This patch inserts vlan header in user
buffer during skb copy on user read.

Signed-off-by: Basil Gor <basil.gor@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-11 18:16:57 -04:00
Jason Wang b92946e291 macvtap: zerocopy: validate vectors before building skb
There're several reasons that the vectors need to be validated:

- Return error when caller provides vectors whose num is greater than UIO_MAXIOV.
- Linearize part of skb when userspace provides vectors grater than MAX_SKB_FRAGS.
- Return error when userspace provides vectors whose total length may exceed
- MAX_SKB_FRAGS * PAGE_SIZE.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:21 +03:00
Jason Wang 01d6657b38 macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:20 +03:00
Jason Wang 02ce04bb3d macvtap: zerocopy: put page when fail to get all requested user pages
When get_user_pages_fast() fails to get all requested pages, we could not use
kfree_skb() to free it as it has not been put in the skb fragments. So we need
to call put_page() instead.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:19 +03:00
Jason Wang 4ef67ebedf macvtap: zerocopy: fix truesize underestimation
As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:19 +03:00
Jason Wang 3afc9621f1 macvtap: zerocopy: fix offset calculation when building skb
This patch fixes the offset calculation when building skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-05-02 18:22:17 +03:00
Al Viro 4040153087 security: trim security.h
Trim security.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
2012-02-14 10:45:42 +11:00
Krishna Kumar ef0002b577 macvtap: Fix macvtap_get_queue to use rxhash first
It was reported that the macvtap device selects a
Acked-by: Michael S. Tsirkin <mst@redhat.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-20 13:45:55 -05:00
Eric Dumazet 2cfa5a0471 net: treewide use of RCU_INIT_POINTER
rcu_assign_pointer(ptr, NULL) can be safely replaced by
RCU_INIT_POINTER(ptr, NULL)

(old rcu_assign_pointer() macro was testing the NULL value and could
omit the smp_wmb(), but this had to be removed because of compiler
warnings)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23 18:48:19 -05:00
Eric W. Biederman e09eff7fc1 macvtap: Fix the minor device number allocation
On systems that create and delete lots of dynamic devices the
31bit linux ifindex fails to fit in the 16bit macvtap minor,
resulting in unusable macvtap devices.  I have systems running
automated tests that that hit this condition in just a few days.

Use a linux idr allocator to track which mavtap minor numbers
are available and and to track the association between macvtap
minor numbers and macvtap network devices.

Remove the unnecessary unneccessary check to see if the network
device we have found is indeed a macvtap device.  With macvtap
specific data structures it is impossible to find any other
kind of networking device.

Increase the macvtap minor range from 65536 to the full 20 bits
that is supported by linux device numbers.  It doesn't solve the
original problem but there is no penalty for a larger minor
device range.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 02:53:07 -04:00
Eric W. Biederman 9bf1907f42 macvtap: Rewrite macvtap_newlink so the error handling works.
Place macvlan_common_newlink at the end of macvtap_newlink because
failing in newlink after registering your network device is not
supported.

Move device_create into a netdevice creation notifier.   The network device
notifier is the only hook that is called after the network device has been
registered with the device layer and before register_network_device returns
success.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 02:53:07 -04:00
Eric W. Biederman 2259fef0bb macvtap: Don't leak unreceived packets when we delete a macvtap device.
To avoid leaking packets in the receive queue.  Add a socket destructor
that will run whenever destroy a macvtap socket.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 02:53:07 -04:00
Eric W. Biederman 047af9cfed macvtap: Fix macvtap_open races in the zero copy enable code.
To see if it is appropriate to enable the macvtap zero copy feature
don't test the lowerdev network device flags.   Instead test the
macvtap network device flags which are a direct copy of the lowerdev
flags.  This is important because nothing holds a reference to lowerdev
and on a very bad day we lowerdev could be a pointer to stale memory.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 02:53:07 -04:00
Eric W. Biederman 99f34b38cd macvtap: Close a race between macvtap_open and macvtap_dellink.
There is a small window in macvtap_open between looking up a
networking device and calling macvtap_set_queue in which
macvtap_del_queues called from macvtap_dellink.   After
calling macvtap_del_queues it is totally incorrect to
allow macvtap_set_queue to proceed so prevent success by
reporting that all of the available queues are in use.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-21 02:53:06 -04:00
Jason Wang 653fc91557 macvtap: fix the uninitialized var using in macvtap_alloc_skb()
Commit d1b08284 use new frag API but would leave f to be used
uninitialized, this patch fix it.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-20 14:37:22 -04:00
Ian Campbell d1b08284ad macvtap: convert to SKB paged frag API.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 15:34:59 -04:00
Shirley Ma 97bc3633be macvtap: macvtapTX zero-copy support
Only 128 bytes is copied, the rest of data is DMA mapped directly from
userspace.

Signed-off-by: Shirley Ma <xma@...ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-07 04:41:24 -07:00
Jason Wang 10a8d94a95 virtio_net: introduce VIRTIO_NET_HDR_F_DATA_VALID
There's no need for the guest to validate the checksum if it have been
validated by host nics. So this patch introduces a new flag -
VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
examing in guest. The backend (tap/macvtap) may set this flag when
met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

No feature negotiation is needed as old driver just ignore this flag.

Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
when gro is on no difference as it produces skb with partial
checksum. But when gro is disabled, 20% or even higher improvement
could be measured by netperf.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-11 15:57:47 -07:00
David S. Miller 33175d84ee Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10 14:26:00 -08:00
Nicolas Kaiser ce3c869283 drivers/net/macvtap: fix error check
'len' is unsigned of type size_t and can't be negative.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07 15:57:58 -08:00
Eric Dumazet 13707f9e5e drivers/net: remove some rcu sparse warnings
Add missing __rcu annotations and helpers.
minor : Fix some rcu_dereference() calls in macvtap

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 15:02:57 -08:00
Eric Dumazet 1ac9ad1394 net: remove dev_txq_stats_fold()
After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

1) ixgbe can be converted to use existing tx_ring counters.

2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.

3) sch_teql : almost revert ab35cd4b8f (Use net_device internal stats)
    Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)

4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets

This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.

Ref: http://www.spinics.net/lists/netdev/msg149202.html

Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-13 21:44:34 -08:00
Michał Mirosław 55508d601d net: Use skb_checksum_start_offset()
Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:43:14 -08:00
Krishna Kumar 1565c7c1c4 macvtap: Implement multiqueue for macvtap driver
Implement multiqueue facility for macvtap driver. The idea is that
a macvtap device can be opened multiple times and the fd's can be
used to register eg, as backend for vhost.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-16 21:06:25 -07:00
David S. Miller bb7e95c8fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x_main.c

Merge bnx2x bug fixes in by hand... :-/

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-27 21:01:35 -07:00
Herbert Xu 8a35747a5d macvtap: Limit packet queue length
Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.

This appears to be caused by the fact that macvtap packet queues
are unlimited in length.  This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.

This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.

Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.

This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.

Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.

Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.

Chris Wright noticed that for this patch to work, we need a
non-zero TX queue length.  This patch includes his work to change
the default macvtap TX queue length to 500.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-22 13:08:56 -07:00
David S. Miller 1ebed71ae2 macvtap: Use dev_t for macvtap_major.
Reported-by: "Robert P. J. Day" <rpjday@crashcourse.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-10 19:25:50 -07:00
Michael S. Tsirkin 55afbd0810 macvtap: add ioctl to modify vnet header size
This adds TUNSETVNETHDRSZ/TUNGETVNETHDRSZ support
to macvtap.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David S. Miller <davem@davemloft.net>
2010-05-04 01:35:47 +03:00
Eric Dumazet 4381548237 net: sock_def_readable() and friends RCU conversion
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
  - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
  - Use wq_has_sleeper() to eventually wakeup tasks.
  - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
  macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-01 15:00:15 -07:00
Eric Dumazet 4a4771a58e net: use sk_sleep()
Commit aa395145 (net: sk_sleep() helper) missed three files in the
conversion.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-26 11:18:45 -07:00
Eric Dumazet aa39514516 net: sk_sleep() helper
Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 16:37:13 -07:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00