OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Michael S. Tsirkin	7e24bfbe43	tun: fix recovery from gup errors get user pages might fail partially in tun zero copy mode. To recover we need to put all pages that we got, but code used a wrong index resulting in double-free errors. Reported-by: Brad Hubbard <bhubbard@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-25 16:16:45 -07:00
Jason Wang	19a6afb23e	tuntap: set SOCK_ZEROCOPY flag during open Commit `54f968d6ef` (tuntap: move socket to tun_file) forgets to set SOCK_ZEROCOPY flag, which will prevent vhost_net from doing zercopy w/ tap. This patch fixes this by setting it during file open. Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-12 00:44:35 -07:00
Jason Wang	92bb73ea2c	tuntap: fix a possible race between queue selection and changing queues Complier may generate codes that re-read the tun->numqueues during tun_select_queue(). This may be a race if vlan->numqueues were changed in the same time and can lead unexpected result (e.g. very huge value). We need prevent the compiler from generating such codes by adding an ACCESS_ONCE() to make sure tun->numqueues were only read once. Bug were introduced by commit `c8d68e6be1` (tuntap: multiqueue support). Reported-by: Michael S. Tsirkin <mst@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-10 14:32:47 -07:00
Jason Wang	8e6d91ae09	tuntap: forbid changing mq flag for persistent device We currently allow changing the mq flag (IFF_MULTI_QUEUE) for a persistent device. This will result a mismatch between the number the queues in netdev and tuntap. This is because we only allocate a 1q netdevice when IFF_MULTI_QUEUE was not specified, so when we set the IFF_MULTI_QUEUE and try to attach more queues later, netif_set_real_num_tx_queues() may fail which result a single queue netdevice with multiple sockets attached. Solve this by disallowing changing the mq flag for persistent device. Bug was introduced by commit `edfb6a148c` (tuntap: reduce memory using of queues). Reported-by: Sriram Narasimhan <sriram.narasimhan@hp.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-29 00:21:32 -07:00
David S. Miller	58717686cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c drivers/net/ethernet/emulex/benet/be.h include/net/tcp.h net/mac802154/mac802154.h Most conflicts were minor overlapping stuff. The be2net driver brought in some fixes that added __vlan_put_tag calls, which in net-next take an additional argument. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 03:55:20 -04:00
Gao feng	3811ae76bc	net: tun: release the reference of tun device in tun_recvmsg We forget to release the reference of tun device in tun_recvmsg. bug introduced in commit `54f968d6ef` (tuntap: move socket to tun_file) Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 11:06:37 -04:00
Jason Wang	e8dbad66ef	tuntap: correct the return value in tun_set_iff() commit (`3be8fbab` tuntap: fix error return code in tun_set_iff()) breaks the creation of multiqueue tuntap since it forbids to create more than one queues for a multiqueue tuntap device. We need return 0 instead -EBUSY here since we don't want to re-initialize the device when one or more queues has been already attached. Add a comment and correct the return value to zero. Reported-by: Jerry Chu <hkchu@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Wei Yongjun <weiyj.lk@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Jerry Chu <hkchu@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-25 01:48:23 -04:00
David S. Miller	6e0895c2ea	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 20:32:51 -04:00
Wei Yongjun	3be8fbab18	tuntap: fix error return code in tun_set_iff() Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. [ Bug added in linux-3.8 , commit `4008e97f86` ("tuntap: fix ambigious multiqueue API") ] Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-12 15:00:04 -04:00
Jason Wang	c0317998c3	tuntap: initialize vlan_features The vlan_features was zero which prevents vlan GSO packets to be transmitted to userspace. This is suboptimal so enable this by initialize vlan_features for tuntap. Netperf shows better performance of guest receiving since vlan TSO works for tuntap: before: netperf -H 192.168.5.4 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.01 2786.67 after: netperf -H 192.168.5.4 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 8085.49 Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-11 16:21:57 -04:00
Jason Wang	40893fd0fd	net: switch to use skb_probe_transport_header() Switch to use the new help skb_probe_transport_header() to do the l4 header probing for untrusted sources. For packets with partial csum, the header should already been set by skb_partial_csum_set(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:31 -04:00
Jason Wang	38502af77e	tuntap: set transport header before passing it to kernel Currently, for the packets receives from tuntap, before doing header check, kernel just reset the transport header in netif_receive_skb() which pretends no l4 header. This is suboptimal for precise packet length estimation (introduced in `1def9238`) which needs correct l4 header for gso packets. So this patch set the transport header to csum_start for partial checksum packets, otherwise it first try skb_flow_dissect(), if it fails, just reset the transport header. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-26 12:44:43 -04:00
Wei Yongjun	f7de0b9368	tuntap: remove unused variable in __tun_detach() The variable dev is initialized but never used otherwise, so remove the unused variable. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-13 11:31:58 -04:00
Eric Dumazet	f8af75f351	tun: add a missing nf_reset() in tun_net_xmit() Dave reported following crash : general protection fault: 0000 [#1] SMP CPU 2 Pid: 25407, comm: qemu-kvm Not tainted 3.7.9-205.fc18.x86_64 #1 Hewlett-Packard HP Z400 Workstation/0B4Ch RIP: 0010:[<ffffffffa0399bd5>] [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack] RSP: 0018:ffff880276913d78 EFLAGS: 00010206 RAX: 50626b6b7876376c RBX: ffff88026e530d68 RCX: ffff88028d158e00 RDX: ffff88026d0d5470 RSI: 0000000000000011 RDI: 0000000000000002 RBP: ffff880276913d88 R08: 0000000000000000 R09: ffff880295002900 R10: 0000000000000000 R11: 0000000000000003 R12: ffffffff81ca3b40 R13: ffffffff8151a8e0 R14: ffff880270875000 R15: 0000000000000002 FS: 00007ff3bce38a00(0000) GS:ffff88029fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fd1430bd000 CR3: 000000027042b000 CR4: 00000000000027e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-kvm (pid: 25407, threadinfo ffff880276912000, task ffff88028c369720) Stack: ffff880156f59100 ffff880156f59100 ffff880276913d98 ffffffff815534f7 ffff880276913db8 ffffffff8151a74b ffff880270875000 ffff880156f59100 ffff880276913dd8 ffffffff8151a5a6 ffff880276913dd8 ffff88026d0d5470 Call Trace: [<ffffffff815534f7>] nf_conntrack_destroy+0x17/0x20 [<ffffffff8151a74b>] skb_release_head_state+0x7b/0x100 [<ffffffff8151a5a6>] __kfree_skb+0x16/0xa0 [<ffffffff8151a666>] kfree_skb+0x36/0xa0 [<ffffffff8151a8e0>] skb_queue_purge+0x20/0x40 [<ffffffffa02205f7>] __tun_detach+0x117/0x140 [tun] [<ffffffffa022184c>] tun_chr_close+0x3c/0xd0 [tun] [<ffffffff8119669c>] __fput+0xec/0x240 [<ffffffff811967fe>] ____fput+0xe/0x10 [<ffffffff8107eb27>] task_work_run+0xa7/0xe0 [<ffffffff810149e1>] do_notify_resume+0x71/0xb0 [<ffffffff81640152>] int_signal+0x12/0x17 Code: 00 00 04 48 89 e5 41 54 53 48 89 fb 4c 8b a7 e8 00 00 00 0f 85 de 00 00 00 0f b6 73 3e 0f b7 7b 2a e8 10 40 00 00 48 85 c0 74 0e <48> 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 c7 c7 08 6a 3a a0 RIP [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack] RSP <ffff880276913d78> This is because tun_net_xmit() needs to call nf_reset() before queuing skb into receive_queue Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-06 16:05:00 -05:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Pravin B Shelar	c9af6db4c1	net: Fix possible wrong checksum generation. Patch `cef401de7b` (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 13:30:10 -05:00
David S. Miller	188d1f76d0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/e1000e/ethtool.c drivers/net/vmxnet3/vmxnet3_drv.c drivers/net/wireless/iwlwifi/dvm/tx.c net/ipv6/route.c The ipv6 route.c conflict is simple, just ignore the 'net' side change as we fixed the same problem in 'net-next' by eliminating cached neighbours from ipv6 routes. The e1000e conflict is an addition of a new statistic in the ethtool code, trivial. The vmxnet3 conflict is about one change in 'net' removing a guarding conditional, whilst in 'net-next' we had a netdev_info() conversion. The iwlwifi conflict is dealing with a WARN_ON() conversion in 'net-next' vs. a revert happening in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-05 14:12:20 -05:00
Jason Wang	9e85722d58	tuntap: allow polling/writing/reading when detached We forbid polling, writing and reading when the file were detached, this may complex the user in several cases: - when guest pass some buffers to vhost/qemu and then disable some queues, host/qemu needs to do its own cleanup on those buffers which is complex sometimes. We can do this simply by allowing a user can still write to an disabled queue. Write to an disabled queue will cause the packet pass to the kernel and read will get nothing. - align the polling behavior with macvtap which never fails when the queue is created. This can simplify the polling errors handling of its user (e.g vhost) We can simply achieve this by don't assign NULL to tfile->tun when detached. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:43:04 -05:00
Michael S. Tsirkin	af668b3c27	tun: fix carrier on/off status Commit `c8d68e6be1` removed carrier off call from tun_detach since it's now called on queue disable and not only on tun close. This confuses userspace which used this flag to detect a free tun. To fix, put this back but under if (clean). Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Jason Wang <jasowang@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:43:03 -05:00
David S. Miller	f1e7b73acc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Bring in the 'net' tree so that we can get some ipv4/ipv6 bug fixes that some net-next work will build upon. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:32:13 -05:00
Eric Dumazet	cef401de7b	net: fix possible wrong checksum generation Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 00:27:15 -05:00
Jason Wang	b8732fb7f8	tuntap: limit the number of flow caches We create new flow caches when a new flow is identified by tuntap, This may lead some issues: - userspace may produce a huge amount of short live flows to exhaust host memory - the unlimited number of flow caches may produce a long list which increase the time in the linear searching Solve this by introducing a limit of total number of flow caches. Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-23 13:47:06 -05:00
Jason Wang	edfb6a148c	tuntap: reduce memory using of queues A MAX_TAP_QUEUES(1024) queues of tuntap device is always allocated unconditionally even userspace only requires a single queue device. This is unnecessary and will lead a very high order of page allocation when has a high possibility to fail. Solving this by creating a one queue net device when userspace only use one queue and also reduce MAX_TAP_QUEUES to DEFAULT_MAX_NUM_RSS_QUEUES which can guarantee the success of the allocation. Reported-by: Dirk Hohndel <dirk@hohndel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-23 13:47:06 -05:00
Paul Moore	5dbbaf2de8	tun: fix LSM/SELinux labeling of tun/tap devices This patch corrects some problems with LSM/SELinux that were introduced with the multiqueue patchset. The problem stems from the fact that the multiqueue work changed the relationship between the tun device and its associated socket; before the socket persisted for the life of the device, however after the multiqueue changes the socket only persisted for the life of the userspace connection (fd open). For non-persistent devices this is not an issue, but for persistent devices this can cause the tun device to lose its SELinux label. We correct this problem by adding an opaque LSM security blob to the tun device struct which allows us to have the LSM security state, e.g. SELinux labeling information, persist for the lifetime of the tun device. In the process we tweak the LSM hooks to work with this new approach to TUN device/socket labeling and introduce a new LSM hook, security_tun_dev_attach_queue(), to approve requests to attach to a TUN queue via TUNSETQUEUE. The SELinux code has been adjusted to match the new LSM hooks, the other LSMs do not make use of the LSM TUN controls. This patch makes use of the recently added "tun_socket:attach_queue" permission to restrict access to the TUNSETQUEUE operation. On older SELinux policies which do not define the "tun_socket:attach_queue" permission the access control decision for TUNSETQUEUE will be handled according to the SELinux policy's unknown permission setting. Signed-off-by: Paul Moore <pmoore@redhat.com> Acked-by: Eric Paris <eparis@parisplace.org> Tested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-14 18:16:59 -05:00
Jason Wang	dd38bd8530	tuntap: fix leaking reference count Reference count leaking of both module and sock were found: - When a detached file were closed, its sock refcnt from device were not released, solving this by add the sock_put(). - The module were hold or drop unconditionally in TUNSETPERSIST, which means we if we set the persist flag for N times, we need unset it for another N times. Solving this by only hold or drop an reference when there's a flag change and also drop the reference count when the persist device is deleted. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-11 19:42:02 -08:00
Jason Wang	7c0c3b1a8a	tuntap: forbid calling TUNSETIFF when detached Michael points out that even after Stefan's fix the TUNSETIFF is still allowed to create a new tap device. This because we only check tfile->tun but the tfile->detached were introduced. Fix this by failing early in tun_set_iff() if the file is detached. After this fix, there's no need to do the check again in tun_set_iff(), so this patch removes it. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-11 19:42:02 -08:00
Jason Wang	b8deabd3ee	tuntap: switch to use rtnl_dereference() Switch to use rtnl_dereference() instead of the open code, suggested by Eric. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-11 19:42:02 -08:00
Michael S. Tsirkin	9d43a18c6e	tun: avoid owner checks on IFF_ATTACH_QUEUE At the moment, we check owner when we enable queue in tun. This seems redundant and will break some valid uses where fd is passed around: I think TUNSETOWNER is there to prevent others from attaching to a persistent device not owned by them. Here the fd is already attached, enabling/disabling queue is more like read/write. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 14:26:43 -08:00
Stefan Hajnoczi	6e331f4c83	tuntap: refuse to re-attach to different tun_struct Multiqueue tun devices support detaching a tun_file from its tun_struct and re-attaching at a later point in time. This allows users to disable a specific queue temporarily. ioctl(TUNSETIFF) allows the user to specify the network interface to attach by name. This means the user can attempt to attach to interface "B" after detaching from interface "A". The driver is not designed to support this so check we are re-attaching to the right tun_struct. Failure to do so may lead to oops. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 14:24:10 -08:00
Eric Dumazet	9fdc6bef5f	tuntap: dont use a private kmem_cache Commit `96442e4242` (tuntap: choose the txq based on rxq) added a per tun_struct kmem_cache. As soon as several tun_struct are used, we get an error because two caches cannot have same name. Use the default kmalloc()/kfree_rcu(), as it reduce code size and doesn't have performance impact here. Reported-by: Paul Moore <pmoore@redhat.com> Tested-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-21 13:14:01 -08:00
Jason Wang	d32649d171	tuntap: fix sparse warning Make tun_enable_queue() static to fix the sparse warning: drivers/net/tun.c:399:19: sparse: symbol 'tun_enable_queue' was not declared. Should it be static? Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-17 20:49:06 -08:00
Eric Dumazet	76fe45812a	tuntap: reset network header before calling skb_get_rxhash() Commit `499744209b` (tuntap: dont use skb after netif_rx_ni(skb)) introduced another bug. skb_get_rxhash() needs to access the network header, and it was set for us in netif_rx_ni(). We need to reset network header or else skb_flow_dissect() behavior is out of control. Reported-and-tested-by: Kirill A. Shutemov <kirill@shutemov.name> Tested-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-17 12:32:44 -08:00
Jason Wang	4008e97f86	tuntap: fix ambigious multiqueue API The current multiqueue API is ambigious which may confuse both user and LSM to do things correctly: - Both TUNSETIFF and TUNSETQUEUE could be used to create the queues of a tuntap device. - TUNSETQUEUE were used to disable and enable a specific queue of the device. But since the state of tuntap were completely removed from the queue, it could be used to attach to another device (there's no such kind of requirement currently, and it needs new kind of LSM policy. - TUNSETQUEUE could be used to attach to a persistent device without any queues. This kind of attching bypass the necessary checking during TUNSETIFF and may lead unexpected result. So this patch tries to make a cleaner and simpler API by: - Only allow TUNSETIFF to create queues. - TUNSETQUEUE could be only used to disable and enabled the queues of a device, and the state of the tuntap device were not detachd from the queues when it was disabled, so TUNSETQUEUE could be only used after TUNSETIFF and with the same device. This is done by introducing a list which keeps track of all queues which were disabled. The queue would be moved between this list and tfiles[] array when it was enabled/disabled. A pointer of the tun_struct were also introdued to track the device it belongs to when it was disabled. After the change, the isolation between management and application could be done through: TUNSETIFF were only called by management software and TUNSETQUEUE were only called by application.For LSM/SELinux, the things left is to do proper check during tun_set_queue() if needed. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-14 13:14:06 -05:00
Eric Dumazet	499744209b	tuntap: dont use skb after netif_rx_ni(skb) On Wed, 2012-12-12 at 23:16 -0500, Dave Jones wrote: > Since todays net merge, I see this when I start openvpn.. > > general protection fault: 0000 [#1] PREEMPT SMP > Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs iTCO_wdt iTCO_vendor_support snd_emu10k1 snd_util_mem snd_ac97_codec coretemp ac97_bus microcode snd_hwdep snd_seq pcspkr snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 snd_rawmidi mfd_core snd_seq_device snd e1000e soundcore emu10k1_gp gameport i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate firewire_ohci sata_sil firewire_core crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy > CPU 0 > Pid: 1381, comm: openvpn Not tainted 3.7.0+ #14 /D975XBX > RIP: 0010:[<ffffffff815b54a4>] [<ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0 > RSP: 0018:ffff88007d0d9c48 EFLAGS: 00010206 > RAX: 000000000000055d RBX: 6b6b6b6b6b6b6b4b RCX: 1471030a0180040a > RDX: 0000000000000005 RSI: 00000000ffffffe0 RDI: ffff8800ba83fa80 > RBP: ffff88007d0d9cb8 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000101 R12: ffff8800ba83fa80 > R13: 0000000000000008 R14: ffff88007d0d9cc8 R15: ffff8800ba83fa80 > FS: 00007f6637104800(0000) GS:ffff8800bf600000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f563f5b01c4 CR3: 000000007d140000 CR4: 00000000000007f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process openvpn (pid: 1381, threadinfo ffff88007d0d8000, task ffff8800a540cd60) > Stack: > ffff8800ba83fa80 0000000000000296 0000000000000000 0000000000000000 > ffff88007d0d9cc8 ffffffff815bcff4 ffff88007d0d9ce8 ffffffff815b1831 > ffff88007d0d9ca8 00000000703f6364 ffff8800ba83fa80 0000000000000000 > Call Trace: > [<ffffffff815bcff4>] ? netif_rx+0x114/0x4c0 > [<ffffffff815b1831>] ? skb_copy_datagram_from_iovec+0x61/0x290 > [<ffffffff815b672a>] __skb_get_rxhash+0x1a/0xd0 > [<ffffffffa03b9538>] tun_get_user+0x418/0x810 [tun] > [<ffffffff8135f468>] ? delay_tsc+0x98/0xf0 > [<ffffffff8109605c>] ? __rcu_read_unlock+0x5c/0xa0 > [<ffffffffa03b9a41>] tun_chr_aio_write+0x81/0xb0 [tun] > [<ffffffff81145011>] ? __buffer_unlock_commit+0x41/0x50 > [<ffffffff811db917>] do_sync_write+0xa7/0xe0 > [<ffffffff811dc01f>] vfs_write+0xaf/0x190 > [<ffffffff811dc375>] sys_write+0x55/0xa0 > [<ffffffff81705540>] tracesys+0xdd/0xe2 > Code: 41 8b 44 24 68 41 2b 44 24 6c 01 de 29 f0 83 f8 03 0f 8e a0 00 00 00 48 63 de 49 03 9c 24 e0 00 00 00 48 85 db 0f 84 72 fe ff ff <8b> 03 41 89 46 08 b8 01 00 00 00 e9 43 fd ff ff 0f 1f 40 00 48 > RIP [<ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0 > RSP <ffff88007d0d9c48> > ---[ end trace 6d42c834c72c002e ]--- > > > Faulting instruction is > > 0: 8b 03 mov (%rbx),%eax > > rbx is slab poison (-20) so this looks like a use-after-free here... > > flow->ports = *ports; > 314: 8b 03 mov (%rbx),%eax > 316: 41 89 46 08 mov %eax,0x8(%r14) > > in the inlined skb_header_pointer in skb_flow_dissect > > Dave > commit `96442e4242` (tuntap: choose the txq based on rxq) added a use after free. Cache rxhash in a temp variable before calling netif_rx_ni() Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Wang <jasowang@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-13 12:58:11 -05:00
stephen hemminger	a676847b39	tun: allow setting ethernet addresss while running This is a pure software device, and ok with live address change. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-11 12:49:53 -05:00
Paul Moore	b3943aef7e	tun: correctly report an error in tun_flow_init() On error, the error code from tun_flow_init() is lost inside tun_set_iff(), this patch fixes this by assigning the tun_flow_init() error code to the "err" variable which is returned by the tun_flow_init() function on error. Signed-off-by: Paul Moore <pmoore@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-07 13:20:46 -05:00
Michael S. Tsirkin	5d09710925	tun: only queue packets on device Historically tun supported two modes of operation: - in default mode, a small number of packets would get queued at the device, the rest would be queued in qdisc - in one queue mode, all packets would get queued at the device This might have made sense up to a point where we made the queue depth for both modes the same and set it to a huge value (500) so unless the consumer is stuck the chance of losing packets is small. Thus in practice both modes behave the same, but the default mode has some problems: - if packets are never consumed, fragments are never orphaned which cases a DOS for sender using zero copy transmit - overrun errors are hard to diagnose: fifo error is incremented only once so you can not distinguish between userspace that is stuck and a transient failure, tcpdump on the device does not show any traffic Userspace solves this simply by enabling IFF_ONE_QUEUE but there seems to be little point in not doing the right thing for everyone, by default. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-03 15:07:36 -05:00
Jason Wang	eb0fb363f9	tuntap: attach queue 0 before registering netdevice We attach queue 0 after registering netdevice currently. This leads to call netif_set_real_num_{tx\|rx}_queues() after registering the netdevice. Since we allow tun/tap has a maximum of 1024 queues, this may lead a huge number of uevents to be injected to userspace since we create 2048 kobjects and then remove 2046. Solve this problem by attaching queue 0 and set the real number of queues before registering netdevice. Reported-by: Jiri Slaby <jslaby@suse.cz> Tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-03 13:47:57 -05:00
Rami Rosen	3872baf618	tun: put correct method name in a debug message. This patch puts the correct method name, tun_do_read, in a debug message. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-26 17:22:10 -05:00
Rami Rosen	36fe8c0973	vtun: fix typos. This patch fixes four typos in drivers/net/vtun.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-26 17:22:09 -05:00
Rami Rosen	9ce99cf6dc	tun: change tun_get_iff() prototype. This patch changes tun_get_iff() prototype to return void as it never fails. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-23 14:24:46 -05:00
Eric W. Biederman	c260b7722f	net: Allow userns root to control tun and tap devices Allow an unpriviled user who has created a user namespace, and then created a network namespace to effectively use the new network namespace, by reducing capable(CAP_NET_ADMIN) calls to ns_capable(net->user_ns,CAP_NET_ADMIN) calls. Allow setting of the tun iff flags. Allow creating of tun devices. Allow adding a new queue to a tun device. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-19 14:15:54 -05:00
Michael S. Tsirkin	149d36f718	tun: report orphan frags errors to zero copy callback When tun transmits a zero copy skb, it orphans the frags which might need to allocate extra memory, in atomic context. If that fails, notify ubufs callback before freeing the skb as a hint that device should disable zerocopy mode. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-02 21:29:57 -04:00
Jason Wang	96442e4242	tuntap: choose the txq based on rxq This patch implements a simple multiqueue flow steering policy - tx follows rx for tun/tap. The idea is simple, it just choose the txq based on which rxq it comes. The flow were identified through the rxhash of a skb, and the hash to queue mapping were recorded in a hlist with an ageing timer to retire the mapping. The mapping were created when tun receives packet from userspace, and was quired in .ndo_select_queue(). I run co-current TCP_CRR test and didn't see any mapping manipulation helpers in perf top, so the overhead could be negelected. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-01 11:14:09 -04:00
Jason Wang	cde8b15f1a	tuntap: add ioctl to attach or detach a file form tuntap device Sometimes usespace may need to active/deactive a queue, this could be done by detaching and attaching a file from tuntap device. This patch introduces a new ioctls - TUNSETQUEUE which could be used to do this. Flag IFF_ATTACH_QUEUE were introduced to do attaching while IFF_DETACH_QUEUE were introduced to do the detaching. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-01 11:14:08 -04:00
Jason Wang	c8d68e6be1	tuntap: multiqueue support This patch converts tun/tap to a multiqueue devices and expose the multiqueue queues as multiple file descriptors to userspace. Internally, each tun_file were abstracted as a queue, and an array of pointers to tun_file structurs were stored in tun_structure device, so multiple tun_files were allowed to be attached to the device as multiple queues. When choosing txq, we first try to identify a flow through its rxhash, if it does not have such one, we could try recorded rxq and then use them to choose the transmit queue. This policy may be changed in the future. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-01 11:14:08 -04:00
Jason Wang	6e914fc707	tuntap: RCUify dereferencing between tun_struct and tun_file RCU were introduced in this patch to synchronize the dereferences between tun_struct and tun_file. All tun_{get\|put} were replaced with RCU, the dereference from one to other must be done under rtnl lock or rcu read critical region. This is needed for the following patches since the one of the goal of multiqueue tuntap is to allow adding or removing queues during workload. Without RCU, control path would hold tx locks when adding or removing queues (which may cause sme delay) and it's hard to change the number of queues without stopping the net device. With the help of rcu, there's also no need for tun_file hold an refcnt to tun_struct. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-01 11:14:07 -04:00
Jason Wang	54f968d6ef	tuntap: move socket to tun_file Current tuntap makes use of the socket receive queue as its tx queue. To implement multiple tx queues for tuntap and enable the ability of adding and removing queues during workload, the first step is to move the socket related structures to tun_file. Then we could let multiple fds/sockets to be attached to the tuntap. This patch removes tun_sock and moves socket related structures from tun_sock or tun_struct to tun_file. Two exceptions are tap_filter and sock_fprog, they are still kept in tun_structure since they are used to filter packets for the net device instead of per transmit queue (at least I see no requirements for them). After those changes, socket were created and destroyed during file open and close (instead of device creation and destroy), the socket structures could be dereferenced from tun_file instead of the file of tun_struct structure itself. For persisent device, since we purge during datching and wouldn't queue any packets when no interface were attached, there's no behaviod changes before and after this patch, so the changes were transparent to the userspace. To keep the attributes such as sndbuf, socket filter and vnet header, those would be re-initialize after a new interface were attached to an persist device. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-01 11:14:07 -04:00
Jason Wang	1e5883382c	tuntap: log the unsigned informaiton with %u Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-01 11:14:06 -04:00
Daniel Wagner	6a328d8c6f	cgroup: net_cls: Rework update socket logic The cgroup logic part of net_cls is very similar as the one in net_prio. Let's stream line the net_cls logic with the net_prio one. The net_prio update logic was changed by following commit (note there were some changes necessary later on) commit `406a3c638c` Author: John Fastabend <john.r.fastabend@intel.com> Date: Fri Jul 20 10:39:25 2012 +0000 net: netprio_cgroup: rework update socket logic Instead of updating the sk_cgrp_prioidx struct field on every send this only updates the field when a task is moved via cgroup infrastructure. This allows sockets that may be used by a kernel worker thread to be managed. For example in the iscsi case today a user can put iscsid in a netprio cgroup and control traffic will be sent with the correct sk_cgrp_prioidx value set but as soon as data is sent the kernel worker thread isssues a send and sk_cgrp_prioidx is updated with the kernel worker threads value which is the default case. It seems more correct to only update the field when the user explicitly sets it via control group infrastructure. This allows the users to manage sockets that may be used with other threads. Since classid is now updated when the task is moved between the cgroups, we don't have to call sock_update_classid() from various places to ensure we always using the latest classid value. [v2: Use iterate_fd() instead of open coding] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Li Zefan <lizefan@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Joe Perches <joe@perches.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: <netdev@vger.kernel.org> Cc: <cgroups@vger.kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-26 03:40:51 -04:00

1 2 3 4 5

207 Commits