Commit Graph

41959 Commits

Author SHA1 Message Date
Florian Westphal 5e3c61f981 netfilter: conntrack: fix lookup race during hash resize
When resizing the conntrack hash table at runtime via
echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
the conntrack lookup path -- reads can happen in parallel and nothing
prevents readers from observing a the newly allocated hash but the old
size (or vice versa).

So access to hash[bucket] can trigger OOB read access in case the table got
expanded and we saw the new size but the old hash pointer (or it got shrunk
and we got new hash ptr but the size of the old and larger table):

kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
[..]
Call Trace:
[<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
[<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
[<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760

Use generation counter to obtain the address/length of the same table.

Also add a synchronize_net before freeing the old hash.
AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
freed, provided that lockless reader got delayed by another event:

CPU1			CPU2
seq_begin
seq_retry
<delay>			resize occurs
			free oldhash
for_each(oldhash[size])

Note that resize is only supported in init_netns, it took over 2 minutes
of constant resizing+flooding to produce the warning, so this isn't a
big problem in practice.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:43 +02:00
Florian Westphal 2cf1234807 netfilter: conntrack: keep BH enabled during lookup
No need to disable BH here anymore:

stats are switched to _ATOMIC variant (== this_cpu_inc()), which
nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:43 +02:00
Florian Westphal 1ad8f48df6 netfilter: nftables: add connlabel set support
Conntrack labels are currently sized depending on the iptables
ruleset, i.e. if we're asked to test or set bits 1, 2, and 65 then we
would allocate enough room to store at least bit 65.

However, with nft, the input is just a register with arbitrary runtime
content.

We therefore ask for the upper ceiling we currently have, which is
enough room to store 128 bits.

Alternatively, we could alter nf_connlabel_replace to increase
net->ct.label_words at run time, but since 128 bits is not that
big we'd only save sizeof(long) so it doesn't seem worth it for now.

This follows a similar approach that xtables 'connlabel'
match uses, so when user inputs

    ct label set bar

then we will set the bit used by the 'bar' label and leave the rest alone.

This is done by passing the sreg content to nf_connlabels_replace
as both value and mask argument.
Labels (bits) already set thus cannot be re-set to zero, but
this is not supported by xtables connlabel match either.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:27:59 +02:00
Eric Dumazet 777c6ae57e tcp: two more missing bh disable
percpu_counter only have protection against preemption.

TCP stack uses them possibly from BH, so we need BH protection
in contexts that could be run in process context

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 23:47:54 -04:00
Eric Dumazet 614bdd4d6e tcp: must block bh in __inet_twsk_hashdance()
__inet_twsk_hashdance() might be called from process context,
better block BH before acquiring bind hash and established locks

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
Eric Dumazet 46cc6e4976 tcp: fix lockdep splat in tcp_snd_una_update()
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit & SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
David S. Miller 32b583a0cb Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-05-04

1) The flowcache can hit an OOM condition if too
   many entries are in the gc_list. Fix this by
   counting the entries in the gc_list and refuse
   new allocations if the value is too high.

2) The inner headers are invalid after a xfrm transformation,
   so reset the skb encapsulation field to ensure nobody tries
   access the inner headers. Otherwise tunnel devices stacked
   on top of xfrm may build the outer headers based on wrong
   informations.

3) Add pmtu handling to vti, we need it to report
   pmtu informations for local generated packets.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:35:31 -04:00
David S. Miller 5332174a83 In this pull request you have:
- two changes to the MAINTAINERS file where one marks our mailing list
   as moderated and the other adds a missing documentation file
 - kernel-doc fixes
 - code refactoring and various cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXKRJdAAoJEJ4aZjxxc6bKSVEP/1Ky6O7+oanpsjjwUiZDMj0W
 KPtoPQ/VsJxu51e0OYi78jHtGned7xV+FLyFx1k8BwLPThYtd8ysDVqMqFAmXsRh
 JPOT+7Y+lf8/FBUYdKyJcsaoqAeRPXnY+p0vE7woLaxk+GOiWpOIip73nisgu9gy
 NxfmgJ77WjEV2v6IiD4djfYmZqOOvCF6IGWkubtc0WZdg5ma/2u7vYEDBy3yjN/b
 og/5joT3GZC8K6X8BabxNSLDER+qs489a6rOUGoRK4NCU3LhELAywuAws30nPrB/
 vFJ6BvEEzkaGcXJViSFelb9zsi4ngwvY9OPQnFCmOicDzJN7jqdV6yXcnSLurph1
 sDR+1+k1f63czCJpG8Uhj+8SaQO7P8T9A5nL1UKwhdCOENCuj8Vtp5y4S2A3bOSe
 jEv1dy9FC3yaPvtkyUN+wOuDerPoJr5pFuVRz2RGyeFSMxs+RPBLYf/D0+x1om9A
 Vz63ecsygk7S7qGNXHUbQvX5Q5Kv5f4y4XjvmrH3rBq+T/WC6V5vbkTo8L4CapX9
 KNffNGl1RWqHz/TVLSrQmlHc9zNM/Rg0am2MIxplGfP0rQSUNob/qjD50KPJSLF/
 M8tmOBSCNAxzlfAwcn+VJLq+xt6Mr2mkhwZZPYGQPno8JJqCMq52k4w1AQvbv+eI
 sxgFGvTq1WACUDx03vyx
 =qoV3
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
pull request: batman-adv 20160504

In this pull request you have:
- two changes to the MAINTAINERS file where one marks our mailing list
  as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:21:08 -04:00
Kangjie Lu 5f8e44741f net: fix infoleak in rtnetlink
The stack object “map” has a total size of 32 bytes. Its last 4
bytes are padding generated by compiler. These padding bytes are
not initialized and sent out via “nla_put”.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:19:42 -04:00
Kangjie Lu b8670c09f3 net: fix infoleak in llc
The stack object “info” has a total size of 12 bytes. Its last byte
is padding which is not initialized and leaked via “put_cmsg”.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:18:48 -04:00
Florian Westphal 9b36627ace net: remove dev->trans_start
previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.

AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).

As I can't test any of them it seems better to just leave them alone.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:50 -04:00
Florian Westphal 860e9538a9 treewide: replace dev->trans_start update with helper
Replace all trans_start updates with netif_trans_update helper.
change was done via spatch:

struct net_device *d;
@@
- d->trans_start = jiffies
+ netif_trans_update(d)

Compile tested only.

Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: linux-xtensa@linux-xtensa.org
Cc: linux1394-devel@lists.sourceforge.net
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: MPT-FusionLinux.pdl@broadcom.com
Cc: linux-scsi@vger.kernel.org
Cc: linux-can@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-omap@vger.kernel.org
Cc: linux-hams@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:49 -04:00
Arnd Bergmann 8bf42e9e51 gre6: add Kconfig dependency for NET_IPGRE_DEMUX
The ipv6 gre implementation was cleaned up to share more code
with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
is disabled, resulting in a link error:

net/built-in.o: In function `gre_rcv':
:(.text+0x17f5d0): undefined reference to `gre_parse_header'
ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!

This adds a Kconfig dependency to prevent that now invalid
configuration.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 308edfdf15 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:12:36 -04:00
Jiri Benc 125372faa4 gre: receive also TEB packets for lwtunnels
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.

To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:32 -04:00
Jiri Benc 244a797bdc gre: move iptunnel_pull_header down to ipgre_rcv
This will allow to make the pull dependent on the tunnel type.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:31 -04:00
Jiri Benc 00b2034029 gre: remove superfluous pskb_may_pull
The call to gre_parse_header is either followed by iptunnel_pull_header, or
in the case of ICMP error path, the actual header is not accessed at all.

In the first case, iptunnel_pull_header will call pskb_may_pull anyway and
it's pointless to do it twice. The only difference is what call will fail
with what error code but the net effect is still the same in all call sites.

In the second case, pskb_may_pull is pointless, as skb->data is at the outer
IP header and not at the GRE header.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:31 -04:00
Alexander Duyck b1dc497b28 net: Fix netdev_fix_features so that TSO_MANGLEID is only available with TSO
This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present.  This way we will also handle ECN correctly of TSO is not
present.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:27 -04:00
Alexander Duyck 36c983824b gso: Only allow GSO_PARTIAL if we can checksum the inner protocol
This patch addresses a possible issue that can occur if we get into any odd
corner cases where we support TSO for a given protocol but not the checksum
or scatter-gather offload.  There are few drivers floating around that
setup their tunnels this way and by enforcing the checksum piece we can
avoid mangling any frames.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:27 -04:00
Alexander Duyck d7fb5a8049 gso: Do not perform partial GSO if number of partial segments is 1 or less
In the event that the number of partial segments is equal to 1 we don't
really need to perform partial segmentation offload.  As such we should
skip multiplying the MSS and instead just clear the partial_segs value
since it will not provide any gain to advertise the frame as being GSO when
it is a single frame.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:26 -04:00
Jiri Benc f132ae7c46 gre: change gre_parse_header to return the header length
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.

In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.

This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 12:44:45 -04:00
Eric Dumazet d4011239f4 tcp: guarantee forward progress in tcp_sendmsg()
Under high rx pressure, it is possible tcp_sendmsg() never has a
chance to allocate an skb and loop forever as sk_flush_backlog()
would always return true.

Fix this by calling sk_flush_backlog() only if one skb had been
allocated and filled before last backlog check.

Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 12:44:36 -04:00
David S. Miller cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Christophe Ricard 1c53855f6b nfc: nci: Add nci_nfcc_loopback to the nci core
For test purpose, provide the generic nci loopback function.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:48:16 +02:00
Christophe Ricard 9b8d1a4cf2 nfc: nci: Add an additional parameter to identify a connection id
According to NCI specification, destination type and destination
specific parameters shall uniquely identify a single destination
for the Logical Connection.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:43:21 +02:00
Christophe Ricard de5ea8517c nfc: nci: Fix nci_core_conn_close
nci_core_conn_close was not retrieving a conn_info using the correct
connection id.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:42:31 +02:00
Christophe Ricard 18836029d8 nfc: nci: Fix nci_core_conn_create to allowing empty destination
NCI_CORE_CONN_CREATE may not have any destination type parameter.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:41:03 +02:00
Nicolas Dichtel 79e8dc8b80 ipv6/ila: fix nlsize calculation for lwtunnel
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.

Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:21:33 -04:00
Wei Wang 26879da587 ipv6: add new struct ipcm6_cookie
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:08:14 -04:00
Sowmini Varadhan bd7c5f983f RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.
An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.

The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex.  A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).

Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:03:44 -04:00
Sowmini Varadhan eb19284026 RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").

Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
           rds_tcp_accept_one                  rds_send_xmit

                                             conn is RDS_CONN_UP, so
    					 invoke rds_tcp_xmit

                                             tc = conn->c_transport_data
        rds_tcp_restore_callbacks
            /* reset t_sock */
    					 null ptr deref from tc->t_sock

The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.

Fixes: 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:03:44 -04:00
Eric Dumazet 1d2077ac01 net: add __sock_wfree() helper
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:02:36 -04:00
Alexander Duyck 996e802187 net: Disable segmentation if checksumming is not supported
In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum
offload for tunnels.  With this being the case we should disable GSO in
addition to the checksum offload features when we find that a device cannot
perform a checksum on a given packet type.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:00:54 -04:00
Jon Paul Maloy 10724cc7bb tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.

Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.

A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.

This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.

This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.

In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:16 -04:00
Jon Paul Maloy 60020e1857 tipc: propagate peer node capabilities to socket layer
During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.

This commit adds this functionality.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:15 -04:00
Jon Paul Maloy 7c8bcfb125 tipc: re-enable compensation for socket receive buffer double counting
In the refactoring commit d570d86497 ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test

if (sk->sk_backlog.len == 0)
     atomic_set(&tsk->dupl_rcvcnt, 0);

with

if (sk->sk_backlog.len)
     atomic_set(&tsk->dupl_rcvcnt, 0);

This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.

We now fix this by inverting the mentioned condition.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:14 -04:00
Sven Eckelmann 64ae744553 batman-adv: Split batadv_iv_ogm_orig_del_if function
batadv_iv_ogm_orig_del_if handles two different buffers bcast_own and
bcast_own_sum which should be resized. The error handling two for
allocating these buffers causes the complexity of this function. This can
be avoided completely when the function is split into a main function
handling the locking, freeing and call of the subfunctions.

The subfunction can then independently handle the resize of the buffers.
This also allows to easily reuse the old buffer (which always is larger) in
case a smaller buffer could not be allocated without increasing the code
complexity.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Simon Wunderlich 86de37c1fb batman-adv: Merge batadv_v_ogm_orig_update into batadv_v_ogm_route_update
Since batadv_v_ogm_orig_update() was only called from one place and the
calling function became very short, merge these two functions together.

This should also reflect the protocol description of B.A.T.M.A.N. V
better.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Simon Wunderlich efcc9d3069 batman-adv: move and restructure batadv_v_ogm_forward
To match our code better to the protocol description of B.A.T.M.A.N. V,
move batadv_v_ogm_forward() out into batadv_v_ogm_process_per_outif()
and move all checks directly deciding whether the OGM should be
forwarded into batadv_v_ogm_forward().

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Simon Wunderlich 121bdca0d4 batman-adv: fix debuginfo macro style issue
Structure initialization within the macros should follow the general
coding style used in the kernel: put the initialization of the first
variable and the closing brace on a separate line.

Reported-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Sven Eckelmann 6fc77a5486 batman-adv: Fix function names on new line starting with '*'
Some really long function names in batman-adv require a newline between
return type and the function name. This has lead to some lines starting
with *batadv_...

This * belongs to the return type and thus should be on the same line as
the return type.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Sven Eckelmann f298cb94d6 batman-adv: Add kernel-doc for batadv_interface_rx
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Sven Eckelmann 98a5b1d88c batman-adv: Fix kerneldoc for batadv_compare_claim
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Sven Eckelmann d3abce780d batman-adv: Fix checkpatch warning about 'unsigned' type
checkpatch.pl warns about the use of 'unsigned' as a short form for
'unsigned int'.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Antonio Quartulli 6d030de89f batman-adv: fix wrong names in kerneldoc
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: Fix additional names]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2016-05-04 02:22:03 +08:00
Geliang Tang 4ba4bc0f74 batman-adv: use to_delayed_work
Use to_delayed_work() instead of open-coding it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Geliang Tang fb1f23eab6 batman-adv: use list_for_each_entry_safe
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Antonio Quartulli 925a6f3790 batman-adv: use static string for table headers
Use a static string when showing table headers rather then
a nonsense parametric one with fixed arguments.

It is easier to grep and it does not need to be recomputed
at runtime each time.

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2016-05-04 02:22:03 +08:00
Simon Wunderlich 565489df24 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-04 02:22:03 +08:00
Julia Lawall 56130915bb VSOCK: constify vsock_transport structure
The vsock_transport structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 13:03:05 -04:00
Eric Dumazet 9d18562a22 fq_codel: add batch ability to fq_codel_drop()
In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.

fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.

Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.

TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.

With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.

[1] Assuming a 64byte cache line, and 1024 buckets

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jonathan Morton <chromatix99@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Dave Taht
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 12:47:09 -04:00
Neil Horman 6071bd1aa1 netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:

[  788.073771] ---------------------[ cut here ]---------------------------
[  788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[  788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G        W
------------   3.10.0-327.el7.x86_64 #1
[  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[  788.542260]  ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[  788.576332]  ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[  788.611943]  0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[  788.647241] Call Trace:
[  788.658817]  <IRQ>  [<ffffffff816351f1>] dump_stack+0x19/0x1b
[  788.686193]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[  788.713803]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[  788.741314]  [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[  788.767018]  [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[  788.796117]  [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[  788.823392]  [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[  788.854487]  [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[  788.880870]  [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...

The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).

The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.

tested successfully by myself on the latest net kernel, to which this applies

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:33:14 -04:00
Eric Dumazet 9580bf2edb net: relax expensive skb_unclone() in iptunnel_handle_offloads()
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:22:19 -04:00
David S. Miller 9b40d5aaef In this small batch of patches you have:
- a fix for our Distributed ARP Table that makes sure that the input
   provided to the hash function during a query is the same as the one
   provided during an insert (so to prevent false negatives), by Antonio
   Quartulli
 - a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
   that a hard interface is properly re-activated when it is brought down
   and then up again, by Antonio Quartulli
 - two fixes respectively to the reference counting of the tt_local_entry
   and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
   as it would prevent the netdev objects references by batman-adv from
   being released after shutdown.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXJOPeAAoJEJ4aZjxxc6bKFaoP/jsY4MelcsGUGQhjfEfm/gbo
 H7Xd5TUydiq9tfbIGwAjbS4Ti+e69ROolyiNQrvxc5PcJFhpQlsSe17+o0NdnCOE
 tBLjuCLjpnj+FzghnQYb54Qb2CEllsSLJcOLh+CnbFiHos+pvhxA/NeC6FufyPMF
 Zrdsjf/v4rzghWQerToKEcIgCXcRE3Zo2txunUnFXSzQGai4AJnljD1Hk1YcbQdn
 O8+6lXYN+j4Swo6yrPB0URzJRIWdjoQ1OfdvggCDTuMW664jyv9gZmsF/fzL2ksj
 SGldxkFOX+4x8NenRxs5OFMXHHAJGu8kU8uoXmOCuv6b59F2KWi27rP1MJqxYDcB
 pTpq4nAx3IooNSSvpU97SFW3WBQgIsNHMFZwZbGkxqXP1UhPEoUcsuFTPVj/hqDI
 h9xBLK/buNbYnMULTW8hMvxOUHqxjPvr37Vbj1uPdbfmwbrvUvwyMSWFn5k/JmAF
 CASMwUC4C7IQtEinVYHmT/+QsPGMcmom1WZ1/OlhlxnmOwAcglI/mZnXl0wD7ptg
 3KETNlrsNHC6YuOLKIKI08l3Ke2DOZLHdV5PvHcdPgTy7EYbSvZTaDwK422pSiuy
 8kcQjN8g6I81drwJqkEiUkJA6kRxkKbXYxosudbRT07IkzUZo7TPAFv7iMNDSHUW
 vuJV/rtYAp3bRDyLrxnb
 =ksVs
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
In this small batch of patches you have:
- a fix for our Distributed ARP Table that makes sure that the input
  provided to the hash function during a query is the same as the one
  provided during an insert (so to prevent false negatives), by Antonio
  Quartulli
- a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
  that a hard interface is properly re-activated when it is brought down
  and then up again, by Antonio Quartulli
- two fixes respectively to the reference counting of the tt_local_entry
  and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
  as it would prevent the netdev objects references by batman-adv from
  being released after shutdown.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:17:38 -04:00
Nikolay Aleksandrov a60c090361 bridge: netlink: export per-vlan stats
Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:27:06 -04:00
Nikolay Aleksandrov 6dada9b10a bridge: vlan: learn to count
Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:27:06 -04:00
Nikolay Aleksandrov 97a47facf3 net: rtnetlink: add linkxstats callbacks and attribute
Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:27:06 -04:00
Nikolay Aleksandrov e8872a25a0 net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter
The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 22:27:06 -04:00
Tom Herbert b05229f442 gre6: Cleanup GREv6 transmit path, call common GRE functions
Changes in GREv6 transmit path:
  - Call gre_checksum, remove gre6_checksum
  - Rename ip6gre_xmit2 to __gre6_xmit
  - Call gre_build_header utility function
  - Call ip6_tnl_xmit common function
  - Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:32 -04:00
Tom Herbert 79ecb90e65 ipv6: Generic tunnel cleanup
A few generic changes to generalize tunnels in IPv6:
  - Export ip6_tnl_change_mtu so that it can be called by ip6_gre
  - Add tun_hlen to ip6_tnl structure.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:32 -04:00
Tom Herbert 182a352d2d gre: Create common functions for transmit
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.

Common functions are for:
  - GRE checksum calculation. Move gre_checksum to gre.h.
  - Building a GRE header. Move GRE build_header and rename
    gre_build_header.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Tom Herbert 8eb30be035 ipv6: Create ip6_tnl_xmit
This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Tom Herbert 308edfdf15 gre6: Cleanup GREv6 receive path, call common GRE functions
- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
  - Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
    most of the code in ip6gre_rcv.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Tom Herbert 95f5c64c3c gre: Move utility functions to common headers
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).

These include:
  - GRE flag handling functions are move to gre.h
  - GRE build_header is moved to gre.h and renamed gre_build_header
  - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
  - iptunnel_pull_header is taken out of gre_parse_header. This is now
    done by caller. The header length is returned from gre_parse_header
    in an int* argument.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Tom Herbert 0d3c703a9d ipv6: Cleanup IPv6 tunnel receive path
Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
  - Make ip6_tnl_rcv non-static so that GREv6 and others can call it
  - Make ip6_tnl_rcv look like ip_tunnel_rcv
  - Switch to gro_cells_receive
  - Make ip6_tnl_rcv non-static and export it

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Eric Dumazet d41a69f1d3 tcp: make tcp_sendmsg() aware of socket backlog
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:26 -04:00
Eric Dumazet 5413d1babe net: do not block BH while processing socket backlog
Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:26 -04:00
Eric Dumazet 860fbbc343 sctp: prepare for socket backlog behavior change
sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.

It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:26 -04:00
Eric Dumazet e61da9e259 udp: prepare for non BH masking at backlog processing
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.

We need to use appropriate SNMP accessors.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet 7309f8821f dccp: do not assume DCCP code is non preemptible
DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet fb3477c0f4 tcp: do not block bh during prequeue processing
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.

As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet c10d9310ed tcp: do not assume TCP code is non preemptible
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Johannes Berg 866daf6eaa wext: remove a/b/g/n from SIOCGIWNAME
Since a/b/g/n no longer exist as spec amendements and VHT (ex 802.11ac)
wasn't handled at all, it's better to just remove the amendment strings
to avoid confusion.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-05-02 22:48:09 +02:00
Linus Torvalds 9c5d1bc2b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips,
    from Sara Sharon.

 2) Fix SKB size checks in batman-adv stack on receive, from Sven
    Eckelmann.

 3) Leak fix on mac80211 interface add error paths, from Johannes Berg.

 4) Cannot invoke napi_disable() with BH disabled in myri10ge driver,
    fix from Stanislaw Gruszka.

 5) Fix sign extension problem when computing feature masks in
    net_gso_ok(), from Marcelo Ricardo Leitner.

 6) lan78xx driver doesn't count packets and packet lengths in its
    statistics properly, fix from Woojung Huh.

 7) Fix the buffer allocation sizes in pegasus USB driver, from Petko
    Manolov.

 8) Fix refcount overflows in bpf, from Alexei Starovoitov.

 9) Unified dst cache handling introduced a preempt warning in
    ip_tunnel, fix by resetting rather then setting the cached route.
    From Paolo Abeni.

10) Listener hash collision test fix in soreuseport, from Craig Gallak

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
  gre: do not pull header in ICMP error processing
  net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
  tipc: only process unicast on intended node
  cxgb3: fix out of bounds read
  net/smscx5xx: use the device tree for mac address
  soreuseport: Fix TCP listener hash collision
  net: l2tp: fix reversed udp6 checksum flags
  ip_tunnel: fix preempt warning in ip tunnel creation/updating
  samples/bpf: fix trace_output example
  bpf: fix check_map_func_compatibility logic
  bpf: fix refcnt overflow
  drivers: net: cpsw: use of_phy_connect() in fixed-link case
  dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive
  drivers: net: cpsw: don't ignore phy-mode if phy-handle is used
  drivers: net: cpsw: fix segfault in case of bad phy-handle
  drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config
  MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
  gre: reject GUE and FOU in collect metadata mode
  pegasus: fixes reported packet length
  pegasus: fixes URB buffer allocation size;
  ...
2016-05-02 09:40:42 -07:00
Jiri Benc b7f8fe251e gre: do not pull header in ICMP error processing
iptunnel_pull_header expects that IP header was already pulled; with this
expectation, it pulls the tunnel header. This is not true in gre_err.
Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
to the IP header.

We cannot pull the tunnel header in this path. It's just a matter of not
calling iptunnel_pull_header - we don't need any of its effects.

Fixes: bda7bb4634 ("gre: Allow multiple protocol listener for gre protocol.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 00:19:58 -04:00
Marcelo Ricardo Leitner 0970f5b366 sctp: signal sk_data_ready earlier on data chunks reception
Dave Miller pointed out that fb586f2530 ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.

This patch thus basically inverts the logic on fb586f2530 and signals
it as early as possible, similar to what we had before.

Fixes: fb586f2530 ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 21:06:10 -04:00
Hamish Martin efe790502b tipc: only process unicast on intended node
We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3d ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.

While we are aware that this change only patches over a root problem that
we still haven't identified, this is a sanity test that it is always
legitimate to do. It will remain in the code even after we identify and
fix the real problem.

Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 21:03:30 -04:00
Jon Paul Maloy def22c47d7 tipc: set 'active' state correctly for first established link
When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.

This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 19:40:22 -04:00
Craig Gallek 90e5d0db2b soreuseport: Fix TCP listener hash collision
I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 19:36:54 -04:00
Wang Shanker 018f825858 net: l2tp: fix reversed udp6 checksum flags
This patch fixes a bug which causes the behavior of whether to ignore
udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
userspace program requests.

When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
expected that udp6 checksums of received packets of the l2tp tunnel
to create should be ignored. In `l2tp_netlink.c`:
`l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
according to the flag, and then passed to `l2tp_core.c`:
`l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
`l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
to `false`, i.e. be set to the contrary. Similarly, the same should be
done to `udp_conf.use_udp6_tx_checksums`.

Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 19:32:16 -04:00
Nikolay Aleksandrov f4b05d27ec net: constify is_skb_forwardable's arguments
is_skb_forwardable is not supposed to change anything so constify its
arguments

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29 16:13:36 -04:00
Paolo Abeni f27337e16f ip_tunnel: fix preempt warning in ip tunnel creation/updating
After the commit e09acddf87 ("ip_tunnel: replace dst_cache with generic
implementation"), a preemption debug warning is triggered on ip4
tunnels updating; the dst cache helper needs to be invoked in unpreemptible
context.

We don't need to load the cache on tunnel update, so this commit fixes
the warning replacing the load with a dst cache reset, which is
preempt safe.

Fixes: e09acddf87 ("ip_tunnel: replace dst_cache with generic implementation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29 14:11:46 -04:00
Liping Zhang cec5913c15 netfilter: IDLETIMER: fix race condition when destroy the target
Workqueue maybe still in running while we destroy the IDLETIMER target,
thus cause a use after free error, add cancel_work_sync() to avoid such
situation.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-29 14:28:48 +02:00
Sven Eckelmann abe59c6522 batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node
The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held
an implicit reference to it. But this reference was never stored in form of
a pointer in the batadv_neigh_node itself. Instead
batadv_neigh_node_release depends on a consistent state of
hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the
batadv_hardif_neigh_node object which it has a reference for. But
batadv_hardif_neigh_get cannot guarantee that because it is working only
with rcu_read_lock on this list. It can therefore happen that a neigh_addr
is in this list twice or that batadv_hardif_neigh_get cannot find the
batadv_hardif_neigh_node for an neigh_addr due to some other list
operations taking place at the same time.

Instead add a batadv_hardif_neigh_node pointer directly in
batadv_neigh_node which will be used for the reference counter decremented
on release of batadv_neigh_node.

Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-04-29 19:46:11 +08:00
Sven Eckelmann a33d970d0b batman-adv: Fix reference counting of vlan object for tt_local_entry
The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an
implicit reference to it. But this reference was never stored in form of a
pointer in the tt_local_entry itself. Instead batadv_tt_local_remove,
batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend
on a consistent state of bat_priv->softif_vlan_list and that
batadv_softif_vlan_get always returns the batadv_softif_vlan object which
it has a reference for. But batadv_softif_vlan_get cannot guarantee that
because it is working only with rcu_read_lock on this list. It can
therefore happen that an vid is in this list twice or that
batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to
some other list operations taking place at the same time.

Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry
which will be used for the reference counter decremented on release of
batadv_tt_local_entry.

Fixes: 35df3b298f ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-04-29 19:46:11 +08:00
Antonio Quartulli b6cf5d499f batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event
At the moment there is no explicit reactivation of an hard-interface
upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is
reactivated as soon as the next OGM is scheduled for sending, but this
mechanism does not work with B.A.T.M.A.N. V. The latter does not rely
on the same scheduling mechanism as its predecessor and for this reason
the hard-interface remains deactivated forever after being brought down
once.

This patch fixes the reactivation mechanism by adding a new routing API
which explicitly allows each algorithm to perform any needed operation
upon interface re-activation.

Such API is optional and is implemented by B.A.T.M.A.N. V only and it
just takes care of setting the iface status to ACTIVE

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2016-04-29 19:46:11 +08:00
Antonio Quartulli 2871734e85 batman-adv: fix DAT candidate selection (must use vid)
Now that DAT is VLAN aware, it must use the VID when
computing the DHT address of the candidate nodes where
an entry is going to be stored/retrieved.

Fixes: be1db4f661 ("batman-adv: make the Distributed ARP Table vlan aware")
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2016-04-29 19:46:10 +08:00
Florian Westphal 70d72b7e06 netfilter: conntrack: init all_locks to avoid debug warning
Else we get 'BUG: spinlock bad magic on CPU#' on resize when
spin lock debugging is enabled.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-29 11:27:10 +02:00
Pablo Neira Ayuso 92b4423e3a netfilter: fix IS_ERR_VALUE usage
This is a forward-port of the original patch from Andrzej Hajda,
he said:

"IS_ERR_VALUE should be used only with unsigned long type.
Otherwise it can work incorrectly. To achieve this function
xt_percpu_counter_alloc is modified to return unsigned long,
and its result is assigned to temporary variable to perform
error checking, before assigning to .pcnt field.

The patch follows conclusion from discussion on LKML [1][2].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"

Original patch from Andrzej is here:

http://patchwork.ozlabs.org/patch/582970/

This patch has clashed with input validation fixes for x_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-29 11:02:33 +02:00
Linus Torvalds 6fa9bffbcc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "There is a lifecycle fix in the auth code, a fix for a narrow race
  condition on map, and a helpful message in the log when there is a
  feature mismatch (which happens frequently now that the default
  server-side options have changed)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: report unsupported features to syslog
  rbd: fix rbd map vs notify races
  libceph: make authorizer destruction independent of ceph_auth_client
2016-04-28 18:59:24 -07:00
Florian Fainelli badf3ada60 net: dsa: Provide CPU port statistics to master netdev
This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).

We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.

We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.

The new ethtool -S <iface> output would therefore look like this now:
<iface> statistics
p<2 digits cpu port number>_<switch MIB counter names>

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:16:17 -04:00
Eric Dumazet 0cef6a4c34 tcp: give prequeue mode some care
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.

Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.

Current prequeue implementation has following issues :

1) It only checks size of the prequeue against sk_rcvbuf

   It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
   But we now have ~8MB values to cope with modern networking needs.
   We have to add sk_rmem_alloc in the equation, since out of order
   packets can definitely use up to sk_rcvbuf memory themselves.

2) Even with a fixed memory truesize check, prequeue can be filled
   by thousands of packets. When prequeue needs to be flushed, either
   from sofirq context (in tcp_prequeue() or timer code), or process
   context (in tcp_prequeue_process()), this adds a latency spike
   which is often not desirable.
   I added a fixed limit of 32 packets, as this translated to a max
   flush time of 60 us on my test hosts.

   Also note that all packets in prequeue are not accounted for tcp_mem,
   since they are not charged against sk_forward_alloc at this point.
   This is probably not a big deal.

Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:14:35 -04:00
Jiri Benc 946b636f17 gre: reject GUE and FOU in collect metadata mode
The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.

I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.

For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:09:37 -04:00
Jiri Benc 2090714e1d gre: build header correctly for collect metadata tunnels
In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
case, the interface is ARPHRD_IPGRE and the protocol to be used for
encapsulation is skb->protocol.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:02:45 -04:00
Jiri Benc a64b04d86d gre: do not assign header_ops in collect metadata mode
In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a0).
This is not the case, we're controlling the encapsulation addresses by
lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
mode does not make sense.

Although it would be more user firendly to reject requests that specify
both the collect metadata flag and a remote/local IP address, this would
break current users of gretap or introduce ugly code and differences in
handling ipgre and gretap configuration. Keep the current behavior of
remote/local IP address being ignored in such case.

v3: Back to v1, added explanation paragraph.
v2: Reject configuration specifying both remote/local address and collect
    metadata flag.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:02:44 -04:00
David S. Miller 12395d0647 Just a single fix, for a per-CPU memory leak in a
(root user triggerable) error case.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXIH65AAoJEGt7eEactAAdZmwP/R2UAHltBlYhCEMqcM+8VhPD
 VDB3LFTYhOVUtVfFwqAzEoxPDjnGyGgZcjO5RxyCZLokm71KbbHAp3h3GnCVQCHd
 dnRej6RD+Kl6n0EoTPCLy7ZAjSjpGBWOTy6MEgrAQnTtL+Q7nUch+z5DXIafTg/w
 MOYke/WfD1jHbq2eGHu6HkbY3IUwoSKaEoA8qN20ieJRU7jsaG29RiAvBot2IVTI
 g3hTL4FPzwSL5XM0qkoxDLPYA5Mo36Cb5sZ9AjkQCaqP/EemOoFxILGWUyi+17nd
 zdF3zZB9lj+CdR+0IbjTjz8b457u1g/JW4dLl+iRqv7clynm3gmz7LivVhBcHogx
 usg0hW9tDeZ5wzHj8v+e+C+RqyxtgHxVvYtt8Jh6bTqS8aMO8hor7qPFOcpJPmyz
 ZbXThJnsvfaYoWAcvIXUa3Q2kwz2myVLDhlQBgwSi5TzgTDqb2GlYnGhvKOnB5cz
 6JL3mZt2vi0Yvb7Lk9YzeEYs5cZq4DFVfx3nHgaVwDZ5GPoUTgIjgSnVLFC/J80a
 r09Wmigtjy5qftw6w4pSUcf/Fj4L0BD+GhAZ+Hs5xhjUnlAxTlHxDW0bJ/kMmFzy
 9B91YSswBc+3IGdjnsN+bNZ6T0XuvapOLRkC1V8fJDrtyy1Tel/LL9Giy/V13XLr
 8vgETcgFJyQ3jnkzyRdr
 =EWBn
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a single fix, for a per-CPU memory leak in a
(root user triggerable) error case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:55:26 -04:00
Dan Carpenter b43586576e tipc: remove an unnecessary NULL check
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:54:12 -04:00
David S. Miller 956a7ffe00 In this patchset you can find the following fixes:
1) check skb size to avoid reading beyond its border when delivering
    payloads, by Sven Eckelmann
 2) initialize last_seen time in neigh_node object to prevent cleanup
    routine from accidentally purge it, by Marek Lindner
 3) release "recently added" slave interfaces upon virtual/batman
    interface shutdown, by Sven Eckelmann
 4) properly decrease router object reference counter upon routing table
    update, by Sven Eckelmann
 5) release queue slots when purging OGM packets of deactivating slave
    interface, by Linus Lüssing
 
 Patch 2 and 3 have no "Fixes:" tag because the offending commits date
 back to when batman-adv was not yet officially in the net tree.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXHt+MAAoJEJ4aZjxxc6bKMEgP/1DZWgQpHs5IM8yW7IQx8CQO
 iMkpwfnJcRSOnADC/Z2GtIcz1Df2r+NZcqf5xMMF2CL0xlks024qTHoqeV7Poyel
 DmzzETbQFWgdFD22RI70h25T4Yb400PP0saL2TbcVec6CiM57YN3cPbhjZvqzN32
 bCIa38kwAGXvNqRzcy5WjDF/rllAoJZ0s055z+kY8WuVOmvOEor+FDmWFr0D8ioP
 /utVP9ACA3YHZ39DMDFDsyBp6nMZOgHjpJVfmcubFULHmKvYQ0zMpgX19IVoMsJ6
 HEtz9fKN4KPgAFbbPcU0GLg4srsNFmEbTB7Bqhqods+ZYN60M4Z0kexqYz1XuItH
 atISvCIe14xHdT6gW32N707yK30DxUKIEpEg5wMXhE+1m041NfrfrcvaEXSLco6d
 txsQzd1R4T5ry3V1YXv4znSVPmHvd84ykKrklQZgPA09QIPCCDb7Olp8Lj6mMsmc
 OuEYLOfAoOD/KZcRUzY6kWpMRfOJLLXUgwcfSEES8MCaaBGD91YZyrSuHvixmo5V
 24JTp0D/X/rkkQjI3a2Pf0dhvdGHAk1g6mElddo86a0UpRbm3qshquAPf+U8QcU0
 Kt4rpN9dtOA8yTpnvxG2r04T32yzQQQNIqRGEnugokaJUECFF0mugxENTqcw2vux
 uNxMhl36A21czA9s/Iu7
 =o059
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
In this patchset you can find the following fixes:

1) check skb size to avoid reading beyond its border when delivering
   payloads, by Sven Eckelmann
2) initialize last_seen time in neigh_node object to prevent cleanup
   routine from accidentally purge it, by Marek Lindner
3) release "recently added" slave interfaces upon virtual/batman
   interface shutdown, by Sven Eckelmann
4) properly decrease router object reference counter upon routing table
   update, by Sven Eckelmann
5) release queue slots when purging OGM packets of deactivating slave
   interface, by Linus Lüssing

Patch 2 and 3 have no "Fixes:" tag because the offending commits date
back to when batman-adv was not yet officially in the net tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:42:40 -04:00
Jason Wang 3df97ba830 tuntap: calculate rps hash only when needed
There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).

Before:
~1150000 pps
After:
~1300000 pps

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:38:54 -04:00
Martin KaFai Lau a166140e81 tcp: Handle eor bit when fragmenting a skb
When fragmenting a skb, the next_skb should carry
the eor from prev_skb.  The eor of prev_skb should
also be reset.

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:19 -04:00
Martin KaFai Lau a643b5d41c tcp: Handle eor bit when coalescing skb
This patch:
1. Prevent next_skb from coalescing to the prev_skb if
   TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
   allowed

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:19 -04:00
Martin KaFai Lau c134ecb878 tcp: Make use of MSG_EOR in tcp_sendmsg
This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg.  The eor bit
will prevent data from appending to that skb in the future.

The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.

This patch handles the tcp_sendmsg case.  The followup patches
will handle other skb coalescing and fragment cases.

One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:18 -04:00
Soheil Hassas Yeganeh 0a2cf20c3f tcp: remove SKBTX_ACK_TSTAMP since it is redundant
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.

Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.

Note that this frees one bit in shinfo->tx_flags.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Soheil Hassas Yeganeh 863c1fd981 tcp: remove an unnecessary check in tcp_tx_timestamp
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.

tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Eric Dumazet 13415e46c5 net: snmp: kill STATS_BH macros
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.

Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.

This more closely matches convention used to update
percpu variables.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:25 -04:00
Eric Dumazet f3832ed2c2 ipv6: kill ICMP6MSGIN_INC_STATS_BH()
IPv6 ICMP stats are atomics anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:25 -04:00
Eric Dumazet c2005eb010 ipv6: rename IP6_UPD_PO_STATS_BH()
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:25 -04:00
Eric Dumazet 1d01550359 ipv6: rename IP6_INC_STATS_BH()
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet 02a1d6e7a6 net: rename NET_{ADD|INC}_STATS_BH()
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet b15084ec7d net: rename IP_UPD_PO_STATS_BH()
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet 98f619957e net: rename IP_ADD_STATS_BH()
Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet a16292a0f0 net: rename ICMP6_INC_STATS_BH()
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet b45386efa2 net: rename IP_INC_STATS_BH()
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 08e3baef65 net: sctp: rename SCTP_INC_STATS_BH()
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 214d3f1f87 net: icmp: rename ICMPMSGIN_INC_STATS_BH()
Remove misleading _BH suffix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 90bbcc6083 net: tcp: rename TCP_INC_STATS_BH
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 02c223470c net: udp: rename UDP_INC_STATS_BH()
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 5d3848bc33 net: rename ICMP_INC_STATS_BH()
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
Eric Dumazet aa62d76b6e dccp: rename DCCP_INC_STATS_BH()
Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
Eric Dumazet 6aef70a851 net: snmp: kill various STATS_USER() helpers
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.

After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.

We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()

Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
David S. Miller c0cc53162a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes in the conflicts.

In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 15:43:10 -04:00
David Ahern 8c14586fc3 net: ipv6: Use passed in table for nexthop lookups
Similar to 3bfd847203 ("net: Use passed in table for nexthop lookups")
for IPv4, if the route spec contains a table id use that to lookup the
next hop first and fall back to a full lookup if it fails (per the fix
4c9bcd1179 ("net: Fix nexthop lookups")).

Example:

    root@kenny:~# ip -6 ro ls table red
    local 2100:1::1 dev lo  proto none  metric 0  pref medium
    2100:1::/120 dev eth1  proto kernel  metric 256  pref medium
    local 2100:2::1 dev lo  proto none  metric 0  pref medium
    2100:2::/120 dev eth2  proto kernel  metric 256  pref medium
    local fe80::e0:f9ff:fe09:3cac dev lo  proto none  metric 0  pref medium
    local fe80::e0:f9ff:fe1c:b974 dev lo  proto none  metric 0  pref medium
    fe80::/64 dev eth1  proto kernel  metric 256  pref medium
    fe80::/64 dev eth2  proto kernel  metric 256  pref medium
    ff00::/8 dev red  metric 256  pref medium
    ff00::/8 dev eth1  metric 256  pref medium
    ff00::/8 dev eth2  metric 256  pref medium
    unreachable default dev lo  metric 240  error -113 pref medium

    root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
    RTNETLINK answers: No route to host

Route add fails even though 2100:1::64 is a reachable next hop:
    root@kenny:~# ping6 -I red  2100:1::64
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms

With this patch:
    root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
    root@kenny:~# ip -6 ro ls table red
    local 2100:1::1 dev lo  proto none  metric 0  pref medium
    2100:1::/120 dev eth1  proto kernel  metric 256  pref medium
    local 2100:2::1 dev lo  proto none  metric 0  pref medium
    2100:2::/120 dev eth2  proto kernel  metric 256  pref medium
    2100:3::/64 via 2100:1::64 dev eth1  metric 1024  pref medium
    local fe80::e0:f9ff:fe09:3cac dev lo  proto none  metric 0  pref medium
    local fe80::e0:f9ff:fe1c:b974 dev lo  proto none  metric 0  pref medium
    fe80::/64 dev eth1  proto kernel  metric 256  pref medium
    fe80::/64 dev eth2  proto kernel  metric 256  pref medium
    ff00::/8 dev red  metric 256  pref medium
    ff00::/8 dev eth1  metric 256  pref medium
    ff00::/8 dev eth2  metric 256  pref medium
    unreachable default dev lo  metric 240  error -113 pref medium

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 15:34:42 -04:00
Johannes Berg d686b920ab nl80211: use nla_put_u64_64bit() for the remaining u64 attributes
Nicolas converted most users, but didn't realize some were generated
by macros. Convert those over as well.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-27 11:01:13 +02:00
Johannes Berg e6436be21e mac80211: fix statistics leak if dev_alloc_name() fails
In the case that dev_alloc_name() fails, e.g. because the name was
given by the user and already exists, we need to clean up properly
and free the per-CPU statistics. Fix that.

Cc: stable@vger.kernel.org
Fixes: 5a490510ba ("mac80211: use per-CPU TX/RX statistics")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-27 10:06:58 +02:00
Florian Westphal f0cdf76c10 net: remove NETDEV_TX_LOCKED support
No more users in the tree, remove NETDEV_TX_LOCKED support.
Adds another hole in softnet_stats struct, but better than keeping
the unused collision counter around.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 15:53:05 -04:00
Xin Long f052f20a82 sctp: sctp_diag should fill RMEM_ALLOC with asoc->rmem_alloc when rcvbuf_policy is set
For sctp assoc, when rcvbuf_policy is set, it will has it's own
rmem_alloc, when we dump asoc info in sctp_diag, we should use that
value on RMEM_ALLOC as well, just like WMEM_ALLOC.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 15:18:48 -04:00
David S. Miller c0b0479307 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-04-26

Here's another set of Bluetooth & 802.15.4 patches for the 4.7 kernel:

 - Cleanups & refactoring of ieee802154 & 6lowpan code
 - Security related additions to ieee802154 and mrf24j40 driver
 - Memory corruption fix to Bluetooth 6lowpan code
 - Race condition fix in vhci driver
 - Enhancements to the atusb 802.15.4 driver

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 13:15:56 -04:00
Nicolas Dichtel 9854518ea0 sched: align nlattr properly when needed
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:49 -04:00
Nicolas Dichtel b676338fb3 neigh: align nlattr properly when needed
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:49 -04:00
Nicolas Dichtel 270cb4d05b rtnl: align nlattr properly when needed
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:49 -04:00
Nicolas Dichtel 66c7a5ee1a ovs: align nlattr properly when needed
I also fix commit 8b32ab9e6ef1: use nla_total_size_64bit() for
OVS_FLOW_ATTR_USED in ovs_flow_cmd_msg_size().

Fixes: 8b32ab9e6ef1 ("ovs: use nla_put_u64_64bit()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
Nicolas Dichtel 6ed46d1247 sock_diag: align nlattr properly when needed
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f1
which is only in net-next right now, thus I didn't make a separate patch.

Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
David Ahern 38bd10c447 net: ipv6: Delete host routes on an ifdown
It was a simple idea -- save IPv6 configured addresses on a link down
so that IPv6 behaves similar to IPv4. As always the devil is in the
details and the IPv6 stack as too many behavioral differences from IPv4
making the simple idea more complicated than it needs to be.

The current implementation for keeping IPv6 addresses can panic or spit
out a warning in one of many paths:

1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
   rt6_fill_node while handling a route dump request.

2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
   fib6_del

3. Panic in fib6_purge_rt because rt6i_ref count is not 1.

The root cause of all these is references related to the host route for
an address that is retained.

So, this patch deletes the host route every time the ifdown loop runs.
Since the host route is deleted and will be re-generated an up there is
no longer a need for the l3mdev fix up. On the 'admin up' side move
addrconf_permanent_addr into the NETDEV_UP event handling so that it
runs only once versus on UP and CHANGE events.

All of the current panics and warnings appear to be related to
addresses on the loopback device, but given the catastrophic nature when
a bug is triggered this patch takes the conservative approach and evicts
all host routes rather than trying to determine when it can be re-used
and when it can not. That can be a later optimizaton if desired.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 11:48:26 -04:00
David S. Miller 6a923934c3 Revert "ipv6: Revert optional address flusing on ifdown."
This reverts commit 841645b5f2.

Ok, this puts the feature back.  I've decided to apply David A.'s
bug fix and run with that rather than make everyone wait another
whole release for this feature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 11:47:41 -04:00
Kanchanapally, Vidyullatha e705498945 cfg80211: Add option to report the bss entry in connect result
Since cfg80211 maintains separate BSS table entries for APs if the same
BSSID, SSID pair is seen on multiple channels, it is possible that it
can map the current_bss to a BSS entry on the wrong channel. This
current_bss will not get flushed unless disconnected and cfg80211
reports a wrong channel as the associated channel.

Fix this by introducing a new cfg80211_connect_bss() function which is
similar to cfg80211_connect_result(), but it includes an additional
parameter: the bss the STA is connected to. This allows drivers to
provide the exact bss entry that matches the BSS to which the connection
was completed.

Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-26 09:40:12 +02:00
Mohammed Shafi Shajakhan 739960f128 cfg80211/nl80211: Add support for NL80211_STA_INFO_RX_DURATION
Add support for the a station statistics netlink attribute:
NL80211_STA_INFO_RX_DURATION.

If present, this attribute contains the aggregate PPDU duration (in
microseconds) for all the frames from the peer. This is useful to
help understand the total time spent transmitting to us by all of
the connected peers.

Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-26 09:40:11 +02:00
Tom Herbert 90bfe662db ila: add checksum neutral ILA translations
Support checksum neutral ILA as described in the ILA draft. The low
order 16 bits of the identifier are used to contain the checksum
adjustment value.

The csum-mode parameter is added to described checksum processing. There
are three values:
 - adjust transport checksum (previous behavior)
 - do checksum neutral mapping
 - do nothing

On output the csum-mode in the ila_params is checked and acted on. If
mode is checksum neutral mapping then to mapping and set C-bit.

On input, C-bit is checked. If it is set checksum-netural mapping is
done (regardless of csum-mode in ila params) and C-bit will be cleared.
If it is not set then action in csum-mode is taken.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 01:27:07 -04:00
Tom Herbert 642c2c9558 ila: xlat changes
Change model of xlat to be used only for input where lookup is done on
the locator part of an address (comparing to locator_match as key
in rhashtable). This is needed for checksum neutral translation
which obfuscates the low order 16 bits of the identifier. It also
permits hosts to be in muliple ILA domains (each locator can map
to a different SIR address). A check is also added to disallow
translating non-ILA addresses (check of type in identifier).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 01:26:04 -04:00
Tom Herbert 351596aad5 ila: Add struct definitions and helpers
Add structures for identifiers, locators, and an ila address which
is composed of a locator and identifier and in6_addr can be cast to
it. This includes a three bit type field and enums for the types defined
in ILA I-D.

In ILA lwt don't allow user to set a translation for a non-ILA
address (type of identifier is zero meaning it is an IID). This also
requires that the destination prefix is at least 65 bytes (64
bit locator and first byte of identifier).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 01:25:22 -04:00
Glenn Ruben Bakke 55441070ca Bluetooth: 6lowpan: Fix memory corruption of ipv6 destination address
The memcpy of ipv6 header destination address to the skb control block
(sbk->cb) in header_create() results in currupted memory when bt_xmit()
is issued. The skb->cb is "released" in the return of header_create()
making room for lower layer to minipulate the skb->cb.

The value retrieved in bt_xmit is not persistent across header creation
and sending, and the lower layer will overwrite portions of skb->cb,
making the copied destination address wrong.

The memory corruption will lead to non-working multicast as the first 4
bytes of the copied destination address is replaced by a value that
resolves into a non-multicast prefix.

This fix removes the dependency on the skb control block between header
creation and send, by moving the destination address memcpy to the send
function path (setup_create, which is called from bt_xmit).

Signed-off-by: Glenn Ruben Bakke <glenn.ruben.bakke@nordicsemi.no>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.5+
2016-04-26 01:08:25 +02:00
Sowmini Varadhan 947d2756cd RDS: TCP: Call pskb_extract() helper function
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 16:54:14 -04:00
Sowmini Varadhan 6fa01ccd88 skbuff: Add pskb_extract() helper function
A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.

The existing code uses the sequence
	clone = skb_clone(..);
	pskb_pull(clone, off, ..);
	pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().

To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 16:54:14 -04:00
Michal Kazior d068ca2ae2 codel: split into multiple files
It was impossible to include codel.h for the
purpose of having access to codel_params or
codel_vars structure definitions and using them
for embedding in other more complex structures.

This splits allows codel.h itself to be treated
like any other header file while codel_qdisc.h and
codel_impl.h contain function definitions with
logic that was previously in codel.h.

This copies over copyrights and doesn't involve
code changes other than adding a few additional
include directives to net/sched/sch*codel.c.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 16:44:27 -04:00
Michal Kazior 79bdc4c862 codel: generalize the implementation
This strips out qdisc specific bits from the code
and makes it slightly more reusable. Codel will be
used by wireless/mac80211 in the future.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 16:44:27 -04:00
Eric Dumazet 960a26282f net: better drop monitoring in ip{6}_recv_error()
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:48:10 -04:00
Eric Dumazet 0aea76d35c tcp: SYN packets are now simply consumed
We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.

We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.

[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:48:10 -04:00
David S. Miller 841645b5f2 ipv6: Revert optional address flusing on ifdown.
This reverts the following three commits:

70af921db6
799977d9aa
f1705ec197

The feature was ill conceived, has terrible semantics, and has added
nothing but regressions to the already fragile ipv6 stack.

Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:33:55 -04:00
Nicolas Dichtel 2dad624e6d wireless: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:11 -04:00
Nicolas Dichtel cbdeafd7e1 netfilter/ipvs: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:11 -04:00
Nicolas Dichtel a558da0916 ieee802154: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:11 -04:00
Nicolas Dichtel 1c714a9283 l2tp: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:10 -04:00