Commit Graph

45143 Commits

Author SHA1 Message Date
Florian Fainelli bc1727d242 net: dsa: Move ports assignment closer to error checking
Move the assignment of ports in _dsa_register_switch() closer to where
it is checked, no functional change. Re-order declarations to be
preserve the inverted christmas tree style.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26 15:43:53 -05:00
Florian Fainelli 3512a8e95e net: dsa: Suffix function manipulating device_node with _dn
Make it clear that these functions take a device_node structure pointer

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26 15:43:53 -05:00
Florian Fainelli 293784a8f8 net: dsa: Make most functions take a dsa_port argument
In preparation for allowing platform data, and therefore no valid
device_node pointer, make most DSA functions takes a pointer to a
dsa_port structure whenever possible. While at it, introduce a
dsa_port_is_valid() helper function which checks whether port->dn is
NULL or not at the moment.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26 15:43:53 -05:00
Florian Fainelli 55ed0ce089 net: dsa: Pass device pointer to dsa_register_switch
In preparation for allowing dsa_register_switch() to be supplied with
device/platform data, pass down a struct device pointer instead of a
struct device_node.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26 15:43:52 -05:00
David S. Miller 49b3eb7725 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - ignore self-generated loop detect MAC addresses in translation table,
    by Simon Wunderlich
 
  - install uapi batman_adv.h header, by Sven Eckelmann
 
  - bump copyright years, by Sven Eckelmann
 
  - Remove an unused variable in translation table code, by Sven Eckelmann
 
  - Handle NET_XMIT_CN like NET_XMIT_SUCCESS (revised according to Davids
    suggestion), and a follow up code clean up, by Gao Feng (2 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAliKJcgWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoYzFEACGNOZcW2bFFpuE5CZWZpTW1n24
 YAY/3C+THc89oUyn7ZbWpZ05HcG+JZXOWCWHaRTnd4UeA+sbki53Ioo52ZGgo79C
 7LsyXMMH3F2NrXVEfCK/MUGZcqBAxMd4dcDPsYz5q5osxydjG3hEJLikqOluop3P
 JKK9FTEeP2JeoWz9eEf7Io2te6EwcIjo3TRa9f53uyyhPnk5eS2NeSle5axZqS7c
 l+NSZVlrrG7oUB0UdzABt5NWOvjDc+Lqp1dkoJo17PHOgXialIfjZOKUlKtjVx6E
 06ow2HZaEYCtdlb0awjyPxtIpwiy0szTzJa/h4XtzeQHgbOKIqJWAEb80X7imHVE
 aljP7A1uuGm0bcVQ+pq21PX8yLu4RCDPDE5Khu9atkiQP5+sVEdGiJ8Soaw4PmoD
 yhDmXshrPGR8u5txN8gaHWG4MHt19645s8dHqHQ7tf5h+mf2QXQ2v/jJQsCV2UfY
 vLd/JOD4ZVYWDCspcDdEwGc8KB9r6P31wXuSVjYfkqTFXocdzDr87V7C69CS0U+b
 Lzel8oa/eVa2ppR+OhpELxhL2ahO7p1jZI2ix4NHftx5MAV4WJ3RfR2ev2Sf9Ukc
 aGjzjuml3JVo2i+11lqiAtOaBK9wDNv1CaC+D2NmC6wpWzWQryNryw3fm2SoH3Xm
 S+UoBDZpik+SIEBv3Q==
 =cd1Y
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20170126' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - ignore self-generated loop detect MAC addresses in translation table,
   by Simon Wunderlich

 - install uapi batman_adv.h header, by Sven Eckelmann

 - bump copyright years, by Sven Eckelmann

 - Remove an unused variable in translation table code, by Sven Eckelmann

 - Handle NET_XMIT_CN like NET_XMIT_SUCCESS (revised according to Davids
   suggestion), and a follow up code clean up, by Gao Feng (2 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26 14:31:08 -05:00
David S. Miller 086cb6a412 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a large batch with Netfilter fixes for
your net tree, they are:

1) Two patches to solve conntrack garbage collector cpu hogging, one to
   remove GC_MAX_EVICTS and another to look at the ratio (scanned entries
   vs. evicted entries) to make a decision on whether to reduce or not
   the scanning interval. From Florian Westphal.

2) Two patches to fix incorrect set element counting if NLM_F_EXCL is
   is not set. Moreover, don't decrenent set->nelems from abort patch
   if -ENFILE which leaks a spare slot in the set. This includes a
   patch to deconstify the set walk callback to update set->ndeact.

3) Two fixes for the fwmark_reflect sysctl feature: Propagate mark to
   reply packets both from nf_reject and local stack, from Pau Espin Pedrol.

4) Fix incorrect handling of loopback traffic in rpfilter and nf_tables
   fib expression, from Liping Zhang.

5) Fix oops on stateful objects netlink dump, when no filter is specified.
   Also from Liping Zhang.

6) Fix a build error if proc is not available in ipt_CLUSTERIP, related
   to fix that was applied in the previous batch for net. From Arnd Bergmann.

7) Fix lack of string validation in table, chain, set and stateful
   object names in nf_tables, from Liping Zhang. Moreover, restrict
   maximum log prefix length to 127 bytes, otherwise explicitly bail
   out.

8) Two patches to fix spelling and typos in nf_tables uapi header file
   and Kconfig, patches from Alexander Alemayhu and William Breathitt Gray.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26 12:54:50 -05:00
Gao Feng c33705188c batman-adv: Treat NET_XMIT_CN as transmit successfully
The tc could return NET_XMIT_CN as one congestion notification, but
it does not mean the packet is lost. Other modules like ipvlan,
macvlan, and others treat NET_XMIT_CN as success too.

So batman-adv should handle NET_XMIT_CN also as NET_XMIT_SUCCESS.

Signed-off-by: Gao Feng <gfree.wind@gmail.com>
[sven@narfation.org: Moved NET_XMIT_CN handling to batadv_send_skb_packet]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26 08:41:18 +01:00
Gao Feng 0843f197c4 batman-adv: Remove one condition check in batadv_route_unicast_packet
It could decrease one condition check to collect some statements in the
first condition block.

Signed-off-by: Gao Feng <gfree.wind@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26 08:37:01 +01:00
Sven Eckelmann 269cee6218 batman-adv: Remove unused variable in batadv_tt_local_set_flags
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26 08:34:20 +01:00
Sven Eckelmann ac79cbb96b batman-adv: update copyright years for 2017
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26 08:34:19 +01:00
Simon Wunderlich d3e9768ab9 batman-adv: don't add loop detect macs to TT
The bridge loop avoidance (BLA) feature of batman-adv sends packets to
probe for Mesh/LAN packet loops. Those packets are not sent by real
clients and should therefore not be added to the translation table (TT).

Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
2017-01-26 08:34:18 +01:00
Arnd Bergmann 5b9d6b154a bridge: move maybe_deliver_addr() inside #ifdef
The only caller of this new function is inside of an #ifdef checking
for CONFIG_BRIDGE_IGMP_SNOOPING, so we should move the implementation
there too, in order to avoid this harmless warning:

net/bridge/br_forward.c:177:13: error: 'maybe_deliver_addr' defined but not used [-Werror=unused-function]

Fixes: 6db6f0eae6 ("bridge: multicast to unicast")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 23:16:06 -05:00
David S. Miller 214767faa2 Here is a batman-adv bugfix:
- fix reference count handling on fragmentation error, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAliI0D8WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoU2WD/9EdJuxSK0++Ka/sdXNEmy931dl
 ojXukizbSIg5+vEs0GCP+E5btvCQkiz1e43AAljHjE4FTv4nu9BeHs7g4msQzZt1
 7Pgy7A5wOz3UP/GN5QxEStGXYNmMeeKuaLklwxx+I619pMBX83bwlcfi8Xf40BWP
 twmeyC11TCXbyyR7sH8nbDUiuZdP4soO3yr2WzTndHYui5UKTPlQ/5/VJLFtnIYw
 ZDGYqAy7cg7wHO4khd2HaWuetIW4QIk59rqJX9pmwhDADuKL4XE9O9K3KawFH873
 lYv226vc6d7Nc6mBfD0zDPXKImOsvi3padtVgXr2AYI8bBlh7eDp9NUFLITly8Jo
 GIr8ABp5u6ZbN8A+16C5yHj5BZArQTe5EjZCS9yLLZQ0a9dXp4ZlEQuHSN9BJo05
 CWYbpoj5Hunooh2EwOYfwWQa7kbDlDh/q7+qIqJcqhd+KT9cCHvPH3bsmi8gZMrh
 fxyMouSX5LQmwnAgs/r4KUQ/5eCcf81o+SVTi/yvEf1Y9pfIqFna2IU/oREpjwY5
 csFrY8Czc/O2E20+dzrCJAKK+Sa1U+WnUGlguqIQFIue/mMcOg6vvRcj8K2uz/7x
 jsnfjA0ZsK2mRVpaWN/36RVcNyMQhrM35sVv7mnvwCWAViGBaPOzOzc+bv4fdzq0
 8KOz8mx3qqU3ZIRKkw==
 =xS4v
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20170125' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - fix reference count handling on fragmentation error, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 23:11:13 -05:00
Florian Fainelli f154be241d net: dsa: Bring back device detaching in dsa_slave_suspend()
Commit 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid
lockdep splat") removed the netif_device_detach() call done in
dsa_slave_suspend() which is necessary, and paired with a corresponding
netif_device_attach(), bring it back.

Fixes: 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:47:44 -05:00
Willy Tarreau 3979ad7e82 net/tcp-fastopen: make connect()'s return case more consistent with non-TFO
Without TFO, any subsequent connect() call after a successful one returns
-1 EISCONN. The last API update ensured that __inet_stream_connect() can
return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
indicate that the connection is now in progress. Unfortunately since this
function is used both for connect() and sendmsg(), it has the undesired
side effect of making connect() now return -1 EINPROGRESS as well after
a successful call, while at the same time poll() returns POLLOUT. This
can confuse some applications which happen to call connect() and to
check for -1 EISCONN to ensure the connection is usable, and for which
EINPROGRESS indicates a need to poll, causing a loop.

This problem was encountered in haproxy where a call to connect() is
precisely used in certain cases to confirm a connection's readiness.
While arguably haproxy's behaviour should be improved here, it seems
important to aim at a more robust behaviour when the goal of the new
API is to make it easier to implement TFO in existing applications.

This patch simply ensures that we preserve the same semantics as in
the non-TFO case on the connect() syscall when using TFO, while still
returning -1 EINPROGRESS on sendmsg(). For this we simply tell
__inet_stream_connect() whether we're doing a regular connect() or in
fact connecting for a sendmsg() call.

Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:12:21 -05:00
Wei Wang 19f6d3f3c8 net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
  s = socket()
    create a new socket
  setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
    newly introduced sockopt
    If set, new functionality described below will be used.
    Return ENOTSUPP if TFO is not supported or not enabled in the
    kernel.

  connect()
    With cookie present, return 0 immediately.
    With no cookie, initiate 3WHS with TFO cookie-request option and
    return -1 with errno = EINPROGRESS.

  write()/sendmsg()
    With cookie present, send out SYN with data and return the number of
    bytes buffered.
    With no cookie, and 3WHS not yet completed, return -1 with errno =
    EINPROGRESS.
    No MSG_FASTOPEN flag is needed.

  read()
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
    write() is not called yet.
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
    established but no msg is received yet.
    Return number of bytes read if socket is established and there is
    msg received.

The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:04:38 -05:00
Wei Wang 25776aa943 net: Remove __sk_dst_reset() in tcp_v6_connect()
Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
will eventually get called when sk is released. No need to handle it in
the protocol specific connect call.
This is also to make the code path consistent with ipv4.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:04:38 -05:00
Wei Wang 065263f40f net/tcp-fastopen: refactor cookie check logic
Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:04:38 -05:00
Jason Baron 56d806222a tcp: correct memory barrier usage in tcp_check_space()
sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.

Fixes: 3c7151275c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:23:36 -05:00
Eric Dumazet 60b1af3300 tcp: reduce skb overhead in selected places
tcp_add_backlog() can use skb_condense() helper to get better
gains and less SKB_TRUESIZE() magic. This only happens when socket
backlog has to be used.

Some attacks involve specially crafted out of order tiny TCP packets,
clogging the ofo queue of (many) sockets.
Then later, expensive collapse happens, trying to copy all these skbs
into single ones.
This unfortunately does not work if each skb has no neighbor in TCP
sequence order.

By using skb_condense() if the skb could not be coalesced to a prior
one, we defeat these kind of threats, potentially saving 4K per skb
(or more, since this is one page fragment).

A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
in skb->head, meaning the copy done by skb_condense() is limited to
about 200 bytes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:13:31 -05:00
Dan Carpenter a08ef4768f tipc: uninitialized return code in tipc_setsockopt()
We shuffled some code around and added some new case statements here and
now "res" isn't initialized on all paths.

Fixes: 01fd12bb18 ("tipc: make replicast a user selectable option")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:41:34 -05:00
Jamal Hadi Salim 1045ba77a5 net sched actions: Add support for user cookies
Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it.  The user can store whatever they wish in the
128 bits.

Sample exercise(showing variable length use of cookie)

.. create an accept action with cookie a1b2c3d4
sudo $TC actions add action ok index 1 cookie a1b2c3d4

.. dump all gact actions..
sudo $TC -s actions ls action gact

    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 1 bind 0 installed 5 sec used 5 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    cookie a1b2c3d4

.. bind the accept action to a filter..
sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1

... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:37:04 -05:00
Xin Long 5207f39963 sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment
Now sctp gso puts segments into skb's frag_list, then processes these
segments in skb_segment. But skb_segment handles them only when gs is
enabled, as it's in the same branch with skb's frags.

Although almost all the NICs support sg other than some old ones, but
since commit 1e16aa3ddf ("net: gso: use feature flag argument in all
protocol gso handlers"), features &= skb->dev->hw_enc_features, and
xfrm_output_gso call skb_segment with features = 0, which means sctp
gso would call skb_segment with sg = 0, and skb_segment would not work
as expected.

This patch is to fix it by setting features param with NETIF_F_SG when
calling skb_segment so that it can go the right branch to process the
skb's frag_list.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:28:33 -05:00
Xin Long 6f29a13061 sctp: sctp_addr_id2transport should verify the addr before looking up assoc
sctp_addr_id2transport is a function for sockopt to look up assoc by
address. As the address is from userspace, it can be a v4-mapped v6
address. But in sctp protocol stack, it always handles a v4-mapped
v6 address as a v4 address. So it's necessary to convert it to a v4
address before looking up assoc by address.

This patch is to fix it by calling sctp_verify_addr in which it can do
this conversion before calling sctp_endpoint_lookup_assoc, just like
what sctp_sendmsg and __sctp_connect do for the address from users.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:26:55 -05:00
Robert Shearman 85c814016c lwtunnel: Fix oops on state free after encap module unload
When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: lwtstate_free+0x18/0x40
[..]
task: ffff88003e372380 task.stack: ffffc900001fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
[..]
Call Trace:
 <IRQ>
 free_fib_info_rcu+0x195/0x1a0
 ? rt_fibinfo_free+0x50/0x50
 rcu_process_callbacks+0x2d3/0x850
 ? rcu_process_callbacks+0x296/0x850
 __do_softirq+0xe4/0x4cb
 irq_exit+0xb0/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.

Fixes: 1104d9ba44 ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:21:36 -05:00
Robert Shearman 88ff7334f2 net: Specify the owning module for lwtunnel ops
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:21:36 -05:00
Parthasarathy Bhuvaragan 35e22e49a5 tipc: fix cleanup at module unload
In tipc_server_stop(), we iterate over the connections with limiting
factor as server's idr_in_use. We ignore the fact that this variable
is decremented in tipc_close_conn(), leading to premature exit.

In this commit, we iterate until the we have no connections left.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan 4c887aa65d tipc: ignore requests when the connection state is not CONNECTED
In tipc_conn_sendmsg(), we first queue the request to the outqueue
followed by the connection state check. If the connection is not
connected, we should not queue this message.

In this commit, we reject the messages if the connection state is
not CF_CONNECTED.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan 9dc3abdd1f tipc: fix nametbl_lock soft lockup at module exit
Commit 333f796235 ("tipc: fix a race condition leading to
subscriber refcnt bug") reveals a soft lockup while acquiring
nametbl_lock.

Before commit 333f796235, we call tipc_conn_shutdown() from
tipc_close_conn() in the context of tipc_topsrv_stop(). In that
context, we are allowed to grab the nametbl_lock.

Commit 333f796235, moved tipc_conn_release (renamed from
tipc_conn_shutdown) to the connection refcount cleanup. This allows
either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup.

Since tipc_exit_net() first calls tipc_topsrv_stop() and then
tipc_nametble_withdraw() increases the chances for the later to
perform the connection cleanup.

The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
when it performs the tipc_conn_kref_release() as it tries to grab
nametbl_lock again while holding it already.
tipc_nametbl_withdraw() grabs nametbl_lock
  tipc_nametbl_remove_publ()
    tipc_subscrp_report_overlap()
      tipc_subscrp_send_event()
        tipc_conn_sendmsg()
          << if (con->flags != CF_CONNECTED) we do conn_put(),
             triggering the cleanup as refcount=0. >>
          tipc_conn_kref_release
            tipc_sock_release
              tipc_conn_release
                tipc_subscrb_delete
                  tipc_subscrp_delete
                    tipc_nametbl_unsubscribe << Soft Lockup >>

The previous changes in this series fixes the race conditions fixed
by commit 333f796235. Hence we can now revert the commit.

Fixes: 333f796235 ("tipc: fix a race condition leading to subscriber refcnt bug")
Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan fc0adfc8fd tipc: fix connection refcount error
Until now, the generic server framework maintains the connection
id's per subscriber in server's conn_idr. At tipc_close_conn, we
remove the connection id from the server list, but the connection is
valid until we call the refcount cleanup. Hence we have a window
where the server allocates the same connection to an new subscriber
leading to inconsistent reference count. We have another refcount
warning we grab the refcount in tipc_conn_lookup() for connections
with flag with CF_CONNECTED not set. This usually occurs at shutdown
when the we stop the topology server and withdraw TIPC_CFG_SRV
publication thereby triggering a withdraw message to subscribers.

In this commit, we:
1. remove the connection from the server list at recount cleanup.
2. grab the refcount for a connection only if CF_CONNECTED is set.

Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Parthasarathy Bhuvaragan d094c4d5f5 tipc: add subscription refcount to avoid invalid delete
Until now, the subscribers keep track of the subscriptions using
reference count at subscriber level. At subscription cancel or
subscriber delete, we delete the subscription only if the timer
was pending for the subscription. This approach is incorrect as:
1. del_timer() is not SMP safe, if on CPU0 the check for pending
   timer returns true but CPU1 might schedule the timer callback
   thereby deleting the subscription. Thus when CPU0 is scheduled,
   it deletes an invalid subscription.
2. We export tipc_subscrp_report_overlap(), which accesses the
   subscription pointer multiple times. Meanwhile the subscription
   timer can expire thereby freeing the subscription and we might
   continue to access the subscription pointer leading to memory
   violations.

In this commit, we introduce subscription refcount to avoid deleting
an invalid subscription.

Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Parthasarathy Bhuvaragan 93f955aad4 tipc: fix nametbl_lock soft lockup at node/link events
We trigger a soft lockup as we grab nametbl_lock twice if the node
has a pending node up/down or link up/down event while:
- we process an incoming named message in tipc_named_rcv() and
  perform an tipc_update_nametbl().
- we have pending backlog items in the name distributor queue
  during a nametable update using tipc_nametbl_publish() or
  tipc_nametbl_withdraw().

The following are the call chain associated:
tipc_named_rcv() Grabs nametbl_lock
   tipc_update_nametbl() (publish/withdraw)
     tipc_node_subscribe()/unsubscribe()
       tipc_node_write_unlock()
          << lockup occurs if an outstanding node/link event
             exits, as we grabs nametbl_lock again >>

tipc_nametbl_withdraw() Grab nametbl_lock
  tipc_named_process_backlog()
    tipc_update_nametbl()
      << rest as above >>

The function tipc_node_write_unlock(), in addition to releasing the
lock processes the outstanding node/link up/down events. To do this,
we need to grab the nametbl_lock again leading to the lockup.

In this commit we fix the soft lockup by introducing a fast variant of
node_unlock(), where we just release the lock. We adapt the
node_subscribe()/node_unsubscribe() to use the fast variants.

Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Pablo Neira Ayuso b2c11e4b95 netfilter: nf_tables: bump set->ndeact on set flush
Add missing set->ndeact update on each deactivated element from the set
flush path. Otherwise, sets with fixed size break after flush since
accounting breaks.

 # nft add set x y { type ipv4_addr\; size 2\; }
 # nft add element x y { 1.1.1.1 }
 # nft add element x y { 1.1.1.2 }
 # nft flush set x y
 # nft add element x y { 1.1.1.1 }
 <cmdline>:1:1-28: Error: Could not process rule: Too many open files in system

Fixes: 8411b6442e ("netfilter: nf_tables: support for set flushing")
Reported-by: Elise Lennion <elise.lennion@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24 21:46:59 +01:00
Pablo Neira Ayuso de70185de0 netfilter: nf_tables: deconstify walk callback function
The flush operation needs to modify set and element objects, so let's
deconstify this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24 21:46:58 +01:00
Pablo Neira Ayuso 35d0ac9070 netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL
If the element exists and no NLM_F_EXCL is specified, do not bump
set->nelems, otherwise we leak one set element slot. This problem
amplifies if the set is full since the abort path always decrements the
counter for the -ENFILE case too, giving one spare extra slot.

Fix this by moving set->nelems update to nft_add_set_elem() after
successful element insertion. Moreover, remove the element if the set is
full so there is no need to rely on the abort path to undo things
anymore.

Fixes: c016c7e45d ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24 21:46:57 +01:00
Liping Zhang 5ce6b04ce9 netfilter: nft_log: restrict the log prefix length to 127
First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127,
at nf_log_packet(), so the extra part is useless.

Second, after adding a log rule with a very very long prefix, we will
fail to dump the nft rules after this _special_ one, but acctually,
they do exist. For example:
  # name_65000=$(printf "%0.sQ" {1..65000})
  # nft add rule filter output log prefix "$name_65000"
  # nft add rule filter output counter
  # nft add rule filter output counter
  # nft list chain filter output
  table ip filter {
      chain output {
          type filter hook output priority 0; policy accept;
      }
  }

So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1.

Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24 21:46:29 +01:00
Kinglong Mee c929ea0b91 SUNRPC: cleanup ida information when removing sunrpc module
After removing sunrpc module, I get many kmemleak information as,
unreferenced object 0xffff88003316b1e0 (size 544):
  comm "gssproxy", pid 2148, jiffies 4294794465 (age 4200.081s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffb0cfb58a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffffb03507fe>] kmem_cache_alloc+0x15e/0x1f0
    [<ffffffffb0639baa>] ida_pre_get+0xaa/0x150
    [<ffffffffb0639cfd>] ida_simple_get+0xad/0x180
    [<ffffffffc06054fb>] nlmsvc_lookup_host+0x4ab/0x7f0 [lockd]
    [<ffffffffc0605e1d>] lockd+0x4d/0x270 [lockd]
    [<ffffffffc06061e5>] param_set_timeout+0x55/0x100 [lockd]
    [<ffffffffc06cba24>] svc_defer+0x114/0x3f0 [sunrpc]
    [<ffffffffc06cbbe7>] svc_defer+0x2d7/0x3f0 [sunrpc]
    [<ffffffffc06c71da>] rpc_show_info+0x8a/0x110 [sunrpc]
    [<ffffffffb044a33f>] proc_reg_write+0x7f/0xc0
    [<ffffffffb038e41f>] __vfs_write+0xdf/0x3c0
    [<ffffffffb0390f1f>] vfs_write+0xef/0x240
    [<ffffffffb0392fbd>] SyS_write+0xad/0x130
    [<ffffffffb0d06c37>] entry_SYSCALL_64_fastpath+0x1a/0xa9
    [<ffffffffffffffff>] 0xffffffffffffffff

I found, the ida information (dynamic memory) isn't cleanup.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 2f048db468 ("SUNRPC: Add an identifier for struct rpc_clnt")
Cc: stable@vger.kernel.org # v3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-24 15:29:24 -05:00
Colin Ian King 1464086939 net: sctp: fix array overrun read on sctp_timer_tbl
Table sctp_timer_tbl is missing a TIMEOUT_RECONF string so
add this in. Also compare timeout with the size of the array
sctp_timer_tbl rather than SCTP_EVENT_TIMEOUT_MAX.  Also add
a build time check that SCTP_EVENT_TIMEOUT_MAX is correct
so we don't ever get this kind of mismatch between the table
and SCTP_EVENT_TIMEOUT_MAX in the future.

Kudos to Marcelo Ricardo Leitner for spotting the missing string
and suggesting the build time sanity check.

Fixes CoverityScan CID#1397639 ("Out-of-bounds read")

Fixes: 7b9438de0c ("sctp: add stream reconf timer")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 15:24:35 -05:00
Florian Fainelli 82272db84d net: dsa: Drop WARN() in tag_brcm.c
We may be able to see invalid Broadcom tags when the hardware and drivers are
misconfigured, or just while exercising the error path. Instead of flooding
the console with messages, flat out drop the packet.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 15:00:22 -05:00
Eric Dumazet fbfa743a9d ipv6: fix ip6_tnl_parse_tlv_enc_lim()
This function suffers from multiple issues.

First one is that pskb_may_pull() may reallocate skb->head,
so the 'raw' pointer needs either to be reloaded or not used at all.

Second issue is that NEXTHDR_DEST handling does not validate
that the options are present in skb->data, so we might read
garbage or access non existent memory.

With help from Willem de Bruijn.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov  <dvyukov@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:53:24 -05:00
Eric Dumazet 21b995a9cb ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()
Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(),
we must reload any pointer that was related to skb->head
(or skb->data), or risk use after free.

Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Kozlov <xeb@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:53:24 -05:00
Daniel Borkmann d1b662adcd bpf: allow option for setting bpf_l4_csum_replace from scratch
When programs need to calculate the csum from scratch for small UDP
packets and use bpf_l4_csum_replace() to feed the result from helpers
like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0
that would ignore the case of current csum being 0, and which would
still allow for the helper to set the csum and transform when needed
to CSUM_MANGLED_0.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:46:06 -05:00
Daniel Borkmann 2492d3b867 bpf: enable load bytes helper for filter/reuseport progs
BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as
for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm,
etc, and yet the only facility they can use is BPF_LD with {BPF_ABS,
BPF_IND} for single byte/half/word access.

Direct packet access is only restricted to tc programs right now,
but we can still facilitate usage by allowing skb_load_bytes() helper
added back then in 05c74e5e53 ("bpf: add bpf_skb_load_bytes helper")
that calls skb_header_pointer() similarly to bpf_load_pointer(), but
for stack buffers with larger access size.

Name the previous sk_filter_func_proto() as bpf_base_func_proto()
since this is used everywhere else as well, similarly for the ctx
converter, that is, bpf_convert_ctx_access().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:46:05 -05:00
Daniel Borkmann 4faf940dd8 bpf: simplify __is_valid_access test on cb
The __is_valid_access() test for cb[] from 62c7989b24 ("bpf: allow
b/h/w/dw access for bpf's cb in ctx") was done unnecessarily complex,
we can just simplify it the same way as recent fix from 2d071c643f
("bpf, trace: make ctx access checks more robust") did. Overflow can
never happen as size is 1/2/4/8 depending on access.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:46:05 -05:00
WANG Cong 0fb44559ff af_unix: move unix_mknod() out of bindlock
Dmitry reported a deadlock scenario:

unix_bind() path:
u->bindlock ==> sb_writer

do_splice() path:
sb_writer ==> pipe->mutex ==> u->bindlock

In the unix_bind() code path, unix_mknod() does not have to
be done with u->bindlock held, since it is a pure fs operation,
so we can just move unix_mknod() out.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:30:56 -05:00
Yotam Gigi 5c5670fae4 net/sched: Introduce sample tc action
This action allows the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly and sampling them using
the psample module. The user can configure the psample group number, the
sampling rate and the packet's truncation (to save kernel-user traffic).

Example:
To sample ingress traffic from interface eth1, one may use the commands:

tc qdisc add dev eth1 handle ffff: ingress

tc filter add dev eth1 parent ffff: \
	   matchall action sample rate 12 group 4

Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets on
dev eth1 to psample group 4.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 13:44:28 -05:00
Yotam Gigi 6ae0a62861 net: Introduce psample, a new genetlink channel for packet sampling
Add a general way for kernel modules to sample packets, without being tied
to any specific subsystem. This netlink channel can be used by tc,
iptables, etc. and allow to standardize packet sampling in the kernel.

For every sampled packet, the psample module adds the following metadata
fields:

PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable

PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable

PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
   truncated during sampling

PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
   user who initiated the sampling. This field allows the user to
   differentiate between several samplers working simultaneously and
   filter packets relevant to him

PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
   sequence is kept for each group

PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets

PSAMPLE_ATTR_DATA - the actual packet bits

The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
group. In addition, add the GET_GROUPS netlink command which allows the
user to see the current sample groups, their refcount and sequence number.
This command currently supports only netlink dump mode.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 13:44:28 -05:00
Andrew Lunn 23e3d618e4 net: dsa: Fix inverted test for multiple CPU interface
Remove the wrong !, otherwise we get false positives about having
multiple CPU interfaces.

Fixes: b22de49086 ("net: dsa: store CPU switch structure in the tree")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 13:33:50 -05:00
Felix Fietkau 6db6f0eae6 bridge: multicast to unicast
Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.

multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.

This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).

However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.

The initial patch and idea is from Felix Fietkau.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
[linus.luessing@c0d3.blue: various bug + style fixes, commit message]
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 12:39:52 -05:00
Krister Johansen 4548b683b7 Introduce a sysctl that modifies the value of PROT_SOCK.
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
that denotes the first unprivileged inet port in the namespace.  To
disable all privileged ports set this to zero.  It also checks for
overlap with the local port range.  The privileged and local range may
not overlap.

The use case for this change is to allow containerized processes to bind
to priviliged ports, but prevent them from ever being allowed to modify
their container's network configuration.  The latter is accomplished by
ensuring that the network namespace is not a child of the user
namespace.  This modification was needed to allow the container manager
to disable a namespace's priviliged port restrictions without exposing
control of the network namespace to processes in the user namespace.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 12:10:51 -05:00
Johannes Berg 115865fa08 mac80211: don't try to sleep in rate_control_rate_init()
In my previous patch, I missed that rate_control_rate_init() is
called from some places that cannot sleep, so it cannot call
ieee80211_recalc_min_chandef(). Remove that call for now to fix
the context bug, we'll have to find a different way to fix the
minimum channel width issue.

Fixes: 96aa2e7cf1 ("mac80211: calculate min channel width correctly")
Reported-by: Xiaolong Ye (via lkp-robot) <xiaolong.ye@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-01-24 16:31:54 +01:00
Liping Zhang b2fbd04498 netfilter: nf_tables: validate the name size when possible
Currently, if the user add a stateful object with the name size exceed
NFT_OBJ_MAXNAMELEN - 1 (i.e. 31), we truncate it down to 31 silently.
This is not friendly, furthermore, this will cause duplicated stateful
objects when the first 31 characters of the name is same. So limit the
stateful object's name size to NFT_OBJ_MAXNAMELEN - 1.

After apply this patch, error message will be printed out like this:
  # name_32=$(printf "%0.sQ" {1..32})
  # nft add counter filter $name_32
  <cmdline>:1:1-52: Error: Could not process rule: Numerical result out
  of range
  add counter filter QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Also this patch cleans up the codes which missing the name size limit
validation in nftables.

Fixes: e50092404c ("netfilter: nf_tables: add stateful objects")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-23 23:36:50 +01:00
Florian Fainelli 4078b76cac net: dsa: Check return value of phy_connect_direct()
We need to check the return value of phy_connect_direct() in
dsa_slave_phy_connect() otherwise we may be continuing the
initialization of a slave network device with a PHY that already
attached somewhere else and which will soon be in error because the PHY
device is in error.

The conditions for such an error to occur are that we have a port of our
switch that is not disabled, and has the same port number as a PHY
address (say both 5) that can be probed using the DSA slave MII bus. We
end-up having this slave network device find a PHY at the same address
as our port number, and we try to attach to it.

A slave network (e.g: port 0) has already attached to our PHY device,
and we try to re-attach it with a different network device, but since we
ignore the error we would end-up initializating incorrect device
references by the time the slave network interface is opened.

The code has been (re)organized several times, making it hard to provide
an exact Fixes tag, this is a bugfix nonetheless.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 15:43:35 -05:00
David Ahern 9f427a0e47 net: mpls: Fix multipath selection for LSR use case
MPLS multipath for LSR is broken -- always selecting the first nexthop
in the one label case. For example:

    $ ip -f mpls ro ls
    100
            nexthop as to 200 via inet 172.16.2.2  dev virt12
            nexthop as to 300 via inet 172.16.3.2  dev virt13
    101
            nexthop as to 201 via inet6 2000:2::2  dev virt12
            nexthop as to 301 via inet6 2000:3::2  dev virt13

In this example incoming packets have a single MPLS labels which means
BOS bit is set. The BOS bit is passed from mpls_forward down to
mpls_multipath_hash which never processes the hash loop because BOS is 1.

Update mpls_multipath_hash to process the entire label stack. mpls_hdr_len
tracks the total mpls header length on each pass (on pass N mpls_hdr_len
is N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set
it verifies the skb has sufficient header for ipv4 or ipv6, and find the
IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to
advance past it.

With these changes I have verified the code correctly sees the label,
BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp
traffic for ipv4 and ipv6 are distributed across the nexthops.

Fixes: 1c78efa831 ("mpls: flow-based multipath selection")
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 12:48:14 -05:00
Eric Dumazet 9ca677b1bd ipv6: add NUMA awareness to seg6_hmac_init_algo()
Since we allocate per cpu storage, let's also use NUMA hints.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-22 16:50:36 -05:00
Geliang Tang 530cef21d9 6lowpan: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-22 16:46:13 -05:00
Linus Torvalds e90665a5d3 Three filesystem endianness fixes (one goes back to the 2.6 era, all
marked for stable) and two fixups for this merge window's patches.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYghs8AAoJEEp/3jgCEfOLOz0IAI/xNUMO121S57GEhzkKDdWC
 5PCHjg9itU+2eMCCZ2Nyuikj2NVwEFh9HLpMz5jtFa3oWCIhljh9wT8zlKDgpn5R
 Q1GCT4LkHGhV+HA2sM04aynKBmC90ZVAHfDt/BTs5mLzW7neSpxFOQEPdS4FG6Zg
 NxUGcI/GhqmfpcLnm5IqXxI1cc0bXf6BmEzlGrPAkvzJBhHXWKCVpr1Q/nBW96Q5
 ko1EpP16wZoeRvsr1ztXmBTNURUrCi7S6PyK4M5MAro381U3a7zwQuFq9uuREahO
 nJtCjWD3bd6U3ENDe/Gacz3czXQyjOjE2/w42jL1dA84UMQbz+wv1SyNCkQgiyI=
 =1LTx
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.10-rc5' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Three filesystem endianness fixes (one goes back to the 2.6 era, all
  marked for stable) and two fixups for this merge window's patches"

* tag 'ceph-for-4.10-rc5' of git://github.com/ceph/ceph-client:
  ceph: fix bad endianness handling in parse_reply_info_extra
  ceph: fix endianness bug in frag_tree_split_cmp
  ceph: fix endianness of getattr mask in ceph_d_revalidate
  libceph: make sure ceph_aes_crypt() IV is aligned
  ceph: fix ceph_get_caps() interruption
2017-01-20 12:15:48 -08:00
Ivan Vecera b6677449df bridge: netlink: call br_changelink() during br_dev_newlink()
Any bridge options specified during link creation (e.g. ip link add)
are ignored as br_dev_newlink() does not process them.
Use br_changelink() to do it.

Fixes: 1332351617 ("bridge: implement rtnl_link_ops->changelink")
Signed-off-by: Ivan Vecera <cera@cera.cz>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 15:07:27 -05:00
Andrew Lunn cf1a56a4cf net: dsa: Remove hwmon support
Only the Marvell mv88e6xxx DSA driver made use of the HWMON support in
DSA. The temperature sensor registers are actually in the embedded
PHYs, and the PHY driver now supports it. So remove all HWMON support
from DSA and drivers.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 14:42:51 -05:00
Josef Bacik 319554f284 inet: don't use sk_v6_rcv_saddr directly
When comparing two sockets we need to use inet6_rcv_saddr so we get a NULL
sk_v6_rcv_saddr if the socket isn't AF_INET6, otherwise our comparison function
can be wrong.

Fixes: 637bc8b ("inet: reset tb->fastreuseport when adding a reuseport sk")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 14:35:51 -05:00
Mahesh Bandewar 1b7cd0044e net: remove duplicate code.
netdev_rx_handler_register() checks to see if the handler is already
busy which was recently separated into netdev_is_rx_handler_busy(). So
use the same function inside register() to avoid code duplication.
Essentially this change should be a no-op

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:22:25 -05:00
Andrew Collins 264b87fa61 fq_codel: Avoid regenerating skb flow hash unless necessary
The fq_codel qdisc currently always regenerates the skb flow hash.
This wastes some cycles and prevents flow seperation in cases where
the traffic has been encrypted and can no longer be understood by the
flow dissector.

Change it to use the prexisting flow hash if one exists, and only
regenerate if necessary.

Signed-off-by: Andrew Collins <acollins@cradlepoint.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:15:14 -05:00
Jon Paul Maloy 01fd12bb18 tipc: make replicast a user selectable option
If the bearer carrying multicast messages supports broadcast, those
messages will be sent to all cluster nodes, irrespective of whether
these nodes host any actual destinations socket or not. This is clearly
wasteful if the cluster is large and there are only a few real
destinations for the message being sent.

In this commit we extend the eligibility of the newly introduced
"replicast" transmit option. We now make it possible for a user to
select which method he wants to be used, either as a mandatory setting
via setsockopt(), or as a relative setting where we let the broadcast
layer decide which method to use based on the ratio between cluster
size and the message's actual number of destination nodes.

In the latter case, a sending socket must stick to a previously
selected method until it enters an idle period of at least 5 seconds.
This eliminates the risk of message reordering caused by method change,
i.e., when changes to cluster size or number of destinations would
otherwise mandate a new method to be used.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:17 -05:00
Jon Paul Maloy a853e4c6d0 tipc: introduce replicast as transport option for multicast
TIPC multicast messages are currently carried over a reliable
'broadcast link', making use of the underlying media's ability to
transport packets as L2 broadcast or IP multicast to all nodes in
the cluster.

When the used bearer is lacking that ability, we can instead emulate
the broadcast service by replicating and sending the packets over as
many unicast links as needed to reach all identified destinations.
We now introduce a new TIPC link-level 'replicast' service that does
this.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:17 -05:00
Jon Paul Maloy 2ae0b8af1f tipc: add functionality to lookup multicast destination nodes
As a further preparation for the upcoming 'replicast' functionality,
we add some necessary structs and functions for looking up and returning
a list of all nodes that host destinations for a given multicast message.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:16 -05:00
Jon Paul Maloy 9999974a83 tipc: add function for checking broadcast support in bearer
As a preparation for the 'replicast' functionality we are going to
introduce in the next commits, we need the broadcast base structure to
store whether bearer broadcast is available at all from the currently
used bearer or bearers.

We do this by adding a new function tipc_bearer_bcast_support() to
the bearer layer, and letting the bearer selection function in
bcast.c use this to give a new boolean field, 'bcast_support' the
appropriate value.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:15 -05:00
Phil Sutter 9af15c3825 device: Implement a bus agnostic dev_num_vf routine
Now that pci_bus_type has num_vf callback set, dev_num_vf can be
implemented in a bus type independent way and the check for whether a
PCI device is being handled in rtnetlink can be dropped.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:43:17 -05:00
Jakub Sitnicki c10aa71b9d gre6: Clean up unused struct ipv6_tel_txoption definition
Commit b05229f442 ("gre6: Cleanup GREv6 transmit path, call common GRE
functions") removed the ip6gre specific transmit function, but left the
struct ipv6_tel_txoption definition. Clean it up.

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:37:01 -05:00
David S. Miller 91e744653c Revert "net: sctp: fix array overrun read on sctp_timer_tbl"
This reverts commit 0e73fc9a56.

This fix wasn't correct, a better one is coming right up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:29:43 -05:00
Eric Dumazet c2a2efbbfc net: remove bh disabling around percpu_counter accesses
Shaohua Li made percpu_counter irq safe in commit 098faf5805
("percpu_counter: make APIs irq safe")

We can safely remove BH disable/enable sections around various
percpu_counter manipulations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:27:22 -05:00
Colin Ian King 0e73fc9a56 net: sctp: fix array overrun read on sctp_timer_tbl
The comparison on the timeout can lead to an array overrun
read on sctp_timer_tbl because of an off-by-one error. Fix
this by using < instead of <= and also compare to the array
size rather than SCTP_EVENT_TIMEOUT_MAX.

Fixes CoverityScan CID#1397639 ("Out-of-bounds read")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:26:01 -05:00
Eric Dumazet e363116b90 ipv6: seg6_genl_set_tunsrc() must check kmemdup() return value
seg6_genl_get_tunsrc() and set_tun_src() do not handle tun_src being
possibly NULL, so we must check kmemdup() return value and abort if
it is NULL

Fixes: 915d7e5e59 ("ipv6: sr: add code base for control plane support of SR-IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Lebrun <david.lebrun@uclouvain.be>
Acked-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:07:48 -05:00
Jason Wang 6391a4481b virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving
Commit 501db51139 ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:01:17 -05:00
David Ahern a1a22c1206 net: ipv6: Keep nexthop of multipath route on admin down
IPv6 deletes route entries associated with multipath routes on an
admin down where IPv4 does not. For example:
    $ ip ro ls vrf red
    unreachable default metric 8192
    1.1.1.0/24 metric 64
            nexthop via 10.100.1.254  dev eth1 weight 1
            nexthop via 10.100.2.254  dev eth2 weight 1
    10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4
    10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4

    $ ip -6 ro ls vrf red
    2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
    2001:db8:2:: dev red proto none metric 0  pref medium
    2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
    2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024  pref medium
    2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024  pref medium
    ...

Set link down:
    $ ip li set eth1 down

IPv4 retains the multihop route but flags eth1 route as dead:

    $ ip ro ls vrf red
    unreachable default metric 8192
    1.1.1.0/24
            nexthop via 10.100.1.16  dev eth1 weight 1 dead linkdown
            nexthop via 10.100.2.16  dev eth2 weight 1
    10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4

and IPv6 deletes the route as part of flushing all routes for the device:

    $ ip -6 ro ls vrf red
    2001:db8:2:: dev red proto none metric 0  pref medium
    2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
    2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024  pref medium
    ...

Worse, on admin up of the device the multipath route has to be deleted
to get this leg of the route re-added.

This patch keeps routes that are part of a multipath route if
ignore_routes_with_linkdown is set with the dead and linkdown flags
enabling consistency between IPv4 and IPv6:

    $ ip -6 ro ls vrf red
    2001:db8:2:: dev red proto none metric 0  pref medium
    2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
    2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown  pref medium
    2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024  pref medium
    ...

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 23:38:51 -05:00
Tobias Klauser 54c30f604d net: caif: Remove unused stats member from struct chnl_net
The stats member of struct chnl_net is used nowhere in the code, so it
might as well be removed.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 11:45:21 -05:00
Alexey Kodanev 0dbd7ff3ac tcp: initialize max window for a new fastopen socket
Found that if we run LTP netstress test with large MSS (65K),
the first attempt from server to send data comparable to this
MSS on fastopen connection will be delayed by the probe timer.

Here is an example:

     < S  seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
     > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
     < .  ack 1 win 342 length 0

Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
as well as in 'size_goal'. This results the segment not queued for
transmition until all the data copied from user buffer. Then, inside
__tcp_push_pending_frames(), it breaks on send window test and
continues with the check probe timer.

Fragmentation occurs in tcp_write_wakeup()...

+0.2 > P. seq 1:43777 ack 1 win 342 length 43776
     < .  ack 43777, win 1365 length 0
     > P. seq 43777:65001 ack 1 win 342 options [...] length 21224
     ...

This also contradicts with the fact that we should bound to the half
of the window if it is large.

Fix this flaw by correctly initializing max_window. Before that, it
could have large values that affect further calculations of 'size_goal'.

Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 11:35:26 -05:00
Arnd Bergmann 39b7b6a624 net/sched: cls_flower: reduce fl_change stack size
The new ARP support has pushed the stack size over the edge on ARM,
as there are two large objects on the stack in this function (mask
and tb) and both have now grown a bit more:

net/sched/cls_flower.c: In function 'fl_change':
net/sched/cls_flower.c:928:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

We can solve this by dynamically allocating one or both of them.
I first tried to do it just for the mask, but that only saved
152 bytes on ARM, while this version just does it for the 'tb'
array, bringing the stack size back down to 664 bytes.

Fixes: 99d31326cb ("net/sched: cls_flower: Support matching on ARP")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 11:18:53 -05:00
Kefeng Wang 03e4deff49 ipv6: addrconf: Avoid addrconf_disable_change() using RCU read-side lock
Just like commit 4acd4945cd ("ipv6: addrconf: Avoid calling
netdevice notifiers with RCU read-side lock"), it is unnecessary
to make addrconf_disable_change() use RCU iteration over the
netdev list, since it already holds the RTNL lock, or we may meet
Illegal context switch in RCU read-side critical section.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 11:08:53 -05:00
Florian Westphal e5072053b0 netfilter: conntrack: refine gc worker heuristics, redux
This further refines the changes made to conntrack gc_worker in
commit e0df8cae6c ("netfilter: conntrack: refine gc worker heuristics").

The main idea of that change was to reduce the scan interval when evictions
take place.

However, on the reporters' setup, there are 1-2 million conntrack entries
in total and roughly 8k new (and closing) connections per second.

In this case we'll always evict at least one entry per gc cycle and scan
interval is always at 1 jiffy because of this test:

 } else if (expired_count) {
     gc_work->next_gc_run /= 2U;
     next_run = msecs_to_jiffies(1);

being true almost all the time.

Given we scan ~10k entries per run its clearly wrong to reduce interval
based on nonzero eviction count, it will only waste cpu cycles since a vast
majorities of conntracks are not timed out.

Thus only look at the ratio (scanned entries vs. evicted entries) to make
a decision on whether to reduce or not.

Because evictor is supposed to only kick in when system turns idle after
a busy period, pick a high ratio -- this makes it 50%.  We thus keep
the idea of increasing scan rate when its likely that table contains many
expired entries.

In order to not let timed-out entries hang around for too long
(important when using event logging, in which case we want to timely
destroy events), we now scan the full table within at most
GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all
timed-out entries sit in same slot.

I tested this with a vm under synflood (with
sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3).

While flood is ongoing, interval now stays at its max rate
(GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms).

With feedback from Nicolas Dichtel.

Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Fixes: b87a2f9199 ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-19 14:28:01 +01:00
Florian Westphal 524b698db0 netfilter: conntrack: remove GC_MAX_EVICTS break
Instead of breaking loop and instant resched, don't bother checking
this in first place (the loop calls cond_resched for every bucket anyway).

Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-19 14:27:41 +01:00
Tobias Klauser 4a7c972644 net: Remove usage of net_device last_rx member
The network stack no longer uses the last_rx member of struct net_device
since the bonding driver switched to use its own private last_rx in
commit 9f24273837 ("bonding: use last_arp_rx in slave_last_rx()").

However, some drivers still (ab)use the field for their own purposes and
some driver just update it without actually using it.

Previously, there was an accompanying comment for the last_rx member
added in commit 4dc89133f4 ("net: add a comment on netdev->last_rx")
which asked drivers not to update is, unless really needed. However,
this commend was removed in commit f8ff080dac ("bonding: remove
useless updating of slave->dev->last_rx"), so some drivers added later
on still did update last_rx.

Remove all usage of last_rx and switch three drivers (sky2, atp and
smc91c92_cs) which actually read and write it to use their own private
copy in netdev_priv.

Compile-tested with allyesconfig and allmodconfig on x86 and arm.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 17:22:49 -05:00
David Ahern 9ed59592e3 lwtunnel: fix autoload of lwt modules
Trying to add an mpls encap route when the MPLS modules are not loaded
hangs. For example:

    CONFIG_MPLS=y
    CONFIG_NET_MPLS_GSO=m
    CONFIG_MPLS_ROUTING=m
    CONFIG_MPLS_IPTUNNEL=m

    $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

The ip command hangs:
root       880   826  0 21:25 pts/0    00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

    $ cat /proc/880/stack
    [<ffffffff81065a9b>] call_usermodehelper_exec+0xd6/0x134
    [<ffffffff81065efc>] __request_module+0x27b/0x30a
    [<ffffffff814542f6>] lwtunnel_build_state+0xe4/0x178
    [<ffffffff814aa1e4>] fib_create_info+0x47f/0xdd4
    [<ffffffff814ae451>] fib_table_insert+0x90/0x41f
    [<ffffffff814a8010>] inet_rtm_newroute+0x4b/0x52
    ...

modprobe is trying to load rtnl-lwt-MPLS:

root       881     5  0 21:25 ?        00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS

and it hangs after loading mpls_router:

    $ cat /proc/881/stack
    [<ffffffff81441537>] rtnl_lock+0x12/0x14
    [<ffffffff8142ca2a>] register_netdevice_notifier+0x16/0x179
    [<ffffffffa0033025>] mpls_init+0x25/0x1000 [mpls_router]
    [<ffffffff81000471>] do_one_initcall+0x8e/0x13f
    [<ffffffff81119961>] do_init_module+0x5a/0x1e5
    [<ffffffff810bd070>] load_module+0x13bd/0x17d6
    ...

The problem is that lwtunnel_build_state is called with rtnl lock
held preventing mpls_init from registering.

Given the potential references held by the time lwtunnel_build_state it
can not drop the rtnl lock to the load module. So, extract the module
loading code from lwtunnel_build_state into a new function to validate
the encap type. The new function is called while converting the user
request into a fib_config which is well before any table, device or
fib entries are examined.

Fixes: 745041e2aa ("lwtunnel: autoload of lwt modules")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 17:07:14 -05:00
Vivien Didelot 9520ed8fb8 net: dsa: use cpu_switch instead of ds[0]
Now that the DSA Ethernet switches are true Linux devices, the CPU
switch is not necessarily the first one. If its address is higher than
the second switch on the same MDIO bus, its index will be 1, not 0.

Avoid any confusion by using dst->cpu_switch instead of dst->ds[0].

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 16:49:47 -05:00
Vivien Didelot b22de49086 net: dsa: store CPU switch structure in the tree
Store a dsa_switch pointer to the CPU switch in the tree instead of only
its index. This avoids the need to initialize it to -1.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 16:49:46 -05:00
David Ahern f8cfe2ceb1 net: ipv6: remove prefix arg to rt6_fill_node
The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route
where a user is requesting a prefix only dump. Simplify rt6_fill_node
by removing the prefix arg and moving the prefix check to rt6_dump_route.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:43:59 -05:00
David Ahern fd61c6ba31 net: ipv6: remove nowait arg to rt6_fill_node
All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and
simplify rt6_fill_node accordingly.

rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the
nowait arg from it as well.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:43:59 -05:00
Eric Dumazet 7be2c82cfd net: fix harmonize_features() vs NETIF_F_HIGHDMA
Ashizuka reported a highmem oddity and sent a patch for freescale
fec driver.

But the problem root cause is that core networking stack
must ensure no skb with highmem fragment is ever sent through
a device that does not assert NETIF_F_HIGHDMA in its features.

We need to call illegal_highdma() from harmonize_features()
regardless of CSUM checks.

Fixes: ec5f061564 ("net: Kill link between CSUM and SG features.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Reported-by: "Ashizuka, Yuusuke" <ashiduka@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:24:27 -05:00
Gao Feng 1a28ad74eb netfilter: nf_tables: eliminate useless condition checks
The return value of nf_tables_table_lookup() is valid pointer or one
pointer error. There are two cases:

1) IS_ERR(table) is true, it would return the error or reset the
   table as NULL, it is unnecessary to perform the latter check
   "table != NULL".

2) IS_ERR(obj) is false, the table is one valid pointer. It is also
   unnecessary to perform that check.

The nf_tables_newset() and nf_tables_newobj() have same logic codes.

In summary, we could move the block of condition check "table != NULL"
in the else block to eliminate the original condition checks.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18 21:10:29 +01:00
Arnd Bergmann 3fd0b634de netfilter: ipt_CLUSTERIP: fix build error without procfs
We can't access c->pde if CONFIG_PROC_FS is disabled:

net/ipv4/netfilter/ipt_CLUSTERIP.c: In function 'clusterip_config_find_get':
net/ipv4/netfilter/ipt_CLUSTERIP.c:147:9: error: 'struct clusterip_config' has no member named 'pde'

This moves the check inside of another #ifdef.

Fixes: 6c5d5cfbe3 ("netfilter: ipt_CLUSTERIP: check duplicate config when initializing")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18 20:59:22 +01:00
Eran Ben Elisha 31a86d1372 net: ethtool: Initialize buffer when querying device channel settings
Ethtool channels respond struct was uninitialized when querying device
channel boundaries settings. As a result, unreported fields by the driver
hold garbage.  This may cause sending unsupported params to driver.

Fixes: 8bf3686204 ('ethtool: ensure channel counts are within bounds ...')
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
CC: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:58:23 -05:00
Xin Long 7f9d68ac94 sctp: implement sender-side procedures for SSN Reset Request Parameter
This patch is to implement sender-side procedures for the Outgoing
and Incoming SSN Reset Request Parameter described in rfc6525 section
5.1.2 and 5.1.3.

It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2
for users.

Note that the new asoc member strreset_outstanding is to make sure
only one reconf request chunk on the fly as rfc6525 section 5.1.1
demands.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:11 -05:00
Xin Long 9fb657aec0 sctp: add sockopt SCTP_ENABLE_STREAM_RESET
This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set
strreset_enable to indicate which reconf request type it supports,
which is described in rfc6525 section 6.3.1.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:10 -05:00
Xin Long c28445c3cb sctp: add reconf_enable in asoc ep and netns
This patch is to add reconf_enable field in all of asoc ep and netns
to indicate if they support stream reset.

When initializing, asoc reconf_enable get the default value from ep
reconf_enable which is from netns netns reconf_enable by default.

It is also to add reconf_capable in asoc peer part to know if peer
supports reconf_enable, the value is set if ext params have reconf
chunk support when processing init chunk, just as rfc6525 section
5.1.1 demands.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:10 -05:00
Xin Long 7a090b0452 sctp: add stream reconf primitive
This patch is to add a primitive based on sctp primitive frame for
sending stream reconf request. It works as the other primitives,
and create a SCTP_CMD_REPLY command to send the request chunk out.

sctp_primitive_RECONF would be the api to send a reconf request
chunk.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:10 -05:00
Xin Long 7b9438de0c sctp: add stream reconf timer
This patch is to add a per transport timer based on sctp timer frame
for stream reconf chunk retransmission. It would start after sending
a reconf request chunk, and stop after receiving the response chunk.

If the timer expires, besides retransmitting the reconf request chunk,
it would also do the same thing with data RTO timer. like to increase
the appropriate error counts, and perform threshold management, possibly
destroying the asoc if sctp retransmission thresholds are exceeded, just
as section 5.1.1 describes.

This patch is also to add asoc strreset_chunk, it is used to save the
reconf request chunk, so that it can be retransmitted, and to check if
the response is really for this request by comparing the information
inside with the response chunk as well.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:10 -05:00
Xin Long cc16f00f65 sctp: add support for generating stream reconf ssn reset request chunk
This patch is to add asoc strreset_outseq and strreset_inseq for
saving the reconf request sequence, initialize them when create
assoc and process init, and also to define Incoming and Outgoing
SSN Reset Request Parameter described in rfc6525 section 4.1 and
4.2, As they can be in one same chunk as section rfc6525 3.1-3
describes, it makes them in one function.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:55:09 -05:00
Liping Zhang f169fd695b netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family
After adding the following nft rule, then ping 224.0.0.1:
  # nft add rule netdev t c pkttype host counter

The warning complain message will be printed out again and again:
  WARNING: CPU: 0 PID: 10182 at net/netfilter/nft_meta.c:163 \
           nft_meta_get_eval+0x3fe/0x460 [nft_meta]
  [...]
  Call Trace:
  <IRQ>
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_null+0x1d/0x20
  nft_meta_get_eval+0x3fe/0x460 [nft_meta]
  nft_do_chain+0xff/0x5e0 [nf_tables]

So we should deal with PACKET_LOOPBACK in netdev family too. For ipv4,
convert it to PACKET_BROADCAST/MULTICAST according to the destination
address's type; For ipv6, convert it to PACKET_MULTICAST directly.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18 20:32:43 +01:00
Liping Zhang 9a6d876262 netfilter: pkttype: unnecessary to check ipv6 multicast address
Since there's no broadcast address in IPV6, so in ipv6 family, the
PACKET_LOOPBACK must be multicast packets, there's no need to check
it again.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18 20:32:43 +01:00
Josef Bacik 637bc8bbe6 inet: reset tb->fastreuseport when adding a reuseport sk
If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
never set it again.  Which means that in the future if we end up adding a bunch
of reuseport sk's to that tb we'll have to do the expensive scan every time.
Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
so we know what comparison to make, and the ipv6 only setting so we can make
sure to compare with new sockets appropriately.  Once one sk has made it onto
the list we know that there are no potential bind conflicts on the owners list
that match that sk's rcv_addr.  So copy the sk's information into our bind
bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
an extra check for subsequent reuseport sockets and skip the expensive bind
conflict check.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 13:04:29 -05:00
Josef Bacik 289141b768 inet: split inet_csk_get_port into two functions
inet_csk_get_port does two different things, it either scans for an open port,
or it tries to see if the specified port is available for use.  Since these two
operations have different rules and are basically independent lets split them
into two different functions to make them both more readable.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 13:04:29 -05:00
Josef Bacik 6cd6661683 inet: don't check for bind conflicts twice when searching for a port
This is just wasted time, we've already found a tb that doesn't have a bind
conflict, and we don't drop the head lock so scanning again isn't going to give
us a different answer.  Instead move the tb->reuse setting logic outside of the
found_tb path and put it in the success: path.  Then make it so that we don't
goto again if we find a bind conflict in the found_tb path as we won't reach
this anymore when we are scanning for an ephemeral port.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 13:04:29 -05:00
Josef Bacik b9470c2760 inet: kill smallest_size and smallest_port
In inet_csk_get_port we seem to be using smallest_port to figure out where the
best place to look for a SO_REUSEPORT sk that matches with an existing set of
SO_REUSEPORT's.  However if we get to the logic

if (smallest_size != -1) {
	port = smallest_port;
	goto have_port;
}

we will do a useless search, because we would have already done the
inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
would have gone to found_tb and succeeded.  Since this logic makes us do yet
another trip through inet_csk_bind_conflict for a port we know won't work just
delete this code and save us the time.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 13:04:29 -05:00
Josef Bacik aa078842b7 inet: drop ->bind_conflict
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
is how they check the rcv_saddr, so delete this call back and simply
change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 13:04:28 -05:00
Josef Bacik fe38d2a1c8 inet: collapse ipv4/v6 rcv_saddr_equal functions into one
We pass these per-protocol equal functions around in various places, but
we can just have one function that checks the sk->sk_family and then do
the right comparison function.  I've also changed the ipv4 version to
not cast to inet_sock since it is unneeded.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 13:04:28 -05:00
Ilya Dryomov 124f930b8c libceph: make sure ceph_aes_crypt() IV is aligned
... otherwise the crypto stack will align it for us with a GFP_ATOMIC
allocation and a memcpy() -- see skcipher_walk_first().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-01-18 17:58:45 +01:00
Jason Baron 0e40f4c959 tcp: accept RST for rcv_nxt - 1 after receiving a FIN
Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

A packetdrill test demonstrating the behavior:

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 15:51:55 -05:00
Gao Feng a7ef6715c4 net: ping: Use right format specifier to avoid type casting
The inet_num is u16, so use %hu instead of casting it to int. And
the sk_bound_dev_if is int actually, so it needn't cast to int.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 15:25:55 -05:00
Lance Richardson 53631a5f9c bridge: sparse fixes in br_ip6_multicast_alloc_query()
Changed type of csum field in struct igmpv3_query from __be16 to
__sum16 to eliminate type warning, made same change in struct
igmpv3_report for consistency.

Fixed up an ntohs() where htons() should have been used instead.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 15:22:05 -05:00
David S. Miller 580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Robert Shearman 27d691056b mpls: Packet stats
Having MPLS packet stats is useful for observing network operation and
for diagnosing network problems. In the absence of anything better,
RFC2863 and RFC3813 are used for guidance for which stats to expose
and the semantics of them. In particular rx_noroutes maps to in
unknown protos in RFC2863. The stats are exposed to userspace via
AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
RTM_GETSTATS messages.

All the introduced fields are 64-bit, even error ones, to ensure no
overflow with long uptimes. Per-CPU counters are used to avoid
cache-line contention on the commonly used fields. The other fields
have also been made per-CPU for code to avoid performance problems in
error conditions on the assumption that on some platforms the cost of
atomic operations could be more expensive than sending the packet
(which is what would be done in the success case). If that's not the
case, we could instead not use per-CPU counters for these fields.

Only unicast and non-fragment are exposed at the moment, but other
counters can be exposed in the future either by adding to the end of
struct mpls_link_stats or by additional netlink attributes in the
AF_MPLS IFLA_STATS_AF_SPEC nested attribute.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 14:38:43 -05:00
Robert Shearman aefb4d4ad8 net: AF-specific RTM_GETSTATS attributes
Add the functionality for including address-family-specific per-link
stats in RTM_GETSTATS messages. This is done through adding a new
IFLA_STATS_AF_SPEC attribute under which address family attributes are
nested and then the AF-specific attributes can be further nested. This
follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
advantage of presenting an easily extended hierarchy. The rtnl_af_ops
structure is extended to provide AFs with the opportunity to fill and
provide the size of their stats attributes.

One alternative would have been to provide AFs with the ability to add
attributes directly into the RTM_GETSTATS message without a nested
hierarchy. I discounted this approach as it increases the rate at
which the 32 attribute number space is used up and it makes
implementation a little more tricky for stats dump resuming (at the
moment the order in which attributes are added to the message has to
match the numeric order of the attributes).

Another alternative would have been to register per-AF RTM_GETSTATS
handlers. I discounted this approach as I perceived a common use-case
to be getting all the stats for an interface and this approach would
necessitate multiple requests/dumps to retrieve them all.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17 14:38:43 -05:00
Linus Torvalds 4b19a9e20b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle multicast packets properly in fast-RX path of mac80211, from
    Johannes Berg.

 2) Because of a logic bug, the user can't actually force SW
    checksumming on r8152 devices. This makes diagnosis of hw
    checksumming bugs really annoying. Fix from Hayes Wang.

 3) VXLAN route lookup does not take the source and destination ports
    into account, which means IPSEC policies cannot be matched properly.
    Fix from Martynas Pumputis.

 4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.

 5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.

 6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
    construction properly in fib_dump_info(), from David Ahern.

 7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
    Schmidt.

 8) Openvswitch conntack actions need to maintain a correct checksum,
    fix from Lance Richardson.

 9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
    fact it already checks this, but not in all of the necessary spots.
    Fix from Basil Gunn.

10) Action GET operations in the packet scheduler can erroneously bump
    the reference count of the entry, making it unreleasable. Fix from
    Jamal Hadi Salim. Jamal gives a great set of example command lines
    that trigger this in the commit message.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  net sched actions: fix refcnt when GETing of action after bind
  net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
  net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
  net/mlx4_core: Fix racy CQ (Completion Queue) free
  net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
  net/mlx5e: Fix a -Wmaybe-uninitialized warning
  ax25: Fix segfault after sock connection timeout
  bpf: rework prog_digest into prog_tag
  tipc: allocate user memory with GFP_KERNEL flag
  net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
  ip6_tunnel: Account for tunnel header in tunnel MTU
  mld: do not remove mld souce list info when set link down
  be2net: fix MAC addr setting on privileged BE3 VFs
  be2net: don't delete MAC on close on unprivileged BE3 VFs
  be2net: fix status check in be_cmd_pmac_add()
  cpmac: remove hopeless #warning
  ravb: do not use zero-length alignment DMA descriptor
  mlx4: do not call napi_schedule() without care
  openvswitch: maintain correct checksum state in conntrack actions
  tcp: fix tcp_fastopen unaligned access complaints on sparc
  ...
2017-01-17 09:33:10 -08:00
Steffen Klassert eb758c8864 esp: Introduce a helper to setup the trailer
We need to setup the trailer in two different cases,
so add a helper to avoid code duplication.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-17 10:23:08 +01:00
Steffen Klassert 03e2a30f6a esp6: Avoid skb_cow_data whenever possible
This patch tries to avoid skb_cow_data on esp6.

On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.

On the decrypt side, we leave the buffer as it is
if it is not cloned.

With this, we can avoid a linearization of the buffer
in most of the cases.

Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-17 10:23:03 +01:00
Steffen Klassert cac2661c53 esp4: Avoid skb_cow_data whenever possible
This patch tries to avoid skb_cow_data on esp4.

On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.

On the decrypt side, we leave the buffer as it is
if it is not cloned.

With this, we can avoid a linearization of the buffer
in most of the cases.

Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-17 10:22:57 +01:00
Jamal Hadi Salim 0faa9cb5b3 net sched actions: fix refcnt when GETing of action after bind
Demonstrating the issue:

.. add a drop action
$sudo $TC actions add action drop index 10

.. retrieve it
$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 0 installed 29 sec used 29 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

... bug 1 above: reference is two.
    Reference is actually 1 but we forget to subtract 1.

... do a GET again and we see the same issue
    try a few times and nothing changes
~$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 0 installed 31 sec used 31 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

... lets try to bind the action to a filter..
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
  u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

... and now a few GETs:
$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 3 bind 1 installed 204 sec used 204 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 4 bind 1 installed 206 sec used 206 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 5 bind 1 installed 235 sec used 235 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

.... as can be observed the reference count keeps going up.

After the fix

$ sudo $TC actions add action drop index 10
$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 1 bind 0 installed 4 sec used 4 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 1 bind 0 installed 6 sec used 6 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
  u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 1 installed 32 sec used 32 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 1 installed 33 sec used 33 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

Fixes: aecc5cefc3 ("net sched actions: fix GETing actions")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 19:43:19 -05:00
Paul Blakey a3308d8fd1 net/sched: cls_flower: Disallow duplicate internal elements
Flower currently allows having the same filter twice with the same
priority. Actions (and statistics update) will always execute on the
first inserted rule leaving the second rule unused.
This patch disallows that.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:05:44 -05:00
Basil Gunn 8a367e74c0 ax25: Fix segfault after sock connection timeout
The ax.25 socket connection timed out & the sock struct has been
previously taken down ie. sock struct is now a NULL pointer. Checking
the sock_flag causes the segfault.  Check if the socket struct pointer
is NULL before checking sock_flag. This segfault is seen in
timed out netrom connections.

Please submit to -stable.

Signed-off-by: Basil Gunn <basil@pacabunga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:39:58 -05:00
Daniel Borkmann f1f7714ea5 bpf: rework prog_digest into prog_tag
Commit 7bd509e311 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.

The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.

Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.

Fixes: 7bd509e311 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:03:31 -05:00
Marcelo Ricardo Leitner cdfb1a9f30 sctp: remove useless code from sctp_apply_peer_addr_params
sctp_frag_point() doesn't store anything, and thus just calling it
cannot do anything useful.

sctp_apply_peer_addr_params is only called by
sctp_setsockopt_peer_addr_params. When operating on an asoc,
sctp_setsockopt_peer_addr_params will call sctp_apply_peer_addr_params
once for the asoc, and then once for each transport this asoc has,
meaning that the frag_point will be recomputed when updating the
transports and calling it when updating the asoc is not necessary.
IOW, no action is needed here and we can remove this call.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:51:40 -05:00
Marcelo Ricardo Leitner 11d05ac1df sctp: remove unused var from sctp_process_asconf
Assigned but not used.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:51:40 -05:00
Colin Ian King 57b68ec2a7 flow dissector: check if arp_eth is null rather than arp
arp is being checked instead of arp_eth to see if the call to
__skb_header_pointer failed. Fix this by checking arp_eth is
null instead of arp.   Also fix to use length hlen rather than
hlen - sizeof(_arp); thanks to Eric Dumazet for spotting
this latter issue.

CoverityScan CID#1396428 ("Logically dead code") on 2nd
arp comparison (which should be arp_eth instead).

Fixes: commit 55733350e5 ("flow disector: ARP support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:48:48 -05:00
Eric Dumazet e89df81317 netlink: do not enter direct reclaim from netlink_trim()
In commit d35c99ff77 ("netlink: do not enter direct reclaim from
netlink_dump()") we made sure to not trigger expensive memory reclaim.

Problem is that a bit later, netlink_trim() might be called and
trigger memory reclaim.

netlink_trim() should be best effort, and really as fast as possible.
Under memory pressure, it is fine to not trim this skb.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:39:35 -05:00
Parthasarathy Bhuvaragan 57d5f64d83 tipc: allocate user memory with GFP_KERNEL flag
Until now, we allocate memory always with GFP_ATOMIC flag.
When the system is under memory pressure and a user tries to send,
the send fails due to low memory. However, the user application
can wait for free memory if we allocate it using GFP_KERNEL flag.

In this commit, we use allocate memory with GFP_KERNEL for all user
allocation.

Reported-by: Rune Torgersen <runet@innovsys.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:31:53 -05:00
Jakub Sitnicki 02ca0423fd ip6_tunnel: Account for tunnel header in tunnel MTU
With ip6gre we have a tunnel header which also makes the tunnel MTU
smaller. We need to reserve room for it. Previously we were using up
space reserved for the Tunnel Encapsulation Limit option
header (RFC 2473).

Also, after commit b05229f442 ("gre6: Cleanup GREv6 transmit path,
call common GRE functions") our contract with the caller has
changed. Now we check if the packet length exceeds the tunnel MTU after
the tunnel header has been pushed, unlike before.

This is reflected in the check where we look at the packet length minus
the size of the tunnel header, which is already accounted for in tunnel
MTU.

Fixes: b05229f442 ("gre6: Cleanup GREv6 transmit path, call common GRE functions")
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:22:12 -05:00
Hangbin Liu 1666d49e1d mld: do not remove mld souce list info when set link down
This is an IPv6 version of commit 24803f38a5 ("igmp: do not remove igmp
souce list..."). In mld_del_delrec(), we will restore back all source filter
info instead of flush them.

Move mld_clear_delrec() from ipv6_mc_down() to ipv6_mc_destroy_dev() since
we should not remove source list info when set link down. Remove
igmp6_group_dropped() in ipv6_mc_destroy_dev() since we have called it in
ipv6_mc_down().

Also clear all source info after igmp6_group_dropped() instead of in it
because ipv6_mc_down() will call igmp6_group_dropped().

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 12:44:59 -05:00
Linus Torvalds 2eabb8b8d6 Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for older
bugs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYfOk6AAoJECebzXlCjuG+Lj4QALaLKRRbIdrz6nmg7gUmpTWc
 CdW8NMbzwSCXmYoivsTHBlhXZKsi5vVjnFXMCM/P85ddmipXdcTFCDLmmNoKUQ0M
 jODlLX90ctaZKCDBVSaH4htAz2gkFv7z5IllX0YDQqHyiuzh/9KoV+AFCgPZPTpL
 O1XRmfWz+yJDydz4hb3i5f2JvMk9P/tCXLnheuxxTIMSl2/fIfgF81eWwDpFqcA2
 27+PyWWjZehVnZ77ca/mWJj2n0+gBINiKafcfF39NK/Hv2q4aauB3k7c4blecc9Q
 m/IT3mKifvHvdNCmvHD5s74h4OikEGYpqaSjonMptZnWgfM4/gtF7yTiQjsOMDx/
 w6W/tfHlGrvegpzhjaIaoZZ50EZp7xwGNNZYgH4J44kytYpolrhsOR6NqCLTqpej
 xG2Kd89ZtnAgc/7T7ET/1PqpZ8f9M9pyV3E8s36OvF4AYQUNrfzbWSTQcZy3WGBP
 YuoUCzacIbNbGgu4m6Zx5l/vKW5yn45xbUMp7T9S4WoxYMx6a5vViU0NiF7KsQDu
 pcDT92DZ57KJFtCw7Ig08ILKsSXmNApH5/4mIrkX3quZuH4j2XapEJ9u//fmfZBd
 Q+Sgv8RXcGELUJIg9yfmoWgPDA/oYslc7ynBV0lXLNgBuod//dGSlZ+6KfFFJYr8
 XVOxwPTiiBIlc9lvB9eA
 =tb4L
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for
  older bugs"

* tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: avoid duplicate dma unmapping during error recovery
  sunrpc: don't call sleeping functions from the notifier block callbacks
  svcrpc: don't leak contexts on PROC_DESTROY
  nfsd: fix supported attributes for acl & labels
2017-01-16 09:34:37 -08:00
William Breathitt Gray e4670b058a netfilter: Fix typo in NF_CONNTRACK Kconfig option description
The NF_CONNTRACK Kconfig option description makes an incorrect reference
to the "meta" expression where the "ct" expression would be correct.This
patch fixes the respective typographical error.

Fixes: d497c63527 ("netfilter: add help information to new nf_tables Kconfig options")
Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-16 14:23:02 +01:00
Liping Zhang d21e540b4d netfilter: nf_tables: fix possible oops when dumping stateful objects
When dumping nft stateful objects, if NFTA_OBJ_TABLE and NFTA_OBJ_TYPE
attributes are not specified either, filter will become NULL, so oops
will happen(actually nft utility will always set NFTA_OBJ_TABLE attr,
so I write a test program to make this happen):

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: nf_tables_dump_obj+0x17c/0x330 [nf_tables]
  [...]
  Call Trace:
  ? nf_tables_dump_obj+0x5/0x330 [nf_tables]
  ? __kmalloc_reserve.isra.35+0x31/0x90
  ? __alloc_skb+0x5b/0x1e0
  netlink_dump+0x124/0x2a0
  __netlink_dump_start+0x161/0x190
  nf_tables_getobj+0xe8/0x280 [nf_tables]

Fixes: a9fea2a3c3 ("netfilter: nf_tables: allow to filter stateful object dumps by type")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-16 14:23:02 +01:00
Liping Zhang 6443ebc3fd netfilter: rpfilter: fix incorrect loopback packet judgment
Currently, we check the existing rtable in PREROUTING hook, if RTCF_LOCAL
is set, we assume that the packet is loopback.

But this assumption is incorrect, for example, a packet encapsulated
in ipsec transport mode was received and routed to local, after
decapsulation, it would be delivered to local again, and the rtable
was not dropped, so RTCF_LOCAL check would trigger. But actually, the
packet was not loopback.

So for these normal loopback packets, we can check whether the in device
is IFF_LOOPBACK or not. For these locally generated broadcast/multicast,
we can check whether the skb->pkt_type is PACKET_LOOPBACK or not.

Finally, there's a subtle difference between nft fib expr and xtables
rpfilter extension, user can add the following nft rule to do strict
rpfilter check:
  # nft add rule x y meta iif eth0 fib saddr . iif oif != eth0 drop

So when the packet is loopback, it's better to store the in device
instead of the LOOPBACK_IFINDEX, otherwise, after adding the above
nft rule, locally generated broad/multicast packets will be dropped
incorrectly.

Fixes: f83a7ea207 ("netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too")
Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-16 14:23:01 +01:00
Gilad Ben-Yossef 726282aa6b IPsec: do not ignore crypto err in ah6 input
ah6 input processing uses the asynchronous hash crypto API which
supplies an error code as part of the operation completion but
the error code was being ignored.

Treat a crypto API error indication as a verification failure.

While a crypto API reported error would almost certainly result
in a memcpy of the digest failing anyway and thus the security
risk seems minor, performing a memory compare on what might be
uninitialized memory is wrong.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-16 12:57:48 +01:00
Gilad Ben-Yossef ebd89a2d06 IPsec: do not ignore crypto err in ah4 input
ah4 input processing uses the asynchronous hash crypto API which
supplies an error code as part of the operation completion but
the error code was being ignored.

Treat a crypto API error indication as a verification failure.

While a crypto API reported error would almost certainly result
in a memcpy of the digest failing anyway and thus the security
risk seems minor, performing a memory compare on what might be
uninitialized memory is wrong.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-16 12:57:48 +01:00
Florian Westphal 3819a35fdb xfrm: fix possible null deref in xfrm_init_tempstate
Dan reports following smatch warning:
 net/xfrm/xfrm_state.c:659
 error: we previously assumed 'afinfo' could be null (see line 651)

 649  struct xfrm_state_afinfo *afinfo = xfrm_state_afinfo_get_rcu(family);
 651  if (afinfo)
		...
 658  }
 659  afinfo->init_temprop(x, tmpl, daddr, saddr);

I am resonably sure afinfo cannot be NULL here.

xfrm_state4.c and state6.c are both part of ipv4/ipv6 (depends on
CONFIG_XFRM, a boolean) but even if ipv6 is a module state6.c can't
be removed (ipv6 lacks module_exit so it cannot be removed).

The only callers for xfrm6_fini that leads to state backend unregister
are error unwinding paths that can be called during ipv6 init function.

So after ipv6 module is loaded successfully the state backend cannot go
away anymore.

The family value from policy lookup path is taken from dst_entry, so
that should always be AF_INET(6).

However, since this silences the warning and avoids readers of this
code wondering about possible null deref it seems preferrable to
be defensive and just add the old check back.

Fixes: 711059b975 ("xfrm: add and use xfrm_state_afinfo_get_rcu")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-01-16 08:36:33 +01:00
David S. Miller 1a717fcf8b We have a number of fixes, in part because I was late
in actually sending them out - will try to do better in
 the future:
  * handle VHT opmode properly when hostapd is controlling
    full station state
  * two fixes for minimum channel width in mac80211
  * don't leave SMPS set to junk in HT capabilities
  * fix headroom when forwarding mesh packets, recently
    broken by another fix that failed to take into account
    frame encryption
  * fix the TID in null-data packets indicating EOSP (end
    of service period) in U-APSD
  * prevent attempting to use (and then failing which
    results in crashes) TXQs on stations that aren't added
    to the driver yet
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYeNveAAoJEGt7eEactAAdCjgP+gIIUjpH08MazRE9Nl6gqRGW
 G6EAoaQDQ8AjiVo8nIJ33X2jmHRIPcyelyfMhdww3AtMJRrvl9wVhQvwKWI4+l6D
 o8UVMBu66fq2luQA6AtOkkU6SkKwC7Se72Mdx6OA/zvdZz4/9Y4nZpOFPVwCbxqL
 UJRzjrhnbS51YgL/Y0s0Vp+7Rv7IYCuH9JORNarMO5sYxaVpLWihJhbShX1bOshw
 uFTHOjNRseImLhM4GOvVA7fSUiK8jxEuMECmlDKQB/6nVxMskE54yLOqMB5Ys6va
 2CKjp5xqM4FfqB4LMNp7soJAXUOXvqk25JXAAbkVNo4VRd2Y4GoJ7XdBPqd4kfKb
 rPlr+UP2xaSQsfqIoe+uFr/lVUmm4oCTrS/Mo7YVrjSMU7fntYwZccr9aw5jrQFW
 YU+1QTF25HEb1SL18FH9JNWpoTlOlpB3bwpbAW4BHzsqDYc76CE/oyDLdZ6zQYlu
 z92cLuycGTwgNhi1zDNThxB81zhZsWH1Jbh9ppZBdDPxo6E4DMK71okreGAXBEMJ
 IdFZZqJbYW4I/TsUtse4atozk6oXlTEFY/pX4Qm0gwGwmqQx+wfSNLbymsD7gR42
 OkL+bZ701HBl2gJUWuICrM2lFWtD4/o6oWpaW2I7QUAQhjvUr4ld9kthVMF2vdlt
 w316EIfZQzBKw7Z7zjZI
 =GU6q
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
We have a number of fixes, in part because I was late
in actually sending them out - will try to do better in
the future:
 * handle VHT opmode properly when hostapd is controlling
   full station state
 * two fixes for minimum channel width in mac80211
 * don't leave SMPS set to junk in HT capabilities
 * fix headroom when forwarding mesh packets, recently
   broken by another fix that failed to take into account
   frame encryption
 * fix the TID in null-data packets indicating EOSP (end
   of service period) in U-APSD
 * prevent attempting to use (and then failing which
   results in crashes) TXQs on stations that aren't added
   to the driver yet
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-15 22:17:59 -05:00
Lance Richardson 75f01a4c9c openvswitch: maintain correct checksum state in conntrack actions
When executing conntrack actions on skbuffs with checksum mode
CHECKSUM_COMPLETE, the checksum must be updated to account for
header pushes and pulls. Otherwise we get "hw csum failure"
logs similar to this (ICMP packet received on geneve tunnel
via ixgbe NIC):

[  405.740065] genev_sys_6081: hw csum failure
[  405.740106] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G          I     4.10.0-rc3+ #1
[  405.740108] Call Trace:
[  405.740110]  <IRQ>
[  405.740113]  dump_stack+0x63/0x87
[  405.740116]  netdev_rx_csum_fault+0x3a/0x40
[  405.740118]  __skb_checksum_complete+0xcf/0xe0
[  405.740120]  nf_ip_checksum+0xc8/0xf0
[  405.740124]  icmp_error+0x1de/0x351 [nf_conntrack_ipv4]
[  405.740132]  nf_conntrack_in+0xe1/0x550 [nf_conntrack]
[  405.740137]  ? find_bucket.isra.2+0x62/0x70 [openvswitch]
[  405.740143]  __ovs_ct_lookup+0x95/0x980 [openvswitch]
[  405.740145]  ? netif_rx_internal+0x44/0x110
[  405.740149]  ovs_ct_execute+0x147/0x4b0 [openvswitch]
[  405.740153]  do_execute_actions+0x22e/0xa70 [openvswitch]
[  405.740157]  ovs_execute_actions+0x40/0x120 [openvswitch]
[  405.740161]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
[  405.740166]  ovs_vport_receive+0x73/0xd0 [openvswitch]
[  405.740168]  ? udp_rcv+0x1a/0x20
[  405.740170]  ? ip_local_deliver_finish+0x93/0x1e0
[  405.740172]  ? ip_local_deliver+0x6f/0xe0
[  405.740174]  ? ip_rcv_finish+0x3a0/0x3a0
[  405.740176]  ? ip_rcv_finish+0xdb/0x3a0
[  405.740177]  ? ip_rcv+0x2a7/0x400
[  405.740180]  ? __netif_receive_skb_core+0x970/0xa00
[  405.740185]  netdev_frame_hook+0xd3/0x160 [openvswitch]
[  405.740187]  __netif_receive_skb_core+0x1dc/0xa00
[  405.740194]  ? ixgbe_clean_rx_irq+0x46d/0xa20 [ixgbe]
[  405.740197]  __netif_receive_skb+0x18/0x60
[  405.740199]  netif_receive_skb_internal+0x40/0xb0
[  405.740201]  napi_gro_receive+0xcd/0x120
[  405.740204]  gro_cell_poll+0x57/0x80 [geneve]
[  405.740206]  net_rx_action+0x260/0x3c0
[  405.740209]  __do_softirq+0xc9/0x28c
[  405.740211]  irq_exit+0xd9/0xf0
[  405.740213]  do_IRQ+0x51/0xd0
[  405.740215]  common_interrupt+0x93/0x93

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-15 22:15:16 -05:00
David S. Miller bb60b8b35a For 4.11, we seem to have more than in the past few releases:
* socket owner support for connections, so when the wifi
    manager (e.g. wpa_supplicant) is killed, connections are
    torn down - wpa_supplicant is critical to managing certain
    operations, and can opt in to this where applicable
  * minstrel & minstrel_ht updates to be more efficient (time and space)
  * set wifi_acked/wifi_acked_valid for skb->destructor use in the
    kernel, which was already available to userspace
  * don't indicate new mesh peers that might be used if there's no
    room to add them
  * multicast-to-unicast support in mac80211, for better medium usage
    (since unicast frames can use *much* higher rates, by ~3 orders of
    magnitude)
  * add API to read channel (frequency) limitations from DT
  * add infrastructure to allow randomizing public action frames for
    MAC address privacy (still requires driver support)
  * many cleanups and small improvements/fixes across the board
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYeKu7AAoJEGt7eEactAAdwjEP/RA4bXFMfkC7qUJ++cLrMMwY
 yCvjb8+ULWL2wbCzpfY37acbGJgot3DNoQJzrO2jMQPqyM9nRlTMg5aF49cI7t62
 gU6daNKJaGBe/0yeG7lTJ4n5UtVCDtN45hGc06Yert+ewb9njiJf+XYrtCWetsIJ
 5bOLYQKPWOz/7UyMH7uJ25zrPFaiA3y7XnXKPEudagG/EwEq9ZuUpSSfLwEAEBPi
 6i/2w4fLj32vXRsQMvQT0sU6mjd+1ub8Is7w5l2F06iWwNYPzdSM0IbU+E+ie2tk
 sE6RA70c4ILrp8KisTAz2lJPa4XEpFkLhI3lzRRy8CVzjyyo/OJen92zvr2R7TVb
 /uZG9qfRQ3UitQmgeKd+wS8PsbRAyWUR/xhNxD2r7zARH2vliwyneU+zEpXLeGA1
 Y4PrN1+Fk45Ye4/4XSbPO4cf1MHX7qinN4rjrpsJKPwoYD/gQ1cZvef4AbaKPvq6
 oCKRVrwNoUuSB8NTcMLPqze3WCfhnJyVUhCZTyzHeW4uG81qrHwrvBvM25vcWGcm
 CcSWFktFIpuGML4FCU3byZfb0NkmJtpCD4n7P98WFPGjvsWIEVCMckqlC8x1F7B7
 BqqjGS2mGA17Xy0uLfmN/JempesQJnZhnAnFERdyX1S1YQuKhLwEu7OsYegnStDL
 Cn1wFw2/qcgeTkJfBICB
 =UToW
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2017-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
For 4.11, we seem to have more than in the past few releases:
 * socket owner support for connections, so when the wifi
   manager (e.g. wpa_supplicant) is killed, connections are
   torn down - wpa_supplicant is critical to managing certain
   operations, and can opt in to this where applicable
 * minstrel & minstrel_ht updates to be more efficient (time and space)
 * set wifi_acked/wifi_acked_valid for skb->destructor use in the
   kernel, which was already available to userspace
 * don't indicate new mesh peers that might be used if there's no
   room to add them
 * multicast-to-unicast support in mac80211, for better medium usage
   (since unicast frames can use *much* higher rates, by ~3 orders of
   magnitude)
 * add API to read channel (frequency) limitations from DT
 * add infrastructure to allow randomizing public action frames for
   MAC address privacy (still requires driver support)
 * many cleanups and small improvements/fixes across the board
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-14 12:02:15 -05:00
Yuchung Cheng 94bdc9785a tcp: disable fack by default
This patch disables FACK by default as RACK is the successor of FACK
(inspired by the insights behind FACK).

FACK[1] in Linux works as follows: a packet P is deemed lost,
if packet Q of higher sequence is s/acked and P and Q are distant
by at least dupthresh number of packets in sequence space.

FACK is more aggressive than the IETF recommened recovery for SACK
(RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
 Recovery Algorithm for TCP), because a single SACK may trigger
fast recovery. This obviously won't work well with reordering so
FACK is dynamically disabled upon detecting reordering.

RACK supersedes FACK by using time distance instead of sequence
distance. On reordering, RACK waits for a quarter of RTT receiving
a single SACK before starting recovery. (the timer can be made more
adaptive in the future by measuring reordering distance in time,
but currently RTT/4 seem to work well.) Once the recovery starts,
RACK behaves almost like FACK because it reduces the reodering
window to 1ms, so it fast retransmits quickly. In addition RACK
can detect loss retransmission as it does not care about the packet
sequences (being repeated or not), which is extremely useful when
the connection is going through a traffic policer.

Google server experiments indicate that disabling FACK after enabling
RACK has negligible impact on the overall loss recovery performance
with more reordering events detected.  But we still keep the FACK
implementation for backup if RACK has bugs that needs to be disabled.

[1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng 4a7f600944 tcp: remove thin_dupack feature
Thin stream DUPACK is to start fast recovery on only one DUPACK
provided the connection is a thin stream (i.e., low inflight).  But
this older feature is now subsumed with RACK. If a connection
receives only a single DUPACK, RACK would arm a reordering timer
and soon starts fast recovery instead of timeout if no further
ACKs are received.

The socket option (THIN_DUPACK) is kept as a nop for compatibility.
Note that this patch does not change another thin-stream feature
which enables linear RTO. Although it might be good to generalize
that in the future (i.e., linear RTO for the first say 3 retries).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng ac229dca7e tcp: remove RFC4653 NCR
This patch removes the (partial) implementation of the aggressive
limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).

NCR is a mitigation to the problem created by the dynamic
DUPACK threshold.  With the current adaptive DUPACK threshold
(tp->reordering) could cause timeouts by preventing fast recovery.
For example, if the last packet of a cwnd burst was reordered, the
threshold will be set to the size of cwnd. But if next application
burst is smaller than threshold and has drops instead of reorderings,
the sender would not trigger fast recovery but instead resorts to a
timeout recovery.

NCR mitigates this issue by checking the number of DUPACKs against
the current flight size additionally. The techniqueue is similar to
the early retransmit RFC.

With RACK loss detection, this mitigation is not needed, because RACK
does not use DUPACK threshold to detect losses. RACK arms a reordering
timer to fire at most a quarter RTT later to start fast recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng bec41a11dd tcp: remove early retransmit
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng 840a3cbe89 tcp: remove forward retransmit feature
Forward retransmit is an esoteric feature in RFC3517 (condition(3)
in the NextSeg()). Basically if a packet is not considered lost by
the current criteria (# of dupacks etc), but the congestion window
has room for more packets, then retransmit this packet.

However it actually conflicts with the rest of recovery design. For
example, when reordering is detected we want to be conservative
in retransmitting packets but forward-retransmit feature would
break that to force more retransmission. Also the implementation is
fairly complicated inside the retransmission logic inducing extra
iterations in the write queue. With RACK losses are being detected
timely and this heuristic is no longer necessary. There this patch
removes the feature.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng 89fe18e44f tcp: extend F-RTO to catch more spurious timeouts
Current F-RTO reverts cwnd reset whenever a never-retransmitted
packet was (s)acked. The timeout can be declared spurious because
the packets acknoledged with this ACK was transmitted before the
timeout, so clearly not all the packets are lost to reset the cwnd.

This nice detection does not really depend F-RTO internals. This
patch applies the detection universally. On Google servers this
change detected 20% more spurious timeouts.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng a0370b3f3f tcp: enable RACK loss detection to trigger recovery
This patch changes two things:

1. Start fast recovery with RACK in addition to other heuristics
   (e.g., DUPACK threshold, FACK). Prior to this change RACK
   is enabled to detect losses only after the recovery has
   started by other algorithms.

2. Disable TCP early retransmit. RACK subsumes the early retransmit
   with the new reordering timer feature. A latter patch in this
   series removes the early retransmit code.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng 98e36d449c tcp: check undo conditions before detecting losses
Currently RACK would mark loss before the undo operations in TCP
loss recovery. This could incorrectly identify real losses as
spurious. For example a sender first experiences a delay spike and
then eventually some packets were lost due to buffer overrun.
In this case, the sender should perform fast recovery b/c not all
the packets were lost.

But the sender may first trigger a (spurious) RTO and reset
cwnd to 1. The following ACKs may used to mark real losses by
tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
F-RTO undo condition and unmark real losses and revert the cwnd
reduction. If there are no more ACKs coming back, eventually the
sender would timeout again instead of performing fast recovery.

The patch fixes this incorrect process by always performing
the undo checks before detecting losses.

Fixes: 4f41b1c58a ("tcp: use RACK to detect losses")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng 1d0833df59 tcp: use sequence to break TS ties for RACK loss detection
The packets inside a jumbo skb (e.g., TSO) share the same skb
timestamp, even though they are sent sequentially on the wire. Since
RACK is based on time, it can not detect some packets inside the
same skb are lost.  However, we can leverage the packet sequence
numbers as extended timestamps to detect losses. Therefore, when
RACK timestamp is identical to skb's timestamp (i.e., one of the
packets of the skb is acked or sacked), we use the sequence numbers
of the acked and unacked packets to break ties.

We can use the same sequence logic to advance RACK xmit time as
well to detect more losses and avoid timeout.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng 57dde7f70d tcp: add reordering timer in RACK loss detection
This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.

It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to

  RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.

The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.

When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng deed7be78f tcp: record most recent RTT in RACK loss detection
Record the most recent RTT in RACK. It is often identical to the
"ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
been retransmitted, RACK choses to believe the ACK is for the
(latest) retransmitted packet if the RTT is over minimum RTT.

This requires passing the arrival time of the most recent ACK to
RACK routines. The timestamp is now recorded in the "ack_time"
in tcp_sacktag_state during the ACK processing.

This patch does not change the RACK algorithm itself. It only adds
the RTT variable to prepare the next main patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng e636f8b010 tcp: new helper for RACK to detect loss
Create a new helper tcp_rack_detect_loss to prepare the upcoming
RACK reordering timer patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng db8da6bb57 tcp: new helper function for RACK loss detection
Create a new helper tcp_rack_mark_skb_lost to prepare the
upcoming RACK reordering timer support.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Shannon Nelson 003c941057 tcp: fix tcp_fastopen unaligned access complaints on sparc
Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 12:31:24 -05:00