Commit Graph

41725 Commits

Author SHA1 Message Date
Xin Long cb2050a7b8 sctp: export some functions for sctp_diag in inet_diag
inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.

inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.

v2->v3:
- do not need to define and export inet_diag_get_handler any more.
  cause all the functions in it are in sctp_diag.ko, we just call
  them in sctp_diag.ko.

- add inet_diag_msg_attrs_fill to make codes clear.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:29:36 -04:00
Xin Long 626d16f50f sctp: export some apis or variables for sctp_diag and reuse some for proc
For some main variables in sctp.ko, we couldn't export it to other modules,
so we have to define some api to access them.

It will include sctp transport and endpoint's traversal.

There are some transport traversal functions for sctp_diag, we can also
use it for sctp_proc. cause they have the similar situation to traversal
transport.

v2->v3:
- rhashtable_walk_init need the parameter gfp, because of recent upstrem
  update

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:29:36 -04:00
Xin Long 52c52a61a3 sctp: add sctp_info dump api for sctp_diag
sctp_diag will dump some important details of sctp's assoc or ep, we use
sctp_info to describe them,  sctp_get_sctp_info to get them, and export
it to sctp_diag.ko.

v2->v3:
- we will not use list_for_each_safe in sctp_get_sctp_info, cause
  all the callers of it will use lock_sock.

- fix the holes in struct sctp_info with __reserved* field.
  because sctp_diag is a new feature, and sctp_info is just for now,
  it may be changed in the future.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:29:35 -04:00
Marcelo Ricardo Leitner 311b21774f sctp: simplify sk_receive_queue locking
SCTP already serializes access to rcvbuf through its sock lock:
sctp_recvmsg takes it right in the start and release at the end, while
rx path will also take the lock before doing any socket processing. On
sctp_rcv() it will check if there is an user using the socket and, if
there is, it will queue incoming packets to the backlog. The backlog
processing will do the same. Even timers will do such check and
re-schedule if an user is using the socket.

Simplifying this will allow us to remove sctp_skb_list_tail and get ride
of some expensive lockings.  The lists that it is used on are also
mangled with functions like __skb_queue_tail and __skb_unlink in the
same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
sctp_close() will also purge those while using only the sock lock.

Therefore the lockings performed by sctp_skb_list_tail() are not
necessary. This patch removes this function and replaces its calls with
just skb_queue_splice_tail_init() instead.

The biggest gain is at sctp_ulpq_tail_event(), because the events always
contain a list, even if it's queueing a single skb and this was
triggering expensive calls to spin_lock_irqsave/_irqrestore for every
data chunk received.

As SCTP will deliver each data chunk on a corresponding recvmsg, the
more effective the change will be.
Before this patch, with chunks with 30 bytes:
netperf -t SCTP_STREAM -H 192.168.1.2 -cC -l 60 -- -m 30 -S 400000
400000 -s 400000 400000
on a 10Gbit link with 1500 MTU:

SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

425984 425984     30    60.00       137.45   7.34     7.36     52.504  52.608

With it:

SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

425984 425984     30    60.00       179.10   7.97     6.70     43.740  36.788

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:22:20 -04:00
Peter Heise ee1c279772 net/hsr: Added support for HSR v1
This patch adds support for the newer version 1 of the HSR
networking standard. Version 0 is still default and the new
version has to be selected via iproute2.

Main changes are in the supervision frame handling and its
ethertype field.

Signed-off-by: Peter Heise <peter.heise@airbus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:06:48 -04:00
Eric Dumazet 8804b2722d tcp: remove false sharing in tcp_rcv_state_process()
Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()

It is not needed for a listener, so we move it where it matters.

Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:45:44 -04:00
Eric Dumazet b3d051477c tcp: do not mess with listener sk_wmem_alloc
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.

We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)

In this patch, I chose to not attach a socket to syncookies skb.

Performance is now 5 Mpps instead of 3.2 Mpps.

Following patch will remove last known false sharing in
tcp_rcv_state_process()

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:45:44 -04:00
Jon Paul Maloy 34b9cd64c8 tipc: let first message on link be a state message
According to the link FSM, a received traffic packet can take a link
from state ESTABLISHING to ESTABLISHED, but the link can still not be
fully set up in one atomic operation. This means that even if the the
very first packet on the link is a traffic packet with sequence number
1 (one), it has to be dropped and retransmitted.

This can be avoided if we let the mentioned packet be preceded by a
LINK_PROTOCOL/STATE message, which takes up the endpoint before the
arrival of the traffic.

We add this small feature in this commit.

This is a fully compatible change.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:06 -04:00
Jon Paul Maloy de7e07f9ee tipc: ensure that first packets on link are sent in order
In some link establishment scenarios we see that packet #2 may be sent
out before packet #1, forcing the receiver to demand retransmission of
the missing packet. This is harmless, but may cause confusion among
people tracing the packet flow.

Since this is extremely easy to fix, we do so by adding en extra send
call to the bearer immediately after the link has come up.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:06 -04:00
Jon Paul Maloy 42b18f605f tipc: refactor function tipc_link_timeout()
The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.

We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:06 -04:00
Jon Paul Maloy 88e8ac7000 tipc: reduce transmission rate of reset messages when link is down
When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.

We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.

This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.

The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:05 -04:00
Jon Paul Maloy 634696b197 tipc: guarantee peer bearer id exchange after reboot
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.

According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.

However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.

The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.

In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.

Although never reported, this problem seems to have always been around.

This protocol addition is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:05 -04:00
Daniel Borkmann 074f528eed bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK
This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:40:41 -04:00
Marcelo Ricardo Leitner 486bdee013 sctp: add support for RPS and RFS
This patch adds what's missing to properly support RPS and RFS on SCTP,
as some of it is already implemented in common calls.

Having support for RPS and RFS allows better scaling specially because
not all NICs support hashing SCTP headers.

Save the hash right when we dequeue a skb from inqueue so we do it only
once per skb instead of per chunk. New sockets will then inherit the
hash through sctp_copy_sock().

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:40:24 -04:00
Eric Dumazet d21fd63ea3 net: validate_xmit_skb() changes
skbs given to validate_xmit_skb() should not have a next
pointer anymore.

Also if a packet is dropped, increment dev->tx_dropped
__dev_queue_xmit() no longer has to change tx_dropped in this case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:40:24 -04:00
Linus Torvalds 16382ed978 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes an NFS regression caused by the skcipher/hash conversion in
  sunrpc.  It also fixes a build problem in certain configurations with
  bcm63xx"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  hwrng: bcm63xx - fix device tree compilation
  sunrpc: Fix skcipher/shash conversion
2016-04-14 18:15:40 -07:00
Craig Gallek d894ba18d4 soreuseport: fix ordering for mixed v4/v6 sockets
With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:14:03 -04:00
Weongyo Jeong da37845fdc packet: uses kfree_skb() for errors.
consume_skb() isn't for error cases that kfree_skb() is more proper
one.  At this patch, it fixed tpacket_rcv() and packet_rcv() to be
consistent for error or non-error cases letting perf trace its event
properly.

Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 17:50:44 -04:00
Parthasarathy Bhuvaragan 333f796235 tipc: fix a race condition leading to subscriber refcnt bug
Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.

This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.

If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:

[31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380         [tipc]
[31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
[31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
[...]
[31.228377] Call Trace:
[31.228377]  [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
[31.228377]  [<ffffffff8105a311>] __warn+0xd1/0xf0
[31.228377]  [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
[31.228377]  [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228377]  [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
[31.228377]  [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
[31.228377]  [<ffffffff81071925>] process_one_work+0x145/0x3d0
[31.246554] ---[ end trace c3882c9baa05a4fd ]---
[31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
[31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
[31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
[31.249323] PGD 0
[31.249323] Oops: 0000 [#1] SMP

In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
  to a new function tipc_sock_release(), which is executed before
  we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:46:46 -04:00
Martin KaFai Lau e646b657f6 ipv6: udp: Do a route lookup and update during release_cb
This patch adds a release_cb for UDPv6.  It does a route lookup
and updates sk->sk_dst_cache if it is needed.  It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.

It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).

Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:29:53 -04:00
Martin KaFai Lau 33c162a980 ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.

Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph).  It is because a UDP socket can become
connected after sending out some datagrams in un-connected state.  or
It can be connected multiple times to different destinations.  Hence,
iph may not be related to where sk is currently connected to.

It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect()  (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.

For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.

Test:

Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0  proto kernel  metric 256  pref medium
2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium

A simple python code to create a connected UDP socket:

import socket
import errno

HOST = '2fac::1'
PORT = 8080

s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
    try:
	data = s.recv(1024)
    except socket.error as se:
	if se.errno == errno.EMSGSIZE:
		pmtu = s.getsockopt(41, 24)
		print("PMTU:%d" % pmtu)
		break
s.close()

Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300

Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0  metric 0
    cache  expires 463sec mtu 1300 pref medium

Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message.  Here is the scapy script I have
used:

>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')

Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:29:51 -04:00
Martin KaFai Lau 7e2040db15 ipv6: datagram: Refactor dst lookup and update codes to a new function
This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()

It will be reused during the pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:29:50 -04:00
Martin KaFai Lau 80fbdb208f ipv6: datagram: Refactor flowi6 init codes to a new function
Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().

Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
   (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()

This new function will be reused during pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:29:49 -04:00
Alexander Duyck 802ab55adc GSO: Support partial segmentation offload
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck 1530545ed6 GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames.  Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck cbc53e08a7 GSO: Add GSO type for fixed IPv4 ID
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:40 -04:00
Alexander Duyck 518f213ddd ethtool: Add support for toggling any of the GSO offloads
The strings were missing for several of the GSO offloads that are
available.  This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:40 -04:00
Jiri Pirko df38dafd25 devlink: implement shared buffer occupancy monitoring interface
User needs to monitor shared buffer occupancy. For that, he issues a
snapshot command in order to instruct hardware to catch current and
maximal occupancy values, and clear command in order to clear the
historical maximal values.

Also port-pool and tc-pool-bind command response messages are extended to
carry occupancy values.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:22:03 -04:00
Jiri Pirko bf7974710a devlink: add shared buffer configuration
Define userspace API and drivers API for configuration of shared
buffers. Four basic objects are defined:
shared buffer - attributes are size, number of pools and TCs
pool - chunk of sharedbuffer definition, it has some size and either
       static or dynamic threshold
port pool threshold - to set per-port threshold for each pool
port tc threshold bind - to bind port and TC to specified pool
                         with threshold.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:22:03 -04:00
David S. Miller 5e26502912 This has just the single fix from Dmitry Ivanov, adding the missing
netlink notifier family check to avoid the socket close DoS problem.
 -----BEGIN PGP SIGNATURE-----
 
 iQIbBAABCgAGBQJXD0YaAAoJEGt7eEactAAdJqUP9R0Ljud2jnH2JeNRv1qdEAXw
 /PKERFsiZjAnU7kXXdOlm5mqYNJILNGbtPaKF5oi6mswzojwgaIQ+rjsHidX0SVs
 1F1+VFxt0//pyfp6udEJ3gyx6wxGlncTdWUtxmzikgmMacn20ejT/wJFzF6y8MHO
 YSyI8NGNxdTn4eBPLL1QdipsDFYpkS7Z2P8ZUquJmR+uMccjIJ4m4kB2NM8AGQgf
 Z6KVjVP5dOvelWoEjSKaw2MyAAN/jRTxc9PFtEARoKMMtGMHBkn9qzin9wDjLA6k
 h3M34Aj5i4R8AIFWJmrCw06rDhPT6j02l4g2IXpmZQqvbP7zewyXyRUkmxMiFvQd
 n6zpNnQf3hDSNC8f7sBjNM5syi23j9ineJgtXdYtHqyqO0xY2qAVRYm7cSEXxFVk
 WaxrWDO/wtGyKm1lTQYudAAsFARIgomra4QJd5FqFdIwX2nuZekuikOUsytUqnEV
 ta+cwLJGLmof5jicrkoYD83974awJHcngLYMYLbkZC0kwWEZEiXX58D3o+FoF9ns
 4D+qkLPRDAVIJI/W9j5yuIA6/Aj7Cy9kuxtKf5CoqxL35k47lu69aHEXEceB6fDD
 M24G8nwEFcMimEWeYz4AARqaEJ08PWcA3/5dgf+N0XDCvQT95OutsZbqAdR2kqzF
 ApCKvGmbx/sNeyZtjbs=
 =FvAT
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This has just the single fix from Dmitry Ivanov, adding the missing
netlink notifier family check to avoid the socket close DoS problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 12:00:59 -04:00
Lars Persson 3dcd493fbe net: sched: do not requeue a NULL skb
A failure in validate_xmit_skb_list() triggered an unconditional call
to dev_requeue_skb with skb=NULL. This slowly grows the queue
discipline's qlen count until all traffic through the queue stops.

We take the optimistic approach and continue running the queue after a
failure since it is unknown if later packets also will fail in the
validate path.

Fixes: 55a93b3ea7 ("qdisc: validate skb without holding lock")
Signed-off-by: Lars Persson <larper@axis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 01:28:51 -04:00
Mathias Krause 309cf37fe2 packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface
Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.

Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

Fixes: eea68e2f1a ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 00:46:39 -04:00
David S. Miller 6c61403dae Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-04-14 00:39:15 -04:00
Eric Dumazet 743b03a832 net: remove netdevice gso_min_segs
After introduction of ndo_features_check(), we believe that very
specific checks for rare features should not be done in core
networking stack.

No driver uses gso_min_segs yet, so we revert this feature and save
few instructions per tx packet in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 00:37:08 -04:00
Julia Lawall cfe2f14c72 qdisc: constify meta_type_ops structures
The meta_type_ops structures are never modified, so declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 00:35:30 -04:00
Chris Friesen d6d5e999e5 route: do not cache fib route info on local routes with oif
For local routes that require a particular output interface we do not want
to cache the result.  Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface.  The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.

This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board.  The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo).  The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.

Sample dhcp_release command line:

   dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Signed off-by: Chris Friesen <chris.friesen@windriver.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 23:33:01 -04:00
Marcelo Ricardo Leitner fb586f2530 sctp: delay calls to sk_data_ready() as much as possible
Currently processing of multiple chunks in a single SCTP packet leads to
multiple calls to sk_data_ready, causing multiple wake up signals which
are costy and doesn't make it wake up any faster.

With this patch it will note that the wake up is pending and will do it
before leaving the state machine interpreter, latest place possible to
do it realiably and cleanly.

Note that sk_data_ready events are not dependent on asocs, unlike waking
up writers.

v2: series re-checked
v3: use local vars to cleanup the code, suggested by Jakub Sitnicki
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 23:04:44 -04:00
David Ahern 70af921db6 net: ipv6: Do not keep linklocal and loopback addresses
f1705ec197 added the option to retain user configured addresses on an
admin down. A comment to one of the later revisions suggested using the
IFA_F_PERMANENT flag rather than adding a user_managed boolean to the
ifaddr struct. A side effect of this change is that link local and
loopback addresses are also retained which is not part of the objective
of f1705ec197. Add check to drop those addresses.

Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:58:37 -04:00
Xin Long bf871ad792 bridge: a netlink notification should be sent when those attributes are changed by ioctl
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.

We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.

Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.

This patch is used for ioctl.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:42:33 -04:00
Xin Long bdaf0d5d98 bridge: a netlink notification should be sent when those attributes are changed by br_sysfs_if
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.

We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.

Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.

This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out
of store_flag.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:42:33 -04:00
Xin Long 047831a9b9 bridge: a netlink notification should be sent when those attributes are changed by br_sysfs_br
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.

We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.

Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.

This patch is used for br_sysfs_br. and we also need to remove some
rtnl_trylock in old functions so that we can call it in a common one.

For group_addr_store, we cannot make it use store_bridge_parm, because
it's not a string-to-long convert, we will add notification on it
individually.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:42:33 -04:00
Xin Long 4436156b6f bridge: simplify the stp_state_store by calling store_bridge_parm
There are some repetitive codes in stp_state_store, we can remove
them by calling store_bridge_parm.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:42:32 -04:00
Xin Long 347db6b49e bridge: simplify the forward_delay_store by calling store_bridge_parm
There are some repetitive codes in forward_delay_store, we can remove
them by calling store_bridge_parm.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:42:32 -04:00
Xin Long 14f31bb39f bridge: simplify the flush_store by calling store_bridge_parm
There are some repetitive codes in flush_store, we can remove
them by calling store_bridge_parm, also, it would send rtnl notification
after we add it in store_bridge_parm in the following patches.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:42:32 -04:00
Daniel Borkmann 47e27d5e92 ipv6, token: allow for clearing the current device token
The original tokenized iid support implemented via f53adae4ea ("net: ipv6:
add tokenized interface identifier support") didn't allow for clearing a
device token as it was intended that this addressing mode was the only one
active for globally scoped IPv6 addresses. Later we relaxed that restriction
via 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses"),
and we should also allow for clearing tokens as there's no good reason why
it shouldn't be allowed.

Fixes: 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses")
Reported-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:38:14 -04:00
Hannes Frederic Sowa fafc4e1ea1 sock: tigthen lockdep checks for sock_owned_by_user
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.

Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:37:20 -04:00
Willem de Bruijn 31c2e4926f udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6 pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.

Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:24:52 -04:00
Willem de Bruijn 9f9a45beaa udp: do not expect udp headers on ioctl SIOCINQ
On udp sockets, ioctl SIOCINQ returns the payload size of the first
packet. Since commit e6afc8ace6 pulled the headers, the result is
incorrect when subtracting header length. Remove that operation.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:24:52 -04:00
David S. Miller 60e19518d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree. More
specifically, they are:

1) Fix missing filter table per-netns registration in arptables, from
   Florian Westphal.

2) Resolve out of bound access when parsing TCP options in
   nf_conntrack_tcp, patch from Jozsef Kadlecsik.

3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables,
   this resolves conflict between xt_limit and ebt_limit, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 21:49:03 -04:00
Pablo Neira Ayuso 4054ff4545 netfilter: ctnetlink: remove unnecessary inlining
Many of these functions are called from control plane path.  Move
ctnetlink_nlmsg_size() under CONFIG_NF_CONNTRACK_EVENTS to avoid a
compilation warning when CONFIG_NF_CONNTRACK_EVENTS=n.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:39:34 +02:00
Florian Westphal d7591f0c41 netfilter: x_tables: introduce and use xt_copy_counters_from_user
The three variants use same copy&pasted code, condense this into a
helper and use that.

Make sure info.name is 0-terminated.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:41 +02:00
Florian Westphal aded9f3e9f netfilter: x_tables: remove obsolete check
Since 'netfilter: x_tables: validate targets of jumps' change we
validate that the target aligns exactly with beginning of a rule,
so offset test is now redundant.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:41 +02:00
Florian Westphal 95609155d7 netfilter: x_tables: remove obsolete overflow check for compat case too
commit 9e67d5a739
("[NETFILTER]: x_tables: remove obsolete overflow check") left the
compat parts alone, but we can kill it there as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:40 +02:00
Florian Westphal 09d9686047 netfilter: x_tables: do compat validation via translate_table
This looks like refactoring, but its also a bug fix.

Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.

For example, we do not check for underflows and the base chain policies.

While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.

Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.

At a high level 32 bit compat works like this:
1- initial pass over blob:
   validate match/entry offsets, bounds checking
   lookup all matches and targets
   do bookkeeping wrt. size delta of 32/64bit structures
   assign match/target.u.kernel pointer (points at kernel
   implementation, needed to access ->compatsize etc.)

2- allocate memory according to the total bookkeeping size to
   contain the translated ruleset

3- second pass over original blob:
   for each entry, copy the 32bit representation to the newly allocated
   memory.  This also does any special match translations (e.g.
   adjust 32bit to 64bit longs, etc).

4- check if ruleset is free of loops (chase all jumps)

5-first pass over translated blob:
   call the checkentry function of all matches and targets.

The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.

In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .

This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.

This has two drawbacks:

1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.

THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.

iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003

shows no noticeable differences in restore times:
old:   0m30.796s
new:   0m31.521s
64bit: 0m25.674s

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:40 +02:00
Florian Westphal 0188346f21 netfilter: x_tables: xt_compat_match_from_user doesn't need a retval
Always returned 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:40 +02:00
Florian Westphal 8dddd32756 netfilter: arp_tables: simplify translate_compat_table args
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:39 +02:00
Florian Westphal 329a080712 netfilter: ip6_tables: simplify translate_compat_table args
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:39 +02:00
Florian Westphal 7d3f843eed netfilter: ip_tables: simplify translate_compat_table args
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:38 +02:00
Florian Westphal 13631bfc60 netfilter: x_tables: validate all offsets and sizes in a rule
Validate that all matches (if any) add up to the beginning of
the target and that each match covers at least the base structure size.

The compat path should be able to safely re-use the function
as the structures only differ in alignment; added a
BUILD_BUG_ON just in case we have an arch that adds padding as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:38 +02:00
Florian Westphal ce683e5f9d netfilter: x_tables: check for bogus target offset
We're currently asserting that targetoff + targetsize <= nextoff.

Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.

We also need the e->elems pointer in a followup change to validate matches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:37 +02:00
Florian Westphal 7ed2abddd2 netfilter: x_tables: check standard target size too
We have targets and standard targets -- the latter carries a verdict.

The ip/ip6tables validation functions will access t->verdict for the
standard targets to fetch the jump offset or verdict for chainloop
detection, but this happens before the targets get checked/validated.

Thus we also need to check for verdict presence here, else t->verdict
can point right after a blob.

Spotted with UBSAN while testing malformed blobs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:37 +02:00
Florian Westphal fc1221b3a1 netfilter: x_tables: add compat version of xt_check_entry_offsets
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:36 +02:00
Florian Westphal a08e4e190b netfilter: x_tables: assert minimum target size
The target size includes the size of the xt_entry_target struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:36 +02:00
Florian Westphal aa412ba225 netfilter: x_tables: kill check_entry helper
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.

Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:36 +02:00
Florian Westphal 7d35812c32 netfilter: x_tables: add and use xt_check_entry_offsets
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.

Unfortunately these checks are not sufficient.

To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:35 +02:00
Florian Westphal 3647234101 netfilter: x_tables: validate targets of jumps
When we see a jump also check that the offset gets us to beginning of
a rule (an ipt_entry).

The extra overhead is negible, even with absurd cases.

300k custom rules, 300k jumps to 'next' user chain:
[ plus one jump from INPUT to first userchain ]:

Before:
real    0m24.874s
user    0m7.532s
sys     0m16.076s

After:
real    0m27.464s
user    0m7.436s
sys     0m18.840s

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:35 +02:00
Florian Westphal f24e230d25 netfilter: x_tables: don't move to non-existent next rule
Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Base chains enforce absolute verdict.

User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.

But if such return is missing we will move to non-existent next rule.

Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:34 +02:00
Andrew Lunn 74c3e2a54b dsa: Rename phys_port_mask to enabled_port_mask
The phys in phys_port_mask suggests this mask is about PHYs. In fact,
it means physical ports. Rename to enabled_port_mask, indicating
external enabled ports of the switch, which is hopefully less
confusing.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 18:15:23 -04:00
Andrew Lunn 5feebd0a8a net: dsa: Remove allocation of driver private memory
The drivers now allocate their own memory for private usage. Remove
the allocation from the core code.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 18:15:23 -04:00
Andrew Lunn 7543a6d535 net: dsa: Have the switch driver allocate there own private memory
Now the switch devices have a dev pointer, make use of it for allocating
the drivers private data structures using a devm_kzalloc().

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 18:15:23 -04:00
Andrew Lunn bbb8d79399 net: dsa: Pass the dsa device to the switch drivers
By passing a device structure to the switch devices, it allows them
to use devm_* methods for resource management.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 18:15:22 -04:00
David S. Miller 71bbe25d01 To synchronize with Kalle, here's just a big change that affects
all drivers - removing the duplicated enum ieee80211_band and
 replacing it by enum nl80211_band. On top of that, just a small
 documentation update.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXDp6hAAoJEGt7eEactAAd/xUQAJtKNwp9CLsx+QFx6lMoXX4x
 r0XA8DFgLp1BflS9P05/g1m0NiQxm3YuRtpze/FdPglb6AAVjLcqksf+vTkU+Lng
 p7rIkb/fQv5s5aoYPxNrD5zgwALVv9y5fI7rV7scj355iesCC0PmAP34own2Dihi
 eBVSammsh5ZNTQKLBk8vXECb0UKWsDBMgp4uQc35Bpw8XSx5Nrtl5JI/hMcckte0
 a/FQyQKjmjl3O/nRLn3kzGPv1OnRiJOMb5fMWB+Xm2cLtmKPHIErgVk2l/CMaiYj
 sRJR8KaZQpQsyWiQU59UNpywlejy7Z1RsSWmuPhm0xTGzIF1wVIgHJSsRI/gNGD2
 8Ey1P+RXkM8NVxrQr/0fis9XWyWfE8ne4tFsPiPOD3VmBiStIB9fAukJHLrvTmKU
 JrkXCePUkfNY/PqJqlP/RONBcysI253/snVF49oZ7LMBZiGDPhdRcEEcCaS0tmMM
 Qa+a78XvaH5xaKuMIDZ4qMdnMMcdv4g8G1DQeA1mb0EIGL1Gtu9BJsu9q8PqmjQU
 1ZAf4MlWJWdYk+CtTNT4slSIQVKAN78s6j1HSB/bNcpWk9y93wBhJW0FdP7FtJ1I
 pjJGIVcLU98FKdqi2jqPEezbDXXzOz0gNQDbqfJyM9/R7ijnJcaPllviaWjEg/O7
 8jMBOg87Hn7kq7JJGpKA
 =2xfe
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
To synchronize with Kalle, here's just a big change that affects
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 17:58:51 -04:00
Jon Paul Maloy 7d45a04cbc tipc: remove remnants of old broadcast code
We remove a couple of leftover fields in struct tipc_bearer. Those
were used by the old broadcast implementation, and are not needed
any longer. There is no functional changes in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 17:49:11 -04:00
Alexander Aring edc73417d8 6lowpan: move mac802154 header
In case of link-layer specific handling for 802.15.4 we need to cast to
802.15.4 sepcific structures. Simple add this header when include the
6lowpan header.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:10 +02:00
Alexander Aring 2732363181 6lowpan: add lowpan_is_ll function
This patch adds the lowpan_is_ll function, which can be used to make a
special 6lowpan linklayer handling for a specific 6lowpan linklayer
type.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:10 +02:00
Alexander Aring a5862f2aba 6lowpan: move eui64 uncompress function
This function will be use in later functionality in other branches than
generic 6lowpan, so we move it to the global 6lowpan header.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:10 +02:00
Alexander Aring 2bc068c3d6 6lowpan: iphc: remove unnecessary zero data
This patch removes unnecessary zero data for a stack variable.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:09 +02:00
Alexander Aring 7115a968b7 6lowpan: iphc: rename add lowpan prefix
This patch adds a lowpan prefix to each functions which doesn't have
such prefix currently.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:09 +02:00
Alexander Aring 353c224e28 6lowpan: move lowpan_802154_dev to 6lowpan
This patch moves the 802.15.4 link layer specific structures to generic
6lowpan. This is necessary for special 802.15.4 6lowpan handling in
6lowpan generic layer.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:09 +02:00
Alexander Aring 2e4d60cbcf 6lowpan: change naming for lowpan private data
This patch changes the naming for interface private data for lowpan
intefaces. The current private data scheme is:

-------------------------------------------------
|    6LoWPAN Generic   |    LinkLayer 6LoWPAN   |
-------------------------------------------------

the current naming schemes are:

- 6LoWPAN Generic:
  - lowpan_priv
- LinkLayer 6LoWPAN:
  - BTLE
    - lowpan_dev
  - 802.15.4:
    - lowpan_dev_info

the new naming scheme with this patch will be:

- 6LoWPAN Generic:
  - lowpan_dev
- LinkLayer 6LoWPAN:
  - BTLE
    - lowpan_btle_dev
  - 802.15.4:
    - lowpan_802154_dev

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:09 +02:00
Alexander Aring 5a7f97e570 ieee802154: 6lowpan: fix short addr hash
The short address is unique in combination with the panid. This patch
will add the panid for generating an ieee802154 address hash.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:09 +02:00
Alexander Aring 9e3b71f343 nl802154: avoid address change while running lowpan
The generation of autoconfigured IPv6 link-local addresses starts with a
notification on interface up.

To handle autoconfiguration according to RFC 4944 requires pan id and
short address to generate an autoconfigured link-local address. This
patch will avoid changing of these link-layer address configuration
while lowpan interface is up.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-13 10:41:09 +02:00
David S. Miller da0caadf0a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains the first batch of Netfilter updates for
your net-next tree.

1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong.

2) Define and register netfilter's afinfo for the bridge family,
   this comes in preparation for native nfqueue's bridge for nft,
   from Stephane Bryant.

3) Add new attributes to store layer 2 and VLAN headers to nfqueue,
   also from Stephane Bryant.

4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes
   coming from userspace, from Stephane Bryant.

5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit
   in IPv6 SYNPROXY, from Liping Zhang.

6) Remove unnecessary check for dst == NULL in nf_reject_ipv6,
   from Haishuang Yan.

7) Deinline ctnetlink event report functions, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-12 22:34:56 -04:00
Phil Sutter bcf4934288 netfilter: ebtables: Fix extension lookup with identical name
If a requested extension exists as module and is not loaded,
ebt_check_match() might accidentally use an NFPROTO_UNSPEC one with same
name and fail.

Reproduced with limit match: Given xt_limit and ebt_limit both built as
module, the following would fail:

  modprobe xt_limit
  ebtables -I INPUT --limit 1/s -j ACCEPT

The fix is to make ebt_check_match() distrust a found NFPROTO_UNSPEC
extension and retry after requesting an appropriate module.

Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-13 01:16:57 +02:00
Florian Westphal ecdfb48cdd netfilter: conntrack: move expectation event helper to ecache.c
Not performance critical, it is only invoked when an expectation is
added/destroyed.

While at it, kill unused nf_ct_expect_event() wrapper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-12 23:01:57 +02:00
Florian Westphal 3c435e2e41 netfilter: conntrack: de-inline nf_conntrack_eventmask_report
Way too large; move it to nf_conntrack_ecache.c.
Reduces total object size by 1216 byte on my machine.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-12 23:01:52 +02:00
David S. Miller 69fb78121b Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-04-12

Here's a set of Bluetooth & 802.15.4 patches intended for the 4.7 kernel:

 - Fix for race condition in vhci driver
 - Memory leak fix for ieee802154/adf7242 driver
 - Improvements to deal with single-mode (LE-only) Bluetooth controllers
 - Fix for allowing the BT_SECURITY_FIPS security level
 - New BCM2E71 ACPI ID
 - NULL pointer dereference fix fox hci_ldisc driver

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-12 11:57:53 -04:00
Johannes Berg 57fbcce37b cfg80211: remove enum ieee80211_band
This enum is already perfectly aliased to enum nl80211_band, and
the only reason for it is that we get IEEE80211_NUM_BANDS out of
it. There's no really good reason to not declare the number of
bands in nl80211 though, so do that and remove the cfg80211 one.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-12 15:56:15 +02:00
Dmitry Ivanov 8f815cdde3 nl80211: check netlink protocol in socket release notification
A non-privileged user can create a netlink socket with the same port_id as
used by an existing open nl80211 netlink socket (e.g. as used by a hostapd
process) with a different protocol number.

Closing this socket will then lead to the notification going to nl80211's
socket release notification handler, and possibly cause an action such as
removing a virtual interface.

Fix this issue by checking that the netlink protocol is NETLINK_GENERIC.
Since generic netlink has no notifier chain of its own, we can't fix the
problem more generically.

Fixes: 026331c4d9 ("cfg80211/mac80211: allow registering for and sending action frames")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Ivanov <dima@ubnt.com>
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-12 15:39:06 +02:00
David Ahern 4f7f34eaab net: vrf: Fix dev refcnt leak due to IPv6 prefix route
ifupdown2 found a kernel bug with IPv6 routes and movement from the main
table to the VRF table. Sequence of events:

Create the interface and add addresses:
    ip link add dev eth4.105 link eth4 type vlan id 105
    ip addr add dev eth4.105 8.105.105.10/24
    ip -6 addr add dev eth4.105 2008:105:105::10/64

At this point IPv6 has inserted a prefix route in the main table even
though the interface is 'down'. From there the VRF device is created:
    ip link add dev vrf105 type vrf table 105
    ip addr add dev vrf105 9.9.105.10/32
    ip -6 addr add dev vrf105 2000:9:105::10/128
    ip link set vrf105 up

Then the interface is enslaved, while still in the 'down' state:
    ip link set dev eth4.105 master vrf105

Since the device is down the VRF driver cycling the device does not
send the NETDEV_UP and NETDEV_DOWN but rather the NETDEV_CHANGE event
which does not flush the routes inserted prior.

When the link is brought up
    ip link set dev eth4.105 up

the prefix route is added in the VRF table, but does not remove
the route from the main table.

Fix by handling the NETDEV_CHANGEUPPER event similar what was implemented
for IPv4 in 7f49e7a38b ("net: Flush local routes when device changes vrf
association")

Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:56:20 -04:00
David Ahern 9ab179d83b net: vrf: Fix dst reference counting
Vivek reported a kernel exception deleting a VRF with an active
connection through it. The root cause is that the socket has a cached
reference to a dst that is destroyed. Converting the dst_destroy to
dst_release and letting proper reference counting kick in does not
work as the dst has a reference to the device which needs to be released
as well.

I talked to Hannes about this at netdev and he pointed out the ipv4 and
ipv6 dst handling has dst_ifdown for just this scenario. Rather than
continuing with the reinvented dst wheel in VRF just remove it and
leverage the ipv4 and ipv6 versions.

Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:56:20 -04:00
David Howells e0e4d82f3b rxrpc: Create a null security type and get rid of conditional calls
Create a null security type for security index 0 and get rid of all
conditional calls to the security operations.  We expect normally to be
using security, so this should be of little negative impact.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:41 -04:00
David Howells 648af7fca1 rxrpc: Absorb the rxkad security module
Absorb the rxkad security module into the af_rxrpc module so that there's
only one module file.  This avoids a circular dependency whereby rxkad pins
af_rxrpc and cached connections pin rxkad but can't be manually evicted
(they will expire eventually and cease pinning).

With this change, af_rxrpc can just be unloaded, despite having cached
connections.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:41 -04:00
David Howells 6dd050f88d rxrpc: Don't assume transport address family and size when using it
Don't assume transport address family and size when using the peer address
to send a packet.  Instead, use the start of the transport address rather
than any particular element of the union and use the transport address
length noted inside the sockaddr_rxrpc struct.

This will be necessary when IPv6 support is introduced.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:41 -04:00
David Howells 843099cac0 rxrpc: Don't pass gfp around in incoming call handling functions
Don't pass gfp around in incoming call handling functions, but rather hard
code it at the points where we actually need it since the value comes from
within the rxrpc driver and is always the same.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:41 -04:00
David Howells dc44b3a09a rxrpc: Differentiate local and remote abort codes in structs
In the rxrpc_connection and rxrpc_call structs, there's one field to hold
the abort code, no matter whether that value was generated locally to be
sent or was received from the peer via an abort packet.

Split the abort code fields in two for cleanliness sake and add an error
field to hold the Linux error number to the rxrpc_call struct too
(sometimes this is generated in a context where we can't return it to
userspace directly).

Furthermore, add a skb mark to indicate a packet that caused a local abort
to be generated so that recvmsg() can pick up the correct abort code.  A
future addition will need to be to indicate to userspace the difference
between aborts via a control message.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:40 -04:00
David Howells 5b3e87f19e rxrpc: Static arrays of strings should be const char *const[]
Static arrays of strings should be const char *const[].

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:40 -04:00
David Howells 8e688d9c16 rxrpc: Move some miscellaneous bits out into their own file
Move some miscellaneous bits out into their own file to make it easier to
split the call handling.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:40 -04:00
David Howells 8f7e6e75d3 rxrpc: Disable a debugging statement that has been left enabled.
Disable a debugging statement that has been left enabled

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:40 -04:00
Willem de Bruijn 4d0fc73ebe rxrpc: do not pull udp headers on receive
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.

Rxrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:31:33 -04:00
Willem de Bruijn 1da8c681d5 sunrpc: do not pull udp headers on receive
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.

Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Reported-by: Franklin S Cooper Jr. <fcooper@ti.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:31:33 -04:00
Erik Hugne ddb1d33969 tipc: purge deferred updates from dead nodes
If a peer node becomes unavailable, in addition to removing the
nametable entries from this node we also need to purge all deferred
updates associated with this node.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:22:20 -04:00
Erik Hugne 541726abe7 tipc: make dist queue pernet
Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:22:20 -04:00
David Ahern a6db4494d2 net: ipv4: Consider failed nexthops in multipath routes
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:16:13 -04:00
Dmitry Ivanov e272602039 netlink: don't send NETLINK_URELEASE for unbound sockets
All existing users of NETLINK_URELEASE use it to clean up resources that
were previously allocated to a socket via some command. As a result, no
users require getting this notification for unbound sockets.

Sending it for unbound sockets, however, is a problem because any user
(including unprivileged users) can create a socket that uses the same ID
as an existing socket. Binding this new socket will fail, but if the
NETLINK_URELEASE notification is generated for such sockets, the users
thereof will be tricked into thinking the socket that they allocated the
resources for is closed.

In the nl80211 case, this will cause destruction of virtual interfaces
that still belong to an existing hostapd process; this is the case that
Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
of netlink log/queue it will cause them to stop reporting events, as if
NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.

Fix this problem by checking that the socket is bound before generating
the NETLINK_URELEASE notification.

Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Ivanov <dima@ubnt.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-10 23:32:23 -04:00
David S. Miller a36a0d4008 decnet: Do not build routes to devices without decnet private data.
In particular, make sure we check for decnet private presence
for loopback devices.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-10 23:01:30 -04:00
Marcelo Ricardo Leitner ba6f5e33bd sctp: avoid refreshing heartbeat timer too often
Currently on high rate SCTP streams the heartbeat timer refresh can
consume quite a lot of resources as timer updates are costly and it
contains a random factor, which a) is also costly and b) invalidates
mod_timer() optimization for not editing a timer to the same value.
It may even cause the timer to be slightly advanced, for no good reason.

As suggested by David Laight this patch now removes this timer update
from hot path by leaving the timer on and re-evaluating upon its
expiration if the heartbeat is still needed or not, similarly to what is
done for TCP. If it's not needed anymore the timer is re-scheduled to
the new timeout, considering the time already elapsed.

For this, we now record the last tx timestamp per transport, updated in
the same spots as hb timer was restarted on tx. Also split up
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
heartbeat one.

On loopback with MTU of 65535 and data chunks with 1636, so that we
have a considerable amount of chunks without stressing system calls,
netperf -t SCTP_STREAM -l 30, perf looked like this before:

Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
  Overhead  Command  Shared Object      Symbol
+    6,15%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
-    5,43%  netperf  [kernel.vmlinux]   [k] _raw_write_unlock_irqrestore
   - _raw_write_unlock_irqrestore
      - 96,54% _raw_spin_unlock_irqrestore
         - 36,14% mod_timer
            + 97,24% sctp_transport_reset_timers
            + 2,76% sctp_do_sm
         + 33,65% __wake_up_sync_key
         + 28,77% sctp_ulpq_tail_event
         + 1,40% del_timer
      - 1,84% mod_timer
         + 99,03% sctp_transport_reset_timers
         + 0,97% sctp_do_sm
      + 1,50% sctp_ulpq_tail_event

And after this patch, now with netperf -l 60:

Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
  Overhead  Command  Shared Object      Symbol
+    5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
+    5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
-    5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      + 49,89% __wake_up_sync_key
      + 45,68% sctp_ulpq_tail_event
      - 2,85% mod_timer
         + 76,51% sctp_transport_reset_t3_rtx
         + 23,49% sctp_do_sm
      + 1,55% del_timer
+    2,50%  netperf  [sctp]             [k] sctp_datamsg_from_user
+    2,26%  netperf  [sctp]             [k] sctp_sendmsg

Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
~3.7%.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-10 22:22:34 -04:00
David S. Miller ae95d71261 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-09 17:41:41 -04:00
Eric Dumazet 03c5b53418 ipv6: fix inet6_lookup_listener()
A stupid refactoring bug in inet6_lookup_listener() needs to be fixed
in order to get proper SO_REUSEPORT behavior.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-09 16:53:52 -04:00
Linus Torvalds 9ef11ceb0d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Stale SKB data pointer access across pskb_may_pull() calls in L2TP,
    from Haishuang Yan.

 2) Fix multicast frame handling in mac80211 AP code, from Felix
    Fietkau.

 3) mac80211 station hashtable insert errors not handled properly, fix
    from Johannes Berg.

 4) Fix TX descriptor count limit handling in e1000, from Alexander
    Duyck.

 5) Revert a buggy netdev refcount fix in netpoll, from Bjorn Helgaas.

 6) Must assign rtnl_link_ops of the device before registering it, fix
    in ip6_tunnel from Thadeu Lima de Souza Cascardo.

 7) Memory leak fix in tc action net exit, from WANG Cong.

 8) Add missing AF_KCM entries to name tables, from Dexuan Cui.

 9) Fix regression in GRE handling of csums wrt.  FOU, from Alexander
    Duyck.

10) Fix memory allocation alignment and congestion map corruption in
    RDS, from Shamir Rabinovitch.

11) Fix default qdisc regression in tuntap driver, from Jason Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  bridge, netem: mark mailing lists as moderated
  tuntap: restore default qdisc
  mpls: find_outdev: check for err ptr in addition to NULL check
  ipv6: Count in extension headers in skb->network_header
  RDS: fix congestion map corruption for PAGE_SIZE > 4k
  RDS: memory allocated must be align to 8
  GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
  net: add the AF_KCM entries to family name tables
  MAINTAINERS: intel-wired-lan list is moderated
  lib/test_bpf: Add additional BPF_ADD tests
  lib/test_bpf: Add test to check for result of 32-bit add that overflows
  lib/test_bpf: Add tests for unsigned BPF_JGT
  lib/test_bpf: Fix JMP_JSET tests
  VSOCK: Detach QP check should filter out non matching QPs.
  stmmac: fix adjust link call in case of a switch is attached
  af_packet: tone down the Tx-ring unsupported spew.
  net_sched: fix a memory leak in tc action
  samples/bpf: Enable powerpc support
  samples/bpf: Use llc in PATH, rather than a hardcoded value
  samples/bpf: Fix build breakage with map_perf_test_user.c
  ...
2016-04-09 10:50:44 -07:00
Vivien Didelot 4d5770b397 net: dsa: make the VLAN add function return void
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.

The DSA layer separates port_vlan_prepare and port_vlan_add for
simplicity and convenience. If an hardware error occurs during the
commit phase, there is no need to report it outside the driver itself.

Make the DSA port_vlan_add routine return void for explicitness.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 16:50:41 -04:00
Vivien Didelot 8497aa618d net: dsa: make the FDB add function return void
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.

The DSA layer separates port_fdb_prepare and port_fdb_add for simplicity
and convenience. If an hardware error occurs during the commit phase,
there is no need to report it outside the DSA driver itself.

Make the DSA port_fdb_add routine return void for explicitness.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 16:50:40 -04:00
Vivien Didelot 43c44a9f65 net: dsa: make the STP state function return void
The DSA layer doesn't care about the return code of the port_stp_update
routine, so make it void in the layer and the DSA drivers.

Replace the useless dsa_slave_stp_update function with a
dsa_slave_stp_state function used to reply to the switchdev
SWITCHDEV_ATTR_ID_PORT_STP_STATE attribute.

In the meantime, rename port_stp_update to port_stp_state_set to
explicit the state change.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 16:50:40 -04:00
David S. Miller 1089ac6977 For the 4.6 cycle, we have a number of changes:
* Bob's mesh mode rhashtable conversion, this includes
    the rhashtable API change for allocation flags
  * BSSID scan, connect() command reassoc support (Jouni)
  * fast (optimised data only) and support for RSS in mac80211 (myself)
  * various smaller changes
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXBQ4GAAoJEGt7eEactAAdWiMP/ibaP3I79NDc0s7wCDA+KRkm
 hx0Qx4a0wwm7lDFlnGBjY6yKr+XFDliCvdGX7XGpLSsTioNg7eXPpwx5FQoj6RiV
 8+5RKE9fTguN9ofUzqAwHd9sVOaxvdlXbKfb/N93Gzjpw/meYk58wXdF7Almkroa
 ukgJeMzIlIh+6D96zFEA+Ofzp5chwh+x2Dn0wXutEe9P9fOERA859veAvx65b+Ql
 IRGTqyuY5B/wcbkr4o+DWQwgrdt7Vop9nYVPNWtMHm2JTzfuCSaQ2cD9TnVAK/bg
 /vtqC46KKNLyBRGexAPqdftY9PWcfipgE+n7k+Et4iGSmNm7Z3dEyewgXmqli7XJ
 X8Uiaq+N6Fpe06DVSU7aSRt8NLV64A44jXSfKRI9U2POUqKMn/PMdm8bhPW8qCdM
 ra6myWpQGHWK9e0TQQdShq0NQKGxCZAiSRiiIrbbvXl1CwXxkPCG39wAC3Sh1tEN
 ou4lGraeywGnTjaq+mwLEtHLoug8Y2x+Fz+Ze4Cu2enXxna9lp4lr+rFlc+2+0Er
 o9oPxkTk8krZGIj9M6PNc5W+InMwchaFX3076n67hnFHzFRlOQzkfffbPYlhKJDQ
 f8c9JiNZIoX/fD1TAKsrdO1+EKm/xo7w7pLgbMwQal8Jr88SkITDg0i3oXc56vNQ
 ZK2gUzwvrD/jh0AUyDfN
 =sj7y
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
For the 4.7 cycle, we have a number of changes:
 * Bob's mesh mode rhashtable conversion, this includes
   the rhashtable API change for allocation flags
 * BSSID scan, connect() command reassoc support (Jouni)
 * fast (optimised data only) and support for RSS in mac80211 (myself)
 * various smaller changes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 16:42:31 -04:00
David S. Miller 30d237a6c2 For the current RC series, we have the following fixes:
* TDLS fixes from Arik and Ilan
  * rhashtable fixes from Ben and myself
  * documentation fixes from Luis
  * U-APSD fixes from Emmanuel
  * a TXQ fix from Felix
  * and a compiler warning suppression from Jeff
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXBQyAAAoJEGt7eEactAAdymsP/i2zU+VQFpkB8RG+nn/AYogY
 x2RXgKjCWOs6FwUcu+VHMx0whKuMADjqbMABdlsGGK62Xa5aYYcObvY+CgCUAI+m
 unV7kYDIBUHudiTpxXgZYUvylhIvW37VYjc6BDoaq4Jc1rz/L69zSrNHmoNiQv+Y
 113T0Ft5EEmEO1LP4s2GLMZTPqwgi2FnaP6UYFdTr2/ZfaRHlj2xRRG62WiT/q1x
 DMT2KWHvETCftpK3GwwkMSr0Au8CVV1soiQOoioTPRaevYbBFVi60GVXQeDlvFEV
 PqVCOEfsSvsw84phfHrW1bOxBeVsNYHbY/T4eVlC0zssUzz6KNH5jAfpyla1p0lP
 WniSqAaWxMcUYWCEBiOLa5LV2XVpXOuTpI84xcgc/BmprgzNyLgDAiCDtehpxALf
 Qmhc/rPR5BbLhTNY8z5qANG6mhQCHCo+52ypvBLMhZoajkPjgyBabwoqaRGje2ub
 vgzbAfqEguJmCAszw04KZ2UHInBcCDAZ5aOKiinawWvpftkDN0IJmO/HW5vnh4xV
 kawpo1eh1JzDcEyYfSjySHHFmR/5qnaDzaPL2cJthcOY/fGywibJHCQmoDPyH5jz
 Bkk0F0rMEcdQWs9pJLIMMzkA7BAlxYLYip0J9QImHL77sWK0QwUDCoryrgD6lL1D
 v2V31g1TZwPF2Noe9Rk7
 =r3Hr
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
For the current RC series, we have the following fixes:
 * TDLS fixes from Arik and Ilan
 * rhashtable fixes from Ben and myself
 * documentation fixes from Luis
 * U-APSD fixes from Emmanuel
 * a TXQ fix from Felix
 * and a compiler warning suppression from Jeff
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 16:41:28 -04:00
Jiri Pirko 1fc2257e83 devlink: share user_ptr pointer for both devlink and devlink_port
Ptr to devlink structure can be easily obtained from
devlink_port->devlink. So share user_ptr[0] pointer for both and leave
user_ptr[1] free for other users.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 15:40:08 -04:00
Jiri Pirko a9844881ba devlink: remove implicit type set in port register
As we rely on caller zeroing or correctly set the struct before the call,
this implicit type set is either no-op (DEVLINK_PORT_TYPE_NOTSET is 0)
or it rewrites wanted value. So remove this.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 15:38:42 -04:00
Alexander Aring feb2add323 6lowpan: iphc: fix handling of link-local compression
This patch fixes handling in case of link-local address compression. A
IPv6 link-local address is defined as fe80::/10 prefix which is also
what ipv6_addr_type checks for link-local addresses.

But IPHC compression for link-local addresses are for fe80::/64 types
only. This patch adds additional checks for zero padded bits in case of
link-local address compression to match on a fe80::/64 address only.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-08 19:28:13 +02:00
Patrik Flykt a164cee111 Bluetooth: Allow setting BT_SECURITY_FIPS with setsockopt
Update the security level check to allow setting BT_SECURITY_FIPS for
an L2CAP socket.

Signed-off-by: Patrik Flykt <patrik.flykt@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-08 19:10:57 +02:00
Johan Hedberg 56b40fbf61 Bluetooth: Ignore unknown advertising packet types
In case of buggy controllers send advertising packet types that we
don't know of we should simply ignore them instead of trying to react
to them in some (potentially wrong) way.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-08 18:51:44 +02:00
Johan Hedberg f18ba58f53 Bluetooth: Fix setting NO_BREDR advertising flag
If we're dealing with a single-mode controller or BR/EDR is disable
for a dual-mode one, the NO_BREDR flag needs to be unconditionally
present in the advertising data. This patch moves it out from behind
an extra condition to be always set in the create_instance_adv_data()
function if BR/EDR is disabled.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-04-08 18:50:40 +02:00
Roopa Prabhu 94a57f1f8a mpls: find_outdev: check for err ptr in addition to NULL check
find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to
find the output device. In case of an error, inet{,6}_fib_lookup_dev()
returns error pointer and dev_get_by_index() returns NULL. But the function
only checks for NULL and thus can end up calling dev_put on an ERR_PTR.
This patch adds an additional check for err ptr after the NULL check.

Before: Trying to add an mpls route with no oif from user, no available
path to 10.1.1.8 and no default route:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
[  822.337195] BUG: unable to handle kernel NULL pointer dereference at
00000000000003a3
[  822.340033] IP: [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[  822.340033] PGD 1db38067 PUD 1de9e067 PMD 0
[  822.340033] Oops: 0000 [#1] SMP
[  822.340033] Modules linked in:
[  822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54
[  822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[  822.340033] task: ffff88001db82580 ti: ffff88001dad4000 task.ti:
ffff88001dad4000
[  822.340033] RIP: 0010:[<ffffffff8148781e>]  [<ffffffff8148781e>]
mpls_nh_assign_dev+0x10b/0x182
[  822.340033] RSP: 0018:ffff88001dad7a88  EFLAGS: 00010282
[  822.340033] RAX: ffffffffffffff9b RBX: ffffffffffffff9b RCX:
0000000000000002
[  822.340033] RDX: 00000000ffffff9b RSI: 0000000000000008 RDI:
0000000000000000
[  822.340033] RBP: ffff88001ddc9ea0 R08: ffff88001e9f1768 R09:
0000000000000000
[  822.340033] R10: ffff88001d9c1100 R11: ffff88001e3c89f0 R12:
ffffffff8187e0c0
[  822.340033] R13: ffffffff8187e0c0 R14: ffff88001ddc9e80 R15:
0000000000000004
[  822.340033] FS:  00007ff9ed798700(0000) GS:ffff88001fc00000(0000)
knlGS:0000000000000000
[  822.340033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  822.340033] CR2: 00000000000003a3 CR3: 000000001de89000 CR4:
00000000000006f0
[  822.340033] Stack:
[  822.340033]  0000000000000000 0000000100000000 0000000000000000
0000000000000000
[  822.340033]  0000000000000000 0801010a00000000 0000000000000000
0000000000000000
[  822.340033]  0000000000000004 ffffffff8148749b ffffffff8187e0c0
000000000000001c
[  822.340033] Call Trace:
[  822.340033]  [<ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e
[  822.340033]  [<ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2
[  822.340033]  [<ffffffff810e7bbc>] ? get_page+0x5/0xa
[  822.340033]  [<ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191
[  822.340033]  [<ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e
[  822.340033]  [<ffffffff813c9393>] ?
rht_key_hashfn.isra.20.constprop.57+0x14/0x1f
[  822.340033]  [<ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc
[  822.340033]  [<ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82
[  822.340033]  [<ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28
[  822.340033]  [<ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189
[  822.340033]  [<ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8
[  822.340033]  [<ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b
[  822.340033]  [<ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3
[  822.340033]  [<ffffffff810e4f35>] ?
__alloc_pages_nodemask+0x11c/0x1e4
[  822.340033]  [<ffffffff8110619c>] ? PageAnon+0x5/0xd
[  822.340033]  [<ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52
[  822.340033]  [<ffffffff810e7bbc>] ? get_page+0x5/0xa
[  822.340033]  [<ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a
[  822.340033]  [<ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30
[  822.340033]  [<ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a
[  822.340033]  [<ffffffff8148f597>] ?
entry_SYSCALL_64_fastpath+0x12/0x6a
[  822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff
eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db
74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07
[  822.340033] RIP  [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[  822.340033]  RSP <ffff88001dad7a88>
[  822.340033] CR2: 00000000000003a3
[  822.435363] ---[ end trace 98cc65e6f6b8bf11 ]---

After patch:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
RTNETLINK answers: Network is unreachable

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 12:43:20 -04:00
Jakub Sitnicki 3ba3458fb9 ipv6: Count in extension headers in skb->network_header
When sending a UDPv6 message longer than MTU, account for the length
of fragmentable IPv6 extension headers in skb->network_header offset.
Same as we do in alloc_new_skb path in __ip6_append_data().

This ensures that later on __ip6_make_skb() will make space in
headroom for fragmentable extension headers:

	/* move skb->data to ip header from ext header */
	if (skb->data < skb_network_header(skb))
		__skb_pull(skb, skb_network_offset(skb));

Prevents a splat due to skb_under_panic:

skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \
head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] KASAN
CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
[...]
Call Trace:
 [<ffffffff813eb7b9>] skb_push+0x79/0x80
 [<ffffffff8143397b>] eth_header+0x2b/0x100
 [<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
 [<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
 [<ffffffff814efe3a>] ip6_output+0x16a/0x280
 [<ffffffff815440c1>] ip6_local_out+0xb1/0xf0
 [<ffffffff814f1115>] ip6_send_skb+0x45/0xd0
 [<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
 [<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
[...]

Reported-by: Ji Jianwen <jiji@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 22:41:37 -04:00
Jon Paul Maloy 5b7066c3dd tipc: stricter filtering of packets in bearer layer
Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.

To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only.  This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 17:00:13 -04:00
Jon Paul Maloy 4e801fa14f tipc: eliminate buffer leak in bearer layer
When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a  bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.

This commit fixes this problem by introducing two changes:

1) Instead of attemting to send the discovery message directly, we let
   tipc_disc_create() return the discovery buffer to the calling
   function, tipc_enable_bearer(), so that the latter can send it
   when the enabling sequence is finished.

2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
   functions at the bearer layer, we now free the indicated buffer or
   buffer chain when a valid bearer cannot be found.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 17:00:13 -04:00
shamir rabinovitch 579ba85552 RDS: fix congestion map corruption for PAGE_SIZE > 4k
When PAGE_SIZE > 4k single page can contain 2 RDS fragments. If
'rds_ib_cong_recv' ignore the RDS fragment offset in to the page it
then read the data fragment as far congestion map update and lead to
corruption of the RDS connection far congestion map.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:58:28 -04:00
shamir rabinovitch e98499ac63 RDS: memory allocated must be align to 8
Fix issue in 'rds_ib_cong_recv' when accessing unaligned memory
allocated by 'rds_page_remainder_alloc' using uint64_t pointer.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:58:27 -04:00
Alexander Duyck a0ca153f98 GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:56:33 -04:00
Tom Herbert 46aa2f30aa udp: Remove udp_offloads
Now that the UDP encapsulation GRO functions have been moved to the UDP
socket we not longer need the udp_offload insfrastructure so removing it.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:30 -04:00
Tom Herbert d92283e338 fou: change to use UDP socket GRO
Adapt gue_gro_receive, gue_gro_complete to take a socket argument.
Don't set udp_offloads any more.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Tom Herbert 38fd2af24f udp: Add socket based GRO and config
Add gro_receive and  gro_complete to struct udp_tunnel_sock_cfg.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Tom Herbert a6024562ff udp: Add GRO functions to UDP socket
This patch adds GRO functions (gro_receive and gro_complete) to UDP
sockets. udp_gro_receive is changed to perform socket lookup on a
packet. If a socket is found the related GRO functions are called.

This features obsoletes using UDP offload infrastructure for GRO
(udp_offload). This has the advantage of not being limited to provide
offload on a per port basis, GRO is now applied to whatever individual
UDP sockets are bound to.  This also allows the possbility of
"application defined GRO"-- that is we can attach something like
a BPF program to a UDP socket to perfrom GRO on an application
layer protocol.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Tom Herbert 63058308cd udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb
Add externally visible functions to lookup a UDP socket by skb. This
will be used for GRO in UDP sockets. These functions also check
if skb->dst is set, and if it is not skb->dev is used to get dev_net.
This allows calling lookup functions before dst has been set on the
skbuff.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:14 -04:00
Hannes Frederic Sowa 8ced425ee6 tun: use socket locks for sk_{attach,detatch}_filter
This reverts commit 5a5abb1fa3 ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.

It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:44:14 -04:00
Hannes Frederic Sowa 1e1d04e678 net: introduce lockdep_is_held and update various places to use it
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:44:14 -04:00
Hannes Frederic Sowa 61881cfb5a sock: fix lockdep annotation in release_sock
During release_sock we use callbacks to finish the processing
of outstanding skbs on the socket. We actually are still locked,
sk_locked.owned == 1, but we already told lockdep that the mutex
is released. This could lead to false positives in lockdep for
lockdep_sock_is_held (we don't hold the slock spinlock during processing
the outstanding skbs).

I took over this patch from Eric Dumazet and tested it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:44:14 -04:00
Haishuang Yan 85f1e7c29a netfilter: ipv6: unnecessary to check whether ip6_route_output() returns NULL
ip6_route_output() never returns NULL, so it is not appropriate to
check if the return value is NULL.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-07 18:53:08 +02:00
Jozsef Kadlecsik 644c7e48cb netfilter: nf_conntrack_tcp: Fix stack out of bounds when parsing TCP options
Baozeng Ding reported a KASAN stack out of bounds issue - it uncovered that
the TCP option parsing routines in netfilter TCP connection tracking could
read one byte out of the buffer of the TCP options.  Therefore in the patch
we check that the available data length is large enough to parse both TCP
option code and size.

Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-07 18:42:37 +02:00
Eric Dumazet 8501786929 tcp/dccp: fix inet_reuseport_add_sock()
David Ahern reported panics in __inet_hash() caused by my recent commit.

The reason is inet_reuseport_add_sock() was still using
sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
SO_REUSEPORT enabled listeners were causing an instant crash.

While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
flag, as it is inherited from the parent at clone time.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 12:02:33 -04:00
Florian Westphal ff76def3bd netfilter: arp_tables: register table in initns
arptables is broken since we didn't register the table anymore --
even 'arptables -L' fails.

Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-07 11:58:49 +02:00
Dexuan Cui 0a1a37b6d6 net: add the AF_KCM entries to family name tables
This is for the recent kcm driver, which introduces AF_KCM(41) in
b7ac4eb(kcm: Kernel Connection Multiplexor module).

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:59:01 -04:00
Jiri Benc a6d5bbf34e ip_tunnel: implement __iptunnel_pull_header
Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner
protocol.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:50:32 -04:00
Jorgen Hansen 8ab18d71de VSOCK: Detach QP check should filter out non matching QPs.
The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.

Testing: Before this change, a detach from a peer on a different
socket would cause an active stream socket to register a detach.

Reviewed-by: George Zhang <georgezhang@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:39:09 -04:00
Dave Jones 6ae81ced37 af_packet: tone down the Tx-ring unsupported spew.
Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.

Replace it with a rate-limited printk.

Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:05:20 -04:00
David S. Miller 32fa270c8a Revert "bridge: Fix incorrect variable assignment on error path in br_sysfs_addbr"
This reverts commit c862cc9b70.

Patch lacks a real-name Signed-off-by.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 15:42:45 -04:00
Jeff Mahoney b4201cc4fc mac80211: fix "warning: ‘target_metric’ may be used uninitialized"
This fixes:

net/mac80211/mesh_hwmp.c:603:26: warning: ‘target_metric’ may be used uninitialized in this function

target_metric is only consumed when reply = true so no bug exists here,
but not all versions of gcc realize it.  Initialize to 0 to remove the
warning.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-06 15:10:25 +02:00
Jouni Malinen 4ce2bd9c4c cfg80211: Allow reassociation to be requested with internal SME
If the user space issues a NL80211_CMD_CONNECT with
NL80211_ATTR_PREV_BSSID when there is already a connection, allow this
to proceed as a reassociation instead of rejecting the new connect
command with EALREADY.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[validate prev_bssid]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-06 15:09:28 +02:00
Jouni Malinen ba6fbacf9c cfg80211: Add option to specify previous BSSID for Connect command
This extends NL80211_CMD_CONNECT to allow the NL80211_ATTR_PREV_BSSID
attribute to be used similarly to way this was already allowed with
NL80211_CMD_ASSOCIATE. This allows user space to request reassociation
(instead of association) when already connected to an AP. This provides
an option to reassociate within an ESS without having to disconnect and
associate with the AP.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-06 13:18:21 +02:00
Felix Fietkau 918fe04b28 mac80211: minstrel_ht: set A-MSDU tx limits based on selected max_prob_rate
Prevents excessive A-MSDU aggregation at low data rates or bad
conditions.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-06 13:18:20 +02:00
Felix Fietkau 6e0456b545 mac80211: add A-MSDU tx support
Requires software tx queueing and fast-xmit support. For good
performance, drivers need frag_list support as well. This avoids the
need for copying data of aggregated frames. Running without it is only
supported for debugging purposes.

To avoid performance and packet size issues, the rate control module or
driver needs to limit the maximum A-MSDU size by setting
max_rc_amsdu_len in struct ieee80211_sta.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
[fix locking issue]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-06 13:18:19 +02:00