Commit Graph

39051 Commits

Author SHA1 Message Date
Jason Gunthorpe 2f31fa881f net/9p: Remove ib_get_dma_mr calls
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:12:36 -04:00
Matan Barak 816dd19b3d net: Add info for NETDEV_CHANGEUPPER event
Some consumers of NETDEV_CHANGEUPPER event would like to know which
upper device was linked/unlinked and what operation was carried.

Add information in the notifier info block for that purpose.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:49 -04:00
Matan Barak 399e6f9581 net/ipv6: Export addrconf_ifid_eui48
For loopback purposes, RoCE devices should have a default GID in the
port GID table, even when the interface is down. In order to do so,
we use the IPv6 link local address which would have been genenrated
for the related Ethernet netdevice when it goes up as a default GID.

addrconf_ifid_eui48 is used to gernerate this address, export it.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:49 -04:00
Sagi Grimberg fc27995942 RDS: Convert to ib_alloc_mr
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:46 -04:00
Steve Wise 9ac07501e1 svcrdma: limit FRMR page list lengths to device max
Svcrdma was incorrectly allocating fastreg MRs and page lists using
RPCSVC_MAXPAGES, which can exceed the device capabilities.  So limit
the depth to the minimum of RPCSVC_MAXPAGES and xprt->sc_frmr_pg_list_len.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:46 -04:00
Sagi Grimberg 0410e38eca xprtrdma, svcrdma: Convert to ib_alloc_mr
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:45 -04:00
Haggai Eran 7c1eb45a22 IB/core: lock client data with lists_rwsem
An ib_client callback that is called with the lists_rwsem locked only for
read is protected from changes to the IB client lists, but not from
ib_unregister_device() freeing its client data. This is because
ib_unregister_device() will remove the device from the device list with
lists_rwsem locked for write, but perform the rest of the cleanup,
including the call to remove() without that lock.

Mark client data that is undergoing de-registration with a new going_down
flag in the client data context. Lock the client data list with lists_rwsem
for write in addition to using the spinlock, so that functions calling the
callback would be able to lock only lists_rwsem for read and let callbacks
sleep.

Since ib_unregister_client() now marks the client data context, no need for
remove() to search the context again, so pass the client data directly to
remove() callbacks.

Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 15:48:21 -04:00
Trond Myklebust 099392048c SUNRPC: Prevent SYN+SYNACK+RST storms
Add a shutdown() call before we release the socket in order to ensure the
reset is sent before we try to reconnect.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-29 19:11:21 -07:00
Pravin B Shelar a581b96dbf openvswitch: Remove vport-net
This structure is not used anymore.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 19:07:15 -07:00
Pravin B Shelar 8c876639c9 openvswitch: Remove vport stats.
Since all vport types are now backed by netdev, we can directly
use netdev stats. Following patch removes redundant stat
from vport.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 19:07:15 -07:00
Pravin B Shelar 3eedb41fb4 openvswitch: Remove egress_tun_info.
tun info is passed using skb-dst pointer. Now we have
converted all vports to netdev based implementation so
Now we can remove redundant pointer to tun-info from OVS_CB.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 19:07:15 -07:00
Pravin B Shelar 24d43f32d8 openvswitch: Remove vport get_name()
Remove unused get_name() function pointer from vport ops.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 19:07:15 -07:00
Trond Myklebust 0c78789e3a SUNRPC: xs_reset_transport must mark the connection as disconnected
In case the reconnection attempt fails.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-29 13:40:32 -07:00
Simon Horman c30da49789 openvswitch: retain parsed IPv6 header fields in flow on error skipping extension headers
When an error occurs skipping IPv6 extension headers retain the already
parsed IP protocol and IPv6 addresses in the flow. Also assume that the
packet is not a fragment in the absence of information to the contrary;
that is always use the frag_off value set by ipv6_skip_exthdr().

This allows matching on the IP protocol and IPv6 addresses of packets
with malformed extension headers.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:39:59 -07:00
David S. Miller f5004a14fa Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-08-28

One more bunch of Bluetooth patches for 4.3:

 - Crash fix for hci_bcm driver
 - Enhancements to hci_intel driver (e.g. baudrate configuration)
 - Fix for SCO link type after multiple connect attempts
 - Cleanups & minor fixes in a few other places

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:15:03 -07:00
Jiri Benc a43a9ef6a2 vxlan: do not receive IPv4 packets on IPv6 socket
By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.

In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.

Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc b9b6695cf0 fou: reject IPv6 config
fou does not really support IPv6 encapsulation. After an UDP socket is
created in fou_create, the encap_rcv callback is set either to fou_udp_recv
or to gue_udp_recv. Both of those unconditionally assume that the received
packet has an IPv4 header and access the data at network_header as it was an
IPv4 header. This leads to IPv6 flow label being interpreted as IP packet
length, etc.

Disallow fou tunnel to be configured as IPv6 until real IPv6 support is
added to fou.

CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc 7f9562a1f4 ip_tunnels: record IP version in tunnel info
There's currently nothing preventing directing packets with IPv6
encapsulation data to IPv4 tunnels (and vice versa). If this happens,
IPv6 addresses are incorrectly interpreted as IPv4 ones.

Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
in ip_tunnel_info. Reject packets at appropriate places if they are supposed
to be encapsulated into an incompatible protocol.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc 46fa062ad6 ip_tunnels: convert the mode field of ip_tunnel_info to flags
The mode field holds a single bit of information only (whether the
ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags.
This allows more mode flags to be added.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
David Ahern f6d3c19274 net: FIB tracepoints
A few useful tracepoints developing VRF driver.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:05:16 -07:00
Vlad Yasevich 73e6742027 sctp: Do not try to search for the transport twice
When removing an non-primary transport during ASCONF
processing, we end up traversing the transport list
twice: once in sctp_cmd_del_non_primary, and once in
sctp_assoc_del_peer.  We can avoid the second
search and call sctp_assoc_rm_peer() instead.
Found by code inspection during code reviews.

Signed-off-by: Vladislav Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 22:25:43 -07:00
lucien 7c5a946181 sctp: ASCONF-ACK with Unresolvable Address should be sent
RFC 5061:
    This is an opaque integer assigned by the sender to identify each
    request parameter.  The receiver of the ASCONF Chunk will copy this
    32-bit value into the ASCONF Response Correlation ID field of the
    ASCONF-ACK response parameter.  The sender of the ASCONF can use this
    same value in the ASCONF-ACK to find which request the response is
    for.  Note that the receiver MUST NOT change this 32-bit value.

    Address Parameter: TLV

    This field contains an IPv4 or IPv6 address parameter, as described
    in Section 3.3.2.1 of [RFC4960].

ASCONF chunk with Error Cause Indication Parameter (Unresolvable Address)
should be sent if the Delete IP Address is not part of the association.

  Endpoint A                           Endpoint B
  (ESTABLISHED)                        (ESTABLISHED)

  ASCONF        ----------------->
  (Delete IP Address)
                <-----------------      ASCONF-ACK
                                        (Unresolvable Address)

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 22:25:42 -07:00
Ken-ichirou MATSUZAWA 7084a31589 netlink: mmap: fix lookup frame position
__netlink_lookup_frame() was always called with the same "pos"
value in netlink_forward_ring(). It will look at the same ring entry
header over and over again, every time through this loop. Then cycle
through the whole ring, advancing ring->head, not "pos" until it
equals the "ring->head != head" loop test fails.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 22:25:42 -07:00
Christophe Ricard 0a6a3a23ea netlink: add NETLINK_CAP_ACK socket option
Since commit c05cdb1b86 ("netlink: allow large data transfers from
user-space"), the kernel may fail to allocate the necessary room for the
acknowledgment message back to userspace. This patch introduces a new
socket option that trims off the payload of the original netlink message.

The netlink message header is still included, so the user can guess from
the sequence number what is the message that has triggered the
acknowledgment.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 22:25:42 -07:00
Joe Stringer 0d5cdef8d5 openvswitch: Fix conntrack compilation without mark.
Fix build with !CONFIG_NF_CONNTRACK_MARK && CONFIG_OPENVSWITCH_CONNTRACK

Fixes: 182e304 ("openvswitch: Allow matching on conntrack mark")
Reported-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Tested-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 22:23:59 -07:00
Steve Wise bc3fe2e376 svcrdma: Use max_sge_rd for destination read depths
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-28 23:02:11 -04:00
David S. Miller 581a5f2a61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree.
In sum, patches to address fallout from the previous round plus updates from
the IPVS folks via Simon Horman, they are:

1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm
   directs network connections to the server with the highest weight that is
   currently available and overflows to the next when active connections exceed
   the node's weight. From Raducu Deaconu.

2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch
   from Julian Anastasov.

3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From
   Julian Anastasov.

4) Enhance multicast configuration for the IPVS state sync daemon. Also from
   Julian.

5) Resolve sparse warnings in the nf_dup modules.

6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set.

7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative
   subsets of code 1. From Andreas Herz.

8) Revert the jumpstack size calculation from mark_source_chains due to chain
   depth miscalculations, from Florian Westphal.

9) Calm down more sparse warning around the Netfilter tree, again from Florian
   Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 16:29:59 -07:00
Nikolay Aleksandrov c9fd56b34e netpoll: warn on netpoll_send_udp users who haven't disabled irqs
Make sure we catch future netpoll_send_udp users who use it without
disabling irqs and also as a hint for poll_controller users.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 16:24:54 -07:00
Pablo Neira Ayuso a9de9777d6 netfilter: nfnetlink: work around wrong endianess in res_id field
The convention in nfnetlink is to use network byte order in every header field
as well as in the attribute payload. The initial version of the batching
infrastructure assumes that res_id comes in host byte order though.

The only client of the batching infrastructure is nf_tables, so let's add a
workaround to address this inconsistency. We currently have 11 nfnetlink
subsystems according to NFNL_SUBSYS_COUNT, so we can assume that the subsystem
2560, ie. htons(10), will not be allocated anytime soon, so it can be an alias
of nf_tables from the nfnetlink batching path when interpreting the res_id
field.

Based on original patch from Florian Westphal.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-29 01:02:35 +02:00
Elad Raz 96be5f2806 netfilter: ipset: Fixing unnamed union init
In continue to proposed Vinson Lee's post [1], this patch fixes compilation
issues founded at gcc 4.4.7. The initialization of .cidr field of unnamed
unions causes compilation error in gcc 4.4.x.

References

Visible links
[1] https://lkml.org/lkml/2015/7/5/74

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-29 01:02:28 +02:00
Daniel Borkmann c1b3b19923 net: sched: don't break line in tc_classify loop notification
Just some minor noise follow-up to address some stylistic issues of
commit 3b3ae88026 ("net: sched: consolidate tc_classify{,_compat}").
Accidentally v1 instead of v2 of that commit got applied, so this
patch adds the relative diff.

Suggested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 14:01:15 -07:00
David S. Miller 8b72ca67fe Included changes:
- code beautification
 - remove obsolete 'deleted' attribute for bat-gw node
 - increase internal version number
 - prevent potential access to netdev object after deregistration
 - set needed_head/tail_room for batman virtual interface
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJV4CMzAAoJEOb/4TMchkvfsjYP/1IckH6GA4vMROXZ6+Y89ZVl
 Ii6uQck+0FUCkzld+sn8RYPDClqmNTB9PcYAVrrKDMFxOZVCdIvK+iLh1s2PQK4o
 eiab+V93VNK3DTlouvszsW0mdWCaBXiKR01PpEcr528d4LxFnGbiCKDGkcr/A0HM
 yMW9ZwmpohFk0/CRZPZlDJjTlTAUsvp+YXqA2vBUdBT1zN8Lb3BB7jK/rXXreLJ9
 nnNNn+AEw6Fy/FXTDc3QKMrCdyFlN+bHnyXds6345p5asy0JCVh0Sk2zuySi2MDh
 LT4cujIP5LtAhp8i52R5ZjOvNOQfGAyRxvBd20gZM/lKfMcIK3KEaMe7Se87NySR
 IhQ1nwE5S4AWnlmU+xfmX4vReojb9sCLhBN8oMC5bm3uqa4kLR9e4CBaPdRSpZex
 P3Z9DPaV1GAe3yyGFz9pRUTXzAuVMtYKjkObDkOVq+oKqBpG5UMf5gwEkycILvzb
 wwqHbjE7QY1izE8TgDe1y+GKPR8Qivwm7X8gwuHqmkKCNCmrqNkGjPkp1W4RGkzg
 BGJmS9FENxxILvt68rXZYqQ2JUg3Z+k/Jiw5gmQGPjwH62JnEMkKD4NMphDf6hxm
 85bT9J8mF2AJJKDvEYwDu+KFEhX0LyhcN38SGq5QYRS+9nf0JoZzWujHJ7oXHJiw
 o7mwRayOYPLzeLLWpjWy
 =KKXM
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
Included changes:
- code beautification
- remove obsolete 'deleted' attribute for bat-gw node
- increase internal version number
- prevent potential access to netdev object after deregistration
- set needed_head/tail_room for batman virtual interface
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:43:33 -07:00
Valentin Rothberg 9723e6abc7 openswitch: fix typo CONFIG_NF_CONNTRACK_LABEL
Fix typo in conntrack.c
s/CONFIG_NF_CONNTRACK_LABEL/CONFIG_NF_CONNTRACK_LABELS/

Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:39:53 -07:00
David Ahern 192132b9a0 net: Add support for VRFs to inetpeer cache
inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:32:36 -07:00
David Ahern d39d14ffa2 net: Add helper function to compare inetpeer addresses
tcp_metrics and inetpeer both have functions to compare inetpeer
addresses. Consolidate into 1 version.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:32:36 -07:00
David Ahern 3abef286cf net: Add set,get helpers for inetpeer addresses
Use inetpeer set,get helpers in tcp_metrics rather than peeking into
the inetpeer_addr struct.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:32:36 -07:00
David Ahern 72afa352d6 net: Introduce ipv4_addr_hash and use it for tcp metrics
Refactors a common line into helper function.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:32:35 -07:00
Philip Downey df2cf4a78e IGMP: Inhibit reports for local multicast groups
The range of addresses between 224.0.0.0 and 224.0.0.255 inclusive, is
reserved for the use of routing protocols and other low-level topology
discovery or maintenance protocols, such as gateway discovery and
group membership reporting.  Multicast routers should not forward any
multicast datagram with destination addresses in this range,
regardless of its TTL.

Currently, IGMP reports are generated for this reserved range of
addresses even though a router will ignore this information since it
has no purpose.  However, the presence of reserved group addresses in
an IGMP membership report uses up network bandwidth and can also
obscure addresses of interest when inspecting membership reports using
packet inspection or debug messages.

Although the RFCs for the various version of IGMP (e.g.RFC 3376 for
v3) do not specify that the reserved addresses be excluded from
membership reports, it should do no harm in doing so.  In particular
there should be no adverse effect in any IGMP snooping functionality
since 224.0.0.x is specifically excluded as per RFC 4541 (IGMP and MLD
Snooping Switches Considerations) section 2.1.2. Data Forwarding
Rules:

    2) Packets with a destination IP (DIP) address in the 224.0.0.X
       range which are not IGMP must be forwarded on all ports.

IGMP reports for local multicast groups can now be optionally
inhibited by means of a system control variable (by setting the value
to zero) e.g.:
    echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports

To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero e.g.:
    echo 1 >  /proc/sys/net/ipv4/igmp_link_local_mcast_reports

Signed-off-by: Philip Downey <pdowney@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 13:28:47 -07:00
Florian Westphal 851345c5bb netfilter: reduce sparse warnings
bridge/netfilter/ebtables.c:290:26: warning: incorrect type in assignment (different modifiers)
-> remove __pure annotation.

ipv6/netfilter/ip6t_SYNPROXY.c:240:27: warning: cast from restricted __be16
-> switch ntohs to htons and vice versa.

netfilter/core.c:391:30: warning: symbol 'nfq_ct_nat_hook' was not declared. Should it be static?
-> delete it, got removed

net/netfilter/nf_synproxy_core.c:221:48: warning: cast to restricted __be32
-> Use __be32 instead of u32.

Tested with objdiff that these changes do not affect generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-28 21:04:12 +02:00
Florian Westphal 98dbbfc3f1 Revert "netfilter: xtables: compute exact size needed for jumpstack"
This reverts commit 98d1bd802c.

mark_source_chains will not re-visit chains, so

*filter
:INPUT ACCEPT [365:25776]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [217:45832]
:t1 - [0:0]
:t2 - [0:0]
:t3 - [0:0]
:t4 - [0:0]
-A t1 -i lo -j t2
-A t2 -i lo -j t3
-A t3 -i lo -j t4
# -A INPUT -j t4
# -A INPUT -j t3
# -A INPUT -j t2
-A INPUT -j t1
COMMIT

Will compute a chain depth of 2 if the comments are removed.
Revert back to counting the number of chains for the time being.

Reported-by: Cong Wang <cwang@twopensource.com>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-28 21:04:11 +02:00
Kuba Pawlak 618353b1f3 Bluetooth: Fix SCO link type handling on connection complete
Synchronous connections are initially created with type eSCO.
Link manager may reject proposed link parameters, which triggers
connection setup retry with a different set. Link type embedded
in responses should be disregarded until Synchronous Connect Complete
returns Success (0x00). Current code updates link type every time
which creates an issue when link type changes to SCO and back to eSCO
on further attepts.

Issue happens with BlackBerry 9100 and 9700 with Intel WilkinsPeak
on third connection setup attept

2015-05-18 01:27:57.332242 < HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
    handle 256 voice setting 0x0060 ptype 0x0380
2015-05-18 01:27:57.333604 > HCI Event: Command Status (0x0f) plen 4
    Setup Synchronous Connection (0x01|0x0028) status 0x00 ncmd 1
2015-05-18 01:27:57.334614 > HCI Event: Synchronous Connect Complete (0x2c) plen 17
    status 0x1a handle 0 bdaddr 30:7C:30:B3:A8:86 type SCO
    Error: Unsupported Remote Feature / Unsupported LMP Feature
2015-05-18 01:27:57.334895 < HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
    handle 256 voice setting 0x0060 ptype 0x0380
2015-05-18 01:27:57.335601 > HCI Event: Command Status (0x0f) plen 4
    Setup Synchronous Connection (0x01|0x0028) status 0x00 ncmd 1
2015-05-18 01:27:57.336610 > HCI Event: Synchronous Connect Complete (0x2c) plen 17
    status 0x1a handle 0 bdaddr 30:7C:30:B3:A8:86 type SCO
    Error: Unsupported Remote Feature / Unsupported LMP Feature
2015-05-18 01:27:57.336685 < HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
    handle 256 voice setting 0x0060 ptype 0x03c8
2015-05-18 01:27:57.337603 > HCI Event: Command Status (0x0f) plen 4
    Setup Synchronous Connection (0x01|0x0028) status 0x00 ncmd 1
2015-05-18 01:27:57.342608 > HCI Event: Max Slots Change (0x1b) plen 3
    handle 256 slots 1
2015-05-18 01:27:57.377631 > HCI Event: Synchronous Connect Complete (0x2c) plen 17
    status 0x00 handle 257 bdaddr 30:7C:30:B3:A8:86 type eSCO
    Air mode: CVSD

Signed-off-by: Kuba Pawlak <kubax.t.pawlak@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-28 21:03:00 +02:00
Stefan Schmidt 4e1795de10 nl802154: stricter input checking for boolean inputs
So far we handled boolean input by forcing them with !! and assigning
them into a bool. This allowed userspace to send values > 1 which were
used as 1. We should be stricter here and return -EINVAL for all but
0 or 1.

Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-28 21:00:37 +02:00
Nicholas Krause df945360ce Bluetooth: Make the function sco_conn_del have a return type of void
This makes the function sco_conn_del have a return type of void now
due to this function always running successfully and thus never
needing to signal its caller when a non recoverable internal failure
occurs by returning a error code to its respective caller.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-28 21:00:37 +02:00
Jozsef Kadlecsik 6fe7ccfd77 netfilter: ipset: Out of bound access in hash:net* types fixed
Dave Jones reported that KASan detected out of bounds access in hash:net*
types:

[   23.139532] ==================================================================
[   23.146130] BUG: KASan: out of bounds access in hash_net4_add_cidr+0x1db/0x220 at addr ffff8800d4844b58
[   23.152937] Write of size 4 by task ipset/457
[   23.159742] =============================================================================
[   23.166672] BUG kmalloc-512 (Not tainted): kasan: bad access detected
[   23.173641] -----------------------------------------------------------------------------
[   23.194668] INFO: Allocated in hash_net_create+0x16a/0x470 age=7 cpu=1 pid=456
[   23.201836]  __slab_alloc.constprop.66+0x554/0x620
[   23.208994]  __kmalloc+0x2f2/0x360
[   23.216105]  hash_net_create+0x16a/0x470
[   23.223238]  ip_set_create+0x3e6/0x740
[   23.230343]  nfnetlink_rcv_msg+0x599/0x640
[   23.237454]  netlink_rcv_skb+0x14f/0x190
[   23.244533]  nfnetlink_rcv+0x3f6/0x790
[   23.251579]  netlink_unicast+0x272/0x390
[   23.258573]  netlink_sendmsg+0x5a1/0xa50
[   23.265485]  SYSC_sendto+0x1da/0x2c0
[   23.272364]  SyS_sendto+0xe/0x10
[   23.279168]  entry_SYSCALL_64_fastpath+0x12/0x6f

The bug is fixed in the patch and the testsuite is extended in ipset
to check cidr handling more thoroughly.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-08-28 18:51:30 +02:00
David S. Miller 0d36938bb8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-08-27 21:45:31 -07:00
Linus Torvalds e001d7084a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Some straggler bug fixes here:

   1) Netlink_sendmsg() doesn't check iterator type properly in mmap
      case, from Ken-ichirou MATSUZAWA.

   2) Don't sleep in atomic context in bcmgenet driver, from Florian
      Fainelli.

   3) The pfkey_broadcast() code patch can't actually ever use anything
      other than GFP_ATOMIC.  And the cases that right now pass
      GFP_KERNEL or similar will currently trigger an RCU splat.  Just
      use GFP_ATOMIC unconditionally.  From David Ahern.

   4) Fix FD bit timings handling in pcan_usb driver, from Marc
      Kleine-Budde.

   5) Cache dst leaked in ip6_gre tunnel removal, fix from Huaibin Wang.

   6) Traversal into drivers/net/ethernet/renesas should be triggered by
      CONFIG_NET_VENDOR_RENESAS, not a particular driver's config
      option.  From Kazuya Mizuguchi.

   7) Fix regression in handling of igmp_join errors in vxlan, from
      Marcelo Ricardo Leitner.

   8) Make phy_{read,write}_mmd_indirect() properly take the mdio_lock
      mutex when programming the registers.  From Russell King.

   9) Fix non-forced handling in u32_destroy(), from WANG Cong.

  10) Test the EVENT_NO_RUNTIME_PM flag before it is cleared in
      usbnet_stop(), from Eugene Shatokhin.

  11) In sfc driver, don't fetch statistics firmware isn't capable of,
      from Bert Kenward.

  12) Verify ASCONF address parameter location in SCTP, from Xin Long"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  sctp: donot reset the overall_error_count in SHUTDOWN_RECEIVE state
  sctp: asconf's process should verify address parameter is in the beginning
  sfc: only use vadaptor stats if firmware is capable
  net: phy: fixed: propagate fixed link values to struct
  usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared
  drivers: net: xgene: fix: Oops in linkwatch_fire_event
  cls_u32: complete the check for non-forced case in u32_destroy()
  net: fec: use reinit_completion() in mdio accessor functions
  net: phy: add locking to phy_read_mmd_indirect()/phy_write_mmd_indirect()
  vxlan: re-ignore EADDRINUSE from igmp_join
  net: compile renesas directory if NET_VENDOR_RENESAS is configured
  ip6_gre: release cached dst on tunnel removal
  phylib: Make PHYs children of their MDIO bus, not the bus' parent.
  can: pcan_usb: don't provide CAN FD bittimings by non-FD adapters
  net: Fix RCU splat in af_key
  net: bcmgenet: fix uncleaned dma flags
  net: bcmgenet: Avoid sleeping in bcmgenet_timeout
  netlink: mmap: fix tx type check
2015-08-27 17:52:38 -07:00
Phil Sutter 3e692f2153 net: sched: simplify attach_one_default_qdisc()
Now that noqueue qdisc can be attached just like any other qdisc, no
special treatment is necessary anymore when attaching it as default
qdisc.

This change has the added benefit that 'tc qdisc show' prints noqueue
instead of nothing for devices defaulting to noqueue.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:30 -07:00
Phil Sutter d66d6c3152 net: sched: register noqueue qdisc
This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:30 -07:00
Phil Sutter db4094bca7 net: sched: ignore tx_queue_len when assigning default qdisc
Since alloc_netdev_mqs() sets IFF_NO_QUEUE for drivers not initializing
tx_queue_len, it is safe to assume that if tx_queue_len is zero,
dev->priv flags always contains IFF_NO_QUEUE.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:29 -07:00
Phil Sutter f84bb1eac0 net: fix IFF_NO_QUEUE for drivers using alloc_netdev
Printing a warning in alloc_netdev_mqs() if tx_queue_len is zero and
IFF_NO_QUEUE not set is not appropriate since drivers may use one of the
alloc_netdev* macros instead of alloc_etherdev*, thereby not
intentionally leaving tx_queue_len uninitialized. Instead check here if
tx_queue_len is zero and set IFF_NO_QUEUE, so the value of tx_queue_len
can be ignored in net/sched_generic.c.

Fixes: 906470c ("net: warn if drivers set tx_queue_len = 0")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:29 -07:00
Jean Sacren 69dba9bbc5 sock: fix kernel doc error
The symbol '__sk_reclaim' is not present in the current tree. Apparently
'__sk_reclaim' was meant to be '__sk_mem_reclaim', so fix it with the
right symbol name for the kernel doc.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:12:41 -07:00
lucien f648f807f6 sctp: donot reset the overall_error_count in SHUTDOWN_RECEIVE state
Commit f8d9605243 ("sctp: Enforce retransmission limit during shutdown")
fixed a problem with excessive retransmissions in the SHUTDOWN_PENDING by not
resetting the association overall_error_count.  This allowed the association
to better enforce assoc.max_retrans limit.

However, the same issue still exists when the association is in SHUTDOWN_RECEIVED
state.  In this state, HB-ACKs will continue to reset the overall_error_count
for the association would extend the lifetime of association unnecessarily.

This patch solves this by resetting the overall_error_count whenever the current
state is small then SCTP_STATE_SHUTDOWN_PENDING.  As a small side-effect, we
end up also handling SCTP_STATE_SHUTDOWN_ACK_SENT and SCTP_STATE_SHUTDOWN_SENT
states, but they are not really impacted because we disable Heartbeats in those
states.

Fixes: Commit f8d9605243 ("sctp: Enforce retransmission limit during shutdown")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:11:44 -07:00
Nikolay Aleksandrov b22fbf22f8 bridge: fdb: rearrange net_bridge_fdb_entry
While looking into fixing the local entries scalability issue I noticed
that the structure is badly arranged because vlan_id would fall in a
second cache line while keeping rcu which is used only when deleting
in the first, so re-arrange the structure and push rcu to the end so we
can get 16 bytes which can be used for other fields (by pushing rcu
fully in the second 64 byte chunk). With this change all the core
necessary information when doing fdb lookups will be available in a
single cache line.

pahole before (note vlan_id):
struct net_bridge_fdb_entry {
	struct hlist_node          hlist;                /*     0    16 */
	struct net_bridge_port *   dst;                  /*    16     8 */
	struct callback_head       rcu;                  /*    24    16 */
	long unsigned int          updated;              /*    40     8 */
	long unsigned int          used;                 /*    48     8 */
	mac_addr                   addr;                 /*    56     6 */
	unsigned char              is_local:1;           /*    62: 7  1 */
	unsigned char              is_static:1;          /*    62: 6  1 */
	unsigned char              added_by_user:1;      /*    62: 5  1 */
	unsigned char              added_by_external_learn:1; /*    62: 4  1 */

	/* XXX 4 bits hole, try to pack */
	/* XXX 1 byte hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	__u16                      vlan_id;              /*    64     2 */

	/* size: 72, cachelines: 2, members: 11 */
	/* sum members: 65, holes: 1, sum holes: 1 */
	/* bit holes: 1, sum bit holes: 4 bits */
	/* padding: 6 */
	/* last cacheline: 8 bytes */
}

pahole after (note vlan_id):
struct net_bridge_fdb_entry {
	struct hlist_node          hlist;                /*     0    16 */
	struct net_bridge_port *   dst;                  /*    16     8 */
	long unsigned int          updated;              /*    24     8 */
	long unsigned int          used;                 /*    32     8 */
	mac_addr                   addr;                 /*    40     6 */
	__u16                      vlan_id;              /*    46     2 */
	unsigned char              is_local:1;           /*    48: 7  1 */
	unsigned char              is_static:1;          /*    48: 6  1 */
	unsigned char              added_by_user:1;      /*    48: 5  1 */
	unsigned char              added_by_external_learn:1; /*    48: 4  1 */

	/* XXX 4 bits hole, try to pack */
	/* XXX 7 bytes hole, try to pack */

	struct callback_head       rcu;                  /*    56    16 */
	/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */

	/* size: 72, cachelines: 2, members: 11 */
	/* sum members: 65, holes: 1, sum holes: 7 */
	/* bit holes: 1, sum bit holes: 4 bits */
	/* last cacheline: 8 bytes */
}

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 16:38:52 -07:00
Joe Stringer 7b85b4dff2 openvswitch: Include ip6_fib.h.
kbuild test robot reports that certain configurations will not
automatically pick up on the "struct rt6_info" definition, so explicitly
include the header for this structure.

Fixes: 7f8a436 "openvswitch: Add conntrack action"
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 16:35:51 -07:00
Jiri Pirko 35d4e17252 net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag
Add this helper so code can easily figure out if netdev is openswitch.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 16:28:35 -07:00
Jiri Pirko 0e4ead9d7b net: introduce change upper device notifier change info
Add info that is passed along with NETDEV_CHANGEUPPER event.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 16:28:34 -07:00
Pravin B Shelar 371bd1061d geneve: Consolidate Geneve functionality in single module.
geneve_core module handles send and receive functionality.
This way OVS could use the Geneve API. Now with use of
tunnel meatadata mode OVS can directly use Geneve netdevice.
So there is no need for separate module for Geneve. Following
patch consolidates Geneve protocol processing in single module.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 15:42:48 -07:00
Pravin B Shelar 6b001e682e openvswitch: Use Geneve device.
With help of tunnel metadata mode OVS can directly use
Geneve devices to implement Geneve tunnels.
This patch removes all of the OVS specific Geneve code
and make OVS use a Geneve net_device. Basic geneve vport
is still there to handle compatibility with current
userspace application.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 15:42:47 -07:00
Pravin B Shelar c29a70d2ca tunnel: introduce udp_tun_rx_dst()
Introduce function udp_tun_rx_dst() to initialize tunnel dst on
receive path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 15:42:47 -07:00
Toshiaki Makita d2d427b392 bridge: Add netlink support for vlan_protocol attribute
This enables bridge vlan_protocol to be configured through netlink.

When CONFIG_BRIDGE_VLAN_FILTERING is disabled, kernel behaves the
same way as this feature is not implemented.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 15:35:33 -07:00
Daniel Borkmann 3b3ae88026 net: sched: consolidate tc_classify{,_compat}
For classifiers getting invoked via tc_classify(), we always need an
extra function call into tc_classify_compat(), as both are being
exported as symbols and tc_classify() itself doesn't do much except
handling of reclassifications when tp->classify() returned with
TC_ACT_RECLASSIFY.

CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
all others use tc_classify(). When tc actions are being configured
out in the kernel, tc_classify() effectively does nothing besides
delegating.

We could spare this layer and consolidate both functions. pktgen on
single CPU constantly pushing skbs directly into the netif_receive_skb()
path with a dummy classifier on ingress qdisc attached, improves
slightly from 22.3Mpps to 23.1Mpps.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 14:18:48 -07:00
lucien ce7b4ccc4f sctp: asconf's process should verify address parameter is in the beginning
in sctp_process_asconf(), we get address parameter from the beginning of
the addip params. but we never check if it's really there. if the addr
param is not there, it still can pass sctp_verify_asconf(), then to be
handled by sctp_process_asconf(), it will not be safe.

so add a code in sctp_verify_asconf() to check the address parameter is in
the beginning, or return false to send abort.

note that this can also detect multiple address parameters, and reject it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 13:59:33 -07:00
Joe Stringer cae3a26275 openvswitch: Allow attaching helpers to ct action
Add support for using conntrack helpers to assist protocol detection.
The new OVS_CT_ATTR_HELPER attribute of the CT action specifies a helper
to be used for this connection. If no helper is specified, then helpers
will be automatically applied as per the sysctl configuration of
net.netfilter.nf_conntrack_helper.

The helper may be specified as part of the conntrack action, eg:
ct(helper=ftp). Initial packets for related connections should be
committed to allow later packets for the flow to be considered
established.

Example ovs-ofctl flows allowing FTP connections from ports 1->2:
in_port=1,tcp,action=ct(helper=ftp,commit),2
in_port=2,tcp,ct_state=-trk,action=ct(recirc)
in_port=2,tcp,ct_state=+trk-new+est,action=1
in_port=2,tcp,ct_state=+trk+rel,action=1

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:43 -07:00
Joe Stringer c2ac667358 openvswitch: Allow matching on conntrack label
Allow matching and setting the ct_label field. As with ct_mark, this is
populated by executing the CT action. The label field may be modified by
specifying a label and mask nested under the CT action. It is stored as
metadata attached to the connection. Label modification occurs after
lookup, and will only persist when the conntrack entry is committed by
providing the COMMIT flag to the CT action. Labels are currently fixed
to 128 bits in size.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:43 -07:00
Joe Stringer 86ca02e774 netfilter: connlabels: Export setting connlabel length
Add functions to change connlabel length into nf_conntrack_labels.c so
they may be reused by other modules like OVS and nftables without
needing to jump through xt_match_check() hoops.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:43 -07:00
Joe Stringer 55e5713f2b netfilter: Always export nf_connlabels_replace()
The following patches will reuse this code from OVS.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:43 -07:00
Joe Stringer 182e3042e1 openvswitch: Allow matching on conntrack mark
Allow matching and setting the ct_mark field. As with ct_state and
ct_zone, these fields are populated when the CT action is executed. To
write to this field, a value and mask can be specified as a nested
attribute under the CT action. This data is stored with the conntrack
entry, and is executed after the lookup occurs for the CT action. The
conntrack entry itself must be committed using the COMMIT flag in the CT
action flags for this change to persist.

Signed-off-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:43 -07:00
Joe Stringer 7f8a436eaa openvswitch: Add conntrack action
Expose the kernel connection tracker via OVS. Userspace components can
make use of the CT action to populate the connection state (ct_state)
field for a flow. This state can be subsequently matched.

Exposed connection states are OVS_CS_F_*:
- NEW (0x01) - Beginning of a new connection.
- ESTABLISHED (0x02) - Part of an existing connection.
- RELATED (0x04) - Related to an established connection.
- INVALID (0x20) - Could not track the connection for this packet.
- REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
- TRACKED (0x80) - This packet has been sent through conntrack.

When the CT action is executed by itself, it will send the packet
through the connection tracker and populate the ct_state field with one
or more of the connection state flags above. The CT action will always
set the TRACKED bit.

When the COMMIT flag is passed to the conntrack action, this specifies
that information about the connection should be stored. This allows
subsequent packets for the same (or related) connections to be
correlated with this connection. Sending subsequent packets for the
connection through conntrack allows the connection tracker to consider
the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.

The CT action may optionally take a zone to track the flow within. This
allows connections with the same 5-tuple to be kept logically separate
from connections in other zones. If the zone is specified, then the
"ct_zone" match field will be subsequently populated with the zone id.

IP fragments are handled by transparently assembling them as part of the
CT action. The maximum received unit (MRU) size is tracked so that
refragmentation can occur during output.

IP frag handling contributed by Andy Zhou.

Based on original design by Justin Pettit.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:43 -07:00
Joe Stringer 5b49004724 ipv6: Export nf_ct_frag6_gather()
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:42 -07:00
Joe Stringer be26b9a88f openvswitch: Move MASKED* macros to datapath.h
This will allow the ovs-conntrack code to reuse these macros.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:42 -07:00
Joe Stringer 8e2fed1c0c openvswitch: Serialize acts with original netlink len
Previously, we used the kernel-internal netlink actions length to
calculate the size of messages to serialize back to userspace.
However,the sw_flow_actions may not be formatted exactly the same as the
actions on the wire, so store the original actions length when
de-serializing and re-use the original length when serializing.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 11:40:42 -07:00
Marek Lindner ed29266347 batman-adv: turn batadv_neigh_node_get() into local function
commit c214ebe1eb29 ("batman-adv: move neigh_node list add into
batadv_neigh_node_new()") removed external calls to
batadv_neigh_node_get().

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:34 +02:00
Sven Eckelmann 7bca68c784 batman-adv: Add lower layer needed_(head|tail)room to own ones
The maximum of hard_header_len and maximum of all needed_(head|tail)room of
all slave interfaces of a batman-adv device must be used to define the
batman-adv device needed_(head|tail)room. This is required to avoid too
small buffer problems when these slave devices try to send the encapsulated
packet in a tx path without the possibility to resize the skbuff.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:34 +02:00
Antonio Quartulli a5256f7e74 batman-adv: don't access unregistered net_device object
In batadv_hardif_disable_interface() there is a call to
batadv_softif_destroy_sysfs() which in turns invokes
unregister_netdevice() on the soft_iface.
After this point we cannot rely on the soft_iface object
anymore because it might get free'd by the netdev periodic
routine at any time.

For this reason the netdev_upper_dev_unlink(.., soft_iface) call
is moved before the invocation of batadv_softif_destroy_sysfs() so
that we can be sure that the soft_iface object is still valid.

Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2015-08-27 20:15:33 +02:00
Simon Wunderlich 07c48eca16 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:33 +02:00
Simon Wunderlich 1e3b4669e7 batman-adv: fix gateway client style issues
commit 0511575c4d03 ("batman-adv: remove obsolete deleted attribute for
gateway node") incorrectly added an empy line and forgot to remove an
include.

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:32 +02:00
Marek Lindner 3f32f8a687 batman-adv: rearrange batadv_neigh_node_new() arguments to follow convention
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:32 +02:00
Simon Wunderlich bd3524c14b batman-adv: remove obsolete deleted attribute for gateway node
With rcu, the gateway node deleted attribute is not needed anymore. In
fact, it may delay the free of the gateway node and its referenced
structures. Therefore remove it altogether and simplify purging as well.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:32 +02:00
Marek Lindner 741aa06bfb batman-adv: move neigh_node list add into batadv_neigh_node_new()
All batadv_neigh_node_* functions expect the neigh_node list item to be part
of the orig_node->neigh_list, therefore the constructor of said list item
should be adding the newly created neigh_node to the respective list.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:31 +02:00
Marek Lindner 39bf7618f0 batman-adv: remove redundant hard_iface assignment
The batadv_neigh_node_new() function already sets the hard_iface pointer.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:31 +02:00
Marek Lindner f729dc70da batman-adv: move hardif refcount inc to batadv_neigh_node_new()
The batadv_neigh_node cleanup function 'batadv_neigh_node_free_rcu()'
takes care of reducing the hardif refcounter, hence it's only logical
to assume the creating function of that same object
'batadv_neigh_node_new()' takes care of increasing the same refcounter.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-27 20:15:30 +02:00
Alexei Starovoitov 1dd34b5ad8 bpf: fix bpf_skb_set_tunnel_key() helper
Make sure to indicate to tunnel driver that key.tun_id is set,
otherwise gre won't recognize the metadata.

Fixes: d3aa45ce6b ("bpf: add helpers to access tunnel metadata")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 17:38:13 -07:00
Pablo Neira Ayuso 1b383bf912 Merge tag 'ipvs2-for-v4.3' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Second Round of IPVS Updates for v4.3

I realise these are a little late in the cycle, so if you would prefer
me to repost them for v4.4 then just let me know.

The updates include:
* A new scheduler from Raducu Deaconu
* Enhanced configurability of the sync daemon from Julian Anastasov
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-26 20:34:46 +02:00
Andreas Herz 1afe839e6b netfilter: ip6t_REJECT: added missing icmpv6 codes
RFC 4443 added two new codes values for ICMPv6 type 1:

 5 - Source address failed ingress/egress policy
 6 - Reject route to destination

And RFC 7084 states in L-14 that IPv6 Router MUST send ICMPv6 Destination
Unreachable with code 5 for packets forwarded to it that use an address
from a prefix that has been invalidated.

Codes 5 and 6 are more informative subsets of code 1.

Signed-off-by: Andreas Herz <andi@geekosphere.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-26 20:32:35 +02:00
Alexei Starovoitov cff82457c5 net_sched: act_bpf: remove spinlock in fast path
Similar to act_gact/act_mirred, act_bpf can be lockless in packet processing
with extra care taken to free bpf programs after rcu grace period.
Replacement of existing act_bpf (very rare) is done with synchronize_rcu()
and final destruction is done from tc_action_ops->cleanup() callback that is
called from tcf_exts_destroy()->tcf_action_destroy()->__tcf_hash_release() when
bind and refcnt reach zero which is only possible when classifier is destroyed.
Previous two patches fixed the last two classifiers (tcindex and rsvp) to
call tcf_exts_destroy() from rcu callback.

Similar to gact/mirred there is a race between prog->filter and
prog->tcf_action. Meaning that the program being replaced may use
previous default action if it happened to return TC_ACT_UNSPEC.
act_mirred race betwen tcf_action and tcfm_dev is similar.
In all cases the race is harmless.
Long term we may want to improve the situation by replacing the whole
tc_action->priv as single pointer instead of updating inner fields one by one.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:45 -07:00
Alexei Starovoitov 9e528d8915 net_sched: convert rsvp to call tcf_exts_destroy from rcu callback
Adjust destroy path of cls_rsvp to call tcf_exts_destroy() after
rcu grace period.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:45 -07:00
Alexei Starovoitov ed7aa879ce net_sched: convert tcindex to call tcf_exts_destroy from rcu callback
Adjust destroy path of cls_tcindex to call tcf_exts_destroy() after
rcu grace period.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:44 -07:00
Alexei Starovoitov faa54be4c7 net_sched: act_bpf: remove unnecessary copy
Fix harmless typo and avoid unnecessary copy of empty 'prog' into
unused 'strcut tcf_bpf_cfg old'.

Fixes: f4eaed28c7 ("act_bpf: fix memory leaks when replacing bpf programs")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:44 -07:00
Alexei Starovoitov 3c645621b7 net_sched: make tcf_hash_destroy() static
tcf_hash_destroy() used once. Make it static.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:44 -07:00
WANG Cong a6c1aea044 cls_u32: complete the check for non-forced case in u32_destroy()
In commit 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
I added a check in u32_destroy() to see if all real filters are gone
for each tp, however, that is only done for root_ht, same is needed
for others.

This can be reproduced by the following tc commands:

tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32
ht 15:2: match ip src 10.0.0.2 flowid 1:10
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32
ht 15:2: match ip src 10.0.0.3 flowid 1:10

Fixes: 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 17:02:48 -07:00
santosh.shilimkar@oracle.com 2724121419 RDS: remove superfluous from rds_ib_alloc_fmr()
Memory allocated for 'ibmr' uses kzalloc_node() which already
initialises the memory to zero. There is no need to do
memset() 0 on that memory.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com ef5217a6e2 RDS: flush the FMR pool less often
FMR flush is an expensive and time consuming operation. Reduce the
frequency of FMR pool flush by 50% so that more FMR work gets accumulated
for more efficient flushing.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com ad1d7dc0d7 RDS: push FMR pool flush work to its own worker
RDS FMR flush operation and also it races with connect/reconect
which happes a lot with RDS. FMR flush being on common rds_wq aggrevates
the problem. Lets push RDS FMR pool flush work to its own worker.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
Wengang Wang 6116c2030f RDS: fix fmr pool dirty_count
In rds_ib_flush_mr_pool(), dirty_count accounts the clean ones
which is wrong. This can lead to a negative dirty count value.

Lets fix it.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com 3f6b314303 RDS: Fix rds MR reference count in rds_rdma_unuse()
rds_rdma_unuse() drops the mr reference count which it hasn't
taken. Correct way of removing mr is to remove mr from the tree
and then rdma_destroy_mr() it first, then rds_mr_put() to decrement
its reference count. Whichever thread holds last reference will free
the mr via rds_mr_put()

This bug was triggering weird null pointer crashes. One if the trace
for it is captured below.

BUG: unable to handle kernel NULL pointer dereference at
0000000000000104
IP: [<ffffffffa0899471>] rds_ib_free_mr+0x31/0x130 [rds_rdma]
PGD 4366fa067 PUD 4366f9067 PMD 0
Oops: 0000 [#1] SMP

[...]

task: ffff88046da6a000 ti: ffff88046da6c000 task.ti: ffff88046da6c000
RIP: 0010:[<ffffffffa0899471>]  [<ffffffffa0899471>]
rds_ib_free_mr+0x31/0x130 [rds_rdma]
RSP: 0018:ffff88046fa43bd8  EFLAGS: 00010286
RAX: 0000000071d38b80 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880079e7ff40
RBP: ffff88046fa43bf8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88046fa43ca8 R11: ffff88046a802ed8 R12: ffff880079e7fa40
R13: 0000000000000000 R14: ffff880079e7ff40 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88046fa40000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000104 CR3: 00000004366fb000 CR4: 00000000000006e0
Stack:
 ffff880079e7fa40 ffff880671d38f08 ffff880079e7ff40 0000000000000296
 ffff88046fa43c28 ffffffffa087a38b ffff880079e7fa40 ffff880671d38f10
 0000000000000000 0000000000000292 ffff88046fa43c48 ffffffffa087a3b6
Call Trace:
 <IRQ>
 [<ffffffffa087a38b>] rds_destroy_mr+0x8b/0xa0 [rds]
 [<ffffffffa087a3b6>] __rds_put_mr_final+0x16/0x30 [rds]
 [<ffffffffa087a492>] rds_rdma_unuse+0xc2/0x120 [rds]
 [<ffffffffa08766d3>] rds_recv_incoming_exthdrs+0x83/0xa0 [rds]
 [<ffffffffa0876782>] rds_recv_incoming+0x92/0x200 [rds]
 [<ffffffffa0895269>] rds_ib_process_recv+0x259/0x320 [rds_rdma]
 [<ffffffffa08962a8>] rds_ib_recv_tasklet_fn+0x1a8/0x490 [rds_rdma]
 [<ffffffff810dcd78>] ? __remove_hrtimer+0x58/0x90
 [<ffffffff810799e1>] tasklet_action+0xb1/0xc0
 [<ffffffff81079b52>] __do_softirq+0xe2/0x290
 [<ffffffff81079df6>] irq_exit+0xa6/0xb0
 [<ffffffff81613915>] do_IRQ+0x65/0xf0
 [<ffffffff816118ab>] common_interrupt+0x6b/0x6b

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:10 -07:00
santosh.shilimkar@oracle.com ba54d3ced9 RDS: fix the dangling reference to rds_ib_incoming_slab
On rds_ib_frag_slab allocation failure, ensure rds_ib_incoming_slab
is not pointing to the detsroyed memory.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:28:10 -07:00
David S. Miller b8766e4ed3 Included changes:
- code restyling and beautification
 - use int kernel types instead of C99
 - update kereldoc
 - prevent potential hlist double deletion of VLAN objects
 - fix gw bandwidth calculation
 - convert list to hlist when needed
 - add lockdep_asserts calls in function with lock requirements
   described in kerneldoc
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJV26RQAAoJEOb/4TMchkvfsGkP/jJrS9XAUL6SkYARWU7JB++I
 G0tVjzJVzh9aPXMV5uFcqwYYB/j2SvqIsad9MCcRpgsmsEhc52TELZCsY6HtZFLt
 0la5mOAkMmjBvW8/gzaGM4nzDc388UYOxW1gKwl+obczUCpEnbyu/R7CiC8zB5d7
 oHYfgmYbGTAtoFzwiA5zXdu11QfPJhLMivMVsNt2sf7daBjJP6arFl+hJWWEGE/L
 oYfEuiOajFvp2d8H/VXnV43aJHtWHvUis1CqB2dFvTFWkPyWuFpmIwoJWydD/e1s
 vjcbq4HdwTwG6Io4DkbmL1YLTONW/Z83FViIHCSDu4TRxQnDrY52WHetbmoE6ARi
 EtQfUlU4i39FydZUnMIEmmY2lyU74YeGd3cqqDsYO2j30xJ0tpcih815cXBL9twC
 jZZqlk6NNeJ0zLAHTXe6azEbg/jv7TBL0wjcDDgmHUDPZkFDMD6gn8okn26Yi3oY
 qd5/HCU2oj8RP9OaHf7kkWo0cg1bwq2ygoRkWcEIwCcUyq3utJJ+8RMI8jUkC3is
 siiPPSbEEQW02fl60KRJZeiyVY2Em2fDYR0gVrj4czcIr0HALtKgJ+u5jDYqqtj3
 uzGT0YcXeAFYRJgo8yjOnzUz7sM2Tgld4qDI59FegrUilsoW/fwi3uEU32Vzda1P
 4aLuUtbkeQfnaLMJCuE2
 =hRJW
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
Included changes:
- code restyling and beautification
- use int kernel types instead of C99
- update kereldoc
- prevent potential hlist double deletion of VLAN objects
- fix gw bandwidth calculation
- convert list to hlist when needed
- add lockdep_asserts calls in function with lock requirements
  described in kerneldoc
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:20:38 -07:00
David S. Miller b01d04aa51 rds: Fix improper gfp_t usage.
>> net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types)
   net/rds/ib_recv.c:382:28:    expected int [signed] can_wait
   net/rds/ib_recv.c:382:28:    got restricted gfp_t
   net/rds/ib_recv.c:828:23: sparse: cast to restricted __le64

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 15:54:25 -07:00
huaibin Wang d4257295ba ip6_gre: release cached dst on tunnel removal
When a tunnel is deleted, the cached dst entry should be released.

This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
  unregister_netdevice: waiting for lo to become free. Usage count = 3

CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 14:33:00 -07:00
WANG Cong e252b3d1a1 route: fix a use-after-free
This patch fixes the following crash:

 general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc7+ #166
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff88010656d280 ti: ffff880106570000 task.ti: ffff880106570000
 RIP: 0010:[<ffffffff8182f91b>]  [<ffffffff8182f91b>] dst_destroy+0xa6/0xef
 RSP: 0018:ffff880107603e38  EFLAGS: 00010202
 RAX: 0000000000000001 RBX: ffff8800d225a000 RCX: ffffffff82250fd0
 RDX: 0000000000000001 RSI: ffffffff82250fd0 RDI: 6b6b6b6b6b6b6b6b
 RBP: ffff880107603e58 R08: 0000000000000001 R09: 0000000000000001
 R10: 000000000000b530 R11: ffff880107609000 R12: 0000000000000000
 R13: ffffffff82343c40 R14: 0000000000000000 R15: ffffffff8182fb4f
 FS:  0000000000000000(0000) GS:ffff880107600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 00007fcabd9d3000 CR3: 00000000d7279000 CR4: 00000000000006e0
 Stack:
  ffffffff82250fd0 ffff8801077d6f00 ffffffff82253c40 ffff8800d225a000
  ffff880107603e68 ffffffff8182fb5d ffff880107603f08 ffffffff810d795e
  ffffffff810d7648 ffff880106574000 ffff88010656d280 ffff88010656d280
 Call Trace:
  <IRQ>
  [<ffffffff8182fb5d>] dst_destroy_rcu+0xe/0x1d
  [<ffffffff810d795e>] rcu_process_callbacks+0x618/0x7eb
  [<ffffffff810d7648>] ? rcu_process_callbacks+0x302/0x7eb
  [<ffffffff8182fb4f>] ? dst_gc_task+0x1eb/0x1eb
  [<ffffffff8107e11b>] __do_softirq+0x178/0x39f
  [<ffffffff8107e52e>] irq_exit+0x41/0x95
  [<ffffffff81a4f215>] smp_apic_timer_interrupt+0x34/0x40
  [<ffffffff81a4d5cd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff8100b968>] ? default_idle+0x21/0x32
  [<ffffffff8100b966>] ? default_idle+0x1f/0x32
  [<ffffffff8100bf19>] arch_cpu_idle+0xf/0x11
  [<ffffffff810b0bc7>] default_idle_call+0x1f/0x21
  [<ffffffff810b0dce>] cpu_startup_entry+0x1ad/0x273
  [<ffffffff8102fe67>] start_secondary+0x135/0x156

dst is freed right before lwtstate_put(), this is not correct...

Fixes: 61adedf3e3 ("route: move lwtunnel state to dst_entry")
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 14:29:19 -07:00
Masanari Iida d749916010 net-next: Fix warning while make xmldocs caused by skbuff.c
This patch fix following warnings.

.//net/core/skbuff.c:407: warning: No description found
for parameter 'len'
.//net/core/skbuff.c:407: warning: Excess function parameter
 'length' description in '__netdev_alloc_skb'
.//net/core/skbuff.c:476: warning: No description found
 for parameter 'len'
.//net/core/skbuff.c:476: warning: Excess function parameter
'length' description in '__napi_alloc_skb'

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 14:20:48 -07:00
David S. Miller 94c10f0ea3 ah4: Fix error return in ah_input().
Noticed by Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:38:50 -07:00
Julia Lawall 25105051fd ah6: fix error return code
Return a negative error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:37:31 -07:00
santosh.shilimkar@oracle.com ae05368afa RDS: check for valid cm_id before initiating connection
Connection could have been dropped while the route is being resolved
so check for valid cm_id before initiating the connection.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:31 -07:00
Mukesh Kacker 06e8941e22 RDS: return EMSGSIZE for oversize requests before processing/queueing
rds_send_queue_rm() allows for the "current datagram" being queued
to exceed SO_SNDBUF thresholds by checking bytes queued without
counting in length of current datagram. (Since sk_sndbuf is set
to twice requested SO_SNDBUF value as a kernel heuristic this
is usually fine!)

If this "current datagram" squeezing past the threshold is itself
many times the size of the sk_sndbuf threshold itself then even
twice the SO_SNDBUF does not save us and it gets queued but
cannot be transmitted. Threads block and deadlock and device
becomes unusable. The check for this datagram not exceeding
SNDBUF thresholds (EMSGSIZE) is not done on this datagram as
that check is only done if queueing attempt fails.
(Datagrams that follow this datagram fail queueing attempts, go
through the check and eventually trip EMSGSIZE error but zero
length datagrams silently fail!)

This fix moves the check for datagrams exceeding SNDBUF limits
before any processing or queueing is attempted and returns EMSGSIZE
early in the rds_sndmsg() code. This change also ensures that all
datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits
and the large datagrams that exceed those limits do not get to
rds_send_queue_rm() code for processing.

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com dfcec251d2 RDS: make sure rds_send_drop_to properly takes the m_rs_lock
rds_send_drop_to() is used during socket tear down to find all the
messages on the socket and flush them .  It can race with the
acking code unless it takes the m_rs_lock on each and every message.

This plugs a hole where we didn't take m_rs_lock on any message that
didn't have the RDS_MSG_ON_CONN set.  Taking m_rs_lock avoids
double frees and other memory corruptions as the ack code trusts
the message m_rs pointer on a socket that had actually been freed.

We must take m_rs_lock to access m_rs.  Because of lock nesting and
rs access, we also need to acquire rs_lock.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:31 -07:00
Santosh Shilimkar 1c3be624f4 RDS: Don't destroy the rdma id until after we're done using it
During connection resets, we are destroying the rdma id too soon. We can't
destroy it when it is still in use. So lets move rdma_destroy_id() after
we clear the rings.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com 5c240fa2ab RDS: Fix assertion level from fatal to warning
Fix the asserion level since its not fatal and can be hit
in normal execution paths. There is no need to take the
system down.

We keep the WARN_ON() to detect the condition if we get
here with bad pages.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com 3049147ca7 RDS: Make sure we do a signaled send for large-send
WR(Work Requests )always generate a WC(Work Completion) with
signaled send. Default RDS ib code is setup for un-signaled
completion. Since RDS connction is persistent, we can end up
sending the data even after large-send when the remote end is
not active(for any reason).

By doing  a signaled send at least once per large-send,
we can at least detect the problem in work completion
handler there by avoiding sending more data to
inactive remote.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com 4f73113c63 RDS: Mark message mapped before transmit
rds_send_xmit() marks the rds message map flag after
xmit_[rdma/atomic]() which is clearly wrong.  We need
to maintain the ownership between transport and rds.

Also take care of error path.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com 0df5f9a68a RDS: add a sock_destruct callback debug aid
This helps to detect the accidental processes/apps trying to destroy
the RDS socket which they are sharing with other processes/apps.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com 0c48424021 RDS: check for congestion updates during rds_send_xmit
Ensure we don't keep sending the data if the link is congested.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com 73ce4317bf RDS: make sure we post recv buffers
If we get an ENOMEM during rds_ib_recv_refill, we might never come
back and refill again later. Patch makes sure to kick krdsd into
helping out.

To achieve this we add RDS_RECV_REFILL flag and update in the refill
path based on that so that at least some therad will keep posting
receive buffers.

Since krdsd and softirq both might race for refill, we decide to
schedule on work queue based on ring_low instead of ring_empty.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com e1f475a738 RDS: don't update ip address tables if the address hasn't changed
If the ip address tables hasn't changed, there is no need to remove
them only to be added back again.

Lets fix it.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:29 -07:00
santosh.shilimkar@oracle.com 1bc7b863f2 RDS: destroy the ib state earlier during shutdown
Destroy ib state early during shutdown. Otherwise we can get callbacks
after the QP isn't really able to handle them.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:29 -07:00
santosh.shilimkar@oracle.com 43962dd7ee RDS: always free recv frag as we free its ring entry
We were still seeing rare occurrences of the WARN_ON(recv->r_frag) which
indicates that the recv refill path was finding allocated frags in ring
entries that were marked free. These were usually followed by OOM crashes.
They only seem to be occurring in the presence of completion errors and
connection resets.

This patch ensures that we free the frag as we mark the ring entry free.
This should stop the refill path from finding allocated frags in ring
entries that were marked free.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:29 -07:00
santosh.shilimkar@oracle.com 1d2e3f396c RDS: restore return value in rds_cmsg_rdma_args()
In rds_cmsg_rdma_args() 'ret' is used by rds_pin_pages() which returns
number of pinned pages on success. And the same value is returned to the
caller of rds_cmsg_rdma_args() on success which is not intended.

Commit f4a3fc03c1 ("RDS: Clean up error handling in rds_cmsg_rdma_args")
removed the 'ret = 0' line which broke RDS RDMA mode.

Fix it by restoring the return value on rds_pin_pages() success
keeping the clean-up in place.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 13:35:29 -07:00
Eric Dumazet 43e122b014 tcp: refine pacing rate determination
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.

At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.

We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :

- After cwnd reduction, it is safer to ramp up more slowly,
  as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
  cwnd every other RTT.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:33:54 -07:00
David Ahern 4ec3b28c27 xfrm: Use VRF master index if output device is enslaved
Directs route lookups to VRF table. Compiles out if NET_VRF is not
enabled. With this patch able to successfully bring up ipsec tunnels
in VRFs, even with duplicate network configuration.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:25:09 -07:00
Eric Dumazet 6f021c62d6 tcp: fix slow start after idle vs TSO/GSO
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.

With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.

Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.

As Neal pointed out, we also need to perform the check
if/when receive window opens.

Tested:

Following packetdrill test demonstrates the problem
// Test of slow start after idle

`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0    accept(3, ..., ...) = 4
+0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0

+0    write(4, ..., 26000) = 26000
+0    > . 1:5001(5000) ack 1
+0    > . 5001:10001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10 }%

+.100 < . 1:1(0) ack 10001 win 511
+0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0    > . 10001:20001(10000) ack 1
+0    > P. 20001:26001(6000) ack 1

+.100 < . 1:1(0) ack 26001 win 511
+0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%

+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0    > . 26001:31001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0    > . 31001:36001(5000) ack 1

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:22:50 -07:00
Marek Lindner 854d2a63de batman-adv: beautify supported routing algorithm list
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:24 +02:00
Sven Eckelmann 87b40f534d batman-adv: Fix conditional statements indentation
commit 29b9256e6631 ("batman-adv: consider outgoing interface in OGM
sending") incorrectly indented the interface check code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:24 +02:00
Sven Eckelmann 5274cd68d7 batman-adv: Add lockdep_asserts for documented external locks
Some functions already have documentation about locks they require inside
their kerneldoc header. These can be directly tested during runtime using
the lockdep asserts.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:23 +02:00
Sven Eckelmann 2c72d655b0 batman-adv: Annotate deleting functions with external lock via lockdep
Functions which use (h)list_del* are requiring correct locking when they
operate on global lists. Most of the time the search in the list and the
delete are done in the same function. All other cases should have it
visible that they require a special lock to avoid race conditions.

Lockdep asserts can be used to check these problem during runtime when the
lockdep functionality is enabled.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:23 +02:00
Marek Lindner 7c26a53ba5 batman-adv: convert bat_priv->tt.req_list to hlist
Since the list's tail is never accessed using a double linked list head
wastes memory.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:22 +02:00
Sven Eckelmann 0b8336f5fc batman-adv: Fix gw_bandwidth calculation on 32 bit systems
The TVLV for the gw_bandwidth stores everything as u32. But the
gw_bandwidth reads the signed long which limits the maximum value to
(2 ** 31 - 1) on systems with 4 byte long. Also the input value is always
converted from either Mibit/s or Kibit/s to 100Kibit/s. This reduces the
values even further when the user sets it via the default unit Kibit/s. It
may even cause an integer overflow and end up with a value the user never
intended.

Instead read the values as u64, check for possible overflows, do the unit
adjustments and then reduce the size to u32.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:22 +02:00
Sven Eckelmann 77927b7d9d batman-adv: Return EINVAL on invalid gw_bandwidth change
Invalid speed settings by the user are currently acknowledged as correct
but not stored. Instead the return of the store operation of the file
"gw_bandwidth" should indicate that the given value is not acceptable.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:21 +02:00
Marek Lindner a121048a89 batman-adv: prevent potential hlist double deletion
The hlist_del_rcu() call in batadv_tt_global_size_mod() does not check
if the element still is part of the list prior to deletion. The atomic
list counter should prevent the worst but converting to
hlist_del_init_rcu() ensures the element can't be deleted more than
once.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:21 +02:00
Marek Lindner d0fa4f3f5b batman-adv: convert orig_node->vlan_list to hlist
Since the list's tail is never accessed using a double linked list head
wastes memory.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:20 +02:00
Sven Eckelmann 2bdd1888f1 batman-adv: Remove batadv_ types forward declarations
main.h is included in every file and is the only way to access types.h.
This makes forward declarations for all types defined in types.h
unnecessary.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:20 +02:00
Marek Lindner 383b863620 batman-adv: rename batadv_new_tt_req_node to batadv_tt_req_node_new
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:19 +02:00
Marek Lindner 433ff98f3f batman-adv: update kernel doc of batadv_tt_global_del_orig_entry()
The updated kernel doc & additional comment shall prevent accidental
copy & paste errors or calling the function without the required
precautions.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:19 +02:00
Sven Eckelmann 4f248cff9e batman-adv: Remove multiple assignment per line
The Linux CodingStyle disallows multiple assignments in a single line.
(see chapter 1)

Reported-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:18 +02:00
Sven Eckelmann 34473822da batman-adv: Fix kerneldoc over 80 column lines
Kerneldoc required single line documentation in the past (before 2009).
Therefore, the 80 columns limit per line check of checkpatch was disabled
for kerneldoc. But kerneldoc is not excluded anymore from it and checkpatch
now enabled the check again.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:18 +02:00
Sven Eckelmann 6b5e971a28 batman-adv: Replace C99 int types with kernel type
(s|u)(8|16|32|64) are the preferred types in the kernel. The use of the
standard C99 types u?int(8|16|32|64)_t are objected by some people and even
checkpatch now warns about using them.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25 00:12:17 +02:00
David Ahern ba51b6be38 net: Fix RCU splat in af_key
Hit the following splat testing VRF change for ipsec:

[  113.475692] ===============================
[  113.476194] [ INFO: suspicious RCU usage. ]
[  113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED Not tainted
[  113.477545] -------------------------------
[  113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 Illegal context switch in RCU read-side critical section!
[  113.479288]
[  113.479288] other info that might help us debug this:
[  113.479288]
[  113.480207]
[  113.480207] rcu_scheduler_active = 1, debug_locks = 1
[  113.480931] 2 locks held by setkey/6829:
[  113.481371]  #0:  (&net->xfrm.xfrm_cfg_mutex){+.+.+.}, at: [<ffffffff814e9887>] pfkey_sendmsg+0xfb/0x213
[  113.482509]  #1:  (rcu_read_lock){......}, at: [<ffffffff814e767f>] rcu_read_lock+0x0/0x6e
[  113.483509]
[  113.483509] stack backtrace:
[  113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED
[  113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[  113.486845]  0000000000000001 ffff88001d4c7a98 ffffffff81518af2 ffffffff81086962
[  113.487732]  ffff88001d538480 ffff88001d4c7ac8 ffffffff8107ae75 ffffffff8180a154
[  113.488628]  0000000000000b30 0000000000000000 00000000000000d0 ffff88001d4c7ad8
[  113.489525] Call Trace:
[  113.489813]  [<ffffffff81518af2>] dump_stack+0x4c/0x65
[  113.490389]  [<ffffffff81086962>] ? console_unlock+0x3d6/0x405
[  113.491039]  [<ffffffff8107ae75>] lockdep_rcu_suspicious+0xfa/0x103
[  113.491735]  [<ffffffff81064032>] rcu_preempt_sleep_check+0x45/0x47
[  113.492442]  [<ffffffff8106404d>] ___might_sleep+0x19/0x1c8
[  113.493077]  [<ffffffff81064268>] __might_sleep+0x6c/0x82
[  113.493681]  [<ffffffff81133190>] cache_alloc_debugcheck_before.isra.50+0x1d/0x24
[  113.494508]  [<ffffffff81134876>] kmem_cache_alloc+0x31/0x18f
[  113.495149]  [<ffffffff814012b5>] skb_clone+0x64/0x80
[  113.495712]  [<ffffffff814e6f71>] pfkey_broadcast_one+0x3d/0xff
[  113.496380]  [<ffffffff814e7b84>] pfkey_broadcast+0xb5/0x11e
[  113.497024]  [<ffffffff814e82d1>] pfkey_register+0x191/0x1b1
[  113.497653]  [<ffffffff814e9770>] pfkey_process+0x162/0x17e
[  113.498274]  [<ffffffff814e9895>] pfkey_sendmsg+0x109/0x213

In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes
the RCU lock.

Since pfkey_broadcast takes the RCU lock the allocation argument is
pointless since GFP_ATOMIC must be used between the rcu_read_{,un}lock.
The one call outside of rcu can be done with GFP_KERNEL.

Fixes: 7f6b9dbd5a ("af_key: locking change")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24 14:48:10 -07:00
Tom Herbert 92b78aff85 ila: Precompute checksum difference for translations
In the ILA build state for LWT compute the checksum difference to apply
to transport checksums that include the IPv6 pseudo header. The
difference is between the route destination (from fib6_config) and the
locator to write.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24 10:34:40 -07:00
Tom Herbert 127eb7cd3c lwt: Add cfg argument to build_state
Add cfg and family arguments to lwt build state functions. cfg is a void
pointer and will either be a pointer to a fib_config or fib6_config
structure. The family parameter indicates which one (either AF_INET
or AF_INET6).

LWT encpasulation implementation may use the fib configuration to build
the LWT state.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24 10:34:40 -07:00
David S. Miller d9893d1351 NFC 4.3 pull request
This is the NFC pull request for 4.3.
 With this one we have:
 
 - A new driver for Samsung's S3FWRN5 NFC chipset. In order to
   properly support this driver, a few NCI core routines needed
   to be exported. Future drivers like Intel's Fields Peak will
   benefit from this.
 
 - SPI support as a physical transport for STM st21nfcb.
 
 - An additional netlink API for sending replies back to userspace
   from vendor commands.
 
 - 2 small fixes for TI's trf7970a
 
 - A few st-nci fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV1l7oAAoJEIqAPN1PVmxK7DwP+QF3A6wCBzaNQKcla6LOl+Ru
 lGquPpFyihlDT/916IP7MnMNZYOP3ENdGll5lKts2yKxuty327Bb2UWNkaFP63Ei
 +zZwhoZXwh6dbK35kwnd87Cwgn0E8vTF+zHhC2MmP8uGDIgOuevb//2GBDayG88T
 rw4QMZsunT4o9x/bNK1uTlYaKPDs5pxXYFYUPXOQ2F5GqpUVFag3pLbZcJhpSZg7
 9Y1KNLxwi/1rwO90JTXH9CQ+oWgVcH86nIlzqGznxgNdOoCF/V/0hdGlTzj8sE6T
 A3a0Qy1gaWQw9+9QoqE7YWu6JfEIgEhRgFx8dm4SsmUbrnlmgYBavEFeG7++1AG5
 QByWh/h2po0MysNRCfhey2EExlZgdvc1WyLQlS+0w+aWmM5MYq+J4lx+sqXdUDdu
 MZyNDytRfGgRimBPSxyjuHtzrBWR8MyenjsbpraNoVDlFD8wVnGa5OcAv2gi7dkD
 wloOSZj5jJq9WoMeGWEwp2TsQMySJDydBSTvgqovlk2K8gmeY69g2YUUmwR2Truf
 fulBsO8upet/v2cRbCepI2X4NvS37wZBRcJtPpGcCoXUagA1vWH8eBKfjoduGERt
 vc5c9DYjSA0VEJ3kzu3Atro/oNrZDDqvX1wEn+64fk1B8v53Lvf2X0ESQQUcfWpz
 k7hPYzOt2IOA7d41CY59
 =KMsn
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.3 pull request

This is the NFC pull request for 4.3.
With this one we have:

- A new driver for Samsung's S3FWRN5 NFC chipset. In order to
  properly support this driver, a few NCI core routines needed
  to be exported. Future drivers like Intel's Fields Peak will
  benefit from this.

- SPI support as a physical transport for STM st21nfcb.

- An additional netlink API for sending replies back to userspace
  from vendor commands.

- 2 small fixes for TI's trf7970a

- A few st-nci fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 20:42:57 -07:00
Jon Paul Maloy 2be80c2d87 tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.

Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.

The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.

However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:14:45 -07:00
Jon Paul Maloy 5ae2f8e685 tipc: interrupt link synchronization when a link goes down
When we introduced the new link failover/synch mechanism
in commit 6e498158a8
("tipc: move link synch and failover to link aggregation level"),
we missed the case when the non-tunnel link goes down during the link
synchronization period. In this case the tunnel link will remain in
state LINK_SYNCHING, something leading to unpredictable behavior when
the failover procedure is initiated.

In this commit, we ensure that the node and remaining link goes
back to regular communication state (SELF_UP_PEER_UP/LINK_ESTABLISHED)
when one of the parallel links goes down. We also ensure that we don't
re-enter synch mode if subsequent SYNCH packets arrive on the remaining
link.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:14:45 -07:00
Jon Paul Maloy 17b2063077 tipc: eliminate risk of premature link setup during failover
When a link goes down, and there is still a working link towards its
destination node, a failover is initiated, and the failed link is not
allowed to re-establish until that procedure is finished. To ensure
this, the concerned link endpoints are set to state LINK_FAILINGOVER,
and the node endpoints to NODE_FAILINGOVER during the failover period.

However, if the link reset is due to a disabled bearer, the corres-
ponding link endpoint is deleted, and only the node endpoint knows
about the ongoing failover. Now, if the disabled bearer is re-enabled
during the failover period, the discovery mechanism may create a new
link endpoint that is ready to be established, despite that this is not
permitted. This situation may cause both the ongoing failover and any
subsequent link synchronization to fail.

In this commit, we ensure that a newly created link goes directly to
state LINK_FAILINGOVER if the corresponding node state is
NODE_FAILINGOVER. This eliminates the problem described above.

Furthermore, we tighten the criteria for which packets are allowed
to end a failover state in the function tipc_node_check_state().
By checking that the receiving link is up and running, instead of just
checking that it is not in failover mode, we eliminate the risk that
protocol packets from the re-created link may cause the failover to
be prematurely terminated.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:14:45 -07:00
Ken-ichirou MATSUZAWA c953e23936 netlink: mmap: fix tx type check
I can't send netlink message via mmaped netlink socket since

    commit: a8866ff6a5
    netlink: make the check for "send from tx_ring" deterministic

msg->msg_iter.type is set to WRITE (1) at

    SYSCALL_DEFINE6(sendto, ...
        import_single_range(WRITE, ...
            iov_iter_init(1, WRITE, ...

call path, so that we need to check the type by iter_is_iovec()
to accept the WRITE.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:04:46 -07:00
Tom Herbert 270136613b fou: Do WARN_ON_ONCE in gue_gro_receive for bad proto callbacks
Do WARN_ON_ONCE instead of WARN_ON in gue_gro_receive when the offload
callcaks are bad (either don't exist or gro_receive is not specified).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 15:59:56 -07:00
Tom Herbert b7fe10e5eb gro: Fix remcsum offload to deal with frags in GRO
The remote checksum offload GRO did not consider the case that frag0
might be in use. This patch fixes that by accessing headers using the
skb_gro functions and not saving offsets relative to skb->head.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 15:59:56 -07:00
Linus Torvalds c4c53bad40 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 9p regression fix from Al Viro:
 "Fix for breakage introduced when switching p9_client_{read,write}() to
  struct iov_iter * (went into 4.1)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  9p: ensure err is initialized to 0 in p9_client_read/write
2015-08-22 20:22:11 -07:00
Vincent Bernat 999b8b88c6 9p: ensure err is initialized to 0 in p9_client_read/write
Some use of those functions were providing unitialized values to those
functions. Notably, when reading 0 bytes from an empty file on a 9P
filesystem, the return code of read() was not 0.

Tested with this simple program:

    #include <assert.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <unistd.h>

    int main(int argc, const char **argv)
    {
        assert(argc == 2);
        char buffer[256];
        int fd = open(argv[1], O_RDONLY|O_NOCTTY);
        assert(fd >= 0);
        assert(read(fd, buffer, 0) == 0);
        return 0;
    }

Cc: stable@vger.kernel.org # v4.1
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-22 21:35:02 -04:00
Michal Hocko 2f064f3485 mm: make page pfmemalloc check more robust
Commit c48a11c7ad ("netvm: propagate page->pfmemalloc to skb") added
checks for page->pfmemalloc to __skb_fill_page_desc():

        if (page->pfmemalloc && !page->mapping)
                skb->pfmemalloc = true;

It assumes page->mapping == NULL implies that page->pfmemalloc can be
trusted.  However, __delete_from_page_cache() can set set page->mapping
to NULL and leave page->index value alone.  Due to being in union, a
non-zero page->index will be interpreted as true page->pfmemalloc.

So the assumption is invalid if the networking code can see such a page.
And it seems it can.  We have encountered this with a NFS over loopback
setup when such a page is attached to a new skbuf.  There is no copying
going on in this case so the page confuses __skb_fill_page_desc which
interprets the index as pfmemalloc flag and the network stack drops
packets that have been allocated using the reserves unless they are to
be queued on sockets handling the swapping which is the case here and
that leads to hangs when the nfs client waits for a response from the
server which has been dropped and thus never arrive.

The struct page is already heavily packed so rather than finding another
hole to put it in, let's do a trick instead.  We can reuse the index
again but define it to an impossible value (-1UL).  This is the page
index so it should never see the value that large.  Replace all direct
users of page->pfmemalloc by page_is_pfmemalloc which will hide this
nastiness from unspoiled eyes.

The information will get lost if somebody wants to use page->index
obviously but that was the case before and the original code expected
that the information should be persisted somewhere else if that is
really needed (e.g.  what SLAB and SLUB do).

[akpm@linux-foundation.org: fix blooper in slub]
Fixes: c48a11c7ad ("netvm: propagate page->pfmemalloc to skb")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.com>
Debugged-by: Jiri Bohac <jbohac@suse.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>	[3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-21 14:30:10 -07:00
Pablo Neira Ayuso 116984a316 netfilter: xt_TEE: use IS_ENABLED(CONFIG_NF_DUP_IPV6)
Instead of IS_ENABLED(CONFIG_IPV6), otherwise we hit:

et/built-in.o: In function `tee_tg6':
>> xt_TEE.c:(.text+0x6cd8c): undefined reference to `nf_dup_ipv6'

when:

 CONFIG_IPV6=y
 CONFIG_NF_DUP_IPV4=y
 # CONFIG_NF_DUP_IPV6 is not set
 CONFIG_NETFILTER_XT_TARGET_TEE=y

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21 22:08:41 +02:00
Pablo Neira Ayuso 59e26423e0 netfilter: nf_dup: fix sparse warnings
>> net/ipv4/netfilter/nft_dup_ipv4.c:29:37: sparse: incorrect type in initializer (different base types)
   net/ipv4/netfilter/nft_dup_ipv4.c:29:37:    expected restricted __be32 [user type] s_addr
   net/ipv4/netfilter/nft_dup_ipv4.c:29:37:    got unsigned int [unsigned] <noident>

>> net/ipv6/netfilter/nf_dup_ipv6.c:48:23: sparse: incorrect type in assignment (different base types)
   net/ipv6/netfilter/nf_dup_ipv6.c:48:23:    expected restricted __be32 [addressable] [assigned] [usertype] flowlabel
   net/ipv6/netfilter/nf_dup_ipv6.c:48:23:    got int

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21 21:29:19 +02:00
David S. Miller dc25b25897 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/qmi_wwan.c

Overlapping additions of new device IDs to qmi_wwan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21 11:44:04 -07:00
Julian Anastasov d33288172e ipvs: add more mcast parameters for the sync daemon
- mcast_group: configure the multicast address, now IPv6
is supported too

- mcast_port: configure the multicast port

- mcast_ttl: configure the multicast TTL/HOP_LIMIT

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21 09:10:11 -07:00
Julian Anastasov e4ff675130 ipvs: add sync_maxlen parameter for the sync daemon
Allow setups with large MTU to send large sync packets by
adding sync_maxlen parameter. The default value is now based
on MTU but no more than 1500 for compatibility reasons.

To avoid problems if MTU changes allow fragmentation by
sending packets with DF=0. Problem reported by Dan Carpenter.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21 09:10:03 -07:00
Julian Anastasov e0b26cc997 ipvs: call rtnl_lock early
When the sync damon is started we need to hold rtnl
lock while calling ip_mc_join_group. Currently, we have
a wrong locking order because the correct one is
rtnl_lock->__ip_vs_mutex. It is implied from the usage
of __ip_vs_mutex in ip_vs_dst_event() which is called
under rtnl lock during NETDEV_* notifications.

Fix the problem by calling rtnl_lock early only for the
start_sync_thread call. As a bonus this fixes the usage
__dev_get_by_name which was not called under rtnl lock.

This patch actually extends and depends on commit 54ff9ef36b
("ipv4, ipv6: kill ip_mc_{join, leave}_group and
ipv6_sock_mc_{join, drop}").

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21 09:09:09 -07:00
Raducu Deaconu eefa32d3f3 ipvs: Add ovf scheduler
The weighted overflow scheduling algorithm directs network connections
to the server with the highest weight that is currently available
and overflows to the next when active connections exceed the node's weight.

Signed-off-by: Raducu Deaconu <rhadoo.io88@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21 09:08:39 -07:00
David S. Miller a9e01ed986 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

This is second pull request includes the conflict resolution patch that
resulted from the updates that we got for the conntrack template through
kmalloc. No changes with regards to the previously sent 15 patches.

The following patchset contains Netfilter updates for your net-next tree, they
are:

1) Rework the existing nf_tables counter expression to make it per-cpu.

2) Prepare and factor out common packet duplication code from the TEE target so
   it can be reused from the new dup expression.

3) Add the new dup expression for the nf_tables IPv4 and IPv6 families.

4) Convert the nf_tables limit expression to use a token-based approach with
   64-bits precision.

5) Enhance the nf_tables limit expression to support limiting at packet byte.
   This comes after several preparation patches.

6) Add a burst parameter to indicate the amount of packets or bytes that can
   exceed the limiting.

7) Add netns support to nfacct, from Andreas Schultz.

8) Pass the nf_conn_zone structure instead of the zone ID in nf_tables to allow
   accessing more zone specific information, from Daniel Borkmann.

9) Allow to define zone per-direction to support netns containers with
   overlapping network addressing, also from Daniel.

10) Extend the CT target to allow setting the zone based on the skb->mark as a
   way to support simple mappings from iptables, also from Daniel.

11) Make the nf_tables payload expression aware of the fact that VLAN offload
    may have removed a vlan header, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 22:18:45 -07:00
Pablo Neira Ayuso 81bf1c64e7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflicts with conntrack template fixes.

Conflicts:
	net/netfilter/nf_conntrack_core.c
	net/netfilter/nf_synproxy_core.c
	net/netfilter/xt_CT.c

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21 06:09:05 +02:00
Jiri Benc 32a2b002ce ipv6: route: per route IP tunnel metadata via lightweight tunnel
Allow specification of per route IP tunnel instructions also for IPv6.
This complements commit 3093fbe7ff ("route: Per route IP tunnel metadata
via lightweight tunnel").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
CC: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:38 -07:00
Jiri Benc 904af04d30 ipv6: route: extend flow representation with tunnel key
Use flowi_tunnel in flowi6 similarly to what is done with IPv4.
This complements commit 1b7179d3ad ("route: Extend flow representation
with tunnel key").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc ab450605b3 ipv6: ndisc: inherit metadata dst when creating ndisc requests
If output device wants to see the dst, inherit the dst of the original skb
in the ndisc request.

This is an IPv6 counterpart of commit 0accfc268f ("arp: Inherit metadata
dst when creating ARP requests").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc 06e9d040ba ipv6: drop metadata dst in ip6_route_input
The fix in commit 48fb6b5545 is incomplete, as now ip6_route_input can be
called with non-NULL dst if it's a metadata dst and the reference is leaked.
Drop the reference.

Fixes: 48fb6b5545 ("ipv6: fix crash over flow-based vxlan device")
Fixes: ee122c79d4 ("vxlan: Flow based tunneling")
CC: Wei-Chun Chao <weichunc@plumgrid.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc 61adedf3e3 route: move lwtunnel state to dst_entry
Currently, the lwtunnel state resides in per-protocol data. This is
a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
The xmit function of the tunnel does not know whether the packet has been
routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
lwtstate data to dst_entry makes such inter-protocol tunneling possible.

As a bonus, this brings a nice diffstat.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc 7c383fb225 ip_tunnels: use tos and ttl fields also for IPv6
Rename the ipv4_tos and ipv4_ttl fields to just 'tos' and 'ttl', as they'll
be used with IPv6 tunnels, too.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc c1ea5d672a ip_tunnels: add IPv6 addresses to ip_tunnel_key
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the
newly introduced padding after the IPv4 addresses needs to be zeroed out.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
David Ahern 4c9bcd1179 net: Fix nexthop lookups
Andreas reported breakage adding routes with local nexthops:
$ ip route show table main
...
172.28.0.0/24 dev vnf-xe1p0  proto kernel  scope link  src 172.28.0.16

$ ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0
RTNETLINK answers: Resource temporarily unavailable

3bfd847203 changed the lookup to use the passed in table but for cases like
this the nexthop is in the local table rather than the passed in table.

Fixes: 3bfd847203 ("net: Use passed in table for nexthop lookups")
Reported-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 14:42:39 -07:00
Scott Feldman eb4cb85180 bridge: fix netlink max attr size
.maxtype should match .policy.  Probably just been getting lucky here
because IFLA_BRPORT_MAX > IFLA_BR_MAX.

Fixes: 13323516 ("bridge: implement rtnl_link_ops->changelink")
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 14:36:25 -07:00
Ying Xue e01286ef03 ipv4: Make fib_encap_match static
Make fib_encap_match() static as it isn't used outside the file.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 14:12:23 -07:00
Christophe Ricard 29e76924cf nfc: netlink: Add capability to reply to vendor_cmd with data
A proprietary vendor command may send back useful data to the user
application.

For example, the field level applied on the NFC router antenna.

Still based on net/wireless/nl80211.c implementation,
add nfc_vendor_cmd_alloc_reply_skb and nfc_vendor_cmd_reply in
order to send back over netlink data generated by a proprietary
command.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-08-20 22:00:11 +02:00
Christophe Ricard 5a9e0ffc0f nfc: nci: hci: Add check on skb nci_hci_send_cmd parameter
skb can be NULL and may lead to a NULL pointer error.

Add a check condition before setting HCI rx buffer.

Cc: stable@vger.kernel.org
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-08-20 21:56:36 +02:00
Robert Baldyga 025a0cb838 NFC: nci: export nci_core_reset and nci_core_init
Some drivers needs to have ability to reinit NCI core, for example
after updating firmware in setup() of post_setup() callback. This
patch makes nci_core_reset() and nci_core_init() functions public,
to make it possible.

Signed-off-by: Robert Baldyga <r.baldyga@samsung.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-08-20 21:23:11 +02:00
Robert Baldyga fdf79bd488 NFC: nci: Add post_setup handler
Some drivers require non-standard configuration after NCI_CORE_INIT
request, because they need to know ndev->manufact_specific_info or
ndev->manufact_id. This patch adds post_setup handler allowing to do
such custom configuration.

Signed-off-by: Robert Baldyga <r.baldyga@samsung.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-08-20 21:22:34 +02:00
Nikolay Aleksandrov 18041e3174 vrf: vrf_master_ifindex_rcu is not always called with rcu read lock
While running net-next I hit this:
[  634.073119] ===============================
[  634.073150] [ INFO: suspicious RCU usage. ]
[  634.073182] 4.2.0-rc6+ #45 Not tainted
[  634.073213] -------------------------------
[  634.073244] include/net/vrf.h:38 suspicious rcu_dereference_check()
usage!
[  634.073274]
               other info that might help us debug this:

[  634.073307]
               rcu_scheduler_active = 1, debug_locks = 1
[  634.073338] 2 locks held by swapper/0/0:
[  634.073369]  #0:  (((&n->timer))){+.-...}, at: [<ffffffff8112bc35>]
call_timer_fn+0x5/0x480
[  634.073412]  #1:  (slock-AF_INET){+.-...}, at: [<ffffffff8174f0f5>]
icmp_send+0x155/0x5f0
[  634.073450]
               stack backtrace:
[  634.073483] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6+ #45
[  634.073514] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[  634.073545]  0000000000000000 0593ba8242d9ace4 ffff88002fc03b48
ffffffff81803f1b
[  634.073612]  0000000000000000 ffffffff81e12500 ffff88002fc03b78
ffffffff811003c5
[  634.073642]  0000000000000000 ffff88002ec4e600 ffffffff81f00f80
ffff88002fc03cf0
[  634.073669] Call Trace:
[  634.073694]  <IRQ>  [<ffffffff81803f1b>] dump_stack+0x4c/0x65
[  634.073728]  [<ffffffff811003c5>] lockdep_rcu_suspicious+0xc5/0x100
[  634.073763]  [<ffffffff8174eb56>] icmp_route_lookup+0x176/0x5c0
[  634.073793]  [<ffffffff8174f2fb>] ? icmp_send+0x35b/0x5f0
[  634.073818]  [<ffffffff8174f274>] ? icmp_send+0x2d4/0x5f0
[  634.073844]  [<ffffffff8174f3ce>] icmp_send+0x42e/0x5f0
[  634.073873]  [<ffffffff8170b662>] ipv4_link_failure+0x22/0xa0
[  634.073899]  [<ffffffff8174bdda>] arp_error_report+0x3a/0x80
[  634.073926]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
[  634.073952]  [<ffffffff816d396e>] neigh_invalidate+0x8e/0x110
[  634.073984]  [<ffffffff816d62ae>] neigh_timer_handler+0x1ae/0x290
[  634.074013]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
[  634.074013]  [<ffffffff8112bce3>] call_timer_fn+0xb3/0x480
[  634.074013]  [<ffffffff8112bc35>] ? call_timer_fn+0x5/0x480
[  634.074013]  [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0
[  634.074013]  [<ffffffff8112c2bc>] run_timer_softirq+0x20c/0x430
[  634.074013]  [<ffffffff810af50e>] __do_softirq+0xde/0x630
[  634.074013]  [<ffffffff810afc97>] irq_exit+0x117/0x120
[  634.074013]  [<ffffffff81810976>] smp_apic_timer_interrupt+0x46/0x60
[  634.074013]  [<ffffffff8180e950>] apic_timer_interrupt+0x70/0x80
[  634.074013]  <EOI>  [<ffffffff8106b9d6>] ? native_safe_halt+0x6/0x10
[  634.074013]  [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10
[  634.074013]  [<ffffffff81027d43>] default_idle+0x23/0x200
[  634.074013]  [<ffffffff8102852f>] arch_cpu_idle+0xf/0x20
[  634.074013]  [<ffffffff810f89ba>] default_idle_call+0x2a/0x40
[  634.074013]  [<ffffffff810f8dcc>] cpu_startup_entry+0x39c/0x4c0
[  634.074013]  [<ffffffff817f9cad>] rest_init+0x13d/0x150
[  634.074013]  [<ffffffff81f69038>] start_kernel+0x4a8/0x4c9
[  634.074013]  [<ffffffff81f68120>] ?
early_idt_handler_array+0x120/0x120
[  634.074013]  [<ffffffff81f68339>] x86_64_start_reservations+0x2a/0x2c
[  634.074013]  [<ffffffff81f68485>] x86_64_start_kernel+0x14a/0x16d

It would seem vrf_master_ifindex_rcu() can be called without RCU held in
other contexts as well so introduce a new helper which acquires rcu and
returns the ifindex.
Also add curly braces around both the "if" and "else" parts as per the
style guide.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-19 22:13:20 -07:00
Trond Myklebust c2126157ea SUNRPC: Allow sockets to do GFP_NOIO allocations
Follow up to commit c4a7ca7749 ("SUNRPC: Allow waiting on memory
allocation"). Allows the RPC socket code to do non-IO blocking.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19 21:46:15 -05:00
Florian Westphal 8cfd23e674 netfilter: nft_payload: work around vlan header stripping
make payload expression aware of the fact that VLAN offload may have
removed a vlan header.

When we encounter tagged skb, transparently insert the tag into the
register so that vlan header matching can work without userspace being
aware of offload features.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-19 08:39:53 +02:00
Jiri Benc 2d79849903 lwtunnel: ip tunnel: fix multiple routes with different encap
Currently, two routes going through the same tunnel interface are considered
the same even when they are routed to a different host after encapsulation.
This causes all routes added after the first one to have incorrect
encapsulation parameters.

This is nicely visible by doing:

  # ip r a 192.168.1.2/32 dev vxlan0 tunnel dst 10.0.0.2
  # ip r a 192.168.1.3/32 dev vxlan0 tunnel dst 10.0.0.3
  # ip r
  [...]
  192.168.1.2/32 tunnel id 0 src 0.0.0.0 dst 10.0.0.2 [...]
  192.168.1.3/32 tunnel id 0 src 0.0.0.0 dst 10.0.0.2 [...]

Implement the missing comparison function.

Fixes: 3093fbe7ff ("route: Per route IP tunnel metadata via lightweight tunnel")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 19:11:20 -07:00
Jiri Benc df383e6240 lwtunnel: fix memory leak
The built lwtunnel_state struct has to be freed after comparison.

Fixes: 571e722676 ("ipv4: support for fib route lwtunnel encap attributes")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 19:11:19 -07:00
Sven Eckelmann fd7dec25a1 batman-adv: Fix memory leak on tt add with invalid vlan
The object tt_local is allocated with kmalloc and not initialized when the
function batadv_tt_local_add checks for the vlan. But this function can
only cleanup the object when the (not yet initialized) reference counter of
the object is 1. This is unlikely and thus the object would leak when the
vlan could not be found.

Instead the uninitialized object tt_local has to be freed manually and the
pointer has to set to NULL to avoid calling the function which would try to
decrement the reference counter of the not existing object.

CID: 1316518
Fixes: 354136bcc3 ("batman-adv: fix kernel crash due to missing NULL checks")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 19:08:23 -07:00
Andrew Lunn 1e72e6f885 net: dsa: Allow multi hop routes to be expressed
With more than two switches in a hierarchy, it becomes necessary to
describe multi-hop routes between switches. The current binding does
not allow this, although the older platform_data did. Extend the link
property to be a list rather than a single phandle to a remote switch.
It is then possible to express that a port should be used to reach
more than one switch and the switch maybe more than one hop away.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 14:17:21 -07:00
Phil Sutter 348e3435cb net: sched: drop all special handling of tx_queue_len == 0
Those were all workarounds for the formerly double meaning of
tx_queue_len, which broke scheduling algorithms if untreated.

Now that all in-tree drivers have been converted away from setting
tx_queue_len = 0, it should be safe to drop these workarounds for
categorically broken setups.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:08 -07:00
Phil Sutter 906470c19d net: warn if drivers set tx_queue_len = 0
Due to the introduction of IFF_NO_QUEUE, there is a better way for
drivers to indicate that no qdisc should be attached by default. Though,
the old convention can't be dropped since ignoring that setting would
break drivers still using it. Instead, add a warning so out-of-tree
driver maintainers get a chance to adjust their code before we finally
get rid of any special handling of tx_queue_len == 0.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:08 -07:00
Phil Sutter 4676a15207 net: caif: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:07 -07:00
Phil Sutter 9ad09c5c05 net: hsr: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:07 -07:00
Phil Sutter cdf7370391 net: batman-adv: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:07 -07:00
Phil Sutter 0a5f107b67 net: dsa: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:06 -07:00
Phil Sutter 4afbc0db72 net: 6lowpan: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:06 -07:00
Phil Sutter ccecb2a47c net: bridge: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:06 -07:00
Phil Sutter 2e659c0551 net: 8021q: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:06 -07:00
Tom Herbert 65d7ab8de5 net: Identifier Locator Addressing module
Adding new module name ila. This implements ILA translation. Light
weight tunnel redirection is used to perform the translation in
the data path. This is configured by the "ip -6 route" command
using the "encap ila <locator>" option, where <locator> is the
value to set in destination locator of the packet. e.g.

ip -6 route add 3333:0:0:1:5555:0:1:0/128 \
      encap ila 2001:0:0:1 via 2401:db00:20:911a:face:0:25:0

Sets a route where 3333:0:0:1 will be overwritten by
2001:0:0:1 on output.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 21:33:06 -07:00
Tom Herbert abc5d1ff3e net: Add inet_proto_csum_replace_by_diff utility function
This function updates a checksum field value and skb->csum based on
a value which is the difference between the old and new checksum.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 21:33:06 -07:00
Tom Herbert 4b048d6d9d net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool
inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
the checksum field carries a pseudo header. This argument should be a
boolean instead of an int.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 21:33:06 -07:00
Tom Herbert 2536862311 lwt: Add support to redirect dst.input
This patch adds the capability to redirect dst input in the same way
that dst output is redirected by LWT.

Also, save the original dst.input and and dst.out when setting up
lwtunnel redirection. These can be called by the client as a pass-
through.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 21:33:05 -07:00
Daniel Borkmann 5e8018fc61 netfilter: nf_conntrack: add efficient mark to zone mapping
This work adds the possibility of deriving the zone id from the skb->mark
field in a scalable manner. This allows for having only a single template
serving hundreds/thousands of different zones, for example, instead of the
need to have one match for each zone as an extra CT jump target.

Note that we'd need to have this information attached to the template as at
the time when we're trying to lookup a possible ct object, we already need
to know zone information for a possible match when going into
__nf_conntrack_find_get(). This work provides a minimal implementation for
a possible mapping.

In order to not add/expose an extra ct->status bit, the zone structure has
been extended to carry a flag for deriving the mark.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-18 01:24:05 +02:00
Daniel Borkmann deedb59039 netfilter: nf_conntrack: add direction support for zones
This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.

The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.

As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.

Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.

If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).

Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.

Below a minimal, simplified collision example (script in [2]) with
netperf sessions:

  +--- tenant-1 ---+   mark := 1
  |    netperf     |--+
  +----------------+  |                CT zone := mark [ORIGINAL]
   [ip,sport] := X   +--------------+  +--- gateway ---+
                     | mark routing |--|     SNAT      |-- ... +
                     +--------------+  +---------------+       |
  +--- tenant-2 ---+  |                                     ~~~|~~~
  |    netperf     |--+                +-----------+           |
  +----------------+   mark := 2       | netserver |------ ... +
   [ip,sport] := X                     +-----------+
                                        [ip,port] := Y
On the gateway netns, example:

  iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
  iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully

  iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
  iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

conntrack dump from gateway netns:

  netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns

  tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                        src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                        src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2

Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.

I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.

  [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
  [2] https://paste.fedoraproject.org/242835/65657871/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-18 01:22:50 +02:00
David Ahern dc028da54e inet: Move VRF table lookup to inlined function
Table lookup compiles out when VRF is not enabled.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 15:58:57 -07:00
David S. Miller 0aa65cc0c2 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-08-16

Here's what's likely the last bluetooth-next pull request for 4.3:

 - 6lowpan/802.15.4 refactoring, cleanups & fixes
 - Document 6lowpan netdev usage in Documentation/networking/6lowpan.txt
 - Support for UART based QCA Bluetooth controllers
 - Power management support for Broeadcom Bluetooth controllers
 - Change LE connection initiation to always use passive scanning first
 - Support for new Silicon Wave USB ID

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 15:41:21 -07:00
David S. Miller 2ea273d76a net: Export bpf_prog_create_from_user().
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:37:06 -07:00
Ian Morris ec120da6f0 ipv6: trivial whitespace fix
Change brace placement to be in line with coding standards

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:34:48 -07:00
David S. Miller c1f066d4ee Included changes:
- avoid integer overflow in GW selection routine
 - prevent race condition by making capability bit changes atomic (use
   clear/set/test_bit)
 - fix synchronization issue in mcast tvlv handler
 - fix crash on double list removal of TT Request objects
 - fix leak by puring packets enqueued for sending upon iface removal
 - ensure network header pointer is set in skb
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVzlUGAAoJEOb/4TMchkvfy9EQAJ0DBKdTqiuWwNQzZ9ghCeLi
 Njgwyq4WdvZXp8vIj9cq0AR1JsFvH9fGp4BlLE8pGsQyxURiWQmR/Jurvb0UGMhM
 SPVvl/b0nnCrRu937UcVEiq6wRrFwEr1fYfL45fY0c1eTkXSu/54iEURDvWlkpyX
 COB4cuydoPY6O2LGiPRsGlOGTYbQ/u89G1HyxFr9Ch2IVRMPVXB6zuI7puuEg2ee
 EwgI2IGO/GHHaScbXQpVy+bL6XL/OcQ4nCrYeaNpwbOKAOfGabNzfVncSCLRKMSW
 jZh5D3cG3AKGUPeooEtd8yGgNG3SSVNTXt7CL+WOM7/2EKvYeNdJFaQzbdZQEtT0
 begOcUJ2K0yas/iExqYXtEPFCWXhjuzrb4u6wu5psGgu3zt1nltpCkplDEMcSEia
 C+dYE5r/JM2S1lsFclnjz+rBXu4ZfCUvV0q+50+lGiWPhfemM2M2XqJNcay0ioSm
 radSwMe549sh6r3ajwdxF5z66tU94PERMMxVxJ9Z3gmHUZ56phxEe6X0xBO91bhW
 aSCcFr6PQMxmfwanLj5Ydxm5MPHdDXF84YPMTFcGb4ya9ZWWURGNq7Sr85Ek7N6W
 5DIA8F8/vsgaXYUn011aBu4q1mm5ZsTs9nR3MAxchAuuu3sTBOkd/BecS4T7CEOE
 IHnbxsJxXReJ/aCQFoks
 =ZEuN
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
Included changes:
- avoid integer overflow in GW selection routine
- prevent race condition by making capability bit changes atomic (use
  clear/set/test_bit)
- fix synchronization issue in mcast tvlv handler
- fix crash on double list removal of TT Request objects
- fix leak by puring packets enqueued for sending upon iface removal
- ensure network header pointer is set in skb
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:31:42 -07:00
Martin KaFai Lau 9c7370a166 ipv6: Fix a potential deadlock when creating pcpu rt
rt6_make_pcpu_route() is called under read_lock(&table->tb6_lock).
rt6_make_pcpu_route() calls ip6_rt_pcpu_alloc(rt) which then
calls dst_alloc().  dst_alloc() _may_ call ip6_dst_gc() which takes
the write_lock(&tabl->tb6_lock).  A visualized version:

read_lock(&table->tb6_lock);
rt6_make_pcpu_route();
=> ip6_rt_pcpu_alloc();
=> dst_alloc();
=> ip6_dst_gc();
=> write_lock(&table->tb6_lock); /* oops */

The fix is to do a read_unlock first before calling ip6_rt_pcpu_alloc().

A reported stack:

[141625.537638] INFO: rcu_sched self-detected stall on CPU { 27}  (t=60000 jiffies g=4159086 c=4159085 q=2139)
[141625.547469] Task dump for CPU 27:
[141625.550881] mtr             R  running task        0 22121  22081 0x00000008
[141625.558069]  0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b
[141625.565641]  ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000
[141625.573220]  ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00
[141625.580803] Call Trace:
[141625.583345]  <IRQ>  [<ffffffff8106e488>] sched_show_task+0xc1/0xc6
[141625.589650]  [<ffffffff810702b0>] dump_cpu_task+0x35/0x39
[141625.595144]  [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c
[141625.601320]  [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4
[141625.607669]  [<ffffffff810940c8>] update_process_times+0x2a/0x4f
[141625.613925]  [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e
[141625.619923]  [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c
[141625.625830]  [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d
[141625.632171]  [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166
[141625.638258]  [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52
[141625.645036]  [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a
[141625.651643]  [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70
[141625.657895]  <EOI>  [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5
[141625.664188]  [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20
[141625.670272]  [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f
[141625.677140]  [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25
[141625.683218]  [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82
[141625.689124]  [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20
[141625.695207]  [<ffffffff813d6058>] fib6_clean_all+0xe/0x10
[141625.700854]  [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8
[141625.706329]  [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9
[141625.711718]  [<ffffffff81346d68>] dst_alloc+0x55/0x159
[141625.717105]  [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63
[141625.723620]  [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8
[141625.729441]  [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13
[141625.735700]  [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf
[141625.741698]  [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17
[141625.748043]  [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a
[141625.754050]  [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9
[141625.761002]  [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c
[141625.766914]  [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17
[141625.773260]  [<ffffffff813d008c>] ip6_route_output+0x7a/0x82
[141625.779177]  [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112
[141625.785437]  [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b
[141625.791604]  [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6
[141625.797423]  [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2
[141625.804464]  [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e
[141625.810028]  [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c
[141625.815588]  [<ffffffff8132be57>] SyS_sendto+0xfe/0x143
[141625.821063]  [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67
[141625.827146]  [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11
[141625.833660]  [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2
[141625.839565]  [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a

Fixes: d52d3997f8 ("pv6: Create percpu rt6_info")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:28:03 -07:00
Martin KaFai Lau a73e419563 ipv6: Add rt6_make_pcpu_route()
It is a prep work for fixing a potential deadlock when creating
a pcpu rt.

The current rt6_get_pcpu_route() will also create a pcpu rt if one does not
exist.  This patch moves the pcpu rt creation logic into another function,
rt6_make_pcpu_route().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:28:03 -07:00