Commit Graph

44469 Commits

Author SHA1 Message Date
Paolo Abeni c915fe13cb udplite: fix NULL pointer dereference
The commit 850cbaddb5 ("udp: use it's own memory accounting schema")
assumes that the socket proto has memory accounting enabled,
but this is not the case for UDPLITE.
Fix it enabling memory accounting for UDPLITE and performing
fwd allocated memory reclaiming on socket shutdown.
UDP and UDPLITE share now the same memory accounting limits.
Also drop the backlog receive operation, since is no more needed.

Fixes: 850cbaddb5 ("udp: use it's own memory accounting schema")
Reported-by: Andrei Vagin <avagin@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:59:38 -05:00
David S. Miller bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Felix Fietkau a786f96da0 mac80211: fix A-MSDU aggregation with fast-xmit + txq
A-MSDU aggregation alters the QoS header after a frame has been
enqueued, so it needs to be ready before enqueue and not overwritten
again afterwards

Fixes: bb42f2d13f ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:37:30 +01:00
Felix Fietkau fff712cbe3 mac80211: remove bogus skb vif assignment
The call to ieee80211_txq_enqueue overwrites the vif pointer with the
codel enqueue time, so setting it just before that call makes no sense.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:37:21 +01:00
Felix Fietkau c1f4c9ede3 mac80211: update A-MPDU flag on tx dequeue
The sequence number counter is used to derive the starting sequence
number. Since that counter is updated on tx dequeue, the A-MPDU flag
needs to be up to date at the tme of dequeue as well.

This patch prevents sending more A-MPDU frames after the session has
been terminated and also ensures that aggregation starts right after the
session has been established

Fixes: bb42f2d13f ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:37:12 +01:00
Pedersen, Thomas 8fdd136f22 cfg80211: add bitrate for 20MHz MCS 9
Some drivers (ath10k) report MCS 9 @ 20MHz, which
technically isn't defined. To get more meaningful value
than 0 out of this however, just extrapolate a bitrate
from ratio of MCS 7 and 9 in channels where it is allowed.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
[add a comment about it in the code]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:34:00 +01:00
Felix Fietkau 6c18a6b4e7 Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE"
This reverts commit c68df2e7be.

__sta_info_recalc_tim turns into a no-op if local->ops->set_tim is not
set. This prevents the beacon TIM bit from being set for all drivers
that do not implement this op (almost all of them), thus thoroughly
essential AP mode powersave functionality.

Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Fixes: c68df2e7be ("mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:32:09 +01:00
Filip Matusiak c8eaf3479e mac80211: Ignore VHT IE from peer with wrong rx_mcs_map
This is a workaround for VHT-enabled STAs which break the spec
and have the VHT-MCS Rx map filled in with value 3 for all eight
spacial streams, an example is AR9462 in AP mode.

As per spec, in section 22.1.1 Introduction to the VHT PHY
A VHT STA shall support at least single spactial stream VHT-MCSs
0 to 7 (transmit and receive) in all supported channel widths.

Some devices in STA mode will get firmware assert when trying to
associate, examples are QCA9377 & QCA6174.

Packet example of broken VHT Cap IE of AR9462:

Tag: VHT Capabilities (IEEE Std 802.11ac/D3.1)
    Tag Number: VHT Capabilities (IEEE Std 802.11ac/D3.1) (191)
    Tag length: 12
    VHT Capabilities Info: 0x00000000
    VHT Supported MCS Set
        Rx MCS Map: 0xffff
            .... .... .... ..11 = Rx 1 SS: Not Supported (0x0003)
            .... .... .... 11.. = Rx 2 SS: Not Supported (0x0003)
            .... .... ..11 .... = Rx 3 SS: Not Supported (0x0003)
            .... .... 11.. .... = Rx 4 SS: Not Supported (0x0003)
            .... ..11 .... .... = Rx 5 SS: Not Supported (0x0003)
            .... 11.. .... .... = Rx 6 SS: Not Supported (0x0003)
            ..11 .... .... .... = Rx 7 SS: Not Supported (0x0003)
            11.. .... .... .... = Rx 8 SS: Not Supported (0x0003)
        ...0 0000 0000 0000 = Rx Highest Long GI Data Rate (in Mb/s, 0 = subfield not in use): 0x0000
        Tx MCS Map: 0xffff
        ...0 0000 0000 0000 = Tx Highest Long GI Data Rate  (in Mb/s, 0 = subfield not in use): 0x0000

Signed-off-by: Filip Matusiak <filip.matusiak@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:18:43 +01:00
Dwip Banerjee 8d8e20e2d7 ipvs: Decrement ttl
We decrement the IP ttl in all the modes in order to prevent infinite
route loops. The changes were done based on Julian Anastasov's
suggestions in a prior thread.

The ttl based check/discard and the actual decrement are done in
__ip_vs_get_out_rt() and in __ip_vs_get_out_rt_v6(), for the IPv6
case. decrement_ttl() implements the actual functionality for the
two cases.

Signed-off-by: Dwip Banerjee <dwip@linux.vnet.ibm.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2016-11-15 09:49:20 +01:00
Gao Feng fe24a0c3a9 ipvs: Use IS_ERR_OR_NULL(svc) instead of IS_ERR(svc) || svc == NULL
This minor refactoring does not change the logic of function
ip_vs_genl_dump_dests.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2016-11-15 09:49:19 +01:00
Linus Torvalds e76d21c40b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
    from Alexander Duyck.

 2) Fix lockdep splats in iwlwifi, from Johannes Berg.

 3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
    is disabled, from Liping Zhang.

 4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.

 5) Disable UFO when path will apply IPSEC tranformations, from Jakub
    Sitnicki.

 6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.

 7) skb_checksum_help() should never actually use the value "0" for the
    resulting checksum, that has a special meaning, use CSUM_MANGLED_0
    instead. From Eric Dumazet.

 8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
    Yuval MIntz.

 9) Fix SCTP reference counting of associations and transports in
    sctp_diag. From Xin Long.

10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
    a previous layer or similar, so explicitly clear the ipv6 control
    block in the skb. From Eli Cooper.

11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
    Cong.

12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
    Shenai.

13) Fix potential access past the end of the skb page frag array in
    tcp_sendmsg(). From Eric Dumazet.

14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
    from David Ahern.

15) Don't return an error in tcp_sendmsg() if we wronte any bytes
    successfully, from Eric Dumazet.

16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
    but forgot to purge these unlock calls. From Eric Dumazet.

17) Fix memory leak in error path of __genl_register_family(). We leak
    the attrbuf, from WANG Cong.

18) cgroupstats netlink policy table is mis-sized, from WANG Cong.

19) Several XDP bug fixes in mlx5, from Saeed Mahameed.

20) Fix several device refcount leaks in network drivers, from Johan
    Hovold.

21) icmp6_send() should use skb dst device not skb->dev to determine L3
    routing domain. From David Ahern.

22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.

23) We leak new macvlan port in some cases of maclan_common_netlink()
    errors. Fix from Gao Feng.

24) Similar to the icmp6_send() fix, icmp_route_lookup() should
    determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
    Also from David Ahern.

25) Several fixes for route offloading and FIB notification handling in
    mlxsw driver, from Jiri Pirko.

26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.

27) Fix long standing regression in ipv4 redirect handling, wrt.
    validating the new neighbour's reachability. From Stephen Suryaputra
    Lin.

28) If sk_filter() trims the packet excessively, handle it reasonably in
    tcp input instead of exploding. From Eric Dumazet.

29) Fix handling of napi hash state when copying channels in sfc driver,
    from Bert Kenward.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
  mlxsw: spectrum_router: Flush FIB tables during fini
  net: stmmac: Fix lack of link transition for fixed PHYs
  sctp: change sk state only when it has assocs in sctp_shutdown
  bnx2: Wait for in-flight DMA to complete at probe stage
  Revert "bnx2: Reset device during driver initialization"
  ps3_gelic: fix spelling mistake in debug message
  net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
  ibmvnic: Fix size of debugfs name buffer
  ibmvnic: Unmap ibmvnic_statistics structure
  sfc: clear napi_hash state when copying channels
  mlxsw: spectrum_router: Correctly dump neighbour activity
  mlxsw: spectrum: Fix refcount bug on span entries
  bnxt_en: Fix VF virtual link state.
  bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
  Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
  tcp: take care of truncations done by sk_filter()
  ipv4: use new_gw for redirect neigh lookup
  r8152: Fix error path in open function
  net: bpqether.h: remove if_ether.h guard
  net: __skb_flow_dissect() must cap its return value
  ...
2016-11-14 14:15:53 -08:00
Xin Long 5bf35ddfee sctp: change sk state only when it has assocs in sctp_shutdown
Now when users shutdown a sock with SEND_SHUTDOWN in sctp, even if
this sock has no connection (assoc), sk state would be changed to
SCTP_SS_CLOSING, which is not as we expect.

Besides, after that if users try to listen on this sock, kernel
could even panic when it dereference sctp_sk(sk)->bind_hash in
sctp_inet_listen, as bind_hash is null when sock has no assoc.

This patch is to move sk state change after checking sk assocs
is not empty, and also merge these two if() conditions and reduce
indent level.

Fixes: d46e416c11 ("sctp: sctp should change socket state when shutdown is received")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 16:22:33 -05:00
WANG Cong d9dc8b0f8b net: fix sleeping for sk_wait_event()
Similar to commit 14135f30e3 ("inet: fix sleeping inside inet_wait_for_connect()"),
sk_wait_event() needs to fix too, because release_sock() is blocking,
it changes the process state back to running after sleep, which breaks
the previous prepare_to_wait().

Switch to the new wait API.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 13:17:21 -05:00
Scott Mayhew ea08e39230 sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
This fixes the following panic that can occur with NFSoRDMA.

general protection fault: 0000 [#1] SMP
Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm sg ioatdma
ipmi_devintf ipmi_ssif dcdbas iTCO_wdt iTCO_vendor_support pcspkr
irqbypass sb_edac shpchp dca crc32_pclmul ghash_clmulni_intel edac_core
lpc_ich aesni_intel lrw gf128mul glue_helper ablk_helper mei_me mei
ipmi_si cryptd wmi ipmi_msghandler acpi_pad acpi_power_meter nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt ahci fb_sys_fops ttm libahci mlx5_core
tg3 crct10dif_pclmul drm crct10dif_common
ptp i2c_core libata crc32c_intel pps_core fjes dm_mirror dm_region_hash
dm_log dm_mod
CPU: 1 PID: 120 Comm: kworker/1:1 Not tainted 3.10.0-514.el7.x86_64 #1
Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
Workqueue: events check_lifetime
task: ffff88031f506dd0 ti: ffff88031f584000 task.ti: ffff88031f584000
RIP: 0010:[<ffffffff8168d847>]  [<ffffffff8168d847>]
_raw_spin_lock_bh+0x17/0x50
RSP: 0018:ffff88031f587ba8  EFLAGS: 00010206
RAX: 0000000000020000 RBX: 20041fac02080072 RCX: ffff88031f587fd8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 20041fac02080072
RBP: ffff88031f587bb0 R08: 0000000000000008 R09: ffffffff8155be77
R10: ffff880322a59b00 R11: ffffea000bf39f00 R12: 20041fac02080072
R13: 000000000000000d R14: ffff8800c4fbd800 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff880322a40000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3c52d4547e CR3: 00000000019ba000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
20041fac02080002 ffff88031f587bd0 ffffffff81557830 20041fac02080002
ffff88031f587c78 ffff88031f587c40 ffffffff8155ae08 000000010157df32
0000000800000001 ffff88031f587c20 ffffffff81096acb ffffffff81aa37d0
Call Trace:
[<ffffffff81557830>] lock_sock_nested+0x20/0x50
[<ffffffff8155ae08>] sock_setsockopt+0x78/0x940
[<ffffffff81096acb>] ? lock_timer_base.isra.33+0x2b/0x50
[<ffffffff8155397d>] kernel_setsockopt+0x4d/0x50
[<ffffffffa0386284>] svc_age_temp_xprts_now+0x174/0x1e0 [sunrpc]
[<ffffffffa03b681d>] nfsd_inetaddr_event+0x9d/0xd0 [nfsd]
[<ffffffff81691ebc>] notifier_call_chain+0x4c/0x70
[<ffffffff810b687d>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffff810b68b6>] blocking_notifier_call_chain+0x16/0x20
[<ffffffff815e8538>] __inet_del_ifa+0x168/0x2d0
[<ffffffff815e8cef>] check_lifetime+0x25f/0x270
[<ffffffff810a7f3b>] process_one_work+0x17b/0x470
[<ffffffff810a8d76>] worker_thread+0x126/0x410
[<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
[<ffffffff810b052f>] kthread+0xcf/0xe0
[<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[<ffffffff81696418>] ret_from_fork+0x58/0x90
[<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
Code: ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb d9 66 0f 1f 44 00 00 0f 1f
44 00 00 55 48 89 e5 53 48 89 fb e8 7e 04 a0 ff b8 00 00 02 00 <f0> 0f
c1 03 89 c2 c1 ea 10 66 39 c2 75 03 5b 5d c3 83 e2 fe 0f
RIP  [<ffffffff8168d847>] _raw_spin_lock_bh+0x17/0x50
RSP <ffff88031f587ba8>

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: c3d4879e ("sunrpc: Add a function to close temporary transports immediately")
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 10:30:58 -05:00
David S. Miller 7d384846b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains a second batch of Netfilter updates for
your net-next tree. This includes a rework of the core hook
infrastructure that improves Netfilter performance by ~15% according to
synthetic benchmarks. Then, a large batch with ipset updates, including
a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a
couple of assorted updates.

Regarding the core hook infrastructure rework to improve performance,
using this simple drop-all packets ruleset from ingress:

        nft add table netdev x
        nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; }
        nft add rule netdev x y drop

And generating traffic through Jesper Brouer's
samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i
option. perf report shows nf_tables calls in its top 10:

    17.30%  kpktgend_0   [nf_tables]            [k] nft_do_chain
    15.75%  kpktgend_0   [kernel.vmlinux]       [k] __netif_receive_skb_core
    10.39%  kpktgend_0   [nf_tables_netdev]     [k] nft_do_chain_netdev

I'm measuring here an improvement of ~15% in performance with this
patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R)
Core(TM) i5-3320M CPU @ 2.60GHz 4-cores.

This rework contains more specifically, in strict order, these patches:

1) Remove compile-time debugging from core.

2) Remove obsolete comments that predate the rcu era. These days it is
   well known that a Netfilter hook always runs under rcu_read_lock().

3) Remove threshold handling, this is only used by br_netfilter too.
   We already have specific code to handle this from br_netfilter,
   so remove this code from the core path.

4) Deprecate NF_STOP, as this is only used by br_netfilter.

5) Place nf_state_hook pointer into xt_action_param structure, so
   this structure fits into one single cacheline according to pahole.
   This also implicit affects nftables since it also relies on the
   xt_action_param structure.

6) Move state->hook_entries into nf_queue entry. The hook_entries
   pointer is only required by nf_queue(), so we can store this in the
   queue entry instead.

7) use switch() statement to handle verdict cases.

8) Remove hook_entries field from nf_hook_state structure, this is only
   required by nf_queue, so store it in nf_queue_entry structure.

9) Merge nf_iterate() into nf_hook_slow() that results in a much more
   simple and readable function.

10) Handle NF_REPEAT away from the core, so far the only client is
    nf_conntrack_in() and we can restart the packet processing using a
    simple goto to jump back there when the TCP requires it.
    This update required a second pass to fix fallout, fix from
    Arnd Bergmann.

11) Set random seed from nft_hash when no seed is specified from
    userspace.

12) Simplify nf_tables expression registration, in a much smarter way
    to save lots of boiler plate code, by Liping Zhang.

13) Simplify layer 4 protocol conntrack tracker registration, from
    Davide Caratti.

14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due
    to recent generalization of the socket infrastructure, from Arnd
    Bergmann.

15) Then, the ipset batch from Jozsef, he describes it as it follows:

* Cleanup: Remove extra whitespaces in ip_set.h
* Cleanup: Mark some of the helpers arguments as const in ip_set.h
* Cleanup: Group counter helper functions together in ip_set.h
* struct ip_set_skbinfo is introduced instead of open coded fields
  in skbinfo get/init helper funcions.
* Use kmalloc() in comment extension helper instead of kzalloc()
  because it is unnecessary to zero out the area just before
  explicit initialization.
* Cleanup: Split extensions into separate files.
* Cleanup: Separate memsize calculation code into dedicated function.
* Cleanup: group ip_set_put_extensions() and ip_set_get_extensions()
  together.
* Add element count to hash headers by Eric B Munson.
* Add element count to all set types header for uniform output
  across all set types.
* Count non-static extension memory into memsize calculation for
  userspace.
* Cleanup: Remove redundant mtype_expire() arguments, because
  they can be get from other parameters.
* Cleanup: Simplify mtype_expire() for hash types by removing
  one level of intendation.
* Make NLEN compile time constant for hash types.
* Make sure element data size is a multiple of u32 for the hash set
  types.
* Optimize hash creation routine, exit as early as possible.
* Make struct htype per ipset family so nets array becomes fixed size
  and thus simplifies the struct htype allocation.
* Collapse same condition body into a single one.
* Fix reported memory size for hash:* types, base hash bucket structure
  was not taken into account.
* hash:ipmac type support added to ipset by Tomasz Chilinski.
* Use setup_timer() and mod_timer() instead of init_timer()
  by Muhammad Falak R Wani, individually for the set type families.

16) Remove useless connlabel field in struct netns_ct, patch from
    Florian Westphal.

17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify
    {ip,ip6,arp}tables code that uses this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 22:41:25 -05:00
Julia Lawall eb1a6bdc28 netfilter: x_tables: simplify IS_ERR_OR_NULL to NULL test
Since commit 7926dbfa4b ("netfilter: don't use
mutex_lock_interruptible()"), the function xt_find_table_lock can only
return NULL on an error.  Simplify the call sites and update the
comment before the function.

The semantic patch that change the code is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression t,e;
@@

t = \(xt_find_table_lock(...)\|
      try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
- ! IS_ERR_OR_NULL(t)
+ t

@@
expression t,e;
@@

t = \(xt_find_table_lock(...)\|
      try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
- IS_ERR_OR_NULL(t)
+ !t

@@
expression t,e,e1;
@@

t = \(xt_find_table_lock(...)\|
      try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
?- t ? PTR_ERR(t) : e1
+ e1
... when any

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-13 22:26:13 +01:00
Eric Dumazet ac6e780070 tcp: take care of truncations done by sk_filter()
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 12:30:02 -05:00
Stephen Suryaputra Lin 969447f226 ipv4: use new_gw for redirect neigh lookup
In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
and then the state of the neigh for the new_gw is checked. If the state
isn't valid then the redirected route is deleted. This behavior is
maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
is assigned to peer->redirect_learned.a4 before calling
ipv4_neigh_lookup().

After commit 5943634fc5 ("ipv4: Maintain redirect and PMTU info in
struct rtable again."), ipv4_neigh_lookup() is performed without the
rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
isn't zero, the function uses it as the key. The neigh is most likely
valid since the old_gw is the one that sends the ICMP redirect message.
Then the new_gw is assigned to fib_nh_exception. The problem is: the
new_gw ARP may never gets resolved and the traffic is blackholed.

So, use the new_gw for neigh lookup.

Changes from v1:
 - use __ipv4_neigh_lookup instead (per Eric Dumazet).

Fixes: 5943634fc5 ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
Signed-off-by: Stephen Suryaputra Lin <ssurya@ieee.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 12:24:44 -05:00
Jiri Benc 217ac77a3c openvswitch: allow L3 netdev ports
Allow ARPHRD_NONE interfaces to be added to ovs bridge.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 91820da6ae openvswitch: add Ethernet push and pop actions
It's not allowed to push Ethernet header in front of another Ethernet
header.

It's not allowed to pop Ethernet header if there's a vlan tag. This
preserves the invariant that L3 packet never has a vlan tag.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 0a6410fbde openvswitch: netlink: support L3 packets
Extend the ovs flow netlink protocol to support L3 packets. Packets without
OVS_KEY_ATTR_ETHERNET attribute specify L3 packets; for those, the
OVS_KEY_ATTR_ETHERTYPE attribute is mandatory.

Push/pop vlan actions are only supported for Ethernet packets.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 5108bbaddc openvswitch: add processing of L3 packets
Support receiving, extracting flow key and sending of L3 packets (packets
without an Ethernet header).

Note that even after this patch, non-Ethernet interfaces are still not
allowed to be added to bridges. Similarly, netlink interface for sending and
receiving L3 packets to/from user space is not in place yet.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 1560a074df openvswitch: support MPLS push and pop for L3 packets
Update Ethernet header only if there is one.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc e2d9d8358c openvswitch: pass mac_proto to ovs_vport_send
We'll need it to alter packets sent to ARPHRD_NONE interfaces.

Change do_output() to use the actual L2 header size of the packet when
deciding on the minimum cutlen. The assumption here is that what matters is
not the output interface hard_header_len but rather the L2 header of the
particular packet. For example, ARPHRD_NONE tunnels that encapsulate
Ethernet should get at least the Ethernet header.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 329f45bc4f openvswitch: add mac_proto field to the flow key
Use a hole in the structure. We support only Ethernet so far and will add
a support for L2-less packets shortly. We could use a bool to indicate
whether the Ethernet header is present or not but the approach with the
mac_proto field is more generic and occupies the same number of bytes in the
struct, while allowing later extensibility. It also makes the code in the
next patches more self explaining.

It would be nice to use ARPHRD_ constants but those are u16 which would be
waste. Thus define our own constants.

Another upside of this is that we can overload this new field to also denote
whether the flow key is valid. This has the advantage that on
refragmentation, we don't have to reparse the packet but can rely on the
stored eth.type. This is especially important for the next patches in this
series - instead of adding another branch for L2-less packets before calling
ovs_fragment, we can just remove all those branches completely.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 738314a084 openvswitch: use hard_header_len instead of hardcoded ETH_HLEN
On tx, use hard_header_len while deciding whether to refragment or drop the
packet. That way, all combinations are calculated correctly:

* L2 packet going to L2 interface (the L2 header len is subtracted),
* L2 packet going to L3 interface (the L2 header is included in the packet
  lenght),
* L3 packet going to L3 interface.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Eric Dumazet 34fad54c25 net: __skb_flow_dissect() must cap its return value
After Tom patch, thoff field could point past the end of the buffer,
this could fool some callers.

If an skb was provided, skb->len should be the upper limit.
If not, hlen is supposed to be the upper limit.

Fixes: a6e544b0a8 ("flow_dissector: Jump to exit code in __skb_flow_dissect")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yibin Yang <yibyang@cisco.com
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12 23:41:53 -05:00
Martin KaFai Lau 4e3264d21b bpf: Fix bpf_redirect to an ipip/ip6tnl dev
If the bpf program calls bpf_redirect(dev, 0) and dev is
an ipip/ip6tnl, it currently includes the mac header.
e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
of IP-IP.

The fix is to pull the mac header.  At ingress, skb_postpull_rcsum()
is not needed because the ethhdr should have been pulled once already
and then got pushed back just before calling the bpf_prog.
At egress, this patch calls skb_postpull_rcsum().

If bpf_redirect(dev, BPF_F_INGRESS) is called,
it also fails now because it calls dev_forward_skb() which
eventually calls eth_type_trans(skb, dev).  The eth_type_trans()
will set skb->type = PACKET_OTHERHOST because the mac address
does not match the redirecting dev->dev_addr.  The PACKET_OTHERHOST
will eventually cause the ip_rcv() errors out.  To fix this,
____dev_forward_skb() is added.

Joint work with Daniel Borkmann.

Fixes: cfc7381b30 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12 23:38:07 -05:00
Linus Torvalds ef091b3cef Ceph's ->read_iter() implementation is incompatible with the new
generic_file_splice_read() code that went into -rc1.  Switch to the
 less efficient default_file_splice_read() for now; the proper fix is
 being held for 4.10.
 
 We also have a fix for a 4.8 regression and a trival libceph fixup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYJdjPAAoJEEp/3jgCEfOLzEoH/A3B1qqiqs2WoMn0O4pnEdcM
 TxaU46VOkYcK2wh/xbYAns2kZEXKgcCySv+kXn4l3Gh6/lXVxv4WexNqWdO1o6yN
 GqEufIH7yQM6QOE/hkwtUciBXmPfQMPxF14vvprYQuyu5Bs96mrphiAa7vX6Vbk5
 VhfE/j0shb8Q2QQj/Om0mWqM6JtOAlr5aFtEcJcodbCk1k8CptUcBsSoQ31PXMC7
 UcaBHh1VHGLvx9WeG1Rw1g9tc2LiUyu+UK0csolp51+amB7HezgfmzDQzHtXzBmm
 n90SQwonf0DrdWUGuQlOpHnREwxLSgN19s68FCjLc0jeMTP4b6TFEIUgFxiqWc4=
 =Ws5s
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client

Pull Ceph fixes from Ilya Dryomov:
 "Ceph's ->read_iter() implementation is incompatible with the new
  generic_file_splice_read() code that went into -rc1.  Switch to the
  less efficient default_file_splice_read() for now; the proper fix is
  being held for 4.10.

  We also have a fix for a 4.8 regression and a trival libceph fixup"

* tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client:
  libceph: initialize last_linger_id with a large integer
  libceph: fix legacy layout decode with pool 0
  ceph: use default file splice read callback
2016-11-11 09:17:10 -08:00
Linus Torvalds ef5beed998 NFS client bugfixes for Linux 4.9
Bugfixes:
 - Trim extra slashes in v4 nfs_paths to fix tools that use this
 - Fix a -Wmaybe-uninitialized warnings
 - Fix suspicious RCU usages
 - Fix Oops when mounting multiple servers at once
 - Suppress a false-positive pNFS error
 - Fix a DMAR failure in NFS over RDMA
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE
 UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM
 iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx
 glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp
 h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN
 PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp
 r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu
 k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP
 ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct
 QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g
 sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS
 jfSXsXmrwTniPCDWgKg3
 =hFod
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Most of these fix regressions in 4.9, and none are going to stable
  this time around.

  Bugfixes:
   - Trim extra slashes in v4 nfs_paths to fix tools that use this
   - Fix a -Wmaybe-uninitialized warnings
   - Fix suspicious RCU usages
   - Fix Oops when mounting multiple servers at once
   - Suppress a false-positive pNFS error
   - Fix a DMAR failure in NFS over RDMA"

* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
  fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
  NFS: Don't print a pNFS error if we aren't using pNFS
  NFS: Ignore connections that have cl_rpcclient uninitialized
  SUNRPC: Fix suspicious RCU usage
  NFSv4.1: work around -Wmaybe-uninitialized warning
  NFS: Trim extra slash in v4 nfs_path
2016-11-11 09:15:30 -08:00
Ilya Dryomov 264048afab libceph: initialize last_linger_id with a large integer
osdc->last_linger_id is a counter for lreq->linger_id, which is used
for watch cookies.  Starting with a large integer should ease the task
of telling apart kernel and userspace clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:08 +01:00
Yan, Zheng 3890dce1d3 libceph: fix legacy layout decode with pool 0
If your data pool was pool 0, ceph_file_layout_from_legacy()
transform that to -1 unconditionally, which broke upgrades.
We only want do that for a fully zeroed ceph_file_layout,
so that it still maps to a file_layout_t.  If any fields
are set, though, we trust the fl_pgpool to be a valid pool.

Fixes: 7627151ea3 ("libceph: define new ceph_file_layout structure")
Link: http://tracker.ceph.com/issues/17825
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:08 +01:00
Lance Richardson 0ace81ec71 ipv4: update comment to document GSO fragmentation cases.
This is a follow-up to commit 9ee6c5dc81 ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.

Suggested-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-10 12:01:54 -05:00
Chuck Lever 62bdf94a20 xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
When a LOCALINV WR is flushed, the frmr is marked STALE, then
frwr_op_unmap_sync DMA-unmaps the frmr's SGL. These STALE frmrs
are then recovered when frwr_op_map hunts for an INVALID frmr to
use.

All other cases that need frmr recovery leave that SGL DMA-mapped.
The FRMR recovery path unconditionally DMA-unmaps the frmr's SGL.

To avoid DMA unmapping the SGL twice for flushed LOCAL_INV WRs,
alter the recovery logic (rather than the hot frwr_op_unmap_sync
path) to distinguish among these cases. This solution also takes
care of the case where multiple LOCAL_INV WRs are issued for the
same rpcrdma_req, some complete successfully, but some are flushed.

Reported-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Vasco Steinmetz <linux@kyberraum.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-10 11:04:54 -05:00
kbuild test robot 737d387b75 netfilter: ipset: hash: fix boolreturn.cocci warnings
net/netfilter/ipset/ip_set_hash_ipmac.c:70:8-9: WARNING: return of 0/1 in function 'hash_ipmac4_data_list' with return type bool
net/netfilter/ipset/ip_set_hash_ipmac.c:178:8-9: WARNING: return of 0/1 in function 'hash_ipmac6_data_list' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

CC: Tomasz Chilinski <tomasz.chilinski@chilan.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:50 +01:00
Jozsef Kadlecsik fcb58a0332 netfilter: ipset: use setup_timer() and mod_timer().
Use setup_timer() and instead of init_timer(), being the preferred way
of setting up a timer.

Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
   active timer (if the timer is inactive it will be activated).

Use setup_timer() and mod_timer() to setup and arm a timer, making the
code compact and easier to read.

Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:49 +01:00
Tomasz Chilinski c0db95c7bc netfilter: ipset: hash:ipmac type support added to ipset
Introduce the hash:ipmac type.

Signed-off-by: Tomasz Chili??ski <tomasz.chilinski@chilan.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:49 +01:00
Jozsef Kadlecsik a71bdbfa99 netfilter: ipset: Fix reported memory size for hash:* types
The calculation of the full allocated memory did not take
into account the size of the base hash bucket structure at some
places.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:48 +01:00
Jozsef Kadlecsik 9be37d2acd netfilter: ipset: Collapse same condition body to a single one
The set full case (with net_ratelimit()-ed pr_warn()) is already
handled, simply jump there.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:48 +01:00
Jozsef Kadlecsik 21956ab290 netfilter: ipset: Make struct htype per ipset family
Before this patch struct htype created at the first source
of ip_set_hash_gen.h and it is common for both IPv4 and IPv6
set variants.

Make struct htype per ipset family and use NLEN to make
nets array fixed size to simplify struct htype allocation.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:48 +01:00
Jozsef Kadlecsik 961509ac18 netfilter: ipset: Optimize hash creation routine
Exit as easly as possible on error and use RCU_INIT_POINTER()
as set is not seen at creation time.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:47 +01:00
Jozsef Kadlecsik 5a902e6d4b netfilter: ipset: Make sure element data size is a multiple of u32
Data for hashing required to be array of u32. Make sure that
element data always multiple of u32.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:47 +01:00
Jozsef Kadlecsik cee8b97b6c netfilter: ipset: Make NLEN compile time constant for hash types
Hash types define HOST_MASK before inclusion of ip_set_hash_gen.h
and the only place where NLEN needed to be calculated at runtime
is *_create() method.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:46 +01:00
Jozsef Kadlecsik 509debc975 netfilter: ipset: Simplify mtype_expire() for hash types
Remove one leve of intendation by using continue while
iterating over elements in bucket.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:46 +01:00
Jozsef Kadlecsik 5fdb5f6938 netfilter: ipset: Remove redundant mtype_expire() arguments
Remove redundant parameters nets_length and dsize, because
they can be get from other parameters.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:46 +01:00
Jozsef Kadlecsik 9e41f26a50 netfilter: ipset: Count non-static extension memory for userspace
Non-static (i.e. comment) extension was not counted into the memory
size. A new internal counter is introduced for this. In the case of
the hash types the sizes of the arrays are counted there as well so
that we can avoid to scan the whole set when just the header data
is requested.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:45 +01:00
Jozsef Kadlecsik 702b71e7c6 netfilter: ipset: Add element count to all set types header
It is better to list the set elements for all set types, thus the
header information is uniform. Element counts are therefore added
to the bitmap and list types.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:45 +01:00
Eric B Munson a54dad51a8 netfilter: ipset: Add element count to hash headers
It would be useful for userspace to query the size of an ipset hash,
however, this data is not exposed to userspace outside of counting the
number of member entries.  This patch uses the attribute
IPSET_ATTR_ELEMENTS to indicate the size in the the header that is
exported to userspace.  This field is then printed by the userspace
tool for hashes.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Josh Hunt <johunt@akamai.com>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:44 +01:00
Jozsef Kadlecsik 722a94519a netfilter: ipset: Separate memsize calculation code into dedicated function
Hash types already has it's memsize calculation code in separate
functions. Clean up and do the same for *bitmap* and *list* sets.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:44 +01:00
Jozsef Kadlecsik bec810d973 netfilter: ipset: Improve skbinfo get/init helpers
Use struct ip_set_skbinfo in struct ip_set_ext instead of open
coded fields and assign structure members in get/init helpers
instead of copying members one by one. Explicitly note that
struct ip_set_skbinfo must be padded to prevent non-aligned
access in the extension blob.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:42 +01:00
Eric Dumazet f522a5fcf6 tcp: remove unaligned accesses from tcp_get_info()
After commit 6ed46d1247 ("sock_diag: align nlattr properly when
needed"), tcp_get_info() gets 64bit aligned memory, so we can avoid
the unaligned helpers.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 22:51:33 -05:00
David Ahern 9b6c14d51b net: tcp response should set oif only if it is L3 master
Lorenzo noted an Android unit test failed due to e0d56fdd7342:
"The expectation in the test was that the RST replying to a SYN sent to a
closed port should be generated with oif=0. In other words it should not
prefer the interface where the SYN came in on, but instead should follow
whatever the routing table says it should do."

Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
that the oif in the flow is set to the skb_iif only if skb_iif is an L3
master.

Fixes: e0d56fdd73 ("net: l3mdev: remove redundant calls")
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 22:32:10 -05:00
David S. Miller d401c1d1e8 This feature and cleanup patchset includes the following changes:
- netlink and code cleanups by Sven Eckelmann (3 patches)
 
  - Cleanup and minor fixes by Linus Luessing (3 patches)
 
  - Speed up multicast update intervals, by Linus Luessing
 
  - Avoid (re)broadcast in meshes for some easy cases,
    by Linus Luessing
 
  - Clean up tx return state handling, by Sven Eckelmann (6 patches)
 
  - Fix some special mac address handling cases, by Sven Eckelmann
    (3 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYI6GFFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqHyEw/9GkYNRQJOk1JMuW0cDvj9uWqoendvXRNPVkCvqh4gjX4o+aQaeyumv1/v
 eYqpslQWmSsrIlGQ6UGCegzyzZ7jXo6ZijM7wvz2bWwB2C0NzUGlBBCzOeA6Bui/
 3Fq+Xmx0Xcf5+c82YmrLor/yYp4FIFTao4+a80vHzQeI/Hg8RuJTOFJdtVNV3JPP
 VrfzMAPLLXJPPKHjt1PN3lfANWqX6nWLUMhHBNkMpYB+mMdyaCve6X+MxPF+WYBH
 wBO8spU35chW7dp8HOncof5nRDv2xVHWs6TN2kdJ762YrZ1oL0GXwWXViKhWskSQ
 QEeOLboyj3IuwPsxOQOLQEbAMrp6jqj3L/6lYWRkV2U6Bbi8EYdozW8L3utxMcvA
 Dft8D2U5JAD5ja0VUFyGhwNaBFien2B9JSEwsyOLtUbaQSASNyvym75WrN2Ey/d7
 JhBzUt6Iwh8RNJylY3nG5OkoNnyXYv3VrQLsIW4QTHc8Um9eaiOeFHtuAi6WNBtI
 HgMwPcdErNbmPd3w9OM6kk6aBQ/DTUK/7CNUKYVoGDayGxKYGDwqhoog9zm18wrt
 wc/TtdIY+q95hgm8fDCJefrnkaIxDJrVtChs30N/pJ24MeKcHuibop3HzxIngze2
 zPZTuXRKA2VSt79+EV4KORAutexi1WQIN7nRH1a8zMsYfyMKG8Y=
 =1xrj
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20161108-v2' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
pull request for net-next: batman-adv 2016-11-08 v2

This feature and cleanup patchset includes the following changes:

 - netlink and code cleanups by Sven Eckelmann (3 patches)

 - Cleanup and minor fixes by Linus Luessing (3 patches)

 - Speed up multicast update intervals, by Linus Luessing

 - Avoid (re)broadcast in meshes for some easy cases,
   by Linus Luessing

 - Clean up tx return state handling, by Sven Eckelmann (6 patches)

 - Fix some special mac address handling cases, by Sven Eckelmann
   (3 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 22:15:28 -05:00
Eric Dumazet 149d6ad836 net: napi_hash_add() is no longer exported
There are no more users except from net/core/dev.c
napi_hash_add() can now be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 21:16:05 -05:00
David Lebrun a149e7c7ce ipv6: sr: add support for SRH injection through setsockopt
This patch adds support for per-socket SRH injection with the setsockopt
system call through the IPPROTO_IPV6, IPV6_RTHDR options.
The SRH is pushed through the ipv6_push_nfrag_opts function.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun 613fa3ca9e ipv6: add source address argument for ipv6_push_nfrag_opts
This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun 9baee83406 ipv6: sr: add calls to verify and insert HMAC signatures
This patch enables the verification of the HMAC signature for transiting
SR-enabled packets, and its insertion on encapsulated/injected SRH.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun 4f4853dc1c ipv6: sr: implement API to control SR HMAC structure
This patch provides an implementation of the genetlink commands
to associate a given HMAC key identifier with an hashing algorithm
and a secret.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun bf355b8d2c ipv6: sr: add core files for SR HMAC support
This patch adds the necessary functions to compute and check the HMAC signature
of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and
hmac(sha256).

In order to avoid dynamic memory allocation for each HMAC computation,
a per-cpu ring buffer is allocated for this purpose.

A new per-interface sysctl called seg6_require_hmac is added, allowing a
user-defined policy for processing HMAC-signed SR-enabled packets.
A value of -1 means that the HMAC field will always be ignored.
A value of 0 means that if an HMAC field is present, its validity will
be enforced (the packet is dropped is the signature is incorrect).
Finally, a value of 1 means that any SR-enabled packet that does not
contain an HMAC signature or whose signature is incorrect will be dropped.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun 6c8702c60b ipv6: sr: add support for SRH encapsulation and injection with lwtunnels
This patch creates a new type of interfaceless lightweight tunnel (SEG6),
enabling the encapsulation and injection of SRH within locally emitted
packets and forwarded packets.

>From a configuration viewpoint, a seg6 tunnel would be configured as follows:

  ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0

Any packet whose destination address is fc00::1 would thus be encapsulated
within an outer IPv6 header containing the SRH with three segments, and would
actually be routed to the first segment of the list. If `mode inline' was
specified instead of `mode encap', then the SRH would be directly inserted
after the IPv6 header without outer encapsulation.

The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
feature was made configurable because direct header insertion may break
several mechanisms such as PMTUD or IPSec AH.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun 915d7e5e59 ipv6: sr: add code base for control plane support of SR-IPv6
This patch adds the necessary hooks and structures to provide support
for SR-IPv6 control plane, essentially the Generic Netlink commands
that will be used for userspace control over the Segment Routing
kernel structures.

The genetlink commands provide control over two different structures:
tunnel source and HMAC data. The tunnel source is the source address
that will be used by default when encapsulating packets into an
outer IPv6 header + SRH. If the tunnel source is set to :: then an
address of the outgoing interface will be selected as the source.

The HMAC commands currently just return ENOTSUPP and will be implemented
in a future patch.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun 1ababeba4a ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)
Implement minimal support for processing of SR-enabled packets
as described in
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.

This patch implements the following operations:
- Intermediate segment endpoint: incrementation of active segment and rerouting.
- Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
  and routing of inner packet.
- Cleanup flag support for SR-inlined packets: removal of SRH if we are the
  penultimate segment endpoint.

A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
packets. Default is deny.

This patch does not provide support for HMAC-signed packets.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David S. Miller 9fa684ec86 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:

1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
   support the set element updates from the packet path. From Liping
   Zhang.

2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

3) Fix a race when inserting new elements to the set hash from the
   packet path, also from Liping.

4) Handle segmented TCP SIP packets properly, basically avoid that the
   INVITE in the allow header create bogus expectations by performing
   stricter SIP message parsing, from Ulrich Weber.

5) nft_parse_u32_check() should return signed integer for errors, from
   John Linville.

6) Fix wrong allocation instead of connlabels, allocate 16 instead of
   32 bytes, from Florian Westphal.

7) Fix compilation breakage when building the ip_vs_sync code with
   CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

8) Destroy the new set if the transaction object cannot be allocated,
   also from Liping Zhang.

9) Use device to route duplicated packets via nft_dup only when set by
   the user, otherwise packets may not follow the right route, again
   from Liping.

10) Fix wrong maximum genetlink attribute definition in IPVS, from
    WANG Cong.

11) Ignore untracked conntrack objects from xt_connmark, from Florian
    Westphal.

12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
    via CT target, otherwise we cannot use the h.245 helper, from
    Florian.

13) Revisit garbage collection heuristic in the new workqueue-based
    timer approach for conntrack to evict objects earlier, again from
    Florian.

14) Fix crash in nf_tables when inserting an element into a verdict map,
    from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:38:18 -05:00
Mathias Krause f567e950bf rtnl: reset calcit fptr in rtnl_unregister()
To avoid having dangling function pointers left behind, reset calcit in
rtnl_unregister(), too.

This is no issue so far, as only the rtnl core registers a netlink
handler with a calcit hook which won't be unregistered, but may become
one if new code makes use of the calcit hook.

Fixes: c7ac8679be ("rtnetlink: Compute and store minimum ifinfo...")
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:18:19 -05:00
Asbjørn Sloth Tønnesen 3f9b9770b4 net: l2tp: fix negative assignment to unsigned int
recv_seq, send_seq and lns_mode mode are all defined as
unsigned int foo:1;

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen 57ceb8611d net: l2tp: cleanup: remove redundant condition
These assignments follow this pattern:

	unsigned int foo:1;
	struct nlattr *nla = info->attrs[bar];

	if (nla)
		foo = nla_get_flag(nla); /* expands to: foo = !!nla */

This could be simplified to: if (nla) foo = 1;
but lets just remove the condition and use the macro,

	foo = nla_get_flag(nla);

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen 97b7af097e net: l2tp: netlink: l2tp_nl_tunnel_send: set UDP6 checksum flags
This patch causes the proper attribute flags to be set,
in the case that IPv6 UDP checksums are disabled, so that
userspace ie. `ip l2tp show tunnel` knows about it.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen 7ff516ffe4 net: l2tp: only set L2TP_ATTR_UDP_CSUM if AF_INET
Only set L2TP_ATTR_UDP_CSUM in l2tp_nl_tunnel_send()
when it's running over IPv4.

This prepares the code to also have IPv6 specific attributes.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
David Ahern 9d1a6c4ea4 net: icmp_route_lookup should use rt dev to determine L3 domain
icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.

Fixes: 613d09b30f ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:49:39 -05:00
Eric Dumazet d61d072e87 net-gro: avoid reorders
Receiving a GSO packet in dev_gro_receive() is not uncommon
in stacked devices, or devices partially implementing LRO/GRO
like bnx2x. GRO is implementing the aggregation the device
was not able to do itself.

Current code causes reorders, like in following case :

For a given flow where sender sent 3 packets P1,P2,P3,P4

Receiver might receive P1 as a single packet, stored in GRO engine.

Then P2-P4 are received as a single GSO packet, immediately given to
upper stack, while P1 is held in GRO engine.

This patch will make sure P1 is given to upper stack, then P2-P4
immediately after.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:48:54 -05:00
Arnd Bergmann 56a62e2218 netfilter: conntrack: fix NF_REPEAT handling
gcc correctly identified a theoretical uninitialized variable use:

net/netfilter/nf_conntrack_core.c: In function 'nf_conntrack_in':
net/netfilter/nf_conntrack_core.c:1125:14: error: 'l4proto' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This could only happen when we 'goto out' before looking up l4proto,
and then enter the retry, implying that l3proto->get_l4proto()
returned NF_REPEAT. This does not currently get returned in any
code path and probably won't ever happen, but is not good to
rely on.

Moving the repeat handling up a little should have the same
behavior as today but avoids the warning by making that case
impossible to enter.

[ I have mangled this original patch to remove the check for tmpl, we
  should inconditionally jump back to the repeat label in case we hit
  NF_REPEAT instead. I have also moved the comment that explains this
  where it belongs. --pablo ]

Fixes: 08733a0cb7 ("netfilter: handle NF_REPEAT from nf_conntrack_in()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-10 00:19:33 +01:00
Arnd Bergmann 30f5815848 udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6}
Since commit ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
the udp6_lib_lookup and udp4_lib_lookup functions are only
provided when it is actually possible to call them.

However, moving the callers now caused a link error:

net/built-in.o: In function `nf_sk_lookup_slow_v6':
(.text+0x131a39): undefined reference to `udp6_lib_lookup'
net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'

This extends the #ifdef so we also provide the functions when
CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
are set.

Fixes: 8db4c5be88 ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-10 00:05:14 +01:00
Davide Caratti 0e54d2179f netfilter: conntrack: simplify init/uninit of L4 protocol trackers
modify registration and deregistration of layer-4 protocol trackers to
facilitate inclusion of new elements into the current list of builtin
protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP,
UDPlite) layer-4 protocol trackers usually register/deregister themselves
using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...).
This sequence is interrupted and rolled back in case of error; in order to
simplify addition of builtin protocols, the input of the above functions
has been modified to allow registering/unregistering multiple protocols.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:49:25 +01:00
Liping Zhang 4e24877e61 netfilter: nf_tables: simplify the basic expressions' init routine
Some basic expressions are built into nf_tables.ko, such as nft_cmp,
nft_lookup, nft_range and so on. But these basic expressions' init
routine is a little ugly, too many goto errX labels, and we forget
to call nft_range_module_exit in the exit routine, although it is
harmless.

Acctually, the init and exit routines of these basic expressions
are same, i.e. do nft_register_expr in the init routine and do
nft_unregister_expr in the exit routine.

So it's better to arrange them into an array and deal with them
together.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:42:23 +01:00
Pablo Neira Ayuso f86dab3aa6 netfilter: nft_hash: get random bytes if seed is not specified
If the user doesn't specify a seed, generate one at configuration time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:41:09 +01:00
Hadar Hen Zion 75bfbca01e net/sched: act_tunnel_key: Add UDP dst port option
The current tunnel set action supports only IP addresses and key
options. Add UDP dst port option.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:55 -05:00
Hadar Hen Zion 24ba898d43 net/dst: Add dst port to dst_metadata utility functions
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst
utility functions.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion f4d997fd61 net/sched: cls_flower: Add UDP port to tunnel parameters
The current IP tunneling classification supports only IP addresses and key.
Enhance UDP based IP tunneling classification parameters by adding UDP
src and dst port.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion 519d10521c net/sched: cls_flower: Allow setting encapsulation fields as used key
When encapsulation field is set, mark it as used key for the flow
dissector. This will be used by offloading drivers.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion 9ce183b4c4 net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_key
Needed for drivers to pick the relevant action when offloading tunnel
key act.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:53 -05:00
Lorenzo Colitti 35b80733b3 net: core: add missing check for uid_range in rule_exists.
Without this check, it is not possible to create two rules that
are identical except for their UID ranges. For example:

root@net-test:/# ip rule add prio 1000 lookup 300
root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300
RTNETLINK answers: File exists
root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100
root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200
root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100
RTNETLINK answers: File exists

Tested: https://android-review.googlesource.com/#/c/299980/
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:28:10 -05:00
Maciej Żenczykowski fb56be83e4 net-ipv6: on device mtu change do not add mtu to mtu-less routes
Routes can specify an mtu explicitly or inherit the mtu from
the underlying device - this inheritance is implemented in
dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().

Currently changing the mtu of a device adds mtu explicitly
to routes using that device.

ie.
  # ip link set dev lo mtu 65536
  # ip -6 route add local 2000::1 dev lo
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium

  # ip link set dev lo mtu 65535
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium

  # ip link set dev lo mtu 65536
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium

  # ip -6 route del local 2000::1

After this patch the route entry no longer changes unless it already has an mtu.
There is no need: this inheritance is already done in ip6_mtu()

  # ip link set dev lo mtu 65536
  # ip -6 route add local 2000::1 dev lo
  # ip -6 route add local 2000::2 dev lo mtu 2000
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium

  # ip link set dev lo mtu 65535
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium

  # ip link set dev lo mtu 1501
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium

  # ip link set dev lo mtu 65536
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium

  # ip -6 route del local 2000::1
  # ip -6 route del local 2000::2

This is desirable because changing device mtu and then resetting it
to the previous value shouldn't change the user visible routing table.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:19:32 -05:00
Soheil Hassas Yeganeh 3023898b7d sock: fix sendmmsg for partial sendmsg
Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e60 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:18:12 -05:00
Eric Dumazet 67db3e4bfb tcp: no longer hold ehash lock while calling tcp_get_info()
We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.

We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.

Being able to lock the socket allows to provide consistent set
of fields.

inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.

We can remove syncp added in commit d654976cbf
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:02:27 -05:00
Eric Dumazet ccbf3bfaee tcp: shortcut listeners in tcp_get_info()
Being lockless in tcp_get_info() is hard, because we need to add
specific synchronization in TCP fast path, like seqcount.

Following patch will change inet_diag_dump_icsk() to no longer
hold any lock for non listeners, so that we can properly acquire
socket lock in get_tcp_info() and let it return more consistent counters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:02:27 -05:00
Sowmini Varadhan 117d15bbfd RDS: TCP: start multipath acceptor loop at 0
The for() loop in rds_tcp_accept_one() assumes that the 0'th
rds_tcp_conn_path is UP and starts multipath accepts at index 1.
But this assumption may not always be true: if the 0'th path
has failed (ERROR or DOWN state) an incoming connection request
should be used to resurrect this path.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 12:47:49 -05:00
Sowmini Varadhan 1ac507d4ff RDS: TCP: report addr/port info based on TCP socket in rds-info
The socket argument passed to rds_tcp_tc_info() is a PF_RDS socket,
so it is incorrect to report the address port info based on
rds_getname() as part of TCP state report.

Invoke inet_getname() for the t_sock associated with the
rds_tcp_connection instead.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 12:47:49 -05:00
Liping Zhang 58c78e104d netfilter: nf_tables: fix oops when inserting an element into a verdict map
Dalegaard says:
 The following ruleset, when loaded with 'nft -f bad.txt'
 ----snip----
 flush ruleset
 table ip inlinenat {
   map sourcemap {
     type ipv4_addr : verdict;
   }

   chain postrouting {
     ip saddr vmap @sourcemap accept
   }
 }
 add chain inlinenat test
 add element inlinenat sourcemap { 100.123.10.2 : jump test }
 ----snip----

 results in a kernel oops:
 BUG: unable to handle kernel paging request at 0000000000001344
 IP: [<ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
 [...]
 Call Trace:
  [<ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
  [<ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
  [<ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
  [<ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
  [<ffffffff8132c630>] ? nla_strcmp+0x40/0x50
  [<ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
  [<ffffffff8132c400>] ? nla_validate+0x60/0x80
  [<ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.

Reported-by: Dalegaard <dalegaard@gmail.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:39 +01:00
Florian Westphal e0df8cae6c netfilter: conntrack: refine gc worker heuristics
Nicolas Dichtel says:
  After commit b87a2f9199 ("netfilter: conntrack: add gc worker to
  remove timed-out entries"), netlink conntrack deletion events may be
  sent with a huge delay.

Nicolas further points at this line:

  goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all.  Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge.  But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:38 +01:00
Florian Westphal 6114cc516d netfilter: conntrack: fix CT target for UNSPEC helpers
Thomas reports its not possible to attach the H.245 helper:

iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"

This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.

We should treat UNSPEC as wildcard and ignore the l3num instead.

Reported-by: Thomas Woerner <twoerner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:37 +01:00
Florian Westphal fb9c9649a1 netfilter: connmark: ignore skbs with magic untracked conntrack objects
The (percpu) untracked conntrack entries can end up with nonzero connmarks.

The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).

Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.

Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.

Reported-by: XU Tianwen <evan.xu.tianwen@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:36 +01:00
WANG Cong 8fbfef7f50 ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr
family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:30 +01:00
Linus Lüssing 9b4aec647a batman-adv: fix rare race conditions on interface removal
In rare cases during shutdown the following general protection fault can
happen:

  general protection fault: 0000 [#1] SMP
  Modules linked in: batman_adv(O-) [...]
  CPU: 3 PID: 1714 Comm: rmmod Tainted: G           O    4.6.0-rc6+ #1
  [...]
  Call Trace:
   [<ffffffffa0363294>] batadv_hardif_disable_interface+0x29a/0x3a6 [batman_adv]
   [<ffffffffa0373db4>] batadv_softif_destroy_netlink+0x4b/0xa4 [batman_adv]
   [<ffffffff813b52f3>] __rtnl_link_unregister+0x48/0x92
   [<ffffffff813b9240>] rtnl_link_unregister+0xc1/0xdb
   [<ffffffff8108547c>] ? bit_waitqueue+0x87/0x87
   [<ffffffffa03850d2>] batadv_exit+0x1a/0xf48 [batman_adv]
   [<ffffffff810c26f9>] SyS_delete_module+0x136/0x1b0
   [<ffffffff8144dc65>] entry_SYSCALL_64_fastpath+0x18/0xa8
   [<ffffffff8108aaca>] ? trace_hardirqs_off_caller+0x37/0xa6
  Code: 89 f7 e8 21 bd 0d e1 4d 85 e4 75 0e 31 f6 48 c7 c7 50 d7 3b a0 e8 50 16 f2 e0 49 8b 9c 24 28 01 00 00 48 85 db 0f 84 b2 00 00 00 <48> 8b 03 4d 85 ed 48 89 45 c8 74 09 4c 39 ab f8 00 00 00 75 1c
  RIP  [<ffffffffa0371852>] batadv_purge_outstanding_packets+0x1c8/0x291 [batman_adv]
   RSP <ffff88001da5fd78>
  ---[ end trace 803b9bdc6a4a952b ]---
  Kernel panic - not syncing: Fatal exception in interrupt
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: Fatal exception in interrupt

It does not happen often, but may potentially happen when frequently
shutting down and reinitializing an interface. With some carefully
placed msleep()s/mdelay()s it can be reproduced easily.

The issue is, that on interface removal, any still running worker thread
of a forwarding packet will race with the interface purging routine to
free a forwarding packet. Temporarily giving up a spin-lock to be able
to sleep in the purging routine is not safe.

Furthermore, there is a potential general protection fault not just for
the purging side shown above, but also on the worker side: Temporarily
removing a forw_packet from the according forw_{bcast,bat}_list will make
it impossible for the purging routine to catch and cancel it.

 # How this patch tries to fix it:

With this patch we split the queue purging into three steps: Step 1),
removing forward packets from the queue of an interface and by that
claim it as our responsibility to free.

Step 2), we are either lucky to cancel a pending worker before it starts
to run. Or if it is already running, we wait and let it do its thing,
except two things:

Through the claiming in step 1) we prevent workers from a) re-arming
themselves. And b) prevent workers from freeing packets which we still
hold in the interface purging routine.

Finally, step 3, we are sure that no forwarding packets are pending or
even running anymore on the interface to remove. We can then safely free
the claimed forwarding packets.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:39 +01:00
Sven Eckelmann 2c0c06ff44 batman-adv: Add module alias for batadv netlink family
The batman-adv module has to be loaded to fulfill genl request by the
userspace. When it is not loaded then requests will fail. It is therefore
useful to get the module automatically loaded when such a request is made.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:39 +01:00
Sven Eckelmann ee3b5e9fe8 batman-adv: Update wifi flags on upper link change
Things like VLANs don't have their link set when they are created. Thus
the wifi flags have to be evaluated later to fix their contents for the
link interface.

Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:38 +01:00
Marek Lindner 1942de1bba batman-adv: retrieve B.A.T.M.A.N. V WiFi neighbor stats from real interface
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[sven.eckelmann@open-mesh.com: re-add batadv_get_real_netdev to take rtnl
 semaphore for batadv_get_real_netdevice]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:38 +01:00
Marek Lindner 5ed4a460a1 batman-adv: additional checks for virtual interfaces on top of WiFi
In a few situations batman-adv tries to determine whether a given interface
is a WiFi interface to enable specific WiFi optimizations. If the interface
batman-adv has been configured with is a virtual interface (e.g. VLAN) it
would not be properly detected as WiFi interface and thus not benefit from
the special WiFi treatment.
This patch changes that by peeking under the hood whenever a virtual
interface is in play.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[sven.eckelmann@open-mesh.com: integrate in wifi_flags caching, retrieve
 namespace of link interface]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:38 +01:00
Sven Eckelmann 10b1bbb46c batman-adv: Cache the type of wifi device for each hardif
batman-adv is requiring the type of wifi device in different contexts. Some
of them can take the rtnl semaphore and some of them already have the
semaphore taken. But even others don't allow that the semaphore will be
taken.

The data has to be retrieved when the hardif is added to batman-adv because
some of the wifi information for an hardif will only be available with rtnl
lock. It can then be cached in the batadv_hard_iface and the functions
is_wifi_netdev and is_cfg80211_netdev can just compare the correct bits
without imposing extra locking requirements.

Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:37 +01:00
Marek Lindner f44a3ae9a2 batman-adv: refactor wifi interface detection
The ELP protocol requires cfg80211 to auto-detect the WiFi througput
to a given neighbor. Use batadv_is_cfg80211_netdev() to determine
whether or not an interface is eligible.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:37 +01:00
Sven Eckelmann 93bbaab455 batman-adv: Reject unicast packet with zero/mcast dst address
An unicast batman-adv packet cannot be transmitted to a multicast or zero
mac address. So reject incoming packets which still have these classes of
addresses as destination mac address in the outer ethernet header.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:36 +01:00
Sven Eckelmann 88ffc7d0e2 batman-adv: Return non-const ptr in batadv_getlink_net
The returned net_namespace of batadv_getlink_net may be used with functions
that potentially modify the struct. Thus it must return the pointer as
non-const like rtnl_link_ops::get_link_net does.

Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:36 +01:00
Sven Eckelmann 92eef520d7 batman-adv: Disallow zero and mcast src address for mgmt frames
The routing check for management frames is validating the source mac
address in the outer ethernet header. It rejects every source mac address
which is a broadcast address. But it also has to reject the zero-mac
address and multicast mac addresses.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:35 +01:00
Sven Eckelmann 9f75c8e1c8 batman-adv: Disallow mcast src address for data frames
The routing checks are validating the source mac address of the outer
ethernet header. They reject every source mac address which is a broadcast
address. But they also have to reject any multicast mac addresses.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
[sw@simonwunderlich.de: fix commit message typo]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:35 +01:00
Sven Eckelmann 7d72d174c7 batman-adv: Remove dev_queue_xmit return code exception
No caller of batadv_send_skb_to_orig is expecting the results to be -1
(-EPERM) anymore when the skbuff was not consumed. They will instead expect
that the skbuff is always consumed. Having such return code filter is
therefore not needed anymore.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:34 +01:00
Sven Eckelmann b91a2543b4 batman-adv: Consume skb in receive handlers
Receiving functions in Linux consume the supplied skbuff. Doing the same in
the batadv_rx_handler functions makes the behavior more similar to the rest
of the Linux network code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:34 +01:00
Alexander Duyck fd0285a39b fib_trie: Correct /proc/net/route off by one error
The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.

In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.

This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key.  In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid.  With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.

Fixes: 25b97c016b ("ipv4: off-by-one in continuation handling in /proc/net/route")
Cc: Andy Whitcroft <apw@canonical.com>
Reported-by: Jason Baron <jbaron@akamai.com>
Tested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:40:27 -05:00
David Ahern 5d41ce29e3 net: icmp6_send should use dst dev to determine L3 domain
icmp6_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have a dst set.
Update icmp6_send to use the dst on the skb to determine L3 domain.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:30:19 -05:00
Soheil Hassas Yeganeh f5f99309fa sock: do not set sk_err in sock_dequeue_err_skb
Do not set sk_err when dequeuing errors from the error queue.
Doing so results in:
a) Bugs: By overwriting existing sk_err values, it possibly
   hides legitimate errors. It is also incorrect when local
   errors are queued with ip_local_error. That happens in the
   context of a system call, which already returns the error
   code.
b) Inconsistent behavior: When there are pending errors on
   the error queue, sk_err is sometimes 0 (e.g., for
   the first timestamp on the error queue) and sometimes
   set to an error code (after dequeuing the first
   timestamp).
c) Suboptimality: Setting sk_err to ENOMSG on simple
   TX timestamps can abort parallel reads and writes.

Removing this line doesn't break userspace. This is because
userspace code cannot rely on sk_err for detecting whether
there is something on the error queue. Except for ICMP messages
received for UDP and RAW, sk_err is not set at enqueue time,
and as a result sk_err can be 0 while there are plenty of
errors on the error queue.

For ICMP packets in UDP and RAW, sk_err is set when they are
enqueued on the error queue, but that does not result in aborting
reads and writes. For such cases, sk_err is only readable via
getsockopt(SO_ERROR) which will reset the value of sk_err on
its own. More importantly, prior to this patch,
recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
sk_err is set by sock_dequeue_err_skb without atomic ops or
locks) which can store 0 in sk_err even when we have ICMP
messages pending. Removing this line from sock_dequeue_err_skb
eliminates that race.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:29:10 -05:00
Jesper Dangaard Brouer 84c46dd865 qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Jesper Dangaard Brouer 1159708432 net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
default qdisc attached, given they will be backed by physical device,
that already have a qdisc attached for pushback.

It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
this can be useful for difference policy reasons (e.g. bandwidth
limiting containers).  For this to work, the tx_queue_len need to have
a sane value, because some qdiscs inherit/copy the tx_queue_len
(namely, pfifo, bfifo, gred, htb, plug and sfb).

Commit a813104d92 ("IFF_NO_QUEUE: Fix for drivers not calling
ether_setup()") caught situations where some drivers didn't initialize
tx_queue_len.  The problem with the commit was choosing 1 as the
fallback value.

A qdisc queue length of 1 causes more harm than good, because it
creates hard to debug situations for userspace. It gives userspace a
false sense of a working config after attaching a qdisc.  As low
volume traffic (that doesn't activate the qdisc policy) works,
like ping, while traffic that e.g. needs shaping cannot reach the
configured policy levels, given the queue length is too small.

This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
IFF_NO_QUEUE devices (that call ether_setup()) also use this value.

Fixes: a813104d92 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Jesper Dangaard Brouer d0a81f67cd net: make default TX queue length a defined constant
The default TX queue length of Ethernet devices have been a magic
constant of 1000, ever since the initial git import.

Looking back in historical trees[1][2] the value used to be 100,
with the same comment "Ethernet wants good queues". The commit[3]
that changed this from 100 to 1000 didn't describe why, but from
conversations with Robert Olsson it seems that it was changed
when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
link speed increased x10 the queue size were also adjusted.  This
value later caused much heartache for the bufferbloat community.

This patch merely moves the value into a defined constant.

[1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
[2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
[3] https://git.kernel.org/tglx/history/c/98921832c232

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Anna Schumaker bb29dd8433 SUNRPC: Fix suspicious RCU usage
We need to hold the rcu_read_lock() when calling rcu_dereference(),
otherwise we can't guarantee that the object being dereferenced still
exists.

Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 14:35:59 -05:00
Paolo Abeni 7c13f97ffd udp: do fwd memory scheduling on dequeue
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks	vanilla		patched
1		440		560
3		2150		2300
6		3650		3800
9		4450		4600
12		6250		6450

v1 -> v2:
 - do rmem and allocated memory scheduling under the receive lock
 - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
 - avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Paolo Abeni ad959036a7 net/sock: add an explicit sk argument for ip_cmsg_recv_offset()
So that we can use it even after orphaining the skbuff.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Marcelo Ricardo Leitner 7233bc84a3 sctp: assign assoc_id earlier in __sctp_connect
sctp_wait_for_connect() currently already holds the asoc to keep it
alive during the sleep, in case another thread release it. But Andrey
Konovalov and Dmitry Vyukov reported an use-after-free in such
situation.

Problem is that __sctp_connect() doesn't get a ref on the asoc and will
do a read on the asoc after calling sctp_wait_for_connect(), but by then
another thread may have closed it and the _put on sctp_wait_for_connect
will actually release it, causing the use-after-free.

Fix is, instead of doing the read after waiting for the connect, do it
before so, and avoid this issue as the socket is still locked by then.
There should be no issue on returning the asoc id in case of failure as
the application shouldn't trust on that number in such situations
anyway.

This issue doesn't exist in sctp_sendmsg() path.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:18:37 -05:00
David Ahern cd2c0f4540 net: Update raw socket bind to consider l3 domain
Binding a raw socket to a local address fails if the socket is bound
to an L3 domain:

    $ vrf-test  -s -l 10.100.1.2 -R -I red
    error binding socket: 99: Cannot assign requested address

Update raw_bind to look consider if sk_bound_dev_if is bound to an L3
domain and use inet_addr_type_table to lookup the address.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:14:34 -05:00
Linus Torvalds fb415f222c Fixes for some recent regressions including fallout from the vmalloc'd
stack change (after which we can no longer encrypt stuff on the stack).
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYHNwpAAoJECebzXlCjuG++DMP/3mUUAF09DfFR/EHl7knDT1f
 kZ53UVHYzr02w0wXfwxVLlp2H7TdSAufgsSvPT6qksA3eY7gL6nJ9zHkl+Nv5yCx
 y6vsFWjO1QEUWFOZWCKcmT2dAI3Ddt9IhK13pfZEKN1XKvK2zWB16HEVzSg6fR2K
 NwHlpMnQUI4HWThURzwTZb1M5YhxRCAnyiv8BTAAPjbEfzPzdL7j3jxwqtH8bOWp
 qIcDDvjC744b9zy0YuAEY/NyGBhYZPdM6gWsBBes1TRzBWUL9qsUYTWDJTmg/F1l
 Or0Jz7CUEN9uOHLGnkATPDc+eBg9YFV+bSsSnJu1/W4Er7dX1Af/lol79zEp/Zw1
 Snd9FelSPj3vxmYAFTCLnHRTRgsyiDhbbb7gVrzH9bxnCrRNR6p2kY018s1Cl9Td
 uWQoNNFQwwnYxWYEeZdO5PgX+pcgoCzhHACNk5oA93YaBE0GuLHHugwwIrYE8TM1
 1iY20sLC5lJcnPqxdgnoprZnnHMuL6rx5KRbvBeflNZ4huK2PIcPJyeB83XH6s12
 G67PjJ0rfWzSBF14O/ZtQA6he+kXvnH3pKqpNnaMiBxZZ2J8E1eQvrKTLLIwmtlP
 18KKJpZIzh7jTTZ/99nAMAt/BGw97P9TToLdnI8dCxYygHEaywpEYtcsE8IWFAvA
 3XkS5QdlJhhAaAUUYBXy
 =oPbZ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fixes for some recent regressions including fallout from the vmalloc'd
  stack change (after which we can no longer encrypt stuff on the
  stack)"

* tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix general protection fault in release_lock_stateid()
  svcrdma: backchannel cannot share a page for send and rcv buffers
  sunrpc: fix some missing rq_rbuffer assignments
  sunrpc: don't pass on-stack memory to sg_set_buf
  nfsd: move blocked lock handling under a dedicated spinlock
2016-11-04 20:12:10 -07:00
Lorenzo Colitti e2d118a1cb net: inet: Support UID-based routing in IP protocols.
- Use the UID in routing lookups made by protocol connect() and
  sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
  (e.g., Path MTU discovery) take the UID of the socket into
  account.
- For packets not associated with a userspace socket, (e.g., ping
  replies) use UID 0 inside the user namespace corresponding to
  the network namespace the socket belongs to. This allows
  all namespaces to apply routing and iptables rules to
  kernel-originated traffic in that namespaces by matching UID 0.
  This is better than using the UID of the kernel socket that is
  sending the traffic, because the UID of kernel sockets created
  at namespace creation time (e.g., the per-processor ICMP and
  TCP sockets) is the UID of the user that created the socket,
  which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:23 -04:00
Lorenzo Colitti 622ec2c9d5 net: core: add UID to flows, rules, and routes
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
  range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
  rtnetlink. The value INVALID_UID indicates no UID was
  specified.
- Add a UID field to the flow structures.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:23 -04:00
Lorenzo Colitti 86741ec254 net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.

Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().

Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:

1. Whenever sk_socket is non-null: sk_uid is the same as the UID
   in sk_socket, i.e., matches the return value of sock_i_uid.
   Specifically, the UID is set when userspace calls socket(),
   fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
   - For a socket that no longer has a sk_socket because
     userspace has called close(): the previous UID.
   - For a cloned socket (e.g., an incoming connection that is
     established but on which userspace has not yet called
     accept): the UID of the socket it was cloned from.
   - For a socket that has never had an sk_socket: UID 0 inside
     the user namespace corresponding to the network namespace
     the socket belongs to.

Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:22 -04:00
Sven Eckelmann e13258f38e batman-adv: Detect missing primaryif during tp_send as error
The throughput meter detects different situations as problems for the
current test. It stops the test after these and reports it to userspace.
This also has to be done when the primary interface disappeared during the
test.

Fixes: 33a3bb4a33 ("batman-adv: throughput meter implementation")
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-04 12:27:39 +01:00
Sven Eckelmann 27915aa610 batman-adv: Revert "fix splat on disabling an interface"
The commit 9799c50372 ("batman-adv: fix splat on disabling an interface")
fixed a warning but at the same time broke the rtnl function add_slave for
devices which were temporarily removed.

batadv_softif_slave_add requires soft_iface of and hard_iface to be NULL
before it is allowed to be enslaved. But this resetting of soft_iface to
NULL in batadv_hardif_disable_interface was removed with the aforementioned
commit.

Reported-by: Julian Labus <julian@freifunk-rtk.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-04 12:27:34 +01:00
WANG Cong 00ffc1ba02 genetlink: fix a memory leak on error path
In __genl_register_family(), when genl_validate_assign_mc_groups()
fails, we forget to free the memory we possibly allocate for
family->attrbuf.

Note, some callers call genl_unregister_family() to clean up
on error path, it doesn't work because the family is inserted
to the global list in the nearly last step.

Cc: Jakub Kicinski <kubakici@wp.pl>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:52:29 -04:00
Eric Dumazet 990ff4d844 ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped
While fuzzing kernel with syzkaller, Andrey reported a nasty crash
in inet6_bind() caused by DCCP lacking a required method.

Fixes: ab1e0a13d7 ("[SOCK] proto: Add hashinfo member to struct proto")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:50:27 -04:00
Simon Horman 5976c5f45c net/sched: cls_flower: Support matching on SCTP ports
Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: \
        flower indev eth0 ip_proto sctp dst_port 80 \
        action drop

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:26:39 -04:00
Eric Dumazet 1aa9d1a0e7 ipv6: dccp: fix out of bound access in dccp_v6_err()
dccp_v6_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmpv6_notify() are more than enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet 93636d1f1f netlink: netlink_diag_dump() runs without locks
A recent commit removed locking from netlink_diag_dump() but forgot
one error case.

=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ #336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
) at:
[<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!

other info that might help us debug this:
3 locks held by syz-executor/4018:
 #0: [   36.220068]  (
sock_diag_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
 #1: [   36.220068]  (
sock_diag_table_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
 #2: [   36.220068]  (
nlk->cb_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0

stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
 ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
 dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
 [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
 [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
 [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
 [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
 [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
 [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110

Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet 6706a97fec dccp: fix out of bound access in dccp_v4_err()
dccp_v4_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

This patch might allow to process more ICMP messages, as some routers
are still limiting the size of reflected bytes to 28 (RFC 792), instead
of extended lengths (RFC 1812 4.3.2.3)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet 346da62cc1 dccp: do not send reset to already closed sockets
Andrey reported following warning while fuzzing with syzkaller

WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
 ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
 0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179
 [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542
 [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
 [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
 [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
 [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
 [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570
 [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017
 [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208
 [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244
 [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116
 [<     inline     >] exit_task_work include/linux/task_work.h:21
 [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828
 [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931
 [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307
 [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
 [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130
arch/x86/entry/common.c:156
 [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0
arch/x86/entry/common.c:259
 [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

Fix this the same way we did for TCP in commit 565b7b2d2e
("tcp: do not send reset to already closed sockets")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet c3f24cfb3e dccp: do not release listeners too soon
Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [<     inline     >] dst_input ./include/net/dst.h:507
 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [<     inline     >] new_sync_write fs/read_write.c:499
 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
 [<     inline     >] SYSC_write fs/read_write.c:607
 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:50 -04:00
Eric Dumazet 79d8665b95 tcp: fix return value for partial writes
After my commit, tcp_sendmsg() might restart its loop after
processing socket backlog.

If sk_err is set, we blindly return an error, even though we
copied data to user space before.

We should instead return number of bytes that could be copied,
otherwise user space might resend data and corrupt the stream.

This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
to process timestamps.

Issue was diagnosed by Soheil and Willem, big kudos to them !

Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:12:06 -04:00
Lance Richardson 9ee6c5dc81 ipv4: allow local fragmentation in ip_finish_output_gso()
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b6 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:10:26 -04:00
Willem de Bruijn dbd1759e6a ipv6: on reassembly, record frag_max_size
IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Willem de Bruijn 0cc0aa614b ipv6: add IPV6_RECVFRAGSIZE cmsg
When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.

At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.

Tested using the test for IP_RECVFRAGSIZE plus

  ip netns exec to ip addr add dev veth1 fc07::1/64
  ip netns exec from ip addr add dev veth0 fc07::2/64

  ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
  ip netns exec from nc -q 1 -u fc07::1 6000 < payload

Both with and without enabling connection tracking

  ip6tables -A INPUT -m state --state NEW -p udp -j LOG

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Willem de Bruijn 70ecc24841 ipv4: add IP_RECVFRAGSIZE cmsg
The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.

Tested:
  Sent data over a veth pair of which the source has a small mtu.
  Sent data using netcat, received using a dedicated process.

  Verified that the cmsg IP_RECVFRAGSIZE is returned only when
  data arrives fragmented, and in that cases matches the veth mtu.

    ip link add veth0 type veth peer name veth1

    ip netns add from
    ip netns add to

    ip link set dev veth1 netns to
    ip netns exec to ip addr add dev veth1 192.168.10.1/24
    ip netns exec to ip link set dev veth1 up

    ip link set dev veth0 netns from
    ip netns exec from ip addr add dev veth0 192.168.10.2/24
    ip netns exec from ip link set dev veth0 up
    ip netns exec from ip link set dev veth0 mtu 1300
    ip netns exec from ethtool -K veth0 ufo off

    dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

    ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
    ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

  using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Eric Dumazet ac9e70b17e tcp: fix potential memory corruption
Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Fixes: 5f74f82ea3 ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:33:30 -04:00
Cyrill Gorcunov 9999370fae net: ip, raw_diag -- Use jump for exiting from nested loop
I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:25:26 -04:00
Cyrill Gorcunov cd05a0eca8 net: ip, raw_diag -- Fix socket leaking for destroy request
In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:25:26 -04:00
WANG Cong 14135f30e3 inet: fix sleeping inside inet_wait_for_connect()
Andrey reported this kernel warning:

  WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
  __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
  do not call blocking ops when !TASK_RUNNING; state=1 set at
  [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
  kernel/sched/wait.c:178
  Modules linked in:
  CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
   ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
   ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
  Call Trace:
   [<     inline     >] __dump_stack lib/dump_stack.c:15
   [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
   [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
   [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
   [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
   [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
   [<     inline     >] slab_alloc_node mm/slub.c:2634
   [<     inline     >] slab_alloc mm/slub.c:2716
   [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
   [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
   [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
   [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
   [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
   [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
   [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
   [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
   [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
   [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
   [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
   [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
   [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
   [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
   [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
   [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
   [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
   [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
  arch/x86/entry/entry_64.S:209

Unlike commit 26cabd3125
("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
sleeping function is called before schedule_timeout(), this is indeed
a bug. Fix this by moving the wait logic to the new API, it is similar
to commit ff960a7317
("netdev, sched/wait: Fix sleeping inside wait event").

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:18:07 -04:00
Pablo Neira Ayuso 08733a0cb7 netfilter: handle NF_REPEAT from nf_conntrack_in()
NF_REPEAT is only needed from nf_conntrack_in() under a very specific
case required by the TCP protocol tracker, we can handle this case
without returning to the core hook path. Handling of NF_REPEAT from the
nf_reinject() is left untouched.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:53:00 +01:00
Pablo Neira Ayuso 26dfab7216 netfilter: merge nf_iterate() into nf_hook_slow()
nf_iterate() has become rather simple, we can integrate this code into
nf_hook_slow() to reduce the amount of LOC in the core path.

However, we still need nf_iterate() around for nf_queue packet handling,
so move this function there where we only need it. I think it should be
possible to refactor nf_queue code to get rid of it definitely, but
given this is slow path anyway, let's have a look this later.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:59 +01:00
Pablo Neira Ayuso 01886bd91f netfilter: remove hook_entries field from nf_hook_state
This field is only useful for nf_queue, so store it in the
nf_queue_entry structure instead, away from the core path. Pass
hook_head to nf_hook_slow().

Since we always have a valid entry on the first iteration in
nf_iterate(), we can use 'do { ... } while (entry)' loop instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:58 +01:00
Pablo Neira Ayuso c63cbc4604 netfilter: use switch() to handle verdict cases from nf_hook_slow()
Use switch() for verdict handling and add explicit handling for
NF_STOLEN and other non-conventional verdicts.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:57 +01:00
Pablo Neira Ayuso 0e5a1c7eb3 netfilter: nf_tables: use hook state from xt_action_param structure
Don't copy relevant fields from hook state structure, instead use the
one that is already available in struct xt_action_param.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:34 +01:00
Pablo Neira Ayuso 613dbd9572 netfilter: x_tables: move hook state into xt_action_param structure
Place pointer to hook state in xt_action_param structure instead of
copying the fields that we need. After this change xt_action_param fits
into one cacheline.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:21 +01:00
Pablo Neira Ayuso 06fd3a392b netfilter: deprecate NF_STOP
NF_STOP is only used by br_netfilter these days, and it can be emulated
with a combination of NF_STOLEN plus explicit call to the ->okfn()
function as Florian suggests.

To retain binary compatibility with userspace nf_queue application, we
have to keep NF_STOP around, so libnetfilter_queue userspace userspace
applications still work if they use NF_STOP for some exotic reason.

Out of tree modules using NF_STOP would break, but we don't care about
those.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:17 +01:00
Pablo Neira Ayuso 1610a73c41 netfilter: kill NF_HOOK_THRESH() and state->tresh
Patch c5136b15ea ("netfilter: bridge: add and use br_nf_hook_thresh")
introduced br_nf_hook_thresh().

Replace NF_HOOK_THRESH() by br_nf_hook_thresh from
br_nf_forward_finish(), so we have no more callers for this macro.

As a result, state->thresh and explicit thresh parameter in the hook
state structure is not required anymore. And we can get rid of
skip-hook-under-thresh loop in nf_iterate() in the core path that is
only used by br_netfilter to search for the filter hook.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:12 +01:00
Pablo Neira Ayuso d2be66f685 netfilter: remove comments that predate rcu days
We cannot block/sleep on nf_iterate because netfilter runs under rcu
read lock these days, where blocking is well-known to be illegal. So
let's remove these old comments.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:05 +01:00
Pablo Neira Ayuso b250a7fc3b netfilter: get rid of useless debugging from core
This patch remove compile time code to catch inconventional verdicts.
We have better ways to handle this case these days, eg. pr_debug() but
even though I don't think this is useful at all, so let's remove this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:55:54 +01:00
Tom Herbert 1913540a13 ila: Fix crash caused by rhashtable changes
commit ca26893f05 ("rhashtable: Add rhlist interface")
added a field to rhashtable_iter so that length became 56 bytes
and would exceed the size of args in netlink_callback (which is
48 bytes). The netlink diag dump function already has been
allocating a iter structure and storing the pointed to that
in the args of netlink_callback. ila_xlat also uses
rhahstable_iter but is still putting that directly in
the arg block. Now since rhashtable_iter size is increased
we are overwriting beyond the structure. The next field
happens to be cb_mutex pointer in netlink_sock and hence the crash.

Fix is to alloc the rhashtable_iter and save it as pointer
in arg.

Tested:

  modprobe ila
  ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
  ./ip ila list  # NO crash now

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:26:02 -04:00
Cyrill Gorcunov 3de864f8c9 net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect
While being preparing patches for killing raw sockets via
diag netlink interface I noticed that my runs are stuck:

 | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
 | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
 | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
 | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
 | [<ffffffff8179b517>] raw_abort+0x33/0x42
 | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52

which has not been the case before. I narrowed it down to the commit

 | commit 286c72deab
 | Author: Eric Dumazet <edumazet@google.com>
 | Date:   Thu Oct 20 09:39:40 2016 -0700
 |
 |     udp: must lock the socket in udp_disconnect()

where we start locking the socket for different reason.

So the raw_abort escaped the renaming and we have to
fix this typo using __udp_disconnect instead.

Fixes: 286c72deab ("udp: must lock the socket in udp_disconnect()")
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:25:16 -04:00
Eric Dumazet 2331ccc5b3 tcp: enhance tcp collapsing
As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
time even if the skb at the right has fragments.

We simply have to use more generic skb_copy_bits() instead of
skb_copy_from_linear_data() in tcp_collapse_retrans()

Also need to guard this skb_copy_bits() in case there is nothing to
copy, otherwise skb_put() could panic if left skb has frags.

Tested:

Used following packetdrill test

// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.100 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 write(4, ..., 200) = 200
   +0 > P. 1:201(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 201:401(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 401:601(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 601:801(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 801:1001(200) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1001:1101(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1101:1201(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1201:1301(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1301:1401(100) ack 1

+.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
// Check that TCP collapse works :
   +0 > P. 1:1001(1000) ack 1

Reported-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:21:36 -04:00
Eli Cooper 4fd19c15de ip6_udp_tunnel: remove unused IPCB related codes
Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are
never used before it reaches ip6tunnel_xmit(), and past that point the
control buffer is no longer interpreted as IPCB.

This clears these unused IPCB related codes. Currently there is no skb
scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be
cleared for IPv4 packets, as shown in 5146d1f151
("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").

Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:18:36 -04:00
Xin Long e4ff952a7e sctp: clean up sctp_packet_transmit
After adding sctp gso, sctp_packet_transmit is a quite big function now.

This patch is to extract the codes for packing packet to sctp_packet_pack
from sctp_packet_transmit, and add some comments, simplify the err path by
freeing auth chunk when freeing packet chunk_list in out path and freeing
head skb early if it fails to pack packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:03:13 -04:00
Roi Dayan 13fa876ebd net/sched: cls_flower: merge filter delete/destroy common code
Move common code from fl_delete and fl_detroy to __fl_delete.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:00:47 -04:00
Roi Dayan a1a8f7fe92 net/sched: cls_flower: add missing unbind call when destroying flows
tcf_unbind was called in fl_delete but was missing in fl_destroy when
force deleting flows.

Fixes: 77b9900ef5 ('tc: introduce Flower classifier')
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:00:47 -04:00
David S. Miller 4cb551a100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:

1) Add fib lookup expression for nf_tables, from Florian Westphal. This
   new expression provides a native replacement for iptables addrtype
   and rp_filter matches. This is more flexible though, since we can
   populate the kernel flowi representation to inquire fib to
   accomodate new usecases, such as RTBH through skb mark.

2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
   new expression allow you to access skbuff route metadata, more
   specifically nexthop and classid fields.

3) Add notrack support for nf_tables, to skip conntracking, requested by
   many users already.

4) Add boilerplate code to allow to use nf_log infrastructure from
   nf_tables ingress.

5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
   the xtables cluster match, from Liping Zhang.

6) Move socket lookup code into generic nf_socket_* infrastructure so
   we can provide a native replacement for the xtables socket match.

7) Make sure nfnetlink_queue data that is updated on every packets is
   placed in a different cache from read-only data, from Florian Westphal.

8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.

9) Start round robin number generation in nft_numgen from zero,
   instead of n-1, for consistency with xtables statistics match,
   patch from Liping Zhang.

10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
    given we retry with a smaller allocation on failure, from Calvin Owens.

11) Cleanup xt_multiport to use switch(), from Gao feng.

12) Remove superfluous check in nft_immediate and nft_cmp, from
    Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 14:57:47 -04:00
Florian Westphal 886bc50348 netfilter: nf_queue: place volatile data in own cacheline
As the comment indicates, the data at the end of nfqnl_instance struct is
written on every queue/dequeue, so it should reside in its own cacheline.

Before this change, 'lock' was in first cacheline so we dirtied both.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:33 +01:00
Liping Zhang e41e9d623c netfilter: nf_tables: remove useless U8_MAX validation
After call nft_data_init, size is already validated and desc.len will
not exceed the sizeof(struct nft_data), i.e. 16 bytes. So it will never
exceed U8_MAX.

Furthermore, in nft_immediate_init, we forget to call nft_data_uninit
when desc.len exceeds U8_MAX, although this will not happen, but it's
a logical mistake.

Now remove these redundant validation introduced by commit 36b701fae1
("netfilter: nf_tables: validate maximum value of u32 netlink attributes")

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:32 +01:00
Anders K. Pedersen 2fa841938c netfilter: nf_tables: introduce routing expression
Introduces an nftables rt expression for routing related data with support
for nexthop (i.e. the directly connected IP address that an outgoing packet
is sent to), which can be used either for matching or accounting, eg.

 # nft add rule filter postrouting \
	ip daddr 192.168.1.0/24 rt nexthop != 192.168.0.1 drop

This will drop any traffic to 192.168.1.0/24 that is not routed via
192.168.0.1.

 # nft add rule filter postrouting \
	flow table acct { rt nexthop timeout 600s counter }
 # nft add rule ip6 filter postrouting \
	flow table acct { rt nexthop timeout 600s counter }

These rules count outgoing traffic per nexthop. Note that the timeout
releases an entry if no traffic is seen for this nexthop within 10 minutes.

 # nft add rule inet filter postrouting \
	ether type ip \
	flow table acct { rt nexthop timeout 600s counter }
 # nft add rule inet filter postrouting \
	ether type ip6 \
	flow table acct { rt nexthop timeout 600s counter }

Same as above, but via the inet family, where the ether type must be
specified explicitly.

"rt classid" is also implemented identical to "meta rtclassid", since it
is more logical to have this match in the routing expression going forward.

Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:31 +01:00
Pablo Neira Ayuso 8db4c5be88 netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c
We need this split to reuse existing codebase for the upcoming nf_tables
socket expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:31 +01:00
Pablo Neira Ayuso 1fddf4bad0 netfilter: nf_log: add packet logging for netdev family
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.

This patch adds the boiler plate code to register the netdev logging
family.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:30 +01:00
Florian Westphal f6d0cbcf09 netfilter: nf_tables: add fib expression
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
just dispatches to ipv4 or ipv6 one based on nfproto).

Currently supports fetching output interface index/name and the
rtm_type associated with an address.

This can be used for adding path filtering. rtm_type is useful
to e.g. enforce a strong-end host model where packets
are only accepted if daddr is configured on the interface the
packet arrived on.

The fib expression is a native nftables alternative to the
xtables addrtype and rp_filter matches.

FIB result order for oif/oifname retrieval is as follows:
 - if packet is local (skb has rtable, RTF_LOCAL set, this
   will also catch looped-back multicast packets), set oif to
   the loopback interface.
 - if fib lookup returns an error, or result points to local,
   store zero result.  This means '--local' option of -m rpfilter
   is not supported. It is possible to use 'fib type local' or add
   explicit saddr/daddr matching rules to create exceptions if this
   is really needed.
 - store result in the destination register.
   In case of multiple routes, search set for desired oif in case
   strict matching is requested.

ipv4 and ipv6 behave fib expressions are supposed to behave the same.

[ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")

	http://patchwork.ozlabs.org/patch/688615/

  to address fallout from this patch after rebasing nf-next, that was
  posted to address compilation warnings. --pablo ]

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:14 +01:00
Chuck Lever 8d42629be0 svcrdma: backchannel cannot share a page for send and rcv buffers
The underlying transport releases the page pointed to by rq_buffer
during xprt_rdma_bc_send_request. When the backchannel reply arrives,
rq_rbuffer then points to freed memory.

Fixes: 68778945e4 ('SUNRPC: Separate buffer pointers for RPC ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:23:58 -04:00
Isaac Boukris e7947ea770 unix: escape all null bytes in abstract unix domain socket
Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.

This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).

Signed-off-by: Isaac Boukris <iboukris@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 12:15:13 -04:00
Wei Yongjun 22ca904ad7 genetlink: fix error return code in genl_register_family()
Fix to return a negative error code from the idr_alloc() error handling
case instead of 0, as done elsewhere in this function.

Also fix the return value check of idr_alloc() since idr_alloc return
negative errors on failure, not zero.

Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 12:13:13 -04:00
David Ahern e58e415968 net: Enable support for VRF with ipv4 multicast
Enable support for IPv4 multicast:
- similar to unicast the flow struct is updated to L3 master device
  if relevant prior to calling fib_rules_lookup. The table id is saved
  to the lookup arg so the rule action for ipmr can return the table
  associated with the device.

- ip_mr_forward needs to check for master device mismatch as well
  since the skb->dev is set to it

- allow multicast address on VRF device for Rx by checking for the
  daddr in the VRF device as well as the original ingress device

- on Tx need to drop to __mkroute_output when FIB lookup fails for
  multicast destination address.

- if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
  IPMR FIB rules on first device create similar to FIB rules. In
  addition the VRF driver does not divert IPv4 multicast packets:
  it breaks on Tx since the fib lookup fails on the mcast address.

With this patch, ipmr forwarding and local rx/tx work.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:54:26 -04:00
Parthasarathy Bhuvaragan f40acbaf42 tipc: remove SS_CONNECTED sock state
In this commit, we replace references to sock->state SS_CONNECTE
with sk_state TIPC_ESTABLISHED.

Finally, the sock->state is no longer explicitly used by tipc.
The FSM below is for various types of connection oriented sockets.

Stream Server Listening Socket:
+-----------+       +-------------+
| TIPC_OPEN |------>| TIPC_LISTEN |
+-----------+       +-------------+

Stream Server Data Socket:
+-----------+       +------------------+
| TIPC_OPEN |------>| TIPC_ESTABLISHED |
+-----------+       +------------------+
                          ^   |
                          |   |
                          |   v
                    +--------------------+
                    | TIPC_DISCONNECTING |
                    +--------------------+

Stream Socket Client:
+-----------+       +-----------------+
| TIPC_OPEN |------>| TIPC_CONNECTING |------+
+-----------+       +-----------------+      |
                            |                |
                            |                |
                            v                |
                    +------------------+     |
                    | TIPC_ESTABLISHED |     |
                    +------------------+     |
                          ^   |              |
                          |   |              |
                          |   v              |
                    +--------------------+   |
                    | TIPC_DISCONNECTING |<--+
                    +--------------------+

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan 99a2088981 tipc: create TIPC_CONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_CONNECTING
by primarily replacing the SS_CONNECTING with TIPC_CONNECTING.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan 6f00089c73 tipc: remove SS_DISCONNECTING state
In this commit, we replace the references to SS_DISCONNECTING with
the combination of sk_state TIPC_DISCONNECTING and flags set in
sk_shutdown.
We introduce a new function _tipc_shutdown(), which provides
the common code required by tipc_release() and tipc_shutdown().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 9fd4b070f6 tipc: create TIPC_DISCONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_DISCONNECTING in
sk_state. TIPC_DISCONNECTING is replacing the socket connection status
update using SS_DISCONNECTING.
TIPC_DISCONNECTING is set for connection oriented sockets at:
- tipc_shutdown()
- connection probe timeout
- when we receive an error message on the connection.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 438adcaf0d tipc: create TIPC_OPEN as a new sk_state
In this commit, we create a new tipc socket state TIPC_OPEN in
sk_state. We primarily replace the SS_UNCONNECTED sock->state with
TIPC_OPEN.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 8ea642ee9a tipc: create TIPC_ESTABLISHED as a new sk_state
Until now, tipc maintains probing state for connected sockets in
tsk->probing_state variable.

In this commit, we express this information as socket states and
this remove the variable. We set probe_unacked flag when a probe
is sent out and reset it if we receive a reply. Instead of the
probing state TIPC_CONN_OK, we create a new state TIPC_ESTABLISHED.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 0c288c8692 tipc: create TIPC_LISTEN as a new sk_state
Until now, tipc maintains the socket state in sock->state variable.
This is used to maintain generic socket states, but in tipc
we overload it and save tipc socket states like TIPC_LISTEN.
Other protocols like TCP, UDP store protocol specific states
in sk->sk_state instead.

In this commit, we :
- declare a new tipc state TIPC_LISTEN, that replaces SS_LISTEN
- Create a new function tipc_set_state(), to update sk->sk_state.
- TIPC_LISTEN state is maintained in sk->sk_state.
- replace references to SS_LISTEN with TIPC_LISTEN.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan c752023aab tipc: remove socket state SS_READY
Until now, tipc socket state SS_READY declares that the socket is a
connectionless socket.

In this commit, we remove the state SS_READY and replace it with a
condition which returns true for datagram / connectionless sockets.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan 360aab6b49 tipc: remove probing_intv from tipc_sock
Until now, probing_intv is a variable in struct tipc_sock but is
always set to a constant CONN_PROBING_INTERVAL. The socket
connection is probed based on this value.

In this commit, we remove this variable and setup the socket
timer based on the constant CONN_PROBING_INTERVAL.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan d6fb7e9c99 tipc: remove tsk->connected from tipc_sock
Until now, we determine if a socket is connected or not based on
tsk->connected, which is set once when the probing state is set
to TIPC_CONN_OK. It is unset when the sock->state is updated from
SS_CONNECTED to any other state.

In this commit, we remove connected variable from tipc_sock and
derive socket connection status from the following condition:
sock->state == SS_CONNECTED => tsk->connected

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan 87227fe7e4 tipc: remove tsk->connected for connectionless sockets
Until now, for connectionless sockets the peer information during
connect is stored in tsk->peer and a connection state is set in
tsk->connected. This is redundant.

In this commit, for connectionless sockets we update:
- __tipc_sendmsg(), when the destination is NULL the peer existence
  is determined by tsk->peer.family, instead of tsk->connected.
- tipc_connect(), remove set/unset of tsk->connected.
Hence tsk->connected is no longer used for connectionless sockets.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan aeda16b6ae tipc: rename tsk->remote to tsk->peer for consistent naming
Until now, the peer information for connect is stored in tsk->remote
but the rest of code uses the name peer for peer/remote.

In this commit, we rename tsk->remote to tsk->peer to align with
naming convention followed in the rest of the code.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan ba8aebe943 tipc: rename struct tipc_skb_cb member handle to bytes_read
In this commit, we rename handle to bytes_read indicating the
purpose of the member.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan cb5da847af tipc: set kern=0 in sk_alloc() during tipc_accept()
Until now, tipc_accept() calls sk_alloc() with kern=1. This is
incorrect as the data socket's owner is the user application.
Thus for these accepted data sockets the network namespace
refcount is skipped.

In this commit, we fix this by setting kern=0.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan 4891d8fe16 tipc: wakeup sleeping users at disconnect
Until now, in filter_connect() when we terminate a connection due to
an error message from peer, we set the socket state to DISCONNECTING.

The socket is notified about this broken connection using EPIPE when
a user tries to send a message. However if a socket was waiting on a
poll() while the connection is being terminated, we fail to wakeup
that socket.

In this commit, we wakeup sleeping sockets at connection termination.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan 7cf87fa278 tipc: return early for non-blocking sockets at link congestion
Until now, in stream/mcast send() we pass the message to the link
layer even when the link is congested and add the socket to the
link's wakeup queue. This is unnecessary for non-blocking sockets.
If a socket is set to non-blocking and sends multicast with zero
back off time while receiving EAGAIN, we exhaust the memory.

In this commit, we return immediately at stream/mcast send() for
non-blocking sockets.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
David S. Miller 2c8657d226 linux-can-fixes-for-4.9-20161031
-----BEGIN PGP SIGNATURE-----
 
 iQEwBAABCgAaBQJYF6ANExxta2xAcGVuZ3V0cm9uaXguZGUACgkQPTuqJaypJWqQ
 DAf+JRm7uhjaXEgITIhgiGIZBmCce7ONSBmBRr6cKcQyzMPaAHjfEZMcR5pltvH1
 z03bspu5oug3bfSxk4iYK9hNxxAV+URxFxBDbOMdnolAsl6ZBcpOkx2jj/JFByZ5
 fmOk8ChQr28qteEMLcTlm6O188eq2fZ+GUVZOh9iRslSVVcvGrM90NQoPFMUfP7G
 On9+rZzSZlyNvdCSdFwrjUDzMVEW77P1Hv3gG10koefgIZKLelmMOIwDg7EvGTon
 7HjqdZs6MFoSFAcbzJHB9lN0pAtK8nxWd3SPhdQof3kca8Dff/si4r5ZB2gAdSBq
 w17LfdWoVE4p5/XvSyM4SM/X9A==
 =Per+
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-4.9-20161031' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2016-10-31

this is a pull request of two patches for the upcoming v4.9 release.

The first patch is by Lukas Resch for the sja1000 plx_pci driver that adds
support for Moxa CAN devices. The second patch is by Oliver Hartkopp and fixes
a potential kernel panic in the CAN broadcast manager.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 21:01:03 -04:00
Xin Long dae399d7fd sctp: hold transport instead of assoc when lookup assoc in rx path
Prior to this patch, in rx path, before calling lock_sock, it needed to
hold assoc when got it by __sctp_lookup_association, in case other place
would free/put assoc.

But in __sctp_lookup_association, it lookup and hold transport, then got
assoc by transport->assoc, then hold assoc and put transport. It means
it didn't hold transport, yet it was returned and later on directly
assigned to chunk->transport.

Without the protection of sock lock, the transport may be freed/put by
other places, which would cause a use-after-free issue.

This patch is to fix this issue by holding transport instead of assoc.
As holding transport can make sure to access assoc is also safe, and
actually it looks up assoc by searching transport rhashtable, to hold
transport here makes more sense.

Note that the function will be renamed later on on another patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:20:33 -04:00
Xin Long 7c17fcc726 sctp: return back transport in __sctp_rcv_init_lookup
Prior to this patch, it used a local variable to save the transport that is
looked up by __sctp_lookup_association(), and didn't return it back. But in
sctp_rcv, it is used to initialize chunk->transport. So when hitting this,
even if it found the transport, it was still initializing chunk->transport
with null instead.

This patch is to return the transport back through transport pointer
that is from __sctp_rcv_lookup_harder().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:20:32 -04:00
Xin Long cd26da4ff4 sctp: hold transport instead of assoc in sctp_diag
In sctp_transport_lookup_process(), Commit 1cceda7849 ("sctp: fix
the issue sctp_diag uses lock_sock in rcu_read_lock") moved cb() out
of rcu lock, but it put transport and hold assoc instead, and ignore
that cb() still uses transport. It may cause a use-after-free issue.

This patch is to hold transport instead of assoc there.

Fixes: 1cceda7849 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:20:32 -04:00
Nikolay Aleksandrov 91b02d3d13 bridge: mcast: add router port on PIM hello message
When we receive a PIM Hello message on a port we can consider that it
has a multicast router attached, thus it is correct to add it to the
router list. The only catch is it shouldn't be considered for a querier.

Using Daniel's description:
leaf-11  leaf-12  leaf-13
       \   |    /
        bridge-1
         /    \
    host-11  host-12

 - all ports in bridge-1 are in a single vlan aware bridge
 - leaf-11 is the IGMP querier
 - leaf-13 is the PIM DR
 - host-11 TXes packets to 226.10.10.10
 - bridge-1 only forwards the 226.10.10.10 traffic out the port to
   leaf-11, it should also forward this traffic out the port to leaf-13

Suggested-by: Daniel Walton <dwalton@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:18:30 -04:00
Nikolay Aleksandrov 56245cae19 net: pim: add all RFC7761 message types
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:18:30 -04:00
Oliver Hartkopp deb507f91f can: bcm: fix warning in bcm_connect/proc_register
Andrey Konovalov reported an issue with proc_register in bcm.c.
As suggested by Cong Wang this patch adds a lock_sock() protection and
a check for unsuccessful proc_create_data() in bcm_connect().

Reference: http://marc.info/?l=linux-netdev&m=147732648731237

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-10-31 20:48:19 +01:00
Eric Dumazet 4f2e4ad56a net: mangle zero checksum in skb_checksum_help()
Sending zero checksum is ok for TCP, but not for UDP.

UDPv6 receiver should by default drop a frame with a 0 checksum,
and UDPv4 would not verify the checksum and might accept a corrupted
packet.

Simply replace such checksum by 0xffff, regardless of transport.

This error was caught on SIT tunnels, but seems generic.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:29:11 -04:00
Eric Dumazet e551c32d57 net: clear sk_err_soft in sk_clone_lock()
At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.

Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:25:55 -04:00
Florian Westphal ce6dd23329 dctcp: avoid bogus doubling of cwnd after loss
If a congestion control module doesn't provide .undo_cwnd function,
tcp_undo_cwnd_reduction() will set cwnd to

   tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);

... which makes sense for reno (it sets ssthresh to half the current cwnd),
but it makes no sense for dctcp, which sets ssthresh based on the current
congestion estimate.

This can cause severe growth of cwnd (eventually overflowing u32).

Fix this by saving last cwnd on loss and restore cwnd based on that,
similar to cubic and other algorithms.

Fixes: e3118e8359 ("net: tcp: add DCTCP congestion control algorithm")
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:16:28 -04:00
Alexander Duyck 184c449f91 net: Add support for XPS with QoS via traffic classes
This patch adds support for setting and using XPS when QoS via traffic
classes is enabled.  With this change we will factor in the priority and
traffic class mapping of the packet and use that information to correctly
select the queue.

This allows us to define a set of queues for a given traffic class via
mqprio and then configure the XPS mapping for those queues so that the
traffic flows can avoid head-of-line blocking between the individual CPUs
if so desired.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:48 -04:00
Alexander Duyck 6234f87407 net: Refactor removal of queues from XPS map and apply on num_tc changes
This patch updates the code for removing queues from the XPS map and makes
it so that we can apply the code any time we change either the number of
traffic classes or the mapping of a given block of queues.  This way we
avoid having queues pulling traffic from a foreign traffic class.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:48 -04:00
Alexander Duyck 8d059b0f6f net: Add sysfs value to determine queue traffic class
Add a sysfs attribute for a Tx queue that allows us to determine the
traffic class for a given queue.  This will allow us to more easily
determine this in the future.  It is needed as XPS will take the traffic
class for a group of queues into account in order to avoid pulling traffic
from one traffic class into another.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:47 -04:00
Alexander Duyck 9cf1f6a8c4 net: Move functions for configuring traffic classes out of inline headers
The functions for configuring the traffic class to queue mappings have
other effects that need to be addressed.  Instead of trying to export a
bunch of new functions just relocate the functions so that we can
instrument them directly with the functionality they will need.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:47 -04:00
Xin Long 19bda36c42 ipv6: add mtu lock check in __ip6_rt_update_pmtu
Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
It leaded to that mtu lock doesn't really work when receiving the pkt
of ICMPV6_PKT_TOOBIG.

This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
did in __ip_rt_update_pmtu.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 14:24:24 -04:00
Jakub Sitnicki f89c56ce71 ipv6: Don't use ufo handling on later transformed packets
Similar to commit c146066ab8 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).

Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:

  DEV=dum0 LEN=1493

  ip li add $DEV type dummy
  ip addr add fc00::1/64 dev $DEV nodad
  ip link set $DEV up
  ip xfrm policy add dir out src fc00::1 dst fc00::2 \
     tmpl src fc00::1 dst fc00::2 proto esp spi 1
  ip xfrm state add src fc00::1 dst fc00::2 \
     proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b

  tcpdump -n -nn -i $DEV -t &
  socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN

tcpdump output before:

  IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
  IP6 fc00::1 > fc00::2: frag (1448|48)
  IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88

... and after:

  IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
  IP6 fc00::1 > fc00::2: frag (1448|80)

Fixes: e89e9cf539 ("[IPv4/IPv6]: UFO Scatter-gather approach")

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 13:10:41 -04:00
Andrey Vagin c62cce2cae net: add an ioctl to get a socket network namespace
Each socket operates in a network namespace where it has been created,
so if we want to dump and restore a socket, we have to know its network
namespace.

We have a socket_diag to get information about sockets, it doesn't
report sockets which are not bound or connected.

This patch introduces a new socket ioctl, which is called SIOCGSKNS
and used to get a file descriptor for a socket network namespace.

A task must have CAP_NET_ADMIN in a target network namespace to
use this ioctl.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 10:56:36 -04:00
Liping Zhang b73b8a1ba5 netfilter: nft_dup: do not use sreg_dev if the user doesn't specify it
The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it
in routing lookup only when the user specify it.

Fixes: d877f07112 ("netfilter: nf_tables: add nft_dup expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31 13:17:38 +01:00
Liping Zhang c17c3cdff1 netfilter: nf_tables: destroy the set if fail to add transaction
When the memory is exhausted, then we will fail to add the NFT_MSG_NEWSET
transaction. In such case, we should destroy the set before we free it.

Fixes: 958bee14d0 ("netfilter: nf_tables: use new transaction infrastructure to handle sets")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31 13:17:29 +01:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Sven Eckelmann 1ad5bcb2a0 batman-adv: Consume skb in batadv_send_skb_to_orig
Sending functions in Linux consume the supplied skbuff. Doing the same in
batadv_send_skb_to_orig avoids the hack of returning -1 (-EPERM) to signal
the caller that he is responsible for cleaning up the skb.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:37 +01:00
Sven Eckelmann 8def0be82d batman-adv: Consume skb in batadv_frag_send_packet
Sending functions in Linux consume the supplied skbuff. Doing the same in
batadv_frag_send_packet avoids the hack of returning -1 (-EPERM) to signal
the caller that he is responsible for cleaning up the skb.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:37 +01:00
Sven Eckelmann eaac2c876e batman-adv: Count all non-success TX packets as dropped
A failure during the submission also causes dropped packets.
batadv_interface_tx should therefore also increase the DROPPED counter for
these returns.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:36 +01:00
Sven Eckelmann bd687fe419 batman-adv: use consume_skb for non-dropped packets
kfree_skb assumes that an skb is dropped after an failure and notes that.
consume_skb should be used in non-failure situations. Such information is
important for dropmonitor netlink which tells how many packets were dropped
and where this drop happened.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:36 +01:00
Linus Lüssing 3111beed0d batman-adv: Simple (re)broadcast avoidance
With this patch, (re)broadcasting on a specific interfaces is avoided:

* No neighbor: There is no need to broadcast on an interface if there
  is no node behind it.

* Single neighbor is source: If there is just one neighbor on an
  interface and if this neighbor is the one we actually got this
  broadcast packet from, then we do not need to echo it back.

* Single neighbor is originator: If there is just one neighbor on
  an interface and if this neighbor is the originator of this
  broadcast packet, then we do not need to echo it back.

Goodies for BATMAN V:

("Upgrade your BATMAN IV network to V now to get these for free!")

Thanks to the split of OGMv1 into two packet types, OGMv2 and ELP
that is, we can now apply the same optimizations stated above to OGMv2
packets, too.

Furthermore, with BATMAN V, rebroadcasts can be reduced in certain
multi interface cases, too, where BATMAN IV cannot. This is thanks to
the removal of the "secondary interface originator" concept in BATMAN V.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:35 +01:00
Linus Lüssing cbebd363b2 batman-adv: Use own timer for multicast TT and TVLV updates
Instead of latching onto the OGM period, this patch introduces a worker
dedicated to multicast TT and TVLV updates.

The reasoning is, that upon roaming especially the translation table
should be updated timely to minimize connectivity issues.

With BATMAN V, the idea is to greatly increase the OGM interval to
reduce overhead. Unfortunately, right now this could lead to
a bad user experience if multicast traffic is involved.

Therefore this patch introduces a fixed 500ms update interval for
multicast TT entries and the multicast TVLV.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:35 +01:00
Linus Lüssing eb6915e2eb batman-adv: Remove unused skb_reset_mac_header()
During broadcast queueing, the skb_reset_mac_header() sets the skb
to a place invalid for a MAC header, pointing right into the
batman-adv broadcast packet. Luckily, no one seems to actually use
eth_hdr(skb) afterwards until batadv_send_skb_packet() resets the
header to a valid position again.

Therefore removing this unnecessary, weird skb_reset_mac_header()
call.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:34 +01:00
Linus Lüssing b77697633d batman-adv: Remove unnecessary lockdep in batadv_mcast_mla_list_free
batadv_mcast_mla_list_free() just frees some leftovers of a local feast
in batadv_mcast_mla_update(). No lockdep needed as it has nothing to do
with bat_priv->mcast.mla_list.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:34 +01:00
Linus Lüssing cbfc16de30 batman-adv: Add wrapper for ARP reply creation
Removing duplicate code.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:33 +01:00
Sven Eckelmann 75721643d5 batman-adv: Close two alignment holes in batadv_hard_iface
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:33 +01:00
Sven Eckelmann ce1a21d142 batman-adv: Mark batadv_netlink_ops as const
The genl_ops don't need to be written by anyone and thus can be moved in a
ro memory range.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:32 +01:00
Sven Eckelmann 9bcb94c861 batman-adv: Introduce missing headers for genetlink restructure
Fixes: 56989f6d85 ("genetlink: mark families as __ro_after_init")
Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:32 +01:00
Linus Torvalds 2a26d99b25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of fixes, mostly drivers as is usually the case.

   1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
      Khoroshilov.

   2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
      Pedersen.

   3) Don't put aead_req crypto struct on the stack in mac80211, from
      Ard Biesheuvel.

   4) Several uninitialized variable warning fixes from Arnd Bergmann.

   5) Fix memory leak in cxgb4, from Colin Ian King.

   6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

   7) Several VRF semantic fixes from David Ahern.

   8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

   9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

  10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

  11) Fix stale link state during failover in NCSCI driver, from Gavin
      Shan.

  12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

  13) Propvide proper handle when emitting notifications of filter
      deletes, from Jamal Hadi Salim.

  14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

  15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

  16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

  17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

  18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
      Leitner.

  19) Revert a netns locking change that causes regressions, from Paul
      Moore.

  20) Add recursion limit to GRO handling, from Sabrina Dubroca.

  21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

  22) Avoid accessing stale vxlan/geneve socket in data path, from
      Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
  geneve: avoid using stale geneve socket.
  vxlan: avoid using stale vxlan socket.
  qede: Fix out-of-bound fastpath memory access
  net: phy: dp83848: add dp83822 PHY support
  enic: fix rq disable
  tipc: fix broadcast link synchronization problem
  ibmvnic: Fix missing brackets in init_sub_crq_irqs
  ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
  Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
  arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
  net/mlx4_en: Save slave ethtool stats command
  net/mlx4_en: Fix potential deadlock in port statistics flow
  net/mlx4: Fix firmware command timeout during interrupt test
  net/mlx4_core: Do not access comm channel if it has not yet been initialized
  net/mlx4_en: Fix panic during reboot
  net/mlx4_en: Process all completions in RX rings after port goes up
  net/mlx4_en: Resolve dividing by zero in 32-bit system
  net/mlx4_core: Change the default value of enable_qos
  net/mlx4_core: Avoid setting ports to auto when only one port type is supported
  net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
  ...
2016-10-29 20:33:20 -07:00
pravin shelar 0e82c76359 genetlink: Fix generic netlink family unregister
This patch fixes a typo in unregister operation.

Following crash is fixed by this patch. It can be easily reproduced
by repeating modprobe and rmmod module that uses genetlink.

[  261.446686] BUG: unable to handle kernel paging request at ffffffffa0264088
[  261.448921] IP: [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.450494] PGD 1c09067
[  261.451266] PUD 1c0a063
[  261.452091] PMD 8068d5067
[  261.452525] PTE 0
[  261.453164]
[  261.453618] Oops: 0000 [#1] SMP
[  261.454577] Modules linked in: openvswitch(+) ...
[  261.480753] RIP: 0010:[<ffffffff813cb70e>]  [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.483069] RSP: 0018:ffffc90003c0bc28  EFLAGS: 00010282
[  261.510145] Call Trace:
[  261.510896]  [<ffffffff816f10ca>] genl_family_find_byname+0x5a/0x70
[  261.512819]  [<ffffffff816f2319>] genl_register_family+0xb9/0x630
[  261.514805]  [<ffffffffa02840bc>] dp_init+0xbc/0x120 [openvswitch]
[  261.518268]  [<ffffffff8100217d>] do_one_initcall+0x3d/0x160
[  261.525041]  [<ffffffff811808a9>] do_init_module+0x60/0x1f1
[  261.526754]  [<ffffffff8110687f>] load_module+0x22af/0x2860
[  261.530144]  [<ffffffff81107026>] SYSC_finit_module+0x96/0xd0
[  261.531901]  [<ffffffff8110707e>] SyS_finit_module+0xe/0x10
[  261.533605]  [<ffffffff8100391e>] do_syscall_64+0x6e/0x180
[  261.535284]  [<ffffffff817c2faf>] entry_SYSCALL64_slow_path+0x25/0x25
[  261.546512] RIP  [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.550198] ---[ end trace 76505a814dd68770 ]---

Fixes: 2ae0f17df1 ("genetlink: use idr to track families").

Reported-by: Jarno Rajahalme <jarno@ovn.org>
CC: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 20:58:15 -04:00
David S. Miller 32ab0a38f0 Among various cleanups and improvements, we have the following:
* client FILS authentication support in mac80211 (Jouni)
  * AP/VLAN multicast improvements (Michael Braun)
  * config/advertising support for differing beacon intervals on
    multiple virtual interfaces (Purushottam Kushwaha, myself)
  * deprecate the old WDS mode for cfg80211-based drivers, the
    mode is hardly usable since it doesn't support any "modern"
    features like WPA encryption (2003), HT (2009) or VHT (2014),
    I'm not even sure WEP (introduced in 1997) could be done.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYEy/9AAoJEGt7eEactAAd/V0P/0FmHGS8HjlSjm+1p6sbWKbt
 5v8bb3cuKHQiYiUM6euIXql2OYuOEHVAQEpNoPXN9CsfKFYgbIH6yW6d8HtKNedV
 n9lmMy/U6yJX9nYt7yMIQ3kLkbEg+YU58B9Hf47waWXLLSNVumS8rfNBn43EoNQf
 VKWYPWpetsCRIWJ1fnLuxvMHCtOOYCxH+491BUonof32+DKPEAsAbnszZ2ElufTR
 7KNyA3K6leOtTd5Ml52dvLOGNc+h2C83VAMxiShq/6r8OnlX5tPifaubzd9n3m41
 jiJJH/92ESrtF2AaWEm8slcgtcfHS/O7y/FSoV4r0PMSvPTBdjwQ9nqCsbONd831
 vjj6c6YWNxgHPcISX0XcWz+FHnLJdUGaDUtHjAJYw4oH4gaRXwfSw0U+jvdlSMUf
 2CBUArk5f0OEguzwa/5X4Jio3OPPIj4jY/lKplcpLOUu8K2FWTLDuIlww/FHXovs
 rDzTLQeXZkx+MkszkTJN42qSEfOFly91J6OA2Wju+emBqrLIbkAGmvyLVg8U8BQd
 gG7oltgmZ6Xg6fEnUQqpIDO7UJlQ+GXAU04SpNDMv1j/ueUJskxlr3hYM7E9ueQv
 LJcZcVV0RAwNRw52cEsdcYCMLuSMYRrO1OHlkl0wd+x2hFrUCWnVzUgLEUhNBV+c
 ICmNMr96nKhrZI217yzF
 =SjfQ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Among various cleanups and improvements, we have the following:
 * client FILS authentication support in mac80211 (Jouni)
 * AP/VLAN multicast improvements (Michael Braun)
 * config/advertising support for differing beacon intervals on
   multiple virtual interfaces (Purushottam Kushwaha, myself)
 * deprecate the old WDS mode for cfg80211-based drivers, the
   mode is hardly usable since it doesn't support any "modern"
   features like WPA encryption (2003), HT (2009) or VHT (2014),
   I'm not even sure WEP (introduced in 1997) could be done.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:28:45 -04:00
Jon Paul Maloy 06bd2b1ed0 tipc: fix broadcast link synchronization problem
In commit 2d18ac4ba7 ("tipc: extend broadcast link initialization
criteria") we tried to fix a problem with the initial synchronization
of broadcast link acknowledge values. Unfortunately that solution is
not sufficient to solve the issue.

We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
broadcast acknowledge numbers, leading to premature opening of the
broadcast link. When the bypassed packets finally arrive, they are
inadvertently accepted, and the already correctly initialized
acknowledge number in the broadcast receive link is overwritten by
the invalid (zero) value of the said packets. After this the broadcast
link goes stale.

We now fix this by marking the packets where we know the acknowledge
value is or may be invalid, and then ignoring the acks from those.

To this purpose, we claim an unused bit in the header to indicate that
the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
packets, plus those LINK_PROTOCOL packets sent out before the broadcast
links are fully synchronized.

This minor protocol update is fully backwards compatible.

Reported-by: John Thompson <thompa.atl@gmail.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:21:09 -04:00
Neal Cardwell 9b9375b5b7 tcp_bbr: add a state transition diagram and accompanying comment
Document the possible state transitions for a BBR flow, and also add a
prose summary of the state machine, covering the life of a typical BBR
flow.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:12:43 -04:00
David S. Miller a283ad5066 This code cleanup patchset includes the following changes (chronological
order):
 
  - bump version strings, by Simon Wunderlich
 
  - README updates/clean up, by Sven Eckelmann (4 patches)
 
  - Code clean up and restructuring by Sven Eckelmann (2 patches)
 
  - Kerneldoc fix in forw_packet structure, by Linus Luessing
 
  - Remove unused argument in dbg_arp, by Antonio Quartulli
 
  - Add support to build batman-adv without wireless, by Linus Luessing
 
  - Restructure error handling for is_ap_isolated, by Markus Elfring
 
  - Remove unused initialization in various functions, by Sven Eckelmann
 
  - Use better names for fragment and gateway list heads, by Sven
    Eckelmann (2 patches)
 
  - Convert to octal permissions for files, by Sven Eckelmann
 
  - Avoid precedence issues for some macros, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYEk4JFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqFEXQ/+M5JTsV+6oN6yknj30D8HpBKp785UiLbKeu7AadpA3KUn/kDbivihF0Cq
 YrfORAbXyGY/CNrSbx0rZbC79jyBSuvSt9whI92rPsmAvbkeypRdOSpKz5HbkNTh
 MxPblI29AvBwIMH5I+vW9iZvqxpdtJh0h/zCT0db5NQVCW7sVpfIMhCJvOULMFJj
 WuPS/YXA4rBW2RJtGHrAy//byhF7ng26qCV06Bxy5xwKK/ncZX5GmgZVrZ5YWgO6
 3kcEjLL1iASnYAoUu6X5PWeT3SZLutlZ2sWu8O3X7ShtpRT48Uisj1OFlIOgOGc5
 fF2iFxWAUlEt8gu4aAADVnknANPMdChfvJG9w9XOlXte4bJtKpr8dJz/jhHiPMYb
 NScHhpAcdpkFSej8RUHibIC+TjVc50vOqPCFtM0smKQ7cgyWNU+V1Bo5pXMkYGA8
 7OhbWJ87mOkPWj2Yr/YeWHhvnrwbnyZFMSaPHJuCOei+zsgCpAtjPK8InfzDQoWu
 UlRBgaO7zxEDUlmaAHHBUOCNB0uM1H8LR1opi7GjPdNUbONyN2mris3nz9UB1aL3
 +b9bWiNzNh6DsEF+UEJfHW/Nrx+yI7mMQw5/zOknhdHUhFTM3kPbWNYdJfo0yPgn
 KSq2tr4mmGLySFsPEVQI5QBDT2iF0Ee7P6PMlA/J64dwSJj+Mc8=
 =eDGz
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20161027' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This code cleanup patchset includes the following changes (chronological
order):

 - bump version strings, by Simon Wunderlich

 - README updates/clean up, by Sven Eckelmann (4 patches)

 - Code clean up and restructuring by Sven Eckelmann (2 patches)

 - Kerneldoc fix in forw_packet structure, by Linus Luessing

 - Remove unused argument in dbg_arp, by Antonio Quartulli

 - Add support to build batman-adv without wireless, by Linus Luessing

 - Restructure error handling for is_ap_isolated, by Markus Elfring

 - Remove unused initialization in various functions, by Sven Eckelmann

 - Use better names for fragment and gateway list heads, by Sven
   Eckelmann (2 patches)

 - Convert to octal permissions for files, by Sven Eckelmann

 - Avoid precedence issues for some macros, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 16:26:50 -04:00
shamir rabinovitch ff57087f31 rds: debug messages are enabled by default
rds use Kconfig option called "RDS_DEBUG" to enable rds debug messages.
This option cause the rds Makefile to add -DDEBUG to the rds gcc command
line.

When CONFIG_DYNAMIC_DEBUG is enabled, the "DEBUG" macro is used by
include/linux/dynamic_debug.h to decide if dynamic debug prints should
be sent by default to the kernel log.

rds should not enable this macro for production builds. rds dynamic
debug work as expected follow this fix.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:55:57 -04:00
David S. Miller 880b583ce1 Just two fixes:
* a fix to process all events while suspending, so any
    potential calls into the driver are done before it is
    suspended
  * small markup fixes for the sphinx documentation conversion
    that's coming into the tree via the doc tree
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYEbSGAAoJEGt7eEactAAdOo4P/A1RyOl2uTspwB4lZFMMA5Tp
 IIwaB+WlNcjGq05C8kemdQJ30AKJUy6aKMTb5FIuFXWLRK6u5uIoQsYe6cRI8WBs
 bgTpo5RMcNGx6dUgFS/XwM2bLXUNXAXOt+43c2kWSjnqQZpQuSXh8/4R8ybZxyZB
 SNyKCzkx1LYPox4Qnufmi+pz/hI8hHRY7gMdyhUWU2uUk9u/nuY0Zxyvj4OFgyow
 SOGasCPoDzT9556hUGyG9M8HmwBiRZUzfzo5mSR2V0a9Tij4ZOvoSkcAUDbj38fZ
 0UPJ48xwUDMF88W+NEO+rrd1IQXtlsHJk6x/mAQpSkEvzymb5BEvraVeTV3+chgo
 fn45ZDP/GbTwqJ3YacSmKgyGwaGPF9XN2+Iuxotp0UeuYxvxwnkWNr0JbRecDPau
 ZL/P2qV7Y3gjy0+6nHafrYIYPuPFZjwPIJ7k6x7mVtO6zfJhtKF57YP/YoFYD78l
 TtCEE5o7lAYkdWKa2nm4Fg16NM897PS/GTM9dQCFtnYLV7ynZPY2TGf4TelqJ3cz
 FU1/vlM0t/jRkkRKIhlmy8re4ASjTEccf//dllz1CDyPdxHY1NmGftftz539s7qW
 NpCvvO0IXDchBT1jeoSMwcMzlZSfQeIKtkp1P0z4lyPiMcmekdgUQ7sb84FhwSa+
 SYRpiZ/Xv6xa5Djhonci
 =toQY
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two fixes:
 * a fix to process all events while suspending, so any
   potential calls into the driver are done before it is
   suspended
 * small markup fixes for the sphinx documentation conversion
   that's coming into the tree via the doc tree
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:54:16 -04:00
David Ahern 46b5ab1a7c net: dev: Fix non-RCU based lower dev walker
netdev_walk_all_lower_dev is not properly walking the lower device
list.  Commit 1a3f060c1a made netdev_walk_all_lower_dev similar
to netdev_walk_all_upper_dev_rcu and netdev_walk_all_lower_dev_rcu
but failed to update its netdev_next_lower_dev iterator. This patch
fixes that.

Fixes: 1a3f060c1a ("net: Introduce new api for walking upper and
                     lower devices")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:50:30 -04:00
Florian Westphal b917783c7b flow_dissector: __skb_get_hash_symmetric arg can be const
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:10:21 -04:00
Eric Dumazet 5ea8ea2cb7 tcp/dccp: drop SYN packets if accept queue is full
Per listen(fd, backlog) rules, there is really no point accepting a SYN,
sending a SYNACK, and dropping the following ACK packet if accept queue
is full, because application is not draining accept queue fast enough.

This behavior is fooling TCP clients that believe they established a
flow, while there is nothing at server side. They might then send about
10 MSS (if using IW10) that will be dropped anyway while server is under
stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:09:21 -04:00
David S. Miller ad60133909 Here are three batman-adv bugfix patches:
- Fix RCU usage for neighbor list, by Sven Eckelmann
 
  - Fix BATADV_DBG_ALL loglevel to include TP Meter messages, by Sven Eckelmann
 
  - Fix possible splat when disabling an interface, by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYENCVFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqFEdA/9HItVGDUIWhLhLRPJFsPSWaENQ1SFaYACHn7zYTMMmbu4uVjMDqlITcu2
 aDyPcgbAkjGPDCDsX6ON6FsJHoovXiKUQCbri45jTgye4Gc0zmh7+Aj/MusCy20s
 U+hYvJ5jTEKFgqh96fc5ot8qxLBveBKrzLa6L+RO2pZmB15ZF/AexCOxUdM56PuL
 nITvewDXJLCbdY4V555K3m2B8cPo2O/Q4eTBg9bwKHG+lGpHqF9qDmlX7S1selcI
 F9toA6XO1e5fRNMbcHR0BxC0ZhJufCWi0ZGylW8M0HwmUVO1+QttVORkN4ECBQOu
 /J6rGfSZbZIRlQ9uGiqqBRkyGBVcoNoNn9mz9c0l9tNQ2AtfW98EXZbtHm8m/BQL
 UgDndVoFW/RHxBsF9IwavSCWBpAg5RAVvV+XwlH07A0WuOehxOV+UkbXzFD1z7Yb
 c+4wuLgfdNTiV8ttfC25lb5PpYo8g8wZf65sGfw3Ej+6iCSQplakoPzuE5kqiMBR
 Q45vELled0elY8GiaNKhrwB42daN7lctn36vL+SfP0R49FtymnaeCDBnH8L7lYVn
 VUP1Fls8BOA2zi1GKKWOMortgxZLynMWleNJC+5Noa04JX8nyMgNJyof85UkaTz+
 QZ/WK4D8gwWfrkWvTH2UQM0iDOcgURWhKu6T8yhlbmymocnSVZI=
 =/5y8
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20161026' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are three batman-adv bugfix patches:

 - Fix RCU usage for neighbor list, by Sven Eckelmann

 - Fix BATADV_DBG_ALL loglevel to include TP Meter messages, by Sven Eckelmann

 - Fix possible splat when disabling an interface, by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:05:53 -04:00
Willem de Bruijn 104ba78c98 packet: on direct_xmit, limit tso and csum to supported devices
When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.

Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.

Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.

Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.

When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.

Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

  ethtool -K eth0 tso off tx on
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

  ethtool -K eth0 tx off
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

v2:
  - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

Fixes: d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:02:15 -04:00
Johannes Berg 4700e9ce6e net_sched actions: use nla_parse_nested()
Use nla_parse_nested instead of open-coding the call to
nla_parse() with the attribute data/len.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:01:01 -04:00
Ido Schimmel c778453b13 switchdev: Remove redundant variable
Instead of storing return value in 'err' and returning, just return
directly.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:58:33 -04:00
Thomas Graf b15ca182ed netlink: Add nla_memdup() to wrap kmemdup() use on nlattr
Wrap several common instances of:
	kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL);

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:57:42 -04:00
Arnd Bergmann f8da977989 net: ip, diag: include net/inet_sock.h
The newly added raw_diag.c fails to build in some configurations
unless we include this header:

In file included from net/ipv4/raw_diag.c:6:0:
include/net/raw.h:71:21: error: field 'inet' has incomplete type
net/ipv4/raw_diag.c: In function 'raw_diag_dump':
net/ipv4/raw_diag.c:166:29: error: implicit declaration of function 'inet_sk'

Fixes: 432490f9d4 ("net: ip, diag -- Add diag interface for raw sockets")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:52:18 -04:00
Eli Cooper ae148b0858 ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()
This patch updates skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit() when an
IPv6 header is installed to a socket buffer.

This is not a cosmetic change.  Without updating this value, GSO packets
transmitted through an ipip6 tunnel have the protocol of ETH_P_IP and
skb_mac_gso_segment() will attempt to call gso_segment() for IPv4,
which results in the packets being dropped.

Fixes: b8921ca83e ("ip4ip6: Support for GSO/GRO")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:49:31 -04:00
Craig Gallek e4cabca549 inet: Fix missing return value in inet6_hash
As part of a series to implement faster SO_REUSEPORT lookups,
commit 086c653f58 ("sock: struct proto hash function may error")
added return values to protocol hash functions and
commit 496611d7b5 ("inet: create IPv6-equivalent inet_hash function")
implemented a new hash function for IPv6.  However, the latter does
not respect the former's convention.

This properly propagates the hash errors in the IPv6 case.

Fixes: 496611d7b5 ("inet: create IPv6-equivalent inet_hash function")
Reported-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 12:01:49 -04:00
Marcelo Ricardo Leitner bf911e985d sctp: validate chunk len before actually using it
Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.

The fix is to just move the check upwards so it's also validated for the
1st chunk.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 12:00:10 -04:00
Jeff Layton 18e601d6ad sunrpc: fix some missing rq_rbuffer assignments
We've been seeing some crashes in testing that look like this:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
PGD 212ca2067 PUD 212ca3067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff88020d7ed200 task.stack: ffff880211838000
RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
Stack:
 ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
 ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
 ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
Call Trace:
 [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
 [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
 [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
 [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
 [<ffffffff810a9418>] kthread+0xd8/0xf0
 [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
 [<ffffffff810a9340>] ? kthread_park+0x60/0x60
Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
 RSP <ffff88021183bdd0>
CR2: 0000000000000000

Both Bruce and Eryu ran a bisect here and found that the problematic
patch was 68778945e4 (SUNRPC: Separate buffer pointers for RPC Call and
Reply messages).

That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
set up the receive buffer, but didn't change all of the necessary
codepaths to set it properly. In particular the backchannel setup was
missing.

We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 68778945e4 "SUNRPC: Separate buffer pointers..."
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-28 16:57:33 -04:00
Colin Ian King b09edbd07f net caif: insert missing spaces in pr_* messages and unbreak multi-line strings
Some of the pr_* messages are missing spaces, so insert these and also
unbreak multi-line literal strings in pr_* messages

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28 13:47:33 -04:00
David S. Miller 1d00578836 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-10-25

Just a leftover from the last development cycle.

1) Remove some unused code, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28 13:26:27 -04:00
Arnd Bergmann 5747620257 netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning
Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
confuses the compiler to the point where it produces a rather
dubious warning message:

net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  struct ip_vs_sync_conn_options opt;
                                 ^~~
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

The problem appears to be a combination of a number of factors, including
the __builtin_bswap32 compiler builtin being slightly odd, having a large
amount of code inlined into a single function, and the way that some
functions only get partially inlined here.

I've spent way too much time trying to work out a way to improve the
code, but the best I've come up with is to add an explicit memset
right before the ip_vs_seq structure is first initialized here. When
the compiler works correctly, this has absolutely no effect, but in the
case that produces the warning, the warning disappears.

In the process of analysing this warning, I also noticed that
we use memcpy to copy the larger ip_vs_sync_conn_options structure
over two members of the ip_vs_conn structure. This works because
the layout is identical, but seems error-prone, so I'm changing
this in the process to directly copy the two members. This change
seemed to have no effect on the object code or the warning, but
it deals with the same data, so I kept the two changes together.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-28 14:14:51 +02:00
Arnd Bergmann 514877182b mac80211: fils_aead: fix encrypt error handling
gcc -Wmaybe-uninitialized reports a bug in aes_siv_encryp:

net/mac80211/fils_aead.c: In function ‘aes_siv_encrypt.constprop’:
net/mac80211/fils_aead.c:84:26: error: ‘tfm2’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

At the time that the memory allocation fails, 'tfm2' has not been
allocated, so we should not attempt to free it later, and we can
simply return an error.

Fixes: 39404feee6 ("mac80211: FILS AEAD protection for station mode association frames")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-28 12:59:12 +02:00
Andrey Vagin 002d8a1a6c net: skip genenerating uevents for network namespaces that are exiting
No one can see these events, because a network namespace can not be
destroyed, if it has sockets.

Unlike other devices, uevent-s for network devices are generated
only inside their network namespaces. They are filtered in
kobj_bcast_filter()

My experiments shows that net namespaces are destroyed more 30% faster
with this optimization.

Here is a perf output for destroying network namespaces without this
patch.

-   94.76%     0.02%  kworker/u48:1  [kernel.kallsyms]     [k] cleanup_net
   - 94.74% cleanup_net
      - 94.64% ops_exit_list.isra.4
         - 41.61% default_device_exit_batch
            - 41.47% unregister_netdevice_many
               - rollback_registered_many
                  - 40.36% netdev_unregister_kobject
                     - 14.55% device_del
                        + 13.71% kobject_uevent
                     - 13.04% netdev_queue_update_kobjects
                        + 12.96% kobject_put
                     - 12.72% net_rx_queue_update_kobjects
                          kobject_put
                        - kobject_release
                           + 12.69% kobject_uevent
                  + 0.80% call_netdevice_notifiers_info
         + 19.57% nfsd_exit_net
         + 11.15% tcp_net_metrics_exit
         + 8.25% rpcsec_gss_exit_net

It's very critical to optimize the exit path for network namespaces,
because they are destroyed under net_mutex and many namespaces can be
destroyed for one iteration.

v2: use dev_set_uevent_suppress()

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 17:14:47 -04:00
Jamal Hadi Salim 9ee7837449 net sched filters: fix notification of filter delete with proper handle
Daniel says:

While trying out [1][2], I noticed that tc monitor doesn't show the
correct handle on delete:

$ tc monitor
qdisc clsact ffff: dev eno1 parent ffff:fff1
filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80

some context to explain the above:
The user identity of any tc filter is represented by a 32-bit
identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
above. A user wishing to delete, get or even modify a specific filter
uses this handle to reference it.
Every classifier is free to provide its own semantics for the 32 bit handle.
Example: classifiers like u32 use schemes like 800:1:801 to describe
the semantics of their filters represented as hash table, bucket and
node ids etc.
Classifiers also have internal per-filter representation which is different
from this externally visible identity. Most classifiers set this
internal representation to be a pointer address (which allows fast retrieval
of said filters in their implementations). This internal representation
is referenced with the "fh" variable in the kernel control code.

When a user successfuly deletes a specific filter, by specifying the correct
tcm->tcm_handle, an event is generated to user space which indicates
which specific filter was deleted.

Before this patch, the "fh" value was sent to user space as the identity.
As an example what is shown in the sample bpf filter delete event above
is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80
which happens to be a 64-bit memory address of the internal filter
representation (address of the corresponding filter's struct cls_bpf_prog);

After this patch the appropriate user identifiable handle as encoded
in the originating request tcm->tcm_handle is generated in the event.
One of the cardinal rules of netlink rules is to be able to take an
event (such as a delete in this case) and reflect it back to the
kernel and successfully delete the filter. This patch achieves that.

Note, this issue has existed since the original TC action
infrastructure code patch back in 2004 as found in:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/

[1] http://patchwork.ozlabs.org/patch/682828/
[2] http://patchwork.ozlabs.org/patch/682829/

Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 17:12:33 -04:00
Arnd Bergmann bc72f3dd89 flow_dissector: fix vlan tag handling
gcc warns about an uninitialized pointer dereference in the vlan
priority handling:

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:281:61: error: 'vlan' may be used uninitialized in this function [-Werror=maybe-uninitialized]

As pointed out by Jiri Pirko, the variable is never actually used
without being initialized first as the only way it end up uninitialized
is with skb_vlan_tag_present(skb)==true, and that means it does not
get accessed.

However, the warning hints at some related issues that I'm addressing
here:

- the second check for the vlan tag is different from the first one
  that tests the skb for being NULL first, causing both the warning
  and a possible NULL pointer dereference that was not entirely fixed.
- The same patch that introduced the NULL pointer check dropped an
  earlier optimization that skipped the repeated check of the
  protocol type
- The local '_vlan' variable is referenced through the 'vlan' pointer
  but the variable has gone out of scope by the time that it is
  accessed, causing undefined behavior

Caching the result of the 'skb && skb_vlan_tag_present(skb)' check
in a local variable allows the compiler to further optimize the
later check. With those changes, the warning also disappears.

Fixes: 3805a938a6 ("flow_dissector: Check skb for VLAN only if skb specified.")
Fixes: d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:36:03 -04:00
David Ahern d5d32e4b76 net: ipv6: Do not consider link state for nexthop validation
Similar to IPv4, do not consider link state when validating next hops.

Currently, if the link is down default routes can fail to insert:
 $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
 RTNETLINK answers: No route to host

With this patch the command succeeds.

Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:33:12 -04:00
David Ahern 830218c1ad net: ipv6: Fix processing of RAs in presence of VRF
rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:

    ICMPv6: RA: ndisc_router_discovery failed to add default route

Fix the 'get' functions by using the table id associated with the
device when applicable.

Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:30:52 -04:00
Johannes Berg 56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg 2ae0f17df1 genetlink: use idr to track families
Since generic netlink family IDs are small integers, allocated
densely, IDR is an ideal match for lookups. Replace the existing
hand-written hash-table with IDR for allocation and lookup.

This lets the families only be written to once, during register,
since the list_head can be removed and removal of a family won't
cause any writes.

It also slightly reduces the code size (by about 1.3k on x86-64).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg 489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg c90c39dab3 genetlink: introduce and use genl_family_attrbuf()
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:08 -04:00