Commit Graph

7808 Commits

Author SHA1 Message Date
Eric Dumazet d985d15151 net: ipv4: fix sparse error in fib_good_nh()
Fixes following sparse errors :

net/ipv4/fib_semantics.c:1579:61: warning: incorrect type in argument 2
(different base types)
net/ipv4/fib_semantics.c:1579:61:    expected unsigned int [unsigned]
[usertype] key
net/ipv4/fib_semantics.c:1579:61:    got restricted __be32 const
[usertype] nh_gw

Fixes: a6db4494d2 ("net: ipv4: Consider failed nexthops in multipath routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19 17:07:30 -07:00
Eric Dumazet 217375a0c6 udp: include addrconf.h
Include ipv4_rcv_saddr_equal() definition to avoid this sparse error :

net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not
declared. Should it be static?

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19 17:06:58 -07:00
Eric Dumazet b6c6b645d2 tcp: md5: remove tcp_md5_hash_header()
After commit 19689e38ec ("tcp: md5: use kmalloc() backed scratch
areas") this function is no longer used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19 17:06:58 -07:00
Xunlei Pang 98a384eca9 fib_trie: Fix the description of pos and bits
1) Fix one typo: s/tn/tp/
2) Fix the description about the "u" bits.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:51:23 -07:00
Eric Dumazet 36a6503fed tcp: refine tcp_prune_ofo_queue() to not drop all packets
Over the years, TCP BDP has increased a lot, and is typically
in the order of ~10 Mbytes with help of clever Congestion Control
modules.

In presence of packet losses, TCP stores incoming packets into an out of
order queue, and number of skbs sitting there waiting for the missing
packets to be received can match the BDP (~10 Mbytes)

In some cases, TCP needs to make room for incoming skbs, and current
strategy can simply remove all skbs in the out of order queue as a last
resort, incurring a huge penalty, both for receiver and sender.

Unfortunately these 'last resort events' are quite frequent, forcing
sender to send all packets again, stalling the flow and wasting a lot of
resources.

This patch cleans only a part of the out of order queue in order
to meet the memory constraints.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: C. Stephen Gun <csg@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:36:07 -07:00
Eric Dumazet dca0aaf847 tcp: defer sacked assignment
While chasing tcp_xmit_retransmit_queue() kasan issue, I found
that we could avoid reading sacked field of skb that we wont send,
possibly removing one cache line miss.

Very minor change in slow path, but why not ? ;)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:27:27 -07:00
David S. Miller 60747ef4d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes for both merge conflicts.

Resolution work done by Stephen Rothwell was used
as a reference.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 01:17:32 -04:00
Thierry Reding d2d371ae5d net: ipconfig: Fix more use after free
While commit 9c706a49d6 ("net: ipconfig: fix use after free") avoids
the use after free, the resulting code still ends up calling both the
ic_setup_if() and ic_setup_routes() after calling ic_close_devs(), and
access to the device is still required.

Move the call to ic_close_devs() to the very end of the function.

Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17 19:33:40 -04:00
Simon Horman 3d7b332092 gre: set inner_protocol on xmit
Ensure that the inner_protocol is set on transmit so that GSO segmentation,
which relies on that field, works correctly.

This is achieved by setting the inner_protocol in gre_build_header rather
than each caller of that function. It ensures that the inner_protocol is
set when gre_fb_xmit() is used to transmit GRE which was not previously the
case.

I have observed this is not the case when OvS transmits GRE using
lwtunnel metadata (which it always does).

Fixes: 3872035241 ("gre: Use inner_proto to obtain inner header protocol")
Cc: Pravin Shelar <pshelar@ovn.org>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-15 13:37:12 -07:00
Pablo Neira Ayuso adf0516845 netfilter: remove ip_conntrack* sysctl compat code
This backward compatibility has been around for more than ten years,
since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
the conntrack utility got adopted by many people in the user community
according to what I observed on the netfilter user mailing list.

So let's get rid of this.

Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
not need to be exported as symbol anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-13 13:27:13 +02:00
Florian Westphal d0b35b93d4 netfilter: use_nf_conn_expires helper in more places
... so we don't need to touch all of these places when we get rid of the
timer in nf_conn.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-12 00:43:13 +02:00
Liping Zhang 9f7c824a44 netfilter: nf_dup4: remove redundant checksum recalculation
IP header checksum will be recalculated at ip_local_out, so
there's no need to calculated it here, remove it. Also update
code comments to illustrate it, and delete the misleading
comments about checksum recalculation.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-12 00:42:47 +02:00
Uwe Kleine-König 9c706a49d6 net: ipconfig: fix use after free
ic_close_devs() calls kfree() for all devices's ic_device. Since commit
2647cffb2b ("net: ipconfig: Support using "delayed" DHCP replies")
the active device's ic_device is still used however to print the
ipconfig summary which results in an oops if the memory is already
changed. So delay freeing until after the autoconfig results are
reported.

Fixes: 2647cffb2b ("net: ipconfig: Support using "delayed" DHCP replies")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-10 14:04:23 -07:00
David Ahern 631fee7d70 net: Remove fib_local variable
After commit 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
fib_local is set but not used. Remove it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 14:57:39 -07:00
Lance Richardson a5d0dc810a vti: flush x-netns xfrm cache when vti interface is removed
When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):

  kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.

Address this problem by garbage collecting the tunnel netns flow cache
when a cross-namespace vti interface receives a NETDEV_DOWN notification.

A more detailed description of the problem scenario (referencing commands
in the script below):

(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1

  The vti_test interface is created in the init namespace. vti_tunnel_init()
  attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
  setting the tunnel net to &init_net.

(2) ip link set vti_test netns secure

  The vti_test interface is moved to the "secure" netns. Note that
  the associated struct ip_tunnel still has tunnel->net set to &init_net.

(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

  The first packet sent using the vti device causes xfrm_lookup() to be
  called as follows:

      dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);

  Note that tunnel->net is the init namespace, while skb_dst(skb) references
  the vti_test interface in the "secure" namespace. The returned dst
  references an interface in the init namespace.

  Also note that the first parameter to xfrm_lookup() determines which flow
  cache is used to store the computed xfrm bundle, so after xfrm_lookup()
  returns there will be a cached bundle in the init namespace flow cache
  with a dst referencing a device in the "secure" namespace.

(4) ip netns del secure

  Kernel begins to delete the "secure" namespace.  At some point the
  vti_test interface is deleted, at which point dst_ifdown() changes
  the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
  in the "secure" namespace however).
  Since nothing has happened to cause the init namespace's flow cache
  to be garbage collected, this dst remains attached to the flow cache,
  so the kernel loops waiting for the last reference to lo to go away.

<Begin script>
ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8

ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
   proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
   proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788

ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

ip netns del secure
<End script>

Reported-by: Hangbin Liu <haliu@redhat.com>
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 12:57:49 -07:00
Hangbin Liu a052517a8f net/multicast: should not send source list records when have filter mode change
Based on RFC3376 5.1 and RFC3810 6.1

   If the per-interface listening change that triggers the new report is
   a filter mode change, then the next [Robustness Variable] State
   Change Reports will include a Filter Mode Change Record.  This
   applies even if any number of source list changes occur in that
   period.

   Old State         New State         State Change Record Sent
   ---------         ---------         ------------------------
   INCLUDE (A)       EXCLUDE (B)       TO_EX (B)
   EXCLUDE (A)       INCLUDE (B)       TO_IN (B)

So we should not send source-list change if there is a filter-mode change.

Here are two scenarios:
1. Group deleted and filter mode is EXCLUDE, which means we need send a
   TO_IN { }.
2. Not group deleted, but has pcm->crcount, which means we need send a
   normal filter-mode-change.

At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
we stop add records and decrease sf_crcount directly

Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.html

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 16:04:39 -07:00
Uwe Kleine-König e068853409 net: ipconfig: drop inter-device timeout
Now that ipconfig learned to handle "delayed replies" in the previous
commit, there is no reason any more to delay sending a first request per
device.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 15:40:05 -07:00
Uwe Kleine-König 2647cffb2b net: ipconfig: Support using "delayed" DHCP replies
The dhcp code only waits 1s between sending DHCP requests on different
devices and only accepts an answer for the device that sent out the last
request. Only the timeout at the end of a loop is increased iteratively
which favours only the last device. This makes it impossible to work
with a dhcp server that takes little more than 1s connected to a device
that is not the last one.

Instead of also increasing the inter-device timeout, teach the code to
handle delayed replies.

To accomplish that, make *ic_dev track the current ic_device instead of
the current net_device and adapt all users accordingly. The relevant
change then is to reset d to ic_dev on a reply to assert that the
followup request goes through the right device.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 15:40:05 -07:00
Uwe Kleine-König 22fc538872 net: ipconfig: Add device name to debug messages
This simplifies understanding what happens when there is more than one
device.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 15:40:05 -07:00
David Forster 94d9f1c590 ipv4: panic in leaf_walk_rcu due to stale node pointer
Panic occurs when issuing "cat /proc/net/route" whilst
populating FIB with > 1M routes.

Use of cached node pointer in fib_route_get_idx is unsafe.

 BUG: unable to handle kernel paging request at ffffc90001630024
 IP: [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
 PGD 11b08d067 PUD 11b08e067 PMD dac4b067 PTE 0
 Oops: 0000 [#1] SMP
 Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscac
 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep virti
 acpi_cpufreq button parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd
tio_ring virtio floppy uhci_hcd ehci_hcd usbcore usb_common libata scsi_mod
 CPU: 1 PID: 785 Comm: cat Not tainted 4.2.0-rc8+ #4
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
 task: ffff8800da1c0bc0 ti: ffff88011a05c000 task.ti: ffff88011a05c000
 RIP: 0010:[<ffffffff814cf6a0>]  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
 RSP: 0018:ffff88011a05fda0  EFLAGS: 00010202
 RAX: ffff8800d8a40c00 RBX: ffff8800da4af940 RCX: ffff88011a05ff20
 RDX: ffffc90001630020 RSI: 0000000001013531 RDI: ffff8800da4af950
 RBP: 0000000000000000 R08: ffff8800da1f9a00 R09: 0000000000000000
 R10: ffff8800db45b7e4 R11: 0000000000000246 R12: ffff8800da4af950
 R13: ffff8800d97a74c0 R14: 0000000000000000 R15: ffff8800d97a7480
 FS:  00007fd3970e0700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: ffffc90001630024 CR3: 000000011a7e4000 CR4: 00000000000006e0
 Stack:
  ffffffff814d00d3 0000000000000000 ffff88011a05ff20 ffff8800da1f9a00
  ffffffff811dd8b9 0000000000000800 0000000000020000 00007fd396f35000
  ffffffff811f8714 0000000000003431 ffffffff8138dce0 0000000000000f80
 Call Trace:
  [<ffffffff814d00d3>] ? fib_route_seq_start+0x93/0xc0
  [<ffffffff811dd8b9>] ? seq_read+0x149/0x380
  [<ffffffff811f8714>] ? fsnotify+0x3b4/0x500
  [<ffffffff8138dce0>] ? process_echoes+0x70/0x70
  [<ffffffff8121cfa7>] ? proc_reg_read+0x47/0x70
  [<ffffffff811bb823>] ? __vfs_read+0x23/0xd0
  [<ffffffff811bbd42>] ? rw_verify_area+0x52/0xf0
  [<ffffffff811bbe61>] ? vfs_read+0x81/0x120
  [<ffffffff811bcbc2>] ? SyS_read+0x42/0xa0
  [<ffffffff81549ab2>] ? entry_SYSCALL_64_fastpath+0x16/0x75
 Code: 48 85 c0 75 d8 f3 c3 31 c0 c3 f3 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
a 04 89 f0 33 02 44 89 c9 48 d3 e8 0f b6 4a 05 49 89
 RIP  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
  RSP <ffff88011a05fda0>
 CR2: ffffc90001630024

Signed-off-by: Dave Forster <dforster@brocade.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-06 00:10:05 -04:00
Soheil Hassas Yeganeh f626300a3e tcp: consider recv buf for the initial window scale
tcp_select_initial_window() intends to advertise a window
scaling for the maximum possible window size. To do so,
it considers the maximum of net.ipv4.tcp_rmem[2] and
net.core.rmem_max as the only possible upper-bounds.
However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE
to set the socket's receive buffer size to values
larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max.
Thus, SO_RCVBUFFORCE is effectively ignored by
tcp_select_initial_window().

To fix this, consider the maximum of net.ipv4.tcp_rmem[2],
net.core.rmem_max and socket's initial buffer space.

Fixes: b0573dea1f ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30 21:21:57 -07:00
Linus Torvalds 7a1e8b80fb Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

   - TPM core and driver updates/fixes
   - IPv6 security labeling (CALIPSO)
   - Lots of Apparmor fixes
   - Seccomp: remove 2-phase API, close hole where ptrace can change
     syscall #"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
  apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
  tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
  tpm: Factor out common startup code
  tpm: use devm_add_action_or_reset
  tpm2_i2c_nuvoton: add irq validity check
  tpm: read burstcount from TPM_STS in one 32-bit transaction
  tpm: fix byte-order for the value read by tpm2_get_tpm_pt
  tpm_tis_core: convert max timeouts from msec to jiffies
  apparmor: fix arg_size computation for when setprocattr is null terminated
  apparmor: fix oops, validate buffer size in apparmor_setprocattr()
  apparmor: do not expose kernel stack
  apparmor: fix module parameters can be changed after policy is locked
  apparmor: fix oops in profile_unpack() when policy_db is not present
  apparmor: don't check for vmalloc_addr if kvzalloc() failed
  apparmor: add missing id bounds check on dfa verification
  apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
  apparmor: use list_next_entry instead of list_entry_next
  apparmor: fix refcount race when finding a child profile
  apparmor: fix ref count leak when profile sha1 hash is read
  apparmor: check that xindex is in trans_table bounds
  ...
2016-07-29 17:38:46 -07:00
Linus Torvalds 468fc7ed55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

 2) Make DSA binding more sane, from Andrew Lunn.

 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface.  From Brenden Blanco and
    others.

 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

 8) Simplify netlink conntrack entry layout, from Florian Westphal.

 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

11) Support qdisc packet injection in pktgen, from John Fastabend.

12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

13) Add NV congestion control support to TCP, from Lawrence Brakmo.

14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

16) Support MPLS over IPV4, from Simon Horman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
  xgene: Fix build warning with ACPI disabled.
  be2net: perform temperature query in adapter regardless of its interface state
  l2tp: Correctly return -EBADF from pppol2tp_getname.
  net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
  net: ipmr/ip6mr: update lastuse on entry change
  macsec: ensure rx_sa is set when validation is disabled
  tipc: dump monitor attributes
  tipc: add a function to get the bearer name
  tipc: get monitor threshold for the cluster
  tipc: make cluster size threshold for monitoring configurable
  tipc: introduce constants for tipc address validation
  net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
  MAINTAINERS: xgene: Add driver and documentation path
  Documentation: dtb: xgene: Add MDIO node
  dtb: xgene: Add MDIO node
  drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
  drivers: net: xgene: Use exported functions
  drivers: net: xgene: Enable MDIO driver
  drivers: net: xgene: Add backward compatibility
  drivers: net: phy: xgene: Add MDIO driver
  ...
2016-07-27 12:03:20 -07:00
Nikolay Aleksandrov 90b5ca1766 net: ipmr/ip6mr: update lastuse on entry change
Currently lastuse is updated on entry creation and cache hit, but it should
also be updated on entry change. Since both on add and update the ttl array
is updated we can simply update the lastuse in ipmr_update_thresholds.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26 15:18:31 -07:00
Daniel Borkmann ba66bbe548 udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skb
After a612769774 ("udp: prevent bugcheck if filter truncates packet
too much"), there followed various other fixes for similar cases such
as f4979fcea7 ("rose: limit sk_filter trim to payload").

Latter introduced a new helper sk_filter_trim_cap(), where we can pass
the trim limit directly to the socket filter handling. Make use of it
here as well with sizeof(struct udphdr) as lower cap limit and drop the
extra skb->len test in UDP's input path.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25 21:40:33 -07:00
Linus Torvalds 55392c4c06 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "This update provides the following changes:

   - The rework of the timer wheel which addresses the shortcomings of
     the current wheel (cascading, slow search for next expiring timer,
     etc).  That's the first major change of the wheel in almost 20
     years since Finn implemted it.

   - A large overhaul of the clocksource drivers init functions to
     consolidate the Device Tree initialization

   - Some more Y2038 updates

   - A capability fix for timerfd

   - Yet another clock chip driver

   - The usual pile of updates, comment improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
  tick/nohz: Optimize nohz idle enter
  clockevents: Make clockevents_subsys static
  clocksource/drivers/time-armada-370-xp: Fix return value check
  timers: Implement optimization for same expiry time in mod_timer()
  timers: Split out index calculation
  timers: Only wake softirq if necessary
  timers: Forward the wheel clock whenever possible
  timers/nohz: Remove pointless tick_nohz_kick_tick() function
  timers: Optimize collect_expired_timers() for NOHZ
  timers: Move __run_timers() function
  timers: Remove set_timer_slack() leftovers
  timers: Switch to a non-cascading wheel
  timers: Reduce the CPU index space to 256k
  timers: Give a few structs and members proper names
  hlist: Add hlist_is_singular_node() helper
  signals: Use hrtimer for sigtimedwait()
  timers: Remove the deprecated mod_timer_pinned() API
  timers, net/ipv4/inet: Initialize connection request timers as pinned
  timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
  timers, drivers/tty/metag_da: Initialize the poll timer as pinned
  ...
2016-07-25 20:43:12 -07:00
David S. Miller c42d7121fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next,
they are:

1) Count pre-established connections as active in "least connection"
   schedulers such that pre-established connections to avoid overloading
   backend servers on peak demands, from Michal Kubecek via Simon Horman.

2) Address a race condition when resizing the conntrack table by caching
   the bucket size when fulling iterating over the hashtable in these
   three possible scenarios: 1) dump via /proc/net/nf_conntrack,
   2) unlinking userspace helper and 3) unlinking custom conntrack timeout.
   From Liping Zhang.

3) Revisit early_drop() path to perform lockless traversal on conntrack
   eviction under stress, use del_timer() as synchronization point to
   avoid two CPUs evicting the same entry, from Florian Westphal.

4) Move NAT hlist_head to nf_conn object, this simplifies the existing
   NAT extension and it doesn't increase size since recent patches to
   align nf_conn, from Florian.

5) Use rhashtable for the by-source NAT hashtable, also from Florian.

6) Don't allow --physdev-is-out from OUTPUT chain, just like
   --physdev-out is not either, from Hangbin Liu.

7) Automagically set on nf_conntrack counters if the user tries to
   match ct bytes/packets from nftables, from Liping Zhang.

8) Remove possible_net_t fields in nf_tables set objects since we just
   simply pass the net pointer to the backend set type implementations.

9) Fix possible off-by-one in h323, from Toby DiPasquale.

10) early_drop() may be called from ctnetlink patch, so we must hold
    rcu read size lock from them too, this amends Florian's patch #3
    coming in this batch, from Liping Zhang.

11) Use binary search to validate jump offset in x_tables, this
    addresses the O(n!) validation that was introduced recently
    resolve security issues with unpriviledge namespaces, from Florian.

12) Fix reference leak to connlabel in error path of nft_ct, from Zhang.

13) Three updates for nft_log: Fix log prefix leak in error path. Bail
    out on loglevel larger than debug in nft_log and set on the new
    NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang.

14) Allow to filter rule dumps in nf_tables based on table and chain
    names.

15) Simplify connlabel to always use 128 bits to store labels and
    get rid of unused function in xt_connlabel, from Florian.

16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack
    helper, by Gao Feng.

17) Put back x_tables module reference in nft_compat on error, from
    Liping Zhang.

18) Add a reference count to the x_tables extensions cache in
    nft_compat, so we can remove them when unused and avoid a crash
    if the extensions are rmmod, again from Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 22:02:36 -07:00
David S. Miller de0ba9a0d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just several instances of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 00:53:32 -04:00
Shmulik Ladkani b8247f095e net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs
Given:
 - tap0 and vxlan0 are bridged
 - vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)

Assume GSO skbs arriving from tap0 having a gso_size as determined by
user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).

After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu.

These skbs are accidentally passed to ip_finish_output2 AS IS.
Alas, each final segment (segmented either by validate_xmit_skb or by
hardware UFO) would be larger than eth0 mtu.
As a result, those above-mtu segments get dropped on certain networks.

This behavior is not aligned with the NON-GSO case:
Assume a non-gso 1500-sized IP packet arrives from tap0. After
encapsulation, the vxlan datagram is fragmented normally at the
ip_finish_output-->ip_fragment code path.

The expected behavior for the GSO case would be segmenting the
"gso-oversized" skb first, then fragmenting each segment according to
dst mtu, and finally passing the resulting fragments to ip_finish_output2.

'ip_finish_output_gso' already supports this "Slowpath" behavior,
according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
forwarding (not set in the bridged case).

In order to support the bridged case, we'll mark skbs arriving from an
ingress interface that get udp-encaspulated as "allowed to be fragmented",
causing their network_seglen to be validated by 'ip_finish_output_gso'
(and fragment if needed).

Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
gso and non-gso cases), which serves users wishing to forbid
fragmentation at the udp tunnel endpoint.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19 16:40:22 -07:00
Shmulik Ladkani 359ebda25a net/ipv4: Introduce IPSKB_FRAG_SEGS bit to inet_skb_parm.flags
This flag indicates whether fragmentation of segments is allowed.

Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
either ip_forward or ipmr_forward).

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19 16:40:22 -07:00
Florian Westphal f4dc77713f netfilter: x_tables: speed up jump target validation
The dummy ruleset I used to test the original validation change was broken,
most rules were unreachable and were not tested by mark_source_chains().

In some cases rulesets that used to load in a few seconds now require
several minutes.

sample ruleset that shows the behaviour:

echo "*filter"
for i in $(seq 0 100000);do
        printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 100000);do
   printf -- "-A INPUT -j chain_%06x\n" $i
   printf -- "-A INPUT -j chain_%06x\n" $i
   printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT

[ pipe result into iptables-restore ]

This ruleset will be about 74mbyte in size, with ~500k searches
though all 500k[1] rule entries. iptables-restore will take forever
(gave up after 10 minutes)

Instead of always searching the entire blob for a match, fill an
array with the start offsets of every single ipt_entry struct,
then do a binary search to check if the jump target is present or not.

After this change ruleset restore times get again close to what one
gets when reverting 3647234101 (~3 seconds on my workstation).

[1] every user-defined rule gets an implicit RETURN, so we get
300k jumps + 100k userchains + 100k returns -> 500k rule entries

Fixes: 3647234101 ("netfilter: x_tables: validate targets of jumps")
Reported-by: Jeff Wu <wujiafu@gmail.com>
Tested-by: Jeff Wu <wujiafu@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-18 21:35:23 +02:00
Nikolay Aleksandrov 43b9e12740 net: ipmr/ip6mr: add support for keeping an entry age
In preparation for hardware offloading of ipmr/ip6mr we need an
interface that allows to check (and later update) the age of entries.
Relying on stats alone can show activity but not actual age of the entry,
furthermore when there're tens of thousands of entries a lot of the
hardware implementations only support "hit" bits which are cleared on
read to denote that the entry was active and shouldn't be aged out,
these can then be naturally translated into age timestamp and will be
compatible with the software forwarding age. Using a lastuse entry doesn't
affect performance because the members in that cache line are written to
along with the age.
Since all new users are encouraged to use ipmr via netlink, this is
exported via the RTA_EXPIRES attribute.
Also do a minor local variable declaration style adjustment - arrange them
longest to shortest.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Shrijeet Mukherjee <shm@cumulusnetworks.com>
CC: Satish Ashok <sashok@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-16 20:19:43 -07:00
Richard Sailer c380d37e97 tcp_timer.c: Add kernel-doc function descriptions
This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.

Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15 23:18:14 -07:00
Jason Baron 083ae30828 tcp: enable per-socket rate limiting of all 'challenge acks'
The per-socket rate limit for 'challenge acks' was introduced in the
context of limiting ack loops:

commit f2b2c582e8 ("tcp: mitigate ACK loops for connections as tcp_sock")

And I think it can be extended to rate limit all 'challenge acks' on a
per-socket basis.

Since we have the global tcp_challenge_ack_limit, this patch allows for
tcp_challenge_ack_limit to be set to a large value and effectively rely on
the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
still prevents a single connections from consuming the entire challenge ack
quota.

It further moves in the direction of eliminating the global limit at some
point, as Eric Dumazet has suggested. This a follow-up to:
Subject: tcp: make challenge acks less predictable

Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15 14:18:29 -07:00
Paul Gortmaker d3fc0353f7 ipv4: af_inet: make it explicitly non-modular
The Makefile controlling compilation of this file is obj-y,
meaning that it currently is never being built as a module.

Since MODULE_ALIAS is a no-op for non-modular code, we can simply
remove the MODULE_ALIAS_NETPROTO variant used here.

We replace module.h with kmod.h since the file does make use of
request_module() in order to load other modules from here.

We don't have to worry about init.h coming in via the removed
module.h since the file explicitly includes init.h already.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:44:26 -07:00
Julian Anastasov 80610229ef ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space
Vegard Nossum is reporting for a crash in fib_dump_info
when nh_dev = NULL and fib_nhs == 1:

Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
RIP: 0033:[<00000000602b3d18>]
RSP: 0000000062623890  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00
RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000
R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00
R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000
Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
Stack:
 626238f0 960226a02 00000400 000000fe
 62623910 600afca7 62623970 62623a48
 62468038 00000018 00000000 00000000
Call Trace:
 [<602b3e93>] rtmsg_fib+0xd3/0x190
 [<602b6680>] fib_table_insert+0x260/0x500
 [<602b0e5d>] inet_rtm_newroute+0x4d/0x60
 [<60250def>] rtnetlink_rcv_msg+0x8f/0x270
 [<60267079>] netlink_rcv_skb+0xc9/0xe0
 [<60250d4b>] rtnetlink_rcv+0x3b/0x50
 [<60265400>] netlink_unicast+0x1a0/0x2c0
 [<60265e47>] netlink_sendmsg+0x3f7/0x470
 [<6021dc9a>] sock_sendmsg+0x3a/0x90
 [<6021e0d0>] ___sys_sendmsg+0x300/0x360
 [<6021fa64>] __sys_sendmsg+0x54/0xa0
 [<6021fac0>] SyS_sendmsg+0x10/0x20
 [<6001ea68>] handle_syscall+0x88/0x90
 [<600295fd>] userspace+0x3fd/0x500
 [<6001ac55>] fork_handler+0x85/0x90

$ addr2line -e vmlinux -i 0x602b3d18
include/linux/inetdevice.h:222
net/ipv4/fib_semantics.c:1264

Problem happens when RTNH_F_LINKDOWN is provided from user space
when creating routes that do not use the flag, catched with
netlink fuzzer.

Currently, the kernel allows user space to set both flags
to nh_flags and fib_flags but this is not intentional, the
assumption was that they are not set. Fix this by rejecting
both flags with EINVAL.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Fixes: 0eeb075fad ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Dinesh Dutt <ddutt@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:41:09 -07:00
Eric Dumazet 75ff39ccc1 tcp: make challenge acks less predictable
Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.

This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.

Based on initial analysis and patch from Linus.

Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.

v2: randomize the count of challenge acks per second, not the period.

Fixes: 282f23c6ee ("tcp: implement RFC 5961 3.2")
Reported-by: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:33:35 -07:00
Simon Horman aa9667e7f6 tunnels: correct conditional build of MPLS and IPv6
Using a combination if #if conditionals and goto labels to unwind
tunnel4_init seems unwieldy. This patch takes a simpler approach of
directly unregistering previously registered protocols when an error
occurs.

This fixes a number of problems with the current implementation
including the potential presence of labels when they are unused
and the potential absence of unregister code when it is needed.

Fixes: 8afe97e5d4 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:27:06 -07:00
Michal Kubeček a612769774 udp: prevent bugcheck if filter truncates packet too much
If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 12:43:15 -07:00
Liping Zhang 64b87639c9 netfilter: conntrack: fix race between nf_conntrack proc read and hash resize
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
  IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
  Call Trace:
  [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
  [<ffffffff81261a1c>] seq_read+0x2cc/0x390
  [<ffffffff812a8d62>] proc_reg_read+0x42/0x70
  [<ffffffff8123bee7>] __vfs_read+0x37/0x130
  [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
  [<ffffffff8123cf75>] vfs_read+0x95/0x140
  [<ffffffff8123e475>] SyS_read+0x55/0xc0
  [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4

It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
  while : ; do
    echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
  done
2. open more shells and input the following cmds:
  while : ; do
    cat /proc/net/nf_conntrack
  done
3. just wait a monent, oops will happen soon.

The solution in this patch is based on Florian's Commit 5e3c61f981
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11 11:38:57 +02:00
Eric Dumazet fa17806cde ipv4: do not abuse GFP_ATOMIC in inet_netconf_notify_devconf()
inet_forward_change() runs with RTNL held.
We are allowed to sleep if required.

If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
we no longer have to use GFP_ATOMIC allocations in
inet_netconf_notify_devconf(), meaning we are less likely to miss
notifications under memory pressure, and wont touch precious memory
reserves either and risk dropping incoming packets.

inet_netconf_get_devconf() can also use GFP_KERNEL allocation.

Fixes: edc9e74893 ("rtnl/ipv4: use netconf msg to advertise forwarding status")
Fixes: 9e5511106f ("rtnl/ipv4: add support of RTM_GETNETCONF")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 18:12:25 -04:00
Simon Horman 1b69e7e6c4 ipip: support MPLS over IPv4
Extend the IPIP driver to support MPLS over IPv4. The implementation is an
extension of existing support for IPv4 over IPv4 and is based of multiple
inner-protocol support for the SIT driver.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:45:56 -04:00
Simon Horman 8afe97e5d4 tunnels: support MPLS over IPv4 tunnels
Extend tunnel support to MPLS over IPv4.  The implementation extends the
existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS
over IPv4.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 17:45:56 -04:00
Thomas Gleixner f3438bc781 timers, net/ipv4/inet: Initialize connection request timers as pinned
Pinned timers must carry the pinned attribute in the timer structure
itself, so convert the code to the new API.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.617891430@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:06 +02:00
James Morris d011a4d861 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/selinux into next 2016-07-07 10:15:34 +10:00
David S. Miller 30d0844bdc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx5/core/en.h
	drivers/net/ethernet/mellanox/mlx5/core/en_main.c
	drivers/net/usb/r8152.c

All three conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 10:35:22 -07:00
David S. Miller ae3e4562e2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next,
they are:

1) Don't use userspace datatypes in bridge netfilter code, from
   Tobin Harding.

2) Iterate only once over the expectation table when removing the
   helper module, instead of once per-netns, from Florian Westphal.

3) Extra sanitization in xt_hook_ops_alloc() to return error in case
   we ever pass zero hooks, xt_hook_ops_alloc():

4) Handle NFPROTO_INET from the logging core infrastructure, from
   Liping Zhang.

5) Autoload loggers when TRACE target is used from rules, this doesn't
   change the behaviour in case the user already selected nfnetlink_log
   as preferred way to print tracing logs, also from Liping Zhang.

6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields
   by cache lines, increases the size of entries in 11% per entry.
   From Florian Westphal.

7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian.

8) Remove useless defensive check in nf_logger_find_get() from Shivani
   Bhardwaj.

9) Remove zone extension as place it in the conntrack object, this is
   always include in the hashing and we expect more intensive use of
   zones since containers are in place. Also from Florian Westphal.

10) Owner match now works from any namespace, from Eric Bierdeman.

11) Make sure we only reply with TCP reset to TCP traffic from
    nf_reject_ipv4, patch from Liping Zhang.

12) Introduce --nflog-size to indicate amount of network packet bytes
    that are copied to userspace via log message, from Vishwanath Pai.
    This obsoletes --nflog-range that has never worked, it was designed
    to achieve this but it has never worked.

13) Introduce generic macros for nf_tables object generation masks.

14) Use generation mask in table, chain and set objects in nf_tables.
    This allows fixes interferences with ongoing preparation phase of
    the commit protocol and object listings going on at the same time.
    This update is introduced in three patches, one per object.

15) Check if the object is active in the next generation for element
    deactivation in the rbtree implementation, given that deactivation
    happens from the commit phase path we have to observe the future
    status of the object.

16) Support for deletion of just added elements in the hash set type.

17) Allow to resize hashtable from /proc entry, not only from the
    obscure /sys entry that maps to the module parameter, from Florian
    Westphal.

18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised
    anymore since we tear down the ruleset whenever the netdevice
    goes away.

19) Support for matching inverted set lookups, from Arturo Borrero.

20) Simplify the iptables_mangle_hook() by removing a superfluous
    extra branch.

21) Introduce ether_addr_equal_masked() and use it from the netfilter
    codebase, from Joe Perches.

22) Remove references to "Use netfilter MARK value as routing key"
    from the Netfilter Kconfig description given that this toggle
    doesn't exists already for 10 years, from Moritz Sichert.

23) Introduce generic NF_INVF() and use it from the xtables codebase,
    from Joe Perches.

24) Setting logger to NONE via /proc was not working unless explicit
    nul-termination was included in the string. This fixes seems to
    leave the former behaviour there, so we don't break backward.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 09:15:15 -07:00
Joe Perches c37a2dfa67 netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF
netfilter uses multiple FWINV #defines with identical form that hide a
specific structure variable and dereference it with a invflags member.

$ git grep "#define FWINV"
include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))

Consolidate these macros into a single NF_INVF macro.

Miscellanea:

o Neaten the alignment around these uses
o A few lines are > 80 columns for intelligibility

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-03 10:55:07 +02:00
Pablo Neira Ayuso 468b021b94 netfilter: x_tables: simplify ip{6}table_mangle_hook()
No need for a special case to handle NF_INET_POST_ROUTING, this is
basically the same handling as for prerouting, input, forward.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01 16:37:02 +02:00
Eric Dumazet 19689e38ec tcp: md5: use kmalloc() backed scratch areas
Some arches have virtually mapped kernel stacks, or will soon have.

tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.

David says that using percpu storage is also problematic on non SMP
builds.

Just use kmalloc() to allocate scratch areas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 04:02:55 -04:00
Shmulik Ladkani fedbb6b4ff ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).

However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).

OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
 - In case of local sockets, sk is same as skb->sk;
 - In case of a udp tunnel, sk is the tunneling socket.

Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 09:02:48 -04:00
Andrey Vagin b1ed4c4fa9 tcp: add an ability to dump and restore window parameters
We found that sometimes a restored tcp socket doesn't work.

A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).

Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.

This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"

Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.

rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 08:15:31 -04:00
David S. Miller ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Eric Dumazet a3d2e9f8eb tcp: do not send too big packets at retransmit time
Arjun reported a bug in TCP stack and bisected it to a recent commit.

In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.

However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.

Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.

Thanks to Arjun for reporting the bug and running the tests !

Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-29 05:25:11 -04:00
David Ahern 637c841dd7 net: diag: Add support to filter on device index
Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-28 05:25:04 -04:00
Tom Goff 70a0dec451 ipmr/ip6mr: Initialize the last assert time of mfc entries.
This fixes wrong-interface signaling on 32-bit platforms for entries
created when jiffies > 2^31 + MFC_ASSERT_THRESH.

Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-28 04:14:09 -04:00
Huw Davies 56ac42bc94 ipv6: Allow request socks to contain IPv6 options.
If set, these will take precedence over the parent's options during
both sending and child creation.  If they're not set, the parent's
options (if any) will be used.

This is to allow the security_inet_conn_request() hook to modify the
IPv6 options in just the same way that it already may do for IPv4.

Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-27 15:05:28 -04:00
Huw Davies 3faa8f982f netlabel: Move bitmap manipulation functions to the NetLabel core.
This is to allow the CALIPSO labelling engine to use these.

Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-27 15:02:51 -04:00
Liping Zhang e1dbbc5907 netfilter: nf_reject_ipv4: don't send tcp RST if the packet is non-TCP
In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.

This means that user can add such a rule like follows via nfnetlink:
  "nft add rule filter forward ip protocol udp reject with tcp reset"

This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:22 +02:00
Herbert Xu 962fcef33b esp: Fix ESN generation under UDP encapsulation
Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.

This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.

Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 11:52:00 -04:00
Eric Dumazet 20e1954fe2 ipv6: RFC 4884 partial support for SIT/GRE tunnels
When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.

This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.

It also avoids complex skb logic if the data_len is not
a multiple of 8.

RFC states :

   The "original datagram" field MUST contain at least 128 octets.
   If the original datagram did not contain 128 octets, the
   "original datagram" field MUST be zero padded to 128 octets.

In practice routers use 128 bytes of original datagram, not more.

Initial translation was added in commit ca15a078bd
("sit: generate icmpv6 error when receiving icmpv4 error")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:39 -07:00
Eric Dumazet 9b8c6d7bf2 gre: better support for ICMP messages for gre+ipv6
ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:39 -07:00
Alexander Duyck 1938ee1fd3 net: Remove deprecated tunnel specific UDP offload functions
Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:32 -07:00
Alexander Duyck 7c46a640de net: Merge VXLAN and GENEVE push notifiers into a single notifier
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier.  The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.

In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed.  The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number.  In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.

I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names.  I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:29 -07:00
Alexander Duyck e7b3db5e60 net: Combine GENEVE and VXLAN port notifiers into single functions
This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path.  This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 20:23:29 -07:00
Arnd Bergmann 318d3cc04e net: xfrm: fix old-style declaration
Modern C standards expect the '__inline__' keyword to come before the return
type in a declaration, and we get a couple of warnings for this with "make W=1"
in the xfrm{4,6}_policy.c files:

net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
 static int inline xfrm6_net_sysctl_init(struct net *net)
net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
 static void inline xfrm6_net_sysctl_exit(struct net *net)
net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
 static int inline xfrm4_net_sysctl_init(struct net *net)
net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
 static void inline xfrm4_net_sysctl_exit(struct net *net)

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 22:06:30 -07:00
Eric Dumazet e582615ad3 gre: fix error handler
1) gre_parse_header() can be called from gre_err()

   At this point transport header points to ICMP header, not the inner
header.

2) We can not really change transport header as ipgre_err() will later
assume transport header still points to ICMP header (using icmp_hdr())

3) pskb_may_pull() logic in gre_parse_header() really works
  if we are interested at zone pointed by skb->data

4) As Jiri explained in commit b7f8fe251e ("gre: do not pull header in
ICMP error processing") we should not pull headers in error handler.

So this fix :

A) changes gre_parse_header() to use skb->data instead of
skb_transport_header()

B) Adds a nhs parameter to gre_parse_header() so that we can skip the
not pulled IP header from error path.
  This offset is 0 for normal receive path.

C) remove obsolete IPV6 includes

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 22:15:21 -07:00
Philip Prindeville 22a59be8b7 net: ipv4: Add ability to have GRE ignore DF bit in IPv4 payloads
In the presence of firewalls which improperly block ICMP Unreachable
    (including Fragmentation Required) messages, Path MTU Discovery is
    prevented from working.

    A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
    is done for other payloads like AppleTalk--and doing transparent
    fragmentation and reassembly.

    Redux includes the enforcement of mutual exclusion between this feature
    and Path MTU Discovery as suggested by Alexander Duyck.

    Cc: Alexander Duyck <alexander.duyck@gmail.com>
    Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
    Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 21:39:59 -07:00
Neal Cardwell dcf1158b27 tcp: return sizeof tcp_dctcp_info in dctcp_get_info()
Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info).  There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:46:30 -07:00
Nicolas Dichtel da6f1da819 ovs/gre: fix rtnl notifications on iface deletion
The function gretap_fb_dev_create() (only used by ovs) never calls
rtnl_configure_link(). The consequence is that dev->rtnl_link_state is
never set to RTNL_LINK_INITIALIZED.
During the deletion phase, the function rollback_registered_many() sends
a RTM_DELLINK only if dev->rtnl_link_state is set to RTNL_LINK_INITIALIZED.

Fixes: b2acd1dc39 ("openvswitch: Use regular GRE net_device instead of vport")
CC: Thomas Graf <tgraf@suug.ch>
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:21:44 -07:00
Nicolas Dichtel 106da663ff ovs/gre,geneve: fix error path when creating an iface
After ipgre_newlink()/geneve_configure() call, the netdev is registered.

Fixes: 7e059158d5 ("vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices")
CC: David Wragg <david@weave.works>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:21:44 -07:00
Su, Xuemin d1e37288c9 udp reuseport: fix packet of same flow hashed to different socket
There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.

It's the same case for IPv6, and this patch also fixes that.

Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 17:23:09 -04:00
Hannes Frederic Sowa b46d9f625b ipv4: fix checksum annotation in udp4_csum_init
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 15:28:04 -04:00
David S. Miller 86ef7f9cbf ipconfig: Protect ic_addrservaddr with IPCONFIG_DYNAMIC.
>> net/ipv4/ipconfig.c:130:15: warning: 'ic_addrservaddr' defined but not used [-Wunused-variable]
    static __be32 ic_addrservaddr = NONE; /* IP Address of the IP addresses'server */

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11 20:40:24 -07:00
Ben Dooks 0b392be9a8 net: ipconfig: avoid warning by making ic_addrservaddr static
The symbol ic_addrservaddr is not static, but has no declaration
to match so make it static to fix the following warning:

net/ipv4/ipconfig.c:130:8: warning: symbol 'ic_addrservaddr' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:28:15 -07:00
Lawrence Brakmo 699fafafab tcp: add NV congestion control
TCP-NV (New Vegas) is a major update to TCP-Vegas.
An earlier version of NV was presented at 2010's LPC.
It is a delayed based congestion avoidance for the
data center. This version has been tested within a
10G rack where the HW RTTs are 20-50us and with
1 to 400 flows.

A description of TCP-NV, including implementation
details as well as experimental results, can be found at:
http://www.brakmo.org/networking/tcp-nv/TCPNV.html

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:07:49 -07:00
Lawrence Brakmo 6f094b9ec6 tcp: add in_flight to tcp_skb_cb
Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:07:49 -07:00
David S. Miller 1578b0a5e9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/sched/act_police.c
	net/sched/sch_drr.c
	net/sched/sch_hfsc.c
	net/sched/sch_prio.c
	net/sched/sch_red.c
	net/sched/sch_tbf.c

In net-next the drop methods of the packet schedulers got removed, so
the bug fixes to them in 'net' are irrelevant.

A packet action unload crash fix conflicts with the addition of the
new firstuse timestamp.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 11:52:24 -07:00
David Ahern 96c63fa739 net: Add l3mdev rule
Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.

This patch introduces a new rule attribute l3mdev. The l3mdev rule
means the table id used for the lookup is pulled from the L3 master
device (e.g., VRF) rather than being statically defined. With the
l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
rule per address family (IPv4 and IPv6).

If an admin wishes to insert higher priority rules for specific VRFs
those rules will co-exist with the l3mdev rule. This capability means
current VRF scripts will co-exist with this new simpler implementation.

Currently, the rules list for both ipv4 and ipv6 look like this:
    $ ip  ru ls
    1000:       from all oif vrf1 lookup 1001
    1000:       from all iif vrf1 lookup 1001
    1000:       from all oif vrf2 lookup 1002
    1000:       from all iif vrf2 lookup 1002
    1000:       from all oif vrf3 lookup 1003
    1000:       from all iif vrf3 lookup 1003
    1000:       from all oif vrf4 lookup 1004
    1000:       from all iif vrf4 lookup 1004
    1000:       from all oif vrf5 lookup 1005
    1000:       from all iif vrf5 lookup 1005
    1000:       from all oif vrf6 lookup 1006
    1000:       from all iif vrf6 lookup 1006
    1000:       from all oif vrf7 lookup 1007
    1000:       from all iif vrf7 lookup 1007
    1000:       from all oif vrf8 lookup 1008
    1000:       from all iif vrf8 lookup 1008
    ...
    32765:      from all lookup local
    32766:      from all lookup main
    32767:      from all lookup default

With the l3mdev rule the list is just the following regardless of the
number of VRFs:
    $ ip ru ls
    1000:       from all lookup [l3mdev table]
    32765:      from all lookup local
    32766:      from all lookup main
    32767:      from all lookup default

(Note: the above pretty print of the rule is based on an iproute2
       prototype. Actual verbage may change)

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 11:36:02 -07:00
Pau Espin Pedrol e00431bc93 tcp: accept RST if SEQ matches right edge of right-most SACK block
RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.

However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.

In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.

For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.

This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.

Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 00:36:18 -07:00
Tom Herbert c1e48af796 gue: Implement direction IP encapsulation
This patch implements direct encapsulation of IPv4 and IPv6 packets
in UDP. This is done a version "1" of GUE and as explained in I-D
draft-ietf-nvo3-gue-03.

Changes here are only in the receive path, fou with IPxIPx already
supports the transmit side. Both the normal receive path and
GRO path are modified to check for GUE version and check for
IP version in the case that GUE version is "1".

Tested:

IPIP with direct GUE encap
  1 TCP_STREAM
    4530 Mbps
  200 TCP_RR
    1297625 tps
    135/232/444 90/95/99% latencies

IP4IP6 with direct GUE encap
  1 TCP_STREAM
    4903 Mbps
  200 TCP_RR
    1184481 tps
    149/253/473 90/95/99% latencies

IP6IP6 direct GUE encap
  1 TCP_STREAM
   5146 Mbps
  200 TCP_RR
    1202879 tps
    146/251/472 90/95/99% latencies

SIT with direct GUE encap
  1 TCP_STREAM
    6111 Mbps
  200 TCP_RR
    1250337 tps
    139/241/467 90/95/99% latencies

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 23:51:14 -07:00
Michal Kubeček 30759219f5 net: disable fragment reassembly if high_thresh is zero
Before commit 6d7b857d54 ("net: use lib/percpu_counter API for
fragmentation mem accounting"), setting the reassembly high threshold
to 0 prevented fragment reassembly as first fragment would be always
evicted before second could be added to the queue. While inefficient,
some users apparently relied on this method.

Since the commit mentioned above, a percpu counter is used for
reassembly memory accounting and high batch size avoids taking slow path
in most common scenarios. As a result, a whole full sized packet can be
reassembled without the percpu counter's main counter changing its value
so that even with high_thresh set to 0, fragmented packets can be still
reassembled and processed.

Add explicit check preventing reassembly if high threshold is zero.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-05 22:56:42 -04:00
Marcelo Ricardo Leitner ae7ef81ef0 skbuff: introduce skb_gso_validate_mtu
skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.

This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03 19:37:21 -04:00
Eric Dumazet ce25d66ad5 Possible problem with e6afc8ac ("udp: remove headers from UDP packets before queueing")
Paul Moore tracked a regression caused by a recent commit, which
mistakenly assumed that sk_filter() could be avoided if socket
had no current BPF filter.

The intent was to avoid udp_lib_checksum_complete() overhead.

But sk_filter() also checks skb_pfmemalloc() and
security_sock_rcv_skb(), so better call it.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Moore <paul@paul-moore.com>
Tested-by: Paul Moore <paul@paul-moore.com>
Tested-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: samanthakumar <samanthakumar@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-02 18:29:49 -04:00
Ezequiel Garcia 049bbf589e ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=n
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
moves the default TTL assignment, and as side-effect IPv4 TTL now
has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).

The sysctl_ip_default_ttl is fundamental for IP to work properly,
as it provides the TTL to be used as default. The defautl TTL may be
used in ip_selected_ttl, through the following flow:

  ip_select_ttl
    ip4_dst_hoplimit
      net->ipv4.sysctl_ip_default_ttl

This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
in net_init_net, called during ipv4's initialization.

Without this commit, a kernel built without sysctl support will send
all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
with setsockopt).

Given a similar issue might appear on the other knobs that were
namespaceify, this commit also moves them.

Fixes: fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-23 14:32:06 -07:00
Hannes Frederic Sowa e5aed006be udp: prevent skbs lingering in tunnel socket queues
In case we find a socket with encapsulation enabled we should call
the encap_recv function even if just a udp header without payload is
available. The callbacks are responsible for correctly verifying and
dropping the packets.

Also, in case the header validation fails for geneve and vxlan we
shouldn't put the skb back into the socket queue, no one will pick
them up there.  Instead we can simply discard them in the respective
encap_recv functions.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 19:56:02 -04:00
Tom Herbert b8921ca83e ip4ip6: Support for GSO/GRO
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:17 -04:00
Tom Herbert 058214a4d1 ip6_tun: Add infrastructure for doing encapsulation
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:16 -04:00
Tom Herbert 5f914b6812 fou: Support IPv6 in fou
This patch adds receive path support for IPv6 with fou.

- Add address family to fou structure for open sockets. This supports
  AF_INET and AF_INET6. Lookups for fou ports are performed on both the
  port number and family.
- In fou and gue receive adjust tot_len in IPv4 header or payload_len
  based on address family.
- Allow AF_INET6 in FOU_ATTR_AF netlink attribute.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:16 -04:00
Tom Herbert dc969b81eb fou: Split out {fou,gue}_build_header
Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:16 -04:00
Tom Herbert 440924bbc0 fou: Call setup_udp_tunnel_sock
Use helper function to set up UDP tunnel related information for a fou
socket.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:16 -04:00
Tom Herbert 55c2bc1432 net: Cleanup encap items in ip_tunnels.h
Consolidate all the ip_tunnel_encap definitions in one spot in the
header file. Also, move ip_encap_hlen and ip_tunnel_encap from
ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:16 -04:00
Tom Herbert 7e13318daa net: define gso types for IPx over IPv4 and IPv6
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:15 -04:00
Tom Herbert 5c7cdf339a gso: Remove arbitrary checks for unsupported GSO
In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.

This is a code simplication especially for add new GSO types.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:03:15 -04:00
Eric Dumazet ea1627c20c tcp: minor optimizations around tcp_hdr() usage
tcp_hdr() is slightly more expensive than using skb->data in contexts
where we know they point to the same byte.

In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
as tcp header has not been pulled yet.

In output path, the same can be said when we just pushed the tcp header
in the skb, in tcp_transmit_skb() and tcp_make_synack()

Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
so that compiler can further optimize and avoid a reload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 13:46:23 -04:00
Eric Dumazet 2632616bc4 sock: propagate __sock_cmsg_send() error
__sock_cmsg_send() might return different error codes, not only -EINVAL.

Fixes: 24025c465f ("ipv4: process socket-level control messages in IPv4")
Fixes: ad1e46a837 ("ipv6: process socket-level control messages in IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 13:46:23 -04:00
David S. Miller 909b27f706 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.

The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.

Conflicts:
	drivers/net/wireless/intel/iwlwifi/mvm/tx.c
	net/ipv4/ip_gre.c
	net/netfilter/nf_conntrack_core.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-15 13:32:48 -04:00
Paolo Abeni 626abd59e5 net/route: enforce hoplimit max value
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.

The caller can i.e. set hoplimit to 256, and when such route will
 be used, packets will be sent with hoplimit/ttl equal to 0.

This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.

This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14 15:33:32 -04:00
Alexander Duyck ed7cbbce54 udp: Resolve NULL pointer dereference over flow-based vxlan device
While testing an OpenStack configuration using VXLANs I saw the following
call trace:

 RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
 RSP: 0018:ffff88103867bc50  EFLAGS: 00010286
 RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
 RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
 RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
 R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
 R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
 FS:  0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
 Stack:
  ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
  ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
  ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
 Call Trace:
  [<ffffffff81576515>] ? skb_checksum+0x35/0x50
  [<ffffffff815733c0>] ? skb_push+0x40/0x40
  [<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
  [<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
  [<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
  [<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
  [<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
  [<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
  [<ffffffff815899d0>] net_rx_action+0x160/0x380
  [<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
  [<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
  [<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
  [<ffffffff8109a400>] ? sort_range+0x30/0x30
  [<ffffffff81096da8>] kthread+0xd8/0xf0
  [<ffffffff81693c82>] ret_from_fork+0x22/0x40
  [<ffffffff81096cd0>] ? kthread_park+0x60/0x60

The following trace is seen when receiving a DHCP request over a flow-based
VXLAN tunnel.  I believe this is caused by the metadata dst having a NULL
dev value and as a result dev_net(dev) is causing a NULL pointer dereference.

To resolve this I am replacing the check for skb_dst(skb)->dev with just
skb->dev.  This makes sense as the callers of this function are usually in
the receive path and as such skb->dev should always be populated.  In
addition other functions in the area where these are called are already
using dev_net(skb->dev) to determine the namespace the UDP packet belongs
in.

Fixes: 63058308cd ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-13 01:56:14 -04:00
Haishuang Yan da73b4e953 gre: Fix wrong tpi->proto in WCCP
When dealing with WCCP in gre6 tunnel, it sets the wrong tpi->protocol,
that is, ETH_P_IP instead of ETH_P_IPV6 for the encapuslated traffic.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 16:53:58 -04:00
David Ahern 0b922b7a82 net: original ingress device index in PKTINFO
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.

The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 19:31:40 -04:00
David Ahern 74b20582ac net: l3mdev: Add hook in ip and ipv6
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.

This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.

dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 19:31:40 -04:00
Jiri Benc e271c7b442 gre: do not keep the GRE header around in collect medata mode
For ipgre interface in collect metadata mode, it doesn't make sense for the
interface to be of ARPHRD_IPGRE type. The outer header of received packets
is not needed, as all the information from it is present in metadata_dst. We
already don't set ipgre_header_ops for collect metadata interfaces, which is
the only consumer of mac_header pointing to the outer IP header.

Just set the interface type to ARPHRD_NONE in collect metadata mode for
ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
mac_header.

Fixes: a64b04d86d ("gre: do not assign header_ops in collect metadata mode")
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 15:16:32 -04:00
Lawrence Brakmo 756ee1729b tcp: replace cnt & rtt with struct in pkts_acked()
Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).

As proposed by Neal Cardwell in his comments to the tcp_nv patch.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 14:43:19 -04:00
Eric Dumazet 10a81980fc tcp: refresh skb timestamp at retransmit time
In the very unlikely case __tcp_retransmit_skb() can not use the cloning
done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing
the copy and transmit, otherwise TCP TS val will be an exact copy of
original transmit.

Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10 15:58:41 -04:00
David Ahern 1ff23beebd net: l3mdev: Allow send on enslaved interface
Allow udp and raw sockets to send by oif that is an enslaved interface
versus the l3mdev/VRF device. For example, this allows BFD to use ifindex
from IP_PKTINFO on a receive to send a response without the need to
convert to the VRF index. It also allows ping and ping6 to work when
specifying an enslaved interface (e.g., ping -I swp1 <ip>) which is
a natural use case.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 22:33:52 -04:00
David S. Miller e800072c18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
In netdevice.h we removed the structure in net-next that is being
changes in 'net'.  In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.

The mlx5 conflicts have to do with vxlan support dependencies.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 15:59:24 -04:00
David S. Miller e8ed77dfa9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.

Several updates for IPVS from Marco Angaroni:

1) Allow SIP connections originating from real-servers to be load
   balanced by the SIP persistence engine as is already implemented
   in the other direction.

2) Release connections immediately for One-packet-scheduling (OPS)
   in IPVS, instead of making it via timer and rcu callback.

3) Skip deleting conntracks for each one packet in OPS, and don't call
   nf_conntrack_alter_reply() since no reply is expected.

4) Enable drop on exhaustion for OPS + SIP persistence.

Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:

5) Move conntrack generation counter out of conntrack pernet structure
   since this is only used by the init_ns to allow hash resizing.

6) Use get_random_once() from packet path to collect hash random seed
    instead of our compound.

7) Don't disable BH from ____nf_conntrack_find() for statistics,
   use NF_CT_STAT_INC_ATOMIC() instead.

8) Fix lookup race during conntrack hash resizing.

9) Introduce clash resolution on conntrack insertion for connectionless
   protocol.

Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:

11) Use a single conntrack hashtable for all namespaces. Include netns
    in object comparisons and make it part of the hash calculation.
    Adapt early_drop() to consider netns.

12) Use single expectation and NAT hashtable for all namespaces.

13) Use a single slab cache for all namespaces for conntrack objects.

14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
    conntrack counter tells us the table is empty (ie. equals zero).

Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.

15) Parse element flags from element deletion path and pass it up to the
    backend set implementation.

16) Allow adjacent intervals in the rbtree set type for dynamic interval
    updates.

17) Add support to set connlabel from nf_tables, from Florian Westphal.

18) Allow set names up to 32 bytes in nf_tables.

Several x_tables fixes and updates:

19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
    from Andrzej Hajda.

And finally, miscelaneous netfilter updates such as:

20) Disable automatic helper assignment by default. Note this proc knob
    was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
    disable automatic helper assignment") 4 years ago to start moving
    towards explicit conntrack helper configuration via iptables CT
    target.

21) Get rid of obsolete and inconsistent debugging instrumentation
    in x_tables.

22) Remove unnecessary check for null after ip6_route_output().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 15:02:58 -04:00
Jarno Rajahalme 229740c631 udp_offload: Set encapsulation before inner completes.
UDP tunnel segmentation code relies on the inner offsets being set for
an UDP tunnel GSO packet, but the inner *_complete() functions will
set the inner offsets only if 'encapsulation' is set before calling
them.  Currently, udp_gro_complete() sets 'encapsulation' only after
the inner *_complete() functions are done.  This causes the inner
offsets having invalid values after udp_gro_complete() returns, which
in turn will make it impossible to properly segment the packet in case
it needs to be forwarded, which would be visible to the user either as
invalid packets being sent or as packet loss.

This patch fixes this by setting skb's 'encapsulation' in
udp_gro_complete() before calling into the inner complete functions,
and by making each possible UDP tunnel gro_complete() callback set the
inner_mac_header to the beginning of the tunnel payload.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 18:25:26 -04:00
Jarno Rajahalme 43b8448cd7 udp_tunnel: Remove redundant udp_tunnel_gro_complete().
The setting of the UDP tunnel GSO type is already performed by
udp[46]_gro_complete().

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 18:25:26 -04:00
Eric Dumazet 47dcc20a39 ipv4: tcp: ip_send_unicast_reply() is not BH safe
I forgot that ip_send_unicast_reply() is not BH safe (yet).

Disabling preemption before calling it was not a good move.

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andres Lagar-Cavilla  <andreslc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 16:02:43 -04:00
David Ahern b3b4663c97 net: vrf: Create FIB tables on link create
Tables have to exist for VRFs to function. Ensure they exist
when VRF device is created.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 15:51:47 -04:00
Florian Westphal 0a93aaedc4 netfilter: conntrack: use a single expectation table for all namespaces
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the expectation table.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-06 11:50:01 +02:00
Florian Westphal 03d7dc5cdf netfilter: conntrack: check netns when walking expect hash
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-06 11:50:01 +02:00
Pablo Neira Ayuso d7cdf81657 netfilter: x_tables: get rid of old and inconsistent debugging
The dprintf() and duprintf() functions are enabled at compile time,
these days we have better runtime debugging through pr_debug() and
static keys.

On top of this, this debugging is so old that I don't expect anyone
using this anymore, so let's get rid of this.

IP_NF_ASSERT() is still left in place, although this needs that
NETFILTER_DEBUG is enabled, I think these assertions provide useful
context information when reading the code.

Note that ARP_NF_ASSERT() has been removed as there is no user of
this.

Kill also DEBUG_ALLOW_ALL and a couple of pr_error() and pr_debug()
spots that are inconsistently placed in the code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:51 +02:00
Florian Westphal 56d52d4892 netfilter: conntrack: use a single hashtable for all namespaces
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.

Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.

NAT bysrc and expectation hash is still per namespace, those will
changed too soon.

Future patch will also make conntrack object slab cache global again.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:47 +02:00
Florian Westphal e0c7d47221 netfilter: conntrack: check netns when comparing conntrack objects
Once we place all conntracks in the same hash table we must also compare
the netns pointer to skip conntracks that belong to a different namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:46 +02:00
Florian Westphal 245cfdcaba netfilter: conntrack: small refactoring of conntrack seq_printf
The iteration process is lockless, so we test if the conntrack object is
eligible for printing (e.g. is AF_INET) after obtaining the reference
count.

Once we put all conntracks into same hash table we might see more
entries that need to be skipped.

So add a helper and first perform the test in a lockless fashion
for fast skip.

Once we obtain the reference count, just repeat the check.

Note that this refactoring also includes a missing check for unconfirmed
conntrack entries due to slab rcu object re-usage, so they need to be
skipped since they are not part of the listing.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:45 +02:00
Eric Dumazet 777c6ae57e tcp: two more missing bh disable
percpu_counter only have protection against preemption.

TCP stack uses them possibly from BH, so we need BH protection
in contexts that could be run in process context

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 23:47:54 -04:00
Eric Dumazet 614bdd4d6e tcp: must block bh in __inet_twsk_hashdance()
__inet_twsk_hashdance() might be called from process context,
better block BH before acquiring bind hash and established locks

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
Eric Dumazet 46cc6e4976 tcp: fix lockdep splat in tcp_snd_una_update()
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit & SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
David S. Miller 32b583a0cb Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-05-04

1) The flowcache can hit an OOM condition if too
   many entries are in the gc_list. Fix this by
   counting the entries in the gc_list and refuse
   new allocations if the value is too high.

2) The inner headers are invalid after a xfrm transformation,
   so reset the skb encapsulation field to ensure nobody tries
   access the inner headers. Otherwise tunnel devices stacked
   on top of xfrm may build the outer headers based on wrong
   informations.

3) Add pmtu handling to vti, we need it to report
   pmtu informations for local generated packets.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:35:31 -04:00
Jiri Benc 125372faa4 gre: receive also TEB packets for lwtunnels
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.

To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:32 -04:00
Jiri Benc 244a797bdc gre: move iptunnel_pull_header down to ipgre_rcv
This will allow to make the pull dependent on the tunnel type.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:31 -04:00
Jiri Benc 00b2034029 gre: remove superfluous pskb_may_pull
The call to gre_parse_header is either followed by iptunnel_pull_header, or
in the case of ICMP error path, the actual header is not accessed at all.

In the first case, iptunnel_pull_header will call pskb_may_pull anyway and
it's pointless to do it twice. The only difference is what call will fail
with what error code but the net effect is still the same in all call sites.

In the second case, pskb_may_pull is pointless, as skb->data is at the outer
IP header and not at the GRE header.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:31 -04:00
Jiri Benc f132ae7c46 gre: change gre_parse_header to return the header length
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.

In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.

This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 12:44:45 -04:00
Eric Dumazet d4011239f4 tcp: guarantee forward progress in tcp_sendmsg()
Under high rx pressure, it is possible tcp_sendmsg() never has a
chance to allocate an skb and loop forever as sk_flush_backlog()
would always return true.

Fix this by calling sk_flush_backlog() only if one skb had been
allocated and filled before last backlog check.

Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 12:44:36 -04:00
David S. Miller cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Eric Dumazet 1d2077ac01 net: add __sock_wfree() helper
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:02:36 -04:00
Eric Dumazet 9580bf2edb net: relax expensive skb_unclone() in iptunnel_handle_offloads()
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:22:19 -04:00
Tom Herbert 182a352d2d gre: Create common functions for transmit
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.

Common functions are for:
  - GRE checksum calculation. Move gre_checksum to gre.h.
  - Building a GRE header. Move GRE build_header and rename
    gre_build_header.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Tom Herbert 95f5c64c3c gre: Move utility functions to common headers
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).

These include:
  - GRE flag handling functions are move to gre.h
  - GRE build_header is moved to gre.h and renamed gre_build_header
  - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
  - iptunnel_pull_header is taken out of gre_parse_header. This is now
    done by caller. The header length is returned from gre_parse_header
    in an int* argument.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Eric Dumazet d41a69f1d3 tcp: make tcp_sendmsg() aware of socket backlog
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:26 -04:00
Eric Dumazet e61da9e259 udp: prepare for non BH masking at backlog processing
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.

We need to use appropriate SNMP accessors.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet fb3477c0f4 tcp: do not block bh during prequeue processing
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.

As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet c10d9310ed tcp: do not assume TCP code is non preemptible
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Jiri Benc b7f8fe251e gre: do not pull header in ICMP error processing
iptunnel_pull_header expects that IP header was already pulled; with this
expectation, it pulls the tunnel header. This is not true in gre_err.
Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
to the IP header.

We cannot pull the tunnel header in this path. It's just a matter of not
calling iptunnel_pull_header - we don't need any of its effects.

Fixes: bda7bb4634 ("gre: Allow multiple protocol listener for gre protocol.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 00:19:58 -04:00
Craig Gallek 90e5d0db2b soreuseport: Fix TCP listener hash collision
I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 19:36:54 -04:00
Paolo Abeni f27337e16f ip_tunnel: fix preempt warning in ip tunnel creation/updating
After the commit e09acddf87 ("ip_tunnel: replace dst_cache with generic
implementation"), a preemption debug warning is triggered on ip4
tunnels updating; the dst cache helper needs to be invoked in unpreemptible
context.

We don't need to load the cache on tunnel update, so this commit fixes
the warning replacing the load with a dst cache reset, which is
preempt safe.

Fixes: e09acddf87 ("ip_tunnel: replace dst_cache with generic implementation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29 14:11:46 -04:00
Pablo Neira Ayuso 92b4423e3a netfilter: fix IS_ERR_VALUE usage
This is a forward-port of the original patch from Andrzej Hajda,
he said:

"IS_ERR_VALUE should be used only with unsigned long type.
Otherwise it can work incorrectly. To achieve this function
xt_percpu_counter_alloc is modified to return unsigned long,
and its result is assigned to temporary variable to perform
error checking, before assigning to .pcnt field.

The patch follows conclusion from discussion on LKML [1][2].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"

Original patch from Andrzej is here:

http://patchwork.ozlabs.org/patch/582970/

This patch has clashed with input validation fixes for x_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-29 11:02:33 +02:00
Eric Dumazet 0cef6a4c34 tcp: give prequeue mode some care
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.

Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.

Current prequeue implementation has following issues :

1) It only checks size of the prequeue against sk_rcvbuf

   It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
   But we now have ~8MB values to cope with modern networking needs.
   We have to add sk_rmem_alloc in the equation, since out of order
   packets can definitely use up to sk_rcvbuf memory themselves.

2) Even with a fixed memory truesize check, prequeue can be filled
   by thousands of packets. When prequeue needs to be flushed, either
   from sofirq context (in tcp_prequeue() or timer code), or process
   context (in tcp_prequeue_process()), this adds a latency spike
   which is often not desirable.
   I added a fixed limit of 32 packets, as this translated to a max
   flush time of 60 us on my test hosts.

   Also note that all packets in prequeue are not accounted for tcp_mem,
   since they are not charged against sk_forward_alloc at this point.
   This is probably not a big deal.

Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:14:35 -04:00
Jiri Benc 946b636f17 gre: reject GUE and FOU in collect metadata mode
The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.

I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.

For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:09:37 -04:00
Jiri Benc 2090714e1d gre: build header correctly for collect metadata tunnels
In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
case, the interface is ARPHRD_IPGRE and the protocol to be used for
encapsulation is skb->protocol.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:02:45 -04:00
Jiri Benc a64b04d86d gre: do not assign header_ops in collect metadata mode
In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a0).
This is not the case, we're controlling the encapsulation addresses by
lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
mode does not make sense.

Although it would be more user firendly to reject requests that specify
both the collect metadata flag and a remote/local IP address, this would
break current users of gretap or introduce ugly code and differences in
handling ipgre and gretap configuration. Keep the current behavior of
remote/local IP address being ignored in such case.

v3: Back to v1, added explanation paragraph.
v2: Reject configuration specifying both remote/local address and collect
    metadata flag.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:02:44 -04:00
Martin KaFai Lau a166140e81 tcp: Handle eor bit when fragmenting a skb
When fragmenting a skb, the next_skb should carry
the eor from prev_skb.  The eor of prev_skb should
also be reset.

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:19 -04:00
Martin KaFai Lau a643b5d41c tcp: Handle eor bit when coalescing skb
This patch:
1. Prevent next_skb from coalescing to the prev_skb if
   TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
   allowed

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:19 -04:00
Martin KaFai Lau c134ecb878 tcp: Make use of MSG_EOR in tcp_sendmsg
This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg.  The eor bit
will prevent data from appending to that skb in the future.

The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.

This patch handles the tcp_sendmsg case.  The followup patches
will handle other skb coalescing and fragment cases.

One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:18 -04:00
Soheil Hassas Yeganeh 0a2cf20c3f tcp: remove SKBTX_ACK_TSTAMP since it is redundant
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.

Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.

Note that this frees one bit in shinfo->tx_flags.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Soheil Hassas Yeganeh 863c1fd981 tcp: remove an unnecessary check in tcp_tx_timestamp
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.

tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Eric Dumazet 02a1d6e7a6 net: rename NET_{ADD|INC}_STATS_BH()
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet b15084ec7d net: rename IP_UPD_PO_STATS_BH()
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet 98f619957e net: rename IP_ADD_STATS_BH()
Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet b45386efa2 net: rename IP_INC_STATS_BH()
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 214d3f1f87 net: icmp: rename ICMPMSGIN_INC_STATS_BH()
Remove misleading _BH suffix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 90bbcc6083 net: tcp: rename TCP_INC_STATS_BH
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 02c223470c net: udp: rename UDP_INC_STATS_BH()
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet 5d3848bc33 net: rename ICMP_INC_STATS_BH()
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
Eric Dumazet 6aef70a851 net: snmp: kill various STATS_USER() helpers
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.

After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.

We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()

Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
David S. Miller c0cc53162a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes in the conflicts.

In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 15:43:10 -04:00
Nicolas Dichtel 6ed46d1247 sock_diag: align nlattr properly when needed
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f1
which is only in net-next right now, thus I didn't make a separate patch.

Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
Eric Dumazet 960a26282f net: better drop monitoring in ip{6}_recv_error()
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:48:10 -04:00
Eric Dumazet 0aea76d35c tcp: SYN packets are now simply consumed
We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.

We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.

[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:48:10 -04:00
Craig Gallek d296ba60d8 soreuseport: Resolve merge conflict for v4/v6 ordering fix
d894ba18d4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b58 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 13:27:54 -04:00
Paolo Abeni 391a20333b ipv4/fib: don't warn when primary address is missing if in_dev is dead
After commit fbd40ea018 ("ipv4: Don't do expensive useless work
during inetdev destroy.") when deleting an interface,
fib_del_ifaddr() can be executed without any primary address
present on the dead interface.

The above is safe, but triggers some "bug: prim == NULL" warnings.

This commit avoids warning if the in_dev is dead

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 23:26:29 -04:00
Eric Dumazet 10d3be5692 tcp-tso: do not split TSO packets at retransmit time
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.

Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
 - Less memory overhead, because write queues have less skbs
 - Less cpu overhead at ACK processing.
 - Better SACK processing, as lot of studies mentioned how
   awful linux was at this ;)
 - Less cpu overhead to send the rtx packets
   (IP stack traversal, netfilter traversal, drivers...)
 - Better latencies in presence of losses.
 - Smaller spikes in fq like packet schedulers, as retransmits
   are not constrained by TCP Small Queues.

1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:43:59 -04:00
Martin KaFai Lau 2de8023e7b tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp
When collapsing skbs, txstamp_ack also needs to be merged.

Retrans Collapse Test:
~~~~~~
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

BPF Output Before:
~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~
<...>-2027  [007] d.s.    79.765921: : ee_data:1459

Sacks Collapse Test:
~~~~~
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

BPF Output Before:
~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~
<...>-2049  [007] d.s.    89.185538: : ee_data:14599

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:06:43 -04:00
Martin KaFai Lau b51e13faf7 tcp: Carry txstamp_ack in tcp_fragment_tstamp
When a tcp skb is sliced into two smaller skbs (e.g. in
tcp_fragment() and tso_fragment()),  it does not carry
the txstamp_ack bit to the newly created skb if it is needed.
The end result is a timestamping event (SCM_TSTAMP_ACK) will
be missing from the sk->sk_error_queue.

This patch carries this bit to the new skb2
in tcp_fragment_tstamp().

BPF Output Before:
~~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~~
<...>-2050  [000] d.s.   100.928763: : ee_data:14599

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 14600) = 14600
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257

0.300 close(4) = 0
0.300 > F. 14601:14601(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:06:43 -04:00
David S. Miller 11afbff861 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:

1) Ensure we don't go over the ruleset blob boundaries in
   mark_source_chains().

2) Validate that target jumps land on an existing xt_entry. This extra
   sanitization comes with a performance penalty when loading the ruleset.

3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.

4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.

5) Make sure the minimal possible target size in x_tables.

6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.

7) Check that standard target size is valid.

8) More sanitization to ensure that the target_offset field is correct.

9) Add xt_check_entry_match() to validate that matches are well-formed.

10-12) Three patch to reduce the number of parameters in
    translate_compat_table() for {arp,ip,ip6}tables by using a container
    structure.

13) No need to return value from xt_compat_match_from_user(), so make
    it void.

14) Consolidate translate_table() so it can be used by compat code too.

15) Remove obsolete check for compat code, so we keep consistent with
    what was already removed in the native layout code (back in 2007).

16) Get rid of target jump validation from mark_source_chains(),
    obsoleted by #2.

17) Introduce xt_copy_counters_from_user() to consolidate counter
    copying, and use it from {arp,ip,ip6}tables.

18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
    functions.

19) Move nf_connlabel_match() to xt_connlabel.

20) Skip event notification if connlabel did not change.

21) Update of nf_connlabels_get() to make the upcoming nft connlabel
    support easier.

23) Remove spinlock to read protocol state field in conntrack.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 00:12:08 -04:00
Nicolas Dichtel 2175d87cc3 libnl: nla_put_msecs(): align on a 64-bit area
nla_data() is now aligned on a 64-bit area.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 20:13:24 -04:00
Nicolas Dichtel b46f6ded90 libnl: nla_put_be64(): align on a 64-bit area
nla_data() is now aligned on a 64-bit area.

A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
This function is removed in the next patch.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 20:13:24 -04:00
David S. Miller 1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Martin KaFai Lau cfea5a688e tcp: Merge tx_flags and tskey in tcp_shifted_skb
After receiving sacks, tcp_shifted_skb() will collapse
skbs if possible.  tx_flags and tskey also have to be
merged.

This patch reuses the tcp_skb_collapse_tstamp() to handle
them.

BPF Output Before:
~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~
<...>-2024  [007] d.s.    88.644374: : ee_data:14599

Packetdrill Script:
~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:40:55 -04:00
Martin KaFai Lau 082ac2d51d tcp: Merge tx_flags and tskey in tcp_collapse_retrans
If two skbs are merged/collapsed during retransmission, the current
logic does not merge the tx_flags and tskey.  The end result is
the SCM_TSTAMP_ACK timestamp could be missing for a packet.

The patch:
1. Merge the tx_flags
2. Overwrite the prev_skb's tskey with the next_skb's tskey

BPF Output Before:
~~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~~
packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:40:55 -04:00
Nicolas Dichtel a9a080422e ipmr: align RTA_MFC_STATS on 64-bit
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:22:13 -04:00
Xin Long b7de529c79 net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diag
EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
back to before jiffies_to_msecs() has been introduced.

Now we can remove it and use jiffies_to_msecs().

Suggested-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 13:55:33 -04:00
Martin KaFai Lau 479f85c366 tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
it could incorrectly think that a skb has already
been acked and queue a SCM_TSTAMP_ACK cmsg to the
sk->sk_error_queue.

In tcp_ack_tstamp(), it checks
'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
script, between() returns true but the tskey is actually not acked.
e.g. try between(3, 2, 1).

The fix is to replace between() with one before() and one !before().
By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
removed.

A packetdrill script is used to reproduce the dup ack scenario.
Due to the lacking cmsg support in packetdrill (may be I
cannot find it),  a BPF prog is used to kprobe to
sock_queue_err_skb() and print out the value of
serr->ee.ee_data.

Both the packetdrill and the bcc BPF script is attached at the end of
this commit message.

BPF Output Before Fix:
~~~~~~
      <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599

BPF Output After Fix:
~~~~~~
      <...>-2049  [000] d.s.   113.517039: : ee_data:1459
      <...>-2049  [000] d.s.   113.517253: : ee_data:14599

BCC BPF Script:
~~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/errqueue.h>

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 1460) = 1460
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 13:45:43 -04:00
Dan Carpenter 110361f41c udp: fix if statement in SIOCINQ ioctl
We deleted a line of code and accidentally made the "return put_user()"
part of the if statement when it's supposed to be unconditional.

Fixes: 9f9a45beaa ('udp: do not expect udp headers on ioctl SIOCINQ')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-18 13:40:08 -04:00
Alexander Duyck e0c20967c8 GRE: Add support for GRO/GSO of IPv6 GRE traffic
Since GRE doesn't really care about L3 protocol we can support IPv4 and
IPv6 using the same offloads.  With that being the case we can add a call
to register the offloads for IPv6 as a part of our GRE offload
initialization.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 19:09:13 -04:00
Alexander Duyck aed069df09 ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skb
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function.  This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.

In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 19:09:13 -04:00
Xin Long cb2050a7b8 sctp: export some functions for sctp_diag in inet_diag
inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.

inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.

v2->v3:
- do not need to define and export inet_diag_get_handler any more.
  cause all the functions in it are in sctp_diag.ko, we just call
  them in sctp_diag.ko.

- add inet_diag_msg_attrs_fill to make codes clear.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:29:36 -04:00
Eric Dumazet 8804b2722d tcp: remove false sharing in tcp_rcv_state_process()
Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()

It is not needed for a listener, so we move it where it matters.

Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:45:44 -04:00
Eric Dumazet b3d051477c tcp: do not mess with listener sk_wmem_alloc
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.

We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)

In this patch, I chose to not attach a socket to syncookies skb.

Performance is now 5 Mpps instead of 3.2 Mpps.

Following patch will remove last known false sharing in
tcp_rcv_state_process()

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:45:44 -04:00
Craig Gallek d894ba18d4 soreuseport: fix ordering for mixed v4/v6 sockets
With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:14:03 -04:00
Alexander Duyck 802ab55adc GSO: Support partial segmentation offload
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck 1530545ed6 GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames.  Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck cbc53e08a7 GSO: Add GSO type for fixed IPv4 ID
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:40 -04:00
Chris Friesen d6d5e999e5 route: do not cache fib route info on local routes with oif
For local routes that require a particular output interface we do not want
to cache the result.  Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface.  The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.

This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board.  The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo).  The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.

Sample dhcp_release command line:

   dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Signed off-by: Chris Friesen <chris.friesen@windriver.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 23:33:01 -04:00
Willem de Bruijn 31c2e4926f udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6 pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.

Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:24:52 -04:00
Willem de Bruijn 9f9a45beaa udp: do not expect udp headers on ioctl SIOCINQ
On udp sockets, ioctl SIOCINQ returns the payload size of the first
packet. Since commit e6afc8ace6 pulled the headers, the result is
incorrect when subtracting header length. Remove that operation.

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:24:52 -04:00
David S. Miller 60e19518d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree. More
specifically, they are:

1) Fix missing filter table per-netns registration in arptables, from
   Florian Westphal.

2) Resolve out of bound access when parsing TCP options in
   nf_conntrack_tcp, patch from Jozsef Kadlecsik.

3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables,
   this resolves conflict between xt_limit and ebt_limit, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 21:49:03 -04:00
Florian Westphal d7591f0c41 netfilter: x_tables: introduce and use xt_copy_counters_from_user
The three variants use same copy&pasted code, condense this into a
helper and use that.

Make sure info.name is 0-terminated.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:41 +02:00
Florian Westphal aded9f3e9f netfilter: x_tables: remove obsolete check
Since 'netfilter: x_tables: validate targets of jumps' change we
validate that the target aligns exactly with beginning of a rule,
so offset test is now redundant.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:41 +02:00
Florian Westphal 95609155d7 netfilter: x_tables: remove obsolete overflow check for compat case too
commit 9e67d5a739
("[NETFILTER]: x_tables: remove obsolete overflow check") left the
compat parts alone, but we can kill it there as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:40 +02:00
Florian Westphal 09d9686047 netfilter: x_tables: do compat validation via translate_table
This looks like refactoring, but its also a bug fix.

Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.

For example, we do not check for underflows and the base chain policies.

While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.

Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.

At a high level 32 bit compat works like this:
1- initial pass over blob:
   validate match/entry offsets, bounds checking
   lookup all matches and targets
   do bookkeeping wrt. size delta of 32/64bit structures
   assign match/target.u.kernel pointer (points at kernel
   implementation, needed to access ->compatsize etc.)

2- allocate memory according to the total bookkeeping size to
   contain the translated ruleset

3- second pass over original blob:
   for each entry, copy the 32bit representation to the newly allocated
   memory.  This also does any special match translations (e.g.
   adjust 32bit to 64bit longs, etc).

4- check if ruleset is free of loops (chase all jumps)

5-first pass over translated blob:
   call the checkentry function of all matches and targets.

The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.

In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .

This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.

This has two drawbacks:

1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.

THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.

iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003

shows no noticeable differences in restore times:
old:   0m30.796s
new:   0m31.521s
64bit: 0m25.674s

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:40 +02:00
Florian Westphal 0188346f21 netfilter: x_tables: xt_compat_match_from_user doesn't need a retval
Always returned 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:40 +02:00
Florian Westphal 8dddd32756 netfilter: arp_tables: simplify translate_compat_table args
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:39 +02:00
Florian Westphal 7d3f843eed netfilter: ip_tables: simplify translate_compat_table args
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:38 +02:00
Florian Westphal ce683e5f9d netfilter: x_tables: check for bogus target offset
We're currently asserting that targetoff + targetsize <= nextoff.

Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.

We also need the e->elems pointer in a followup change to validate matches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:37 +02:00
Florian Westphal fc1221b3a1 netfilter: x_tables: add compat version of xt_check_entry_offsets
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:36 +02:00
Florian Westphal aa412ba225 netfilter: x_tables: kill check_entry helper
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.

Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:36 +02:00
Florian Westphal 7d35812c32 netfilter: x_tables: add and use xt_check_entry_offsets
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.

Unfortunately these checks are not sufficient.

To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:35 +02:00
Florian Westphal 3647234101 netfilter: x_tables: validate targets of jumps
When we see a jump also check that the offset gets us to beginning of
a rule (an ipt_entry).

The extra overhead is negible, even with absurd cases.

300k custom rules, 300k jumps to 'next' user chain:
[ plus one jump from INPUT to first userchain ]:

Before:
real    0m24.874s
user    0m7.532s
sys     0m16.076s

After:
real    0m27.464s
user    0m7.436s
sys     0m18.840s

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:35 +02:00
Florian Westphal f24e230d25 netfilter: x_tables: don't move to non-existent next rule
Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Base chains enforce absolute verdict.

User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.

But if such return is missing we will move to non-existent next rule.

Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14 00:30:34 +02:00
David Ahern 9ab179d83b net: vrf: Fix dst reference counting
Vivek reported a kernel exception deleting a VRF with an active
connection through it. The root cause is that the socket has a cached
reference to a dst that is destroyed. Converting the dst_destroy to
dst_release and letting proper reference counting kick in does not
work as the dst has a reference to the device which needs to be released
as well.

I talked to Hannes about this at netdev and he pointed out the ipv4 and
ipv6 dst handling has dst_ifdown for just this scenario. Rather than
continuing with the reinvented dst wheel in VRF just remove it and
leverage the ipv4 and ipv6 versions.

Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:56:20 -04:00
David Ahern a6db4494d2 net: ipv4: Consider failed nexthops in multipath routes
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:16:13 -04:00
David S. Miller ae95d71261 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-09 17:41:41 -04:00
Alexander Duyck a0ca153f98 GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:56:33 -04:00
Tom Herbert 46aa2f30aa udp: Remove udp_offloads
Now that the UDP encapsulation GRO functions have been moved to the UDP
socket we not longer need the udp_offload insfrastructure so removing it.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:30 -04:00
Tom Herbert d92283e338 fou: change to use UDP socket GRO
Adapt gue_gro_receive, gue_gro_complete to take a socket argument.
Don't set udp_offloads any more.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Tom Herbert 38fd2af24f udp: Add socket based GRO and config
Add gro_receive and  gro_complete to struct udp_tunnel_sock_cfg.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Tom Herbert a6024562ff udp: Add GRO functions to UDP socket
This patch adds GRO functions (gro_receive and gro_complete) to UDP
sockets. udp_gro_receive is changed to perform socket lookup on a
packet. If a socket is found the related GRO functions are called.

This features obsoletes using UDP offload infrastructure for GRO
(udp_offload). This has the advantage of not being limited to provide
offload on a per port basis, GRO is now applied to whatever individual
UDP sockets are bound to.  This also allows the possbility of
"application defined GRO"-- that is we can attach something like
a BPF program to a UDP socket to perfrom GRO on an application
layer protocol.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:29 -04:00
Tom Herbert 63058308cd udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb
Add externally visible functions to lookup a UDP socket by skb. This
will be used for GRO in UDP sockets. These functions also check
if skb->dst is set, and if it is not skb->dev is used to get dev_net.
This allows calling lookup functions before dst has been set on the
skbuff.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:53:14 -04:00
Hannes Frederic Sowa 1e1d04e678 net: introduce lockdep_is_held and update various places to use it
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:44:14 -04:00
Eric Dumazet 8501786929 tcp/dccp: fix inet_reuseport_add_sock()
David Ahern reported panics in __inet_hash() caused by my recent commit.

The reason is inet_reuseport_add_sock() was still using
sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
SO_REUSEPORT enabled listeners were causing an instant crash.

While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
flag, as it is inherited from the parent at clone time.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 12:02:33 -04:00
Florian Westphal ff76def3bd netfilter: arp_tables: register table in initns
arptables is broken since we didn't register the table anymore --
even 'arptables -L' fails.

Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-07 11:58:49 +02:00
Jiri Benc a6d5bbf34e ip_tunnel: implement __iptunnel_pull_header
Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner
protocol.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:50:32 -04:00
samanthakumar 627d2d6b55 udp: enable MSG_PEEK at non-zero offset
Enable peeking at UDP datagrams at the offset specified with socket
option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
to the end of the given datagram.

Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f6e
("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
on peek, decrease it on regular reads.

When peeking, always checksum the packet immediately, to avoid
recomputation on subsequent peeks and final read.

The socket lock is not held for the duration of udp_recvmsg, so
peek and read operations can run concurrently. Only the last store
to sk_peek_off is preserved.

Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05 16:29:37 -04:00
samanthakumar e6afc8ace6 udp: remove headers from UDP packets before queueing
Remove UDP transport headers before queueing packets for reception.
This change simplifies a follow-up patch to add MSG_PEEK support.

Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05 16:29:37 -04:00
Eric Dumazet 4ce7e93cb3 tcp: rate limit ACK sent by SYN_RECV request sockets
Attackers like to use SYNFLOOD targeting one 5-tuple, as they
hit a single RX queue (and cpu) on the victim.

If they use random sequence numbers in their SYN, we detect
they do not match the expected window and send back an ACK.

This patch adds a rate limitation, so that the effect of such
attacks is limited to ingress only.

We roughly double our ability to absorb such attacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:20 -04:00
Eric Dumazet a9d6532b56 ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply()
TCP uses per cpu 'sockets' to send some packets :
- RST packets ( tcp_v4_send_reset()) )
- ACK packets for SYN_RECV and TIMEWAIT sockets

By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree()
to not call sk_write_space() since these internal sockets
do not care.

This gives a small performance improvement, merely by allowing
cpu to properly predict the sock_wfree() conditional branch,
and avoiding one atomic operation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:20 -04:00
Eric Dumazet 9caad86415 tcp: increment sk_drops for listeners
Goal: packets dropped by a listener are accounted for.

This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
so that children do not inherit their parent drop count.

Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
We already have a separate LINUX_MIB_SYNCOOKIESSENT

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:20 -04:00
Eric Dumazet 532182cd61 tcp: increment sk_drops for dropped rx packets
Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:20 -04:00
Eric Dumazet 3b24d854cb tcp/dccp: do not touch listener sk_refcnt under synflood
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:20 -04:00
Eric Dumazet 2d331915a0 tcp/dccp: use rcu locking in inet_diag_find_one_icsk()
RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.

This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:19 -04:00
Eric Dumazet ca065d0cf8 udp: no longer use SLAB_DESTROY_BY_RCU
Tom Herbert would like not touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.

This actually removes a lot of complexity in UDP stack.

Multicast receives no longer need to hold a bucket spinlock.

Note that ip early demux still needs to take a reference on the socket.

Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.

Performance for a single UDP socket receiving flood traffic from
many RX queues/cpus.

Simple udp_rx using simple recvfrom() loop :
438 kpps instead of 374 kpps : 17 % increase of the peak rate.

v2: Addressed Willem de Bruijn feedback in multicast handling
 - keep early demux break in __udp4_lib_demux_lookup()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:19 -04:00
Soheil Hassas Yeganeh c14ac9451c sock: enable timestamping using control messages
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh 24025c465f ipv4: process socket-level control messages in IPv4
Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh 6b084928ba tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:29 -04:00
Haishuang Yan 7822ce73e6 netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address
Since nla_get_in_addr and nla_put_in_addr were implemented,
so use them appropriately.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02 20:15:58 -04:00
Yuchung Cheng 2349262397 tcp: remove cwnd moderation after recovery
For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.

This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.

On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.

By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.

v2 changes: revised commit message on bogus ACKs and burst, and
            missing signature

Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02 20:11:43 -04:00
Steffen Klassert d6af1a31cc vti: Add pmtu handling to vti_xmit.
We currently rely on the PMTU discovery of xfrm.
However if a packet is locally sent, the PMTU mechanism
of xfrm tries to do local socket notification what
might not work for applications like ping that don't
check for this. So add pmtu handling to vti_xmit to
report MTU changes immediately.

Reported-by: Mark McKinstry <Mark.McKinstry@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-03-31 08:59:56 +02:00
Alexander Duyck c3483384ee gro: Allow tunnel stacking in the case of FOU/GUE
This patch should fix the issues seen with a recent fix to prevent
tunnel-in-tunnel frames from being generated with GRO.  The fix itself is
correct for now as long as we do not add any devices that support
NETIF_F_GSO_GRE_CSUM.  When such a device is added it could have the
potential to mess things up due to the fact that the outer transport header
points to the outer UDP header and not the GRE header as would be expected.

Fixes: fac8e0f579 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 16:02:33 -04:00
Liping Zhang 29421198c3 netfilter: ipv4: fix NULL dereference
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:29 +02:00
Pablo Neira Ayuso b301f25387 netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.

Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:24 +02:00
Florian Westphal 54d83fc74a netfilter: x_tables: fix unconditional helper
Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.

Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:15 +02:00
Florian Westphal 6e94e0cfb0 netfilter: x_tables: make sure e->next_offset covers remaining blob size
Otherwise this function may read data beyond the ruleset blob.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:08 +02:00
Florian Westphal bdf533de69 netfilter: x_tables: validate e->target_offset early
We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:04 +02:00
Quentin Armitage 995096a0a4 Fix returned tc and hoplimit values for route with IPv6 encapsulation
For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.

># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1  encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2  scope link

Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-27 22:35:02 -04:00
Alexander Duyck 5197f3499c net: Reset encap_level to avoid resetting features on inner IP headers
This patch corrects an oversight in which we were allowing the encap_level
value to pass from the outer headers to the inner headers.  As a result we
were incorrectly identifying UDP or GRE tunnels as also making use of ipip
or sit when the second header actually represented a tunnel encapsulated in
either a UDP or GRE tunnel which already had the features masked.

Fixes: 7644345622 ("net: Move GSO csum into SKB_GSO_CB")
Reported-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-23 14:19:08 -04:00
Lance Richardson 4cfc86f3da ipv4: initialize flowi4_flags before calling fib_lookup()
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:

	if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {

Fix by initializing the entire fl4 structure, which will prevent
similar issues as fields are added in the future by ensuring that
all fields are initialized to zero unless explicitly initialized
to another value.

Fixes: 58189ca7b2 ("net: Fix vti use case with oif in dst lookups")
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22 15:59:23 -04:00
Paolo Abeni ad0ea1989c ipv4: fix broadcast packets reception
Currently, ingress ipv4 broadcast datagrams are dropped since,
in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
bcast packets.

This patch addresses the issue, invoking ip_check_mc_rcu()
only for mcast packets.

Fixes: 6e54030932 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22 15:53:50 -04:00
Deepa Dinamani 3ba9d300c9 net: ipv4: Fix truncated timestamp returned by inet_current_timestamp()
The millisecond timestamps returned by the function is
converted to network byte order by making a call to htons().
htons() only returns __be16 while __be32 is required here.

This was identified by the sparse warning from the buildbot:
net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return
			    expression (different base types)
net/ipv4/af_inet.c:1405:16: expected restricted __be32
net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident>

Change the function to use htonl() to return the correct __be32 type
instead so that the millisecond value doesn't get truncated.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Fixes: 822c868532 ("net: ipv4: Convert IP network timestamps to be y2038 safe")
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot]
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21 22:56:38 -04:00
Jesse Gross a09a4c8dd1 tunnels: Remove encapsulation offloads on decap.
If a packet is either locally encapsulated or processed through GRO
it is marked with the offloads that it requires. However, when it is
decapsulated these tunnel offload indications are not removed. This
means that if we receive an encapsulated TCP packet, aggregate it with
GRO, decapsulate, and retransmit the resulting frame on a NIC that does
not support encapsulation, we won't be able to take advantage of hardware
offloads even though it is just a simple TCP packet at this point.

This fixes the problem by stripping off encapsulation offload indications
when packets are decapsulated.

The performance impacts of this bug are significant. In a test where a
Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
result of avoiding unnecessary segmentation at the VM tap interface.

Reported-by: Ramu Ramamurthy <sramamur@linux.vnet.ibm.com>
Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:40 -04:00
Jesse Gross fac8e0f579 tunnels: Don't apply GRO to multiple layers of encapsulation.
When drivers express support for TSO of encapsulated packets, they
only mean that they can do it for one layer of encapsulation.
Supporting additional levels would mean updating, at a minimum,
more IP length fields and they are unaware of this.

No encapsulation device expresses support for handling offloaded
encapsulated packets, so we won't generate these types of frames
in the transmit path. However, GRO doesn't have a check for
multiple levels of encapsulation and will attempt to build them.

UDP tunnel GRO actually does prevent this situation but it only
handles multiple UDP tunnels stacked on top of each other. This
generalizes that solution to prevent any kind of tunnel stacking
that would cause problems.

Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:40 -04:00
Jesse Gross b8cba75bdf ipip: Properly mark ipip GRO packets as encapsulated.
ipip encapsulated packets can be merged together by GRO but the result
does not have the proper GSO type set or even marked as being
encapsulated at all. Later retransmission of these packets will likely
fail if the device does not support ipip offloads. This is similar to
the issue resolved in IPv6 sit in feec0cb3
("ipv6: gro: support sit protocol").

Reported-by: Patrick Boutilier <boutilpj@ednet.ns.ca>
Fixes: 9667e9bb ("ipip: Add gro callbacks to ipip offload")
Tested-by: Patrick Boutilier <boutilpj@ednet.ns.ca>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:39 -04:00
Linus Torvalds 1200b6809d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support more Realtek wireless chips, from Jes Sorenson.

   2) New BPF types for per-cpu hash and arrap maps, from Alexei
      Starovoitov.

   3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

   4) Allow the use of SO_REUSEPORT in order to do per-thread processing
   of incoming TCP/UDP connections.  The muxing can be done using a
   BPF program which hashes the incoming packet.  From Craig Gallek.

   5) Add a multiplexer for TCP streams, to provide a messaged based
      interface.  BPF programs can be used to determine the message
      boundaries.  From Tom Herbert.

   6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

   7) Avoid factorial complexity when taking down an inetdev interface
      with lots of configured addresses.  We were doing things like
      traversing the entire address less for each address removed, and
      flushing the entire netfilter conntrack table for every address as
      well.

   8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

   9) Allow offloading u32 classifiers to hardware, and implement for
      ixgbe, from John Fastabend.

  10) Allow configuring IRQ coalescing parameters on a per-queue basis,
      from Kan Liang.

  11) Extend ethtool so that larger link mode masks can be supported.
      From David Decotigny.

  12) Introduce devlink, which can be used to configure port link types
      (ethernet vs Infiniband, etc.), port splitting, and switch device
      level attributes as a whole.  From Jiri Pirko.

  13) Hardware offload support for flower classifiers, from Amir Vadai.

  14) Add "Local Checksum Offload".  Basically, for a tunneled packet
      the checksum of the outer header is 'constant' (because with the
      checksum field filled into the inner protocol header, the payload
      of the outer frame checksums to 'zero'), and we can take advantage
      of that in various ways.  From Edward Cree"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
  bonding: fix bond_get_stats()
  net: bcmgenet: fix dma api length mismatch
  net/mlx4_core: Fix backward compatibility on VFs
  phy: mdio-thunder: Fix some Kconfig typos
  lan78xx: add ndo_get_stats64
  lan78xx: handle statistics counter rollover
  RDS: TCP: Remove unused constant
  RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
  net: smc911x: convert pxa dma to dmaengine
  team: remove duplicate set of flag IFF_MULTICAST
  bonding: remove duplicate set of flag IFF_MULTICAST
  net: fix a comment typo
  ethernet: micrel: fix some error codes
  ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
  bpf, dst: add and use dst_tclassid helper
  bpf: make skb->tc_classid also readable
  net: mvneta: bm: clarify dependencies
  cls_bpf: reset class and reuse major in da
  ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
  ldmvsw: Add ldmvsw.c driver code
  ...
2016-03-19 10:05:34 -07:00
Daniel Borkmann fca5fdf67d ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded
value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX
for this, which makes the code a bit more generic and allows to remove
BPF_TUNLEN_MAX from eBPF code.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 19:38:46 -04:00
Eric Dumazet e316ea62e3 tcp/dccp: remove obsolete WARN_ON() in icmp handlers
Now SYN_RECV request sockets are installed in ehash table, an ICMP
handler can find a request socket while another cpu handles an incoming
packet transforming this SYN_RECV request socket into an ESTABLISHED
socket.

We need to remove the now obsolete WARN_ON(req->sk), since req->sk
is set when a new child is created and added into listener accept queue.

If this race happens, the ICMP will do nothing special.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ben Lazarus <blazarus@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-17 21:06:40 -04:00
Linus Torvalds 70477371dc Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 4.6:

  API:
   - Convert remaining crypto_hash users to shash or ahash, also convert
     blkcipher/ablkcipher users to skcipher.
   - Remove crypto_hash interface.
   - Remove crypto_pcomp interface.
   - Add crypto engine for async cipher drivers.
   - Add akcipher documentation.
   - Add skcipher documentation.

  Algorithms:
   - Rename crypto/crc32 to avoid name clash with lib/crc32.
   - Fix bug in keywrap where we zero the wrong pointer.

  Drivers:
   - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
   - Add PIC32 hwrng driver.
   - Support BCM6368 in bcm63xx hwrng driver.
   - Pack structs for 32-bit compat users in qat.
   - Use crypto engine in omap-aes.
   - Add support for sama5d2x SoCs in atmel-sha.
   - Make atmel-sha available again.
   - Make sahara hashing available again.
   - Make ccp hashing available again.
   - Make sha1-mb available again.
   - Add support for multiple devices in ccp.
   - Improve DMA performance in caam.
   - Add hashing support to rockchip"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: qat - remove redundant arbiter configuration
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: qat - Change the definition of icp_qat_uof_regtype
  hwrng: exynos - use __maybe_unused to hide pm functions
  crypto: ccp - Add abstraction for device-specific calls
  crypto: ccp - CCP versioning support
  crypto: ccp - Support for multiple CCPs
  crypto: ccp - Remove check for x86 family and model
  crypto: ccp - memset request context to zero during import
  lib/mpi: use "static inline" instead of "extern inline"
  lib/mpi: avoid assembler warning
  hwrng: bcm63xx - fix non device tree compatibility
  crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
  crypto: qat - The AE id should be less than the maximal AE number
  lib/mpi: Endianness fix
  crypto: rockchip - add hash support for crypto engine in rk3288
  crypto: xts - fix compile errors
  crypto: doc - add skcipher API documentation
  crypto: doc - update AEAD AD handling
  ...
2016-03-17 11:22:54 -07:00
Peter Zijlstra 25528213fe tags: Fix DEFINE_PER_CPU expansions
$ make tags
  GEN     tags
ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

Which are all the result of the DEFINE_PER_CPU pattern:

  scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
  scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'

The below cures them. All except the workqueue one are within reasonable
distance of the 80 char limit. TJ do you have any preference on how to
fix the wq one, or shall we just not care its too long?

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
David S. Miller 1cdba55055 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS updates for net-next

The following patchset contains Netfilter/IPVS fixes and OVS NAT
support, more specifically this batch is composed of:

1) Fix a crash in ipset when performing a parallel flush/dump with
   set:list type, from Jozsef Kadlecsik.

2) Make sure NFACCT_FILTER_* netlink attributes are in place before
   accessing them, from Phil Turnbull.

3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP
   helper, from Arnd Bergmann.

4) Add workaround to IPVS to reschedule existing connections to new
   destination server by dropping the packet and wait for retransmission
   of TCP syn packet, from Julian Anastasov.

5) Allow connection rescheduling in IPVS when in CLOSE state, also
   from Julian.

6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni.

7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef.

8) Check match/targetinfo netlink attribute size in nft_compat,
   patch from Florian Westphal.

9) Check for integer overflow on 32-bit systems in x_tables, from
   Florian Westphal.

Several patches from Jarno Rajahalme to prepare the introduction of
NAT support to OVS based on the Netfilter infrastructure:

10) Schedule IP_CT_NEW_REPLY definition for removal in
    nf_conntrack_common.h.

11) Simplify checksumming recalculation in nf_nat.

12) Add comments to the openvswitch conntrack code, from Jarno.

13) Update the CT state key only after successful nf_conntrack_in()
    invocation.

14) Find existing conntrack entry after upcall.

15) Handle NF_REPEAT case due to templates in nf_conntrack_in().

16) Call the conntrack helper functions once the conntrack has been
    confirmed.

17) And finally, add the NAT interface to OVS.

The batch closes with:

18) Cleanup to use spin_unlock_wait() instead of
    spin_lock()/spin_unlock(), from Nicholas Mc Guire.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 22:10:25 -04:00
Eric Dumazet acffb584cd net: diag: add a scheduling point in inet_diag_dump_icsk()
On loaded TCP servers, looking at millions of sockets can hold
cpu for many seconds, if the lookup condition is very narrow.

(eg : ss dst 1.2.3.4 )

Better add a cond_resched() to allow other processes to access
the cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 19:38:09 -04:00
Jarno Rajahalme 264619055b netfilter: Allow calling into nat helper without skb_dst.
NAT checksum recalculation code assumes existence of skb_dst, which
becomes a problem for a later patch in the series ("openvswitch:
Interface with NAT.").  Simplify this by removing the check on
skb_dst, as the checksum will be dealt with later in the stack.

Suggested-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:27 +01:00
Martin KaFai Lau a44d6eacda tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In
Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data).  Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).

The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.

Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.

v6: Rebase on the latest net-next

v5: Eric pointed out that checking skb->len is still needed in
tcp_fastopen_add_skb() because skb can carry a FIN without data.
Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
helper is used.  Comment is added to the fastopen case to explain why
segs_in has to be reset and tcp_segs_in() has to be called before
__skb_pull().

v4: Add comment to the changes in tcp_fastopen_add_skb()
and also add remark on this case in the commit message.

v3: Add const modifier to the skb parameter in tcp_segs_in()

v2: Rework based on recent fix by Eric:
commit a9d99ce28e ("tcp: fix tcpi_segs_in after connection establishment")

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 14:55:26 -04:00
Alexander Duyck 0833482495 GSO/UDP: Use skb->len instead of udph->len to determine length of original skb
It is possible for tunnels to end up generating IP or IPv6 datagrams that
are larger than 64K and expecting to be segmented.  As such we need to deal
with length values greater than 64K.  In order to accommodate this we need
to update the code to work with a 32b length value instead of a 16b one.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 23:55:14 -04:00
David S. Miller fbd40ea018 ipv4: Don't do expensive useless work during inetdev destroy.
When an inetdev is destroyed, every address assigned to the interface
is removed.  And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:

1) Address promotion.  We are deleting all addresses, so there is no
   point in doing this.

2) A full nf conntrack table purge for every address.  We only need to
   do this once, as is already caught by the existing
   masq_dev_notifier so masq_inet_event() can skip this.

Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
2016-03-13 23:28:35 -04:00
Zhang Shengju 136ba622de netconf: add macro to represent all attributes
This patch adds macro NETCONFA_ALL to represent all type of netconf
attributes for IPv4 and IPv6.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 21:54:44 -04:00
Alexander Duyck c194cf93c1 gro: Defer clearing of flush bit in tunnel paths
This patch updates the GRO handlers for GRE, VXLAN, GENEVE, and FOU so that
we do not clear the flush bit until after we have called the next level GRO
handler.  Previously this was being cleared before parsing through the list
of frames, however this resulted in several paths where either the bit
needed to be reset but wasn't as in the case of FOU, or cases where it was
being set as in GENEVE.  By just deferring the clearing of the bit until
after the next level protocol has been parsed we can avoid any unnecessary
bit twiddling and avoid bugs.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 15:01:00 -04:00
Tom Herbert 473bd239b8 tcp: Add tcp_inq to get available receive bytes on socket
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 16:36:14 -05:00
David S. Miller 4c38cd61ae Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next tree,
they are:

1) Remove useless debug message when deleting IPVS service, from
   Yannick Brosseau.

2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
   several spots of the IPVS code, from Arnd Bergmann.

3) Add prandom_u32 support to nft_meta, from Florian Westphal.

4) Remove unused variable in xt_osf, from Sudip Mukherjee.

5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
   since fixing af_packet defragmentation issues, from Joe Stringer.

6) On-demand hook registration for iptables from netns. Instead of
   registering the hooks for every available netns whenever we need
   one of the support tables, we register this on the specific netns
   that needs it, patchset from Florian Westphal.

7) Add missing port range selection to nf_tables masquerading support.

BTW, just for the record, there is a typo in the description of
5f6c253ebe ("netfilter: bridge: register hooks only when bridge
interface is added") that refers to the cluster match as deprecated, but
it is actually the CLUSTERIP target (which registers hooks
inconditionally) the one that is scheduled for removal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 14:25:20 -05:00
Daniel Borkmann db3c6139e6 bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit
The assumptions from commit 0c1d70af92 ("net: use dst_cache for vxlan
device"), 468dfffcd7 ("geneve: add dst caching support") and 3c1cb4d260
("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
when ip_tunnel_info is used is unfortunately not always valid as assumed.

While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
with different remote dsts, tos, etc, therefore they cannot be assumed as
packet independent.

Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
would reuse the same entry that was first created on the initial route
lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
a different one.

Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
in vxlan needs to be handeled differently in this context as it is currently
inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
helper is added for the three tunnel cases, which checks if we can use dst
cache.

Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Fixes: 468dfffcd7 ("geneve: add dst caching support")
Fixes: 3c1cb4d260 ("net/ipv4: add dst cache support for gre lwtunnels")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:58:47 -05:00
David S. Miller 810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Eric Dumazet a9d99ce28e tcp: fix tcpi_segs_in after connection establishment
If final packet (ACK) of 3WHS is lost, it appears we do not properly
account the following incoming segment into tcpi_segs_in

While we are at it, starts segs_in with one, to count the SYN packet.

We do not yet count number of SYN we received for a request sock, we
might add this someday.

packetdrill script showing proper behavior after fix :

// Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.020 < P. 1:1001(1000) ack 1 win 32792

   +0 accept(3, ..., ...) = 4

+.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%

Fixes: 2efd055c53 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 15:47:13 -05:00
Zhang Shengju 8dfd329fbc arp: correct return value of arp_rcv
Currently, arp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

v1->v2:
If sanity check is failed, call kfree_skb() instead of consume_skb(), then
return the correct return value.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 14:55:04 -05:00
Benjamin Poirier 1837b2e2bc mld, igmp: Fix reserved tailroom calculation
The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 15:41:07 -05:00
Eric Engestrom a9d562358b net/ipv4: remove left over dead code
8cc785f6f4 ("net: ipv4: make the ping
/proc code AF-independent") removed the code using it, but renamed this
variable instead of removing it.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 15:00:55 -05:00
Pablo Neira Ayuso 8a6bf5da1a netfilter: nft_masq: support port range
Complete masquerading support by allowing port range selection.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:27 +01:00
Florian Westphal b9e69e1273 netfilter: xtables: don't hook tables by default
delay hook registration until the table is being requested inside a
namespace.

Historically, a particular table (iptables mangle, ip6tables filter, etc)
was registered on module load.

When netns support was added to iptables only the ip/ip6tables ruleset was
made namespace aware, not the actual hook points.

This means f.e. that when ipt_filter table/module is loaded on a system,
then each namespace on that system has an (empty) iptables filter ruleset.

In other words, if a namespace sends a packet, such skb is 'caught' by
netfilter machinery and fed to hooking points for that table (i.e. INPUT,
FORWARD, etc).

Thanks to Eric Biederman, hooks are no longer global, but per namespace.

This means that we can avoid allocation of empty ruleset in a namespace and
defer hook registration until we need the functionality.

We register a tables hook entry points ONLY in the initial namespace.
When an iptables get/setockopt is issued inside a given namespace, we check
if the table is found in the per-namespace list.

If not, we attempt to find it in the initial namespace, and, if found,
create an empty default table in the requesting namespace and register the
needed hooks.

Hook points are destroyed only once namespace is deleted, there is no
'usage count' (it makes no sense since there is no 'remove table' operation
in xtables api).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:24 +01:00
Florian Westphal a67dd266ad netfilter: xtables: prepare for on-demand hook register
This change prepares for upcoming on-demand xtables hook registration.

We change the protoypes of the register/unregister functions.
A followup patch will then add nf_hook_register/unregister calls
to the iptables one.

Once a hook is registered packets will be picked up, so all assignments
of the form

net->ipv4.iptable_$table = new_table

have to be moved to ip(6)t_register_table, else we can see NULL
net->ipv4.iptable_$table later.

This patch doesn't change functionality; without this the actual change
simply gets too big.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:23 +01:00
Joe Stringer 5f547391f5 netfilter: nf_defrag_ipv4: Drop redundant ip_send_check()
Since commit 0848f6428b ("inet: frags: fix defragmented packet's IP
header for af_packet"), ip_send_check() would be called twice for
defragmentation that occurs from netfilter ipv4 defrag hooks. Remove the
extra call.

Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:22 +01:00
WANG Cong 64d4e3431e net: remove skb_sender_cpu_clear()
After commit 52bd2d62ce ("net: better skb->sender_cpu and skb->napi_id cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:36:47 -05:00
Deepa Dinamani b1b270d863 net: ipv4: tcp_probe: Replace timespec with timespec64
TCP probe log timestamps use struct timespec which is
not y2038 safe. Even though timespec might be good enough here
as it is used to represent delta time, the plan is to get rid
of all uses of timespec in the kernel.
Replace with struct timespec64 which is y2038 safe.

Prints still use unsigned long format and type.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:18:44 -05:00
Deepa Dinamani 822c868532 net: ipv4: Convert IP network timestamps to be y2038 safe
ICMP timestamp messages and IP source route options require
timestamps to be in milliseconds modulo 24 hours from
midnight UT format.

Add inet_current_timestamp() function to support this. The function
returns the required timestamp in network byte order.

Timestamp calculation is also changed to call ktime_get_real_ts64()
which uses struct timespec64. struct timespec64 is y2038 safe.
Previously it called getnstimeofday() which uses struct timespec.
struct timespec is not y2038 safe.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:18:44 -05:00
Alexander Duyck 2246387662 GSO: Provide software checksum of tunneled UDP fragmentation offload
On reviewing the code I realized that GRE and UDP tunnels could cause a
kernel panic if we used GSO to segment a large UDP frame that was sent
through the tunnel with an outer checksum and hardware offloads were not
available.

In order to correct this we need to update the feature flags that are
passed to the skb_segment function so that in the event of UDP
fragmentation being requested for the inner header the segmentation
function will correctly generate the checksum for the payload if we cannot
segment the outer header.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 14:23:35 -05:00
David Lamparter 17b693cdd8 net: l3mdev: prefer VRF master for source address selection
When selecting an address in context of a VRF, the vrf master should be
preferred for address selection.  If it isn't, the user has a hard time
getting the system to select to their preference - the code will pick
the address off the first in-VRF interface it can find, which on a
router could well be a non-routable address.

Signed-off-by: David Lamparter <equinox@diac24.net>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
[dsa: Fixed comment style and removed extra blank link ]
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 14:22:26 -05:00
David Ahern 3f2fb9a834 net: l3mdev: address selection should only consider devices in L3 domain
David Lamparter noted a use case where the source address selection fails
to pick an address from a VRF interface - unnumbered interfaces.

Relevant commands from his script:
    ip addr add 9.9.9.9/32 dev lo
    ip link set lo up

    ip link add name vrf0 type vrf table 101
    ip rule add oif vrf0 table 101
    ip rule add iif vrf0 table 101
    ip link set vrf0 up
    ip addr add 10.0.0.3/32 dev vrf0

    ip link add name dummy2 type dummy
    ip link set dummy2 master vrf0 up

    --> note dummy2 has no address - unnumbered device

    ip route add 10.2.2.2/32 dev dummy2 table 101
    ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02

    tcpdump -ni dummy2 &

And using ping instead of his socat example:
    $ ping -I vrf0 -c1 10.2.2.2
    ping: Warning: source address might be selected on device other than vrf0.
    PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data.

>From tcpdump:
    12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64

Note the source address is from lo and is not a VRF local address. With
this patch:

    $ ping -I vrf0 -c1 10.2.2.2
    PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data.

>From tcpdump:
    12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64

Now the source address comes from vrf0.

The ipv4 function for selecting source address takes a const argument.
Removing the const requires touching a lot of places, so instead
l3mdev_master_ifindex_rcu is changed to take a const argument and then
do the typecast to non-const as required by netdev_master_upper_dev_get_rcu.
This is similar to what l3mdev_fib_table_rcu does.

IPv6 for unnumbered interfaces appears to be selecting the addresses
properly.

Cc: David Lamparter <david@opensourcerouting.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 14:22:26 -05:00
Hannes Frederic Sowa a8c4a2522a ipv4: only create late gso-skb if skb is already set up with CHECKSUM_PARTIAL
Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
skbs down. This can easily happen with UDP+IPv4 sockets with the first
MSG_MORE write smaller than the MTU, second write is a sendfile.

Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
were we calculate the checksum manually during copying.

Commit d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
sockets") started to exposes this bug.

Fixes: d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
Reported-by: Jiri Benc <jbenc@redhat.com>
Cc: Jiri Benc <jbenc@redhat.com>
Reported-by: Wakko Warner <wakko@animx.eu.org>
Cc: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-24 14:11:34 -05:00
Craig Gallek e5fbfc1c2d soreuseport: fix merge conflict in tcp bind
One of the validation checks for the new array-based TCP SO_REUSEPORT
validation was unintentionally dropped in ea8add2b19.  This adds it back.

Lack of this check allows the user to allocate multiple sock_reuseport
structures (leaking all but the first).

Fixes: ea8add2b19 ("tcp/dccp: better use of ephemeral ports in bind()")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-24 13:38:18 -05:00
Bernie Harris 5146d1f151 tunnel: Clear IPCB(skb)->opt before dst_link_failure called
IPCB may contain data from previous layers (in the observed case the
qdisc layer). In the observed scenario, the data was misinterpreted as
ip header options, which later caused the ihl to be set to an invalid
value (<5). This resulted in an infinite loop in the mips implementation
of ip_fast_csum.

This patch clears IPCB(skb)->opt before dst_link_failure can be called for
various types of tunnels. This change only applies to encapsulated ipv4
packets.

The code introduced in 11c21a30 which clears all of IPCB has been removed
to be consistent with these changes, and instead the opt field is cleared
unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
SIT, GRE, and IPIP tunnels.

The relevant vti, l2tp, and pptp functions already contain similar code for
clearing the IPCB.

Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 19:11:56 -05:00
Konstantin Khlebnikov 9bdfb3b79e tcp: convert cached rtt from usec to jiffies when feeding initial rto
Currently it's converted into msecs, thus HZ=1000 intact.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 18:28:46 -05:00
David S. Miller b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Anton Protopopov a97eb33ff2 rtnl: RTM_GETNETCONF: fix wrong return value
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:33:46 -05:00
Jiri Benc d13b161c2c gre: clear IFF_TX_SKB_SHARING
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.

Also, clean up whitespace in ipgre_tap_setup when we're already touching it.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:43:48 -05:00
Jiri Benc 7f290c9435 iptunnel: scrub packet in iptunnel_pull_header
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Eric Dumazet 7716682cc5 tcp/dccp: fix another race at listener dismantle
Ilya reported following lockdep splat:

kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0

To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.

We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.

Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:35:51 -05:00
Xin Long deed49df73 route: check and remove route cache when we get route
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.

Fix this issue by checking  and removing route cache when we get route.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:31:36 -05:00
Insu Yun 1eea84b74c tcp: correctly crypto_alloc_hash return check
crypto_alloc_hash never returns NULL

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 22:23:04 -05:00
Ben Hutchings 7bbf3cae65 ipv4: Remove inet_lro library
There are no longer any in-tree drivers that use it.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 16:15:46 -05:00
Nikolay Borisov 52a773d645 net: Export ip fragment sysctl to unprivileged users
Now that all the ip fragmentation related sysctls are namespaceified
there is no reason to hide them anymore from "root" users inside
containers.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov 0fbf4cb27e ipv4: namespacify ip fragment max dist sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov e21145a987 ipv4: namespacify ip_early_demux sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov 287b7f38fd ipv4: Namespacify ip_dynaddr sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov dcd87999d4 igmp: net: Move igmp namespace init to correct file
When igmp related sysctl were namespacified their initializatin was
erroneously put into the tcp socket namespace constructor. This
patch moves the relevant code into the igmp namespace constructor to
keep things consistent.

Also sprinkle some #ifdefs to silence warnings

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov fa50d974d1 ipv4: Namespaceify ip_default_ttl sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Eric Dumazet cd9b266095 tcp: add tcpi_min_rtt and tcpi_notsent_bytes to tcp_info
tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow,
in usec unit. Might be ~0U if not yet known.

tcpi_notsent_bytes reports the amount of bytes in the write queue that
were not yet sent.

This is done in a single patch to not add a temporary 32bit padding hole
in tcp_info.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:27:35 -05:00
Eric Dumazet 729235554d tcp: md5: release request socket instead of listener
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.

The bug was added for IPv4 only.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:24:06 -05:00
Paolo Abeni 3c1cb4d260 net/ipv4: add dst cache support for gre lwtunnels
In case of UDP traffic with datagram length below MTU this
gives about 4% performance increase

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:49 -05:00
Paolo Abeni e09acddf87 ip_tunnel: replace dst_cache with generic implementation
The current ip_tunnel cache implementation is prone to a race
that will cause the wrong dst to be cached on cuncurrent dst cache
miss and ip tunnel update via netlink.

Replacing with the generic implementation fix the issue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
Eric Dumazet 372022830b tcp: do not set rtt_min to 1
There are some cases where rtt_us derives from deltas of jiffies,
instead of using usec timestamps.

Since we want to track minimal rtt, better to assume a delta of 0 jiffie
might be in fact be very close to 1 jiffie.

It is kind of sad jiffies_to_usecs(1) calls a function instead of simply
using a constant.

Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 16:07:13 -05:00
Eric Dumazet 919483096b ipv4: fix memory leaks in ip_cmsg_send() callers
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.

Callers are responsible for the freeing.

Many thanks to Dmitry for the report and diagnostic.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-13 05:57:39 -05:00