OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Roopa Prabhu	0241b83673	vxlan: fix default fdb entry netlink notify ordering during netdev create Problem: In vxlan_newlink, a default fdb entry is added before register_netdev. The default fdb creation function also notifies user-space of the fdb entry on the vxlan device which user-space does not know about yet. (RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex). This patch fixes the user-space netlink notification ordering issue with the following changes: - decouple fdb notify from fdb create. - Move fdb notify after register_netdev. - Call rtnl_configure_link in vxlan newlink handler to notify userspace about the newlink before fdb notify and hence avoiding the user-space race. Fixes: `afbd8bae9c` ("vxlan: add implicit fdb entry for default destination") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-07 20:20:46 +09:00
Roopa Prabhu	4c2438ba85	vxlan: make netlink notify in vxlan_fdb_destroy optional Add a new option do_notify to vxlan_fdb_destroy to make sending netlink notify optional. Used by a later patch. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-07 20:20:46 +09:00
Roopa Prabhu	25e20e730d	vxlan: add new fdb alloc and create helpers - Add new vxlan_fdb_alloc helper - rename existing vxlan_fdb_create into vxlan_fdb_update: because it really creates or updates an existing fdb entry - move new fdb creation into a separate vxlan_fdb_create Main motivation for this change is to introduce the ability to decouple vxlan fdb creation and notify, used in a later patch. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-07 20:20:46 +09:00
David S. Miller	5cd3da4ba2	Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net Simple overlapping changes in stmmac driver. Adjust skb_gro_flush_final_remcsum function signature to make GRO list changes in net-next, as per Stephen Rothwell's example merge resolution. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-03 10:29:26 +09:00
Sabrina Dubroca	603d4cf8fe	net: fix use-after-free in GRO with ESP Since the addition of GRO for ESP, gro_receive can consume the skb and return -EINPROGRESS. In that case, the lower layer GRO handler cannot touch the skb anymore. Commit `5f114163f2` ("net: Add a skb_gro_flush_final helper.") converted some of the gro_receive handlers that can lead to ESP's gro_receive so that they wouldn't access the skb when -EINPROGRESS is returned, but missed other spots, mainly in tunneling protocols. This patch finishes the conversion to using skb_gro_flush_final(), and adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and GUE. Fixes: `5f114163f2` ("net: Add a skb_gro_flush_final helper.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-02 20:34:04 +09:00
Pieter Jansen van Vuuren	256c87c17c	net: check tunnel option type in tunnel flags Check the tunnel option type stored in tunnel flags when creating options for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel options on interfaces that are not associated with them. Make sure all users of the infrastructure set correct flags, for the BPF helper we have to set all bits to keep backward compatibility. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-06-29 23:50:26 +09:00
David Miller	d4546c2509	net: Convert GRO SKB handling to list_head. Manage pending per-NAPI GRO packets via list_head. Return an SKB pointer from the GRO receive handlers. When GRO receive handlers return non-NULL, it means that this SKB needs to be completed at this time and removed from the NAPI queue. Several operations are greatly simplified by this transformation, especially timing out the oldest SKB in the list when gro_count exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue in reverse order. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-06-26 11:33:04 +09:00
Hangbin Liu	72f6d71e49	vxlan: add ttl inherit support Like tos inherit, ttl inherit should also means inherit the inner protocol's ttl values, which actually not implemented in vxlan yet. But we could not treat ttl == 0 as "use the inner TTL", because that would be used also when the "ttl" option is not specified and that would be a behavior change, and breaking real use cases. So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is specified with ip cmd. Reported-by: Jianlin Shi <jishi@redhat.com> Suggested-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-17 13:53:13 -04:00
David S. Miller	3e3ab9ccca	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-29 10:15:51 -05:00
Nicolas Dichtel	f15ca723c1	net: don't call update_pmtu unconditionally Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to: "BUG: unable to handle kernel NULL pointer dereference at (null)" Let's add a helper to check if update_pmtu is available before calling it. Fixes: `52a589d51f` ("geneve: update skb dst pmtu on tx path") Fixes: `a93bf0ff44` ("vxlan: update skb dst pmtu on tx path") CC: Roman Kapl <code@rkapl.cz> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-25 16:27:34 -05:00
David S. Miller	fba961ab29	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Lots of overlapping changes. Also on the net-next side the XDP state management is handled more in the generic layers so undo the 'net' nfp fix which isn't applicable in net-next. Include a necessary change by Jakub Kicinski, with log message: ==================== cls_bpf no longer takes care of offload tracking. Make sure netdevsim performs necessary checks. This fixes a warning caused by TC trying to remove a filter it has not added. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-22 11:16:31 -05:00
Xin Long	a93bf0ff44	vxlan: update skb dst pmtu on tx path Unlike ip tunnels, now vxlan doesn't do any pmtu update for upper dst pmtu, even if it doesn't match the lower dst pmtu any more. The problem can be reproduced when reducing the vxlan lower dev's pmtu when running netperf. In jianlin's testing, the performance went to 1/7 of the previous. This patch is to update the upper dst pmtu to match the lower dst pmtu on tx path so that packets can be sent out even when lower dev's pmtu has been changed. It also works for metadata dst. Note that this patch doesn't process any pmtu icmp packet. But even in the future, the support for pmtu icmp packets process of udp tunnels will also needs this. The same thing will be done for geneve in another patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 13:37:27 -05:00
Haishuang Yan	57b61127ab	vxlan: speedup vxlan tunnels dismantle Since we now hold RTNL lock in vxlan_exit_net, it's better to batch them to speedup vxlan tunnels dismantle. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 10:59:44 -05:00
Alexey Kodanev	f870c1ff65	vxlan: restore dev->mtu setting based on lower device Stefano Brivio says: Commit `a985343ba9` ("vxlan: refactor verification and application of configuration") introduced a change in the behaviour of initial MTU setting: earlier, the MTU for a link created on top of a given lower device, without an initial MTU specification, was set to the MTU of the lower device minus headroom as a result of this path in vxlan_dev_configure(): if (!conf->mtu) dev->mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM); which is now gone. Now, the initial MTU, in absence of a configured value, is simply set by ether_setup() to ETH_DATA_LEN (1500 bytes). This breaks userspace expectations in case the MTU of the lower device is higher than 1500 bytes minus headroom. This patch restores the previous behaviour on newlink operation. Since max_mtu can be negative and we update dev->mtu directly, also check it for valid minimum. Reported-by: Junhan Yan <juyan@redhat.com> Fixes: `a985343ba9` ("vxlan: refactor verification and application of configuration") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-16 23:05:10 -05:00
Xin Long	fc39c38bdc	vxlan: use __be32 type for the param vni in __vxlan_fdb_delete All callers of __vxlan_fdb_delete pass vni with __be32 type, and this param should be declared as __be32 type. Fixes: `3ad7a4b141` ("vxlan: support fdb and learning in COLLECT_METADATA mode") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-11-28 15:57:12 -05:00
Linus Torvalds	5bbcc0f595	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Highlights: 1) Maintain the TCP retransmit queue using an rbtree, with 1GB windows at 100Gb this really has become necessary. From Eric Dumazet. 2) Multi-program support for cgroup+bpf, from Alexei Starovoitov. 3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew Lunn. 4) Add meter action support to openvswitch, from Andy Zhou. 5) Add a data meta pointer for BPF accessible packets, from Daniel Borkmann. 6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet. 7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli. 8) More work to move the RTNL mutex down, from Florian Westphal. 9) Add 'bpftool' utility, to help with bpf program introspection. From Jakub Kicinski. 10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper Dangaard Brouer. 11) Support 'blocks' of transformations in the packet scheduler which can span multiple network devices, from Jiri Pirko. 12) TC flower offload support in cxgb4, from Kumar Sanghvi. 13) Priority based stream scheduler for SCTP, from Marcelo Ricardo Leitner. 14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg. 15) Add RED qdisc offloadability, and use it in mlxsw driver. From Nogah Frankel. 16) eBPF based device controller for cgroup v2, from Roman Gushchin. 17) Add some fundamental tracepoints for TCP, from Song Liu. 18) Remove garbage collection from ipv6 route layer, this is a significant accomplishment. From Wei Wang. 19) Add multicast route offload support to mlxsw, from Yotam Gigi" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits) tcp: highest_sack fix geneve: fix fill_info when link down bpf: fix lockdep splat net: cdc_ncm: GetNtbFormat endian fix openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start netem: remove unnecessary 64 bit modulus netem: use 64 bit divide by rate tcp: Namespace-ify sysctl_tcp_default_congestion_control net: Protect iterations over net::fib_notifier_ops in fib_seq_sum() ipv6: set all.accept_dad to 0 by default uapi: fix linux/tls.h userspace compilation error usbnet: ipheth: prevent TX queue timeouts when device not ready vhost_net: conditionally enable tx polling uapi: fix linux/rxrpc.h userspace compilation errors net: stmmac: fix LPI transitioning for dwmac4 atm: horizon: Fix irq release error net-sysfs: trigger netlink notification on ifalias change via sysfs openvswitch: Using kfree_rcu() to simplify the code openvswitch: Make local function ovs_nsh_key_attr_size() static openvswitch: Fix return value check in ovs_meter_cmd_features() ...	2017-11-15 11:56:19 -08:00
Xin Long	8bff3685a4	vxlan: fix the issue that neigh proxy blocks all icmpv6 packets Commit `f1fb08f633` ("vxlan: fix ND proxy when skb doesn't have transport header offset") removed icmp6_code and icmp6_type check before calling neigh_reduce when doing neigh proxy. It means all icmpv6 packets would be blocked by this, not only ns packet. In Jianlin's env, even ping6 couldn't work through it. This patch is to bring the icmp6_code and icmp6_type check back and also removed the same check from neigh_reduce(). Fixes: `f1fb08f633` ("vxlan: fix ND proxy when skb doesn't have transport header offset") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Vincent Bernat <vincent@bernat.im> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-11-14 16:12:01 +09:00
Vasily Averin	0e4ec5acad	vxlan: exit_net cleanup checks added Be sure that sock_list array initialized in net_init hook was return to initial state Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-11-14 15:45:52 +09:00
Kees Cook	df7e828c1b	timer: Remove init_timer_deferrable() in favor of timer_setup() This refactors the only users of init_timer_deferrable() to use the new timer_setup() and from_timer(). Removes definition of init_timer_deferrable(). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> # for networking parts Acked-by: Sebastian Reichel <sre@kernel.org> # for drivers/hsi parts Cc: linux-mips@linux-mips.org Cc: Petr Mladek <pmladek@suse.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kalle Valo <kvalo@qca.qualcomm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux1394-devel@lists.sourceforge.net Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: linux-s390@vger.kernel.org Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ursula Braun <ubraun@linux.vnet.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Harish Patil <harish.patil@cavium.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Manish Chopra <manish.chopra@cavium.com> Cc: Len Brown <len.brown@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-pm@vger.kernel.org Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Gross <mark.gross@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-watchdog@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-wireless@vger.kernel.org Cc: Sebastian Reichel <sre@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Michael Reed <mdr@sgi.com> Cc: netdev@vger.kernel.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Link: https://lkml.kernel.org/r/1507159627-127660-6-git-send-email-keescook@chromium.org	2017-10-05 15:01:18 +02:00
Jiri Benc	fa20e0e32c	vxlan: factor out VXLAN-GPE next protocol The values are shared between VXLAN-GPE and NSH. Originally probably by coincidence but I notified both working groups about this last year and they seem to keep the values in sync since then. Hopefully they'll get a single IANA registry for the values, too. (I asked them for that.) Factor out the code to be shared by the NSH implementation. NSH and MPLS values are added in this patch, too. For MPLS, the drafts incorrectly assign only a single value, while we have two MPLS ethertypes. I raised the problem with both groups. For now, I assume the value is for unicast. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-29 15:16:52 -07:00
Girish Moodalbail	653ef6a3e4	vxlan: change vxlan_[config_]validate() to use netlink_ext_ack for error reporting The kernel log is not where users expect error messages for netlink requests; as we have extended acks now, we can replace pr_debug() with NL_SET_ERR_MSG_ATTR(). Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-13 20:00:29 -07:00
David S. Miller	3118e6e19d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net The UDP offload conflict is dealt with by simply taking what is in net-next where we have removed all of the UFO handling code entirely. The TCP conflict was a case of local variables in a function being removed from both net and net-next. In netvsc we had an assignment right next to where a missing set of u64 stats sync object inits were added. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:28:45 -07:00
K. Den	be73b3043b	vxlan: fix remcsum when GRO on and CHECKSUM_PARTIAL boundary is outer UDP In the case that GRO is turned on and the original received packet is CHECKSUM_PARTIAL, if the outer UDP header is exactly at the last csum-unnecessary point, which for instance could occur if the packet comes from another Linux guest on the same Linux host, we have to do either remcsum_adjust or set up CHECKSUM_PARTIAL again with its csum_start properly reset considering RCO. However, since b7fe10e5ebac("gro: Fix remcsum offload to deal with frags in GRO") that barrier in such case could be skipped if GRO turned on, hence we pass over it and the inner L4 validation mistakenly reckons it as a bad csum. This patch makes remcsum_offload being reset at the same time of GRO remcsum cleanup, so as to make it work in such case as before. Fixes: `b7fe10e5eb` ("gro: Fix remcsum offload to deal with frags in GRO") Signed-off-by: Koichiro Den <den@klaipeden.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-01 16:09:14 -07:00
Sabrina Dubroca	04584957b5	geneve/vxlan: offload ports on register/unregister events This improves consistency of handling when moving a netdev to another netns. Most drivers currently do a full reset when the device goes up, so that will flush the offload state anyway. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-24 13:52:59 -07:00
Sabrina Dubroca	2d2b13fcff	geneve/vxlan: add support for NETDEV_UDP_TUNNEL_DROP_INFO Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-24 13:52:59 -07:00
Reshetova, Elena	66af846fe5	net, vxlan: convert vxlan_sock.refcnt from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-04 22:35:15 +01:00
Jiri Benc	69e766612c	vxlan: fix hlist corruption It's not a good idea to add the same hlist_node to two different hash lists. This leads to various hard to debug memory corruptions. Fixes: `b1be00a6c3` ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-03 02:36:27 -07:00
Sabrina Dubroca	889ce937c9	vxlan: correctly set vxlan->net when creating the device in a netns Commit `a985343ba9` ("vxlan: refactor verification and application of configuration") modified vxlan device creation, and replaced the assignment of vxlan->net to src_net with dev_net(netdev) in ->setup(). But dev_net(netdev) is not the same as src_net. At the time ->setup() is called, dev_net hasn't been set yet, so we end up creating the socket for the vxlan device in init_net. Fix this by bringing back the assignment of vxlan->net during device creation. Fixes: `a985343ba9` ("vxlan: refactor verification and application of configuration") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-03 02:04:10 -07:00
Matthias Schiffer	019b13ae85	vxlan: fix incorrect nlattr access in MTU check The access to the wrong variable could lead to a NULL dereference and possibly other invalid memory reads in vxlan newlink/changelink requests with a IFLA_MTU attribute. Fixes: `a985343ba9` "vxlan: refactor verification and application of configuration" Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-27 14:40:35 -04:00
Matthias Schiffer	a8b8a889e3	net: add netlink_ext_ack argument to rtnl_link_ops.validate Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-26 23:13:22 -04:00
Matthias Schiffer	ad744b223c	net: add netlink_ext_ack argument to rtnl_link_ops.changelink Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-26 23:13:22 -04:00
Matthias Schiffer	7a3f4a1851	net: add netlink_ext_ack argument to rtnl_link_ops.newlink Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-26 23:13:21 -04:00
Matthias Schiffer	49f810f00f	vxlan: allow multiple VXLANs with same VNI for IPv6 link-local addresses As link-local addresses are only valid for a single interface, we can allow to use the same VNI for multiple independent VXLANs, as long as the used interfaces are distinct. This way, VXLANs can always be used as a drop-in replacement for VLANs with greater ID space. This also extends VNI lookup to respect the ifindex when link-local IPv6 addresses are used, so using the same VNI on multiple interfaces can actually work. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-20 13:37:03 -04:00
Matthias Schiffer	87613de950	vxlan: fix snooping for link-local IPv6 addresses If VXLAN is run over link-local IPv6 addresses, it is necessary to store the ifindex in the FDB entries. Otherwise, the used interface is undefined and unicast communication will most likely fail. Support for link-local IPv4 addresses should be possible as well, but as the semantics aren't as well defined as for IPv6, and there doesn't seem to be much interest in having the support, it's not implemented for now. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-20 13:37:03 -04:00
Matthias Schiffer	0f22a3c68d	vxlan: check valid combinations of address scopes * Multicast addresses are never valid as local address * Link-local IPv6 unicast addresses may only be used as remote when the local address is link-local as well * Don't allow link-local IPv6 local/remote addresses without interface We also store in the flags field if link-local addresses are used for the follow-up patches that actually make VXLAN over link-local IPv6 work. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-20 13:37:02 -04:00
Matthias Schiffer	ce44a4aea5	vxlan: improve validation of address family configuration Address families of source and destination addresses must match, and changelink operations can't change the address family. In addition, always use the VXLAN_F_IPV6 to check if a VXLAN device uses IPv4 or IPv6. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-20 13:37:02 -04:00
Matthias Schiffer	dc5321d796	vxlan: get rid of redundant vxlan_dev.flags There is no good reason to keep the flags twice in vxlan_dev and vxlan_config. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-20 13:37:02 -04:00
Matthias Schiffer	a985343ba9	vxlan: refactor verification and application of configuration The vxlan_dev_configure function was mixing validation and application of the vxlan configuration; this could easily lead to bugs with the changelink operation, as it was hard to see if the function wcould return an error after parts of the configuration had already been applied. This commit splits validation and application out of vxlan_dev_configure as separate functions to make it clearer where error returns are allowed and where the vxlan_dev or net_device may be configured. Log messages in these functions are removed, as it is generally unexpected to find error output for netlink requests in the kernel log. Userspace should be able to handle errors based on the error codes returned via netlink just fine. In addition, some validation and initialization is moved to vxlan_validate and vxlan_setup respectively to improve grouping of similar settings. Finally, this also fixes two actual bugs: * if set, conf->mtu would overwrite dev->mtu in each changelink operation, reverting other changes of dev->mtu * the "if (!conf->dst_port)" branch would never be run, as conf->dst_port was set in vxlan_setup before. This caused VXLAN-GPE to use the same default port as other VXLAN sockets instead of the intended IANA-assigned 4790. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-20 13:37:01 -04:00
Johannes Berg	d58ff35122	networking: make skb_push & __skb_push return void pointers It seems like a historic accident that these return unsigned char , and in many places that means casts are required, more often than not. Make these functions return void and remove all the casts across the tree, adding a (u8 ) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - (fn(SKB, LEN)) + (u8 )fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; type T; @@ - E = ((T )(fn(SKB, LEN))) + E = fn(SKB, LEN) @@ expression SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - fn(SKB, LEN)[0] + (u8 )fn(SKB, LEN) Note that the last part there converts from push(...)[0] to the more idiomatic (u8 *)push(...). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-16 11:48:40 -04:00
Johannes Berg	b080db5853	networking: convert many more places to skb_put_zero() There were many places that my previous spatch didn't find, as pointed out by yuan linyu in various patches. The following spatch found many more and also removes the now unnecessary casts: @@ identifier p, p2; expression len; expression skb; type t, t2; @@ ( -p = skb_put(skb, len); +p = skb_put_zero(skb, len); \| -p = (t)skb_put(skb, len); +p = skb_put_zero(skb, len); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, len); \| -memset(p, 0, len); ) @@ type t, t2; identifier p, p2; expression skb; @@ t p; ... ( -p = skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); \| -p = (t )skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, sizeof(p)); \| -memset(p, 0, sizeof(p)); ) @@ expression skb, len; @@ -memset(skb_put(skb, len), 0, len); +skb_put_zero(skb, len); Apply it to the tree (with one manual fixup to keep the comment in vxlan.c, which spatch removed.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-16 11:48:35 -04:00
David S. Miller	0ddead90b2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 11:59:32 -04:00
Roopa Prabhu	e0090a9e97	vxlan: dont migrate permanent fdb entries during learn This patch fixes vxlan_snoop to not move permanent fdb entries on learn events. This is consistent with the bridge fdb handling of permanent entries. Fixes: `26a41ae604` ("vxlan: only migrate dynamic FDB entries") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-12 11:01:00 -04:00
David S. Miller	cf124db566	net: Fix inconsistent teardown and release of private netdev state. Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-07 15:53:24 -04:00
Mark Bloch	57d88182ea	vxlan: use a more suitable function when assigning NULL When stopping the vxlan interface we detach it from the socket. Use RCU_INIT_POINTER() and not rcu_assign_pointer() to do so. Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-07 15:26:03 -04:00
Mark Bloch	a53cb29b0a	vxlan: fix use-after-free on deletion Adding a vxlan interface to a socket isn't symmetrical, while adding is done in vxlan_open() the deletion is done in vxlan_dellink(). This can cause a use-after-free error when we close the vxlan interface before deleting it. We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call it from vxlan_stop() to match the call from vxlan_open(). Fixes: `56ef9c909b` ("vxlan: Move socket initialization to within rtnl scope") Acked-by: Jiri Benc <jbenc@redhat.com> Tested-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-02 14:29:16 -04:00
Lance Richardson	35cf284556	vxlan: eliminate cached dst leak After commit `0c1d70af92` ("net: use dst_cache for vxlan device"), cached dst entries could be leaked when more than one remote was present for a given vxlan_fdb entry, causing subsequent netns operations to block indefinitely and "unregister_netdevice: waiting for lo to become free." messages to appear in the kernel log. Fix by properly releasing cached dst and freeing resources in this case. Fixes: `0c1d70af92` ("net: use dst_cache for vxlan device") Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-01 14:45:21 -04:00
Jiri Benc	baf4d78607	vxlan: do not output confusing error message The message "Cannot bind port X, err=Y" creates only confusion. In metadata based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and no error message should be printed. But when IPv6 tunnel was requested, such failure is fatal. The vxlan_socket_create does not know when the error is harmless and when it's not. Instead of passing such information down to vxlan_socket_create, remove the message completely. It's not useful. We propagate the error code up to the user space and the port number comes from the user space. There's nothing in the message that the process creating vxlan interface does not know. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-30 22:30:13 -04:00
Jiri Benc	d074bf9600	vxlan: correctly handle ipv6.disable module parameter When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns -EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole operation of bringing up the tunnel. Ignore failure of IPv6 socket creation for metadata based tunnels caused by IPv6 not being available. Fixes: `b1be00a6c3` ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-30 22:30:13 -04:00
Vincent Bernat	f1fb08f633	vxlan: fix ND proxy when skb doesn't have transport header offset When an incoming frame is tagged or when GRO is disabled, the skb handled to vxlan_xmit() doesn't contain a valid transport header offset. This makes ND proxying fail. We combine two changes: replace use of skb_transport_offset() and ensure the necessary amount of skb is linear just before using it: - In vxlan_xmit(), when determining if we have an ICMPv6 neighbor discovery packet, just check if it is an ICMPv6 packet and rely on neigh_reduce() to do more checks if this is the case. The use of pskb_may_pull() is replaced by skb_header_pointer() for just the IPv6 header. - In neigh_reduce(), add pskb_may_pull() for IPv6 header and neighbor discovery message since this was removed from vxlan_xmit(). Replace skb_transport_header() with ipv6_hdr() + 1. - In vxlan_na_create(), replace first skb_transport_offset() with ipv6_hdr() + 1 and second with skb_network_offset() + sizeof(struct ipv6hdr). Additionally, ensure we pskb_may_pull() the whole skb as we need it to iterate over the options. Signed-off-by: Vincent Bernat <vincent@bernat.im> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-03 18:50:42 -07:00
Felix Manlunas	d6acfeb17d	vxlan: vxlan dev should inherit lowerdev's gso_max_size vxlan dev currently ignores lowerdev's gso_max_size, which adversely affects TSO performance of liquidio if it's the lowerdev. Egress TCP packets' skb->len often exceed liquidio's advertised gso_max_size. This may happen on other NIC drivers. Fix it by assigning lowerdev's gso_max_size to that of vxlan dev. Might as well do likewise for gso_max_segs. Single flow TSO throughput of liquidio as lowerdev (using iperf3): Before the patch: 139 Mbps After the patch : 8.68 Gbps Percent increase: 6,144 % Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: Satanand Burla <satananda.burla@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-01 11:43:38 -07:00

1 2 3 4 5 ...

432 Commits