OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Chuck Lever	96ceddea37	xprtrdma: Remove usage of "mw" Clean up: struct rpcrdma_mw was named after Memory Windows, but xprtrdma no longer supports a Memory Window registration mode. Rename rpcrdma_mw and its fields to reduce confusion and make the code more sensible to read. Renaming "mw" was suggested by Tom Talpey, the author of the original xprtrdma implementation. It's a good idea, but I haven't done this until now because it's a huge diffstat for no benefit other than code readability. However, I'm about to introduce static trace points that expose a few of xprtrdma's internal data structures. They should make sense in the trace report, and it's reasonable to treat trace points as a kernel API contract which might be difficult to change later. While I'm churning things up, two additional changes: - rename variables unhelpfully called "r" to "mr", to improve code clarity, and - rename the MR-related helper functions using the form "rpcrdma_mr_<verb>", to be consistent with other areas of the code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:51 -05:00
Chuck Lever	ce5b371782	xprtrdma: Replace all usage of "frmr" with "frwr" Clean up: Over time, the industry has adopted the term "frwr" instead of "frmr". The term "frwr" is now more widely recognized. For the past couple of years I've attempted to add new code using "frwr" , but there still remains plenty of older code that still uses "frmr". Replace all usage of "frmr" to avoid confusion. While we're churning code, rename variables unhelpfully called "f" to "frwr", to improve code clarity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:50 -05:00
Chuck Lever	30b5416bf0	xprtrdma: Don't clear RPC_BC_PA_IN_USE on pre-allocated rpc_rqst's No need for the overhead of atomically setting and clearing this bit flag for every use of a pre-allocated backchannel rpc_rqst. These are a distinct pool of rpc_rqsts that are used only for callback operations, so it is safe to simply leave the bit set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:49 -05:00
Chuck Lever	cf73daf527	xprtrdma: Split xprt_rdma_send_request Clean up. @rqst is set up differently for backchannel Replies. For example, rqst->rq_task and task->tk_client are both NULL. So it is easier to understand and maintain this code path if it is separated. Also, we can get rid of the confusing rl_connect_cookie hack in rpcrdma_bc_receive_call. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:48 -05:00
Chuck Lever	6c537f2c7c	xprtrdma: buf_free not called for CB replies Since commit `5a6d1db455` ("SUNRPC: Add a transport-specific private field in rpc_rqst"), the rpc_rqst's for RPC-over-RDMA backchannel operations leave rq_buffer set to NULL. xprt_release does not invoke ->op->buf_free when rq_buffer is NULL. The RPCRDMA_REQ_F_BACKCHANNEL check in xprt_rdma_free is therefore redundant because xprt_rdma_free is not invoked for backchannel requests. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:47 -05:00
Chuck Lever	a2b6470b1c	xprtrdma: Move unmap-safe logic to rpcrdma_marshal_req Clean up. This logic is related to marshaling the request, and I'd like to keep everything that touches req->rl_registered close together, for CPU cache efficiency. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:47 -05:00
Chuck Lever	20035edf3c	xprtrdma: Support IPv6 in xprt_rdma_set_port Clean up a harmless oversight. xprtrdma's ->set_port method has never properly supported IPv6. This issue has never been a problem because NFS/RDMA mounts have always required "port=20049", thus so far, rpcbind is not invoked for these mounts. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:46 -05:00
Chuck Lever	dd229cee4e	xprtrdma: Remove another sockaddr_storage field (cdata::addr) Save more space in struct rpcrdma_xprt by removing the redundant "addr" field from struct rpcrdma_create_data_internal. Wherever we have rpcrdma_xprt, we also have the rpc_xprt, which has a sockaddr_storage field with the same content. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:45 -05:00
Chuck Lever	d461f1f2fb	xprtrdma: Initialize the xprt address string array earlier This makes the address strings available for debugging messages in earlier stages of transport set up. The first benefit is to get rid of the single-use rep_remote_addr field, saving 128+ bytes in struct rpcrdma_ep. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:45 -05:00
Chuck Lever	104927042c	xprtrdma: Remove unused padding variables Clean up. Remove fields that should have been removed by commit `b3221d6a53` ("xprtrdma: Remove logic that constructs RDMA_MSGP type calls"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:44 -05:00
Chuck Lever	3f0e3edd6a	xprtrdma: Remove ri_reminv_expected Clean up. Commit `b5f0afbea4` ("xprtrdma: Per-connection pad optimization") should have removed this. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:43 -05:00
Chuck Lever	c34416182f	xprtrdma: Per-mode handling for Remote Invalidation Refactoring change: Remote Invalidation is particular to the memory registration mode that is use. Use a callout instead of a generic function to handle Remote Invalidation. This gets rid of the 8-byte flags field in struct rpcrdma_mw, of which only a single bit flag has been allocated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:43 -05:00
Chuck Lever	42b9f5c58a	xprtrdma: Eliminate unnecessary lock cycle in xprt_rdma_send_request The rpcrdma_req is not shared yet, and its associated Send hasn't been posted, thus RMW should be safe. There's no need for the expense of a lock cycle here. Fixes: `0ba6f37012` ("xprtrdma: Refactor rpcrdma_deferred_completion") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:42 -05:00
Chuck Lever	d698c4a02e	xprtrdma: Fix backchannel allocation of extra rpcrdma_reps The backchannel code uses rpcrdma_recv_buffer_put to add new reps to the free rep list. This also decrements rb_recv_count, which spoofs the receive overrun logic in rpcrdma_buffer_get_rep. Commit `9b06688bc3` ("xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)") replaced the original open-coded list_add with a call to rpcrdma_recv_buffer_put(), but then a year later, commit `05c974669e` ("xprtrdma: Fix receive buffer accounting") added rep accounting to rpcrdma_recv_buffer_put. It was an oversight to let the backchannel continue to use this function. The fix this, let's combine the "add to free list" logic with rpcrdma_create_rep. Also, do not allocate RPCRDMA_MAX_BC_REQUESTS rpcrdma_reps in rpcrdma_buffer_create and then allocate additional rpcrdma_reps in rpcrdma_bc_setup_reps. Allocating the extra reps during backchannel set-up is sufficient. Fixes: `05c974669e` ("xprtrdma: Fix receive buffer accounting") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:41 -05:00
Chuck Lever	03ac1a76ce	xprtrdma: Fix buffer leak after transport set up failure This leak has been around forever, and is exceptionally rare. EINVAL causes mount to fail with "an incorrect mount option was specified" although it's not likely that one of the mount options is incorrect. Instead, return ENODEV in this case, as this appears to be an issue with system or device configuration rather than a specific mount option. Some obsolete comments are also removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-01-16 11:19:41 -05:00
Arnd Bergmann	41e4b39115	netfilter: nf_defrag: move NF_CONNTRACK bits into #ifdef We cannot access the skb->_nfct field when CONFIG_NF_CONNTRACK is disabled: net/ipv4/netfilter/nf_defrag_ipv4.c: In function 'ipv4_conntrack_defrag': net/ipv4/netfilter/nf_defrag_ipv4.c:83:9: error: 'struct sk_buff' has no member named '_nfct' net/ipv6/netfilter/nf_defrag_ipv6_hooks.c: In function 'ipv6_defrag': net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68:9: error: 'struct sk_buff' has no member named '_nfct' Both functions already have an #ifdef for this, so let's move the check in there. Fixes: `902d6a4c2a` ("netfilter: nf_defrag: Skip defrag if NOTRACK is set") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-16 01:52:08 +01:00
Arnd Bergmann	b069b37adb	netfilter: nf_defrag: mark xt_table structures 'const' again As a side-effect of adding the module option, we now get a section mismatch warning: WARNING: net/ipv4/netfilter/iptable_raw.o(.data+0x1c): Section mismatch in reference from the variable packet_raw to the function .init.text:iptable_raw_table_init() The variable packet_raw references the function __init iptable_raw_table_init() If the reference is valid then annotate the variable with __init* or __refdata (see linux/init.h) or name the variable: _template, _timer, _sht, _ops, _probe, _probe_one, *_console Apparently it's ok to link to a __net_init function from .rodata but not from .data. We can address this by rearranging the logic so that the structure is read-only again. Instead of writing to the .priority field later, we have an extra copies of the structure with that flag. An added advantage is that that we don't have writable function pointers with this approach. Fixes: `902d6a4c2a` ("netfilter: nf_defrag: Skip defrag if NOTRACK is set") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-16 01:52:07 +01:00
Subash Abhinov Kasiviswanathan	83f1999cae	netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460 ipv6_defrag pulls network headers before fragment header. In case of an error, the netfilter layer is currently dropping these packets. This results in failure of some IPv6 standards tests which passed on older kernels due to the netfilter framework using cloning. The test case run here is a check for ICMPv6 error message replies when some invalid IPv6 fragments are sent. This specific test case is listed in https://www.ipv6ready.org/docs/Core_Conformance_Latest.pdf in the Extension Header Processing Order section. A packet with unrecognized option Type 11 is sent and the test expects an ICMP error in line with RFC2460 section 4.2 - 11 - discard the packet and, only if the packet's Destination Address was not a multicast address, send an ICMP Parameter Problem, Code 2, message to the packet's Source Address, pointing to the unrecognized Option Type. Since netfilter layer now drops all invalid IPv6 frag packets, we no longer see the ICMP error message and fail the test case. To fix this, save the transport header. If defrag is unable to process the packet due to RFC2460, restore the transport header and allow packet to be processed by stack. There is no change for other packet processing paths. Tested by confirming that stack sends an ICMP error when it receives these packets. Also tested that fragmented ICMP pings succeed. v1->v2: Instead of cloning always, save the transport_header and restore it in case of this specific error. Update the title and commit message accordingly. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-16 01:52:06 +01:00
Florian Westphal	e3eeacbac4	netfilter: x_tables: don't return garbage pointer on modprobe failure request_module may return a positive error result from modprobe, if we cast this to ERR_PTR this returns a garbage result (it passes IS_ERR checks). Fix it by ignoring modprobe return values entirely, just retry the table lookup instead. Reported-by: syzbot+980925dbfbc7f93bc2ef@syzkaller.appspotmail.com Fixes: `03d13b6868` ("netfilter: xtables: add and use xt_request_find_table_lock") Fixes: `20651cefd2` ("netfilter: x_tables: unbreak module auto loading") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-16 01:51:59 +01:00
Arnd Bergmann	9be9d04b28	netfilter: nf_tables: flow_offload depends on flow_table Without CONFIG_NF_FLOW_TABLE, the new nft_flow_offload module produces a link error: net/netfilter/nft_flow_offload.o: In function `nft_flow_offload_iterate_cleanup': nft_flow_offload.c:(.text+0xb0): undefined reference to `nf_flow_table_iterate' net/netfilter/nft_flow_offload.o: In function `flow_offload_iterate_cleanup': nft_flow_offload.c:(.text+0x160): undefined reference to `flow_offload_dead' net/netfilter/nft_flow_offload.o: In function `nft_flow_offload_eval': nft_flow_offload.c:(.text+0xc4c): undefined reference to `flow_offload_alloc' nft_flow_offload.c:(.text+0xc64): undefined reference to `flow_offload_add' nft_flow_offload.c:(.text+0xc94): undefined reference to `flow_offload_free' This adds a Kconfig dependency for it. Fixes: `a3c90f7a23` ("netfilter: nf_tables: flow offload expression") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-16 01:19:16 +01:00
David S. Miller	79d891c1bb	linux-can-next-for-4.16-20180105 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlpPT5ATHG1rbEBwZW5n dXRyb25peC5kZQAKCRAe/sog7Dgc9tyZB/wNk7hfmWT7qMSq4nB1/l4DvlCVtQR+ 7t7jLltd2ld1bqFr62S1/NExWbgm9GXS25wHgLQQn8I0jwCyuFb8K+VIe/+t9vSu PXOihUlIXCqpJwI9FtvGb/jmIbHV1JbnGv1b/J1q34FzhThsXN3DPX5BI5+T+Hy4 9hnHuYtcveyGlU08RsePyc6WfCzBJafR1YpJYSSsIxmtT6Db0SyRSZjY4MFzv9eA mV+wvSpvepiw7tDN9XhSdNQJR9HAh/AXkYRgU448BysqhR5tK5oq8QAjsJK2Usy7 X1RY/M32fn1QdcwfWEWw5xB9ZblKMnxRzB3vmGLkyvIuPnP/JGQoq5sW =BrhI -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-4.16-20180105' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2017-12-01,Re: pull-request: can-next this is a pull request of 7 patches for net-next/master. All patches are by me. Patch 6 is for the "can_raw" protocol and add error checking to the bind() function. All other patches clean up the coding style and remove unused parameters in various CAN drivers and infrastructure. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 16:13:34 -05:00
Kees Cook	289a4860d1	net: Restrict unwhitelisted proto caches to size 0 Now that protocols have been annotated (the copy of icsk_ca_ops->name is of an ops field from outside the slab cache): $ git grep 'copy_._user.sk.*->' caif/caif_socket.c: copy_from_user(&cf_sk->conn_req.param.data, ov, ol)) { ipv4/raw.c: if (copy_from_user(&raw_sk(sk)->filter, optval, optlen)) ipv4/raw.c: copy_to_user(optval, &raw_sk(sk)->filter, len)) ipv4/tcp.c: if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) ipv4/tcp.c: if (copy_to_user(optval, icsk->icsk_ulp_ops->name, len)) ipv6/raw.c: if (copy_from_user(&raw6_sk(sk)->filter, optval, optlen)) ipv6/raw.c: if (copy_to_user(optval, &raw6_sk(sk)->filter, len)) sctp/socket.c: if (copy_from_user(&sctp_sk(sk)->subscribe, optval, optlen)) sctp/socket.c: if (copy_to_user(optval, &sctp_sk(sk)->subscribe, len)) sctp/socket.c: if (copy_to_user(optval, &sctp_sk(sk)->initmsg, len)) we can switch the default proto usercopy region to size 0. Any protocols needing to add whitelisted regions must annotate the fields with the useroffset and usersize fields of struct proto. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2018-01-15 12:08:02 -08:00
David Windsor	b2ce04c2a3	sctp: Copy struct sctp_sock.autoclose to userspace using put_user() The autoclose field can be copied with put_user(), so there is no need to use copy_to_user(). In both cases, hardened usercopy is being bypassed since the size is constant, and not open to runtime manipulation. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log] Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-sctp@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2018-01-15 12:08:01 -08:00
David Windsor	ab9ee8e38b	sctp: Define usercopy region in SCTP proto slab cache The SCTP socket event notification subscription information need to be copied to/from userspace. In support of usercopy hardening, this patch defines a region in the struct proto slab cache in which userspace copy operations are allowed. Additionally moves the usercopy fields to be adjacent for the region to cover both. example usage trace: net/sctp/socket.c: sctp_getsockopt_events(...): ... copy_to_user(..., &sctp_sk(sk)->subscribe, len) sctp_setsockopt_events(...): ... copy_from_user(&sctp_sk(sk)->subscribe, ..., optlen) sctp_getsockopt_initmsg(...): ... copy_to_user(..., &sctp_sk(sk)->initmsg, len) This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: split from network patch, move struct members adjacent] [kees: add SCTPv6 struct whitelist, provide usage trace] Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-sctp@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2018-01-15 12:08:00 -08:00
David Windsor	93070d339d	caif: Define usercopy region in caif proto slab cache The CAIF channel connection request parameters need to be copied to/from userspace. In support of usercopy hardening, this patch defines a region in the struct proto slab cache in which userspace copy operations are allowed. example usage trace: net/caif/caif_socket.c: setsockopt(...): ... copy_from_user(&cf_sk->conn_req.param.data, ..., ol) This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: split from network patch, provide usage trace] Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2018-01-15 12:08:00 -08:00
David Windsor	8c2bc895a9	ip: Define usercopy region in IP proto slab cache The ICMP filters for IPv4 and IPv6 raw sockets need to be copied to/from userspace. In support of usercopy hardening, this patch defines a region in the struct proto slab cache in which userspace copy operations are allowed. example usage trace: net/ipv4/raw.c: raw_seticmpfilter(...): ... copy_from_user(&raw_sk(sk)->filter, ..., optlen) raw_geticmpfilter(...): ... copy_to_user(..., &raw_sk(sk)->filter, len) net/ipv6/raw.c: rawv6_seticmpfilter(...): ... copy_from_user(&raw6_sk(sk)->filter, ..., optlen) rawv6_geticmpfilter(...): ... copy_to_user(..., &raw6_sk(sk)->filter, len) This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: split from network patch, provide usage trace] Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2018-01-15 12:07:59 -08:00
David Windsor	30c2c9f158	net: Define usercopy region in struct proto slab cache In support of usercopy hardening, this patch defines a region in the struct proto slab cache in which userspace copy operations are allowed. Some protocols need to copy objects to/from userspace, and they can declare the region via their proto structure with the new usersize and useroffset fields. Initially, if no region is specified (usersize == 0), the entire field is marked as whitelisted. This allows protocols to be whitelisted in subsequent patches. Once all protocols have been annotated, the full-whitelist default can be removed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, split off per-proto patches] [kees: add logic for by-default full-whitelist] Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2018-01-15 12:07:58 -08:00
Jim Westfall	cd9ff4de01	ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY Map all lookup neigh keys to INADDR_ANY for loopback/point-to-point devices to avoid making an entry for every remote ip the device needs to talk to. This used the be the old behavior but became broken in `a263b30936` (ipv4: Make neigh lookups directly in output packet path) and later removed in `0bb4087cbe` (ipv4: Fix neigh lookup keying over loopback/point-to-point devices) because it was broken. Signed-off-by: Jim Westfall <jwestfall@surrealistic.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:53:43 -05:00
Jim Westfall	096b9854c0	net: Allow neigh contructor functions ability to modify the primary_key Use n->primary_key instead of pkey to account for the possibility that a neigh constructor function may have modified the primary_key value. Signed-off-by: Jim Westfall <jwestfall@surrealistic.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:53:43 -05:00
William Tu	95a332088e	Revert "openvswitch: Add erspan tunnel support." This reverts commit `ceaa001a17`. The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed as a nested attribute to support all ERSPAN v1 and v2's fields. The current attr is a be32 supporting only one field. Thus, this patch reverts it and later patch will redo it using nested attr. Signed-off-by: William Tu <u9012063@gmail.com> Cc: Jiri Benc <jbenc@redhat.com> Cc: Pravin Shelar <pshelar@ovn.org> Acked-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:33:16 -05:00
Ido Schimmel	6802f3adcb	ipv6: Fix build with gcc-4.4.5 Emil reported the following compiler errors: net/ipv6/route.c: In function `rt6_sync_up`: net/ipv6/route.c:3586: error: unknown field `nh_flags` specified in initializer net/ipv6/route.c:3586: warning: missing braces around initializer net/ipv6/route.c:3586: warning: (near initialization for `arg.<anonymous>`) net/ipv6/route.c: In function `rt6_sync_down_dev`: net/ipv6/route.c:3695: error: unknown field `event` specified in initializer net/ipv6/route.c:3695: warning: missing braces around initializer net/ipv6/route.c:3695: warning: (near initialization for `arg.<anonymous>`) Problem is with the named initializers for the anonymous union members. Fix this by adding curly braces around the initialization. Fixes: `4c981e28d3` ("ipv6: Prepare to handle multiple netdev events") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com> Tested-by: Emil S Tantilov <emils.tantilov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:28:05 -05:00
Jon Maloy	e9a034456a	tipc: fix bug during lookup of multicast destination nodes In commit `232d07b74a` ("tipc: improve groupcast scope handling") we inadvertently broke non-group multicast transmission when changing the parameter 'domain' to 'scope' in the function tipc_nametbl_lookup_dst_nodes(). We missed to make the corresponding change in the calling function, with the result that the lookup always fails. A closer anaysis reveals that this parameter is not needed at all. Non-group multicast is hard coded to use CLUSTER_SCOPE, and in the current implementation this will be delivered to all matching destinations except those which are published with NODE_SCOPE on other nodes. Since such publications never will be visible on the sending node anyway, it makes no sense to discriminate by scope at all. We now remove this parameter altogether. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:27:13 -05:00
Kirill Tkhai	273c28bc57	net: Convert atomic_t net::count to refcount_t Since net could be obtained from RCU lists, and there is a race with net destruction, the patch converts net::count to refcount_t. This provides sanity checks for the cases of incrementing counter of already dead net, when maybe_get_net() has to used instead of get_net(). Drivers: allyesconfig and allmodconfig are OK. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:23:42 -05:00
r.hering@avm.de	30be8f8dba	net/tls: Fix inverted error codes to avoid endless loop sendfile() calls can hang endless with using Kernel TLS if a socket error occurs. Socket error codes must be inverted by Kernel TLS before returning because they are stored with positive sign. If returned non-inverted they are interpreted as number of bytes sent, causing endless looping of the splice mechanic behind sendfile(). Signed-off-by: Robert Hering <r.hering@avm.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:21:57 -05:00
Eric Dumazet	95ef498d97	ipv6: ip6_make_skb() needs to clear cork.base.dst In my last patch, I missed fact that cork.base.dst was not initialized in ip6_make_skb() : If ip6_setup_cork() returns an error, we might attempt a dst_release() on some random pointer. Fixes: `862c03ee1d` ("ipv6: fix possible mem leaks in ipv6_make_skb()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 14:19:32 -05:00
Marcelo Ricardo Leitner	594831a8ab	sctp: removed unused var from sctp_make_auth Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:59:42 -05:00
Marcelo Ricardo Leitner	37f47bc90c	sctp: avoid compiler warning on implicit fallthru These fall-through are expected. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:56:13 -05:00
Lorenzo Colitti	6503a30440	net: ipv4: Make "ip route get" match iif lo rules again. Commit `3765d35ed8` ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") broke "ip route get" in the presence of rules that specify iif lo. Host-originated traffic always has iif lo, because ip_route_output_key_hash and ip6_route_output_flags set the flow iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a convenient way to select only originated traffic and not forwarded traffic. inet_rtm_getroute used to match these rules correctly because even though it sets the flow iif to 0, it called ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX. But now that it calls ip_route_output_key_hash_rcu, the ifindex will remain 0 and not match the iif lo in the rule. As a result, "ip route get" will return ENETUNREACH. Fixes: `3765d35ed8` ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Tested: https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py passes again Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:53:30 -05:00
David Ahern	cbbdf8433a	netlink: extack needs to be reset each time through loop syzbot triggered the WARN_ON in netlink_ack testing the bad_attr value. The problem is that netlink_rcv_skb loops over the skb repeatedly invoking the callback and without resetting the extack leaving potentially stale data. Initializing each time through avoids the WARN_ON. Fixes: `2d4bc93368` ("netlink: extended ACK reporting") Reported-by: syzbot+315fa6766d0f7c359327@syzkaller.appspotmail.com Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:50:07 -05:00
Cong Wang	59b36613e8	tipc: fix a memory leak in tipc_nl_node_get_link() When tipc_node_find_by_name() fails, the nlmsg is not freed. While on it, switch to a goto label to properly free it. Fixes: be9c086715c ("tipc: narrow down exposure of struct tipc_node") Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:45:50 -05:00
Jon Maloy	febafc8455	tipc: fix a potental access after delete in tipc_sk_join() In commit `d12d2e12ce` "tipc: send out join messages as soon as new member is discovered") we added a call to the function tipc_group_join() without considering the case that the preceding tipc_sk_publish() might have failed, and the group item already deleted. We fix this by returning from tipc_sk_join() directly after the failed tipc_sk_publish. Reported-by: syzbot+e3eeae78ea88b8d6d858@syzkaller.appspotmail.com Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:43:51 -05:00
Mike Maloney	749439bfac	ipv6: fix udpv6 sendmsg crash caused by too small MTU The logic in __ip6_append_data() assumes that the MTU is at least large enough for the headers. A device's MTU may be adjusted after being added while sendmsg() is processing data, resulting in __ip6_append_data() seeing any MTU. For an mtu smaller than the size of the fragmentation header, the math results in a negative 'maxfraglen', which causes problems when refragmenting any previous skb in the skb_write_queue, leaving it possibly malformed. Instead sendmsg returns EINVAL when the mtu is calculated to be less than IPV6_MIN_MTU. Found by syzkaller: kernel BUG at ./include/linux/skbuff.h:2064! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 14216 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #2 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801d0b68580 task.stack: ffff8801ac6b8000 RIP: 0010:__skb_pull include/linux/skbuff.h:2064 [inline] RIP: 0010:__ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617 RSP: 0018:ffff8801ac6bf570 EFLAGS: 00010216 RAX: 0000000000010000 RBX: 0000000000000028 RCX: ffffc90003cce000 RDX: 00000000000001b8 RSI: ffffffff839df06f RDI: ffff8801d9478ca0 RBP: ffff8801ac6bf780 R08: ffff8801cc3f1dbc R09: 0000000000000000 R10: ffff8801ac6bf7a0 R11: 43cb4b7b1948a9e7 R12: ffff8801cc3f1dc8 R13: ffff8801cc3f1d40 R14: 0000000000001036 R15: dffffc0000000000 FS: 00007f43d740c700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7834984000 CR3: 00000001d79b9000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip6_finish_skb include/net/ipv6.h:911 [inline] udp_v6_push_pending_frames+0x255/0x390 net/ipv6/udp.c:1093 udpv6_sendmsg+0x280d/0x31a0 net/ipv6/udp.c:1363 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 SYSC_sendto+0x352/0x5a0 net/socket.c:1750 SyS_sendto+0x40/0x50 net/socket.c:1718 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x4512e9 RSP: 002b:00007f43d740bc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000007180a8 RCX: 00000000004512e9 RDX: 000000000000002e RSI: 0000000020d08000 RDI: 0000000000000005 RBP: 0000000000000086 R08: 00000000209c1000 R09: 000000000000001c R10: 0000000000040800 R11: 0000000000000216 R12: 00000000004b9c69 R13: 00000000ffffffff R14: 0000000000000005 R15: 00000000202c2000 Code: 9e 01 fe e9 c5 e8 ff ff e8 7f 9e 01 fe e9 4a ea ff ff 48 89 f7 e8 52 9e 01 fe e9 aa eb ff ff e8 a8 b6 cf fd 0f 0b e8 a1 b6 cf fd <0f> 0b 49 8d 45 78 4d 8d 45 7c 48 89 85 78 fe ff ff 49 8d 85 ba RIP: __skb_pull include/linux/skbuff.h:2064 [inline] RSP: ffff8801ac6bf570 RIP: __ip6_make_skb+0x18cf/0x1f70 net/ipv6/ip6_output.c:1617 RSP: ffff8801ac6bf570 Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Mike Maloney <maloney@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:28:18 -05:00
Stephen Hemminger	d542296a4d	9p: add missing module license for xen transport The 9P of Xen module is missing required license and module information. See https://bugzilla.kernel.org/show_bug.cgi?id=198109 Reported-by: Alan Bartlett <ajb@elrepo.org> Fixes: `868eb12273` ("xen/9pfs: introduce Xen 9pfs transport driver") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-15 13:13:53 -05:00
Johannes Berg	59b179b48c	cfg80211: check dev_set_name() return value syzbot reported a warning from rfkill_alloc(), and after a while I think that the reason is that it was doing fault injection and the dev_set_name() failed, leaving the name NULL, and we didn't check the return value and got to rfkill_alloc() with a NULL name. Since we really don't want a NULL name, we ought to check the return value. Fixes: `fb28ad3590` ("net: struct device - replace bus_id with dev_name(), dev_set_name()") Reported-by: syzbot+1ddfb3357e1d7bb5b5d3@syzkaller.appspotmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-15 11:35:06 +01:00
Johannes Berg	51a1aaa631	mac80211_hwsim: validate number of different channels When creating a new radio on the fly, hwsim allows this to be done with an arbitrary number of channels, but cfg80211 only supports a limited number of simultaneous channels, leading to a warning. Fix this by validating the number - this requires moving the define for the maximum out to a visible header file. Reported-by: syzbot+8dd9051ff19940290931@syzkaller.appspotmail.com Fixes: `b59ec8dd43` ("mac80211_hwsim: fix number of channels in interface combinations") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-15 09:34:45 +01:00
Dominik Brodowski	7a94b8c2ee	nl80211: take RCU read lock when calling ieee80211_bss_get_ie() As ieee80211_bss_get_ie() derefences an RCU to return ssid_ie, both the call to this function and any operation on this variable need protection by the RCU read lock. Fixes: `44905265bc` ("nl80211: don't expose wdev->ssid for most interfaces") Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-15 09:15:04 +01:00
Johannes Berg	a48a52b7be	cfg80211: fully initialize old channel for event Paul reported that he got a report about undefined behaviour that seems to me to originate in using uninitialized memory when the channel structure here is used in the event code in nl80211 later. He never reported whether this fixed it, and I wasn't able to trigger this so far, but we should do the right thing and fully initialize the on-stack structure anyway. Reported-by: Paul Menzel <pmenzel+linux-wireless@molgen.mpg.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-15 09:15:03 +01:00
Trond Myklebust	0af3442af7	SUNRPC: Add explicit rescheduling points in the receive path When reading the reply from the server, insert an explicit cond_resched() to avoid starving higher priority tasks. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2018-01-14 23:06:30 -05:00
Trond Myklebust	3d188805f8	SUNRPC: Chunk reading of replies from the server Read the TCP data in chunks of max 2MB so that we do not hog the socket lock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2018-01-14 23:06:30 -05:00
Chuck Lever	024fbf9c2e	SUNRPC: Remove rpc_protocol() Since nfs4_create_referral_server was the only call site of rpc_protocol, rpc_protocol can now be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2018-01-14 23:06:30 -05:00
Alexei Starovoitov	68fda450a7	bpf: fix 32-bit divide by zero due to some JITs doing if (src_reg == 0) check in 64-bit mode for div/mod operations mask upper 32-bits of src register before doing the check Fixes: `622582786c` ("net: filter: x86: internal BPF JIT") Fixes: `7a12b5031c` ("sparc64: Add eBPF JIT.") Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-01-14 23:05:33 +01:00
David S. Miller	564737f981	Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 10GbE Intel Wired LAN Driver Updates 2018-01-12 This series contains updates to ixgbe, fm10k and net core. Alex updates the driver to remove a duplicate MAC address check and verifies that we have not run out of resources to configure a MAC rule in our filter table. Also do not assume that dev->num_tc was populated and configured with the driver, since it can be configured via mqprio without any hardware coordination. Fixed the recording of stats for MACVLAN in ixgbe and fm10k instead of recording the receive queue on MACVLAN offloaded frames. When handling a MACVLAN offload, we should be stopping/starting traffic on our own queues instead of the upper devices transmit queues. Fixed possible race conditions with the MACVLAN cleanup with the interface cleanup on shutdown. With the recent fixes to ixgbe, we can cap the number of queues regardless of accel_priv being in use or not, since the actual number of queues are being reported via real_num_tx_queues. Tony fixes up the kernel documentation for ixgbe and ixgbevf to resolve warnings when W=1 is used. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-14 12:25:04 -05:00
Nogah Frankel	7fdb61b44c	net: sch: prio: Add offload ability to PRIO qdisc Add the ability to offload PRIO qdisc by using ndo_setup_tc. There are three commands for PRIO offloading: * TC_PRIO_REPLACE: handles set and tune * TC_PRIO_DESTROY: handles qdisc destroy * TC_PRIO_STATS: updates the qdiscs counters (given as reference) Like RED qdisc, the indication of whether PRIO is being offloaded is being set and updated as part of the dump function. It is so because the driver could decide to offload or not based on the qdisc parent, which could change without notifying the qdisc. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Reviewed-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-14 12:21:11 -05:00
Linus Torvalds	9e8f8f1ef4	Char/Misc fixes for 4.15-rc8 Here are two bugfixes for some driver bugs for 4.15-rc8 The first is a bluetooth security bug that has been ignored by the Bluetooth developers for months for no obvious reason at all, so I've taken it through my tree. The second is a simple double-free bug in the mux subsystem. Both have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWlppww8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylnwgCeOrW4MKzAG9nc+ZWKRw5CeWVqx9AAoLyQeiF6 KyLdQ6C2GiSRHtz7memv =Zbvd -----END PGP SIGNATURE----- Merge tag 'char-misc-4.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc fixes from Greg KH: "Here are two bugfixes for some driver bugs for 4.15-rc8 The first is a bluetooth security bug that has been ignored by the Bluetooth developers for months for no obvious reason at all, so I've taken it through my tree. The second is a simple double-free bug in the mux subsystem. Both have been in linux-next for a while with no reported issues" * tag 'char-misc-4.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: mux: core: fix double get_device() Bluetooth: Prevent stack info leak from the EFS element.	2018-01-13 14:01:59 -08:00
Jesper Dangaard Brouer	daaf24c634	bpf: simplify xdp_convert_ctx_access for xdp_rxq_info As pointed out by Daniel Borkmann, using bpf_target_off() is not necessary for xdp_rxq_info when extracting queue_index and ifindex, as these members are u32 like BPF_W. Also fix trivial spelling mistake introduced in same commit. Fixes: `02dd3291b2` ("bpf: finally expose xdp_rxq_info to XDP bpf-programs") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-01-13 00:10:18 +01:00
Alexander Duyck	d584527c70	net: Cap number of queues even with accel_priv With the recent fix to ixgbe we can cap the number of queues always regardless of if accel_priv is being used or not since the actual number of queues are being reported via real_num_tx_queues. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2018-01-12 08:20:36 -08:00
David S. Miller	9c70f1a7fa	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-01-11 1) Don't allow to change the encap type on state updates. The encap type is set on state initialization and should not change anymore. From Herbert Xu. 2) Skip dead policies when rehashing to fix a slab-out-of-bounds bug in xfrm_hash_rebuild. From Florian Westphal. 3) Two buffer overread fixes in pfkey. From Eric Biggers. 4) Fix rcu usage in xfrm_get_type_offload, request_module can sleep, so can't be used under rcu_read_lock. From Sabrina Dubroca. 5) Fix an uninitialized lock in xfrm_trans_queue. Use __skb_queue_tail instead of skb_queue_tail in xfrm_trans_queue as we don't need the lock. From Herbert Xu. 6) Currently it is possible to create an xfrm state with an unknown encap type in ESP IPv4. Fix this by returning an error on unknown encap types. Also from Herbert Xu. 7) Fix sleeping inside a spinlock in xfrm_policy_cache_flush. From Florian Westphal. 8) Fix ESP GRO when the headers not fully in the linear part of the skb. We need to pull before we can access them. 9) Fix a skb leak on error in key_notify_policy. 10) Fix a race in the xdst pcpu cache, we need to run the resolver routines with bottom halfes off like the old flowcache did. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-12 10:32:49 -05:00
David S. Miller	19d28fbd30	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net BPF alignment tests got a conflict because the registers are output as Rn_w instead of just Rn in net-next, and in net a fixup for a testcase prohibits logical operations on pointers before using them. Also, we should attempt to patch BPF call args if JIT always on is enabled. Instead, if we fail to JIT the subprogs we should pass an error back up and fail immediately. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-11 22:13:42 -05:00
David S. Miller	8c2e6c904f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-01-11 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Various BPF related improvements and fixes to nfp driver: i) do not register XDP RXQ structure to control queues, ii) round up program stack size to word size for nfp, iii) restrict MTU changes when BPF offload is active, iv) add more fully featured relocation support to JIT, v) add support for signed compare instructions to the nfp JIT, vi) export and reuse verfier log routine for nfp, and many more, from Jakub, Quentin and Nic. 2) Fix a syzkaller reported GPF in BPF's copy_verifier_state() when we hit kmalloc failure path, from Alexei. 3) Add two follow-up fixes for the recent XDP RXQ series: i) kvzalloc() allocated memory was only kfree()'ed, and ii) fix a memory leak where RX queue was not freed in netif_free_rx_queues(), from Jakub. 4) Add a sample for transferring XDP meta data into the skb, here it is used for setting skb->mark with the buffer from XDP, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-11 13:59:41 -05:00
Subash Abhinov Kasiviswanathan	902d6a4c2a	netfilter: nf_defrag: Skip defrag if NOTRACK is set conntrack defrag is needed only if some module like CONNTRACK or NAT explicitly requests it. For plain forwarding scenarios, defrag is not needed and can be skipped if NOTRACK is set in a rule. Since conntrack defrag is currently higher priority than raw table, setting NOTRACK is not sufficient. We need to move raw to a higher priority for iptables only. This is achieved by introducing a module parameter "raw_before_defrag" which allows to change the priority of raw table to place it before defrag. By default, the parameter is disabled and the priority of raw table is NF_IP_PRI_RAW to support legacy behavior. If the module parameter is enabled, then the priority of the raw table is set to NF_IP_PRI_RAW_BEFORE_DEFRAG. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-11 13:14:20 +01:00
Florian Westphal	5ed001baee	netfilter: clusterip: make sure arp hooks are available The clusterip target needs to register an arp mangling hook, so make sure NF_ARP hooks are available. Fixes: `2a95183a5e` ("netfilter: don't allocate space for arp/bridge hooks unless needed") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-11 13:12:26 +01:00
Linus Torvalds	cbd0a6a2cc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs regression fix from Al Viro/ Fix a leak in socket() introduced by commit `8e1611e235` ("make sock_alloc_file() do sock_release() on failures"). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Fix a leak in socket(2) when we fail to allocate a file descriptor.	2018-01-10 17:55:42 -08:00
Al Viro	ce4bb04cae	Fix a leak in socket(2) when we fail to allocate a file descriptor. Got broken by "make sock_alloc_file() do sock_release() on failures" - cleanup after sock_map_fd() failure got pulled all the way into sock_alloc_file(), but it used to serve the case when sock_map_fd() failed before getting to sock_alloc_file() as well, and that got lost. Trivial to fix, fortunately. Fixes: `8e1611e235` (make sock_alloc_file() do sock_release() on failures) Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-10 18:47:05 -05:00
Nogah Frankel	f8253df553	net: sch: red: Change offloaded xstats to be incremental Change the value of the xstats requested from the driver for offloaded RED to be incremental, like the normal stats. It increases consistency - if a qdisc stops being offloaded its xstats don't change. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Reviewed-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 16:07:40 -05:00
Mathieu Xhonneux	ccc12b11c5	ipv6: sr: fix TLVs not being copied using setsockopt Function ipv6_push_rthdr4 allows to add an IPv6 Segment Routing Header to a socket through setsockopt, but the current implementation doesn't copy possible TLVs at the end of the SRH received from userspace. Therefore, the execution of the following branch if (sr_has_hmac(sr_phdr)) { ... } will never complete since the len and type fields of a possible HMAC TLV are not copied, hence seg6_get_tlv_hmac will return an error, and the HMAC will not be computed. This commit adds a memcpy in case TLVs have been appended to the SRH. Fixes: `a149e7c7ce` ("ipv6: sr: add support for SRH injection through setsockopt") Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 16:03:55 -05:00
Eric Dumazet	862c03ee1d	ipv6: fix possible mem leaks in ipv6_make_skb() ip6_setup_cork() might return an error, while memory allocations have been done and must be rolled back. Fixes: `6422398c2a` ("ipv6: introduce ipv6_make_skb") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Reported-by: Mike Maloney <maloney@google.com> Acked-by: Mike Maloney <maloney@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 16:01:25 -05:00
Wei Yongjun	809a79e913	tcp: make local function tcp_recv_timestamp static Fixes the following sparse warning: net/ipv4/tcp.c:1736:6: warning: symbol 'tcp_recv_timestamp' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:55:35 -05:00
Cong Wang	78bbb15f22	8021q: fix a memory leak for VLAN 0 device A vlan device with vid 0 is allow to creat by not able to be fully cleaned up by unregister_vlan_dev() which checks for vlan_id!=0. Also, VLAN 0 is probably not a valid number and it is kinda "reserved" for HW accelerating devices, but it is probably too late to reject it from creation even if makes sense. Instead, just remove the check in unregister_vlan_dev(). Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: `ad1afb0039` ("vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)") Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:31:07 -05:00
Ido Schimmel	398958ae48	ipv6: Add support for non-equal-cost multipath The use of hash-threshold instead of modulo-N makes it trivial to add support for non-equal-cost multipath. Instead of dividing the multipath hash function's output space equally between the nexthops, each nexthop is assigned a region size which is proportional to its weight. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:14:44 -05:00
Ido Schimmel	3d709f69a3	ipv6: Use hash-threshold instead of modulo-N Now that each nexthop stores its region boundary in the multipath hash function's output space, we can use hash-threshold instead of modulo-N in multipath selection. This reduces the number of checks we need to perform during lookup, as dead and linkdown nexthops are assigned a negative region boundary. In addition, in contrast to modulo-N, only flows near region boundaries are affected when a nexthop is added or removed. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:14:44 -05:00
Ido Schimmel	7696c06a18	ipv6: Use a 31-bit multipath hash The hash thresholds assigned to IPv6 nexthops are in the range of [-1, 2^31 - 1], where a negative value is assigned to nexthops that should not be considered during multipath selection. Therefore, in a similar fashion to IPv4, we need to use the upper 31-bits of the multipath hash for multipath selection. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:14:44 -05:00
Ido Schimmel	d7dedee184	ipv6: Calculate hash thresholds for IPv6 nexthops Before we convert IPv6 to use hash-threshold instead of modulo-N, we first need each nexthop to store its region boundary in the hash function's output space. The boundary is calculated by dividing the output space equally between the different active nexthops. That is, nexthops that are not dead or linkdown. The boundaries are rebalanced whenever a nexthop is added or removed to a multipath route and whenever a nexthop becomes active or inactive. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:14:44 -05:00
Xiongfeng Wang	b0d55b5bc7	caif_usb: use strlcpy() instead of strncpy() gcc-8 reports net/caif/caif_usb.c: In function 'cfusbl_device_notify': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may be truncated copying 15 bytes from a string of length 15 [-Wstringop-truncation] The compiler require that the input param 'len' of strncpy() should be greater than the length of the src string, so that '\0' is copied as well. We can just use strlcpy() to avoid this warning. Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 15:06:14 -05:00
David S. Miller	65d51f2682	mlx5-updates-2018-01-08 Four patches from Or that add Hairpin support to mlx5: =========================================================== From: Or Gerlitz <ogerlitz@mellanox.com> We refer the ability of NIC HW to fwd packet received on one port to the other port (also from a port to itself) as hairpin. The application API is based on ingress tc/flower rules set on the NIC with the mirred redirect action. Other actions can apply to packets during the redirect. Hairpin allows to offload the data-path of various SW DDoS gateways, load-balancers, etc to HW. Packets go through all the required processing in HW (header re-write, encap/decap, push/pop vlan) and then forwarded, CPU stays at practically zero usage. HW Flow counters are used by the control plane for monitoring and accounting. Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ). All the flows that share <recv NIC, mirred NIC> are redirected through the same hairpin pair. Currently, only header-rewrite is supported as a packet modification action. I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this functionality on HW simulator, before it was avail in the FW so the driver code could be tested early. =========================================================== From Feras three patches that provide very small changes that allow IPoIB to support RX timestamping for child interfaces, simply by hooking the mlx5e timestamping PTP ioctl to IPoIB child interface netdev profile. One patch from Gal to fix a spilling mistake. Two patches from Eugenia adds drop counters to VF statistics to be reported as part of VF statistics in netlink (iproute2) and implemented them in mlx5 eswitch. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJaVF5WAAoJEEg/ir3gV/o+fRkH/0PxjwJRA3REqhi/H8HOdH9f cBLrOzFdqTCYQWQFCLFbMQ/Zgoel3KglpJ0iQMjuVFfjMbybVXOe8FAEVdbWHnfL C+2HRMe8dplKrsq5UkxJhbyKhFKhl2XeMFYWonw9dSM7Nz5DyowQ1y1r5SgMlMAv t3mYAIa4kZHK18BjDoIsCoAXXwsHiztR2irMp5+DwataTGP7vC7AsrucDxLA/qFf I3E15DZk9s1f53PUuY7CYnUnJfMMP3VJdxpyx4k6xt9J2IMuilF4YyD6wpAKsVQU /LzRkWI9x/6QindffqlrACeeidimOeY4pC4txIhS5uXgFXulugDHq1/Ih1sgZS8= =g5vr -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2018-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux mlx5-updates-2018-01-08 Four patches from Or that add Hairpin support to mlx5: =========================================================== From: Or Gerlitz <ogerlitz@mellanox.com> We refer the ability of NIC HW to fwd packet received on one port to the other port (also from a port to itself) as hairpin. The application API is based on ingress tc/flower rules set on the NIC with the mirred redirect action. Other actions can apply to packets during the redirect. Hairpin allows to offload the data-path of various SW DDoS gateways, load-balancers, etc to HW. Packets go through all the required processing in HW (header re-write, encap/decap, push/pop vlan) and then forwarded, CPU stays at practically zero usage. HW Flow counters are used by the control plane for monitoring and accounting. Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ). All the flows that share <recv NIC, mirred NIC> are redirected through the same hairpin pair. Currently, only header-rewrite is supported as a packet modification action. I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this functionality on HW simulator, before it was avail in the FW so the driver code could be tested early. =========================================================== From Feras three patches that provide very small changes that allow IPoIB to support RX timestamping for child interfaces, simply by hooking the mlx5e timestamping PTP ioctl to IPoIB child interface netdev profile. One patch from Gal to fix a spilling mistake. Two patches from Eugenia adds drop counters to VF statistics to be reported as part of VF statistics in netlink (iproute2) and implemented them in mlx5 eswitch. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 14:57:19 -05:00
Marcelo Ricardo Leitner	c76f97c99a	sctp: make use of pre-calculated len Some sockopt handling functions were calculating the length of the buffer to be written to userspace and then calculating it again when actually writing the buffer, which could lead to some write not using an up-to-date length. This patch updates such places to just make use of the len variable. Also, replace some sizeof(type) to sizeof(var). Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 14:53:22 -05:00
Marcelo Ricardo Leitner	5960cefab9	sctp: add a ceiling to optlen in some sockopts Hangbin Liu reported that some sockopt calls could cause the kernel to log a warning on memory allocation failure if the user supplied a large optlen value. That is because some of them called memdup_user() without a ceiling on optlen, allowing it to try to allocate really large buffers. This patch adds a ceiling by limiting optlen to the maximum allowed that would still make sense for these sockopt. Reported-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 14:53:22 -05:00
Marcelo Ricardo Leitner	2e83acb970	sctp: GFP_ATOMIC is not needed in sctp_setsockopt_events So replace it with GFP_USER and also add __GFP_NOWARN. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 14:53:22 -05:00
Arnd Bergmann	a0a97f2a1a	netfilter: improve flow table Kconfig dependencies The newly added NF_FLOW_TABLE options cause some build failures in randconfig kernels: - when CONFIG_NF_CONNTRACK is disabled, or is a loadable module but NF_FLOW_TABLE is built-in: In file included from net/netfilter/nf_flow_table.c:8:0: include/net/netfilter/nf_conntrack.h:59:22: error: field 'ct_general' has incomplete type struct nf_conntrack ct_general; include/net/netfilter/nf_conntrack.h: In function 'nf_ct_get': include/net/netfilter/nf_conntrack.h:148:15: error: 'const struct sk_buff' has no member named '_nfct' include/net/netfilter/nf_conntrack.h: In function 'nf_ct_put': include/net/netfilter/nf_conntrack.h:157:2: error: implicit declaration of function 'nf_conntrack_put'; did you mean 'nf_ct_put'? [-Werror=implicit-function-declaration] net/netfilter/nf_flow_table.o: In function `nf_flow_offload_work_gc': (.text+0x1540): undefined reference to `nf_ct_delete' - when CONFIG_NF_TABLES is disabled: In file included from net/ipv6/netfilter/nf_flow_table_ipv6.c:13:0: include/net/netfilter/nf_tables.h: In function 'nft_gencursor_next': include/net/netfilter/nf_tables.h:1189:14: error: 'const struct net' has no member named 'nft'; did you mean 'nf'? - when CONFIG_NF_FLOW_TABLE_INET is enabled, but NF_FLOW_TABLE_IPV4 or NF_FLOW_TABLE_IPV6 are not, or are loadable modules net/netfilter/nf_flow_table_inet.o: In function `nf_flow_offload_inet_hook': nf_flow_table_inet.c:(.text+0x94): undefined reference to `nf_flow_offload_ipv6_hook' nf_flow_table_inet.c:(.text+0x40): undefined reference to `nf_flow_offload_ip_hook' - when CONFIG_NF_FLOW_TABLES is disabled, but the other options are enabled: net/netfilter/nf_flow_table_inet.o: In function `nf_flow_offload_inet_hook': nf_flow_table_inet.c:(.text+0x6c): undefined reference to `nf_flow_offload_ipv6_hook' net/netfilter/nf_flow_table_inet.o: In function `nf_flow_inet_module_exit': nf_flow_table_inet.c:(.exit.text+0x8): undefined reference to `nft_unregister_flowtable_type' net/netfilter/nf_flow_table_inet.o: In function `nf_flow_inet_module_init': nf_flow_table_inet.c:(.init.text+0x8): undefined reference to `nft_register_flowtable_type' net/ipv4/netfilter/nf_flow_table_ipv4.o: In function `nf_flow_ipv4_module_exit': nf_flow_table_ipv4.c:(.exit.text+0x8): undefined reference to `nft_unregister_flowtable_type' net/ipv4/netfilter/nf_flow_table_ipv4.o: In function `nf_flow_ipv4_module_init': nf_flow_table_ipv4.c:(.init.text+0x8): undefined reference to `nft_register_flowtable_type' This adds additional Kconfig dependencies to ensure that NF_CONNTRACK and NF_TABLES are always visible from NF_FLOW_TABLE, and that the internal dependencies between the four new modules are met. Fixes: `7c23b629a8` ("netfilter: flow table support for the mixed IPv4/IPv6 family") Fixes: `0995210753` ("netfilter: flow table support for IPv6") Fixes: `97add9f0d6` ("netfilter: flow table support for IPv4") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 18:18:18 +01:00
David S. Miller	661e4e33a9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2018-01-09 The following pull-request contains BPF updates for your net tree. The main changes are: 1) Prevent out-of-bounds speculation in BPF maps by masking the index after bounds checks in order to fix spectre v1, and add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for removing the BPF interpreter from the kernel in favor of JIT-only mode to make spectre v2 harder, from Alexei. 2) Remove false sharing of map refcount with max_entries which was used in spectre v1, from Daniel. 3) Add a missing NULL psock check in sockmap in order to fix a race, from John. 4) Fix test_align BPF selftest case since a recent change in verifier rejects the bit-wise arithmetic on pointers earlier but test_align update was missing, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-10 11:17:21 -05:00
Ahmed Abdelsalam	202a8ff545	netfilter: add IPv6 segment routing header 'srh' match It allows matching packets based on Segment Routing Header (SRH) information. The implementation considers revision 7 of the SRH draft. https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07 Currently supported match options include: (1) Next Header (2) Hdr Ext Len (3) Segments Left (4) Last Entry (5) Tag value of SRH Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 16:28:44 +01:00
Pablo Neira Ayuso	cbef426ce7	netfilter: core: return EBUSY in case NAT hook is already in use EEXIST is used for an object that already exists, with the same name/handle. However, there no same object there, instead there is a object that is using the single slot that is available for NAT hooks since patch `f92b40a8b2` ("netfilter: core: only allow one nat hook per hook point"). Let's change this return value before this behaviour gets exposed in the first -rc. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:16 +01:00
Wei Yongjun	99eadf67c8	netfilter: remove duplicated include Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:15 +01:00
Wei Yongjun	0ded1785f3	netfilter: core: make local function __nf_unregister_net_hook static Fixes the following sparse warning: net/netfilter/core.c:380:6: warning: symbol '__nf_unregister_net_hook' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:14 +01:00
Wei Yongjun	03a0120f75	netfilter: nf_tables: fix a typo in nf_tables_getflowtable() Fix a typo, we should check 'flowtable' instead of 'table'. Fixes: `3b49e2e94e` ("netfilter: nf_tables: add flow table netlink frontend") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:13 +01:00
Florian Westphal	20651cefd2	netfilter: x_tables: unbreak module auto loading a typo causes module auto load support to never be compiled in. Fixes: `03d13b6868` ("netfilter: xtables: add and use xt_request_find_table_lock") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:12 +01:00
Pablo Neira Ayuso	98319cb908	netfilter: nf_tables: get rid of struct nft_af_info abstraction Remove the infrastructure to register/unregister nft_af_info structure, this structure stores no useful information anymore. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:11 +01:00
Pablo Neira Ayuso	dd4cbef723	netfilter: nf_tables: get rid of pernet families Now that we have a single table list for each netns, we can get rid of one pointer per family and the global afinfo list, thus, shrinking struct netns for nftables that now becomes 64 bytes smaller. And call __nft_release_afinfo() from __net_exit path accordingly to release netnamespace objects on removal. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:10 +01:00
Pablo Neira Ayuso	36596dadf5	netfilter: nf_tables: add single table list for all families Place all existing user defined tables in struct net *, instead of having one list per family. This saves us from one level of indentation in netlink dump functions. Place pointer to struct nft_af_info in struct nft_table temporarily, as we still need this to put back reference module reference counter on table removal. This patch comes in preparation for the removal of struct nft_af_info. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:08 +01:00
Pablo Neira Ayuso	1ea26cca52	netfilter: nf_tables: remove struct nft_af_info parameter in nf_tables_chain_type_lookup() Pass family number instead, this comes in preparation for the removal of struct nft_af_info. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:07 +01:00
Pablo Neira Ayuso	c9c17211ec	netfilter: nf_tables: no need for struct nft_af_info to enable/disable table nf_tables_table_enable() and nf_tables_table_disable() take a pointer to struct nft_af_info that is never used, remove it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:06 +01:00
Pablo Neira Ayuso	e7bb5c7140	netfilter: nf_tables: remove flag field from struct nft_af_info Replace it by a direct check for the netdev protocol family. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:05 +01:00
Pablo Neira Ayuso	fe19c04ca1	netfilter: nf_tables: remove nhooks field from struct nft_af_info We already validate the hook through bitmask, so this check is superfluous. When removing this, this patch is also fixing a bug in the new flowtable codebase, since ctx->afi points to the table family instead of the netdev family which is where the flowtable is really hooked in. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-10 15:32:04 +01:00
Steffen Klassert	76a4201191	xfrm: Fix a race in the xdst pcpu cache. We need to run xfrm_resolve_and_create_bundle() with bottom halves off. Otherwise we may reuse an already released dst_enty when the xfrm lookup functions are called from process context. Fixes: c30d78c14a813db39a647b6a348b428 ("xfrm: add xdst pcpu cache") Reported-by: Darius Ski <darius.ski@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-01-10 12:14:28 +01:00
Jakub Kicinski	82aaff2f63	net: free RX queue structures Looks like commit `e817f85652` ("xdp: generic XDP handling of xdp_rxq_info") replaced kvfree(dev->_rx) in free_netdev() with a call to netif_free_rx_queues() which doesn't actually free the rings? While at it remove the unnecessary temporary variable. Fixes: `e817f85652` ("xdp: generic XDP handling of xdp_rxq_info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-01-10 12:06:17 +01:00
Jakub Kicinski	141b52a98a	net: use the right variant of kfree kvzalloc'ed memory should be kvfree'd. Fixes: `e817f85652` ("xdp: generic XDP handling of xdp_rxq_info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-01-10 12:06:17 +01:00
Steffen Klassert	1e532d2b49	af_key: Fix memory leak in key_notify_policy. We leak the allocated out_skb in case pfkey_xfrm_policy2msg() fails. Fix this by freeing it on error. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-01-10 09:45:11 +01:00
Alexei Starovoitov	290af86629	bpf: introduce BPF_JIT_ALWAYS_ON config The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715. A quote from goolge project zero blog: "At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel - while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel's text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets." To make attacker job harder introduce BPF_JIT_ALWAYS_ON config option that removes interpreter from the kernel in favor of JIT-only mode. So far eBPF JIT is supported by: x64, arm64, arm32, sparc64, s390, powerpc64, mips64 The start of JITed program is randomized and code page is marked as read-only. In addition "constant blinding" can be turned on with net.core.bpf_jit_harden v2->v3: - move __bpf_prog_ret0 under ifdef (Daniel) v1->v2: - fix init order, test_bpf and cBPF (Daniel's feedback) - fix offloaded bpf (Jakub's feedback) - add 'return 0' dummy in case something can invoke prog->bpf_func - retarget bpf tree. For bpf-next the patch would need one extra hunk. It will be sent when the trees are merged back to net-next Considered doing: int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT; but it seems better to land the patch as-is and in bpf-next remove bpf_jit_enable global variable from all JITs, consolidate in one place and remove this jit_init() function. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-01-09 22:25:26 +01:00
Jon Maloy	eb929a91b2	tipc: improve poll() for group member socket The current criteria for returning POLLOUT from a group member socket is too simplistic. It basically returns POLLOUT as soon as the group has external destinations, something obviously leading to a lot of spinning during destination congestion situations. At the same time, the internal congestion handling is unnecessarily complex. We now change this as follows. - We introduce an 'open' flag in struct tipc_group. This flag is used only to help poll() get the setting of POLLOUT right, and not for congeston handling as such. This means that a user can choose to ignore an EAGAIN for a destination and go on sending messages to other destinations in the group if he wants to. - The flag is set to false every time we return EAGAIN on a send call. - The flag is set to true every time any member, i.e., not necessarily the member that caused EAGAIN, is removed from the small_win list. - We remove the group member 'usr_pending' flag. The size of the send window and presence in the 'small_win' list is sufficient criteria for recognizing congestion. This solution seems to be a reasonable compromise between 'anycast', which is normally not waiting for POLLOUT for a specific destination, and the other three send modes, which are. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:58 -05:00
Jon Maloy	232d07b74a	tipc: improve groupcast scope handling When a member joins a group, it also indicates a binding scope. This makes it possible to create both node local groups, invisible to other nodes, as well as cluster global groups, visible everywhere. In order to avoid that different members end up having permanently differing views of group size and memberhip, we must inhibit locally and globally bound members from joining the same group. We do this by using the binding scope as an additional separator between groups. I.e., a member must ignore all membership events from sockets using a different scope than itself, and all lookups for message destinations must require an exact match between the message's lookup scope and the potential target's binding scope. Apart from making it possible to create local groups using the same identity on different nodes, a side effect of this is that it now also becomes possible to create a cluster global group with the same identity across the same nodes, without interfering with the local groups. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:58 -05:00
Jon Maloy	8348500f80	tipc: add option to suppress PUBLISH events for pre-existing publications Currently, when a user is subscribing for binding table publications, he will receive a PUBLISH event for all already existing matching items in the binding table. However, a group socket making a subscriptions doesn't need this initial status update from the binding table, because it has already scanned it during the join operation. Worse, the multiplicatory effect of issuing mutual events for dozens or hundreds group members within a short time frame put a heavy load on the topology server, with the end result that scale out operations on a big group tend to take much longer than needed. We now add a new filter option, TIPC_SUB_NO_STATUS, for topology server subscriptions, so that this initial avalanche of events is suppressed. This change, along with the previous commit, significantly improves the range and speed of group scale out operations. We keep the new option internal for the tipc driver, at least for now. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:58 -05:00
Jon Maloy	d12d2e12ce	tipc: send out join messages as soon as new member is discovered When a socket is joining a group, we look up in the binding table to find if there are already other members of the group present. This is used for being able to return EAGAIN instead of EHOSTUNREACH if the user proceeds directly to a send attempt. However, the information in the binding table can be used to directly set the created member in state MBR_PUBLISHED and send a JOIN message to the peer, instead of waiting for a topology PUBLISH event to do this. When there are many members in a group, the propagation time for such events can be significant, and we can save time during the join operation if we use the initial lookup result fully. In this commit, we eliminate the member state MBR_DISCOVERED which has been the result of the initial lookup, and do instead go directly to MBR_PUBLISHED, which initiates the setup. After this change, the tipc_member FSM looks as follows: +-----------+ ---->\| PUBLISHED \|-----------------------------------------------+ PUB- +-----------+ LEAVE/WITHRAW \| LISH \|JOIN \| \| +-------------------------------------------+ \| \| \| LEAVE/WITHDRAW \| \| \| \| +------------+ \| \| \| \| +----------->\| PENDING \|---------+ \| \| \| \| \|msg/maxactv +-+---+------+ LEAVE/ \| \| \| \| \| \| \| \| WITHDRAW \| \| \| \| \| \| +----------+ \| \| \| \| \| \| \| \|revert/maxactv\| \| \| \| \| \| \| V V V V V \| +----------+ msg +------------+ +-----------+ +-->\| JOINED \|------>\| ACTIVE \|------>\| LEAVING \|---> \| +----------+ +--- -+------+ LEAVE/+-----------+DOWN \| A A \| WITHDRAW A A A EVT \| \| \| \|RECLAIM \| \| \| \| \| \|REMIT V \| \| \| \| \| \|== adv +------------+ \| \| \| \| \| +---------\| RECLAIMING \|--------+ \| \| \| \| +-----+------+ LEAVE/ \| \| \| \| \|REMIT WITHDRAW \| \| \| \| \|< adv \| \| \| \|msg/ V LEAVE/ \| \| \| \|adv==ADV_IDLE+------------+ WITHDRAW \| \| \| +-------------\| REMITTED \|------------+ \| \| +------------+ \| \|PUBLISH \| JOIN +-----------+ LEAVE/WITHDRAW \| ---->\| JOINING \|-----------------------------------------------+ +-----------+ Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:58 -05:00
Jon Maloy	c2b22bcf2e	tipc: simplify group LEAVE sequence After the changes in the previous commit the group LEAVE sequence can be simplified. We now let the arrival of a LEAVE message unconditionally issue a group DOWN event to the user. When a topology WITHDRAW event is received, the member, if it still there, is set to state LEAVING, but we only issue a group DOWN event when the link to the peer node is gone, so that no LEAVE message is to be expected. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:57 -05:00
Jon Maloy	7ad32bcb78	tipc: create group member event messages when they are needed In the current implementation, a group socket receiving topology events about other members just converts the topology event message into a group event message and stores it until it reaches the right state to issue it to the user. This complicates the code unnecessarily, and becomes impractical when we in the coming commits will need to create and issue membership events independently. In this commit, we change this so that we just notice the type and origin of the incoming topology event, and then drop the buffer. Only when it is time to actually send a group event to the user do we explicitly create a new message and send it upwards. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:57 -05:00
Jon Maloy	0233493a5f	tipc: adjustment to group member FSM Analysis reveals that the member state MBR_QURANTINED in reality is unnecessary, and can be replaced by the state MBR_JOINING at all occurrencs. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:57 -05:00
Jon Maloy	4ea5dab541	tipc: let group member stay in JOINED mode if unable to reclaim We handle a corner case in the function tipc_group_update_rcv_win(). During extreme pessure it might happen that a message receiver has all its active senders in RECLAIMING or REMITTED mode, meaning that there is nobody to reclaim advertisements from if an additional sender tries to go active. Currently we just set the new sender to ACTIVE anyway, hence at least theoretically opening up for a receiver queue overflow by exceeding the MAX_ACTIVE limit. The correct solution to this is to instead add the member to the pending queue, while letting the oldest member in that queue revert to JOINED state. In this commit we refactor the code for handling message arrival from a JOINED member, both to make it more comprehensible and to cover the case described above. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:57 -05:00
Jon Maloy	8d5dee21f6	tipc: a couple of cleanups - We remove the 'reclaiming' member list in struct tipc_group, since it doesn't serve any purpose. - We simplify the GRP_REMIT_MSG branch of tipc_group_protocol_rcv(). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:35:57 -05:00
Wei Wang	4512c43eac	ipv6: remove null_entry before adding default route In the current code, when creating a new fib6 table, tb6_root.leaf gets initialized to net->ipv6.ip6_null_entry. If a default route is being added with rt->rt6i_metric = 0xffffffff, fib6_add() will add this route after net->ipv6.ip6_null_entry. As null_entry is shared, it could cause problem. In order to fix it, set fn->leaf to NULL before calling fib6_add_rt2node() when trying to add the first default route. And reset fn->leaf to null_entry when adding fails or when deleting the last default route. syzkaller reported the following issue which is fixed by this commit: WARNING: suspicious RCU usage 4.15.0-rc5+ #171 Not tainted ----------------------------- net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by swapper/0/0: #0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline] #0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310 #1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline] #1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007 #2: (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560 #3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline] #3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline] fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904 </IRQ> Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: `66f5d6ce53` ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 12:33:55 -05:00
Nicolai Stange	20b50d7997	net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg() Commit `8f659a03a0` ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by doing it on the local 'hdrincl' variable itself: int hdrincl = inet->hdrincl; hdrincl = READ_ONCE(hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: `8f659a03a0` ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange <nstange@suse.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 11:59:16 -05:00
Eugenia Emantayev	37e2d99b59	ethtool: Ensure new ring parameters are within bounds during SRINGPARAM Add a sanity check to ensure that all requested ring parameters are within bounds, which should reduce errors in driver implementation. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 11:54:49 -05:00
Xiongfeng Wang	3dc2fa4754	net: caif: use strlcpy() instead of strncpy() gcc-8 reports net/caif/caif_dev.c: In function 'caif_enroll_dev': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may be truncated copying 15 bytes from a string of length 15 [-Wstringop-truncation] net/caif/cfctrl.c: In function 'cfctrl_linkup_request': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may be truncated copying 15 bytes from a string of length 15 [-Wstringop-truncation] net/caif/cfcnfg.c: In function 'caif_connect_client': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may be truncated copying 15 bytes from a string of length 15 [-Wstringop-truncation] The compiler require that the input param 'len' of strncpy() should be greater than the length of the src string, so that '\0' is copied as well. We can just use strlcpy() to avoid this warning. Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 11:52:18 -05:00
Colin Ian King	709af180ee	ipv6: use ARRAY_SIZE for array sizing calculation on array seg6_action_table Use the ARRAY_SIZE macro on array seg6_action_table to determine size of the array. Improvement suggested by coccinelle. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 11:40:46 -05:00
Andrii Vladyka	b8fd0823e0	net: core: fix module type in sock_diag_bind Use AF_INET6 instead of AF_INET in IPv6-related code path Signed-off-by: Andrii Vladyka <tulup@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-09 11:28:58 -05:00
David S. Miller	a0ce093180	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-01-09 10:37:00 -05:00
Steffen Klassert	374d1b5a81	esp: Fix GRO when the headers not fully in the linear part of the skb. The GRO layer does not necessarily pull the complete headers into the linear part of the skb, a part may remain on the first page fragment. This can lead to a crash if we try to pull the headers, so make sure we have them on the linear part before pulling. Fixes: `7785bba299` ("esp: Add a software GRO codepath") Reported-by: syzbot+82bbd65569c49c6c0c4d@syzkaller.appspotmail.com Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-01-09 13:01:58 +01:00
Eugenia Emantayev	c5a9f6f0ab	net/core: Add drop counters to VF statistics Modern hardware can decide to drop packets going to/from a VF. Add receive and transmit drop counters to be displayed at hypervisor layer in iproute2 per VF statistics. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-01-09 07:40:48 +02:00
Linus Torvalds	ef7f8cec80	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Frag and UDP handling fixes in i40e driver, from Amritha Nambiar and Alexander Duyck. 2) Undo unintentional UAPI change in netfilter conntrack, from Florian Westphal. 3) Revert a change to how error codes are returned from dev_get_valid_name(), it broke some apps. 4) Cannot cache routes for ipv6 tunnels in the tunnel is ipv4/ipv6 dual-stack. From Eli Cooper. 5) Fix missed PMTU updates in geneve, from Xin Long. 6) Cure double free in macvlan, from Gao Feng. 7) Fix heap out-of-bounds write in rds_message_alloc_sgs(), from Mohamed Ghannam. 8) FEC bug fixes from FUgang Duan (mis-accounting of dev_id, missed deferral of probe when the regulator is not ready yet). 9) Missing DMA mapping error checks in 3c59x, from Neil Horman. 10) Turn off Broadcom tags for some b53 switches, from Florian Fainelli. 11) Fix OOPS when get_target_net() is passed an SKB whose NETLINK_CB() isn't initialized. From Andrei Vagin. 12) Fix crashes in fib6_add(), from Wei Wang. 13) PMTU bug fixes in SCTP from Marcelo Ricardo Leitner. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits) sh_eth: fix TXALCR1 offsets mdio-sun4i: Fix a memory leak phylink: mark expected switch fall-throughs in phylink_mii_ioctl sctp: fix the handling of ICMP Frag Needed for too small MTUs sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled xen-netfront: enable device after manual module load bnxt_en: Fix the 'Invalid VF' id check in bnxt_vf_ndo_prep routine. bnxt_en: Fix population of flow_type in bnxt_hwrm_cfa_flow_alloc() sh_eth: fix SH7757 GEther initialization net: fec: free/restore resource in related probe error pathes uapi/if_ether.h: prevent redefinition of struct ethhdr ipv6: fix general protection fault in fib6_add() RDS: null pointer dereference in rds_atomic_free_op sh_eth: fix TSU resource handling net: stmmac: enable EEE in MII, GMII or RGMII only rtnetlink: give a user socket to get_target_net() MAINTAINERS: Update my email address. can: ems_usb: improve error reporting for error warning and error passive can: flex_can: Correct the checking for frame length in flexcan_start_xmit() can: gs_usb: fix return value of the "set_bittiming" callback ...	2018-01-08 20:21:39 -08:00
Yang Shi	f4803f1b73	net: tipc: remove unused hardirq.h Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by TIPC at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 20:59:25 -05:00
Yang Shi	419091f1cc	net: ovs: remove unused hardirq.h Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by openvswitch at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: dev@openvswitch.org Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 20:59:25 -05:00
Yang Shi	373372b31b	net: caif: remove unused hardirq.h Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by caif at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 20:59:25 -05:00
David S. Miller	9f0e896f35	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree: 1) Free hooks via call_rcu to speed up netns release path, from Florian Westphal. 2) Reduce memory footprint of hook arrays, skip allocation if family is not present - useful in case decnet support is not compiled built-in. Patches from Florian Westphal. 3) Remove defensive check for malformed IPv4 - including ihl field - and IPv6 headers in x_tables and nf_tables. 4) Add generic flow table offload infrastructure for nf_tables, this includes the netlink control plane and support for IPv4, IPv6 and mixed IPv4/IPv6 dataplanes. This comes with NAT support too. This patchset adds the IPS_OFFLOAD conntrack status bit to indicate that this flow has been offloaded. 5) Add secpath matching support for nf_tables, from Florian. 6) Save some code bytes in the fast path for the nf_tables netdev, bridge and inet families. 7) Allow one single NAT hook per point and do not allow to register NAT hooks in nf_tables before the conntrack hook, patches from Florian. 8) Seven patches to remove the struct nf_af_info abstraction, instead we perform direct calls for IPv4 which is faster. IPv6 indirections are still needed to avoid dependencies with the 'ipv6' module, but these now reside in struct nf_ipv6_ops. 9) Seven patches to handle NFPROTO_INET from the Netfilter core, hence we can remove specific code in nf_tables to handle this pseudofamily. 10) No need for synchronize_net() call for nf_queue after conversion to hook arrays. Also from Florian. 11) Call cond_resched_rcu() when dumping large sets in ipset to avoid softlockup. Again from Florian. 12) Pass lockdep_nfnl_is_held() to rcu_dereference_protected(), patch from Florian Westphal. 13) Fix matching of counters in ipset, from Jozsef Kadlecsik. 14) Missing nfnl lock protection in the ip_set_net_exit path, also from Jozsef. 15) Move connlimit code that we can reuse from nf_tables into nf_conncount, from Florian Westhal. And asorted cleanups: 16) Get rid of nft_dereference(), it only has one single caller. 17) Add nft_set_is_anonymous() helper function. 18) Remove NF_ARP_FORWARD leftover chain definition in nf_tables_arp. 19) Remove unnecessary comments in nf_conntrack_h323_asn1.c From Varsha Rao. 20) Remove useless parameters in frag_safe_skb_hp(), from Gao Feng. 21) Constify layer 4 conntrack protocol definitions, function parameters to register/unregister these protocol trackers, and timeouts. Patches from Florian Westphal. 22) Remove nlattr_size indirection, from Florian Westphal. 23) Add fall-through comments as -Wimplicit-fallthrough needs this, from Gustavo A. R. Silva. 24) Use swap() macro to exchange values in ipset, patch from Gustavo A. R. Silva. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 20:40:42 -05:00
Guillaume Nault	23fe846f9a	l2tp: adjust comments about L2TPv3 offsets The "offset" option has been removed by commit `900631ee6a` ("l2tp: remove configurable payload offset"). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Acked-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 14:22:42 -05:00
Marcelo Ricardo Leitner	b6c5734db0	sctp: fix the handling of ICMP Frag Needed for too small MTUs syzbot reported a hang involving SCTP, on which it kept flooding dmesg with the message: [ 246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too low, using default minimum of 512 That happened because whenever SCTP hits an ICMP Frag Needed, it tries to adjust to the new MTU and triggers an immediate retransmission. But it didn't consider the fact that MTUs smaller than the SCTP minimum MTU allowed (512) would not cause the PMTU to change, and issued the retransmission anyway (thus leading to another ICMP Frag Needed, and so on). As IPv4 (ip_rt_min_pmtu=556) and IPv6 (IPV6_MIN_MTU=1280) minimum MTU are higher than that, sctp_transport_update_pmtu() is changed to re-fetch the PMTU that got set after our request, and with that, detect if there was an actual change or not. The fix, thus, skips the immediate retransmission if the received ICMP resulted in no change, in the hope that SCTP will select another path. Note: The value being used for the minimum MTU (512, SCTP_DEFAULT_MINSEGMENT) is not right and instead it should be (576, SCTP_MIN_PMTU), but such change belongs to another patch. Changes from v1: - do not disable PMTU discovery, in the light of commit `06ad391919` ("[SCTP] Don't disable PMTU discovery when mtu is small") and as suggested by Xin Long. - changed the way to break the rtx loop by detecting if the icmp resulted in a change or not Changes from v2: none See-also: https://lkml.org/lkml/2017/12/22/811 Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 14:19:13 -05:00
Marcelo Ricardo Leitner	cc35c3d1ed	sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled Currently, if PMTU discovery is disabled on a given transport, but the configured value is higher than the actual PMTU, it is likely that we will get some icmp Frag Needed. The issue is, if PMTU discovery is disabled, we won't update the information and will issue a retransmission immediately, which may very well trigger another ICMP, and another retransmission, leading to a loop. The fix is to simply not trigger immediate retransmissions if PMTU discovery is disabled on the given transport. Changes from v2: - updated stale comment, noticed by Xin Long Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 14:19:13 -05:00
Stefano Brivio	c8c9aeb519	tcp: Split BUG_ON() in tcp_tso_should_defer() into two assertions The two conditions triggering BUG_ON() are somewhat unrelated: the tcp_skb_pcount() check is meant to catch TSO flaws, the second one checks sanity of congestion window bookkeeping. Split them into two separate BUG_ON() assertions on two lines, so that we know which one actually triggers, when they do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 14:12:26 -05:00
David Ahern	54dc3e3324	net: ipv6: Allow connect to linklocal address from socket bound to vrf Allow a process bound to a VRF to connect to a linklocal address. Currently, this fails because of a mismatch between the scope of the linklocal address and the sk_bound_dev_if inherited by the VRF binding: $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1 ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument Relax the scope check to allow the socket to be bound to the same L3 device as the scope id. This makes ipv6 linklocal consistent with other relaxed checks enabled by commits `1ff23beebd` ("net: l3mdev: Allow send on enslaved interface") and `7bb387c5ab` ("net: Allow IP_MULTICAST_IF to set index to L3 slave"). Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-08 14:11:18 -05:00
Jozsef Kadlecsik	f998b6b101	netfilter: ipset: Missing nfnl_lock()/nfnl_unlock() is added to ip_set_net_exit() Patch "netfilter: ipset: use nfnl_mutex_is_locked" is added the real mutex locking check, which revealed the missing locking in ip_set_net_exit(). Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Reported-by: syzbot+36b06f219f2439fe62e1@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:12 +01:00
Jozsef Kadlecsik	4750005a85	netfilter: ipset: Fix "don't update counters" mode when counters used at the matching The matching of the counters was not taken into account, fixed. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:12 +01:00
Gustavo A. R. Silva	c045337751	netfilter: ipset: use swap macro instead of _manually_ swapping values Make use of the swap macro and remove unnecessary variables tmp. This makes the code easier to read and maintain. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:11 +01:00
Pablo Neira Ayuso	a3c90f7a23	netfilter: nf_tables: flow offload expression Add new instruction for the nf_tables VM that allows us to specify what flows are offloaded into a given flow table via name. This new instruction creates the flow entry and adds it to the flow table. Only established flows, ie. we have seen traffic in both directions, are added to the flow table. You can still decide to offload entries at a later stage via packet counting or checking the ct status in case you want to offload assured conntracks. This new extension depends on the conntrack subsystem. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:10 +01:00
Pablo Neira Ayuso	7c23b629a8	netfilter: flow table support for the mixed IPv4/IPv6 family This patch adds the IPv6 flow table type, that implements the datapath flow table to forward IPv6 traffic. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:09 +01:00
Pablo Neira Ayuso	0995210753	netfilter: flow table support for IPv6 This patch adds the IPv6 flow table type, that implements the datapath flow table to forward IPv6 traffic. This patch exports ip6_dst_mtu_forward() that is required to check for mtu to pass up packets that need PMTUD handling to the classic forwarding path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:08 +01:00
Pablo Neira Ayuso	97add9f0d6	netfilter: flow table support for IPv4 This patch adds the IPv4 flow table type, that implements the datapath flow table to forward IPv4 traffic. Rationale is: 1) Look up for the packet in the flow table, from the ingress hook. 2) If there's a hit, decrement ttl and pass it on to the neighbour layer for transmission. 3) If there's a miss, packet is passed up to the classic forwarding path. This patch also supports layer 3 source and destination NAT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:08 +01:00
Pablo Neira Ayuso	ac2a66665e	netfilter: add generic flow table infrastructure This patch defines the API to interact with flow tables, this allows to add, delete and lookup for entries in the flow table. This also adds the generic garbage code that removes entries that have expired, ie. no traffic has been seen for a while. Users of the flow table infrastructure can delete entries via flow_offload_dead(), which sets the dying bit, this signals the garbage collector to release an entry from user context. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:07 +01:00
Pablo Neira Ayuso	3b49e2e94e	netfilter: nf_tables: add flow table netlink frontend This patch introduces a netlink control plane to create, delete and dump flow tables. Flow tables are identified by name, this name is used from rules to refer to an specific flow table. Flow tables use the rhashtable class and a generic garbage collector to remove expired entries. This also adds the infrastructure to add different flow table types, so we can add one for each layer 3 protocol family. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:06 +01:00
Pablo Neira Ayuso	90964016e5	netfilter: nf_conntrack: add IPS_OFFLOAD status bit This new bit tells us that the conntrack entry is owned by the flow table offload infrastructure. # cat /proc/net/nf_conntrack ipv4 2 tcp 6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2 Note the [OFFLOAD] tag in the listing. The timer of such conntrack entries look like stopped from userspace. In practise, to make sure the conntrack entry does not go away, the conntrack timer is periodically set to an arbitrary large value that gets refreshed on every iteration from the garbage collector, so it never expires- and they display no internal state in the case of TCP flows. This allows us to save a bitcheck from the packet path via nf_ct_is_expired(). Conntrack entries that have been offloaded to the flow table infrastructure cannot be deleted/flushed via ctnetlink. The flow table infrastructure is also responsible for releasing this conntrack entry. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:05 +01:00
Pablo Neira Ayuso	0befd061af	netfilter: nf_tables: remove nft_dereference() This macro is unnecessary, it just hides details for one single caller. nfnl_dereference() is just enough. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:05 +01:00
Pablo Neira Ayuso	a7f87b47e6	netfilter: remove defensive check on malformed packets from raw sockets Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they can inject into the stack. Specifically, not for IPv4 since `55888dfb6b` ("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2)"). IPv6 raw sockets also ensure that packets have a well-formed IPv6 header available in the skbuff. At quick glance, br_netfilter also validates layer 3 headers and it drops malformed both IPv4 and IPv6 packets. Therefore, let's remove this defensive check all over the place. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:04 +01:00
Florian Westphal	f6931f5f5b	netfilter: meta: secpath support replacement for iptables "-m policy --dir in --policy {ipsec,none}". Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:03 +01:00
Pablo Neira Ayuso	b3a61254d8	netfilter: remove struct nf_afinfo and its helper functions This abstraction has no clients anymore, remove it. This is what remains from previous authors, so correct copyright statement after recent modifications and code removal. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:02 +01:00
Pablo Neira Ayuso	464356234f	netfilter: remove route_key_size field in struct nf_afinfo This is only needed by nf_queue, place this code where it belongs. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:11:01 +01:00
Pablo Neira Ayuso	ce388f452f	netfilter: move reroute indirection to struct nf_ipv6_ops We cannot make a direct call to nf_ip6_reroute() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define reroute indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:10:53 +01:00
Pablo Neira Ayuso	3f87c08c61	netfilter: move route indirection to struct nf_ipv6_ops We cannot make a direct call to nf_ip6_route() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define route indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:26 +01:00
Pablo Neira Ayuso	7db9a51e0f	netfilter: remove saveroute indirection in struct nf_afinfo This is only used by nf_queue.c and this function comes with no symbol dependencies with IPv6, it just refers to structure layouts. Therefore, we can replace it by a direct function call from where it belongs. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:25 +01:00
Pablo Neira Ayuso	f7dcbe2f36	netfilter: move checksum_partial indirection to struct nf_ipv6_ops We cannot make a direct call to nf_ip6_checksum_partial() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define checksum_partial indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:24 +01:00
Pablo Neira Ayuso	ef71fe27ec	netfilter: move checksum indirection to struct nf_ipv6_ops We cannot make a direct call to nf_ip6_checksum() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define checksum indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:23 +01:00
Florian Westphal	625c556118	netfilter: connlimit: split xt_connlimit into front and backend This allows to reuse xt_connlimit infrastructure from nf_tables. The upcoming nf_tables frontend can just pass in an nftables register as input key, this allows limiting by any nft-supported key, including concatenations. For xt_connlimit, pass in the zone and the ip/ipv6 address. With help from Yi-Hung Wei. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:22 +01:00
Pablo Neira Ayuso	c2f9eafee9	netfilter: nf_tables: remove hooks from family definition They don't belong to the family definition, move them to the filter chain type definition instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:22 +01:00
Pablo Neira Ayuso	c974a3a364	netfilter: nf_tables: remove multihook chains and families Since NFPROTO_INET is handled from the core, we don't need to maintain extra infrastructure in nf_tables to handle the double hook registration, one for IPv4 and another for IPv6. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:21 +01:00
Pablo Neira Ayuso	12355d3670	netfilter: nf_tables_inet: don't use multihook infrastructure anymore Use new native NFPROTO_INET support in netfilter core, this gets rid of ad-hoc code in the nf_tables API codebase. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:20 +01:00
Pablo Neira Ayuso	cb7ccd835e	netfilter: core: support for NFPROTO_INET hook registration Expand NFPROTO_INET in two hook registrations, one for NFPROTO_IPV4 and another for NFPROTO_IPV6. Hence, we handle NFPROTO_INET from the core. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:19 +01:00
Pablo Neira Ayuso	3025940811	netfilter: core: pass family as parameter to nf_remove_net_hook() So static_key_slow_dec applies to the family behind NFPROTO_INET. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:19 +01:00
Pablo Neira Ayuso	62a0fe46e2	netfilter: core: pass hook number, family and device to nf_find_hook_list() Instead of passing struct nf_hook_ops, this is needed by follow up patches to handle NFPROTO_INET from the core. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:18 +01:00
Pablo Neira Ayuso	3d3cdc38e8	netfilter: core: add nf_remove_net_hook Just a cleanup, __nf_unregister_net_hook() is used by a follow up patch when handling NFPROTO_INET as a real family from the core. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:17 +01:00
Pablo Neira Ayuso	408070d6ee	netfilter: nf_tables: add nft_set_is_anonymous() helper Add helper function to test for the NFT_SET_ANONYMOUS flag. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:16 +01:00
Pablo Neira Ayuso	7a4473a31a	netfilter: nf_tables: explicit nft_set_pktinfo() call from hook path Instead of calling this function from the family specific variant, this reduces the code size in the fast path for the netdev, bridge and inet families. After this change, we must call nft_set_pktinfo() upfront from the chain hook indirection. Before: text data bss dec hex filename 2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o After: text data bss dec hex filename 2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:15 +01:00
Pablo Neira Ayuso	fa45a76021	netfilter: nf_tables_arp: don't set forward chain 46928a0b49f3 ("netfilter: nf_tables: remove multihook chains and families") already removed this, this is a leftover. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:15 +01:00
Florian Westphal	84ba7dd71a	netfilter: nf_tables: reject nat hook registration if prio is before conntrack No problem for iptables as priorities are fixed values defined in the nat modules, but in nftables the priority its coming from userspace. Reject in case we see that such a hook would not work. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:14 +01:00
Florian Westphal	f92b40a8b2	netfilter: core: only allow one nat hook per hook point The netfilter NAT core cannot deal with more than one NAT hook per hook location (prerouting, input ...), because the NAT hooks install a NAT null binding in case the iptables nat table (iptable_nat hooks) or the corresponding nftables chain (nft nat hooks) doesn't specify a nat transformation. Null bindings are needed to detect port collsisions between NAT-ed and non-NAT-ed connections. This causes nftables NAT rules to not work when iptable_nat module is loaded, and vice versa because nat binding has already been attached when the second nat hook is consulted. The netfilter core is not really the correct location to handle this (hooks are just hooks, the core has no notion of what kinds of side effects a hook implements), but its the only place where we can check for conflicts between both iptables hooks and nftables hooks without adding dependencies. So add nat annotation to hook_ops to describe those hooks that will add NAT bindings and then make core reject if such a hook already exists. The annotation fills a padding hole, in case further restrictions appar we might change this to a 'u8 type' instead of bool. iptables error if nft nat hook active: iptables -t nat -A POSTROUTING -j MASQUERADE iptables v1.4.21: can't initialize iptables table `nat': File exists Perhaps iptables or your kernel needs to be upgraded. nftables error if iptables nat table present: nft -f /etc/nftables/ipv4-nat /usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists table nat { ^^ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:13 +01:00
Florian Westphal	03d13b6868	netfilter: xtables: add and use xt_request_find_table_lock currently we always return -ENOENT to userspace if we can't find a particular table, or if the table initialization fails. Followup patch will make nat table init fail in case nftables already registered a nat hook so this change makes xt_find_table_lock return an ERR_PTR to return the errno value reported from the table init function. Add xt_request_find_table_lock as try_then_request_module replacement and use it where needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:12 +01:00
Florian Westphal	2a95183a5e	netfilter: don't allocate space for arp/bridge hooks unless needed no need to define hook points if the family isn't supported. Because we need these hooks for either nftables, arp/ebtables or the 'call-iptables' hack we have in the bridge layer add two new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the users select them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:11 +01:00
Florian Westphal	bb4badf3a3	netfilter: don't allocate space for decnet hooks unless needed no need to define hook points if the family isn't supported. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:10 +01:00
Florian Westphal	ef57170bbf	netfilter: reduce hook array sizes to what is needed Not all families share the same hook count, adjust sizes to what is needed. struct net before: /* size: 6592, cachelines: 103, members: 46 / after: / size: 5952, cachelines: 93, members: 46 */ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:09 +01:00
Florian Westphal	b0f38338ae	netfilter: reduce size of hook entry point locations struct net contains: struct nf_hook_entries __rcu hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; which store the hook entry point locations for the various protocol families and the hooks. Using array results in compact c code when doing accesses, i.e. x = rcu_dereference(net->nf.hooks[pf][hook]); but its also wasting a lot of memory, as most families are not used. So split the array into those families that are used, which are only 5 (instead of 13). In most cases, the 'pf' argument is constant, i.e. gcc removes switch statement. struct net before: / size: 5184, cachelines: 81, members: 46 / after: / size: 4672, cachelines: 73, members: 46 */ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:08 +01:00
Florian Westphal	8c873e2199	netfilter: core: free hooks with call_rcu Giuseppe Scrivano says: "SELinux, if enabled, registers for each new network namespace 6 netfilter hooks." Cost for this is high. With synchronize_net() removed: "The net benefit on an SMP machine with two cores is that creating a new network namespace takes -40% of the original time." This patch replaces synchronize_net+kvfree with call_rcu(). We store rcu_head at the tail of a structure that has no fixed layout, i.e. we cannot use offsetof() to compute the start of the original allocation. Thus store this information right after the rcu head. We could simplify this by just placing the rcu_head at the start of struct nf_hook_entries. However, this structure is used in packet processing hotpath, so only place what is needed for that at the beginning of the struct. Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:07 +01:00
Florian Westphal	26888dfd7e	netfilter: core: remove synchronize_net call if nfqueue is used since commit `960632ece6` ("netfilter: convert hook list to an array") nfqueue no longer stores a pointer to the hook that caused the packet to be queued. Therefore no extra synchronize_net() call is needed after dropping the packets enqueued by the old rule blob. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:06 +01:00
Florian Westphal	4e645b47c4	netfilter: core: make nf_unregister_net_hooks simple wrapper again This reverts commit `d3ad2c17b4` ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls"). Nothing wrong with it. However, followup patch will delay freeing of hooks with call_rcu, so all synchronize_net() calls become obsolete and there is no need anymore for this batching. This revert causes a temporary performance degradation when destroying network namespace, but its resolved with the upcoming call_rcu conversion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:05 +01:00
Varsha Rao	ca9b01473a	netfilter: nf_conntrack_h323: Remove unwanted comments. Change old multi-line comment style to kernel comment style and remove unwanted comments. Signed-off-by: Varsha Rao <rvarsha016@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:05 +01:00
Florian Westphal	a778a15fa5	netfilter: ipset: add resched points during set listing When sets are extremely large we can get softlockup during ipset -L. We could fix this by adding cond_resched_rcu() at the right location during iteration, but this only works if RCU nesting depth is 1. At this time entire variant->list() is called under under rcu_read_lock_bh. This used to be a read_lock_bh() but as rcu doesn't really lock anything, it does not appear to be needed, so remove it (ipset increments set reference count before this, so a set deletion should not be possible). Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:04 +01:00
Florian Westphal	49971b8853	netfilter: ipset: use nfnl_mutex_is_locked Check that we really hold nfnl mutex here instead of relying on correct usage alone. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:03 +01:00
Gao Feng	6b3d933000	netfilter: ipvs: Remove useless ipvsh param of frag_safe_skb_hp The param of frag_safe_skb_hp, ipvsh, isn't used now. So remove it and update the callers' codes too. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:02 +01:00
Florian Westphal	2c9e8637ea	netfilter: conntrack: timeouts can be const Nowadays this is just the default template that is used when setting up the net namespace, so nothing writes to these locations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:02 +01:00
Gustavo A. R. Silva	e8542dcec0	netfilter: mark expected switch fall-throughs In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:01:01 +01:00
Florian Westphal	9dae47aba0	netfilter: conntrack: l4 protocol trackers can be const previous patches removed all writes to these structs so we can now mark them as const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 18:00:54 +01:00
Florian Westphal	cd9ceafc0a	netfilter: conntrack: constify list of builtin trackers Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 16:47:14 +01:00
Florian Westphal	3921584674	netfilter: conntrack: remove nlattr_size pointer from l4proto trackers similar to previous commit, but instead compute this at compile time and turn nlattr_size into an u16. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-08 16:47:14 +01:00
Florian Westphal	b1bdcb59b6	xfrm: don't call xfrm_policy_cache_flush while holding spinlock xfrm_policy_cache_flush can sleep, so it cannot be called while holding a spinlock. We could release the lock first, but I don't see why we need to invoke this function here in first place, the packet path won't reuse an xdst entry unless its still valid. While at it, add an annotation to xfrm_policy_cache_flush, it would have probably caught this bug sooner. Fixes: `ec30d78c14` ("xfrm: add xdst pcpu cache") Reported-by: syzbot+e149f7d1328c26f9c12f@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-01-08 10:16:40 +01:00
Herbert Xu	bcfd09f783	xfrm: Return error on unknown encap_type in init_state Currently esp will happily create an xfrm state with an unknown encap type for IPv4, without setting the necessary state parameters. This patch fixes it by returning -EINVAL. There is a similar problem in IPv6 where if the mode is unknown we will skip initialisation while returning zero. However, this is harmless as the mode has already been checked further up the stack. This patch removes this anomaly by aligning the IPv6 behaviour with IPv4 and treating unknown modes (which cannot actually happen) as transport mode. Fixes: `38320c70d2` ("[IPSEC]: Use crypto_aead and authenc in ESP") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-01-08 07:17:52 +01:00
Ido Schimmel	1de178edc7	ipv6: Flush multipath routes when all siblings are dead By default, IPv6 deletes nexthops from a multipath route when the nexthop device is put administratively down. This differs from IPv4 where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A multipath route is flushed when all of its nexthops become dead. Align IPv6 with IPv4 and have it conform to the same guidelines. In case the multipath route needs to be flushed, its siblings are flushed one by one. Otherwise, the nexthops are marked with the appropriate flags and the tree walker is instructed to skip all the siblings. As explained in previous patches, care is taken to update the sernum of the affected tree nodes, so as to prevent the use of wrong dst entries. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:41 -05:00
Ido Schimmel	922c2ac82e	ipv6: Take table lock outside of sernum update function The next patch is going to allow dead routes to remain in the FIB tree in certain situations. When this happens we need to be sure to bump the sernum of the nodes where these are stored so that potential copies cached in sockets are invalidated. The function that performs this update assumes the table lock is not taken when it is invoked, but that will not be the case when it is invoked by the tree walker. Have the function assume the lock is taken and make the single caller take the lock itself. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:41 -05:00
Ido Schimmel	4a8e56ee2c	ipv6: Export sernum update function We are going to allow dead routes to stay in the FIB tree (e.g., when they are part of a multipath route, directly connected route with no carrier) and revive them when their nexthop device gains carrier or when it is put administratively up. This is equivalent to the addition of the route to the FIB tree and we should therefore take care of updating the sernum of all the parent nodes of the node where the route is stored. Otherwise, we risk sockets caching and using sub-optimal dst entries. Export the function that performs the above, so that it could be invoked from fib6_ifup() later on. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	b5cb5a755b	ipv6: Teach tree walker to skip multipath routes As explained in previous patch, fib6_ifdown() needs to consider the state of all the sibling routes when a multipath route is traversed. This is done by evaluating all the siblings when the first sibling in a multipath route is traversed. If the multipath route does not need to be flushed (e.g., not all siblings are dead), then we should just skip the multipath route as our work is done. Have the tree walker jump to the last sibling when it is determined that the multipath route needs to be skipped. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	f9d882ea57	ipv6: Report dead flag during route dump Up until now the RTNH_F_DEAD flag was only reported in route dump when the 'ignore_routes_with_linkdown' sysctl was set. This is expected as dead routes were flushed otherwise. The reliance on this sysctl is going to be removed, so we need to report the flag regardless of the sysctl's value. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	8067bb8c1d	ipv6: Ignore dead routes during lookup Currently, dead routes are only present in the routing tables in case the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are flushed. Subsequent patches are going to remove the reliance on this sysctl and make IPv6 more consistent with IPv4. Before this is done, we need to make sure dead routes are skipped during route lookup, so as to not cause packet loss. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	44c9f2f206	ipv6: Check nexthop flags in route dump instead of carrier Similar to previous patch, there is no need to check for the carrier of the nexthop device when dumping the route and we can instead check for the presence of the RTNH_F_LINKDOWN flag. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	14c5206c2d	ipv6: Check nexthop flags during route lookup instead of carrier Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the need to dereference the nexthop device and check its carrier and instead check for the presence of the flag. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	5609b80a37	ipv6: Set nexthop flags during route creation It is valid to install routes with a nexthop device that does not have a carrier, so we need to make sure they're marked accordingly. As explained in the previous patch, host and anycast routes are never marked with the 'linkdown' flag. Note that reject routes are unaffected, as these use the loopback device which always has a carrier. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	27c6fa73f9	ipv6: Set nexthop flags upon carrier change Similar to IPv4, when the carrier of a netdev changes we should toggle the 'linkdown' flag on all the nexthops using it as their nexthop device. This will later allow us to test for the presence of this flag during route lookup and dump. Up until commit `4832c30d54` ("net: ipv6: put host and anycast routes on device with address") host and anycast routes used the loopback netdev as their nexthop device and thus were not marked with the 'linkdown' flag. The patch preserves this behavior and allows one to ping the local address even when the nexthop device does not have a carrier and the 'ignore_routes_with_linkdown' sysctl is set. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	4c981e28d3	ipv6: Prepare to handle multiple netdev events To make IPv6 more in line with IPv4 we need to be able to respond differently to different netdev events. For example, when a netdev is unregistered all the routes using it as their nexthop device should be flushed, whereas when the netdev's carrier changes only the 'linkdown' flag should be toggled. Currently, this is not possible, as the function that traverses the routing tables is not aware of the triggering event. Propagate the triggering event down, so that it could be used in later patches. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:40 -05:00
Ido Schimmel	2127d95aef	ipv6: Clear nexthop flags upon netdev up Previous patch marked nexthops with the 'dead' and 'linkdown' flags. Clear these flags when the netdev comes back up. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:39 -05:00
Ido Schimmel	2b2413610e	ipv6: Mark dead nexthops with appropriate flags When a netdev is put administratively down or unregistered all the nexthops using it as their nexthop device should be marked with the 'dead' and 'linkdown' flags. Currently, when a route is dumped its nexthop device is tested and the flags are set accordingly. A similar check is performed during route lookup. Instead, we can simply mark the nexthops based on netdev events and avoid checking the netdev's state during route dump and lookup. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:39 -05:00
Ido Schimmel	9fcb0714dc	ipv6: Remove redundant route flushing during namespace dismantle By the time fib6_net_exit() is executed all the netdevs in the namespace have been either unregistered or pushed back to the default namespace. That is because pernet subsys operations are always ordered before pernet device operations and therefore invoked after them during namespace dismantle. Thus, all the routing tables in the namespace are empty by the time fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed. This allows us to simplify the condition in fib6_ifdown() as it's only ever called with an actual netdev. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:29:39 -05:00
David S. Miller	7f0b800048	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-01-07 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Add a start of a framework for extending struct xdp_buff without having the overhead of populating every data at runtime. Idea is to have a new per-queue struct xdp_rxq_info that holds read mostly data (currently that is, queue number and a pointer to the corresponding netdev) which is set up during rxqueue config time. When a XDP program is invoked, struct xdp_buff holds a pointer to struct xdp_rxq_info that the BPF program can then walk. The user facing BPF program that uses struct xdp_md for context can use these members directly, and the verifier rewrites context access transparently by walking the xdp_rxq_info and net_device pointers to load the data, from Jesper. 2) Redo the reporting of offload device information to user space such that it works in combination with network namespaces. The latter is reported through a device/inode tuple as similarly done in other subsystems as well (e.g. perf) in order to identify the namespace. For this to work, ns_get_path() has been generalized such that the namespace can be retrieved not only from a specific task (perf case), but also from a callback where we deduce the netns (ns_common) from a netdevice. bpftool support using the new uapi info and extensive test cases for test_offload.py in BPF selftests have been added as well, from Jakub. 3) Add two bpftool improvements: i) properly report the bpftool version such that it corresponds to the version from the kernel source tree. So pick the right linux/version.h from the source tree instead of the installed one. ii) fix bpftool and also bpf_jit_disasm build with bintutils >= 2.9. The reason for the build breakage is that binutils library changed the function signature to select the disassembler. Given this is needed in multiple tools, add a proper feature detection to the tools/build/features infrastructure, from Roman. 4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the stacktrace map. It is currently unimplemented, but there are use cases where user space needs to walk all stacktrace map entries e.g. for dumping or deleting map entries w/o having to close and recreate the map. Add BPF selftests along with it, from Yonghong. 5) Few follow-up cleanups for the bpftool cgroup code: i) rename the cgroup 'list' command into 'show' as we have it for other subcommands as well, ii) then alias the 'show' command such that 'list' is accepted which is also common practice in iproute2, and iii) remove couple of newlines from error messages using p_err(), from Jakub. 6) Two follow-up cleanups to sockmap code: i) remove the unused bpf_compute_data_end_sk_skb() function and ii) only build the sockmap infrastructure when CONFIG_INET is enabled since it's only aware of TCP sockets at this time, from John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-07 21:26:31 -05:00
Linus Torvalds	75d4276e83	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - untangle sys_close() abuses in xt_bpf - deal with register_shrinker() failures in sget() * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'" sget(): handle failures of register_shrinker() mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.	2018-01-06 17:13:21 -08:00
Dmitry Vyukov	889c604fd0	netfilter: x_tables: fix int overflow in xt_alloc_table_info() syzkaller triggered OOM kills by passing ipt_replace.size = -1 to IPT_SO_SET_REPLACE. The root cause is that SMP_ALIGN() in xt_alloc_table_info() causes int overflow and the size check passes when it should not. SMP_ALIGN() is no longer needed leftover. Remove SMP_ALIGN() call in xt_alloc_table_info(). Reported-by: syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-01-07 00:17:23 +01:00
Jesper Dangaard Brouer	02dd3291b2	bpf: finally expose xdp_rxq_info to XDP bpf-programs Now all XDP driver have been updated to setup xdp_rxq_info and assign this to xdp_buff->rxq. Thus, it is now safe to enable access to some of the xdp_rxq_info struct members. This patch extend xdp_md and expose UAPI to userspace for ingress_ifindex and rx_queue_index. Access happens via bpf instruction rewrite, that load data directly from struct xdp_rxq_info. * ingress_ifindex map to xdp_rxq_info->dev->ifindex * rx_queue_index map to xdp_rxq_info->queue_index Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-01-05 15:21:22 -08:00
Jesper Dangaard Brouer	e817f85652	xdp: generic XDP handling of xdp_rxq_info Hook points for xdp_rxq_info: * reg : netif_alloc_rx_queues * unreg: netif_free_rx_queues The net_device have some members (num_rx_queues + real_num_rx_queues) and data-area (dev->_rx with struct netdev_rx_queue's) that were primarily used for exporting information about RPS (CONFIG_RPS) queues to sysfs (CONFIG_SYSFS). For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info, and remove some of the CONFIG_SYSFS ifdefs. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-01-05 15:21:22 -08:00
Jesper Dangaard Brouer	c0124f327e	xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg The driver code qede_free_fp_array() depend on kfree() can be called with a NULL pointer. This stems from the qede_alloc_fp_array() function which either (kz)alloc memory for fp->txq or fp->rxq. This also simplifies error handling code in case of memory allocation failures, but xdp_rxq_info_unreg need to know the difference. Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails and detect this is the failure path by seeing that xdp_rxq_info was not registred yet, which first happens after successful alloaction in qede_init_fp(). Driver hook points for xdp_rxq_info: * reg : qede_init_fp * unreg: qede_free_fp_array Tested on actual hardware with samples/bpf program. V2: Driver have no proper error path for failed XDP RX-queue info reg, as qede_init_fp() is a void function. Cc: everest-linux-l2@cavium.com Cc: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-01-05 15:21:21 -08:00
Jesper Dangaard Brouer	aecd67b607	xdp: base API for new XDP rx-queue info concept This patch only introduce the core data structures and API functions. All XDP enabled drivers must use the API before this info can used. There is a need for XDP to know more about the RX-queue a given XDP frames have arrived on. For both the XDP bpf-prog and kernel side. Instead of extending xdp_buff each time new info is needed, the patch creates a separate read-mostly struct xdp_rxq_info, that contains this info. We stress this data/cache-line is for read-only info. This is NOT for dynamic per packet info, use the data_meta for such use-cases. The performance advantage is this info can be setup at RX-ring init time, instead of updating N-members in xdp_buff. A possible (driver level) micro optimization is that xdp_buff->rxq assignment could be done once per XDP/NAPI loop. The extra pointer deref only happens for program needing access to this info (thus, no slowdown to existing use-cases). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-01-05 15:21:20 -08:00
Sowmini Varadhan	3db6e0d172	rds: use RCU to synchronize work-enqueue with connection teardown rds_sendmsg() can enqueue work on cp_send_w from process context, but it should not enqueue this work if connection teardown has commenced (else we risk enquing work after rds_conn_path_destroy() has assumed that all work has been cancelled/flushed). Similarly some other functions like rds_cong_queue_updates and rds_tcp_data_ready are called in softirq context, and may end up enqueuing work on rds_wq after rds_conn_path_destroy() has assumed that all workqs are quiesced. Check the RDS_DESTROY_PENDING bit and use rcu synchronization to avoid all these races. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 13:39:18 -05:00
Sowmini Varadhan	c90ecbfaf5	rds: Use atomic flag to track connections being destroyed Replace c_destroy_in_prog by using a bit in cp_flags that can set/tested atomically. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 13:39:18 -05:00
Jon Maloy	d84d1b3b6b	tipc: simplify small window members' sorting algorithm We simplify the sorting algorithm in tipc_update_member(). We also make the remaining conditional call to this function unconditional, since the same condition now is tested for inside the said function. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 13:37:03 -05:00
Jon Maloy	38266ca17c	tipc: some clarifying name changes We rename some functions and variables, to make their purpose clearer. - tipc_group::congested -> tipc_group::small_win. Members in this list are not necessarily (and typically) congested. Instead, they may potentially be subject to congestion because their send window is less than ADV_IDLE, and therefore need to be checked during message transmission. - tipc_group_is_receiver() -> tipc_group_is_sender(). This socket will accept messages coming from members fulfilling this condition, i.e., they are senders from this member's viewpoint. - tipc_group_is_enabled() -> tipc_group_is_receiver(). Members fulfilling this condition will accept messages sent from the current socket, i.e., they are receivers from its viewpoint. There are no functional changes in this commit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 13:37:03 -05:00
Al Viro	040ee69226	fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'" Descriptor table is a shared object; it's not a place where you can stick temporary references to files, especially when we don't need an opened file at all. Cc: stable@vger.kernel.org # v4.14 Fixes: `98589a0998` ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-01-05 11:43:39 -05:00
Florian Fainelli	bf08c34086	net: dsa: Move padding into Broadcom tagger Instead of having the different master network device drivers potentially used by DSA/Broadcom tags, move the padding necessary for the switches to accept short packets where it makes most sense: within tag_brcm.c. This avoids multiplying the number of similar commits to e.g: bgmac, bcmsysport, etc. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 11:21:31 -05:00
Soheil Hassas Yeganeh	0a38806f31	net: revert "Update RFS target at poll for tcp/udp" On multi-threaded processes, one common architecture is to have one (or a small number of) threads polling sockets, and a considerably larger pool of threads reading form and writing to the sockets. When we set RPS core on tcp_poll() or udp_poll() we essentially steer all packets of all the polled FDs to one (or small number of) cores, creaing a bottleneck and/or RPS misprediction. Another common architecture is to shard FDs among threads pinned to cores. In such a setting, setting RPS core in tcp_poll() and udp_poll() is redundant because the RFS core is correctly set in recvmsg and sendmsg. Thus, revert the following commit: `c3f1dbaf6e` ("net: Update RFS target at poll for tcp/udp"). Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 11:14:57 -05:00
Soheil Hassas Yeganeh	e3f2c4a3db	ip: do not set RFS core on error queue reads We should only record RPS on normal reads and writes. In single threaded processes, all calls record the same state. In multi-threaded processes where a separate thread processes errors, the RFS table mispredicts. Note that, when CONFIG_RPS is disabled, sock_rps_record_flow is a noop and no branch is added as a result of this patch. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 11:14:56 -05:00
James Chapman	900631ee6a	l2tp: remove configurable payload offset If L2TP_ATTR_OFFSET is set to a non-zero value in L2TPv3 tunnels, it results in L2TPv3 packets being transmitted which might not be compliant with the L2TPv3 RFC. This patch has l2tp ignore the offset setting and send all packets with no offset. In more detail: L2TPv2 supports a variable offset from the L2TPv2 header to the payload. The offset value is indicated by an optional field in the L2TP header. Our L2TP implementation already detects the presence of the optional offset and skips that many bytes when handling data received packets. All transmitted packets are always transmitted with no offset. L2TPv3 has no optional offset field in the L2TPv3 packet header. Instead, L2TPv3 defines optional fields in a "Layer-2 Specific Sublayer". At the time when the original L2TP code was written, there was talk at IETF of offset being implemented in a new Layer-2 Specific Sublayer. A L2TP_ATTR_OFFSET netlink attribute was added so that this offset could be configured and the intention was to allow it to be also used to set the tx offset for L2TPv2. However, no L2TPv3 offset was ever specified and the L2TP_ATTR_OFFSET parameter was forgotten about. Setting L2TP_ATTR_OFFSET results in L2TPv3 packets being transmitted with the specified number of bytes padding between L2TPv3 header and payload. This is not compliant with L2TPv3 RFC3931. This change removes the configurable offset altogether while retaining L2TP_ATTR_OFFSET for backwards compatibility. Any L2TP_ATTR_OFFSET value is ignored. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 11:03:19 -05:00
James Chapman	de3b58bc35	l2tp: revert "l2tp: fix missing print session offset info" Revert commit `820da53575` ("l2tp: fix missing print session offset info"). The peer_offset parameter is removed. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 11:03:13 -05:00
James Chapman	863def15b9	l2tp: revert "l2tp: add peer_offset parameter" Revert commit `f15bc54eee` ("l2tp: add peer_offset parameter"). This is removed because it is adding another configurable offset and configurable offsets are being removed. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 11:03:04 -05:00
David S. Miller	f737be8d61	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix chain filtering when dumping rules via nf_tables_dump_rules(). 2) Fix accidental change in NF_CT_STATE_UNTRACKED_BIT through uapi, introduced when removing the untracked conntrack object, from Florian Westphal. 3) Fix potential nul-dereference when releasing dump filter in nf_tables_dump_obj_done(), patch from Hangbin Liu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-05 10:33:01 -05:00
Marc Kleine-Budde	ff847ee47b	can: af_can: give struct holding the CAN per device receive lists a sensible name This patch adds a "can_" prefix to the "struct dev_rcv_lists" to better reflect the meaning and improbe code readability. The conversion is done with: sed -i \ -e "s/struct dev_rcv_lists/struct can_dev_rcv_lists/g" \ net/can/*.[ch] include/net/netns/can.h Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2018-01-05 11:12:08 +01:00
Marc Kleine-Budde	adb552c319	can: raw: raw_bind(): bail out if can_family is not AF_CAN Until now CAN raw's bind() doesn't check if the can_familiy in the struct sockaddr_can is set to AF_CAN. This patch adds the missing check. Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2018-01-05 11:12:08 +01:00
Herbert Xu	d16b46e4fd	xfrm: Use __skb_queue_tail in xfrm_trans_queue We do not need locking in xfrm_trans_queue because it is designed to use per-CPU buffers. However, the original code incorrectly used skb_queue_tail which takes the lock. This patch switches it to __skb_queue_tail instead. Reported-and-tested-by: Artem Savkov <asavkov@redhat.com> Fixes: `acf568ee85` ("xfrm: Reinject transport-mode packets...") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-01-05 09:33:14 +01:00
David S. Miller	72deacce01	We have things all over the place, no point listing them. One thing is notable: I applied two patches and later reverted them - we'll get back to that once all the driver situation is sorted out. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlpOQXUACgkQB8qZga/f l8Qpuw/+OclRyelfxh1v1xwFYDUAJZZmU9wr/Yx/ezZ8NoebA5bfJSXV/s+Tgw5E oORx7LUkbxwreQtoEHtc9/IE7SCfXrB8kWoy5A/Q094SDglWOiQbRuYQ0gn4pMkV zukm4O3+cHHGj1slnSOzQWNeF/5mbNwEMo5Id5ZnSjMfoPl+CWH8qvfu4oRFhmiG tZ0gIGARX9FL3v+RyqEhugTxfCzAYRTinGQhG4r6LlkgCqTnza7VhG+3N+fPMkjS 4Rs9ucnMnunrbd9lbbpTb+vWAJ+McJfVw/Gtmjp/W8vyxZFEr0EHiY31btmMAhTO ibZVZYCslL3WM2vIxxy0nGR6O28eCRzU4ETSOrInv4ZooplvmFHVnjms1hqiSaZO 4qy8Yb8cPrIPTcI3OYWvicBAAcHLqlEw8GC4rltf2bw6a0FdJ3igitFy9MPFhxBW OZ0YS+exHAb9lBbk49qOM0Bqu7ug5MUTygX9RGTeWB0sRDmc5OQVsAqvfaapGts9 u+huQzO2Y1b8IDVAL/tTOoDz6A1Qc/S2BFDNilFKVeGOhB35jFA3BN4vJzmpp9Oy cz8150ls6BbfHjrFiuHlQWwaoG6GTebSln9XnEqNXfh5GFj1H/FYTQOv4rIaIjrY wdirSv6UopaRjnBSb062glmb9ZFHQEKBWDvRC7jTRaXMRTBQ2zo= =M0gz -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2018-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== We have things all over the place, no point listing them. One thing is notable: I applied two patches and later reverted them - we'll get back to that once all the driver situation is sorted out. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-04 14:33:29 -05:00
Wei Wang	7bbfe00e02	ipv6: fix general protection fault in fib6_add() In fib6_add(), pn could be NULL if fib6_add_1() failed to return a fib6 node. Checking pn != fn before accessing pn->leaf makes sure pn is not NULL. This fixes the following GPF reported by syzkaller: general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244 RSP: 0018:ffff8801c7626a70 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000020 RCX: ffffffff84794465 RDX: 0000000000000004 RSI: ffff8801d38935f0 RDI: 0000000000000282 RBP: ffff8801c7626da0 R08: 1ffff10038ec4c35 R09: 0000000000000000 R10: ffff8801c7626c68 R11: 0000000000000000 R12: 00000000fffffffe R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8801db200000(0063) knlGS:0000000009b70840 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000020be1000 CR3: 00000001d585a006 CR4: 00000000001606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006 ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833 inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957 rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411 netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423 netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline] netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864 sock_sendmsg_nosec net/socket.c:636 [inline] sock_sendmsg+0xca/0x110 net/socket.c:646 sock_write_iter+0x31a/0x5d0 net/socket.c:915 call_write_iter include/linux/fs.h:1772 [inline] do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653 do_iter_write+0x154/0x540 fs/read_write.c:932 compat_writev+0x225/0x420 fs/read_write.c:1246 do_compat_writev+0x115/0x220 fs/read_write.c:1267 C_SYSC_writev fs/read_write.c:1278 [inline] compat_SyS_writev+0x26/0x30 fs/read_write.c:1274 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline] do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389 entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:125 Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: `66f5d6ce53` ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-04 14:29:20 -05:00
Mohamed Ghannam	7d11f77f84	RDS: null pointer dereference in rds_atomic_free_op set rm->atomic.op_active to 0 when rds_pin_pages() fails or the user supplied address is invalid, this prevents a NULL pointer usage in rds_atomic_free_op() Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-04 14:19:26 -05:00
Andrei Vagin	f428fe4a04	rtnetlink: give a user socket to get_target_net() This function is used from two places: rtnl_dump_ifinfo and rtnl_getlink. In rtnl_getlink(), we give a request skb into get_target_net(), but in rtnl_dump_ifinfo, we give a response skb into get_target_net(). The problem here is that NETLINK_CB() isn't initialized for the response skb. In both cases we can get a user socket and give it instead of skb into get_target_net(). This bug was found by syzkaller with this call-trace: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3149 Comm: syzkaller140561 Not tainted 4.15.0-rc4-mm1+ #47 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__netlink_ns_capable+0x8b/0x120 net/netlink/af_netlink.c:868 RSP: 0018:ffff8801c880f348 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8443f900 RDX: 000000000000007b RSI: ffffffff86510f40 RDI: 00000000000003d8 RBP: ffff8801c880f360 R08: 0000000000000000 R09: 1ffff10039101e4f R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff86510f40 R13: 000000000000000c R14: 0000000000000004 R15: 0000000000000011 FS: 0000000001a1a880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020151000 CR3: 00000001c9511005 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: netlink_ns_capable+0x26/0x30 net/netlink/af_netlink.c:886 get_target_net+0x9d/0x120 net/core/rtnetlink.c:1765 rtnl_dump_ifinfo+0x2e5/0xee0 net/core/rtnetlink.c:1806 netlink_dump+0x48c/0xce0 net/netlink/af_netlink.c:2222 __netlink_dump_start+0x4f0/0x6d0 net/netlink/af_netlink.c:2319 netlink_dump_start include/linux/netlink.h:214 [inline] rtnetlink_rcv_msg+0x7f0/0xb10 net/core/rtnetlink.c:4485 netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2441 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4540 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline] netlink_unicast+0x4be/0x6a0 net/netlink/af_netlink.c:1334 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897 Cc: Jiri Benc <jbenc@redhat.com> Fixes: `79e1ad148c` ("rtnetlink: use netnsid to query interface") Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-04 13:42:20 -05:00
Ben Seri	06e7e776ca	Bluetooth: Prevent stack info leak from the EFS element. In the function l2cap_parse_conf_rsp and in the function l2cap_parse_conf_req the following variable is declared without initialization: struct l2cap_conf_efs efs; In addition, when parsing input configuration parameters in both of these functions, the switch case for handling EFS elements may skip the memcpy call that will write to the efs variable: ... case L2CAP_CONF_EFS: if (olen == sizeof(efs)) memcpy(&efs, (void *)val, olen); ... The olen in the above if is attacker controlled, and regardless of that if, in both of these functions the efs variable would eventually be added to the outgoing configuration request that is being built: l2cap_add_conf_opt(&ptr, L2CAP_CONF_EFS, sizeof(efs), (unsigned long) &efs); So by sending a configuration request, or response, that contains an L2CAP_CONF_EFS element, but with an element length that is not sizeof(efs) - the memcpy to the uninitialized efs variable can be avoided, and the uninitialized variable would be returned to the attacker (16 bytes). This issue has been assigned CVE-2017-1000410 Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Ben Seri <ben@armis.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-04 17:01:01 +01:00
Johannes Berg	736a80bbfd	mac80211: mesh: drop frames appearing to be from us If there are multiple mesh stations with the same MAC address, they will both get confused and start throwing warnings. Obviously in this case nothing can actually work anyway, so just drop frames that look like they're from ourselves early on. Reported-by: Gui Iribarren <gui@altermundi.net> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-04 15:51:53 +01:00
Peter Große	3a3713ec36	mac80211: Fix setting TX power on monitor interfaces Instead of calling ieee80211_recalc_txpower on monitor interfaces directly, call it using the virtual monitor interface, if one exists. In case of a single monitor interface given, reject setting TX power, if no virtual monitor interface exists. That being checked, don't warn in ieee80211_bss_info_change_notify, after setting TX power on a monitor interface. Fixes warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 2193 at net/mac80211/driver-ops.h:167 ieee80211_bss_info_change_notify+0x111/0x190 Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core rndis_host cdc_ether usbnet mii tp_smapi(O) thinkpad_ec(O) ohci_hcd vboxpci(O) vboxnetadp(O) vboxnetflt(O) v boxdrv(O) x86_pkg_temp_thermal kvm_intel kvm irqbypass iwldvm iwlwifi ehci_pci ehci_hcd tpm_tis tpm_tis_core tpm CPU: 0 PID: 2193 Comm: iw Tainted: G O 4.12.12-gentoo #2 task: ffff880186fd5cc0 task.stack: ffffc90001b54000 RIP: 0010:ieee80211_bss_info_change_notify+0x111/0x190 RSP: 0018:ffffc90001b57a10 EFLAGS: 00010246 RAX: 0000000000000006 RBX: ffff8801052ce840 RCX: 0000000000000064 RDX: 00000000fffffffc RSI: 0000000000040000 RDI: ffff8801052ce840 RBP: ffffc90001b57a38 R08: 0000000000000062 R09: 0000000000000000 R10: ffff8802144b5000 R11: ffff880049dc4614 R12: 0000000000040000 R13: 0000000000000064 R14: ffff8802105f0760 R15: ffffc90001b57b48 FS: 00007f92644b4580(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9263c109f0 CR3: 00000001df850000 CR4: 00000000000406f0 Call Trace: ieee80211_recalc_txpower+0x33/0x40 ieee80211_set_tx_power+0x40/0x180 nl80211_set_wiphy+0x32e/0x950 Reported-by: Peter Große <pegro@friiks.de> Signed-off-by: Peter Große <pegro@friiks.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-04 15:27:48 +01:00
Hao Chen	3ea15452ee	nl80211: Check for the required netlink attribute presence nl80211_nan_add_func() does not check if the required attribute NL80211_NAN_FUNC_FOLLOW_UP_DEST is present when processing NL80211_CMD_ADD_NAN_FUNCTION request. This request can be issued by users with CAP_NET_ADMIN privilege and may result in NULL dereference and a system crash. Add a check for the required attribute presence. Signed-off-by: Hao Chen <flank3rsky@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-01-04 15:22:02 +01:00
Marcelo Ricardo Leitner	79d0895140	sctp: fix error path in sctp_stream_init syzbot noticed a NULL pointer dereference panic in sctp_stream_free() which was caused by an incomplete error handling in sctp_stream_init(). By not clearing stream->outcnt, it made a for() in sctp_stream_free() think that it had elements to free, but not, leading to the panic. As suggested by Xin Long, this patch also simplifies the error path by moving it to the only if() that uses it. See-also: https://www.spinics.net/lists/netdev/msg473756.html See-also: https://www.spinics.net/lists/netdev/msg465024.html Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: `f952be79ce` ("sctp: introduce struct sctp_stream_out_ext") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-03 11:29:42 -05:00
Mohamed Ghannam	c095508770	RDS: Heap OOB write in rds_message_alloc_sgs() When args->nr_local is 0, nr_pages gets also 0 due some size calculation via rds_rm_size(), which is later used to allocate pages for DMA, this bug produces a heap Out-Of-Bound write access to a specific memory region. Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-03 11:23:05 -05:00
Ingo Molnar	475c5ee193	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-01-03 14:14:18 +01:00
Florian Fainelli	21602e1a55	net: dsa: Fix dsa_legacy_register() return value We need to make the dsa_legacy_register() stub return 0 in order for dsa_init_module() to successfully register and continue registering the ETH_P_XDSA packet handler. Fixes: `2a93c1a365` ("net: dsa: Allow compiling out legacy support") Reported-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 21:52:48 -05:00
Jon Maloy	f9c935db80	tipc: fix problems with multipoint-to-point flow control In commit `04d7b574b2` ("tipc: add multipoint-to-point flow control") we introduced a protocol for preventing buffer overflow when many group members try to simultaneously send messages to the same receiving member. Stress test of this mechanism has revealed a couple of related bugs: - When the receiving member receives an advertisement REMIT message from one of the senders, it will sometimes prematurely activate a pending member and send it the remitted advertisement, although the upper limit for active senders has been reached. This leads to accumulation of illegal advertisements, and eventually to messages being dropped because of receive buffer overflow. - When the receiving member leaves REMITTED state while a received message is being read, we miss to look at the pending queue, to activate the oldest pending peer. This leads to some pending senders being starved out, and never getting the opportunity to profit from the remitted advertisement. We fix the former in the function tipc_group_proto_rcv() by returning directly from the function once it becomes clear that the remitting peer cannot leave REMITTED state at that point. We fix the latter in the function tipc_group_update_rcv_win() by looking up and activate the longest pending peer when it becomes clear that the remitting peer now can leave REMITTED state. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 21:52:07 -05:00
Stephen Hemminger	71891e2dab	ethtool: do not print warning for applications using legacy API In kernel log ths message appears on every boot: "warning: `NetworkChangeNo' uses legacy ethtool link settings API, link modes are only partially reported" When ethtool link settings API changed, it started complaining about usages of old API. Ironically, the original patch was from google but the application using the legacy API is chrome. Linux ABI is fixed as much as possible. The kernel must not break it and should not complain about applications using legacy API's. This patch just removes the warning since using legacy API's in Linux is perfectly acceptable. Fixes: `3f1ac7a700` ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David Decotigny <decot@googlers.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 21:49:17 -05:00
Masami Hiramatsu	a56c1470c2	net: dccp: Remove dccpprobe module Remove DCCP probe module since jprobe has been deprecated. That function is now replaced by dccp/dccp_probe trace-event. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 14:27:30 -05:00
Masami Hiramatsu	ee549be6f0	net: dccp: Add DCCP sendmsg trace event Add DCCP sendmsg trace event (dccp/dccp_probe) for replacing dccpprobe. User can trace this event via ftrace or perftools. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 14:27:30 -05:00
Masami Hiramatsu	fa4475f792	net: sctp: Remove debug SCTP probe module Remove SCTP probe module since jprobe has been deprecated. That function is now replaced by sctp/sctp_probe and sctp/sctp_probe_path trace-events. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 14:27:29 -05:00
Masami Hiramatsu	103d750c88	net: sctp: Add SCTP ACK tracking trace event Add SCTP ACK tracking trace event to trace the changes of SCTP association state in response to incoming packets. It is used for debugging SCTP congestion control algorithms, and will replace sctp_probe module. Note that this event a bit tricky. Since this consists of 2 events (sctp_probe and sctp_probe_path) so you have to enable both events as below. # cd /sys/kernel/debug/tracing # echo 1 > events/sctp/sctp_probe/enable # echo 1 > events/sctp/sctp_probe_path/enable Or, you can enable all the events under sctp. # echo 1 > events/sctp/enable Since sctp_probe_path event is always invoked from sctp_probe event, you can not see any output if you only enable sctp_probe_path. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 14:27:29 -05:00
Masami Hiramatsu	6987990c3e	net: tcp: Remove TCP probe module Remove TCP probe module since jprobe has been deprecated. That function is now replaced by tcp/tcp_probe trace-event. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 14:27:29 -05:00
Masami Hiramatsu	c3fde1bd28	net: tcp: Add trace events for TCP congestion window tracing This adds an event to trace TCP stat variables with slightly intrusive trace-event. This uses ftrace/perf event log buffer to trace those state, no needs to prepare own ring-buffer, nor custom user apps. User can use ftrace to trace this event as below; # cd /sys/kernel/debug/tracing # echo 1 > events/tcp/tcp_probe/enable (run workloads) # cat trace Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 14:27:29 -05:00
Kristian Evensen	bbb6189df4	inet_diag: Add equal-operator for ports inet_diag currently provides less/greater than or equal operators for comparing ports when filtering sockets. An equal comparison can be performed by combining the two existing operators, or a user can for example request a port range and then do the final filtering in userspace. However, these approaches both have drawbacks. Implementing equal using LE/GE causes the size and complexity of a filter to grow quickly as the number of ports increase, while it on busy machines would be great if the kernel only returns information about relevant sockets. This patch introduces source and destination port equal operators. INET_DIAG_BC_S_EQ is used to match a source port, INET_DIAG_BC_D_EQ a destination port, and usage is the same as for the existing port operators. I.e., the port to match is stored in the no-member of the next inet_diag_bc_op-struct in the filter. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:54:04 -05:00
Julia Lawall	e0b10844d9	openvswitch: drop unneeded newline OVS_NLERR prints a newline at the end of the message string, so the message string does not need to include a newline explicitly. Done using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:49:53 -05:00
Julia Lawall	62262ffd95	net: dccp: drop unneeded newline DCCP_CRIT prints some other text and then a newline after the message string, so the message string does not need to include a newline explicitly. Done using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:49:32 -05:00
Wei Yongjun	9540d97761	net: sched: fix skb leak in dev_requeue_skb() When dev_requeue_skb() is called with bulked skb list, only the first skb of the list will be requeued to qdisc layer, and leak the others without free them. TCP is broken due to skb leak since no free skb will be considered as still in the host queue and never be retransmitted. This happend when dev_requeue_skb() called from qdisc_restart(). qdisc_restart \|-- dequeue_skb \|-- sch_direct_xmit() \|-- dev_requeue_skb() <-- skb may bluked Fix dev_requeue_skb() to requeue the full bluked list. Also change to use __skb_queue_tail() in __dev_requeue_skb() to avoid skb out of order. Fixes: `a53851e2c3` ("net: sched: explicit locking in gso_cpu fallback") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:48:29 -05:00
Roi Dayan	3bb23421a5	net/sched: Fix update of lastuse in act modules implementing stats_update We need to update lastuse to to the most updated value between what is already set and the new value. If HW matching fails, i.e. because of an issue, the stats are not updated but it could be that software did match and updated lastuse. Fixes: `5712bf9c5c` ("net/sched: act_mirred: Use passed lastuse argument") Fixes: `9fea47d93b` ("net/sched: act_gact: Update statistics when offloaded to hardware") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:27:52 -05:00
Nogah Frankel	44edf2f897	net: sched: Move offload check till after dump call Move the check of the offload state to after the qdisc dump action was called, so the qdisc could update it if it was changed. Fixes: `7a4fa29106` ("net: sched: Add TCA_HW_OFFLOAD") Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Reviewed-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:17:45 -05:00
Nogah Frankel	8234af2db3	net_sch: red: Fix the new offload indication Update the offload flag, TCQ_F_OFFLOADED, in each dump call (and ignore the offloading function return value in relation to this flag). This is done because a qdisc is being initialized, and therefore offloaded before being grafted. Since the ability of the driver to offload the qdisc depends on its location, a qdisc can be offloaded and un-offloaded by graft calls, that doesn't effect the qdisc itself. Fixes: `428a68af3a` ("net: sched: Move to new offload indication in RED" Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Reviewed-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 13:17:45 -05:00
Xin Long	2fa771be95	ip6_tunnel: allow ip6gre dev mtu to be set below 1280 Commit `582442d6d5` ("ipv6: Allow the MTU of ipip6 tunnel to be set below 1280") fixed a mtu setting issue. It works for ipip6 tunnel. But ip6gre dev updates the mtu also with ip6_tnl_change_mtu. Since the inner packet over ip6gre can be ipv4 and it's mtu should also be allowed to set below 1280, the same issue also exists on ip6gre. This patch is to fix it by simply changing to check if parms.proto is IPPROTO_IPV6 in ip6_tnl_change_mtu instead, to make ip6gre to go to 'else' branch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 12:36:14 -05:00
Eli Cooper	23263ec86a	ip6_tunnel: disable dst caching if tunnel is dual-stack When an ip6_tunnel is in mode 'any', where the transport layer protocol can be either 4 or 41, dst_cache must be disabled. This is because xfrm policies might apply to only one of the two protocols. Caching dst would cause xfrm policies for one protocol incorrectly used for the other. Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 12:31:12 -05:00
David S. Miller	55a5ec9b77	Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns" This reverts commit `87c320e515`. Changing the error return code in some situations turns out to be harmful in practice. In particular Michael Ellerman reports that DHCP fails on his powerpc machines, and this revert gets things working again. Johannes Berg agrees that this revert is the best course of action for now. Fixes: `029b6d1405` ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-01-02 11:50:12 -05:00
Greg Kroah-Hartman	87ad3722bf	Merge 4.15-rc6 into staging-next We need the staging fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-02 15:02:04 +01:00
Sabrina Dubroca	2f10a61cee	xfrm: fix rcu usage in xfrm_get_type_offload request_module can sleep, thus we cannot hold rcu_read_lock() while calling it. The function also jumps back and takes rcu_read_lock() again (in xfrm_state_get_afinfo()), resulting in an imbalance. This codepath is triggered whenever a new offloaded state is created. Fixes: `ffdb5211da` ("xfrm: Auto-load xfrm offload modules") Reported-by: syzbot+ca425f44816d749e8eb49755567a75ee48cf4a30@syzkaller.appspotmail.com Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-31 16:29:24 +01:00
Eric Biggers	4e765b4972	af_key: fix buffer overread in parse_exthdrs() If a message sent to a PF_KEY socket ended with an incomplete extension header (fewer than 4 bytes remaining), then parse_exthdrs() read past the end of the message, into uninitialized memory. Fix it by returning -EINVAL in this case. Reproducer: #include <linux/pfkeyv2.h> #include <sys/socket.h> #include <unistd.h> int main() { int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2); char buf[17] = { 0 }; struct sadb_msg msg = (void )buf; msg->sadb_msg_version = PF_KEY_V2; msg->sadb_msg_type = SADB_DELETE; msg->sadb_msg_len = 2; write(sock, buf, 17); } Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-30 09:52:08 +01:00
Eric Biggers	06b335cb51	af_key: fix buffer overread in verify_address_len() If a message sent to a PF_KEY socket ended with one of the extensions that takes a 'struct sadb_address' but there were not enough bytes remaining in the message for the ->sa_family member of the 'struct sockaddr' which is supposed to follow, then verify_address_len() read past the end of the message, into uninitialized memory. Fix it by returning -EINVAL in this case. This bug was found using syzkaller with KMSAN. Reproducer: #include <linux/pfkeyv2.h> #include <sys/socket.h> #include <unistd.h> int main() { int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2); char buf[24] = { 0 }; struct sadb_msg msg = (void )buf; struct sadb_address addr = (void )(msg + 1); msg->sadb_msg_version = PF_KEY_V2; msg->sadb_msg_type = SADB_DELETE; msg->sadb_msg_len = 3; addr->sadb_address_len = 1; addr->sadb_address_exttype = SADB_EXT_ADDRESS_SRC; write(sock, buf, 24); } Reported-by: Alexander Potapenko <glider@google.com> Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-30 09:52:07 +01:00
Florian Westphal	862591bf4f	xfrm: skip policies marked as dead while rehashing syzkaller triggered following KASAN splat: BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618 read of size 2 at addr ffff8801c8e92fe4 by task kworker/1:1/23 [..] Workqueue: events xfrm_hash_rebuild [..] __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428 xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618 process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112 worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..] The reproducer triggers: 1016 if (error) { 1017 list_move_tail(&walk->walk.all, &x->all); 1018 goto out; 1019 } in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump callback returns -ENOBUFS). In this case, *walk is located the pfkey socket struct, so this socket becomes visible in the global policy list. It looks like this is intentional -- phony walker has walk.dead set to 1 and all other places skip such "policies". Ccing original authors of the two commits that seem to expose this issue (first patch missed ->dead check, second patch adds pfkey sockets to policies dumper list). Fixes: `880a6fab8f` ("xfrm: configure policy hash table thresholds by netlink") Fixes: `12a169e7d8` ("ipsec: Put dumpers on the dump list") Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Timo Teras <timo.teras@iki.fi> Cc: Christophe Gouault <christophe.gouault@6wind.com> Reported-by: syzbot <bot+c028095236fcb6f4348811565b75084c754dc729@syzkaller.appspotmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-30 09:18:47 +01:00
Herbert Xu	257a4b018d	xfrm: Forbid state updates from changing encap type Currently we allow state updates to competely replace the contents of x->encap. This is bad because on the user side ESP only sets up header lengths depending on encap_type once when the state is first created. This could result in the header lengths getting out of sync with the actual state configuration. In practice key managers will never do a state update to change the encapsulation type. Only the port numbers need to be changed as the peer NAT entry is updated. Therefore this patch adds a check in xfrm_state_update to forbid any changes to the encap_type. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-30 09:18:47 +01:00
David S. Miller	6bb8824732	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net net/ipv6/ip6_gre.c is a case of parallel adds. include/trace/events/tcp.h is a little bit more tricky. The removal of in-trace-macro ifdefs in 'net' paralleled with moving show_tcp_state_name and friends over to include/trace/events/sock.h in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-29 15:42:26 -05:00
Tom Herbert	d66fa9ec53	strparser: Call sock_owned_by_user_nocheck strparser wants to check socket ownership without producing any warnings. As indicated by the comment in the code, it is permissible for owned_by_user to return true. Fixes: `43a0c6751a` ("strparser: Stream parser for messages") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-and-tested-by: <syzbot+c91c53af67f9ebe599a337d2e70950366153b295@syzkaller.appspotmail.com> Signed-off-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-28 14:28:22 -05:00
Willem de Bruijn	f72c4ac695	skbuff: in skb_copy_ubufs unclone before releasing zerocopy skb_copy_ubufs must unclone before it is safe to modify its skb_shared_info with skb_zcopy_clear. Commit `b90ddd5687` ("skbuff: skb_copy_ubufs must release uarg even without user frags") ensures that all skbs release their zerocopy state, even those without frags. But I forgot an edge case where such an skb arrives that is cloned. The stack does not build such packets. Vhost/tun skbs have their frags orphaned before cloning. TCP skbs only attach zerocopy state when a frag is added. But if TCP packets can be trimmed or linearized, this might occur. Tracing the code I found no instance so far (e.g., skb_linearize ends up calling skb_zcopy_clear if !skb->data_len). Still, it is non-obvious that no path exists. And it is fragile to rely on this. Fixes: `b90ddd5687` ("skbuff: skb_copy_ubufs must release uarg even without user frags") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-28 14:26:22 -05:00
Jiri Pirko	8ec6957403	net: sched: don't set extack message in case the qdisc will be created If the qdisc is not found here, it is going to be created. Therefore, this is not an error path. Remove the extack message set and don't confuse user with error message in case the qdisc was created successfully. Fixes: `0921559811` ("net: sched: sch_api: handle generic qdisc errors") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-28 12:18:31 -05:00
Parthasarathy Bhuvaragan	517d7c79bd	tipc: fix hanging poll() for stream sockets In commit `42b531de17` ("tipc: Fix missing connection request handling"), we replaced unconditional wakeup() with condtional wakeup for clients with flags POLLIN \| POLLRDNORM \| POLLRDBAND. This breaks the applications which do a connect followed by poll with POLLOUT flag. These applications are not woken when the connection is ESTABLISHED and hence sleep forever. In this commit, we fix it by including the POLLOUT event for sockets in TIPC_CONNECTING state. Fixes: `42b531de17` ("tipc: Fix missing connection request handling") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-28 12:15:26 -05:00
David S. Miller	fcffe2edbd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2017-12-28 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Fix incorrect state pruning related to recognition of zero initialized stack slots, where stacksafe exploration would mistakenly return a positive pruning verdict too early ignoring other slots, from Gianluca. 2) Various BPF to BPF calls related follow-up fixes. Fix an off-by-one in maximum call depth check, and rework maximum stack depth tracking logic to fix a bypass of the total stack size check reported by Jann. Also fix a bug in arm64 JIT where prog->jited_len was uninitialized. Addition of various test cases to BPF selftests, from Alexei. 3) Addition of a BPF selftest to test_verifier that is related to BPF to BPF calls which demonstrates a late caller stack size increase and thus out of bounds access. Fixed above in 2). Test case from Jann. 4) Addition of correlating BPF helper calls, BPF to BPF calls as well as BPF maps to bpftool xlated dump in order to allow for better BPF program introspection and debugging, from Daniel. 5) Fixing several bugs in BPF to BPF calls kallsyms handling in order to get it actually to work for subprogs, from Daniel. 6) Extending sparc64 JIT support for BPF to BPF calls and fix a couple of build errors for libbpf on sparc64, from David. 7) Allow narrower context access for BPF dev cgroup typed programs in order to adapt to LLVM code generation. Also adjust memlock rlimit in the test_dev_cgroup BPF selftest, from Yonghong. 8) Add netdevsim Kconfig entry to BPF selftests since test_offload.py relies on netdevsim device being available, from Jakub. 9) Reduce scope of xdp_do_generic_redirect_map() to being static, from Xiongwei. 10) Minor cleanups and spelling fixes in BPF verifier, from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 20:40:32 -05:00
Willem de Bruijn	8ddab50839	tcp: do not allocate linear memory for zerocopy skbs Zerocopy payload is now always stored in frags, and space for headers is reversed, so this memory is unused. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 16:44:14 -05:00
Willem de Bruijn	02583adeff	tcp: place all zerocopy payload in frags This avoids an unnecessary copy of 1-2KB and improves tso_fragment, which has to fall back to tcp_fragment if skb->len != skb_data_len. It also avoids a surprising inconsistency in notifications: Zerocopy packets sent over loopback have their frags copied, so set SO_EE_CODE_ZEROCOPY_COPIED in the notification. But this currently does not happen for small packets, because when all data fits in the linear fragment, data is not copied in skb_orphan_frags_rx. Reported-by: Tom Deseyn <tom.deseyn@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 16:44:13 -05:00
Willem de Bruijn	111856c758	tcp: push full zerocopy packets Skbs that reach MAX_SKB_FRAGS cannot be extended further. Do the same for zerocopy frags as non-zerocopy frags and set the PSH bit. This improves GRO assembly. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 16:44:13 -05:00
Willem de Bruijn	bf5c25d608	skbuff: in skb_segment, call zerocopy functions once per nskb This is a net-next follow-up to commit `268b790679` ("skbuff: orphan frags before zerocopy clone"), which fixed a bug in net, but added a call to skb_zerocopy_clone at each frag to do so. When segmenting skbs with user frags, either the user frags must be replaced with private copies and uarg released, or the uarg must have its refcount increased for each new skb. skb_orphan_frags does the first, except for cases that can handle reference counting. skb_zerocopy_clone then does the second. Call these once per nskb, instead of once per frag. That is, in the common case. With a frag list, also refresh when the origin skb (frag_skb) changes. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 16:44:13 -05:00
Tonghao Zhang	8cb38a6024	sctp: Replace use of sockets_allocated with specified macro. The patch(`180d8cd942`) replaces all uses of struct sock fields' memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem to accessor macros. But the sockets_allocated field of sctp sock is not replaced at all. Then replace it now for unifying the code. Fixes: `180d8cd942` ("foundations of per-cgroup memory pressure controlling.") Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 13:47:52 -05:00
Sowmini Varadhan	66261da169	rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc() If kmem_cache_alloc() fails in the middle of the for() loop, cleanup anything that might have been allocated so far. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 13:37:27 -05:00
Sowmini Varadhan	b319109396	rds: tcp: initialize t_tcp_detached to false Commit `f10b4cff98` ("rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete") adds the field t_tcp_detached, but this needs to be initialized explicitly to false. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 13:37:27 -05:00
Sowmini Varadhan	7ae0c649c4	rds; Reset rs->rs_bound_addr in rds_add_bound() failure path If the rds_sock is not added to the bind_hash_table, we must reset rs_bound_addr so that rds_remove_bound will not trip on this rds_sock. rds_add_bound() does a rds_sock_put() in this failure path, so failing to reset rs_bound_addr will result in a socket refcount bug, and will trigger a WARN_ON with the stack shown below when the application subsequently tries to close the PF_RDS socket. WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \ rds_sock_destruct+0x15/0x30 [rds] : __sk_destruct+0x21/0x190 rds_remove_bound.part.13+0xb6/0x140 [rds] rds_release+0x71/0x120 [rds] sock_release+0x1a/0x70 sock_close+0xe/0x20 __fput+0xd5/0x210 task_work_run+0x82/0xa0 do_exit+0x2ce/0xb30 ? syscall_trace_enter+0x1cc/0x2b0 do_group_exit+0x39/0xa0 SyS_exit_group+0x10/0x10 do_syscall_64+0x61/0x1a0 Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 13:37:27 -05:00
Lorenzo Bianconi	f15bc54eee	l2tp: add peer_offset parameter Introduce peer_offset parameter in order to add the capability to specify two different values for payload offset on tx/rx side. If just offset is provided by userspace use it for rx side as well in order to maintain compatibility with older l2tp versions Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 12:11:51 -05:00
Hangbin Liu	820da53575	l2tp: fix missing print session offset info Report offset parameter in L2TP_CMD_SESSION_GET command if it has been configured by userspace Fixes: `309795f4be` ("l2tp: Add netlink control API for L2TP") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 12:11:51 -05:00
David S. Miller	9f30e5c5c2	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-12-22 1) Separate ESP handling from segmentation for GRO packets. This unifies the IPsec GSO and non GSO codepath. 2) Add asynchronous callbacks for xfrm on layer 2. This adds the necessary infrastructure to core networking. 3) Allow to use the layer2 IPsec GSO codepath for software crypto, all infrastructure is there now. 4) Also allow IPsec GSO with software crypto for local sockets. 5) Don't require synchronous crypto fallback on IPsec offloading, it is not needed anymore. 6) Check for xdo_dev_state_free and only call it if implemented. From Shannon Nelson. 7) Check for the required add and delete functions when a driver registers xdo_dev_ops. From Shannon Nelson. 8) Define xfrmdev_ops only with offload config. From Shannon Nelson. 9) Update the xfrm stats documentation. From Shannon Nelson. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 11:15:14 -05:00
David S. Miller	65bbbf6c20	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-12-22 1) Check for valid id proto in validate_tmpl(), otherwise we may trigger a warning in xfrm_state_fini(). From Cong Wang. 2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute. From Michal Kubecek. 3) Verify the state is valid when encap_type < 0, otherwise we may crash on IPsec GRO . From Aviv Heller. 4) Fix stack-out-of-bounds read on socket policy lookup. We access the flowi of the wrong address family in the IPv4 mapped IPv6 case, fix this by catching address family missmatches before we do the lookup. 5) fix xfrm_do_migrate() with AEAD to copy the geniv field too. Otherwise the state is not fully initialized and migration fails. From Antony Antony. 6) Fix stack-out-of-bounds with misconfigured transport mode policies. Our policy template validation is not strict enough. It is possible to configure policies with transport mode template where the address family of the template does not match the selectors address family. Fix this by refusing such a configuration, address family can not change on transport mode. 7) Fix a policy reference leak when reusing pcpu xdst entry. From Florian Westphal. 8) Reinject transport-mode packets through tasklet, otherwise it is possible to reate a recursion loop. From Herbert Xu. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 10:58:23 -05:00
Tommi Rantala	642a8439dd	tipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path Calling tipc_mon_delete() before the monitor has been created will oops. This can happen in tipc_enable_bearer() error path if tipc_disc_create() fails. [ 48.589074] BUG: unable to handle kernel paging request at 0000000000001008 [ 48.590266] IP: tipc_mon_delete+0xea/0x270 [tipc] [ 48.591223] PGD 1e60c5067 P4D 1e60c5067 PUD 1eb0cf067 PMD 0 [ 48.592230] Oops: 0000 [#1] SMP KASAN [ 48.595610] CPU: 5 PID: 1199 Comm: tipc Tainted: G B 4.15.0-rc4-pc64-dirty #5 [ 48.597176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 [ 48.598489] RIP: 0010:tipc_mon_delete+0xea/0x270 [tipc] [ 48.599347] RSP: 0018:ffff8801d827f668 EFLAGS: 00010282 [ 48.600705] RAX: ffff8801ee813f00 RBX: 0000000000000204 RCX: 0000000000000000 [ 48.602183] RDX: 1ffffffff1de6a75 RSI: 0000000000000297 RDI: 0000000000000297 [ 48.604373] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1dd1533 [ 48.605607] R10: ffffffff8eafbb05 R11: fffffbfff1dd1534 R12: 0000000000000050 [ 48.607082] R13: dead000000000200 R14: ffffffff8e73f310 R15: 0000000000001020 [ 48.608228] FS: 00007fc686484800(0000) GS:ffff8801f5540000(0000) knlGS:0000000000000000 [ 48.610189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.611459] CR2: 0000000000001008 CR3: 00000001dda70002 CR4: 00000000003606e0 [ 48.612759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.613831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.615038] Call Trace: [ 48.615635] tipc_enable_bearer+0x415/0x5e0 [tipc] [ 48.620623] tipc_nl_bearer_enable+0x1ab/0x200 [tipc] [ 48.625118] genl_family_rcv_msg+0x36b/0x570 [ 48.631233] genl_rcv_msg+0x5a/0xa0 [ 48.631867] netlink_rcv_skb+0x1cc/0x220 [ 48.636373] genl_rcv+0x24/0x40 [ 48.637306] netlink_unicast+0x29c/0x350 [ 48.639664] netlink_sendmsg+0x439/0x590 [ 48.642014] SYSC_sendto+0x199/0x250 [ 48.649912] do_syscall_64+0xfd/0x2c0 [ 48.650651] entry_SYSCALL64_slow_path+0x25/0x25 [ 48.651843] RIP: 0033:0x7fc6859848e3 [ 48.652539] RSP: 002b:00007ffd25dff938 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 48.654003] RAX: ffffffffffffffda RBX: 00007ffd25dff990 RCX: 00007fc6859848e3 [ 48.655303] RDX: 0000000000000054 RSI: 00007ffd25dff990 RDI: 0000000000000003 [ 48.656512] RBP: 00007ffd25dff980 R08: 00007fc685c35fc0 R09: 000000000000000c [ 48.657697] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000d13010 [ 48.658840] R13: 00007ffd25e009c0 R14: 0000000000000000 R15: 0000000000000000 [ 48.662972] RIP: tipc_mon_delete+0xea/0x270 [tipc] RSP: ffff8801d827f668 [ 48.664073] CR2: 0000000000001008 [ 48.664576] ---[ end trace e811818d54d5ce88 ]--- Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 10:55:00 -05:00
Tommi Rantala	19142551b2	tipc: error path leak fixes in tipc_enable_bearer() Fix memory leak in tipc_enable_bearer() if enable_media() fails, and cleanup with bearer_disable() if tipc_mon_create() fails. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 10:54:59 -05:00
Avinash Repaka	14e138a86f	RDS: Check cmsg_len before dereferencing CMSG_DATA RDS currently doesn't check if the length of the control message is large enough to hold the required data, before dereferencing the control message data. This results in following crash: BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013 [inline] BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066 Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157 CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x25b/0x340 mm/kasan/report.c:409 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 rds_rdma_bytes net/rds/send.c:1013 [inline] rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066 sock_sendmsg_nosec net/socket.c:628 [inline] sock_sendmsg+0xca/0x110 net/socket.c:638 ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018 __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108 SYSC_sendmmsg net/socket.c:2139 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2134 entry_SYSCALL_64_fastpath+0x1f/0x96 RIP: 0033:0x43fe49 RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49 RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0 R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000 To fix this, we verify that the cmsg_len is large enough to hold the data to be read, before proceeding further. Reported-by: syzbot <syzkaller-bugs@googlegroups.com> Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-27 10:37:23 -05:00
William Tu	214bb1c78a	net: erspan: remove md NULL check The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and since we've checked 'tun_dst', 'md' will never be NULL. The patch removes it at both ipv4 and ipv6 erspan. Fixes: `afb4c97d90` ("ip6_gre: fix potential memory leak in ip6erspan_rcv") Fixes: `50670b6ee9` ("ip_gre: fix potential memory leak in erspan_rcv") Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 17:30:11 -05:00
Mat Martineau	fb7df5e400	tcp: md5: Handle RCU dereference of md5sig_info Dereference tp->md5sig_info in tcp_v4_destroy_sock() the same way it is done in the adjacent call to tcp_clear_md5_list(). Resolves this sparse warning: net/ipv4/tcp_ipv4.c:1914:17: warning: incorrect type in argument 1 (different address spaces) net/ipv4/tcp_ipv4.c:1914:17: expected struct callback_head head net/ipv4/tcp_ipv4.c:1914:17: got struct callback_head [noderef] <asn:4><noident> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 17:23:50 -05:00
Tobias Brunner	09ee9dba96	ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT If SNAT modifies the source address the resulting packet might match an IPsec policy, reinject the packet if that's the case. The exact same thing is already done for IPv4. Signed-off-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 17:14:56 -05:00
Jon Maloy	3a33a19bf8	tipc: fix memory leak of group member when peer node is lost When a group member receives a member WITHDRAW event, this might have two reasons: either the peer member is leaving the group, or the link to the member's node has been lost. In the latter case we need to issue a DOWN event to the user right away, and let function tipc_group_filter_msg() perform delete of the member item. However, in this case we miss to change the state of the member item to MBR_LEAVING, so the member item is not deleted, and we have a memory leak. We now separate better between the four sub-cases of a WITHRAW event and make sure that each case is handled correctly. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 13:06:36 -05:00
Jiri Pirko	4853f128c1	net: sched: fix possible null pointer deref in tcf_block_put We need to check block for being null in both tcf_block_put and tcf_block_put_ext. Fixes: `343723dd51` ("net: sched: fix clsact init error path") Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 13:02:05 -05:00
Jon Maloy	0a3d805c9c	tipc: base group replicast ack counter on number of actual receivers In commit `2f487712b8` ("tipc: guarantee that group broadcast doesn't bypass group unicast") we introduced a mechanism that requires the first (replicated) broadcast sent after a unicast to be acknowledged by all receivers before permitting sending of the next (true) broadcast. The counter for keeping track of the number of acknowledges to expect is based on the tipc_group::member_cnt variable. But this misses that some of the known members may not be ready for reception, and will never acknowledge the message, either because they haven't fully joined the group or because they are leaving the group. Such members are identified by not fulfilling the condition tested for in the function tipc_group_is_enabled(). We now set the counter for the actual number of acks to receive at the moment the message is sent, by just counting the number of recipients satisfying the tipc_group_is_enabled() test. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 13:00:04 -05:00
Cong Wang	b2fb01f426	net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap() The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for flying RCU callback installed by a previous mini_qdisc_pair_swap(), however we miss it on the tp_head==NULL path, which leads to that the RCU callback still uses miniq_old->rcu after it is freed together with qdisc in qdisc_graft(). So just add it on that path too. Fixes: `46209401f8` ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ") Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com> Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 12:28:40 -05:00
Alexey Kodanev	e5a9336adb	ip6_gre: fix device features for ioctl setup When ip6gre is created using ioctl, its features, such as scatter-gather, GSO and tx-checksumming will be turned off: # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1 # ethtool -k gre6 (truncated output) tx-checksumming: off scatter-gather: off tcp-segmentation-offload: off generic-segmentation-offload: off [requested on] But when netlink is used, they will be enabled: # ip link add gre6 type ip6gre remote fd00::1 # ethtool -k gre6 (truncated output) tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on generic-segmentation-offload: on This results in a loss of performance when gre6 is created via ioctl. The issue was found with LTP/gre tests. Fix it by moving the setup of device features to a separate function and invoke it with ndo_init callback because both netlink and ioctl will eventually call it via register_netdevice(): register_netdevice() - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init() - ip6gre_tunnel_init_common() - ip6gre_tnl_init_features() The moved code also contains two minor style fixes: * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line. * fixed the issue reported by checkpatch: "Unnecessary parentheses around 'nt->encap.type == TUNNEL_ENCAP_NONE'" Fixes: `ac4eb009e4` ("ip6gre: Add support for basic offloads offloads excluding GSO") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-26 12:21:19 -05:00
Hangbin Liu	8bea728dce	netfilter: nf_tables: fix potential NULL-ptr deref in nf_tables_dump_obj_done() If there is no NFTA_OBJ_TABLE and NFTA_OBJ_TYPE, the c.data will be NULL in nf_tables_getobj(). So before free filter->table in nf_tables_dump_obj_done(), we need to check if filter is NULL first. Fixes: `e46abbcc05` ("netfilter: nf_tables: Allow table names of up to 255 chars") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-12-26 17:16:47 +01:00
David S. Miller	fba961ab29	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Lots of overlapping changes. Also on the net-next side the XDP state management is handled more in the generic layers so undo the 'net' nfp fix which isn't applicable in net-next. Include a necessary change by Jakub Kicinski, with log message: ==================== cls_bpf no longer takes care of offload tracking. Make sure netdevsim performs necessary checks. This fixes a warning caused by TC trying to remove a filter it has not added. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-22 11:16:31 -05:00
Sven Eckelmann	5b0890a972	flow_dissector: Parse batman-adv unicast headers The batman-adv unicast packets contain a full layer 2 frame in encapsulated form. The flow dissector must therefore be able to parse the batman-adv unicast header to reach the layer 2+3 information. +--------------------+ \| ip(v6)hdr \| +--------------------+ \| inner ethhdr \| +--------------------+ \| batadv unicast hdr \| +--------------------+ \| outer ethhdr \| +--------------------+ The obtained information from the upper layer can then be used by RPS to schedule the processing on separate cores. This allows better distribution of multiple flows from the same neighbor to different cores. Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:35:53 -05:00
Sven Eckelmann	fec149f5d3	batman-adv: Convert packet.h to uapi header The header file is used by different userspace programs to inject packets or to decode sniffed packets. It should therefore be available to them as userspace header. Also other components in the kernel (like the flow dissector) require access to the packet definitions to be able to decode ETH_P_BATMAN ethernet packets. Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:35:53 -05:00
Sven Eckelmann	adbf9b7324	batman-adv: Remove kernel fixed width types in packet.h The uapi headers use the __u8/__u16/... version of the fixed width types instead of u8/u16/... The use of the latter must be avoided before packet.h is copied to include/uapi/linux/. Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:35:53 -05:00
Sven Eckelmann	a6cb82b5c2	batman-adv: Remove usage of BIT(x) in packet.h The BIT(x) macro is no longer available for uapi headers because it is defined outside of it (linux/bitops.h). The use of it must therefore be avoided and replaced by an appropriate other representation. Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:35:53 -05:00
Sven Eckelmann	4e58452b6f	batman-adv: Let packet.h include its headers directly The headers used by packet.h should also be included by it directly. main.h is currently dealing with it in batman-adv, but this will no longer work when this header is moved to include/uapi/linux/. Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:35:53 -05:00
Willem de Bruijn	b90ddd5687	skbuff: skb_copy_ubufs must release uarg even without user frags skb_copy_ubufs creates a private copy of frags[] to release its hold on user frags, then calls uarg->callback to notify the owner. Call uarg->callback even when no frags exist. This edge case can happen when zerocopy_sg_from_iter finds enough room in skb_headlen to copy all the data. Fixes: `3ece782693` ("sock: skb_copy_ubufs support for compound pages") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:00:58 -05:00
Willem de Bruijn	268b790679	skbuff: orphan frags before zerocopy clone Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate calls to skb_uarg(skb)->callback for the same data. skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb with each segment. This is only safe for uargs that do refcounting, which is those that pass skb_orphan_frags without dropping their shared frags. For others, skb_orphan_frags drops the user frags and sets the uarg to NULL, after which sock_zerocopy_clone has no effect. Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback calls for the same data causing the vhost_net_ubuf_ref_>refcount to drop below zero. Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com> Fixes: `1f8b977ab3` ("sock: enable MSG_ZEROCOPY") Reported-by: Andreas Hartmann <andihartmann@01019freenet.de> Reported-by: David Hill <dhill@redhat.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 15:00:58 -05:00
Shaohua Li	513674b5a2	net: reevalulate autoflowlabel setting after sysctl setting sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2. If sockopt doesn't set autoflowlabel, outcome packets from the hosts are supposed to not include flowlabel. This is true for normal packet, but not for reset packet. The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't changed, so the sock will keep the old behavior in terms of auto flowlabel. Reset packet is suffering from this problem, because reset packet is sent from a special control socket, which is created at boot time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control socket will always have its ipv6_pinfo.autoflowlabel set, even after user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always have flowlabel. Normal sock created before sysctl setting suffers from the same issue. We can't even turn off autoflowlabel unless we kill all socks in the hosts. To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the autoflowlabel setting from user, otherwise we always call ip6_default_np_autolabel() which has the new settings of sysctl. Note, this changes behavior a little bit. Before commit `42240901f7` (ipv6: Implement different admin modes for automatic flow labels), the autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes, existing connection will change autoflowlabel behavior. After that commit, autoflowlabel behavior is sticky in the whole life of the sock. With this patch, the behavior isn't sticky again. Cc: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 13:07:20 -05:00
Eric Garver	c48e74736f	openvswitch: Fix pop_vlan action for double tagged frames skb_vlan_pop() expects skb->protocol to be a valid TPID for double tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop() shift the true ethertype into position for us. Fixes: `5108bbaddc` ("openvswitch: add processing of L3 packets") Signed-off-by: Eric Garver <e@erig.me> Reviewed-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 13:02:08 -05:00
Alexander Aring	a7c31693e1	net: sch: sch_drr: add extack support This patch adds extack support for the drr qdisc implementation by adding NL_SET_ERR_MSG in validation of user input. Also it serves to illustrate a use case of how the infrastructure ops api changes are to be used by individual qdiscs. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:51 -05:00
Alexander Aring	710fb39689	net: sch: sch_cbs: add extack support This patch adds extack support for the cbs qdisc implementation by adding NL_SET_ERR_MSG in validation of user input. Also it serves to illustrate a use case of how the infrastructure ops api changes are to be used by individual qdiscs. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:51 -05:00
Alexander Aring	62a6de62dc	net: sch: sch_cbq: add extack support This patch adds extack support for the cbq qdisc implementation by adding NL_SET_ERR_MSG in validation of user input. Also it serves to illustrate a use case of how the infrastructure ops api changes are to be used by individual qdiscs. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:51 -05:00
Alexander Aring	a38a98821c	net: sch: api: add extack support in qdisc_create_dflt This patch adds extack support for the function qdisc_create_dflt which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why qdisc_create_dflt failed. The function qdisc_create_dflt will also call an init callback which can fail by any per-qdisc specific handling. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:51 -05:00
Alexander Aring	d0bd684ddd	net: sch: api: add extack support in qdisc_alloc This patch adds extack support for the function qdisc_alloc which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why qdisc_alloc failed. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:51 -05:00
Alexander Aring	8d1a77f974	net: sch: api: add extack support in tcf_block_get This patch adds extack support for the function tcf_block_get which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why tcf_block_get failed. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:51 -05:00
Alexander Aring	e9bc3fa28b	net: sch: api: add extack support in qdisc_get_rtab This patch adds extack support for the function qdisc_get_rtab which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why qdisc_get_rtab failed. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:50 -05:00
Alexander Aring	653d6fd68d	net: sched: sch: add extack for graft callback This patch adds extack support for graft callback to prepare per-qdisc specific changes for extack. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:50 -05:00
Alexander Aring	cbaacc4e8a	net: sched: sch: add extack for block callback This patch adds extack support for block callback to prepare per-qdisc specific changes for extack. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:50 -05:00
Alexander Aring	793d81d6a1	net: sched: sch: add extack to change class This patch adds extack support for class change callback api. This prepares to handle extack support inside each specific class implementation. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:50 -05:00
Alexander Aring	2030721cc0	net: sched: sch: add extack for change qdisc ops This patch adds extack support for change callback for qdisc ops structtur to prepare per-qdisc specific changes for extack. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:50 -05:00
Alexander Aring	e63d7dfd2d	net: sched: sch: add extack for init callback This patch adds extack support for init callback to prepare per-qdisc specific changes for extack. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:50 -05:00
Alexander Aring	0921559811	net: sched: sch_api: handle generic qdisc errors This patch adds extack support for generic qdisc handling. The extack will be set deeper to each called function which is not part of netdev core api. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:49 -05:00
Alexander Aring	ac8ef4ab73	net: sched: fix coding style issues This patch fix checkpatch issues for upcomming patches according to the sched api file. It changes mostly how to check on null pointer. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 12:32:49 -05:00
Ido Schimmel	58acfd714e	ipv6: Honor specified parameters in fibmatch lookup Currently, parameters such as oif and source address are not taken into account during fibmatch lookup. Example (IPv4 for reference) before patch: $ ip -4 route show 192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1 198.51.100.0/24 dev dummy1 proto kernel scope link src 198.51.100.1 $ ip -6 route show 2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium 2001:db8:2::/64 dev dummy1 proto kernel metric 256 pref medium fe80::/64 dev dummy0 proto kernel metric 256 pref medium fe80::/64 dev dummy1 proto kernel metric 256 pref medium $ ip -4 route get fibmatch 192.0.2.2 oif dummy0 192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1 $ ip -4 route get fibmatch 192.0.2.2 oif dummy1 RTNETLINK answers: No route to host $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0 2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1 2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium After: $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0 2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1 RTNETLINK answers: Network is unreachable The problem stems from the fact that the necessary route lookup flags are not set based on these parameters. Instead of duplicating the same logic for fibmatch, we can simply resolve the original route from its copy and dump it instead. Fixes: `18c3a61c42` ("net: ipv6: RTM_GETROUTE: return matched fib result when requested") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-21 11:51:06 -05:00
Shannon Nelson	92a2320697	xfrm: check for xdo_dev_ops add and delete This adds a check for the required add and delete functions up front at registration time to be sure both are defined. Since both the features check and the registration check are looking at the same things, break out the check for both to call. Lastly, for some reason the feature check was setting xfrmdev_ops to NULL if the NETIF_F_HW_ESP bit was missing, which would probably surprise the driver later if the driver turned its NETIF_F_HW_ESP bit back on. We shouldn't be messing with the driver's callback list, so we stop doing that with this patch. Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-21 08:17:48 +01:00
Pablo Neira Ayuso	24c0df82ef	netfilter: nf_tables: fix chain filter in nf_tables_dump_rules() ctx->chain may be null now that we have very large object names, so we cannot check for ctx->chain[0] here. Fixes: `b7263e071a` ("netfilter: nf_tables: Allow table names of up to 255 chars") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Phil Sutter <phil@nwl.cc>	2017-12-21 00:15:11 +01:00
Ido Schimmel	b4681c2829	ipv4: Fix use-after-free when flushing FIB tables Since commit `0ddcf43d5d` ("ipv4: FIB Local/MAIN table collapse") the local table uses the same trie allocated for the main table when custom rules are not in use. When a net namespace is dismantled, the main table is flushed and freed (via an RCU callback) before the local table. In case the callback is invoked before the local table is iterated, a use-after-free can occur. Fix this by iterating over the FIB tables in reverse order, so that the main table is always freed after the local table. v3: Reworded comment according to Alex's suggestion. v2: Add a comment to make the fix more explicit per Dave's and Alex's feedback. Fixes: `0ddcf43d5d` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 15:12:39 -05:00
Jon Maloy	bb25c3855a	tipc: remove joining group member from congested list When we receive a JOIN message from a peer member, the message may contain an advertised window value ADV_IDLE that permits removing the member in question from the tipc_group::congested list. However, since the removal has been made conditional on that the advertised window is not ADV_IDLE, we miss this case. This has the effect that a sender sometimes may enter a state of permanent, false, broadcast congestion. We fix this by unconditinally removing the member from the congested list before calling tipc_member_update(), which might potentially sort it into the list again. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 14:56:48 -05:00
David S. Miller	a943e8bc05	This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - de-inline hash functions to save memory footprint, by Denys Vlasenko - Add License information to various files, by Sven Eckelmann (3 patches) - Change batman_adv.h from ISC to MIT, by Sven Eckelmann - Improve various includes, by Sven Eckelmann (5 patches) - Lots of kernel-doc work by Sven Eckelmann (8 patches) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlo6QfgWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeobWfEADPEOdxWS7nW4Xkhug+7vLbcloJ Om1VDKFD4n5NfB6e+vh8kQAnnnQ/LzFSv53giNdnjjE9IPNKxNhzBQFS95H189EP ebP0mKOesTadkTx+MjFxenhnaTnzK0hkngdxz/frvaq+i6ECMhnq8Bw0elVG0nSg X9ts0x6BuNIw6EjIdPP0GvfOV1DvUmdMz1YLJy4yoJ8Tm671y86y07jTJaaEN2Ex 9dp0j3hEqtrYZvgRdQ/hzYLFJ9fvcF1FyA+duufK3FNbkJtZn89zqseIuN2saqqw QN/nDBrduzW6SR9y0JfXlatI6FN6316jskLcpqorz5/88KrwQeTOg2ZXXn0iw6l1 r0tkBP5/eEu2Dcd3WRsNtTnMZmGfc2uuqmvD1Pz0wN7RAzIYhPvs9to6TVy3mK5p 0OyFJaZU9vurtIPqVYjNtSofkUHKMOZZ1H7LocWaINGIVsREx/i0DmX//M0ZiqbB mS4ybkn8yOYfFUizg51RYo2nKVyw/ZXNqBYBPUWio91CPj0vOUFitvfxnxNitV/m 182HVdoOnGhorYbS3J/8Su3AyEyhJTVWnmq0z3u1CuvdMhppbAzvXvmERotAJN9e 5Wp26PE2a7zb4LJhMBQ4q+RnUnZV5ADhigrIEGPOQoY5KdIrEOcW65wrgfhWYlER NVJc2okVZKhg3o7amA== =5JUK -----END PGP SIGNATURE----- Merge tag 'batadv-next-for-davem-20171220' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - de-inline hash functions to save memory footprint, by Denys Vlasenko - Add License information to various files, by Sven Eckelmann (3 patches) - Change batman_adv.h from ISC to MIT, by Sven Eckelmann - Improve various includes, by Sven Eckelmann (5 patches) - Lots of kernel-doc work by Sven Eckelmann (8 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 14:33:03 -05:00
Yafang Shao	cbabf46364	net: tracepoint: using sock_set_state tracepoint to trace SCTP state transition With changes in inet_ files, SCTP state transitions are traced with inet_sock_set_state tracepoint. As SCTP state names, i.e. SCTP_SS_CLOSED, SCTP_SS_ESTABLISHED, have the same value with TCP state names. So the output info still print the TCP state names, that makes the code easy. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 14:00:25 -05:00
Yafang Shao	b0832e3005	net: tracepoint: using sock_set_state tracepoint to trace DCCP state transition With changes in inet_ files, DCCP state transitions are traced with inet_sock_set_state tracepoint. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 14:00:25 -05:00
Yafang Shao	986ffdfd08	net: sock: replace sk_state_load with inet_sk_state_load and remove sk_state_store sk_state_load is only used by AF_INET/AF_INET6, so rename it to inet_sk_state_load and move it into inet_sock.h. sk_state_store is removed as it is not used any more. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 14:00:25 -05:00
Yafang Shao	563e0bb0dc	net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint As sk_state is a common field for struct sock, so the state transition tracepoint should not be a TCP specific feature. Currently it traces all AF_INET state transition, so I rename this tracepoint to inet_sock_set_state tracepoint with some minor changes and move it into trace/events/sock.h. We dont need to create a file named trace/events/inet_sock.h for this one single tracepoint. Two helpers are introduced to trace sk_state transition - void inet_sk_state_store(struct sock sk, int newstate); - void inet_sk_set_state(struct sock sk, int state); As trace header should not be included in other header files, so they are defined in sock.c. The protocol such as SCTP maybe compiled as a ko, hence export inet_sk_set_state(). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 14:00:25 -05:00
Haishuang Yan	afb4c97d90	ip6_gre: fix potential memory leak in ip6erspan_rcv If md is NULL, tun_dst must be freed, otherwise it will cause memory leak. Fixes: `ef7baf5e08` ("ip6_gre: add ip6 erspan collect_md mode") Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 13:56:39 -05:00
Haishuang Yan	50670b6ee9	ip_gre: fix potential memory leak in erspan_rcv If md is NULL, tun_dst must be freed, otherwise it will cause memory leak. Fixes: `1a66a836da` ("gre: add collect_md mode to ERSPAN tunnel") Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 13:56:39 -05:00
Haishuang Yan	a7343211f0	ip6_gre: fix error path when ip6erspan_rcv failed Same as ipv4 code, when ip6erspan_rcv call return PACKET_REJECT, we should call icmpv6_send to send icmp unreachable message in error path. Fixes: `5a963eb61b` ("ip6_gre: Add ERSPAN native tunnel support") Acked-by: William Tu <u9012063@gmail.com> Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 13:51:46 -05:00
Haishuang Yan	dd8d5b8c5b	ip_gre: fix error path when erspan_rcv failed When erspan_rcv call return PACKET_REJECT, we shoudn't call ipgre_rcv to process packets again, instead send icmp unreachable message in error path. Fixes: `84e54fe0a5` ("gre: introduce native tunnel support for ERSPAN") Acked-by: William Tu <u9012063@gmail.com> Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 13:51:46 -05:00
Haishuang Yan	293a1991cf	ip6_gre: fix a pontential issue in ip6erspan_rcv pskb_may_pull() can change skb->data, so we need to load ipv6h/ershdr at the right place. Fixes: `5a963eb61b` ("ip6_gre: Add ERSPAN native tunnel support") Cc: William Tu <u9012063@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 13:48:39 -05:00
Jakub Kicinski	102740bd94	cls_bpf: fix offload assumptions after callback conversion cls_bpf used to take care of tracking what offload state a filter is in, i.e. it would track if offload request succeeded or not. This information would then be used to issue correct requests to the driver, e.g. requests for statistics only on offloaded filters, removing only filters which were offloaded, using add instead of replace if previous filter was not added etc. This tracking of offload state no longer functions with the new callback infrastructure. There could be multiple entities trying to offload the same filter. Throw out all the tracking and corresponding commands and simply pass to the drivers both old and new bpf program. Drivers will have to deal with offload state tracking by themselves. Fixes: `3f7889c4c7` ("net: sched: cls_bpf: call block callbacks for offload") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 13:08:18 -05:00
Andy Shevchenko	223b229b63	bridge: Use helpers to handle MAC address Use %pM to print MAC mac_pton() to convert it from ASCII to binary format, and ether_addr_copy() to copy. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 12:46:11 -05:00
Eric W. Biederman	21b5944350	net: Fix double free and memory corruption in get_net_ns_by_id() (I can trivially verify that that idr_remove in cleanup_net happens after the network namespace count has dropped to zero --EWB) Function get_net_ns_by_id() does not check for net::count after it has found a peer in netns_ids idr. It may dereference a peer, after its count has already been finaly decremented. This leads to double free and memory corruption: put_net(peer) rtnl_lock() atomic_dec_and_test(&peer->count) [count=0] ... __put_net(peer) get_net_ns_by_id(net, id) spin_lock(&cleanup_list_lock) list_add(&net->cleanup_list, &cleanup_list) spin_unlock(&cleanup_list_lock) queue_work() peer = idr_find(&net->netns_ids, id) \| get_net(peer) [count=1] \| ... \| (use after final put) v ... cleanup_net() ... spin_lock(&cleanup_list_lock) ... list_replace_init(&cleanup_list, ..) ... spin_unlock(&cleanup_list_lock) ... ... ... ... put_net(peer) ... atomic_dec_and_test(&peer->count) [count=0] ... spin_lock(&cleanup_list_lock) ... list_add(&net->cleanup_list, &cleanup_list) ... spin_unlock(&cleanup_list_lock) ... queue_work() ... rtnl_unlock() rtnl_lock() ... for_each_net(tmp) { ... id = __peernet2id(tmp, peer) ... spin_lock_irq(&tmp->nsid_lock) ... idr_remove(&tmp->netns_ids, id) ... ... ... net_drop_ns() ... net_free(peer) ... } ... \| v cleanup_net() ... (Second free of peer) Also, put_net() on the right cpu may reorder with left's cpu list_replace_init(&cleanup_list, ..), and then cleanup_list will be corrupted. Since cleanup_net() is executed in worker thread, while put_net(peer) can happen everywhere, there should be enough time for concurrent get_net_ns_by_id() to pick the peer up, and the race does not seem to be unlikely. The patch fixes the problem in standard way. (Also, there is possible problem in peernet2id_alloc(), which requires check for net::count under nsid_lock and maybe_get_net(peer), but in current stable kernel it's used under rtnl_lock() and it has to be safe. Openswitch begun to use peernet2id_alloc(), and possibly it should be fixed too. While this is not in stable kernel yet, so I'll send a separate message to netdev@ later). Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Fixes: `0c7aecd4bd` "netns: add rtnl cmd to add and get peer netns ids" Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 12:42:22 -05:00
Alexey Kodanev	53c81e95df	ip6_vti: adjust vti mtu according to mtu of lower device LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams over ip6_vti that require fragmentation and the underlying device has an MTU smaller than 1500 plus some extra space for headers. This happens because ip6_vti, by default, sets MTU to ETH_DATA_LEN and not updating it depending on a destination address or link parameter. Further attempts to send UDP packets may succeed because pmtu gets updated on ICMPV6_PKT_TOOBIG in vti6_err(). In case the lower device has larger MTU size, e.g. 9000, ip6_vti works but not using the possible maximum size, output packets have 1500 limit. The above cases require manual MTU setup after ip6_vti creation. However ip_vti already updates MTU based on lower device with ip_tunnel_bind_dev(). Here is the example when the lower device MTU is set to 9000: # ip a sh ltp_ns_veth2 ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ... inet 10.0.0.2/24 scope global ltp_ns_veth2 inet6 fd00::2/64 scope global # ip li add vti6 type vti6 local fd00::2 remote fd00::1 # ip li show vti6 vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ... link/tunnel6 fd00::2 peer fd00::1 After the patch: # ip li add vti6 type vti6 local fd00::2 remote fd00::1 # ip li show vti6 vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ... link/tunnel6 fd00::2 peer fd00::1 Reported-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-20 11:52:32 -05:00
Steffen Klassert	f58869c44f	esp: Don't require synchronous crypto fallback on offloading anymore. We support asynchronous crypto on layer 2 ESP now. So no need to force synchronous crypto fallback on offloading anymore. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-20 10:41:53 +01:00
Steffen Klassert	95bff4b580	xfrm: Allow to use the layer2 IPsec GSO codepath for software crypto. We now have support for asynchronous crypto operations in the layer 2 TX path. This was the missing part to allow the GSO codepath for software crypto, so allow this codepath now. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-20 10:41:42 +01:00
Steffen Klassert	f53c723902	net: Add asynchronous callbacks for xfrm on layer 2. This patch implements asynchronous crypto callbacks and a backlog handler that can be used when IPsec is done at layer 2 in the TX path. It also extends the skb validate functions so that we can update the driver transmit return codes based on async crypto operation or to indicate that we queued the packet in a backlog queue. Joint work with: Aviv Heller <avivh@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-20 10:41:36 +01:00
Steffen Klassert	3dca3f38cf	xfrm: Separate ESP handling from segmentation for GRO packets. We change the ESP GSO handlers to only segment the packets. The ESP handling and encryption is defered to validate_xmit_xfrm() where this is done for non GRO packets too. This makes the code more robust and prepares for asynchronous crypto handling. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-20 10:41:31 +01:00
Phil Sutter	d03a45572e	ipv4: fib: Fix metrics match when deleting a route The recently added fib_metrics_match() causes a regression for routes with both RTAX_FEATURES and RTAX_CC_ALGO if the latter has TCP_CONG_NEEDS_ECN flag set: \| # ip link add d0 type dummy \| # ip link set d0 up \| # ip route add 172.29.29.0/24 dev d0 features ecn congctl dctcp \| # ip route del 172.29.29.0/24 dev d0 features ecn congctl dctcp \| RTNETLINK answers: No such process During route insertion, fib_convert_metrics() detects that the given CC algo requires ECN and hence sets DST_FEATURE_ECN_CA bit in RTAX_FEATURES. During route deletion though, fib_metrics_match() compares stored RTAX_FEATURES value with that from userspace (which obviously has no knowledge about DST_FEATURE_ECN_CA) and fails. Fixes: `5f9ae3d9e7` ("ipv4: do metrics match when looking up and deleting a route") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 14:21:58 -05:00
Cong Wang	1df94c3c5d	net_sched: properly check for empty skb array on error path First, the check of &q->ring.queue against NULL is wrong, it is always false. We should check the value rather than the address. Secondly, we need the same check in pfifo_fast_reset() too, as both ->reset() and ->destroy() are called in qdisc_destroy(). Fixes: `c5ad119fb6` ("net: sched: pfifo_fast use skb_array") Reported-by: syzbot <syzkaller@googlegroups.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 14:13:12 -05:00
Jon Maloy	3db0960117	tipc: fix list sorting bug in function tipc_group_update_member() When, during a join operation, or during message transmission, a group member needs to be added to the group's 'congested' list, we sort it into the list in ascending order, according to its current advertised window size. However, we miss the case when the member is already on that list. This will have the result that the member, after the window size has been decremented, might be at the wrong position in that list. This again may have the effect that we during broadcast and multicast transmissions miss the fact that a destination is not yet ready for reception, and we end up sending anyway. From this point on, the behavior during the remaining session is unpredictable, e.g., with underflowing window sizes. We now correct this bug by unconditionally removing the member from the list before (re-)sorting it in. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 14:10:03 -05:00
David S. Miller	748a709974	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2017-12-18 Here's the first bluetooth-next pull request for the 4.16 kernel. - hci_ll: multiple cleanups & fixes - Remove Gustavo Padovan from the MAINTAINERS file - Support BLE Adversing while connected (if the controller can do it) - DT updates for TI chips - Various other smaller cleanups & fixes Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 13:53:39 -05:00
Xin Long	c9fefa0819	ip6_tunnel: get the min mtu properly in ip6_tnl_xmit Now it's using IPV6_MIN_MTU as the min mtu in ip6_tnl_xmit, but IPV6_MIN_MTU actually only works when the inner packet is ipv6. With IPV6_MIN_MTU for ipv4 packets, the new pmtu for inner dst couldn't be set less than 1280. It would cause tx_err and the packet to be dropped when the outer dst pmtu is close to 1280. Jianlin found it by running ipv4 traffic with the topo: (client) gre6 <---> eth1 (route) eth2 <---> gre6 (server) After changing eth2 mtu to 1300, the performance became very low, or the connection was even broken. The issue also affects ip4ip6 and ip6ip6 tunnels. So if the inner packet is ipv4, 576 should be considered as the min mtu. Note that for ip4ip6 and ip6ip6 tunnels, the inner packet can only be ipv4 or ipv6, but for gre6 tunnel, it may also be ARP. This patch using 576 as the min mtu for non-ipv6 packet works for all those cases. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 13:45:33 -05:00
Xin Long	2c52129a7d	ip6_gre: remove the incorrect mtu limit for ipgre tap The same fix as the patch "ip_gre: remove the incorrect mtu limit for ipgre tap" is also needed for ip6_gre. Fixes: `61e84623ac` ("net: centralize net_device min/max MTU checking") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 13:45:32 -05:00
Xin Long	cfddd4c33c	ip_gre: remove the incorrect mtu limit for ipgre tap ipgre tap driver calls ether_setup(), after commit `61e84623ac` ("net: centralize net_device min/max MTU checking"), the range of mtu is [min_mtu, max_mtu], which is [68, 1500] by default. It causes the dev mtu of the ipgre tap device to not be greater than 1500, this limit value is not correct for ipgre tap device. Besides, it's .change_mtu already does the right check. So this patch is just to set max_mtu as 0, and leave the check to it's .change_mtu. Fixes: `61e84623ac` ("net: centralize net_device min/max MTU checking") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 13:45:32 -05:00
Michael Chan	56f5aa77cd	net: Disable GRO_HW when generic XDP is installed on a device. Hardware should not aggregate any packets when generic XDP is installed. Cc: Ariel Elior <Ariel.Elior@cavium.com> Cc: everest-linux-l2@cavium.com Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 10:38:36 -05:00
Michael Chan	fb1f5f79ae	net: Introduce NETIF_F_GRO_HW. Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware GRO. With this flag, we can now independently turn on or off hardware GRO when GRO is on. Previously, drivers were using NETIF_F_GRO to control hardware GRO and so it cannot be independently turned on or off without affecting GRO. Hardware GRO (just like GRO) guarantees that packets can be re-segmented by TSO/GSO to reconstruct the original packet stream. Logically, GRO_HW should depend on GRO since it a subset, but we will let individual drivers enforce this dependency as they see fit. Since NETIF_F_GRO is not propagated between upper and lower devices, NETIF_F_GRO_HW should follow suit since it is a subset of GRO. In other words, a lower device can independent have GRO/GRO_HW enabled or disabled and no feature propagation is required. This will preserve the current GRO behavior. This can be changed later if we decide to propagate GRO/ GRO_HW/RXCSUM from upper to lower devices. Cc: Ariel Elior <Ariel.Elior@cavium.com> Cc: everest-linux-l2@cavium.com Signed-off-by: Michael Chan <michael.chan@broadcom.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 10:38:36 -05:00
Tonghao Zhang	648845ab7e	sock: Move the socket inuse to namespace. In some case, we want to know how many sockets are in use in different _net_ namespaces. It's a key resource metric. This patch add a member in struct netns_core. This is a counter for socket-inuse in the _net_ namespace. The patch will add/sub counter in the sk_alloc, sk_clone_lock and __sk_free. This patch will not counter the socket created in kernel. It's not very useful for userspace to know how many kernel sockets we created. The main reasons for doing this are that: 1. When linux calls the 'do_exit' for process to exit, the functions 'exit_task_namespaces' and 'exit_task_work' will be called sequentially. 'exit_task_namespaces' may have destroyed the _net_ namespace, but 'sock_release' called in 'exit_task_work' may use the _net_ namespace if we counter the socket-inuse in sock_release. 2. socket and sock are in pair. More important, sock holds the _net_ namespace. We counter the socket-inuse in sock, for avoiding holding _net_ namespace again in socket. It's a easy way to maintain the code. Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 09:58:14 -05:00
Tonghao Zhang	08fc7f8140	sock: Change the netns_core member name. Change the member name will make the code more readable. This patch will be used in next patch. Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 09:58:14 -05:00
David S. Miller	c6479d6257	A few more fixes: * hwsim: - set To-DS bit in some frames missing it - fix sleeping in atomic * nl80211: - doc cleanup - fix locking in an error path * build: - don't append to created certs C files - ship certificate pre-hexdumped -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlo44hIACgkQB8qZga/f l8RSchAAlk1jd9WTkvzQEUe3KpQ9cPoLe6SZ5ozPJCccJQ4GPlQiB9NK1hi+qfv/ WWdJl7zSWMujVtneeEhxHFnlwhRGh3s1t9IP1RACPEhlvFbyKob55cXJnbbiRtCJ MEwW15c/SbCND79QRVXLQxV0JiNM0I8j1LIdEWoi2xFA+/g+zA7InZ1g/WvPUQq7 N28bGJ4lVEmBFJ1nnaF4kFpGRn73Cq2E9CqkWtS3Gfo9Gso6XCMvmPRjO2+BzJ67 Ovmy9jEJTZO9aGOUXtcaZbeNhlxJMqVdRv2mjpR/l9g6r2/u+BD4uiAlLy+RNcRU gABqk3RJl+CA7BDl+YJahm33rLhokkZZIwq7UV5jsYOR1UTUdK/j7s5MpoqaSIVI ZA5/emcEoY8XTt+MMCEJbhRIJA9rQXKOq6vdeqLK5YBL0HKlg5eIWtGqh3EcY6iO +qe0cc4TFp4sWzOOrKShmDh5agDta1+gHUoBQsTBZnnU6pEybcY2FTUF+xN4u3mk cEOOMFX8CiMDIyCf2GUALRbnwfZgWesxjfY7NXCAOL6T7JK+WFVX9Njn2JqT/eIO FKUZl29nathswPTb1YdHmoAUJPU3TicuVnVPJzkJVG0lN5x7pwFX7/nkrZxphCBT 06t0r3RohfbMGwYTVXqi7z7FkDVr2YbP9fQ63r5yvWOgPCT+xLM= =OWs9 -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2017-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A few more fixes: * hwsim: - set To-DS bit in some frames missing it - fix sleeping in atomic * nl80211: - doc cleanup - fix locking in an error path * build: - don't append to created certs C files - ship certificate pre-hexdumped ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-19 09:39:11 -05:00
Sunil Dutt	983dafaab7	cfg80211: Scan results to also report the per chain signal strength This commit enhances the scan results to report the per chain signal strength based on the latest BSS update. This provides similar information to what is already available through STA information. Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-19 10:37:31 +01:00
David Spinadel	86b6c46572	nl80211: send deauth reason if locally generated Send disconnection reason code to user space even if it's locally generated, since some tests that check reason code may fail because of the current behavior. Signed-off-by: David Spinadel <david.spinadel@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-19 10:13:01 +01:00
Johannes Berg	e7881bd594	Revert "mac80211: Add TXQ scheduling API" This reverts commit `e937b8da5a`. Turns out that a new driver (mt76) is coming in through Kalle's tree, and will conflict with this. It also has some conflicting requirements, so we'll revisit this later. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-19 10:12:48 +01:00
Johannes Berg	0973dd45ec	Revert "mac80211: Add airtime account and scheduling to TXQs" This reverts commit `b0d52ad821`. We need to revert the TXQ scheduling API due to conflicts with a new driver, and this depends on that API. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-19 10:12:26 +01:00
Johannes Berg	04a7279ff1	cfg80211: ship certificates as hex files Not only does this remove the need for the hexdump code in most normal kernel builds (still there for the extra directory), but it also removes the need to ship binary files, which apparently is somewhat problematic, as Randy reported. While at it, also add the generated files to clean-files. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-19 09:28:01 +01:00
Thierry Reding	5d32407396	cfg80211: always rewrite generated files from scratch Currently the certs C code generation appends to the generated files, which is most likely a leftover from commit `715a123347` ("wireless: don't write C files on failures"). This causes duplicate code in the generated files if the certificates have their timestamps modified between builds and thereby trigger the generation rules. Fixes: `715a123347` ("wireless: don't write C files on failures") Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-19 09:13:26 +01:00
Herbert Xu	acf568ee85	xfrm: Reinject transport-mode packets through tasklet This is an old bugbear of mine: https://www.mail-archive.com/netdev@vger.kernel.org/msg03894.html By crafting special packets, it is possible to cause recursion in our kernel when processing transport-mode packets at levels that are only limited by packet size. The easiest one is with DNAT, but an even worse one is where UDP encapsulation is used in which case you just have to insert an UDP encapsulation header in between each level of recursion. This patch avoids this problem by reinjecting tranport-mode packets through a tasklet. Fixes: `b05e106698` ("[IPV4/6]: Netfilter IPsec input hooks") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-19 08:23:21 +01:00
Xiongwei Song	c060bc6115	bpf: make function xdp_do_generic_redirect_map() static The function xdp_do_generic_redirect_map() is only used in this file, so make it static. Clean up sparse warning: net/core/filter.c:2687:5: warning: no previous prototype for 'xdp_do_generic_redirect_map' [-Wmissing-prototypes] Signed-off-by: Xiongwei Song <sxwjean@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2017-12-19 01:37:16 +01:00
William Tu	d91e8db5b6	net: erspan: reload pointer after pskb_may_pull pskb_may_pull() can change skb->data, so we need to re-load pkt_md and ershdr at the right place. Fixes: `94d7d8f292` ("ip6_gre: add erspan v2 support") Fixes: `f551c91de2` ("net: erspan: introduce erspan v2 for ip_gre") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 15:11:25 -05:00
William Tu	ae3e13373b	net: erspan: fix wrong return value If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM. Fixes: `94d7d8f292` ("ip6_gre: add erspan v2 support") Fixes: `f551c91de2` ("net: erspan: introduce erspan v2 for ip_gre") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 15:11:25 -05:00
Samuel Mendoza-Jonas	75e8e15635	net/ncsi: Don't take any action on HNCDSC AEN The current HNCDSC handler takes the status flag from the AEN packet and will update or change the current channel based on this flag and the current channel status. However the flag from the HNCDSC packet merely represents the host link state. While the state of the host interface is potentially interesting information it should not affect the state of the NCSI link. Indeed the NCSI specification makes no mention of any recommended action related to the host network controller driver state. Update the HNCDSC handler to record the host network driver status but take no other action. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Acked-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 14:50:11 -05:00
Nikolay Aleksandrov	84aeb437ab	net: bridge: fix early call to br_stp_change_bridge_id and plug newlink leaks The early call to br_stp_change_bridge_id in bridge's newlink can cause a memory leak if an error occurs during the newlink because the fdb entries are not cleaned up if a different lladdr was specified, also another minor issue is that it generates fdb notifications with ifindex = 0. Another unrelated memory leak is the bridge sysfs entries which get added on NETDEV_REGISTER event, but are not cleaned up in the newlink error path. To remove this special case the call to br_stp_change_bridge_id is done after netdev register and we cleanup the bridge on changelink error via br_dev_delete to plug all leaks. This patch makes netlink bridge destruction on newlink error the same as dellink and ioctl del which is necessary since at that point we have a fully initialized bridge device. To reproduce the issue: $ ip l add br0 address 00:11:22:33:44:55 type bridge group_fwd_mask 1 RTNETLINK answers: Invalid argument $ rmmod bridge [ 1822.142525] ============================================================================= [ 1822.143640] BUG bridge_fdb_cache (Tainted: G O ): Objects remaining in bridge_fdb_cache on __kmem_cache_shutdown() [ 1822.144821] ----------------------------------------------------------------------------- [ 1822.145990] Disabling lock debugging due to kernel taint [ 1822.146732] INFO: Slab 0x0000000092a844b2 objects=32 used=2 fp=0x00000000fef011b0 flags=0x1ffff8000000100 [ 1822.147700] CPU: 2 PID: 13584 Comm: rmmod Tainted: G B O 4.15.0-rc2+ #87 [ 1822.148578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 1822.150008] Call Trace: [ 1822.150510] dump_stack+0x78/0xa9 [ 1822.151156] slab_err+0xb1/0xd3 [ 1822.151834] ? __kmalloc+0x1bb/0x1ce [ 1822.152546] __kmem_cache_shutdown+0x151/0x28b [ 1822.153395] shutdown_cache+0x13/0x144 [ 1822.154126] kmem_cache_destroy+0x1c0/0x1fb [ 1822.154669] SyS_delete_module+0x194/0x244 [ 1822.155199] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 1822.155773] entry_SYSCALL_64_fastpath+0x23/0x9a [ 1822.156343] RIP: 0033:0x7f929bd38b17 [ 1822.156859] RSP: 002b:00007ffd160e9a98 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0 [ 1822.157728] RAX: ffffffffffffffda RBX: 00005578316ba090 RCX: 00007f929bd38b17 [ 1822.158422] RDX: 00007f929bd9ec60 RSI: 0000000000000800 RDI: 00005578316ba0f0 [ 1822.159114] RBP: 0000000000000003 R08: 00007f929bff5f20 R09: 00007ffd160e8a11 [ 1822.159808] R10: 00007ffd160e9860 R11: 0000000000000202 R12: 00007ffd160e8a80 [ 1822.160513] R13: 0000000000000000 R14: 0000000000000000 R15: 00005578316ba090 [ 1822.161278] INFO: Object 0x000000007645de29 @offset=0 [ 1822.161666] INFO: Object 0x00000000d5df2ab5 @offset=128 Fixes: `30313a3d57` ("bridge: Handle IFLA_ADDRESS correctly when creating bridge device") Fixes: `5b8d5429da` ("bridge: netlink: register netdevice before executing changelink") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 13:29:01 -05:00
Xin Long	d196975905	sctp: add SCTP_CID_RECONF conversion in sctp_cname Whenever a new type of chunk is added, the corresp conversion in sctp_cname should be added. Otherwise, in some places, pr_debug will print it as "unknown chunk". Fixes: `cc16f00f65` ("sctp: add support for generating stream reconf ssn reset request chunk") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 13:21:46 -05:00
Xin Long	5c468674d1	sctp: fix the issue that a __u16 variable may overflow in sctp_ulpq_renege Now when reneging events in sctp_ulpq_renege(), the variable freed could be increased by a __u16 value twice while freed is of __u16 type. It means freed may overflow at the second addition. This patch is to fix it by using __u32 type for 'freed', while at it, also to remove 'if (chunk)' check, as all renege commands are generated in sctp_eat_data and it can't be NULL. Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 13:21:46 -05:00
Jon Maloy	3f42f5fe31	tipc: remove leaving group member from all lists A group member going into state LEAVING should never go back to any other state before it is finally deleted. However, this might happen if the socket needs to send out a RECLAIM message during this interval. Since we forget to remove the leaving member from the group's 'active' or 'pending' list, the member might be selected for reclaiming, change state to RECLAIMING, and get stuck in this state instead of being deleted. This might lead to suppression of the expected 'member down' event to the receiver. We fix this by removing the member from all lists, except the RB tree, at the moment it goes into state LEAVING. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 13:16:40 -05:00
Jon Maloy	234833991e	tipc: fix lost member events bug Group messages are not supposed to be returned to sender when the destination socket disappears. This is done correctly for regular traffic messages, by setting the 'dest_droppable' bit in the header. But we forget to do that in group protocol messages. This has the effect that such messages may sometimes bounce back to the sender, be perceived as a legitimate peer message, and wreak general havoc for the rest of the session. In particular, we have seen that a member in state LEAVING may go back to state RECLAIMED or REMITTED, hence causing suppression of an otherwise expected 'member down' event to the user. We fix this by setting the 'dest_droppable' bit even in group protocol messages. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 13:16:40 -05:00
David S. Miller	b36025b19a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2017-12-17 The following pull-request contains BPF updates for your net tree. The main changes are: 1) Fix a corner case in generic XDP where we have non-linear skbs but enough tailroom in the skb to not miss to linearizing there, from Song. 2) Fix BPF JIT bugs in s390x and ppc64 to not recache skb data when BPF context is not skb, from Daniel. 3) Fix a BPF JIT bug in sparc64 where recaching skb data after helper call would use the wrong register for the skb, from Daniel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-18 10:49:22 -05:00
Greg Kroah-Hartman	7f9d04bc56	Merge 4.15-rc4 into staging-next We want the staging fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-12-18 09:12:51 +01:00
Brendan McGrath	588753f1eb	ipv6: icmp6: Allow icmp messages to be looped back One example of when an ICMPv6 packet is required to be looped back is when a host acts as both a Multicast Listener and a Multicast Router. A Multicast Router will listen on address ff02::16 for MLDv2 messages. Currently, MLDv2 messages originating from a Multicast Listener running on the same host as the Multicast Router are not being delivered to the Multicast Router. This is due to dst.input being assigned the default value of dst_discard. This results in the packet being looped back but discarded before being delivered to the Multicast Router. This patch sets dst.input to ip6_input to ensure a looped back packet is delivered to the Multicast Router. Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-16 22:51:26 -05:00
David S. Miller	c30abd5e40	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Three sets of overlapping changes, two in the packet scheduler and one in the meson-gxl PHY driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-16 22:11:55 -05:00
Linus Torvalds	d025fbf1a2	NFS client fixes for Linux 4.15-rc4 Stable bugfixes: - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a commit in the case that there were no commit requests. - SUNRPC: Fix a race in the receive code path Other fixes: - NFS: Fix a deadlock in nfs client initialization - xprtrdma: Fix a performance regression for small IOs -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlo0PdMACgkQ18tUv7Cl QOvlUg/+KoXWXNwItHIyyegYgRXcAPpaCtdnCjjOP6R9HEJ+clnLcaqDxdDKVWQ/ oDvEcQcsBpywbUi7vVrvdar4mofwuyjXPpbcZPlDP1Ru4yyAlyylftwIuQW/nzdd vX2tZaVf+B9y1XvSD5NI+2EKWmp7MVrPdNhYxAB39TQZnAAvYDFHhywtZ0UR7vJt 7YVcZoPtKUhg15jhCOr73eaCT0884/tlgedfd6DkDGR6bCtSQC2PySfqq9Lnnl/1 ruDzzcgTARzSEzvta/uyBRspOLBHeeBhTdQUp79lMfekC4+68Tx6DFWnydIUttuE G7LphN6hfbJLF20U/ENb2H8v10WZsKvGEuxM+fp5PXGcIMSlX4qoJUe/egJFiiSL IaikgibvfiKmYSJvwdxTlOcr793X2Ej19HNciNjJQp4pviDOdZixgtGvVVHJBmh6 LYzE5q9jgbW9wQXwTTeWHp/nyqL80NslX0UARYnS2Ua0B96GRCESXqCUFtxK6tKR wbYiHzKc4dOfSxpNlKI+FlX63m5oSAmTEii3ODsWZjObbwYHNX2Zqj2cVFiSLCpv ZXgmpNL+tL2zBWxPvn6rzYhpaXo++PqlHK7vv2QVBI6XM2J8ztpj5Wr5zneRoJaE ejk8nw/mR43bfdQuUGZRKh/Z+FTqL0/2WbDgJMXl09c+zRz7J2c= =XhEC -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: "This has two stable bugfixes, one to fix a BUG_ON() when nfs_commit_inode() is called with no outstanding commit requests and another to fix a race in the SUNRPC receive codepath. Additionally, there are also fixes for an NFS client deadlock and an xprtrdma performance regression. Summary: Stable bugfixes: - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a commit in the case that there were no commit requests. - SUNRPC: Fix a race in the receive code path Other fixes: - NFS: Fix a deadlock in nfs client initialization - xprtrdma: Fix a performance regression for small IOs" * tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: Fix a race in the receive code path nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests xprtrdma: Spread reply processing over more CPUs nfs: fix a deadlock in nfs client initialization	2017-12-16 13:12:53 -08:00
Linus Torvalds	7a3c296ae0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot. 2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik Brueckner. 3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes Berg. 4) Add missing barriers to ptr_ring, from Michael S. Tsirkin. 5) Don't advertise gigabit in sh_eth when not available, from Thomas Petazzoni. 6) Check network namespace when delivering to netlink taps, from Kevin Cernekee. 7) Kill a race in raw_sendmsg(), from Mohamed Ghannam. 8) Use correct address in TCP md5 lookups when replying to an incoming segment, from Christoph Paasch. 9) Add schedule points to BPF map alloc/free, from Eric Dumazet. 10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also from Eric Dumazet. 11) Fix SKB leak in tipc, from Jon Maloy. 12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz. 13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn. 14) Add some new qmi_wwan device IDs, from Daniele Palmas. 15) Fix static key imbalance in ingress qdisc, from Jiri Pirko. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits) net: qcom/emac: Reduce timeout for mdio read/write net: sched: fix static key imbalance in case of ingress/clsact_init error net: sched: fix clsact init error path ip_gre: fix wrong return value of erspan_rcv net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support pkt_sched: Remove TC_RED_OFFLOADED from uapi net: sched: Move to new offload indication in RED net: sched: Add TCA_HW_OFFLOAD net: aquantia: Increment driver version net: aquantia: Fix typo in ethtool statistics names net: aquantia: Update hw counters on hw init net: aquantia: Improve link state and statistics check interval callback net: aquantia: Fill in multicast counter in ndev stats from hardware net: aquantia: Fill ndev stat couters from hardware net: aquantia: Extend stat counters to 64bit values net: aquantia: Fix hardware DMA stream overload on large MRRS net: aquantia: Fix actual speed capabilities reporting sock: free skb in skb_complete_tx_timestamp on error s390/qeth: update takeover IPs after configuration change s390/qeth: lock IP table while applying takeover changes ...	2017-12-15 13:08:37 -08:00
Jiri Pirko	b59e6979a8	net: sched: fix static key imbalance in case of ingress/clsact_init error Move static key increments to the beginning of the init function so they pair 1:1 with decrements in ingress/clsact_destroy, which is called in case ingress/clsact_init fails. Fixes: `6529eaba33` ("net: sched: introduce tcf block infractructure") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 15:43:12 -05:00
Jiri Pirko	343723dd51	net: sched: fix clsact init error path Since in qdisc_create, the destroy op is called when init fails, we don't do cleanup in init and leave it up to destroy. This fixes use-after-free when trying to put already freed block. Fixes: `6e40cf2d4d` ("net: sched: use extended variants of block_get/put in ingress and clsact qdiscs") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 15:43:12 -05:00
Trond Myklebust	90d91b0cd3	SUNRPC: Fix a race in the receive code path We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot race with the call to xprt_complete_rqst(). Reported-by: Chuck Lever <chuck.lever@oracle.com> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317 Fixes: `ce7c252a8c` ("SUNRPC: Add a separate spinlock to protect..") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-12-15 14:31:56 -05:00
Chuck Lever	ccede75985	xprtrdma: Spread reply processing over more CPUs Commit `d8f532d20e` ("xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion") introduced a performance regression for NFS I/O small enough to not need memory registration. In multi- threaded benchmarks that generate primarily small I/O requests, IOPS throughput is reduced by nearly a third. This patch restores the previous level of throughput. Because workqueues are typically BOUND (in particular ib_comp_wq, nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend to aggregate on the CPU that is handling Receive completions. The usual approach to addressing this problem is to create a QP and CQ for each CPU, and then schedule transactions on the QP for the CPU where you want the transaction to complete. The transaction then does not require an extra context switch during completion to end up on the same CPU where the transaction was started. This approach doesn't work for the Linux NFS/RDMA client because currently the Linux NFS client does not support multiple connections per client-server pair, and the RDMA core API does not make it straightforward for ULPs to determine which CPU is responsible for handling Receive completions for a CQ. So for the moment, record the CPU number in the rpcrdma_req before the transport sends each RPC Call. Then during Receive completion, queue the RPC completion on that same CPU. Additionally, move all RPC completion processing to the deferred handler so that even RPCs with simple small replies complete on the CPU that sent the corresponding RPC Call. Fixes: `d8f532d20e` ("xprtrdma: Invoke rpcrdma_reply_handler ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-12-15 14:31:50 -05:00
Haishuang Yan	c05fad5713	ip_gre: fix wrong return value of erspan_rcv If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM. Fixes: `84e54fe0a5` ("gre: introduce native tunnel support for ERSPAN") Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 14:10:39 -05:00
Xin Long	463118c34a	sctp: support sysctl to allow users to use stream interleave This is the last patch for support of stream interleave, after this patch, users could enable stream interleave by systcl -w net.sctp.intl_enable=1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:22 -05:00
Xin Long	107e242569	sctp: update mid instead of ssn when doing stream and asoc reset When using idata and doing stream and asoc reset, setting ssn with 0 could only clear the 1st 16 bits of mid. So to make this work for both data and idata, it sets mid with 0 instead of ssn, and also mid_uo for unordered idata also need to be cleared, as said in section 2.3.2 of RFC8260. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:22 -05:00
Xin Long	ef4775e340	sctp: add stream interleave support in stream scheduler As Marcelo said in the stream scheduler patch: Support for I-DATA chunks, also described in RFC8260, with user message interleaving is straightforward as it just requires the schedulers to probe for the feature and ignore datamsg boundaries when dequeueing. All needs to do is just to ignore datamsg boundaries when dequeueing. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:22 -05:00
Xin Long	de60fe9105	sctp: implement handle_ftsn for sctp_stream_interleave handle_ftsn is added as a member of sctp_stream_interleave, used to skip ssn for data or mid for idata, called for SCTP_CMD_PROCESS_FWDTSN cmd. sctp_handle_iftsn works for ifwdtsn, and sctp_handle_fwdtsn works for fwdtsn. Note that different from sctp_handle_fwdtsn, sctp_handle_iftsn could do stream abort pd. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:22 -05:00
Xin Long	47b20a8856	sctp: implement report_ftsn for sctp_stream_interleave report_ftsn is added as a member of sctp_stream_interleave, used to skip tsn from tsnmap, remove old events from reasm or lobby queue, and abort pd for data or idata, called for SCTP_CMD_REPORT_FWDTSN cmd and asoc reset. sctp_report_iftsn works for ifwdtsn, and sctp_report_fwdtsn works for fwdtsn. Note that sctp_report_iftsn doesn't do asoc abort_pd, as stream abort_pd will be done when handling ifwdtsn. But when ftsn is equal with ftsn, which means asoc reset, asoc abort_pd has to be done. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:22 -05:00
Xin Long	0fc2ea922c	sctp: implement validate_ftsn for sctp_stream_interleave validate_ftsn is added as a member of sctp_stream_interleave, used to validate ssn/chunk type for fwdtsn or mid (message id)/chunk type for ifwdtsn, called in sctp_sf_eat_fwd_tsn, just as validate_data. If this check fails, an abort packet will be sent, as said in section 2.3.1 of RFC8260. As ifwdtsn and fwdtsn chunks have different length, it also defines ftsn_chunk_len for sctp_stream_interleave to describe the chunk size. Then it replaces all sizeof(struct sctp_fwdtsn_chunk) with sctp_ftsnchk_len. It also adds the process for ifwdtsn in rx path. As Marcelo pointed out, there's no need to add event table for ifwdtsn, but just share prsctp_chunk_event_table with fwdtsn's. It would drop fwdtsn chunk for ifwdtsn and drop ifwdtsn chunk for fwdtsn by calling validate_ftsn in sctp_sf_eat_fwd_tsn. After this patch, the ifwdtsn can be accepted. Note that this patch also removes the sctp.intl_enable check for idata chunks in sctp_chunk_event_lookup, as it will do this check in validate_data later. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:22 -05:00
Xin Long	8e0c3b73ce	sctp: implement generate_ftsn for sctp_stream_interleave generate_ftsn is added as a member of sctp_stream_interleave, used to create fwdtsn or ifwdtsn chunk according to abandoned chunks, called in sctp_retransmit and sctp_outq_sack. sctp_generate_iftsn works for ifwdtsn, and sctp_generate_fwdtsn is still used for making fwdtsn. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:21 -05:00
Xin Long	2d07a49ade	sctp: add basic structures and make chunk function for ifwdtsn sctp_ifwdtsn_skip, sctp_ifwdtsn_hdr and sctp_ifwdtsn_chunk are used to define and parse I-FWD TSN chunk format, and sctp_make_ifwdtsn is a function to build the chunk. The I-FORWARD-TSN Chunk Format is defined in section 2.3.1 of RFC8260. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:52:21 -05:00
Yuval Mintz	428a68af3a	net: sched: Move to new offload indication in RED Let RED utilize the new internal flag, TCQ_F_OFFLOADED, to mark a given qdisc as offloaded instead of using a dedicated indication. Also, change internal logic into looking at said flag when possible. Fixes: `602f3baf22` ("net_sch: red: Add offload ability to RED qdisc") Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:35:36 -05:00
Yuval Mintz	7a4fa29106	net: sched: Add TCA_HW_OFFLOAD Qdiscs can be offloaded to HW, but current implementation isn't uniform. Instead, qdiscs either pass information about offload status via their TCA_OPTIONS or omit it altogether. Introduce a new attribute - TCA_HW_OFFLOAD that would form a uniform uAPI for the offloading status of qdiscs. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 13:35:36 -05:00
William Tu	94d7d8f292	ip6_gre: add erspan v2 support Similar to support for ipv4 erspan, this patch adds erspan v2 to ip6erspan tunnel. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 12:34:00 -05:00
William Tu	f551c91de2	net: erspan: introduce erspan v2 for ip_gre The patch adds support for erspan version 2. Not all features are supported in this patch. The SGT (security group tag), GRA (timestamp granularity), FT (frame type) are set to fixed value. Only hardware ID and direction are configurable. Optional subheader is also not supported. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 12:34:00 -05:00
William Tu	1d7e2ed22f	net: erspan: refactor existing erspan code The patch refactors the existing erspan implementation in order to support erspan version 2, which has additional metadata. So, in stead of having one 'struct erspanhdr' holding erspan version 1, breaks it into 'struct erspan_base_hdr' and 'struct erspan_metadata'. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 12:33:59 -05:00
Willem de Bruijn	35b99dffc3	sock: free skb in skb_complete_tx_timestamp on error skb_complete_tx_timestamp must ingest the skb it is passed. Call kfree_skb if the skb cannot be enqueued. Fixes: `b245be1f4d` ("net-timestamp: no-payload only sysctl") Fixes: `9ac25fc063` ("net: fix socket refcounting in skb_complete_tx_timestamp()") Reported-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 11:30:36 -05:00
Sven Eckelmann	ff15c27c97	batman-adv: Add kernel-doc to externally visible functions According to the kernel-doc documentation, externally visible functions should be documented. This refers to all all non-static function which can (and will) be used by functions in other sources files. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:24 +01:00
Sven Eckelmann	e57acf8e93	batman-adv: Add kernel-doc to functions in headers Externally visible functions should be documented with kernel-doc. This usually refers to non-static functions but also static inline files in headers are visible in other files and should therefore be documented. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:23 +01:00
Sven Eckelmann	73844a8c78	batman-adv: Add kernel-doc to enums in headers All enums in types.h are already documented. But some other headers still have private enums which also should be documented. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:22 +01:00
Sven Eckelmann	c93effcf72	batman-adv: Add kernel-doc to structs in headers All structs in types.h are already documented. But some other headers still have private structs which also should be documented. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:22 +01:00
Sven Eckelmann	a07369d7fb	batman-adv: Fix kernel-doc references to struct members The correct syntax to create references in kernel-doc to a struct member is not "struct_name::member"" but "&struct_name->member" or "&struct_name.member". The correct syntax is required to get the correct cross-referencing in the reStructuredText text output. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:21 +01:00
Sven Eckelmann	8b84cc4fb5	batman-adv: Use inline kernel-doc for enum/struct The inline kernel-doc comments make it easier to keep changes to the struct/enum synchronized with the documentation of the it. And it makes it easier for larger structures like struct batadv_priv to read the documentation inside the code. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:20 +01:00
Sven Eckelmann	7e9a8c2ce7	batman-adv: Use parentheses in function kernel-doc The documentation describing kernel-doc comments for functions ("How to format kernel-doc comments") uses parentheses at the end of the function name. Using this format allows to use a consistent style when adding documentation to a function and when referencing this function in a different kernel-doc section. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:20 +01:00
Sven Eckelmann	6a3038f07c	batman-adv: Add missing kernel-doc to packet.h Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:19 +01:00
Sven Eckelmann	eface060c2	batman-adv: Remove unused sched.h include The linux/wait.h include was removed with commit 421d988b2c08 ("batman-adv: Consolidate logging related functions"). The previously required (but not unused) linux/sched.h include can also be dropped now. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:18 +01:00
Sven Eckelmann	8cfba951a0	batman-adv: include kobject.h for kobject_* functions The linux/kobject.h provides the kobject_* function declarations and should therefore be included directly before they are used. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:18 +01:00
Sven Eckelmann	3a64469e4f	batman-adv: Include net.h for net_ratelimited_function The linux/net.h provides the net_ratelimited_function. It should therefore be included directly before it is used. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:17 +01:00
Sven Eckelmann	ecc36f5ee6	batman-adv: include build_bug.h for BUILD_BUG_ON define commit `bc6245e5ef` ("bug: split BUILD_BUG stuff out into <linux/build_bug.h>") added a new header for BUILD_BUG_ON. It should therefore be included instead of linux/bug.h Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:29:15 +01:00
Sven Eckelmann	b92b94ac73	batman-adv: include gfp.h for GFP_* defines The linux/gfp.h provides the GFP_ATOMIC and GFP_KERNEL define. It should therefore be included instead of linux/fs.h. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:24:10 +01:00
Sven Eckelmann	9969ffa85f	batman-adv: Add license header to Kconfig The last remaining file without license notice and/or SPDX license identifier under net/batman-adv/ is the Kconfig. It should have been licensed under the same conditions as the rest of batman-adv and the Makefile which uses the CONFIG_* variables from Kconfig. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:24:08 +01:00
Sven Eckelmann	7db7d9f369	batman-adv: Add SPDX license identifier above copyright header The "Linux kernel licensing rules" require that each file has a SPDX license identifier as first line (and sometimes as second line). The FSFE REUSE practices [1] would also require the same tags but have no restrictions on the placement in the source file. Using the "Linux kernel licensing rules" is therefore also fulfilling the FSFE REUSE practices requirements at the same time. [1] https://reuse.software/practices/ Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2017-12-15 17:22:45 +01:00
David S. Miller	8ce38aeb55	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-12-15 1) Currently we can add or update socket policies, but not clear them. Support clearing of socket policies too. From Lorenzo Colitti. 2) Add documentation for the xfrm device offload api. From Shannon Nelson. 3) Fix IPsec extended sequence numbers (ESN) for IPsec offloading. From Yossef Efraim. 4) xfrm_dev_state_add function returns success even for unsupported options, fix this to fail in such cases. From Yossef Efraim. 5) Remove a redundant xfrm_state assignment. From Aviv Heller. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 11:10:27 -05:00
David S. Miller	0f546ffcd0	Here are some batman-adv bugfixes: - Initialize the fragment headers, by Sven Eckelmann - Fix a NULL check in BATMAN V, by Sven Eckelmann - Fix kernel doc for the time_setup() change, by Sven Eckelmann - Use the right lock in BATMAN IV OGM Update, by Sven Eckelmann -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlozsZQWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZ1uD/94DURJf4b8kgUDjgMfuKZZyafH c4N8SI+3+gZExtgXbb48f91a9S5T7qIAm46c7nHNJpTLEhc/ucup3c06i8VjBc2r Tz37jnnjrHzfbw0p10zpLkdDlU0mLtEQBpNGQxru8h2x3gs2RBfDy/yX7MzzAp4m maZ3ub0wPAtlRKQ4OywTStP9ZP7BXj4/puCRrn3kPIKDc1Ahwh4dS4r4euSvvR6z 2Qkfnpb3sbzh1Sa526GNAwgEo6hpssOugQyCIZUK4YJh1OfjnS+xJHWmfBkLw7q9 OfkuwAjkZMivrcxK/lvdMJ6WltQUyWVkKxDuA6iAtwRV4tX5W5b+9OEToZlHuPid 11bDXDfmWTfnpYM7uTT7tRbjFpvJ9X4ftUNHHtlL+h+VRz8Uf0xcteBIi7FxHHm5 fRqMLj5J0WU5rVf2vKTz2XIi5dXWgy/bkoPzalMc8UxWVsKthr34UZmcPRDVvCja bwad1H+EEelx4tvPeWTEEhdhLFvalmfgRG3i/paqr547ETllmO462kdUTOmxDnvT 76GZVsLeHBPrnZi/tqc230EyoZijiM/l41twOZPjb0cklvOEFYR0vs/BUfmvdM/3 xlUF+CBaVnHcY3HDn7ywVICWP6jnG+LOQ4tOL6jjsKiqbEpPHRHuQ++MnZ2uRUgs 5e/yBVxdEjo9RoLVsg== =US7g -----END PGP SIGNATURE----- Merge tag 'batadv-net-for-davem-20171215' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - Initialize the fragment headers, by Sven Eckelmann - Fix a NULL check in BATMAN V, by Sven Eckelmann - Fix kernel doc for the time_setup() change, by Sven Eckelmann - Use the right lock in BATMAN IV OGM Update, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 11:02:11 -05:00
Sean Wang	f0af34317f	net: dsa: mediatek: combine MediaTek tag with VLAN tag In order to let MT7530 switch can recognize well those egress packets having both special tag and VLAN tag, the information about the special tag should be carried on the existing VLAN tag. On the other hand, it's unnecessary for extra handling for ingress packets when VLAN tag is present since it is able to put the VLAN tag after the special tag and then follow the existing way to parse. Signed-off-by: Sean Wang <sean.wang@mediatek.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-15 10:31:54 -05:00
Song Liu	2d17d8d79e	xdp: linearize skb in netif_receive_generic_xdp() In netif_receive_generic_xdp(), it is necessary to linearize all nonlinear skb. However, in current implementation, skb with troom <= 0 are not linearized. This patch fixes this by calling skb_linearize() for all nonlinear skb. Fixes: `de8f3a83b0` ("bpf: add meta pointer for direct access") Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2017-12-15 14:34:36 +01:00
Thiago Rafael Becker	bdcf0a423e	kernel: make groups_sort calling a responsibility group_info allocators In testing, we found that nfsd threads may call set_groups in parallel for the same entry cached in auth.unix.gid, racing in the call of groups_sort, corrupting the groups for that entry and leading to permission denials for the client. This patch: - Make groups_sort globally visible. - Move the call to groups_sort to the modifiers of group_info - Remove the call to groups_sort from set_groups Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-12-14 16:00:49 -08:00
Eric Dumazet	4688eb7cf3	tcp: refresh tcp_mstamp from timers callbacks Only the retransmit timer currently refreshes tcp_mstamp We should do the same for delayed acks and keepalives. Even if RFC 7323 does not request it, this is consistent to what linux did in the past, when TS values were based on jiffies. Fixes: `385e20706f` ("tcp: use tp->tcp_mstamp in output path") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Mike Maloney <maloney@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Mike Maloney <maloney@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 16:04:04 -05:00
Wei Wang	9ee11bd03c	tcp: fix potential underestimation on rcv_rtt When ms timestamp is used, current logic uses 1us in tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us. This could cause rcv_rtt underestimation. Fix it by always using a min value of 1ms if ms timestamp is used. Fixes: `645f4c6f2e` ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 16:01:17 -05:00
Yuchung Cheng	7268586baa	tcp: pause Fast Open globally after third consecutive timeout Prior to this patch, active Fast Open is paused on a specific destination IP address if the previous connections to the IP address have experienced recurring timeouts . But recent experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla browsers indicate the isssue is often caused by broken middle-boxes sitting close to the client. Therefore it is much better user experience if Fast Open is disabled out-right globally to avoid experiencing further timeouts on connections toward other destinations. This patch changes the destination-IP disablement to global disablement if a connection experiencing recurring timeouts or aborts due to timeout. Repeated incidents would still exponentially increase the pause time, starting from an hour. This is extremely conservative but an unfortunate compromise to minimize bad experience due to broken middle-boxes. Reported-by: Dragana Damjanovic <ddamjanovic@mozilla.com> Reported-by: Patrick McManus <mcmanus@ducksong.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 15:51:12 -05:00
Willem de Bruijn	8d74e9f88d	net: avoid skb_warn_bad_offload on IS_ERR skb_warn_bad_offload warns when packets enter the GSO stack that require skb_checksum_help or vice versa. Do not warn on arbitrary bad packets. Packet sockets can craft many. Syzkaller was able to demonstrate another one with eth_type games. In particular, suppress the warning when segmentation returns an error, which is for reasons other than checksum offload. See also commit `36c9247449` ("net: WARN if skb_checksum_help() is called on skb requiring segmentation") for context on this warning. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 15:14:10 -05:00
Nikolay Aleksandrov	eb7935830d	net: bridge: use rhashtable for fdbs Before this patch the bridge used a fixed 256 element hash table which was fine for small use cases (in my tests it starts to degrade above 1000 entries), but it wasn't enough for medium or large scale deployments. Modern setups have thousands of participants in a single bridge, even only enabling vlans and adding a few thousand vlan entries will cause a few thousand fdbs to be automatically inserted per participating port. So we need to scale the fdb table considerably to cope with modern workloads, and this patch converts it to use a rhashtable for its operations thus improving the bridge scalability. Tests show the following results (10 runs each), at up to 1000 entries rhashtable is ~3% slower, at 2000 rhashtable is 30% faster, at 3000 it is 2 times faster and at 30000 it is 50 times faster. Obviously this happens because of the properties of the two constructs and is expected, rhashtable keeps pretty much a constant time even with 10000000 entries (tested), while the fixed hash table struggles considerably even above 10000. As a side effect this also reduces the net_bridge struct size from 3248 bytes to 1344 bytes. Also note that the key struct is 8 bytes. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 15:10:01 -05:00
Eric Dumazet	ec94c2696f	tcp/dccp: avoid one atomic operation for timewait hashdance First, rename __inet_twsk_hashdance() to inet_twsk_hashdance() Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead of 4, but adding a fat warning that we do not have the right to access tw anymore after inet_twsk_hashdance() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 14:33:10 -05:00
Andy Shevchenko	22b371cbb9	Bluetooth: introduce DEFINE_SHOW_ATTRIBUTE() macro This macro deduplicates a lot of similar code across the hci_debugfs.c module. Targeting to be moved to seq_file.h eventually. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-12-13 20:20:34 +01:00
David S. Miller	d6da83813f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The follow patchset contains Netfilter fixes for your net tree, they are: 1) Fix compilation warning in x_tables with clang due to useless redundant reassignment, from Colin Ian King. 2) Add bugtrap to net_exit to catch uninitialized lists, patch from Vasily Averin. 3) Fix out of bounds memory reads in H323 conntrack helper, this comes with an initial patch to remove replace the obscure CHECK_BOUND macro as a dependency. From Eric Sesterhenn. 4) Reduce retransmission timeout when window is 0 in TCP conntrack, from Florian Westphal. 6) ctnetlink clamp timeout to INT_MAX if timeout is too large, otherwise timeout wraps around and it results in killing the entry that is being added immediately. 7) Missing CAP_NET_ADMIN checks in cthelper and xt_osf, due to no netns support. From Kevin Cernekee. 8) Missing maximum number of instructions checks in xt_bpf, patch from Jann Horn. 9) With no CONFIG_PROC_FS ipt_CLUSTERIP compilation breaks, patch from Arnd Bergmann. 10) Missing netlink attribute policy in nftables exthdr, from Florian Westphal. 11) Enable conntrack with IPv6 MASQUERADE rules, as `a357b3f80b` should have done in first place, from Konstantin Khlebnikov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 14:12:20 -05:00
Neal Cardwell	b4f70c3d4e	tcp: allow TLP in ECN CWR This patch enables tail loss probe in cwnd reduction (CWR) state to detect potential losses. Prior to this patch, since the sender uses PRR to determine the cwnd in CWR state, the combination of CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer() defers sending wait for more ACKs. The ACKs may not come due to packet losses. Disallowing TLP when there is unused cwnd had the primary effect of disallowing TLP when there is TSO deferral, Nagle deferral, or we hit the rwin limit. Because basically every application write() or incoming ACK will cause us to run tcp_write_xmit() to see if we can send more, and then if we sent something we call tcp_schedule_loss_probe() to see if we should schedule a TLP. At that point, there are a few common reasons why some cwnd budget could still be unused: (a) rwin limit (b) nagle check (c) TSO deferral (d) TSQ For (d), after the next packet tx completion the TSQ mechanism will allow us to send more packets, so we don't really need a TLP (in practice it shouldn't matter whether we schedule one or not). But for (a), (b), (c) the sender won't send any more packets until it gets another ACK. But if the whole flight was lost, or all the ACKs were lost, then we won't get any more ACKs, and ideally we should schedule and send a TLP to get more feedback. In particular for a long time we have wanted some kind of timer for TSO deferral, and at least this would give us some kind of timer Reported-by: Steve Ibanez <sibanez@stanford.edu> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Nandita Dukkipati <nanditad@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:59:21 -05:00
Cong Wang	039af9c66b	net_sched: switch to exit_batch for action pernet ops Since we now hold RTNL lock in tc_action_net_exit(), it is good to batch them to speedup tc action dismantle. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:58:41 -05:00
Kevin Cernekee	a46182b002	net: igmp: Use correct source address on IGMPv3 reports Closing a multicast socket after the final IPv4 address is deleted from an interface can generate a membership report that uses the source IP from a different interface. The following test script, run from an isolated netns, reproduces the issue: #!/bin/bash ip link add dummy0 type dummy ip link add dummy1 type dummy ip link set dummy0 up ip link set dummy1 up ip addr add 10.1.1.1/24 dev dummy0 ip addr add 192.168.99.99/24 dev dummy1 tcpdump -U -i dummy0 & socat EXEC:"sleep 2" \ UDP4-DATAGRAM:239.101.1.68:8889,ip-add-membership=239.0.1.68:10.1.1.1 & sleep 1 ip addr del 10.1.1.1/24 dev dummy0 sleep 5 kill %tcpdump RFC 3376 specifies that the report must be sent with a valid IP source address from the destination subnet, or from address 0.0.0.0. Add an extra check to make sure this is the case. Signed-off-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:51:27 -05:00
Jon Maloy	c545a945d0	tipc: eliminate potential memory leak In the function tipc_sk_mcast_rcv() we call refcount_dec(&skb->users) on received sk_buffers. Since the reference counter might hit zero at this point, we have a potential memory leak. We fix this by replacing refcount_dec() with kfree_skb(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:44:36 -05:00
Pravin Shedge	83593010d3	net: remove duplicate includes These duplicate includes have been found with scripts/checkincludes.pl but they have been removed manually to avoid removing false positives. Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:18:46 -05:00
Eric Dumazet	b5476022bb	ipv4: igmp: guard against silly MTU values IPv4 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in igmp code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv4 minimal MTU. This patch adds missing IPV4_MIN_MTU define, to not abuse ETH_MIN_MTU anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:13:58 -05:00
Eric Dumazet	b9b312a7a4	ipv6: mcast: better catch silly mtu values syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:0000000010b86b8d len:196 put:20 head:000000003b477e60 data:000000000e85441e tail:0xd4 end:0xc0 dev:lo ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:104! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:ffff8801db307508 EFLAGS: 00010286 RAX: 0000000000000082 RBX: ffff8801c517e840 RCX: 0000000000000000 RDX: 0000000000000082 RSI: 1ffff1003b660e61 RDI: ffffed003b660e95 RBP: ffff8801db307570 R08: 1ffff1003b660e23 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff85bd4020 R13: ffffffff84754ed2 R14: 0000000000000014 R15: ffff8801c4e26540 FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000463610 CR3: 00000001c6698000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-13 13:13:15 -05:00
Łukasz Rymanowski	9e1e9f20ca	Bluetooth: Add support to advertise when connected So far, kernel did not allow to advertise when there was a connection established. With this patch kernel does allow it if controller supports it. If controller supports non-connectable advertising when connected, then only non-connectable advertising instances will be advertised. Signed-off-by: Łukasz Rymanowski <lukasz.rymanowski@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-12-13 09:41:37 +01:00
Jaganath Kanakkassery	94386b6a5b	Bluetooth: Remove redundant disable_advertising() There is already __hci_req_disable_advertising() function for disabling, so use it. Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-12-13 00:28:42 +01:00
Andy Shevchenko	8a95079448	Bluetooth: Utilize %ph specifier Instead of open coding byte-by-byte printing, re-use %ph specifier. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-12-13 00:28:42 +01:00
Markus Elfring	1b259904a2	Bluetooth: Use common error handling code in bt_init() * Improve jump targets so that a bit of exception handling can be better reused at the end of this function. * Adjust five condition checks. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-12-13 00:28:40 +01:00
Christoph Paasch	30791ac419	tcp md5sig: Use skb's saddr when replying to an incoming segment The MD5-key that belongs to a connection is identified by the peer's IP-address. When we are in tcp_v4(6)_reqsk_send_ack(), we are replying to an incoming segment from tcp_check_req() that failed the seq-number checks. Thus, to find the correct key, we need to use the skb's saddr and not the daddr. This bug seems to have been there since quite a while, but probably got unnoticed because the consequences are not catastrophic. We will call tcp_v4_reqsk_send_ack only to send a challenge-ACK back to the peer, thus the connection doesn't really fail. Fixes: `9501f97229` ("tcp md5sig: Let the caller pass appropriate key for tcp_v{4,6}_do_calc_md5_hash().") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-12 11:15:42 -05:00
Eric Dumazet	c3916ad932	tcp: smoother receiver autotuning Back in linux-3.13 (commit `b0983d3c9b` ("tcp: fix dynamic right sizing")) I addressed the pressing issues we had with receiver autotuning. But DRS suffers from extra latencies caused by rcv_rtt_est.rtt_us drifts. One common problem happens during slow start, since the apparent RTT measured by the receiver can be inflated by ~50%, at the end of one packet train. Also, a single drop can delay read() calls by one RTT, meaning tcp_rcv_space_adjust() can be called one RTT too late. By replacing the tri-modal heuristic with a continuous function, we can offset the effects of not growing 'at the optimal time'. The curve of the function matches prior behavior if the space increased by 25% and 50% exactly. Cost of added multiply/divide is small, considering a TCP flow typically would run this part of the code few times in its life. I tested this patch with 100 ms RTT / 1% loss link, 100 runs of (netperf -l 5), and got an average throughput of 4600 Mbit instead of 1700 Mbit. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-12 10:53:04 -05:00
Eric Dumazet	607065bad9	tcp: avoid integer overflows in tcp_rcv_space_adjust() When using large tcp_rmem[2] values (I did tests with 500 MB), I noticed overflows while computing rcvwin. Lets fix this before the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-12 10:53:04 -05:00
Eric Dumazet	02db55718d	tcp: do not overshoot window_clamp in tcp_rcv_space_adjust() While rcvbuf is properly clamped by tcp_rmem[2], rcvwin is left to a potentially too big value. It has no serious effect, since : 1) tcp_grow_window() has very strict checks. 2) window_clamp can be mangled by user space to any value anyway. tcp_init_buffer_space() and companions use tcp_full_space(), we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-12 10:53:03 -05:00
Florian Westphal	d2950278d2	xfrm: put policies when reusing pcpu xdst entry We need to put the policies when re-using the pcpu xdst entry, else this leaks the reference. Fixes: `ec30d78c14` ("xfrm: add xdst pcpu cache") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-12-12 06:39:05 +01:00
Xin Long	2342b8d95b	sctp: make sure stream nums can match optlen in sctp_setsockopt_reset_streams Now in sctp_setsockopt_reset_streams, it only does the check optlen < sizeof(*params) for optlen. But it's not enough, as params->srs_number_streams should also match optlen. If the streams in params->srs_stream_list are less than stream nums in params->srs_number_streams, later when dereferencing the stream list, it could cause a slab-out-of-bounds crash, as reported by syzbot. This patch is to fix it by also checking the stream numbers in sctp_setsockopt_reset_streams to make sure at least it's not greater than the streams in the list. Fixes: `7f9d68ac94` ("sctp: implement sender-side procedures for SSN Reset Request Parameter") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 14:08:21 -05:00
Mohamed Ghannam	8f659a03a0	net: ipv4: fix for a race condition in raw_sendmsg inet->hdrincl is racy, and could lead to uninitialized stack pointer usage, so its value should be read only once. Fixes: `c008ba5bdc` ("ipv4: Avoid reading user iov twice after raw_probe_proto_opt") Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 14:05:31 -05:00
Paul E. McKenney	1dfa55e019	Merge branches 'cond_resched.2017.12.04a', 'dyntick.2017.11.28a', 'fixes.2017.12.11a', 'srbd.2017.12.05a' and 'torture.2017.12.11a' into HEAD cond_resched.2017.12.04a: Convert cond_resched_rcu_qs() to cond_resched() dyntick.2017.11.28a: Make RCU dynticks handle interrupts from NMI fixes.2017.12.11a: Miscellaneous fixes srbd.2017.12.05a: Remove now-redundant smp_read_barrier_depends() torture.2017.12.11a: Torture-testing update	2017-12-11 09:21:58 -08:00
Kevin Cernekee	93c647643b	netlink: Add netns check on taps Currently, a nlmon link inside a child namespace can observe systemwide netlink activity. Filter the traffic so that nlmon can only sniff netlink messages from its own netns. Test case: vpnns -- bash -c "ip link add nlmon0 type nlmon; \ ip link set nlmon0 up; \ tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" & sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \ spi 0x1 mode transport \ auth sha1 0x6162633132330000000000000000000000000000 \ enc aes 0x00000000000000000000000000000000 grep --binary abc123 /tmp/nlmon.pcap Signed-off-by: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:58:18 -05:00
Xin Long	132282386f	sctp: add support for the process of unordered idata Unordered idata process is more complicated than unordered data: - It has to add mid into sctp_stream_out to save the next mid value, which is separated from ordered idata's. - To support pd for unordered idata, another mid and pd_mode need to be added to save the message id and pd state in sctp_stream_in. - To make unordered idata reasm easier, it adds a new event queue to save frags for idata. The patch mostly adds the samilar reasm functions for unordered idata as ordered idata's, and also adjusts some other codes on assign_mid, abort_pd and ulpevent_data for idata. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:05 -05:00
Xin Long	65f5e35783	sctp: implement abort_pd for sctp_stream_interleave abort_pd is added as a member of sctp_stream_interleave, used to abort partial delivery for data or idata, called in sctp_cmd_assoc_failed. Since stream interleave allows to do partial delivery for each stream at the same time, sctp_intl_abort_pd for idata would be very different from the old function sctp_ulpq_abort_pd for data. Note that sctp_ulpevent_make_pdapi will support per stream in this patch by adding pdapi_stream and pdapi_seq in sctp_pdapi_event, as described in section 6.1.7 of RFC6458. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:05 -05:00
Xin Long	be4e0ce10d	sctp: implement start_pd for sctp_stream_interleave start_pd is added as a member of sctp_stream_interleave, used to do partial_delivery for data or idata when datalen >= asoc->rwnd in sctp_eat_data. The codes have been done in last patches, but they need to be extracted into start_pd, so that it could be used for SCTP_CMD_PART_DELIVER cmd as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:05 -05:00
Xin Long	94014e8d87	sctp: implement renege_events for sctp_stream_interleave renege_events is added as a member of sctp_stream_interleave, used to renege some old data or idata in reasm or lobby queue properly to free some memory for the new data when there's memory stress. It defines sctp_renege_events for idata, and leaves sctp_ulpq_renege as it is for data. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:05 -05:00
Xin Long	9162e0ed9e	sctp: implement enqueue_event for sctp_stream_interleave enqueue_event is added as a member of sctp_stream_interleave, used to enqueue either data, idata or notification events into user socket rx queue. It replaces sctp_ulpq_tail_event used in the other places with enqueue_event. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:05 -05:00
Xin Long	bd4d627dbd	sctp: implement ulpevent_data for sctp_stream_interleave ulpevent_data is added as a member of sctp_stream_interleave, used to do the most process in ulpq, including to convert data or idata chunk to event, reasm them in reasm queue and put them in lobby queue in right order, and deliver them up to user sk rx queue. This procedure is described in section 2.2.3 of RFC8260. It adds most functions for idata here to do the similar process as the old functions for data. But since the details are very different between them, the old functions can not be reused for idata. event->ssn and event->ppid settings are moved to ulpevent_data from sctp_ulpevent_make_rcvmsg, so that sctp_ulpevent_make_rcvmsg could work for both data and idata. Note that mid is added in sctp_ulpevent for idata, __packed has to be used for defining sctp_ulpevent, or it would exceeds the skb cb that saves a sctp_ulpevent variable for ulp layer process. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:05 -05:00
Xin Long	9d4ceaf154	sctp: implement validate_data for sctp_stream_interleave validate_data is added as a member of sctp_stream_interleave, used to validate ssn/chunk type for data or mid (message id)/chunk type for idata, called in sctp_eat_data. If this check fails, an abort packet will be sent, as said in section 2.2.3 of RFC8260. It also adds the process for idata in rx path. As Marcelo pointed out, there's no need to add event table for idata, but just share chunk_event_table with data's. It would drop data chunk for idata and drop idata chunk for data by calling validate_data in sctp_eat_data. As last patch did, it also replaces sizeof(struct sctp_data_chunk) with sctp_datachk_len for rx path. After this patch, the idata can be accepted and delivered to ulp layer. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:04 -05:00
Xin Long	668c9beb90	sctp: implement assign_number for sctp_stream_interleave assign_number is added as a member of sctp_stream_interleave, used to assign ssn for data or mid (message id) for idata, called in sctp_packet_append_data. sctp_chunk_assign_ssn is left as it is, and sctp_chunk_assign_mid is added for sctp_stream_interleave_1. This procedure is described in section 2.2.2 of RFC8260. All sizeof(struct sctp_data_chunk) in tx path is replaced with sctp_datachk_len, to make it right for idata as well. And also adjust sctp_chunk_is_data for SCTP_CID_I_DATA. After this patch, idata can be built and sent in tx path. Note that if sp strm_interleave is set, it has to wait_connect in sctp_sendmsg, as asoc intl_enable need to be known after 4 shake- hands, to decide if it should use data or idata later. data and idata can't be mixed to send in one asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:04 -05:00
Xin Long	0c3f6f6554	sctp: implement make_datafrag for sctp_stream_interleave To avoid hundreds of checks for the different process on I-DATA chunk, struct sctp_stream_interleave is defined as a group of functions used to replace the codes in some place where it needs to do different job according to if the asoc intl_enabled is set. With these ops, it only needs to initialize asoc->stream.si with sctp_stream_interleave_0 for normal data if asoc intl_enable is 0, or sctp_stream_interleave_1 for idata if asoc intl_enable is set in sctp_stream_init. After that, the members in asoc->stream.si can be used directly in some special places without checking asoc intl_enable. make_datafrag is the first member for sctp_stream_interleave, it's used to make data or idata frags, called in sctp_datamsg_from_user. The old function sctp_make_datafrag_empty needs to be adjust some to fit in this ops. Note that as idata and data chunks have different length, it also defines data_chunk_len for sctp_stream_interleave to describe the chunk size. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:04 -05:00
Xin Long	ad05a7a05e	sctp: add basic structures and make chunk function for idata sctp_idatahdr and sctp_idata_chunk are used to define and parse I-DATA chunk format, and sctp_make_idata is a function to build the chunk. The I-DATA Chunk Format is defined in section 2.1 of RFC8260. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:04 -05:00
Xin Long	96b120b3c1	sctp: add asoc intl_enable negotiation during 4 shakehands asoc intl_enable will be set when local sp strm_interleave is set and there's I-DATA chunk in init and init_ack extensions, as said in section 2.2.1 of RFC8260. asoc intl_enable indicates all data will be sent as I-DATA chunks. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:04 -05:00
Xin Long	772a58693f	sctp: add stream interleave enable members and sockopt This patch adds intl_enable in asoc and netns, and strm_interleave in sctp_sock to indicate if stream interleave is enabled and supported. netns intl_enable would be set via procfs, but that is not added yet until all stream interleave codes are completely implemented; asoc intl_enable will be set when doing 4-shakehands. sp strm_interleave can be set by sockopt SCTP_INTERLEAVING_SUPPORTED which is also added in this patch. This socket option is defined in section 4.3.1 of RFC8260. Note that strm_interleave can only be set by sockopt when both netns intl_enable and sp frag_interleave are set. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 11:23:04 -05:00
Konstantin Khlebnikov	23715275e4	netfilter: ip6t_MASQUERADE: add dependency on conntrack module After commit `4d3a57f23d` ("netfilter: conntrack: do not enable connection tracking unless needed") conntrack is disabled by default unless some module explicitly declares dependency in particular network namespace. Fixes: `a357b3f80b` ("netfilter: nat: add dependencies on conntrack module") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-12-11 17:04:50 +01:00
Cong Wang	b1042d3563	netlink: convert netlink tap spinlock to mutex Both netlink_add_tap() and netlink_remove_tap() are called in process context, no need to bother spinlock. Note, in fact, currently we always hold RTNL when calling these two functions, so we don't need any other lock at all, but keeping this lock doesn't harm anything. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 10:56:55 -05:00
Cong Wang	25e3f70fcb	netlink: make netlink tap per netns nlmon device is not supposed to capture netlink events from other netns, so instead of filtering events, we can simply make netlink tap itself per netns. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 10:56:55 -05:00
Tom Herbert	97a6ec4ac0	rhashtable: Change rhashtable_walk_start to return void Most callers of rhashtable_walk_start don't care about a resize event which is indicated by a return value of -EAGAIN. So calls to rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something like this is common: ret = rhashtable_walk_start(rhiter); if (ret && ret != -EAGAIN) goto out; Since zero and -EAGAIN are the only possible return values from the function this check is pointless. The condition never evaluates to true. This patch changes rhashtable_walk_start to return void. This simplifies code for the callers that ignore -EAGAIN. For the few cases where the caller cares about the resize event, particularly where the table can be walked in mulitple parts for netlink or seq file dump, the function rhashtable_walk_start_check has been added that returns -EAGAIN on a resize event. Signed-off-by: Tom Herbert <tom@quantonium.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 09:58:38 -05:00
Stephen Hemminger	a0b586fa75	rtnetlink: fix typo in GSO max segments Fixes: `46e6b992c2` ("rtnetlink: allow GSO maximums to be set on device creation") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 09:45:59 -05:00
David S. Miller	f0f1d0166b	Three fixes: * for certificate C file generation, don't use hexdump as it's not always installed by default, use pure posix instead (od/sed) * for certificate C file generation, don't write the file if anything fails, so the build abort will not cause a bad build upon a second attempt * fix locking in ieee80211_sta_tear_down_BA_sessions() which had been causing lots of locking warnings -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlouVRAACgkQB8qZga/f l8RRdg/+J6dq+Y5WmdRVniKh+XR+6wu1nMLh5LcTKhhh3e1yv2No8T3V8CxT4eGC YEsqjoJT4MF8WMN/qx6DqqKi8jRPuUEUIoFsM0Joz8EBXkkS3lC0Rnzp1ZlMXHwD aJ+aY5SrQL9isVN50traO6DPbr0fXiy4af8XLty49lmFdAO0OdxwMFu6WhkS24ej g7bMMW8MqHgytolhWrpaiAEcj1wa2kTyzGXmAvv0IztxLrJyRNljnJwnme3vN1b7 E5niCdTbnHkifxHIZqgX2u5Vhn26c+kkhom6WJGv3TTsVQGeERbT3qQHksCN2sNf /mgiLgEQp9XW8PejYgTrGlalIX5dr2E5mU0kXYIiG1xf8Q0BwZzDRiSqZD9p5dYW NSbnK+ZmHc+5WyvdNZqv09rTKRFeq8QUwy53MOFeD2MMHqjn1KfUaqm+d2AGbPn0 Trm1Vpk0NFvBvEhpfZrauye7iZkXWYdasaZDRdmRkkiWTW9tgtSH3BTUQIaRbbxh sgE+oY1yZxle/1atLQV70Kku7yFChjPc6qEfFEobygWzl+wbyu0ipBvCD7gyEJKy yPQ4kEJLqY4Vh4A/riprrjyD/06QTVUF5PgBeTfLgTI5NJ1Z4Dfw51Y65fr3stb6 QvUj+eJiVebBRNP4C4CjLA4n65QlYFTJWQnNyreLqkjlB8tob64= =Mgd9 -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2017-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Three fixes: * for certificate C file generation, don't use hexdump as it's not always installed by default, use pure posix instead (od/sed) * for certificate C file generation, don't write the file if anything fails, so the build abort will not cause a bad build upon a second attempt * fix locking in ieee80211_sta_tear_down_BA_sessions() which had been causing lots of locking warnings ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-11 09:39:14 -05:00
Florian Westphal	f5b5702ac5	netfilter: exthdr: add missign attributes to policy Add missing netlink attribute policy. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-12-11 13:46:04 +01:00
Toke Høiland-Jørgensen	b0d52ad821	mac80211: Add airtime account and scheduling to TXQs This adds airtime accounting and scheduling to the mac80211 TXQ scheduler. A new hardware flag, AIRTIME_ACCOUNTING, is added that drivers can set if they support reporting airtime usage of transmissions. When this flag is set, mac80211 will expect the actual airtime usage to be reported in the tx_time and rx_time fields of the respective status structs. When airtime information is present, mac80211 will schedule TXQs (through ieee80211_next_txq()) in a way that enforces airtime fairness between active stations. This scheduling works the same way as the ath9k in-driver airtime fairness scheduling. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-11 12:40:24 +01:00
Toke Høiland-Jørgensen	e937b8da5a	mac80211: Add TXQ scheduling API This adds an API to mac80211 to handle scheduling of TXQs and changes the interface between driver and mac80211 for TXQ handling as follows: - The wake_tx_queue callback interface no longer includes the TXQ. Instead, the driver is expected to retrieve that from ieee80211_next_txq() - Two new mac80211 functions are added: ieee80211_next_txq() and ieee80211_schedule_txq(). The former returns the next TXQ that should be scheduled, and is how the driver gets a queue to pull packets from. The latter is called internally by mac80211 to start scheduling a queue, and the driver is supposed to call it to re-schedule the TXQ after it is finished pulling packets from it (unless the queue emptied). The ath9k and ath10k drivers are changed to use the new API. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-11 12:37:51 +01:00
Johannes Berg	4564b187c1	nl80211: fix nl80211_send_iface() error paths Evidently I introduced a locking bug in my change here, the nla_put_failure sometimes needs to unlock. Fix it. Fixes: `44905265bc` ("nl80211: don't expose wdev->ssid for most interfaces") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-11 12:33:47 +01:00
David Spinadel	9de18d8186	mac80211: Add MIC space only for TX key option Add a key flag to indicates that the device only needs MIC space and not a real MIC. In such cases, keep the MIC zeroed for ease of debug. Signed-off-by: David Spinadel <david.spinadel@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-12-11 12:20:17 +01:00

... 7 8 9 10 11 ...

49975 Commits