linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Chuck Lever	7dbb53baed	sunrpc: Simplify do_enqueue tracing There are three cases where svc_xprt_do_enqueue() returns without waking an nfsd thread: 1. There is no work to do 2. The transport is already busy 3. There are no available nfsd threads Only 3. is truly interesting. Move the trace point so it records that there was work to do and either an nfsd thread was awoken, or a free one could not found. As an additional clean up, remove a redundant comment and a couple of dprintk call sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-04-03 15:08:11 -04:00
Chuck Lever	caa3e106dc	sunrpc: Move trace_svc_xprt_dequeue() Reduce the amount of noise generated by trace_svc_xprt_dequeue by moving it to the end of svc_get_next_xprt. This generates exactly one trace event when a ready xprt is found, rather than spurious events when there is no work to do. The empty events contain no information that can't be obtained simply by tracing function calls to svc_xprt_dequeue. A small additional benefit is simplification of the svc_xprt_event trace class, which no longer has to handle the case when the @xprt parameter is NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-04-03 15:08:10 -04:00
Chuck Lever	989f881ebf	svc: Simplify ->xpo_secure_port Clean up: Instead of returning a value that is used to set or clear a bit, just make ->xpo_secure_port mangle that bit, and return void. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-04-03 15:08:09 -04:00
Chuck Lever	63a1b15693	sunrpc: Remove unneeded pointer dereference Clean up: Noticed during code inspection that there is already a local automatic variable "xprt" so dereferencing rqst->rq_xprt again is unnecessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-04-03 15:08:09 -04:00
Szymon Janc	082f2300cf	Bluetooth: Fix connection if directed advertising and privacy is used Local random address needs to be updated before creating connection if RPA from LE Direct Advertising Report was resolved in host. Otherwise remote device might ignore connection request due to address mismatch. This was affecting following qualification test cases: GAP/CONN/SCEP/BV-03-C, GAP/CONN/GCEP/BV-05-C, GAP/CONN/DCEP/BV-05-C Before patch: < HCI Command: LE Set Random Address (0x08\|0x0005) plen 6 #11350 [hci0] 84680.231216 Address: 56:BC:E8:24:11:68 (Resolvable) Identity type: Random (0x01) Identity: F2:F1:06:3D:9C:42 (Static) > HCI Event: Command Complete (0x0e) plen 4 #11351 [hci0] 84680.246022 LE Set Random Address (0x08\|0x0005) ncmd 1 Status: Success (0x00) < HCI Command: LE Set Scan Parameters (0x08\|0x000b) plen 7 #11352 [hci0] 84680.246417 Type: Passive (0x00) Interval: 60.000 msec (0x0060) Window: 30.000 msec (0x0030) Own address type: Random (0x01) Filter policy: Accept all advertisement, inc. directed unresolved RPA (0x02) > HCI Event: Command Complete (0x0e) plen 4 #11353 [hci0] 84680.248854 LE Set Scan Parameters (0x08\|0x000b) ncmd 1 Status: Success (0x00) < HCI Command: LE Set Scan Enable (0x08\|0x000c) plen 2 #11354 [hci0] 84680.249466 Scanning: Enabled (0x01) Filter duplicates: Enabled (0x01) > HCI Event: Command Complete (0x0e) plen 4 #11355 [hci0] 84680.253222 LE Set Scan Enable (0x08\|0x000c) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 18 #11356 [hci0] 84680.458387 LE Direct Advertising Report (0x0b) Num reports: 1 Event type: Connectable directed - ADV_DIRECT_IND (0x01) Address type: Random (0x01) Address: 53:38:DA:46:8C:45 (Resolvable) Identity type: Public (0x00) Identity: 11:22:33:44:55:66 (OUI 11-22-33) Direct address type: Random (0x01) Direct address: 7C:D6:76:8C:DF:82 (Resolvable) Identity type: Random (0x01) Identity: F2:F1:06:3D:9C:42 (Static) RSSI: -74 dBm (0xb6) < HCI Command: LE Set Scan Enable (0x08\|0x000c) plen 2 #11357 [hci0] 84680.458737 Scanning: Disabled (0x00) Filter duplicates: Disabled (0x00) > HCI Event: Command Complete (0x0e) plen 4 #11358 [hci0] 84680.469982 LE Set Scan Enable (0x08\|0x000c) ncmd 1 Status: Success (0x00) < HCI Command: LE Create Connection (0x08\|0x000d) plen 25 #11359 [hci0] 84680.470444 Scan interval: 60.000 msec (0x0060) Scan window: 60.000 msec (0x0060) Filter policy: White list is not used (0x00) Peer address type: Random (0x01) Peer address: 53:38:DA:46:8C:45 (Resolvable) Identity type: Public (0x00) Identity: 11:22:33:44:55:66 (OUI 11-22-33) Own address type: Random (0x01) Min connection interval: 30.00 msec (0x0018) Max connection interval: 50.00 msec (0x0028) Connection latency: 0 (0x0000) Supervision timeout: 420 msec (0x002a) Min connection length: 0.000 msec (0x0000) Max connection length: 0.000 msec (0x0000) > HCI Event: Command Status (0x0f) plen 4 #11360 [hci0] 84680.474971 LE Create Connection (0x08\|0x000d) ncmd 1 Status: Success (0x00) < HCI Command: LE Create Connection Cancel (0x08\|0x000e) plen 0 #11361 [hci0] 84682.545385 > HCI Event: Command Complete (0x0e) plen 4 #11362 [hci0] 84682.551014 LE Create Connection Cancel (0x08\|0x000e) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 19 #11363 [hci0] 84682.551074 LE Connection Complete (0x01) Status: Unknown Connection Identifier (0x02) Handle: 0 Role: Master (0x00) Peer address type: Public (0x00) Peer address: 00:00:00:00:00:00 (OUI 00-00-00) Connection interval: 0.00 msec (0x0000) Connection latency: 0 (0x0000) Supervision timeout: 0 msec (0x0000) Master clock accuracy: 0x00 After patch: < HCI Command: LE Set Scan Parameters (0x08\|0x000b) plen 7 #210 [hci0] 667.152459 Type: Passive (0x00) Interval: 60.000 msec (0x0060) Window: 30.000 msec (0x0030) Own address type: Random (0x01) Filter policy: Accept all advertisement, inc. directed unresolved RPA (0x02) > HCI Event: Command Complete (0x0e) plen 4 #211 [hci0] 667.153613 LE Set Scan Parameters (0x08\|0x000b) ncmd 1 Status: Success (0x00) < HCI Command: LE Set Scan Enable (0x08\|0x000c) plen 2 #212 [hci0] 667.153704 Scanning: Enabled (0x01) Filter duplicates: Enabled (0x01) > HCI Event: Command Complete (0x0e) plen 4 #213 [hci0] 667.154584 LE Set Scan Enable (0x08\|0x000c) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 18 #214 [hci0] 667.182619 LE Direct Advertising Report (0x0b) Num reports: 1 Event type: Connectable directed - ADV_DIRECT_IND (0x01) Address type: Random (0x01) Address: 50:52:D9:A6:48:A0 (Resolvable) Identity type: Public (0x00) Identity: 11:22:33:44:55:66 (OUI 11-22-33) Direct address type: Random (0x01) Direct address: 7C:C1:57:A5:B7:A8 (Resolvable) Identity type: Random (0x01) Identity: F4:28:73:5D:38:B0 (Static) RSSI: -70 dBm (0xba) < HCI Command: LE Set Scan Enable (0x08\|0x000c) plen 2 #215 [hci0] 667.182704 Scanning: Disabled (0x00) Filter duplicates: Disabled (0x00) > HCI Event: Command Complete (0x0e) plen 4 #216 [hci0] 667.183599 LE Set Scan Enable (0x08\|0x000c) ncmd 1 Status: Success (0x00) < HCI Command: LE Set Random Address (0x08\|0x0005) plen 6 #217 [hci0] 667.183645 Address: 7C:C1:57:A5:B7:A8 (Resolvable) Identity type: Random (0x01) Identity: F4:28:73:5D:38:B0 (Static) > HCI Event: Command Complete (0x0e) plen 4 #218 [hci0] 667.184590 LE Set Random Address (0x08\|0x0005) ncmd 1 Status: Success (0x00) < HCI Command: LE Create Connection (0x08\|0x000d) plen 25 #219 [hci0] 667.184613 Scan interval: 60.000 msec (0x0060) Scan window: 60.000 msec (0x0060) Filter policy: White list is not used (0x00) Peer address type: Random (0x01) Peer address: 50:52:D9:A6:48:A0 (Resolvable) Identity type: Public (0x00) Identity: 11:22:33:44:55:66 (OUI 11-22-33) Own address type: Random (0x01) Min connection interval: 30.00 msec (0x0018) Max connection interval: 50.00 msec (0x0028) Connection latency: 0 (0x0000) Supervision timeout: 420 msec (0x002a) Min connection length: 0.000 msec (0x0000) Max connection length: 0.000 msec (0x0000) > HCI Event: Command Status (0x0f) plen 4 #220 [hci0] 667.186558 LE Create Connection (0x08\|0x000d) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 19 #221 [hci0] 667.485824 LE Connection Complete (0x01) Status: Success (0x00) Handle: 0 Role: Master (0x00) Peer address type: Random (0x01) Peer address: 50:52:D9:A6:48:A0 (Resolvable) Identity type: Public (0x00) Identity: 11:22:33:44:55:66 (OUI 11-22-33) Connection interval: 50.00 msec (0x0028) Connection latency: 0 (0x0000) Supervision timeout: 420 msec (0x002a) Master clock accuracy: 0x07 @ MGMT Event: Device Connected (0x000b) plen 13 {0x0002} [hci0] 667.485996 LE Address: 11:22:33:44:55:66 (OUI 11-22-33) Flags: 0x00000000 Data length: 0 Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org	2018-04-03 16:12:56 +02:00
Linus Torvalds	642e7fd233	Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux Pull removal of in-kernel calls to syscalls from Dominik Brodowski: "System calls are interaction points between userspace and the kernel. Therefore, system call functions such as sys_xyzzy() or compat_sys_xyzzy() should only be called from userspace via the syscall table, but not from elsewhere in the kernel. At least on 64-bit x86, it will likely be a hard requirement from v4.17 onwards to not call system call functions in the kernel: It is better to use use a different calling convention for system calls there, where struct pt_regs is decoded on-the-fly in a syscall wrapper which then hands processing over to the actual syscall function. This means that only those parameters which are actually needed for a specific syscall are passed on during syscall entry, instead of filling in six CPU registers with random user space content all the time (which may cause serious trouble down the call chain). Those x86-specific patches will be pushed through the x86 tree in the near future. Moreover, rules on how data may be accessed may differ between kernel data and user data. This is another reason why calling sys_xyzzy() is generally a bad idea, and -- at most -- acceptable in arch-specific code. This patchset removes all in-kernel calls to syscall functions in the kernel with the exception of arch/. On top of this, it cleans up the three places where many syscalls are referenced or prototyped, namely kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h" * 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits) bpf: whitelist all syscalls for error injection kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions kernel/sys_ni: sort cond_syscall() entries syscalls/x86: auto-create compat_sys_() prototypes syscalls: sort syscall prototypes in include/linux/compat.h net: remove compat_sys_() prototypes from net/compat.h syscalls: sort syscall prototypes in include/linux/syscalls.h kexec: move sys_kexec_load() prototype to syscalls.h x86/sigreturn: use SYSCALL_DEFINE0 x86: fix sys_sigreturn() return type to be long, not unsigned long x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm() mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead() mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff() mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64() fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate() fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate() fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid() kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare() ...	2018-04-02 21:22:12 -07:00
Dominik Brodowski	6df354653e	net: socket: add __compat_sys_...msg() helpers; remove in-kernel calls to compat syscalls Using the net-internal helpers __compat_sys_...msg() allows us to avoid the internal calls to the compat_sys_...msg() syscalls. compat_sys_recvmmsg() is handled in a different patch. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:20 +02:00
Dominik Brodowski	157b334aa8	net: socket: add __compat_sys_recvmmsg() helper; remove in-kernel call to compat syscall Using the net-internal helper __compat_sys_recvmmsg() allows us to avoid the internal calls to the compat_sys_recvmmsg() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:19 +02:00
Dominik Brodowski	8770cf4a58	net: socket: add __compat_sys_getsockopt() helper; remove in-kernel call to compat syscall Using the net-internal helper __compat_sys_getsockopt() allows us to avoid the internal calls to the compat_sys_getsockopt() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:19 +02:00
Dominik Brodowski	73ee3eafd5	net: socket: add __compat_sys_setsockopt() helper; remove in-kernel call to compat syscall Using the net-internal helper __compat_sys_setsockopt() allows us to avoid the internal calls to the compat_sys_setsockopt() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:18 +02:00
Dominik Brodowski	fd4e82f5b8	net: socket: add __compat_sys_recvfrom() helper; remove in-kernel call to compat syscall Using the net-internal helper __compat_sys_recvfrom() allows us to avoid the internal calls to the compat_sys_recvfrom() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:17 +02:00
Dominik Brodowski	d27e9afc64	net: socket: replace call to sys_recv() with __sys_recvfrom() sys_recv() merely expands the parameters to __sys_recvfrom() by NULL and NULL. Open-code this in the two places which used sys_recv() as a wrapper to __sys_recvfrom(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:16 +02:00
Dominik Brodowski	f3bf896b1d	net: socket: replace calls to sys_send() with __sys_sendto() sys_send() merely expands the parameters to __sys_sendto() by NULL and 0. Open-code this in the two places which used sys_send() as a wrapper to __sys_sendto(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:16 +02:00
Dominik Brodowski	e1834a329d	net: socket: move check for forbid_cmsg_compat to __sys_...msg() The non-compat codepaths for sys_...msg() verify that MSG_CMSG_COMPAT is not set. By moving this check to the __sys_...msg() functions (and making it dependent on a static flag passed to this function), we can call the __sys...msg() functions instead of the syscall functions in all cases. __sys_recvmmsg() does not need this trickery, as the check is handled within the do_sys_recvmmsg() function internal to net/socket.c. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:15 +02:00
Dominik Brodowski	1255e26906	net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall Using the net-internal helper do_sys_recvmmsg() allows us to avoid the internal calls to the sys_getsockopt() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:14 +02:00
Dominik Brodowski	13a2d70e2b	net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall Using the net-internal helper __sys_getsockopt() allows us to avoid the internal calls to the sys_getsockopt() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:13 +02:00
Dominik Brodowski	cc36dca0df	net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall Using the net-internal helper __sys_setsockopt() allows us to avoid the internal calls to the sys_setsockopt() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:13 +02:00
Dominik Brodowski	005a1aeac4	net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall Using the net-internal helper __sys_shutdown() allows us to avoid the internal calls to the sys_shutdown() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:12 +02:00
Dominik Brodowski	6debc8d834	net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall Using the net-internal helper __sys_socketpair() allows us to avoid the internal calls to the sys_socketpair() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:11 +02:00
Dominik Brodowski	b21c8f838a	net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall Using the net-internal helper __sys_getpeername() allows us to avoid the internal calls to the sys_getpeername() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:10 +02:00
Dominik Brodowski	8882a107b3	net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall Using the net-internal helper __sys_getsockname() allows us to avoid the internal calls to the sys_getsockname() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:09 +02:00
Dominik Brodowski	25e290eed9	net: socket: add __sys_listen() helper; remove in-kernel call to syscall Using the net-internal helper __sys_listen() allows us to avoid the internal calls to the sys_listen() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:09 +02:00
Dominik Brodowski	1387c2c2f9	net: socket: add __sys_connect() helper; remove in-kernel call to syscall Using the net-internal helper __sys_connect() allows us to avoid the internal calls to the sys_connect() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:08 +02:00
Dominik Brodowski	a87d35d87a	net: socket: add __sys_bind() helper; remove in-kernel call to syscall Using the net-internal helper __sys_bind() allows us to avoid the internal calls to the sys_bind() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:07 +02:00
Dominik Brodowski	9d6a15c3f2	net: socket: add __sys_socket() helper; remove in-kernel call to syscall Using the net-internal helper __sys_socket() allows us to avoid the internal calls to the sys_socket() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:06 +02:00
Dominik Brodowski	4541e80560	net: socket: add __sys_accept4() helper; remove in-kernel call to syscall Using the net-internal helper __sys_accept4() allows us to avoid the internal calls to the sys_accept4() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:06 +02:00
Dominik Brodowski	211b634b7f	net: socket: add __sys_sendto() helper; remove in-kernel call to syscall Using the net-internal helper __sys_sendto() allows us to avoid the internal calls to the sys_sendto() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:05 +02:00
Dominik Brodowski	7a09e1eb9c	net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall Using the net-internal helper __sys_recvfrom() allows us to avoid the internal calls to the sys_recvfrom() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:04 +02:00
Eric Dumazet	6e00f7dd5e	ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh I forgot to change ip6frag_low_thresh proc_handler from proc_dointvec_minmax to proc_doulongvec_minmax Fixes: `3e67f106f6` ("inet: frags: break the 2GB limit for frags storage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-02 10:10:50 -04:00
Luis Henriques	fb18a57568	ceph: quota: add initial infrastructure to support cephfs quotas This patch adds the infrastructure required to support cephfs quotas as it is currently implemented in the ceph fuse client. Cephfs quotas can be set on any directory, and can restrict the number of bytes or the number of files stored beneath that point in the directory hierarchy. Quotas are set using the extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', and can be removed by setting these attributes to '0'. Link: http://tracker.ceph.com/issues/22372 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:51 +02:00
Chengguang Xu	57a35dfb52	libceph, ceph: add __init attribution to init funcitons Add __init attribution to the functions which are called only once during initiating/registering operations and deleting unnecessary symbol exports. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:48 +02:00
Chengguang Xu	f2f87877b8	libceph: adding missing message types to ceph_msg_type_name() Some of message types are missing in ceph_msg_type_name(), so just adding them for better understanding of output information. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:46 +02:00
Chengguang Xu	7377324e5b	libceph: fix misjudgement of maximum monitor number num_mon should allow up to CEPH_MAX_MON in ceph_monmap_decode(). Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:45 +02:00
Chengguang Xu	11e1478df9	libceph, ceph: change permission for readonly debugfs entries Remove write permission for debugfs entries which only have readonly function. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:45 +02:00
Chengguang Xu	4c069a5821	ceph: add newline to end of debug message format Some of dout format do not include newline in the end, fix for the files which are in fs/ceph and net/ceph directories, and changing printk to dout for printing debug info in super.c Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:44 +02:00
Ilya Dryomov	08c1ac508b	libceph, ceph: move ceph_calc_file_object_mapping() to striper.c ceph_calc_file_object_mapping() has nothing to do with osdmaps. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:43 +02:00
Ilya Dryomov	ed0811d2d2	libceph: striping framework implementation Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:42 +02:00
Ilya Dryomov	45a267dbb4	libceph: handle zero-length data items rbd needs this for null copyups -- if copyup data is all zeroes, we want to save some I/O and network bandwidth. See rbd_obj_issue_copyup() in the next commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2018-04-02 10:12:40 +02:00
Ilya Dryomov	b9e281c2b3	libceph: introduce BVECS data type In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for working with bio_vec array data buffers. The wrappers are trivial, but make it look similar to ceph_bio_iter. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:39 +02:00
Ilya Dryomov	5359a17d27	libceph, rbd: new bio handling code (aka don't clone bios) The reason we clone bios is to be able to give each object request (and consequently each ceph_osd_data/ceph_msg_data item) its own pointer to a (list of) bio(s). The messenger then initializes its cursor with cloned bio's ->bi_iter, so it knows where to start reading from/writing to. That's all the cloned bios are used for: to determine each object request's starting position in the provided data buffer. Introduce ceph_bio_iter to do exactly that -- store position within bio list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:38 +02:00
Ilya Dryomov	dccbf08005	libceph, ceph: change ceph_calc_file_object_mapping() signature - make it void - xlen (object extent length) out parameter should be u32 because only a single stripe unit is mapped at a time Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2018-04-02 10:12:38 +02:00
Ilya Dryomov	db2196a589	libceph: eliminate overflows in ceph_calc_file_object_mapping() bl, stripeno and objsetno should be u64 -- otherwise large enough files get corrupted. How large depends on file layout: - 4M-objects layout (default): any file over 16P - 64K-objects layout (smallest possible object size): any file over 512T Only CephFS is affected, rbd doesn't use ceph_calc_file_object_mapping() yet. Fortunately, CephFS has a max_file_size configurable, the default for which is way below both of the above numbers. Reimplement the logic from scratch with no layout validation -- it's done on the MDS side. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2018-04-02 10:12:38 +02:00
Xin Long	6174a30df1	route: check sysctl_fib_multipath_use_neigh earlier than hash Prior to this patch, when one packet is hashed into path [1] (hash <= nh_upper_bound) and it's neigh is dead, it will try path [2]. However, if path [2]'s neigh is alive but it's hash > nh_upper_bound, it will not return this alive path. This packet will never be sent even if path [2] is alive. 3.3.3.1/24: nexthop via 1.1.1.254 dev eth1 weight 1 <--[1] (dead neigh) nexthop via 2.2.2.254 dev eth2 weight 1 <--[2] With sysctl_fib_multipath_use_neigh set is supposed to find an available path respecting to the l3/l4 hash. But if there is no available route with this hash, it should at least return an alive route even with other hash. This patch is to fix it by processing fib_multipath_use_neigh earlier than the hash check, so that it will at least return an alive route if there is when fib_multipath_use_neigh is enabled. It's also compatible with before when there are alive routes with the l3/l4 hash. Fixes: `a6db4494d2` ("net: ipv4: Consider failed nexthops in multipath routes") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 20:57:39 -04:00
Li RongQing	da01ec4ee5	net: sched: do not emit messages while holding spinlock move messages emitting out of sch_tree_lock to avoid holding this lock too long. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 20:53:50 -04:00
Hangbin Liu	ec1d8ccb07	vlan: also check phy_driver ts_info for vlan's real device Just like function ethtool_get_ts_info(), we should also consider the phy_driver ts_info call back. For example, driver dp83640. Fixes: `37dd9255b2` ("vlan: Pass ethtool get_ts_info queries to real device.") Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 20:53:50 -04:00
David S. Miller	c0b458a946	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c, we had some overlapping changes: 1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE --> MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE 2) In 'net-next' params->log_rq_size is renamed to be params->log_rq_mtu_frames. 3) In 'net-next' params->hard_mtu is added. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 19:49:34 -04:00
David S. Miller	859a59352e	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2018-04-01 Here's (most likely) the last bluetooth-next pull request for the 4.17 kernel: - Remove unused btuart_cs driver (replaced by serial_cs + hci_uart) - New USB ID for Edimax EW-7611ULB controller - Cleanups & fixes to hci_bcm driver - Clenups to btmrvl driver Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 17:46:01 -04:00
Gustavo A. R. Silva	9ea471320e	Bluetooth: Mark expected switch fall-throughs In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2018-04-01 21:43:03 +03:00
Eric Dumazet	1f4c6eb240	ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data() While testing my inet defrag changes, I found that the senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook, competing with TX completion of prior skbs possibly happening on another cpus. The solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 14:08:21 -04:00
Eric Dumazet	694aba690d	ipv4: factorize sk_wmem_alloc updates done by __ip_append_data() While testing my inet defrag changes, I found that the senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook. The solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. Similar change for IPv6 is provided in following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 14:08:21 -04:00
Alexey Kodanev	ec1258903a	ip6_gre: remove redundant 'tunnel' setting in ip6erspan_tap_init() 'tunnel' was already set at the start of ip6erspan_tap_init(). Fixes: `5a963eb61b` ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 14:02:32 -04:00
Atul Gupta	cc35c88ae4	crypto : chtls - CPL handler definition Exchange messages with hardware to program the TLS session CPL handlers for messages received from chip. Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Signed-off-by: Michael Werner <werner@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:37:32 -04:00
Atul Gupta	e0be6bea25	ethtool: enable Inline TLS in HW Ethtool option enables TLS record offload on HW, user configures the feature for netdev capable of Inline TLS. This allows user to define custom sk_prot for Inline TLS sock Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:37:32 -04:00
Atul Gupta	dd0bed1665	tls: support for Inline tls record Facility to register Inline TLS drivers to net/tls. Setup TLS_HW_RECORD prot to listen on offload device. Cases handled - Inline TLS device exists, setup prot for TLS_HW_RECORD - Atleast one Inline TLS exists, sets TLS_HW_RECORD. - If non-inline device establish connection, move to TLS_SW_TX Signed-off-by: Atul Gupta <atul.gupta@chelsio.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:37:32 -04:00
David S. Miller	d4069fe6fc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-03-31 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Add raw BPF tracepoint API in order to have a BPF program type that can access kernel internal arguments of the tracepoints in their raw form similar to kprobes based BPF programs. This infrastructure also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which returns an anon-inode backed fd for the tracepoint object that allows for automatic detach of the BPF program resp. unregistering of the tracepoint probe on fd release, from Alexei. 2) Add new BPF cgroup hooks at bind() and connect() entry in order to allow BPF programs to reject, inspect or modify user space passed struct sockaddr, and as well a hook at post bind time once the port has been allocated. They are used in FB's container management engine for implementing policy, replacing fragile LD_PRELOAD wrapper intercepting bind() and connect() calls that only works in limited scenarios like glibc based apps but not for other runtimes in containerized applications, from Andrey. 3) BPF_F_INGRESS flag support has been added to sockmap programs for their redirect helper call bringing it in line with cls_bpf based programs. Support is added for both variants of sockmap programs, meaning for tx ULP hooks as well as recv skb hooks, from John. 4) Various improvements on BPF side for the nfp driver, besides others this work adds BPF map update and delete helper call support from the datapath, JITing of 32 and 64 bit XADD instructions as well as offload support of bpf_get_prandom_u32() call. Initial implementation of nfp packet cache has been tackled that optimizes memory access (see merge commit for further details), from Jakub and Jiong. 5) Removal of struct bpf_verifier_env argument from the print_bpf_insn() API has been done in order to prepare to use print_bpf_insn() soon out of perf tool directly. This makes the print_bpf_insn() API more generic and pushes the env into private data. bpftool is adjusted as well with the print_bpf_insn() argument removal, from Jiri. 6) Couple of cleanups and prep work for the upcoming BTF (BPF Type Format). The latter will reuse the current BPF verifier log as well, thus bpf_verifier_log() is further generalized, from Martin. 7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read and write support has been added in similar fashion to existing IPv6 IPV6_TCLASS socket option we already have, from Nikita. 8) Fixes in recent sockmap scatterlist API usage, which did not use sg_init_table() for initialization thus triggering a BUG_ON() in scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and uses a small helper sg_init_marker() to properly handle the affected cases, from Prashant. 9) Let the BPF core follow IDR code convention and therefore use the idr_preload() and idr_preload_end() helpers, which would also help idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory pressure, from Shaohua. 10) Last but not least, a spelling fix in an error message for the BPF cookie UID helper under BPF sample code, from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:33:04 -04:00
Eric Dumazet	f2d1c724fc	inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:40 -04:00
Eric Dumazet	219badfaad	ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[], we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:40 -04:00
Eric Dumazet	bf66337140	inet: frags: get rid of ipfrag_skb_cb/FRAG_CB ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately this integer is currently in a different cache line than skb->next, meaning that we use two cache lines per skb when finding the insertion point. By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields in a single cache line and save precious memory bandwidth. Note that after the fast path added by Changli Gao in commit `d6bebca92c` ("fragment: add fast path for in-order fragments") this change wont help the fast path, since we still need to access prev->len (2nd cache line), but will show great benefits when slow path is entered, since we perform a linear scan of a potentially long list. Also, note that this potential long list is an attack vector, we might consider also using an rb-tree there eventually. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:40 -04:00
Eric Dumazet	05c0b86b96	ipv6: frags: rewrite ip6_expire_frag_queue() Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	1eec5d5670	inet: frags: do not clone skb in ip_expire() An skb_clone() was added in commit `ec4fbd6475` ("inet: frag: release spinlock before calling icmp_send()") While fixing the bug at that time, it also added a very high cost for DDOS frags, as the ICMP rate limit is applied after this expensive operation (skb_clone() + consume_skb(), implying memory allocations, copy, and freeing) We can use skb_get(head) here, all we want is to make sure skb wont be freed by another cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	3e67f106f6	inet: frags: break the 2GB limit for frags storage Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... \|\| frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat \| grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	2d44ed22e6	inet: frags: remove inet_frag_maybe_warn_overflow() This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	399d1404be	inet: frags: get rif of inet_frag_evicting() This refactors ip_expire() since one indentation level is removed. Note: in the future, we should try hard to avoid the skb_clone() since this is a serious performance cost. Under DDOS, the ICMP message wont be sent because of rate limits. Fact that ip6_expire_frag_queue() does not use skb_clone() is disturbing too. Presumably IPv6 should have the same issue than the one we fixed in commit `ec4fbd6475` ("inet: frag: release spinlock before calling icmp_send()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	6befe4a78b	inet: frags: remove some helpers Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem() Also since we use rhashtable we can bring back the number of fragments in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was removed in commit `434d305405` ("inet: frag: don't account number of fragment queues") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	648700f76b	inet: frags: use rhashtables for reassembly units Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:39 -04:00
Eric Dumazet	483a6e4fa0	inet: frags: refactor ipfrag_init() We need to call inet_frags_init() before register_pernet_subsys(), as a prereq for following patch ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:38 -04:00
Eric Dumazet	807f1844df	inet: frags: refactor lowpan_net_frag_init() We want to call lowpan_net_frag_init() earlier. Similar to commit "inet: frags: refactor ipv6_frag_init()" This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:38 -04:00
Eric Dumazet	5b975bab23	inet: frags: refactor ipv6_frag_init() We want to call inet_frags_init() earlier. This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:38 -04:00
Eric Dumazet	093ba72914	inet: frags: add a pointer to struct netns_frags In order to simplify the API, add a pointer to struct inet_frags. This will allow us to make things less complex. These functions no longer have a struct inet_frags parameter : inet_frag_destroy(struct inet_frag_queue q /, struct inet_frags f /) inet_frag_put(struct inet_frag_queue q /, struct inet_frags f /) inet_frag_kill(struct inet_frag_queue q /, struct inet_frags f /) inet_frags_exit_net(struct netns_frags nf /, struct inet_frags f /) ip6_expire_frag_queue(struct net net, struct frag_queue fq) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:38 -04:00
Eric Dumazet	787bea7748	inet: frags: change inet_frags_init_net() return value We will soon initialize one rhashtable per struct netns_frags in inet_frags_init_net(). This patch changes the return value to eventually propagate an error. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:25:38 -04:00
Wei Yongjun	eeb0a2a526	vlan: vlan_hw_filter_capable() can be static Fixes the following sparse warning: net/8021q/vlan_core.c:168:6: warning: symbol 'vlan_hw_filter_capable' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 23:20:48 -04:00
David S. Miller	e2e80c027f	RxRPC development -----BEGIN PGP SIGNATURE----- iQIVAwUAWr6a+fu3V2unywtrAQJDgQ/+Pdyv8EW7VtdPRNdePGJLh4WHVrCnSL64 hKzHV3gEt8PK7sJi53lkbvTYJ3hpB/SC9Li0TdOvph4ngH8pUer7HiXW9g2zrcy+ KsbhmWslMOH6uy+8f2yWoI0HmSz7XvzQATCKOJz5Lm7X0JfvSoKkhgn6VLxqG91C p3IAYr5735/mqZIqs+FCIP2jLIPtCH/Plv18F9w847dSS1qsH6twAsUVjzn9CtJv MHQ+Wo57tpoOwynnZO+CDLhZGdXJ5YIZ8wXNqu/EUaeAHXgqkpSd4IUZeguDpoJR YrkvMf8cQlcBDohHBBmIVz9OOIZ8A7WbygNf7UPS9DGYjNVL9vKxF89xwAKYA5Ky LYmUug8WYp2FIHxuWGomF15SrzbNNDfD/v7ZvjXmNxvJXXV+AbzNoXZifSAuyo2V 2bgolw22PbZAMgRvBr5OjtxGD65lpCD00ZJWBgFrWn8l3hdK/40NJW5s3nQCShmm NCUAFV+nHvazfUNfcBAbVRPXCN8Pc0Cuspr6HJ3Wi4q5lnpJQY+hS+KFNizL6iyu 1fNRcqBF8KnVM5bE5Y27o96W/RhNdFudYB+sAH5IEGRqqG5YjkwvPgILvXhTH2kI MDxHMFjRgvWijzLLPpBd9YAsPLddvE6WRlngqOrSp6ICP4o8Vmeb2mxgo4tbQBXu 3dIL1fEfhME= =Lyzx -----END PGP SIGNATURE----- Merge tag 'rxrpc-next-20180330' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes and more traces Here are some patches that add some more tracepoints to AF_RXRPC and fix some issues therein: (1) Fix the use of VERSION packets to keep firewall routes open. (2) Fix the incorrect current time usage in a tracepoint. (3) Fix Tx ring annotation corruption. (4) Fix accidental conversion of call-level abort into connection-level abort. (5) Fix calculation of resend time. (6) Remove a couple of unused variables. (7) Fix a bunch of checker warnings and an error. Note that not all warnings can be quashed as checker doesn't seem to correctly handle seqlocks. (8) Fix a potential race between call destruction and socket/net destruction. (9) Add a tracepoint to track rxrpc_local refcounting. (10) Fix an apparent leak of rxrpc_local objects. (11) Add a tracepoint to track rxrpc_peer refcounting. (12) Fix a leak of rxrpc_peer objects. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:29:12 -04:00
Kirill Tkhai	554873e517	net: Do not take net_rwsem in __rtnl_link_unregister() This function calls call_netdevice_notifier(), which also may take net_rwsem. So, we can't use net_rwsem here. This patch makes callers of this functions take pernet_ops_rwsem, like register_netdevice_notifier() does. This will protect the modifications of net_namespace_list, and allows notifiers to take it (they won't have to care about context). Since __rtnl_link_unregister() is used on module load and unload (which are not frequent operations), this looks for me better, than make all call_netdevice_notifier() always executing in "protected net_namespace_list" context. Also, this fixes the problem we had a deal in `328fbe747a` "Close race between {un, }register_netdevice_notifier and ...", and guarantees __rtnl_link_unregister() does not skip exitting net. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:24:58 -04:00
Kirill Tkhai	fc1dd36992	net: Remove net_rwsem from {, un}register_netdevice_notifier() These functions take net_rwsem, while wireless_nlevent_flush() also takes it. But down_read() can't be taken recursive, because of rw_semaphore design, which prevents it to be occupied by only readers forever. Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(), net list can't change, so these down_read()/up_read() can be removed. Fixes: `f0b07bb151` "net: Introduce net_rwsem to protect net_namespace_list" Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:24:58 -04:00
Jon Maloy	7494cfa6d3	tipc: avoid possible string overflow gcc points out that the combined length of the fixed-length inputs to l->name is larger than the destination buffer size: net/tipc/link.c: In function 'tipc_link_create': net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes into a region of size between 26 and 58 [-Werror=format-overflow=] sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes (assuming 75) into a destination of size 60 sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); A detailed analysis reveals that the theoretical maximum length of a link name is: max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name = 16 + 1 + 15 + 1 + 16 + 1 + 15 = 65 Since we also need space for a trailing zero we now set MAX_LINK_NAME to 68. Just to be on the safe side we also replace the sprintf() call with snprintf(). Fixes: `25b0b9c4e8` ("tipc: handle collisions of 32-bit node address hash values") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:19:52 -04:00
Jon Maloy	37922ea4a3	tipc: permit overlapping service ranges in name table With the new RB tree structure for service ranges it becomes possible to solve an old problem; - we can now allow overlapping service ranges in the table. When inserting a new service range to the tree, we use 'lower' as primary key, and when necessary 'upper' as secondary key. Since there may now be multiple service ranges matching an indicated 'lower' value, we must also add the 'upper' value to the functions used for removing publications, so that the correct, corresponding range item can be found. These changes guarantee that a well-formed publication/withdrawal item from a peer node never will be rejected, and make it possible to eliminate the problematic backlog functionality we currently have for handling such cases. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:19:52 -04:00
Jon Maloy	f20889f72b	tipc: refactor name table translate function The function tipc_nametbl_translate() function is ugly and hard to follow. This can be improved somewhat by introducing a stack variable for holding the publication list to be used and re-ordering the if- clauses for selection of algorithm. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:19:52 -04:00
Jon Maloy	218527fe27	tipc: replace name table service range array with rb tree The current design of the binding table has an unnecessary memory consuming and complex data structure. It aggregates the service range items into an array, which is expanded by a factor two every time it becomes too small to hold a new item. Furthermore, the arrays never shrink when the number of ranges diminishes. We now replace this array with an RB tree that is holding the range items as tree nodes, each range directly holding a list of bindings. This, along with a few name changes, improves both readability and volume of the code, as well as reducing memory consumption and hopefully improving cache hit rate. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:19:52 -04:00
Nikolay Aleksandrov	804b854d37	net: bridge: disable bridge MTU auto tuning if it was set manually As Roopa noted today the biggest source of problems when configuring bridge and ports is that the bridge MTU keeps changing automatically on port events (add/del/changemtu). That leads to inconsistent behaviour and network config software needs to chase the MTU and fix it on each such event. Let's improve on that situation and allow for the user to set any MTU within ETH_MIN/MAX limits, but once manually configured it is the user's responsibility to keep it correct afterwards. In case the MTU isn't manually set - the behaviour reverts to the previous and the bridge follows the minimum MTU. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:19:00 -04:00
Nikolay Aleksandrov	f40aa23339	net: bridge: set min MTU on port events and allow user to set max Recently the bridge was changed to automatically set maximum MTU on port events (add/del/changemtu) when vlan filtering is enabled, but that actually changes behaviour in a way which breaks some setups and can lead to packet drops. In order to still allow that maximum to be set while being compatible, we add the ability for the user to tune the bridge MTU up to the maximum when vlan filtering is enabled, but that has to be done explicitly and all port events (add/del/changemtu) lead to resetting that MTU to the minimum as before. Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-31 22:19:00 -04:00
Linus Torvalds	b5dbc28762	Kbuild fixes for v4.16 (3rd) - fix missed rebuild of TRIM_UNUSED_KSYMS - fix rpm-pkg for GNU tar >= 1.29 - include scripts/dtc/include-prefixes/* to kernel header deb-pkg - add -no-integrated-as option ealier to fix building with Clang - fix netfilter Makefile for parallel building -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJavwJpAAoJED2LAQed4NsGQuIQAK/UmPVczOxT7RefB4BrAsZG Zlai7HnfpzWk5EZE6fbTHTmbFu6HZ1TuYhOW5UlJcxd3P+nJfL5WwDo0H52LVfLT UkSubLCtZBl+DqtbuOg4Xrmh8k3WneGqYT7H9D19LRXTeeoh82g81+mWYL3F9UOA OWGzKf9+3CQhP7OjeVlfdQ8qv2UR+snyIK0jNRImTuhtys8iy2Q4EP/nQYtF7oAA KcYY62rS3qVKfTrdk5NY7kxvpp6/1m6141UPR75Xve7h+Emx/u0RthiMUW08e2bv PX5IlyI8XFz54wD2tojawMEo235cYPJAKQHZAry5tiLXvOF5vEZvoPGc8oUZnMGe bMNONRfXrKWi10/pcTqEfl6gEAE+bvOrqIKj/DECT4hF1av2uEeou/SzuEX+wbqK GxU4L5mnUwDsJNLPiUeVjyl4GD48X16lBdCs9laamRzYat5lKzJFBmgNf0dyHdI+ l/myEtk17nSeohPWRgUeTBcP8O+E27rER7U/+KC0c4spwKrEfLFIzzNauLLJdugN o1VNYacseg3cLQnjSpmC26jxZw29jMFaLM5mBuiI7/F9mUlK6zaG6gyoDzV3A5lN jgPw48apNj4SLnUMrOi+1RYWXWkguF09f8GecjJKXvR5wGqzY7E3ZDi/zgXBf72q 5r5dDuIExh0KXcO9Risp =2WPN -----END PGP SIGNATURE----- Merge tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - fix missed rebuild of TRIM_UNUSED_KSYMS - fix rpm-pkg for GNU tar >= 1.29 - include scripts/dtc/include-prefixes/* to kernel header deb-pkg - add -no-integrated-as option ealier to fix building with Clang - fix netfilter Makefile for parallel building * tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: netfilter: nf_nat_snmp_basic: add correct dependency to Makefile kbuild: rpm-pkg: Support GNU tar >= 1.29 builddeb: Fix header package regarding dtc source links kbuild: set no-integrated-as before incl. arch Makefile kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races	2018-03-30 18:53:57 -10:00
Andrey Ignatov	aac3fc320d	bpf: Post-hooks for sys_bind "Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-31 02:16:26 +02:00
Andrey Ignatov	d74bad4e74	bpf: Hooks for sys_connect == The problem == See description of the problem in the initial patch of this patch set. == The solution == The patch provides much more reliable in-kernel solution for the 2nd part of the problem: making outgoing connecttion from desired IP. It adds new attach types `BPF_CGROUP_INET4_CONNECT` and `BPF_CGROUP_INET6_CONNECT` for program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both source and destination of a connection at connect(2) time. Local end of connection can be bound to desired IP using newly introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though, and doesn't support binding to port, i.e. leverages `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this: * looking for a free port is expensive and can affect performance significantly; * there is no use-case for port. As for remote end (`struct sockaddr *` passed by user), both parts of it can be overridden, remote IP and remote port. It's useful if an application inside cgroup wants to connect to another application inside same cgroup or to itself, but knows nothing about IP assigned to the cgroup. Support is added for IPv4 and IPv6, for TCP and UDP. IPv4 and IPv6 have separate attach types for same reason as sys_bind hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. == Implementation notes == The patch introduces new field in `struct proto`: `pre_connect` that is a pointer to a function with same signature as `connect` but is called before it. The reason is in some cases BPF hooks should be called way before control is passed to `sk->sk_prot->connect`. Specifically `inet_dgram_connect` autobinds socket before calling `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since it'd cause double-bind. On the other hand `proto.pre_connect` provides a flexible way to add BPF hooks for connect only for necessary `proto` and call them at desired time before `connect`. Since `bpf_bind()` is allowed to bind only to IP and autobind in `inet_dgram_connect` binds only port there is no chance of double-bind. bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite of value of `bind_address_no_port` socket field. bpf_bind() sets `with_lock` to `false` when calling to __inet_bind() and __inet6_bind() since all call-sites, where bpf_bind() is called, already hold socket lock. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-31 02:15:54 +02:00
Andrey Ignatov	3679d585bb	net: Introduce __inet_bind() and __inet6_bind Refactor `bind()` code to make it ready to be called from BPF helper function `bpf_bind()` (will be added soon). Implementation of `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and `__inet6_bind()` correspondingly. These function can be used from both `sk_prot->bind` and `bpf_bind()` contexts. New functions have two additional arguments. `force_bind_address_no_port` forces binding to IP only w/o checking `inet_sock.bind_address_no_port` field. It'll allow to bind local end of a connection to desired IP in `bpf_bind()` w/o changing `bind_address_no_port` field of a socket. It's useful since `bpf_bind()` can return an error and we'd need to restore original value of `bind_address_no_port` in that case if we changed this before calling to the helper. `with_lock` specifies whether to lock socket when working with `struct sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old behavior is preserved. But it will be set to `false` for `bpf_bind()` use-case. The reason is all call-sites, where `bpf_bind()` will be called, already hold that socket lock. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-31 02:15:43 +02:00
Andrey Ignatov	4fbac77d2d	bpf: Hooks for sys_bind == The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-31 02:15:18 +02:00
Andrey Ignatov	5e43f899b0	bpf: Check attach type at prog load time == The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-31 02:14:44 +02:00
David Howells	17226f1240	rxrpc: Fix leak of rxrpc_peer objects When a new client call is requested, an rxrpc_conn_parameters struct object is passed in with a bunch of parameters set, such as the local endpoint to use. A pointer to the target peer record is also placed in there by rxrpc_get_client_conn() - and this is removed if and only if a new connection object is allocated. Thus it leaks if a new connection object isn't allocated. Fix this by putting any peer object attached to the rxrpc_conn_parameters object in the function that allocated it. Fixes: `19ffa01c9c` ("rxrpc: Use structs to hold connection params and protocol info") Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:05:44 +01:00
David Howells	1159d4b496	rxrpc: Add a tracepoint to track rxrpc_peer refcounting Add a tracepoint to track reference counting on the rxrpc_peer struct. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:05:38 +01:00
David Howells	31f5f9a169	rxrpc: Fix apparent leak of rxrpc_local objects rxrpc_local objects cannot be disposed of until all the connections that point to them have been RCU'd as a connection object holds refcount on the local endpoint it is communicating through. Currently, this can cause an assertion failure to occur when a network namespace is destroyed as there's no check that the RCU destructors for the connections have been run before we start trying to destroy local endpoints. The kernel reports: rxrpc: AF_RXRPC: Leaked local 0000000036a41bc1 {5} ------------[ cut here ]------------ kernel BUG at ../net/rxrpc/local_object.c:439! Fix this by keeping a count of the live connections and waiting for it to go to zero at the end of rxrpc_destroy_all_connections(). Fixes: `dee46364ce` ("rxrpc: Add RCU destruction for connections and calls") Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:05:33 +01:00
David Howells	09d2bf595d	rxrpc: Add a tracepoint to track rxrpc_local refcounting Add a tracepoint to track reference counting on the rxrpc_local struct. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:05:28 +01:00
David Howells	d3be4d2443	rxrpc: Fix potential call vs socket/net destruction race rxrpc_call structs don't pin sockets or network namespaces, but may attempt to access both after their refcount reaches 0 so that they can detach themselves from the network namespace. However, there's no guarantee that the socket still exists at this point (so sock_net(&call->socket->sk) may be invalid) and the namespace may have gone away if the call isn't pinning a peer. Fix this by (a) carrying a net pointer in the rxrpc_call struct and (b) waiting for all calls to be destroyed when the network namespace goes away. This was detected by checker: net/rxrpc/call_object.c:634:57: warning: incorrect type in argument 1 (different address spaces) net/rxrpc/call_object.c:634:57: expected struct sock const sk net/rxrpc/call_object.c:634:57: got struct sock [noderef] <asn:4><noident> Fixes: `2baec2c3f8` ("rxrpc: Support network namespacing") Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:05:23 +01:00
David Howells	88f2a8257c	rxrpc: Fix checker warnings and errors Fix various issues detected by checker. Errors: () rxrpc_discard_prealloc() should be using rcu_assign_pointer to set call->socket. Warnings: () rxrpc_service_connection_reaper() should be passing NULL rather than 0 to trace_rxrpc_conn() as the where argument. () rxrpc_disconnect_client_call() should get its net pointer via the call->conn rather than call->sock to avoid a warning about accessing an RCU pointer without protection. () Proc seq start/stop functions need annotation as they pass locks between the functions. False positives: () Checker doesn't correctly handle of seq-retry lock context balance in rxrpc_find_service_conn_rcu(). () Checker thinks execution may proceed past the BUG() in rxrpc_publish_service_conn(). (*) Variable length array warnings from SKCIPHER_REQUEST_ON_STACK() in rxkad.c. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:05:17 +01:00
Sebastian Andrzej Siewior	edb63e2b27	rxrpc: remove unused static variables The rxrpc_security_methods and rxrpc_security_sem user has been removed in `648af7fca1` ("rxrpc: Absorb the rxkad security module"). This was noticed by kbuild test robot for the -RT tree but is also true for !RT. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:04:44 +01:00
Marc Dionne	59299aa102	rxrpc: Fix resend event time calculation Commit `a158bdd3` ("rxrpc: Fix call timeouts") reworked the time calculation for the next resend event. For this calculation, "oldest" will be before "now", so ktime_sub(oldest, now) will yield a negative value. When passed to nsecs_to_jiffies which expects an unsigned value, the end result will be a very large value, and a resend event scheduled far into the future. This could cause calls to stall if some packets were lost. Fix by ordering the arguments to ktime_sub correctly. Fixes: `a158bdd324` ("rxrpc: Fix call timeouts") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:04:44 +01:00
David Howells	57b0c9d49b	rxrpc: Don't treat call aborts as conn aborts If a call-level abort is received for the previous call to complete on a connection channel, then that abort is queued for the connection processor to handle. Unfortunately, the connection processor then assumes without checking that the abort is connection-level (ie. callNumber is 0) and distributes it over all active calls on that connection, thereby incorrectly aborting them. Fix this by discarding aborts aimed at a completed call. Further, discard all packets aimed at a call that's complete if there's currently an active call on a channel, since the DATA packets associated with the new call automatically terminate the old call. Fixes: `18bfeba50d` ("rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:04:44 +01:00
David Howells	03877bf6a3	rxrpc: Fix Tx ring annotation after initial Tx failure rxrpc calls have a ring of packets that are awaiting ACK or retransmission and a parallel ring of annotations that tracks the state of those packets. If the initial transmission of a packet on the underlying UDP socket fails then the packet annotation is marked for resend - but the setting of this mark accidentally erases the last-packet mark also stored in the same annotation slot. If this happens, a call won't switch out of the Tx phase when all the packets have been transmitted. Fix this by retaining the last-packet mark and only altering the packet state. Fixes: `248f219cb8` ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:04:43 +01:00
David Howells	f82eb88b0f	rxrpc: Fix a bit of time confusion The rxrpc_reduce_call_timer() function should be passed the 'current time' in jiffies, not the current ktime time. It's confusing in rxrpc_resend because that has to deal with both. Pass the correct current time in. Note that this only affects the trace produced and not the functioning of the code. Fixes: `a158bdd324` ("rxrpc: Fix call timeouts") Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:04:43 +01:00
David Howells	ace45bec6d	rxrpc: Fix firewall route keepalive Fix the firewall route keepalive part of AF_RXRPC which is currently function incorrectly by replying to VERSION REPLY packets from the server with VERSION REQUEST packets. Instead, send VERSION REPLY packets to the peers of service connections to act as keep-alives 20s after the latest packet was transmitted to that peer. Also, just discard VERSION REPLY packets rather than replying to them. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-30 21:04:43 +01:00
David Ahern	b6cdbc8523	net/ipv6: Fix route leaking between VRFs Donald reported that IPv6 route leaking between VRFs is not working. The root cause is the strict argument in the call to rt6_lookup when validating the nexthop spec. ip6_route_check_nh validates the gateway and device (if given) of a route spec. It in turn could call rt6_lookup (e.g., lookup in a given table did not succeed so it falls back to a full lookup) and if so sets the strict argument to 1. That means if the egress device is given, the route lookup needs to return a result with the same device. This strict requirement does not work with VRFs (IPv4 or IPv6) because the oif in the flow struct is overridden with the index of the VRF device to trigger a match on the l3mdev rule and force the lookup to its table. The right long term solution is to add an l3mdev index to the flow struct such that the oif is not overridden. That solution will not backport well, so this patch aims for a simpler solution to relax the strict argument if the route spec device is an l3mdev slave. As done in other places, use the FLOWI_FLAG_SKIP_NH_OIF to know that the RT6_LOOKUP_F_IFACE flag needs to be removed. Fixes: `ca254490c8` ("net: Add VRF support to IPv6 stack") Reported-by: Donald Sharp <sharpd@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 14:23:59 -04:00
David Lebrun	5807b22c91	ipv6: sr: fix seg6 encap performances with TSO enabled Enabling TSO can lead to abysmal performances when using seg6 in encap mode, such as with the ixgbe driver. This patch adds a call to iptunnel_handle_offloads() to remove the encapsulation bit if needed. Before: root@comp4-seg6bpf:~# iperf3 -c fc00::55 Connecting to host fc00::55, port 5201 [ 4] local fc45::4 port 36592 connected to fc00::55 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 196 KBytes 1.60 Mbits/sec 47 6.66 KBytes [ 4] 1.00-2.00 sec 304 KBytes 2.49 Mbits/sec 100 5.33 KBytes [ 4] 2.00-3.00 sec 284 KBytes 2.32 Mbits/sec 92 5.33 KBytes After: root@comp4-seg6bpf:~# iperf3 -c fc00::55 Connecting to host fc00::55, port 5201 [ 4] local fc45::4 port 43062 connected to fc00::55 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 1.03 GBytes 8.89 Gbits/sec 0 743 KBytes [ 4] 1.00-2.00 sec 1.03 GBytes 8.87 Gbits/sec 0 743 KBytes [ 4] 2.00-3.00 sec 1.03 GBytes 8.87 Gbits/sec 0 743 KBytes Reported-by: Tom Herbert <tom@quantonium.net> Fixes: `6c8702c60b` ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 14:14:33 -04:00
Toshiaki Makita	ae4745730c	net: Fix untag for vlan packets without ethernet header In some situation vlan packets do not have ethernet headers. One example is packets from tun devices. Users can specify vlan protocol in tun_pi field instead of IP protocol, and skb_vlan_untag() attempts to untag such packets. skb_vlan_untag() (more precisely, skb_reorder_vlan_header() called by it) however did not expect packets without ethernet headers, so in such a case size argument for memmove() underflowed and triggered crash. ==== BUG: unable to handle kernel paging request at ffff8801cccb8000 IP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 PGD 9cee067 P4D 9cee067 PUD 1d9401063 PMD 1cccb7063 PTE 2810100028101 Oops: 000b [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 17663 Comm: syz-executor2 Not tainted 4.16.0-rc7+ #368 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 RSP: 0018:ffff8801cc046e28 EFLAGS: 00010287 RAX: ffff8801ccc244c4 RBX: fffffffffffffffe RCX: fffffffffff6c4c2 RDX: fffffffffffffffe RSI: ffff8801cccb7ffc RDI: ffff8801cccb8000 RBP: ffff8801cc046e48 R08: ffff8801ccc244be R09: ffffed0039984899 R10: 0000000000000001 R11: ffffed0039984898 R12: ffff8801ccc244c4 R13: ffff8801ccc244c0 R14: ffff8801d96b7c06 R15: ffff8801d96b7b40 FS: 00007febd562d700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801cccb8000 CR3: 00000001ccb2f006 CR4: 00000000001606e0 DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: memmove include/linux/string.h:360 [inline] skb_reorder_vlan_header net/core/skbuff.c:5031 [inline] skb_vlan_untag+0x470/0xc40 net/core/skbuff.c:5061 __netif_receive_skb_core+0x119c/0x3460 net/core/dev.c:4460 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4627 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4701 netif_receive_skb+0xae/0x390 net/core/dev.c:4725 tun_rx_batched.isra.50+0x5ee/0x870 drivers/net/tun.c:1555 tun_get_user+0x299e/0x3c20 drivers/net/tun.c:1962 tun_chr_write_iter+0xb9/0x160 drivers/net/tun.c:1990 call_write_iter include/linux/fs.h:1782 [inline] new_sync_write fs/read_write.c:469 [inline] __vfs_write+0x684/0x970 fs/read_write.c:482 vfs_write+0x189/0x510 fs/read_write.c:544 SYSC_write fs/read_write.c:589 [inline] SyS_write+0xef/0x220 fs/read_write.c:581 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x454879 RSP: 002b:00007febd562cc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007febd562d6d4 RCX: 0000000000454879 RDX: 0000000000000157 RSI: 0000000020000180 RDI: 0000000000000014 RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000006b0 R14: 00000000006fc120 R15: 0000000000000000 Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20 RIP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 RSP: ffff8801cc046e28 CR2: ffff8801cccb8000 ==== We don't need to copy headers for packets which do not have preceding headers of vlan headers, so skip memmove() in that case. Fixes: `4bbb3e0e82` ("net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 12:36:27 -04:00
Lorenzo Bianconi	428604fb11	ipv6: do not set routes if disable_ipv6 has been enabled Do not allow setting ipv6 routes from userspace if disable_ipv6 has been enabled. The issue can be triggered using the following reproducer: - sysctl net.ipv6.conf.all.disable_ipv6=1 - ip -6 route add a🅱️c:d::/64 dev em1 - ip -6 route show a🅱️c:d::/64 dev em1 metric 1024 pref medium Fix it checking disable_ipv6 value in ip6_route_info_create routine Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 12:20:52 -04:00
David S. Miller	d162190bde	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. This batch comes with more input sanitization for xtables to address bug reports from fuzzers, preparation works to the flowtable infrastructure and assorted updates. In no particular order, they are: 1) Make sure userspace provides a valid standard target verdict, from Florian Westphal. 2) Sanitize error target size, also from Florian. 3) Validate that last rule in basechain matches underflow/policy since userspace assumes this when decoding the ruleset blob that comes from the kernel, from Florian. 4) Consolidate hook entry checks through xt_check_table_hooks(), patch from Florian. 5) Cap ruleset allocations at 512 mbytes, 134217728 rules and reject very large compat offset arrays, so we have a reasonable upper limit and fuzzers don't exercise the oom-killer. Patches from Florian. 6) Several WARN_ON checks on xtables mutex helper, from Florian. 7) xt_rateest now has a hashtable per net, from Cong Wang. 8) Consolidate counter allocation in xt_counters_alloc(), from Florian. 9) Earlier xt_table_unlock() call in {ip,ip6,arp,eb}tables, patch from Xin Long. 10) Set FLOW_OFFLOAD_DIR_* to IP_CT_DIR_* definitions, patch from Felix Fietkau. 11) Consolidate code through flow_offload_fill_dir(), also from Felix. 12) Inline ip6_dst_mtu_forward() just like ip_dst_mtu_maybe_forward() to remove a dependency with flowtable and ipv6.ko, from Felix. 13) Cache mtu size in flow_offload_tuple object, this is safe for forwarding as `f87c10a8aa` describes, from Felix. 14) Rename nf_flow_table.c to nf_flow_table_core.o, to simplify too modular infrastructure, from Felix. 15) Add rt0, rt2 and rt4 IPv6 routing extension support, patch from Ahmed Abdelsalam. 16) Remove unused parameter in nf_conncount_count(), from Yi-Hung Wei. 17) Support for counting only to nf_conncount infrastructure, patch from Yi-Hung Wei. 18) Add strict NFT_CT_{SRC_IP,DST_IP,SRC_IP6,DST_IP6} key datatypes to nft_ct. 19) Use boolean as return value from ipt_ah and from IPVS too, patch from Gustavo A. R. Silva. 20) Remove useless parameters in nfnl_acct_overquota() and nf_conntrack_broadcast_help(), from Taehee Yoo. 21) Use ipv6_addr_is_multicast() from xt_cluster, also from Taehee Yoo. 22) Statify nf_tables_obj_lookup_byhandle, patch from Fengguang Wu. 23) Fix typo in xt_limit, from Geert Uytterhoeven. 24) Do no use VLAs in Netfilter code, again from Gustavo. 25) Use ADD_COUNTER from ebtables, from Taehee Yoo. 26) Bitshift support for CONNMARK and MARK targets, from Jack Ma. 27) Use pr_*() and add pr_fmt(), from Arushi Singhal. 28) Add synproxy support to ctnetlink. 29) ICMP type and IGMP matching support for ebtables, patches from Matthias Schiffer. 30) Support for the revision infrastructure to ebtables, from Bernie Harris. 31) String match support for ebtables, also from Bernie. 32) Documentation for the new flowtable infrastructure. 33) Use generic comparison functions in ebt_stp, from Joe Perches. 34) Demodularize filter chains in nftables. 35) Register conntrack hooks in case nftables NAT chain is added. 36) Merge assignments with return in a couple of spots in the Netfilter codebase, also from Arushi. 37) Document that xtables percpu counters are stored in the same memory area, from Ben Hutchings. 38) Revert mark_source_chains() sanity checks that break existing rulesets, from Florian Westphal. 39) Use is_zero_ether_addr() in the ipset codebase, from Joe Perches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 11:41:18 -04:00
Kirill Tkhai	328fbe747a	net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net() {un,}register_netdevice_notifier() iterate over all net namespaces hashed to net_namespace_list. But pernet_operations register and unregister netdevices in unhashed net namespace, and they are not seen for netdevice notifiers. This results in asymmetry: 1)Race with register_netdevice_notifier() pernet_operations::init(net) ... register_netdevice() ... call_netdevice_notifiers() ... ... nb is not called ... ... register_netdevice_notifier(nb) -> net skipped ... ... list_add_tail(&net->list, ..) ... Then, userspace stops using net, and it's destructed: pernet_operations::exit(net) unregister_netdevice() call_netdevice_notifiers() ... nb is called ... This always happens with net::loopback_dev, but it may be not the only device. 2)Race with unregister_netdevice_notifier() pernet_operations::init(net) register_netdevice() call_netdevice_notifiers() ... nb is called ... Then, userspace stops using net, and it's destructed: list_del_rcu(&net->list) ... pernet_operations::exit(net) unregister_netdevice_notifier(nb) -> net skipped dev_change_net_namespace() ... call_netdevice_notifiers() ... nb is not called ... unregister_netdevice() call_netdevice_notifiers() ... nb is not called ... This race is more danger, since dev_change_net_namespace() moves real network devices, which use not trivial netdevice notifiers, and if this will happen, the system will be left in unpredictable state. The patch closes the race. During the testing I found two places, where register_netdevice_notifier() is called from pernet init/exit methods (which led to deadlock) and fixed them (see previous patches). The review moved me to one more unusual registration place: raw_init() (can driver). It may be a reason of problems, if someone creates in-kernel CAN_RAW sockets, since they will be destroyed in exit method and raw_release() will call unregister_netdevice_notifier(). But grep over kernel tree does not show, someone creates such sockets from kernel space. Theoretically, there can be more places like this, and which are hidden from review, but we found them on the first bumping there (since there is no a race, it will be 100% reproducible). Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 10:59:35 -04:00
Kirill Tkhai	9e2f6c5d78	netfilter: Rework xt_TEE netdevice notifier Register netdevice notifier for every iptable entry is not good, since this breaks modularity, and the hidden synchronization is based on rtnl_lock(). This patch reworks the synchronization via new lock, while the rest of logic remains as it was before. This is required for the next patch. Tested via: while :; do unshare -n iptables -t mangle -A OUTPUT -j TEE --gateway 1.1.1.2 --oif lo; done Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 10:59:23 -04:00
Kirill Tkhai	e9a441b6e7	xfrm: Register xfrm_dev_notifier in appropriate place Currently, driver registers it from pernet_operations::init method, and this breaks modularity, because initialization of net namespace and netdevice notifiers are orthogonal actions. We don't have per-namespace netdevice notifiers; all of them are global for all devices in all namespaces. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 10:59:23 -04:00
Russell King	e679c9c1db	sfp/phylink: move module EEPROM ethtool access into netdev core ethtool Provide a pointer to the SFP bus in struct net_device, so that the ethtool module EEPROM methods can access the SFP directly, rather than needing every user to provide a hook for it. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 10:11:06 -04:00
Gal Pressman	9daae9bd47	net: Call add/kill vid ndo on vlan filter feature toggling NETIF_F_HW_VLAN_[CS]TAG_FILTER features require more than just a bit flip in dev->features in order to keep the driver in a consistent state. These features notify the driver of each added/removed vlan, but toggling of vlan-filter does not notify the driver accordingly for each of the existing vlans. This patch implements a similar solution to NETIF_F_RX_UDP_TUNNEL_PORT behavior (which notifies the driver about UDP ports in the same manner that vids are reported). Each toggling of the features propagates to the 8021q module, which iterates over the vlans and call add/kill ndo accordingly. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-30 09:58:59 -04:00
Joe Perches	26c97c5d8d	netfilter: ipset: Use is_zero_ether_addr instead of static and memcmp To make the test a bit clearer and to reduce object size a little. Miscellanea: o remove now unnecessary static const array $ size ip_set_hash_mac.o* text data bss dec hex filename 22822 4619 64 27505 6b71 ip_set_hash_mac.o.allyesconfig.new 22932 4683 64 27679 6c1f ip_set_hash_mac.o.allyesconfig.old 10443 1040 0 11483 2cdb ip_set_hash_mac.o.defconfig.new 10507 1040 0 11547 2d1b ip_set_hash_mac.o.defconfig.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 12:20:44 +02:00
Florian Westphal	e3b5e1ec75	Revert "netfilter: x_tables: ensure last rule in base chain matches underflow/policy" This reverts commit `0d7df906a0`. Valdis Kletnieks reported that xtables is broken in linux-next since `0d7df906a0` ("netfilter: x_tables: ensure last rule in base chain matches underflow/policy"), as kernel rejects the (well-formed) ruleset: [ 64.402790] ip6_tables: last base chain position 1136 doesn't match underflow 1344 (hook 1) mark_source_chains is not the correct place for such a check, as it terminates evaluation of a chain once it sees an unconditional verdict (following rules are known to be unreachable). It seems preferrable to fix libiptc instead, so remove this check again. Fixes: `0d7df906a0` ("netfilter: x_tables: ensure last rule in base chain matches underflow/policy") Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 12:20:32 +02:00
Ben Hutchings	9ba5c404bf	netfilter: x_tables: Add note about how to free percpu counters Due to the way percpu counters are allocated and freed in blocks, it is not safe to free counters individually. Currently all callers do the right thing, but let's note this restriction. Fixes: `ae0ac0ed6f` ("netfilter: x_tables: pack percpu counter allocations") Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:44:27 +02:00
Arushi Singhal	c47d36b385	netfilter: Merge assignment with return Merge assignment with return statement to directly return the value. Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:44:27 +02:00
Pablo Neira Ayuso	a3073c17dd	netfilter: nf_tables: use nft_set_lookup_global from nf_tables_newsetelem() Replace opencoded implementation of nft_set_lookup_global() by call to this function. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:29:20 +02:00
Pablo Neira Ayuso	10659cbab7	netfilter: nf_tables: rename to nft_set_lookup_global() To prepare shorter introduction of shorter function prefix. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:29:20 +02:00
Pablo Neira Ayuso	43a605f2f7	netfilter: nf_tables: enable conntrack if NAT chain is registered Register conntrack hooks if the user adds NAT chains. Users get confused with the existing behaviour since they will see no packets hitting this chain until they add the first rule that refers to conntrack. This patch adds new ->init() and ->free() indirections to chain types that can be used by NAT chains to invoke the conntrack dependency. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:29:19 +02:00
Pablo Neira Ayuso	02c7b25e5f	netfilter: nf_tables: build-in filter chain type One module per supported filter chain family type takes too much memory for very little code - too much modularization - place all chain filter definitions in one single file. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:29:19 +02:00
Pablo Neira Ayuso	cc07eeb0e5	netfilter: nf_tables: nft_register_chain_type() returns void Use WARN_ON() instead since it should not happen that neither family goes over NFPROTO_NUMPROTO nor there is already a chain of this type already registered. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:29:18 +02:00
Pablo Neira Ayuso	32537e9184	netfilter: nf_tables: rename struct nf_chain_type Use nft_ prefix. By when I added chain types, I forgot to use the nftables prefix. Rename enum nft_chain_type to enum nft_chain_types too, otherwise there is an overlap. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:29:17 +02:00
Joe Perches	9124a20d87	netfilter: ebt_stp: Use generic functions for comparisons Instead of unnecessary const declarations, use the generic functions to save a little object space. $ size net/bridge/netfilter/ebt_stp.o* text data bss dec hex filename 1250 144 0 1394 572 net/bridge/netfilter/ebt_stp.o.new 1344 144 0 1488 5d0 net/bridge/netfilter/ebt_stp.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:24:29 +02:00
Bernie Harris	1be3ac9844	netfilter: ebtables: Add string filter This patch is part of a proposal to add a string filter to ebtables, which would be similar to the string filter in iptables. Like iptables, the ebtables filter uses the xt_string module. Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:04:12 +02:00
Bernie Harris	39c202d228	netfilter: ebtables: Add support for specifying match revision Currently ebtables assumes that the revision number of all match modules is 0, which is an issue when trying to use existing xtables matches with ebtables. The solution is to modify ebtables to allow extensions to specify a revision number, similar to iptables. This gets passed down to the kernel, which is then able to find the match module correctly. To main binary backwards compatibility, the size of the ebt_entry structures is not changed, only the size of the name field is decreased by 1 byte to make room for the revision field. Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-30 11:03:39 +02:00
John Fastabend	fa246693a1	bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT: Add support for the BPF_F_INGRESS flag in skb redirect helper. To do this convert skb into a scatterlist and push into ingress queue. This is the same logic that is used in the sk_msg redirect helper so it should feel familiar. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-30 00:09:43 +02:00
John Fastabend	8934ce2fd0	bpf: sockmap redirect ingress support Add support for the BPF_F_INGRESS flag in sk_msg redirect helper. To do this add a scatterlist ring for receiving socks to check before calling into regular recvmsg call path. Additionally, because the poll wakeup logic only checked the skb recv queue we need to add a hook in TCP stack (similar to write side) so that we have a way to wake up polling socks when a scatterlist is redirected to that sock. After this all that is needed is for the redirect helper to push the scatterlist into the psock receive queue. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-30 00:09:43 +02:00
David S. Miller	e15f20ea33	We have a fair number of patches, but many of them are from the first bullet here: * EAPoL-over-nl80211 from Denis - this will let us fix some long-standing issues with bridging, races with encryption and more * DFS offload support from the qtnfmac folks * regulatory database changes for the new ETSI adaptivity requirements * various other fixes and small enhancements -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlq85QQACgkQB8qZga/f l8Sdbg//bc8C/4F1TUdJqZWGK1j9Lwd6nZpP0iFqH5Ees3MMnti5XdV2dCL31ivz c8DOuRAU8ZG/RLPgSTHVZHwh+f7S5/TxSKg8WOBvrYk7a0C1uvVVhe5XZQEmqE7g eqM0+UQ5DyzUYnu0kSUrFKPV7BqDa2YzVDdK8e8iozqZmAnvGN9k8H7EDEeUxtxk LEl+bEcmhDSfIssU2Iaksl+9qoZP6BkoVGAOmDzIL654WV4XVKorxRX6vndqSQGu 0cCz2Occ+/0hfvszONBRR4M/gtI/Yyn3u+D1Q0YD3X40Q9gJE11fcodmMT61l5C7 rGcu94RIGilvRvjZScK3giiU2L7DD+VETUa+YGnjd8gLpmrYd6cURxlm4yuHbw8C UScLCApAUuY+skmPLeuyHW9mnzHaC336vzVjk8OhdNRhX23/rB8nk1yIywgqETVW g/iub8/Xp6TRfdyh76I18wlfqCp1It2JAeICgKH5NPlwUA6U0xFR0/ddSR8FuAcK ZLY8mgsc2kIH6r4x5sjeH+Yb6tGi/Z3HMZM2hna+t4vSpn6Q5+GPsA6yuHuBUhJb 8QswMiLDSux8I4guKgQyROiHaCzE3zOigJ4o1z9XITKsgluZVxnKr+ETKdr88WFp II8U0qH/kejXIfxUjbv5Wla70J9wi/hjxR6vOfSkEtYNvIApdfc= =hI0F -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2018-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== We have a fair number of patches, but many of them are from the first bullet here: * EAPoL-over-nl80211 from Denis - this will let us fix some long-standing issues with bridging, races with encryption and more * DFS offload support from the qtnfmac folks * regulatory database changes for the new ETSI adaptivity requirements * various other fixes and small enhancements ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 16:23:26 -04:00
Arnd Bergmann	7ae665f132	sctp: fix unused lable warning The proc file cleanup left a label possibly unused: net/sctp/protocol.c: In function 'sctp_defaults_init': net/sctp/protocol.c:1304:1: error: label 'err_init_proc' defined but not used [-Werror=unused-label] This adds an #ifdef around it to match the respective 'goto'. Fixes: `d47d08c8ca` ("sctp: use proc_remove_subtree()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:38:27 -04:00
Eric Dumazet	18dcbe12fe	ipv6: export ip6 fragments sysctl to unprivileged users IPv4 was changed in commit `52a773d645` ("net: Export ip fragment sysctl to unprivileged users") The only sysctl that is not per-netns is not used : ip6frag_secret_interval Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:14:03 -04:00
David Ahern	2233000cba	net/ipv6: Move call_fib6_entry_notifiers up for route adds Move call to call_fib6_entry_notifiers for new IPv6 routes to right before the insertion into the FIB. At this point notifier handlers can decide the fate of the new route with a clean path to delete the potential new entry if the notifier returns non-0. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:10:31 -04:00
David Ahern	c1d7ee67ac	net/ipv4: Allow notifier to fail route replace Add checking to call to call_fib_entry_notifiers for IPv4 route replace. Allows a notifier handler to fail the replace. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:10:30 -04:00
David Ahern	6635f311ea	net/ipv4: Move call_fib_entry_notifiers up for new routes Move call to call_fib_entry_notifiers for new IPv4 routes to right before the call to fib_insert_alias. At this point the only remaining failure path is memory allocations in fib_insert_node. Handle that very unlikely failure with a call to call_fib_entry_notifiers to tell drivers about it. At this point notifier handlers can decide the fate of the new route with a clean path to delete the potential new entry if the notifier returns non-0. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:10:30 -04:00
David Ahern	9776d32537	net: Move call_fib_rule_notifiers up in fib_nl_newrule Move call_fib_rule_notifiers up in fib_nl_newrule to the point right before the rule is inserted into the list. At this point there are no more failure paths within the core rule code, so if the notifier does not fail then the rule will be inserted into the list. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:10:30 -04:00
David Ahern	c30d9356e9	net: Fix fib notifer to return errno Notifier handlers use notifier_from_errno to convert any potential error to an encoded format. As a consequence the other side, call_fib_notifier{s} in this case, needs to use notifier_to_errno to return the error from the handler back to its caller. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 14:10:30 -04:00
Kirill Tkhai	152f253152	net: Remove rtnl_lock() in nf_ct_iterate_destroy() rtnl_lock() doesn't protect net::ct::count, and it's not needed for__nf_ct_unconfirmed_destroy() and for nf_queue_nf_hook_drop(). Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 13:47:54 -04:00
Kirill Tkhai	ec9c780925	ovs: Remove rtnl_lock() from ovs_exit_net() Here we iterate for_each_net() and removes vport from alive net to the exiting net. ovs_net::dps are protected by ovs_mutex(), and the others, who change it (ovs_dp_cmd_new(), __dp_destroy()) also take it. The same with datapath::ports list. So, we remove rtnl_lock() here. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 13:47:54 -04:00
Kirill Tkhai	10256debb9	net: Don't take rtnl_lock() in wireless_nlevent_flush() This function iterates over net_namespace_list and flushes the queue for every of them. What does this rtnl_lock() protects?! Since we may add skbs to net::wext_nlevents without rtnl_lock(), it does not protects us about queuers. It guarantees, two threads can't flush the queue in parallel, that can change the order, but since skb can be queued in any order, it doesn't matter, how many threads do this in parallel. In case of several threads, this will be even faster. So, we can remove rtnl_lock() here, as it was used for iteration over net_namespace_list only. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 13:47:53 -04:00
Kirill Tkhai	f0b07bb151	net: Introduce net_rwsem to protect net_namespace_list rtnl_lock() is used everywhere, and contention is very high. When someone wants to iterate over alive net namespaces, he/she has no a possibility to do that without exclusive lock. But the exclusive rtnl_lock() in such places is overkill, and it just increases the contention. Yes, there is already for_each_net_rcu() in kernel, but it requires rcu_read_lock(), and this can't be sleepable. Also, sometimes it may be need really prevent net_namespace_list growth, so for_each_net_rcu() is not fit there. This patch introduces new rw_semaphore, which will be used instead of rtnl_mutex to protect net_namespace_list. It is sleepable and allows not-exclusive iterations over net namespaces list. It allows to stop using rtnl_lock() in several places (what is made in next patches) and makes less the time, we keep rtnl_mutex. Here we just add new lock, while the explanation of we can remove rtnl_lock() there are in next patches. Fine grained locks generally are better, then one big lock, so let's do that with net_namespace_list, while the situation allows that. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 13:47:53 -04:00
David S. Miller	36fc2d727e	RxRPC development -----BEGIN PGP SIGNATURE----- iQIVAwUAWrrCyfu3V2unywtrAQJRgg//VbXndX7THWFy6NhfZL2lWH88zvhnF3om vVSxYqqrxAmnpUxtzQWa67By+LL7A6nnUJPhNm660GDfZQWDso4Vltji72unEAJT qvNqB2mcKJzdswBh+RB5QNJYmMhMs1flssUxxyBH1G0X1eWddfDal0QOz1nv0AqY E7WM++fbvo4RBi3jSZVJnOitHQLfYdQAcPv2LwaImUwWiKyaGxU54yqRk0r8Wcmh OZMipZFiHMFYHETPsu7ZI1lZHPPn+7zEyxRiIvvJlo8QOL/Sk7WsO/kKGhnsFGRQ yWsIePh6mU8lYmvwv3ZgqPf1uZu2kbInEQ0Mmzu/5T5VWEq2P1I9zsT7IT6khhYa BKxrzaVRd5D3GCl4HV7ANt2eDxtAIBjJkRNsXgs3OQEg93bokjP39PySU4RGtF7c Cb0kWBEk2HfdcKtQwb4ksdQ2g7Sby6p4wZ6YEH/hQybPXkApMqM2DCv1N9f0ELYk KbRxuo8ann5x7EXmXMUERZZYMiaGSf2FbcTwCFC9/WnUq3+qDYIJU8Wifv6nP7ti XcPw/brH8XkgOzE0/lSyKlD1e0XS+lf+HVVGYC0ws3bbR5UTSpjapD6CwzwxgUTZ NkKb/Erqtdt5XQoAg4QAW5snHsMQ1IDzM73BqEz4LVyNss4o7spCj9TWOu5y+5dy xFGAL2BI1+E= =BwJ2 -----END PGP SIGNATURE----- Merge tag 'rxrpc-next-20180327' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Tracing updates Here are some patches that update tracing in AF_RXRPC and AFS: (1) Add a tracepoint for tracking resend events. (2) Use debug_ids in traces rather than pointers (as pointers are now hashed) and allow use of the same debug_id in AFS calls as in the corresponding AF_RXRPC calls. This makes filtering the trace output much easier. (3) Add a tracepoint for tracking call completion. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 12:02:08 -04:00
David S. Miller	5568cdc368	ip_tunnel: Resolve ipsec merge conflict properly. We want to use dev_set_mtu() regardless of how we calculate the mtu value. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 11:42:14 -04:00
David S. Miller	56455e0998	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2018-03-29 1) Remove a redundant pointer initialization esp_input_set_header(). From Colin Ian King. 2) Mark the xfrm kmem_caches as __ro_after_init. From Alexey Dobriyan. 3) Do the checksum for an ipsec offlad packet in software if the device does not advertise NETIF_F_HW_ESP_TX_CSUM. From Shannon Nelson. 4) Use booleans for true and false instead of integers in xfrm_policy_cache_flush(). From Gustavo A. R. Silva Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 11:22:31 -04:00
David S. Miller	020295d95e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-03-29 1) Fix a rcu_read_lock/rcu_read_unlock imbalance in the error path of xfrm_local_error(). From Taehee Yoo. 2) Some VTI MTU fixes. From Stefano Brivio. 3) Fix a too early overwritten skb control buffer on xfrm transport mode. Please note that this pull request has a merge conflict in net/ipv4/ip_tunnel.c. The conflict is between commit `f6cc9c054e` ("ip_tunnel: Emit events for post-register MTU changes") from the net tree and commit `24fc79798b` ("ip_tunnel: Clamp MTU to bounds on new link") from the ipsec tree. It can be solved as it is currently done in linux-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-29 10:12:47 -04:00
Emmanuel Grumbach	c470bdc1aa	mac80211: don't WARN on bad WMM parameters from buggy APs Apparently, some APs are buggy enough to send a zeroed WMM IE. Don't WARN on this since this is not caused by a bug on the client's system. This aligns the condition of the WARNING in drv_conf_tx with the validity check in ieee80211_sta_wmm_params. We will now pick the default values whenever we get a zeroed WMM IE. This has been reported here: https://bugzilla.kernel.org/show_bug.cgi?id=199161 Fixes: `f409079bb6` ("mac80211: sanity check CW_min/CW_max towards driver") Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 15:02:38 +02:00
Denis Kenzior	018f6fbf54	mac80211: Send control port frames over nl80211 If userspace requested control port frames to go over 80211, then do so. The control packets are intercepted just prior to delivery of the packet to the underlying network device. Pre-authentication type frames (protocol: 0x88c7) are also forwarded over nl80211. Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 14:08:30 +02:00
Denis Kenzior	9118064914	mac80211: Add support for tx_control_port Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 14:01:27 +02:00
Denis Kenzior	1224f5831a	nl80211: Add control_port_over_nl80211 to mesh_setup Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 14:01:27 +02:00
Denis Kenzior	c3bfe1f6fc	nl80211: Add control_port_over_nl80211 for ibss Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 14:00:27 +02:00
Denis Kenzior	64bf3d4bc2	nl80211: Add CONTROL_PORT_OVER_NL80211 attribute Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 13:45:04 +02:00
Denis Kenzior	2576a9ace4	nl80211: Implement TX of control port frames This commit implements the TX side of NL80211_CMD_CONTROL_PORT_FRAME. Userspace provides the raw EAPoL frame using NL80211_ATTR_FRAME. Userspace should also provide the destination address and the protocol type to use when sending the frame. This is used to implement TX of Pre-authentication frames. If CONTROL_PORT_ETHERTYPE_NO_ENCRYPT is specified, then the driver will be asked not to encrypt the outgoing frame. A new EXT_FEATURE flag is introduced so that nl80211 code can check whether a given wiphy has capability to pass EAPoL frames over nl80211. Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 13:44:19 +02:00
Denis Kenzior	6a671a50f8	nl80211: Add CMD_CONTROL_PORT_FRAME API This commit also adds cfg80211_rx_control_port function. This is used to generate a CMD_CONTROL_PORT_FRAME event out to userspace. The conn_owner_nlportid is used as the unicast destination. This means that userspace must specify NL80211_ATTR_SOCKET_OWNER flag if control port over nl80211 routing is requested in NL80211_CMD_CONNECT, NL80211_CMD_ASSOCIATE, NL80211_CMD_START_AP or IBSS/mesh join. Signed-off-by: Denis Kenzior <denkenz@gmail.com> [johannes: fix return value of cfg80211_rx_control_port()] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 13:44:04 +02:00
Johannes Berg	4d191c7536	mac80211: remove shadowing duplicated variable We already have 'ifmgd' here, and it's already assigned to the same value, so remove the duplicate. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 11:35:23 +02:00
Manikanta Pubbisetty	db3bdcb9c3	mac80211: allow AP_VLAN operation on crypto controlled devices In the current implementation, mac80211 advertises the support of AP_VLANs based on the driver's support for AP mode; it also blocks encrypted AP_VLAN operation on devices advertising SW_CRYPTO_CONTROL. The implementation seems weird in it's current form and could be often confusing, this is because there can be drivers advertising both SW_CRYPTO_CONTROL and AP mode support (ex: ath10k) in which case AP_VLAN will still be supported but only in open BSS and not in secured BSS. When SW_CRYPTO_CONTROL is enabled, it makes more sense if the decision to support AP_VLANs is left to the driver. Mac80211 can then allow AP_VLAN operations depending on the driver support. Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 11:35:22 +02:00
Haim Dreyfuss	19d3577e35	cfg80211: Add API to allow querying regdb for wmm_rule In general regulatory self managed devices maintain their own regulatory profiles thus it doesn't have to query the regulatory database on country change. ETSI has recently introduced a new channel access mechanism for 5GHz that all wlan devices need to comply with. These values are stored in the regulatory database. There are self managed devices which can't maintain these values on their own. Add API to allow self managed regulatory devices to query the regulatory database for high band wmm rule. Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> [johannes: fix documentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 11:35:17 +02:00
Haim Dreyfuss	e552af0581	mac80211: limit wmm params to comply with ETSI requirements ETSI has recently added new requirements that restrict the WMM parameter values for 5GHz frequencies. We need to take care of the following scenarios in order to comply with these new requirements: 1. When using mac80211 default values; 2. When the userspace tries to configure its own values; 3. When associating to an AP which advertises WWM IE. When associating to an AP, the client uses the values in the advertised WMM IE. But the AP may not comply with the new ETSI requirements, so the client needs to check the current regulatory rules and use those limits accordingly. Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 11:11:50 +02:00
Johannes Berg	5bf16a11ba	cfg80211: don't require RTNL held for regdomain reads The whole code is set up to allow RCU reads of this data, but then uses rtnl_dereference() which requires the RTNL. Convert it to rcu_dereference_rtnl() which makes it require only RCU or the RTNL, to allow RCU-protected reading of the data. Reviewed-by: Coelho, Luciano <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 11:11:50 +02:00
Haim Dreyfuss	230ebaa189	cfg80211: read wmm rules from regulatory database ETSI EN 301 893 v2.1.1 (2017-05) standard defines a new channel access mechanism that all devices (WLAN and LAA) need to comply with. The regulatory database can now be loaded into the kernel and also has the option to load optional data. In order to be able to comply with ETSI standard, we add wmm_rule into regulatory rule and add the option to read its value from the regulatory database. Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> [johannes: fix memory leak in error path] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 11:11:40 +02:00
Denis Kenzior	466a306142	nl80211: Add SOCKET_OWNER support to START_AP Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:47:28 +02:00
Denis Kenzior	188c1b3c04	nl80211: Add SOCKET_OWNER support to JOIN_MESH Signed-off-by: Denis Kenzior <denkenz@gmail.com> [johannes: fix race with wdev lock/unlock by just acquiring once] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:38:24 +02:00
Denis Kenzior	f8d16d3edb	nl80211: Add SOCKET_OWNER support to JOIN_IBSS Signed-off-by: Denis Kenzior <denkenz@gmail.com> [johannes: fix race with wdev lock/unlock by just acquiring once] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:36:22 +02:00
Denis Kenzior	37b1c00468	cfg80211: Support all iftypes in autodisconnect_wk Currently autodisconnect_wk assumes that only interface types of P2P_CLIENT and STATION use conn_owner_nlportid. Change this so all interface types are supported. Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:22:19 +02:00
Dmitry Lebed	2c390e44e4	cfg80211: enable use of non-cleared DFS channels for DFS offload Currently channel switch/start_ap to DFS channel cannot be done to non-CAC-cleared channel even if DFS offload if enabled. Make non-cleared DFS channels available if DFS offload is enabled. CAC will be started by HW after channel change, start_ap call, etc. Signed-off-by: Dmitry Lebed <dlebed@quantenna.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:21:35 +02:00
Dmitry Lebed	850964519a	cfg80211: fix CAC_STARTED event handling Exclude CAC_STARTED event from !wdev->cac_started check, since cac_started will be set later in the same function. Signed-off-by: Dmitry Lebed <dlebed@quantenna.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:21:16 +02:00
tamizhr@codeaurora.org	97f5f4254a	mac80211: Use proper chan_width enum in sta opmode event Bandwidth change value reported via nl80211 contains mac80211 specific enum value(ieee80211_sta_rx_bw) and which is not understand by userspace application. Map the mac80211 specific value to nl80211_chan_width enum value to avoid using wrong value in the userspace application. And used station's ht/vht capability to map IEEE80211_STA_RX_BW_20 and IEEE80211_STA_RX_BW_160 with proper nl80211 value. Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:19:59 +02:00
tamizhr@codeaurora.org	57566b2003	mac80211: Use proper smps_mode enum in sta opmode event SMPS_MODE change value notified via nl80211 contains mac80211 specific value(ieee80211_smps_mode) and user space application will not know those values. This patch add support to map the mac80211 enum value to nl80211_smps_mode which will be understood by the userspace application. Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-29 10:19:55 +02:00
Masahiro Yamada	28913ee819	netfilter: nf_nat_snmp_basic: add correct dependency to Makefile nf_nat_snmp_basic_main.c includes a generated header, but the necessary dependency is missing in Makefile. This could cause build error in parallel building. Remove a weird line, and add a correct one. Fixes: `cc2d58634e` ("netfilter: nf_nat_snmp_basic: use asn1 decoder library") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>	2018-03-29 09:42:32 +09:00
Alexei Starovoitov	14624a9387	net/mac802154: disambiguate mac80215 vs mac802154 trace events two trace events defined with the same name and both unused. They conflict in allyesconfig build. Rename one of them. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-28 22:55:18 +02:00
Alexei Starovoitov	c105547501	treewide: remove large struct-pass-by-value from tracepoint arguments - fix trace_hfi1_ctxt_info() to pass large struct by reference instead of by value - convert 'type array[]' tracepoint arguments into 'type array', since compiler will warn that sizeof('type array[]') == sizeof('type array') and later should be used instead The CAST_TO_U64 macro in the later patch will enforce that tracepoint arguments can only be integers, pointers, or less than 8 byte structures. Larger structures should be passed by reference. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-28 22:55:18 +02:00
Nikita V. Shirokov	6f5c39fa5c	bpf: Add sock_ops R/W access to ipv4 tos Sample usage for tos ... bpf_getsockopt(skops, SOL_IP, IP_TOS, &v, sizeof(v)) ... where skops is a pointer to the ctx (struct bpf_sock_ops). Signed-off-by: Nikita V. Shirokov <tehnerd@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-28 21:05:35 +02:00
David Howells	1bae5d2295	rxrpc: Trace call completion Add a tracepoint to track rxrpc calls moving into the completed state and to log the completion type and the recorded error value and abort code. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-27 23:08:20 +01:00
David Howells	a25e21f0bc	rxrpc, afs: Use debug_ids rather than pointers in traces In rxrpc and afs, use the debug_ids that are monotonically allocated to various objects as they're allocated rather than pointers as kernel pointers are now hashed making them less useful. Further, the debug ids aren't reused anywhere nearly as quickly. In addition, allow kernel services that use rxrpc, such as afs, to take numbers from the rxrpc counter, assign them to their own call struct and pass them in to rxrpc for both client and service calls so that the trace lines for each will have the same ID tag. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-27 23:03:00 +01:00
David Howells	827efed6a6	rxrpc: Trace resend Add a tracepoint to trace packet resend events and to dump the Tx annotation buffer for added illumination. Signed-off-by: David Howells <dhowells@rdhat.com>	2018-03-27 23:02:47 +01:00
Kirill Tkhai	8518e9bb98	net: Add more comments This adds comments to different places to improve readability. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 13:18:09 -04:00
Kirill Tkhai	4420bf21fb	net: Rename net_sem to pernet_ops_rwsem net_sem is some undefined area name, so it will be better to make the area more defined. Rename it to pernet_ops_rwsem for better readability and better intelligibility. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 13:18:09 -04:00
Kirill Tkhai	2f635ceeb2	net: Drop pernet_operations::async Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 13:18:09 -04:00
Kirill Tkhai	094374e5e1	net: Reflect all pernet_operations are converted All pernet_operations are reviewed and converted, hooray! Reflect this in core code: setup_net() and cleanup_net() will take down_read() always. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 13:18:07 -04:00
Ursula Braun	ab6f6dd18a	net/smc: use announced length in sock_recvmsg() Not every CLC proposal message needs the maximum buffer length. Due to the MSG_WAITALL flag, it is important to use the peeked real length when receiving the message. Fixes: `d63d271ce2` ("smc: switch to sock_recvmsg()") Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 11:59:49 -04:00
Cong Wang	b85ab56c3f	llc: properly handle dev_queue_xmit() return value llc_conn_send_pdu() pushes the skb into write queue and calls llc_conn_send_pdus() to flush them out. However, the status of dev_queue_xmit() is not returned to caller, in this case, llc_conn_state_process(). llc_conn_state_process() needs hold the skb no matter success or failure, because it still uses it after that, therefore we should hold skb before dev_queue_xmit() when that skb is the one being processed by llc_conn_state_process(). For other callers, they can just pass NULL and ignore the return value as they are. Reported-by: Noam Rathaus <noamr@beyondsecurity.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 11:56:00 -04:00
David S. Miller	5f75a1863e	mlx5-updates-2018-03-22 (Misc updates) This series includes misc updates for mlx5 core and netdev dirver, Highlights: From Inbar, three patches to add support for PFC stall prevention statistics and enable/disable through new ethtool tunable, as requested from previous submission. From Moshe, four patches, added more drop counters: - drop counter for netdev steering miss - drop counter for when VF logical link is down - drop counter for when netdev logical link is down. From Or, three patches to support vlan push/pop offload via tc HW action, for newer HW (Connectx-5 and onward) via HW steering flow actions rather than the emulated path for the older HW brands. And five more misc small trivial patches. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJauVxkAAoJEEg/ir3gV/o+Oi4H/1Mzv+XEuHLwVJHpzMNVLVeR EK1GDW3724zjY24Iy+9FRppL1mV0hY3aw1qhD+rUmUKQx+Q3OBrH1WN2QJkjBkM9 i60ygFLTHvcHyPFmHtPzCckO7ODdWsXSlEwsl8kCSITQTn1Mjf6v7Mogd0CLCbwN iu8ocpHTD+/whE5gq9aDDrGtuzgJbPiRqavvsof9pc2g1JlnvtHs9xoiN+vzoqIa 9lGS+2UM2+wFw9rMtbdaqtfVPsupMeEdGZ98chn0gpWshOvOda5NuRE59M+5un5E C3nJ4Js//PLkvQ9Nu7+goXtbfPLCxvcuurid5TM2Su9hmD+9Us223TcAvdpOMoY= =VdCO -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2018-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2018-03-22 (Misc updates) This series includes misc updates for mlx5 core and netdev dirver, Highlights: From Inbar, three patches to add support for PFC stall prevention statistics and enable/disable through new ethtool tunable, as requested from previous submission. From Moshe, four patches, added more drop counters: - drop counter for netdev steering miss - drop counter for when VF logical link is down - drop counter for when netdev logical link is down. From Or, three patches to support vlan push/pop offload via tc HW action, for newer HW (Connectx-5 and onward) via HW steering flow actions rather than the emulated path for the older HW brands. And five more misc small trivial patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 11:05:23 -04:00
Dave Watson	cd00edc179	strparser: Fix sign of err codes strp_parser_err is called with a negative code everywhere, which then calls abort_parser with a negative code. strp_msg_timeout calls abort_parser directly with a positive code. Negate ETIMEDOUT to match signed-ness of other calls. The default abort_parser callback, strp_abort_strp, sets sk->sk_err to err. Also negate the error here so sk_err always holds a positive value, as the rest of the net code expects. Currently a negative sk_err can result in endless loops, or user code that thinks it actually sent/received err bytes. Found while testing net/tls_sw recv path. Fixes: `43a0c6751a` ("strparser: Stream parser for messages") Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 11:00:18 -04:00
Craig Dillabaugh	734549eb55	net sched actions: fix dumping which requires several messages to user space Fixes a bug in the tcf_dump_walker function that can cause some actions to not be reported when dumping a large number of actions. This issue became more aggrevated when cookies feature was added. In particular this issue is manifest when large cookie values are assigned to the actions and when enough actions are created that the resulting table must be dumped in multiple batches. The number of actions returned in each batch is limited by the total number of actions and the memory buffer size. With small cookies the numeric limit is reached before the buffer size limit, which avoids the code path triggering this bug. When large cookies are used buffer fills before the numeric limit, and the erroneous code path is hit. For example after creating 32 csum actions with the cookie aaaabbbbccccdddd $ tc actions ls action csum total acts 26 action order 0: csum (tcp) action continue index 1 ref 1 bind 0 cookie aaaabbbbccccdddd ..... action order 25: csum (tcp) action continue index 26 ref 1 bind 0 cookie aaaabbbbccccdddd total acts 6 action order 0: csum (tcp) action continue index 28 ref 1 bind 0 cookie aaaabbbbccccdddd ...... action order 5: csum (tcp) action continue index 32 ref 1 bind 0 cookie aaaabbbbccccdddd Note that the action with index 27 is omitted from the report. Fixes: `4b3550ef53` ("[NET_SCHED]: Use nla_nest_start/nla_nest_end")" Signed-off-by: Craig Dillabaugh <cdillaba@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:59:12 -04:00
Joe Perches	e32ac25018	ipv6: addrconf: Use normal debugging style Remove local ADBG macro and use netdev_dbg/pr_debug Miscellanea: o Remove unnecessary debug message after allocation failure as there already is a dump_stack() on the failure paths o Leave the allocation failure message on snmp6_alloc_dev as there is one code path that does not do a dump_stack() Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:54:40 -04:00
Eric Dumazet	1dfe82ebd7	net: fix possible out-of-bound read in skb_network_protocol() skb mac header is not necessarily set at the time skb_network_protocol() is called. Use skb->data instead. BUG: KASAN: slab-out-of-bounds in skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739 Read of size 2 at addr ffff8801b3097a0b by task syz-executor5/14242 CPU: 1 PID: 14242 Comm: syz-executor5 Not tainted 4.16.0-rc6+ #280 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report+0x23c/0x360 mm/kasan/report.c:412 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:443 skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739 harmonize_features net/core/dev.c:2924 [inline] netif_skb_features+0x509/0x9b0 net/core/dev.c:3011 validate_xmit_skb+0x81/0xb00 net/core/dev.c:3084 validate_xmit_skb_list+0xbf/0x120 net/core/dev.c:3142 packet_direct_xmit+0x117/0x790 net/packet/af_packet.c:256 packet_snd net/packet/af_packet.c:2944 [inline] packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969 sock_sendmsg_nosec net/socket.c:629 [inline] sock_sendmsg+0xca/0x110 net/socket.c:639 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047 __sys_sendmsg+0xe5/0x210 net/socket.c:2081 Fixes: `19acc32725` ("gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pravin B Shelar <pshelar@ovn.org> Reported-by: Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:53:27 -04:00
Wei Yongjun	e1a22d13eb	tipc: tipc_node_create() can be static Fixes the following sparse warning: net/tipc/node.c:336:18: warning: symbol 'tipc_node_create' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:50:22 -04:00
Wei Yongjun	c76f2481b6	tipc: fix error handling in tipc_udp_enable() Release alloced resource before return from the error handling case in tipc_udp_enable(), otherwise will cause memory leak. Fixes: `52dfae5c85` ("tipc: obtain node identity from interface by default") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:50:02 -04:00
David S. Miller	d7785b59e7	Here are some batman-adv bugfixes: - fix multicast-via-unicast transmissions for AP isolation and gateway extension, by Linus Luessing (2 patches) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlq466kWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeobv5EACwz70r21xIUxwq6FWXysTfxxXt H3qcXt2BzCaGYKmYt8xqz9z368v7wpgCsPfZsvW2eBYhcpCkGQm6PR2XNy7WMoBz 6aFO4SDDi5L/S+5O0iXg79niRiTe9lCbOl/r6uv/2FtY22rIfhocFwkBL8cjIC3E hOWuFUbk3mAu8i+WMyhfDEUL33oy9CMvlJqaxqJuTHf3HxjOeGVjusOvSbQu+OGG K+TCAbIjazQ02R9p6lXQoIAjfg+kfsYIdT84MTgjk91qCtp1ztRxr3MNAwTgPvA4 Vi4uZGLbLCMvaAF1wqVBSuOnFFwbUCc8IBIoUvvZSMQTYm3agcg96pFLMYqNzrhY aHco1neJ+a+pGUmByijmYbsrJI2dYqK0V0OZlLG9WKwkrE3Y023LGGdXUbdl4RhS LXqMLGVJ9eqV+m7MCuv/q/PTVOL89loAun0DuIU4IhJuDM2+5yEG5jBiI6ImIym+ KX3F9F9nyefr5aCYsd14izX6WHxTXkJQaVpjVNBP56P6eZMLMx81eozBD9eotFyv A1EgQolLFEKWmtKU2KUK+qrGFNXaLlc9z8ZGkbizi/CTEjN0tr1UjW6B6toOOoZ+ fhF3HwlpgqVOdDodT+eDtxY8YboZBCpAnAOiE1LP3leVY4yab23Q02kskOPTPg1G CfpZwDjXIyZburUvfA== =j2w0 -----END PGP SIGNATURE----- Merge tag 'batadv-net-for-davem-20180326' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - fix multicast-via-unicast transmissions for AP isolation and gateway extension, by Linus Luessing (2 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:40:19 -04:00
Colin Ian King	8daf1a2d7e	net/ncsi: check for null return from call to nla_nest_start The call to nla_nest_start calls nla_put which can lead to a NULL return so it's possible for attr to become NULL and we can potentially get a NULL pointer dereference on attr. Fix this by checking for a NULL return. Detected by CoverityScan, CID#1466125 ("Dereference null return") Fixes: `955dc68cb9` ("net/ncsi: Add generic netlink family") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:38:26 -04:00
Xin Long	5306653850	sctp: remove unnecessary asoc in sctp_has_association After Commit `dae399d7fd` ("sctp: hold transport instead of assoc when lookup assoc in rx path"), it put transport instead of asoc in sctp_has_association. Variable 'asoc' is not used any more. So this patch is to remove it, while at it, it also changes the return type of sctp_has_association to bool, and does the same for it's caller sctp_endpoint_is_peeled_off. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 10:22:11 -04:00
Geert Uytterhoeven	b903036aad	net: Spelling s/stucture/structure/ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2018-03-27 09:51:23 +02:00
Inbar Karmy	e1577c1c88	ethtool: Add support for configuring PFC stall prevention in ethtool In the event where the device unexpectedly becomes unresponsive for a long period of time, flow control mechanism may propagate pause frames which will cause congestion spreading to the entire network. To prevent this scenario, when the device is stalled for a period longer than a pre-configured timeout, flow control mechanisms are automatically disabled. This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION as a tunable. This API provides support for configuring flow control storm prevention timeout (msec). Signed-off-by: Inbar Karmy <inbark@mellanox.com> Cc: Michal Kubecek <mkubecek@suse.cz> Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-26 13:46:46 -07:00
Yuval Mintz	8c13af2a21	ip6mr: Add refcounting to mfc Since ipmr and ip6mr are using the same mr_mfc struct at their core, we can now refactor the ipmr_cache_{hold,put} logic and apply refcounting to both ipmr and ip6mr. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:14:43 -04:00
Yuval Mintz	d3c07e5b99	ip6mr: Add API for default_rule fib Add the ability to discern whether a given FIB rule notification relates to the default rule inserted when registering ip6mr or a different one. Would later be used by drivers wishing to offload ipv6 multicast routes but unable to offload rules other than the default one. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:14:43 -04:00
Yuval Mintz	088aa3eec2	ip6mr: Support fib notifications In similar fashion to ipmr, support fib notifications for ip6mr mfc and vif related events. This would later allow drivers to react to said notifications and offload the IPv6 mroutes. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:14:43 -04:00
Yuval Mintz	cdc9f9443b	ipmr: Make ipmr_dump() common Since all the primitive elements used for the notification done by ipmr are now common [mr_table, mr_mfc, vif_device] we can refactor the logic for dumping them to a common file. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:14:43 -04:00
Yuval Mintz	54c4cad97b	ipmr: Make MFC fib notifiers common Like vif notifications, move the notifier struct for MFC as well as its helpers into a common file; Currently they're only used by ipmr. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:14:42 -04:00
Yuval Mintz	bc67a0daf8	ipmr: Make vif fib notifiers common The fib-notifiers are tightly coupled with the vif_device which is already common. Move the notifier struct definition and helpers to the common file; Currently they're only used by ipmr. Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:14:42 -04:00
Kirill Tkhai	5e804a6077	net: Convert sunrpc_net_ops These pernet_operations look similar to rpcsec_gss_net_ops, they just create and destroy another caches. So, they also can be async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:03:26 -04:00
Kirill Tkhai	855aeba340	net: Convert rpcsec_gss_net_ops These pernet_operations initialize and destroy sunrpc_net_id refered per-net items. Only used global list is cache_list, and accesses already serialized. sunrpc_destroy_cache_detail() check for list_empty() without cache_list_lock, but when it's called from unregister_pernet_subsys(), there can't be callers in parallel, so we won't miss list_empty() in this case. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:03:26 -04:00
John Fastabend	eb82a99447	net: sched, fix OOO packets with pfifo_fast After the qdisc lock was dropped in pfifo_fast we allow multiple enqueue threads and dequeue threads to run in parallel. On the enqueue side the skb bit ooo_okay is used to ensure all related skbs are enqueued in-order. On the dequeue side though there is no similar logic. What we observe is with fewer queues than CPUs it is possible to re-order packets when two instances of __qdisc_run() are running in parallel. Each thread will dequeue a skb and then whichever thread calls the ndo op first will be sent on the wire. This doesn't typically happen because qdisc_run() is usually triggered by the same core that did the enqueue. However, drivers will trigger __netif_schedule() when queues are transitioning from stopped to awake using the netif_tx_wake_* APIs. When this happens netif_schedule() calls qdisc_run() on the same CPU that did the netif_tx_wake_* which is usually done in the interrupt completion context. This CPU is selected with the irq affinity which is unrelated to the enqueue operations. To resolve this we add a RUNNING bit to the qdisc to ensure only a single dequeue per qdisc is running. Enqueue and dequeue operations can still run in parallel and also on multi queue NICs we can still have a dequeue in-flight per qdisc, which is typically per CPU. Fixes: `c5ad119fb6` ("net: sched: pfifo_fast use skb_array") Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 12:36:23 -04:00
Joe Perches	d6444062f8	net: Use octal not symbolic permissions Prefer the direct use of octal for permissions. Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace and some typing. Miscellanea: o Whitespace neatening around these conversions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 12:07:48 -04:00
Kirill Tkhai	070f2d7e26	net: Drop NETDEV_UNREGISTER_FINAL Last user is gone after `bdf5bd7f21` "rds: tcp: remove register_netdevice_notifier infrastructure.", so we can remove this netdevice command. This allows to delete rtnl_lock() in netdev_run_todo(), which is hot path for net namespace unregistration. dev_change_net_namespace() and netdev_wait_allrefs() have rcu_barrier() before NETDEV_UNREGISTER_FINAL call, and the source commits say they were introduced to delemit the call with NETDEV_UNREGISTER, but this patch leaves them on the places, since they require additional analysis, whether we need in them for something else. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 11:34:00 -04:00
Kirill Tkhai	ede2762d93	net: Make NETDEV_XXX commands enum { } This patch is preparation to drop NETDEV_UNREGISTER_FINAL. Since the cmd is used in usnic_ib_netdev_event_to_string() to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL from everywhere, we'd have holes in event2str[] in this function. Instead of that, let's make NETDEV_XXX commands names available for everyone, and to define netdev_cmd_to_name() in the way we won't have to shaffle names after their numbers are changed. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 11:33:26 -04:00
Sagi Grimberg	a470143fc8	net/utils: Introduce inet_addr_is_any Can be useful to check INET_ANY address for both ipv4/ipv6 addresses. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-03-26 08:53:43 -06:00
kbuild test robot	da18ab32d7	tipc: tipc_disc_addr_trial_msg() can be static Fixes: `25b0b9c4e8` ("tipc: handle collisions of 32-bit node address hash values") Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Jon Maloy jon.maloy@ericsson.com Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-25 21:21:43 -04:00
Paolo Abeni	10b8a3de60	ipv6: the entire IPv6 header chain must fit the first fragment While building ipv6 datagram we currently allow arbitrary large extheaders, even beyond pmtu size. The syzbot has found a way to exploit the above to trigger the following splat: kernel BUG at ./include/linux/skbuff.h:2073! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 4230 Comm: syzkaller672661 Not tainted 4.16.0-rc2+ #326 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__skb_pull include/linux/skbuff.h:2073 [inline] RIP: 0010:__ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP: 0018:ffff8801bc18f0f0 EFLAGS: 00010293 RAX: ffff8801b17400c0 RBX: 0000000000000738 RCX: ffffffff84f01828 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b415ac18 RBP: ffff8801bc18f360 R08: ffff8801b4576844 R09: 0000000000000000 R10: ffff8801bc18f380 R11: ffffed00367aee4e R12: 00000000000000d6 R13: ffff8801b415a740 R14: dffffc0000000000 R15: ffff8801b45767c0 FS: 0000000001535880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000b000 CR3: 00000001b4123001 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip6_finish_skb include/net/ipv6.h:969 [inline] udp_v6_push_pending_frames+0x269/0x3b0 net/ipv6/udp.c:1073 udpv6_sendmsg+0x2a96/0x3400 net/ipv6/udp.c:1343 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg+0xca/0x110 net/socket.c:640 ___sys_sendmsg+0x320/0x8b0 net/socket.c:2046 __sys_sendmmsg+0x1ee/0x620 net/socket.c:2136 SYSC_sendmmsg net/socket.c:2167 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2162 do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x4404c9 RSP: 002b:00007ffdce35f948 EFLAGS: 00000217 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004404c9 RDX: 0000000000000003 RSI: 0000000020001f00 RDI: 0000000000000003 RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8 R10: 0000000020000080 R11: 0000000000000217 R12: 0000000000401df0 R13: 0000000000401e80 R14: 0000000000000000 R15: 0000000000000000 Code: ff e8 1d 5e b9 fc e9 15 e9 ff ff e8 13 5e b9 fc e9 44 e8 ff ff e8 29 5e b9 fc e9 c0 e6 ff ff e8 3f f3 80 fc 0f 0b e8 38 f3 80 fc <0f> 0b 49 8d 87 80 00 00 00 4d 8d 87 84 00 00 00 48 89 85 20 fe RIP: __skb_pull include/linux/skbuff.h:2073 [inline] RSP: ffff8801bc18f0f0 RIP: __ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP: ffff8801bc18f0f0 As stated by RFC 7112 section 5: When a host fragments an IPv6 datagram, it MUST include the entire IPv6 Header Chain in the First Fragment. So this patch addresses the issue dropping datagrams with excessive extheader length. It also updates the error path to report to the calling socket nonnegative pmtu values. The issue apparently predates git history. v1 -> v2: cleanup error path, as per Eric's suggestion Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+91e6f9932ff122fa4410@syzkaller.appspotmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-25 21:17:20 -04:00
Alexander Potapenko	7880287981	netlink: make sure nladdr has correct size in netlink_connect() KMSAN reports use of uninitialized memory in the case when \|alen\| is smaller than sizeof(struct sockaddr_nl), and therefore \|nladdr\| isn't fully copied from the userspace. Signed-off-by: Alexander Potapenko <glider@google.com> Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-25 21:14:51 -04:00
Hans Wippel	bc58a1baf2	net/ipv4: disable SMC TCP option with SYN Cookies Currently, the SMC experimental TCP option in a SYN packet is lost on the server side when SYN Cookies are active. However, the corresponding SYNACK sent back to the client contains the SMC option. This causes an inconsistent view of the SMC capabilities on the client and server. This patch disables the SMC option in the SYNACK when SYN Cookies are active to avoid this issue. Fixes: `60e2a77807` ("tcp: TCP experimental option for SMC") Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-25 20:53:54 -04:00
Yonghong Song	13acc94eff	net: permit skb_segment on head_frag frag_list skb One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at function skb_segment(), line 3667. The bpf program attaches to clsact ingress, calls bpf_skb_change_proto to change protocol from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect to send the changed packet out. 3472 struct sk_buff skb_segment(struct sk_buff head_skb, 3473 netdev_features_t features) 3474 { 3475 struct sk_buff segs = NULL; 3476 struct sk_buff tail = NULL; ... 3665 while (pos < offset + len) { 3666 if (i >= nfrags) { 3667 BUG_ON(skb_headlen(list_skb)); 3668 3669 i = 0; 3670 nfrags = skb_shinfo(list_skb)->nr_frags; 3671 frag = skb_shinfo(list_skb)->frags; 3672 frag_skb = list_skb; ... call stack: ... #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525 #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7 #4 [ffff883ffef03668] die at ffffffff8101deb2 #5 [ffff883ffef03698] do_trap at ffffffff8101a700 #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0 #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab [exception RIP: skb_segment+3044] RIP: ffffffff817e4dd4 RSP: ffff883ffef03860 RFLAGS: 00010216 RAX: 0000000000002bf6 RBX: ffff883feb7aaa00 RCX: 0000000000000011 RDX: ffff883fb87910c0 RSI: 0000000000000011 RDI: ffff883feb7ab500 RBP: ffff883ffef03928 R8: 0000000000002ce2 R9: 00000000000027da R10: 000001ea00000000 R11: 0000000000002d82 R12: ffff883f90a1ee80 R13: ffff883fb8791120 R14: ffff883feb7abc00 R15: 0000000000002ce2 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7 Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-25 16:46:04 -04:00
David S. Miller	b9ee96b45f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Don't pick fixed hash implementation for NFT_SET_EVAL sets, otherwise userspace hits EOPNOTSUPP with valid rules using the meter statement, from Florian Westphal. 2) If you send a batch that flushes the existing ruleset (that contains a NAT chain) and the new ruleset definition comes with a new NAT chain, don't bogusly hit EBUSY. Also from Florian. 3) Missing netlink policy attribute validation, from Florian. 4) Detach conntrack template from skbuff if IP_NODEFRAG is set on, from Paolo Abeni. 5) Cache device names in flowtable object, otherwise we may end up walking over devices going aways given no rtnl_lock is held. 6) Fix incorrect net_device ingress with ingress hooks. 7) Fix crash when trying to read more data than available in UDP packets from the nf_socket infrastructure, from Subash. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-24 17:10:01 -04:00
Subash Abhinov Kasiviswanathan	32c1733f0d	netfilter: nf_socket: Fix out of bounds access in nf_sk_lookup_slow_v{4,6} skb_header_pointer will copy data into a buffer if data is non linear, otherwise it will return a pointer in the linear section of the data. nf_sk_lookup_slow_v{4,6} always copies data of size udphdr but later accesses memory within the size of tcphdr (th->doff) in case of TCP packets. This causes a crash when running with KASAN with the following call stack - BUG: KASAN: stack-out-of-bounds in xt_socket_lookup_slow_v4+0x524/0x718 net/netfilter/xt_socket.c:178 Read of size 2 at addr ffffffe3d417a87c by task syz-executor/28971 CPU: 2 PID: 28971 Comm: syz-executor Tainted: G B W O 4.9.65+ #1 Call trace: [<ffffff9467e8d390>] dump_backtrace+0x0/0x428 arch/arm64/kernel/traps.c:76 [<ffffff9467e8d7e0>] show_stack+0x28/0x38 arch/arm64/kernel/traps.c:226 [<ffffff946842d9b8>] __dump_stack lib/dump_stack.c:15 [inline] [<ffffff946842d9b8>] dump_stack+0xd4/0x124 lib/dump_stack.c:51 [<ffffff946811d4b0>] print_address_description+0x68/0x258 mm/kasan/report.c:248 [<ffffff946811d8c8>] kasan_report_error mm/kasan/report.c:347 [inline] [<ffffff946811d8c8>] kasan_report.part.2+0x228/0x2f0 mm/kasan/report.c:371 [<ffffff946811df44>] kasan_report+0x5c/0x70 mm/kasan/report.c:372 [<ffffff946811bebc>] check_memory_region_inline mm/kasan/kasan.c:308 [inline] [<ffffff946811bebc>] __asan_load2+0x84/0x98 mm/kasan/kasan.c:739 [<ffffff94694d6f04>] __tcp_hdrlen include/linux/tcp.h:35 [inline] [<ffffff94694d6f04>] xt_socket_lookup_slow_v4+0x524/0x718 net/netfilter/xt_socket.c:178 Fix this by copying data into appropriate size headers based on protocol. Fixes: `a583636a83` ("inet: refactor inet[6]_lookup functions to take skb") Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-24 21:17:14 +01:00
Linus Lüssing	a752c0a452	batman-adv: fix packet loss for broadcasted DHCP packets to a server DHCP connectivity issues can currently occur if the following conditions are met: 1) A DHCP packet from a client to a server 2) This packet has a multicast destination 3) This destination has a matching entry in the translation table (FF:FF:FF:FF:FF:FF for IPv4, 33:33:00:01:00:02/33:33:00:01:00:03 for IPv6) 4) The orig-node determined by TT for the multicast destination does not match the orig-node determined by best-gateway-selection In this case the DHCP packet will be dropped. The "gateway-out-of-range" check is supposed to only be applied to unicasted DHCP packets to a specific DHCP server. In that case dropping the the unicasted frame forces the client to retry via a broadcasted one, but now directed to the new best gateway. A DHCP packet with broadcast/multicast destination is already ensured to always be delivered to the best gateway. Dropping a multicasted DHCP packet here will only prevent completing DHCP as there is no other fallback. So far, it seems the unicast check was implicitly performed by expecting the batadv_transtable_search() to return NULL for multicast destinations. However, a multicast address could have always ended up in the translation table and in fact is now common. To fix this potential loss of a DHCP client-to-server packet to a multicast address this patch adds an explicit multicast destination check to reliably bail out of the gateway-out-of-range check for such destinations. The issue and fix were tested in the following three node setup: - Line topology, A-B-C - A: gateway client, DHCP client - B: gateway server, hop-penalty increased: 30->60, DHCP server - C: gateway server, code modifications to announce FF:FF:FF:FF:FF:FF Without this patch, A would never transmit its DHCP Discover packet due to an always "out-of-range" condition. With this patch, a full DHCP handshake between A and B was possible again. Fixes: `be7af5cf9c` ("batman-adv: refactoring gateway handling code") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2018-03-24 10:25:49 +01:00
Linus Lüssing	f8fb3419ea	batman-adv: fix multicast-via-unicast transmission with AP isolation For multicast frames AP isolation is only supposed to be checked on the receiving nodes and never on the originating one. Furthermore, the isolation or wifi flag bits should only be intepreted as such for unicast and never multicast TT entries. By injecting flags to the multicast TT entry claimed by a single target node it was verified in tests that this multicast address becomes unreachable, leading to packet loss. Omitting the "src" parameter to the batadv_transtable_search() call successfully skipped the AP isolation check and made the target reachable again. Fixes: `1d8ab8d3c1` ("batman-adv: Modified forwarding behaviour for multicast packets") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2018-03-24 10:25:07 +01:00
Davide Caratti	94cb549240	net/sched: act_vlan: declare push_vid with host byte order use u16 in place of __be16 to suppress the following sparse warnings: net/sched/act_vlan.c:150:26: warning: incorrect type in assignment (different base types) net/sched/act_vlan.c:150:26: expected restricted __be16 [usertype] push_vid net/sched/act_vlan.c:150:26: got unsigned short net/sched/act_vlan.c:151:21: warning: restricted __be16 degrades to integer net/sched/act_vlan.c:208:26: warning: incorrect type in assignment (different base types) net/sched/act_vlan.c:208:26: expected unsigned short [unsigned] [usertype] tcfv_push_vid net/sched/act_vlan.c:208:26: got restricted __be16 [usertype] push_vid Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 21:54:07 -04:00
Davide Caratti	affaa0c724	net/sched: remove tcf_idr_cleanup() tcf_idr_cleanup() is no more used, so remove it. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 21:52:19 -04:00
Eric Dumazet	1bfa26ff8c	ipv6: fix possible deadlock in rt6_age_examine_exception() syzbot reported a LOCKDEP splat [1] in rt6_age_examine_exception() rt6_age_examine_exception() is called while rt6_exception_lock is held. This lock is the lower one in the lock hierarchy, thus we can not call dst_neigh_lookup() function, as it can fallback to neigh_create() We should instead do a pure RCU lookup. As a bonus we avoid a pair of atomic operations on neigh refcount. [1] WARNING: possible circular locking dependency detected 4.16.0-rc4+ #277 Not tainted syz-executor7/4015 is trying to acquire lock: (&ndev->lock){++--}, at: [<00000000416dce19>] __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928 but task is already holding lock: (&tbl->lock){++-.}, at: [<00000000b5cb1d65>] neigh_ifdown+0x3d/0x250 net/core/neighbour.c:292 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&tbl->lock){++-.}: __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline] _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312 __neigh_create+0x87e/0x1d90 net/core/neighbour.c:528 neigh_create include/net/neighbour.h:315 [inline] ip6_neigh_lookup+0x9a7/0xba0 net/ipv6/route.c:228 dst_neigh_lookup include/net/dst.h:405 [inline] rt6_age_examine_exception net/ipv6/route.c:1609 [inline] rt6_age_exceptions+0x381/0x660 net/ipv6/route.c:1645 fib6_age+0xfb/0x140 net/ipv6/ip6_fib.c:2033 fib6_clean_node+0x389/0x580 net/ipv6/ip6_fib.c:1919 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1845 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1893 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1970 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1986 fib6_clean_all net/ipv6/ip6_fib.c:1997 [inline] fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2053 ndisc_netdev_event+0x3c2/0x4a0 net/ipv6/ndisc.c:1781 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x262/0x430 net/core/dev.c:6960 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919 sock_do_ioctl+0xef/0x390 net/socket.c:957 sock_ioctl+0x36b/0x610 net/socket.c:1081 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #2 (rt6_exception_lock){+.-.}: __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168 spin_lock_bh include/linux/spinlock.h:315 [inline] rt6_flush_exceptions+0x21/0x210 net/ipv6/route.c:1367 fib6_del_route net/ipv6/ip6_fib.c:1677 [inline] fib6_del+0x624/0x12c0 net/ipv6/ip6_fib.c:1761 __ip6_del_rt+0xc7/0x120 net/ipv6/route.c:2980 ip6_del_rt+0x132/0x1a0 net/ipv6/route.c:2993 __ipv6_dev_ac_dec+0x3b1/0x600 net/ipv6/anycast.c:332 ipv6_dev_ac_dec net/ipv6/anycast.c:345 [inline] ipv6_sock_ac_close+0x2b4/0x3e0 net/ipv6/anycast.c:200 inet6_release+0x48/0x70 net/ipv6/af_inet6.c:433 sock_release+0x8d/0x1e0 net/socket.c:594 sock_close+0x16/0x20 net/socket.c:1149 __fput+0x327/0x7e0 fs/file_table.c:209 ____fput+0x15/0x20 fs/file_table.c:243 task_work_run+0x199/0x270 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x9bb/0x1ad0 kernel/exit.c:865 do_group_exit+0x149/0x400 kernel/exit.c:968 get_signal+0x73a/0x16d0 kernel/signal.c:2469 do_signal+0x90/0x1e90 arch/x86/kernel/signal.c:809 exit_to_usermode_loop+0x258/0x2f0 arch/x86/entry/common.c:162 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:265 [inline] do_syscall_64+0x6ec/0x940 arch/x86/entry/common.c:292 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #1 (&(&tb->tb6_lock)->rlock){+.-.}: __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168 spin_lock_bh include/linux/spinlock.h:315 [inline] __ip6_ins_rt+0x56/0x90 net/ipv6/route.c:1007 ip6_route_add+0x141/0x190 net/ipv6/route.c:2955 addrconf_prefix_route+0x44f/0x620 net/ipv6/addrconf.c:2359 fixup_permanent_addr net/ipv6/addrconf.c:3368 [inline] addrconf_permanent_addr net/ipv6/addrconf.c:3391 [inline] addrconf_notify+0x1ad2/0x2310 net/ipv6/addrconf.c:3460 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x15d/0x430 net/core/dev.c:6958 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 do_setlink+0xa22/0x3bb0 net/core/rtnetlink.c:2357 rtnl_newlink+0xf37/0x1a50 net/core/rtnetlink.c:2965 rtnetlink_rcv_msg+0x57f/0xb10 net/core/rtnetlink.c:4641 netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4659 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline] netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897 sock_sendmsg_nosec net/socket.c:629 [inline] sock_sendmsg+0xca/0x110 net/socket.c:639 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047 __sys_sendmsg+0xe5/0x210 net/socket.c:2081 SYSC_sendmsg net/socket.c:2092 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2088 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #0 (&ndev->lock){++--}: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline] _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312 __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928 ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961 pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392 pneigh_ifdown net/core/neighbour.c:695 [inline] neigh_ifdown+0x149/0x250 net/core/neighbour.c:294 rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874 addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633 addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x262/0x430 net/core/dev.c:6960 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919 packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066 sock_do_ioctl+0xef/0x390 net/socket.c:957 sock_ioctl+0x36b/0x610 net/socket.c:1081 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 other info that might help us debug this: Chain exists of: &ndev->lock --> rt6_exception_lock --> &tbl->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&tbl->lock); lock(rt6_exception_lock); lock(&tbl->lock); lock(&ndev->lock); * DEADLOCK * 2 locks held by syz-executor7/4015: #0: (rtnl_mutex){+.+.}, at: [<00000000a2f16daa>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (&tbl->lock){++-.}, at: [<00000000b5cb1d65>] neigh_ifdown+0x3d/0x250 net/core/neighbour.c:292 stack backtrace: CPU: 0 PID: 4015 Comm: syz-executor7 Not tainted 4.16.0-rc4+ #277 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_circular_bug.isra.38+0x2cd/0x2dc kernel/locking/lockdep.c:1223 check_prev_add kernel/locking/lockdep.c:1863 [inline] check_prevs_add kernel/locking/lockdep.c:1976 [inline] validate_chain kernel/locking/lockdep.c:2417 [inline] __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline] _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312 __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928 ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961 pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392 pneigh_ifdown net/core/neighbour.c:695 [inline] neigh_ifdown+0x149/0x250 net/core/neighbour.c:294 rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874 addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633 addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x262/0x430 net/core/dev.c:6960 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919 packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066 sock_do_ioctl+0xef/0x390 net/socket.c:957 sock_ioctl+0x36b/0x610 net/socket.c:1081 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: `c757faa8bf` ("ipv6: prepare fib6_age() for exception table") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:40:34 -04:00
Jon Maloy	52dfae5c85	tipc: obtain node identity from interface by default Selecting and explicitly configuring a TIPC node identity may be unwanted in some cases. In this commit we introduce a default setting if the identity has not been set at the moment the first bearer is enabled. We do this by using a raw copy of a unique identifier from the used interface: MAC address in the case of an L2 bearer, IPv4/IPv6 address in the case of a UDP bearer. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:18 -04:00
Jon Maloy	25b0b9c4e8	tipc: handle collisions of 32-bit node address hash values When a 32-bit node address is generated from a 128-bit identifier, there is a risk of collisions which must be discovered and handled. We do this as follows: - We don't apply the generated address immediately to the node, but do instead initiate a 1 sec trial period to allow other cluster members to discover and handle such collisions. - During the trial period the node periodically sends out a new type of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast, to all the other nodes in the cluster. - When a node is receiving such a message, it must check that the presented 32-bit identifier either is unused, or was used by the very same peer in a previous session. In both cases it accepts the request by not responding to it. - If it finds that the same node has been up before using a different address, it responds with a DSC_TRIAL_FAIL_MSG containing that address. - If it finds that the address has already been taken by some other node, it generates a new, unused address and returns it to the requester. - During the trial period the requesting node must always be prepared to accept a failure message, i.e., a message where a peer suggests a different (or equal) address to the one tried. In those cases it must apply the suggested value as trial address and restart the trial period. This algorithm ensures that in the vast majority of cases a node will have the same address before and after a reboot. If a legacy user configures the address explicitly, there will be no trial period and messages, so this protocol addition is completely backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:18 -04:00
Jon Maloy	d50ccc2d39	tipc: add 128-bit node identifier We add a 128-bit node identity, as an alternative to the currently used 32-bit node address. For the sake of compatibility and to minimize message header changes we retain the existing 32-bit address field. When not set explicitly by the user, this field will be filled with a hash value generated from the much longer node identity, and be used as a shorthand value for the latter. We permit either the address or the identity to be set by configuration, but not both, so when the address value is set by a legacy user the corresponding 128-bit node identity is generated based on the that value. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:18 -04:00
Jon Maloy	23fd3eace0	tipc: remove direct accesses to own_addr field in struct tipc_net As a preparation to changing the addressing structure of TIPC we replace all direct accesses to the tipc_net::own_addr field with the function dedicated for this, tipc_own_addr(). There are no changes to program logics in this commit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:18 -04:00
Jon Maloy	b89afb116c	tipc: allow closest-first lookup algorithm when legacy address is configured The removal of an internal structure of the node address has an unwanted side effect. - Currently, if a user is sending an anycast message with destination domain 0, the tipc_namebl_translate() function will use the 'closest- first' algorithm to first look for a node local destination, and only when no such is found, will it resort to the cluster global 'round- robin' lookup algorithm. - Current users can get around this, and enforce unconditional use of global round-robin by indicating a destination as Z.0.0 or Z.C.0. - This option disappears when we make the node address flat, since the lookup algorithm has no way of recognizing this case. So, as long as there are node local destinations, the algorithm will always select one of those, and there is nothing the sender can do to change this. We solve this by eliminating the 'closest-first' option, which was never a good idea anyway, for non-legacy users, but only for those. To distinguish between legacy users and non-legacy users we introduce a new flag 'legacy_addr_format' in struct tipc_core, to be set when the user configures a legacy-style Z.C.N node address. Hence, when a legacy user indicates a zero lookup domain 'closest-first' is selected, and in all other cases we use 'round-robin'. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:18 -04:00
Jon Maloy	2026364149	tipc: remove restrictions on node address values Nominally, TIPC organizes network nodes into a three-level network hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This hierarchy is reflected in the node address format, - it is sub-divided into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id. However, the 'zone' and 'cluster' levels have in reality never been fully implemented,and never will be. The result of this has been that the first 20 bits the node identity structure have been wasted, and the usable node identity range within a cluster has been limited to 12 bits. This is starting to become a problem. In the following commits, we will need to be able to connect between nodes which are using the whole 32-bit value space of the node address. We therefore remove the restrictions on which values can be assigned to node identity, -it is from now on only a 32-bit integer with no assumed internal structure. Isolation between clusters is now achieved only by setting different values for the 'network id' field used during neighbor discovery, in practice leading to the latter becoming the new cluster identity. The rules for accepting discovery requests/responses from neighboring nodes now become: - If the user is using legacy address format on both peers, reception of discovery messages is subject to the legacy lookup domain check in addition to the cluster id check. - Otherwise, the discovery request/response is always accepted, provided both peers have the same network id. This secures backwards compatibility for users who have been using zone or cluster identities as cluster separators, instead of the intended 'network id'. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:18 -04:00
Jon Maloy	b39e465e56	tipc: some cleanups in the file discover.c To facilitate the coming changes in the neighbor discovery functionality we make some renaming and refactoring of that code. The functional changes in this commit are trivial, e.g., that we move the message sending call in tipc_disc_timeout() outside the spinlock protected region. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:17 -04:00
Jon Maloy	cb30a63384	tipc: refactor function tipc_enable_bearer() As a preparation for the next commits we try to reduce the footprint of the function tipc_enable_bearer(), while hopefully making is simpler to follow. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:12:17 -04:00
Kirill Tkhai	b2864fbdc5	net: Convert rxrpc_net_ops These pernet_operations modifies rxrpc_net_id-pointed per-net entities. There is external link to AF_RXRPC in fs/afs/Kconfig, but it seems there is no other pernet_operations interested in that per-net entities. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:00:47 -04:00
Kirill Tkhai	fc18999ed2	net: Convert udp_sysctl_ops These pernet_operations just initialize udp4 defaults. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 13:00:46 -04:00
Petr Machata	f6cc9c054e	ip_tunnel: Emit events for post-register MTU changes For tunnels created with IFLA_MTU, MTU of the netdevice is set by rtnl_create_link() (called from rtnl_newlink()) before the device is registered. However without IFLA_MTU that's not done. rtnl_newlink() proceeds by calling struct rtnl_link_ops.newlink, which via ip_tunnel_newlink() calls register_netdevice(), and that emits NETDEV_REGISTER. Thus any listeners that inspect the netdevice get the MTU of 0. After ip_tunnel_newlink() corrects the MTU after registering the netdevice, but since there's no event, the listeners don't get to know about the MTU until something else happens--such as a NETDEV_UP event. That's not ideal. So instead of setting the MTU directly, go through dev_set_mtu(), which takes care of distributing the necessary NETDEV_PRECHANGEMTU and NETDEV_CHANGEMTU events. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:54:34 -04:00
Nikolay Aleksandrov	82792a070b	net: bridge: fix direct access to bridge vlan_enabled and use helper We need to use br_vlan_enabled() helper otherwise we'll break builds without bridge vlans: net/bridge//br_if.c: In function ‘br_mtu’: net/bridge//br_if.c:458:8: error: ‘const struct net_bridge’ has no member named ‘vlan_enabled’ if (br->vlan_enabled) ^ net/bridge//br_if.c:462:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ scripts/Makefile.build:324: recipe for target 'net/bridge//br_if.o' failed Fixes: `419d14af9e` ("bridge: Allow max MTU when multiple VLANs present") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:41:14 -04:00
Dave Watson	c46234ebb4	tls: RX path for ktls Add rx path for tls software implementation. recvmsg, splice_read, and poll implemented. An additional sockopt TLS_RX is added, with the same interface as TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or together (with two different setsockopt calls with appropriate keys). Control messages are passed via CMSG in a similar way to transmit. If no cmsg buffer is passed, then only application data records will be passed to userspace, and EIO is returned for other types of alerts. EBADMSG is passed for decryption errors, and EMSGSIZE is passed for framing too big, and EBADMSG for framing too small (matching openssl semantics). EINVAL is returned for TLS versions that do not match the original setsockopt call. All are unrecoverable. strparser is used to parse TLS framing. Decryption is done directly in to userspace buffers if they are large enough to support it, otherwise sk_cow_data is called (similar to ipsec), and buffers are decrypted in place and copied. splice_read always decrypts in place, since no buffers are provided to decrypt in to. sk_poll is overridden, and only returns POLLIN if a full TLS message is received. Otherwise we wait for strparser to finish reading a full frame. Actual decryption is only done during recvmsg or splice_read calls. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:25:54 -04:00
Dave Watson	583715853a	tls: Refactor variable names Several config variables are prefixed with tx, drop the prefix since these will be used for both tx and rx. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:25:54 -04:00
Dave Watson	f4a8e43f1f	tls: Pass error code explicitly to tls_err_abort Pass EBADMSG explicitly to tls_err_abort. Receive path will pass additional codes - EMSGSIZE if framing is larger than max TLS record size, EINVAL if TLS version mismatch. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:25:54 -04:00
Dave Watson	dbe425599b	tls: Move cipher info to a separate struct Separate tx crypto parameters to a separate cipher_context struct. The same parameters will be used for rx using the same struct. tls_advance_record_sn is modified to only take the cipher info. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:25:53 -04:00
Dave Watson	69ca9293e8	tls: Generalize zerocopy_from_iter Refactor zerocopy_from_iter to take arguments for pages and size, such that it can be used for both tx and rx. RX will also support zerocopy direct to output iter, as long as the full message can be copied at once (a large enough userspace buffer was provided). Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:25:53 -04:00
Chas Williams	419d14af9e	bridge: Allow max MTU when multiple VLANs present If the bridge is allowing multiple VLANs, some VLANs may have different MTUs. Instead of choosing the minimum MTU for the bridge interface, choose the maximum MTU of the bridge members. With this the user only needs to set a larger MTU on the member ports that are participating in the large MTU VLANS. Signed-off-by: Chas Williams <3chas3@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 12:17:30 -04:00
David S. Miller	03fe2debbb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Fun set of conflict resolutions here... For the mac80211 stuff, these were fortunately just parallel adds. Trivially resolved. In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the function phy_disable_interrupts() earlier in the file, whilst in 'net-next' the phy_error() call from this function was removed. In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the 'rt_table_id' member of rtable collided with a bug fix in 'net' that added a new struct member "rt_mtu_locked" which needs to be copied over here. The mlxsw driver conflict consisted of net-next separating the span code and definitions into separate files, whilst a 'net' bug fix made some changes to that moved code. The mlx5 infiniband conflict resolution was quite non-trivial, the RDMA tree's merge commit was used as a guide here, and here are their notes: ==================== Due to bug fixes found by the syzkaller bot and taken into the for-rc branch after development for the 4.17 merge window had already started being taken into the for-next branch, there were fairly non-trivial merge issues that would need to be resolved between the for-rc branch and the for-next branch. This merge resolves those conflicts and provides a unified base upon which ongoing development for 4.17 can be based. Conflicts: drivers/infiniband/hw/mlx5/main.c - Commit `42cea83f95` (IB/mlx5: Fix cleanup order on unload) added to for-rc and commit `b5ca15ad7e` (IB/mlx5: Add proper representors support) add as part of the devel cycle both needed to modify the init/de-init functions used by mlx5. To support the new representors, the new functions added by the cleanup patch needed to be made non-static, and the init/de-init list added by the representors patch needed to be modified to match the init/de-init list changes made by the cleanup patch. Updates: drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function prototypes added by representors patch to reflect new function names as changed by cleanup patch drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init stage list to match new order from cleanup patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 11:31:58 -04:00
Pradeep Kumar Chitrapu	dcbe73ca55	mac80211: notify driver for change in multicast rates With drivers implementing rate control in driver or firmware rate_control_send_low() may not get called, and thus the driver needs to know about changes in the multicast rate. Add and use a new BSS change flag for this. Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org> [rewrite commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-23 13:23:17 +01:00
Steffen Klassert	9a3fb9fb84	xfrm: Fix transport mode skb control buffer usage. A recent commit introduced a new struct xfrm_trans_cb that is used with the sk_buff control buffer. Unfortunately it placed the structure in front of the control buffer and overlooked that the IPv4/IPv6 control buffer is still needed for some layer 4 protocols. As a result the IPv4/IPv6 control buffer is overwritten with this structure. Fix this by setting a apropriate header in front of the structure. Fixes `acf568ee85` ("xfrm: Reinject transport-mode packets ...") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-03-23 07:56:04 +01:00
Kirill Tkhai	d9ff304973	net: Replace ip_ra_lock with per-net mutex Since ra_chain is per-net, we may use per-net mutexes to protect them in ip_ra_control(). This improves scalability. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 15:12:56 -04:00
Kirill Tkhai	5796ef75ec	net: Make ip_ra_chain per struct net This is optimization, which makes ip_call_ra_chain() iterate less sockets to find the sockets it's looking for. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 15:12:56 -04:00
Kirill Tkhai	128aaa98ad	net: Revert "ipv4: fix a deadlock in ip_ra_control" This reverts commit `1215e51eda`. Since raw_close() is used on every RAW socket destruction, the changes made by `1215e51eda` scale sadly. This clearly seen on endless unshare(CLONE_NEWNET) test, and cleanup_net() kwork spends a lot of time waiting for rtnl_lock() introduced by this commit. Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(), so we revert this patch. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 15:12:56 -04:00
Kirill Tkhai	0526947f9d	net: Move IP_ROUTER_ALERT out of lock_sock(sk) ip_ra_control() does not need sk_lock. Who are the another users of ip_ra_chain? ip_mroute_setsockopt() doesn't take sk_lock, while parallel IP_ROUTER_ALERT syscalls are synchronized by ip_ra_lock. So, we may move this command out of sk_lock. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 15:12:55 -04:00
Kirill Tkhai	76d3e153d0	net: Revert "ipv4: get rid of ip_ra_lock" This reverts commit `ba3f571d5d`. The commit was made after `1215e51eda` "ipv4: fix a deadlock in ip_ra_control", and killed ip_ra_lock, which became useless after rtnl_lock() made used to destroy every raw ipv4 socket. This scales very bad, and next patch in series reverts `1215e51eda`. ip_ra_lock will be used again. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 15:12:55 -04:00
Colin Ian King	1574639411	gre: fix TUNNEL_SEQ bit check on sequence numbering The current logic of flags \| TUNNEL_SEQ is always non-zero and hence sequence numbers are always incremented no matter the setting of the TUNNEL_SEQ bit. Fix this by using & instead of \|. Detected by CoverityScan, CID#1466039 ("Operands don't affect result") Fixes: `77a5196a80` ("gre: add sequence number for collect md mode.") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 14:52:43 -04:00
GhantaKrishnamurthy MohanKrishna	872619d8cf	tipc: step sk->sk_drops when rcv buffer is full Currently when tipc is unable to queue a received message on a socket, the message is rejected back to the sender with error TIPC_ERR_OVERLOAD. However, the application on this socket has no knowledge about these discards. In this commit, we try to step the sk_drops counter when tipc is unable to queue a received message. Export sk_drops using tipc socket diagnostics. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 14:43:37 -04:00
GhantaKrishnamurthy MohanKrishna	c30b70deb5	tipc: implement socket diagnostics for AF_TIPC This commit adds socket diagnostics capability for AF_TIPC in netlink family NETLINK_SOCK_DIAG in a new kernel module (diag.ko). The following are key design considerations: - config TIPC_DIAG has default y, like INET_DIAG. - only requests with flag NLM_F_DUMP is supported (dump all). - tipc_sock_diag_req message is introduced to send filter parameters. - the response attributes are of TLV, some nested. To avoid exposing data structures between diag and tipc modules and avoid code duplication, the following additions are required: - export tipc_nl_sk_walk function to reuse socket iterator. - export tipc_sk_fill_sock_diag to fill the tipc diag attributes. - create a sock_diag response message in __tipc_add_sock_diag defined in diag.c and use the above exported tipc_sk_fill_sock_diag to fill response. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 14:43:35 -04:00
GhantaKrishnamurthy MohanKrishna	dfde331e75	tipc: modify socket iterator for sock_diag The current socket iterator function tipc_nl_sk_dump, handles socket locks and calls __tipc_nl_add_sk for each socket. To reuse this logic in sock_diag implementation, we do minor modifications to make these functions generic as described below. In this commit, we add a two new functions __tipc_nl_sk_walk, __tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk accordingly. In __tipc_nl_sk_walk we: 1. acquire and release socket locks 2. for each socket, execute the specified callback function In __tipc_nl_add_sk we: - Move the netlink attribute insertion to __tipc_nl_add_sk_info. tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument. sock_diag will use these generic functions in a later commit. There is no functional change in this commit. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 14:43:29 -04:00
David S. Miller	e0645d9b96	Two more fixes (in three patches): * ath9k_htc doesn't like QoS NDP frames, use regular ones * hwsim: set up wmediumd for radios created later -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlqySjUACgkQB8qZga/f l8RDig//bV/Fnwn7deyR7LJOi/g6HVyueCUi+cturTo9RQEQHnBtRVRu3c2Lnd+o 74f+2ofEWEFOYqTvZq5jUWjANOZ/ZGgA+fUw6tOfSmjkEPw9EkST8mQl7lKH0dSR DYrRLKCHwPof9MQQgXHLq44e/26yFCxYP+ADSn9Q6yqo84e/cxP76nqSwYGLforx By1zzkMXPKtfvXTZ41UZTfXRMml2LIxBbCoWTfScIZWvusQjCl653f3lBfR3fynj qwzJJfcjt1TnOQSgCLzk03xi+ci5o157//GmvjT2hrRjRL3i22e+cmFtnXGNCjtg avmc7HftV0sAY8gecqhCLOiWnCuxxjDz1KxZW3dccuJxh1/jfsGtby3H/6stkU4s EU7AU/QhXL7HPHop+fyI+DapFmry0/h1tdKqoSGffONF/qMJmbkXFUvG0kAMqoWh nb/LTlGaVqk8Bz7hNbzP6zkPgEep3i6dBC9iRdfK3gclcEALsh2Z0mx6+ae3FEX3 HHbfB9RTDYUT0Tk33cVFrjMsOpwdhcGsu1ncMJc/iuTOGI/AO6TVhgPEjEj/4xth FO7SzpIpRUe47yKqM0Ykvoz5EwE2IWUt0/eXj+1X2hPNGpcYEZ4bvHQGk8D2h7tA kwd0YqEXjiUw53cSnS4+A45ecOhErH/ZND/ULb65KMl/JDf5gOQ= =fs06 -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2018-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more fixes (in three patches): * ath9k_htc doesn't like QoS NDP frames, use regular ones * hwsim: set up wmediumd for radios created later ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 13:19:10 -04:00
David Ahern	145307460b	devlink: Remove top_hierarchy arg to devlink_resource_register top_hierarchy arg can be determined by comparing parent_resource_id to DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate argument. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 13:08:41 -04:00
David Ahern	68e2ffdeb5	net/ipv6: Handle onlink flag with multipath routes For multipath routes the ONLINK flag can be specified per nexthop in rtnh_flags or globally in rtm_flags. Update ip6_route_multipath_add to consider the ONLINK setting coming from rtnh_flags. Each loop over nexthops the config for the sibling route is initialized to the global config and then per nexthop settings overlayed. The flag is 'or'ed into fib6_config to handle the ONLINK flag coming from either rtm_flags or rtnh_flags. Fixes: `fc1e64e109` ("net/ipv6: Add support for onlink flag") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 12:40:04 -04:00
David Lebrun	8936ef7604	ipv6: sr: fix NULL pointer dereference when setting encap source address When using seg6 in encap mode, we call ipv6_dev_get_saddr() to set the source address of the outer IPv6 header, in case none was specified. Using skb->dev can lead to BUG() when it is in an inconsistent state. This patch uses the net_device attached to the skb's dst instead. [940807.667429] BUG: unable to handle kernel NULL pointer dereference at 000000000000047c [940807.762427] IP: ipv6_dev_get_saddr+0x8b/0x1d0 [940807.815725] PGD 0 P4D 0 [940807.847173] Oops: 0000 [#1] SMP PTI [940807.890073] Modules linked in: [940807.927765] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G W 4.16.0-rc1-seg6bpf+ #2 [940808.028988] Hardware name: HP ProLiant DL120 G6/ProLiant DL120 G6, BIOS O26 09/06/2010 [940808.128128] RIP: 0010:ipv6_dev_get_saddr+0x8b/0x1d0 [940808.187667] RSP: 0018:ffff88043fd836b0 EFLAGS: 00010206 [940808.251366] RAX: 0000000000000005 RBX: ffff88042cb1c860 RCX: 00000000000000fe [940808.338025] RDX: 00000000000002c0 RSI: ffff88042cb1c860 RDI: 0000000000004500 [940808.424683] RBP: ffff88043fd83740 R08: 0000000000000000 R09: ffffffffffffffff [940808.511342] R10: 0000000000000040 R11: 0000000000000000 R12: ffff88042cb1c850 [940808.598012] R13: ffffffff8208e380 R14: ffff88042ac8da00 R15: 0000000000000002 [940808.684675] FS: 0000000000000000(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000 [940808.783036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [940808.852975] CR2: 000000000000047c CR3: 00000004255fe000 CR4: 00000000000006e0 [940808.939634] Call Trace: [940808.970041] <IRQ> [940808.995250] ? ip6t_do_table+0x265/0x640 [940809.043341] seg6_do_srh_encap+0x28f/0x300 [940809.093516] ? seg6_do_srh+0x1a0/0x210 [940809.139528] seg6_do_srh+0x1a0/0x210 [940809.183462] seg6_output+0x28/0x1e0 [940809.226358] lwtunnel_output+0x3f/0x70 [940809.272370] ip6_xmit+0x2b8/0x530 [940809.313185] ? ac6_proc_exit+0x20/0x20 [940809.359197] inet6_csk_xmit+0x7d/0xc0 [940809.404173] tcp_transmit_skb+0x548/0x9a0 [940809.453304] __tcp_retransmit_skb+0x1a8/0x7a0 [940809.506603] ? ip6_default_advmss+0x40/0x40 [940809.557824] ? tcp_current_mss+0x24/0x90 [940809.605925] tcp_retransmit_skb+0xd/0x80 [940809.654016] tcp_xmit_retransmit_queue.part.17+0xf9/0x210 [940809.719797] tcp_ack+0xa47/0x1110 [940809.760612] tcp_rcv_established+0x13c/0x570 [940809.812865] tcp_v6_do_rcv+0x151/0x3d0 [940809.858879] tcp_v6_rcv+0xa5c/0xb10 [940809.901770] ? seg6_output+0xdd/0x1e0 [940809.946745] ip6_input_finish+0xbb/0x460 [940809.994837] ip6_input+0x74/0x80 [940810.034612] ? ip6_rcv_finish+0xb0/0xb0 [940810.081663] ipv6_rcv+0x31c/0x4c0 ... Fixes: `6c8702c60b` ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Reported-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 12:22:45 -04:00
David Lebrun	191f86ca8e	ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state The seg6_build_state() function is called with RCU read lock held, so we cannot use GFP_KERNEL. This patch uses GFP_ATOMIC instead. [ 92.770271] ============================= [ 92.770628] WARNING: suspicious RCU usage [ 92.770921] 4.16.0-rc4+ #12 Not tainted [ 92.771277] ----------------------------- [ 92.771585] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section! [ 92.772279] [ 92.772279] other info that might help us debug this: [ 92.772279] [ 92.773067] [ 92.773067] rcu_scheduler_active = 2, debug_locks = 1 [ 92.773514] 2 locks held by ip/2413: [ 92.773765] #0: (rtnl_mutex){+.+.}, at: [<00000000e5461720>] rtnetlink_rcv_msg+0x441/0x4d0 [ 92.774377] #1: (rcu_read_lock){....}, at: [<00000000df4f161e>] lwtunnel_build_state+0x59/0x210 [ 92.775065] [ 92.775065] stack backtrace: [ 92.775371] CPU: 0 PID: 2413 Comm: ip Not tainted 4.16.0-rc4+ #12 [ 92.775791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014 [ 92.776608] Call Trace: [ 92.776852] dump_stack+0x7d/0xbc [ 92.777130] __schedule+0x133/0xf00 [ 92.777393] ? unwind_get_return_address_ptr+0x50/0x50 [ 92.777783] ? __sched_text_start+0x8/0x8 [ 92.778073] ? rcu_is_watching+0x19/0x30 [ 92.778383] ? kernel_text_address+0x49/0x60 [ 92.778800] ? __kernel_text_address+0x9/0x30 [ 92.779241] ? unwind_get_return_address+0x29/0x40 [ 92.779727] ? pcpu_alloc+0x102/0x8f0 [ 92.780101] _cond_resched+0x23/0x50 [ 92.780459] __mutex_lock+0xbd/0xad0 [ 92.780818] ? pcpu_alloc+0x102/0x8f0 [ 92.781194] ? seg6_build_state+0x11d/0x240 [ 92.781611] ? save_stack+0x9b/0xb0 [ 92.781965] ? __ww_mutex_wakeup_for_backoff+0xf0/0xf0 [ 92.782480] ? seg6_build_state+0x11d/0x240 [ 92.782925] ? lwtunnel_build_state+0x1bd/0x210 [ 92.783393] ? ip6_route_info_create+0x687/0x1640 [ 92.783846] ? ip6_route_add+0x74/0x110 [ 92.784236] ? inet6_rtm_newroute+0x8a/0xd0 Fixes: `6c8702c60b` ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 12:21:11 -04:00
David S. Miller	755f6633d6	This feature/cleanup patchset includes the following patches: - avoid redundant multicast TT entries, by Linus Luessing - add netlink support for distributed arp table cache and multicast flags, by Linus Luessing (2 patches) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlqv59kWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSBaD/9r/oJC+Q/3Eu6DTAAiS7Jx2IpQ kOwU7l4hkGK8mBZ098CkmHTBY+zurqYwcokCHhKJO5mqJEpvlM27PuQxzqWSBMMO FWFax2YlKPpJp+/f/rSD9HS73RTY7npv6l5/eFg6+0WSQET04PjLB1KxPrO5u1+Z JujrAxp0GEyMoVQgMy9uloedkpizhADyYSZzDDXnHhq1NiAPU87cjrTLv/xdtdp7 TvbNfobhZUmKZ951yaRlDmE+mH8IoTQoY7HD/JANnduYeFJAlIPnHQEQa8+5tLwO qWUeLa4Acv5MhO2KjbKQpu5r2dFbs0x+jmsja8xBmgNWO5meKh/aE8TKGJeDVQEW TTynEivf82suiquCIZ573fBnliJkipffg32ZHgtNGrh54hh+YU7Sts0t9Lsou4ar aOU6lup3MHFysf3s9hK6y9TzSqwFJ8Mak0UFsa03r0Ub8am6bKHTqMFaCgN0aK9P vBL4atSvJVguwPlzxLMi44K4NxqEVfa41dZ7nQ99P3BFzWwSvph3i4lu+cxGxwI7 4kgWU5Cz8T51dH7g8j3vUish36DzwQtUsKLWZVpV+DV4BaHJ/rLyqeug3ffUrWRk p0bFU7wBAv8rKeFPd30m2tdJ/nMo+rDbN6Tm9n43tK4NWKOGBndhCoNhjfrzhM8U un6Iy7taISgeElZ5fQ== =HVxO -----END PGP SIGNATURE----- Merge tag 'batadv-next-for-davem-20180319' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - avoid redundant multicast TT entries, by Linus Luessing - add netlink support for distributed arp table cache and multicast flags, by Linus Luessing (2 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 11:28:54 -04:00
David S. Miller	ee54a9f9ae	Here are some batman-adv bugfixes: - fix possible IPv6 packet loss when multicast extension is used, by Linus Luessing - fix SKB handling issues for TTVN and DAT, by Matthias Schiffer (two patches) - fix include for eventpoll, by Sven Eckelmann - fix skb checksum for ttvn reroutes, by Sven Eckelmann -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlqv5tcWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeob7PD/wPnjVmFl6uQlRHfOfUBzGvW9hU ASakGylzfZTXzP2VviMJ+7JehXQgediM4caTr/8jWhJFeVxu5jmeYzv8nzOlwZJ/ PjNlX+ChuIAWv1wyxHI3YWwj7Y1Ox9Lp8Gs7m5UbmfyiZEtb+ybHjvPzZ/FMfhWo hOgbndSpFKfwFuXFkXyitektBcKTiFUyjk9U22v91XCQZZzMb8KqougVBE+bdg0w N8LfgKXVC448sUNKTEczhji7bFuJwR+Sogx1TwIIsyfT2Tp4JXY3A7M1LDwAL/Rc nTu9L58vk4qUouRjIQJ7rhhoxdPSrD5p3oinzVnLvJBxgOS3pxegY3asX8sRMXsj bgolIGCZ9vDfA21WXqa3RyjUiBEx9UId6W++h+22kVqVHt5tqIFK2nVQ6IInOXwB kzd3UDrRBLgRdBpwlu/ii9rn68MOEdLNpL3Seo9DkViJESfLn1ODnFA1Y2rFoK6S cVeYl1DVj4SQSXjNilqYhFZoTEo/kxN5KhgHRXzujTv6MCHVa2SnCZaXS5EXp8n6 u4gpPc363Isj17A+UVlmHmmzA6d8Jqe8veVvET0xSvQPHmE7eepMkJU9hIUNXfcv fENmQnU10pS1keJqS6xuR6k2RxMoofYRUPpKEigCcH7z/0GytWQc6QVGmo19LD9b GNcWAhKI453pc1IuUA== =hsM8 -----END PGP SIGNATURE----- Merge tag 'batadv-net-for-davem-20180319' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - fix possible IPv6 packet loss when multicast extension is used, by Linus Luessing - fix SKB handling issues for TTVN and DAT, by Matthias Schiffer (two patches) - fix include for eventpoll, by Sven Eckelmann - fix skb checksum for ttvn reroutes, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 11:26:13 -04:00
Sowmini Varadhan	bdf5bd7f21	rds: tcp: remove register_netdevice_notifier infrastructure. The netns deletion path does not need to wait for all net_devices to be unregistered before dismantling rds_tcp state for the netns (we are able to dismantle this state on module unload even when all net_devices are active so there is no dependency here). This patch removes code related to netdevice notifiers and refactors all the code needed to dismantle rds_tcp state into a ->exit callback for the pernet_operations used with register_pernet_device(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 11:21:45 -04:00
Kirill Tkhai	aa65f63654	net: Convert nf_ct_net_ops These pernet_operations register and unregister sysctl. Also, there is inet_frags_exit_net() called in exit method, which has to be safe after `a560002437` "net: Fix hlist corruptions in inet_evict_bucket()". Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 11:11:30 -04:00
Kirill Tkhai	08012631d6	net: Convert lowpan_frags_ops These pernet_operations register and unregister sysctl. Also, there is inet_frags_exit_net() called in exit method, which has to be safe after `a560002437` "net: Fix hlist corruptions in inet_evict_bucket()". Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 11:11:29 -04:00
Kirill Tkhai	1ae7762760	net: Convert can_pernet_ops These pernet_operations create and destroy /proc entries and cancel per-net timer. Also, there are unneed iterations over empty list of net devices, since all net devices must be already moved to init_net or unregistered by default_device_ops. This already was mentioned here: https://marc.info/?l=linux-can&m=150169589119335&w=2 So, it looks safe to make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-22 11:11:29 -04:00
Pablo Neira Ayuso	90d2723c6d	netfilter: nf_tables: do not hold reference on netdevice from preparation phase The netfilter netdevice event handler hold the nfnl_lock mutex, this avoids races with a device going away while such device is being attached to hooks from the netlink control plane. Therefore, either control plane bails out with ENOENT or netdevice event path waits until the hook that is attached to net_device is registered. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-22 13:17:52 +01:00
Pablo Neira Ayuso	d92191aa84	netfilter: nf_tables: cache device name in flowtable object Devices going away have to grab the nfnl_lock from the netdev event path to avoid races with control plane updates. However, netlink dumps in netfilter do not hold nfnl_lock mutex. Cache the device name into the objects to avoid an use-after-free situation for a device that is going away. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-22 12:57:07 +01:00
Paolo Abeni	aebfa52a92	netfilter: drop template ct when conntrack is skipped. The ipv4 nf_ct code currently skips the nf_conntrak_in() call for fragmented packets. As a results later matches/target can end up manipulating template ct entry instead of 'real' ones. Exploiting the above, syzbot found a way to trigger the following splat: WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55 xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 4242 Comm: syzkaller027971 Not tainted 4.16.0-rc2+ #243 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1dc/0x200 kernel/panic.c:547 report_bug+0x211/0x2d0 lib/bug.c:184 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178 fixup_bug arch/x86/kernel/traps.c:247 [inline] do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957 RIP: 0010:xt_cluster_hash net/netfilter/xt_cluster.c:55 [inline] RIP: 0010:xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127 RSP: 0018:ffff8801d2f6f2d0 EFLAGS: 00010293 RAX: ffff8801af700540 RBX: 0000000000000000 RCX: ffffffff84a2d1e1 RDX: 0000000000000000 RSI: ffff8801d2f6f478 RDI: ffff8801cafd336a RBP: ffff8801d2f6f2e8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b03b3d18 R13: ffff8801cafd3300 R14: dffffc0000000000 R15: ffff8801d2f6f478 ipt_do_table+0xa91/0x19b0 net/ipv4/netfilter/ip_tables.c:296 iptable_filter_hook+0x65/0x80 net/ipv4/netfilter/iptable_filter.c:41 nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline] nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483 nf_hook include/linux/netfilter.h:243 [inline] NF_HOOK include/linux/netfilter.h:286 [inline] raw_send_hdrinc.isra.17+0xf39/0x1880 net/ipv4/raw.c:432 raw_sendmsg+0x14cd/0x26b0 net/ipv4/raw.c:669 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763 sock_sendmsg_nosec net/socket.c:629 [inline] sock_sendmsg+0xca/0x110 net/socket.c:639 SYSC_sendto+0x361/0x5c0 net/socket.c:1748 SyS_sendto+0x40/0x50 net/socket.c:1716 do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x441b49 RSP: 002b:00007ffff5ca8b18 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000441b49 RDX: 0000000000000030 RSI: 0000000020ff7000 RDI: 0000000000000003 RBP: 00000000006cc018 R08: 000000002066354c R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000403470 R13: 0000000000403500 R14: 0000000000000000 R15: 0000000000000000 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. Instead of adding checks for template ct on every target/match manipulating skb->_nfct, simply drop the template ct when skipping nf_conntrack_in(). Fixes: `7b4fdf77a4` ("netfilter: don't track fragmented packets") Reported-and-tested-by: syzbot+0346441ae0545cfcea3a@syzkaller.appspotmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-22 12:56:10 +01:00
Davide Caratti	f29cdfbe33	net/sched: fix idr leak in the error path of tcf_skbmod_init() tcf_skbmod_init() can fail after the idr has been successfully reserved. When this happens, every subsequent attempt to configure skbmod rules using the same idr value will systematically fail with -ENOSPC, unless the first attempt was done using the 'replace' keyword: # tc action add action skbmod swap mac index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action add action skbmod swap mac index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel # tc action add action skbmod swap mac index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel ... Fix this in tcf_skbmod_init(), ensuring that tcf_idr_release() is called on the error path when the idr has been reserved, but not yet inserted. Also, don't test 'ovr' in the error path, to avoid a 'replace' failure implicitly become a 'delete' that leaks refcount in act_skbmod module: # rmmod act_skbmod; modprobe act_skbmod # tc action add action skbmod swap mac index 100 # tc action add action skbmod swap mac continue index 100 RTNETLINK answers: File exists We have an error talking to the kernel # tc action replace action skbmod swap mac continue index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action list action skbmod # # rmmod act_skbmod rmmod: ERROR: Module act_skbmod is in use Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:12:37 -04:00
Davide Caratti	d7f2001573	net/sched: fix idr leak in the error path of tcf_vlan_init() tcf_vlan_init() can fail after the idr has been successfully reserved. When this happens, every subsequent attempt to configure vlan rules using the same idr value will systematically fail with -ENOSPC, unless the first attempt was done using the 'replace' keyword. # tc action add action vlan pop index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action add action vlan pop index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel # tc action add action vlan pop index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel ... Fix this in tcf_vlan_init(), ensuring that tcf_idr_release() is called on the error path when the idr has been reserved, but not yet inserted. Also, don't test 'ovr' in the error path, to avoid a 'replace' failure implicitly become a 'delete' that leaks refcount in act_vlan module: # rmmod act_vlan; modprobe act_vlan # tc action add action vlan push id 5 index 100 # tc action replace action vlan push id 7 index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action list action vlan # # rmmod act_vlan rmmod: ERROR: Module act_vlan is in use Fixes: `4c5b9d9642` ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update") Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:12:27 -04:00
Davide Caratti	1e46ef1762	net/sched: fix idr leak in the error path of __tcf_ipt_init() __tcf_ipt_init() can fail after the idr has been successfully reserved. When this happens, subsequent attempts to configure xt/ipt rules using the same idr value systematically fail with -ENOSPC: # tc action add action xt -j LOG --log-prefix test1 index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "test1" index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel Command "(null)" is unknown, try "tc actions help". # tc action add action xt -j LOG --log-prefix test1 index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "test1" index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel Command "(null)" is unknown, try "tc actions help". # tc action add action xt -j LOG --log-prefix test1 index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "test1" index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel ... Fix this in the error path of __tcf_ipt_init(), calling tcf_idr_release() in place of tcf_idr_cleanup(). Since tcf_ipt_release() can now be called when tcfi_t is NULL, we also need to protect calls to ipt_destroy_target() to avoid NULL pointer dereference. Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:12:16 -04:00
Davide Caratti	94fa3f929e	net/sched: fix idr leak in the error path of tcp_pedit_init() tcf_pedit_init() can fail to allocate 'keys' after the idr has been successfully reserved. When this happens, subsequent attempts to configure a pedit rule using the same idr value systematically fail with -ENOSPC: # tc action add action pedit munge ip ttl set 63 index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action add action pedit munge ip ttl set 63 index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel # tc action add action pedit munge ip ttl set 63 index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel ... Fix this in the error path of tcf_act_pedit_init(), calling tcf_idr_release() in place of tcf_idr_cleanup(). Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:12:08 -04:00
Davide Caratti	5bf7f8185f	net/sched: fix idr leak in the error path of tcf_act_police_init() tcf_act_police_init() can fail after the idr has been successfully reserved (e.g., qdisc_get_rtab() may return NULL). When this happens, subsequent attempts to configure a police rule using the same idr value systematiclly fail with -ENOSPC: # tc action add action police rate 1000 burst 1000 drop index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action add action police rate 1000 burst 1000 drop index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel # tc action add action police rate 1000 burst 1000 drop index 100 RTNETLINK answers: No space left on device ... Fix this in the error path of tcf_act_police_init(), calling tcf_idr_release() in place of tcf_idr_cleanup(). Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:12:00 -04:00
Davide Caratti	60e10b3adc	net/sched: fix idr leak in the error path of tcf_simp_init() if the kernel fails to duplicate 'sdata', creation of a new action fails with -ENOMEM. However, subsequent attempts to install the same action using the same value of 'index' systematically fail with -ENOSPC, and that value of 'index' will no more be usable by act_simple, until rmmod / insmod of act_simple.ko is done: # tc actions add action simple sdata hello index 100 # tc actions list action simple action order 0: Simple <hello> index 100 ref 1 bind 0 # tc actions flush action simple # tc actions add action simple sdata hello index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc actions flush action simple # tc actions add action simple sdata hello index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel # tc actions add action simple sdata hello index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel ... Fix this in the error path of tcf_simp_init(), calling tcf_idr_release() in place of tcf_idr_cleanup(). Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:11:53 -04:00
Davide Caratti	bbc09e7842	net/sched: fix idr leak on the error path of tcf_bpf_init() when the following command sequence is entered # tc action add action bpf bytecode '4,40 0 0 12,31 0 1 2048,6 0 0 262144,6 0 0 0' index 100 RTNETLINK answers: Invalid argument We have an error talking to the kernel # tc action add action bpf bytecode '4,40 0 0 12,21 0 1 2048,6 0 0 262144,6 0 0 0' index 100 RTNETLINK answers: No space left on device We have an error talking to the kernel act_bpf correctly refuses to install the first TC rule, because 31 is not a valid instruction. However, it refuses to install the second TC rule, even if the BPF code is correct. Furthermore, it's no more possible to install any other rule having the same value of 'index' until act_bpf module is unloaded/inserted again. After the idr has been reserved, call tcf_idr_release() instead of tcf_idr_cleanup(), to fix this issue. Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 18:11:46 -04:00
David S. Miller	454bfe9783	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2018-03-21 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Add a BPF hook for sendmsg and sendfile by reusing the ULP infrastructure and sockmap. Three helpers are added along with this, bpf_msg_apply_bytes(), bpf_msg_cork_bytes(), and bpf_msg_pull_data(). The first is used to tell for how many bytes the verdict should be applied to, the second to tell that x bytes need to be queued first to retrigger the BPF program for a verdict, and the third helper is mainly for the sendfile case to pull in data for making it private for reading and/or writing, from John. 2) Improve address to symbol resolution of user stack traces in BPF stackmap. Currently, the latter stores the address for each entry in the call trace, however to map these addresses to user space files, it is necessary to maintain the mapping from these virtual addresses to symbols in the binary which is not practical for system-wide profiling. Instead, this option for the stackmap rather stores the ELF build id and offset for the call trace entries, from Song. 3) Add support that allows BPF programs attached to perf events to read the address values recorded with the perf events. They are requested through PERF_SAMPLE_ADDR via perf_event_open(). Main motivation behind it is to support building memory or lock access profiling and tracing tools with the help of BPF, from Teng. 4) Several improvements to the tools/bpf/ Makefiles. The 'make bpf' in the tools directory does not provide the standard quiet output except for bpftool and it also does not respect specifying a build output directory. 'make bpf_install' command neither respects specified destination nor prefix, all from Jiri. In addition, Jakub fixes several other minor issues in the Makefiles on top of that, e.g. fixing dependency paths, phony targets and more. 5) Various doc updates e.g. add a comment for BPF fs about reserved names to make the dentry lookup from there a bit more obvious, and a comment to the bpf_devel_QA file in order to explain the diff between native and bpf target clang usage with regards to pointer size, from Quentin and Daniel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-21 12:08:01 -04:00
Dmitry Lebed	13cf6dec93	cfg80211/nl80211: add DFS offload flag Add wiphy EXT_FEATURE flag to indicate that HW or driver does all DFS actions by itself. User-space functionality already implemented in hostapd using vendor-specific (QCA) OUI to advertise DFS offload support. Need to introduce generic flag to inform about DFS offload support. For devices with DFS_OFFLOAD flag set user-space will no longer need to issue CAC or do any actions in response to "radar detected" events. HW will do everything by itself and send events to user-space to indicate that CAC was started/finished, etc. Signed-off-by: Dmitrii Lebed <dlebed@quantenna.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-21 11:29:59 +01:00
Dmitry Lebed	2cb021f5de	cfg80211/nl80211: add CAC_STARTED event CAC_STARTED event is needed for DFS offload feature and should be generated by driver/HW if DFS_OFFLOAD is enabled. Signed-off-by: Dmitry Lebed <dlebed@quantenna.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-21 11:29:59 +01:00
Tosoni	1ad22fb5bb	mac80211: inform wireless layer when frame RSSI is invalid When the low-level driver returns an invalid RSSI indication, set the signal value to 0 as an indication to the upper layer. Also, skip average level computation if signal is invalid. Signed-off-by: Jean Pierre TOSONI <jp.tosoni@acksys.fr> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-21 11:27:11 +01:00
Ben Caradoc-Davies	7c181f4fcd	mac80211: add ieee80211_hw flag for QoS NDP support Commit `7b6ddeaf27` ("mac80211: use QoS NDP for AP probing") added an argument qos_ok to ieee80211_nullfunc_get to support QoS NDP. Despite the claim in the commit log "Change all the drivers to not allow QoS NDP for now, even though it looks like most of them should be OK with that", this commit enables QoS NDP in response to beacons (see change to mlme.c:ieee80211_send_nullfunc), causing ath9k_htc to lose IP connectivity. See: https://patchwork.kernel.org/patch/10241109/ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=891060 Introduce a hardware flag to allow such buggy drivers to override the correct default behaviour of mac80211 of sending QoS NDP packets. Signed-off-by: Ben Caradoc-Davies <ben@transient.nz> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-03-21 10:56:18 +01:00
Chuck Lever	6f29d07ca4	svcrdma: Clean up rdma_build_arg_xdr Clean up: The value of the byte_count parameter is already passed to rdma_build_arg_xdr as part of the svc_rdma_op_ctxt structure. Further, without the parameter called "byte_count" there is no need to have the abbreviated "bc" automatic variable. "bc" can now be called something more intuitive. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-20 17:32:13 -04:00
Chuck Lever	97cc326450	svcrdma: Consult max_qp_init_rd_atom when accepting connections The target needs to return the lesser of the client's Inbound RDMA Read Queue Depth (IRD), provided in the connection parameters, and the local device's Outbound RDMA Read Queue Depth (ORD). The latter limit is max_qp_init_rd_atom, not max_qp_rd_atom. The svcrdma_ord value caps the ORD value for iWARP transports, which do not exchange ORD/IRD values at connection time. Since no other Linux kernel RDMA-enabled storage target sees fit to provide this cap, I'm removing it here too. initiator_depth is a u8, so ensure the computed ORD value does not overflow that field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-20 17:32:13 -04:00
Chuck Lever	0c4398ff8b	svcrdma: Use pr_err to report Receive errors Clean up: Other completion handlers use pr_err, not pr_warn. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-20 17:32:12 -04:00
Stefano Brivio	5f2fb802ee	ipv6: old_dport should be a __be16 in __ip6_datagram_connect() Fixes: `2f987a76a9` ("net: ipv6: keep sk status consistent after datagram connect failure") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-20 12:43:43 -04:00
Matthias Schiffer	78d9f4d49b	netfilter: ebtables: add support for matching IGMP type We already have ICMPv6 type/code matches (which can be used to distinguish different types of MLD packets). Add support for IPv4 IGMP matches in the same way. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 17:24:10 +01:00
Matthias Schiffer	5adc1668dd	netfilter: ebtables: add support for matching ICMP type and code We already have ICMPv6 type/code matches. This adds support for IPv4 ICMP matches in the same way. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 17:24:03 +01:00
Florian Westphal	467697d289	netfilter: nf_tables: add missing netlink attrs to policies Fixes: `8aeff920dc` ("netfilter: nf_tables: add stateful object reference to set elements") Fixes: `f25ad2e907` ("netfilter: nf_tables: prepare for expressions associated to set elements") Fixes: `1a94e38d25` ("netfilter: nf_tables: add NFTA_RULE_ID attribute") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 16:47:22 +01:00
Arkadi Sharshevsky	7fe4d6dcbc	devlink: Remove redundant free on error path The current code performs unneeded free. Remove the redundant skb freeing during the error path. Fixes: `1555d204e7` ("devlink: Support for pipeline debug (dpipe)") Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-20 10:59:29 -04:00
Pablo Neira Ayuso	20710b3b81	netfilter: ctnetlink: synproxy support This patch exposes synproxy information per-conntrack. Moreover, send sequence adjustment events once server sends us the SYN,ACK packet, so we can synchronize the sequence adjustment too for packets going as reply from the server, as part of the synproxy logic. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 14:39:31 +01:00
Florian Westphal	ae6153b50f	netfilter: nf_tables: permit second nat hook if colliding hook is going away Sergei Trofimovich reported that restoring an nft ruleset doesn't work anymore unless old rule content is flushed first. The problem stems from a recent change designed to prevent multiple nat hooks at the same hook point locations and nftables transaction model. A 'flush ruleset' won't take effect until the entire transaction has completed. So, if one has a nft.rules file that contains a 'flush ruleset', followed by a nat hook register request, then 'nft -f file' will work, but running 'nft -f file' again will fail with -EBUSY. Reason is that nftables will place the flush/removal requests in the transaction list, but it will not act on the removal until after all new rules are in place. The netfilter core will therefore get request to register a new nat hook before the old one is removed -- this now fails as the netfilter core can't know the existing hook is staged for removal. To fix this, we can search the transaction log when a hook collision is detected. The collision is okay if 1. there is a delete request pending for the nat hook that is already registered. 2. there is no second add request for a matching nat hook. This is required to only apply the exception once. Fixes: `f92b40a8b2` ("netfilter: core: only allow one nat hook per hook point") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:55:03 +01:00
Florian Westphal	4f2921ca21	netfilter: nf_tables: meter: pick a set backend that supports updates in nftables, 'meter' can be used to instantiate a hash-table at run time: rule add filter forward iif "internal" meter hostacct { ip saddr counter} nft list meter ip filter hostacct table ip filter { meter hostacct { type ipv4_addr elements = { 192.168.0.1 : counter packets 8 bytes 2672, .. because elemets get added on the fly, the kernel must chose a set backend type that implements the ->update() function, otherwise rule insertion fails with EOPNOTSUPP. Therefore, skip set types that lack ->update, and also make sure we do not discard a (bad) candidate when we did yet find any candidate at all. This could happen when userspace prefers low memory footprint -- the set implementation currently checked might not be a fit at all. Make sure we pick it anyway (!bops). In case next candidate is a better fix, it will be chosen instead. But in case nothing else is found we at least have a non-ideal match rather than no match at all. Fixes: `6c03ae210c` ("netfilter: nft_set_hash: add non-resizable hashtable implementation") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:52:10 +01:00
Arushi Singhal	5191d70f83	netfilter: Replace printk() with pr_() and define pr_fmt() Using pr_<loglevel>() is more concise than printk(KERN_<LOGLEVEL>). This patch: Replace printks having a log level with the appropriate pr_() macros. Define pr_fmt() to include relevant name. * Remove redundant prefixes from pr_() calls. Indent the code where possible. * Remove the useless output messages. * Remove periods from messages. Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:44:14 +01:00
Jack Ma	472a73e007	netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets. This patch introduces a new feature that allows bitshifting (left and right) operations to co-operate with existing iptables options. Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jack Ma <jack.ma@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:41:41 +01:00
Taehee Yoo	d72133e628	netfilter: ebtables: use ADD_COUNTER macro xtables uses ADD_COUNTER macro to increase packet and byte count. ebtables also can use this. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:41:40 +01:00
Gustavo A. R. Silva	5b4c6e3860	netfilter: nf_tables: remove VLA usage In preparation to enabling -Wvla, remove VLA and replace it with dynamic memory allocation. >From a security viewpoint, the use of Variable Length Arrays can be a vector for stack overflow attacks. Also, in general, as the code evolves it is easy to lose track of how big a VLA can get. Thus, we can end up having segfaults that are hard to debug. Also, fixed as part of the directive to remove all VLAs from the kernel: https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:41:39 +01:00
Gustavo A. R. Silva	1446385904	netfilter: nfnetlink_cthelper: Remove VLA usage In preparation to enabling -Wvla, remove VLA and replace it with dynamic memory allocation. >From a security viewpoint, the use of Variable Length Arrays can be a vector for stack overflow attacks. Also, in general, as the code evolves it is easy to lose track of how big a VLA can get. Thus, we can end up having segfaults that are hard to debug. Also, fixed as part of the directive to remove all VLAs from the kernel: https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:41:38 +01:00
Gustavo A. R. Silva	8039ab43ee	netfilter: cttimeout: remove VLA usage In preparation to enabling -Wvla, remove VLA and replace it with dynamic memory allocation. >From a security viewpoint, the use of Variable Length Arrays can be a vector for stack overflow attacks. Also, in general, as the code evolves it is easy to lose track of how big a VLA can get. Thus, we can end up having segfaults that are hard to debug. Also, fixed as part of the directive to remove all VLAs from the kernel: https://lkml.org/lkml/2018/3/7/621 While at it, remove likely() notation which is not necessary from the control plane code. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:41:04 +01:00
Pablo Neira Ayuso	d719e3f21c	netfilter: nft_ct: add NFT_CT_{SRC,DST}_{IP,IP6} All existing keys, except the NFT_CT_SRC and NFT_CT_DST are assumed to have strict datatypes. This is causing problems with sets and concatenations given the specific length of these keys is not known. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Florian Westphal <fw@strlen.de>	2018-03-20 13:27:19 +01:00
Yi-Hung Wei	35d8deb80c	netfilter: conncount: Support count only use case Currently, nf_conncount_count() counts the number of connections that matches key and inserts a conntrack 'tuple' with the same key into the accounting data structure. This patch supports another use case that only counts the number of connections where 'tuple' is not provided. Therefore, proper changes are made on nf_conncount_count() to support the case where 'tuple' is NULL. This could be useful for querying statistics or debugging purpose. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:27:18 +01:00
Yi-Hung Wei	6aec208786	netfilter: Refactor nf_conncount Remove parameter 'family' in nf_conncount_count() and count_tree(). It is because the parameter is not useful after commit `625c556118` ("netfilter: connlimit: split xt_connlimit into front and backend"). Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-03-20 13:27:17 +01:00
NeilBrown	3b68e6ee3c	SUNRPC: cache: ignore timestamp written to 'flush' file. The interface for flushing the sunrpc auth cache was poorly designed and has caused problems a number of times. The design is that you write a timestamp, and all entries created before that time are discarded. The most obvious problem is that this is not what people actually want. They want to just flush the whole cache. The 1-second granularity can be a problem, as can the use of wall-clock time. A current problem is that code will write the current time to this file - expecting it to clear everything - and if the seconds number ticks over before this timestamp is checked, the test "then >= now" fails, and a full flush isn't forced. So lets just drop the subtleties and always flush the whole cache. The worst this could do is impose an extra cost refilling it, but that would require someone to be using non-standard tools. We still report an error if the string written is not a number, but we cause any valid number to flush the whole cache. Reported-by: "Wang, Alan 1. (NSB - CN/Hangzhou)" <alan.1.wang@nokia-sbell.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:12 -04:00
James Ettle	90a9b1473d	sunrpc: Fix unaligned access on sparc64 Fix unaligned access in gss_{get,verify}_mic_v2() on sparc64 Signed-off-by: James Ettle <james@ettle.org.uk> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:12 -04:00
John Fastabend	015632bb30	bpf: sk_msg program helper bpf_sk_msg_pull_data Currently, if a bpf sk msg program is run the program can only parse data that the (start,end) pointers already consumed. For sendmsg hooks this is likely the first scatterlist element. For sendpage this will be the range (0,0) because the data is shared with userspace and by default we want to avoid allowing userspace to modify data while (or after) BPF verdict is being decided. To support pulling in additional bytes for parsing use a new helper bpf_sk_msg_pull(start, end, flags) which works similar to cls tc logic. This helper will attempt to point the data start pointer at 'start' bytes offest into msg and data end pointer at 'end' bytes offset into message. After basic sanity checks to ensure 'start' <= 'end' and 'end' <= msg_length there are a few cases we need to handle. First the sendmsg hook has already copied the data from userspace and has exclusive access to it. Therefor, it is not necessesary to copy the data. However, it may be required. After finding the scatterlist element with 'start' offset byte in it there are two cases. One the range (start,end) is entirely contained in the sg element and is already linear. All that is needed is to update the data pointers, no allocate/copy is needed. The other case is (start, end) crosses sg element boundaries. In this case we allocate a block of size 'end - start' and copy the data to linearize it. Next sendpage hook has not copied any data in initial state so that data pointers are (0,0). In this case we handle it similar to the above sendmsg case except the allocation/copy must always happen. Then when sending the data we have possibly three memory regions that need to be sent, (0, start - 1), (start, end), and (end + 1, msg_length). This is required to ensure any writes by the BPF program are correctly transmitted. Lastly this operation will invalidate any previous data checks so BPF programs will have to revalidate pointers after making this BPF call. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:39 +01:00
John Fastabend	91843d540a	bpf: sockmap, add msg_cork_bytes() helper In the case where we need a specific number of bytes before a verdict can be assigned, even if the data spans multiple sendmsg or sendfile calls. The BPF program may use msg_cork_bytes(). The extreme case is a user can call sendmsg repeatedly with 1-byte msg segments. Obviously, this is bad for performance but is still valid. If the BPF program needs N bytes to validate a header it can use msg_cork_bytes to specify N bytes and the BPF program will not be called again until N bytes have been accumulated. The infrastructure will attempt to coalesce data if possible so in many cases (most my use cases at least) the data will be in a single scatterlist element with data pointers pointing to start/end of the element. However, this is dependent on available memory so is not guaranteed. So BPF programs must validate data pointer ranges, but this is the case anyways to convince the verifier the accesses are valid. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:39 +01:00
John Fastabend	2a100317c9	bpf: sockmap, add bpf_msg_apply_bytes() helper A single sendmsg or sendfile system call can contain multiple logical messages that a BPF program may want to read and apply a verdict. But, without an apply_bytes helper any verdict on the data applies to all bytes in the sendmsg/sendfile. Alternatively, a BPF program may only care to read the first N bytes of a msg. If the payload is large say MB or even GB setting up and calling the BPF program repeatedly for all bytes, even though the verdict is already known, creates unnecessary overhead. To allow BPF programs to control how many bytes a given verdict applies to we implement a bpf_msg_apply_bytes() helper. When called from within a BPF program this sets a counter, internal to the BPF infrastructure, that applies the last verdict to the next N bytes. If the N is smaller than the current data being processed from a sendmsg/sendfile call, the first N bytes will be sent and the BPF program will be re-run with start_data pointing to the N+1 byte. If N is larger than the current data being processed the BPF verdict will be applied to multiple sendmsg/sendfile calls until N bytes are consumed. Note1 if a socket closes with apply_bytes counter non-zero this is not a problem because data is not being buffered for N bytes and is sent as its received. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:39 +01:00
John Fastabend	4f738adba3	bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:38 +01:00
John Fastabend	8c05dbf04b	net: generalize sk_alloc_sg to work with scatterlist rings The current implementation of sk_alloc_sg expects scatterlist to always start at entry 0 and complete at entry MAX_SKB_FRAGS. Future patches will want to support starting at arbitrary offset into scatterlist so add an additional sg_start parameters and then default to the current values in TLS code paths. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:38 +01:00
John Fastabend	312fc2b4c8	net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG When calling do_tcp_sendpages() from in kernel and we know the data has no references from user side we can omit SKBTX_SHARED_FRAG flag. This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used to omit setting SKBTX_SHARED_FRAG. The flag is not exposed to userspace because the sendpage call from the splice logic masks out all bits except MSG_MORE. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:38 +01:00
John Fastabend	2c3682f0be	sock: make static tls function alloc_sg generic sock helper The TLS ULP module builds scatterlists from a sock using page_frag_refill(). This is going to be useful for other ULPs so move it into sock file for more general use. In the process remove useless goto at end of while loop. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-03-19 21:14:38 +01:00
Stefano Brivio	f8a554b4aa	vti6: Fix dev->max_mtu setting We shouldn't allow a tunnel to have IP_MAX_MTU as MTU, because another IPv6 header is going on top of our packets. Without this patch, we might end up building packets bigger than IP_MAX_MTU. Fixes: `b96f9afee4` ("ipv4/6: use core net MTU range checking") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-03-19 08:45:50 +01:00
Stefano Brivio	7a67e69a33	vti6: Keep set MTU on link creation or change, validate it In vti6_link_config(), if MTU is already given on link creation or change, validate and use it instead of recomputing it. To do that, we need to propagate the knowledge that MTU was set by userspace all the way down to vti6_link_config(). To keep this simple, vti6_dev_init() sets the new 'keep_mtu' argument of vti6_link_config() to true: on initialization, we don't have convenient access to netlink attributes there, but we will anyway check whether dev->mtu is set in vti6_link_config(). If it's non-zero, it was set to the value of the IFLA_MTU attribute during creation. Otherwise, determine a reasonable value. Fixes: `ed1efb2aef` ("ipv6: Add support for IPsec virtual tunnel interfaces") Fixes: `53c81e95df` ("ip6_vti: adjust vti mtu according to mtu of lower device") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-03-19 08:45:50 +01:00
Stefano Brivio	c6741fbed6	vti6: Properly adjust vti6 MTU from MTU of lower device If a lower device is found, we don't need to subtract LL_MAX_HEADER to calculate our MTU: just use its MTU, the link layer headers are already taken into account by it. If the lower device is not found, start from ETH_DATA_LEN instead, and only in this case subtract a worst-case LL_MAX_HEADER. We then need to subtract our additional IPv6 header from the calculation. While at it, note that vti6 doesn't have a hardware header, so it doesn't need to set dev->hard_header_len. And as vti6_link_config() now always sets the MTU, there's no need to set a default value in vti6_dev_setup(). This makes the behaviour consistent with IPv4 vti, after commit `a32452366b` ("vti4: Don't count header length twice."), which was accidentally reverted by merge commit `f895f0cfbb` ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec"). While commit `53c81e95df` ("ip6_vti: adjust vti mtu according to mtu of lower device") improved on the original situation, this was still not ideal. As reported in that commit message itself, if we start from an underlying veth MTU of 9000, we end up with an MTU of 8832, that is, 9000 - LL_MAX_HEADER - sizeof(ipv6hdr). This should simply be 8880, or 9000 - sizeof(ipv6hdr) instead: we found the lower device (veth) and we know we don't have any additional link layer header, so there's no need to subtract an hypothetical worst-case number. Fixes: `53c81e95df` ("ip6_vti: adjust vti mtu according to mtu of lower device") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-03-19 08:45:50 +01:00
Stefano Brivio	03080e5ec7	vti4: Don't override MTU passed on link creation via IFLA_MTU Don't hardcode a MTU value on vti tunnel initialization, ip_tunnel_newlink() is able to deal with this already. See also commit `ffc2b6ee41` ("ip_gre: fix IFLA_MTU ignored on NEWLINK"). Fixes: `1181412c1a` ("net/ipv4: VTI support new module for ip_vti.") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-03-19 08:45:50 +01:00

... 4 5 6 7 8 ...

50999 Commits