OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Eric Dumazet	ab443c5391	sch_cake: do not call cake_destroy() from cake_init() qdiscs are not supposed to call their own destroy() method from init(), because core stack already does that. syzbot was able to trigger use after free: DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 0 PID: 21902 at kernel/locking/mutex.c:586 __mutex_lock_common kernel/locking/mutex.c:586 [inline] WARNING: CPU: 0 PID: 21902 at kernel/locking/mutex.c:586 __mutex_lock+0x9ec/0x12f0 kernel/locking/mutex.c:740 Modules linked in: CPU: 0 PID: 21902 Comm: syz-executor189 Not tainted 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__mutex_lock_common kernel/locking/mutex.c:586 [inline] RIP: 0010:__mutex_lock+0x9ec/0x12f0 kernel/locking/mutex.c:740 Code: 08 84 d2 0f 85 19 08 00 00 8b 05 97 38 4b 04 85 c0 0f 85 27 f7 ff ff 48 c7 c6 20 00 ac 89 48 c7 c7 a0 fe ab 89 e8 bf 76 ba ff <0f> 0b e9 0d f7 ff ff 48 8b 44 24 40 48 8d b8 c8 08 00 00 48 89 f8 RSP: 0018:ffffc9000627f290 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88802315d700 RSI: ffffffff815f1db8 RDI: fffff52000c4fe44 RBP: ffff88818f28e000 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815ebb5e R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: ffffc9000627f458 R15: 0000000093c30000 FS: 0000555556abc400(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fda689c3303 CR3: 000000001cfbb000 CR4: 0000000000350ef0 Call Trace: <TASK> tcf_chain0_head_change_cb_del+0x2e/0x3d0 net/sched/cls_api.c:810 tcf_block_put_ext net/sched/cls_api.c:1381 [inline] tcf_block_put_ext net/sched/cls_api.c:1376 [inline] tcf_block_put+0xbc/0x130 net/sched/cls_api.c:1394 cake_destroy+0x3f/0x80 net/sched/sch_cake.c:2695 qdisc_create.constprop.0+0x9da/0x10f0 net/sched/sch_api.c:1293 tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5571 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2496 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xdf0 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2409 ___sys_sendmsg+0xf3/0x170 net/socket.c:2463 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2492 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f1bb06badb9 Code: Unable to access opcode bytes at RIP 0x7f1bb06bad8f. RSP: 002b:00007fff3012a658 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f1bb06badb9 RDX: 0000000000000000 RSI: 00000000200007c0 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000003 R10: 0000000000000003 R11: 0000000000000246 R12: 00007fff3012a688 R13: 00007fff3012a6a0 R14: 00007fff3012a6e0 R15: 00000000000013c2 </TASK> Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://lore.kernel.org/r/20211210142046.698336-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-10 08:11:36 -08:00
Eric Dumazet	285ec2fef4	l2tp: add netns refcount tracker to l2tp_dfs_seq_data Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-10 06:38:27 -08:00
Eric Dumazet	ffa84b5ffb	net: add netns refcount tracker to struct sock Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-10 06:38:26 -08:00
Eric Dumazet	9ba74e6c9e	net: add networking namespace refcount tracker We have 100+ syzbot reports about netns being dismantled too soon, still unresolved as of today. We think a missing get_net() or an extra put_net() is the root cause. In order to find the bug(s), and be able to spot future ones, this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers to precisely pair all put_net() with corresponding get_net(). To use these helpers, each data structure owning a refcount should also use a "netns_tracker" to pair the get and put. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-10 06:38:26 -08:00
Minghao Chi	cde3fac565	batman-adv: remove unneeded variable in batadv_nc_init Return status directly from function called. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2021-12-10 08:52:52 +01:00
Jean Sacren	9745177c94	net: x25: drop harmless check of !more 'more' is checked first. When !more is checked immediately after that, it is always true. We should drop this check. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Link: https://lore.kernel.org/r/20211208024732.142541-5-sakiwit@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 18:35:11 -08:00
Jakub Kicinski	3150a73366	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 13:23:02 -08:00
Eric Dumazet	4177e49605	xfrm: use net device refcount tracker helpers xfrm4_fill_dst() and xfrm6_fill_dst() build dst, getting a device reference that will likely be released by standard dst_release() code. We have to track these references or risk a warning if CONFIG_NET_DEV_REFCNT_TRACKER=y Note to XFRM maintainers : Error path in xfrm6_fill_dst() releases the reference, but does not clear xdst->u.dst.dev, so I wonder if this could lead to double dev_put() in some cases, where a dst_release() _is_ called by the callers in their error path. This extra dev_put() was added in commit `84c4a9dfbf` ("xfrm6: release dev before returning error") Fixes: `9038c32000` ("net: dst: add net device refcount tracking to dst_entry") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <amwang@redhat.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Link: https://lore.kernel.org/r/20211207193203.2706158-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 11:51:45 -08:00
Russell King (Oracle)	0a9f0794d9	net: dsa: mark DSA phylink as legacy_pre_march2020 The majority of DSA drivers do not make use of the PCS support, and thus operate in legacy mode. In order to preserve this behaviour in future, we need to set the legacy_pre_march2020 flag so phylink knows this may require the legacy calls. There are some DSA drivers that do make use of PCS support, and these will continue operating as before - legacy_pre_march2020 will not prevent split-PCS support enabling the newer phylink behaviour. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 11:21:03 -08:00
Eric Dumazet	61c2402665	net/sched: fq_pie: prevent dismantle issue For some reason, fq_pie_destroy() did not copy working code from pie_destroy() and other qdiscs, thus causing elusive bug. Before calling del_timer_sync(&q->adapt_timer), we need to ensure timer will not rearm itself. rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (4416 ticks this GP) idle=60d/1/0x4000000000000000 softirq=10433/10434 fqs=2579 (t=10501 jiffies g=13085 q=3989) NMI backtrace for cpu 0 CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:111 nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline] rcu_dump_cpu_stacks+0x25e/0x3f0 kernel/rcu/tree_stall.h:343 print_cpu_stall kernel/rcu/tree_stall.h:627 [inline] check_cpu_stall kernel/rcu/tree_stall.h:711 [inline] rcu_pending kernel/rcu/tree.c:3878 [inline] rcu_sched_clock_irq.cold+0x9d/0x746 kernel/rcu/tree.c:2597 update_process_times+0x16d/0x200 kernel/time/timer.c:1785 tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226 tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1428 __run_hrtimer kernel/time/hrtimer.c:1685 [inline] __hrtimer_run_queues+0x1c0/0xe50 kernel/time/hrtimer.c:1749 hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline] __sysvec_apic_timer_interrupt+0x146/0x530 arch/x86/kernel/apic/apic.c:1103 sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1097 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 RIP: 0010:write_comp_data kernel/kcov.c:221 [inline] RIP: 0010:__sanitizer_cov_trace_const_cmp1+0x1d/0x80 kernel/kcov.c:273 Code: 54 c8 20 48 89 10 c3 66 0f 1f 44 00 00 53 41 89 fb 41 89 f1 bf 03 00 00 00 65 48 8b 0c 25 40 70 02 00 48 89 ce 4c 8b 54 24 08 <e8> 4e f7 ff ff 84 c0 74 51 48 8b 81 88 15 00 00 44 8b 81 84 15 00 RSP: 0018:ffffc90000d27b28 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff888064bf1bf0 RCX: ffff888011928000 RDX: ffff888011928000 RSI: ffff888011928000 RDI: 0000000000000003 RBP: ffff888064bf1c28 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff875d8295 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8880783dd300 R14: 0000000000000000 R15: 0000000000000000 pie_calculate_probability+0x405/0x7c0 net/sched/sch_pie.c:418 fq_pie_timer+0x170/0x2a0 net/sched/sch_fq_pie.c:383 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421 expire_timers kernel/time/timer.c:1466 [inline] __run_timers.part.0+0x675/0xa20 kernel/time/timer.c:1734 __run_timers kernel/time/timer.c:1715 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 run_ksoftirqd kernel/softirq.c:921 [inline] run_ksoftirqd+0x2d/0x60 kernel/softirq.c:913 smpboot_thread_fn+0x645/0x9c0 kernel/smpboot.c:164 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Fixes: `ec97ecf1eb` ("net: sched: add Flow Queue PIE packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Cc: Sachin D. Patil <sdp.sachin@gmail.com> Cc: V. Saicharan <vsaicharan1998@gmail.com> Cc: Mohit Bhasi <mohitbhasi1998@gmail.com> Cc: Leslie Monis <lesliemonis@gmail.com> Cc: Gautam Ramakrishnan <gautamramk@gmail.com> Link: https://lore.kernel.org/r/20211209084937.3500020-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 08:01:00 -08:00
Andrea Mayer	ae68d93354	seg6: fix the iif in the IPv6 socket control block When an IPv4 packet is received, the ip_rcv_core(...) sets the receiving interface index into the IPv4 socket control block (v5.16-rc4, net/ipv4/ip_input.c line 510): IPCB(skb)->iif = skb->skb_iif; If that IPv4 packet is meant to be encapsulated in an outer IPv6+SRH header, the seg6_do_srh_encap(...) performs the required encapsulation. In this case, the seg6_do_srh_encap function clears the IPv6 socket control block (v5.16-rc4 net/ipv6/seg6_iptunnel.c line 163): memset(IP6CB(skb), 0, sizeof(*IP6CB(skb))); The memset(...) was introduced in commit `ef489749aa` ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation") a long time ago (2019-01-29). Since the IPv6 socket control block and the IPv4 socket control block share the same memory area (skb->cb), the receiving interface index info is lost (IP6CB(skb)->iif is set to zero). As a side effect, that condition triggers a NULL pointer dereference if commit `0857d6f8c7` ("ipv6: When forwarding count rx stats on the orig netdev") is applied. To fix that issue, we set the IP6CB(skb)->iif with the index of the receiving interface once again. Fixes: `ef489749aa` ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation") Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211208195409.12169-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 07:55:42 -08:00
Krzysztof Kozlowski	4cd8371a23	nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done The done() netlink callback nfc_genl_dump_ses_done() should check if received argument is non-NULL, because its allocation could fail earlier in dumpit() (nfc_genl_dump_ses()). Fixes: `ac22ac466a` ("NFC: Add a GET_SE netlink API") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211209081307.57337-1-krzysztof.kozlowski@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 07:50:32 -08:00
Tadeusz Struk	fd79a0cbf0	nfc: fix segfault in nfc_genl_dump_devices_done When kmalloc in nfc_genl_dump_devices() fails then nfc_genl_dump_devices_done() segfaults as below KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 25 Comm: kworker/0:1 Not tainted 5.16.0-rc4-01180-g2a987e65025e-dirty #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-6.fc35 04/01/2014 Workqueue: events netlink_sock_destruct_work RIP: 0010:klist_iter_exit+0x26/0x80 Call Trace: <TASK> class_dev_iter_exit+0x15/0x20 nfc_genl_dump_devices_done+0x3b/0x50 genl_lock_done+0x84/0xd0 netlink_sock_destruct+0x8f/0x270 __sk_destruct+0x64/0x3b0 sk_destruct+0xa8/0xd0 __sk_free+0x2e8/0x3d0 sk_free+0x51/0x90 netlink_sock_destruct_work+0x1c/0x20 process_one_work+0x411/0x710 worker_thread+0x6fd/0xa80 Link: https://syzkaller.appspot.com/bug?id=fc0fa5a53db9edd261d56e74325419faf18bd0df Reported-by: syzbot+f9f76f4a0766420b4a02@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211208182742.340542-1-tadeusz.struk@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 07:50:23 -08:00
Jianguo Wu	158390e456	udp: using datalen to cap max gso segments The max number of UDP gso segments is intended to cap to UDP_MAX_SEGMENTS, this is checked in udp_send_skb(): if (skb->len > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); return -EINVAL; } skb->len contains network and transport header len here, we should use only data len instead. Fixes: `bec1f6f697` ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/900742e5-81fb-30dc-6e0b-375c6cdd7982@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-09 07:48:07 -08:00
Eric Dumazet	7770a39d7c	xfrm: fix a small bug in xfrm_sa_len() copy_user_offload() will actually push a struct struct xfrm_user_offload, which is different than (struct xfrm_state *)->xso (struct xfrm_state_offload) Fixes: `d77e38e612` ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-12-09 10:23:05 +01:00
Antoine Tenart	5f1c802ca6	net-sysfs: warn if new queue objects are being created during device unregistration Calling netdev_queue_update_kobjects is allowed during device unregistration since commit `5c56580b74` ("net: Adjust TX queue kobjects if number of queues changes during unregister"). But this is solely to allow queue unregistrations. Any path attempting to add new queues after a device started its unregistration should be fixed. This patch adds a warning to detect such illegal use. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 18:36:32 -08:00
Antoine Tenart	d7dac08341	net-sysfs: update the queue counts in the unregistration path When updating Rx and Tx queue kobjects, the queue count should always be updated to match the queue kobjects count. This was not done in the net device unregistration path, fix it. Tracking all queue count updates will allow in a following up patch to detect illegal updates. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 18:36:32 -08:00
Eric Dumazet	e195e9b5de	net, neigh: clear whole pneigh_entry at alloc time Commit `2c611ad97a` ("net, neigh: Extend neigh->flags to 32 bit to allow for extensions") enables a new KMSAM warning [1] I think the bug is actually older, because the following intruction only occurred if ndm->ndm_flags had NTF_PROXY set. pn->flags = ndm->ndm_flags; Let's clear all pneigh_entry fields at alloc time. [1] BUG: KMSAN: uninit-value in pneigh_fill_info+0x986/0xb30 net/core/neighbour.c:2593 pneigh_fill_info+0x986/0xb30 net/core/neighbour.c:2593 pneigh_dump_table net/core/neighbour.c:2715 [inline] neigh_dump_info+0x1e3f/0x2c60 net/core/neighbour.c:2832 netlink_dump+0xaca/0x16a0 net/netlink/af_netlink.c:2265 __netlink_dump_start+0xd1c/0xee0 net/netlink/af_netlink.c:2370 netlink_dump_start include/linux/netlink.h:254 [inline] rtnetlink_rcv_msg+0x181b/0x18c0 net/core/rtnetlink.c:5534 netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5589 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] sock_write_iter+0x594/0x690 net/socket.c:1057 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28c/0x520 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] __kmalloc+0xc3c/0x12d0 mm/slub.c:4437 kmalloc include/linux/slab.h:595 [inline] pneigh_lookup+0x60f/0xd70 net/core/neighbour.c:766 arp_req_set_public net/ipv4/arp.c:1016 [inline] arp_req_set+0x430/0x10a0 net/ipv4/arp.c:1032 arp_ioctl+0x8d4/0xb60 net/ipv4/arp.c:1232 inet_ioctl+0x4ef/0x820 net/ipv4/af_inet.c:947 sock_do_ioctl net/socket.c:1118 [inline] sock_ioctl+0xa3f/0x13e0 net/socket.c:1235 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl+0x2df/0x4a0 fs/ioctl.c:860 __x64_sys_ioctl+0xd8/0x110 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae CPU: 1 PID: 20001 Comm: syz-executor.0 Not tainted 5.16.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `62dd93181a` ("[IPV6] NDISC: Set per-entry is_router flag in Proxy NA.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211206165329.1049835-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 17:41:44 -08:00
Jakub Kicinski	fd31cb0c6a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Fix bogus compilter warning in nfnetlink_queue, from Florian Westphal. 2) Don't run conntrack on vrf with !dflt qdisc, from Nicolas Dichtel. 3) Fix nft_pipapo bucket load in AVX2 lookup routine for six 8-bit groups, from Stefano Brivio. 4) Break rule evaluation on malformed TCP options. 5) Use socat instead of nc in selftests/netfilter/nft_zones_many.sh, also from Florian 6) Fix KCSAN data-race in conntrack timeout updates, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: conntrack: annotate data-races around ct->timeout selftests: netfilter: switch zone stress to socat netfilter: nft_exthdr: break evaluation if setting TCP option fails selftests: netfilter: Add correctness test for mac,net set type nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups vrf: don't run conntrack on vrf with !dflt qdisc netfilter: nfnetlink_queue: silence bogus compiler warning ==================== Link: https://lore.kernel.org/r/20211209000847.102598-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 17:02:35 -08:00
Jakub Kicinski	6efcdadc15	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== bpf 2021-12-08 We've added 12 non-merge commits during the last 22 day(s) which contain a total of 29 files changed, 659 insertions(+), 80 deletions(-). The main changes are: 1) Fix an off-by-two error in packet range markings and also add a batch of new tests for coverage of these corner cases, from Maxim Mikityanskiy. 2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh. 3) Fix two functional regressions and a build warning related to BTF kfunc for modules, from Kumar Kartikeya Dwivedi. 4) Fix outdated code and docs regarding BPF's migrate_disable() use on non- PREEMPT_RT kernels, from Sebastian Andrzej Siewior. 5) Add missing includes in order to be able to detangle cgroup vs bpf header dependencies, from Jakub Kicinski. 6) Fix regression in BPF sockmap tests caused by missing detachment of progs from sockets when they are removed from the map, from John Fastabend. 7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF dispatcher, from Björn Töpel. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add selftests to cover packet access corner cases bpf: Fix the off-by-two error in range markings treewide: Add missing includes masked by cgroup -> bpf dependency tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets bpf: Fix bpf_check_mod_kfunc_call for built-in modules bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL mips, bpf: Fix reference to non-existing Kconfig symbol bpf: Make sure bpf_disable_instrumentation() is safe vs preemption. Documentation/locking/locktypes: Update migrate_disable() bits. bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap bpf, sockmap: Attach map progs to psock early for feature probes bpf, x86: Fix "no previous prototype" warning ==================== Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 16:06:44 -08:00
Vladimir Oltean	857fdd74fb	net: dsa: eliminate dsa_switch_ops :: port_bridge_tx_fwd_{,un}offload We don't really need new switch API for these, and with new switches which intend to add support for this feature, it will become cumbersome to maintain. The change consists in restructuring the two drivers that implement this offload (sja1105 and mv88e6xxx) such that the offload is enabled and disabled from the ->port_bridge_{join,leave} methods instead of the old ->port_bridge_tx_fwd_{,un}offload. The only non-trivial change is that mv88e6xxx_map_virtual_bridge_to_pvt() has been moved to avoid a forward declaration, and the mv88e6xxx_reg_lock() calls from inside it have been removed, since locking is now done from mv88e6xxx_port_bridge_{join,leave}. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:16 -08:00
Vladimir Oltean	b079922ba2	net: dsa: add a "tx_fwd_offload" argument to ->port_bridge_join This is a preparation patch for the removal of the DSA switch methods ->port_bridge_tx_fwd_offload() and ->port_bridge_tx_fwd_unoffload(). The plan is for the switch to report whether it offloads TX forwarding directly as a response to the ->port_bridge_join() method. This change deals with the noisy portion of converting all existing function prototypes to take this new boolean pointer argument. The bool is placed in the cross-chip notifier structure for bridge join, and a reference to it is provided to drivers. In the next change, DSA will then actually look at this value instead of calling ->port_bridge_tx_fwd_offload(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:16 -08:00
Vladimir Oltean	d3eed0e57d	net: dsa: keep the bridge_dev and bridge_num as part of the same structure The main desire behind this is to provide coherent bridge information to the fast path without locking. For example, right now we set dp->bridge_dev and dp->bridge_num from separate code paths, it is theoretically possible for a packet transmission to read these two port properties consecutively and find a bridge number which does not correspond with the bridge device. Another desire is to start passing more complex bridge information to dsa_switch_ops functions. For example, with FDB isolation, it is expected that drivers will need to be passed the bridge which requested an FDB/MDB entry to be offloaded, and along with that bridge_dev, the associated bridge_num should be passed too, in case the driver might want to implement an isolation scheme based on that number. We already pass the {bridge_dev, bridge_num} pair to the TX forwarding offload switch API, however we'd like to remove that and squash it into the basic bridge join/leave API. So that means we need to pass this pair to the bridge join/leave API. During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we call the driver's .port_bridge_leave with what used to be our dp->bridge_dev, but provided as an argument. When bridge_dev and bridge_num get folded into a single structure, we need to preserve this behavior in dsa_port_bridge_leave: we need a copy of what used to be in dp->bridge. Switch drivers check bridge membership by comparing dp->bridge_dev with the provided bridge_dev, but now, if we provide the struct dsa_bridge as a pointer, they cannot keep comparing dp->bridge to the provided pointer, since this only points to an on-stack copy. To make this obvious and prevent driver writers from forgetting and doing stupid things, in this new API, the struct dsa_bridge is provided as a full structure (not very large, contains an int and a pointer) instead of a pointer. An explicit comparison function needs to be used to determine bridge membership: dsa_port_offloads_bridge(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:16 -08:00
Vladimir Oltean	6a43cba303	net: dsa: export bridging offload helpers to drivers Move the static inline helpers from net/dsa/dsa_priv.h to include/net/dsa.h, so that drivers can call functions such as dsa_port_offloads_bridge_dev(), which will be necessary after the transition to a more complex bridge structure. More functions than are needed right now are being moved, but this is done for uniformity. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:16 -08:00
Vladimir Oltean	936db8a2db	net: dsa: rename dsa_port_offloads_bridge to dsa_port_offloads_bridge_dev Currently the majority of dsa_port_bridge_dev_get() calls in drivers is just to check whether a port is under the bridge device provided as argument by the DSA API. We'd like to change that DSA API so that a more complex structure is provided as argument. To keep things more generic, and considering that the new complex structure will be provided by value and not by reference, direct comparisons between dp->bridge and the provided bridge will be broken. The generic way to do the checking would simply be to do something like dsa_port_offloads_bridge(dp, &bridge). But there's a problem, we already have a function named that way, which actually takes a bridge_dev net_device as argument. Rename it so that we can use dsa_port_offloads_bridge for something else. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:15 -08:00
Vladimir Oltean	36cbf39b56	net: dsa: hide dp->bridge_dev and dp->bridge_num in the core behind helpers The location of the bridge device pointer and number is going to change. It is not going to be kept individually per port, but in a common structure allocated dynamically and which will have lockdep validation. Create helpers to access these elements so that we have a migration path to the new organization. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:15 -08:00
Vladimir Oltean	947c8746e2	net: dsa: assign a bridge number even without TX forwarding offload The service where DSA assigns a unique bridge number for each forwarding domain is useful even for drivers which do not implement the TX forwarding offload feature. For example, drivers might use the dp->bridge_num for FDB isolation. So rename ds->num_fwd_offloading_bridges to ds->max_num_bridges, and calculate a unique bridge_num for all drivers that set this value. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:14 -08:00
Vladimir Oltean	3f9bb0301d	net: dsa: make dp->bridge_num one-based I have seen too many bugs already due to the fact that we must encode an invalid dp->bridge_num as a negative value, because the natural tendency is to check that invalid value using (!dp->bridge_num). Latest example can be seen in commit `1bec0f0506` ("net: dsa: fix bridge_num not getting cleared after ports leaving the bridge"). Convert the existing users to assume that dp->bridge_num == 0 is the encoding for invalid, and valid bridge numbers start from 1. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-08 14:31:14 -08:00
Wei Wang	1db8f5fc2e	virtio/vsock: fix the transport to work with VMADDR_CID_ANY The VMADDR_CID_ANY flag used by a socket means that the socket isn't bound to any specific CID. For example, a host vsock server may want to be bound with VMADDR_CID_ANY, so that a guest vsock client can connect to the host server with CID=VMADDR_CID_HOST (i.e. 2), and meanwhile, a host vsock client can connect to the same local server with CID=VMADDR_CID_LOCAL (i.e. 1). The current implementation sets the destination socket's svm_cid to a fixed CID value after the first client's connection, which isn't an expected operation. For example, if the guest client first connects to the host server, the server's svm_cid gets set to VMADDR_CID_HOST, then other host clients won't be able to connect to the server anymore. Reproduce steps: 1. Run the host server: socat VSOCK-LISTEN:1234,fork - 2. Run a guest client to connect to the host server: socat - VSOCK-CONNECT:2:1234 3. Run a host client to connect to the host server: socat - VSOCK-CONNECT:1:1234 Without this patch, step 3. above fails to connect, and socat complains "socat[1720] E connect(5, AF=40 cid:1 port:1234, 16): Connection reset by peer". With this patch, the above works well. Fixes: `c0cfa2d8a7` ("vsock: add multi-transports support") Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20211126011823.1760-1-wei.w.wang@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>	2021-12-08 15:41:50 -05:00
Kees Cook	c0e084e342	hv_sock: Extract hvs_send_data() helper that takes only header When building under -Warray-bounds, the compiler is especially conservative when faced with casts from a smaller object to a larger object. While this has found many real bugs, there are some cases that are currently false positives (like here). With this as one of the last few instances of the warning in the kernel before -Warray-bounds can be enabled globally, rearrange the functions so that there is a header-only version of hvs_send_data(). Silences this warning: net/vmw_vsock/hyperv_transport.c: In function 'hvs_shutdown_lock_held.constprop': net/vmw_vsock/hyperv_transport.c:231:32: warning: array subscript 'struct hvs_send_buf[0]' is partly outside array bounds of 'struct vmpipe_proto_header[1]' [-Warray-bounds] 231 \| send_buf->hdr.pkt_type = 1; \| ~~~~~~~~~~~~~~~~~~~~~~~^~~ net/vmw_vsock/hyperv_transport.c:465:36: note: while referencing 'hdr' 465 \| struct vmpipe_proto_header hdr; \| ^~~ This change results in no executable instruction differences. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20211207063217.2591451-1-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 21:49:19 -08:00
Eric Dumazet	ada066b2e0	net: sched: act_mirred: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:45:00 -08:00
Eric Dumazet	e7c8ab8419	openvswitch: add net device refcount tracker to struct vport Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:45:00 -08:00
Eric Dumazet	e4b8954074	netlink: add net device refcount tracker to struct ethnl_req_info Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:59 -08:00
Eric Dumazet	b60645248a	net/smc: add net device tracker to struct smc_pnetentry Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:59 -08:00
Eric Dumazet	035f1f2b96	pktgen add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:59 -08:00
Eric Dumazet	615d069dcf	llc: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:59 -08:00
Eric Dumazet	66ce07f780	ax25: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:59 -08:00
Eric Dumazet	e44b14ebae	inet: add net device refcount tracker to struct fib_nh_common Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:59 -08:00
Eric Dumazet	4fc003fe03	net: switchdev: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:58 -08:00
Eric Dumazet	f12bf6f3f9	net: watchdog: add net device refcount tracker Add a netdevice_tracker inside struct net_device, to track the self reference when a device has an active watchdog timer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:58 -08:00
Eric Dumazet	b2dcdc7f73	net: bridge: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:58 -08:00
Eric Dumazet	19c9ebf6ed	vlan: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 20:44:58 -08:00
Eric Dumazet	802a7dc5cf	netfilter: conntrack: annotate data-races around ct->timeout (struct nf_conn)->timeout can be read/written locklessly, add READ_ONCE()/WRITE_ONCE() to prevent load/store tearing. BUG: KCSAN: data-race in __nf_conntrack_alloc / __nf_conntrack_find_get write to 0xffff888132e78c08 of 4 bytes by task 6029 on cpu 0: __nf_conntrack_alloc+0x158/0x280 net/netfilter/nf_conntrack_core.c:1563 init_conntrack+0x1da/0xb30 net/netfilter/nf_conntrack_core.c:1635 resolve_normal_ct+0x502/0x610 net/netfilter/nf_conntrack_core.c:1746 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0x72/0x170 net/netfilter/core.c:619 nf_hook include/linux/netfilter.h:262 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push_pending_frames include/net/tcp.h:1897 [inline] tcp_data_snd_check+0x62/0x2e0 net/ipv4/tcp_input.c:5452 tcp_rcv_established+0x880/0x10e0 net/ipv4/tcp_input.c:5947 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0xf2/0x270 net/core/sock.c:2768 release_sock+0x40/0x110 net/core/sock.c:3300 sk_stream_wait_memory+0x435/0x700 net/core/stream.c:145 tcp_sendmsg_locked+0xb85/0x25a0 net/ipv4/tcp.c:1402 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1440 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:644 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] __sys_sendto+0x21e/0x2c0 net/socket.c:2036 __do_sys_sendto net/socket.c:2048 [inline] __se_sys_sendto net/socket.c:2044 [inline] __x64_sys_sendto+0x74/0x90 net/socket.c:2044 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888132e78c08 of 4 bytes by task 17446 on cpu 1: nf_ct_is_expired include/net/netfilter/nf_conntrack.h:286 [inline] ____nf_conntrack_find net/netfilter/nf_conntrack_core.c:776 [inline] __nf_conntrack_find_get+0x1c7/0xac0 net/netfilter/nf_conntrack_core.c:807 resolve_normal_ct+0x273/0x610 net/netfilter/nf_conntrack_core.c:1734 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0x72/0x170 net/netfilter/core.c:619 nf_hook include/linux/netfilter.h:262 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402 __tcp_send_ack+0x1fd/0x300 net/ipv4/tcp_output.c:3956 tcp_send_ack+0x23/0x30 net/ipv4/tcp_output.c:3962 __tcp_ack_snd_check+0x2d8/0x510 net/ipv4/tcp_input.c:5478 tcp_ack_snd_check net/ipv4/tcp_input.c:5523 [inline] tcp_rcv_established+0x8c2/0x10e0 net/ipv4/tcp_input.c:5948 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0xf2/0x270 net/core/sock.c:2768 release_sock+0x40/0x110 net/core/sock.c:3300 tcp_sendpage+0x94/0xb0 net/ipv4/tcp.c:1114 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833 rds_tcp_xmit+0x376/0x5f0 net/rds/tcp_send.c:118 rds_send_xmit+0xbed/0x1500 net/rds/send.c:367 rds_send_worker+0x43/0x200 net/rds/threads.c:200 process_one_work+0x3fc/0x980 kernel/workqueue.c:2298 worker_thread+0x616/0xa70 kernel/workqueue.c:2445 kthread+0x2c7/0x2e0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 value changed: 0x00027cc2 -> 0x00000000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17446 Comm: kworker/u4:5 Tainted: G W 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: krdsd rds_send_worker Note: I chose an arbitrary commit for the Fixes: tag, because I do not think we need to backport this fix to very old kernels. Fixes: `e37542ba11` ("netfilter: conntrack: avoid possible false sharing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-12-08 01:29:15 +01:00
Pablo Neira Ayuso	962e5a4035	netfilter: nft_exthdr: break evaluation if setting TCP option fails Break rule evaluation on malformed TCP options. Fixes: `99d1712bc4` ("netfilter: exthdr: tcp option set support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-12-08 01:05:55 +01:00
Stefano Brivio	b7e945e228	nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups The sixth byte of packet data has to be looked up in the sixth group, not in the seventh one, even if we load the bucket data into ymm6 (and not ymm5, for convenience of tracking stalls). Without this fix, matching on a MAC address as first field of a set, if 8-bit groups are selected (due to a small set size) would fail, that is, the given MAC address would never match. Reported-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Cc: <stable@vger.kernel.org> # 5.6.x Fixes: `7400b06396` ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-By: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-12-08 01:05:55 +01:00
Maxim Galaganov	4f6e14bd19	mptcp: support TCP_CORK and TCP_NODELAY First, add cork and nodelay fields to the mptcp_sock structure so they can be used in sync_socket_options(), and fill them on setsockopt while holding the msk socket lock. Then, on setsockopt set proper tcp_sk(ssk)->nonagle values for subflows by calling __tcp_sock_set_cork() or __tcp_sock_set_nodelay() on the ssk while holding the ssk socket lock. tcp_push_pending_frames() will be invoked on the ssk if a cork was cleared or nodelay was set. Also set MPTCP_PUSH_PENDING bit by calling mptcp_check_and_set_pending(). This will lead to __mptcp_push_pending() being called inside mptcp_release_cb() with new tcp_sk(ssk)->nonagle. Also add getsockopt support for TCP_CORK and TCP_NODELAY. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Maxim Galaganov <max@internet.ru> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:31 -08:00
Maxim Galaganov	8b38217a2a	mptcp: expose mptcp_check_and_set_pending Expose the mptcp_check_and_set_pending() function for use inside MPTCP sockopt code. The next patch will call it when TCP_CORK is cleared or TCP_NODELAY is set on the MPTCP socket in order to push pending data from mptcp_release_cb(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Maxim Galaganov <max@internet.ru> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:31 -08:00
Maxim Galaganov	6fadaa5658	tcp: expose __tcp_sock_set_cork and __tcp_sock_set_nodelay Expose __tcp_sock_set_cork() and __tcp_sock_set_nodelay() for use in MPTCP setsockopt code -- namely for syncing MPTCP socket options with subflows inside sync_socket_options() while already holding the subflow socket lock. Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Maxim Galaganov <max@internet.ru> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:30 -08:00
Florian Westphal	3b1e21eb60	mptcp: getsockopt: add support for IP_TOS earlier patch added IP_TOS setsockopt support, this allows to get the value set by earlier setsockopt. Extends mptcp_put_int_option to handle u8 input/output by adding required cast. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:30 -08:00
Davide Caratti	602837e847	mptcp: allow changing the "backup" bit by endpoint id a non-zero 'id' is sufficient to identify MPTCP endpoints: allow changing the value of 'backup' bit by simply specifying the endpoint id. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/158 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:30 -08:00
Florian Westphal	644807e3e4	mptcp: add SIOCINQ, OUTQ and OUTQNSD ioctls Allows to query in-sequence data ready for read(), total bytes in write queue and total bytes in write queue that have not yet been sent. v2: remove unneeded READ_ONCE() (Paolo Abeni) v3: check for new data unconditionally in SIOCINQ ioctl (Mat Martineau) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:29 -08:00
Florian Westphal	2c9e77659a	mptcp: add TCP_INQ cmsg support Support the TCP_INQ setsockopt. This is a boolean that tells recvmsg path to include the remaining in-sequence bytes in the cmsg data. v2: do not use CB(skb)->offset, increment map_seq instead (Paolo Abeni) v3: adjust CB(skb)->map_seq when taking skb from ofo queue (Paolo Abeni) Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/224 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-07 11:36:29 -08:00
Luiz Augusto von Dentz	8aca46f91c	Bluetooth: mgmt: Introduce mgmt_alloc_skb and mgmt_send_event_skb This introduces mgmt_alloc_skb and mgmt_send_event_skb which are convenient when building MGMT events that have variable length as the likes of skb_put_data can be used to insert portion directly on the skb instead of having to first build an intermediate buffer just to be copied over the skb. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:52 +01:00
Luiz Augusto von Dentz	9a667031b9	Bluetooth: msft: Fix compilation when CONFIG_BT_MSFTEXT is not set This fixes compilation when CONFIG_BT_MSFTEXT is not set. Fixes: 6b3d4c8fcf3f2 ("Bluetooth: hci_event: Use of a function table to handle HCI events") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Luiz Augusto von Dentz	853b70b506	Bluetooth: hci_sync: Set Privacy Mode when updating the resolving list This adds support for Set Privacy Mode when updating the resolving list when HCI_CONN_FLAG_DEVICE_PRIVACY so the controller shall use Device Mode for devices programmed in the resolving list, Device Mode is actually required when the remote device are not able to use RPA as otherwise the default mode is Network Privacy Mode in which only allows RPAs thus the controller would filter out advertisement using identity addresses for which there is an IRK. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Luiz Augusto von Dentz	6126ffabba	Bluetooth: Introduce HCI_CONN_FLAG_DEVICE_PRIVACY device flag This introduces HCI_CONN_FLAG_DEVICE_PRIVACY which can be used by userspace to indicate to the controller to use Device Privacy Mode to a specific device. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Luiz Augusto von Dentz	fe92ee6425	Bluetooth: hci_core: Rework hci_conn_params flags This reworks hci_conn_params flags to use bitmap_* helpers and add support for setting the supported flags in hdev->conn_flags so it can easily be accessed. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Luiz Augusto von Dentz	6f59f991b4	Bluetooth: MGMT: Use hci_dev_test_and_{set,clear}_flag This make use of hci_dev_test_and_{set,clear}_flag instead of doing 2 operations in a row. Fixes: `cbbdfa6f33` ("Bluetooth: Enable controller RPA resolution using Experimental feature") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Aditya Garg	d2f8114f95	Bluetooth: add quirk disabling LE Read Transmit Power Some devices have a bug causing them to not work if they query LE tx power on startup. Thus we add a quirk in order to not query it and default min/max tx power values to HCI_TX_POWER_INVALID. Signed-off-by: Aditya Garg <gargaditya08@live.com> Reported-by: Orlando Chamberlain <redecorating@protonmail.com> Tested-by: Orlando Chamberlain <redecorating@protonmail.com> Link: https://lore.kernel.org/r/4970a940-211b-25d6-edab-21a815313954@protonmail.com Fixes: `7c395ea521` ("Bluetooth: Query LE tx power on startup") Cc: stable@vger.kernel.org Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Luiz Augusto von Dentz	147306ccbb	Bluetooth: hci_event: Use of a function table to handle Command Status This change the use of switch statement to a function table which is easier to extend. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:51 +01:00
Luiz Augusto von Dentz	c8992cffbe	Bluetooth: hci_event: Use of a function table to handle Command Complete This change the use of switch statement to a function table which is easier to extend and can include min/max length of each command. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	95118dd4ed	Bluetooth: hci_event: Use of a function table to handle LE subevents This change the use of switch statement to a function table which is easier to extend and can include min/max length of each subevent. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	3e54c5890c	Bluetooth: hci_event: Use of a function table to handle HCI events This change the use of switch statement to a function table which is easier to extend and can include min/max length of each HCI event. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	a3679649a1	Bluetooth: HCI: Use skb_pull_data to parse LE Direct Advertising Report event This uses skb_pull_data to check the LE Direct Advertising Report events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	b48b833f9e	Bluetooth: HCI: Use skb_pull_data to parse LE Ext Advertising Report event This uses skb_pull_data to check the LE Extended Advertising Report events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	47afe93c91	Bluetooth: HCI: Use skb_pull_data to parse LE Advertising Report event This uses skb_pull_data to check the LE Advertising Report events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	12cfe4176a	Bluetooth: HCI: Use skb_pull_data to parse LE Metaevents This uses skb_pull_data to check the LE Metaevents received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	70a6b8de6a	Bluetooth: HCI: Use skb_pull_data to parse Extended Inquiry Result event This uses skb_pull_data to check the Extended Inquiry Result events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	8d08d324fd	Bluetooth: HCI: Use skb_pull_data to parse Inquiry Result with RSSI event This uses skb_pull_data to check the Inquiry Result with RSSI events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	27d9eb4bca	Bluetooth: HCI: Use skb_pull_data to parse Inquiry Result event This uses skb_pull_data to check the Inquiry Result events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	aadc3d2f42	Bluetooth: HCI: Use skb_pull_data to parse Number of Complete Packets event This uses skb_pull_data to check the Number of Complete Packets events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:50 +01:00
Luiz Augusto von Dentz	e3f3a1aea8	Bluetooth: HCI: Use skb_pull_data to parse Command Complete event This uses skb_pull_data to check the Command Complete events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:49 +01:00
Luiz Augusto von Dentz	ae61a10d9d	Bluetooth: HCI: Use skb_pull_data to parse BR/EDR events This uses skb_pull_data to check the BR/EDR events received have the minimum required length. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:49 +01:00
Luiz Augusto von Dentz	13244cccc2	skbuff: introduce skb_pull_data Like skb_pull but returns the original data pointer before pulling the data after performing a check against sbk->len. This allows to change code that does "struct foo p = (void )skb->data;" which is hard to audit and error prone, to: p = skb_pull_data(skb, sizeof(*p)); if (!p) return; Which is both safer and cleaner. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-12-07 17:05:38 +01:00
Tony Lu	1c5526968e	net/smc: Clear memory when release and reuse buffer Currently, buffers are cleared when smc connections are created and buffers are reused. This slows down the speed of establishing new connections. In most cases, the applications want to establish connections as quickly as possible. This patch moves memset() from connection creation path to release and buffer unuse path, this trades off between speed of establishing and release. Test environments: - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4 - socket sndbuf / rcvbuf: 16384 / 131072 bytes - w/o first round, 5 rounds, avg, 100 conns batch per round - smc_buf_create() use bpftrace kprobe, introduces extra latency Latency benchmarks for smc_buf_create(): w/o patch : 19040.0 ns w/ patch : 1932.6 ns ratio : 10.2% (-89.8%) Latency benchmarks for socket create and connect: w/o patch : 143.3 us w/ patch : 102.2 us ratio : 71.3% (-28.7%) The latency of establishing connections is reduced by 28.7%. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/20211203113331.2818873-1-kgraul@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 17:01:28 -08:00
Eric Dumazet	4dbb0dad8e	devlink: fix netns refcount leak in devlink_nl_cmd_reload() While preparing my patch series adding netns refcount tracking, I spotted bugs in devlink_nl_cmd_reload() Some error paths forgot to release a refcount on a netns. To fix this, we can reduce the scope of get_net()/put_net() section around the call to devlink_reload(). Fixes: `ccdf07219d` ("devlink: Add reload action option to devlink reload command") Fixes: `dc64cc7c63` ("devlink: Add devlink reload limit option") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Moshe Shemesh <moshe@mellanox.com> Cc: Jacob Keller <jacob.e.keller@intel.com> Cc: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20211205192822.1741045-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:56:32 -08:00
Antoine Tenart	dde91ccfa2	ethtool: do not perform operations on net devices being unregistered There is a short period between a net device starts to be unregistered and when it is actually gone. In that time frame ethtool operations could still be performed, which might end up in unwanted or undefined behaviours[1]. Do not allow ethtool operations after a net device starts its unregistration. This patch targets the netlink part as the ioctl one isn't affected: the reference to the net device is taken and the operation is executed within an rtnl lock section and the net device won't be found after unregister. [1] For example adding Tx queues after unregister ends up in NULL pointer exceptions and UaFs, such as: BUG: KASAN: use-after-free in kobject_get+0x14/0x90 Read of size 1 at addr ffff88801961248c by task ethtool/755 CPU: 0 PID: 755 Comm: ethtool Not tainted 5.15.0-rc6+ #778 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/014 Call Trace: dump_stack_lvl+0x57/0x72 print_address_description.constprop.0+0x1f/0x140 kasan_report.cold+0x7f/0x11b kobject_get+0x14/0x90 kobject_add_internal+0x3d1/0x450 kobject_init_and_add+0xba/0xf0 netdev_queue_update_kobjects+0xcf/0x200 netif_set_real_num_tx_queues+0xb4/0x310 veth_set_channels+0x1c3/0x550 ethnl_set_channels+0x524/0x610 Fixes: `041b1c5d4a` ("ethtool: helper functions for netlink interface") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20211203101318.435618-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:53:32 -08:00
Eric Dumazet	5fa5ae6058	netpoll: add net device refcount tracker to struct netpoll Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:06:02 -08:00
Eric Dumazet	42120a8643	ipmr, ip6mr: add net device refcount tracker to struct vif_device Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:06:02 -08:00
Eric Dumazet	095e200f17	net: failover: add net device refcount tracker Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:06:02 -08:00
Eric Dumazet	63f13937cb	net: linkwatch: add net device refcount tracker Add a netdevice_tracker inside struct net_device, to track the self reference when a device is in lweventlist. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:44 -08:00
Eric Dumazet	606509f27f	net/sched: add net device refcount tracker to struct Qdisc Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:12 -08:00
Eric Dumazet	c04438f58d	ipv4: add net device refcount tracker to struct in_device Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	8c727003c4	ipv6: add net device refcount tracker to struct inet6_dev Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	f77159a348	net: add net device refcount tracker to struct netdev_adjacent Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	08d622568e	net: add net device refcount tracker to struct neigh_parms Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	77a23b1f95	net: add net device refcount tracker to struct pneigh_entry Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	85662c9f8c	net: add net device refcount tracker to struct neighbour Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	56c1c77948	ipv6: add net device refcount tracker to struct ip6_tnl Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	c0fd407a06	sit: add net device refcount tracking to ip_tunnel Note that other ip_tunnel users do not seem to hold a reference on tunnel->dev. Probably needs some investigations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	fb67510ba9	ipv6: add net device refcount tracker to rt6_probe_deferred() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:11 -08:00
Eric Dumazet	9038c32000	net: dst: add net device refcount tracking to dst_entry We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:10 -08:00
Eric Dumazet	4dbd24f65c	drop_monitor: add net device refcount tracker We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:10 -08:00
Eric Dumazet	14ed029b5e	net: add net device refcount tracker to dev_ifsioc() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:10 -08:00
Eric Dumazet	5ae2195088	net: add net device refcount tracker to ethtool_phys_id() This helper might hold a netdev reference for a long time, lets add reference tracking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:10 -08:00
Eric Dumazet	0b688f24b7	net: add net device refcount tracker to struct netdev_queue This will help debugging pesky netdev reference leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:10 -08:00
Eric Dumazet	80e8921b2b	net: add net device refcount tracker to struct netdev_rx_queue This helps debugging net device refcount leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:10 -08:00
Eric Dumazet	4d92b95ff2	net: add net device refcount tracker infrastructure net device are refcounted. Over the years we had numerous bugs caused by imbalanced dev_hold() and dev_put() calls. The general idea is to be able to precisely pair each decrement with a corresponding prior increment. Both share a cookie, basically a pointer to private data storing stack traces. This patch adds dev_hold_track() and dev_put_track(). To use these helpers, each data structure owning a refcount should also use a "netdevice_tracker" to pair the hold and put. netdevice_tracker dev_tracker; ... dev_hold_track(dev, &dev_tracker, GFP_ATOMIC); ... dev_put_track(dev, &dev_tracker); Whenever a leak happens, we will get precise stack traces of the point dev_hold_track() happened, at device dismantle phase. We will also get a stack trace if too many dev_put_track() for the same netdevice_tracker are attempted. This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-06 16:05:07 -08:00
Benjamin Berg	2250abadd3	Bluetooth: hci_core: Cancel sync command if sending a frame failed If sending a frame failed any sync command associated with it will never be completed. As such, cancel any such command immediately to avoid timing out. Signed-off-by: Benjamin Berg <bberg@redhat.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2021-12-03 10:41:59 -08:00
Benjamin Berg	914b08b330	Bluetooth: Add hci_cmd_sync_cancel to public API After transfer errors it makes sense to cancel an ongoing synchronous command that cannot complete anymore. To permit this, export the old hci_req_sync_cancel function as hci_cmd_sync_cancel in the API. Signed-off-by: Benjamin Berg <bberg@redhat.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2021-12-03 10:41:59 -08:00
Benjamin Berg	ae422391e1	Bluetooth: Reset more state when cancelling a sync command Resetting the timers and cmd_cnt means that we assume the device will be in a good state after the sync command finishes. Without this a chain of synchronous commands might get stuck if one of them is cancelled. Signed-off-by: Benjamin Berg <bberg@redhat.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2021-12-03 10:41:59 -08:00
Eric Dumazet	03cfda4fa6	tcp: fix another uninit-value (sk_rx_queue_mapping) KMSAN is still not happy [1]. I missed that passive connections do not inherit their sk_rx_queue_mapping values from the request socket, but instead tcp_child_process() is calling sk_mark_napi_id(child, skb) We have many sk_mark_napi_id() callers, so I am providing a new helper, forcing the setting sk_rx_queue_mapping and sk_napi_id. Note that we had no KMSAN report for sk_napi_id because passive connections got a copy of this field from the listener. sk_rx_queue_mapping in the other hand is inside the sk_dontcopy_begin/sk_dontcopy_end so sk_clone_lock() leaves this field uninitialized. We might remove dead code populating req->sk_rx_queue_mapping in the future. [1] BUG: KMSAN: uninit-value in __sk_rx_queue_set include/net/sock.h:1924 [inline] BUG: KMSAN: uninit-value in sk_rx_queue_update include/net/sock.h:1938 [inline] BUG: KMSAN: uninit-value in sk_mark_napi_id include/net/busy_poll.h:136 [inline] BUG: KMSAN: uninit-value in tcp_child_process+0xb42/0x1050 net/ipv4/tcp_minisocks.c:833 __sk_rx_queue_set include/net/sock.h:1924 [inline] sk_rx_queue_update include/net/sock.h:1938 [inline] sk_mark_napi_id include/net/busy_poll.h:136 [inline] tcp_child_process+0xb42/0x1050 net/ipv4/tcp_minisocks.c:833 tcp_v4_rcv+0x3d83/0x4ed0 net/ipv4/tcp_ipv4.c:2066 ip_protocol_deliver_rcu+0x760/0x10b0 net/ipv4/ip_input.c:204 ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ip_local_deliver+0x584/0x8c0 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:460 [inline] ip_sublist_rcv_finish net/ipv4/ip_input.c:551 [inline] ip_list_rcv_finish net/ipv4/ip_input.c:601 [inline] ip_sublist_rcv+0x11fd/0x1520 net/ipv4/ip_input.c:609 ip_list_rcv+0x95f/0x9a0 net/ipv4/ip_input.c:644 __netif_receive_skb_list_ptype net/core/dev.c:5505 [inline] __netif_receive_skb_list_core+0xe34/0x1240 net/core/dev.c:5553 __netif_receive_skb_list+0x7fc/0x960 net/core/dev.c:5605 netif_receive_skb_list_internal+0x868/0xde0 net/core/dev.c:5696 gro_normal_list net/core/dev.c:5850 [inline] napi_complete_done+0x579/0xdd0 net/core/dev.c:6587 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] virtnet_poll+0x17b6/0x2350 drivers/net/virtio_net.c:1557 __napi_poll+0x14e/0xbc0 net/core/dev.c:7020 napi_poll net/core/dev.c:7087 [inline] net_rx_action+0x824/0x1880 net/core/dev.c:7174 __do_softirq+0x1fe/0x7eb kernel/softirq.c:558 run_ksoftirqd+0x33/0x50 kernel/softirq.c:920 smpboot_thread_fn+0x616/0xbf0 kernel/smpboot.c:164 kthread+0x721/0x850 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 Uninit was created at: __alloc_pages+0xbc7/0x10a0 mm/page_alloc.c:5409 alloc_pages+0x8a5/0xb80 alloc_slab_page mm/slub.c:1810 [inline] allocate_slab+0x287/0x1c20 mm/slub.c:1947 new_slab mm/slub.c:2010 [inline] ___slab_alloc+0xbdf/0x1e90 mm/slub.c:3039 __slab_alloc mm/slub.c:3126 [inline] slab_alloc_node mm/slub.c:3217 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0xbb3/0x11c0 mm/slub.c:3264 sk_prot_alloc+0xeb/0x570 net/core/sock.c:1914 sk_clone_lock+0xd6/0x1940 net/core/sock.c:2118 inet_csk_clone_lock+0x8d/0x6a0 net/ipv4/inet_connection_sock.c:956 tcp_create_openreq_child+0xb1/0x1ef0 net/ipv4/tcp_minisocks.c:453 tcp_v4_syn_recv_sock+0x268/0x2710 net/ipv4/tcp_ipv4.c:1563 tcp_check_req+0x207c/0x2a30 net/ipv4/tcp_minisocks.c:765 tcp_v4_rcv+0x36f5/0x4ed0 net/ipv4/tcp_ipv4.c:2047 ip_protocol_deliver_rcu+0x760/0x10b0 net/ipv4/ip_input.c:204 ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ip_local_deliver+0x584/0x8c0 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:460 [inline] ip_sublist_rcv_finish net/ipv4/ip_input.c:551 [inline] ip_list_rcv_finish net/ipv4/ip_input.c:601 [inline] ip_sublist_rcv+0x11fd/0x1520 net/ipv4/ip_input.c:609 ip_list_rcv+0x95f/0x9a0 net/ipv4/ip_input.c:644 __netif_receive_skb_list_ptype net/core/dev.c:5505 [inline] __netif_receive_skb_list_core+0xe34/0x1240 net/core/dev.c:5553 __netif_receive_skb_list+0x7fc/0x960 net/core/dev.c:5605 netif_receive_skb_list_internal+0x868/0xde0 net/core/dev.c:5696 gro_normal_list net/core/dev.c:5850 [inline] napi_complete_done+0x579/0xdd0 net/core/dev.c:6587 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] virtnet_poll+0x17b6/0x2350 drivers/net/virtio_net.c:1557 __napi_poll+0x14e/0xbc0 net/core/dev.c:7020 napi_poll net/core/dev.c:7087 [inline] net_rx_action+0x824/0x1880 net/core/dev.c:7174 __do_softirq+0x1fe/0x7eb kernel/softirq.c:558 Fixes: `342159ee39` ("net: avoid dirtying sk->sk_rx_queue_mapping") Fixes: `a37a0ee4d2` ("net: avoid uninit-value from tcp_conn_request") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Tested-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-03 14:15:49 +00:00
Eric Dumazet	a941892455	inet: use #ifdef CONFIG_SOCK_RX_QUEUE_MAPPING consistently Since commit `4e1beecc3b` ("net/sock: Add kernel config SOCK_RX_QUEUE_MAPPING"), sk_rx_queue_mapping access is guarded by CONFIG_SOCK_RX_QUEUE_MAPPING. Fixes: `54b92e8419` ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Tariq Toukan <tariqt@nvidia.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-03 14:15:07 +00:00
Chris Mi	43332cf974	net/sched: act_ct: Offload only ASSURED connections Short-lived connections increase the insertion rate requirements, fill the offload table and provide very limited offload value since they process a very small amount of packets. The ct ASSURED flag is designed to filter short-lived connections for early expiration. Offload connections when they are ESTABLISHED and ASSURED. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-03 11:02:25 +00:00
Jakub Kicinski	fc993be36f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-02 11:44:56 -08:00
Linus Torvalds	a51e3ac43d	Networking fixes for 5.16-rc4, including fixes from wireless, and wireguard. Current release - regressions: - smc: keep smc_close_final()'s error code during active close Current release - new code bugs: - iwlwifi: various static checker fixes (int overflow, leaks, missing error codes) - rtw89: fix size of firmware header before transfer, avoid crash - mt76: fix timestamp check in tx_status; fix pktid leak; - mscc: ocelot: fix missing unlock on error in ocelot_hwstamp_set() Previous releases - regressions: - smc: fix list corruption in smc_lgr_cleanup_early - ipv4: convert fib_num_tclassid_users to atomic_t Previous releases - always broken: - tls: fix authentication failure in CCM mode - vrf: reset IPCB/IP6CB when processing outbound pkts, prevent incorrect processing - dsa: mv88e6xxx: fixes for various device errata - rds: correct socket tunable error in rds_tcp_tune() - ipv6: fix memory leak in fib6_rule_suppress - wireguard: reset peer src endpoint when netns exits - wireguard: improve resilience to DoS around incoming handshakes - tcp: fix page frag corruption on page fault which involves TCP - mpls: fix missing attributes in delete notifications - mt7915: fix NULL pointer dereference with ad-hoc mode Misc: - rt2x00: be more lenient about EPROTO errors during start - mlx4_en: update reported link modes for 1/10G Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGo6KMACgkQMUZtbf5S Iruq+BAAhRMTcL+X4eRIL9lIEvWEKHMKLCA/pUaQWNlSxsbEeydJWRNSc37Cs3pv z0rYIEhfieOz8+QXS1Kq+yZwJVXjA8Jvgld2qw9V9Y5w+N15Mj8RUtG8NaUw+o4E U8PCAbaamnbzyPdlCYcVHschd8MD0BCXm5+jAGeIyCP+KQCnhEpFZv+bvHaWzQR8 FZLYrhXTR9W0DFsrKG9+haqFwFBR3+VDqTGILhaHPE+r2o6wKQQ5yJMhd8fq0SaC nne8zDkGuFEeW3cxj0VbhdRMyrV97eMK+P4dZ2P0Z7xcrsed9/2XJkNQNJGtuRnj GGJV6utupJRAY+lnJNUkifqS4Wt7KirfZsSsyaKKa4plyoVgtGhiqEYFTQVLagC0 CF4Qe+3qks6rESbRu6PEFN4oWSkMEhRzdcDpg7vBDURUKcrRs9fgtNUJUCi8nKFA A/F/K+7IHBoBZyQYZbYmnGdNsNauKbF3rUY3hwMGBfQZIr/wsql9+jhtLsmZX77m V/L7KzT2jhhNc5gDzuLps25K3P7snKuV19qQSsY2LeuGj1x3gmWZ+ibN6ynhB+Gt KBnfHDMTI/4aciZBIbwJmwfeRhCF8tOfw0WZdUP7FRIXukbfVuDBoznWLz4BKKgf GSYSTNDs/PHZQo5vCQ/onvTwUK5aN6zoPNy5ih7lp9YZBYtN2TI= =r0Jh -----END PGP SIGNATURE----- Merge tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless, and wireguard. Mostly scattered driver changes this week, with one big clump in mv88e6xxx. Nothing of note, really. Current release - regressions: - smc: keep smc_close_final()'s error code during active close Current release - new code bugs: - iwlwifi: various static checker fixes (int overflow, leaks, missing error codes) - rtw89: fix size of firmware header before transfer, avoid crash - mt76: fix timestamp check in tx_status; fix pktid leak; - mscc: ocelot: fix missing unlock on error in ocelot_hwstamp_set() Previous releases - regressions: - smc: fix list corruption in smc_lgr_cleanup_early - ipv4: convert fib_num_tclassid_users to atomic_t Previous releases - always broken: - tls: fix authentication failure in CCM mode - vrf: reset IPCB/IP6CB when processing outbound pkts, prevent incorrect processing - dsa: mv88e6xxx: fixes for various device errata - rds: correct socket tunable error in rds_tcp_tune() - ipv6: fix memory leak in fib6_rule_suppress - wireguard: reset peer src endpoint when netns exits - wireguard: improve resilience to DoS around incoming handshakes - tcp: fix page frag corruption on page fault which involves TCP - mpls: fix missing attributes in delete notifications - mt7915: fix NULL pointer dereference with ad-hoc mode Misc: - rt2x00: be more lenient about EPROTO errors during start - mlx4_en: update reported link modes for 1/10G" * tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: dsa: b53: Add SPI ID table gro: Fix inconsistent indenting selftests: net: Correct case name net/rds: correct socket tunable error in rds_tcp_tune() mctp: Don't let RTM_DELROUTE delete local routes net/smc: Keep smc_close_final rc during active close ibmvnic: drop bad optimization in reuse_tx_pools() ibmvnic: drop bad optimization in reuse_rx_pools() net/smc: fix wrong list_del in smc_lgr_cleanup_early Fix Comment of ETH_P_802_3_MIN ethernet: aquantia: Try MAC address from device tree ipv4: convert fib_num_tclassid_users to atomic_t net: avoid uninit-value from tcp_conn_request net: annotate data-races on txq->xmit_lock_owner octeontx2-af: Fix a memleak bug in rvu_mbox_init() net/mlx4_en: Fix an use-after-free bug in mlx4_en_try_alloc_resources() vrf: Reset IPCB/IP6CB when processing outbound pkts in vrf dev xmit net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings() net: dsa: mv88e6xxx: Link in pcs_get_state() if AN is bypassed net: dsa: mv88e6xxx: Fix inband AN for 2500base-x on 88E6393X family ...	2021-12-02 11:22:06 -08:00
Alexei Starovoitov	8293eb995f	bpf: Rename btf_member accessors. Rename btf_member_bit_offset() and btf_member_bitfield_size() to avoid conflicts with similarly named helpers in libbpf's btf.h. Rename the kernel helpers, since libbpf helpers are part of uapi. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-3-alexei.starovoitov@gmail.com	2021-12-02 11:18:34 -08:00
Xu Wang	d9e56d1839	mctp: Remove redundant if statements The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 13:09:33 +00:00
Xu Wang	98fa41d627	net: openvswitch: Remove redundant if statements The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 13:08:19 +00:00
Jiapeng Chong	1ebb87cc89	gro: Fix inconsistent indenting Eliminate the follow smatch warning: net/ipv6/ip6_offload.c:249 ipv6_gro_receive() warn: inconsistent indenting. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 12:21:11 +00:00
William Kucharski	19f36edf14	net/rds: correct socket tunable error in rds_tcp_tune() Correct an error where setting /proc/sys/net/rds/tcp/rds_tcp_rcvbuf would instead modify the socket's sk_sndbuf and would leave sk_rcvbuf untouched. Fixes: `c6a58ffed5` ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket") Signed-off-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 12:16:57 +00:00
Matt Johnston	76d001603c	mctp: Don't let RTM_DELROUTE delete local routes We need to test against the existing route type, not the rtm_type in the netlink request. Fixes: `83f0a0b728` ("mctp: Specify route types, require rtm_type in RTM_*ROUTE messages") Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 12:15:25 +00:00
Tony Lu	00e158fb91	net/smc: Keep smc_close_final rc during active close When smc_close_final() returns error, the return code overwrites by kernel_sock_shutdown() in smc_close_active(). The return code of smc_close_final() is more important than kernel_sock_shutdown(), and it will pass to userspace directly. Fix it by keeping both return codes, if smc_close_final() raises an error, return it or kernel_sock_shutdown()'s. Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/ Fixes: `606a63c978` ("net/smc: Ensure the active closing peer first closes clcsock") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 12:14:36 +00:00
Dust Li	789b6cc2a5	net/smc: fix wrong list_del in smc_lgr_cleanup_early smc_lgr_cleanup_early() meant to delete the link group from the link group list, but it deleted the list head by mistake. This may cause memory corruption since we didn't remove the real link group from the list and later memseted the link group structure. We got a list corruption panic when testing: [ 231.277259] list_del corruption. prev->next should be ffff8881398a8000, but was 0000000000000000 [ 231.278222] ------------[ cut here ]------------ [ 231.278726] kernel BUG at lib/list_debug.c:53! [ 231.279326] invalid opcode: 0000 [#1] SMP NOPTI [ 231.279803] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.10.46+ #435 [ 231.280466] Hardware name: Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 [ 231.281248] Workqueue: events smc_link_down_work [ 231.281732] RIP: 0010:__list_del_entry_valid+0x70/0x90 [ 231.282258] Code: 4c 60 82 e8 7d cc 6a 00 0f 0b 48 89 fe 48 c7 c7 88 4c 60 82 e8 6c cc 6a 00 0f 0b 48 89 fe 48 c7 c7 c0 4c 60 82 e8 5b cc 6a 00 <0f> 0b 48 89 fe 48 c7 c7 00 4d 60 82 e8 4a cc 6a 00 0f 0b cc cc cc [ 231.284146] RSP: 0018:ffffc90000033d58 EFLAGS: 00010292 [ 231.284685] RAX: 0000000000000054 RBX: ffff8881398a8000 RCX: 0000000000000000 [ 231.285415] RDX: 0000000000000001 RSI: ffff88813bc18040 RDI: ffff88813bc18040 [ 231.286141] RBP: ffffffff8305ad40 R08: 0000000000000003 R09: 0000000000000001 [ 231.286873] R10: ffffffff82803da0 R11: ffffc90000033b90 R12: 0000000000000001 [ 231.287606] R13: 0000000000000000 R14: ffff8881398a8000 R15: 0000000000000003 [ 231.288337] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 [ 231.289160] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 231.289754] CR2: 0000000000e72058 CR3: 000000010fa96006 CR4: 00000000003706f0 [ 231.290485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 231.291211] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 231.291940] Call Trace: [ 231.292211] smc_lgr_terminate_sched+0x53/0xa0 [ 231.292677] smc_switch_conns+0x75/0x6b0 [ 231.293085] ? update_load_avg+0x1a6/0x590 [ 231.293517] ? ttwu_do_wakeup+0x17/0x150 [ 231.293907] ? update_load_avg+0x1a6/0x590 [ 231.294317] ? newidle_balance+0xca/0x3d0 [ 231.294716] smcr_link_down+0x50/0x1a0 [ 231.295090] ? __wake_up_common_lock+0x77/0x90 [ 231.295534] smc_link_down_work+0x46/0x60 [ 231.295933] process_one_work+0x18b/0x350 Fixes: `a0a62ee15a` ("net/smc: separate locks for SMCD and SMCR link group lists") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 12:07:46 +00:00
Eric Dumazet	213f5f8f31	ipv4: convert fib_num_tclassid_users to atomic_t Before commit `faa041a40b` ("ipv4: Create cleanup helper for fib_nh") changes to net->ipv4.fib_num_tclassid_users were protected by RTNL. After the change, this is no longer the case, as free_fib_info_rcu() runs after rcu grace period, without rtnl being held. Fixes: `faa041a40b` ("ipv4: Create cleanup helper for fib_nh") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 11:56:04 +00:00
Eric Dumazet	7a10d8c810	net: annotate data-races on txq->xmit_lock_owner syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner without annotations. No serious issue there, let's document what is happening there. BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0: __netif_tx_unlock include/linux/netdevice.h:4437 [inline] __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_hh_output include/net/neighbour.h:511 [inline] neigh_output include/net/neighbour.h:525 [inline] ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1: __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523 neigh_output include/net/neighbour.h:527 [inline] ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443 folio_test_anon include/linux/page-flags.h:581 [inline] PageAnon include/linux/page-flags.h:586 [inline] zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347 zap_pmd_range mm/memory.c:1467 [inline] zap_pud_range mm/memory.c:1496 [inline] zap_p4d_range mm/memory.c:1517 [inline] unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538 unmap_single_vma+0x157/0x210 mm/memory.c:1583 unmap_vmas+0xd0/0x180 mm/memory.c:1615 exit_mmap+0x23d/0x470 mm/mmap.c:3170 __mmput+0x27/0x1b0 kernel/fork.c:1113 mmput+0x3d/0x50 kernel/fork.c:1134 exit_mm+0xdb/0x170 kernel/exit.c:507 do_exit+0x608/0x17a0 kernel/exit.c:819 do_group_exit+0xce/0x180 kernel/exit.c:929 get_signal+0xfc3/0x1550 kernel/signal.c:2852 arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffffff Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-01 19:14:26 -08:00
Eric Dumazet	ce8299b6f7	Revert "net: snmp: add statistics for tcp small queue check" This reverts commit `aeeecb8891`. The new SNMP variable (TCPSmallQueueFailure) can be incremented for good reasons, even on a 100Gbit single TCP_STREAM flow. If we really wanted to ease driver debugging [1], this would require something more sophisticated. [1] Usually, if a driver is delaying TX completions too much, this can lead to stalls in TCP output. Various work arounds have been used in the past, like skb_orphan() in ndo_start_xmit(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Menglong Dong <imagedong@tencent.com> Link: https://lore.kernel.org/r/20211201033246.2826224-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-01 19:06:09 -08:00
Russell King (Oracle)	5938bce4b6	net: dsa: support use of phylink_generic_validate() Support the use of phylink_generic_validate() when there is no phylink_validate method given in the DSA switch operations and mac_capabilities have been set in the phylink_config structure by the DSA switch driver. This gives DSA switch drivers the option to use this if they provide the supported_interfaces and mac_capabilities, while still giving them an option to override the default implementation if necessary. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-01 18:58:00 -08:00
Russell King (Oracle)	072eea6c22	net: dsa: replace phylink_get_interfaces() with phylink_get_caps() Phylink needs slightly more information than phylink_get_interfaces() allows us to get from the DSA drivers - we need the MAC capabilities. Replace the phylink_get_interfaces() method with phylink_get_caps() to allow DSA drivers to fill in the phylink_config MAC capabilities field as well. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-01 18:58:00 -08:00
Russell King (Oracle)	21bd64bd71	net: dsa: consolidate phylink creation The code in port.c and slave.c creating the phylink instance is very similar - let's consolidate this into a single function. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-12-01 18:58:00 -08:00
Jean Sacren	ac1077e928	net: xfrm: drop check of pols[0] for the second time !pols[0] is checked earlier. If we don't return, pols[0] is always true. We should drop the check of pols[0] for the second time and the binary is also smaller. Before: text data bss dec hex filename 48395 957 240 49592 c1b8 net/xfrm/xfrm_policy.o After: text data bss dec hex filename 48379 957 240 49576 c1a8 net/xfrm/xfrm_policy.o Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-12-01 09:23:01 +01:00
Yang Yingliang	5cfe53cfeb	mctp: remove unnecessary check before calling kfree_skb() The skb will be checked inside kfree_skb(), so remove the outside check. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20211130031243.768823-1-yangyingliang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-30 18:09:23 -08:00
Harshit Mogalapalli	f123cffdd8	net: netlink: af_netlink: Prevent empty skb by adding a check on len. Adding a check on len parameter to avoid empty skb. This prevents a division error in netem_enqueue function which is caused when skb->len=0 and skb->data_len=0 in the randomized corruption step as shown below. skb->data[prandom_u32() % skb_headlen(skb)] ^= 1<<(prandom_u32() % 8); Crash Report: [ 343.170349] netdevsim netdevsim0 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0 [ 343.216110] netem: version 1.3 [ 343.235841] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 343.236680] CPU: 3 PID: 4288 Comm: reproducer Not tainted 5.16.0-rc1+ [ 343.237569] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 343.238707] RIP: 0010:netem_enqueue+0x1590/0x33c0 [sch_netem] [ 343.239499] Code: 89 85 58 ff ff ff e8 5f 5d e9 d3 48 8b b5 48 ff ff ff 8b 8d 50 ff ff ff 8b 85 58 ff ff ff 48 8b bd 70 ff ff ff 31 d2 2b 4f 74 <f7> f1 48 b8 00 00 00 00 00 fc ff df 49 01 d5 4c 89 e9 48 c1 e9 03 [ 343.241883] RSP: 0018:ffff88800bcd7368 EFLAGS: 00010246 [ 343.242589] RAX: 00000000ba7c0a9c RBX: 0000000000000001 RCX: 0000000000000000 [ 343.243542] RDX: 0000000000000000 RSI: ffff88800f8edb10 RDI: ffff88800f8eda40 [ 343.244474] RBP: ffff88800bcd7458 R08: 0000000000000000 R09: ffffffff94fb8445 [ 343.245403] R10: ffffffff94fb8336 R11: ffffffff94fb8445 R12: 0000000000000000 [ 343.246355] R13: ffff88800a5a7000 R14: ffff88800a5b5800 R15: 0000000000000020 [ 343.247291] FS: 00007fdde2bd7700(0000) GS:ffff888109780000(0000) knlGS:0000000000000000 [ 343.248350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 343.249120] CR2: 00000000200000c0 CR3: 000000000ef4c000 CR4: 00000000000006e0 [ 343.250076] Call Trace: [ 343.250423] <TASK> [ 343.250713] ? memcpy+0x4d/0x60 [ 343.251162] ? netem_init+0xa0/0xa0 [sch_netem] [ 343.251795] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.252443] netem_enqueue+0xe28/0x33c0 [sch_netem] [ 343.253102] ? stack_trace_save+0x87/0xb0 [ 343.253655] ? filter_irq_stacks+0xb0/0xb0 [ 343.254220] ? netem_init+0xa0/0xa0 [sch_netem] [ 343.254837] ? __kasan_check_write+0x14/0x20 [ 343.255418] ? _raw_spin_lock+0x88/0xd6 [ 343.255953] dev_qdisc_enqueue+0x50/0x180 [ 343.256508] __dev_queue_xmit+0x1a7e/0x3090 [ 343.257083] ? netdev_core_pick_tx+0x300/0x300 [ 343.257690] ? check_kcov_mode+0x10/0x40 [ 343.258219] ? _raw_spin_unlock_irqrestore+0x29/0x40 [ 343.258899] ? __kasan_init_slab_obj+0x24/0x30 [ 343.259529] ? setup_object.isra.71+0x23/0x90 [ 343.260121] ? new_slab+0x26e/0x4b0 [ 343.260609] ? kasan_poison+0x3a/0x50 [ 343.261118] ? kasan_unpoison+0x28/0x50 [ 343.261637] ? __kasan_slab_alloc+0x71/0x90 [ 343.262214] ? memcpy+0x4d/0x60 [ 343.262674] ? write_comp_data+0x2f/0x90 [ 343.263209] ? __kasan_check_write+0x14/0x20 [ 343.263802] ? __skb_clone+0x5d6/0x840 [ 343.264329] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.264958] dev_queue_xmit+0x1c/0x20 [ 343.265470] netlink_deliver_tap+0x652/0x9c0 [ 343.266067] netlink_unicast+0x5a0/0x7f0 [ 343.266608] ? netlink_attachskb+0x860/0x860 [ 343.267183] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.267820] ? write_comp_data+0x2f/0x90 [ 343.268367] netlink_sendmsg+0x922/0xe80 [ 343.268899] ? netlink_unicast+0x7f0/0x7f0 [ 343.269472] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.270099] ? write_comp_data+0x2f/0x90 [ 343.270644] ? netlink_unicast+0x7f0/0x7f0 [ 343.271210] sock_sendmsg+0x155/0x190 [ 343.271721] ____sys_sendmsg+0x75f/0x8f0 [ 343.272262] ? kernel_sendmsg+0x60/0x60 [ 343.272788] ? write_comp_data+0x2f/0x90 [ 343.273332] ? write_comp_data+0x2f/0x90 [ 343.273869] ___sys_sendmsg+0x10f/0x190 [ 343.274405] ? sendmsg_copy_msghdr+0x80/0x80 [ 343.274984] ? slab_post_alloc_hook+0x70/0x230 [ 343.275597] ? futex_wait_setup+0x240/0x240 [ 343.276175] ? security_file_alloc+0x3e/0x170 [ 343.276779] ? write_comp_data+0x2f/0x90 [ 343.277313] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.277969] ? write_comp_data+0x2f/0x90 [ 343.278515] ? __fget_files+0x1ad/0x260 [ 343.279048] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.279685] ? write_comp_data+0x2f/0x90 [ 343.280234] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.280874] ? sockfd_lookup_light+0xd1/0x190 [ 343.281481] __sys_sendmsg+0x118/0x200 [ 343.281998] ? __sys_sendmsg_sock+0x40/0x40 [ 343.282578] ? alloc_fd+0x229/0x5e0 [ 343.283070] ? write_comp_data+0x2f/0x90 [ 343.283610] ? write_comp_data+0x2f/0x90 [ 343.284135] ? __sanitizer_cov_trace_pc+0x21/0x60 [ 343.284776] ? ktime_get_coarse_real_ts64+0xb8/0xf0 [ 343.285450] __x64_sys_sendmsg+0x7d/0xc0 [ 343.285981] ? syscall_enter_from_user_mode+0x4d/0x70 [ 343.286664] do_syscall_64+0x3a/0x80 [ 343.287158] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 343.287850] RIP: 0033:0x7fdde24cf289 [ 343.288344] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 db 2c 00 f7 d8 64 89 01 48 [ 343.290729] RSP: 002b:00007fdde2bd6d98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 343.291730] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fdde24cf289 [ 343.292673] RDX: 0000000000000000 RSI: 00000000200000c0 RDI: 0000000000000004 [ 343.293618] RBP: 00007fdde2bd6e20 R08: 0000000100000001 R09: 0000000000000000 [ 343.294557] R10: 0000000100000001 R11: 0000000000000246 R12: 0000000000000000 [ 343.295493] R13: 0000000000021000 R14: 0000000000000000 R15: 00007fdde2bd7700 [ 343.296432] </TASK> [ 343.296735] Modules linked in: sch_netem ip6_vti ip_vti ip_gre ipip sit ip_tunnel geneve macsec macvtap tap ipvlan macvlan 8021q garp mrp hsr wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic curve25519_x86_64 libcurve25519_generic libchacha xfrm_interface xfrm6_tunnel tunnel4 veth netdevsim psample batman_adv nlmon dummy team bonding tls vcan ip6_gre ip6_tunnel tunnel6 gre tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter ppdev bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec parport_pc drm joydev floppy parport sg syscopyarea sysfillrect sysimgblt i2c_piix4 qemu_fw_cfg fb_sys_fops pcspkr [ 343.297459] ip_tables xfs virtio_net net_failover failover sd_mod sr_mod cdrom t10_pi ata_generic pata_acpi ata_piix libata virtio_pci virtio_pci_legacy_dev serio_raw virtio_pci_modern_dev dm_mirror dm_region_hash dm_log dm_mod [ 343.311074] Dumping ftrace buffer: [ 343.311532] (ftrace buffer empty) [ 343.312040] ---[ end trace a2e3db5a6ae05099 ]--- [ 343.312691] RIP: 0010:netem_enqueue+0x1590/0x33c0 [sch_netem] [ 343.313481] Code: 89 85 58 ff ff ff e8 5f 5d e9 d3 48 8b b5 48 ff ff ff 8b 8d 50 ff ff ff 8b 85 58 ff ff ff 48 8b bd 70 ff ff ff 31 d2 2b 4f 74 <f7> f1 48 b8 00 00 00 00 00 fc ff df 49 01 d5 4c 89 e9 48 c1 e9 03 [ 343.315893] RSP: 0018:ffff88800bcd7368 EFLAGS: 00010246 [ 343.316622] RAX: 00000000ba7c0a9c RBX: 0000000000000001 RCX: 0000000000000000 [ 343.317585] RDX: 0000000000000000 RSI: ffff88800f8edb10 RDI: ffff88800f8eda40 [ 343.318549] RBP: ffff88800bcd7458 R08: 0000000000000000 R09: ffffffff94fb8445 [ 343.319503] R10: ffffffff94fb8336 R11: ffffffff94fb8445 R12: 0000000000000000 [ 343.320455] R13: ffff88800a5a7000 R14: ffff88800a5b5800 R15: 0000000000000020 [ 343.321414] FS: 00007fdde2bd7700(0000) GS:ffff888109780000(0000) knlGS:0000000000000000 [ 343.322489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 343.323283] CR2: 00000000200000c0 CR3: 000000000ef4c000 CR4: 00000000000006e0 [ 343.324264] Kernel panic - not syncing: Fatal exception in interrupt [ 343.333717] Dumping ftrace buffer: [ 343.334175] (ftrace buffer empty) [ 343.334653] Kernel Offset: 0x13600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 343.336027] Rebooting in 86400 seconds.. Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Link: https://lore.kernel.org/r/20211129175328.55339-1-harshit.m.mogalapalli@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-30 17:45:01 -08:00
Florian Westphal	28b78ecffe	netfilter: bridge: add support for pppoe filtering This makes 'bridge-nf-filter-pppoe-tagged' sysctl work for bridged traffic. Looking at the original commit it doesn't appear this ever worked: static unsigned int br_nf_post_routing(unsigned int hook, struct sk_buff **pskb, [..] if (skb->protocol == htons(ETH_P_8021Q)) { skb_pull(skb, VLAN_HLEN); skb->network_header += VLAN_HLEN; + } else if (skb->protocol == htons(ETH_P_PPP_SES)) { + skb_pull(skb, PPPOE_SES_HLEN); + skb->network_header += PPPOE_SES_HLEN; } [..] NF_HOOK(... POST_ROUTING, ...) ... but the adjusted offsets are never restored. The alternative would be to rip this code out for good, but otoh we'd have to keep this anyway for the vlan handling (which works because vlan tag info is in the skb, not the packet payload). Reported-and-tested-by: Amish Chana <amish@3g.co.za> Fixes: `516299d2f5` ("[NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 23:37:55 +01:00
Pablo Neira Ayuso	f87b9464d1	netfilter: nft_fwd_netdev: Support egress hook Allow packet redirection to another interface upon egress. [lukas: set skb_iif, add commit message, original patch from Pablo. ] Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 23:37:16 +01:00
Florian Westphal	b43c2793f5	netfilter: nfnetlink_queue: silence bogus compiler warning net/netfilter/nfnetlink_queue.c:601:36: warning: variable 'ctinfo' is uninitialized when used here [-Wuninitialized] if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0) ctinfo is only uninitialized if ct == NULL. Init it to 0 to silence this. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 23:15:11 +01:00
Bernard Zhao	632cb151ca	netfilter: ctnetlink: remove useless type conversion to bool dying is bool, the type conversion to true/false value is not needed. Signed-off-by: Bernard Zhao <bernard@vivo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 23:11:03 +01:00
Florian Westphal	c5fc837bf9	netfilter: nf_queue: remove leftover synchronize_rcu Its no longer needed after commit `8702997074` ("netfilter: nf_queue: move hookfn registration out of struct net"). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 22:50:43 +01:00
Kees Cook	4be1dbb75c	netfilter: conntrack: Use memset_startat() to zero struct nf_conn In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Use memset_startat() to avoid confusing memset() about writing beyond the target struct member. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 22:49:29 +01:00
GuoYong Zheng	fc5e0352cc	ipvs: remove unused variable for ip_vs_new_dest The dest variable is not used after ip_vs_new_dest anymore in ip_vs_add_dest, do not need pass it to ip_vs_new_dest, remove it. Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-11-30 22:46:08 +01:00
Christoph Hellwig	06edc59c1f	bpf, docs: Prune all references to "internal BPF" The eBPF name has completely taken over from eBPF in general usage for the actual eBPF representation, or BPF for any general in-kernel use. Prune all remaining references to "internal BPF". Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de	2021-11-30 10:52:11 -08:00
Leon Romanovsky	4c897cfc46	devlink: Simplify devlink resources unregister call The devlink_resources_unregister() used second parameter as an entry point for the recursive removal of devlink resources. None of the callers outside of devlink core needed to use this field, so let's remove it. As part of this removal, the "struct devlink_resource" was moved from .h to .c file as it is not possible to use in any place in the code except devlink.c. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-30 12:23:32 +00:00
Nikolay Aleksandrov	6130805066	net: ipv6: use the new fib6_nh_release_dsts helper in fib6_nh_release We can remove a bit of code duplication by reusing the new fib6_nh_release_dsts helper in fib6_nh_release. Their only difference is that fib6_nh_release's version doesn't use atomic operation to swap the pointers because it assumes the fib6_nh is no longer visible, while fib6_nh_release_dsts can be used anywhere. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-30 11:59:53 +00:00
Nikolay Aleksandrov	7709efa62c	net: nexthop: reduce rcu synchronizations when replacing resilient groups We can optimize resilient nexthop group replaces by reducing the number of synchronize_net calls. After commit `1005f19b93` ("net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group") we always do a synchronize_net because we must ensure no new dsts can be created for the replaced group's removed nexthops, but we already did that when replacing resilient groups, so if we always call synchronize_net after any group type replacement we'll take care of both cases and reduce synchronize_net calls for resilient groups. Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-30 11:59:18 +00:00
Tianjia Zhang	dc2724a64e	net/tls: simplify the tls_set_sw_offload function Assigning crypto_info variables in advance can simplify the logic of accessing value and move related local variables to a smaller scope. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-30 11:58:34 +00:00
Christophe JAILLET	72a2ff567f	ethtool: netlink: Slightly simplify 'ethnl_features_to_bitmap()' The 'dest' bitmap is fully initialized by the 'for' loop, so there is no need to explicitly reset it. This also makes this function in line with 'ethnl_features_to_bitmap32()' which does not clear the destination before writing it. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/17fca158231c6f03689bd891254f0dd1f4e84cb8.1638091829.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 20:16:46 -08:00
Jakub Kicinski	5fdc2333e6	AF_RXRPC fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmGk9MIACgkQ+7dXa6fL C2sFUw/9EulPVJSzUywa6Jf13BqWCqciuMoo+NrNlTFY7P0CnurloqOl+6hJ+hYI cKr4JaClzOR/LT8DaJHYp+uyB0cn5U/udEDr2iekxvh16B/GDDRPWDDIvm9mzczw jVqzZHH54Jl6s+fnfkIhkLZA6rTuH4kFtsEb1S0VBXIJC1zv4O79Ba15nxkGPQoQ GfJzZqLTUOMmWPoiTuKTJmPhvCPHEzIGduKzU2hRgjrDJdk695BGdwFK1cs6/BcA n0aEVnYmndnaJ3MyLjxzROi1D6MFfHs8+a4vKTD13kcWQRT8kPE51/KKhMvUELHn HJsVXqFeip2FyQFqRlmNfRgfvhC+SWV9kceP0G6pZIoLAXlEMGkmhmB+BWZtnY6U yXaOJlwchNEP83f2KWUBkX4dV9Zx8mdy5UUQBTSQbMwFxSgZ5FfOyCkf98xx03DS LTqV9ZciJ+Z6QbDscYIrnJV2WvWTJeyUB5lsI8W3lAXXugu28WHL+lorhEKIIJWG jdmxO4sEa29tdE7FVngQ3RLcNJETcJCKpMypNaAhWOCPlfnW7t0FpkcAoza7t1yy DGQaGkR7zC1sZ5hr31YvWBhaZT8IjlAePOg/1GfpUnyL9122jsAqR9dz0xgPGnf6 fK0BLMcfnuYObtJdSEOAMqnx2qkRKTU9qrT2vXNWd/bApFTY4E0= =q6GF -----END PGP SIGNATURE----- Merge tag 'rxrpc-fixes-20211129' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Leak fixes Here are a couple of fixes for leaks in AF_RXRPC: (1) Fix a leak of rxrpc_peer structs in rxrpc_look_up_bundle(). (2) Fix a leak of rxrpc_local structs in rxrpc_lookup_peer(). * tag 'rxrpc-fixes-20211129' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: rxrpc: Fix rxrpc_local leak in rxrpc_lookup_peer() rxrpc: Fix rxrpc_peer leak in rxrpc_look_up_bundle() ==================== Link: https://lore.kernel.org/r/163820097905.226370.17234085194655347888.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 20:04:11 -08:00
Jason A. Donenfeld	20ae1d6aa1	wireguard: device: reset peer src endpoint when netns exits Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Paolo Abeni <pabeni@redhat.com> Fixes: `900575aa33` ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 19:50:45 -08:00
Eiichi Tsukata	beacff50ed	rxrpc: Fix rxrpc_local leak in rxrpc_lookup_peer() Need to call rxrpc_put_local() for peer candidate before kfree() as it holds a ref to rxrpc_local. [DH: v2: Changed to abstract the peer freeing code out into a function] Fixes: `9ebeddef58` ("rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record") Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/all/20211121041608.133740-2-eiichi.tsukata@nutanix.com/ # v1	2021-11-29 15:40:02 +00:00
Eiichi Tsukata	ca77fba821	rxrpc: Fix rxrpc_peer leak in rxrpc_look_up_bundle() Need to call rxrpc_put_peer() for bundle candidate before kfree() as it holds a ref to rxrpc_peer. [DH: v2: Changed to abstract out the bundle freeing code into a function] Fixes: `245500d853` ("rxrpc: Rewrite the client connection manager") Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/20211121041608.133740-1-eiichi.tsukata@nutanix.com/ # v1	2021-11-29 15:39:40 +00:00
msizanoen1	cdef485217	ipv6: fix memory leak in fib6_rule_suppress The kernel leaks memory when a `fib` rule is present in IPv6 nftables firewall rules and a suppress_prefix rule is present in the IPv6 routing rules (used by certain tools such as wg-quick). In such scenarios, every incoming packet will leak an allocation in `ip6_dst_cache` slab cache. After some hours of `bpftrace`-ing and source code reading, I tracked down the issue to `ca7a03c417` ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule"). The problem with that change is that the generic `args->flags` always have `FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag `RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not decreasing the refcount when needed. How to reproduce: - Add the following nftables rule to a prerouting chain: meta nfproto ipv6 fib saddr . mark . iif oif missing drop This can be done with: sudo nft create table inet test sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }' sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop - Run: sudo ip -6 rule add table main suppress_prefixlength 0 - Watch `sudo slabtop -o \| grep ip6_dst_cache` to see memory usage increase with every incoming ipv6 packet. This patch exposes the protocol-specific flags to the protocol specific `suppress` function, and check the protocol-specific `flags` argument for RT6_LOOKUP_F_DST_NOREF instead of the generic FIB_LOOKUP_NOREF when decreasing the refcount, like this. [1]: `ca7a03c417/net/ipv6/fib6_rules.c (L71)` [2]: `ca7a03c417/net/ipv6/fib6_rules.c (L99)` Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105 Fixes: `ca7a03c417` ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 14:43:35 +00:00
Menglong Dong	aeeecb8891	net: snmp: add statistics for tcp small queue check Once tcp small queue check failed in tcp_small_queue_check(), the throughput of tcp will be limited, and it's hard to distinguish whether it is out of tcp congestion control. Add statistics of LINUX_MIB_TCPSMALLQUEUEFAILURE for this scene. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 13:05:47 +00:00
Leon Romanovsky	e9538f8270	devlink: Remove misleading internal_flags from health reporter dump DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET command doesn't have .doit callback and has no use in internal_flags at all. Remove this misleading assignment. Fixes: `e44ef4e451` ("devlink: Hang reporter's dump method on a dumpit cb") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 13:04:26 +00:00
Jeremy Kerr	d851956544	mctp: test: fix skb free in test device tx In our test device, we're currently freeing skbs in the transmit path with kfree(), rather than kfree_skb(). This change uses the correct kfree_skb() instead. Fixes: `ded21b7229` ("mctp: Add test utils") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 13:00:09 +00:00
Tianjia Zhang	5961060692	net/tls: Fix authentication failure in CCM mode When the TLS cipher suite uses CCM mode, including AES CCM and SM4 CCM, the first byte of the B0 block is flags, and the real IV starts from the second byte. The XOR operation of the IV and rec_seq should be skip this byte, that is, add the iv_offset. Fixes: `f295b3ae9f` ("net/tls: Add support of AES128-CCM based ciphers") Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Cc: Vakul Garg <vakul.garg@nxp.com> Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:48:28 +00:00
Benjamin Poirier	f05b0b9733	net: mpls: Make for_nexthops iterator const There are separate for_nexthops and change_nexthops iterators. The for_nexthops variant should use const. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:46:52 +00:00
Benjamin Poirier	69d9c0d077	net: mpls: Remove duplicate variable from iterator macro __nh is just a copy of nh with a different type. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:46:52 +00:00
Benjamin Poirier	189168181b	net: mpls: Remove rcu protection from nh_dev Following the previous commit, nh_dev can no longer be accessed and modified concurrently. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:39:42 +00:00
Benjamin Poirier	7d4741eacd	net: mpls: Fix notifications when deleting a device There are various problems related to netlink notifications for mpls route changes in response to interfaces being deleted: * delete interface of only nexthop DELROUTE notification is missing RTA_OIF attribute * delete interface of non-last nexthop NEWROUTE notification is missing entirely * delete interface of last nexthop DELROUTE notification is missing nexthop All of these problems stem from the fact that existing routes are modified in-place before sending a notification. Restructure mpls_ifdown() to avoid changing the route in the DELROUTE cases and to create a copy in the NEWROUTE case. Fixes: `f8efb73c97` ("mpls: multipath route support") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:39:42 +00:00
Sebastian Andrzej Siewior	fd888e85fe	net: Write lock dev_base_lock without disabling bottom halves. The writer acquires dev_base_lock with disabled bottom halves. The reader can acquire dev_base_lock without disabling bottom halves because there is no writer in softirq context. On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts as a lock to ensure that resources, that are protected by disabling bottom halves, remain protected. This leads to a circular locking dependency if the lock acquired with disabled bottom halves (as in write_lock_bh()) and somewhere else with enabled bottom halves (as by read_lock() in netstat_show()) followed by disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout() -> spin_lock_bh()). This is the reverse locking order. All read_lock() invocation are from sysfs callback which are not invoked from softirq context. Therefore there is no need to disable bottom halves while acquiring a write lock. Acquire the write lock of dev_base_lock without disabling bottom halves. Reported-by: Pei Zhang <pezhang@redhat.com> Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:12:36 +00:00
Tom Parkin	07b8ca3792	net/l2tp: convert tunnel rwlock_t to rcu Previously commit `e02d494d2c` ("l2tp: Convert rwlock to RCU") converted most, but not all, rwlock instances in the l2tp subsystem to RCU. The remaining rwlock protects the per-tunnel hashlist of sessions which is used for session lookups in the UDP-encap data path. Convert the remaining rwlock to rcu to improve performance of UDP-encap tunnels. Note that the tunnel and session, which both live on RCU-protected lists, use slightly different approaches to incrementing their refcounts in the various getter functions. The tunnel has to use refcount_inc_not_zero because the tunnel shutdown process involves dropping the refcount to zero prior to synchronizing RCU readers (via. kfree_rcu). By contrast, the session shutdown removes the session from the list(s) it is on, synchronizes with readers, and then decrements the session refcount. Since the getter functions increment the session refcount with the RCU read lock held we prevent getters seeing a zero session refcount, and therefore don't need to use refcount_inc_not_zero. Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:11:25 +00:00
Finn Behrens	1eda919126	nl80211: reset regdom when reloading regdb Reload the regdom when the regulatory db is reloaded. Otherwise, the user had to change the regulatoy domain to a different one and then reset it to the correct one to have a new regulatory db take effect after a reload. Signed-off-by: Finn Behrens <fin@nyantec.com> Link: https://lore.kernel.org/r/YaIIZfxHgqc/UTA7@gimli.kloenk.dev [edit commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-29 09:31:56 +01:00
Johannes Berg	af9d3a2984	mac80211: add docs for ssn in struct tid_ampdu_tx As pointed out by Stephen, add the missing docs. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/r/20211129091948.1327ec82beab.Iecc5975406a3028d35c65ff8d2dec31a693888d3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-29 09:31:17 +01:00
Miri Korenblit	75c5bd68b6	ieee80211: change HE nominal packet padding value defines It's easier to use and understand, and to extend for EHT later, if we use the values here instead of the shifted values. Unfortunately, we need to add _POS so that we can use it in places like iwlwifi/mvm where constants are needed. While at it, fix the typo ("NOMIMAL") which also helps catch any conflicts. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://lore.kernel.org/r/20211126104817.7c29a05b8eb5.I2ca9faf06e177e3035bec91e2ae53c2f91d41774@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-28 21:53:04 +01:00
Johannes Berg	fb8b53acf6	cfg80211: use ieee80211_bss_get_elem() instead of _get_ie() Use the structured helper for finding an element instead of the unstructured ieee80211_bss_get_ie(). Link: https://lore.kernel.org/r/20210930131130.e94709f341c3.I4ddb7fcb40efca27987deda7f9a144a5702ebfae@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-28 21:53:04 +01:00
Linus Torvalds	d06c942efe	vhost,virtio,vdpa: bugfixes Misc fixes all over the place. Revert of virtio used length validation series: the approach taken does not seem to work, breaking too many guests in the process. We'll need to do length validation using some other approach. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmGe0sEPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRp8WEH/imDIq1iduDeAuvFnmrm5eEO9w3wzXCT4NiG 8Pla241FzQ1pEFEAne16KP0+SlLhj7P0oc5FR8vkYvxxuyneDbCzcS2M1kYMOpA1 ry28PuObAnekzE/WXxvC031ozB5Zb/FL54gmw+/1EdAOdMGL0CdQ1aJxREBHRTBo p4ZHr83GA2D2C/IyKCsgQ8cB9ZrMqImTQQ4vRD89HoFBp+GH2u2Di1iyXEWuOqdI n1+7M9jjbyW8A+N1bkOicpShS/6UcyJQOOcg8kvUQOV6srVkYhfaiWC/CbOP2g73 8PKK+/K2Htf92s6RdvDUPSKmvqGR/4KPZWPtWThXBYXGgWul0uI= =q6tO -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin: "Misc fixes all over the place. Revert of virtio used length validation series: the approach taken does not seem to work, breaking too many guests in the process. We'll need to do length validation using some other approach" [ This merge also ends up reverting commit `f7a36b03a7` ("vsock/virtio: suppress used length validation"), which came in through the networking tree in the meantime, and was part of that whole used length validation series - Linus ] * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa_sim: avoid putting an uninitialized iova_domain vhost-vdpa: clean irqs before reseting vdpa device virtio-blk: modify the value type of num in virtio_queue_rq() vhost/vsock: cleanup removing `len` variable vhost/vsock: fix incorrect used length reported to the guest Revert "virtio_ring: validate used buffer length" Revert "virtio-net: don't let virtio core to validate used length" Revert "virtio-blk: don't let virtio core to validate used length" Revert "virtio-scsi: don't let virtio core to validate used buffer length"	2021-11-28 11:58:52 -08:00
Linus Torvalds	7413927713	NFS client bugfixes for Linux 5.16 Highlights include: Stable fixes: - NFSv42: Fix pagecache invalidation after COPY/CLONE Bugfixes: - NFSv42: Don't fail clone() just because the server failed to return post-op attributes - SUNRPC: use different lockdep keys for INET6 and LOCAL - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION - SUNRPC: fix header include guard in trace header -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGiTnMACgkQZwvnipYK APJoJQ//VZYSCx/mGaTIj5oUwjBKE/n/9rz2EUGS1cfYjZjPpb5Xgm1tn+1A4c01 Ztu9/hKgwrDqknqkmtKvP1GsX5vYUqgfqAlc880Q2nXAqaLJBBZgB6BFMmTtcoQx C24L0tgxlZVD9Vw0DJEVDVgDxXA/9VmdSQK6uptQRQhcYf4VtR1wAzELHWdkdkfq 1WrREeAwGWw1BaPTrlPn9XwW9qTaMlBH05XRHh6dM7gFmoIe3td7kq7BOnFxFsnA AQZ/nCgeMTE04kQQMzYqUc4YqZvnzUHxueZ6q8s0K1RKJBpNIQNgdWUMa285Qo4a JA9oBCPPo0JjmsEge2Km12zyBJoA7lLQDfc6UQJON50ADF0sTu3wszgGuC63KkhE V+kUogK2WOlnGky2yYrHmv43mcCcyoJ/g+g+38GXNYGorsFi/XUhvctEpFFFPF71 0umQwWhA6Dhc52hMj5DN4nfspp//hEuV9o7/zJzlPi0elC+xtVaBWZCorNvznlXW C/O5yobJVd89PuIE17Stg+c0Rq3k7RVPPoyS2IaMkZ5cs7DtT5Tz3nKClPAA8Bur mPLAMkHSOSLO26cy30SVZCIx1JDMJU9dx/PkFemCexkzYXQxgp9px/8wmM2Xe6Oc /hDqi8V7ayJUuYpYuJ6sA8oUqj3j+NkoP4w5HhlnmZDBhFMEBhM= =jeHm -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client fixes from Trond Myklebust: "Highlights include: Stable fixes: - NFSv42: Fix pagecache invalidation after COPY/CLONE Bugfixes: - NFSv42: Don't fail clone() just because the server failed to return post-op attributes - SUNRPC: use different lockdep keys for INET6 and LOCAL - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION - SUNRPC: fix header include guard in trace header" * tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: use different lock keys for INET6 and LOCAL sunrpc: fix header include guard in trace header NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION NFSv42: Fix pagecache invalidation after COPY/CLONE NFS: Add a tracepoint to show the results of nfs_set_cache_invalid() NFSv42: Don't fail clone() unless the OP_CLONE operation failed	2021-11-27 10:33:55 -08:00
Kuniyuki Iwashima	9acbc584c3	af_unix: Relax race in unix_autobind(). When we bind an AF_UNIX socket without a name specified, the kernel selects an available one from 0x00000 to 0xFFFFF. unix_autobind() starts searching from a number in the 'static' variable and increments it after acquiring two locks. If multiple processes try autobind, they obtain the same lock and check if a socket in the hash list has the same name. If not, one process uses it, and all except one end up retrying the _next_ number (actually not, it may be incremented by the other processes). The more we autobind sockets in parallel, the longer the latency gets. We can avoid such a race by searching for a name from a random number. These show latency in unix_autobind() while 64 CPUs are simultaneously autobind-ing 1024 sockets for each. Without this patch: usec : count distribution 0 : 1176 \|* \| 2 : 3655 \|******* \| 4 : 4094 \|********* \| 6 : 3831 \|******** \| 8 : 3829 \|******** \| 10 : 3844 \|******** \| 12 : 3638 \|******* \| 14 : 2992 \|***** \| 16 : 2485 \|*** \| 18 : 2230 \|*** \| 20 : 2095 \|** \| 22 : 1853 \|* \| 24 : 1827 \|* \| 26 : 1677 \|* \| 28 : 1473 \| \| 30 : 1573 \|* \| 32 : 1417 \| \| 34 : 1385 \| \| 36 : 1345 \| \| 38 : 1344 \| \| 40 : 1200 \|* \| With this patch: usec : count distribution 0 : 1855 \|**** \| 2 : 6464 \|***************** \| 4 : 9936 \|**************************** \| 6 : 12107 \|************************************\| 8 : 10441 \|****************************** \| 10 : 7264 \|******************* \| 12 : 4254 \|********** \| 14 : 2538 \|**** \| 16 : 1596 \|* \| 18 : 1088 \|* \| 20 : 800 \| \| 22 : 670 \| \| 24 : 601 \|* \| 26 : 562 \|* \| 28 : 525 \|* \| 30 : 446 \|* \| 32 : 378 \|* \| 34 : 337 \|* \| 36 : 317 \|* \| 38 : 314 \|* \| 40 : 298 \| \| Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:58 -08:00
Kuniyuki Iwashima	afd20b9290	af_unix: Replace the big lock with small locks. The hash table of AF_UNIX sockets is protected by the single lock. This patch replaces it with per-hash locks. The effect is noticeable when we handle multiple sockets simultaneously. Here is a test result on an EC2 c5.24xlarge instance. It shows latency (under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating 1024 sockets for each in parallel. Without this patch: nsec : count distribution 0 : 179 \| \| 500 : 3021 \|******* \| 1000 : 6271 \|*************** \| 1500 : 6318 \|*************** \| 2000 : 5828 \|************* \| 2500 : 5124 \|*********** \| 3000 : 4426 \|********* \| 3500 : 3672 \|******* \| 4000 : 3138 \|***** \| 4500 : 2811 \|**** \| 5000 : 2384 \|*** \| 5500 : 2023 \|** \| 6000 : 1954 \|* \| 6500 : 1737 \|* \| 7000 : 1749 \|* \| 7500 : 1520 \| \| 8000 : 1469 \| \| 8500 : 1394 \| \| 9000 : 1232 \|* \| 9500 : 1138 \|* \| 10000 : 994 \|* \| With this patch: nsec : count distribution 0 : 1634 \|** \| 500 : 13170 \|************************************\| 1000 : 13156 \|*********************************** \| 1500 : 9010 \|*********************** \| 2000 : 6363 \|*************** \| 2500 : 4443 \|********* \| 3000 : 3240 \|***** \| 3500 : 2549 \|*** \| 4000 : 1872 \|* \| 4500 : 1504 \| \| 5000 : 1247 \|* \| 5500 : 1035 \|* \| 6000 : 889 \| \| 6500 : 744 \|** \| 7000 : 634 \|* \| 7500 : 498 \|* \| 8000 : 433 \|* \| 8500 : 355 \|* \| 9000 : 336 \|* \| 9500 : 284 \| \| 10000 : 243 \| \| Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:58 -08:00
Kuniyuki Iwashima	e6b4b87389	af_unix: Save hash in sk_hash. To replace unix_table_lock with per-hash locks in the next patch, we need to save a hash in each socket because /proc/net/unix or BPF prog iterate sockets while holding a hash table lock and release it later in a different function. Currently, we store a real/pseudo hash in struct unix_address. However, we do not allocate it to unbound sockets, nor should we do just for that. For this purpose, we can use sk_hash. Then, we no longer use the hash field in struct unix_address and can remove it. Also, this patch does - rename unix_insert_socket() to unix_insert_unbound_socket() - remove the redundant list argument from __unix_insert_socket() and unix_insert_unbound_socket() - use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash() - remove 'inline' from unix_remove_socket() and unix_insert_unbound_socket(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:57 -08:00
Kuniyuki Iwashima	f452be496a	af_unix: Add helpers to calculate hashes. This patch adds three helper functions that calculate hashes for unbound sockets and bound sockets with BSD/abstract addresses. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:57 -08:00
Kuniyuki Iwashima	5ce7ab4961	af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead. In BSD and abstract address cases, we store sockets in the hash table with keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket varies depending on its address type; sockets with BSD addresses always have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash. This is just for the UNIX_ABSTRACT() macro used to check the address type. The difference of the saved hashes comes from the first byte of the address in the first place. So, we can test it directly. Then we can keep a real hash in each socket and replace unix_table_lock with per-hash locks in the later patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:56 -08:00
Kuniyuki Iwashima	12f21c49ad	af_unix: Allocate unix_address in unix_bind_(bsd\|abstract)(). To terminate address with '\0' in unix_bind_bsd(), we add unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract(). Also, unix_bind_abstract() does not return -EEXIST. Only kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it, so we move the last error check in unix_bind() to unix_bind_bsd(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:56 -08:00
Kuniyuki Iwashima	5c32a3ed64	af_unix: Remove unix_mkname(). This patch removes unix_mkname() and postpones calculating a hash to unix_bind_abstract(). Some BSD stuffs still remain in unix_bind() though, the next patch packs them into unix_bind_bsd(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:55 -08:00
Kuniyuki Iwashima	d2d8c9fddb	af_unix: Copy unix_mkname() into unix_find_(bsd\|abstract)(). We should not call unix_mkname() before unix_find_other() and instead do the same thing where necessary based on the address type: - terminating the address with '\0' in unix_find_bsd() - calculating the hash in unix_find_abstract(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:55 -08:00
Kuniyuki Iwashima	b8a58aa6fc	af_unix: Cut unix_validate_addr() out of unix_mkname(). unix_mkname() tests socket address length and family and does some processing based on the address type. It is called in the early stage, and therefore some instructions are redundant and can end up in vain. The address length/family tests are done twice in unix_bind(). Also, the address type is rechecked later in unix_bind() and unix_find_other(), where we can do the same processing. Moreover, in the BSD address case, the hash is set to 0 but never used and confusing. This patch moves the address tests out of unix_mkname(), and the following patches move the other part into appropriate places and remove unix_mkname() finally. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:55 -08:00
Kuniyuki Iwashima	aed26f557b	af_unix: Return an error as a pointer in unix_find_other(). We can return an error as a pointer and need not pass an additional argument to unix_find_other(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:54 -08:00
Kuniyuki Iwashima	fa39ef0e47	af_unix: Factorise unix_find_other() based on address types. As done in the commit `fa42d910a3` ("unix_bind(): take BSD and abstract address cases into new helpers"), this patch moves BSD and abstract address cases from unix_find_other() into unix_find_bsd() and unix_find_abstract(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:54 -08:00
Kuniyuki Iwashima	f7ed31f461	af_unix: Pass struct sock to unix_autobind(). We do not use struct socket in unix_autobind() and pass struct sock to unix_bind_bsd() and unix_bind_abstract(). Let's pass it to unix_autobind() as well. Also, this patch fixes these errors by checkpatch.pl. ERROR: do not use assignment in if condition #1795: FILE: net/unix/af_unix.c:1795: + if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr CHECK: Logical continuations should be on the previous line #1796: FILE: net/unix/af_unix.c:1796: + if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr + && (err = unix_autobind(sock)) != 0) Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:53 -08:00
Kuniyuki Iwashima	755662ce78	af_unix: Use offsetof() instead of sizeof(). The length of the AF_UNIX socket address contains an offset to the member sun_path of struct sockaddr_un. Currently, the preceding member is just sun_family, and its type is sa_family_t and resolved to short. Therefore, the offset is represented by sizeof(short). However, it is not clear and fragile to changes in struct sockaddr_storage or sockaddr_un. This commit makes it clear and robust by rewriting sizeof() with offsetof(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:53 -08:00
Xin Long	442b03c32c	bridge: use __set_bit in __br_vlan_set_default_pvid The same optimization as the one in commit `cc0be1ad68` ("net: bridge: Slightly optimize 'find_portno()'") is needed for the 'changed' bitmap in __br_vlan_set_default_pvid(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/4e35f415226765e79c2a11d2c96fbf3061c486e2.1637782773.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 16:47:56 -08:00
Tonghao Zhang	bde3b0fd80	net: ethtool: set a default driver name The netdev (e.g. ifb, bareudp), which not support ethtool ops (e.g. .get_drvinfo), we can use the rtnl kind as a default name. ifb netdev may be created by others prefix, not ifbX. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Hao Chen <chenhao288@hisilicon.com> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Danielle Ratson <danieller@nvidia.com> Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20211125163049.84970-1-xiangxia.m.yue@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 16:44:53 -08:00
Jakub Kicinski	93d5404e89	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ipa/ipa_main.c `8afc7e471a` ("net: ipa: separate disabling setup from modem stop") `76b5fbcd6b` ("net: ipa: kill ipa_modem_init()") Duplicated include, drop one. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 13:45:19 -08:00
Tony Lu	bacb6c1e47	net/smc: Don't call clcsock shutdown twice when smc shutdown When applications call shutdown() with SHUT_RDWR in userspace, smc_close_active() calls kernel_sock_shutdown(), and it is called twice in smc_shutdown(). This fixes this by checking sk_state before do clcsock shutdown, and avoids missing the application's call of smc_shutdown(). Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/ Fixes: `606a63c978` ("net/smc: Ensure the active closing peer first closes clcsock") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/20211126024134.45693-1-tonylu@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 11:23:35 -08:00
Ziyang Xuan	01d9cc2dea	net: vlan: fix underflow for the real_dev refcnt Inject error before dev_hold(real_dev) in register_vlan_dev(), and execute the following testcase: ip link add dev dummy1 type dummy ip link add name dummy1.100 link dummy1 type vlan id 100 ip link del dev dummy1 When the dummy netdevice is removed, we will get a WARNING as following: ======================================================================= refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 2 PID: 0 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 and an endless loop of: ======================================================================= unregister_netdevice: waiting for dummy1 to become free. Usage count = -1073741824 That is because dev_put(real_dev) in vlan_dev_free() be called without dev_hold(real_dev) in register_vlan_dev(). It makes the refcnt of real_dev underflow. Move the dev_hold(real_dev) to vlan_dev_init() which is the call-back of ndo_init(). That makes dev_hold() and dev_put() for vlan's real_dev symmetrical. Fixes: `563bcbae3b` ("net: vlan: fix a UAF in vlan_dev_real_dev()") Reported-by: Petr Machata <petrm@nvidia.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/r/20211126015942.2918542-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 11:20:46 -08:00
Julian Wiedmann	0276af2176	ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce() ethtool_set_coalesce() now uses both the .get_coalesce() and .set_coalesce() callbacks. But the check for their availability is buggy, so changing the coalesce settings on a device where the driver provides only _one_ of the callbacks results in a NULL pointer dereference instead of an -EOPNOTSUPP. Fix the condition so that the availability of both callbacks is ensured. This also matches the netlink code. Note that reproducing this requires some effort - it only affects the legacy ioctl path, and needs a specific combination of driver options: - have .get_coalesce() and .coalesce_supported but no .set_coalesce(), or - have .set_coalesce() but no .get_coalesce(). Here eg. ethtool doesn't cause the crash as it first attempts to call ethtool_get_coalesce() and bails out on error. Fixes: `f3ccfda193` ("ethtool: extend coalesce setting uAPI with CQE mode") Cc: Yufeng Mo <moyufeng@huawei.com> Cc: Huazhong Tan <tanhuazhong@huawei.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Link: https://lore.kernel.org/r/20211126175543.28000-1-jwi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 11:17:47 -08:00
Davide Caratti	de6d25924c	net/sched: sch_ets: don't peek at classes beyond 'nbands' when the number of DRR classes decreases, the round-robin active list can contain elements that have already been freed in ets_qdisc_change(). As a consequence, it's possible to see a NULL dereference crash, caused by the attempt to call cl->qdisc->ops->peek(cl->qdisc) when cl->qdisc is NULL: BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 910 Comm: mausezahn Not tainted 5.16.0-rc1+ #475 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: 0010:ets_qdisc_dequeue+0x129/0x2c0 [sch_ets] Code: c5 01 41 39 ad e4 02 00 00 0f 87 18 ff ff ff 49 8b 85 c0 02 00 00 49 39 c4 0f 84 ba 00 00 00 49 8b ad c0 02 00 00 48 8b 7d 10 <48> 8b 47 18 48 8b 40 38 0f ae e8 ff d0 48 89 c3 48 85 c0 0f 84 9d RSP: 0000:ffffbb36c0b5fdd8 EFLAGS: 00010287 RAX: ffff956678efed30 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000002 RSI: ffffffff9b938dc9 RDI: 0000000000000000 RBP: ffff956678efed30 R08: e2f3207fe360129c R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff956678efeac0 R13: ffff956678efe800 R14: ffff956611545000 R15: ffff95667ac8f100 FS: 00007f2aa9120740(0000) GS:ffff95667b800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000018 CR3: 000000011070c000 CR4: 0000000000350ee0 Call Trace: <TASK> qdisc_peek_dequeued+0x29/0x70 [sch_ets] tbf_dequeue+0x22/0x260 [sch_tbf] __qdisc_run+0x7f/0x630 net_tx_action+0x290/0x4c0 __do_softirq+0xee/0x4f8 irq_exit_rcu+0xf4/0x130 sysvec_apic_timer_interrupt+0x52/0xc0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0033:0x7f2aa7fc9ad4 Code: b9 ff ff 48 8b 54 24 18 48 83 c4 08 48 89 ee 48 89 df 5b 5d e9 ed fc ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa <53> 48 83 ec 10 48 8b 05 10 64 33 00 48 8b 00 48 85 c0 0f 85 84 00 RSP: 002b:00007ffe5d33fab8 EFLAGS: 00000202 RAX: 0000000000000002 RBX: 0000561f72c31460 RCX: 0000561f72c31720 RDX: 0000000000000002 RSI: 0000561f72c31722 RDI: 0000561f72c31720 RBP: 000000000000002a R08: 00007ffe5d33fa40 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 0000561f7187e380 R13: 0000000000000000 R14: 0000000000000000 R15: 0000561f72c31460 </TASK> Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt intel_rapl_msr iTCO_vendor_support intel_rapl_common joydev virtio_balloon lpc_ich i2c_i801 i2c_smbus pcspkr ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel serio_raw libata virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod CR2: 0000000000000018 Ensuring that 'alist' was never zeroed [1] was not sufficient, we need to remove from the active list those elements that are no more SP nor DRR. [1] https://lore.kernel.org/netdev/60d274838bf09777f0371253416e8af71360bc08.1633609148.git.dcaratti@redhat.com/ v3: fix race between ets_qdisc_change() and ets_qdisc_dequeue() delisting DRR classes beyond 'nbands' in ets_qdisc_change() with the qdisc lock acquired, thanks to Cong Wang. v2: when a NULL qdisc is found in the DRR active list, try to dequeue skb from the next list item. Reported-by: Hangbin Liu <liuhangbin@gmail.com> Fixes: `dcc68b4d80` ("net: sch_ets: Add a new Qdisc") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/7a5c496eed2d62241620bdbb83eb03fb9d571c99.1637762721.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 11:10:20 -08:00
John Crispin	eb87d3e089	mac80211: notify non-transmitting BSS of color changes When color change is triggered in multiple bssid case, allow only for transmitting BSS, and when it changes its bss color, notify the non transmitting BSSs also of the new bss color. Signed-off-by: John Crispin <john@phrozen.org> Co-developed-by: Lavanya Suresh <lavaks@codeaurora.org> Signed-off-by: Lavanya Suresh <lavaks@codeaurora.org> Co-developed-by: Rameshkumar Sundaram <quic_ramess@quicinc.com> Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com> Link: https://lore.kernel.org/r/1637146647-16282-1-git-send-email-quic_ramess@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:51:25 +01:00
Peter Seiderer	dc53078320	mac80211: minstrel_ht: remove unused SAMPLE_SWITCH_THR define Remove unused SAMPLE_SWITCH_THR define. Signed-off-by: Peter Seiderer <ps.report@gmx.net> Link: https://lore.kernel.org/r/20211116221244.30844-1-ps.report@gmx.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:51:07 +01:00
Lorenzo Bianconi	8415816493	cfg80211: allow continuous radar monitoring on offchannel chain Allow continuous radar detection on the offchannel chain in order to switch to the monitored channel whenever the underlying driver reports a radar pattern on the main channel. Tested-by: Owen Peng <owen.peng@mediatek.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/d46217310a49b14ff0e9c002f0a6e0547d70fd2c.1637071350.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:50:27 +01:00
Lorenzo Bianconi	c47240cb46	cfg80211: schedule offchan_cac_abort_wk in cfg80211_radar_event If necessary schedule offchan_cac_abort_wk work in cfg80211_radar_event routine adding offchan parameter to cfg80211_radar_event signature. Rename cfg80211_radar_event in __cfg80211_radar_event and introduce the two following inline helpers: - cfg80211_radar_event - cfg80211_offchan_radar_event Doing so the drv will not need to run cfg80211_offchan_cac_abort() after radar detection on the offchannel chain. Tested-by: Owen Peng <owen.peng@mediatek.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/3ff583e021e3343a3ced54a7b09b5e184d1880dc.1637062727.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:49:54 +01:00
liuguoqiang	3536672bbd	cfg80211: delete redundant free code When kzalloc failed and rdev->sacn_req or rdev->scan_msg is null, pass a null pointer to kfree is redundant, delete it and return directly. Signed-off-by: liuguoqiang <liuguoqiang@uniontech.com> Link: https://lore.kernel.org/r/20211115092139.24407-1-liuguoqiang@uniontech.com [remove now unused creq = NULL assigment] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:49:17 +01:00
Felix Fietkau	d787a3e38f	mac80211: add support for .ndo_fill_forward_path This allows drivers to provide a destination device + info for flow offload Only supported in combination with 802.3 encap offload Signed-off-by: Felix Fietkau <nbd@nbd.name> Tested-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/20211112112223.1209-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:47:26 +01:00
luo penghao	71abf71e9e	mac80211: Remove unused assignment statements The assignment of these three local variables in the file will not be used in the corresponding functions, so they should be deleted. The clang_analyzer complains as follows: net/mac80211/wpa.c:689:2 warning: net/mac80211/wpa.c:883:2 warning: net/mac80211/wpa.c:452:2 warning: Value stored to 'hdr' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Link: https://lore.kernel.org/r/20211104061411.1744-1-luo.penghao@zte.com.cn Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:46:24 +01:00
Lorenzo Bianconi	91e89c7732	cfg80211: fix possible NULL pointer dereference in cfg80211_stop_offchan_radar_detection Fix the following NULL pointer dereference in cfg80211_stop_offchan_radar_detection routine that occurs when hostapd is stopped during the CAC on offchannel chain: Sat Jan 1 0[ 779.567851] ESR = 0x96000005 0:12:50 2000 dae[ 779.572346] EC = 0x25: DABT (current EL), IL = 32 bits mon.debug hostap[ 779.578984] SET = 0, FnV = 0 d: hostapd_inter[ 779.583445] EA = 0, S1PTW = 0 face_deinit_free[ 779.587936] Data abort info: : num_bss=1 conf[ 779.592224] ISV = 0, ISS = 0x00000005 ->num_bss=1 Sat[ 779.597403] CM = 0, WnR = 0 Jan 1 00:12:50[ 779.601749] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000418b2000 2000 daemon.deb[ 779.609601] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 ug hostapd: host[ 779.619657] Internal error: Oops: 96000005 [#1] SMP [ 779.770810] CPU: 0 PID: 2202 Comm: hostapd Not tainted 5.10.75 #0 [ 779.776892] Hardware name: MediaTek MT7622 RFB1 board (DT) [ 779.782370] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) [ 779.788384] pc : cfg80211_chandef_valid+0x10/0x490 [cfg80211] [ 779.794128] lr : cfg80211_check_station_change+0x3190/0x3950 [cfg80211] [ 779.800731] sp : ffffffc01204b7e0 [ 779.804036] x29: ffffffc01204b7e0 x28: ffffff80039bdc00 [ 779.809340] x27: 0000000000000000 x26: ffffffc008cb3050 [ 779.814644] x25: 0000000000000000 x24: 0000000000000002 [ 779.819948] x23: ffffff8002630000 x22: ffffff8003e748d0 [ 779.825252] x21: 0000000000000cc0 x20: ffffff8003da4a00 [ 779.830556] x19: 0000000000000000 x18: ffffff8001bf7ce0 [ 779.835860] x17: 00000000ffffffff x16: 0000000000000000 [ 779.841164] x15: 0000000040d59200 x14: 00000000000019c0 [ 779.846467] x13: 00000000000001c8 x12: 000636b9e9dab1c6 [ 779.851771] x11: 0000000000000141 x10: 0000000000000820 [ 779.857076] x9 : 0000000000000000 x8 : ffffff8003d7d038 [ 779.862380] x7 : 0000000000000000 x6 : ffffff8003d7d038 [ 779.867683] x5 : 0000000000000e90 x4 : 0000000000000038 [ 779.872987] x3 : 0000000000000002 x2 : 0000000000000004 [ 779.878291] x1 : 0000000000000000 x0 : 0000000000000000 [ 779.883594] Call trace: [ 779.886039] cfg80211_chandef_valid+0x10/0x490 [cfg80211] [ 779.891434] cfg80211_check_station_change+0x3190/0x3950 [cfg80211] [ 779.897697] nl80211_radar_notify+0x138/0x19c [cfg80211] [ 779.903005] cfg80211_stop_offchan_radar_detection+0x7c/0x8c [cfg80211] [ 779.909616] __cfg80211_leave+0x2c/0x190 [cfg80211] [ 779.914490] cfg80211_register_netdevice+0x1c0/0x6d0 [cfg80211] [ 779.920404] raw_notifier_call_chain+0x50/0x70 [ 779.924841] call_netdevice_notifiers_info+0x54/0xa0 [ 779.929796] __dev_close_many+0x40/0x100 [ 779.933712] __dev_change_flags+0x98/0x190 [ 779.937800] dev_change_flags+0x20/0x60 [ 779.941628] devinet_ioctl+0x534/0x6d0 [ 779.945370] inet_ioctl+0x1bc/0x230 [ 779.948849] sock_do_ioctl+0x44/0x200 [ 779.952502] sock_ioctl+0x268/0x4c0 [ 779.955985] __arm64_sys_ioctl+0xac/0xd0 [ 779.959900] el0_svc_common.constprop.0+0x60/0x110 [ 779.964682] do_el0_svc+0x1c/0x24 [ 779.967990] el0_svc+0x10/0x1c [ 779.971036] el0_sync_handler+0x9c/0x120 [ 779.974950] el0_sync+0x148/0x180 [ 779.978259] Code: a9bc7bfd 910003fd a90153f3 aa0003f3 (f9400000) [ 779.984344] ---[ end trace 0e67b4f5d6cdeec7 ]--- [ 779.996400] Kernel panic - not syncing: Oops: Fatal exception [ 780.002139] SMP: stopping secondary CPUs [ 780.006057] Kernel Offset: disabled [ 780.009537] CPU features: 0x0000002,04002004 [ 780.013796] Memory Limit: none Fixes: b8f5facf286b ("cfg80211: implement APIs for dedicated radar detection HW") Reported-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Tested-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c2e34c065bf8839c5ffa45498ae154021a72a520.1635958796.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:46:12 +01:00
Ahmed Zaki	8f9dcc2956	mac80211: fix a memory leak where sta_info is not freed The following is from a system that went OOM due to a memory leak: wlan0: Allocated STA 74:83:c2:64:0b:87 wlan0: Allocated STA 74:83:c2:64:0b:87 wlan0: IBSS finish 74:83:c2:64:0b:87 (---from ieee80211_ibss_add_sta) wlan0: Adding new IBSS station 74:83:c2:64:0b:87 wlan0: moving STA 74:83:c2:64:0b:87 to state 2 wlan0: moving STA 74:83:c2:64:0b:87 to state 3 wlan0: Inserted STA 74:83:c2:64:0b:87 wlan0: IBSS finish 74:83:c2:64:0b:87 (---from ieee80211_ibss_work) wlan0: Adding new IBSS station 74:83:c2:64:0b:87 wlan0: moving STA 74:83:c2:64:0b:87 to state 2 wlan0: moving STA 74:83:c2:64:0b:87 to state 3 . . wlan0: expiring inactive not authorized STA 74:83:c2:64:0b:87 wlan0: moving STA 74:83:c2:64:0b:87 to state 2 wlan0: moving STA 74:83:c2:64:0b:87 to state 1 wlan0: Removed STA 74:83:c2:64:0b:87 wlan0: Destroyed STA 74:83:c2:64:0b:87 The ieee80211_ibss_finish_sta() is called twice on the same STA from 2 different locations. On the second attempt, the allocated STA is not destroyed creating a kernel memory leak. This is happening because sta_info_insert_finish() does not call sta_info_free() the second time when the STA already exists (returns -EEXIST). Note that the caller sta_info_insert_rcu() assumes STA is destroyed upon errors. Same fix is applied to -ENOMEM. Signed-off-by: Ahmed Zaki <anzaki@gmail.com> Link: https://lore.kernel.org/r/20211002145329.3125293-1-anzaki@gmail.com [change the error path label to use the existing code] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:39:16 +01:00
Xing Song	942bd1070c	mac80211: set up the fwd_skb->dev for mesh forwarding Mesh forwarding requires that the fwd_skb->dev is set up for TX handling, otherwise the following warning will be generated, so set it up for the pending frames. [ 72.835674 ] WARNING: CPU: 0 PID: 1193 at __skb_flow_dissect+0x284/0x1298 [ 72.842379 ] Modules linked in: ksmbd pppoe ppp_async l2tp_ppp ... [ 72.962020 ] CPU: 0 PID: 1193 Comm: kworker/u5:1 Tainted: P S 5.4.137 #0 [ 72.969938 ] Hardware name: MT7622_MT7531 RFB (DT) [ 72.974659 ] Workqueue: napi_workq napi_workfn [ 72.979025 ] pstate: 60000005 (nZCv daif -PAN -UAO) [ 72.983822 ] pc : __skb_flow_dissect+0x284/0x1298 [ 72.988444 ] lr : __skb_flow_dissect+0x54/0x1298 [ 72.992977 ] sp : ffffffc010c738c0 [ 72.996293 ] x29: ffffffc010c738c0 x28: 0000000000000000 [ 73.001615 ] x27: 000000000000ffc2 x26: ffffff800c2eb818 [ 73.006937 ] x25: ffffffc010a987c8 x24: 00000000000000ce [ 73.012259 ] x23: ffffffc010c73a28 x22: ffffffc010a99c60 [ 73.017581 ] x21: 000000000000ffc2 x20: ffffff80094da800 [ 73.022903 ] x19: 0000000000000000 x18: 0000000000000014 [ 73.028226 ] x17: 00000000084d16af x16: 00000000d1fc0bab [ 73.033548 ] x15: 00000000715f6034 x14: 000000009dbdd301 [ 73.038870 ] x13: 00000000ea4dcbc3 x12: 0000000000000040 [ 73.044192 ] x11: 000000000eb00ff0 x10: 0000000000000000 [ 73.049513 ] x9 : 000000000eb00073 x8 : 0000000000000088 [ 73.054834 ] x7 : 0000000000000000 x6 : 0000000000000001 [ 73.060155 ] x5 : 0000000000000000 x4 : 0000000000000000 [ 73.065476 ] x3 : ffffffc010a98000 x2 : 0000000000000000 [ 73.070797 ] x1 : 0000000000000000 x0 : 0000000000000000 [ 73.076120 ] Call trace: [ 73.078572 ] __skb_flow_dissect+0x284/0x1298 [ 73.082846 ] __skb_get_hash+0x7c/0x228 [ 73.086629 ] ieee80211_txq_may_transmit+0x7fc/0x17b8 [mac80211] [ 73.092564 ] ieee80211_tx_prepare_skb+0x20c/0x268 [mac80211] [ 73.098238 ] ieee80211_tx_pending+0x144/0x330 [mac80211] [ 73.103560 ] tasklet_action_common.isra.16+0xb4/0x158 [ 73.108618 ] tasklet_action+0x2c/0x38 [ 73.112286 ] __do_softirq+0x168/0x3b0 [ 73.115954 ] do_softirq.part.15+0x88/0x98 [ 73.119969 ] __local_bh_enable_ip+0xb0/0xb8 [ 73.124156 ] napi_workfn+0x58/0x90 [ 73.127565 ] process_one_work+0x20c/0x478 [ 73.131579 ] worker_thread+0x50/0x4f0 [ 73.135249 ] kthread+0x124/0x128 [ 73.138484 ] ret_from_fork+0x10/0x1c Signed-off-by: Xing Song <xing.song@mediatek.com> Tested-By: Frank Wunderlich <frank-w@public-files.de> Link: https://lore.kernel.org/r/20211123033123.2684-1-xing.song@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:36:31 +01:00
Felix Fietkau	73111efacd	mac80211: fix regression in SSN handling of addba tx Some drivers that do their own sequence number allocation (e.g. ath9k) rely on being able to modify params->ssn on starting tx ampdu sessions. This was broken by a change that modified it to use sta->tid_seq[tid] instead. Cc: stable@vger.kernel.org Fixes: `31d8bb4e07` ("mac80211: agg-tx: refactor sending addba") Reported-by: Eneas U de Queiroz <cotequeiroz@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20211124094024.43222-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:35:45 +01:00
Felix Fietkau	18688c80ad	mac80211: fix rate control for retransmitted frames Since retransmission clears info->control, rate control needs to be called again, otherwise the driver might crash due to invalid rates. Cc: stable@vger.kernel.org # 5.14+ Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reported-by: Robert W <rwbugreport@lost-in-the-void.net> Fixes: `03c3911d2d` ("mac80211: call ieee80211_tx_h_rate_ctrl() when dequeue") Signed-off-by: Felix Fietkau <nbd@nbd.name> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> Link: https://lore.kernel.org/r/20211122204323.9787-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:35:32 +01:00
Johannes Berg	d5e568c3a4	mac80211: track only QoS data frames for admission control For admission control, obviously all of that only works for QoS data frames, otherwise we cannot even access the QoS field in the header. Syzbot reported (see below) an uninitialized value here due to a status of a non-QoS nullfunc packet, which isn't even long enough to contain the QoS header. Fix this to only do anything for QoS data packets. Reported-by: syzbot+614e82b88a1a4973e534@syzkaller.appspotmail.com Fixes: `02219b3abc` ("mac80211: add WMM admission control support") Link: https://lore.kernel.org/r/20211122124737.dad29e65902a.Ieb04587afacb27c14e0de93ec1bfbefb238cc2a0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:35:20 +01:00
Maxime Bizon	48c06708e6	mac80211: fix TCP performance on mesh interface sta is NULL for mesh point (resolved later), so sk pacing parameters were not applied. Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Link: https://lore.kernel.org/r/66f51659416ac35d6b11a313bd3ffe8b8a43dd55.camel@freebox.fr Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-26 11:34:29 +01:00
Xin Long	703319094c	sctp: make the raise timer more simple and accurate Currently, the probe timer is reused as the raise timer when PLPMTUD is in the Search Complete state. raise_count was introduced to count how many times the probe timer has timed out. When raise_count reaches to 30, the raise timer handler will be triggered. During the whole processing above, the timer keeps timing out every probe_ interval. It is a waste for the Search Complete state, as the raise timer only needs to time out after 30 * probe_interval. Since the raise timer and probe timer are never used at the same time, it is no need to keep probe timer 'alive' in the Search Complete state. This patch to introduce sctp_transport_reset_raise_timer() to start the timer as the raise timer when entering the Search Complete state. When entering the other states, sctp_transport_reset_probe_timer() will still be called to reset the timer to the probe timer. raise_count can be removed from sctp_transport as no need to count probe timer timeout for raise timer timeout. last_rtx_chunks can be removed as sctp_transport_reset_probe_timer() can be called in the place where asoc rtx_data_chunks is changed. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/edb0e48988ea85997488478b705b11ddc1ba724a.1637781974.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-25 21:00:45 -08:00
Xin Long	0c51dffcc8	tipc: delete the unlikely branch in tipc_aead_encrypt When a skb comes to tipc_aead_encrypt(), it's always linear. The unlikely check 'skb_cloned(skb) && tailen <= skb_tailroom(skb)' can completely be taken care of in skb_cow_data() by the code in branch "if (!skb_has_frag_list())". Also, remove the 'TODO:' annotation, as the pages in skbs are not writable, see more on commit `3cf4375a09` ("tipc: do not write skb_shinfo frags when doing decrytion"). Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/47a478da0b6095b76e3cbe7a75cbd25d9da1df9a.1637773872.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-25 20:58:16 -08:00
Jakub Kicinski	f3911f73f5	tls: fix replacing proto_ops We replace proto_ops whenever TLS is configured for RX. But our replacement also overrides sendpage_locked, which will crash unless TX is also configured. Similarly we plug both of those in for TLS_HW (NIC crypto offload) even tho TLS_HW has a completely different implementation for TX. Last but not least we always plug in something based on inet_stream_ops even though a few of the callbacks differ for IPv6 (getname, release, bind). Use a callback building method similar to what we do for struct proto. Fixes: `c46234ebb4` ("tls: RX path for ktls") Fixes: `d4ffb02dee` ("net/tls: enable sk_msg redirect to tls socket egress") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-25 19:28:16 -08:00
Jakub Kicinski	e062fe99cc	tls: splice_read: fix accessing pre-processed records recvmsg() will put peek()ed and partially read records onto the rx_list. splice_read() needs to consult that list otherwise it may miss data. Align with recvmsg() and also put partially-read records onto rx_list. tls_sw_advance_skb() is pretty pointless now and will be removed in net-next. Fixes: `692d7b5d1f` ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-25 19:28:16 -08:00
Jakub Kicinski	520493f66f	tls: splice_read: fix record type check We don't support splicing control records. TLS 1.3 changes moved the record type check into the decrypt if(). The skb may already be decrypted and still be an alert. Note that decrypt_skb_update() is idempotent and updates ctx->decrypted so the if() is pointless. Reorder the check for decryption errors with the content type check while touching them. This part is not really a bug, because if decryption failed in TLS 1.3 content type will be DATA, and for TLS 1.2 it will be correct. Nevertheless its strange to touch output before checking if the function has failed. Fixes: `fedf201e12` ("net: tls: Refactor control message handling on recv") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-25 19:28:16 -08:00
Archie Pusaka	dbf6811abb	Bluetooth: Limit duration of Remote Name Resolve When doing remote name request, we cannot scan. In the normal case it's OK since we can expect it to finish within a short amount of time. However, there is a possibility to scan lots of devices that (1) requires Remote Name Resolve (2) is unresponsive to Remote Name Resolve When this happens, we are stuck to do Remote Name Resolve until all is done before continue scanning. This patch adds a time limit to stop us spending too long on remote name request. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-25 21:08:19 +01:00
Archie Pusaka	ea13aed5e5	Bluetooth: Send device found event on name resolve failure Introducing NAME_REQUEST_FAILED flag that will be sent together with device found event on name resolve failure. This will provide the userspace with an information so it can decide not to resolve the name for these devices in the future. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-25 21:08:19 +01:00
Luiz Augusto von Dentz	7978656caf	Bluetooth: HCI: Fix definition of hci_rp_delete_stored_link_key num_keys is actually 2 octects not 1: BLUETOOTH CORE SPECIFICATION Version 5.3 \| Vol 4, Part E page 1989: Num_Keys_Deleted: Size: 2 octets 0xXXXX Number of Link Keys Deleted Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-25 21:06:18 +01:00
Luiz Augusto von Dentz	e88422bccd	Bluetooth: HCI: Fix definition of hci_rp_read_stored_link_key Both max_num_keys and num_key are 2 octects: BLUETOOTH CORE SPECIFICATION Version 5.3 \| Vol 4, Part E page 1985: Max_Num_Keys: Size: 2 octets Range: 0x0000 to 0xFFFF Num_Keys_Read: Size: 2 octets Range: 0x0000 to 0xFFFF Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-25 21:03:45 +01:00
Guo DaXing	9ebb0c4b27	net/smc: Fix loop in smc_listen The kernel_listen function in smc_listen will fail when all the available ports are occupied. At this point smc->clcsock->sk->sk_data_ready has been changed to smc_clcsock_data_ready. When we call smc_listen again, now both smc->clcsock->sk->sk_data_ready and smc->clcsk_data_ready point to the smc_clcsock_data_ready function. The smc_clcsock_data_ready() function calls lsmc->clcsk_data_ready which now points to itself resulting in an infinite loop. This patch restores smc->clcsock->sk->sk_data_ready with the old value. Fixes: `a60a2b1e0a` ("net/smc: reduce active tcp_listen workers") Signed-off-by: Guo DaXing <guodaxing@huawei.com> Acked-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 19:02:21 -08:00
Karsten Graul	587acad41f	net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk() Coverity reports a possible NULL dereferencing problem: in smc_vlan_by_tcpsk(): 6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times). 7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next. 1623 ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower); CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS) 8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev. 1624 if (is_vlan_dev(ndev)) { Remove the manual implementation and use netdev_walk_all_lower_dev() to iterate over the lower devices. While on it remove an obsolete function parameter comment. Fixes: `cb9d43f677` ("net/smc: determine vlan_id of stacked net_device") Suggested-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 19:02:21 -08:00
Maciej Żenczykowski	305e95bb89	net-ipv6: changes to ->tclass (via IPV6_TCLASS) should sk_dst_reset() This is to match ipv4 behaviour, see __ip_sock_set_tos() implementation. Technically for ipv6 this might not be required because normally we do not allow tclass to influence routing, yet the cli tooling does support it: lpk11:~# ip -6 rule add pref 5 tos 45 lookup 5 lpk11:~# ip -6 rule 5: from all tos 0x45 lookup 5 and in general dscp/tclass based routing does make sense. We already have cases where dscp can affect vlan priority and/or transmit queue (especially on wifi). So let's just make things match. Easier to reason about and no harm. Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123223208.1117871-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 18:57:23 -08:00
Maciej Żenczykowski	9f7b3a69c8	net-ipv6: do not allow IPV6_TCLASS to muck with tcp's ECN This is to match ipv4 behaviour, see __ip_sock_set_tos() implementation at ipv4/ip_sockglue.c:579 void __ip_sock_set_tos(struct sock *sk, int val) { if (sk->sk_type == SOCK_STREAM) { val &= ~INET_ECN_MASK; val \|= inet_sk(sk)->tos & INET_ECN_MASK; } if (inet_sk(sk)->tos != val) { inet_sk(sk)->tos = val; sk->sk_priority = rt_tos2priority(val); sk_dst_reset(sk); } } Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211123223154.1117794-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 18:57:14 -08:00
Maciej Żenczykowski	079925cce1	net: allow SO_MARK with CAP_NET_RAW A CAP_NET_RAW capable process can already spoof (on transmit) anything it desires via raw packet sockets... There is no good reason to not allow it to also be able to play routing tricks on packets from its own normal sockets. There is a desire to be able to use SO_MARK for routing table selection (via ip rule fwmark) from within a user process without having to run it as root. Granting it CAP_NET_RAW is much less dangerous than CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change, while CAP_NET_ADMIN does - by for example allowing the reconfiguration of the routing tables and/or bringing up/down devices). Let's keep CAP_NET_ADMIN for persistent state changes, while using CAP_NET_RAW for non-configuration related stuff. Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 18:57:05 -08:00
Maciej Żenczykowski	a1b519b745	net: allow CAP_NET_RAW to setsockopt SO_PRIORITY CAP_NET_ADMIN is and should continue to be about configuring the system as a whole, not about configuring per-socket or per-packet parameters. Sending and receiving raw packets is what CAP_NET_RAW is all about. It can already send packets with any VLAN tag, and any IPv4 TOS mark, and any IPv6 TCLASS mark, simply by virtue of building such a raw packet. Not to mention using any protocol and source/ /destination ip address/port tuple. These are the fields that networking gear uses to prioritize packets. Hence, a CAP_NET_RAW process is already capable of affecting traffic prioritization after it hits the wire. This change makes it capable of affecting traffic prioritization even in the host at the nic and before that in the queueing disciplines (provided skb->priority is actually being used for prioritization, and not the TOS/TCLASS field) Hence it makes sense to allow a CAP_NET_RAW process to set the priority of sockets and thus packets it sends. Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 18:57:02 -08:00
Eric Dumazet	4e1fddc98d	tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows While testing BIG TCP patch series, I was expecting that TCP_RR workloads with 80KB requests/answers would send one 80KB TSO packet, then being received as a single GRO packet. It turns out this was not happening, and the root cause was that cubic Hystart ACK train was triggering after a few (2 or 3) rounds of RPC. Hystart was wrongly setting CWND/SSTHRESH to 30, while my RPC needed a budget of ~20 segments. Ideally these TCP_RR flows should not exit slow start. Cubic Hystart should reset itself at each round, instead of assuming every TCP flow is a bulk one. Note that even after this patch, Hystart can still trigger, depending on scheduling artifacts, but at a higher CWND/SSTHRESH threshold, keeping optimal TSO packet sizes. Tested: ip link set dev eth0 gro_ipv6_max_size 131072 gso_ipv6_max_size 131072 nstat -n; netperf -H ... -t TCP_RR -l 5 -- -r 80000,80000 -K cubic; nstat\|egrep "Ip6InReceives\|Hystart\|Ip6OutRequests" Before: 8605 Ip6InReceives 87541 0.0 Ip6OutRequests 129496 0.0 TcpExtTCPHystartTrainDetect 1 0.0 TcpExtTCPHystartTrainCwnd 30 0.0 After: 8760 Ip6InReceives 88514 0.0 Ip6OutRequests 87975 0.0 Fixes: `ae27e98a51` ("[TCP] CUBIC v2.3") Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20211123202535.1843771-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 17:26:05 -08:00
Ido Schimmel	5a45ab3f24	net: bridge: Allow base 16 inputs in sysfs Cited commit converted simple_strtoul() to kstrtoul() as suggested by the former's documentation. However, it also forced all the inputs to be decimal resulting in user space breakage. Fix by setting the base to '0' so that the base is automatically detected. Before: # ip link add name br0 type bridge vlan_filtering 1 # echo "0x88a8" > /sys/class/net/br0/bridge/vlan_protocol bash: echo: write error: Invalid argument After: # ip link add name br0 type bridge vlan_filtering 1 # echo "0x88a8" > /sys/class/net/br0/bridge/vlan_protocol # echo $? 0 Fixes: `520fbdf7fb` ("net/bridge: replace simple_strtoul to kstrtol") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20211124101122.3321496-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 17:23:52 -08:00
Eric Dumazet	627b94f75b	gro: remove rcu_read_lock/rcu_read_unlock from gro_complete handlers All gro_complete() handlers are called from napi_gro_complete() while rcu_read_lock() has been called. There is no point stacking more rcu_read_lock() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 17:21:42 -08:00
Eric Dumazet	fc1ca3348a	gro: remove rcu_read_lock/rcu_read_unlock from gro_receive handlers All gro_receive() handlers are called from dev_gro_receive() while rcu_read_lock() has been called. There is no point stacking more rcu_read_lock() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-24 17:21:42 -08:00
Brian Gix	899663be5e	Bluetooth: refactor malicious adv data check Check for out-of-bound read was being performed at the end of while num_reports loop, and would fill journal with false positives. Added check to beginning of loop processing so that it doesn't get checked after ptr has been advanced. Signed-off-by: Brian Gix <brian.gix@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-24 17:29:12 +01:00
Kumar Thangavel	ac13285214	net/ncsi : Add payload to be 32-bit aligned to fix dropped packets Update NC-SI command handler (both standard and OEM) to take into account of payload paddings in allocating skb (in case of payload size is not 32-bit aligned). The checksum field follows payload field, without taking payload padding into account can cause checksum being truncated, leading to dropped packets. Fixes: `fb4ee67529` ("net/ncsi: Add NCSI OEM command support") Signed-off-by: Kumar Thangavel <thangavel.k@hcl.com> Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-24 11:53:17 +00:00
Kuniyuki Iwashima	b4a8e7493d	dccp: Inline dccp_listen_start(). This patch inlines dccp_listen_start() and removes a stale comment in inet_dccp_listen() so that it looks like inet_listen(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Richard Sailer <richard_siegfried@systemli.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-23 20:16:22 -08:00
Kuniyuki Iwashima	e7049395b1	dccp/tcp: Remove an unused argument in inet_csk_listen_start(). The commit `1295e2cf30` ("inet: minor optimization for backlog setting in listen(2)") added change so that sk_max_ack_backlog is initialised earlier in inet_dccp_listen() and inet_listen(). Since then, we no longer use backlog in inet_csk_listen_start(), so let's remove it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Richard Sailer <richard_siegfried@systemli.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-23 20:16:18 -08:00
Jakub Kicinski	2106efda78	net: remove .ndo_change_proto_down .ndo_change_proto_down was added seemingly to enable out-of-tree implementations. Over 2.5yrs later we still have no real users upstream. Hardwire the generic implementation for now, we can revert once real users materialize. (rocker is a test vehicle, not a user.) We need to drop the optimization on the sysfs side, because unlike ndos priv_flags will be changed at runtime, so we'd need READ_ONCE/WRITE_ONCE everywhere.. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:18:48 +00:00
Tony Lu	606a63c978	net/smc: Ensure the active closing peer first closes clcsock The side that actively closed socket, it's clcsock doesn't enter TIME_WAIT state, but the passive side does it. It should show the same behavior as TCP sockets. Consider this, when client actively closes the socket, the clcsock in server enters TIME_WAIT state, which means the address is occupied and won't be reused before TIME_WAIT dismissing. If we restarted server, the service would be unavailable for a long time. To solve this issue, shutdown the clcsock in [A], perform the TCP active close progress first, before the passive closed side closing it. So that the actively closed side enters TIME_WAIT, not the passive one. Client \| Server close() // client actively close \| smc_release() \| smc_close_active() // PEERCLOSEWAIT1 \| smc_close_final() // abort or closed = 1\| smc_cdc_get_slot_and_msg_send() \| [A] \| \|smc_cdc_msg_recv_action() // ACTIVE \| queue_work(smc_close_wq, &conn->close_work) \| smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1 \| smc_close_passive_abort_received() // only in abort \| \|close() // server recv zero, close \| smc_release() // PROCESSABORT or APPCLOSEWAIT1 \| smc_close_active() \| smc_close_abort() or smc_close_final() // CLOSED \| smc_cdc_get_slot_and_msg_send() // abort or closed = 1 smc_cdc_msg_recv_action() \| smc_clcsock_release() queue_work(smc_close_wq, &conn->close_work) \| sock_release(tcp) // actively close clc, enter TIME_WAIT smc_close_passive_work() // PEERCLOSEWAIT1 \| smc_conn_free() smc_close_passive_abort_received() // CLOSED\| smc_conn_free() \| smc_clcsock_release() \| sock_release(tcp) // passive close clc \| Link: https://www.spinics.net/lists/netdev/msg780407.html Fixes: `b38d732477` ("smc: socket closing and linkgroup cleanup") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:42:24 +00:00
Tony Lu	45c3ff7a9a	net/smc: Clean up local struct sock variables There remains some variables to replace with local struct sock. So clean them up all. Fixes: `3163c5071f` ("net/smc: use local struct sock variables consistently") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:42:24 +00:00
Nikolay Aleksandrov	1c743127cc	net: nexthop: fix null pointer dereference when IPv6 is not enabled When we try to add an IPv6 nexthop and IPv6 is not enabled (!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug has been present since the beginning of IPv6 nexthop gateway support. Commit `1aefd3de7b` ("ipv6: Add fib6_nh_init and release to stubs") tells us that only fib6_nh_init has a dummy stub because fib6_nh_release should not be called if fib6_nh_init returns an error, but the commit below added a call to ipv6_stub->fib6_nh_release in its error path. To fix it return the dummy stub's -EAFNOSUPPORT error directly without calling ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path. [1] Output is a bit truncated, but it clearly shows the error. BUG: kernel NULL pointer dereference, address: 000000000000000000 #PF: supervisor instruction fetch in kernel modede #PF: error_code(0x0010) - not-present pagege PGD 0 P4D 0 Oops: 0010 [#1] PREEMPT SMP NOPTI CPU: 4 PID: 638 Comm: ip Kdump: loaded Not tainted 5.16.0-rc1+ #446 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:0x0 Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. RSP: 0018:ffff888109f5b8f0 EFLAGS: 00010286^Ac RAX: 0000000000000000 RBX: ffff888109f5ba28 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8881008a2860 RBP: ffff888109f5b9d8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff888109f5b978 R11: ffff888109f5b948 R12: 00000000ffffff9f R13: ffff8881008a2a80 R14: ffff8881008a2860 R15: ffff8881008a2840 FS: 00007f98de70f100(0000) GS:ffff88822bf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000100efc000 CR4: 00000000000006e0 Call Trace: <TASK> nh_create_ipv6+0xed/0x10c rtm_new_nexthop+0x6d7/0x13f3 ? check_preemption_disabled+0x3d/0xf2 ? lock_is_held_type+0xbe/0xfd rtnetlink_rcv_msg+0x23f/0x26a ? check_preemption_disabled+0x3d/0xf2 ? rtnl_calcit.isra.0+0x147/0x147 netlink_rcv_skb+0x61/0xb2 netlink_unicast+0x100/0x187 netlink_sendmsg+0x37f/0x3a0 ? netlink_unicast+0x187/0x187 sock_sendmsg_nosec+0x67/0x9b ____sys_sendmsg+0x19d/0x1f9 ? copy_msghdr_from_user+0x4c/0x5e ? rcu_read_lock_any_held+0x2a/0x78 ___sys_sendmsg+0x6c/0x8c ? asm_sysvec_apic_timer_interrupt+0x12/0x20 ? lockdep_hardirqs_on+0xd9/0x102 ? sockfd_lookup_light+0x69/0x99 __sys_sendmsg+0x50/0x6e do_syscall_64+0xcb/0xf2 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f98dea28914 Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53 RSP: 002b:00007fff859f5e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e2e RAX: ffffffffffffffda RBX: 00000000619cb810 RCX: 00007f98dea28914 RDX: 0000000000000000 RSI: 00007fff859f5ed0 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008 R10: fffffffffffffce6 R11: 0000000000000246 R12: 0000000000000001 R13: 000055c0097ae520 R14: 000055c0097957fd R15: 00007fff859f63a0 </TASK> Modules linked in: bridge stp llc bonding virtio_net Cc: stable@vger.kernel.org Fixes: `53010f991a` ("nexthop: Add support for IPv6 gateways") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:39:53 +00:00
Ghalem Boudour	bcf141b2eb	xfrm: fix policy lookup for ipv6 gre packets On egress side, xfrm lookup is called from __gre6_xmit() with the fl6_gre_key field not initialized leading to policies selectors check failure. Consequently, gre packets are sent without encryption. On ingress side, INET6_PROTO_NOPOLICY was set, thus packets were not checked against xfrm policies. Like for egress side, fl6_gre_key should be correctly set, this is now done in decode_session6(). Fixes: `c12b395a46` ("gre: Support GRE over IPv6") Cc: stable@vger.kernel.org Signed-off-by: Ghalem Boudour <ghalem.boudour@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-11-23 10:12:21 +01:00
NeilBrown	064a91771f	SUNRPC: use different lock keys for INET6 and LOCAL xprtsock.c reclassifies sock locks based on the protocol. However there are 3 protocols and only 2 classification keys. The same key is used for both INET6 and LOCAL. This causes lockdep complaints. The complaints started since Commit `ea9afca88b` ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock") which resulted in the sock locks beings used more. So add another key, and renumber them slightly. Fixes: `ea9afca88b` ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock") Fixes: `176e21ee2e` ("SUNRPC: Support for RPC over AF_LOCAL transports") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-11-22 17:44:49 -05:00
Shiraz Saleem	325e0d0aa6	devlink: Add 'enable_iwarp' generic device param Add a new device generic parameter to enable and disable iWARP functionality on a multi-protocol RDMA device. Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Tested-by: Leszek Kaliszczuk <leszek.kaliszczuk@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-11-22 08:41:39 -08:00
Nikolay Aleksandrov	1005f19b93	net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group When replacing a nexthop group, we must release the IPv6 per-cpu dsts of the removed nexthop entries after an RCU grace period because they contain references to the nexthop's net device and to the fib6 info. With specific series of events[1] we can reach net device refcount imbalance which is unrecoverable. IPv4 is not affected because dsts don't take a refcount on the route. [1] $ ip nexthop list id 200 via 2002:db8::2 dev bridge.10 scope link onlink id 201 via 2002:db8::3 dev bridge scope link onlink id 203 group 201/200 $ ip -6 route 2001:db8::10 nhid 203 metric 1024 pref medium nexthop via 2002:db8::3 dev bridge weight 1 onlink nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink Create rt6_info through one of the multipath legs, e.g.: $ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10 (pkt_inj is just a custom packet generator, nothing special) Then remove that leg from the group by replace (let's assume it is id 200 in this case): $ ip nexthop replace id 203 group 201 Now remove the IPv6 route: $ ip -6 route del 2001:db8::10/128 The route won't be really deleted due to the stale rt6_info holding 1 refcnt in nexthop id 200. At this point we have the following reference count dependency: (deleted) IPv6 route holds 1 reference over nhid 203 nh 203 holds 1 ref over id 201 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info Now to create circular dependency between nh 200 and the IPv6 route, and also to get a reference over nh 200, restore nhid 200 in the group: $ ip nexthop replace id 203 group 201/200 And now we have a permanent circular dependncy because nhid 203 holds a reference over nh 200 and 201, but the route holds a ref over nh 203 and is deleted. To trigger the bug just delete the group (nhid 203): $ ip nexthop del id 203 It won't really be deleted due to the IPv6 route dependency, and now we have 2 unlinked and deleted objects that reference each other: the group and the IPv6 route. Since the group drops the reference it holds over its entries at free time (i.e. its own refcount needs to drop to 0) that will never happen and we get a permanent ref on them, since one of the entries holds a reference over the IPv6 route it will also never be released. At this point the dependencies are: (deleted, only unlinked) IPv6 route holds reference over group nh 203 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info This is the last point where it can be fixed by running traffic through nh 200, and specifically through the same CPU so the rt6_info (dst) will get released due to the IPv6 genid, that in turn will free the IPv6 route, which in turn will free the ref count over the group nh 203. If nh 200 is deleted at this point, it will never be released due to the ref from the unlinked group 203, it will only be unlinked: $ ip nexthop del id 200 $ ip nexthop $ Now we can never release that stale rt6_info, we have IPv6 route with ref over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with rt6_info (dst) with ref over the net device and the IPv6 route. All of these objects are only unlinked, and cannot be released, thus they can't release their ref counts. Message from syslogd@dev at Nov 19 14:04:10 ... kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 Message from syslogd@dev at Nov 19 14:04:20 ... kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 Fixes: `7bf4796dd0` ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:44:49 +00:00
Nikolay Aleksandrov	8837cbbf85	net: ipv6: add fib6_nh_release_dsts stub We need a way to release a fib6_nh's per-cpu dsts when replacing nexthops otherwise we can end up with stale per-cpu dsts which hold net device references, so add a new IPv6 stub called fib6_nh_release_dsts. It must be used after an RCU grace period, so no new dsts can be created through a group's nexthop entry. Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed so it doesn't need a dummy stub when IPv6 is not enabled. Fixes: `7bf4796dd0` ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:44:49 +00:00
Kees Cook	03f61041c1	skbuff: Switch structure bounds to struct_group() In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Replace the existing empty member position markers "headers_start" and "headers_end" with a struct_group(). This will allow memcpy() and sizeof() to more easily reason about sizes, and improve readability. "pahole" shows no size nor member offset changes to struct sk_buff. "objdump -d" shows no object code changes (outside of WARNs affected by source line number changes). Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/* Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:13:54 +00:00
Kees Cook	fba84957e2	skbuff: Move conditional preprocessor directives out of struct sk_buff In preparation for using the struct_group() macro in struct sk_buff, move the conditional preprocessor directives out of the region of struct sk_buff that will be enclosed by struct_group(). While GCC and Clang are happy with conditional preprocessor directives here, sparse is not, even under -Wno-directive-within-macro[1], as would be seen under a C=1 build: net/core/filter.c: note: in included file (through include/linux/netlink.h, include/linux/sock_diag.h): ./include/linux/skbuff.h:820:1: warning: directive in macro's argument list ./include/linux/skbuff.h:822:1: warning: directive in macro's argument list ./include/linux/skbuff.h:846:1: warning: directive in macro's argument list ./include/linux/skbuff.h:848:1: warning: directive in macro's argument list Additionally remove empty macro argument definitions and usage. "objdump -d" shows no object code differences. [1] https://www.spinics.net/lists/linux-sparse/msg10857.html Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:13:54 +00:00
Daniel Borkmann	4177d5b017	net, neigh: Fix crash in v6 module initialization error path When IPv6 module gets initialized, but it's hitting an error in inet6_init() where it then needs to undo all the prior initialization work, it also might do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there is a missing timer cancellation of the table's managed_work item. The kernel test robot explicitly triggered this error path and caused a UAF crash similar to the below: [...] [ 28.833183][ C0] BUG: unable to handle page fault for address: f7a43288 [ 28.833973][ C0] #PF: supervisor write access in kernel mode [ 28.834660][ C0] #PF: error_code(0x0002) - not-present page [ 28.835319][ C0] pde = 06b2c067 pte = 00000000 [ 28.835853][ C0] Oops: 0002 [#1] PREEMPT [ 28.836367][ C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7 [ 28.837293][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014 [ 28.838338][ C0] EIP: __run_timers.constprop.0+0x82/0x440 [...] [ 28.845607][ C0] Call Trace: [ 28.845942][ C0] <SOFTIRQ> [ 28.846333][ C0] ? check_preemption_disabled.isra.0+0x2a/0x80 [ 28.846975][ C0] ? __this_cpu_preempt_check+0x8/0xa [ 28.847570][ C0] run_timer_softirq+0xd/0x40 [ 28.848050][ C0] __do_softirq+0xf5/0x576 [ 28.848547][ C0] ? __softirqentry_text_start+0x10/0x10 [ 28.849127][ C0] do_softirq_own_stack+0x2b/0x40 [ 28.849749][ C0] </SOFTIRQ> [ 28.850087][ C0] irq_exit_rcu+0x7d/0xc0 [ 28.850587][ C0] common_interrupt+0x2a/0x40 [ 28.851068][ C0] asm_common_interrupt+0x119/0x120 [...] Note that IPv6 module cannot be unloaded as per `8ce4406103` ("ipv6: do not allow ipv6 module to be removed") hence this can only be seen during module initialization error. Tested with kernel test robot's reproducer. Fixes: `7482e3841d` ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Li Zhijian <zhijianx.li@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:09:51 +00:00
Wen Gu	7a61432dc8	net/smc: Avoid warning of possible recursive locking Possible recursive locking is detected by lockdep when SMC falls back to TCP. The corresponding warnings are as follows: ============================================ WARNING: possible recursive locking detected 5.16.0-rc1+ #18 Tainted: G E -------------------------------------------- wrk/1391 is trying to acquire lock: ffff975246c8e7d8 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0x109/0x250 [smc] but task is already holding lock: ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ei->socket.wq.wait); lock(&ei->socket.wq.wait); * DEADLOCK * May be due to missing lock nesting notation 2 locks held by wrk/1391: #0: ffff975246040130 (sk_lock-AF_SMC){+.+.}-{0:0}, at: smc_connect+0x43/0x150 [smc] #1: ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc] stack backtrace: Call Trace: <TASK> dump_stack_lvl+0x56/0x7b __lock_acquire+0x951/0x11f0 lock_acquire+0x27a/0x320 ? smc_switch_to_fallback+0x109/0x250 [smc] ? smc_switch_to_fallback+0xfe/0x250 [smc] _raw_spin_lock_irq+0x3b/0x80 ? smc_switch_to_fallback+0x109/0x250 [smc] smc_switch_to_fallback+0x109/0x250 [smc] smc_connect_fallback+0xe/0x30 [smc] __smc_connect+0xcf/0x1090 [smc] ? mark_held_locks+0x61/0x80 ? __local_bh_enable_ip+0x77/0xe0 ? lockdep_hardirqs_on+0xbf/0x130 ? smc_connect+0x12a/0x150 [smc] smc_connect+0x12a/0x150 [smc] __sys_connect+0x8a/0xc0 ? syscall_enter_from_user_mode+0x20/0x70 __x64_sys_connect+0x16/0x20 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The nested locking in smc_switch_to_fallback() is considered to possibly cause a deadlock because smc_wait->lock and clc_wait->lock are the same type of lock. But actually it is safe so far since there is no other place trying to obtain smc_wait->lock when clc_wait->lock is held. So the patch replaces spin_lock() with spin_lock_nested() to avoid false report by lockdep. Link: https://lkml.org/lkml/2021/11/19/962 Fixes: `2153bd1e3d` ("Transfer remaining wait queue entries during fallback") Reported-by: syzbot+e979d3597f48262cb4ee@syzkaller.appspotmail.com Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 14:51:45 +00:00
Michael S. Tsirkin	f7a36b03a7	vsock/virtio: suppress used length validation It turns out that vhost vsock violates the virtio spec by supplying the out buffer length in the used length (should just be the in length). As a result, attempts to validate the used length fail with: vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0 Since vsock driver does not use the length fox tx and validates the length before use for rx, it is safe to suppress the validation in virtio core for this driver. Reported-by: Halil Pasic <pasic@linux.ibm.com> Fixes: `939779f515` ("virtio_ring: validate used buffer length") Cc: "Jason Wang" <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 14:49:03 +00:00
Yajun Deng	e968b1b3e9	arp: Remove #ifdef CONFIG_PROC_FS proc_create_net() and remove_proc_entry() already contain the case whether to define CONFIG_PROC_FS, so remove #ifdef CONFIG_PROC_FS. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 14:34:07 +00:00
Christophe JAILLET	08a7abf4af	net-sysfs: Slightly optimize 'xps_queue_show()' The 'mask' bitmap is local to this function. So the non-atomic '__set_bit()' can be used to save a few cycles. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 14:30:19 +00:00
Christophe JAILLET	db473c075f	rds: Fix a typo in a comment s/cold/could/ Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-By: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 14:28:37 +00:00
Eric Dumazet	6d872df3e3	net: annotate accesses to dev->gso_max_segs dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 12:49:42 +00:00
Eric Dumazet	4b66d2161b	net: annotate accesses to dev->gso_max_size dev->gso_max_size is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 12:49:42 +00:00
Eric Dumazet	19d36c5f29	ipv6: fix typos in __ip6_finish_output() We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED Found by code inspection, please double check that fixing this bug does not surface other bugs. Fixes: `09ee9dba96` ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tobias Brunner <tobias@strongswan.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Tested-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 12:42:07 +00:00
Hao Chen	7462494408	ethtool: extend ringparam setting/getting API with rx_buf_len Add two new parameters kernel_ringparam and extack for .get_ringparam and .set_ringparam to extend more ring params through netlink. Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 12:31:49 +00:00
Hao Chen	0b70c256eb	ethtool: add support to set/get rx buf len via ethtool Add support to set rx buf len via ethtool -G parameter and get rx buf len via ethtool -g parameter. Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 12:31:48 +00:00
Hao Chen	448f413a8b	ethtool: add support to set/get tx copybreak buf size via ethtool Add support for ethtool to set/get tx copybreak buf size. Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 12:31:47 +00:00
Vincent Whitchurch	f9390b249c	af_unix: fix regression in read after shutdown On kernels before v5.15, calling read() on a unix socket after shutdown(SHUT_RD) or shutdown(SHUT_RDWR) would return the data previously written or EOF. But now, while read() after shutdown(SHUT_RD) still behaves the same way, read() after shutdown(SHUT_RDWR) always fails with -EINVAL. This behaviour change was apparently inadvertently introduced as part of a bug fix for a different regression caused by the commit adding sockmap support to af_unix, commit `94531cfcbe` ("af_unix: Add unix_stream_proto for sockmap"). Those commits, for unclear reasons, started setting the socket state to TCP_CLOSE on shutdown(SHUT_RDWR), while this state change had previously only been done in unix_release_sock(). Restore the original behaviour. The sockmap tests in tests/selftests/bpf continue to pass after this patch. Fixes: `d0c6416bd7` ("unix: Fix an issue in unix_shutdown causing the other end read/write failures") Link: https://lore.kernel.org/lkml/20211111140000.GA10779@axis.com/ Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 15:10:30 +00:00
Paolo Abeni	bcd9773431	mptcp: use delegate action to schedule 3rd ack retrans Scheduling a delack in mptcp_established_options_mp() is not a good idea: such function is called by tcp_send_ack() and the pending delayed ack will be cleared shortly after by the tcp_event_ack_sent() call in __tcp_transmit_skb(). Instead use the mptcp delegated action infrastructure to schedule the delayed ack after the current bh processing completes. Additionally moves the schedule_3rdack_retransmission() helper into protocol.c to avoid making it visible in a different compilation unit. Fixes: `ec3edaa7ca` ("mptcp: Add handling of outgoing MP_JOIN requests") Reviewed-by: Mat Martineau <mathew.j.martineau>@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 14:24:00 +00:00
Eric Dumazet	ee50e67ba0	mptcp: fix delack timer To compute the rtx timeout schedule_3rdack_retransmission() does multiple things in the wrong way: srtt_us is measured in usec/8 and the timeout itself is an absolute value. Fixes: `ec3edaa7ca` ("mptcp: Add handling of outgoing MP_JOIN requests") Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau>@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 14:24:00 +00:00
Florian Westphal	c9406a23c1	mptcp: sockopt: add SOL_IP freebind & transparent options These options also need to be set before bind, so do the sync of msk to new ssk socket a bit earlier. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 14:11:00 +00:00
Poorva Sonparote	ffcacff87c	mptcp: Support for IP_TOS for MPTCP setsockopt() SOL_IP provides a way to configure network layer attributes in a socket. This patch adds support for IP_TOS for setsockopt(.. ,SOL_IP, ..) Support for SOL_IP is added in mptcp_setsockopt() and IP_TOS is handled in a private function. The idea here is to take in the value passed for IP_TOS and set it to the current subflow, open subflows as well new subflows that might be created after the initial call to setsockopt(). This sync is done using sync_socket_options(.., ssk) and setting the value of tos using __ip_sock_set_tos(ssk,..). The patch has been tested using the packetdrill script here - https://github.com/multipath-tcp/mptcp_net-next/issues/220#issuecomment-947863717 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/220 Signed-off-by: Poorva Sonparote <psonparo@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 14:11:00 +00:00
Poorva Sonparote	4f47d5d507	ipv4: Exposing __ip_sock_set_tos() in ip.h Making the static function __ip_sock_set_tos() from net/ipv4/ip_sockglue.c accessible by declaring it in include/net/ip.h The reason for doing this is to use this function to set IP_TOS value in mptcp_setsockopt() without the lock. Signed-off-by: Poorva Sonparote <psonparo@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 14:11:00 +00:00
Jakub Kicinski	2c193f2cb1	net: kunit: add a test for dev_addr_lists Add a KUnit test for the dev_addr API. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 12:25:57 +00:00
Jakub Kicinski	a387ff8e5d	dev_addr_list: put the first addr on the tree Since all netdev->dev_addr modifications go via dev_addr_mod() we can put it on the list. When address is change remove it and add it back. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 12:25:57 +00:00
Jakub Kicinski	d07b26f5bb	dev_addr: add a modification check netdev->dev_addr should only be modified via helpers, but someone may be casting off the const. Add a runtime check to catch abuses. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 12:25:57 +00:00
Jakub Kicinski	5f0b692384	net: unexport dev_addr_init() & dev_addr_flush() There are no module callers in-tree and it's hard to justify why anyone would init or flush addresses of a netdev (note the flush is more of a destructor, it frees netdev->dev_addr). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 12:25:57 +00:00
Jakub Kicinski	adeef3e321	net: constify netdev->dev_addr Commit `406f42fa0d` ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. We converted all users to make modifications via appropriate helpers, make netdev->dev_addr const. The update helpers need to upcast from the buffer to struct netdev_hw_addr. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-20 12:25:57 +00:00
John Fastabend	c0d95d3380	bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap When a sock is added to a sock map we evaluate what proto op hooks need to be used. However, when the program is removed from the sock map we have not been evaluating if that changes the required program layout. Before the patch listed in the 'fixes' tag this was not causing failures because the base program set handles all cases. Specifically, the case with a stream parser and the case with out a stream parser are both handled. With the fix below we identified a race when running with a proto op that attempts to read skbs off both the stream parser and the skb->receive_queue. Namely, that a race existed where when the stream parser is empty checking the skb->receive_queue from recvmsg at the precies moment when the parser is paused and the receive_queue is not empty could result in skipping the stream parser. This may break a RX policy depending on the parser to run. The fix tag then loads a specific proto ops that resolved this race. But, we missed removing that proto ops recv hook when the sock is removed from the sockmap. The result is the stream parser is stopped so no more skbs will be aggregated there, but the hook and BPF program continues to be attached on the psock. User space will then get an EBUSY when trying to read the socket because the recvmsg() handler is now waiting on a stopped stream parser. To fix we rerun the proto ops init() function which will look at the new set of progs attached to the psock and rest the proto ops hook to the correct handlers. And in the above case where we remove the sock from the sock map the RX prog will no longer be listed so the proto ops is removed. Fixes: `c5d2177a72` ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com	2021-11-20 00:56:24 +01:00
John Fastabend	38207a5e81	bpf, sockmap: Attach map progs to psock early for feature probes When a TCP socket is added to a sock map we look at the programs attached to the map to determine what proto op hooks need to be changed. Before the patch in the 'fixes' tag there were only two categories -- the empty set of programs or a TX policy. In any case the base set handled the receive case. After the fix we have an optimized program for receive that closes a small, but possible, race on receive. This program is loaded only when the map the psock is being added to includes a RX policy. Otherwise, the race is not possible so we don't need to handle the race condition. In order for the call to sk_psock_init() to correctly evaluate the above conditions all progs need to be set in the psock before the call. However, in the current code this is not the case. We end up evaluating the requirements on the old prog state. If your psock is attached to multiple maps -- for example a tx map and rx map -- then the second update would pull in the correct maps. But, the other pattern with a single rx enabled map the correct receive hooks are not used. The result is the race fixed by the patch in the fixes tag below may still be seen in this case. To fix we simply set all psock->progs before doing the call into sock_map_init(). With this the init() call gets the full list of programs and chooses the correct proto ops on the first iteration instead of requiring the second update to pull them in. This fixes the race case when only a single map is used. Fixes: `c5d2177a72` ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com	2021-11-20 00:56:16 +01:00
Bernard Zhao	520fbdf7fb	net/bridge: replace simple_strtoul to kstrtol simple_strtoull is obsolete, use kstrtol instead. Signed-off-by: Bernard Zhao <bernard@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 14:16:42 +00:00
Kees Cook	812ad3d270	ethtool: stats: Use struct_group() to clear all stats at once In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Add struct_group() to mark region of struct stats_reply_data that should be initialized, which can now be done in a single memset() call. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:53:02 +00:00
Kees Cook	b5d8cf0af1	net/af_iucv: Use struct_group() to zero struct iucv_sock region In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Add struct_group() to mark the region of struct iucv_sock that gets initialized to zero. Avoid the future warning: In function 'fortify_memset_chk', inlined from 'iucv_sock_alloc' at net/iucv/af_iucv.c:476:2: ./include/linux/fortify-string.h:199:4: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 199 \| __write_overflow_field(p_size_field, size); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Acked-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/lkml/19ff61a0-0cda-6000-ce56-dc6b367c00d6@linux.ibm.com/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:52:25 +00:00
Kees Cook	8f2a83b454	ipv6: Use memset_after() to zero rt6_info In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Use memset_after() to clear everything after the dst_entry member of struct rt6_info. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:51:26 +00:00
Kees Cook	e3617433c3	net: 802: Use memset_startat() to clear struct fields In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Use memset_startat() so memset() doesn't get confused about writing beyond the destination member that is intended to be the starting point of zeroing through the end of the struct. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:23:23 +00:00
Kees Cook	f5455a1d9d	net: dccp: Use memset_startat() for TP zeroing In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Use memset_startat() so memset() doesn't get confused about writing beyond the destination member that is intended to be the starting point of zeroing through the end of the struct. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:22:49 +00:00
Heiko Carstens	7c8e1a9155	net/af_iucv: fix kernel doc comments Fix kernel doc comments where appropriate, or remove incorrect kernel doc indicators. Acked-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:12:30 +00:00
Heiko Carstens	682026a5e9	net/iucv: fix kernel doc comments Fix kernel doc comments where appropriate or remove incorrect kernel doc indicators. Also move kernel doc comments directly before functions. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:12:29 +00:00
David S. Miller	d6821c5bc6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Add selftest for vrf+conntrack, from Florian Westphal. 2) Extend nfqueue selftest to cover nfqueue, also from Florian. 3) Remove duplicated include in nft_payload, from Wan Jiabing. 4) Several improvements to the nat port shadowing selftest, from Phil Sutter. 5) Fix filtering of reply tuple in ctnetlink, from Florent Fourcot. 6) Do not override error with -EINVAL in filter setup path, also from Florent. 7) Honor sysctl_expire_nodest_conn regardless conn_reuse_mode for reused connections, from yangxingwu. 8) Replace snprintf() by sysfs_emit() in xt_IDLETIMER as reported by Coccinelle, from Jing Yao. 9) Incorrect IPv6 tunnel match in flowtable offload, from Will Mortensen. 10) Switch port shadow selftest to use socat, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-19 11:00:01 +00:00
Lorenzo Bianconi	1507b15319	cfg80211: move offchan_cac_event to a dedicated work In order to make cfg80211_offchan_cac_abort() (renamed from cfg80211_offchan_cac_event) callable in other contexts and without so much locking restrictions, make it trigger a new work instead of operating directly. Do some other renames while at it to clarify. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/6145c3d0f30400a568023f67981981d24c7c6133.1635325205.git.lorenzo@kernel.org [rewrite commit log] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-19 09:38:50 +01:00
Lorenzo Bianconi	237337c230	mac80211: introduce set_radar_offchan callback Similar to cfg80211, introduce set_radar_offchan callback in mac80211_ops in order to configure a dedicated offchannel chain available on some hw (e.g. mt7915) to perform offchannel CAC detection and avoid tx/rx downtime. Tested-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/201110606d4f3a7dfdf31440e351f2e2c375d4f0.1634979655.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-19 09:38:49 +01:00
Lorenzo Bianconi	bc2dfc0283	cfg80211: implement APIs for dedicated radar detection HW If a dedicated (off-channel) radar detection hardware (chain) is available in the hardware/driver, allow this to be used by calling the NL80211_CMD_RADAR_DETECT command with a new flag attribute requesting off-channel radar detection is used. Offchannel CAC (channel availability check) avoids the CAC downtime when switching to a radar channel or when turning on the AP. Drivers advertise support for this using the new feature flag NL80211_EXT_FEATURE_RADAR_OFFCHAN. Tested-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/7468e291ef5d05d692c1738d25b8f778d8ea5c3f.1634979655.git.lorenzo@kernel.org Link: https://lore.kernel.org/r/1e60e60fef00e14401adae81c3d49f3e5f307537.1634979655.git.lorenzo@kernel.org Link: https://lore.kernel.org/r/85fa50f57fc3adb2934c8d9ca0be30394de6b7e8.1634979655.git.lorenzo@kernel.org Link: https://lore.kernel.org/r/4b6c08671ad59aae0ac46fc94c02f31b1610eb72.1634979655.git.lorenzo@kernel.org Link: https://lore.kernel.org/r/241849ccaf2c228873c6f8495bf87b19159ba458.1634979655.git.lorenzo@kernel.org [remove offchan_mutex, fix cfg80211_stop_offchan_radar_detection(), remove gfp_t argument, fix documentation, fix tracing] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-19 09:38:49 +01:00
Jakub Kicinski	50fc24944a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-18 13:13:16 -08:00
Linus Torvalds	8d0112ac6f	Networking fixes for 5.16-rc2, including fixes from bpf, mac80211. Current release - regressions: - devlink: don't throw an error if flash notification sent before devlink visible - page_pool: Revert "page_pool: disable dma mapping support...", turns out there are active arches who need it Current release - new code bugs: - amt: cancel delayed_work synchronously in amt_fini() Previous releases - regressions: - xsk: fix crash on double free in buffer pool - bpf: fix inner map state pruning regression causing program rejections - mac80211: drop check for DONT_REORDER in __ieee80211_select_queue, preventing mis-selecting the best effort queue - mac80211: do not access the IV when it was stripped - mac80211: fix radiotap header generation, off-by-one - nl80211: fix getting radio statistics in survey dump - e100: fix device suspend/resume Previous releases - always broken: - tcp: fix uninitialized access in skb frags array for Rx 0cp - bpf: fix toctou on read-only map's constant scalar tracking - bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs - tipc: only accept encrypted MSG_CRYPTO msgs - smc: transfer remaining wait queue entries during fallback, fix missing wake ups - udp: validate checksum in udp_read_sock() (when sockmap is used) - sched: act_mirred: drop dst for the direction from egress to ingress - virtio_net_hdr_to_skb: count transport header in UFO, prevent allowing bad skbs into the stack - nfc: reorder the logic in nfc_{un,}register_device, fix unregister - ipsec: check return value of ipv6_skip_exthdr - usb: r8152: add MAC passthrough support for more Lenovo Docks Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGWf08ACgkQMUZtbf5S Irt+lxAAj8FAoLoSmQKUK3LttLLh0ZQQXu8Riey+wrP8Z9Yp8xWXIaVRF1c0vCE6 clbrF+mLfk6Wvv/RzOgwyBMHvK+djr/oVDNSmjlRvss4MLDfOQZhUV8V4XpvF4Up hI7wyKfHtd7niosNqil6wklJFpLU8WyIAWrPSIPE6JlPkJmcm3GUGsliwEPwdLY1 yl7z4zsxigjA+hKxYqNQX6tixF3xnbDUbAnWshrSPL89melwz4GMao45qmcxJEVr EipPhKifk0hT067jG08FMXcKBFKt6rKk7SVNo4mtq8Tl6HleJBj8fdaJAjSdFahB +rYJ0sDZwGoDL5CxZ5mD3fM1cDgh4WFEM0z//0b/bZhoPDRKEpLr9LPuv+N6+/rA 8D98EHsvyNjlFgdyd8celMstiGtBn4YLEoLNYYh9Qibgm0XsCuv0yox7g0AOLMmQ QiBmh2EnaXNPQ8PRZNMK3VH5ol2KoYWL6yrpJYV+wOWVLfezwlSsjkPSfW5pF9FG hU0iQBp/YTCdCadR9YLj8qfDWDUAkCN7WpqIu9EA9FXJcYjJVaix0MA/tAVlzXyR xlB7cU6O5NABcs/+04zPkKLwKbVYNMqgvKE+FVDVm+BKxo0UMxcmz/Np/ZYxfhkF bwKplaiPb2H4D6t0sdxqaeYirPwt1BcleLilae6vHG1jO90H9Vw= =tlqV -----END PGP SIGNATURE----- Merge tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, mac80211. Current release - regressions: - devlink: don't throw an error if flash notification sent before devlink visible - page_pool: Revert "page_pool: disable dma mapping support...", turns out there are active arches who need it Current release - new code bugs: - amt: cancel delayed_work synchronously in amt_fini() Previous releases - regressions: - xsk: fix crash on double free in buffer pool - bpf: fix inner map state pruning regression causing program rejections - mac80211: drop check for DONT_REORDER in __ieee80211_select_queue, preventing mis-selecting the best effort queue - mac80211: do not access the IV when it was stripped - mac80211: fix radiotap header generation, off-by-one - nl80211: fix getting radio statistics in survey dump - e100: fix device suspend/resume Previous releases - always broken: - tcp: fix uninitialized access in skb frags array for Rx 0cp - bpf: fix toctou on read-only map's constant scalar tracking - bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs - tipc: only accept encrypted MSG_CRYPTO msgs - smc: transfer remaining wait queue entries during fallback, fix missing wake ups - udp: validate checksum in udp_read_sock() (when sockmap is used) - sched: act_mirred: drop dst for the direction from egress to ingress - virtio_net_hdr_to_skb: count transport header in UFO, prevent allowing bad skbs into the stack - nfc: reorder the logic in nfc_{un,}register_device, fix unregister - ipsec: check return value of ipv6_skip_exthdr - usb: r8152: add MAC passthrough support for more Lenovo Docks" * tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits) ptp: ocp: Fix a couple NULL vs IS_ERR() checks net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock() net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound ipv6: check return value of ipv6_skip_exthdr e100: fix device suspend/resume devlink: Don't throw an error if flash notification sent before devlink visible page_pool: Revert "page_pool: disable dma mapping support..." ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port() octeontx2-af: debugfs: don't corrupt user memory NFC: add NCI_UNREG flag to eliminate the race NFC: reorder the logic in nfc_{un,}register_device NFC: reorganize the functions in nci_request tipc: check for null after calling kmemdup i40e: Fix display error code in dmesg i40e: Fix creation of first queue by omitting it if is not power of two i40e: Fix warning message and call stack during rmmod i40e driver i40e: Fix ping is lost after configuring ADq on VF i40e: Fix changing previously set num_queue_pairs for PFs i40e: Fix NULL ptr dereference on VSI filter sync i40e: Fix correct max_pkt_size on VF RX queue ...	2021-11-18 12:54:24 -08:00
luo penghao	2e1809208a	xfrm: Remove duplicate assignment The statement in the switch is repeated with the statement at the beginning of the while loop, so this statement is meaningless. The clang_analyzer complains as follows: net/xfrm/xfrm_policy.c:3392:2 warning: Value stored to 'exthdr' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-11-18 19:15:05 +01:00
luo penghao	c6e7871894	ipv6/esp6: Remove structure variables and alignment statements The definition of this variable is just to find the length of the structure after aligning the structure. The PTR alignment function is to optimize the size of the structure. In fact, it doesn't seem to be of much use, because both members of the structure are of type u32. So I think that the definition of the variable and the corresponding alignment can be deleted, the value of extralen can be directly passed in the size of the structure. The clang_analyzer complains as follows: net/ipv6/esp6.c:117:27 warning: Value stored to 'extra' during its initialization is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2021-11-18 19:15:04 +01:00
Jeremy Kerr	f6ef47e5bd	mctp/test: Update refcount checking in route fragment tests In `99ce45d5e`, we moved a route refcount decrement from mctp_do_fragment_route into the caller. This invalidates the assumption that the route test makes about refcount behaviour, so the route tests fail. This change fixes the test case to suit the new refcount behaviour. Fixes: `99ce45d5e7` ("mctp: Implement extended addressing") Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-18 12:01:14 +00:00
Yao Jing	4cdf85ef23	ipv6: ah6: use swap() to make code cleaner Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid opencoding it. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Yao Jing <yao.jing2@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-18 12:00:15 +00:00
Jordy Zomer	5f9c55c806	ipv6: check return value of ipv6_skip_exthdr The offset value is used in pointer math on skb->data. Since ipv6_skip_exthdr may return -1 the pointer to uh and th may not point to the actual udp and tcp headers and potentially overwrite other stuff. This is why I think this should be checked. EDIT: added {}'s, thanks Kees Signed-off-by: Jordy Zomer <jordy@pwning.systems> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-18 11:42:06 +00:00
Leon Romanovsky	fec1faf221	devlink: Don't throw an error if flash notification sent before devlink visible The mlxsw driver calls to various devlink flash routines even before users can get any access to the devlink instance itself. For example, mlxsw_core_fw_rev_validate() one of such functions. __mlxsw_core_bus_device_register -> mlxsw_core_fw_rev_validate -> mlxsw_core_fw_flash -> mlxfw_firmware_flash -> mlxfw_status_notify -> devlink_flash_update_status_notify -> __devlink_flash_update_notify -> WARN_ON(...) It causes to the WARN_ON to trigger warning about devlink not registered. Fixes: `cf53021740` ("devlink: Notify users when objects are accessible") Reported-by: Danielle Ratson <danieller@nvidia.com> Tested-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-18 11:34:17 +00:00
Yunsheng Lin	f915b75bff	page_pool: Revert "page_pool: disable dma mapping support..." This reverts commit `d00e60ee54`. As reported by Guillaume in [1]: Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT in 32-bit systems, which breaks the bootup proceess when a ethernet driver is using page pool with PP_FLAG_DMA_MAP flag. As we were hoping we had no active consumers for such system when we removed the dma mapping support, and LPAE seems like a common feature for 32 bits system, so revert it. 1. https://www.spinics.net/lists/netdev/msg779890.html Fixes: `d00e60ee54` ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reported-by: "kernelci.org bot" <bot@kernelci.org> Tested-by: "kernelci.org bot" <bot@kernelci.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-18 11:29:40 +00:00
Riccardo Paolo Bestetti	8ff978b8b2	ipv4/raw: support binding to nonlocal addresses Add support to inet v4 raw sockets for binding to nonlocal addresses through the IP_FREEBIND and IP_TRANSPARENT socket options, as well as the ipv4.ip_nonlocal_bind kernel parameter. Add helper function to inet_sock.h to check for bind address validity on the base of the address type and whether nonlocal address are enabled for the socket via any of the sockopts/sysctl, deduplicating checks in ipv4/ping.c, ipv4/af_inet.c, ipv6/af_inet6.c (for mapped v4->v6 addresses), and ipv4/raw.c. Add test cases with IP[V6]_FREEBIND verifying that both v4 and v6 raw sockets support binding to nonlocal addresses after the change. Add necessary support for the test cases to nettest. Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211117090010.125393-1-pbl@bestov.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-17 20:21:52 -08:00
Lin Ma	48b71a9e66	NFC: add NCI_UNREG flag to eliminate the race There are two sites that calls queue_work() after the destroy_workqueue() and lead to possible UAF. The first site is nci_send_cmd(), which can happen after the nci_close_device as below nfcmrvl_nci_unregister_dev \| nfc_genl_dev_up nci_close_device \| flush_workqueue \| del_timer_sync \| nci_unregister_device \| nfc_get_device destroy_workqueue \| nfc_dev_up nfc_unregister_device \| nci_dev_up device_del \| nci_open_device \| __nci_request \| nci_send_cmd \| queue_work !!! Another site is nci_cmd_timer, awaked by the nci_cmd_work from the nci_send_cmd. ... \| ... nci_unregister_device \| queue_work destroy_workqueue \| nfc_unregister_device \| ... device_del \| nci_cmd_work \| mod_timer \| ... \| nci_cmd_timer \| queue_work !!! For the above two UAF, the root cause is that the nfc_dev_up can race between the nci_unregister_device routine. Therefore, this patch introduce NCI_UNREG flag to easily eliminate the possible race. In addition, the mutex_lock in nci_close_device can act as a barrier. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: `6a2968aaf5` ("NFC: basic NCI protocol implementation") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211116152732.19238-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-17 20:17:05 -08:00
Lin Ma	3e3b5dfcd1	NFC: reorder the logic in nfc_{un,}register_device There is a potential UAF between the unregistration routine and the NFC netlink operations. The race that cause that UAF can be shown as below: (FREE) \| (USE) nfcmrvl_nci_unregister_dev \| nfc_genl_dev_up nci_close_device \| nci_unregister_device \| nfc_get_device nfc_unregister_device \| nfc_dev_up rfkill_destory \| device_del \| rfkill_blocked ... \| ... The root cause for this race is concluded below: 1. The rfkill_blocked (USE) in nfc_dev_up is supposed to be placed after the device_is_registered check. 2. Since the netlink operations are possible just after the device_add in nfc_register_device, the nfc_dev_up() can happen anywhere during the rfkill creation process, which leads to data race. This patch reorder these actions to permit 1. Once device_del is finished, the nfc_dev_up cannot dereference the rfkill object. 2. The rfkill_register need to be placed after the device_add of nfc_dev because the parent device need to be created first. So this patch keeps the order but inject device_lock to prevent the data race. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: `be055b2f89` ("NFC: RFKILL support") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211116152652.19217-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-17 20:17:05 -08:00
Lin Ma	86cdf8e387	NFC: reorganize the functions in nci_request There is a possible data race as shown below: thread-A in nci_request() \| thread-B in nci_close_device() \| mutex_lock(&ndev->req_lock); test_bit(NCI_UP, &ndev->flags); \| ... \| test_and_clear_bit(NCI_UP, &ndev->flags) mutex_lock(&ndev->req_lock); \| \| This race will allow __nci_request() to be awaked while the device is getting removed. Similar to commit `e2cb6b891a` ("bluetooth: eliminate the potential race condition when removing the HCI controller"). this patch alters the function sequence in nci_request() to prevent the data races between the nci_close_device(). Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: `6a2968aaf5` ("NFC: basic NCI protocol implementation") Link: https://lore.kernel.org/r/20211115145600.8320-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-17 20:16:53 -08:00
Tadeusz Struk	3e6db07975	tipc: check for null after calling kmemdup kmemdup can return a null pointer so need to check for it, otherwise the null key will be dereferenced later in tipc_crypto_key_xmit as can be seen in the trace [1]. Cc: tipc-discussion@lists.sourceforge.net Cc: stable@vger.kernel.org # 5.15, 5.14, 5.10 [1] https://syzkaller.appspot.com/bug?id=bca180abb29567b189efdbdb34cbf7ba851c2a58 Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20211115160143.5099-1-tadeusz.struk@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-17 20:04:52 -08:00
Eric Dumazet	bec251bc8b	net: no longer stop all TX queues in dev_watchdog() There is no reason for stopping all TX queues from dev_watchdog() Not only this stops feeding the NIC, it also migrates all qdiscs to be serviced on the cpu calling netif_tx_unlock(), causing a potential latency artifact. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-17 14:56:16 +00:00
Eric Dumazet	dab8fe3207	net: do not inline netif_tx_lock()/netif_tx_unlock() These are not fast path, there is no point in inlining them. Also provide netif_freeze_queues()/netif_unfreeze_queues() so that we can use them from dev_watchdog() in the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-17 14:56:16 +00:00
Eric Dumazet	5337824f4d	net: annotate accesses to queue->trans_start In following patches, dev_watchdog() will no longer stop all queues. It will read queue->trans_start locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-17 14:56:16 +00:00
Eric Dumazet	8160fb43d5	net: use an atomic_long_t for queue->trans_timeout tx_timeout_show() assumed dev_watchdog() would stop all the queues, to fetch queue->trans_timeout under protection of the queue->_xmit_lock. As we want to no longer disrupt transmits, we use an atomic_long_t instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: david decotigny <david.decotigny@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-17 14:56:16 +00:00
David S. Miller	b32563b6cc	bluetooth-next pull request for net-next: - Add support for AOSP Bluetooth Quality Report - Enables AOSP extension for Mediatek Chip (MT7921 & MT7922) - Rework of HCI command execution serialization -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmGUIpgZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKc+KEACi9xeDA0c2n9U4pGL4Vl6a kcLWHH5/TczQWpqsqAdDDO6M5tcNM92kcrzYgplM/mBHLLxrSE5JhIiH4lIdM0Gu jQntiEu9/AmI4wuF9N2ErRxn+0mJmzyO7jAV/pT0J/UqBuL+ahNIQN9l2392W1sF fU51bpdQlm4sEslAdFB+/p2XOwV4RDGcja9PHbbsuKM5FhMD9aUugiHe8S+7EVrO wCyIgIKvDecwWehgxE/6thpYg42N3z52oBKH2zLGlYf6VmQa6ODoYU6Wq6NMaGKg 2LbWpEBTn4hUye2gVyjJf0I2Yxn/xV/VrZ5QLLvUz4po9PL7ZsbXMnrJYYKpDfWd PGKlwyJTc/hhe1v9TBO4eAsD9rrEjMXbvTfmep3Y+mfMgQFjjKmy4ZfaNt+IDlje U0GbIwqHSCGmAUYtrp6FdZvEsv/YPKKoWNav+MG1ZOdeawKHZQjn3UncU3tA0Nev dZWhFgVNFqDcUYuxuaVgBCYTlC8KS/Kcfuc2THU1ATcxcFk8amrnIPUSwZZJrVso ZiJqOQLMMnK6GiABcgJmo48y/QW0lQkptbYKt/Gtk09wvQaaM8aCVpKOS20hMyyj qDe95Bm8onvdtS1n3ajqzZLVMOZiCoudUm62+q4Ked+RaUJ6VIc5Wn0uv5oO0mo1 7bPLzQ1h4APUHofbXCo0Tg== =BMgm -----END PGP SIGNATURE----- Merge tag 'for-net-next-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for AOSP Bluetooth Quality Report - Enables AOSP extension for Mediatek Chip (MT7921 & MT7922) - Rework of HCI command execution serialization ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-17 14:52:44 +00:00
Xin Long	f799ada6bf	net: sched: act_mirred: drop dst for the direction from egress to ingress Without dropping dst, the packets sent from local mirred/redirected to ingress will may still use the old dst. ip_rcv() will drop it as the old dst is for output and its .input is dst_discard. This patch is to fix by also dropping dst for those packets that are mirred or redirected from egress to ingress in act_mirred. Note that we don't drop it for the direction change from ingress to egress, as on which there might be a user case attaching a metadata dst by act_tunnel_key that would be used later. Fixes: `b57dc7c13e` ("net/sched: Introduce action ct") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-16 19:17:38 -08:00
Eric Dumazet	49ecc2e9c3	net: align static siphash keys siphash keys use 16 bytes. Define siphash_aligned_key_t macro so that we can make sure they are not crossing a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-16 19:07:54 -08:00
Jakub Kicinski	f5c741608b	Couple of fixes: * bad dont-reorder check * throughput LED trigger for various new(ish) paths * radiotap header generation * locking assertions in mac80211 with monitor mode * radio statistics * don't try to access IV when not present * call stop_ap for P2P_GO as well as we should -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmGT1iIACgkQB8qZga/f l8QDhA/+JeKZeXQ+Ct8kh34xMN4VxCUeGlJNY1aH2mTlJFSwxqJ4XH7eCiSTezXB ZEZ46wDLOpUF9epGl4NLzFkR5+Sanpn3LJOMLBDYQwkl3xcvNiQEZ0y4T8PayJRR kSs9ina3zxvZvBjCYX1KKdFatZiE+X5qwYyaK8T4ZjNQ4vjQpeX/V3eFpFeMDjuX fzIUN27yiESZG4heB2BkyeTR/jb4hG/gCbdGfNxa9MbF0OGJMJkuwH9MNHRQR2cx NvEPoq2IcooZfbqWnVk9lxWzRpm8Buio+gth9Knd13AGDl4Mez4+n0KxAEhafjhE ljUYekjh+55iSkt7eczW7wjvxmdGC+lfGN7GUop9mw4LU6NvpFCKXW5MqaH8XBy4 UIkcblXsMKkD8sFbMGzNIl7pT4JGdXkxlorxg7xI5s02nMFppnWJ7mrXQmcBkBG7 5JLkaXahiQzQYq/VGepCPMYYwKQqbHXts9TPMq/W3aIeR8QxA3uINZ5P9eGF9qn8 xTcvEQPYvIid1BIorO5/0V/VgE9VNDP1w8aKvn8bJjOmbWfuE68Wx+K6LlHy+E+O /MhjiRwnCvvTfdi6vPUjTxis8Xh5e9xfYmcqyRVIQujLmhta31FEJ/q+R+A1A4Rs RsesqpLgEOo3rkSciAAfItmRf3SVgJlIuZcfe16XFw0s45oYEck= =aD/1 -----END PGP SIGNATURE----- Merge tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Couple of fixes: * bad dont-reorder check * throughput LED trigger for various new(ish) paths * radiotap header generation * locking assertions in mac80211 with monitor mode * radio statistics * don't try to access IV when not present * call stop_ap for P2P_GO as well as we should * tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: fix throughput LED trigger mac80211: fix monitor_sdata RCU/locking assertions mac80211: drop check for DONT_REORDER in __ieee80211_select_queue mac80211: fix radiotap header generation mac80211: do not access the IV when it was stripped nl80211: fix radio statistics in survey dump cfg80211: call cfg80211_stop_ap when switch from P2P_GO type ==================== Link: https://lore.kernel.org/r/20211116160845.157214-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-16 16:53:59 -08:00
Jakub Kicinski	f083ec3160	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2021-11-16 We've added 12 non-merge commits during the last 5 day(s) which contain a total of 23 files changed, 573 insertions(+), 73 deletions(-). The main changes are: 1) Fix pruning regression where verifier went overly conservative rejecting previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer. 2) Fix verifier TOCTOU bug when using read-only map's values as constant scalars during verification, from Daniel Borkmann. 3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson. 4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker. 5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_() helpers in tracing programs due to deadlock possibilities, from Dmitrii Banshchikov. 6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang. 7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin. 8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: udp: Validate checksum in udp_read_sock() bpf: Fix toctou on read-only map's constant scalar tracking samples/bpf: Fix build error due to -isystem removal selftests/bpf: Add tests for restricted helpers bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs libbpf: Perform map fd cleanup for gen_loader in case of error samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu tools/runqslower: Fix cross-build samples/bpf: Fix summary per-sec stats in xdp_sample_user selftests/bpf: Check map in map pruning bpf: Fix inner map state pruning regression. xsk: Fix crash on double free in buffer pool ==================== Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-16 16:53:48 -08:00
Archie Pusaka	1f9d565743	Bluetooth: Attempt to clear HCI_LE_ADV on adv set terminated error event We should clear the flag if the adv instance removed due to receiving this error status is the last one we have. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 15:17:51 +01:00
Archie Pusaka	0f281a5e5b	Bluetooth: Ignore HCI_ERROR_CANCELLED_BY_HOST on adv set terminated event This event is received when the controller stops advertising, specifically for these three reasons: (a) Connection is successfully created (success). (b) Timeout is reached (error). (c) Number of advertising events is reached (error). () This event is NOT generated when the host stops the advertisement. Refer to the BT spec ver 5.3 vol 4 part E sec 7.7.65.18. Note that the section was revised from BT spec ver 5.0 vol 2 part E sec 7.7.65.18 which was ambiguous about (). Some chips (e.g. RTL8822CE) send this event when the host stops the advertisement with status = HCI_ERROR_CANCELLED_BY_HOST (due to (*) above). This is treated as an error and the advertisement will be removed and userspace will be informed via MGMT event. On suspend, we are supposed to temporarily disable advertisements, and continue advertising on resume. However, due to the behavior above, the advertisements are removed instead. This patch returns early if HCI_ERROR_CANCELLED_BY_HOST is received. Btmon snippet of the unexpected behavior: @ MGMT Command: Remove Advertising (0x003f) plen 1 Instance: 1 < HCI Command: LE Set Extended Advertising Enable (0x08\|0x0039) plen 6 Extended advertising: Disabled (0x00) Number of sets: 1 (0x01) Entry 0 Handle: 0x01 Duration: 0 ms (0x00) Max ext adv events: 0 > HCI Event: LE Meta Event (0x3e) plen 6 LE Advertising Set Terminated (0x12) Status: Operation Cancelled by Host (0x44) Handle: 1 Connection handle: 0 Number of completed extended advertising events: 5 > HCI Event: Command Complete (0x0e) plen 4 LE Set Extended Advertising Enable (0x08\|0x0039) ncmd 2 Status: Success (0x00) Signed-off-by: Archie Pusaka <apusaka@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 15:16:30 +01:00
Luiz Augusto von Dentz	9482c5074a	Bluetooth: hci_request: Remove bg_scan_update work This work is no longer necessary since all the code using it has been converted to use hci_passive_scan/hci_passive_scan_sync. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 15:13:34 +01:00
Luiz Augusto von Dentz	f056a65783	Bluetooth: hci_sync: Convert MGMT_OP_SET_CONNECTABLE to use cmd_sync This makes MGMT_OP_SET_CONNEABLE use hci_cmd_sync_queue instead of use a dedicated connetable_update work. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 15:13:34 +01:00
Luiz Augusto von Dentz	2bd1b23761	Bluetooth: hci_sync: Convert MGMT_OP_SET_DISCOVERABLE to use cmd_sync This makes MGMT_OP_SET_DISCOVERABLE use hci_cmd_sync_queue instead of use a dedicated discoverable_update work. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 15:13:34 +01:00
Eric Dumazet	b3cb764aa1	net: drop nopreempt requirement on sock_prot_inuse_add() This is distracting really, let's make this simpler, because many callers had to take care of this by themselves, even if on x86 this adds more code than really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:20:45 +00:00
Eric Dumazet	4199bae10c	net: merge net->core.prot_inuse and net->core.sock_inuse net->core.sock_inuse is a per cpu variable (int), while net->core.prot_inuse is another per cpu variable of 64 integers. per cpu allocator tend to place them in very different places. Grouping them together makes sense, since it makes updates potentially faster, if hitting the same cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:20:45 +00:00
Eric Dumazet	d477eb9004	net: make sock_inuse_add() available MPTCP hard codes it, let us instead provide this helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:20:45 +00:00
Eric Dumazet	2a12ae5d43	net: inline sock_prot_inuse_add() sock_prot_inuse_add() is very small, we can inline it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:20:45 +00:00
Eric Dumazet	587652bbdd	net: gro: populate net/core/gro.c Move gro code and data from net/core/dev.c to net/core/gro.c to ease maintenance. gro_normal_list() and gro_normal_one() are inlined because they are called from both files. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:16:54 +00:00
Eric Dumazet	e456a18a39	net: gro: move skb_gro_receive into net/core/gro.c net/core/gro.c will contain all core gro functions, to shrink net/core/skbuff.c and net/core/dev.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:16:54 +00:00
Eric Dumazet	0b935d7f8c	net: gro: move skb_gro_receive_list to udp_offload.c This helper is used once, no need to keep it in fat net/core/skbuff.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:16:54 +00:00
Eric Dumazet	4721031c35	net: move gro definitions to include/net/gro.h include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:16:54 +00:00
Eric Dumazet	29fbc26e6d	tcp: do not call tcp_cleanup_rbuf() if we have a backlog Under pressure, tcp recvmsg() has logic to process the socket backlog, but calls tcp_cleanup_rbuf() right before. Avoiding sending ACK right before processing new segments makes a lot of sense, as this decrease the number of ACK packets, with no impact on effective ACK clocking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:35 +00:00
Eric Dumazet	8bd172b787	tcp: check local var (timeo) before socket fields in one test Testing timeo before sk_err/sk_state/sk_shutdown makes more sense. Modern applications use non-blocking IO, while a socket is terminated only once during its life time. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:35 +00:00
Eric Dumazet	f35f821935	tcp: defer skb freeing after socket lock is released tcp recvmsg() (or rx zerocopy) spends a fair amount of time freeing skbs after their payload has been consumed. A typical ~64KB GRO packet has to release ~45 page references, eventually going to page allocator for each of them. Currently, this freeing is performed while socket lock is held, meaning that there is a high chance that BH handler has to queue incoming packets to tcp socket backlog. This can cause additional latencies, because the user thread has to process the backlog at release_sock() time, and while doing so, additional frames can be added by BH handler. This patch adds logic to defer these frees after socket lock is released, or directly from BH handler if possible. Being able to free these skbs from BH handler helps a lot, because this avoids the usual alloc/free assymetry, when BH handler and user thread do not run on same cpu or NUMA node. One cpu can now be fully utilized for the kernel->user copy, and another cpu is handling BH processing and skb/page allocs/frees (assuming RFS is not forcing use of a single CPU) Tested: 100Gbit NIC Max throughput for one TCP_STREAM flow, over 10 runs MTU : 1500 Before: 55 Gbit After: 66 Gbit MTU : 4096+(headers) Before: 82 Gbit After: 95 Gbit Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:35 +00:00
Eric Dumazet	3df684c1a3	tcp: avoid indirect calls to sock_rfree TCP uses sk_eat_skb() when skbs can be removed from receive queue. However, the call to skb_orphan() from __kfree_skb() incurs an indirect call so sock_rfee(), which is more expensive than a direct call, especially for CONFIG_RETPOLINE=y. Add tcp_eat_recv_skb() function to make the call before __kfree_skb(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:35 +00:00
Eric Dumazet	b96c51bd3b	tcp: tp->urg_data is unlikely to be set Use some unlikely() hints in the fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	7b6a893a59	tcp: annotate races around tp->urg_data tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock owned. Also, it is faster to first check tp->urg_data in tcp_poll(), then tp->urg_seq == tp->copied_seq, because tp->urg_seq is located in a different/cold cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	0307a0b74b	tcp: annotate data-races on tp->segs_in and tp->data_segs_in tcp_segs_in() can be called from BH, while socket spinlock is held but socket owned by user, eventually reading these fields from tcp_get_info() Found by code inspection, no need to backport this patch to older kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	d2489c7b6d	tcp: add RETPOLINE mitigation to sk_backlog_rcv Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	93afcfd1db	tcp: small optimization in tcp recvmsg() When reading large chunks of data, incoming packets might be added to the backlog from BH. tcp recvmsg() detects the backlog queue is not empty, and uses a release_sock()/lock_sock() pair to process this backlog. We now have __sk_flush_backlog() to perform this a bit faster. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	91b6d32563	net: cache align tcp_memory_allocated, tcp_sockets_allocated tcp_memory_allocated and tcp_sockets_allocated often share a common cache line, source of false sharing. Also take care of udp_memory_allocated and mptcp_sockets_allocated. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	aba546565b	net: remove sk_route_nocaps Instead of using a full netdev_features_t, we can use a single bit, as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from sk->sk_route_cap. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	d0d598ca86	net: remove sk_route_forced_caps We were only using one bit, and we can replace it by sk_is_tcp() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	42f67eea3b	net: use sk_is_tcp() in more places Move sk_is_tcp() to include/net/sock.h and use it where we can. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	3735440200	tcp: small optimization in tcp_v6_send_check() For TCP flows, inet6_sk(sk)->saddr has the same value than sk->sk_v6_rcv_saddr. Using sk->sk_v6_rcv_saddr increases data locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Eric Dumazet	d519f35096	tcp: minor optimization in tcp_add_backlog() If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values are not used. Defer their access to the point we need them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-16 13:10:34 +00:00
Jesse Melhuish	385315decf	Bluetooth: Don't initialize msft/aosp when using user channel A race condition is triggered when usermode control is given to userspace before the kernel's MSFT query responds, resulting in an unexpected response to userspace's reset command. Issue can be observed in btmon: < HCI Command: Vendor (0x3f\|0x001e) plen 2 #3 [hci0] 05 01 .. @ USER Open: bt_stack_manage (privileged) version 2.22 {0x0002} [hci0] < HCI Command: Reset (0x03\|0x0003) plen 0 #4 [hci0] > HCI Event: Command Complete (0x0e) plen 5 #5 [hci0] Vendor (0x3f\|0x001e) ncmd 1 Status: Command Disallowed (0x0c) 05 . > HCI Event: Command Complete (0x0e) plen 4 #6 [hci0] Reset (0x03\|0x0003) ncmd 2 Status: Success (0x00) Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Jesse Melhuish <melhuishj@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 14:03:11 +01:00
Jackie Liu	a27c519a81	Bluetooth: fix uninitialized variables notify_evt Coverity Scan report: [...] *** CID 1493985: Uninitialized variables (UNINIT) /net/bluetooth/hci_event.c: 4535 in hci_sync_conn_complete_evt() 4529 4530 /* Notify only in case of SCO over HCI transport data path which 4531 * is zero and non-zero value shall be non-HCI transport data path 4532 / 4533 if (conn->codec.data_path == 0) { 4534 if (hdev->notify) >>> CID 1493985: Uninitialized variables (UNINIT) >>> Using uninitialized value "notify_evt" when calling "hdev->notify". 4535 hdev->notify(hdev, notify_evt); 4536 } 4537 4538 hci_connect_cfm(conn, ev->status); 4539 if (ev->status) 4540 hci_conn_del(conn); [...] Although only btusb uses air_mode, and he only handles HCI_NOTIFY_ENABLE_SCO_CVSD and HCI_NOTIFY_ENABLE_SCO_TRANSP, there is still a very small chance that ev->air_mode is not equal to 0x2 and 0x3, but notify_evt is initialized to HCI_NOTIFY_ENABLE_SCO_CVSD or HCI_NOTIFY_ENABLE_SCO_TRANSP. the context is maybe not correct. Let us directly use the required function instead of re-initializing it, so as to restore the original logic and make the code more correct. Addresses-Coverity: ("Uninitialized variables") Fixes: `f4f9fa0c07` ("Bluetooth: Allow usb to auto-suspend when SCO use non-HCI transport") Suggested-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 14:01:30 +01:00
Pavel Skripkin	3a56ef719f	Bluetooth: stop proccessing malicious adv data Syzbot reported slab-out-of-bounds read in hci_le_adv_report_evt(). The problem was in missing validaion check. We should check if data is not malicious and we can read next data block. If we won't check ptr validness, code can read a way beyond skb->end and it can cause problems, of course. Fixes: `e95beb4141` ("Bluetooth: hci_le_adv_report_evt code refactoring") Reported-and-tested-by: syzbot+e3fcb9c4f3c2a931dc40@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2021-11-16 13:59:20 +01:00
Cong Wang	099f896f49	udp: Validate checksum in udp_read_sock() It turns out the skb's in sock receive queue could have bad checksums, as both ->poll() and ->recvmsg() validate checksums. We have to do the same for ->read_sock() path too before they are redirected in sockmap. Fixes: `d7f571188e` ("udp: Implement ->read_sock() for sockmap") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211115044006.26068-1-xiyou.wangcong@gmail.com	2021-11-16 13:18:23 +01:00
Dmitrii Banshchikov	5e0bc3082e	bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing progs may result in locking issues. bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that isn't safe for any context: ====================================================== WARNING: possible circular locking dependency detected 5.15.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.4/14877 is trying to acquire lock: ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255 but task is already holding lock: ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&obj_hash[i].lock){-.-.}-{2:2}: lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162 __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569 debug_hrtimer_init kernel/time/hrtimer.c:414 [inline] debug_init kernel/time/hrtimer.c:468 [inline] hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592 ntp_init_cmos_sync kernel/time/ntp.c:676 [inline] ntp_init+0xa1/0xad kernel/time/ntp.c:1095 timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639 start_kernel+0x267/0x56e init/main.c:1030 secondary_startup_64_no_verify+0xb1/0xbb -> #0 (tk_core.seq.seqcount){----}-{0:0}: check_prev_add kernel/locking/lockdep.c:3051 [inline] check_prevs_add kernel/locking/lockdep.c:3174 [inline] validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789 __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015 lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625 seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103 ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255 ktime_get_coarse include/linux/timekeeping.h:120 [inline] ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline] ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline] bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171 bpf_prog_a99735ebafdda2f1+0x10/0xb50 bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline] __bpf_prog_run include/linux/filter.h:626 [inline] bpf_prog_run include/linux/filter.h:633 [inline] BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline] trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127 perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708 perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39 trace_lock_release+0x128/0x150 include/trace/events/lock.h:58 lock_release+0x82/0x810 kernel/locking/lockdep.c:5636 __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline] _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194 debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline] debug_deactivate kernel/time/hrtimer.c:481 [inline] __run_hrtimer kernel/time/hrtimer.c:1653 [inline] __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749 hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline] __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103 sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline] _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194 try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118 wake_up_process kernel/sched/core.c:4200 [inline] wake_up_q+0x9a/0xf0 kernel/sched/core.c:953 futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184 do_futex+0x367/0x560 kernel/futex/syscalls.c:127 __do_sys_futex kernel/futex/syscalls.c:199 [inline] __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae There is a possible deadlock with bpf_timer_* set of helpers: hrtimer_start() lock_base(); trace_hrtimer...() perf_event() bpf_run() bpf_timer_start() hrtimer_start() lock_base() <- DEADLOCK Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT and BPF_PROG_TYPE_RAW_TRACEPOINT prog types. Fixes: `d055126180` ("bpf: Add bpf_ktime_get_coarse_ns helper") Fixes: `b00628b1c7` ("bpf: Introduce bpf timers.") Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru	2021-11-15 20:35:58 -08:00
Jakub Kicinski	a5bdc36354	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2021-11-15 We've added 72 non-merge commits during the last 13 day(s) which contain a total of 171 files changed, 2728 insertions(+), 1143 deletions(-). The main changes are: 1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to BTF such that BPF verifier will be able to detect misuse, from Yonghong Song. 2) Big batch of libbpf improvements including various fixes, future proofing APIs, and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko. 3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush. 4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to ensure exception handling still works, from Russell King and Alan Maguire. 5) Add a new bpf_find_vma() helper for tracing to map an address to the backing file such as shared library, from Song Liu. 6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump, updating documentation and bash-completion among others, from Quentin Monnet. 7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as the API is heavily tailored around perf and is non-generic, from Dave Marchevsky. 8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an opt-out for more relaxed BPF program requirements, from Stanislav Fomichev. 9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits) bpftool: Use libbpf_get_error() to check error bpftool: Fix mixed indentation in documentation bpftool: Update the lists of names for maps and prog-attach types bpftool: Fix indent in option lists in the documentation bpftool: Remove inclusion of utilities.mak from Makefiles bpftool: Fix memory leak in prog_dump() selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning selftests/bpf: Fix an unused-but-set-variable compiler warning bpf: Introduce btf_tracing_ids bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs bpftool: Enable libbpf's strict mode by default docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support selftests/bpf: Clarify llvm dependency with btf_tag selftest selftests/bpf: Add a C test for btf_type_tag selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests selftests/bpf: Test libbpf API function btf__add_type_tag() bpftool: Support BTF_KIND_TYPE_TAG libbpf: Support BTF_KIND_TYPE_TAG ... ==================== Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-15 08:49:23 -08:00
Wen Gu	cf4f5530bb	net/smc: Make sure the link_id is unique The link_id is supposed to be unique, but smcr_next_link_id() doesn't skip the used link_id as expected. So the patch fixes this. Fixes: `026c381fb4` ("net/smc: introduce link_idx for link group array") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 14:42:24 +00:00
Tetsuo Handa	938cca9e41	sock: fix /proc/net/sockstat underflow in sk_clone_lock() sk_clone_lock() needs to call sock_inuse_add(1) before entering the sk_free_unlock_clone() error path, for __sk_free() from sk_free() from sk_free_unlock_clone() calls sock_inuse_add(-1). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: `648845ab7e` ("sock: Move the socket inuse to namespace.") Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 14:27:39 +00:00
Xin Long	271351d255	tipc: only accept encrypted MSG_CRYPTO msgs The MSG_CRYPTO msgs are always encrypted and sent to other nodes for keys' deployment. But when receiving in peers, if those nodes do not validate it and make sure it's encrypted, one could craft a malicious MSG_CRYPTO msg to deploy its key with no need to know other nodes' keys. This patch is to do that by checking TIPC_SKB_CB(skb)->decrypted and discard it if this packet never got decrypted. Note that this is also a supplementary fix to CVE-2021-43267 that can be triggered by an unencrypted malicious MSG_CRYPTO msg. Fixes: `1ef6f7c939` ("tipc: add automatic session key exchange") Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 14:25:22 +00:00
liuguoqiang	6def480181	net: return correct error code When kmemdup called failed and register_net_sysctl return NULL, should return ENOMEM instead of ENOBUFS Signed-off-by: liuguoqiang <liuguoqiang@uniontech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 14:22:12 +00:00
Christophe JAILLET	cc0be1ad68	net: bridge: Slightly optimize 'find_portno()' The 'inuse' bitmap is local to this function. So we can use the non-atomic '__set_bit()' to save a few cycles. While at it, also remove some useless {}. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 14:07:06 +00:00
Wen Gu	2153bd1e3d	net/smc: Transfer remaining wait queue entries during fallback The SMC fallback is incomplete currently. There may be some wait queue entries remaining in smc socket->wq, which should be removed to clcsocket->wq during the fallback. For example, in nginx/wrk benchmark, this issue causes an all-zeros test result: server: nginx -g 'daemon off;' client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html Running 5s test @ http://11.200.15.93/index.html 1 threads and 1 connections Thread Stats Avg Stdev Max ± Stdev Latency 0.00us 0.00us 0.00us -nan% Req/Sec 0.00 0.00 0.00 -nan% 0 requests in 5.00s, 0.00B read Requests/sec: 0.00 Transfer/sec: 0.00B The reason for this all-zeros result is that when wrk used SMC to replace TCP, it added an eppoll_entry into smc socket->wq and expected to be notified if epoll events like EPOLL_IN/ EPOLL_OUT occurred on the smc socket. However, once a fallback occurred, wrk switches to use clcsocket. Now it is clcsocket->wq instead of smc socket->wq which will be woken up. The eppoll_entry remaining in smc socket->wq does not work anymore and wrk stops the test. This patch fixes this issue by removing remaining wait queue entries from smc socket->wq to clcsocket->wq during the fallback. Link: https://www.spinics.net/lists/netdev/msg779769.html Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 13:27:02 +00:00
Harshit Mogalapalli	cb3ef7b000	net: sched: sch_netem: Refactor code in 4-state loss generator Fixed comments to match description with variable names and refactored code to match the convention as per [1]. To match the convention mapping is done as follows: State 3 - LOST_IN_BURST_PERIOD State 4 - LOST_IN_GAP_PERIOD [1] S. Salsano, F. Ludovici, A. Ordine, "Definition of a general and intuitive loss model for packet networks and its implementation in the Netem module in the Linux kernel" Fixes: `a6e2fe17eb` ("sch_netem: replace magic numbers with enumerate") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 13:23:23 +00:00
Tadeusz Struk	86c3a3e964	tipc: use consistent GFP flags Some functions, like tipc_crypto_start use inconsisten GFP flags when allocating memory. The mentioned function use GFP_ATOMIC to to alloc a crypto instance, and then calls alloc_ordered_workqueue() which allocates memory with GFP_KERNEL. tipc_aead_init() function even uses GFP_KERNEL and GFP_ATOMIC interchangeably. No doc comment specifies what context a function is designed to work in, but the flags should at least be consistent within a function. Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: tipc-discussion@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-15 12:57:31 +00:00
Linus Lüssing	9057d6c23e	batman-adv: allow netlink usage in unprivileged containers Currently, creating a batman-adv interface in an unprivileged LXD container and attaching secondary interfaces to it with "ip" or "batctl" works fine. However all batctl debug and configuration commands fail: root@container:~# batctl originators Error received: Operation not permitted root@container:~# batctl orig_interval 1000 root@container:~# batctl orig_interval 2000 root@container:~# batctl orig_interval 1000 To fix this change the generic netlink permissions from GENL_ADMIN_PERM to GENL_UNS_ADMIN_PERM. This way a batman-adv interface is fully maintainable as root from within a user namespace, from an unprivileged container. All except one batman-adv netlink setting are per interface and do not leak information or change settings from the host system and are therefore save to retrieve or modify as root from within an unprivileged container. "batctl routing_algo" / BATADV_CMD_GET_ROUTING_ALGOS is the only exception: It provides the batman-adv kernel module wide default routing algorithm. However it is read-only from netlink and an unprivileged container is still not allowed to modify /sys/module/batman_adv/parameters/routing_algo. Instead it is advised to use the newly introduced "batctl if create routing_algo RA_NAME" / IFLA_BATADV_ALGO_NAME to set the routing algorithm on interface creation, which already works fine in an unprivileged container. Cc: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2021-11-15 12:01:14 +01:00
Simon Wunderlich	c2262123cc	batman-adv: Start new development cycle This version will contain all the (major or even only minor) changes for Linux 5.17. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2021-11-15 12:01:07 +01:00
Felix Fietkau	30f6cf9691	mac80211: fix throughput LED trigger The codepaths for rx with decap offload and tx with itxq were not updating the counters for the throughput led trigger. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20211113063415.55147-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:56:57 +01:00
Johannes Berg	6dd2360334	mac80211: fix monitor_sdata RCU/locking assertions Since commit `a05829a722` ("cfg80211: avoid holding the RTNL when calling the driver") we've not only been protecting the pointer to monitor_sdata with the RTNL, but also with the wiphy->mtx. This is relevant in a number of lockdep assertions, e.g. the one we hit in ieee80211_set_monitor_channel(). However, we're now protecting all the assignments/dereferences, even the one in interface iter, with the wiphy->mtx, so switch over the lockdep assertions to that lock. Fixes: `a05829a722` ("cfg80211: avoid holding the RTNL when calling the driver") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20211112135143.cb8e8ceffef3.Iaa210f16f6904c8a7a24954fb3396da0ef86ec08@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:55:55 +01:00
Felix Fietkau	f6ab25d41b	mac80211: drop check for DONT_REORDER in __ieee80211_select_queue When __ieee80211_select_queue is called, skb->cb has not been cleared yet, which means that info->control.flags can contain garbage. In some cases this leads to IEEE80211_TX_CTRL_DONT_REORDER being set, causing packets marked for other queues to randomly end up in BE instead. This flag only needs to be checked in ieee80211_select_queue_80211, since the radiotap parser is the only piece of code that sets it Fixes: `66d06c8473` ("mac80211: adhere to Tx control flag that prevents frame reordering") Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20211110212201.35452-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:55:40 +01:00
Johannes Berg	c033a38a81	mac80211: fix radiotap header generation In commit `8c89f7b3d3` ("mac80211: Use flex-array for radiotap header bitmap") we accidentally pointed the position to the wrong place, so we overwrite a present bitmap, and thus cause all kinds of trouble. To see the issue, note that the previous code read: pos = (void )(it_present + 1); The requirement now is that we need to calculate pos via it_optional, to not trigger the compiler hardening checks, as: pos = (void )&rthdr->it_optional[...]; Rewriting the original expression, we get (obviously, since that just adds "+ x - x" terms): pos = (void )(it_present + 1 + rthdr->it_optional - rthdr->it_optional) and moving the "+ rthdr->it_optional" outside to be used as an array: pos = (void )&rthdr->it_optional[it_present + 1 - rthdr->it_optional]; The original is off by one, fix it. Cc: stable@vger.kernel.org Fixes: `8c89f7b3d3` ("mac80211: Use flex-array for radiotap header bitmap") Reported-by: Sid Hayn <sidhayn@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Tested-by: Sid Hayn <sidhayn@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20211109100203.c61007433ed6.I1dade57aba7de9c4f48d68249adbae62636fd98c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:55:20 +01:00
Xing Song	77dfc2bc0b	mac80211: do not access the IV when it was stripped ieee80211_get_keyid() will return false value if IV has been stripped, such as return 0 for IP/ARP frames due to LLC header, and return -EINVAL for disassociation frames due to its length... etc. Don't try to access it if it's not present. Signed-off-by: Xing Song <xing.song@mediatek.com> Link: https://lore.kernel.org/r/20211101024657.143026-1-xing.song@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:54:37 +01:00
Johannes Berg	ce6b697499	nl80211: fix radio statistics in survey dump Even if userspace specifies the NL80211_ATTR_SURVEY_RADIO_STATS attribute, we cannot get the statistics because we're not really parsing the incoming attributes properly any more. Fix this by passing the attrbuf to nl80211_prepare_wdev_dump() and filling it there, if given, and using a local version only if no output is desired. Since I'm touching it anyway, make nl80211_prepare_wdev_dump() static. Fixes: `50508d941c` ("cfg80211: use parallel_ops for genl") Reported-by: Jan Fuchs <jf@simonwunderlich.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Tested-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20211029092539.2851b4799386.If9736d4575ee79420cbec1bd930181e1d53c7317@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:51:51 +01:00
Nguyen Dinh Phi	563fbefed4	cfg80211: call cfg80211_stop_ap when switch from P2P_GO type If the userspace tools switch from NL80211_IFTYPE_P2P_GO to NL80211_IFTYPE_ADHOC via send_msg(NL80211_CMD_SET_INTERFACE), it does not call the cleanup cfg80211_stop_ap(), this leads to the initialization of in-use data. For example, this path re-init the sdata->assigned_chanctx_list while it is still an element of assigned_vifs list, and makes that linked list corrupt. Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Reported-by: syzbot+bbf402b783eeb6d908db@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20211027173722.777287-1-phind.uet@gmail.com Cc: stable@vger.kernel.org Fixes: `ac800140c2` ("cfg80211: .stop_ap when interface is going down") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2021-11-15 10:51:13 +01:00
Paul Moore	1aa3b2207e	net,lsm,selinux: revert the security_sctp_assoc_established() hook This patch reverts two prior patches, `e7310c9402` ("security: implement sctp_assoc_established hook in selinux") and `7c2ef0240e` ("security: add sctp_assoc_established hook"), which create the security_sctp_assoc_established() LSM hook and provide a SELinux implementation. Unfortunately these two patches were merged without proper review (the Reviewed-by and Tested-by tags from Richard Haines were for previous revisions of these patches that were significantly different) and there are outstanding objections from the SELinux maintainers regarding these patches. Work is currently ongoing to correct the problems identified in the reverted patches, as well as others that have come up during review, but it is unclear at this point in time when that work will be ready for inclusion in the mainline kernel. In the interest of not keeping objectionable code in the kernel for multiple weeks, and potentially a kernel release, we are reverting the two problematic patches. Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-14 12:21:53 +00:00
luo penghao	1274a4eb31	ipv6: Remove duplicate statements This statement is repeated with the initialization statement Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-14 12:20:44 +00:00
luo penghao	0de3521500	ipv4: Remove duplicate assignments there is a same action when the variable is initialized Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-14 12:20:44 +00:00
luo penghao	ef14102914	ipv4: drop unused assignment The assignment in the if statement will be overwritten by the following statement Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-14 12:20:44 +00:00
Linus Torvalds	0ecca62beb	One notable change here is that async creates and unlinks introduced in 5.7 are now enabled by default. This should greatly speed up things like rm, tar and rsync. To opt out, wsync mount option can be used. Other than that we have a pile of bug fixes all across the filesystem from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from Luis. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmGORY8THGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi5XmB/0SRQTW+BRAhYvSD/Ib2/c6uVZGnmJU MUCO/uuD4fZvfxyMVb3qnAzHIGh3YlFWa/dgSZNStvLmY0L9P0MIvyaYolVuC4Tu 7llX1I+yckTns9VmiULBNy9D812eRY282nMbRzikMGPO1eb6Yqo3r50AdTvkam/R Qs5pfwRqLerbP7VUv4vSsrBflwVyHOrCFZaUUVTu4f2kKz/FzZd/FCw5VV51KYzq ygN+8eFotf5zluMX0tl0FXVgJ13N9vbg+YNxyOzHmOAV0AhSmmtV8vJ+j+m7+sOj b7wNl5AuI5Ogeg8PHSGsXOHBVn6IbhdGDcougEVL+1gHkxxOQ5hfBHWQ =22Vd -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "One notable change here is that async creates and unlinks introduced in 5.7 are now enabled by default. This should greatly speed up things like rm, tar and rsync. To opt out, wsync mount option can be used. Other than that we have a pile of bug fixes all across the filesystem from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from Luis" * tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client: ceph: add a new metric to keep track of remote object copies libceph, ceph: move ceph_osdc_copy_from() into cephfs code ceph: clean-up metrics data structures to reduce code duplication ceph: split 'metric' debugfs file into several files ceph: return the real size read when it hits EOF ceph: properly handle statfs on multifs setups ceph: shut down mount on bad mdsmap or fsmap decode ceph: fix mdsmap decode when there are MDS's beyond max_mds ceph: ignore the truncate when size won't change with Fx caps issued ceph: don't rely on error_string to validate blocklisted session. ceph: just use ci->i_version for fscache aux info ceph: shut down access to inode when async create fails ceph: refactor remove_session_caps_cb ceph: fix auth cap handling logic in remove_session_caps_cb ceph: drop private list from remove_session_caps_cb ceph: don't use -ESTALE as special return code in try_get_cap_refs ceph: print inode numbers instead of pointer values ceph: enable async dirops by default libceph: drop ->monmap and err initialization ceph: convert to noop_direct_IO	2021-11-13 11:31:07 -08:00
Arjun Roy	70701b83e2	tcp: Fix uninitialized access in skb frags array for Rx 0cp. TCP Receive zerocopy iterates through the SKB queue via tcp_recv_skb(), acquiring a pointer to an SKB and an offset within that SKB to read from. From there, it iterates the SKB frags array to determine which offset to start remapping pages from. However, this is built on the assumption that the offset read so far within the SKB is smaller than the SKB length. If this assumption is violated, we can attempt to read an invalid frags array element, which would cause a fault. tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN flag is set. Therefore, we must guard against this occurrence inside skb_advance_frag(). One way that we can reproduce this error follows: 1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with: char some_array[32 * 1024]; struct tcp_zerocopy_receive zc = { .copybuf_address = (__u64) &some_array[0], .copybuf_len = 32 * 1024, }; 2) In a sender program, after a TCP handshake, send the following sequence of packets: i) Seq = [X, X+4000] ii) Seq = [X+4000, X+5000] iii) Seq = [X+4000, X+5000], Flags = FIN \| URG, urgptr=1000 (This can happen without URG, if we have a signal pending, but URG is a convenient way to reproduce the behaviour). In this case, the following event sequence will occur on the receiver: tcp_zerocopy_receive(): -> receive_fallback_to_copy() // copybuf_len >= inq -> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG -> tcp_recv_skb() // yields skb with skb->len == offset -> tcp_zerocopy_set_hint_for_skb() -> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags -> find_next_mappable_frag() // will dereference this bad frags ptr. With this patch, skb_advance_to_frag() will no longer return an invalid frags pointer, and will return NULL instead, fixing the issue. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `05255b823a` ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive") Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-12 20:13:28 -08:00
Song Liu	9e2ad638ae	bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs syzbot reported the following BUG w/o CONFIG_DEBUG_INFO_BTF BUG: KASAN: global-out-of-bounds in task_iter_init+0x212/0x2e7 kernel/bpf/task_iter.c:661 Read of size 4 at addr ffffffff90297404 by task swapper/0/1 CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.15.0-syzkaller #0 Hardware name: ... Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0xf/0x309 mm/kasan/report.c:256 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 task_iter_init+0x212/0x2e7 kernel/bpf/task_iter.c:661 do_one_initcall+0x103/0x650 init/main.c:1295 do_initcall_level init/main.c:1368 [inline] do_initcalls init/main.c:1384 [inline] do_basic_setup init/main.c:1403 [inline] kernel_init_freeable+0x6b1/0x73a init/main.c:1606 kernel_init+0x1a/0x1d0 init/main.c:1497 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> This is caused by hard-coded name[1] in BTF_ID_LIST_GLOBAL (w/o CONFIG_DEBUG_INFO_BTF). Fix this by adding a parameter n to BTF_ID_LIST_GLOBAL. This avoids ifdef CONFIG_DEBUG_INFO_BTF in btf.c and filter.c. Fixes: `7c7e3d31e7` ("bpf: Introduce helper bpf_find_vma") Reported-by: syzbot+e0d81ec552a21d9071aa@syzkaller.appspotmail.com Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211112150243.1270987-2-songliubraving@fb.com	2021-11-12 10:19:09 -08:00
Paul Moore	32a370abf1	net,lsm,selinux: revert the security_sctp_assoc_established() hook This patch reverts two prior patches, `e7310c9402` ("security: implement sctp_assoc_established hook in selinux") and `7c2ef0240e` ("security: add sctp_assoc_established hook"), which create the security_sctp_assoc_established() LSM hook and provide a SELinux implementation. Unfortunately these two patches were merged without proper review (the Reviewed-by and Tested-by tags from Richard Haines were for previous revisions of these patches that were significantly different) and there are outstanding objections from the SELinux maintainers regarding these patches. Work is currently ongoing to correct the problems identified in the reverted patches, as well as others that have come up during review, but it is unclear at this point in time when that work will be ready for inclusion in the mainline kernel. In the interest of not keeping objectionable code in the kernel for multiple weeks, and potentially a kernel release, we are reverting the two problematic patches. Signed-off-by: Paul Moore <paul@paul-moore.com>	2021-11-12 12:07:02 -05:00
Magnus Karlsson	199d983bc0	xsk: Fix crash on double free in buffer pool Fix a crash in the buffer pool allocator when a buffer is double freed. It is possible to trigger this behavior not only from a faulty driver, but also from user space like this: Create a zero-copy AF_XDP socket. Load an XDP program that will issue XDP_DROP for all packets. Put the same umem buffer into the fill ring multiple times, then bind the socket and send some traffic. This will crash the kernel as the XDP_DROP action triggers one call to xsk_buff_free()/xp_free() for every packet dropped. Each call will add the corresponding buffer entry to the free_list and increase the free_list_cnt. Some entries will have been added multiple times due to the same buffer being freed. The buffer allocation code will then traverse this broken list and since the same buffer is in the list multiple times, it will try to delete the same buffer twice from the list leading to a crash. The fix for this is just to test that the buffer has not been added before in xp_free(). If it has been, just return from the function and do not put it in the free_list a second time. Note that this bug was not present in the code before the commit referenced in the Fixes tag. That code used one list entry per allocated buffer, so multiple frees did not have any side effects. But the commit below optimized the usage of the pool and only uses a single entry per buffer in the umem, meaning that multiple allocations/frees of the same buffer will also only use one entry, thus leading to the problem. Fixes: `47e4075df3` ("xsk: Batched buffer allocation for the pool") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/bpf/20211111075707.21922-1-magnus.karlsson@gmail.com	2021-11-12 15:55:27 +01:00
Linus Torvalds	f54ca91fe6	Networking fixes for 5.16-rc1, including fixes from bpf, can and netfilter. Current release - regressions: - bpf: do not reject when the stack read size is different from the tracked scalar size - net: fix premature exit from NAPI state polling in napi_disable() - riscv, bpf: fix RV32 broken build, and silence RV64 warning Current release - new code bugs: - net: fix possible NULL deref in sock_reserve_memory - amt: fix error return code in amt_init(); fix stopping the workqueue - ax88796c: use the correct ioctl callback Previous releases - always broken: - bpf: stop caching subprog index in the bpf_pseudo_func insn - security: fixups for the security hooks in sctp - nfc: add necessary privilege flags in netlink layer, limit operations to admin only - vsock: prevent unnecessary refcnt inc for non-blocking connect - net/smc: fix sk_refcnt underflow on link down and fallback - nfnetlink_queue: fix OOB when mac header was cleared - can: j1939: ignore invalid messages per standard - bpf, sockmap: - fix race in ingress receive verdict with redirect to self - fix incorrect sk_skb data_end access when src_reg = dst_reg - strparser, and tls are reusing qdisc_skb_cb and colliding - ethtool: fix ethtool msg len calculation for pause stats - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to access an unregistering real_dev - udp6: make encap_rcv() bump the v6 not v4 stats - drv: prestera: add explicit padding to fix m68k build - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge - drv: mvpp2: fix wrong SerDes reconfiguration order Misc & small latecomers: - ipvs: auto-load ipvs on genl access - mctp: sanity check the struct sockaddr_mctp padding fields - libfs: support RENAME_EXCHANGE in simple_rename() - avoid double accounting for pure zerocopy skbs Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2 5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8 JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4 KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY= =Ttgq -----END PGP SIGNATURE----- Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, can and netfilter. Current release - regressions: - bpf: do not reject when the stack read size is different from the tracked scalar size - net: fix premature exit from NAPI state polling in napi_disable() - riscv, bpf: fix RV32 broken build, and silence RV64 warning Current release - new code bugs: - net: fix possible NULL deref in sock_reserve_memory - amt: fix error return code in amt_init(); fix stopping the workqueue - ax88796c: use the correct ioctl callback Previous releases - always broken: - bpf: stop caching subprog index in the bpf_pseudo_func insn - security: fixups for the security hooks in sctp - nfc: add necessary privilege flags in netlink layer, limit operations to admin only - vsock: prevent unnecessary refcnt inc for non-blocking connect - net/smc: fix sk_refcnt underflow on link down and fallback - nfnetlink_queue: fix OOB when mac header was cleared - can: j1939: ignore invalid messages per standard - bpf, sockmap: - fix race in ingress receive verdict with redirect to self - fix incorrect sk_skb data_end access when src_reg = dst_reg - strparser, and tls are reusing qdisc_skb_cb and colliding - ethtool: fix ethtool msg len calculation for pause stats - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to access an unregistering real_dev - udp6: make encap_rcv() bump the v6 not v4 stats - drv: prestera: add explicit padding to fix m68k build - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge - drv: mvpp2: fix wrong SerDes reconfiguration order Misc & small latecomers: - ipvs: auto-load ipvs on genl access - mctp: sanity check the struct sockaddr_mctp padding fields - libfs: support RENAME_EXCHANGE in simple_rename() - avoid double accounting for pure zerocopy skbs" * tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits) selftests/net: udpgso_bench_rx: fix port argument net: wwan: iosm: fix compilation warning cxgb4: fix eeprom len when diagnostics not implemented net: fix premature exit from NAPI state polling in napi_disable() net/smc: fix sk_refcnt underflow on linkdown and fallback net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer() gve: fix unmatched u64_stats_update_end() net: ethernet: lantiq_etop: Fix compilation error selftests: forwarding: Fix packet matching in mirroring selftests vsock: prevent unnecessary refcnt inc for nonblocking connect net: marvell: mvpp2: Fix wrong SerDes reconfiguration order net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory net: stmmac: allow a tc-taprio base-time of zero selftests: net: test_vxlan_under_vrf: fix HV connectivity test net: hns3: allow configure ETS bandwidth of all TCs net: hns3: remove check VF uc mac exist when set by PF net: hns3: fix some mac statistics is always 0 in device version V2 net: hns3: fix kernel crash when unload VF while it is being reset net: hns3: sync rx ring head in echo common pull net: hns3: fix pfc packet number incorrect after querying pfc parameters ...	2021-11-11 09:49:36 -08:00
Alexander Lobakin	0315a075f1	net: fix premature exit from NAPI state polling in napi_disable() Commit `719c571970` ("net: make napi_disable() symmetric with enable") accidentally introduced a bug sometimes leading to a kernel BUG when bringing an iface up/down under heavy traffic load. Prior to this commit, napi_disable() was polling n->state until none of (NAPIF_STATE_SCHED \| NAPIF_STATE_NPSVC) is set and then always flip them. Now there's a possibility to get away with the NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg() call with an uninitialized variable, rather than straight to another round of the state check. Error path looks like: napi_disable(): unsigned long val, new; /* new is uninitialized / do { val = READ_ONCE(n->state); / NAPIF_STATE_NPSVC and/or NAPIF_STATE_SCHED is set / if (val & (NAPIF_STATE_SCHED \| NAPIF_STATE_NPSVC)) { / true / usleep_range(20, 200); continue; / go straight to the condition check / } new = val \| <...> } while (cmpxchg(&n->state, val, new) != val); / state == val, cmpxchg() writes garbage / napi_enable(): do { val = READ_ONCE(n->state); BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); / 50/50 boom */ <...> while the typical BUG splat is like: [ 172.652461] ------------[ cut here ]------------ [ 172.652462] kernel BUG at net/core/dev.c:6937! [ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42 [ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0 [ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48 [ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246 [ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0 < snip > [ 172.782403] Call Trace: [ 172.784857] <TASK> [ 172.786963] ice_up_complete+0x6f/0x210 [ice] [ 172.791349] ice_xdp+0x136/0x320 [ice] [ 172.795108] ? ice_change_mtu+0x180/0x180 [ice] [ 172.799648] dev_xdp_install+0x61/0xe0 [ 172.803401] dev_xdp_attach+0x1e0/0x550 [ 172.807240] dev_change_xdp_fd+0x1e6/0x220 [ 172.811338] do_setlink+0xee8/0x1010 [ 172.814917] rtnl_setlink+0xe5/0x170 [ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10 [ 172.823732] ? security_capable+0x36/0x50 < snip > Fix this by replacing 'do { } while (cmpxchg())' with an "infinite" for-loop with an explicit break. From v1 [0]: - just use a for-loop to simplify both the fix and the existing code (Eric). [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com Fixes: `719c571970` ("net: make napi_disable() symmetric with enable") Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-10 17:45:15 -08:00
Linus Torvalds	38764c7340	A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping support for a filehandle format deprecated 20 years ago, and further xdr-related cleanup from Chuck. -----BEGIN PGP SIGNATURE----- iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx 29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/ pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1 Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD 0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD 6aiufVWKNNRQef9y =yKBl -----END PGP SIGNATURE----- Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping support for a filehandle format deprecated 20 years ago, and further xdr-related cleanup from Chuck" * tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits) nfsd4: remove obselete comment nfsd: document server-to-server-copy parameters NFSD:fix boolreturn.cocci warning nfsd: update create verifier comment SUNRPC: Change return value type of .pc_encode SUNRPC: Replace the "__be32 p" parameter to .pc_encode NFSD: Save location of NFSv4 COMPOUND status SUNRPC: Change return value type of .pc_decode SUNRPC: Replace the "__be32 p" parameter to .pc_decode SUNRPC: De-duplicate .pc_release() call sites SUNRPC: Simplify the SVC dispatch code path SUNRPC: Capture value of xdr_buf::page_base SUNRPC: Add trace event when alloc_pages_bulk() makes no progress svcrdma: Split svcrmda_wc_{read,write} tracepoints svcrdma: Split the svcrdma_wc_send() tracepoint svcrdma: Split the svcrdma_wc_receive() tracepoint NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases NFSD: Initialize pointer ni with NULL and not plain integer 0 NFSD: simplify struct nfsfh ...	2021-11-10 16:45:54 -08:00
Linus Torvalds	2ec20f4895	NFS client updates for Linux 5.16 Highlights include: Features: - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN - Optimisations for READDIR and the 'ls -l' style workload - Further replacements of dprintk() with tracepoints and other tracing improvements - Ensure we re-probe NFSv4 server capabilities when the user does a "mount -o remount" Bugfixes: - Fix an Oops in pnfs_mark_request_commit() - Fix up deadlocks in the commit code - Fix regressions in NFSv2/v3 attribute revalidation due to the change_attr_type optimisations - Fix some dentry verifier races - Fix some missing dentry verifier settings - Fix a performance regression in nfs_set_open_stateid_locked() - SUNRPC was sending multiple SYN calls when re-establishing a TCP connection. - Fix multiple NFSv4 issues due to missing sanity checking of server return values - Fix a potential Oops when FREE_STATEID races with an unmount Cleanups: - Clean up the labelled NFS code - Remove unused header <linux/pnfs_osd_xdr.h> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2 XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g= =n8lj -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN - Optimisations for READDIR and the 'ls -l' style workload - Further replacements of dprintk() with tracepoints and other tracing improvements - Ensure we re-probe NFSv4 server capabilities when the user does a "mount -o remount" Bugfixes: - Fix an Oops in pnfs_mark_request_commit() - Fix up deadlocks in the commit code - Fix regressions in NFSv2/v3 attribute revalidation due to the change_attr_type optimisations - Fix some dentry verifier races - Fix some missing dentry verifier settings - Fix a performance regression in nfs_set_open_stateid_locked() - SUNRPC was sending multiple SYN calls when re-establishing a TCP connection. - Fix multiple NFSv4 issues due to missing sanity checking of server return values - Fix a potential Oops when FREE_STATEID races with an unmount Cleanups: - Clean up the labelled NFS code - Remove unused header <linux/pnfs_osd_xdr.h>" * tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits) NFSv4: Sanity check the parameters in nfs41_update_target_slotid() NFS: Remove the nfs4_label argument from decode_getattr_*() functions NFS: Remove the nfs4_label argument from nfs_setsecurity NFS: Remove the nfs4_label argument from nfs_fhget() NFS: Remove the nfs4_label argument from nfs_add_or_obtain() NFS: Remove the nfs4_label argument from nfs_instantiate() NFS: Remove the nfs4_label from the nfs_setattrres NFS: Remove the nfs4_label from the nfs4_getattr_res NFS: Remove the f_label from the nfs4_opendata and nfs_openres NFS: Remove the nfs4_label from the nfs4_lookupp_res struct NFS: Remove the label from the nfs4_lookup_res struct NFS: Remove the nfs4_label from the nfs4_link_res struct NFS: Remove the nfs4_label from the nfs4_create_res struct NFS: Remove the nfs4_label from the nfs_entry struct NFS: Create a new nfs_alloc_fattr_with_label() function NFS: Always initialise fattr->label in nfs_fattr_alloc() NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open() NFSv4: Remove unnecessary 'minor version' check NFSv4: Fix potential Oops in decode_op_map() ...	2021-11-10 16:32:46 -08:00
Mark Pashmfouroush	f89315650b	bpf: Add ingress_ifindex to bpf_sk_lookup It may be helpful to have access to the ifindex during bpf socket lookup. An example may be to scope certain socket lookup logic to specific interfaces, i.e. an interface may be made exempt from custom lookup code. Add the ifindex of the arriving connection to the bpf_sk_lookup API. Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com	2021-11-10 16:29:58 -08:00

... 5 6 7 8 9 ...

67820 Commits