Commit Graph

1015339 Commits

Author SHA1 Message Date
Dave Ertman f9f83202b7 ice: Allow all LLDP packets from PF to Tx
Currently in the ice driver, the check whether to
allow a LLDP packet to egress the interface from the
PF_VSI is being based on the SKB's priority field.
It checks to see if the packets priority is equal to
TC_PRIO_CONTROL.  Injected LLDP packets do not always
meet this condition.

SCAPY defaults to a sk_buff->protocol value of ETH_P_ALL
(0x0003) and does not set the priority field.  There will
be other injection methods (even ones used by end users)
that will not correctly configure the socket so that
SKB fields are correctly populated.

Then ethernet header has to have to correct value for
the protocol though.

Add a check to also allow packets whose ethhdr->h_proto
matches ETH_P_LLDP (0x88CC).

Fixes: 0c3a6101ff ("ice: Allow egress control packets from PF_VSI")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04 07:37:48 -07:00
Paul Greenwalt 5cd349c349 ice: report supported and advertised autoneg using PHY capabilities
Ethtool incorrectly reported supported and advertised auto-negotiation
settings for a backplane PHY image which did not support auto-negotiation.
This can occur when using media or PHY type for reporting ethtool
supported and advertised auto-negotiation settings.

Remove setting supported and advertised auto-negotiation settings based
on PHY type in ice_phy_type_to_ethtool(), and MAC type in
ice_get_link_ksettings().

Ethtool supported and advertised auto-negotiation settings should be
based on the PHY image using the AQ command get PHY capabilities with
media. Add setting supported and advertised auto-negotiation settings
based get PHY capabilities with media in ice_get_link_ksettings().

Fixes: 48cb27f2fd ("ice: Implement handlers for ethtool PHY/link operations")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04 07:37:48 -07:00
Haiyue Wang c7ee6ce1cf ice: handle the VF VSI rebuild failure
VSI rebuild can be failed for LAN queue config, then the VF's VSI will
be NULL, the VF reset should be stopped with the VF entering into the
disable state.

Fixes: 12bb018c53 ("ice: Refactor VF reset")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04 07:37:48 -07:00
Brett Creeley 8679f07a99 ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared
Some AVF drivers expect the VF_MBX_ATQLEN register to be cleared for any
type of VFR/VFLR. Fix this by clearing the VF_MBX_ATQLEN register at the
same time as VF_MBX_ARQLEN.

Fixes: 82ba01282c ("ice: clear VF ARQLEN register on reset")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04 07:37:48 -07:00
Brett Creeley f0457690af ice: Fix allowing VF to request more/less queues via virtchnl
Commit 12bb018c53 ("ice: Refactor VF reset") caused a regression
that removes the ability for a VF to request a different amount of
queues via VIRTCHNL_OP_REQUEST_QUEUES. This prevents VF drivers to
either increase or decrease the number of queue pairs they are
allocated. Fix this by using the variable vf->num_req_qs when
determining the vf->num_vf_qs during VF VSI creation.

Fixes: 12bb018c53 ("ice: Refactor VF reset")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04 07:37:48 -07:00
Colin Ian King ebbf5fcb94 netdevsim: Fix unsigned being compared to less than zero
The comparison of len < 0 is always false because len is a size_t. Fix
this by making len a ssize_t instead.

Addresses-Coverity: ("Unsigned compared against 0")
Fixes: d395381909 ("netdevsim: Add max_vfs to bus_dev")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:33:17 -07:00
David S. Miller 579028dec1 bluetooth pull request for net:
- Fixes UAF and CVE-2021-3564
  - Fix VIRTIO_ID_BT to use an unassigned ID
  - Fix firmware loading on some Intel Controllers
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmC5RWQZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKS0+D/4kJF7G9FohvLJUzTrrhcPx
 nEE/5IL1eZeCQVCdKmgMeiy6K2iARGY9ZNqnx/AX1SJN9bHI7WsL6uy2RV7r57kx
 iP2XZsV2uzXbwY9KVvfXBMNoCA2E4xS0UxpxA2h1znRUgMWDFLFkZydwYsBieGb6
 tXZwJo3WOnDp169RbKdWTrWstYlL6KTTJoIxaVYWlghXVZ8Fl8LUHbhnx5MEqhqz
 469AfGDlUKEoiYUUDwNrwX1ory/RWhcDxTFpDeji48U0P7oLFL73Aoyy/WP0B2FO
 dhOErn38YUDivwBqSO2O21RUsICREbyLqHy6K/JWe4RqY50nEmWhfQo59ApzSuV3
 e2HcbDwK5vgGYxmU6T9vb5S0nV1AgTV+5O3t1Mj6ZVqTAl6b2OkfqskCZzTrklIS
 aKIP4viRAPLsJMdKKHW1mhR3zBH0deYEovIpFy+LkjX5aFsrEgc8hRn7i5ceF8GW
 d+Ov9LPJQJQTK+r6W7xPiCUkC1dj/SMZ756Gr6cGhXPzY1DgBoyaaoZV1K4mz17g
 dlLwXfF4nIJqJFop3iTPVGWVoeapZ/tgu73iTUdkXIEbqj19wj67nw+xz0WGs1pB
 B1H/OemQS4/yfo4IsfLRDAJ14Q+5JS4qRKBf7p4e/yj533BW6lia0GTdujO+N4eT
 FQfnUoYaexkiPYwGMyjRpQ==
 =X9Cg
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2021-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

bluetooth pull request for net:

 - Fixes UAF and CVE-2021-3564
 - Fix VIRTIO_ID_BT to use an unassigned ID
 - Fix firmware loading on some Intel Controllers

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:32:21 -07:00
Andreas Roeseler e32ea44c7a icmp: fix lib conflict with trinity
Including <linux/in.h> and <netinet/in.h> in the dependencies breaks
compilation of trinity due to multiple definitions. <linux/in.h> is only
used in <linux/icmp.h> to provide the definition of the struct in_addr,
but this can be substituted out by using the datatype __be32.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:31:34 -07:00
Nathan Chancellor 118de61067 net: ethernet: rmnet: Restructure if checks to avoid uninitialized warning
Clang warns that proto in rmnet_map_v5_checksum_uplink_packet() might be
used uninitialized:

drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:283:14: warning:
variable 'proto' is used uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]
                } else if (skb->protocol == htons(ETH_P_IPV6)) {
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:295:36: note:
uninitialized use occurs here
                check = rmnet_map_get_csum_field(proto, trans);
                                                 ^~~~~
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:283:10: note:
remove the 'if' if its condition is always true
                } else if (skb->protocol == htons(ETH_P_IPV6)) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:270:11: note:
initialize the variable 'proto' to silence this warning
                u8 proto;
                        ^
                         = '\0'
1 warning generated.

This is technically a false positive because there is an if statement
above this one that checks skb->protocol for not being either
ETH_P_IP{,V6}. However, it is more obvious to sink that into the if
statement as an else branch, which makes the code clearer and fixes the
warning.

At the same time, move the "IS_ENABLED(CONFIG_IPV6)" into the else if
condition so that the else branch of the preprocessor conditional can
be shared, since there is no build failure with CONFIG_IPV6 disabled.

Fixes: b6e5d27e32 ("net: ethernet: rmnet: Add support for MAPv5 egress packets")
Link: https://github.com/ClangBuiltLinux/linux/issues/1390
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:29:54 -07:00
Xuan Zhuo 1a8024239d virtio-net: fix for skb_over_panic inside big mode
In virtio-net's large packet mode, there is a hole in the space behind
buf.

    hdr_padded_len - hdr_len

We must take this into account when calculating tailroom.

[   44.544385] skb_put.cold (net/core/skbuff.c:5254 (discriminator 1) net/core/skbuff.c:5252 (discriminator 1))
[   44.544864] page_to_skb (drivers/net/virtio_net.c:485) [   44.545361] receive_buf (drivers/net/virtio_net.c:849 drivers/net/virtio_net.c:1131)
[   44.545870] ? netif_receive_skb_list_internal (net/core/dev.c:5714)
[   44.546628] ? dev_gro_receive (net/core/dev.c:6103)
[   44.547135] ? napi_complete_done (./include/linux/list.h:35 net/core/dev.c:5867 net/core/dev.c:5862 net/core/dev.c:6565)
[   44.547672] virtnet_poll (drivers/net/virtio_net.c:1427 drivers/net/virtio_net.c:1525)
[   44.548251] __napi_poll (net/core/dev.c:6985)
[   44.548744] net_rx_action (net/core/dev.c:7054 net/core/dev.c:7139)
[   44.549264] __do_softirq (./arch/x86/include/asm/jump_label.h:19 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:560)
[   44.549762] irq_exit_rcu (kernel/softirq.c:433 kernel/softirq.c:637 kernel/softirq.c:649)
[   44.551384] common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 13))
[   44.551991] ? asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)
[   44.552654] asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)

Fixes: fb32856b16 ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reported-by: Corentin Noël <corentin.noel@collabora.com>
Tested-by: Corentin Noël <corentin.noel@collabora.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:29:04 -07:00
Nathan Chancellor 819fb78f69 net: ks8851: Make ks8851_read_selftest() return void
clang points out that ret in ks8851_read_selftest() is set but unused:

drivers/net/ethernet/micrel/ks8851_common.c:1028:6: warning: variable
'ret' set but not used [-Wunused-but-set-variable]
        int ret = 0;
            ^
1 warning generated.

The return code of this function has never been checked so just remove
ret and make the function return void.

Fixes: 3ba81f3ece ("net: Micrel KS8851 SPI network driver")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:27:37 -07:00
Yu Kuai a10541f5d9 sch_htb: fix doc warning in htb_add_to_id_tree()
Add description for parameters of htb_add_to_id_tree() to fix
gcc W=1 warnings:
net/sched/sch_htb.c:282: warning: Function parameter or member 'root' not described in 'htb_add_to_id_tree'
net/sched/sch_htb.c:282: warning: Function parameter or member 'cl' not described in 'htb_add_to_id_tree'
net/sched/sch_htb.c:282: warning: Function parameter or member 'prio' not described in 'htb_add_to_id_tree'

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:26:32 -07:00
Colin Ian King 92e1b57c38 bonding: remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read,
it is being updated later on.  The assignment is redundant and can be
removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:25:29 -07:00
Russell King feb938fad6 net: phy: marvell: use phy_modify_changed() for marvell_set_polarity()
Rather than open-coding the phy_modify_changed() sequence, use this
helper in marvell_set_polarity().

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Marek Behún <kabel@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:24:34 -07:00
David S. Miller e31d57ca14 Merge tag 'ieee802154-for-davem-2021-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:

====================
An update from ieee802154 for your *net* tree.

This time we have fixes for the ieee802154 netlink code, as well as a driver
fix. Zhen Lei, Wei Yongjun and Yang Li each had  a patch to cleanup some return
code handling ensuring we actually get a real error code when things fails.

Dan Robertson fixed a potential null dereference in our netlink handling.

Andy Shevchenko removed of_match_ptr()usage in the mrf24j40 driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:21:58 -07:00
Coco Li 821bbf79fe ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
Reported by syzbot:
HEAD commit:    90c911ad Merge tag 'fixes' of git://git.kernel.org/pub/scm..
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
dashboard link: https://syzkaller.appspot.com/bug?extid=123aa35098fd3c000eb7
compiler:       Debian clang version 11.0.1-2

==================================================================
BUG: KASAN: slab-out-of-bounds in fib6_nh_get_excptn_bucket net/ipv6/route.c:1604 [inline]
BUG: KASAN: slab-out-of-bounds in fib6_nh_flush_exceptions+0xbd/0x360 net/ipv6/route.c:1732
Read of size 8 at addr ffff8880145c78f8 by task syz-executor.4/17760

CPU: 0 PID: 17760 Comm: syz-executor.4 Not tainted 5.12.0-rc8-syzkaller #0
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x202/0x31e lib/dump_stack.c:120
 print_address_description+0x5f/0x3b0 mm/kasan/report.c:232
 __kasan_report mm/kasan/report.c:399 [inline]
 kasan_report+0x15c/0x200 mm/kasan/report.c:416
 fib6_nh_get_excptn_bucket net/ipv6/route.c:1604 [inline]
 fib6_nh_flush_exceptions+0xbd/0x360 net/ipv6/route.c:1732
 fib6_nh_release+0x9a/0x430 net/ipv6/route.c:3536
 fib6_info_destroy_rcu+0xcb/0x1c0 net/ipv6/ip6_fib.c:174
 rcu_do_batch kernel/rcu/tree.c:2559 [inline]
 rcu_core+0x8f6/0x1450 kernel/rcu/tree.c:2794
 __do_softirq+0x372/0x7a6 kernel/softirq.c:345
 invoke_softirq kernel/softirq.c:221 [inline]
 __irq_exit_rcu+0x22c/0x260 kernel/softirq.c:422
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:434
 sysvec_apic_timer_interrupt+0x91/0xb0 arch/x86/kernel/apic/apic.c:1100
 </IRQ>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0010:lock_acquire+0x1f6/0x720 kernel/locking/lockdep.c:5515
Code: f6 84 24 a1 00 00 00 02 0f 85 8d 02 00 00 f7 c3 00 02 00 00 49 bd 00 00 00 00 00 fc ff df 74 01 fb 48 c7 44 24 40 0e 36 e0 45 <4b> c7 44 3d 00 00 00 00 00 4b c7 44 3d 09 00 00 00 00 43 c7 44 3d
RSP: 0018:ffffc90009e06560 EFLAGS: 00000206
RAX: 1ffff920013c0cc0 RBX: 0000000000000246 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90009e066e0 R08: dffffc0000000000 R09: fffffbfff1f992b1
R10: fffffbfff1f992b1 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 1ffff920013c0cb4
 rcu_lock_acquire+0x2a/0x30 include/linux/rcupdate.h:267
 rcu_read_lock include/linux/rcupdate.h:656 [inline]
 ext4_get_group_info+0xea/0x340 fs/ext4/ext4.h:3231
 ext4_mb_prefetch+0x123/0x5d0 fs/ext4/mballoc.c:2212
 ext4_mb_regular_allocator+0x8a5/0x28f0 fs/ext4/mballoc.c:2379
 ext4_mb_new_blocks+0xc6e/0x24f0 fs/ext4/mballoc.c:4982
 ext4_ext_map_blocks+0x2be3/0x7210 fs/ext4/extents.c:4238
 ext4_map_blocks+0xab3/0x1cb0 fs/ext4/inode.c:638
 ext4_getblk+0x187/0x6c0 fs/ext4/inode.c:848
 ext4_bread+0x2a/0x1c0 fs/ext4/inode.c:900
 ext4_append+0x1a4/0x360 fs/ext4/namei.c:67
 ext4_init_new_dir+0x337/0xa10 fs/ext4/namei.c:2768
 ext4_mkdir+0x4b8/0xc00 fs/ext4/namei.c:2814
 vfs_mkdir+0x45b/0x640 fs/namei.c:3819
 ovl_do_mkdir fs/overlayfs/overlayfs.h:161 [inline]
 ovl_mkdir_real+0x53/0x1a0 fs/overlayfs/dir.c:146
 ovl_create_real+0x280/0x490 fs/overlayfs/dir.c:193
 ovl_workdir_create+0x425/0x600 fs/overlayfs/super.c:788
 ovl_make_workdir+0xed/0x1140 fs/overlayfs/super.c:1355
 ovl_get_workdir fs/overlayfs/super.c:1492 [inline]
 ovl_fill_super+0x39ee/0x5370 fs/overlayfs/super.c:2035
 mount_nodev+0x52/0xe0 fs/super.c:1413
 legacy_get_tree+0xea/0x180 fs/fs_context.c:592
 vfs_get_tree+0x86/0x270 fs/super.c:1497
 do_new_mount fs/namespace.c:2903 [inline]
 path_mount+0x196f/0x2be0 fs/namespace.c:3233
 do_mount fs/namespace.c:3246 [inline]
 __do_sys_mount fs/namespace.c:3454 [inline]
 __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3431
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x4665f9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f68f2b87188 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 000000000056bf60 RCX: 00000000004665f9
RDX: 00000000200000c0 RSI: 0000000020000000 RDI: 000000000040000a
RBP: 00000000004bfbb9 R08: 0000000020000100 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056bf60
R13: 00007ffe19002dff R14: 00007f68f2b87300 R15: 0000000000022000

Allocated by task 17768:
 kasan_save_stack mm/kasan/common.c:38 [inline]
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:427 [inline]
 ____kasan_kmalloc+0xc2/0xf0 mm/kasan/common.c:506
 kasan_kmalloc include/linux/kasan.h:233 [inline]
 __kmalloc+0xb4/0x380 mm/slub.c:4055
 kmalloc include/linux/slab.h:559 [inline]
 kzalloc include/linux/slab.h:684 [inline]
 fib6_info_alloc+0x2c/0xd0 net/ipv6/ip6_fib.c:154
 ip6_route_info_create+0x55d/0x1a10 net/ipv6/route.c:3638
 ip6_route_add+0x22/0x120 net/ipv6/route.c:3728
 inet6_rtm_newroute+0x2cd/0x2260 net/ipv6/route.c:5352
 rtnetlink_rcv_msg+0xb34/0xe70 net/core/rtnetlink.c:5553
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2502
 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1338
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1927
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg net/socket.c:674 [inline]
 ____sys_sendmsg+0x5a2/0x900 net/socket.c:2350
 ___sys_sendmsg net/socket.c:2404 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2433
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Last potentially related work creation:
 kasan_save_stack+0x27/0x50 mm/kasan/common.c:38
 kasan_record_aux_stack+0xee/0x120 mm/kasan/generic.c:345
 __call_rcu kernel/rcu/tree.c:3039 [inline]
 call_rcu+0x1b1/0xa30 kernel/rcu/tree.c:3114
 fib6_info_release include/net/ip6_fib.h:337 [inline]
 ip6_route_info_create+0x10c4/0x1a10 net/ipv6/route.c:3718
 ip6_route_add+0x22/0x120 net/ipv6/route.c:3728
 inet6_rtm_newroute+0x2cd/0x2260 net/ipv6/route.c:5352
 rtnetlink_rcv_msg+0xb34/0xe70 net/core/rtnetlink.c:5553
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2502
 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1338
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1927
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg net/socket.c:674 [inline]
 ____sys_sendmsg+0x5a2/0x900 net/socket.c:2350
 ___sys_sendmsg net/socket.c:2404 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2433
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Second to last potentially related work creation:
 kasan_save_stack+0x27/0x50 mm/kasan/common.c:38
 kasan_record_aux_stack+0xee/0x120 mm/kasan/generic.c:345
 insert_work+0x54/0x400 kernel/workqueue.c:1331
 __queue_work+0x981/0xcc0 kernel/workqueue.c:1497
 queue_work_on+0x111/0x200 kernel/workqueue.c:1524
 queue_work include/linux/workqueue.h:507 [inline]
 call_usermodehelper_exec+0x283/0x470 kernel/umh.c:433
 kobject_uevent_env+0x1349/0x1730 lib/kobject_uevent.c:617
 kvm_uevent_notify_change+0x309/0x3b0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4809
 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:877 [inline]
 kvm_put_kvm+0x9c/0xd10 arch/x86/kvm/../../../virt/kvm/kvm_main.c:920
 kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3120
 __fput+0x352/0x7b0 fs/file_table.c:280
 task_work_run+0x146/0x1c0 kernel/task_work.c:140
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
 exit_to_user_mode_prepare+0x10b/0x1e0 kernel/entry/common.c:208
 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
 syscall_exit_to_user_mode+0x26/0x70 kernel/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff8880145c7800
 which belongs to the cache kmalloc-192 of size 192
The buggy address is located 56 bytes to the right of
 192-byte region [ffff8880145c7800, ffff8880145c78c0)
The buggy address belongs to the page:
page:ffffea00005171c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x145c7
flags: 0xfff00000000200(slab)
raw: 00fff00000000200 ffffea00006474c0 0000000200000002 ffff888010c41a00
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880145c7780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff8880145c7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8880145c7880: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
                                                                ^
 ffff8880145c7900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880145c7980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================

In the ip6_route_info_create function, in the case that the nh pointer
is not NULL, the fib6_nh in fib6_info has not been allocated.
Therefore, when trying to free fib6_info in this error case using
fib6_info_release, the function will call fib6_info_destroy_rcu,
which it will access fib6_nh_release(f6i->fib6_nh);
However, f6i->fib6_nh doesn't have any refcount yet given the lack of allocation
causing the reported memory issue above.
Therefore, releasing the empty pointer directly instead would be the solution.

Fixes: f88d8ea67f ("ipv6: Plumb support for nexthop object in a fib6_info")
Fixes: 706ec91916 ("ipv6: Fix nexthop refcnt leak when creating ipv6 route info")
Signed-off-by: Coco Li <lixiaoyan@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:19:49 -07:00
David S. Miller 5e7a2c6494 wireless-drivers fixes for v5.13
We have only mt76 fixes this time, most important being the fix for
 A-MSDU injection attacks.
 
 mt76
 
 * mitigate A-MSDU injection attacks (CVE-2020-24588)
 
 * fix possible array out of bound access in mt7921_mcu_tx_rate_report
 
 * various aggregation and HE setting fixes
 
 * suspend/resume fix for pci devices
 
 * mt7615: fix crash when runtime-pm is not supported
 -----BEGIN PGP SIGNATURE-----
 
 iQFJBAABCgAzFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmC4mW8VHGt2YWxvQGNv
 ZGVhdXJvcmEub3JnAAoJEG4XJFUm622bdu0IAKfYKc00/3VhdWXqWiagMfxIyBAQ
 vGolP4xaBEWmZof3TeOcjMPmgLLLYV1quH5dr6T95VPwrZLw8gn5u79lbboF6NHA
 f8EfKwTmkRRH1kTPSk38kMMHwNlmAXBDbgLx+MYQdzrs33H4lvHT/IYpMO7TOVrO
 kvWpD+Zy7Qgg4O9+jz2E6ut9ghlXkoKut7WVQz+fIPhkWXeKpteDk/y/l6ReA401
 /VYY6OAk24TXQYwVtOVC4VjxpuBi/8I6r/cXTXBDjO/3jQjvJMHdZWij2uwxBGNa
 G1GvvSSd8CGo6WiiavDzgLN5paR0RgXMIeHJkWvhiJT0YlyQvc9srRbpkGc=
 =htnX
 -----END PGP SIGNATURE-----

Merge tag 'wireless-drivers-2021-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for v5.13

We have only mt76 fixes this time, most important being the fix for
A-MSDU injection attacks.

mt76

* mitigate A-MSDU injection attacks (CVE-2020-24588)

* fix possible array out of bound access in mt7921_mcu_tx_rate_report

* various aggregation and HE setting fixes

* suspend/resume fix for pci devices

* mt7615: fix crash when runtime-pm is not supported
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:17:33 -07:00
Zheng Yongjun 59607863c5 fib: Return the correct errno code
When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:13:56 -07:00
Zheng Yongjun 49251cd002 net: Return the correct errno code
When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:13:56 -07:00
Zheng Yongjun d773695866 net/x25: Return the correct errno code
When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:13:56 -07:00
Rahul Lakkireddy a27fb314cb cxgb4: fix regression with HASH tc prio value update
commit db43b30cd8 ("cxgb4: add ethtool n-tuple filter deletion")
has moved searching for next highest priority HASH filter rule to
cxgb4_flow_rule_destroy(), which searches the rhashtable before the
the rule is removed from it and hence always finds at least 1 entry.
Fix by removing the rule from rhashtable first before calling
cxgb4_flow_rule_destroy() and hence avoid fetching stale info.

Fixes: db43b30cd8 ("cxgb4: add ethtool n-tuple filter deletion")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:12:42 -07:00
David S. Miller e5118f5723 Merge branch 'ipa-inline-csum'
Alex Elder says:

====================
net: ipa: support inline checksum offload

Inline offload--required for checksum offload support on IPA version
4.5 and above--is now supported by the RMNet driver:
  https://lore.kernel.org/netdev/162259440606.2786.10278242816453240434.git-patchwork-notify@kernel.org/

Add support for it in the IPA driver, and revert the commit that
disabled it pending acceptance of the RMNet code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:09:40 -07:00
Alex Elder d15ec19333 Revert "net: ipa: disable checksum offload for IPA v4.5+"
This reverts commit c88c34fcf8.

The RMNet driver now supports inline checksum offload.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:09:40 -07:00
Alex Elder 5567d4d9e7 net: ipa: add support for inline checksum offload
Starting with IPA v4.5, IP payload checksum offload is implemented
differently.

Prior to v4.5, the IPA hardware appends an rmnet_map_dl_csum_trailer
structure to each packet if checksum offload is enabled in the
download direction (modem->AP).  In the upload direction (AP->modem)
a rmnet_map_ul_csum_header structure is prepended before each sent
packet.

Starting with IPA v4.5, checksum offload is implemented using a
single new rmnet_map_v5_csum_header structure which sits between
the QMAP header and the packet data.  The same header structure
is used in both directions.

The new header contains a header type (CSUM_OFFLOAD); a checksum
flag; and a flag indicating whether any other headers follow this
one.  The checksum flag indicates whether the hardware should
compute (and insert) the checksum on a sent packet.  On a received
packet the checksum flag indicates whether the hardware confirms the
checksum value in the payload is correct.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:09:40 -07:00
David S. Miller e03101824d Merge branch 'caif-fixes'
Pavel Skripkin says:

====================
This patch series fix 2 memory leaks in caif
interface.

Syzbot reported memory leak in cfserl_create().
The problem was in cfcnfg_add_phy_layer() function.
This function accepts struct cflayer *link_support and
assign it to corresponting structures, but it can fail
in some cases.

These cases must be handled to prevent leaking allocated
struct cflayer *link_support pointer, because if error accured
before assigning link_support pointer to somewhere, this pointer
must be freed.

Fail log:

[   49.051872][ T7010] caif:cfcnfg_add_phy_layer(): Too many CAIF Link Layers (max 6)
[   49.110236][ T7042] caif:cfcnfg_add_phy_layer(): Too many CAIF Link Layers (max 6)
[   49.134936][ T7045] caif:cfcnfg_add_phy_layer(): Too many CAIF Link Layers (max 6)
[   49.163083][ T7043] caif:cfcnfg_add_phy_layer(): Too many CAIF Link Layers (max 6)
[   55.248950][ T6994] kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

int cfcnfg_add_phy_layer(..., struct cflayer *link_support, ...)
{
...
	/* CAIF protocol allow maximum 6 link-layers */
	for (i = 0; i < 7; i++) {
		phyid = (dev->ifindex + i) & 0x7;
		if (phyid == 0)
			continue;
		if (cfcnfg_get_phyinfo_rcu(cnfg, phyid) == NULL)
			goto got_phyid;
	}
	pr_warn("Too many CAIF Link Layers (max 6)\n");
	goto out;
...
	if (link_support != NULL) {
		link_support->id = phyid;
		layer_set_dn(frml, link_support);
		layer_set_up(link_support, frml);
		layer_set_dn(link_support, phy_layer);
		layer_set_up(phy_layer, link_support);
	}
...
}

As you can see, if cfcnfg_add_phy_layer fails before layer_set_*,
link_support becomes leaked.

So, in this series, I made cfcnfg_add_phy_layer()
return an int and added error handling code to prevent
leaking link_support pointer in caif_device_notify()
and cfusbl_device_notify() functions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:05:07 -07:00
Pavel Skripkin 7f5d86669f net: caif: fix memory leak in cfusbl_device_notify
In case of caif_enroll_dev() fail, allocated
link_support won't be assigned to the corresponding
structure. So simply free allocated pointer in case
of error.

Fixes: 7ad65bf68d ("caif: Add support for CAIF over CDC NCM USB interface")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:05:07 -07:00
Pavel Skripkin b53558a950 net: caif: fix memory leak in caif_device_notify
In case of caif_enroll_dev() fail, allocated
link_support won't be assigned to the corresponding
structure. So simply free allocated pointer in case
of error

Fixes: 7c18d2205e ("caif: Restructure how link caif link layer enroll")
Cc: stable@vger.kernel.org
Reported-and-tested-by: syzbot+7ec324747ce876a29db6@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:05:07 -07:00
Pavel Skripkin a2805dca51 net: caif: add proper error handling
caif_enroll_dev() can fail in some cases. Ingnoring
these cases can lead to memory leak due to not assigning
link_support pointer to anywhere.

Fixes: 7c18d2205e ("caif: Restructure how link caif link layer enroll")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:05:06 -07:00
Pavel Skripkin bce130e7f3 net: caif: added cfserl_release function
Added cfserl_release() function.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:05:06 -07:00
David S. Miller 4189777ca8 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
This series contains updates to igb, igc, ixgbe, ixgbevf, i40e and ice
drivers.

Kurt Kanzenbach fixes XDP for igb when PTP is enabled by pulling the
timestamp and adjusting appropriate values prior to XDP operations.

Magnus adds missing exception tracing for XDP on igb, igc, ixgbe,
ixgbevf, i40e and ice drivers.

Maciej adds tracking of AF_XDP zero copy enabled queues to resolve an
issue with copy mode Tx for the ice driver.

Note: Patch 7 will conflict when merged with net-next. Please carry
these changes forward. IGC_XDP_TX and IGC_XDP_REDIRECT will need to be
changed to return to conform with the net-next changes. Let me know if
you have issues.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:02:55 -07:00
David S. Miller fcd1a53064 mlx5-updates-2021-06-03
This series contains misc updates for mlx5 driver
 
 1) Alaa disables advanced features when kdump mode to save on memory
 2) Jakub counts all link flap events
 3) Meir adds support for IPoIB NDR speed
 4) Various misc cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmC5Ny0ACgkQSD+KveBX
 +j4ZhQgAs71PeGPSvxdwIylXje3ZcQq5dubLdiVNOKiuRd9JOfc0hlvfXU6qDHOM
 t0zOYM/vR2S43zEv+lx6xT0gYivoR8Yqng18T8ImAoO1I43gQDvtHgdVrcyFPRmy
 vAm/vxQl8L9Skd7PELmZdKlgzYdgfF3+craqGgkBz3D1zsZ3cAxh5O+b7LCnD8Pt
 D/44chJTDLMoPE/36zY7NyzByvxrXiCC6sGq5RIxNWkvy73c4JXTSrPN4te8QzpB
 yTYn56UDSPJ8ENLP8TBJ7HhmyOgrCoun1X9LHTqAVE3cGUbdcWjgBHTgei22k691
 3iep8YpiN28bj8AtklzwwVVCy+VIPQ==
 =FbSJ
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
This series provides misc updates for mlx5 drivers.
For more information please see tag log below.

Please pull and let me know if there is any problem.

mlx5-updates-2021-06-03

This series contains misc updates for mlx5 driver

1) Alaa disables advanced features when kdump mode to save on memory
2) Jakub counts all link flap events
3) Meir adds support for IPoIB NDR speed
4) Various misc cleanup
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 15:00:30 -07:00
Íñigo Huguet 6a8dd8b2fa net:cxgb3: fix code style issues
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:57:37 -07:00
Íñigo Huguet 5e0b892892 net:cxgb3: replace tasklets with works
OFLD and CTRL TX queues can be stopped if there is no room in
their DMA rings. If this happens, they're tried to be restarted
later after having made some room in the corresponding ring.

The tasks of restarting these queues were triggered using
tasklets, but they can be replaced for workqueue works, getting
them out of softirq context.

This queues stop/restart probably doesn't happen often and they
can be quite lengthy because they try to send all pending skbs.
Moreover, given that probably the ring is not empty yet, so the
DMA still has work to do, we don't need to be so fast to justify
using tasklets/softirq instead of running in a thread.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:57:37 -07:00
Yuchung Cheng a29cb69146 net: tcp better handling of reordering then loss cases
This patch aims to improve the situation when reordering and loss are
ocurring in the same flight of packets.

Previously the reordering would first induce a spurious recovery, then
the subsequent ACK may undo the cwnd (based on the timestamps e.g.).
However the current loss recovery does not proceed to invoke
RACK to install a reordering timer. If some packets are also lost, this
may lead to a long RTO-based recovery. An example is
https://groups.google.com/g/bbr-dev/c/OFHADvJbTEI

The solution is to after reverting the recovery, always invoke RACK
to either mount the RACK timer to fast retransmit after the reordering
window, or restarts the recovery if new loss is identified. Hence
it is possible the sender may go from Recovery to Disorder/Open to
Recovery again in one ACK.

Reported-by: mingkun bian <bianmingkun@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:20:44 -07:00
David S. Miller 86b84066dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-06-02

The following pull-request contains BPF updates for your *net* tree.

We've added 2 non-merge commits during the last 7 day(s) which contain
a total of 4 files changed, 19 insertions(+), 24 deletions(-).

The main changes are:

1) Fix pahole BTF generation when ccache is used, from Javier Martinez Canillas.

2) Fix BPF lockdown hooks in bpf_probe_read_kernel{,_str}() helpers which caused
   a deadlock from bcc programs, triggered OOM killer from audit side and didn't
   work generally with SELinux policy rules due to pointing to wrong task struct,
   from Daniel Borkmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:17:42 -07:00
Kees Cook 43902070fb net: bonding: Use strscpy_pad() instead of manually-truncated strncpy()
Silence this warning by using strscpy_pad() directly:

drivers/net/bonding/bond_main.c:4877:3: warning: 'strncpy' specified bound 16 equals destination size [-Wstringop-truncation]
    4877 |   strncpy(params->primary, primary, IFNAMSIZ);
         |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Additionally replace other strncpy() uses, as it is considered deprecated:
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202102150705.fdR6obB0-lkp@intel.com
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:16:14 -07:00
Kees Cook 9c153d3889 net: vlan: Avoid using strncpy()
Use strscpy_pad() instead of strncpy() which is considered deprecated:
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:15:10 -07:00
Pavel Skripkin c47cc30499 net: kcm: fix memory leak in kcm_sendmsg
Syzbot reported memory leak in kcm_sendmsg()[1].
The problem was in non-freed frag_list in case of error.

In the while loop:

	if (head == skb)
		skb_shinfo(head)->frag_list = tskb;
	else
		skb->next = tskb;

frag_list filled with skbs, but nothing was freeing them.

backtrace:
  [<0000000094c02615>] __alloc_skb+0x5e/0x250 net/core/skbuff.c:198
  [<00000000e5386cbd>] alloc_skb include/linux/skbuff.h:1083 [inline]
  [<00000000e5386cbd>] kcm_sendmsg+0x3b6/0xa50 net/kcm/kcmsock.c:967 [1]
  [<00000000f1613a8a>] sock_sendmsg_nosec net/socket.c:652 [inline]
  [<00000000f1613a8a>] sock_sendmsg+0x4c/0x60 net/socket.c:672

Reported-and-tested-by: syzbot+b039f5699bd82e1fb011@syzkaller.appspotmail.com
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:13:26 -07:00
David S. Miller 5ff5622ea1 Merge branch 'NVMeTCP-Offload-ULP'
Shai Malin says:

====================
NVMeTCP Offload ULP

With the goal of enabling a generic infrastructure that allows NVMe/TCP
offload devices like NICs to seamlessly plug into the NVMe-oF stack, this
patch series introduces the nvme-tcp-offload ULP host layer, which will
be a new transport type called "tcp-offload" and will serve as an
abstraction layer to work with vendor specific nvme-tcp offload drivers.

NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes
both the TCP level and the NVMeTCP level.

The nvme-tcp-offload transport can co-exist with the existing tcp and
other transports. The tcp offload was designed so that stack changes are
kept to a bare minimum: only registering new transports.
All other APIs, ops etc. are identical to the regular tcp transport.
Representing the TCP offload as a new transport allows clear and manageable
differentiation between the connections which should use the offload path
and those that are not offloaded (even on the same device).

The nvme-tcp-offload layers and API compared to nvme-tcp and nvme-rdma:

* NVMe layer: *

       [ nvme/nvme-fabrics/blk-mq ]
             |
        (nvme API and blk-mq API)
             |
             |
* Vendor agnostic transport layer: *

      [ nvme-rdma ] [ nvme-tcp ] [ nvme-tcp-offload ]
             |        |             |
           (Verbs)
             |        |             |
             |     (Socket)
             |        |             |
             |        |        (nvme-tcp-offload API)
             |        |             |
             |        |             |
* Vendor Specific Driver: *

             |        |             |
           [ qedr ]
                      |             |
                   [ qede ]
                                    |
                                  [ qedn ]

Performance:
============
With this implementation on top of the Marvell qedn driver (using the
Marvell FastLinQ NIC), we were able to demonstrate the following CPU
utilization improvement:

On AMD EPYC 7402, 2.80GHz, 28 cores:
- For 16K queued read IOs, 16jobs, 4qd (50Gbps line rate):
  Improved the CPU utilization from 15.1% with NVMeTCP SW to 4.7% with
  NVMeTCP offload.

On Intel(R) Xeon(R) Gold 5122 CPU, 3.60GHz, 16 cores:
- For 512K queued read IOs, 16jobs, 4qd (25Gbps line rate):
  Improved the CPU utilization from 16.3% with NVMeTCP SW to 1.1% with
  NVMeTCP offload.

In addition, we were able to demonstrate the following latency improvement:
- For 200K read IOPS (16 jobs, 16 qd, with fio rate limiter):
  Improved the average latency from 105 usec with NVMeTCP SW to 39 usec
  with NVMeTCP offload.

  Improved the 99.99 tail latency from 570 usec with NVMeTCP SW to 91 usec
  with NVMeTCP offload.

The end-to-end offload latency was measured from fio while running against
back end of null device.

Upstream plan:
==============
The RFC series "NVMeTCP Offload ULP and QEDN Device Driver"
https://lore.kernel.org/netdev/20210531225222.16992-1-smalin@marvell.com/
was designed in a modular way so that part 1 (nvme-tcp-offload) and
part 2 (qed) are independent and part 3 (qedn) depends on both parts 1+2.

- Part 1 (RFC patch 1-8): NVMeTCP Offload ULP
  The nvme-tcp-offload patches, will be sent to
  'linux-nvme@lists.infradead.org'.

- Part 2 (RFC patches 9-15): QED NVMeTCP Offload
  The qed infrastructure, will be sent to 'netdev@vger.kernel.org'.

Once part 1 and 2 are accepted:

- Part 3 (RFC patches 16-27): QEDN NVMeTCP Offload
  The qedn patches, will be sent to 'linux-nvme@lists.infradead.org'.

Marvell is fully committed to maintain, test, and address issues with
the new nvme-tcp-offload layer.

Usage:
======
With the Marvell NVMeTCP offload design, the network-device (qede) and the
offload-device (qedn) are paired on each port - Logically similar to the
RDMA model.
The user will interact with the network-device in order to configure
the ip/vlan. The NVMeTCP configuration is populated as part of the
nvme connect command.

Example:
Assign IP to the net-device (from any existing Linux tool):

    ip addr add 100.100.0.101/24 dev p1p1

This IP will be used by both net-device (qede) and offload-device (qedn).

In order to connect from "sw" nvme-tcp through the net-device (qede):

    nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn

In order to connect from "offload" nvme-tcp through the offload-device (qedn):

    nvme connect -t tcp_offload -s 4420 -a 100.100.0.100 -n testnqn

An alternative approach, and as a future enhancement that will not impact this
series will be to modify nvme-cli with a new flag that will determine
if "-t tcp" should be the regular nvme-tcp (which will be the default)
or nvme-tcp-offload.
Exmaple:
    nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn -[new flag]

Queue Initialization Design:
============================
The nvme-tcp-offload ULP module shall register with the existing
nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following ops:
- claim_dev() - in order to resolve the route to the target according to
                the paired net_dev.
- create_queue() - in order to create offloaded nvme-tcp queue.

The nvme-tcp-offload ULP module shall manage all the controller level
functionalities, call claim_dev and based on the return values shall call
the relevant module create_queue in order to create the admin queue and
the IO queues.

IO-path Design:
===============
The nvme-tcp-offload shall work at the IO-level - the nvme-tcp-offload
ULP module shall pass the request (the IO) to the nvme-tcp-offload vendor
driver and later, the nvme-tcp-offload vendor driver returns the request
completion (the IO completion).
No additional handling is needed in between; this design will reduce the
CPU utilization as we will describe below.

The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following IO-path ops:
- send_req() - in order to pass the request to the handling of the
               offload driver that shall pass it to the vendor specific device.
- poll_queue()

Once the IO completes, the nvme-tcp-offload vendor driver shall call
command.done() that will invoke the nvme-tcp-offload ULP layer to
complete the request.

TCP events:
===========
The Marvell FastLinQ NIC HW engine handle all the TCP re-transmissions
and OOO events.

Teardown and errors:
====================
In case of NVMeTCP queue error the nvme-tcp-offload vendor driver shall
call the nvme_tcp_ofld_report_queue_err.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following teardown ops:
- drain_queue()
- destroy_queue()

The Marvell FastLinQ NIC HW engine:
====================================
The Marvell NIC HW engine is capable of offloading the entire TCP/IP
stack and managing up to 64K connections per PF, already implemented and
upstream use cases for this include iWARP (by the Marvell qedr driver)
and iSCSI (by the Marvell qedi driver).
In addition, the Marvell NIC HW engine offloads the NVMeTCP queue layer
and is able to manage the IO level also in case of TCP re-transmissions
and OOO events.
The HW engine enables direct data placement (including the data digest CRC
calculation and validation) and direct data transmission (including data
digest CRC calculation).

The Marvell qedn driver:
========================
The new driver will be added under "drivers/nvme/hw" and will be enabled
by the Kconfig "Marvell NVM Express over Fabrics TCP offload".
As part of the qedn init, the driver will register as a pci device driver
and will work with the Marvell fastlinQ NIC.
As part of the probe, the driver will register to the nvme_tcp_offload
(ULP) and to the qed module (qed_nvmetcp_ops) - similar to other
"qed_*_ops" which are used by the qede, qedr, qedf and qedi device
drivers.

nvme-tcp-offload Future work:
=============================
- NVMF_OPT_HOST_IFACE Support.

Changes since RFC v1:
=====================
- nvme-tcp-offload: Fix nvme_tcp_ofld_ops return values.
- nvme-tcp-offload: Remove NVMF_TRTYPE_TCP_OFFLOAD.
- nvme-tcp-offload: Add nvme_tcp_ofld_poll() implementation.
- nvme-tcp-offload: Fix nvme_tcp_ofld_queue_rq() to check map_sg() and
  send_req() return values.

Changes since RFC v2:
=====================
- nvme-tcp-offload: Fixes in controller and queue level (patches 3-6).
- qedn: Add the Marvell's NVMeTCP HW offload vendor driver init and probe
  (patches 8-11).

Changes since RFC v3:
=====================
- nvme-tcp-offload: Add the full implementation of the nvme-tcp-offload layer
  including the new ops: setup_ctrl(), release_ctrl(), commit_rqs() and new
  flows (ASYNC and timeout).
- nvme-tcp-offload: Add device maximums: max_hw_sectors, max_segments.
- nvme-tcp-offload: layer design and optimization changes.

Changes since RFC v4:
=====================
(Many thanks to Hannes Reinecke for his feedback)
- nvme_tcp_offload: Add num_hw_vectors in order to limit the number of queues.
- nvme_tcp_offload: Add per device private_data.
- nvme_tcp_offload: Fix header digest, data digest and tos initialization.

Changes since RFC v5:
=====================
(Many thanks to Sagi Grimberg for his feedback)
- nvme-fabrics: Expose nvmf_check_required_opts() globally (as a new patch).
- nvme_tcp_offload: Remove io-queues BLK_MQ_F_BLOCKING.
- nvme_tcp_offload: Fix the nvme_tcp_ofld_stop_queue (drain_queue) flow.
- nvme_tcp_offload: Fix the nvme_tcp_ofld_free_queue (destroy_queue) flow.
- nvme_tcp_offload: Change rwsem to mutex.
- nvme_tcp_offload: remove redundant fields.
- nvme_tcp_offload: Remove the "new" from setup_ctrl().
- nvme_tcp_offload: Remove the init_req() and commit_rqs() ops.
- nvme_tcp_offload: Minor fixes in nvme_tcp_ofld_create_ctrl() ansd
  nvme_tcp_ofld_free_queue().
- nvme_tcp_offload: Patch 8 (timeout and async) was squeashed into
  patch 7 (io level).

Changes since RFC v6:
=====================
- No changes in nvme_tcp_offload (only in qedn).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:22 -07:00
Dean Balandin 35155e2626 nvme-tcp-offload: Add IO level implementation
In this patch, we present the IO level functionality.
The nvme-tcp-offload shall work on the IO-level, meaning the
nvme-tcp-offload ULP module shall pass the request to the nvme-tcp-offload
vendor driver and shall expect for the request completion.
No additional handling is needed in between, this design will reduce the
CPU utilization as we will describe below.

The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following IO-path ops:
 - send_req - in order to pass the request to the handling of the offload
   driver that shall pass it to the vendor specific device
 - poll_queue

The vendor driver will manage the context from which the request will be
executed and the request aggregations.
Once the IO completed, the nvme-tcp-offload vendor driver shall call
command.done() that shall invoke the nvme-tcp-offload ULP layer for
completing the request.

This patch also add support for the nvme-tcp-offload timeout and
nvme-tcp-offload ASYNC flow.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Dean Balandin <dbalandin@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Dean Balandin e4ba452ded nvme-tcp-offload: Add queue level implementation
In this patch we implement queue level functionality.
The implementation is similar to the nvme-tcp module, the main
difference being that we call the vendor specific create_queue op which
creates the TCP connection, and NVMeTPC connection including
icreq+icresp negotiation.
Once create_queue returns successfully, we can move on to the fabrics
connect.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Dean Balandin <dbalandin@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Arie Gershberg 5faf6d6855 nvme-tcp-offload: Add controller level error recovery implementation
In this patch, we implement controller level error handling and recovery.
Upon an error discovered by the ULP or reset controller initiated by the
nvme-core (using reset_ctrl workqueue), the ULP will initiate a controller
recovery which includes teardown and re-connect of all queues.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Arie Gershberg <agershberg@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Arie Gershberg 5aadd5f931 nvme-tcp-offload: Add controller level implementation
In this patch we implement controller level functionality including:
- create_ctrl.
- delete_ctrl.
- free_ctrl.

The implementation is similar to other nvme fabrics modules, the main
difference being that the nvme-tcp-offload ULP calls the vendor specific
claim_dev() op with the given TCP/IP parameters to determine which device
will be used for this controller.
Once found, the vendor specific device and controller will be paired and
kept in a controller list managed by the ULP.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Arie Gershberg <agershberg@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Dean Balandin 4b8178ec57 nvme-tcp-offload: Add device scan implementation
As part of create_ctrl(), it scans the registered devices and calls
the claim_dev op on each of them, to find the first devices that matches
the connection params. Once the correct devices is found (claim_dev
returns true), we raise the refcnt of that device and return that device
as the device to be used for ctrl currently being created.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Dean Balandin <dbalandin@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Prabhakar Kushwaha af527935bd nvme-fabrics: Expose nvmf_check_required_opts() globally
nvmf_check_required_opts() is used to check if user provided opts has
the required_opts or not. if not, it will log which options are not
provided.

It can be leveraged by nvme-tcp-offload to check if provided opts are
supported by this specific vendor driver or not.

So expose nvmf_check_required_opts() globally.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Prabhakar Kushwaha 98a5097d1e nvme-fabrics: Move NVMF_ALLOWED_OPTS and NVMF_REQUIRED_OPTS definitions
Move NVMF_ALLOWED_OPTS and NVMF_REQUIRED_OPTS definitions
to header file, so it can be used by the different HW devices.

NVMeTCP offload devices might have different limitations of the
allowed options, for example, a device that does not support all the
queue types. With tcp and rdma, only the nvme-tcp and nvme-rdma layers
handle those attributes and the HW devices do not create any limitations
for the allowed options.

An alternative design could be to add separate fields in
nvme_tcp_ofld_ops such as max_hw_sectors and max_segments that
we already have in this series.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Arie Gershberg <agershberg@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
Shai Malin f0e8cb6106 nvme-tcp-offload: Add nvme-tcp-offload - NVMeTCP HW offload ULP
This patch will present the structure for the NVMeTCP offload common
layer driver. This module is added under "drivers/nvme/host/" and future
offload drivers which will register to it will be placed under
"drivers/nvme/hw".
This new driver will be enabled by the Kconfig "NVM Express over Fabrics
TCP offload commmon layer".
In order to support the new transport type, for host mode, no change is
needed.

Each new vendor-specific offload driver will register to this ULP during
its probe function, by filling out the nvme_tcp_ofld_dev->ops and
nvme_tcp_ofld_dev->private_data and calling nvme_tcp_ofld_register_dev
with the initialized struct.

The internal implementation:
- tcp-offload.h:
  Includes all common structs and ops to be used and shared by offload
  drivers.

- tcp-offload.c:
  Includes the init function which registers as a NVMf transport just
  like any other transport.

Acked-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Dean Balandin <dbalandin@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:11:21 -07:00
David S. Miller ae1d9cc312 Merge branch 'tipc-cleanups'
Jon Maloy says:

====================
tipc: some small cleanups

We make some minor code cleanups and improvements.

v2: Changed value of TIPC_ANY_SCOPE macro in patch #3
    to avoid compiler warning
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:06:39 -07:00
Jon Maloy 5ef213258d tipc: simplify handling of lookup scope during multicast message reception
We introduce a new macro TIPC_ANY_SCOPE to make the handling of the
lookup scope value more comprehensible during multicast reception.

The (unchanged) rules go as follows:

1) Multicast messages sent from own node are delivered to all matching
   sockets on the own node, irrespective of their binding scope.

2) Multicast messages sent from other nodes arrive here because they
   have found TIPC_CLUSTER_SCOPE bindings emanating from this node.
   Those messages should be delivered to exactly those sockets, but not
   to local sockets bound with TIPC_NODE_SCOPE, since the latter
   obviously were not meant to be visible for those senders.

3) Group multicast/broadcast messages are delivered to the sockets with
   a binding scope matching exactly the lookup scope indicated in the
   message header, and nobody else.

Reviewed-by: Xin Long <lucien.xin@gmail.com>
Tested-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:06:39 -07:00
Jon Maloy 62633c2f17 tipc: refactor function tipc_sk_anc_data_recv()
We refactor tipc_sk_anc_data_recv() to make it slightly more
comprehensible, but also to facilitate application of some additions
to the code in a future commit.

Reviewed-by: Xin Long <lucien.xin@gmail.com>
Tested-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03 14:06:39 -07:00