OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jakub Kicinski	adc2e56ebe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-06-18 19:47:02 -07:00
Linus Torvalds	9ed13a17e3	Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDNP7EACgkQMUZtbf5S IrvTmxAAgOAM9MdRl9wnYtqXKPXJ1JJtenozwt1yX6b6OG+Ns7cm6YYafU3KoZWR KlzpvP90vRrER3RqksbMngHzvGjZKDS4LWRur7sRlJ1TBQoLrQCIbriAh07d7wlU 0nnS4J8mczTCKx78QCUYy1QBIX5TQrUbx0JQZDPoIPBjFeILW+Gx/Ghg5tUR4mhf 6icYqwIPocTXO37ZmWOzezZNVOXJF4kaQUZeuOHNe5hOtm6EeIpZbW1Xx3DIr5bd 80a/uNU7nVyos0n7jxnfVE/oelTnYbT5scZeV/PPVqZ4U113f7uex2QP23/XhGSX lK1EhwPqPOyaNhQoihLM6Xzd4o7aZOcmF8NY96xqjC+DqdN+juvfJU+ClCZojGIj H4bwCSaj3y2PiimfQdBiIKvYMc5d4zBdw/Dpk/gLDp4d5N638TAtuunK4Mj+TEuT QF1qkBLIB4HFtLS0M35/twk93md/5GUdSTij2GB3fOkAWRu2m266P5m+4DigW/TB Xm8FgKdetvxVP0Qv/p49nPEn24Ny8wCafH1x1wVTmoda2qi6j1EXMuSa0PlCdz70 Sl5FrlxdEkOpC4p+Aoc8APSoBXnOriAlpU+z/EVb8Co4JR/+Ge5zBWpsiZDVD0/K Ay0FW3I87iyn9tw1H1Fzr9GBlVl5vWRauZFHjzl90fWakCrCzJE= =xxUe -----END PGP SIGNATURE----- Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming" * tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits) net: ethernet: fix potential use-after-free in ec_bhf_remove selftests/net: Add icmp.sh for testing ICMP dummy address responses icmp: don't send out ICMP messages with a source address of 0.0.0.0 net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY net: ll_temac: Fix TX BD buffer overwrite net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Make sure to free skb when it is completely used MAINTAINERS: add Guvenc as SMC maintainer bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path bnxt_en: Fix TQM fastpath ring backing store computation bnxt_en: Rediscover PHY capabilities after firmware reset cxgb4: fix wrong shift. mac80211: handle various extensible elements correctly mac80211: reset profile_periodicity/ema_ap cfg80211: avoid double free of PMSR request cfg80211: make certificate generation more robust mac80211: minstrel_ht: fix sample time check net: qed: Fix memcpy() overflow of qed_dcbx_params() net: cdc_eem: fix tx fixup skb leak net: hamradio: fix memory leak in mkiss_close ...	2021-06-18 18:55:29 -07:00
Aya Levin	0232fc2ddc	net/mlx5: Reset mkey index on creation Reset only the index part of the mkey and keep the variant part. On devlink reload, driver recreates mkeys, so the mkey index may change. Trying to preserve the variant part of the mkey, driver mistakenly merged the mkey index with current value. In case of a devlink reload, current value of index part is dirty, so the index may be corrupted. Fixes: `54c62e13ad` ("{IB,net}/mlx5: Setup mkey variant before mr create command invocation") Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:49 -07:00
Dmytro Linkin	a5ae8fc905	net/mlx5e: Don't create devices during unload flow Running devlink reload command for port in switchdev mode cause resources to corrupt: driver can't release allocated EQ and reclaim memory pages, because "rdma" auxiliary device had add CQs which blocks EQ from deletion. Erroneous sequence happens during reload-down phase, and is following: 1. detach device - suspends auxiliary devices which support it, destroys others. During this step "eth-rep" and "rdma-rep" are destroyed, "eth" - suspended. 2. disable SRIOV - moves device to legacy mode; as part of disablement - rescans drivers. This step adds "rdma" auxiliary device. 3. destroy EQ table - <failure>. Driver shouldn't create any device during unload flows. To handle that implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset on device attach. If flag is set do no-op on drivers rescan. Fixes: `a925b5e309` ("net/mlx5: Register mlx5 devices to auxiliary virtual bus") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:47 -07:00
Alex Vesker	65fb7d109a	net/mlx5: DR, Fix STEv1 incorrect L3 decapsulation padding Decapsulation L3 on small inner packets which are less than 64 Bytes was done incorrectly. In small packets there is an extra padding added in L2 which should not be included in L3 length. The issue was that after decapL3 the extra L2 padding caused an update on the L3 length. To avoid this issue the new header is pushed to the beginning of the packet (offset 0) which should not cause a HW reparse and update the L3 length. Fixes: `c349b4137c` ("net/mlx5: DR, Add STEv1 modify header logic") Reviewed-by: Erez Shitrit <erezsh@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:45 -07:00
Parav Pandit	c7d6c19b3b	net/mlx5: SF_DEV, remove SF device on invalid state When auxiliary bus autoprobe is disabled and SF is in ACTIVE state, on SF port deletion it transitions from ACTIVE->ALLOCATED->INVALID. When VHCA event handler queries the state, it is already transition to INVALID state. In this scenario, event handler missed to delete the SF device. Fix it by deleting the SF when SF state is INVALID. Fixes: `90d010b863` ("net/mlx5: SF, Add auxiliary device support") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:42 -07:00
Parav Pandit	ca36fc4d77	net/mlx5: E-Switch, Allow setting GUID for host PF vport E-switch should be able to set the GUID of host PF vport. Currently it returns an error. This results in below error when user attempts to configure MAC address of the PF of an external controller. $ devlink port function set pci/0000:03:00.0/196608 \ hw_addr 00:00:00:11:22:33 mlx5_core 0000:03:00.0: mlx5_esw_set_vport_mac_locked:1876:(pid 6715):\ "Failed to set vport 0 node guid, err = -22. RDMA_CM will not function properly for this VF." Check for zero vport is no longer needed. Fixes: `330077d14d` ("net/mlx5: E-switch, Supporting setting devlink port function mac address") Signed-off-by: Yuval Avnery <yuvalav@nvidia.com> Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Bodong Wang <bodong@nvidia.com> Reviewed-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:40 -07:00
Parav Pandit	bbc8222dc4	net/mlx5: E-Switch, Read PF mac address External controller PF's MAC address is not read from the device during vport setup. Fail to read this results in showing all zeros to user while the factory programmed MAC is a valid value. $ devlink port show eth1 -jp { "port": { "pci/0000:03:00.0/196608": { "type": "eth", "netdev": "eth1", "flavour": "pcipf", "controller": 1, "pfnum": 0, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Hence, read it when enabling a vport. After the fix, $ devlink port show eth1 -jp { "port": { "pci/0000:03:00.0/196608": { "type": "eth", "netdev": "eth1", "flavour": "pcipf", "controller": 1, "pfnum": 0, "splittable": false, "function": { "hw_addr": "98:03:9b:a0:60:11" } } } } Fixes: `f099fde16d` ("net/mlx5: E-switch, Support querying port function mac address") Signed-off-by: Bodong Wang <bodong@nvidia.com> Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:38 -07:00
Leon Romanovsky	2058cc9c80	net/mlx5: Check that driver was probed prior attaching the device The device can be requested to be attached despite being not probed. This situation is possible if devlink reload races with module removal, and the following kernel panic is an outcome of such race. mlx5_core 0000:00:09.0: firmware version: 4.7.9999 mlx5_core 0000:00:09.0: 0.000 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x255 link) BUG: unable to handle page fault for address: fffffffffffffff0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3218067 P4D 3218067 PUD 321a067 PMD 0 Oops: 0000 [#1] SMP KASAN NOPTI CPU: 7 PID: 250 Comm: devlink Not tainted 5.12.0-rc2+ #2836 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5_attach_device+0x80/0x280 [mlx5_core] Code: f8 48 c1 e8 03 42 80 3c 38 00 0f 85 80 01 00 00 48 8b 45 68 48 8d 78 f0 48 89 fe 48 c1 ee 03 42 80 3c 3e 00 0f 85 70 01 00 00 <48> 8b 40 f0 48 85 c0 74 0d 48 89 ef ff d0 85 c0 0f 85 84 05 0e 00 RSP: 0018:ffff8880129675f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff827407f1 RDX: 1ffff110011336cf RSI: 1ffffffffffffffe RDI: fffffffffffffff0 RBP: ffff888008e0c000 R08: 0000000000000008 R09: ffffffffa0662ee7 R10: fffffbfff40cc5dc R11: 0000000000000000 R12: ffff88800ea002e0 R13: ffffed1001d459f7 R14: ffffffffa05ef4f8 R15: dffffc0000000000 FS: 00007f51dfeaf740(0000) GS:ffff88806d5c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffff0 CR3: 000000000bc82006 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mlx5_load_one+0x117/0x1d0 [mlx5_core] devlink_reload+0x2d5/0x520 ? devlink_remote_reload_actions_performed+0x30/0x30 ? mutex_trylock+0x24b/0x2d0 ? devlink_nl_cmd_reload+0x62b/0x1070 devlink_nl_cmd_reload+0x66d/0x1070 ? devlink_reload+0x520/0x520 ? devlink_nl_pre_doit+0x64/0x4d0 genl_family_rcv_msg_doit+0x1e9/0x2f0 ? mutex_lock_io_nested+0x1130/0x1130 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240 ? security_capable+0x51/0x90 genl_rcv_msg+0x27f/0x4a0 ? genl_get_cmd+0x3c0/0x3c0 ? lock_acquire+0x1a9/0x6d0 ? devlink_reload+0x520/0x520 ? lock_release+0x6c0/0x6c0 netlink_rcv_skb+0x11d/0x340 ? genl_get_cmd+0x3c0/0x3c0 ? netlink_ack+0x9f0/0x9f0 ? lock_release+0x1f9/0x6c0 genl_rcv+0x24/0x40 netlink_unicast+0x433/0x700 ? netlink_attachskb+0x730/0x730 ? _copy_from_iter_full+0x178/0x650 ? __alloc_skb+0x113/0x2b0 netlink_sendmsg+0x6f1/0xbd0 ? netlink_unicast+0x700/0x700 ? netlink_unicast+0x700/0x700 sock_sendmsg+0xb0/0xe0 __sys_sendto+0x193/0x240 ? __x64_sys_getpeername+0xb0/0xb0 ? copy_page_range+0x2300/0x2300 ? __up_read+0x1a1/0x7b0 ? do_user_addr_fault+0x219/0xdc0 __x64_sys_sendto+0xdd/0x1b0 ? syscall_enter_from_user_mode+0x1d/0x50 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f51dffb514a Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c RSP: 002b:00007ffcaef22e78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f51dffb514a RDX: 0000000000000030 RSI: 000055750daf2440 RDI: 0000000000000003 RBP: 000055750daf2410 R08: 00007f51e0081200 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: mlx5_core(-) ptp pps_core ib_ipoib rdma_ucm rdma_cm iw_cm ib_cm ib_umad ib_uverbs ib_core [last unloaded: mlx5_ib] CR2: fffffffffffffff0 ---[ end trace 7789831bfe74fa42 ]--- Fixes: `a925b5e309` ("net/mlx5: Register mlx5 devices to auxiliary virtual bus") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:35 -07:00
Leon Romanovsky	94a4b8414d	net/mlx5: Fix error path for set HCA defaults In the case of the failure to execute mlx5_core_set_hca_defaults(), we used wrong goto label to execute error unwind flow. Fixes: `5bef709d76` ("net/mlx5: Enable host PF HCA after eswitch is initialized") Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-16 15:36:32 -07:00
Colin Ian King	fb0a1dacf2	mlxsw: spectrum_router: remove redundant continue statement The continue statement at the end of a for-loop has no effect, remove it. Addresses-Coverity: ("Continue has no effect") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-16 12:46:21 -07:00
Shay Drory	c36326d38d	net/mlx5: Round-Robin EQs over IRQs Whenever users provided affinity for an EQ creation request, map the EQ to a matching IRQ. Matching IRQ=IRQ with the same affinity and type (completion/control) of the EQ created. This mapping is being done in agressive dedicated IRQ allocation scheme, which described bellow. First, we check whether there is a matching IRQ that his min threshold is not exhausted. - min_eqs_threshold = 3 for control EQ. - min_eqs_threshold = 1 for completion EQ. In case no matching IRQ was found, try to request a new IRQ. In case we can't request a new IRQ, reuse least-used matching IRQ. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:58:00 -07:00
Shay Drory	c8ea212bfd	net/mlx5: Separate between public and private API of sf.h Move mlx5_sf_max_functions() and friends from the privete sf/sf.h to the public lib/sf.h. This is done in order to have one direction include paths. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:58:00 -07:00
Shay Drory	71e084e264	net/mlx5: Allocating a pool of MSI-X vectors for SFs SFs (Sub Functions) currently use IRQs from the global IRQ table their parent Physical Function have. In order to better scale, we need to allocate more IRQs and share them between different SFs. Driver will maintain 3 separated irq pools: 1. A pool that serve the PF consumer (PF's netdev, rdma stacks), similar to what the driver had before this patch. i.e, this pool will share irqs between rdma and netev, and will keep the irq indexes and allocation order. The last is important for PF netdev rmap (aRFS). 2. A pool of control IRQs for SFs. The size of this pool is the number of SFs that can be created divided by SFS_PER_IRQ. This pool will serve the control path EQs of the SFs. 3. A pool of completion data path IRQs for SFs transport queues. The size of this pool is: num_irqs_allocated - pf_pool_size - sf_ctrl_pool_size. This pool will served netdev and rdma stacks. Moreover, rmap is not supported on SFs. Sharing methodology of the SFs pools is explained in the next patch. Important note: rmap is not supported on SFs because rmap mapping cannot function correctly for IRQs that are shared for different core/netdev RX rings. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:58:00 -07:00
Shay Drory	fc63dd2a85	net/mlx5: Change IRQ storage logic from static to dynamic Store newly created IRQs in the xarray DB instead of a static array, so we will be able to store only IRQs which are being used. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:59 -07:00
Shay Drory	2d74524c01	net/mlx5: Moving rmap logic to EQs IRQs are being simplified in order to ease their sharing and any feature specific object will be moved to upper layer. Hence we move rmap object into eq_table. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:59 -07:00
Shay Drory	e8abebb3a4	net/mlx5: Extend mlx5_irq_request to request IRQ from the kernel Extend mlx5_irq_request so that IRQs will be requested upon EQ creation, and not on driver boot. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:59 -07:00
Shay Drory	2de6153837	net/mlx5: Removing rmap per IRQ In next patches, IRQs will be requested according to demand, instead of statically on driver boot. Also, currently, rmap is managed by the IRQ layer. rmap management will move out from the IRQ layer in future patches. Therefore, we want to remove the IRQ from the rmap, when IRQ is destroyed, instead of removing all the IRQs from the rmap when irq_table is destroyed. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:58 -07:00
Leon Romanovsky	652e3581f2	net/mlx5: Clean license text in eq.[c\|h] files The eq.[c\|h] files are under major rewrite. so use this opportunity and update their copyright and license texts. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:58 -07:00
Leon Romanovsky	e4e3f24b82	net/mlx5: Provide cpumask at EQ creation phase The users of EQ are running their code on different CPUs and with various affinity patterns. Move the cpumask setting close to their actual usage. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:57 -07:00
Shay Drory	3b43190b2f	net/mlx5: Introduce API for request and release IRQs Introduce new API that will allow IRQs users to hold a pointer to mlx5_irq. In the end of this series, IRQs will be allocated on demand. Hence, this will allow us to properly manage and use IRQs. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:57 -07:00
Leon Romanovsky	c38421abcf	net/mlx5: Delay IRQ destruction till all users are gone Shared IRQ are consumed by multiple EQ users and in order to properly initialize and later release such IRQs, we add kref counting of IRQ structure. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:57 -07:00
Mark Bloch	8a66e45859	net/mlx5: Change ownership model for lag Lag is used to combine two PCI functions of the same HCA into a single logical unit. This is a core functionality and as such should be managed by the core driver. Currently this isn't the case. While we store the lag software structure inside the lower device, its lifetime (creation / destruction) is dictated by the mlx5e part. Change the ownership model so lag is tied to the lifetime of the lower level driver instead to the mlx5e part. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:56 -07:00
Mark Bloch	8ed19471fd	net/mlx5: Lag, Don't rescan if the device is going down If MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV is set it means the device is going down and mlx5_rescan_drivers_locked() shouldn't be called. With this patch and the previous one in the series, unbinding a PCI function when its netdev is part of a bond works and leaves the system in a working state. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:56 -07:00
Mark Bloch	8c22ad36ee	net/mlx5: Lag, refactor disable flow When a net device is removed (can happen if the PCI function is unbound from the system) it's not enough to destroy the hardware lag. The system should recreate the original devices that were present before the lag. As the same flow is done when a net device is removed from the bond refactor and reuse the code. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-14 20:57:56 -07:00
Linus Torvalds	29a877d576	RDMA second v5.13 rc Pull Request A mixture of small bug fixes and a small security issue: - WARN_ON when IPoIB is automatically moved between namespaces - Long standing bug where mlx5 would use the wrong page for the doorbell recovery memory if fork is used - Security fix for mlx4 that disables the timestamp feature - Several crashers for mlx5 - Plug a recent mlx5 memory leak for the sig_mr -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmDCAxYACgkQOG33FX4g mxqL4Q/9FOOS+Q0O2nOtkxzenqB931w46Q4kca1m6RZcdJI97P/tpF+SigQoUwV+ qiuJV4CThkidqWjxxfesX4uXyj6mc8yW4ux57c2JAMiS5iGIsKEPCavNvzcWRZKJ rlMQg0yi7KeDwJ8XC2nw/Ajl1ujtxh569AkaqFVMMJer6jSa048TU14iulOOlcpZ VGmF0/sCSY+PzyEOycr0LxGfUImCdD/spvF1RDbCNtQUcQwg41LUUkR+wvrqp8eR KmuU7i+NLbcGyCZou16r6su9mMRYU5ZuFN5JMtjrmeqfdOi6deb7StyCgQFmRuac Yw9Lgw91JUNphZp9v//sw6UDfyZaRMdsSW4796jiEPjnxZK7tzx+klhFLpO3WPkh 3VaZGY5nkcGcaRfqGD0PUHcHNjPr18rCXHz+JIovNLwIIJDmR4iUnZOs/JgOkvvd bh4p4O/3xhXT57FoyBb/MhYgILAVHJ3Od6Dab3uJNx7ZaHAngtVHhzykm8PP4t/h sHfd5W494jgec5RicJBQQfjZ4YUdSFMKjqLchKaSkdIsv/Wi+3idh+561ucmkMwI JnIVZV/0739JUKeXhOJkxQkc1SKjr79e7+JUlrEgVFC0lJ8srzUD0f9a0L5txgt4 2MqQ9CSGljhiUpby0urFPb/vznQ3OQoZVwXOxj1TKtr0rrS3nuE= =crsk -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma fixes from Jason Gunthorpe: "A mixture of small bug fixes and a small security issue: - WARN_ON when IPoIB is automatically moved between namespaces - Long standing bug where mlx5 would use the wrong page for the doorbell recovery memory if fork is used - Security fix for mlx4 that disables the timestamp feature - Several crashers for mlx5 - Plug a recent mlx5 memory leak for the sig_mr" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/mlx5: Fix initializing CQ fragments buffer RDMA/mlx5: Delete right entry from MR signature database RDMA: Verify port when creating flow rule RDMA/mlx5: Block FDB rules when not in switchdev mode RDMA/mlx4: Do not map the core_clock page to user space unless enabled RDMA/mlx5: Use different doorbell memory for different processes RDMA/ipoib: Fix warning caused by destroying non-initial netns	2021-06-10 10:53:04 -07:00
Vlad Buslov	9724fd5d9c	net/mlx5: Bridge, add tracepoints Move private bridge structures to dedicated headers that is accessible to bridge tracepoint header. Implemented following tracepoints: - Initialize FDB entry. - Refresh FDB entry. - Cleanup FDB entry. - Create VLAN. - Cleanup VLAN. - Attach port to bridge. - Detach port from bridge. Usage example: ># cd /sys/kernel/debug/tracing ># echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event ># cat trace ... kworker/u20:1-96 [001] .... 231.892503: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 lastuse=4294895695 Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:12 -07:00
Vlad Buslov	cc2987c44b	net/mlx5: Bridge, filter tagged packets that didn't match tagged fg With support for pvid vlans in mlx5 bridge it is possible to have rules in untagged flow group when vlan filtering is enabled. However, such rules can also match tagged packets that didn't match anything in tagged flow group. Filter such packets by introducing additional flow group between tagged and untagged groups. When filtering is enabled on the bridge create additional flow in vlan filtering flow group and matches tagged packets with specified source MAC address and redirects them to new "skip" table. The skip table is new lowest-level empty table that is used to skip all further processing on packet in bridge priority. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:12 -07:00
Vlad Buslov	36e55079e5	net/mlx5: Bridge, support pvid and untagged vlan configurations Implement support for pushing vlan header into untagged packet on ingress of port that has pvid configured and support for popping vlan on egress of port that has the matching vlan configured as untagged. To support such configurations packet reformat contexts of {INSERT\|REMOVE}_HEADER types are created per such vlan and saved to struct mlx5_esw_bridge_vlan which allows all FDB entries on particular vlan to share single packet reformat instance. When initializing FDB entries with pvid or untagged vlan type set its mlx5_flow_act->pkt_reformat action accordingly. Flush all flows when removing vlan from port. This is necessary because even though software bridge removes all FDB entries before removing their vlan, mlx5 bridge implementation deletes their corresponding flow entries from hardware in asynchronous workqueue task, which will cause firmware error if vlan packet reformat context is deleted before all flows that point to it. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:11 -07:00
Vlad Buslov	ffc89ee5e5	net/mlx5: Bridge, match FDB entry vlan tag Add support for FDB vlan-tagged entries. Extend ingress and egress flow tables with flow groups to match packet vlan tag. Modify the flow creation code to include vlan tag, if vlan is configured on port and vlan configuration is supported for offload. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:11 -07:00
Vlad Buslov	d75b9e8048	net/mlx5: Bridge, implement infrastructure for vlans Establish all the necessary infrastructure for implementing vlan matching and vlan push/pop in following patches: - Add new per-vport struct mlx5_esw_bridge_port that is used to store metadata for all port vlans. Initialize and cleanup the instance of the structure when port representor is linked/unliked to bridge. Use xarray to allow quick vport metadata lookup by vport number. - Add new per-port-vlan struct mlx5_esw_bridge_vlan that is used to store vlan-specific data (vid, flags). Handle SWITCHDEV_PORT_OBJ_{ADD\|DEL} switchdev blocking event for SWITCHDEV_OBJ_ID_PORT_VLAN object by creating/deleting the vlan structure and saving it in per-vport xarray for quick lookup. - Implement support for SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING object attribute that is used to toggle vlan filtering. Remove all FDB entries from hardware when vlan filtering state is changed. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:10 -07:00
Vlad Buslov	c636a0f0f3	net/mlx5: Bridge, dynamic entry ageing Dynamic FDB entries require capability to age out unused entries. Such entries are either aged out by kernel software bridge implementation or by hardware switch that offloaded them (and notified the kernel to mark them as SWITCHDEV_FDB_ADD_TO_BRIDGE). Leaving ageing to kernel bridge would result it deleting offloaded dynamic FDB entries every ageing_time period due to packets being processed by hardware and, consecutively, 'used' timestamp for FDB entry not being updated. However, since hardware doesn't support ageing, software solution inside the driver is required. In order to emulate hardware ageing in driver, extend bridge FDB ingress flows with counter and create delayed br_offloads->update_work task on bridge offloads workqueue. Run the task every second, update 'used' timestamp in software bridge dynamic entry by sending SWITCHDEV_FDB_ADD_TO_BRIDGE for the entry, if it flow hardware counter lastuse field was changed since last update. If lastuse wasn't changed for ageing_time period, then delete the FDB entry and notify kernel bridge by sending SWITCHDEV_FDB_DEL_TO_BRIDGE notification. Register blocking switchdev notifier callback and handle attribute set SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME event to allow user to dynamically configure bridge FDB entry ageing timeout. Save the value per-bridge in struct mlx5_esw_bridge. Silently ignore SWITCHDEV_ATTR_ID_PORT_{PRE_}BRIDGE_FLAGS switchdev event since mlx5 bridge implementation relies on software bridge for implementing necessary behavior for all of these flags. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:10 -07:00
Vlad Buslov	7cd6a54a82	net/mlx5: Bridge, handle FDB events Hardware supported by mlx5 driver doesn't provide learning and requires the driver to emulate all switch-like behavior in software. As such, all packets by default go through miss path, appear on representor and get to software bridge, if it is the upper device of the representor. This causes bridge to process packet in software, learn the MAC address to FDB and send SWITCHDEV_FDB_ADD_TO_DEVICE event to all subscribers. In order to offload FDB entries in mlx5, register switchdev notifier callback and implement support for both 'added_by_user' and dynamic FDB entry SWITCHDEV_FDB_ADD_TO_DEVICE events asynchronously using new mlx5_esw_bridge_offloads->wq ordered workqueue. In workqueue callback offload the ingress rule (matching FDB entry MAC as packet source MAC) and egress table rule (matching FDB entry MAC as destination MAC). For ingress table rule also match source vport to ensure that only traffic coming from expected bridge port is matched by offloaded rule. Save all the relevant FDB entry data in struct mlx5_esw_bridge_fdb_entry instance and insert the instance in new mlx5_esw_bridge->fdb_list list (for traversing all entries by software ageing implementation in following patch) and in new mlx5_esw_bridge->fdb_ht hash table for fast retrieval. Notify the bridge that FDB entry has been offloaded by sending SWITCHDEV_FDB_OFFLOADED notification. Delete FDB entry on reception of SWITCHDEV_FDB_DEL_TO_DEVICE event. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:10 -07:00
Vlad Buslov	19e9bfa044	net/mlx5: Bridge, add offload infrastructure Create new files bridge.{c\|h} in en/rep directory that implement bridge interaction with representor netdevices and handle required events/notifications, bridge.{c\|h} in esw directory that implement all necessary eswitch offloading infrastructure and works on vport/eswitch level. Provide new kconfig MLX5_BRIDGE which is automatically selected when both kernel bridge and mlx5 eswitch configs are enabled. Provide basic infrastructure for bridge offloads: - struct mlx5_esw_bridge_offloads - per-eswitch bridge offload structure that encapsulates generic bridge-offloads data (notifier blocks, ingress flow table/group, etc.) that is created/deleted on enable/disable eswitch offloads. - struct mlx5_esw_bridge - per-bridge structure that encapsulates per-bridge data (reference counter, FDB, egress flow table/group, etc.) that is created when first eswitch represetor is attached to new bridge and deleted when last representor is removed from the bridge as a result of NETDEV_CHANGEUPPER event. The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB namespace. The new priority is between tc-miss and slow path priorities. Priority consist of two levels: the ingress table that is global per eswitch and matches incoming packets by src_mac/vid and redirects them to next level (egress table) that is chosen according to ingress port bridge membership and matches on dst_mac/vid in order to redirect packet to vport according to the following diagram: + \| +---------v----------+ \| \| \| FDB_TC_OFFLOAD \| \| \| +---------+----------+ \| \| +---------v----------+ \| \| \| FDB_FT_OFFLOAD \| \| \| +---------+----------+ \| \| +---------v----------+ \| \| \| FDB_TC_MISS \| \| \| +---------+----------+ \| +--------------------------------------+ \| \| \| \| +------+ \| \| \| \| \| +------v--------+ FDB_BR_OFFLOAD \| \| \| INGRESS_TABLE \| \| \| +------+---+----+ \| \| \| \| match \| \| \| +---------+ \| \| \| \| \| +-------+ \| \| +-------v-------+ match \| \| \| \| \| \| EGRESS_TABLE +------------> vport \| \| \| +-------+-------+ \| \| \| \| \| \| \| +-------+ \| \| miss \| \| \| +------+------+ \| \| \| \| +--------------------------------------+ \| \| +---------v----------+ \| \| \| FDB_SLOW_PATH \| \| \| +---------+----------+ \| v Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:09 -07:00
Vlad Buslov	0781015288	net/mlx5e: Refactor mlx5e_eswitch_{*}rep() helpers Change the helper to functions to accept constant pointer to struct net_device. This is necessary for following patches in series that pass mlx5e_eswitch_rep() as a callback to kernel bridge infrastructure code. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:09 -07:00
Vlad Buslov	ec3be8873d	net/mlx5: Create TC-miss priority and table In order to adhere to kernel software datapath model bridge offloads must come after TC and NF FDBs. Following patches in this series add new FDB priority for bridge after FDB_FT_OFFLOAD. However, since netfilter offload is implemented with unmanaged tables, its miss path is not automatically connected to next priority and requires the code to manually connect with slow table. To keep bridge offloads encapsulated and not mix it with eswitch offloads, create a new FDB_TC_MISS priority between FDB_FT_OFFLOAD and FDB_SLOW_PATH: + \| +---------v----------+ \| \| \| FDB_TC_OFFLOAD \| \| \| +---------+----------+ \| \| \| +---------v----------+ \| \| \| FDB_FT_OFFLOAD \| \| \| +---------+----------+ \| \| \| +---------v----------+ \| \| \| FDB_TC_MISS \| \| \| +---------+----------+ \| \| \| +---------v----------+ \| \| \| FDB_SLOW_PATH \| \| \| +---------+----------+ \| v Initialize the new priority with single default empty managed table and use the table as TC/NF miss patch instead of slow table. This approach allows bridge offloads to be created as new FDB namespace priority between FDB_TC_MISS and FDB_SLOW_PATH without exposing its internal tables to any other modules since miss path of managed TC-miss table is automatically wired to next priority. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik	ded6a877a3	net/mlx5: DR, Support EMD tag in modify header for STEv1 Add support for EMD tag in modify header set/copy actions on device that supports STEv1. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik	7ea9b39852	net/mlx5: DR, Added support for INSERT_HEADER reformat type Add support for INSERT_HEADER packet reformat context type Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik	3f3f05ab88	net/mlx5: Added new parameters to reformat context Adding new reformat context type (INSERT_HEADER) requires adding two new parameters to reformat context - reformat_param_0 and reformat_param_1. As defined by HW spec, these parameters have different meaning for different reformat context type. The first parameter (reformat_param_0) is not new to HW spec, but it wasn't used by any of the supported reformats. The second parameter (reformat_param_1) is new to the HW spec - it was added to allow supporting INSERT_HEADER. For NSERT_HEADER, reformat_param_0 indicates the header used to reference the location of the inserted header, and reformat_param_1 indicates the offset of the inserted header from the reference point defined by reformat_param_0. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik	d7418b4efa	net/mlx5: DR, Allow encap action for RX for supporting devices Encap actions on RX flow were not supported on older devices. However, this is no longer the case in devices that support STEv1. This patch adds support for encap l3/l2 on RX flow for supported devices: update actions state machine by adding the newely supported transitions and add the required support in STEv0/1 files. The new transitions that are supported are: - from decap/modify-header/pop-vlan to encap - from encap to termination table Signed-off-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik	28de41a4ba	net/mlx5: DR, Split reformat state to Encap and Decap Split single reformat state into two separate states for encap and decap. This will allow adding actions to the specific domain, such as encap on RX. Signed-off-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik	67133eaa93	net/mlx5: mlx5_ifc support for header insert/remove Add support for HCA caps 2 that contains capabilities for the new insert/remove header actions. Added the required definitions for supporting the new reformat type: added packet reformat parameters, reformat anchors and definitions to allow copy/set into the inserted EMD (Embedded MetaData) tag. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 18:36:06 -07:00
Aya Levin	54e1217b90	net/mlx5e: Block offload of outer header csum for GRE tunnel The device is able to offload either the outer header csum or inner header csum. The driver utilizes the inner csum offload. So, prohibit setting of tx-gre-csum-segmentation and let it be: off[fixed]. Fixes: `2729984149` ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:06 -07:00
Aya Levin	6d6727dddc	net/mlx5e: Block offload of outer header csum for UDP tunnels The device is able to offload either the outer header csum or inner header csum. The driver utilizes the inner csum offload. Hence, block setting of tx-udp_tnl-csum-segmentation and set it to off[fixed]. Fixes: `b49663c8fb` ("net/mlx5e: Add support for UDP tunnel segmentation with outer checksum offload") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:06 -07:00
Shay Drory	7a545077cb	Revert "net/mlx5: Arm only EQs with EQEs" In the scenario described below, an EQ can remain in FIRED state which can result in missing an interrupt generation. The scenario: device mlx5_core driver ------ ---------------- EQ1.eqe generated EQ1.MSI-X sent EQ1.state = FIRED EQ2.eqe generated mlx5_irq() polls - eq1_eqes() arm eq1 polls - eq2_eqes() arm eq2 EQ2.MSI-X sent EQ2.state = FIRED mlx5_irq() polls - eq2_eqes() -- no eqes found driver skips EQ arming; ->EQ2 remains fired, misses generating interrupt. Hence, always arm the EQ by reverting the cited commit in fixes tag. Fixes: `d894892dda` ("net/mlx5: Arm only EQs with EQEs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:05 -07:00
Aya Levin	a6ee6f5f10	net/mlx5e: Fix select queue to consider SKBTX_HW_TSTAMP Steering packets to PTP-SQ should be done only if the SKB has SKBTX_HW_TSTAMP set in the tx_flags. While here, take the function into a header and inline it. Set the whole condition to select the PTP-SQ to unlikely. Fixes: `24c22dd091` ("net/mlx5e: Add states to PTP channel") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:05 -07:00
Aya Levin	9ae8c18c5e	net/mlx5e: Don't update netdev RQs with PTP-RQ Since the driver opens the PTP-RQ under channel 0, it appears to the stack as if the SKB was received on rxq0. So from thew stack POV there are still the same number of RX queues. Fixes: `960fbfe222` ("net/mlx5e: Allow coexistence of CQE compression and HW TS PTP") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:05 -07:00
Chris Mi	11f5ac3e05	net/mlx5e: Verify dev is present in get devlink port ndo When changing eswitch mode, the netdev is detached from the hardware resources. So verify dev is present in get devlink port ndo. Otherwise, we will hit the following panic: [241535.973539] RIP: 0010:__devlink_port_phys_port_name_get+0x13/0x1b0 [241535.976471] RSP: 0018:ffff9eaf0ae1b7c8 EFLAGS: 00010292 [241535.977471] RAX: 000000000002d370 RBX: 000000000002d370 RCX: 0000000000000000 [241535.978479] RDX: 0000000000000010 RSI: ffff9eaf0ae1b858 RDI: 000000000002d370 [241535.979482] RBP: ffff9eaf0ae1b7e0 R08: 000000000000002a R09: ffff8888d54d13da [241535.980486] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8888e6700000 [241535.981491] R13: ffff9eaf0ae1b858 R14: 0000000000000010 R15: 0000000000000000 [241535.982489] FS: 00007fd374ef3740(0000) GS:ffff88909ea00000(0000) knlGS:0000000000000000 [241535.983494] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [241535.984487] CR2: 000000000002d444 CR3: 000000089fd26006 CR4: 00000000003706e0 [241535.985502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [241535.986499] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [241535.987477] Call Trace: [241535.988426] ? nla_put_64bit+0x71/0xa0 [241535.989368] devlink_compat_phys_port_name_get+0x50/0xa0 [241535.990312] dev_get_phys_port_name+0x4b/0x60 [241535.991252] rtnl_fill_ifinfo+0x57b/0xcb0 [241535.992192] rtnl_dump_ifinfo+0x58f/0x6d0 [241535.993123] ? ksize+0x14/0x20 [241535.994033] ? __alloc_skb+0xe8/0x250 [241535.994935] netlink_dump+0x17c/0x300 [241535.995821] netlink_recvmsg+0x1de/0x2c0 [241535.996677] sock_recvmsg+0x70/0x80 [241535.997518] ____sys_recvmsg+0x9b/0x1b0 [241535.998360] ? iovec_from_user+0x82/0x120 [241535.999202] ? __import_iovec+0x2c/0x130 [241536.000031] ___sys_recvmsg+0x94/0x130 [241536.000850] ? __handle_mm_fault+0x56d/0x6e0 [241536.001668] __sys_recvmsg+0x5f/0xb0 [241536.002464] ? syscall_enter_from_user_mode+0x2b/0x80 [241536.003242] __x64_sys_recvmsg+0x1f/0x30 [241536.004008] do_syscall_64+0x38/0x50 [241536.004767] entry_SYSCALL_64_after_hwframe+0x44/0xae [241536.005532] RIP: 0033:0x7fd375014f47 Fixes: `2ff349c5ed` ("net/mlx5e: Verify dev is present in some ndos") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:04 -07:00
Maor Gottlieb	4aaf96ac8b	net/mlx5: DR, Don't use SW steering when RoCE is not supported SW steering uses RC QP to write/read to/from ICM, hence it's not supported when RoCE is not supported as well. Fixes: `70605ea545` ("net/mlx5: DR, Expose APIs for direct rule managing") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:04 -07:00
Maor Gottlieb	c189716b2a	net/mlx5: Consider RoCE cap before init RDMA resources Check if RoCE is supported by the device before enable it in the vport context and create all the RDMA steering objects. Fixes: `80f09dfc23` ("net/mlx5: Eswitch, enable RoCE loopback traffic") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:04 -07:00
Dima Chumak	a3e5fd9314	net/mlx5e: Fix page reclaim for dead peer hairpin When adding a hairpin flow, a firmware-side send queue is created for the peer net device, which claims some host memory pages for its internal ring buffer. If the peer net device is removed/unbound before the hairpin flow is deleted, then the send queue is not destroyed which leads to a stack trace on pci device remove: [ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110 [ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0 [ 748.002171] ------------[ cut here ]------------ [ 748.001177] FW pages counter is 4 after reclaiming all pages [ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core] [ +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core] [ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1 [ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core] [ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9 [ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286 [ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000 [ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51 [ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8 [ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30 [ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000 [ 748.001429] FS: 00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000 [ 748.001695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0 [ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 748.001654] Call Trace: [ 748.000576] ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core] [ 748.001416] ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core] [ 748.001354] ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core] [ 748.001203] mlx5_function_teardown+0x30/0x60 [mlx5_core] [ 748.001275] mlx5_uninit_one+0xa7/0xc0 [mlx5_core] [ 748.001200] remove_one+0x5f/0xc0 [mlx5_core] [ 748.001075] pci_device_remove+0x9f/0x1d0 [ 748.000833] device_release_driver_internal+0x1e0/0x490 [ 748.001207] unbind_store+0x19f/0x200 [ 748.000942] ? sysfs_file_ops+0x170/0x170 [ 748.001000] kernfs_fop_write_iter+0x2bc/0x450 [ 748.000970] new_sync_write+0x373/0x610 [ 748.001124] ? new_sync_read+0x600/0x600 [ 748.001057] ? lock_acquire+0x4d6/0x700 [ 748.000908] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 748.001126] ? fd_install+0x1c9/0x4d0 [ 748.000951] vfs_write+0x4d0/0x800 [ 748.000804] ksys_write+0xf9/0x1d0 [ 748.000868] ? __x64_sys_read+0xb0/0xb0 [ 748.000811] ? filp_open+0x50/0x50 [ 748.000919] ? syscall_enter_from_user_mode+0x1d/0x50 [ 748.001223] do_syscall_64+0x3f/0x80 [ 748.000892] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 748.001026] RIP: 0033:0x7f58bcfb22f7 [ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7 [ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003 [ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001 [ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d [ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700 [ 748.001564] irq event stamp: 0 [ 748.000787] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0 [ 748.001854] softirqs last enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0 [ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 748.001492] ---[ end trace a6fabd773d1c51ae ]--- Fix by destroying the send queue of a hairpin peer net device that is being removed/unbound, which returns the allocated ring buffer pages to the host. Fixes: `4d8fcf216c` ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:03 -07:00
Huy Nguyen	8ad893e516	net/mlx5e: Remove dependency in IPsec initialization flows Currently, IPsec feature is disabled because mlx5e_build_nic_netdev is required to be called after mlx5e_ipsec_init. This requirement is invalid as mlx5e_build_nic_netdev and mlx5e_ipsec_init initialize independent resources. Remove ipsec pointer check in mlx5e_build_nic_netdev so that the two functions can be called at any order. Fixes: `547eede070` ("net/mlx5e: IPSec, Innova IPSec offload infrastructure") Signed-off-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:03 -07:00
Vlad Buslov	fb1a3132ee	net/mlx5e: Fix use-after-free of encap entry in neigh update handler Function mlx5e_rep_neigh_update() wasn't updated to accommodate rtnl lock removal from TC filter update path and properly handle concurrent encap entry insertion/deletion which can lead to following use-after-free: [23827.464923] ================================================================== [23827.469446] BUG: KASAN: use-after-free in mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.470971] Read of size 4 at addr ffff8881d132228c by task kworker/u20:6/21635 [23827.472251] [23827.472615] CPU: 9 PID: 21635 Comm: kworker/u20:6 Not tainted 5.13.0-rc3+ #5 [23827.473788] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [23827.475639] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core] [23827.476731] Call Trace: [23827.477260] dump_stack+0xbb/0x107 [23827.477906] print_address_description.constprop.0+0x18/0x140 [23827.478896] ? mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.479879] ? mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.480905] kasan_report.cold+0x7c/0xd8 [23827.481701] ? mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.482744] kasan_check_range+0x145/0x1a0 [23827.493112] mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.494054] ? mlx5e_tc_tun_encap_info_equal_generic+0x140/0x140 [mlx5_core] [23827.495296] mlx5e_rep_neigh_update+0x41e/0x5e0 [mlx5_core] [23827.496338] ? mlx5e_rep_neigh_entry_release+0xb80/0xb80 [mlx5_core] [23827.497486] ? read_word_at_a_time+0xe/0x20 [23827.498250] ? strscpy+0xa0/0x2a0 [23827.498889] process_one_work+0x8ac/0x14e0 [23827.499638] ? lockdep_hardirqs_on_prepare+0x400/0x400 [23827.500537] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [23827.501359] ? rwlock_bug.part.0+0x90/0x90 [23827.502116] worker_thread+0x53b/0x1220 [23827.502831] ? process_one_work+0x14e0/0x14e0 [23827.503627] kthread+0x328/0x3f0 [23827.504254] ? _raw_spin_unlock_irq+0x24/0x40 [23827.505065] ? __kthread_bind_mask+0x90/0x90 [23827.505912] ret_from_fork+0x1f/0x30 [23827.506621] [23827.506987] Allocated by task 28248: [23827.507694] kasan_save_stack+0x1b/0x40 [23827.508476] __kasan_kmalloc+0x7c/0x90 [23827.509197] mlx5e_attach_encap+0xde1/0x1d40 [mlx5_core] [23827.510194] mlx5e_tc_add_fdb_flow+0x397/0xc40 [mlx5_core] [23827.511218] __mlx5e_add_fdb_flow+0x519/0xb30 [mlx5_core] [23827.512234] mlx5e_configure_flower+0x191c/0x4870 [mlx5_core] [23827.513298] tc_setup_cb_add+0x1d5/0x420 [23827.514023] fl_hw_replace_filter+0x382/0x6a0 [cls_flower] [23827.514975] fl_change+0x2ceb/0x4a51 [cls_flower] [23827.515821] tc_new_tfilter+0x89a/0x2070 [23827.516548] rtnetlink_rcv_msg+0x644/0x8c0 [23827.517300] netlink_rcv_skb+0x11d/0x340 [23827.518021] netlink_unicast+0x42b/0x700 [23827.518742] netlink_sendmsg+0x743/0xc20 [23827.519467] sock_sendmsg+0xb2/0xe0 [23827.520131] ____sys_sendmsg+0x590/0x770 [23827.520851] ___sys_sendmsg+0xd8/0x160 [23827.521552] __sys_sendmsg+0xb7/0x140 [23827.522238] do_syscall_64+0x3a/0x70 [23827.522907] entry_SYSCALL_64_after_hwframe+0x44/0xae [23827.523797] [23827.524163] Freed by task 25948: [23827.524780] kasan_save_stack+0x1b/0x40 [23827.525488] kasan_set_track+0x1c/0x30 [23827.526187] kasan_set_free_info+0x20/0x30 [23827.526968] __kasan_slab_free+0xed/0x130 [23827.527709] slab_free_freelist_hook+0xcf/0x1d0 [23827.528528] kmem_cache_free_bulk+0x33a/0x6e0 [23827.529317] kfree_rcu_work+0x55f/0xb70 [23827.530024] process_one_work+0x8ac/0x14e0 [23827.530770] worker_thread+0x53b/0x1220 [23827.531480] kthread+0x328/0x3f0 [23827.532114] ret_from_fork+0x1f/0x30 [23827.532785] [23827.533147] Last potentially related work creation: [23827.534007] kasan_save_stack+0x1b/0x40 [23827.534710] kasan_record_aux_stack+0xab/0xc0 [23827.535492] kvfree_call_rcu+0x31/0x7b0 [23827.536206] mlx5e_tc_del_fdb_flow+0x577/0xef0 [mlx5_core] [23827.537305] mlx5e_flow_put+0x49/0x80 [mlx5_core] [23827.538290] mlx5e_delete_flower+0x6d1/0xe60 [mlx5_core] [23827.539300] tc_setup_cb_destroy+0x18e/0x2f0 [23827.540144] fl_hw_destroy_filter+0x1d2/0x310 [cls_flower] [23827.541148] __fl_delete+0x4dc/0x660 [cls_flower] [23827.541985] fl_delete+0x97/0x160 [cls_flower] [23827.542782] tc_del_tfilter+0x7ab/0x13d0 [23827.543503] rtnetlink_rcv_msg+0x644/0x8c0 [23827.544257] netlink_rcv_skb+0x11d/0x340 [23827.544981] netlink_unicast+0x42b/0x700 [23827.545700] netlink_sendmsg+0x743/0xc20 [23827.546424] sock_sendmsg+0xb2/0xe0 [23827.547084] ____sys_sendmsg+0x590/0x770 [23827.547850] ___sys_sendmsg+0xd8/0x160 [23827.548606] __sys_sendmsg+0xb7/0x140 [23827.549303] do_syscall_64+0x3a/0x70 [23827.549969] entry_SYSCALL_64_after_hwframe+0x44/0xae [23827.550853] [23827.551217] The buggy address belongs to the object at ffff8881d1322200 [23827.551217] which belongs to the cache kmalloc-256 of size 256 [23827.553341] The buggy address is located 140 bytes inside of [23827.553341] 256-byte region [ffff8881d1322200, ffff8881d1322300) [23827.555747] The buggy address belongs to the page: [23827.556847] page:00000000898762aa refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1d1320 [23827.558651] head:00000000898762aa order:2 compound_mapcount:0 compound_pincount:0 [23827.559961] flags: 0x2ffff800010200(slab\|head\|node=0\|zone=2\|lastcpupid=0x1ffff) [23827.561243] raw: 002ffff800010200 dead000000000100 dead000000000122 ffff888100042b40 [23827.562653] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 [23827.564112] page dumped because: kasan: bad access detected [23827.565439] [23827.565932] Memory state around the buggy address: [23827.566917] ffff8881d1322180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [23827.568485] ffff8881d1322200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [23827.569818] >ffff8881d1322280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [23827.571143] ^ [23827.571879] ffff8881d1322300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [23827.573283] ffff8881d1322380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [23827.574654] ================================================================== Most of the necessary logic is already correctly implemented by mlx5e_get_next_valid_encap() helper that is used in neigh stats update handler. Make the handler generic by renaming it to mlx5e_get_next_matching_encap() and use callback to test whether flow is matching instead of hardcoded check for 'valid' flag value. Implement mlx5e_get_next_valid_encap() by calling mlx5e_get_next_matching_encap() with callback that tests encap MLX5_ENCAP_ENTRY_VALID flag. Implement new mlx5e_get_next_init_encap() helper by calling mlx5e_get_next_matching_encap() with callback that tests encap completion result to be non-error and use it in mlx5e_rep_neigh_update() to safely iterate over nhe->encap_list. Remove encap completion logic from mlx5e_rep_update_flows() since the encap entries passed to this function are already guaranteed to be properly initialized by similar code in mlx5e_get_next_init_encap(). Fixes: `2a1f1768fa` ("net/mlx5e: Refactor neigh update for concurrent execution") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:03 -07:00
Yang Li	2bf8d2ae34	net/mlx5e: Fix an error code in mlx5e_arfs_create_tables() When the code execute 'if (!priv->fs.arfs->wq)', the value of err is 0. So, we use -ENOMEM to indicate that the function create_singlethread_workqueue() return NULL. Clean up smatch warning: drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c:373 mlx5e_arfs_create_tables() warn: missing error code 'err'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Fixes: `f6755b80d6` ("net/mlx5e: Dynamic alloc arfs table for netdev when needed") Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-09 17:20:02 -07:00
Colin Ian King	f3b5a89075	mlxsw: thermal: Fix null dereference of NULL temperature parameter The call to mlxsw_thermal_module_temp_and_thresholds_get passes a NULL pointer for the temperature and this can be dereferenced in this function if the mlxsw_reg_query call fails. The simplist fix is to pass the address of dummy temperature variable instead of a NULL pointer. Addresses-Coverity: ("Explicit null dereferenced") Fixes: `72a64c2fe9` ("mlxsw: thermal: Read module temperature thresholds using MTMP register") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-09 15:39:14 -07:00
Mykola Kostenok	72a64c2fe9	mlxsw: thermal: Read module temperature thresholds using MTMP register mlxsw_thermal_module_trips_update() is used to update the trip points of the module's thermal zone. Currently, this is done by querying the thresholds from the module's EEPROM via MCIA register. This data does not pass validation and in some cases can be unreliable. For example, due to some problem with transceiver module. Previous patch made it possible to read module's temperature and thresholds via MTMP register. Therefore, extend mlxsw_thermal_module_trips_update() to use the thresholds queried from MTMP, if valid. This is both more reliable and more efficient than current method, as temperature and thresholds are queried in one transaction instead of three. This is significant when working over a slow bus such as I2C. Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com> Acked-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-08 14:39:07 -07:00
Mykola Kostenok	e57977b34a	mlxsw: thermal: Add function for reading module temperature and thresholds Provide new function mlxsw_thermal_module_temp_and_thresholds_get() for reading temperature and temperature thresholds by a single operation. The motivation is to reduce the number of transactions with the device which is important when operating over a slow bus such as I2C. Currently, the sole caller of the function is only using it to read the module's temperature. The next patch will also use it to query the module's temperature thresholds. Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com> Acked-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-08 14:39:07 -07:00
Mykola Kostenok	befc204808	mlxsw: core_env: Read module temperature thresholds using MTMP register Currently, module temperature thresholds are obtained from Management Cable Info Access (MCIA) register by specifying the thresholds offsets within module EEPROM layout. This data does not pass validation and in some cases can be unreliable. For example, due to some problem with the module. Add support for a new feature provided by Management Temperature (MTMP) register for sanitization of temperature thresholds values. Extend mlxsw_env_module_temp_thresholds_get() to get temperature thresholds through MTMP field 'max_operational_temperature' - if it is not zero, feature is supported. Otherwise fallback to old method and get the thresholds through MCIA. Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com> Acked-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-08 14:39:07 -07:00
Mykola Kostenok	314dbb19f9	mlxsw: reg: Extend MTMP register with new threshold field Extend Management Temperature (MTMP) register with new field specifying the maximum temperature threshold. Extend mlxsw_reg_mtmp_unpack() function with two extra arguments, providing high and maximum temperature thresholds. For modules, these thresholds correspond to critical and emergency thresholds that are read from the module's EEPROM. Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com> Acked-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-08 14:39:07 -07:00
Amit Cohen	a08a61934c	mlxsw: spectrum_router: Remove abort mechanism The abort mechanism was introduced in commit `8e05fd7166` ("fib: hook IPv4 fib for hardware offload") with the purpose of falling back to software-based routing in case of a route programming error in hardware. The process is irreversible and requires users to reload the offloading driver or reboot the machine. While this approach might make sense in theory, it makes very little sense in practice. In the case of high speed ASICs such as the Spectrum ASIC, the abort mechanism effectively kills the machine upon a non-fatal error such as a route programming error. Such an extreme policy does not belong in the kernel, especially when user space can simply try to reprogram the route following the RTM_NEWROUTE failure notification. Therefore, remove the abort mechanism. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-08 14:39:06 -07:00
Matteo Croce	c420c98982	skbuff: add a parameter to __skb_frag_unref This is a prerequisite patch, the next one is enabling recycling of skbs and fragments. Add an extra argument on __skb_frag_unref() to handle recycling, and update the current users of the function with that. Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 14:11:47 -07:00
Mykola Kostenok	2fd8d84ce3	mlxsw: core: Set thermal zone polling delay argument to real value at init Thermal polling delay argument for modules and gearboxes thermal zones used to be initialized with zero value, while actual delay was used to be set by mlxsw_thermal_set_mode() by thermal operation callback set_mode(). After operations set_mode()/get_mode() have been removed by cited commits, modules and gearboxes thermal zones always have polling time set to zero and do not perform temperature monitoring. Set non-zero "polling_delay" in thermal_zone_device_register() routine, thus, the relevant thermal zones will perform thermal monitoring. Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Fixes: `5d7bd8aa7c` ("thermal: Simplify or eliminate unnecessary set_mode() methods") Fixes: `1ee14820fd` ("thermal: remove get_mode() operation of drivers") Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com> Acked-by: Vadim Pasternak <vadimp@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 13:11:41 -07:00
Petr Machata	d566ed04e4	mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() In mlxsw Qdisc offload, find_class() is an operation that yields a qdisc offload descriptor given a parental qdisc descriptor and a class handle. In __mlxsw_sp_qdisc_ets_graft() however, a band number is passed to that function instead of a handle. This can lead to a trigger of a WARN_ON with the following splat: WARNING: CPU: 3 PID: 808 at drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c:1356 __mlxsw_sp_qdisc_ets_graft+0x115/0x130 [mlxsw_spectrum] [...] Call Trace: mlxsw_sp_setup_tc_prio+0xe3/0x100 [mlxsw_spectrum] qdisc_offload_graft_helper+0x35/0xa0 prio_graft+0x176/0x290 [sch_prio] qdisc_graft+0xb3/0x540 tc_modify_qdisc+0x56a/0x8a0 rtnetlink_rcv_msg+0x12c/0x370 netlink_rcv_skb+0x49/0xf0 netlink_unicast+0x1f6/0x2b0 netlink_sendmsg+0x1fb/0x410 ____sys_sendmsg+0x1f3/0x220 ___sys_sendmsg+0x70/0xb0 __sys_sendmsg+0x54/0xa0 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae Since the parent handle is not passed with the offload information, compute it from the band number and qdisc handle. Fixes: 28052e618b04 ("mlxsw: spectrum_qdisc: Track children per qdisc") Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 13:11:41 -07:00
Petr Machata	306b9228c0	mlxsw: reg: Spectrum-3: Enforce lowest max-shaper burst size of 11 A max-shaper is the HW component responsible for delaying egress traffic above a configured transmission rate. Burst size is the amount of traffic that is allowed to pass without accounting. The burst size value needs to be such that it can be expressed as 2^BS * 512 bits, where BS lies in a certain ASIC-dependent range. mlxsw enforces that this holds before attempting to configure the shaper. The assumption for Spectrum-3 was that the lower limit of BS would be 5, like for Spectrum-1. But as of now, the limit is still 11. Therefore fix the driver accordingly, so that incorrect values are rejected early with a proper message. Fixes: `23effa2479` ("mlxsw: reg: Add max_shaper_bs to QoS ETS Element Configuration") Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 13:11:41 -07:00
David S. Miller	126285651b	Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net Bug fixes overlapping feature additions and refactoring, mostly. Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 13:01:52 -07:00
Vladyslav Tarasiuk	f68406ca3b	net/mlx5e: Remove unreachable code in mlx5e_xmit() After some commits, mlx5e_txwqe_build_eseg() lost its ability to return boolean value and became effectively void. Change its return type to void and remove unreachable branches. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:21 -07:00
Alaa Hleihel	39e8cc6d75	net/mlx5e: Disable TLS device offload in kdump mode Under kdump environment we want to use the smallest possible amount of resources, that includes setting SQ size to minimum. However, when running on a device that supports TLS device offload, then the SQ stop room becomes larger than with non-capable device and requires increasing the SQ size. Since TLS device offload is not necessary in kdump mode, disable it to reduce the memory requirements for capable devices. With this change, the needed SQ stop room size drops by 33. Signed-off-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:20 -07:00
Alaa Hleihel	040ee6172e	net/mlx5e: Disable TX MPWQE in kdump mode Under kdump environment we want to use the smallest possible amount of resources, that includes setting SQ size to minimum. However, when running on a device that supports TX MPWQE, then the SQ stop room becomes larger than with non-capable device and requires increasing the SQ size. Since TX MPWQE offload is not necessary in kdump mode, disable it to reduce the memory requirements for capable devices. With this change, the needed SQ stop room size drops by 31. Signed-off-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:20 -07:00
Tariq Toukan	8ec5d438a3	net/mlx5e: RX, Re-place page pool numa node change logic Move the logic that updates the page pool upon changes in numa node. Before this patch, logic was placed in the RX polling function, being called also when no RX traffic, wasting cpu cycles. Here we move it to the RX post_wqes function, to be called only when new RX descriptors are going to be allocated. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:19 -07:00
Lama Kayal	771a563ea0	net/mlx5e: Zero-init DIM structures Initialize structs to avoid unexpected behavior. No immediate issue in current code, structs are return values, it's safer to initialize. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:19 -07:00
Meir Lichtinger	ab57a912be	net/mlx5e: IPoIB, Add support for NDR speed Add NDR IB PTYS coding and NDR speed 100GHz. Fixes: `235b6ac306` ("RDMA/ipoib: Add 50Gb and 100Gb link speeds to ethtool") Signed-off-by: Meir Lichtinger <meirl@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:18 -07:00
Shaokun Zhang	c4cf987ebe	net/mlx5e: Remove the repeated declaration Function 'mlx5e_deactivate_rq' is declared twice, so remove the repeated declaration. Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:18 -07:00
Dan Carpenter	b74fc1ca6a	net/mlx5: check for allocation failure in mlx5_ft_pool_init() Add a check for if the kzalloc() fails. Fixes: `4a98544d18` ("net/mlx5: Move chains ft pool to be used by all firmware steering") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:17 -07:00
Jiapeng Chong	e6dfa4a54a	net/mlx5: Fix duplicate included vhca_event.h Clean up the following includecheck warning: ./drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c: vhca_event.h is included more than once. No functional change. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:17 -07:00
Jakub Kicinski	490dcecabb	mlx5: count all link events mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This breaks link flap detection based on Linux carrier state transition count as netif_carrier_on() does nothing if carrier is already on. Make sure we count such events. netif_carrier_event() increments the counters and fires the linkwatch events. The latter is not necessary for the use case but seems like the right thing to do. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-03 13:10:17 -07:00
Shay Drory	404e5a1269	RDMA/mlx4: Do not map the core_clock page to user space unless enabled Currently when mlx4 maps the hca_core_clock page to the user space there are read-modifiable registers, one of which is semaphore, on this page as well as the clock counter. If user reads the wrong offset, it can modify the semaphore and hang the device. Do not map the hca_core_clock page to the user space unless the device has been put in a backwards compatibility mode to support this feature. After this patch, mlx4 core_clock won't be mapped to user space on the majority of existing devices and the uverbs device time feature in ibv_query_rt_values_ex() will be disabled. Fixes: `52033cfb5a` ("IB/mlx4: Add mmap call to map the hardware clock") Link: https://lore.kernel.org/r/9632304e0d6790af84b3b706d8c18732bc0d5e27.1622726305.git.leonro@nvidia.com Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2021-06-03 14:19:53 -03:00
Yevgeny Kliteynik	216214c64a	net/mlx5: DR, Create multi-destination flow table with level less than 64 Flow table that contains flow pointing to multiple flow tables or multiple TIRs must have a level lower than 64. In our case it applies to muli- destination flow table. Fix the level of the created table to comply with HW Spec definitions, and still make sure that its level lower than SW-owned tables, so that it would be possible to point from the multi-destination FW table to SW tables. Fixes: `34583beea4` ("net/mlx5: DR, Create multi-destination table for SW-steering use") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:21 -07:00
Aya Levin	5349cbba75	net/mlx5e: Fix conflict with HW TS and CQE compression When a driver's profile doesn't support a dedicated PTP-RQ, configuration of CQE compression while HW TS is configured should fail. Fixes: `885b8cfb16` ("net/mlx5e: Update ethtool setting of CQE compression") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:21 -07:00
Aya Levin	256f79d13c	net/mlx5e: Fix HW TS with CQE compression according to profile When the driver's profile doesn't support a dedicated PTP-RQ, the PTP accuracy of HW TS is affected by the CQE compression. In this case, turn off CQE compression. Otherwise, the driver crashes: BUG: kernel NULL pointer dereference, address:0000000000000018 ... ... RIP: 0010:mlx5e_ptp_rx_set_fs+0x25/0x1a0 [mlx5_core] ... ... Call Trace: mlx5e_ptp_activate_channel+0xb2/0xf0 [mlx5_core] mlx5e_activate_priv_channels+0x3b9/0x8c0 [mlx5_core] ? __mutex_unlock_slowpath+0x45/0x2a0 ? mlx5e_refresh_tirs+0x151/0x1e0 [mlx5_core] mlx5e_switch_priv_channels+0x1cd/0x2d0 [mlx5_core] ? mlx5e_xdp_allowed+0x150/0x150 [mlx5_core] mlx5e_safe_switch_params+0x118/0x3c0 [mlx5_core] ? __mutex_lock+0x6e/0x8e0 ? mlx5e_hwstamp_set+0xa9/0x300 [mlx5_core] mlx5e_hwstamp_set+0x194/0x300 [mlx5_core] ? dev_ioctl+0x9b/0x3d0 mlx5i_ioctl+0x37/0x60 [mlx5_core] mlx5i_pkey_ioctl+0x12/0x20 [mlx5_core] dev_ioctl+0xa9/0x3d0 sock_ioctl+0x268/0x420 __x64_sys_ioctl+0x3d8/0x790 ? lockdep_hardirqs_on_prepare+0xe4/0x190 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: `960fbfe222` ("net/mlx5e: Allow coexistence of CQE compression and HW TS PTP") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:21 -07:00
Roi Dayan	2a2c84facd	net/mlx5e: Fix adding encap rules to slow path On some devices the ignore flow level cap is not supported and we shouldn't use it. Setting the dest ft with mlx5_chains_get_tc_end_ft() already gives the correct end ft if ignore flow level cap is supported or not. Fixes: `39ac237ce0` ("net/mlx5: E-Switch, Refactor chains and priorities") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:20 -07:00
Roi Dayan	afe93f71b5	net/mlx5e: Check for needed capability for cvlan matching If not supported show an error and return instead of trying to offload to the hardware and fail. Fixes: `699e96ddf4` ("net/mlx5e: Support offloading tc double vlan headers match") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:20 -07:00
Moshe Shemesh	5940e64281	net/mlx5: Check firmware sync reset requested is set before trying to abort it In case driver sent NACK to firmware on sync reset request, it will get sync reset abort event while it didn't set sync reset requested mode. Thus, on abort sync reset event handler, driver should check reset requested is set before trying to stop sync reset poll. Fixes: `7dd6df329d` ("net/mlx5: Handle sync reset abort event") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:20 -07:00
Roi Dayan	b38742e411	net/mlx5e: Disable TLS offload for uplink representor TLS offload is not supported in switchdev mode. Fixes: `7a9fb35e8c` ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:19 -07:00
Aya Levin	d8ec92005f	net/mlx5e: Fix incompatible casting Device supports setting of a single fec mode at a time, enforce this by bitmap_weight == 1. Input from fec command is in u32, avoid cast to unsigned long and use bitmap_from_arr32 to populate bitmap safely. Fixes: `4bd9d5070b` ("net/mlx5e: Enforce setting of a single FEC mode") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-06-01 18:30:19 -07:00
Jakub Kicinski	af9207adb6	mlx5-updates-2021-05-26 Misc update for mlx5 driver, 1) Clean up patches for lag and SF 2) Reserve bit 31 in steering register C1 for IPSec offload usage 3) Move steering tables pool logic into the steering core and increase the maximum table size to 2G entries when software steering is enabled. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCv6vAACgkQSD+KveBX +j6qnAgAz0eKWKCsFCqlXGIgF1cg3FrGR5W2Zi5euriHhHwNqnZof3AIMkzcXjLL wBlPjWk3YLfBaBNPTziz6EJuGl1vZZxuSdc7bqsNnl0srujRtQFu3JyerdgXEXNL W2NxjSTiVwu8lq2qlYauQvcE0v+JrB/LMe9tvq1UQ2v9FtBMMhs9hGUSCro2huwj XYF0m0ve89+mYlm6/m0SIUpPVdMiIhm4+coO1wibk7+8jn6+ZT6EJbbZvjc9eQg7 ZKr8f/TpfmvHToG8LPOc6HqHzRiHlp3Yzsft+xm54r082n4F/noGhL+Hqvvj1aTj C6Ip5N7VkzT+erMLMrjIbrmEP94cyQ== =torZ -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2021-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-05-26 Misc update for mlx5 driver, 1) Clean up patches for lag and SF 2) Reserve bit 31 in steering register C1 for IPSec offload usage 3) Move steering tables pool logic into the steering core and increase the maximum table size to 2G entries when software steering is enabled. * tag 'mlx5-updates-2021-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Fix lag port remapping logic net/mlx5: Use boolean arithmetic to evaluate roce_lag net/mlx5: Remove unnecessary spin lock protection net/mlx5: Cap the maximum flow group size to 16M entries net/mlx5: DR, Set max table size to 2G entries net/mlx5: Move chains ft pool to be used by all firmware steering net/mlx5: Move table size calculation to steering cmd layer net/mlx5: Add case for FS_FT_NIC_TX FT in MLX5_CAP_FLOWTABLE_TYPE net/mlx5: DR, Remove unused field of send_ring struct net/mlx5e: RX, Remove unnecessary check in RX CQE compression handling net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet net/mlx5e: TC: Reserved bit 31 of REG_C1 for IPsec offload net/mlx5e: TC: Use bit counts for register mapping net/mlx5: CT: Avoid reusing modify header context for natted entries net/mlx5e: CT, Remove newline from ct_dbg call ==================== Link: https://lore.kernel.org/r/20210527185624.694304-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-27 17:14:23 -07:00
Jiri Pirko	7dafcc4c9d	mlxsw: core: use PSID string define in devlink info Instead of having the string spelled out in the driver, use the global define with the same value. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-27 14:51:18 -07:00
Jiri Pirko	f55c998c27	mlxsw: core: Expose FW version over defined keyword To be aligned with the rest of the drivers, expose FW version under "fw" keyword in devlink dev info, in addition to the existing "fw.version", which is currently Mellanox-specific. devlink output before: running: fw.version 30.2008.2018 after: running: fw.version 30.2008.2018 fw 30.2008.2018 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-27 14:51:18 -07:00
Jiri Pirko	2754125ebd	net/mlx5: Expose FW version over defined keyword To be aligned with the rest of the drivers, expose FW version under "fw" keyword in devlink dev info, in addition to the existing "fw.version", which is currently Mellanox-specific. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-27 14:51:17 -07:00
Eli Cohen	8613641063	net/mlx5: Fix lag port remapping logic Fix the logic so that if both ports netdevices are enabled or disabled, use the trivial mapping without swapping. If only one of the netdevice's tx is enabled, use it to remap traffic to that port. Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:39 -07:00
Eli Cohen	2b14767525	net/mlx5: Use boolean arithmetic to evaluate roce_lag Avoid mixing boolean and bit arithmetic when evaluating validity of roce_lag. Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:39 -07:00
Eli Cohen	a546432f2f	net/mlx5: Remove unnecessary spin lock protection Taking lag_lock to access ldev->tracker is meaningless in the context of do_bond() and mlx5_lag_netdev_event(). Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:39 -07:00
Paul Blakey	71513c05a9	net/mlx5: Cap the maximum flow group size to 16M entries The maximum number of large flow groups applies to both small and large tables. For very large tables (such as the 2G SW steering tables) this may create a small number of flow groups each with an unrealistic entries domain (> 16M). Set the maximum number of large flow groups to at least what user requested, but with a maximum per group size of 16M entries. For software steering, if user requested less than 128 large flow groups, it will gives us about 128 16M groups in a 2G entries tables. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:38 -07:00
Paul Blakey	9e11799840	net/mlx5: DR, Set max table size to 2G entries SW steering has no table size limitations. However, fs_core API is size aware. Set SW steering tables to the maximum possible table size (INT_MAX). Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:38 -07:00
Paul Blakey	4a98544d18	net/mlx5: Move chains ft pool to be used by all firmware steering Firmware FT pool is per device, but the software tracking of this pool only services fs_chains users, and if another layer takes a flow table, the pool will not be updated, and fs_chains will fail creating a flow table, with no recovery till the flow table is returned. Move FT pool to be global per device, and stored at the cmd level, so all layers can use it. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:38 -07:00
Paul Blakey	04745afb2a	net/mlx5: Move table size calculation to steering cmd layer Currently the table size is calculated by the fs_core layer. However, each steering cmd instance has a different allocation logic. FW steering uses a predefined pools of multiple sizes. SW steering doesn't have a pool, and can allocate any size of tables. Move the table size calculation to the steering cmd layer as a pre-step for moving fs_chains pool logic globally to firmware steering, and increasing table sizes for software steering. In addition, change the size parameter to absolute size to allow the special zero value to mean "get next available maximum size". Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:37 -07:00
Paul Blakey	e01b58e9d5	net/mlx5: Add case for FS_FT_NIC_TX FT in MLX5_CAP_FLOWTABLE_TYPE Commit `16f1c5bb3e` ("net/mlx5: Check device capability for maximum flow counters") added MLX5_CAP_FLOWTABLE_TYPE but forgot to account for FS_FT_NIC_TX case in the expression. Although the expression will return 1 for this case instead of the actual cap, there isn't currently no known side affects of missing this case. Add the FS_FT_NIC_TX case. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:37 -07:00
Yevgeny Kliteynik	b72ce870f5	net/mlx5: DR, Remove unused field of send_ring struct Remove unused field of struct mlx5dr_send_ring Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:37 -07:00
Tariq Toukan	2ef9c7c613	net/mlx5e: RX, Remove unnecessary check in RX CQE compression handling There are two reasons for exiting mlx5e_decompress_cqes_cont(): 1. The compression session is completed (cqd.left == 0). 2. The budget is exhausted (work_done == budget). If after calling mlx5e_decompress_cqes_cont() we have cqd.left > 0, it necessarily implies that budget is exhausted. The first part of the complex condition is covered by the second, hence we remove it here. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:36 -07:00
Huy Nguyen	c07274ab1a	net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet rep_tc copy REG_C1 to REG_B. IPsec crypto utilizes the whole REG_B register with BIT31 as IPsec marker. rep_tc_update_skb drops IPsec because it thought REG_B contains bad value. In previous patch, BIT 31 of REG_C1 is reserved for IPsec. Skip the rep_tc_update_skb if BIT31 of REG_B is set. Signed-off-by: Huy Nguyen <huyn@nvidia.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:36 -07:00
Huy Nguyen	b973cf3245	net/mlx5e: TC: Reserved bit 31 of REG_C1 for IPsec offload Currently ASAP features fully utilize all the bits of the CQE's flow tag and ft_metadata field. The flow tag field cannot be used because the flow table tagging in FTE does not allow partial write. We agree to reserve bit 31 of CQE's ft_metadata for IPsec to avoid ASAP CT from dropping IPsec offloaded packet Here is the new bit layout of REG_C1. Tunnel option id is reduced to 11 bits: < IPSEC MARKER (1) \| ESW_TUN_ID(12) \| ESW_TUN_OPTS(11) \| ESW_ZONE_ID(8) > Signed-off-by: Huy Nguyen <huyn@nvidia.com> Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com>	2021-05-27 11:54:36 -07:00
Paul Blakey	ed2fe7ba7b	net/mlx5e: TC: Use bit counts for register mapping To prepare for next patch where we will use a non-byte aligned mapping, change all byte counts in register mapping to bits. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:35 -07:00
Paul Blakey	7fac5c2ece	net/mlx5: CT: Avoid reusing modify header context for natted entries Currently the driver is designed to reuse header modify context entries. Natted entries will always have a unique modify header, as such the modify header hashtable lookup is introducing an overhead. When the hashtable size exceeded 200k entries the tested insertion rate dropped from ~10k entries/sec to ~300 entries/sec. Don't use the re-use mechanism when creating modify headers for natted tuples. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:35 -07:00
Roi Dayan	74097a0dcd	net/mlx5e: CT, Remove newline from ct_dbg call ct_dbg() already adds a newline. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-27 11:54:35 -07:00
Jakub Kicinski	5ada57a9a6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net cdc-wdm: s/kill_urbs/poison_urbs/ to fix build Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-27 09:55:10 -07:00
Vlad Buslov	9453d45ecb	net: zero-initialize tc skb extension on allocation Function skb_ext_add() doesn't initialize created skb extension with any value and leaves it up to the user. However, since extension of type TC_SKB_EXT originally contained only single value tc_skb_ext->chain its users used to just assign the chain value without setting whole extension memory to zero first. This assumption changed when TC_SKB_EXT extension was extended with additional fields but not all users were updated to initialize the new fields which leads to use of uninitialized memory afterwards. UBSAN log: [ 778.299821] UBSAN: invalid-load in net/openvswitch/flow.c:899:28 [ 778.301495] load of value 107 is not a valid value for type '_Bool' [ 778.303215] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7+ #2 [ 778.304933] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 778.307901] Call Trace: [ 778.308680] <IRQ> [ 778.309358] dump_stack+0xbb/0x107 [ 778.310307] ubsan_epilogue+0x5/0x40 [ 778.311167] __ubsan_handle_load_invalid_value.cold+0x43/0x48 [ 778.312454] ? memset+0x20/0x40 [ 778.313230] ovs_flow_key_extract.cold+0xf/0x14 [openvswitch] [ 778.314532] ovs_vport_receive+0x19e/0x2e0 [openvswitch] [ 778.315749] ? ovs_vport_find_upcall_portid+0x330/0x330 [openvswitch] [ 778.317188] ? create_prof_cpu_mask+0x20/0x20 [ 778.318220] ? arch_stack_walk+0x82/0xf0 [ 778.319153] ? secondary_startup_64_no_verify+0xb0/0xbb [ 778.320399] ? stack_trace_save+0x91/0xc0 [ 778.321362] ? stack_trace_consume_entry+0x160/0x160 [ 778.322517] ? lock_release+0x52e/0x760 [ 778.323444] netdev_frame_hook+0x323/0x610 [openvswitch] [ 778.324668] ? ovs_netdev_get_vport+0xe0/0xe0 [openvswitch] [ 778.325950] __netif_receive_skb_core+0x771/0x2db0 [ 778.327067] ? lock_downgrade+0x6e0/0x6f0 [ 778.328021] ? lock_acquire+0x565/0x720 [ 778.328940] ? generic_xdp_tx+0x4f0/0x4f0 [ 778.329902] ? inet_gro_receive+0x2a7/0x10a0 [ 778.330914] ? lock_downgrade+0x6f0/0x6f0 [ 778.331867] ? udp4_gro_receive+0x4c4/0x13e0 [ 778.332876] ? lock_release+0x52e/0x760 [ 778.333808] ? dev_gro_receive+0xcc8/0x2380 [ 778.334810] ? lock_downgrade+0x6f0/0x6f0 [ 778.335769] __netif_receive_skb_list_core+0x295/0x820 [ 778.336955] ? process_backlog+0x780/0x780 [ 778.337941] ? mlx5e_rep_tc_netdevice_event_unregister+0x20/0x20 [mlx5_core] [ 778.339613] ? seqcount_lockdep_reader_access.constprop.0+0xa7/0xc0 [ 778.341033] ? kvm_clock_get_cycles+0x14/0x20 [ 778.342072] netif_receive_skb_list_internal+0x5f5/0xcb0 [ 778.343288] ? __kasan_kmalloc+0x7a/0x90 [ 778.344234] ? mlx5e_handle_rx_cqe_mpwrq+0x9e0/0x9e0 [mlx5_core] [ 778.345676] ? mlx5e_xmit_xdp_frame_mpwqe+0x14d0/0x14d0 [mlx5_core] [ 778.347140] ? __netif_receive_skb_list_core+0x820/0x820 [ 778.348351] ? mlx5e_post_rx_mpwqes+0xa6/0x25d0 [mlx5_core] [ 778.349688] ? napi_gro_flush+0x26c/0x3c0 [ 778.350641] napi_complete_done+0x188/0x6b0 [ 778.351627] mlx5e_napi_poll+0x373/0x1b80 [mlx5_core] [ 778.352853] __napi_poll+0x9f/0x510 [ 778.353704] ? mlx5_flow_namespace_set_mode+0x260/0x260 [mlx5_core] [ 778.355158] net_rx_action+0x34c/0xa40 [ 778.356060] ? napi_threaded_poll+0x3d0/0x3d0 [ 778.357083] ? sched_clock_cpu+0x18/0x190 [ 778.358041] ? __common_interrupt+0x8e/0x1a0 [ 778.359045] __do_softirq+0x1ce/0x984 [ 778.359938] __irq_exit_rcu+0x137/0x1d0 [ 778.360865] irq_exit_rcu+0xa/0x20 [ 778.361708] common_interrupt+0x80/0xa0 [ 778.362640] </IRQ> [ 778.363212] asm_common_interrupt+0x1e/0x40 [ 778.364204] RIP: 0010:native_safe_halt+0xe/0x10 [ 778.365273] Code: 4f ff ff ff 4c 89 e7 e8 50 3f 40 fe e9 dc fe ff ff 48 89 df e8 43 3f 40 fe eb 90 cc e9 07 00 00 00 0f 00 2d 74 05 62 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 64 05 62 00 f4 c3 cc cc 0f 1f 44 00 [ 778.369355] RSP: 0018:ffffffff84407e48 EFLAGS: 00000246 [ 778.370570] RAX: ffff88842de46a80 RBX: ffffffff84425840 RCX: ffffffff83418468 [ 778.372143] RDX: 000000000026f1da RSI: 0000000000000004 RDI: ffffffff8343af5e [ 778.373722] RBP: fffffbfff0884b08 R08: 0000000000000000 R09: ffff88842de46bcb [ 778.375292] R10: ffffed1085bc8d79 R11: 0000000000000001 R12: 0000000000000000 [ 778.376860] R13: ffffffff851124a0 R14: 0000000000000000 R15: dffffc0000000000 [ 778.378491] ? rcu_eqs_enter.constprop.0+0xb8/0xe0 [ 778.379606] ? default_idle_call+0x5e/0xe0 [ 778.380578] default_idle+0xa/0x10 [ 778.381406] default_idle_call+0x96/0xe0 [ 778.382350] do_idle+0x3d4/0x550 [ 778.383153] ? arch_cpu_idle_exit+0x40/0x40 [ 778.384143] cpu_startup_entry+0x19/0x20 [ 778.385078] start_kernel+0x3c7/0x3e5 [ 778.385978] secondary_startup_64_no_verify+0xb0/0xbb Fix the issue by providing new function tc_skb_ext_alloc() that allocates tc skb extension and initializes its memory to 0 before returning it to the caller. Change all existing users to use new API instead of calling skb_ext_add() directly. Fixes: `038ebb1a71` ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct") Fixes: `d29334c15d` ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-25 15:36:42 -07:00
Ido Schimmel	daeabf89eb	mlxsw: spectrum_router: Add support for custom multipath hash policy When this policy is set, only enable the packet fields that were enabled by user space for multipath hash computation. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-19 12:47:47 -07:00
Ido Schimmel	01848e05f8	mlxsw: spectrum_router: Add support for inner layer 3 multipath hash policy When this policy is set, the kernel uses the inner layer 3 fields for multipath hash computation and falls back to the outer fields if no encapsulation was encountered. This behavior is most likely influenced by the behavior of the flow dissector, which is used for the packet dissection. The Spectrum ASIC, however, cannot fallback to outer fields if inner fields are not available. This should not result in a discrepancy from the software data path because if several flows have matching inner fields, they will tend to have matching outer fields as well. Therefore, implement this policy by enabling both outer and inner layer 3 fields for the multipath hash computation. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-19 12:47:47 -07:00
Ido Schimmel	b7b8f435ea	mlxsw: spectrum_outer: Factor out helper for common outer fields Outer IPv4 and IPv6 addresses are used by multiple multipath hash policies. Factor out helpers that set these fields to increase code sharing between different policies. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-19 12:47:47 -07:00
Ido Schimmel	28bc824807	mlxsw: reg: Add inner packet fields to RECRv2 register The RECRv2 register is used for setting up the router's ECMP hash configuration. Extend it with inner packet fields to allow the ECMP hash to be calculated based on inner flow information. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-19 12:47:47 -07:00
Ido Schimmel	9d23d3eb6f	mlxsw: spectrum_router: Move multipath hash configuration to a bitmap Currently, the multipath hash configuration is written directly to the register payload. While this is OK for the two currently supported policies, it is going to be hard to follow when more policies and more packet fields are added. Instead, set the required headers and fields in a bitmap and then dump it to the register payload. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-19 12:47:47 -07:00
Ido Schimmel	7725c1c8f7	mlxsw: spectrum_router: Replace if statement with a switch statement The code was written when only two multipath hash policies were present, so the if statement was sufficient. The next patch and future patches are going to add support for more policies, so move to a switch statement. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-19 12:47:47 -07:00
Jakub Kicinski	e63052a5dd	mlx5e: add add missing BH locking around napi_schdule() It's not correct to call napi_schedule() in pure process context. Because we use __raise_softirq_irqoff() we require callers to be in a context which will eventually lead to softirq handling (hardirq, bh disabled, etc.). With code as is users will see: NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! Fixes: `a8dd7ac12f` ("net/mlx5e: Generalize RQ activation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:55 -07:00
Ariel Levkovich	6ff51ab8aa	net/mlx5: Set term table as an unmanaged flow table Termination tables are restricted to have the default miss action and cannot be set to forward to another table in case of a miss. If the fs prio of the termination table is not the last one in the list, fs_core will attempt to attach it to another table. Set the unmanaged ft flag when creating the termination table ft and select the tc offload prio for it to prevent fs_core from selecting the forwarding to next ft miss action and use the default one. In addition, set the flow that forwards to the termination table to ignore ft level restrictions since the ft level is not set by fs_core for unamanged fts. Fixes: `249ccc3c95` ("net/mlx5e: Add support for offloading traffic from uplink to uplink") Signed-off-by: Ariel Levkovich <lariel@nvidia.com>	2021-05-18 23:01:53 -07:00
Leon Romanovsky	75e8564e91	net/mlx5: Don't overwrite HCA capabilities when setting MSI-X count During driver probe of device that has dynamic MSI-X feature enabled, the following error is printed in some FW flavour (not released yet). mlx5_core 0000:06:00.0: firmware version: 4.7.4387 mlx5_core 0000:06:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) mlx5_core 0000:06:00.0: mlx5_cmd_check:777:(pid 70599): SET_HCA_CAP(0x109) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x0) mlx5_core 0000:06:00.0: set_hca_cap:622:(pid 70599): handle_hca_cap failed mlx5_core 0000:06:00.0: mlx5_function_setup:1045:(pid 70599): set_hca_cap failed mlx5_core 0000:06:00.0: probe_one:1465:(pid 70599): mlx5_init_one failed with error code -22 mlx5_core: probe of 0000:06:00.0 failed with error -22 In order to make the setting capability of MSI-X future proof, let's query the current capabilities first. Fixes: `604774add5` ("net/mlx5: Dynamically assign MSI-X vectors count") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:51 -07:00
Eli Cohen	7c9f131f36	{net,vdpa}/mlx5: Configure interface MAC into mpfs L2 table net/mlx5: Expose MPFS configuration API MPFS is the multi physical function switch that bridges traffic between the physical port and any physical functions associated with it. The driver is required to add or remove MAC entries to properly forward incoming traffic to the correct physical function. We export the API to control MPFS so that other drivers, such as mlx5_vdpa are able to add MAC addresses of their network interfaces. The MAC address of the vdpa interface must be configured into the MPFS L2 address. Failing to do so could cause, in some NIC configurations, failure to forward packets to the vdpa network device instance. Fix this by adding calls to update the MPFS table. CC: <mst@redhat.com> CC: <jasowang@redhat.com> CC: <virtualization@lists.linux-foundation.org> Fixes: `1a86b377aa` ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:48 -07:00
Aya Levin	5e7923acbd	net/mlx5e: Fix error path of updating netdev queues Avoid division by zero in the error flow. In the driver TC number can be either 1 or 8. When TC count is set to 1, driver zero netdev->num_tc. Hence, need to convert it back from 0 to 1 in the error flow. Fixes: `fa3748775b` ("net/mlx5e: Handle errors from netif_set_real_num_{tx,rx}_queues") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:46 -07:00
Vlad Buslov	7d1a3d08c8	net/mlx5e: Reject mirroring on source port change encap rules Rules with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE dest flag are translated to destination FT in eswitch. Currently it is not possible to mirror such rules because firmware doesn't support mixing FT and Vport destinations in single rule when one of them adds encapsulation. Since the only use case for MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE destination is support for tunnel endpoints on VF and trying to offload such rule with mirror action causes either crash in fs_core or firmware error with syndrome 0xff6a1d, reject all such rules in mlx5 TC layer. Fixes: `10742efc20` ("net/mlx5e: VF tunnel TX traffic offloading") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:43 -07:00
Dima Chumak	97817fcc68	net/mlx5e: Fix multipath lag activation When handling FIB_EVENT_ENTRY_REPLACE event for a new multipath route, lag activation can be missed if a stale (struct lag_mp)->mfi pointer exists, which was associated with an older multipath route that had been removed. Normally, when a route is removed, it triggers mlx5_lag_fib_event(), which handles FIB_EVENT_ENTRY_DEL and clears mfi pointer. But, if mlx5_lag_check_prereq() condition isn't met, for example when eswitch is in legacy mode, the fib event is skipped and mfi pointer becomes stale. Fix by resetting mfi pointer to NULL every time mlx5_lag_mp_init() is called. Fixes: `544fe7c2e6` ("net/mlx5e: Activate HW multipath and handle port affinity based on FIB events") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:41 -07:00
Saeed Mahameed	77ecd10d0a	net/mlx5e: reset XPS on error flow if netdev isn't registered yet mlx5e_attach_netdev can be called prior to registering the netdevice: Example stack: ipoib_new_child_link -> ipoib_intf_init-> rdma_init_netdev-> mlx5_rdma_setup_rn-> mlx5e_attach_netdev-> mlx5e_num_channels_changed -> mlx5e_set_default_xps_cpumasks -> netif_set_xps_queue -> __netif_set_xps_queue -> kmalloc If any later stage fails at any point after mlx5e_num_channels_changed() returns, XPS allocated maps will never be freed as they are only freed during netdev unregistration, which will never happen for yet to be registered netdevs. Fixes: `3909a12e79` ("net/mlx5e: Fix configuration of XPS cpumasks and netdev queues in corner cases") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>	2021-05-18 23:01:38 -07:00
Roi Dayan	eb96cc1592	net/mlx5e: Make sure fib dev exists in fib event For unreachable route entry the fib dev does not exists. Fixes: `8914add2c9` ("net/mlx5e: Handle FIB events to update tunnel endpoint device") Reported-by: Dennis Afanasev <dennis.afanasev@stateless.net> Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:36 -07:00
Roi Dayan	83026d8318	net/mlx5e: Fix null deref accessing lag dev It could be the lag dev is null so stop processing the event. In bond_enslave() the active/backup slave being set before setting the upper dev so first event is without an upper dev. After setting the upper dev with bond_master_upper_dev_link() there is a second event and in that event we have an upper dev. Fixes: `7e51891a23` ("net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:34 -07:00
Dima Chumak	fe7738eb3c	net/mlx5e: Fix nullptr in mlx5e_tc_add_fdb_flow() The result of __dev_get_by_index() is not checked for NULL, which then passed to mlx5e_attach_encap() and gets dereferenced. Also, in case of a successful lookup, the net_device reference count is not incremented, which may result in net_device pointer becoming invalid at any time during mlx5e_attach_encap() execution. Fix by using dev_get_by_index(), which does proper reference counting on the net_device pointer. Also, handle nullptr return value when mirred device is not found. It's safe to call dev_put() on the mirred net_device pointer, right after mlx5e_attach_encap() call, because it's not being saved/copied down the call chain. Fixes: `3c37745ec6` ("net/mlx5e: Properly deal with encap flows add/del under neigh update") Addresses-Coverity: ("Dereference null return value") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:31 -07:00
Parav Pandit	82041634d9	net/mlx5: SF, Fix show state inactive when its inactivated When a SF is inactivated and when it is in a TEARDOWN_REQUEST state, driver still returns its state as active. This is incorrect. Fix it by treating TEARDOWN_REQEUST as inactive state. When a SF is still attached to the driver, on user request to reactivate EINVAL error is returned. Inform user about it with better code EBUSY and informative error message. Fixes: `6a32732174` ("net/mlx5: SF, Port function state change support") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:29 -07:00
Roi Dayan	fca086617a	net/mlx5: Fix err prints and return when creating termination table Fix print to print correct error code and not using IS_ERR() which will just result in always printing 1. Also return real err instead of always -EOPNOTSUPP. Fixes: `10caabdaad` ("net/mlx5e: Use termination table for VLAN push actions") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:26 -07:00
Jianbo Liu	442b3d7b67	net/mlx5: Set reformat action when needed for termination rules For remote mirroring, after the tunnel packets are received, they are decapsulated and sent to representor, then re-encapsulated and sent out over another tunnel. So reformat action is set only when the destination is required to do encapsulation. Fixes: `249ccc3c95` ("net/mlx5e: Add support for offloading traffic from uplink to uplink") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:24 -07:00
Dima Chumak	dca59f4a79	net/mlx5e: Fix nullptr in add_vlan_push_action() The result of dev_get_by_index_rcu() is not checked for NULL and then gets dereferenced immediately. Also, the RCU lock must be held by the caller of dev_get_by_index_rcu(), which isn't satisfied by the call stack. Fix by handling nullptr return value when iflink device is not found. Add RCU locking around dev_get_by_index_rcu() to avoid possible adverse effects while iterating over the net_device's hlist. It is safe not to increment reference count of the net_device pointer in case of a successful lookup, because it's already handled by VLAN code during VLAN device registration (see register_vlan_dev and netdev_upper_dev_link). Fixes: `278748a95a` ("net/mlx5e: Offload TC e-switch rules with egress VLAN device") Addresses-Coverity: ("Dereference null return value") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:21 -07:00
Maor Gottlieb	3410fbcd47	{net, RDMA}/mlx5: Fix override of log_max_qp by other device mlx5_core_dev holds pointer to static profile, hence when the log_max_qp of the profile is override by some device, then it effect all other mlx5 devices that share the same profile. Fix it by having a profile instance for every mlx5 device. Fixes: `883371c453` ("net/mlx5: Check FW limitations on log_max_qp before setting it") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-05-18 23:01:19 -07:00
Amit Cohen	b0d80c013b	mlxsw: Remove Mellanox SwitchX-2 ASIC support Initial support for the Mellanox SwitchX-2 ASIC was added in July 2015. Since then all development efforts shifted towards the Mellanox Spectrum ASICs and development of this driver stopped beside trivial fixes and refactoring. Therefore, the driver does not support any switch offloads and simply traps all traffic to the CPU, rendering it irrelevant for deployment. In addition, support for this ASIC was dropped by Mellanox a few years ago. Given the driver is not used by any users and that there is no intention of investing in its development, remove it from the kernel. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-17 15:15:46 -07:00
Amit Cohen	9b43fbb8ce	mlxsw: Remove Mellanox SwitchIB ASIC support Initial support for the Mellanox SwitchIB and SwitchIB-2 ASICs was added in October 2016, but since then development of this driver stopped. Therefore, the driver does not support any offloads and simply registers devlink ports for its front panel ports, rendering it irrelevant for deployment. Given the driver is not used by any users and that there is no intention of investing in its development, remove it from the kernel. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-17 15:15:46 -07:00
Ido Schimmel	51746a353b	mlxsw: spectrum_router: Avoid missing error code warning Explicitly set the error code to zero before the goto statement to avoid the following smatch warning: drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:3598 mlxsw_sp_nexthop_group_refresh() warn: missing error code 'err' The warning is a false positive, but the change both suppresses the warning and makes it clear to future readers that this is not an error path. The original report and discussion can be found here [1]. [1] https://lore.kernel.org/lkml/202105141823.Td2h3Mbi-lkp@intel.com/ Cc: Dan Carpenter <dan.carpenter@oracle.com> Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-17 15:15:46 -07:00
Ido Schimmel	8c2b58e65d	mlxsw: core: Avoid unnecessary EMAD buffer copy mlxsw_emad_transmit() takes care of sending EMAD transactions to the device. Since these transactions can time out, the driver performs up to 5 retransmissions, each time copying the skb with the original request. The data of the skb does not change throughout the process, so there is no need to copy it each time. Instead, only the skb itself can be copied. Therefore, use skb_clone() instead of skb_copy(). This reduces the latency of the function by about 16%. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-17 15:15:46 -07:00
Danielle Ratson	837ec05cfe	mlxsw: Verify the accessed index doesn't exceed the array length There are few cases in which an array index queried from a fw register, is accessed without any validation that it doesn't exceed the array length. Add a proper length validation, so accessing memory past the end of an array will be forbidden. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-17 15:15:46 -07:00
Danielle Ratson	ece5df874d	mlxsw: spectrum_buffers: Switch function arguments In the call path: mlxsw_sp_hdroom_bufs_reset_sizes() mlxsw_sp_hdroom_int_buf_size_get() ->int_buf_size_get() The 'speed' and 'mtu' arguments were mistakenly switched twice. The two bugs thus canceled each other. Clean this up by switching the arguments in both call sites, so that they are passed in the right order. Found during manual code inspection. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-17 15:15:46 -07:00
Vladyslav Tarasiuk	db825feefc	net/mlx4: Fix EEPROM dump support Fix SFP and QSFP* EEPROM queries by setting i2c_address, offset and page number correctly. For SFP set the following params: - I2C address for offsets 0-255 is 0x50. For 256-511 - 0x51. - Page number is zero. - Offset is 0-255. At the same time, QSFP* parameters are different: - I2C address is always 0x50. - Page number is not limited to zero. - Offset is 0-255 for page zero and 128-255 for others. To set parameters accordingly to cable used, implement function to query module ID and implement respective helper functions to set parameters correctly. Fixes: `135dd9594f` ("net/mlx4_en: ethtool, Remove unsupported SFP EEPROM high pages query") Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-10 14:34:39 -07:00
Linus Torvalds	fc858a5231	Networking fixes for 5.13-rc1, including fixes from bpf, can and netfilter trees. Self-contained fixes, nothing risky. Current release - new code bugs: - dsa: ksz: fix a few bugs found by static-checker in the new driver - stmmac: fix frame preemption handshake not triggering after interface restart Previous releases - regressions: - make nla_strcmp handle more then one trailing null character - fix stack OOB reads while fragmenting IPv4 packets in openvswitch and net/sched - sctp: do asoc update earlier in sctp_sf_do_dupcook_a - sctp: delay auto_asconf init until binding the first addr - stmmac: clear receive all(RA) bit when promiscuous mode is off - can: mcp251x: fix resume from sleep before interface was brought up Previous releases - always broken: - bpf: fix leakage of uninitialized bpf stack under speculation - bpf: fix masking negation logic upon negative dst register - netfilter: don't assume that skb_header_pointer() will never fail - only allow init netns to set default tcp cong to a restricted algo - xsk: fix xp_aligned_validate_desc() when len == chunk_size to avoid false positive errors - ethtool: fix missing NLM_F_MULTI flag when dumping - can: m_can: m_can_tx_work_queue(): fix tx_skb race condition - sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b - bridge: fix NULL-deref caused by a races between assigning rx_handler_data and setting the IFF_BRIDGE_PORT bit Latecomer: - seg6: add counters support for SRv6 Behaviors Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCV3YoACgkQMUZtbf5S IrsQ2w//Q8/qbl6wGTKUfu6DZHYUU5j5sTwiHR823PKKSgXI+okWMN0KUlZszOsz qnPkH6GuojRooOE1s8PFLSlt9axKhQ0y7uzMTrWYafQ+JZTtgg9/MiPxQ8fdiE5i uOG1ngttZ+1jlE5tMPL4GAOSegg3rWVDclzqnJTdsPPOco3MWj6SL9xN0LDPxCEL BDysRqL/UiOIoh4v6IXQRx2UWjsNGu4biM1po+Jfumnd9T0zKoEpzu6UN6yPShbx 284LihZSQtughCbhGqkErBOxfjZcvpFOQrqmjEvI+Z/eYg4InfWZemt8Sa92/alE yAFjK76MUTaUxaAO/gk8XauhvkYOzJJwKpqhbOmlaM7oj55QdzT5/8JxMxVoA6hV pscHOixk15GVse49PdPV8v47cyTLc/Xi69i+/uUdNVVfuORL1wft1w1xbd0S6Pbe 7Gqax21S7zxcDsrUli7cFheYiqtbQAL0anlIUz8tUOZFz0VQ/zPuFd4rUYZ/o38V Mrevdk3t6CXNxS4CRXyUW4UejYB1O6Qw12sUue31e3h73d6LiN3NAiN5Qp7SEk1/ fvk+jfOf8vvmtimYvcUK2i0D+vqj4Ec/qRIE/XXuUDBcp22tPL9uWMfWavwTdAj1 Se4SzksTWF+NM0lO0ItonMyPh3ZXcSLhIv/gHrZwEKuWkXCGO4M= =JmWS -----END PGP SIGNATURE----- Merge tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc1, including fixes from bpf, can and netfilter trees. Self-contained fixes, nothing risky. Current release - new code bugs: - dsa: ksz: fix a few bugs found by static-checker in the new driver - stmmac: fix frame preemption handshake not triggering after interface restart Previous releases - regressions: - make nla_strcmp handle more then one trailing null character - fix stack OOB reads while fragmenting IPv4 packets in openvswitch and net/sched - sctp: do asoc update earlier in sctp_sf_do_dupcook_a - sctp: delay auto_asconf init until binding the first addr - stmmac: clear receive all(RA) bit when promiscuous mode is off - can: mcp251x: fix resume from sleep before interface was brought up Previous releases - always broken: - bpf: fix leakage of uninitialized bpf stack under speculation - bpf: fix masking negation logic upon negative dst register - netfilter: don't assume that skb_header_pointer() will never fail - only allow init netns to set default tcp cong to a restricted algo - xsk: fix xp_aligned_validate_desc() when len == chunk_size to avoid false positive errors - ethtool: fix missing NLM_F_MULTI flag when dumping - can: m_can: m_can_tx_work_queue(): fix tx_skb race condition - sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b - bridge: fix NULL-deref caused by a races between assigning rx_handler_data and setting the IFF_BRIDGE_PORT bit Latecomer: - seg6: add counters support for SRv6 Behaviors" * tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits) atm: firestream: Use fallthrough pseudo-keyword net: stmmac: Do not enable RX FIFO overflow interrupts mptcp: fix splat when closing unaccepted socket i40e: Remove LLDP frame filters i40e: Fix PHY type identifiers for 2.5G and 5G adapters i40e: fix the restart auto-negotiation after FEC modified i40e: Fix use-after-free in i40e_client_subtask() i40e: fix broken XDP support netfilter: nftables: avoid potential overflows on 32bit arches netfilter: nftables: avoid overflows in nft_hash_buckets() tcp: Specify cmsgbuf is user pointer for receive zerocopy. mlxsw: spectrum_mr: Update egress RIF list before route's action net: ipa: fix inter-EE IRQ register definitions can: m_can: m_can_tx_work_queue(): fix tx_skb race condition can: mcp251x: fix resume from sleep before interface was brought up can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path can: mcp251xfd: mcp251xfd_probe(): fix an error pointer dereference in probe netfilter: nftables: Fix a memleak from userdata error path in new objects netfilter: remove BUG_ON() after skb_header_pointer() netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check ...	2021-05-08 08:31:46 -07:00
Ido Schimmel	cbaf3f6af9	mlxsw: spectrum_mr: Update egress RIF list before route's action Each multicast route that is forwarding packets (as opposed to trapping them) points to a list of egress router interfaces (RIFs) through which packets are replicated. A route's action can transition from trap to forward when a RIF is created for one of the route's egress virtual interfaces (eVIF). When this happens, the route's action is first updated and only later the list of egress RIFs is committed to the device. This results in the route pointing to an invalid list. In case the list pointer is out of range (due to uninitialized memory), the device will complain: mlxsw_spectrum2 0000:06:00.0: EMAD reg access failed (tid=5733bf490000905c,reg_id=300f(pefa),type=write,status=7(bad parameter)) Fix this by first committing the list of egress RIFs to the device and only later update the route's action. Note that a fix is not needed in the reverse function (i.e., mlxsw_sp_mr_route_evif_unresolve()), as there the route's action is first updated and only later the RIF is removed from the list. Cc: stable@vger.kernel.org Fixes: `c011ec1bbf` ("mlxsw: spectrum: Add the multicast routing offloading logic") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/20210506072308.3834303-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-06 17:19:29 -07:00
Linus Torvalds	583f2bcf86	- Remove duplicate error message for the amlogic driver (Tang Bin) - Fix spellos in comments for the tegra and sun8i (Bhaskar Chowdhury) - Add the missing fifth node on the rcar_gen3 sensor (Niklas Söderlund) - Remove duplicate include in ti-bandgap (Zhang Yunkai) - Assign error code in the error path in the function thermal_of_populate_bind_params() (Jia-Ju Bai) - Fix spelling mistake in a comment 'disabed' -> 'disabled' (Colin Ian King) - Use the device name instead of auto-numbering for a better identification of the cooling device (Daniel Lezcano) - Improve a bit the division accuracy in the power allocator governor (Jeson Gao) - Enable the missing third sensor on msm8976 (Konrad Dybcio) - Add QCom tsens driver co-maintainer (Thara Gopinath) - Fix memory leak and use after free errors in the core code (Daniel Lezcano) - Add the MDM9607 compatible bindings (Konrad Dybcio) - Fix trivial spello in the copyright name for Hisilicon (Hao Fang) - Fix negative index array access when converting the frequency to power in the energy model (Brian-sy Yang) - Add support for Gen2 new PMIC support for Qcom SPMI (David Collins) - Update maintainer file for CPU cooling device section (Lukasz Luba) - Fix missing put_device on error in the Qcom tsens driver (Guangqing Zhu) - Add compatible DT binding for sm8350 (Robert Foss) - Add support for the MDM9607's tsens driver (Konrad Dybcio) - Remove duplicate error messages in thermal_mmio and the bcm2835 driver (Ruiqi Gong) - Add the Thermal Temperature Cooling driver (Zhang Rui) - Remove duplicate error messages in the Hisilicon sensor driver (Ye Bin) - Use the devm_platform_ioremap_resource_byname() function instead of a couple of corresponding calls (dingsenjie) - Sort the headers alphabetically in the ti-bandgap driver (Zhen Lei) - Add missing property in the DT thermal sensor binding (Rafał Miłecki) - Remove dead code in the ti-bandgap sensor driver (Lin Ruizhe) - Convert the BRCM DT bindings to the yaml schema (Rafał Miłecki) - Replace the thermal_notify_framework() call by a call to the thermal_zone_device_update() function. Remove the function as well as the corresponding documentation (Thara Gopinath) - Add support for the ipq8064-tsens sensor along with a set of cleanups and code preparation (Ansuel Smith) - Add a lockless __thermal_cdev_update() function to improve the locking scheme in the core code and governors (Lukasz Luba) - Fix multiple cooling device notification changes (Lukasz Luba) - Remove unneeded variable initialization (Colin Ian King) -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmCRqDIACgkQqDIjiipP 6E8O2Qf5AQvSVoN9WYRBLo1+a4mkGsJ/wHQMEsOA4FVHft5/QVkRtpMNbSiyq00O YTpNuoBqiYm/tSTyzK/5Oh+0ucgm/ef4c4dTyPjZYw2GB+3rYNRAXdX/tB6Ggjl/ oUArUCoSQZjOU6Y573B05rcHp1PVM/XL9LgD1uX76tXA1MaGvsyC0cyPRAdOANke W83BWI0XMhv8B1bZwHVB2Oft5x6HhqWBl3HKbNOmPEMtwkqqBCFAqB0wNEH88ZTf 2hyBjBoZQHdMkJsC0piMvIyAjHZiIjQB47VWz31EvKB3/E28xCqRqPViPq9QbrA5 got0+oDbxI96T024ndXRomc0SSxZnw== =5THg -----END PGP SIGNATURE----- Merge tag 'thermal-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux Pull thermal updates from Daniel Lezcano: - Remove duplicate error message for the amlogic driver (Tang Bin) - Fix spellos in comments for the tegra and sun8i (Bhaskar Chowdhury) - Add the missing fifth node on the rcar_gen3 sensor (Niklas Söderlund) - Remove duplicate include in ti-bandgap (Zhang Yunkai) - Assign error code in the error path in the function thermal_of_populate_bind_params() (Jia-Ju Bai) - Fix spelling mistake in a comment 'disabed' -> 'disabled' (Colin Ian King) - Use the device name instead of auto-numbering for a better identification of the cooling device (Daniel Lezcano) - Improve a bit the division accuracy in the power allocator governor (Jeson Gao) - Enable the missing third sensor on msm8976 (Konrad Dybcio) - Add QCom tsens driver co-maintainer (Thara Gopinath) - Fix memory leak and use after free errors in the core code (Daniel Lezcano) - Add the MDM9607 compatible bindings (Konrad Dybcio) - Fix trivial spello in the copyright name for Hisilicon (Hao Fang) - Fix negative index array access when converting the frequency to power in the energy model (Brian-sy Yang) - Add support for Gen2 new PMIC support for Qcom SPMI (David Collins) - Update maintainer file for CPU cooling device section (Lukasz Luba) - Fix missing put_device on error in the Qcom tsens driver (Guangqing Zhu) - Add compatible DT binding for sm8350 (Robert Foss) - Add support for the MDM9607's tsens driver (Konrad Dybcio) - Remove duplicate error messages in thermal_mmio and the bcm2835 driver (Ruiqi Gong) - Add the Thermal Temperature Cooling driver (Zhang Rui) - Remove duplicate error messages in the Hisilicon sensor driver (Ye Bin) - Use the devm_platform_ioremap_resource_byname() function instead of a couple of corresponding calls (dingsenjie) - Sort the headers alphabetically in the ti-bandgap driver (Zhen Lei) - Add missing property in the DT thermal sensor binding (Rafał Miłecki) - Remove dead code in the ti-bandgap sensor driver (Lin Ruizhe) - Convert the BRCM DT bindings to the yaml schema (Rafał Miłecki) - Replace the thermal_notify_framework() call by a call to the thermal_zone_device_update() function. Remove the function as well as the corresponding documentation (Thara Gopinath) - Add support for the ipq8064-tsens sensor along with a set of cleanups and code preparation (Ansuel Smith) - Add a lockless __thermal_cdev_update() function to improve the locking scheme in the core code and governors (Lukasz Luba) - Fix multiple cooling device notification changes (Lukasz Luba) - Remove unneeded variable initialization (Colin Ian King) * tag 'thermal-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (55 commits) thermal/drivers/mtk_thermal: Remove redundant initializations of several variables thermal/core/power allocator: Use the lockless __thermal_cdev_update() function thermal/core/fair share: Use the lockless __thermal_cdev_update() function thermal/core/fair share: Lock the thermal zone while looping over instances thermal/core/power_allocator: Update once cooling devices when temp is low thermal/core/power_allocator: Maintain the device statistics from going stale thermal/core: Create a helper __thermal_cdev_update() without a lock dt-bindings: thermal: tsens: Document ipq8064 bindings thermal/drivers/tsens: Add support for ipq8064-tsens thermal/drivers/tsens: Drop unused define for msm8960 thermal/drivers/tsens: Replace custom 8960 apis with generic apis thermal/drivers/tsens: Fix bug in sensor enable for msm8960 thermal/drivers/tsens: Use init_common for msm8960 thermal/drivers/tsens: Add VER_0 tsens version thermal/drivers/tsens: Convert msm8960 to reg_field thermal/drivers/tsens: Don't hardcode sensor slope Documentation: driver-api: thermal: Remove thermal_notify_framework from documentation thermal/core: Remove thermal_notify_framework iwlwifi: mvm: tt: Replace thermal_notify_framework dt-bindings: thermal: brcm,ns-thermal: Convert to the json-schema ...	2021-05-05 12:46:48 -07:00
Linus Torvalds	f34b2cf178	RDMA merge window pull request This is significantly bug fixes and general cleanups. The noteworthy new features are fairly small: - XRC support for HNS and improves RQ operations - Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw and qib - Quite a few general cleanups on spelling, error handling, static checker detections, etc - Increase the number of device ports supported beyond 255. High port count software switches now exist - Several bug fixes for rtrs - mlx5 Device Memory support for host controlled atomics - Report SRQ tables through to rdma-tool -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmCMMHEACgkQOG33FX4g mxri3Q//RAgIExCGHebQ9xkptZHVyTLLJMpiMl2cqk3ZVRdDZ7QdiQjIqY2KqlUK nxBj7EXJeX6rV5a1xqCcOO1gBetB28TSwnCNE2ZqrXP5B59ISW8D052IWza3UkUz WmHLARxHQlyKBWA4+ZAgfoUGL0NmWA8QPf56t/RK/3/OsuYnGzcnWmmFbt8XKFcH NtO3KC45mKWDqqG0A0XRrLbEQz/ElO3OuPBqlBKgB3ZgGPzgsOUTOGkm1tCcZ89L /pvZGB7SklKZdCX8TxdpVGd9h0zHl8pqh1yEzvTA1ypNAYSUId2mvZXluU8J5yJl FLk7E1IxE5050FNEc7T5uZdUVntulYiqL2558coRI34l5w26pKGjIMxw/nTB8hg8 4ZfBtKVemIG6yzW5Up6iBpK7qWYpvLWVShwYAWhbNsjN7JGzJuh1gJnjbmYgyz2P RTMU9wjFPLL2wZxg4LDHACVJNBb82j6KKuE+kZWpk11ro7INw9+7YwRuTo7/ezxC BwXKu8wF4igwSigV55jM+WnGXLhxdC3qmx/2cbtWyLM/PzdRL96tM0RWW5v8/Nv7 teFhkt+f3RVqcfYH5K1qCXy3UFrxG6bxFSvcHHSBx2bdIrqhuTY5FqszAYImeW2j iHoyIsuSuGu79HQgOzAQZsEyksWi6OYDvA9Q9VBoPP4bJ3DOAa4= =vsXA -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "This is significantly bug fixes and general cleanups. The noteworthy new features are fairly small: - XRC support for HNS and improves RQ operations - Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw and qib - Quite a few general cleanups on spelling, error handling, static checker detections, etc - Increase the number of device ports supported beyond 255. High port count software switches now exist - Several bug fixes for rtrs - mlx5 Device Memory support for host controlled atomics - Report SRQ tables through to rdma-tool" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (145 commits) IB/qib: Remove redundant assignment to ret RDMA/nldev: Add copy-on-fork attribute to get sys command RDMA/bnxt_re: Fix a double free in bnxt_qplib_alloc_res RDMA/siw: Fix a use after free in siw_alloc_mr IB/hfi1: Remove redundant variable rcd RDMA/nldev: Add QP numbers to SRQ information RDMA/nldev: Return SRQ information RDMA/restrack: Add support to get resource tracking for SRQ RDMA/nldev: Return context information RDMA/core: Add CM to restrack after successful attachment to a device RDMA/cma: Skip device which doesn't support CM RDMA/rxe: Fix a bug in rxe_fill_ip_info() RDMA/mlx5: Expose private query port RDMA/mlx4: Remove an unused variable RDMA/mlx5: Fix type assignment for ICM DM IB/mlx5: Set right RoCE l3 type and roce version while deleting GID RDMA/i40iw: Fix error unwinding when i40iw_hmc_sd_one fails RDMA/cxgb4: add missing qpid increment IB/ipoib: Remove unnecessary struct declaration RDMA/bnxt_re: Get rid of custom module reference counting ...	2021-05-01 09:15:05 -07:00
Parav Pandit	f1b9acd3a5	net/mlx5: SF, Extend SF table for additional SF id range Extended the SF table to cover additioanl SF id range of external controller. A user optionallly provides the external controller number when user wants to create SF on the external controller. An example on eswitch system: $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev $ devlink port show pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1 pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 external true splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:59:07 -07:00
Parav Pandit	a3088f87d9	net/mlx5: SF, Split mlx5_sf_hw_table into two parts Device has SF ids in two different contiguous ranges. One for the local controller and second for the external controller's PF. Each such range has its own maximum number of functions and base id. To allocate SF from either of the range, prepare code to split into range specific fields into its own structure. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:59:04 -07:00
Parav Pandit	01ed9550e8	net/mlx5: SF, Use helpers for allocation and free Use helper routines for SF id and SF table allocation and free so that subsequent patch can reuse it for multiple SF function id range. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:59:01 -07:00
Parav Pandit	326c08a020	net/mlx5: SF, Consider own vhca events of SF devices Vhca events on eswitch manager are received for all the functions on the NIC, including for SFs of external host PF controllers. While SF device handler is only interested in SF devices events related to its own PF. Hence, validate if the function belongs to self or not. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:59 -07:00
Parav Pandit	7e6ccbc187	net/mlx5: SF, Store and use start function id SF ids in the device are in two different contiguous ranges. One for the local controller and second for the external host controller. Prepare code to handle multiple start function id by storing it in the table. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:56 -07:00
Parav Pandit	a1ab3e4554	devlink: Extend SF port attributes to have external attribute Extended SF port attributes to have optional external flag similar to PCI PF and VF port attributes. External atttibute is required to generate unique phys_port_name when PF number and SF number are overlapping between two controllers similar to SR-IOV VFs. When a SF is for external controller an example view of external SF port and config sequence. On eswitch system: $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev $ devlink port show pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1 pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached phys_port_name construction: $ cat /sys/class/net/eth1/phys_port_name c1pf0sf77 Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:53 -07:00
Parav Pandit	1d7979352f	net/mlx5: SF, Rely on hw table for SF devlink port allocation Supporting SF allocation is currently checked at two places: (a) SF devlink port allocator and (b) SF HW table handler. Both layers are using HCA CAP to identify it using helper routine mlx5_sf_supported() and mlx5_sf_max_functions(). Instead, rely on the HW table handler to check if SF is supported or not. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:51 -07:00
Parav Pandit	87bd418ea7	net/mlx5: E-Switch, Consider SF ports of host PF Query SF vports count and base id of host PF from the firmware. Account these ports in the total port calculation whenever it is non zero. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:48 -07:00
Parav Pandit	47dd7e609f	net/mlx5: E-Switch, Use xarray for vport number to vport and rep mapping Currently vport number to vport and its representor are mapped using an array and an index. Vport numbers of different types of functions are not contiguous. Adding new such discontiguous range using index and number mapping is increasingly complex and hard to maintain. Hence, maintain an xarray of vport and rep whose lookup is done based on the vport number. Each VF and SF entry is marked with a xarray mark to identify the function type. Additionally PF and VF needs special handling for legacy inline mode. They are additionally marked as host function using additional HOST_FN mark. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:45 -07:00
Parav Pandit	9f8c7100c8	net/mlx5: E-Switch, Prepare to return total vports from eswitch struct Total vports are already stored during eswitch initialization. Instead of calculating everytime, read directly from eswitch. Additionally, host PF's SF vport information is available using QUERY_HCA_CAP command. It is not available through HCA_CAP of the eswitch manager PF. Hence, this patch prepares the return total eswitch vport count from the existing eswitch struct. This further helps to keep eswitch port counting macros and logic within eswitch. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:43 -07:00
Parav Pandit	06ec5acc77	net/mlx5: E-Switch, Return eswitch max ports when eswitch is supported mlx5_eswitch_get_total_vports() doesn't honor MLX5_ESWICH Kconfig flag. When MLX5_ESWITCH is disabled, FS layer continues to initialize eswitch specific ACL namespaces. Instead, start honoring MLX5_ESWITCH flag and perform vport specific initialization only when vport count is non zero. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-24 00:58:40 -07:00
Hans Westgaard Ry	79ebfb11fe	net/mlx4: Treat VFs fair when handling comm_channel_events Handling comm_channel_event in mlx4_master_comm_channel uses a double loop to determine which slaves have requested work. The search is always started at lowest slave. This leads to unfairness; lower VFs tends to be prioritized over higher VFs. The patch uses find_next_bit to determine which slaves to handle. Fairness is implemented by always starting at the next to the last start. An MPI program has been used to measure improvements. It runs 500 ibv_reg_mr, synchronizes with all other instances and then runs 500 ibv_dereg_mr. The results running 500 processes, time reported is for running 500 calls: ibv_reg_mr: Mod. Org. mlx4_1 403.356ms 424.674ms mlx4_2 403.355ms 424.674ms mlx4_3 403.354ms 424.674ms mlx4_4 403.355ms 424.674ms mlx4_5 403.357ms 424.677ms mlx4_6 403.354ms 424.676ms mlx4_7 403.357ms 424.675ms mlx4_8 403.355ms 424.675ms ibv_dereg_mr: Mod. Org. mlx4_1 116.408ms 142.818ms mlx4_2 116.434ms 142.793ms mlx4_3 116.488ms 143.247ms mlx4_4 116.679ms 143.230ms mlx4_5 112.017ms 107.204ms mlx4_6 112.032ms 107.516ms mlx4_7 112.083ms 184.195ms mlx4_8 115.089ms 190.618ms Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-22 14:59:26 -07:00
Petr Machata	7de85b0431	mlxsw: spectrum_qdisc: Index future FIFOs by band number mlxsw used to hold an array of qdiscs indexed by the TC number. In the previous patch, it was changed to allocate child qdiscs dynamically, and they are now indexed by band number. Follow suit with the array of future FIFOs. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	5cbd960253	mlxsw: spectrum_qdisc: Allocate child qdiscs dynamically Instead of keeping qdiscs in globally-preallocated arrays, introduce a per-qdisc-kind value num_classes, and then allocate the necessary child qdiscs (if any) based on that value. Since now dynamic allocation is involved, mlxsw_sp_qdisc_replace() gets messy enough that it is worth it to split it to two cases: a new qdisc allocation and a change of existing qdisc. (Note that the change also includes what TC formally calls replace, if the qdisc kind is the same.) Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	cff99e2045	mlxsw: spectrum_qdisc: Guard all qdisc accesses with a lock The FIFO handler currently guards accesses to the future FIFO tracking by asserting RTNL. In the future, the changes to the qdisc state will be more thorough, so other qdiscs will need this guarding is as well. In order to not further the RTNL infestation, instead convert to a custom lock that will guard accesses to the qdisc state. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	51d52ed955	mlxsw: spectrum_qdisc: Track children per qdisc mlxsw currently allows a two-level structure of qdiscs: the root and possibly a number of children. In order to support offloading more general qdisc trees, introduce to struct mlxsw_sp_qdisc a pointer to child qdiscs. Refer to the child qdiscs through this pointer, instead of going through the tclass_qdiscs in qdisc_state. Additionally introduce a field num_classes, which holds number of given qdisc's children. Also introduce a generic function for walking qdisc trees. Rewrite mlxsw_sp_qdisc_find() and _find_by_handle() to use the generic walker. For now, keep the qdisc_state.tclass_qdisc, and just point root_qdiscs's children to this array. Following patches will make the allocation dynamic. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	b21832b568	mlxsw: spectrum_qdisc: Promote backlog reduction to mlxsw_sp_qdisc_destroy() When a qdisc is removed, it is necessary to update the backlog value at its parent--unless the qdisc is at root position. RED, TBF and FIFO all do that, each separately. Since all of them need to do this, just promote the operation directly to mlxsw_sp_qdisc_destroy(), instead of deferring it to individual destructors. Since FIFO dtor thus becomes trivial, remove it. Add struct mlxsw_sp_qdisc.parent to point at the parent qdisc. This will be handy later as deeper structures are offloaded. Use the parent qdisc to find the chain of parents whose backlog value needs to be updated. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	017a131cde	mlxsw: spectrum_qdisc: Track tclass_num as int, not u8 tclass_num is just a number, a value that would be ordinarily passed around as an int. (Which is unlike a u8 prio_bitmap.) In several places, tclass_num already is an int. Convert the remaining instances. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	549f2aae84	mlxsw: spectrum_qdisc: Drop an always-true condition The function mlxsw_sp_qdisc_compare() is invoked a couple lines above this check, which will bounce any requests where this condition does not hold. Therefore drop it. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	290fe2c595	mlxsw: spectrum_qdisc: Simplify mlxsw_sp_qdisc_compare() The purpose of this function is to filter out events that are related to qdiscs that are not offloaded, or are not offloaded anymore. But the function is unnecessarily thorough: - mlxsw_sp_qdisc pointer is never NULL in the context where it is called - Two qdiscs with the same handle will never have different types. Even when replacing one qdisc with another in the same class, Linux will not permit handle reuse unless the qdisc type also matches. Simplify the function by omitting these two unnecessary conditions. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Petr Machata	17c0e6d175	mlxsw: spectrum_qdisc: Drop one argument from check_params callback The mlxsw_sp_qdisc argument is not used in any of the actual callbacks. Drop it. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-20 16:43:13 -07:00
Parav Pandit	dedbc2d358	IB/mlx5: Set right RoCE l3 type and roce version while deleting GID Currently when GID is deleted, it zero out all the fields of the RoCE address in the SET_ROCE_ADDRESS command for a specified index. roce_version = 0 means RoCEv1 in the SET_ROCE_ADDRESS command. This assumes that device has RoCEv1 always enabled which is not always correct. For example Subfunction does not support RoCEv1. Due to this assumption a previously added RoCEv2 GID is always deleted as RoCEv1 GID. This results in a below syndrome: mlx5_core.sf mlx5_core.sf.4: mlx5_cmd_check:777:(pid 4256): SET_ROCE_ADDRESS(0x761) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x12822d) Hence set the right RoCE version during GID deletion provided by the core. Link: https://lore.kernel.org/r/d3f54129c90ca329caf438dbe31875d8ad08d91a.1618753425.git.leonro@nvidia.com Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2021-04-20 09:41:10 -03:00
Yevgeny Kliteynik	aeacb52a8d	net/mlx5: DR, Add support for isolate_vl_tc QP When using SW steering, rule insertion rate depends on the RDMA RC QP performance used for writing to the ICM. During stress this QP is competing on the HW resources with all the other QPs that are used to send data. To protect SW steering QP's performance in such cases, we set this QP to use isolated VL. The VL number is reserved by FW and is not exposed to the driver. Support for this QP on isolated VL exists only when both force-loopback and isolate_vl_tc capabilities are set. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:46 -07:00
Yevgeny Kliteynik	7304d603a5	net/mlx5: DR, Add support for force-loopback QP When supported by the device, SW steering RoCE RC QP that is used to write/read to/from ICM will be created with force-loopback attribute. Such QP doesn't require GID index upon creation. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:43 -07:00
Yevgeny Kliteynik	df9dd15ae1	net/mlx5: DR, Add support for matching tunnel GTP-U Enable matching on tunnel GTP-U and GTP-U first extension header using dynamic flex parser. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:40 -07:00
Yevgeny Kliteynik	35ba005d82	net/mlx5: DR, Set flex parser for TNL_MPLS dynamically Query the flex_parser id that's intended for TNL_MPLS and use an appropriate flex parser for MPLS over UDP/GRE. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:37 -07:00
Yevgeny Kliteynik	3442e0335e	net/mlx5: DR, Add support for matching on geneve TLV option Enable matching on tunnel geneve TLV option using the flex parser. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:34 -07:00
Yevgeny Kliteynik	4923938d2f	net/mlx5: DR, Set STEv0 ICMP flex parser dynamically Set the flex parser ID dynamicly for ICMP instead of relying on hardcoded values. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:31 -07:00
Yevgeny Kliteynik	160e9cb37a	net/mlx5: DR, Add support for dynamic flex parser Flex parser is a HW parser that can support protocols that are not natively supported by the HCA, such as Geneve (TLV options) and GTP-U. There are 8 such parsers, and each of them can be assigned to parse a specific set of protocols. This patch adds misc4 match params which allows using a correct flex parser that was programmed to the required protocol. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:28 -07:00
Muhammad Sammar	323b91acc1	net/mlx5: DR, Remove protocol-specific flex_parser_3 definitions Remove MPLS specific fields from flex parser 3 layout. Flex parser can be used for multiple protocols and should not be hardcoded to a specific type. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:24 -07:00
Yevgeny Kliteynik	25cb317680	net/mlx5: E-Switch, Improve error messages in term table creation Add error code to the error messages and removed duplicated message: if termination table creation failed, we already get an error message in mlx5_eswitch_termtbl_create, so no need for the additional error print in the calling function. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:18 -07:00
Yevgeny Kliteynik	ff1925bb0d	net/mlx5: DR, Fix SQ/RQ in doorbell bitmask QP doorbell size is 16 bits. Fixing sw steering's QP doorbel bitmask, which had 20 bits. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:15 -07:00
Yevgeny Kliteynik	7d22ad732d	net/mlx5: DR, Rename an argument in dr_rdma_segments Rename the argument to better reflect that the meaning is not number of records, but wheather or not we should ring the dorbell. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:12 -07:00
Tariq Toukan	6980ffa0c5	net/mlx5e: RX, Add checks for calculated Striding RQ attributes Striding RQ attributes below are mutually dependent. An unaware change to one might take the others out of the valid range derived by the HW caps: - The MPWQE size in bytes - The number of strides in a MPWQE - The stride size Add checks to verify they are valid and comply to the HW spec and SW assumptions/requirements. This is not a fix, no particular issue exists today. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:09 -07:00
Vladyslav Tarasiuk	6a5689ba02	net/mlx5e: Fix possible non-initialized struct usage If mlx5e_devlink_port_register() fails, driver may try to register devlink health TX and RX reporters on non-registered devlink port. Instead, create health reporters only if mlx5e_devlink_port_register() does not fail. And destroy reporters only if devlink_port is registered. Also, change mlx5e_get_devlink_port() behavior and return NULL in case port is not registered to replicate devlink's wrapper when ndo is not implemented. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:06 -07:00
Tariq Toukan	d408c01cae	net/mlx5e: Fix lost changes during code movements The changes done in commit [1] were missed by the code movements done in [2], as they were developed in ~parallel. Here we re-apply them. [1] commit `e4484d9df5` ("net/mlx5e: Enable striding RQ for Connect-X IPsec capable devices") [2] commit `b3a131c2a1` ("net/mlx5e: Move params logic into its dedicated file") Fixes: `b3a131c2a1` ("net/mlx5e: Move params logic into its dedicated file") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-19 20:17:03 -07:00
Jakub Kicinski	8203c7ce4e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/stmicro/stmmac/stmmac_main.c - keep the ZC code, drop the code related to reinit net/bridge/netfilter/ebtables.c - fix build after move to net_generic Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-17 11:08:07 -07:00
Jakub Kicinski	b572ec9ff0	mlx5: implement ethtool standard stats Add support for PHY/MAC/Ctrl/RMON stats. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-16 16:59:47 -07:00
Jakub Kicinski	c1912ab0ee	mlxsw: implement ethtool standard stats mlxsw has nicely grouped stats, add support for standard uAPI. I'm guessing the register access part. Compile tested only. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-16 16:59:20 -07:00
David S. Miller	03e481e88b	mlx5-updates-2021-04-16 This patchset introduces updates to mlx5e netdev driver. 1) Tariq refactors TLS offloads and adds resiliency against RX resync failures 2) Maxim reduces code duplications by unifying channels reset flow regardless if channels are closed or open 3) Aya Enhances TX/RX health reporters diagnostics to expose the internal clock time-stamping format 4) Moshe adds support for ethtool extended link state, to show the reason for link down -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmB53AUACgkQSD+KveBX +j6rzAf+JwJG9G7GSj3a/xird4dlgt4xPbRLB19pTw19ZyHZyujDxdN4QM3r5hTk 5ua1PnhYYaUcyPFvdgR9J0cIJ3QRaxZ+q/XnkE9Yo0eZ1DJ0SL/n6rxEQpcxpee1 XP7qjJu3leVwh5mVW2uOx/ClrL9vYb/fG3Q00j59rUB+i9bZszXZgZ99hJvYBFTB k7W/9X6BNxuLlEg/Ui9L499aDWHRcIY5J2ku+1v/8paJZltk+IFv5glYszylE++M l68drIy3dIjl/Sxj6WR2rHTBus6AIFxWFH8C2L7uqGl97BPjS80snMPIefLJhW+y bQvzMDtfKDmIpvEIdzHPuEhEdqqteg== =YCy6 -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2021-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-04-16 This patchset introduces updates to mlx5e netdev driver. 1) Tariq refactors TLS offloads and adds resiliency against RX resync failures 2) Maxim reduces code duplications by unifying channels reset flow regardless if channels are closed or open 3) Aya Enhances TX/RX health reporters diagnostics to expose the internal clock time-stamping format 4) Moshe adds support for ethtool extended link state, to show the reason for link down ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-16 16:53:52 -07:00
Vladimir Oltean	2c4eca3ef7	net: bridge: switchdev: include local flag in FDB notifications As explained in bugfix commit `6ab4c3117a` ("net: bridge: don't notify switchdev for local FDB addresses") as well as in this discussion: https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/ the switchdev notifiers for FDB entries managed to have a zero-day bug, which was that drivers would not know what to do with local FDB entries, because they were not told that they are local. The bug fix was to simply not notify them of those addresses. Let us now add the 'is_local' bit to bridge FDB entries, and make all drivers ignore these entries by their own choice. Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-16 15:15:45 -07:00
Aya Levin	95742c1cc5	net/mlx5: Enhance diagnostics info for TX/RX reporters Add ts_format to 'Common Config' section of the TX/RX devlink reporters diagnostics info. Possible values for ts_format: 'RT' or 'FRC' which stands for: Real Time and Free Running Counters correspondingly. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:34 -07:00
Aya Levin	302522e67c	net/mlx5: Add helper to initialize 1PPS Wrap 1PPS initialization in a helper for a cleaner init flow. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:31 -07:00
Moshe Tal	b3446acb2b	net/mlx5e: Add ethtool extended link state In case the interface was set up but cannot establish the link, ethtool will print more information to help the user troubleshoot the state. For example, no link due to missing cable: $ ethtool eth1 ... Link detected: no (No cable) Beside the general extended state, drivers can pass additional information about the link state using the sub-state field. For example: $ ethtool eth1 ... Link detected: no (Autoneg, No partner detected) The extended state is available only for specific cases, in other cases ethtool with print only "Link detected: no" as before Signed-off-by: Moshe Tal <moshet@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:28 -07:00
Maor Dickman	5cec6de0ae	net/mlx5: Allocate FC bulk structs with kvzalloc() instead of kzalloc() The bulk size is larger than 16K so use kvzalloc(). The bulk bitmask upper size limit is 16K so use kvcalloc(). Signed-off-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:22 -07:00
Maxim Mikityanskiy	94872d4ef9	net/mlx5e: Cleanup safe switch channels API by passing params mlx5e_safe_switch_channels accepts new_chs as a parameter and opens new channels in place, then copying them to priv->channels. It requires all the callers to allocate space for this temporary storage of the new channels. This commit cleans up the API by replacing new_chs with new_params, a meaningful subset of new_chs to be filled by the caller. The temporary space for the new channels is allocated inside mlx5e_safe_switch_params (a new name for mlx5e_safe_switch_channels). An extra copy of params is made, but since it's control flow, it's not critical. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:20 -07:00
Maxim Mikityanskiy	b3b886cf96	net/mlx5e: Refactor on-the-fly configuration changes This commit extends mlx5e_safe_switch_channels() to support on-the-fly configuration changes, when the channels are open, but don't need to be recreated. Such flows exist when a parameter being changed doesn't affect how the queues are created, or when the queues can be modified while remaining active. Before this commit, such flows were handled as special cases on the caller site. This commit adds this functionality to mlx5e_safe_switch_channels(), allowing the caller to pass a boolean indicating whether it's required to recreate the channels or it's allowed to skip it. The logic of switching channel parameters is now completely encapsulated into mlx5e_safe_switch_channels(). Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:17 -07:00
Maxim Mikityanskiy	69cc4185dc	net/mlx5e: Use mlx5e_safe_switch_channels when channels are closed This commit uses new functionality of mlx5e_safe_switch_channels introduced by the previous commit to reduce the amount of repeating similar code all over the driver. It's very common in mlx5e to call mlx5e_safe_switch_channels when the channels are open, but assign parameters and run hardware commands manually when the channels are closed. After the previous commit it's no longer needed to do such manual things every time, so this commit removes unneeded code and relies on the new functionality of mlx5e_safe_switch_channels. Some of the places are refactored and simplified, where more complex flows are used to change configuration on the fly, without recreating the channels (the logic is rewritten in a more robust way, with a reset required by default and a list of exceptions). Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:14 -07:00
Maxim Mikityanskiy	6cad120d9e	net/mlx5e: Allow mlx5e_safe_switch_channels to work with channels closed mlx5e_safe_switch_channels is used to modify channel parameters and/or hardware configuration in a safe way, so that if anything goes wrong, everything reverts to the old configuration and remains in a consistent state. However, this function only works when the channels are open. When the caller needs to modify some parameters, first it has to check that the channels are open, otherwise it has to assign parameters directly, and such boilerplate repeats in many different places. This commit prepares for the refactoring of such places by allowing mlx5e_safe_switch_channels to work when the channels are closed. In this case it will assign the new parameters and run the preactivate hook. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:11 -07:00
Tariq Toukan	e9ce991bce	net/mlx5e: kTLS, Add resiliency to RX resync failures When the TLS logic finds a tcp seq match for a kTLS RX resync request, it calls the driver callback function mlx5e_ktls_resync() to handle it and communicate it to the device. Errors might occur during mlx5e_ktls_resync(), however, they are not reported to the stack. Moreover, there is no error handling in the stack for these errors. In this patch, the driver obtains responsibility on errors handling, adding queue and retry mechanisms to these resyncs. We maintain a linked list of resync matches, and try posting them to the async ICOSQ in the NAPI context. Only possible failure that demands driver handling is ICOSQ being full. By relying on the NAPI mechanism, we make sure that the entries in list will be handled when ICOSQ completions arrive and make some room available. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:08 -07:00
Tariq Toukan	72f6f2f8d6	net/mlx5e: TX, Inline function mlx5e_tls_handle_tx_wqe() When TLS is supported, WQE ctrl segment of every transmitted packet is updated with the (possibly empty, for non-TLS packets) TISN field. Take this one-liner function into the header file and inline it, to save the overhead of a function call per packet. While here, remove unused function parameter. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:05 -07:00
Tariq Toukan	b6b3ad2175	net/mlx5e: TX, Inline TLS skb check When TLS is supported and enabled, every transmitted packet is tested to identify if TLS offload is required. Take the early-return condition into an inline function, to save the overhead of a function call for non-TLS packets. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:48:02 -07:00
Tariq Toukan	8668587a33	net/mlx5e: Cleanup unused function parameter Socket parameter is not used in accel_rule_init(), remove it. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:47:59 -07:00
Tariq Toukan	2f014f4016	net/mlx5e: Remove non-essential TLS SQ state bit Maintaining an SQ state bit to indicate TLS support has no real need, a simple and fast test [1] for the SKB is almost equally good. [1] !skb->sk \|\| !tls_is_sk_tx_device_offloaded(skb->sk) Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-16 11:47:56 -07:00
Jakub Kicinski	1703bb50df	mlx5: implement ethtool::get_fec_stats Report corrected bits. v2: catch reg access errors (Saeed) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-15 17:08:29 -07:00
wenxu	e3e0f9b279	net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta In the nft_offload there is the mate flow_dissector with no ingress_ifindex but with ingress_iftype that only be used in the software. So if the mask of ingress_ifindex in meta is 0, this meta check should be bypass. Fixes: `6d65bc64e2` ("net/mlx5e: Add mlx5e_flower_parse_meta support") Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 16:13:00 -07:00
Aya Levin	7a320c9db3	net/mlx5e: Fix setting of RS FEC mode Change register setting from bit number to bit mask. Fixes: `b5ede32d33` ("net/mlx5e: Add support for FEC modes based on 50G per lane links") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 16:12:57 -07:00
Aya Levin	41bafb31dc	net/mlx5: Fix setting of devlink traps in switchdev mode Prevent setting of devlink traps on the uplink while in switchdev mode. In this mode, it is the SW switch responsibility to handle both packets with a mismatch in destination MAC or VLAN ID. Therefore, there are no flow steering tables to trap undesirable packets and driver crashes upon setting a trap. Fixes: `241dc15939` ("net/mlx5: Notify on trap action by blocking event") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 16:12:54 -07:00
Aya Levin	5b232ea94c	net/mlx5e: Fix RQ creation flow for queues which doesn't support XDP Allow to create an RQ which is not registered as an XDP RQ. For example: the trap-RQ doesn't register as an XDP RQ. Fixes: `869c5f9262` ("net/mlx5e: Generalize open RQ") Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:03:10 -07:00
Wenpeng Liang	31450b435f	net/mlx5: Replace spaces with tab at the start of a line There should be no spaces at the start of the line. Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:03:07 -07:00
Wenpeng Liang	9dee115bc1	net/mlx5: Remove return statement exist at the end of void function void function return statements are not generally useful. Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:03:04 -07:00
Wenpeng Liang	02f47c04c3	net/mlx5: Add a blank line after declarations There should be a blank lines after declarations. Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:03:01 -07:00
Colin Ian King	82c3ba31c3	net/mlx5: Fix bit-wise and with zero The bit-wise and of the action field with MLX5_ACCEL_ESP_ACTION_DECRYPT is incorrect as MLX5_ACCEL_ESP_ACTION_DECRYPT is zero and not intended to be a bit-flag. Fix this by using the == operator as was originally intended. Addresses-Coverity: ("Logically dead code") Fixes: `7dfee4b1d7` ("net/mlx5: IPsec, Refactor SA handle creation and destruction") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:58 -07:00
Roi Dayan	b7f86258a2	net/mlx5: DR, Alloc cmd buffer with kvzalloc() instead of kzalloc() The cmd size is 8K so use kvzalloc(). Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:55 -07:00
Jianbo Liu	9dac2966c5	net/mlx5: DR, Use variably sized data structures for different actions mlx5dr_action is a generally used data structure, and there is an union for different types of actions in it. The size of mlx5dr_action is about 72 bytes, but for those actions with fewer fields, most of the allocated memory is wasted. Remove this union, and mlx5dr_action becomes a generic action header. Then actions are dynamically allocated with needed memory, the data for each action is stored right after the header. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:52 -07:00
Parav Pandit	a74ed24c43	net/mlx5: SF, Reuse stored hardware function id SF's hardware function id is already stored in mlx5_sf. Reuse it, instead of querying the hw table. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:49 -07:00
Parav Pandit	6e74e6ea1b	net/mlx5: SF, Use device pointer directly At many places in the code, device pointer is directly available. Make use of it, instead of accessing it from the table. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:46 -07:00
Parav Pandit	57b92bdd9e	net/mlx5: E-Switch, Initialize eswitch acls ns when eswitch is enabled Currently eswitch flow steering (FS) namespace of vport's ingress and egress ACL are enabled when FS layer is initialized. This is done even when eswitch is diabled. This demands that total eswitch ports to be known to FS layer without eswitch in use. Given the FS core is not dependent on eswitch, make namespace init and cleanup routines as helper routines to be invoked only when eswitch is needed. With this change, ingress and egress ACL namespaces are created only when eswitch legacy/offloads mode is enabled. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:43 -07:00
Parav Pandit	b55b35382e	net/mlx5: E-Switch, Move legacy code to a individual file Currently eswitch offers two modes. Legacy and offloads. Offloads code is already in its own file eswitch_offloads.c However eswitch.c contains the eswitch legacy code and common infrastructure code. To enable future extensions and to better manage generic common eswitch infrastructure code, move the legacy code to its own legacy.c file. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:40 -07:00
Parav Pandit	b16f2bb6b6	net/mlx5: E-Switch, Convert a macro to a helper routine Convert ESW_ALLOWED macro to a helper routine so that it can be used in other eswitch files. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:38 -07:00
Parav Pandit	13795553a8	net/mlx5: E-Switch Make cleanup sequence mirror of init Make cleanup sequence mirror of init sequence for cleaning up reps and freeing vports. Also when reps initialization fails, there is no need to perform reps cleanup. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:35 -07:00
Parav Pandit	6308a5f06b	net/mlx5: E-Switch, Make vport number u16 Vport number is 16-bit field in hardware. Make it u16. Move location of vport in the structure so that it reduces a hole in the structure. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:32 -07:00
Parav Pandit	7d5ae47891	net/mlx5: E-Switch, Skip querying SF enabled bits With vhca events, SF state is queried through the VHCA events. Device no longer expects SF bitmap in the query eswitch functions command. Hence, remove it to simplify the code. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-14 11:02:29 -07:00
Parav Pandit	7bf481d7e7	net/mlx5: E-Switch, let user to enable disable metadata Currently each packet inserted in eswitch is tagged with a internal metadata to indicate source vport. Metadata tagging is not always needed. Metadata insertion is needed for multi-port RoCE, failover between representors and stacked devices. In many other cases, metadata enablement is not needed. Metadata insertion slows down the packet processing rate of the E-switch when it is in switchdev mode. Below table show performance gain with metadata disabled for VXLAN offload rules in both SMFS and DMFS steering mode on ConnectX-5 device. ---------------------------------------------- \| steering \| metadata \| pkt size \| rx pps \| \| mode \| \| \| (million) \| ---------------------------------------------- \| smfs \| disabled \| 128Bytes \| 42 \| ---------------------------------------------- \| smfs \| enabled \| 128Bytes \| 36 \| ---------------------------------------------- \| dmfs \| disabled \| 128Bytes \| 42 \| ---------------------------------------------- \| dmfs \| enabled \| 128Bytes \| 36 \| ---------------------------------------------- Hence, allow user to disable metadata using driver specific devlink parameter. Metadata setting of the eswitch is applicable only for the switchdev mode. Example to show and disable metadata before changing eswitch mode: $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata pci/0000:06:00.0: name esw_port_metadata type driver-specific values: cmode runtime value true $ devlink dev param set pci/0000:06:00.0 \ name esw_port_metadata value false cmode runtime $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> --- changelog: v1->v2: - added performance numbers in commit log - updated commit log and documentation for switchdev mode - added explicit note on when user can disable metadata in documentation	2021-04-14 11:02:26 -07:00
Jason Gunthorpe	a0354d2308	Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== This pr contains changes from mlx5-next branch, already reviewed on netdev and rdma mailing lists, links below. 1) From Leon, Dynamically assign MSI-X vectors count Already Acked by Bjorn Helgaas. https://patchwork.kernel.org/project/netdevbpf/cover/20210314124256.70253-1-leon@kernel.org/ 2) Cleanup series: https://patchwork.kernel.org/project/netdevbpf/cover/20210311070915.321814-1-saeed@kernel.org/ From Mark, E-Switch cleanups and refactoring, and the addition of single FDB mode needed HW bits. From Mikhael, Remove unused struct field From Saeed, Cleanup W=1 prototype warning From Zheng, Esw related cleanup From Tariq, User order-0 page allocation for EQs ==================== * mlx5-next: net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks net/mlx5: Dynamically assign MSI-X vectors count net/mlx5: Add dynamic MSI-X capabilities bits PCI/IOV: Add sysfs MSI-X vector assignment interface net/mlx5: Use order-0 allocations for EQs net/mlx5: Add IFC bits needed for single FDB mode net/mlx5: E-Switch, Refactor send to vport to be more generic RDMA/mlx5: Use representor E-Switch when getting netdev and metadata net/mlx5: E-Switch, Add eswitch pointer to each representor net/mlx5: E-Switch, Add match on vhca id to default send rules net/mlx5: Remove unused mlx5_core_health member recover_work net/mlx5: simplify the return expression of mlx5_esw_offloads_pair() net/mlx5: Cleanup prototype warning Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2021-04-12 13:49:48 -03:00
Vladyslav Tarasiuk	4c88fa412a	net/mlx5: Add support for DSFP module EEPROM dumps Allow the driver to recognise DSFP transceiver module ID and therefore allow its EEPROM dumps using ethtool. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-11 16:34:56 -07:00
Vladyslav Tarasiuk	e109d2b204	net/mlx5: Implement get_module_eeprom_by_page() Implement ethtool_ops::get_module_eeprom_by_page() to enable support of new SFP standards. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-11 16:34:56 -07:00
Vladyslav Tarasiuk	e19b0a3474	net/mlx5: Refactor module EEPROM query Prepare for ethtool_ops::get_module_eeprom_data() implementation by extracting common part of mlx5_query_module_eeprom() into a separate function. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-11 16:34:56 -07:00
Jakub Kicinski	8859a44ea0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts: MAINTAINERS - keep Chandrasekar drivers/net/ethernet/mellanox/mlx5/core/en_main.c - simple fix + trust the code re-added to param.c in -next is fine include/linux/bpf.h - trivial include/linux/ethtool.h - trivial, fix kdoc while at it include/linux/skmsg.h - move to relevant place in tcp.c, comment re-wrapped net/core/skmsg.c - add the sk = sk // sk = NULL around calls net/tipc/crypto.c - trivial Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-09 20:48:35 -07:00
Jakub Kicinski	95b5c29132	Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-next 2021-04-09 This pr contains changes from mlx5-next branch, already reviewed on netdev and rdma mailing lists, links below. 1) From Leon, Dynamically assign MSI-X vectors count Already Acked by Bjorn Helgaas. https://patchwork.kernel.org/project/netdevbpf/cover/20210314124256.70253-1-leon@kernel.org/ 2) Cleanup series: https://patchwork.kernel.org/project/netdevbpf/cover/20210311070915.321814-1-saeed@kernel.org/ From Mark, E-Switch cleanups and refactoring, and the addition of single FDB mode needed HW bits. From Mikhael, Remove unused struct field From Saeed, Cleanup W=1 prototype warning From Zheng, Esw related cleanup From Tariq, User order-0 page allocation for EQs * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks net/mlx5: Dynamically assign MSI-X vectors count net/mlx5: Add dynamic MSI-X capabilities bits PCI/IOV: Add sysfs MSI-X vector assignment interface net/mlx5: Use order-0 allocations for EQs net/mlx5: Add IFC bits needed for single FDB mode net/mlx5: E-Switch, Refactor send to vport to be more generic RDMA/mlx5: Use representor E-Switch when getting netdev and metadata net/mlx5: E-Switch, Add eswitch pointer to each representor net/mlx5: E-Switch, Add match on vhca id to default send rules net/mlx5: Remove unused mlx5_core_health member recover_work net/mlx5: simplify the return expression of mlx5_esw_offloads_pair() net/mlx5: Cleanup prototype warning ==================== Link: https://lore.kernel.org/r/20210409200704.10886-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-09 18:07:21 -07:00
Danielle Ratson	a975d7d8a3	ethtool: Remove link_mode param and derive link params from driver Some drivers clear the 'ethtool_link_ksettings' struct in their get_link_ksettings() callback, before populating it with actual values. Such drivers will set the new 'link_mode' field to zero, resulting in user space receiving wrong link mode information given that zero is a valid value for the field. Another problem is that some drivers (notably tun) can report random values in the 'link_mode' field. This can result in a general protection fault when the field is used as an index to the 'link_mode_params' array [1]. This happens because such drivers implement their set_link_ksettings() callback by simply overwriting their private copy of 'ethtool_link_ksettings' struct with the one they get from the stack, which is not always properly initialized. Fix these problems by removing 'link_mode' from 'ethtool_link_ksettings' and instead have drivers call ethtool_params_from_link_mode() with the current link mode. The function will derive the link parameters (e.g., speed) from the link mode and fill them in the 'ethtool_link_ksettings' struct. v3: * Remove link_mode parameter and derive the link parameters in the driver instead of passing link_mode parameter to ethtool and derive it there. v2: * Introduce 'cap_link_mode_supported' instead of adding a validity field to 'ethtool_link_ksettings' struct. [1] general protection fault, probably for non-canonical address 0xdffffc00f14cc32c: 0000 [#1] PREEMPT SMP KASAN KASAN: probably user-memory-access in range [0x000000078a661960-0x000000078a661967] CPU: 0 PID: 8452 Comm: syz-executor360 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__ethtool_get_link_ksettings+0x1a3/0x3a0 net/ethtool/ioctl.c:446 Code: b7 3e fa 83 fd ff 0f 84 30 01 00 00 e8 16 b0 3e fa 48 8d 3c ed 60 d5 69 8a 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 +38 d0 7c 08 84 d2 0f 85 b9 RSP: 0018:ffffc900019df7a0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff888026136008 RCX: 0000000000000000 RDX: 00000000f14cc32c RSI: ffffffff873439ca RDI: 000000078a661960 RBP: 00000000ffff8880 R08: 00000000ffffffff R09: ffff88802613606f R10: ffffffff873439bc R11: 0000000000000000 R12: 0000000000000000 R13: ffff88802613606c R14: ffff888011d0c210 R15: ffff888011d0c210 FS: 0000000000749300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004b60f0 CR3: 00000000185c2000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: linkinfo_prepare_data+0xfd/0x280 net/ethtool/linkinfo.c:37 ethnl_default_notify+0x1dc/0x630 net/ethtool/netlink.c:586 ethtool_notify+0xbd/0x1f0 net/ethtool/netlink.c:656 ethtool_set_link_ksettings+0x277/0x330 net/ethtool/ioctl.c:620 dev_ethtool+0x2b35/0x45d0 net/ethtool/ioctl.c:2842 dev_ioctl+0x463/0xb70 net/core/dev_ioctl.c:440 sock_do_ioctl+0x148/0x2d0 net/socket.c:1060 sock_ioctl+0x477/0x6a0 net/socket.c:1177 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `c8907043c6` ("ethtool: Get link mode in use instead of speed and duplex parameters") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-07 14:53:04 -07:00
David S. Miller	f86c70ed04	mlx5-updates-2021-04-06 Introduce TC sample offload Background ---------- The tc sample action allows user to sample traffic matched by tc classifier. The sampling consists of choosing packets randomly and sampling them using psample module. The tc sample parameters include group id, sampling rate and packet's truncation (to save kernel-user traffic). Sample in TC SW --------------- User must specify rate and group id for sample action, truncate is optional. tc filter add dev enp4s0f0_0 ingress protocol ip prio 1 flower \ src_mac 02:25:d0:14:01:02 dst_mac 02:25:d0:14:01:03 \ action sample rate 10 group 5 trunc 60 \ action mirred egress redirect dev enp4s0f0_1 The tc sample action kernel module 'act_sample' will call another kernel module 'psample' to send sampled packets to userspace. MLX5 sample HW offload - MLX5 driver patches -------------------------------------------- The sample action is translated to a goto flow table object destination which samples packets according to the provided sample ratio. Sampled packets are duplicated. One copy is processed by a termination table, named the sample table, which sends the packet to the eswitch manager port (that will be processed by software). The second copy is processed by the default table which executes the subsequent actions. The default table is created per <vport, chain, prio> tuple as rules with different prios and chains may overlap. For example, for the following typical flow table: +-------------------------------+ + original flow table + +-------------------------------+ + original match + +-------------------------------+ + sample action + other actions + +-------------------------------+ We translate the tc filter with sample action to the following HW model: +---------------------+ + original flow table + +---------------------+ + original match + +---------------------+ \| v +------------------------------------------------+ + Flow Sampler Object + +------------------------------------------------+ + sample ratio + +------------------------------------------------+ + sample table id \| default table id + +------------------------------------------------+ \| \| v v +-----------------------------+ +----------------------------------------+ + sample table + + default table per <vport, chain, prio> + +-----------------------------+ +----------------------------------------+ + forward to management vport + + original match + +-----------------------------+ +----------------------------------------+ + other actions + +----------------------------------------+ Flow sampler object ------------------- Hardware introduces flow sampler object to do sample. It is a new destination type. Driver needs to specify two flow table ids in it. One is sample table id. The other one is the default table id. Sample table samples the packets according to the sample rate and forward the sampled packets to eswitch manager port. Default table finishes the subsequent actions. Group id and reg_c0 ------------------- Userspace program will take different actions for sampled packets according to tc sample action group id. So hardware must pass group id to software for each sampled packets. In Paul Blakey's "Introduce connection tracking offload" patch set, reg_c0 lower 16 bits are used for miss packet chain id restore. We convert reg_c0 lower 16 bits to a common object pool, so other features can also use it. Since sample group id is 32 bits, create a 16 bits object id to map the group id and write the object id to reg_c0 lower 16 bits. reg_c0 can only be used for matching. Write reg_c0 to flow_tag, so software can get the object id via flow_tag and find group id via the common object pool. Sampler restore handle ---------------------- Use common object pool to create an object id to map sample parameters. Allocate a modify header action to write the object id to reg_c0 lower 16 bits. Create a restore rule to pass the object id to software. So software can identify sampled packets via the object id and send it to userspace. Aggregate the modify header action, restore rule and object id to a sample restore handle. Re-use identical sample restore handle for the same object id. Send sampled packets to userspace --------------------------------- The destination for sampled packets is eswitch manager port, so representors can receive sampled packets together with the group id. Driver will send sampled packets and group id to userspace via psample. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmBtNrUACgkQSD+KveBX +j6cRQf/ZARhVPDEgCvFd+wD+n2VCM11FJCpIumGecfqpA9DB/7i0iQrBWG2cGy6 Go3XZ7HCPy0bAeDnVMBulF5RshfQkB/CNJfCTrw0QkNvenO/eYPZrl0XAGwL7w8W 9vkeK51VG70bj7VEMeWVovL0X2VoGea0MD0ASLgOG3qZmCjFX0Aw3yY4WNZAA1fn i9rSP0AgTXqbR+nUezqP9xDHCyEf4etqpdPO/gosFvasZxTa9Xm6tXxT8YrcjAEH MjIYJVS5SERem/gxqrRi5p0u1RNrbZ3vPMmZQIr6x2eBXLwMhvjvcxKqZ2l9PvD5 +O+Hf43GAmhAoqZukvU8H8oMWArciA== =MkzD -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2021-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-04-06 Introduce TC sample offload Background ---------- The tc sample action allows user to sample traffic matched by tc classifier. The sampling consists of choosing packets randomly and sampling them using psample module. The tc sample parameters include group id, sampling rate and packet's truncation (to save kernel-user traffic). Sample in TC SW --------------- User must specify rate and group id for sample action, truncate is optional. tc filter add dev enp4s0f0_0 ingress protocol ip prio 1 flower \ src_mac 02:25:d0:14:01:02 dst_mac 02:25:d0:14:01:03 \ action sample rate 10 group 5 trunc 60 \ action mirred egress redirect dev enp4s0f0_1 The tc sample action kernel module 'act_sample' will call another kernel module 'psample' to send sampled packets to userspace. MLX5 sample HW offload - MLX5 driver patches -------------------------------------------- The sample action is translated to a goto flow table object destination which samples packets according to the provided sample ratio. Sampled packets are duplicated. One copy is processed by a termination table, named the sample table, which sends the packet to the eswitch manager port (that will be processed by software). The second copy is processed by the default table which executes the subsequent actions. The default table is created per <vport, chain, prio> tuple as rules with different prios and chains may overlap. For example, for the following typical flow table: +-------------------------------+ + original flow table + +-------------------------------+ + original match + +-------------------------------+ + sample action + other actions + +-------------------------------+ We translate the tc filter with sample action to the following HW model: +---------------------+ + original flow table + +---------------------+ + original match + +---------------------+ \| v +------------------------------------------------+ + Flow Sampler Object + +------------------------------------------------+ + sample ratio + +------------------------------------------------+ + sample table id \| default table id + +------------------------------------------------+ \| \| v v +-----------------------------+ +----------------------------------------+ + sample table + + default table per <vport, chain, prio> + +-----------------------------+ +----------------------------------------+ + forward to management vport + + original match + +-----------------------------+ +----------------------------------------+ + other actions + +----------------------------------------+ Flow sampler object ------------------- Hardware introduces flow sampler object to do sample. It is a new destination type. Driver needs to specify two flow table ids in it. One is sample table id. The other one is the default table id. Sample table samples the packets according to the sample rate and forward the sampled packets to eswitch manager port. Default table finishes the subsequent actions. Group id and reg_c0 ------------------- Userspace program will take different actions for sampled packets according to tc sample action group id. So hardware must pass group id to software for each sampled packets. In Paul Blakey's "Introduce connection tracking offload" patch set, reg_c0 lower 16 bits are used for miss packet chain id restore. We convert reg_c0 lower 16 bits to a common object pool, so other features can also use it. Since sample group id is 32 bits, create a 16 bits object id to map the group id and write the object id to reg_c0 lower 16 bits. reg_c0 can only be used for matching. Write reg_c0 to flow_tag, so software can get the object id via flow_tag and find group id via the common object pool. Sampler restore handle ---------------------- Use common object pool to create an object id to map sample parameters. Allocate a modify header action to write the object id to reg_c0 lower 16 bits. Create a restore rule to pass the object id to software. So software can identify sampled packets via the object id and send it to userspace. Aggregate the modify header action, restore rule and object id to a sample restore handle. Re-use identical sample restore handle for the same object id. Send sampled packets to userspace --------------------------------- The destination for sampled packets is eswitch manager port, so representors can receive sampled packets together with the group id. Driver will send sampled packets and group id to userspace via psample. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-07 14:38:24 -07:00
Vadim Pasternak	d567fd6e82	mlxsw: core: Remove critical trip points from thermal zones Disable software thermal protection by removing critical trip points from all thermal zones. The software thermal protection is redundant given there are two layers of protection below it in firmware and hardware. The first layer is performed by firmware, the second, in case firmware was not able to perform protection, by hardware. The temperature threshold set for hardware protection is always higher than for firmware. Signed-off-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-07 14:26:18 -07:00
Chris Mi	f94d6389f6	net/mlx5e: TC, Add support to offload sample action The following diagram illustrates the hardware model for tc sample action: +---------------------+ + original flow table + +---------------------+ + original match + +---------------------+ \| v +------------------------------------------------+ + Flow Sampler Object + +------------------------------------------------+ + sample ratio + +------------------------------------------------+ + sample table id \| default table id + +------------------------------------------------+ \| \| v v +-----------------------------+ +----------------------------------------+ + sample table + + default table per <vport, chain, prio> + +-----------------------------+ +----------------------------------------+ + forward to management vport + + original match + +-----------------------------+ +----------------------------------------+ + other actions + +----------------------------------------+ The sample action is translated to a goto flow table object destination which samples packets according to the provided sample ratio. Sampled packets are duplicated. One copy is processed by a termination table, named the sample table, which sends the packet to the eswitch manager port (that will be processed by software). The second copy is processed by the default table which executes the subsequent actions. The default table is created per <vport, chain, prio> tuple as rules with different prios and chains may overlap. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:05 -07:00
Chris Mi	be9dc00474	net/mlx5e: TC, Handle sampled packets Mark the sampled packets with a sample restore object. Send sampled packets using the psample api. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:04 -07:00
Chris Mi	7319a1cc3c	net/mlx5e: TC, Refactor tc update skb function As a pre-step to process sampled packet in this function. Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:04 -07:00
Chris Mi	36a3196256	net/mlx5e: TC, Add sampler restore handle API Use common object pool to create an object ID to map sample parameters. Allocate a modify header action to write the object ID to reg_c0 lower 16 bits. Create a restore rule to pass the object ID to software. So software can identify sampled packets via the object ID and send it to userspace. Aggregate the modify header action, restore rule and object ID to a sample restore handle. Re-use identical sample restore handle for the same object ID. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:04 -07:00
Chris Mi	11ecd6c60b	net/mlx5e: TC, Add sampler object API In order to offload sample action, HW introduces sampler object. The sampler object samples packets according to the provided sample ratio. Sampled packets are duplicated. One copy is processed by a termination table, named the sample table, which sends the packet up to software. The second copy is processed by the default table. Instantiate sampler object. Re-use identical sampler object for the same sample ratio, sample table and default table as a prestep for offloading tc sample actions. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:03 -07:00
Chris Mi	2a9ab10a56	net/mlx5e: TC, Add sampler termination table API Sampled packets are sent to software using termination tables. There is only one rule in that table that is to forward sampled packets to the e-switch management vport. Create a sampler termination table and rule for each eswitch. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:03 -07:00
Chris Mi	41c2fd9498	net/mlx5e: TC, Parse sample action Parse TC sample action and save sample parameters in flow attribute data structure. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:03 -07:00
Chris Mi	c935568271	net/mlx5: Instantiate separate mapping objects for FDB and NIC tables Currently, the u32 chain id is mapped to u16 value which is stored on the lower 16 bits of reg_c0 for FDB and reg_b for NIC tables. The mapping is internally maintained by the chains object. However, with the introduction of reg_c0 objects the fdb may store more than just the chain id on reg_c0. This is not relevant for NIC tables. Separate the chains mapping instantiation for FDB and NIC tables. Remove the mapping from the chains object. For FDB tables, create the mapping per eswitch. For NIC tables, create the mapping per tc table. Pass the corresponding mapping pointer when creating the chains object. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:02 -07:00
Chris Mi	a91d98a0a2	net/mlx5: Map register values to restore objects Currently reg_c0 lower 16 bits and reg_b are used to store the chain id that missed in FDB and NIC tables accordingly. However, the registers' values may index a restore object, rather than a single u32 value. Different object types can be used to restore mutually exclusive contexts such as chain id and sample group id. Use the mapping object to associate an index with a restore object as a prestep for supporting additional restore types. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:02 -07:00
Chris Mi	c1904360dd	net/mlx5: E-switch, Set per vport table default group number Different per voprt table is created using a different per vport table namespace. Because we can't use variable to set the namespace member value. If max group number is 0 in the namespace, use the eswitch default max group number. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:02 -07:00
Chris Mi	c796bb7cd2	net/mlx5: E-switch, Generalize per vport table API Currently, per vport table was used only for port mirroring actions. However, sample action will also require a per vport table instance. Generalize the vport table API to work with multiple namespaces where each namespace manages its own vport table instance. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:01 -07:00
Chris Mi	0a9e230787	net/mlx5: E-switch, Rename functions to follow naming convention. Public api starts with mlx5 and remove mlx5 for non-public api. Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:01 -07:00
Chris Mi	4c7f40287a	net/mlx5: E-switch, Move vport table functions to a new file Currently, the vport table functions are in common eswitch offload file. This file is too big. Move the vport table create, delete and lookup functions to a separate file. Put the file in esw directory. Pre-step for generalizing its functionality for serving both the mirroring and the sample features. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:36:01 -07:00
Xiaoming Ni	d5f9b005c3	net/mlx5: fix kfree mismatch in indir_table.c Memory allocated by kvzalloc() should be freed by kvfree(). Fixes: `34ca65352d` ("net/mlx5: E-Switch, Indirect table infrastructur") Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:04:36 -07:00
Eli Cohen	1a73704c82	net/mlx5: Fix HW spec violation configuring uplink Make sure to modify uplink port to follow only if the uplink_follow capability is set as required by the HW spec. Failure to do so causes traffic to the uplink representor net device to cease after switching to switchdev mode. Fixes: `7d0314b11c` ("net/mlx5e: Modify uplink state on interface up/down") Signed-off-by: Eli Cohen <elic@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-06 21:04:35 -07:00
Leon Romanovsky	e71b75f737	net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks The mlx5 implementation executes a firmware command on the PF to change the configuration of the selected VF. Link: https://lore.kernel.org/linux-pci/20210314124256.70253-5-leon@kernel.org Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2021-04-04 10:30:38 +03:00
Leon Romanovsky	604774add5	net/mlx5: Dynamically assign MSI-X vectors count The number of MSI-X vectors is a PCI property visible through lspci. The field is read-only and configured by the device. The mlx5 devices work in a static or dynamic assignment mode. Static assignment means that all newly created VFs have a preset number of MSI-X vectors determined by device configuration parameters. This can result in some VFs having too many or too few MSI-X vectors. Till now this has been the only means of fine-tuning the MSI-X vector count and it was acceptable for small numbers of VFs. With dynamic assignment the inefficiency of having a fixed number of MSI-X vectors can be avoided with each VF having exactly the required vectors. Userspace will provide this information while provisioning the VF for use, based on the intended use. For instance if being used with a VM, the MSI-X vector count might be matched to the CPU count of the VM. For compatibility mlx5 continues to start up with MSI-X vector assignment, but the kernel can now access a larger dynamic vector pool and assign more vectors to created VFs. Link: https://lore.kernel.org/linux-pci/20210314124256.70253-4-leon@kernel.org Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>	2021-04-04 10:29:48 +03:00
Vu Pham	6783f0a21a	net/mlx5e: Dynamic alloc vlan table for netdev when needed Dynamic allocate vlan table in mlx5e_priv for EN netdev when needed. Don't allocate it for representor netdev. Signed-off-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:08 -07:00
Vu Pham	f6755b80d6	net/mlx5e: Dynamic alloc arfs table for netdev when needed Dynamic allocate arfs table in mlx5e_priv for EN netdev when needed. Don't allocate it for representor netdev. Signed-off-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:08 -07:00
Ariel Levkovich	bb5696570b	net/mlx5e: Reject tc rules which redirect from a VF to itself Since there are self loopback prevention mechanisms at the VF level, offloading such rules which redirect from a VF to itself in the eswitch will break the datapath since the packets will be dropped once they go back to the vport they came from. Therefore, offloading such rules will be rejected and left to be handled by SW. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:08 -07:00
Roi Dayan	8802b8a44e	net/mlx5: Use ida_alloc_range() instead of ida_simple_alloc() ida_simple_alloc() and remove functions are deprecated. Related change: commit `3264ceec8f` ("lib/idr.c: document that ida_simple_{get,remove}() are deprecated") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:07 -07:00
Parav Pandit	233dd7d656	net/mlx5: E-Switch, move QoS specific fields to existing qos struct Function QoS related fields are already defined in qos related struct. min and max rate are left out to mlx5_vport_info struct. Move them to existing qos struct. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:07 -07:00
Parav Pandit	b47e105625	net/mlx5: E-Switch, cut down mlx5_vport_info structure size by 8 bytes Structure mlx5_vport_info consumes 40 bytes of space due to a hole in it. After packing it reduces to 32 bytes. Currently: pahole -C mlx5_vport_info drivers/net/ethernet/mellanox/mlx5/core/eswitch.o struct mlx5_vport_info { u8 mac[6]; /* 0 6 / u16 vlan; / 6 2 / u8 qos; / 8 1 / / XXX 7 bytes hole, try to pack / u64 node_guid; / 16 8 / int link_state; / 24 4 / u32 min_rate; / 28 4 / u32 max_rate; / 32 4 / bool spoofchk; / 36 1 / bool trusted; / 37 1 / / size: 40, cachelines: 1, members: 9 / / sum members: 31, holes: 1, sum holes: 7 / / padding: 2 / / last cacheline: 40 bytes / }; After packing: $ pahole -C mlx5_vport_info drivers/net/ethernet/mellanox/mlx5/core/eswitch.o struct mlx5_vport_info { u8 mac[6]; / 0 6 / u16 vlan; / 6 2 / u64 node_guid; / 8 8 / int link_state; / 16 4 / u32 min_rate; / 20 4 / u32 max_rate; / 24 4 / u8 qos; / 28 1 / u8 spoofchk:1; / 29: 0 1 / u8 trusted:1; / 29: 1 1 / / size: 32, cachelines: 1, members: 9 / / padding: 2 / / bit_padding: 6 bits / / last cacheline: 32 bytes */ }; Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:07 -07:00
Parav Pandit	19779f28c9	net/mlx5: Pair mutex_destory with mutex_init for rate limit table Add missing mutex_destroy() to pair with mutex_init(). This should be done only when table is initialized, hence perform mutex_init() only when table is initialized. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:06 -07:00
Parav Pandit	6b30b6d4d3	net/mlx5: Allocate rate limit table when rate is configured A device supports 128 rate limiters. A static table allocation consumes 8KB of memory even when rate is not configured. Instead, allocate the table when at least one rate is configured. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:06 -07:00
Parav Pandit	97d85aba25	net/mlx5: Use helper to increment, decrement rate entry refcount Rate limit entry refcount can be incremented uniformly when it is newly allocated or reused. So simplify the code to increment refcount at one place. Use decrement refcount helper in two routines. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:06 -07:00
Parav Pandit	51ccc9f5f1	net/mlx5: Use helpers to allocate and free rl table entries User helper routines to allocate and free rate limit table entries. Subsequent patch extends use of these helpers to do allocation during rate entry allocation callback. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:05 -07:00
Parav Pandit	16e74672a2	net/mlx5: Do not hold mutex while reading table constants Table max_size, min and max rate are constants initialized while table is created. Reading it doesn't need to hold a table mutex. Hence, read them without holding table mutex. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:05 -07:00
Parav Pandit	c6baac47d9	net/mlx5: Use unsigned int for free_count Fix the warning due to missing int. WARNING: Prefer 'unsigned int' to bare use of 'unsigned' + unsigned free_count; Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2021-04-02 16:13:04 -07:00

... 3 4 5 6 7 ...

7886 Commits