Commit Graph

7886 Commits

Author SHA1 Message Date
Jakub Kicinski adc2e56ebe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh

scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18 19:47:02 -07:00
Linus Torvalds 9ed13a17e3 Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
bluetooth, netfilter and can.
 
 Current release - regressions:
 
  - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
           to fix modifying offloaded qdiscs
 
  - lantiq: net: fix duplicated skb in rx descriptor ring
 
  - rtnetlink: fix regression in bridge VLAN configuration, empty info
               is not an error, bot-generated "fix" was not needed
 
  - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix
            umem creation
 
 Current release - new code bugs:
 
  - ethtool: fix NULL pointer dereference during module EEPROM dump via
             the new netlink API
 
  - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue
           should not be visible to the stack
 
  - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs
 
  - mlx5e: verify dev is present in get devlink port ndo, avoid a panic
 
 Previous releases - regressions:
 
  - neighbour: allow NUD_NOARP entries to be force GCed
 
  - further fixes for fallout from reorg of WiFi locking
      (staging: rtl8723bs, mac80211, cfg80211)
 
  - skbuff: fix incorrect msg_zerocopy copy notifications
 
  - mac80211: fix NULL ptr deref for injected rate info
 
  - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs
 
 Previous releases - always broken:
 
  - bpf: more speculative execution fixes
 
  - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
 
  - udp: fix race between close() and udp_abort() resulting in a panic
 
  - fix out of bounds when parsing TCP options before packets
    are validated (in netfilter: synproxy, tc: sch_cake and mptcp)
 
  - mptcp: improve operation under memory pressure, add missing wake-ups
 
  - mptcp: fix double-lock/soft lookup in subflow_error_report()
 
  - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress
 
  - ena: fix DMA mapping function issues in XDP
 
  - rds: fix memory leak in rds_recvmsg
 
 Misc:
 
  - vrf: allow larger MTUs
 
  - icmp: don't send out ICMP messages with a source address of 0.0.0.0
 
  - cdc_ncm: switch to eth%d interface naming
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDNP7EACgkQMUZtbf5S
 IrvTmxAAgOAM9MdRl9wnYtqXKPXJ1JJtenozwt1yX6b6OG+Ns7cm6YYafU3KoZWR
 KlzpvP90vRrER3RqksbMngHzvGjZKDS4LWRur7sRlJ1TBQoLrQCIbriAh07d7wlU
 0nnS4J8mczTCKx78QCUYy1QBIX5TQrUbx0JQZDPoIPBjFeILW+Gx/Ghg5tUR4mhf
 6icYqwIPocTXO37ZmWOzezZNVOXJF4kaQUZeuOHNe5hOtm6EeIpZbW1Xx3DIr5bd
 80a/uNU7nVyos0n7jxnfVE/oelTnYbT5scZeV/PPVqZ4U113f7uex2QP23/XhGSX
 lK1EhwPqPOyaNhQoihLM6Xzd4o7aZOcmF8NY96xqjC+DqdN+juvfJU+ClCZojGIj
 H4bwCSaj3y2PiimfQdBiIKvYMc5d4zBdw/Dpk/gLDp4d5N638TAtuunK4Mj+TEuT
 QF1qkBLIB4HFtLS0M35/twk93md/5GUdSTij2GB3fOkAWRu2m266P5m+4DigW/TB
 Xm8FgKdetvxVP0Qv/p49nPEn24Ny8wCafH1x1wVTmoda2qi6j1EXMuSa0PlCdz70
 Sl5FrlxdEkOpC4p+Aoc8APSoBXnOriAlpU+z/EVb8Co4JR/+Ge5zBWpsiZDVD0/K
 Ay0FW3I87iyn9tw1H1Fzr9GBlVl5vWRauZFHjzl90fWakCrCzJE=
 =xxUe
 -----END PGP SIGNATURE-----

Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
  bluetooth, netfilter and can.

  Current release - regressions:

   - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
     to fix modifying offloaded qdiscs

   - lantiq: net: fix duplicated skb in rx descriptor ring

   - rtnetlink: fix regression in bridge VLAN configuration, empty info
     is not an error, bot-generated "fix" was not needed

   - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem
     creation

  Current release - new code bugs:

   - ethtool: fix NULL pointer dereference during module EEPROM dump via
     the new netlink API

   - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose
     queue should not be visible to the stack

   - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs

   - mlx5e: verify dev is present in get devlink port ndo, avoid a panic

  Previous releases - regressions:

   - neighbour: allow NUD_NOARP entries to be force GCed

   - further fixes for fallout from reorg of WiFi locking (staging:
     rtl8723bs, mac80211, cfg80211)

   - skbuff: fix incorrect msg_zerocopy copy notifications

   - mac80211: fix NULL ptr deref for injected rate info

   - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs

  Previous releases - always broken:

   - bpf: more speculative execution fixes

   - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local

   - udp: fix race between close() and udp_abort() resulting in a panic

   - fix out of bounds when parsing TCP options before packets are
     validated (in netfilter: synproxy, tc: sch_cake and mptcp)

   - mptcp: improve operation under memory pressure, add missing
     wake-ups

   - mptcp: fix double-lock/soft lookup in subflow_error_report()

   - bridge: fix races (null pointer deref and UAF) in vlan tunnel
     egress

   - ena: fix DMA mapping function issues in XDP

   - rds: fix memory leak in rds_recvmsg

  Misc:

   - vrf: allow larger MTUs

   - icmp: don't send out ICMP messages with a source address of 0.0.0.0

   - cdc_ncm: switch to eth%d interface naming"

* tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits)
  net: ethernet: fix potential use-after-free in ec_bhf_remove
  selftests/net: Add icmp.sh for testing ICMP dummy address responses
  icmp: don't send out ICMP messages with a source address of 0.0.0.0
  net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
  net: ll_temac: Fix TX BD buffer overwrite
  net: ll_temac: Add memory-barriers for TX BD access
  net: ll_temac: Make sure to free skb when it is completely used
  MAINTAINERS: add Guvenc as SMC maintainer
  bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
  bnxt_en: Fix TQM fastpath ring backing store computation
  bnxt_en: Rediscover PHY capabilities after firmware reset
  cxgb4: fix wrong shift.
  mac80211: handle various extensible elements correctly
  mac80211: reset profile_periodicity/ema_ap
  cfg80211: avoid double free of PMSR request
  cfg80211: make certificate generation more robust
  mac80211: minstrel_ht: fix sample time check
  net: qed: Fix memcpy() overflow of qed_dcbx_params()
  net: cdc_eem: fix tx fixup skb leak
  net: hamradio: fix memory leak in mkiss_close
  ...
2021-06-18 18:55:29 -07:00
Aya Levin 0232fc2ddc net/mlx5: Reset mkey index on creation
Reset only the index part of the mkey and keep the variant part. On
devlink reload, driver recreates mkeys, so the mkey index may change.
Trying to preserve the variant part of the mkey, driver mistakenly
merged the mkey index with current value. In case of a devlink reload,
current value of index part is dirty, so the index may be corrupted.

Fixes: 54c62e13ad ("{IB,net}/mlx5: Setup mkey variant before mr create command invocation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:49 -07:00
Dmytro Linkin a5ae8fc905 net/mlx5e: Don't create devices during unload flow
Running devlink reload command for port in switchdev mode cause
resources to corrupt: driver can't release allocated EQ and reclaim
memory pages, because "rdma" auxiliary device had add CQs which blocks
EQ from deletion.
Erroneous sequence happens during reload-down phase, and is following:

1. detach device - suspends auxiliary devices which support it, destroys
   others. During this step "eth-rep" and "rdma-rep" are destroyed,
   "eth" - suspended.
2. disable SRIOV - moves device to legacy mode; as part of disablement -
   rescans drivers. This step adds "rdma" auxiliary device.
3. destroy EQ table - <failure>.

Driver shouldn't create any device during unload flows. To handle that
implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset
on device attach. If flag is set do no-op on drivers rescan.

Fixes: a925b5e309 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:47 -07:00
Alex Vesker 65fb7d109a net/mlx5: DR, Fix STEv1 incorrect L3 decapsulation padding
Decapsulation L3 on small inner packets which are less than
64 Bytes was done incorrectly. In small packets there is an
extra padding added in L2 which should not be included in L3
length. The issue was that after decapL3 the extra L2 padding
caused an update on the L3 length.

To avoid this issue the new header is pushed to the beginning
of the packet (offset 0) which should not cause a HW reparse
and update the L3 length.

Fixes: c349b4137c ("net/mlx5: DR, Add STEv1 modify header logic")
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:45 -07:00
Parav Pandit c7d6c19b3b net/mlx5: SF_DEV, remove SF device on invalid state
When auxiliary bus autoprobe is disabled and SF is in ACTIVE state,
on SF port deletion it transitions from ACTIVE->ALLOCATED->INVALID.

When VHCA event handler queries the state, it is already transition
to INVALID state.

In this scenario, event handler missed to delete the SF device.

Fix it by deleting the SF when SF state is INVALID.

Fixes: 90d010b863 ("net/mlx5: SF, Add auxiliary device support")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:42 -07:00
Parav Pandit ca36fc4d77 net/mlx5: E-Switch, Allow setting GUID for host PF vport
E-switch should be able to set the GUID of host PF vport.
Currently it returns an error. This results in below error
when user attempts to configure MAC address of the PF of an
external controller.

$ devlink port function set pci/0000:03:00.0/196608 \
   hw_addr 00:00:00:11:22:33

mlx5_core 0000:03:00.0: mlx5_esw_set_vport_mac_locked:1876:(pid 6715):\
"Failed to set vport 0 node guid, err = -22.
RDMA_CM will not function properly for this VF."

Check for zero vport is no longer needed.

Fixes: 330077d14d ("net/mlx5: E-switch, Supporting setting devlink port function mac address")
Signed-off-by: Yuval Avnery <yuvalav@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:40 -07:00
Parav Pandit bbc8222dc4 net/mlx5: E-Switch, Read PF mac address
External controller PF's MAC address is not read from the device during
vport setup. Fail to read this results in showing all zeros to user
while the factory programmed MAC is a valid value.

$ devlink port show eth1 -jp
{
    "port": {
        "pci/0000:03:00.0/196608": {
            "type": "eth",
            "netdev": "eth1",
            "flavour": "pcipf",
            "controller": 1,
            "pfnum": 0,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:00:00"
            }
        }
    }
}

Hence, read it when enabling a vport.

After the fix,

$ devlink port show eth1 -jp
{
    "port": {
        "pci/0000:03:00.0/196608": {
            "type": "eth",
            "netdev": "eth1",
            "flavour": "pcipf",
            "controller": 1,
            "pfnum": 0,
            "splittable": false,
            "function": {
                "hw_addr": "98:03:9b:a0:60:11"
            }
        }
    }
}

Fixes: f099fde16d ("net/mlx5: E-switch, Support querying port function mac address")
Signed-off-by: Bodong Wang <bodong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:38 -07:00
Leon Romanovsky 2058cc9c80 net/mlx5: Check that driver was probed prior attaching the device
The device can be requested to be attached despite being not probed.
This situation is possible if devlink reload races with module removal,
and the following kernel panic is an outcome of such race.

 mlx5_core 0000:00:09.0: firmware version: 4.7.9999
 mlx5_core 0000:00:09.0: 0.000 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x255 link)
 BUG: unable to handle page fault for address: fffffffffffffff0
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 3218067 P4D 3218067 PUD 321a067 PMD 0
 Oops: 0000 [#1] SMP KASAN NOPTI
 CPU: 7 PID: 250 Comm: devlink Not tainted 5.12.0-rc2+ #2836
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:mlx5_attach_device+0x80/0x280 [mlx5_core]
 Code: f8 48 c1 e8 03 42 80 3c 38 00 0f 85 80 01 00 00 48 8b 45 68 48 8d 78 f0 48 89 fe 48 c1 ee 03 42 80 3c 3e 00 0f 85 70 01 00 00 <48> 8b 40 f0 48 85 c0 74 0d 48 89 ef ff d0 85 c0 0f 85 84 05 0e 00
 RSP: 0018:ffff8880129675f0 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff827407f1
 RDX: 1ffff110011336cf RSI: 1ffffffffffffffe RDI: fffffffffffffff0
 RBP: ffff888008e0c000 R08: 0000000000000008 R09: ffffffffa0662ee7
 R10: fffffbfff40cc5dc R11: 0000000000000000 R12: ffff88800ea002e0
 R13: ffffed1001d459f7 R14: ffffffffa05ef4f8 R15: dffffc0000000000
 FS:  00007f51dfeaf740(0000) GS:ffff88806d5c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: fffffffffffffff0 CR3: 000000000bc82006 CR4: 0000000000370ea0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  mlx5_load_one+0x117/0x1d0 [mlx5_core]
  devlink_reload+0x2d5/0x520
  ? devlink_remote_reload_actions_performed+0x30/0x30
  ? mutex_trylock+0x24b/0x2d0
  ? devlink_nl_cmd_reload+0x62b/0x1070
  devlink_nl_cmd_reload+0x66d/0x1070
  ? devlink_reload+0x520/0x520
  ? devlink_nl_pre_doit+0x64/0x4d0
  genl_family_rcv_msg_doit+0x1e9/0x2f0
  ? mutex_lock_io_nested+0x1130/0x1130
  ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
  ? security_capable+0x51/0x90
  genl_rcv_msg+0x27f/0x4a0
  ? genl_get_cmd+0x3c0/0x3c0
  ? lock_acquire+0x1a9/0x6d0
  ? devlink_reload+0x520/0x520
  ? lock_release+0x6c0/0x6c0
  netlink_rcv_skb+0x11d/0x340
  ? genl_get_cmd+0x3c0/0x3c0
  ? netlink_ack+0x9f0/0x9f0
  ? lock_release+0x1f9/0x6c0
  genl_rcv+0x24/0x40
  netlink_unicast+0x433/0x700
  ? netlink_attachskb+0x730/0x730
  ? _copy_from_iter_full+0x178/0x650
  ? __alloc_skb+0x113/0x2b0
  netlink_sendmsg+0x6f1/0xbd0
  ? netlink_unicast+0x700/0x700
  ? netlink_unicast+0x700/0x700
  sock_sendmsg+0xb0/0xe0
  __sys_sendto+0x193/0x240
  ? __x64_sys_getpeername+0xb0/0xb0
  ? copy_page_range+0x2300/0x2300
  ? __up_read+0x1a1/0x7b0
  ? do_user_addr_fault+0x219/0xdc0
  __x64_sys_sendto+0xdd/0x1b0
  ? syscall_enter_from_user_mode+0x1d/0x50
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f51dffb514a
 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
 RSP: 002b:00007ffcaef22e78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f51dffb514a
 RDX: 0000000000000030 RSI: 000055750daf2440 RDI: 0000000000000003
 RBP: 000055750daf2410 R08: 00007f51e0081200 R09: 000000000000000c
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 Modules linked in: mlx5_core(-) ptp pps_core ib_ipoib rdma_ucm rdma_cm iw_cm ib_cm ib_umad ib_uverbs ib_core [last unloaded: mlx5_ib]
 CR2: fffffffffffffff0
 ---[ end trace 7789831bfe74fa42 ]---

Fixes: a925b5e309 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:35 -07:00
Leon Romanovsky 94a4b8414d net/mlx5: Fix error path for set HCA defaults
In the case of the failure to execute mlx5_core_set_hca_defaults(),
we used wrong goto label to execute error unwind flow.

Fixes: 5bef709d76 ("net/mlx5: Enable host PF HCA after eswitch is initialized")
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:32 -07:00
Colin Ian King fb0a1dacf2 mlxsw: spectrum_router: remove redundant continue statement
The continue statement at the end of a for-loop has no effect,
remove it.

Addresses-Coverity: ("Continue has no effect")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:46:21 -07:00
Shay Drory c36326d38d net/mlx5: Round-Robin EQs over IRQs
Whenever users provided affinity for an EQ creation request, map the
EQ to a matching IRQ.
Matching IRQ=IRQ with the same affinity and type (completion/control) of
the EQ created.

This mapping is being done in agressive dedicated IRQ allocation scheme,
which described bellow.

First, we check whether there is a matching IRQ that his min threshold
is not exhausted.
   - min_eqs_threshold = 3 for control EQ.
   - min_eqs_threshold = 1 for completion EQ.
In case no matching IRQ was found, try to request a new IRQ.
In case we can't request a new IRQ, reuse least-used matching IRQ.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:58:00 -07:00
Shay Drory c8ea212bfd net/mlx5: Separate between public and private API of sf.h
Move mlx5_sf_max_functions() and friends from the privete sf/sf.h
to the public lib/sf.h. This is done in order to have one direction
include paths.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:58:00 -07:00
Shay Drory 71e084e264 net/mlx5: Allocating a pool of MSI-X vectors for SFs
SFs (Sub Functions) currently use IRQs from the global IRQ table their
parent Physical Function have. In order to better scale, we need to
allocate more IRQs and share them between different SFs.

Driver will maintain 3 separated irq pools:
1. A pool that serve the PF consumer (PF's netdev, rdma stacks), similar
to what the driver had before this patch. i.e, this pool will share irqs
between rdma and netev, and will keep the irq indexes and allocation
order. The last is important for PF netdev rmap (aRFS).

2. A pool of control IRQs for SFs. The size of this pool is the number
of SFs that can be created divided by SFS_PER_IRQ. This pool will serve
the control path EQs of the SFs.

3. A pool of completion data path IRQs for SFs transport queues. The
size of this pool is:
num_irqs_allocated - pf_pool_size - sf_ctrl_pool_size.
This pool will served netdev and rdma stacks. Moreover, rmap is not
supported on SFs.

Sharing methodology of the SFs pools is explained in the next patch.

Important note: rmap is not supported on SFs because rmap mapping cannot
function correctly for IRQs that are shared for different core/netdev RX
rings.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:58:00 -07:00
Shay Drory fc63dd2a85 net/mlx5: Change IRQ storage logic from static to dynamic
Store newly created IRQs in the xarray DB instead of a static array,
so we will be able to store only IRQs which are being used.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:59 -07:00
Shay Drory 2d74524c01 net/mlx5: Moving rmap logic to EQs
IRQs are being simplified in order to ease their sharing and any feature
specific object will be moved to upper layer.
Hence we move rmap object into eq_table.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:59 -07:00
Shay Drory e8abebb3a4 net/mlx5: Extend mlx5_irq_request to request IRQ from the kernel
Extend mlx5_irq_request so that IRQs will be requested upon EQ creation,
and not on driver boot.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:59 -07:00
Shay Drory 2de6153837 net/mlx5: Removing rmap per IRQ
In next patches, IRQs will be requested according to demand, instead of
statically on driver boot.
Also, currently, rmap is managed by the IRQ layer. rmap management will
move out from the IRQ layer in future patches.

Therefore, we want to remove the IRQ from the rmap, when IRQ is destroyed,
instead of removing all the IRQs from the rmap when irq_table is destroyed.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:58 -07:00
Leon Romanovsky 652e3581f2 net/mlx5: Clean license text in eq.[c|h] files
The eq.[c|h] files are under major rewrite. so use this opportunity and
update their copyright and license texts.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:58 -07:00
Leon Romanovsky e4e3f24b82 net/mlx5: Provide cpumask at EQ creation phase
The users of EQ are running their code on different CPUs and with
various affinity patterns. Move the cpumask setting close to their
actual usage.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:57 -07:00
Shay Drory 3b43190b2f net/mlx5: Introduce API for request and release IRQs
Introduce new API that will allow IRQs users to hold a pointer to
mlx5_irq.
In the end of this series, IRQs will be allocated on demand. Hence,
this will allow us to properly manage and use IRQs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:57 -07:00
Leon Romanovsky c38421abcf net/mlx5: Delay IRQ destruction till all users are gone
Shared IRQ are consumed by multiple EQ users and in order to properly
initialize and later release such IRQs, we add kref counting of IRQ
structure.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:57 -07:00
Mark Bloch 8a66e45859 net/mlx5: Change ownership model for lag
Lag is used to combine two PCI functions of the same HCA into a single
logical unit. This is a core functionality and as such should be managed by
the core driver. Currently this isn't the case. While we store the lag
software structure inside the lower device, its lifetime (creation /
destruction) is dictated by the mlx5e part. Change the ownership model so
lag is tied to the lifetime of the lower level driver instead to the
mlx5e part.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:56 -07:00
Mark Bloch 8ed19471fd net/mlx5: Lag, Don't rescan if the device is going down
If MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV is set it means the device is going
down and mlx5_rescan_drivers_locked() shouldn't be called.
With this patch and the previous one in the series, unbinding a PCI
function when its netdev is part of a bond works and leaves the system in a
working state.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:56 -07:00
Mark Bloch 8c22ad36ee net/mlx5: Lag, refactor disable flow
When a net device is removed (can happen if the PCI function is unbound
from the system) it's not enough to destroy the hardware lag. The system
should recreate the original devices that were present before the lag.
As the same flow is done when a net device is removed from the bond
refactor and reuse the code.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14 20:57:56 -07:00
Linus Torvalds 29a877d576 RDMA second v5.13 rc Pull Request
A mixture of small bug fixes and a small security issue:
 
  - WARN_ON when IPoIB is automatically moved between namespaces
 
  - Long standing bug where mlx5 would use the wrong page for the doorbell
    recovery memory if fork is used
 
  - Security fix for mlx4 that disables the timestamp feature
 
  - Several crashers for mlx5
 
  - Plug a recent mlx5 memory leak for the sig_mr
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmDCAxYACgkQOG33FX4g
 mxqL4Q/9FOOS+Q0O2nOtkxzenqB931w46Q4kca1m6RZcdJI97P/tpF+SigQoUwV+
 qiuJV4CThkidqWjxxfesX4uXyj6mc8yW4ux57c2JAMiS5iGIsKEPCavNvzcWRZKJ
 rlMQg0yi7KeDwJ8XC2nw/Ajl1ujtxh569AkaqFVMMJer6jSa048TU14iulOOlcpZ
 VGmF0/sCSY+PzyEOycr0LxGfUImCdD/spvF1RDbCNtQUcQwg41LUUkR+wvrqp8eR
 KmuU7i+NLbcGyCZou16r6su9mMRYU5ZuFN5JMtjrmeqfdOi6deb7StyCgQFmRuac
 Yw9Lgw91JUNphZp9v//sw6UDfyZaRMdsSW4796jiEPjnxZK7tzx+klhFLpO3WPkh
 3VaZGY5nkcGcaRfqGD0PUHcHNjPr18rCXHz+JIovNLwIIJDmR4iUnZOs/JgOkvvd
 bh4p4O/3xhXT57FoyBb/MhYgILAVHJ3Od6Dab3uJNx7ZaHAngtVHhzykm8PP4t/h
 sHfd5W494jgec5RicJBQQfjZ4YUdSFMKjqLchKaSkdIsv/Wi+3idh+561ucmkMwI
 JnIVZV/0739JUKeXhOJkxQkc1SKjr79e7+JUlrEgVFC0lJ8srzUD0f9a0L5txgt4
 2MqQ9CSGljhiUpby0urFPb/vznQ3OQoZVwXOxj1TKtr0rrS3nuE=
 =crsk
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "A mixture of small bug fixes and a small security issue:

   - WARN_ON when IPoIB is automatically moved between namespaces

   - Long standing bug where mlx5 would use the wrong page for the
     doorbell recovery memory if fork is used

   - Security fix for mlx4 that disables the timestamp feature

   - Several crashers for mlx5

   - Plug a recent mlx5 memory leak for the sig_mr"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/mlx5: Fix initializing CQ fragments buffer
  RDMA/mlx5: Delete right entry from MR signature database
  RDMA: Verify port when creating flow rule
  RDMA/mlx5: Block FDB rules when not in switchdev mode
  RDMA/mlx4: Do not map the core_clock page to user space unless enabled
  RDMA/mlx5: Use different doorbell memory for different processes
  RDMA/ipoib: Fix warning caused by destroying non-initial netns
2021-06-10 10:53:04 -07:00
Vlad Buslov 9724fd5d9c net/mlx5: Bridge, add tracepoints
Move private bridge structures to dedicated headers that is accessible to
bridge tracepoint header. Implemented following tracepoints:

- Initialize FDB entry.
- Refresh FDB entry.
- Cleanup FDB entry.
- Create VLAN.
- Cleanup VLAN.
- Attach port to bridge.
- Detach port from bridge.

Usage example:

># cd /sys/kernel/debug/tracing
># echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
># cat trace
...
   kworker/u20:1-96      [001] ....   231.892503: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 lastuse=4294895695

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:12 -07:00
Vlad Buslov cc2987c44b net/mlx5: Bridge, filter tagged packets that didn't match tagged fg
With support for pvid vlans in mlx5 bridge it is possible to have rules in
untagged flow group when vlan filtering is enabled. However, such rules can
also match tagged packets that didn't match anything in tagged flow group.
Filter such packets by introducing additional flow group between tagged and
untagged groups. When filtering is enabled on the bridge create additional
flow in vlan filtering flow group and matches tagged packets with specified
source MAC address and redirects them to new "skip" table. The skip table
is new lowest-level empty table that is used to skip all further processing
on packet in bridge priority.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:12 -07:00
Vlad Buslov 36e55079e5 net/mlx5: Bridge, support pvid and untagged vlan configurations
Implement support for pushing vlan header into untagged packet on ingress
of port that has pvid configured and support for popping vlan on egress of
port that has the matching vlan configured as untagged. To support such
configurations packet reformat contexts of {INSERT|REMOVE}_HEADER types are
created per such vlan and saved to struct mlx5_esw_bridge_vlan which allows
all FDB entries on particular vlan to share single packet reformat
instance. When initializing FDB entries with pvid or untagged vlan type set
its mlx5_flow_act->pkt_reformat action accordingly.

Flush all flows when removing vlan from port. This is necessary because
even though software bridge removes all FDB entries before removing their
vlan, mlx5 bridge implementation deletes their corresponding flow entries
from hardware in asynchronous workqueue task, which will cause firmware
error if vlan packet reformat context is deleted before all flows that
point to it.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:11 -07:00
Vlad Buslov ffc89ee5e5 net/mlx5: Bridge, match FDB entry vlan tag
Add support for FDB vlan-tagged entries. Extend ingress and egress flow
tables with flow groups to match packet vlan tag. Modify the flow creation
code to include vlan tag, if vlan is configured on port and vlan
configuration is supported for offload.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:11 -07:00
Vlad Buslov d75b9e8048 net/mlx5: Bridge, implement infrastructure for vlans
Establish all the necessary infrastructure for implementing vlan matching
and vlan push/pop in following patches:

- Add new per-vport struct mlx5_esw_bridge_port that is used to store
metadata for all port vlans. Initialize and cleanup the instance of the
structure when port representor is linked/unliked to bridge. Use xarray to
allow quick vport metadata lookup by vport number.

- Add new per-port-vlan struct mlx5_esw_bridge_vlan that is used to store
vlan-specific data (vid, flags). Handle SWITCHDEV_PORT_OBJ_{ADD|DEL}
switchdev blocking event for SWITCHDEV_OBJ_ID_PORT_VLAN object by
creating/deleting the vlan structure and saving it in per-vport xarray for
quick lookup.

- Implement support for SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING object
attribute that is used to toggle vlan filtering. Remove all FDB entries
from hardware when vlan filtering state is changed.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:10 -07:00
Vlad Buslov c636a0f0f3 net/mlx5: Bridge, dynamic entry ageing
Dynamic FDB entries require capability to age out unused entries. Such
entries are either aged out by kernel software bridge implementation or by
hardware switch that offloaded them (and notified the kernel to mark them
as SWITCHDEV_FDB_ADD_TO_BRIDGE). Leaving ageing to kernel bridge would
result it deleting offloaded dynamic FDB entries every ageing_time period
due to packets being processed by hardware and, consecutively, 'used'
timestamp for FDB entry not being updated. However, since hardware doesn't
support ageing, software solution inside the driver is required.

In order to emulate hardware ageing in driver, extend bridge FDB ingress
flows with counter and create delayed br_offloads->update_work task on
bridge offloads workqueue. Run the task every second, update 'used'
timestamp in software bridge dynamic entry by sending
SWITCHDEV_FDB_ADD_TO_BRIDGE for the entry, if it flow hardware counter
lastuse field was changed since last update. If lastuse wasn't changed for
ageing_time period, then delete the FDB entry and notify kernel bridge by
sending SWITCHDEV_FDB_DEL_TO_BRIDGE notification.

Register blocking switchdev notifier callback and handle attribute set
SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME event to allow user to dynamically
configure bridge FDB entry ageing timeout. Save the value per-bridge in
struct mlx5_esw_bridge. Silently ignore
SWITCHDEV_ATTR_ID_PORT_{PRE_}BRIDGE_FLAGS switchdev event since mlx5 bridge
implementation relies on software bridge for implementing necessary
behavior for all of these flags.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:10 -07:00
Vlad Buslov 7cd6a54a82 net/mlx5: Bridge, handle FDB events
Hardware supported by mlx5 driver doesn't provide learning and requires the
driver to emulate all switch-like behavior in software. As such, all
packets by default go through miss path, appear on representor and get to
software bridge, if it is the upper device of the representor. This causes
bridge to process packet in software, learn the MAC address to FDB and send
SWITCHDEV_FDB_ADD_TO_DEVICE event to all subscribers.

In order to offload FDB entries in mlx5, register switchdev notifier
callback and implement support for both 'added_by_user' and dynamic FDB
entry SWITCHDEV_FDB_ADD_TO_DEVICE events asynchronously using new
mlx5_esw_bridge_offloads->wq ordered workqueue. In workqueue callback
offload the ingress rule (matching FDB entry MAC as packet source MAC) and
egress table rule (matching FDB entry MAC as destination MAC). For ingress
table rule also match source vport to ensure that only traffic coming from
expected bridge port is matched by offloaded rule. Save all the relevant
FDB entry data in struct mlx5_esw_bridge_fdb_entry instance and insert the
instance in new mlx5_esw_bridge->fdb_list list (for traversing all entries
by software ageing implementation in following patch) and in new
mlx5_esw_bridge->fdb_ht hash table for fast retrieval. Notify the bridge
that FDB entry has been offloaded by sending SWITCHDEV_FDB_OFFLOADED
notification.

Delete FDB entry on reception of SWITCHDEV_FDB_DEL_TO_DEVICE event.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:10 -07:00
Vlad Buslov 19e9bfa044 net/mlx5: Bridge, add offload infrastructure
Create new files bridge.{c|h} in en/rep directory that implement bridge
interaction with representor netdevices and handle required
events/notifications, bridge.{c|h} in esw directory that implement all
necessary eswitch offloading infrastructure and works on vport/eswitch
level. Provide new kconfig MLX5_BRIDGE which is automatically selected when
both kernel bridge and mlx5 eswitch configs are enabled.

Provide basic infrastructure for bridge offloads:

- struct mlx5_esw_bridge_offloads - per-eswitch bridge offload structure
that encapsulates generic bridge-offloads data (notifier blocks, ingress
flow table/group, etc.) that is created/deleted on enable/disable eswitch
offloads.

- struct mlx5_esw_bridge - per-bridge structure that encapsulates
per-bridge data (reference counter, FDB, egress flow table/group, etc.)
that is created when first eswitch represetor is attached to new bridge and
deleted when last representor is removed from the bridge as a result of
NETDEV_CHANGEUPPER event.

The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB
namespace. The new priority is between tc-miss and slow path priorities.
Priority consist of two levels: the ingress table that is global per
eswitch and matches incoming packets by src_mac/vid and redirects them to
next level (egress table) that is chosen according to ingress port bridge
membership and matches on dst_mac/vid in order to redirect packet to vport
according to the following diagram:

                +
                |
      +---------v----------+
      |                    |
      |   FDB_TC_OFFLOAD   |
      |                    |
      +---------+----------+
                |
                |
      +---------v----------+
      |                    |
      |   FDB_FT_OFFLOAD   |
      |                    |
      +---------+----------+
                |
                |
      +---------v----------+
      |                    |
      |    FDB_TC_MISS     |
      |                    |
      +---------+----------+
                |
+--------------------------------------+
|               |                      |
|        +------+                      |
|        |                             |
| +------v--------+   FDB_BR_OFFLOAD   |
| | INGRESS_TABLE |                    |
| +------+---+----+                    |
|        |   |      match              |
|        |   +---------+               |
|        |             |               |    +-------+
|        |     +-------v-------+ match |    |       |
|        |     | EGRESS_TABLE  +------------> vport |
|        |     +-------+-------+       |    |       |
|        |             |               |    +-------+
|        |    miss     |               |
|        +------+------+               |
|               |                      |
+--------------------------------------+
                |
                |
      +---------v----------+
      |                    |
      |   FDB_SLOW_PATH    |
      |                    |
      +---------+----------+
                |
                v

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:09 -07:00
Vlad Buslov 0781015288 net/mlx5e: Refactor mlx5e_eswitch_{*}rep() helpers
Change the helper to functions to accept constant pointer to struct
net_device. This is necessary for following patches in series that pass
mlx5e_eswitch_rep() as a callback to kernel bridge infrastructure code.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:09 -07:00
Vlad Buslov ec3be8873d net/mlx5: Create TC-miss priority and table
In order to adhere to kernel software datapath model bridge offloads must
come after TC and NF FDBs. Following patches in this series add new FDB
priority for bridge after FDB_FT_OFFLOAD. However, since netfilter offload
is implemented with unmanaged tables, its miss path is not automatically
connected to next priority and requires the code to manually connect with
slow table. To keep bridge offloads encapsulated and not mix it with
eswitch offloads, create a new FDB_TC_MISS priority between FDB_FT_OFFLOAD
and FDB_SLOW_PATH:

          +
          |
+---------v----------+
|                    |
|   FDB_TC_OFFLOAD   |
|                    |
+---------+----------+
          |
          |
          |
+---------v----------+
|                    |
|   FDB_FT_OFFLOAD   |
|                    |
+---------+----------+
          |
          |
          |
+---------v----------+
|                    |
|    FDB_TC_MISS     |
|                    |
+---------+----------+
          |
          |
          |
+---------v----------+
|                    |
|   FDB_SLOW_PATH    |
|                    |
+---------+----------+
          |
          v

Initialize the new priority with single default empty managed table and use
the table as TC/NF miss patch instead of slow table. This approach allows
bridge offloads to be created as new FDB namespace priority between
FDB_TC_MISS and FDB_SLOW_PATH without exposing its internal tables to any
other modules since miss path of managed TC-miss table is automatically
wired to next priority.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik ded6a877a3 net/mlx5: DR, Support EMD tag in modify header for STEv1
Add support for EMD tag in modify header set/copy actions
on device that supports STEv1.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik 7ea9b39852 net/mlx5: DR, Added support for INSERT_HEADER reformat type
Add support for INSERT_HEADER packet reformat context type

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik 3f3f05ab88 net/mlx5: Added new parameters to reformat context
Adding new reformat context type (INSERT_HEADER) requires adding two new
parameters to reformat context - reformat_param_0 and reformat_param_1.
As defined by HW spec, these parameters have different meaning for
different reformat context type.

The first parameter (reformat_param_0) is not new to HW spec, but it
wasn't used by any of the supported reformats. The second parameter
(reformat_param_1) is new to the HW spec - it was added to allow
supporting INSERT_HEADER.

For NSERT_HEADER, reformat_param_0 indicates the header used to
reference the location of the inserted header, and reformat_param_1
indicates the offset of the inserted header from the reference point
defined by reformat_param_0.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik d7418b4efa net/mlx5: DR, Allow encap action for RX for supporting devices
Encap actions on RX flow were not supported on older devices.
However, this is no longer the case in devices that support STEv1.
This patch adds support for encap l3/l2 on RX flow for supported
devices: update actions state machine by adding the newely supported
transitions and add the required support in STEv0/1 files.
The new transitions that are supported are:
 - from decap/modify-header/pop-vlan to encap
 - from encap to termination table

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik 28de41a4ba net/mlx5: DR, Split reformat state to Encap and Decap
Split single reformat state into two separate states for encap and decap.
This will allow adding actions to the specific domain, such as encap on RX.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik 67133eaa93 net/mlx5: mlx5_ifc support for header insert/remove
Add support for HCA caps 2 that contains capabilities for the new
insert/remove header actions.

Added the required definitions for supporting the new reformat type:
added packet reformat parameters, reformat anchors and definitions
to allow copy/set into the inserted EMD (Embedded MetaData) tag.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:06 -07:00
Aya Levin 54e1217b90 net/mlx5e: Block offload of outer header csum for GRE tunnel
The device is able to offload either the outer header csum or inner
header csum. The driver utilizes the inner csum offload. So, prohibit
setting of tx-gre-csum-segmentation and let it be: off[fixed].

Fixes: 2729984149 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:06 -07:00
Aya Levin 6d6727dddc net/mlx5e: Block offload of outer header csum for UDP tunnels
The device is able to offload either the outer header csum or inner
header csum. The driver utilizes the inner csum offload. Hence, block
setting of tx-udp_tnl-csum-segmentation and set it to off[fixed].

Fixes: b49663c8fb ("net/mlx5e: Add support for UDP tunnel segmentation with outer checksum offload")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:06 -07:00
Shay Drory 7a545077cb Revert "net/mlx5: Arm only EQs with EQEs"
In the scenario described below, an EQ can remain in FIRED state which
can result in missing an interrupt generation.

The scenario:

device                       mlx5_core driver
------                       ----------------
EQ1.eqe generated
EQ1.MSI-X sent
EQ1.state = FIRED
EQ2.eqe generated
                             mlx5_irq()
                               polls - eq1_eqes()
                               arm eq1
                               polls - eq2_eqes()
                               arm eq2
EQ2.MSI-X sent
EQ2.state = FIRED
                              mlx5_irq()
                              polls - eq2_eqes() -- no eqes found
                              driver skips EQ arming;

->EQ2 remains fired, misses generating interrupt.

Hence, always arm the EQ by reverting the cited commit in fixes tag.

Fixes: d894892dda ("net/mlx5: Arm only EQs with EQEs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:05 -07:00
Aya Levin a6ee6f5f10 net/mlx5e: Fix select queue to consider SKBTX_HW_TSTAMP
Steering packets to PTP-SQ should be done only if the SKB has
SKBTX_HW_TSTAMP set in the tx_flags. While here, take the function into
a header and inline it.
Set the whole condition to select the PTP-SQ to unlikely.

Fixes: 24c22dd091 ("net/mlx5e: Add states to PTP channel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:05 -07:00
Aya Levin 9ae8c18c5e net/mlx5e: Don't update netdev RQs with PTP-RQ
Since the driver opens the PTP-RQ under channel 0, it appears to the
stack as if the SKB was received on rxq0. So from thew stack POV there
are still the same number of RX queues.

Fixes: 960fbfe222 ("net/mlx5e: Allow coexistence of CQE compression and HW TS PTP")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:05 -07:00
Chris Mi 11f5ac3e05 net/mlx5e: Verify dev is present in get devlink port ndo
When changing eswitch mode, the netdev is detached from the
hardware resources. So verify dev is present in get devlink
port ndo. Otherwise, we will hit the following panic:

[241535.973539] RIP: 0010:__devlink_port_phys_port_name_get+0x13/0x1b0
[241535.976471] RSP: 0018:ffff9eaf0ae1b7c8 EFLAGS: 00010292
[241535.977471] RAX: 000000000002d370 RBX: 000000000002d370 RCX: 0000000000000000
[241535.978479] RDX: 0000000000000010 RSI: ffff9eaf0ae1b858 RDI: 000000000002d370
[241535.979482] RBP: ffff9eaf0ae1b7e0 R08: 000000000000002a R09: ffff8888d54d13da
[241535.980486] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8888e6700000
[241535.981491] R13: ffff9eaf0ae1b858 R14: 0000000000000010 R15: 0000000000000000
[241535.982489] FS:  00007fd374ef3740(0000) GS:ffff88909ea00000(0000) knlGS:0000000000000000
[241535.983494] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[241535.984487] CR2: 000000000002d444 CR3: 000000089fd26006 CR4: 00000000003706e0
[241535.985502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[241535.986499] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[241535.987477] Call Trace:
[241535.988426]  ? nla_put_64bit+0x71/0xa0
[241535.989368]  devlink_compat_phys_port_name_get+0x50/0xa0
[241535.990312]  dev_get_phys_port_name+0x4b/0x60
[241535.991252]  rtnl_fill_ifinfo+0x57b/0xcb0
[241535.992192]  rtnl_dump_ifinfo+0x58f/0x6d0
[241535.993123]  ? ksize+0x14/0x20
[241535.994033]  ? __alloc_skb+0xe8/0x250
[241535.994935]  netlink_dump+0x17c/0x300
[241535.995821]  netlink_recvmsg+0x1de/0x2c0
[241535.996677]  sock_recvmsg+0x70/0x80
[241535.997518]  ____sys_recvmsg+0x9b/0x1b0
[241535.998360]  ? iovec_from_user+0x82/0x120
[241535.999202]  ? __import_iovec+0x2c/0x130
[241536.000031]  ___sys_recvmsg+0x94/0x130
[241536.000850]  ? __handle_mm_fault+0x56d/0x6e0
[241536.001668]  __sys_recvmsg+0x5f/0xb0
[241536.002464]  ? syscall_enter_from_user_mode+0x2b/0x80
[241536.003242]  __x64_sys_recvmsg+0x1f/0x30
[241536.004008]  do_syscall_64+0x38/0x50
[241536.004767]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[241536.005532] RIP: 0033:0x7fd375014f47

Fixes: 2ff349c5ed ("net/mlx5e: Verify dev is present in some ndos")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:04 -07:00
Maor Gottlieb 4aaf96ac8b net/mlx5: DR, Don't use SW steering when RoCE is not supported
SW steering uses RC QP to write/read to/from ICM, hence it's not
supported when RoCE is not supported as well.

Fixes: 70605ea545 ("net/mlx5: DR, Expose APIs for direct rule managing")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:04 -07:00
Maor Gottlieb c189716b2a net/mlx5: Consider RoCE cap before init RDMA resources
Check if RoCE is supported by the device before enable it in
the vport context and create all the RDMA steering objects.

Fixes: 80f09dfc23 ("net/mlx5: Eswitch, enable RoCE loopback traffic")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:04 -07:00
Dima Chumak a3e5fd9314 net/mlx5e: Fix page reclaim for dead peer hairpin
When adding a hairpin flow, a firmware-side send queue is created for
the peer net device, which claims some host memory pages for its
internal ring buffer. If the peer net device is removed/unbound before
the hairpin flow is deleted, then the send queue is not destroyed which
leads to a stack trace on pci device remove:

[ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
[ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110
[ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0
[ 748.002171] ------------[ cut here ]------------
[ 748.001177] FW pages counter is 4 after reclaiming all pages
[ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]                      [  +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core]
[ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1
[ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]
[ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9
[ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286
[ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000
[ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51
[ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8
[ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30
[ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000
[ 748.001429] FS:  00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000
[ 748.001695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0
[ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 748.001654] Call Trace:
[ 748.000576]  ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core]
[ 748.001416]  ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core]
[ 748.001354]  ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core]
[ 748.001203]  mlx5_function_teardown+0x30/0x60 [mlx5_core]
[ 748.001275]  mlx5_uninit_one+0xa7/0xc0 [mlx5_core]
[ 748.001200]  remove_one+0x5f/0xc0 [mlx5_core]
[ 748.001075]  pci_device_remove+0x9f/0x1d0
[ 748.000833]  device_release_driver_internal+0x1e0/0x490
[ 748.001207]  unbind_store+0x19f/0x200
[ 748.000942]  ? sysfs_file_ops+0x170/0x170
[ 748.001000]  kernfs_fop_write_iter+0x2bc/0x450
[ 748.000970]  new_sync_write+0x373/0x610
[ 748.001124]  ? new_sync_read+0x600/0x600
[ 748.001057]  ? lock_acquire+0x4d6/0x700
[ 748.000908]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[ 748.001126]  ? fd_install+0x1c9/0x4d0
[ 748.000951]  vfs_write+0x4d0/0x800
[ 748.000804]  ksys_write+0xf9/0x1d0
[ 748.000868]  ? __x64_sys_read+0xb0/0xb0
[ 748.000811]  ? filp_open+0x50/0x50
[ 748.000919]  ? syscall_enter_from_user_mode+0x1d/0x50
[ 748.001223]  do_syscall_64+0x3f/0x80
[ 748.000892]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 748.001026] RIP: 0033:0x7f58bcfb22f7
[ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7
[ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003
[ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001
[ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d
[ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700
[ 748.001564] irq event stamp: 0
[ 748.000787] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0
[ 748.001854] softirqs last  enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0
[ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 748.001492] ---[ end trace a6fabd773d1c51ae ]---

Fix by destroying the send queue of a hairpin peer net device that is
being removed/unbound, which returns the allocated ring buffer pages to
the host.

Fixes: 4d8fcf216c ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:03 -07:00
Huy Nguyen 8ad893e516 net/mlx5e: Remove dependency in IPsec initialization flows
Currently, IPsec feature is disabled because mlx5e_build_nic_netdev
is required to be called after mlx5e_ipsec_init. This requirement is
invalid as mlx5e_build_nic_netdev and mlx5e_ipsec_init initialize
independent resources.

Remove ipsec pointer check in mlx5e_build_nic_netdev so that the
two functions can be called at any order.

Fixes: 547eede070 ("net/mlx5e: IPSec, Innova IPSec offload infrastructure")
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:03 -07:00
Vlad Buslov fb1a3132ee net/mlx5e: Fix use-after-free of encap entry in neigh update handler
Function mlx5e_rep_neigh_update() wasn't updated to accommodate rtnl lock
removal from TC filter update path and properly handle concurrent encap
entry insertion/deletion which can lead to following use-after-free:

 [23827.464923] ==================================================================
 [23827.469446] BUG: KASAN: use-after-free in mlx5e_encap_take+0x72/0x140 [mlx5_core]
 [23827.470971] Read of size 4 at addr ffff8881d132228c by task kworker/u20:6/21635
 [23827.472251]
 [23827.472615] CPU: 9 PID: 21635 Comm: kworker/u20:6 Not tainted 5.13.0-rc3+ #5
 [23827.473788] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [23827.475639] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
 [23827.476731] Call Trace:
 [23827.477260]  dump_stack+0xbb/0x107
 [23827.477906]  print_address_description.constprop.0+0x18/0x140
 [23827.478896]  ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
 [23827.479879]  ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
 [23827.480905]  kasan_report.cold+0x7c/0xd8
 [23827.481701]  ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
 [23827.482744]  kasan_check_range+0x145/0x1a0
 [23827.493112]  mlx5e_encap_take+0x72/0x140 [mlx5_core]
 [23827.494054]  ? mlx5e_tc_tun_encap_info_equal_generic+0x140/0x140 [mlx5_core]
 [23827.495296]  mlx5e_rep_neigh_update+0x41e/0x5e0 [mlx5_core]
 [23827.496338]  ? mlx5e_rep_neigh_entry_release+0xb80/0xb80 [mlx5_core]
 [23827.497486]  ? read_word_at_a_time+0xe/0x20
 [23827.498250]  ? strscpy+0xa0/0x2a0
 [23827.498889]  process_one_work+0x8ac/0x14e0
 [23827.499638]  ? lockdep_hardirqs_on_prepare+0x400/0x400
 [23827.500537]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
 [23827.501359]  ? rwlock_bug.part.0+0x90/0x90
 [23827.502116]  worker_thread+0x53b/0x1220
 [23827.502831]  ? process_one_work+0x14e0/0x14e0
 [23827.503627]  kthread+0x328/0x3f0
 [23827.504254]  ? _raw_spin_unlock_irq+0x24/0x40
 [23827.505065]  ? __kthread_bind_mask+0x90/0x90
 [23827.505912]  ret_from_fork+0x1f/0x30
 [23827.506621]
 [23827.506987] Allocated by task 28248:
 [23827.507694]  kasan_save_stack+0x1b/0x40
 [23827.508476]  __kasan_kmalloc+0x7c/0x90
 [23827.509197]  mlx5e_attach_encap+0xde1/0x1d40 [mlx5_core]
 [23827.510194]  mlx5e_tc_add_fdb_flow+0x397/0xc40 [mlx5_core]
 [23827.511218]  __mlx5e_add_fdb_flow+0x519/0xb30 [mlx5_core]
 [23827.512234]  mlx5e_configure_flower+0x191c/0x4870 [mlx5_core]
 [23827.513298]  tc_setup_cb_add+0x1d5/0x420
 [23827.514023]  fl_hw_replace_filter+0x382/0x6a0 [cls_flower]
 [23827.514975]  fl_change+0x2ceb/0x4a51 [cls_flower]
 [23827.515821]  tc_new_tfilter+0x89a/0x2070
 [23827.516548]  rtnetlink_rcv_msg+0x644/0x8c0
 [23827.517300]  netlink_rcv_skb+0x11d/0x340
 [23827.518021]  netlink_unicast+0x42b/0x700
 [23827.518742]  netlink_sendmsg+0x743/0xc20
 [23827.519467]  sock_sendmsg+0xb2/0xe0
 [23827.520131]  ____sys_sendmsg+0x590/0x770
 [23827.520851]  ___sys_sendmsg+0xd8/0x160
 [23827.521552]  __sys_sendmsg+0xb7/0x140
 [23827.522238]  do_syscall_64+0x3a/0x70
 [23827.522907]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [23827.523797]
 [23827.524163] Freed by task 25948:
 [23827.524780]  kasan_save_stack+0x1b/0x40
 [23827.525488]  kasan_set_track+0x1c/0x30
 [23827.526187]  kasan_set_free_info+0x20/0x30
 [23827.526968]  __kasan_slab_free+0xed/0x130
 [23827.527709]  slab_free_freelist_hook+0xcf/0x1d0
 [23827.528528]  kmem_cache_free_bulk+0x33a/0x6e0
 [23827.529317]  kfree_rcu_work+0x55f/0xb70
 [23827.530024]  process_one_work+0x8ac/0x14e0
 [23827.530770]  worker_thread+0x53b/0x1220
 [23827.531480]  kthread+0x328/0x3f0
 [23827.532114]  ret_from_fork+0x1f/0x30
 [23827.532785]
 [23827.533147] Last potentially related work creation:
 [23827.534007]  kasan_save_stack+0x1b/0x40
 [23827.534710]  kasan_record_aux_stack+0xab/0xc0
 [23827.535492]  kvfree_call_rcu+0x31/0x7b0
 [23827.536206]  mlx5e_tc_del_fdb_flow+0x577/0xef0 [mlx5_core]
 [23827.537305]  mlx5e_flow_put+0x49/0x80 [mlx5_core]
 [23827.538290]  mlx5e_delete_flower+0x6d1/0xe60 [mlx5_core]
 [23827.539300]  tc_setup_cb_destroy+0x18e/0x2f0
 [23827.540144]  fl_hw_destroy_filter+0x1d2/0x310 [cls_flower]
 [23827.541148]  __fl_delete+0x4dc/0x660 [cls_flower]
 [23827.541985]  fl_delete+0x97/0x160 [cls_flower]
 [23827.542782]  tc_del_tfilter+0x7ab/0x13d0
 [23827.543503]  rtnetlink_rcv_msg+0x644/0x8c0
 [23827.544257]  netlink_rcv_skb+0x11d/0x340
 [23827.544981]  netlink_unicast+0x42b/0x700
 [23827.545700]  netlink_sendmsg+0x743/0xc20
 [23827.546424]  sock_sendmsg+0xb2/0xe0
 [23827.547084]  ____sys_sendmsg+0x590/0x770
 [23827.547850]  ___sys_sendmsg+0xd8/0x160
 [23827.548606]  __sys_sendmsg+0xb7/0x140
 [23827.549303]  do_syscall_64+0x3a/0x70
 [23827.549969]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [23827.550853]
 [23827.551217] The buggy address belongs to the object at ffff8881d1322200
 [23827.551217]  which belongs to the cache kmalloc-256 of size 256
 [23827.553341] The buggy address is located 140 bytes inside of
 [23827.553341]  256-byte region [ffff8881d1322200, ffff8881d1322300)
 [23827.555747] The buggy address belongs to the page:
 [23827.556847] page:00000000898762aa refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1d1320
 [23827.558651] head:00000000898762aa order:2 compound_mapcount:0 compound_pincount:0
 [23827.559961] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
 [23827.561243] raw: 002ffff800010200 dead000000000100 dead000000000122 ffff888100042b40
 [23827.562653] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
 [23827.564112] page dumped because: kasan: bad access detected
 [23827.565439]
 [23827.565932] Memory state around the buggy address:
 [23827.566917]  ffff8881d1322180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 [23827.568485]  ffff8881d1322200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 [23827.569818] >ffff8881d1322280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 [23827.571143]                       ^
 [23827.571879]  ffff8881d1322300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 [23827.573283]  ffff8881d1322380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 [23827.574654] ==================================================================

Most of the necessary logic is already correctly implemented by
mlx5e_get_next_valid_encap() helper that is used in neigh stats update
handler. Make the handler generic by renaming it to
mlx5e_get_next_matching_encap() and use callback to test whether flow is
matching instead of hardcoded check for 'valid' flag value. Implement
mlx5e_get_next_valid_encap() by calling mlx5e_get_next_matching_encap()
with callback that tests encap MLX5_ENCAP_ENTRY_VALID flag. Implement new
mlx5e_get_next_init_encap() helper by calling
mlx5e_get_next_matching_encap() with callback that tests encap completion
result to be non-error and use it in mlx5e_rep_neigh_update() to safely
iterate over nhe->encap_list.

Remove encap completion logic from mlx5e_rep_update_flows() since the encap
entries passed to this function are already guaranteed to be properly
initialized by similar code in mlx5e_get_next_init_encap().

Fixes: 2a1f1768fa ("net/mlx5e: Refactor neigh update for concurrent execution")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:03 -07:00
Yang Li 2bf8d2ae34 net/mlx5e: Fix an error code in mlx5e_arfs_create_tables()
When the code execute 'if (!priv->fs.arfs->wq)', the value of err is 0.
So, we use -ENOMEM to indicate that the function
create_singlethread_workqueue() return NULL.

Clean up smatch warning:
drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c:373
mlx5e_arfs_create_tables() warn: missing error code 'err'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: f6755b80d6 ("net/mlx5e: Dynamic alloc arfs table for netdev when needed")
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:02 -07:00
Colin Ian King f3b5a89075 mlxsw: thermal: Fix null dereference of NULL temperature parameter
The call to mlxsw_thermal_module_temp_and_thresholds_get passes a NULL
pointer for the temperature and this can be dereferenced in this function
if the mlxsw_reg_query call fails.  The simplist fix is to pass the
address of dummy temperature variable instead of a NULL pointer.

Addresses-Coverity: ("Explicit null dereferenced")
Fixes: 72a64c2fe9 ("mlxsw: thermal: Read module temperature thresholds using MTMP register")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09 15:39:14 -07:00
Mykola Kostenok 72a64c2fe9 mlxsw: thermal: Read module temperature thresholds using MTMP register
mlxsw_thermal_module_trips_update() is used to update the trip points of
the module's thermal zone. Currently, this is done by querying the
thresholds from the module's EEPROM via MCIA register. This data does
not pass validation and in some cases can be unreliable. For example,
due to some problem with transceiver module.

Previous patch made it possible to read module's temperature and
thresholds via MTMP register. Therefore, extend
mlxsw_thermal_module_trips_update() to use the thresholds queried from
MTMP, if valid.

This is both more reliable and more efficient than current method, as
temperature and thresholds are queried in one transaction instead of
three. This is significant when working over a slow bus such as I2C.

Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com>
Acked-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:39:07 -07:00
Mykola Kostenok e57977b34a mlxsw: thermal: Add function for reading module temperature and thresholds
Provide new function mlxsw_thermal_module_temp_and_thresholds_get() for
reading temperature and temperature thresholds by a single operation.
The motivation is to reduce the number of transactions with the device
which is important when operating over a slow bus such as I2C.

Currently, the sole caller of the function is only using it to read the
module's temperature. The next patch will also use it to query the
module's temperature thresholds.

Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com>
Acked-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:39:07 -07:00
Mykola Kostenok befc204808 mlxsw: core_env: Read module temperature thresholds using MTMP register
Currently, module temperature thresholds are obtained from Management
Cable Info Access (MCIA) register by specifying the thresholds offsets
within module EEPROM layout. This data does not pass validation and in
some cases can be unreliable. For example, due to some problem with the
module.

Add support for a new feature provided by Management Temperature (MTMP)
register for sanitization of temperature thresholds values.

Extend mlxsw_env_module_temp_thresholds_get() to get temperature
thresholds through MTMP field 'max_operational_temperature' - if it is
not zero, feature is supported. Otherwise fallback to old method and get
the thresholds through MCIA.

Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com>
Acked-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:39:07 -07:00
Mykola Kostenok 314dbb19f9 mlxsw: reg: Extend MTMP register with new threshold field
Extend Management Temperature (MTMP) register with new field specifying
the maximum temperature threshold.

Extend mlxsw_reg_mtmp_unpack() function with two extra arguments,
providing high and maximum temperature thresholds. For modules, these
thresholds correspond to critical and emergency thresholds that are read
from the module's EEPROM.

Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com>
Acked-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:39:07 -07:00
Amit Cohen a08a61934c mlxsw: spectrum_router: Remove abort mechanism
The abort mechanism was introduced in commit 8e05fd7166 ("fib: hook
IPv4 fib for hardware offload") with the purpose of falling back to
software-based routing in case of a route programming error in hardware.
The process is irreversible and requires users to reload the offloading
driver or reboot the machine.

While this approach might make sense in theory, it makes very little
sense in practice. In the case of high speed ASICs such as the Spectrum
ASIC, the abort mechanism effectively kills the machine upon a non-fatal
error such as a route programming error.

Such an extreme policy does not belong in the kernel, especially when
user space can simply try to reprogram the route following the
RTM_NEWROUTE failure notification.

Therefore, remove the abort mechanism.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:39:06 -07:00
Matteo Croce c420c98982 skbuff: add a parameter to __skb_frag_unref
This is a prerequisite patch, the next one is enabling recycling of
skbs and fragments. Add an extra argument on __skb_frag_unref() to
handle recycling, and update the current users of the function with that.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 14:11:47 -07:00
Mykola Kostenok 2fd8d84ce3 mlxsw: core: Set thermal zone polling delay argument to real value at init
Thermal polling delay argument for modules and gearboxes thermal zones
used to be initialized with zero value, while actual delay was used to
be set by mlxsw_thermal_set_mode() by thermal operation callback
set_mode(). After operations set_mode()/get_mode() have been removed by
cited commits, modules and gearboxes thermal zones always have polling
time set to zero and do not perform temperature monitoring.

Set non-zero "polling_delay" in thermal_zone_device_register() routine,
thus, the relevant thermal zones will perform thermal monitoring.

Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Fixes: 5d7bd8aa7c ("thermal: Simplify or eliminate unnecessary set_mode() methods")
Fixes: 1ee14820fd ("thermal: remove get_mode() operation of drivers")
Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com>
Acked-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 13:11:41 -07:00
Petr Machata d566ed04e4 mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
In mlxsw Qdisc offload, find_class() is an operation that yields a qdisc
offload descriptor given a parental qdisc descriptor and a class handle. In
__mlxsw_sp_qdisc_ets_graft() however, a band number is passed to that
function instead of a handle. This can lead to a trigger of a WARN_ON
with the following splat:

 WARNING: CPU: 3 PID: 808 at drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c:1356 __mlxsw_sp_qdisc_ets_graft+0x115/0x130 [mlxsw_spectrum]
 [...]
 Call Trace:
  mlxsw_sp_setup_tc_prio+0xe3/0x100 [mlxsw_spectrum]
  qdisc_offload_graft_helper+0x35/0xa0
  prio_graft+0x176/0x290 [sch_prio]
  qdisc_graft+0xb3/0x540
  tc_modify_qdisc+0x56a/0x8a0
  rtnetlink_rcv_msg+0x12c/0x370
  netlink_rcv_skb+0x49/0xf0
  netlink_unicast+0x1f6/0x2b0
  netlink_sendmsg+0x1fb/0x410
  ____sys_sendmsg+0x1f3/0x220
  ___sys_sendmsg+0x70/0xb0
  __sys_sendmsg+0x54/0xa0
  do_syscall_64+0x3a/0x70
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Since the parent handle is not passed with the offload information, compute
it from the band number and qdisc handle.

Fixes: 28052e618b04 ("mlxsw: spectrum_qdisc: Track children per qdisc")
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 13:11:41 -07:00
Petr Machata 306b9228c0 mlxsw: reg: Spectrum-3: Enforce lowest max-shaper burst size of 11
A max-shaper is the HW component responsible for delaying egress traffic
above a configured transmission rate. Burst size is the amount of traffic
that is allowed to pass without accounting. The burst size value needs to
be such that it can be expressed as 2^BS * 512 bits, where BS lies in a
certain ASIC-dependent range. mlxsw enforces that this holds before
attempting to configure the shaper.

The assumption for Spectrum-3 was that the lower limit of BS would be 5,
like for Spectrum-1. But as of now, the limit is still 11. Therefore fix
the driver accordingly, so that incorrect values are rejected early with a
proper message.

Fixes: 23effa2479 ("mlxsw: reg: Add max_shaper_bs to QoS ETS Element Configuration")
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 13:11:41 -07:00
David S. Miller 126285651b Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Bug fixes overlapping feature additions and refactoring, mostly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 13:01:52 -07:00
Vladyslav Tarasiuk f68406ca3b net/mlx5e: Remove unreachable code in mlx5e_xmit()
After some commits, mlx5e_txwqe_build_eseg() lost its ability to return
boolean value and became effectively void.

Change its return type to void and remove unreachable branches.

Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:21 -07:00
Alaa Hleihel 39e8cc6d75 net/mlx5e: Disable TLS device offload in kdump mode
Under kdump environment we want to use the smallest possible amount
of resources, that includes setting SQ size to minimum.
However, when running on a device that supports TLS device offload,
then the SQ stop room becomes larger than with non-capable device and
requires increasing the SQ size.

Since TLS device offload is not necessary in kdump mode, disable it to
reduce the memory requirements for capable devices.

With this change, the needed SQ stop room size drops by 33.

Signed-off-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:20 -07:00
Alaa Hleihel 040ee6172e net/mlx5e: Disable TX MPWQE in kdump mode
Under kdump environment we want to use the smallest possible amount
of resources, that includes setting SQ size to minimum.
However, when running on a device that supports TX MPWQE, then the SQ stop
room becomes larger than with non-capable device and requires increasing
the SQ size.

Since TX MPWQE offload is not necessary in kdump mode, disable it to
reduce the memory requirements for capable devices.

With this change, the needed SQ stop room size drops by 31.

Signed-off-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:20 -07:00
Tariq Toukan 8ec5d438a3 net/mlx5e: RX, Re-place page pool numa node change logic
Move the logic that updates the page pool upon changes in numa node.
Before this patch, logic was placed in the RX polling function, being
called also when no RX traffic, wasting cpu cycles.  Here we move it to
the RX post_wqes function, to be called only when new RX descriptors are
going to be allocated.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:19 -07:00
Lama Kayal 771a563ea0 net/mlx5e: Zero-init DIM structures
Initialize structs to avoid unexpected behavior.

No immediate issue in current code, structs are return values, it's
safer to initialize.

Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:19 -07:00
Meir Lichtinger ab57a912be net/mlx5e: IPoIB, Add support for NDR speed
Add NDR IB PTYS coding and NDR speed 100GHz.

Fixes: 235b6ac306 ("RDMA/ipoib: Add 50Gb and 100Gb link speeds to ethtool")
Signed-off-by: Meir Lichtinger <meirl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:18 -07:00
Shaokun Zhang c4cf987ebe net/mlx5e: Remove the repeated declaration
Function 'mlx5e_deactivate_rq' is declared twice, so remove the
repeated declaration.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:18 -07:00
Dan Carpenter b74fc1ca6a net/mlx5: check for allocation failure in mlx5_ft_pool_init()
Add a check for if the kzalloc() fails.

Fixes: 4a98544d18 ("net/mlx5: Move chains ft pool to be used by all firmware steering")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:17 -07:00
Jiapeng Chong e6dfa4a54a net/mlx5: Fix duplicate included vhca_event.h
Clean up the following includecheck warning:

./drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c: vhca_event.h is
included more than once.

No functional change.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:17 -07:00
Jakub Kicinski 490dcecabb mlx5: count all link events
mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE
events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This
breaks link flap detection based on Linux carrier state transition
count as netif_carrier_on() does nothing if carrier is already on.
Make sure we count such events.

netif_carrier_event() increments the counters and fires the linkwatch
events. The latter is not necessary for the use case but seems like
the right thing to do.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-03 13:10:17 -07:00
Shay Drory 404e5a1269 RDMA/mlx4: Do not map the core_clock page to user space unless enabled
Currently when mlx4 maps the hca_core_clock page to the user space there
are read-modifiable registers, one of which is semaphore, on this page as
well as the clock counter. If user reads the wrong offset, it can modify
the semaphore and hang the device.

Do not map the hca_core_clock page to the user space unless the device has
been put in a backwards compatibility mode to support this feature.

After this patch, mlx4 core_clock won't be mapped to user space on the
majority of existing devices and the uverbs device time feature in
ibv_query_rt_values_ex() will be disabled.

Fixes: 52033cfb5a ("IB/mlx4: Add mmap call to map the hardware clock")
Link: https://lore.kernel.org/r/9632304e0d6790af84b3b706d8c18732bc0d5e27.1622726305.git.leonro@nvidia.com
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-03 14:19:53 -03:00
Yevgeny Kliteynik 216214c64a net/mlx5: DR, Create multi-destination flow table with level less than 64
Flow table that contains flow pointing to multiple flow tables or multiple
TIRs must have a level lower than 64. In our case it applies to muli-
destination flow table.
Fix the level of the created table to comply with HW Spec definitions, and
still make sure that its level lower than SW-owned tables, so that it
would be possible to point from the multi-destination FW table to SW
tables.

Fixes: 34583beea4 ("net/mlx5: DR, Create multi-destination table for SW-steering use")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:21 -07:00
Aya Levin 5349cbba75 net/mlx5e: Fix conflict with HW TS and CQE compression
When a driver's profile doesn't support a dedicated PTP-RQ,
configuration of CQE compression while HW TS is configured should fail.

Fixes: 885b8cfb16 ("net/mlx5e: Update ethtool setting of CQE compression")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:21 -07:00
Aya Levin 256f79d13c net/mlx5e: Fix HW TS with CQE compression according to profile
When the driver's profile doesn't support a dedicated PTP-RQ, the PTP
accuracy of HW TS is affected by the CQE compression. In this case,
turn off CQE compression. Otherwise, the driver crashes:

BUG: kernel NULL pointer dereference, address:0000000000000018
...
...
RIP: 0010:mlx5e_ptp_rx_set_fs+0x25/0x1a0 [mlx5_core]
...
...
Call Trace:
 mlx5e_ptp_activate_channel+0xb2/0xf0 [mlx5_core]
 mlx5e_activate_priv_channels+0x3b9/0x8c0 [mlx5_core]
 ? __mutex_unlock_slowpath+0x45/0x2a0
 ? mlx5e_refresh_tirs+0x151/0x1e0 [mlx5_core]
 mlx5e_switch_priv_channels+0x1cd/0x2d0 [mlx5_core]
 ? mlx5e_xdp_allowed+0x150/0x150 [mlx5_core]
 mlx5e_safe_switch_params+0x118/0x3c0 [mlx5_core]
 ? __mutex_lock+0x6e/0x8e0
 ? mlx5e_hwstamp_set+0xa9/0x300 [mlx5_core]
 mlx5e_hwstamp_set+0x194/0x300 [mlx5_core]
 ? dev_ioctl+0x9b/0x3d0
 mlx5i_ioctl+0x37/0x60 [mlx5_core]
 mlx5i_pkey_ioctl+0x12/0x20 [mlx5_core]
 dev_ioctl+0xa9/0x3d0
 sock_ioctl+0x268/0x420
 __x64_sys_ioctl+0x3d8/0x790
 ? lockdep_hardirqs_on_prepare+0xe4/0x190
 do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 960fbfe222 ("net/mlx5e: Allow coexistence of CQE compression and HW TS PTP")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:21 -07:00
Roi Dayan 2a2c84facd net/mlx5e: Fix adding encap rules to slow path
On some devices the ignore flow level cap is not supported and we
shouldn't use it. Setting the dest ft with mlx5_chains_get_tc_end_ft()
already gives the correct end ft if ignore flow level cap is supported
or not.

Fixes: 39ac237ce0 ("net/mlx5: E-Switch, Refactor chains and priorities")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:20 -07:00
Roi Dayan afe93f71b5 net/mlx5e: Check for needed capability for cvlan matching
If not supported show an error and return instead of trying to offload
to the hardware and fail.

Fixes: 699e96ddf4 ("net/mlx5e: Support offloading tc double vlan headers match")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:20 -07:00
Moshe Shemesh 5940e64281 net/mlx5: Check firmware sync reset requested is set before trying to abort it
In case driver sent NACK to firmware on sync reset request, it will get
sync reset abort event while it didn't set sync reset requested mode.
Thus, on abort sync reset event handler, driver should check reset
requested is set before trying to stop sync reset poll.

Fixes: 7dd6df329d ("net/mlx5: Handle sync reset abort event")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:20 -07:00
Roi Dayan b38742e411 net/mlx5e: Disable TLS offload for uplink representor
TLS offload is not supported in switchdev mode.

Fixes: 7a9fb35e8c ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:19 -07:00
Aya Levin d8ec92005f net/mlx5e: Fix incompatible casting
Device supports setting of a single fec mode at a time, enforce this
by bitmap_weight == 1. Input from fec command is in u32, avoid cast to
unsigned long and use bitmap_from_arr32 to populate bitmap safely.

Fixes: 4bd9d5070b ("net/mlx5e: Enforce setting of a single FEC mode")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01 18:30:19 -07:00
Jakub Kicinski af9207adb6 mlx5-updates-2021-05-26
Misc update for mlx5 driver,
 
 1) Clean up patches for lag and SF
 
 2) Reserve bit 31 in steering register C1 for IPSec offload usage
 
 3) Move steering tables pool logic into the steering core and
   increase the maximum table size to 2G entries when software steering
   is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCv6vAACgkQSD+KveBX
 +j6qnAgAz0eKWKCsFCqlXGIgF1cg3FrGR5W2Zi5euriHhHwNqnZof3AIMkzcXjLL
 wBlPjWk3YLfBaBNPTziz6EJuGl1vZZxuSdc7bqsNnl0srujRtQFu3JyerdgXEXNL
 W2NxjSTiVwu8lq2qlYauQvcE0v+JrB/LMe9tvq1UQ2v9FtBMMhs9hGUSCro2huwj
 XYF0m0ve89+mYlm6/m0SIUpPVdMiIhm4+coO1wibk7+8jn6+ZT6EJbbZvjc9eQg7
 ZKr8f/TpfmvHToG8LPOc6HqHzRiHlp3Yzsft+xm54r082n4F/noGhL+Hqvvj1aTj
 C6Ip5N7VkzT+erMLMrjIbrmEP94cyQ==
 =torZ
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-05-26

Misc update for mlx5 driver,

1) Clean up patches for lag and SF

2) Reserve bit 31 in steering register C1 for IPSec offload usage

3) Move steering tables pool logic into the steering core and
  increase the maximum table size to 2G entries when software steering
  is enabled.

* tag 'mlx5-updates-2021-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Fix lag port remapping logic
  net/mlx5: Use boolean arithmetic to evaluate roce_lag
  net/mlx5: Remove unnecessary spin lock protection
  net/mlx5: Cap the maximum flow group size to 16M entries
  net/mlx5: DR, Set max table size to 2G entries
  net/mlx5: Move chains ft pool to be used by all firmware steering
  net/mlx5: Move table size calculation to steering cmd layer
  net/mlx5: Add case for FS_FT_NIC_TX FT in MLX5_CAP_FLOWTABLE_TYPE
  net/mlx5: DR, Remove unused field of send_ring struct
  net/mlx5e: RX, Remove unnecessary check in RX CQE compression handling
  net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet
  net/mlx5e: TC: Reserved bit 31 of REG_C1 for IPsec offload
  net/mlx5e: TC: Use bit counts for register mapping
  net/mlx5: CT: Avoid reusing modify header context for natted entries
  net/mlx5e: CT, Remove newline from ct_dbg call
====================

Link: https://lore.kernel.org/r/20210527185624.694304-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 17:14:23 -07:00
Jiri Pirko 7dafcc4c9d mlxsw: core: use PSID string define in devlink info
Instead of having the string spelled out in the driver, use the global
define with the same value.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 14:51:18 -07:00
Jiri Pirko f55c998c27 mlxsw: core: Expose FW version over defined keyword
To be aligned with the rest of the drivers, expose FW version under "fw"
keyword in devlink dev info, in addition to the existing "fw.version",
which is currently Mellanox-specific.

devlink output before:
       running:
         fw.version 30.2008.2018
after:
       running:
         fw.version 30.2008.2018
         fw 30.2008.2018

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 14:51:18 -07:00
Jiri Pirko 2754125ebd net/mlx5: Expose FW version over defined keyword
To be aligned with the rest of the drivers, expose FW version under "fw"
keyword in devlink dev info, in addition to the existing "fw.version",
which is currently Mellanox-specific.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 14:51:17 -07:00
Eli Cohen 8613641063 net/mlx5: Fix lag port remapping logic
Fix the logic so that if both ports netdevices are enabled or disabled,
use the trivial mapping without swapping.

If only one of the netdevice's tx is enabled, use it to remap traffic to
that port.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:39 -07:00
Eli Cohen 2b14767525 net/mlx5: Use boolean arithmetic to evaluate roce_lag
Avoid mixing boolean and bit arithmetic when evaluating validity of
roce_lag.

Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:39 -07:00
Eli Cohen a546432f2f net/mlx5: Remove unnecessary spin lock protection
Taking lag_lock to access ldev->tracker is meaningless in the context of
do_bond() and mlx5_lag_netdev_event().

Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:39 -07:00
Paul Blakey 71513c05a9 net/mlx5: Cap the maximum flow group size to 16M entries
The maximum number of large flow groups applies to both small and large
tables. For very large tables (such as the 2G SW steering tables) this may
create a small number of flow groups each with an unrealistic entries
domain (> 16M).

Set the maximum number of large flow groups to at least what user
requested, but with a maximum per group size of 16M entries.
For software steering, if user requested less than 128 large flow
groups, it will gives us about 128 16M groups in a 2G
entries tables.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:38 -07:00
Paul Blakey 9e11799840 net/mlx5: DR, Set max table size to 2G entries
SW steering has no table size limitations.
However, fs_core API is size aware.

Set SW steering tables to the maximum possible table size (INT_MAX).

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:38 -07:00
Paul Blakey 4a98544d18 net/mlx5: Move chains ft pool to be used by all firmware steering
Firmware FT pool is per device, but the software tracking of this pool
only services fs_chains users, and if another layer takes a flow table,
the pool will not be updated, and fs_chains will fail creating a flow
table, with no recovery till the flow table is returned.

Move FT pool to be global per device, and stored at the cmd level,
so all layers can use it.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:38 -07:00
Paul Blakey 04745afb2a net/mlx5: Move table size calculation to steering cmd layer
Currently the table size is calculated by the fs_core layer. However, each
steering cmd instance has a different allocation logic. FW steering uses
a predefined pools of multiple sizes. SW steering doesn't have a pool,
and can allocate any size of tables.

Move the table size calculation to the steering cmd layer as a pre-step
for moving fs_chains pool logic globally to firmware steering, and
increasing table sizes for software steering. In addition, change the
size parameter to absolute size to allow the special zero value to
mean "get next available maximum size".

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:37 -07:00
Paul Blakey e01b58e9d5 net/mlx5: Add case for FS_FT_NIC_TX FT in MLX5_CAP_FLOWTABLE_TYPE
Commit 16f1c5bb3e ("net/mlx5: Check device capability for maximum flow
counters") added MLX5_CAP_FLOWTABLE_TYPE but forgot to account
for FS_FT_NIC_TX case in the expression.

Although the expression will return 1 for this case instead of the
actual cap, there isn't currently no known side affects of
missing this case.

Add the FS_FT_NIC_TX case.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:37 -07:00
Yevgeny Kliteynik b72ce870f5 net/mlx5: DR, Remove unused field of send_ring struct
Remove unused field of struct mlx5dr_send_ring

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:37 -07:00
Tariq Toukan 2ef9c7c613 net/mlx5e: RX, Remove unnecessary check in RX CQE compression handling
There are two reasons for exiting mlx5e_decompress_cqes_cont():
1. The compression session is completed (cqd.left == 0).
2. The budget is exhausted (work_done == budget).

If after calling mlx5e_decompress_cqes_cont() we have cqd.left > 0,
it necessarily implies that budget is exhausted.

The first part of the complex condition is covered by the second,
hence we remove it here.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:36 -07:00
Huy Nguyen c07274ab1a net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet
rep_tc copy REG_C1 to REG_B. IPsec crypto utilizes the whole REG_B
register with BIT31 as IPsec marker. rep_tc_update_skb drops
IPsec because it thought REG_B contains bad value.

In previous patch, BIT 31 of REG_C1 is reserved for IPsec.
Skip the rep_tc_update_skb if BIT31 of REG_B is set.

Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:36 -07:00
Huy Nguyen b973cf3245 net/mlx5e: TC: Reserved bit 31 of REG_C1 for IPsec offload
Currently ASAP features fully utilize all the bits of the CQE's flow tag
and ft_metadata field. The flow tag field cannot be used because the
flow table tagging in FTE does not allow partial write.

We agree to reserve bit 31 of CQE's ft_metadata for IPsec to avoid
ASAP CT from dropping IPsec offloaded packet

Here is the new bit layout of REG_C1. Tunnel option id is reduced to
11 bits:
< IPSEC MARKER (1) | ESW_TUN_ID(12) | ESW_TUN_OPTS(11) | ESW_ZONE_ID(8) >

Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
2021-05-27 11:54:36 -07:00
Paul Blakey ed2fe7ba7b net/mlx5e: TC: Use bit counts for register mapping
To prepare for next patch where we will use a non-byte
aligned mapping, change all byte counts in register
mapping to bits.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:35 -07:00
Paul Blakey 7fac5c2ece net/mlx5: CT: Avoid reusing modify header context for natted entries
Currently the driver is designed to reuse header modify context entries.
Natted entries will always have a unique modify header, as such the
modify header hashtable lookup is introducing an overhead. When the
hashtable size exceeded 200k entries the tested insertion rate dropped
from ~10k entries/sec to ~300 entries/sec.

Don't use the re-use mechanism when creating modify headers
for natted tuples.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:35 -07:00
Roi Dayan 74097a0dcd net/mlx5e: CT, Remove newline from ct_dbg call
ct_dbg() already adds a newline.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-27 11:54:35 -07:00
Jakub Kicinski 5ada57a9a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 09:55:10 -07:00
Vlad Buslov 9453d45ecb net: zero-initialize tc skb extension on allocation
Function skb_ext_add() doesn't initialize created skb extension with any
value and leaves it up to the user. However, since extension of type
TC_SKB_EXT originally contained only single value tc_skb_ext->chain its
users used to just assign the chain value without setting whole extension
memory to zero first. This assumption changed when TC_SKB_EXT extension was
extended with additional fields but not all users were updated to
initialize the new fields which leads to use of uninitialized memory
afterwards. UBSAN log:

[  778.299821] UBSAN: invalid-load in net/openvswitch/flow.c:899:28
[  778.301495] load of value 107 is not a valid value for type '_Bool'
[  778.303215] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7+ #2
[  778.304933] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  778.307901] Call Trace:
[  778.308680]  <IRQ>
[  778.309358]  dump_stack+0xbb/0x107
[  778.310307]  ubsan_epilogue+0x5/0x40
[  778.311167]  __ubsan_handle_load_invalid_value.cold+0x43/0x48
[  778.312454]  ? memset+0x20/0x40
[  778.313230]  ovs_flow_key_extract.cold+0xf/0x14 [openvswitch]
[  778.314532]  ovs_vport_receive+0x19e/0x2e0 [openvswitch]
[  778.315749]  ? ovs_vport_find_upcall_portid+0x330/0x330 [openvswitch]
[  778.317188]  ? create_prof_cpu_mask+0x20/0x20
[  778.318220]  ? arch_stack_walk+0x82/0xf0
[  778.319153]  ? secondary_startup_64_no_verify+0xb0/0xbb
[  778.320399]  ? stack_trace_save+0x91/0xc0
[  778.321362]  ? stack_trace_consume_entry+0x160/0x160
[  778.322517]  ? lock_release+0x52e/0x760
[  778.323444]  netdev_frame_hook+0x323/0x610 [openvswitch]
[  778.324668]  ? ovs_netdev_get_vport+0xe0/0xe0 [openvswitch]
[  778.325950]  __netif_receive_skb_core+0x771/0x2db0
[  778.327067]  ? lock_downgrade+0x6e0/0x6f0
[  778.328021]  ? lock_acquire+0x565/0x720
[  778.328940]  ? generic_xdp_tx+0x4f0/0x4f0
[  778.329902]  ? inet_gro_receive+0x2a7/0x10a0
[  778.330914]  ? lock_downgrade+0x6f0/0x6f0
[  778.331867]  ? udp4_gro_receive+0x4c4/0x13e0
[  778.332876]  ? lock_release+0x52e/0x760
[  778.333808]  ? dev_gro_receive+0xcc8/0x2380
[  778.334810]  ? lock_downgrade+0x6f0/0x6f0
[  778.335769]  __netif_receive_skb_list_core+0x295/0x820
[  778.336955]  ? process_backlog+0x780/0x780
[  778.337941]  ? mlx5e_rep_tc_netdevice_event_unregister+0x20/0x20 [mlx5_core]
[  778.339613]  ? seqcount_lockdep_reader_access.constprop.0+0xa7/0xc0
[  778.341033]  ? kvm_clock_get_cycles+0x14/0x20
[  778.342072]  netif_receive_skb_list_internal+0x5f5/0xcb0
[  778.343288]  ? __kasan_kmalloc+0x7a/0x90
[  778.344234]  ? mlx5e_handle_rx_cqe_mpwrq+0x9e0/0x9e0 [mlx5_core]
[  778.345676]  ? mlx5e_xmit_xdp_frame_mpwqe+0x14d0/0x14d0 [mlx5_core]
[  778.347140]  ? __netif_receive_skb_list_core+0x820/0x820
[  778.348351]  ? mlx5e_post_rx_mpwqes+0xa6/0x25d0 [mlx5_core]
[  778.349688]  ? napi_gro_flush+0x26c/0x3c0
[  778.350641]  napi_complete_done+0x188/0x6b0
[  778.351627]  mlx5e_napi_poll+0x373/0x1b80 [mlx5_core]
[  778.352853]  __napi_poll+0x9f/0x510
[  778.353704]  ? mlx5_flow_namespace_set_mode+0x260/0x260 [mlx5_core]
[  778.355158]  net_rx_action+0x34c/0xa40
[  778.356060]  ? napi_threaded_poll+0x3d0/0x3d0
[  778.357083]  ? sched_clock_cpu+0x18/0x190
[  778.358041]  ? __common_interrupt+0x8e/0x1a0
[  778.359045]  __do_softirq+0x1ce/0x984
[  778.359938]  __irq_exit_rcu+0x137/0x1d0
[  778.360865]  irq_exit_rcu+0xa/0x20
[  778.361708]  common_interrupt+0x80/0xa0
[  778.362640]  </IRQ>
[  778.363212]  asm_common_interrupt+0x1e/0x40
[  778.364204] RIP: 0010:native_safe_halt+0xe/0x10
[  778.365273] Code: 4f ff ff ff 4c 89 e7 e8 50 3f 40 fe e9 dc fe ff ff 48 89 df e8 43 3f 40 fe eb 90 cc e9 07 00 00 00 0f 00 2d 74 05 62 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 64 05 62 00 f4 c3 cc cc 0f 1f 44 00
[  778.369355] RSP: 0018:ffffffff84407e48 EFLAGS: 00000246
[  778.370570] RAX: ffff88842de46a80 RBX: ffffffff84425840 RCX: ffffffff83418468
[  778.372143] RDX: 000000000026f1da RSI: 0000000000000004 RDI: ffffffff8343af5e
[  778.373722] RBP: fffffbfff0884b08 R08: 0000000000000000 R09: ffff88842de46bcb
[  778.375292] R10: ffffed1085bc8d79 R11: 0000000000000001 R12: 0000000000000000
[  778.376860] R13: ffffffff851124a0 R14: 0000000000000000 R15: dffffc0000000000
[  778.378491]  ? rcu_eqs_enter.constprop.0+0xb8/0xe0
[  778.379606]  ? default_idle_call+0x5e/0xe0
[  778.380578]  default_idle+0xa/0x10
[  778.381406]  default_idle_call+0x96/0xe0
[  778.382350]  do_idle+0x3d4/0x550
[  778.383153]  ? arch_cpu_idle_exit+0x40/0x40
[  778.384143]  cpu_startup_entry+0x19/0x20
[  778.385078]  start_kernel+0x3c7/0x3e5
[  778.385978]  secondary_startup_64_no_verify+0xb0/0xbb

Fix the issue by providing new function tc_skb_ext_alloc() that allocates
tc skb extension and initializes its memory to 0 before returning it to the
caller. Change all existing users to use new API instead of calling
skb_ext_add() directly.

Fixes: 038ebb1a71 ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Fixes: d29334c15d ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25 15:36:42 -07:00
Ido Schimmel daeabf89eb mlxsw: spectrum_router: Add support for custom multipath hash policy
When this policy is set, only enable the packet fields that were enabled
by user space for multipath hash computation.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19 12:47:47 -07:00
Ido Schimmel 01848e05f8 mlxsw: spectrum_router: Add support for inner layer 3 multipath hash policy
When this policy is set, the kernel uses the inner layer 3 fields for
multipath hash computation and falls back to the outer fields if no
encapsulation was encountered. This behavior is most likely influenced
by the behavior of the flow dissector, which is used for the packet
dissection.

The Spectrum ASIC, however, cannot fallback to outer fields if inner
fields are not available. This should not result in a discrepancy from
the software data path because if several flows have matching inner
fields, they will tend to have matching outer fields as well.

Therefore, implement this policy by enabling both outer and inner layer
3 fields for the multipath hash computation.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19 12:47:47 -07:00
Ido Schimmel b7b8f435ea mlxsw: spectrum_outer: Factor out helper for common outer fields
Outer IPv4 and IPv6 addresses are used by multiple multipath hash
policies. Factor out helpers that set these fields to increase code
sharing between different policies.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19 12:47:47 -07:00
Ido Schimmel 28bc824807 mlxsw: reg: Add inner packet fields to RECRv2 register
The RECRv2 register is used for setting up the router's ECMP hash
configuration. Extend it with inner packet fields to allow the ECMP hash
to be calculated based on inner flow information.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19 12:47:47 -07:00
Ido Schimmel 9d23d3eb6f mlxsw: spectrum_router: Move multipath hash configuration to a bitmap
Currently, the multipath hash configuration is written directly to the
register payload. While this is OK for the two currently supported
policies, it is going to be hard to follow when more policies and more
packet fields are added.

Instead, set the required headers and fields in a bitmap and then dump
it to the register payload.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19 12:47:47 -07:00
Ido Schimmel 7725c1c8f7 mlxsw: spectrum_router: Replace if statement with a switch statement
The code was written when only two multipath hash policies were present,
so the if statement was sufficient. The next patch and future patches
are going to add support for more policies, so move to a switch
statement.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-19 12:47:47 -07:00
Jakub Kicinski e63052a5dd mlx5e: add add missing BH locking around napi_schdule()
It's not correct to call napi_schedule() in pure process
context. Because we use __raise_softirq_irqoff() we require
callers to be in a context which will eventually lead to
softirq handling (hardirq, bh disabled, etc.).

With code as is users will see:

 NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!

Fixes: a8dd7ac12f ("net/mlx5e: Generalize RQ activation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:55 -07:00
Ariel Levkovich 6ff51ab8aa net/mlx5: Set term table as an unmanaged flow table
Termination tables are restricted to have the default miss action and
cannot be set to forward to another table in case of a miss.
If the fs prio of the termination table is not the last one in the
list, fs_core will attempt to attach it to another table.

Set the unmanaged ft flag when creating the termination table ft
and select the tc offload prio for it to prevent fs_core from selecting
the forwarding to next ft miss action and use the default one.

In addition, set the flow that forwards to the termination table to
ignore ft level restrictions since the ft level is not set by fs_core
for unamanged fts.

Fixes: 249ccc3c95 ("net/mlx5e: Add support for offloading traffic from uplink to uplink")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
2021-05-18 23:01:53 -07:00
Leon Romanovsky 75e8564e91 net/mlx5: Don't overwrite HCA capabilities when setting MSI-X count
During driver probe of device that has dynamic MSI-X feature enabled,
the following error is printed in some FW flavour (not released yet).

 mlx5_core 0000:06:00.0: firmware version: 4.7.4387
 mlx5_core 0000:06:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
 mlx5_core 0000:06:00.0: mlx5_cmd_check:777:(pid 70599): SET_HCA_CAP(0x109) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x0)
 mlx5_core 0000:06:00.0: set_hca_cap:622:(pid 70599): handle_hca_cap failed
 mlx5_core 0000:06:00.0: mlx5_function_setup:1045:(pid 70599): set_hca_cap failed
 mlx5_core 0000:06:00.0: probe_one:1465:(pid 70599): mlx5_init_one failed with error code -22
 mlx5_core: probe of 0000:06:00.0 failed with error -22

In order to make the setting capability of MSI-X future proof, let's
query the current capabilities first.

Fixes: 604774add5 ("net/mlx5: Dynamically assign MSI-X vectors count")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:51 -07:00
Eli Cohen 7c9f131f36 {net,vdpa}/mlx5: Configure interface MAC into mpfs L2 table
net/mlx5: Expose MPFS configuration API

MPFS is the multi physical function switch that bridges traffic between
the physical port and any physical functions associated with it. The
driver is required to add or remove MAC entries to properly forward
incoming traffic to the correct physical function.

We export the API to control MPFS so that other drivers, such as
mlx5_vdpa are able to add MAC addresses of their network interfaces.

The MAC address of the vdpa interface must be configured into the MPFS L2
address. Failing to do so could cause, in some NIC configurations, failure
to forward packets to the vdpa network device instance.

Fix this by adding calls to update the MPFS table.

CC: <mst@redhat.com>
CC: <jasowang@redhat.com>
CC: <virtualization@lists.linux-foundation.org>
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:48 -07:00
Aya Levin 5e7923acbd net/mlx5e: Fix error path of updating netdev queues
Avoid division by zero in the error flow. In the driver TC number can be
either 1 or 8. When TC count is set to 1, driver zero netdev->num_tc.
Hence, need to convert it back from 0 to 1 in the error flow.

Fixes: fa3748775b ("net/mlx5e: Handle errors from netif_set_real_num_{tx,rx}_queues")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:46 -07:00
Vlad Buslov 7d1a3d08c8 net/mlx5e: Reject mirroring on source port change encap rules
Rules with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE dest flag are
translated to destination FT in eswitch. Currently it is not possible to
mirror such rules because firmware doesn't support mixing FT and Vport
destinations in single rule when one of them adds encapsulation. Since the
only use case for MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE destination is
support for tunnel endpoints on VF and trying to offload such rule with
mirror action causes either crash in fs_core or firmware error with
syndrome 0xff6a1d, reject all such rules in mlx5 TC layer.

Fixes: 10742efc20 ("net/mlx5e: VF tunnel TX traffic offloading")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:43 -07:00
Dima Chumak 97817fcc68 net/mlx5e: Fix multipath lag activation
When handling FIB_EVENT_ENTRY_REPLACE event for a new multipath route,
lag activation can be missed if a stale (struct lag_mp)->mfi pointer
exists, which was associated with an older multipath route that had been
removed.

Normally, when a route is removed, it triggers mlx5_lag_fib_event(),
which handles FIB_EVENT_ENTRY_DEL and clears mfi pointer. But, if
mlx5_lag_check_prereq() condition isn't met, for example when eswitch is
in legacy mode, the fib event is skipped and mfi pointer becomes stale.

Fix by resetting mfi pointer to NULL every time mlx5_lag_mp_init() is
called.

Fixes: 544fe7c2e6 ("net/mlx5e: Activate HW multipath and handle port affinity based on FIB events")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:41 -07:00
Saeed Mahameed 77ecd10d0a net/mlx5e: reset XPS on error flow if netdev isn't registered yet
mlx5e_attach_netdev can be called prior to registering the netdevice:
Example stack:

ipoib_new_child_link ->
ipoib_intf_init->
rdma_init_netdev->
mlx5_rdma_setup_rn->

mlx5e_attach_netdev->
mlx5e_num_channels_changed ->
mlx5e_set_default_xps_cpumasks ->
netif_set_xps_queue ->
__netif_set_xps_queue -> kmalloc

If any later stage fails at any point after mlx5e_num_channels_changed()
returns, XPS allocated maps will never be freed as they
are only freed during netdev unregistration, which will never happen for
yet to be registered netdevs.

Fixes: 3909a12e79 ("net/mlx5e: Fix configuration of XPS cpumasks and netdev queues in corner cases")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
2021-05-18 23:01:38 -07:00
Roi Dayan eb96cc1592 net/mlx5e: Make sure fib dev exists in fib event
For unreachable route entry the fib dev does not exists.

Fixes: 8914add2c9 ("net/mlx5e: Handle FIB events to update tunnel endpoint device")
Reported-by: Dennis Afanasev <dennis.afanasev@stateless.net>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:36 -07:00
Roi Dayan 83026d8318 net/mlx5e: Fix null deref accessing lag dev
It could be the lag dev is null so stop processing the event.
In bond_enslave() the active/backup slave being set before setting the
upper dev so first event is without an upper dev.
After setting the upper dev with bond_master_upper_dev_link() there is
a second event and in that event we have an upper dev.

Fixes: 7e51891a23 ("net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:34 -07:00
Dima Chumak fe7738eb3c net/mlx5e: Fix nullptr in mlx5e_tc_add_fdb_flow()
The result of __dev_get_by_index() is not checked for NULL, which then
passed to mlx5e_attach_encap() and gets dereferenced.

Also, in case of a successful lookup, the net_device reference count is
not incremented, which may result in net_device pointer becoming invalid
at any time during mlx5e_attach_encap() execution.

Fix by using dev_get_by_index(), which does proper reference counting on
the net_device pointer. Also, handle nullptr return value when mirred
device is not found.

It's safe to call dev_put() on the mirred net_device pointer, right
after mlx5e_attach_encap() call, because it's not being saved/copied
down the call chain.

Fixes: 3c37745ec6 ("net/mlx5e: Properly deal with encap flows add/del under neigh update")
Addresses-Coverity: ("Dereference null return value")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:31 -07:00
Parav Pandit 82041634d9 net/mlx5: SF, Fix show state inactive when its inactivated
When a SF is inactivated and when it is in a TEARDOWN_REQUEST
state, driver still returns its state as active. This is incorrect.
Fix it by treating TEARDOWN_REQEUST as inactive state. When a SF
is still attached to the driver, on user request to reactivate EINVAL
error is returned. Inform user about it with better code EBUSY and
informative error message.

Fixes: 6a32732174 ("net/mlx5: SF, Port function state change support")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:29 -07:00
Roi Dayan fca086617a net/mlx5: Fix err prints and return when creating termination table
Fix print to print correct error code and not using IS_ERR() which
will just result in always printing 1.
Also return real err instead of always -EOPNOTSUPP.

Fixes: 10caabdaad ("net/mlx5e: Use termination table for VLAN push actions")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:26 -07:00
Jianbo Liu 442b3d7b67 net/mlx5: Set reformat action when needed for termination rules
For remote mirroring, after the tunnel packets are received, they are
decapsulated and sent to representor, then re-encapsulated and sent
out over another tunnel. So reformat action is set only when the
destination is required to do encapsulation.

Fixes: 249ccc3c95 ("net/mlx5e: Add support for offloading traffic from uplink to uplink")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:24 -07:00
Dima Chumak dca59f4a79 net/mlx5e: Fix nullptr in add_vlan_push_action()
The result of dev_get_by_index_rcu() is not checked for NULL and then
gets dereferenced immediately.

Also, the RCU lock must be held by the caller of dev_get_by_index_rcu(),
which isn't satisfied by the call stack.

Fix by handling nullptr return value when iflink device is not found.
Add RCU locking around dev_get_by_index_rcu() to avoid possible adverse
effects while iterating over the net_device's hlist.

It is safe not to increment reference count of the net_device pointer in
case of a successful lookup, because it's already handled by VLAN code
during VLAN device registration (see register_vlan_dev and
netdev_upper_dev_link).

Fixes: 278748a95a ("net/mlx5e: Offload TC e-switch rules with egress VLAN device")
Addresses-Coverity: ("Dereference null return value")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:21 -07:00
Maor Gottlieb 3410fbcd47 {net, RDMA}/mlx5: Fix override of log_max_qp by other device
mlx5_core_dev holds pointer to static profile, hence when the
log_max_qp of the profile is override by some device, then it
effect all other mlx5 devices that share the same profile.
Fix it by having a profile instance for every mlx5 device.

Fixes: 883371c453 ("net/mlx5: Check FW limitations on log_max_qp before setting it")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-05-18 23:01:19 -07:00
Amit Cohen b0d80c013b mlxsw: Remove Mellanox SwitchX-2 ASIC support
Initial support for the Mellanox SwitchX-2 ASIC was added in July 2015.
Since then all development efforts shifted towards the Mellanox Spectrum
ASICs and development of this driver stopped beside trivial fixes and
refactoring. Therefore, the driver does not support any switch offloads
and simply traps all traffic to the CPU, rendering it irrelevant for
deployment.

In addition, support for this ASIC was dropped by Mellanox a few years
ago.

Given the driver is not used by any users and that there is no
intention of investing in its development, remove it from the kernel.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17 15:15:46 -07:00
Amit Cohen 9b43fbb8ce mlxsw: Remove Mellanox SwitchIB ASIC support
Initial support for the Mellanox SwitchIB and SwitchIB-2 ASICs was added
in October 2016, but since then development of this driver stopped.
Therefore, the driver does not support any offloads and simply registers
devlink ports for its front panel ports, rendering it irrelevant for
deployment.

Given the driver is not used by any users and that there is no intention
of investing in its development, remove it from the kernel.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17 15:15:46 -07:00
Ido Schimmel 51746a353b mlxsw: spectrum_router: Avoid missing error code warning
Explicitly set the error code to zero before the goto statement to avoid
the following smatch warning:

drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:3598 mlxsw_sp_nexthop_group_refresh() warn: missing error code 'err'

The warning is a false positive, but the change both suppresses the
warning and makes it clear to future readers that this is not an error
path.

The original report and discussion can be found here [1].

[1] https://lore.kernel.org/lkml/202105141823.Td2h3Mbi-lkp@intel.com/

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17 15:15:46 -07:00
Ido Schimmel 8c2b58e65d mlxsw: core: Avoid unnecessary EMAD buffer copy
mlxsw_emad_transmit() takes care of sending EMAD transactions to the
device. Since these transactions can time out, the driver performs up to
5 retransmissions, each time copying the skb with the original request.

The data of the skb does not change throughout the process, so there is
no need to copy it each time. Instead, only the skb itself can be
copied. Therefore, use skb_clone() instead of skb_copy().

This reduces the latency of the function by about 16%.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17 15:15:46 -07:00
Danielle Ratson 837ec05cfe mlxsw: Verify the accessed index doesn't exceed the array length
There are few cases in which an array index queried from a fw register,
is accessed without any validation that it doesn't exceed the array
length.

Add a proper length validation, so accessing memory past the end of an
array will be forbidden.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17 15:15:46 -07:00
Danielle Ratson ece5df874d mlxsw: spectrum_buffers: Switch function arguments
In the call path:

mlxsw_sp_hdroom_bufs_reset_sizes()
    mlxsw_sp_hdroom_int_buf_size_get()
        ->int_buf_size_get()

The 'speed' and 'mtu' arguments were mistakenly switched twice. The two
bugs thus canceled each other.

Clean this up by switching the arguments in both call sites, so that
they are passed in the right order.

Found during manual code inspection.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17 15:15:46 -07:00
Vladyslav Tarasiuk db825feefc net/mlx4: Fix EEPROM dump support
Fix SFP and QSFP* EEPROM queries by setting i2c_address, offset and page
number correctly. For SFP set the following params:
- I2C address for offsets 0-255 is 0x50. For 256-511 - 0x51.
- Page number is zero.
- Offset is 0-255.

At the same time, QSFP* parameters are different:
- I2C address is always 0x50.
- Page number is not limited to zero.
- Offset is 0-255 for page zero and 128-255 for others.

To set parameters accordingly to cable used, implement function to query
module ID and implement respective helper functions to set parameters
correctly.

Fixes: 135dd9594f ("net/mlx4_en: ethtool, Remove unsupported SFP EEPROM high pages query")
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-10 14:34:39 -07:00
Linus Torvalds fc858a5231 Networking fixes for 5.13-rc1, including fixes from bpf, can
and netfilter trees. Self-contained fixes, nothing risky.
 
 Current release - new code bugs:
 
  - dsa: ksz: fix a few bugs found by static-checker in the new driver
 
  - stmmac: fix frame preemption handshake not triggering after
            interface restart
 
 Previous releases - regressions:
 
  - make nla_strcmp handle more then one trailing null character
 
  - fix stack OOB reads while fragmenting IPv4 packets in openvswitch
    and net/sched
 
  - sctp: do asoc update earlier in sctp_sf_do_dupcook_a
 
  - sctp: delay auto_asconf init until binding the first addr
 
  - stmmac: clear receive all(RA) bit when promiscuous mode is off
 
  - can: mcp251x: fix resume from sleep before interface was brought up
 
 Previous releases - always broken:
 
  - bpf: fix leakage of uninitialized bpf stack under speculation
 
  - bpf: fix masking negation logic upon negative dst register
 
  - netfilter: don't assume that skb_header_pointer() will never fail
 
  - only allow init netns to set default tcp cong to a restricted algo
 
  - xsk: fix xp_aligned_validate_desc() when len == chunk_size to
         avoid false positive errors
 
  - ethtool: fix missing NLM_F_MULTI flag when dumping
 
  - can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
 
  - sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b
 
  - bridge: fix NULL-deref caused by a races between assigning
            rx_handler_data and setting the IFF_BRIDGE_PORT bit
 
 Latecomer:
 
  - seg6: add counters support for SRv6 Behaviors
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCV3YoACgkQMUZtbf5S
 IrsQ2w//Q8/qbl6wGTKUfu6DZHYUU5j5sTwiHR823PKKSgXI+okWMN0KUlZszOsz
 qnPkH6GuojRooOE1s8PFLSlt9axKhQ0y7uzMTrWYafQ+JZTtgg9/MiPxQ8fdiE5i
 uOG1ngttZ+1jlE5tMPL4GAOSegg3rWVDclzqnJTdsPPOco3MWj6SL9xN0LDPxCEL
 BDysRqL/UiOIoh4v6IXQRx2UWjsNGu4biM1po+Jfumnd9T0zKoEpzu6UN6yPShbx
 284LihZSQtughCbhGqkErBOxfjZcvpFOQrqmjEvI+Z/eYg4InfWZemt8Sa92/alE
 yAFjK76MUTaUxaAO/gk8XauhvkYOzJJwKpqhbOmlaM7oj55QdzT5/8JxMxVoA6hV
 pscHOixk15GVse49PdPV8v47cyTLc/Xi69i+/uUdNVVfuORL1wft1w1xbd0S6Pbe
 7Gqax21S7zxcDsrUli7cFheYiqtbQAL0anlIUz8tUOZFz0VQ/zPuFd4rUYZ/o38V
 Mrevdk3t6CXNxS4CRXyUW4UejYB1O6Qw12sUue31e3h73d6LiN3NAiN5Qp7SEk1/
 fvk+jfOf8vvmtimYvcUK2i0D+vqj4Ec/qRIE/XXuUDBcp22tPL9uWMfWavwTdAj1
 Se4SzksTWF+NM0lO0ItonMyPh3ZXcSLhIv/gHrZwEKuWkXCGO4M=
 =JmWS
 -----END PGP SIGNATURE-----

Merge tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.13-rc1, including fixes from bpf, can and
  netfilter trees. Self-contained fixes, nothing risky.

  Current release - new code bugs:

   - dsa: ksz: fix a few bugs found by static-checker in the new driver

   - stmmac: fix frame preemption handshake not triggering after
     interface restart

  Previous releases - regressions:

   - make nla_strcmp handle more then one trailing null character

   - fix stack OOB reads while fragmenting IPv4 packets in openvswitch
     and net/sched

   - sctp: do asoc update earlier in sctp_sf_do_dupcook_a

   - sctp: delay auto_asconf init until binding the first addr

   - stmmac: clear receive all(RA) bit when promiscuous mode is off

   - can: mcp251x: fix resume from sleep before interface was brought up

  Previous releases - always broken:

   - bpf: fix leakage of uninitialized bpf stack under speculation

   - bpf: fix masking negation logic upon negative dst register

   - netfilter: don't assume that skb_header_pointer() will never fail

   - only allow init netns to set default tcp cong to a restricted algo

   - xsk: fix xp_aligned_validate_desc() when len == chunk_size to avoid
     false positive errors

   - ethtool: fix missing NLM_F_MULTI flag when dumping

   - can: m_can: m_can_tx_work_queue(): fix tx_skb race condition

   - sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b

   - bridge: fix NULL-deref caused by a races between assigning
     rx_handler_data and setting the IFF_BRIDGE_PORT bit

  Latecomer:

   - seg6: add counters support for SRv6 Behaviors"

* tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
  atm: firestream: Use fallthrough pseudo-keyword
  net: stmmac: Do not enable RX FIFO overflow interrupts
  mptcp: fix splat when closing unaccepted socket
  i40e: Remove LLDP frame filters
  i40e: Fix PHY type identifiers for 2.5G and 5G adapters
  i40e: fix the restart auto-negotiation after FEC modified
  i40e: Fix use-after-free in i40e_client_subtask()
  i40e: fix broken XDP support
  netfilter: nftables: avoid potential overflows on 32bit arches
  netfilter: nftables: avoid overflows in nft_hash_buckets()
  tcp: Specify cmsgbuf is user pointer for receive zerocopy.
  mlxsw: spectrum_mr: Update egress RIF list before route's action
  net: ipa: fix inter-EE IRQ register definitions
  can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
  can: mcp251x: fix resume from sleep before interface was brought up
  can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path
  can: mcp251xfd: mcp251xfd_probe(): fix an error pointer dereference in probe
  netfilter: nftables: Fix a memleak from userdata error path in new objects
  netfilter: remove BUG_ON() after skb_header_pointer()
  netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
  ...
2021-05-08 08:31:46 -07:00
Ido Schimmel cbaf3f6af9 mlxsw: spectrum_mr: Update egress RIF list before route's action
Each multicast route that is forwarding packets (as opposed to trapping
them) points to a list of egress router interfaces (RIFs) through which
packets are replicated.

A route's action can transition from trap to forward when a RIF is
created for one of the route's egress virtual interfaces (eVIF). When
this happens, the route's action is first updated and only later the
list of egress RIFs is committed to the device.

This results in the route pointing to an invalid list. In case the list
pointer is out of range (due to uninitialized memory), the device will
complain:

mlxsw_spectrum2 0000:06:00.0: EMAD reg access failed (tid=5733bf490000905c,reg_id=300f(pefa),type=write,status=7(bad parameter))

Fix this by first committing the list of egress RIFs to the device and
only later update the route's action.

Note that a fix is not needed in the reverse function (i.e.,
mlxsw_sp_mr_route_evif_unresolve()), as there the route's action is
first updated and only later the RIF is removed from the list.

Cc: stable@vger.kernel.org
Fixes: c011ec1bbf ("mlxsw: spectrum: Add the multicast routing offloading logic")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20210506072308.3834303-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-06 17:19:29 -07:00
Linus Torvalds 583f2bcf86 - Remove duplicate error message for the amlogic driver (Tang Bin)
- Fix spellos in comments for the tegra and sun8i (Bhaskar Chowdhury)
 
 - Add the missing fifth node on the rcar_gen3 sensor (Niklas
   Söderlund)
 
 - Remove duplicate include in ti-bandgap (Zhang Yunkai)
 
 - Assign error code in the error path in the function
   thermal_of_populate_bind_params() (Jia-Ju Bai)
 
 - Fix spelling mistake in a comment 'disabed' -> 'disabled' (Colin Ian
   King)
 
 - Use the device name instead of auto-numbering for a better
   identification of the cooling device (Daniel Lezcano)
 
 - Improve a bit the division accuracy in the power allocator governor
   (Jeson Gao)
 
 - Enable the missing third sensor on msm8976 (Konrad Dybcio)
 
 - Add QCom tsens driver co-maintainer (Thara Gopinath)
 
 - Fix memory leak and use after free errors in the core code (Daniel
   Lezcano)
 
 - Add the MDM9607 compatible bindings (Konrad Dybcio)
 
 - Fix trivial spello in the copyright name for Hisilicon (Hao Fang)
 
 - Fix negative index array access when converting the frequency to
   power in the energy model (Brian-sy Yang)
 
 - Add support for Gen2 new PMIC support for Qcom SPMI (David Collins)
 
 - Update maintainer file for CPU cooling device section (Lukasz Luba)
 
 - Fix missing put_device on error in the Qcom tsens driver (Guangqing
   Zhu)
 
 - Add compatible DT binding for sm8350 (Robert Foss)
 
 - Add support for the MDM9607's tsens driver (Konrad Dybcio)
 
 - Remove duplicate error messages in thermal_mmio and the bcm2835
   driver (Ruiqi Gong)
 
 - Add the Thermal Temperature Cooling driver (Zhang Rui)
 
 - Remove duplicate error messages in the Hisilicon sensor driver (Ye
   Bin)
 
 - Use the devm_platform_ioremap_resource_byname() function instead of
   a couple of corresponding calls (dingsenjie)
 
 - Sort the headers alphabetically in the ti-bandgap driver (Zhen Lei)
 
 - Add missing property in the DT thermal sensor binding (Rafał
   Miłecki)
 
 - Remove dead code in the ti-bandgap sensor driver (Lin Ruizhe)
 
 - Convert the BRCM DT bindings to the yaml schema (Rafał Miłecki)
 
 - Replace the thermal_notify_framework() call by a call to the
   thermal_zone_device_update() function. Remove the function as well
   as the corresponding documentation (Thara Gopinath)
 
 - Add support for the ipq8064-tsens sensor along with a set of
   cleanups and code preparation (Ansuel Smith)
 
 - Add a lockless __thermal_cdev_update() function to improve the
   locking scheme in the core code and governors (Lukasz Luba)
 
 - Fix multiple cooling device notification changes (Lukasz Luba)
 
 - Remove unneeded variable initialization (Colin Ian King)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmCRqDIACgkQqDIjiipP
 6E8O2Qf5AQvSVoN9WYRBLo1+a4mkGsJ/wHQMEsOA4FVHft5/QVkRtpMNbSiyq00O
 YTpNuoBqiYm/tSTyzK/5Oh+0ucgm/ef4c4dTyPjZYw2GB+3rYNRAXdX/tB6Ggjl/
 oUArUCoSQZjOU6Y573B05rcHp1PVM/XL9LgD1uX76tXA1MaGvsyC0cyPRAdOANke
 W83BWI0XMhv8B1bZwHVB2Oft5x6HhqWBl3HKbNOmPEMtwkqqBCFAqB0wNEH88ZTf
 2hyBjBoZQHdMkJsC0piMvIyAjHZiIjQB47VWz31EvKB3/E28xCqRqPViPq9QbrA5
 got0+oDbxI96T024ndXRomc0SSxZnw==
 =5THg
 -----END PGP SIGNATURE-----

Merge tag 'thermal-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux

Pull thermal updates from Daniel Lezcano:

 - Remove duplicate error message for the amlogic driver (Tang Bin)

 - Fix spellos in comments for the tegra and sun8i (Bhaskar Chowdhury)

 - Add the missing fifth node on the rcar_gen3 sensor (Niklas Söderlund)

 - Remove duplicate include in ti-bandgap (Zhang Yunkai)

 - Assign error code in the error path in the function
   thermal_of_populate_bind_params() (Jia-Ju Bai)

 - Fix spelling mistake in a comment 'disabed' -> 'disabled' (Colin Ian
   King)

 - Use the device name instead of auto-numbering for a better
   identification of the cooling device (Daniel Lezcano)

 - Improve a bit the division accuracy in the power allocator governor
   (Jeson Gao)

 - Enable the missing third sensor on msm8976 (Konrad Dybcio)

 - Add QCom tsens driver co-maintainer (Thara Gopinath)

 - Fix memory leak and use after free errors in the core code (Daniel
   Lezcano)

 - Add the MDM9607 compatible bindings (Konrad Dybcio)

 - Fix trivial spello in the copyright name for Hisilicon (Hao Fang)

 - Fix negative index array access when converting the frequency to
   power in the energy model (Brian-sy Yang)

 - Add support for Gen2 new PMIC support for Qcom SPMI (David Collins)

 - Update maintainer file for CPU cooling device section (Lukasz Luba)

 - Fix missing put_device on error in the Qcom tsens driver (Guangqing
   Zhu)

 - Add compatible DT binding for sm8350 (Robert Foss)

 - Add support for the MDM9607's tsens driver (Konrad Dybcio)

 - Remove duplicate error messages in thermal_mmio and the bcm2835
   driver (Ruiqi Gong)

 - Add the Thermal Temperature Cooling driver (Zhang Rui)

 - Remove duplicate error messages in the Hisilicon sensor driver (Ye
   Bin)

 - Use the devm_platform_ioremap_resource_byname() function instead of a
   couple of corresponding calls (dingsenjie)

 - Sort the headers alphabetically in the ti-bandgap driver (Zhen Lei)

 - Add missing property in the DT thermal sensor binding (Rafał Miłecki)

 - Remove dead code in the ti-bandgap sensor driver (Lin Ruizhe)

 - Convert the BRCM DT bindings to the yaml schema (Rafał Miłecki)

 - Replace the thermal_notify_framework() call by a call to the
   thermal_zone_device_update() function. Remove the function as well as
   the corresponding documentation (Thara Gopinath)

 - Add support for the ipq8064-tsens sensor along with a set of cleanups
   and code preparation (Ansuel Smith)

 - Add a lockless __thermal_cdev_update() function to improve the
   locking scheme in the core code and governors (Lukasz Luba)

 - Fix multiple cooling device notification changes (Lukasz Luba)

 - Remove unneeded variable initialization (Colin Ian King)

* tag 'thermal-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (55 commits)
  thermal/drivers/mtk_thermal: Remove redundant initializations of several variables
  thermal/core/power allocator: Use the lockless __thermal_cdev_update() function
  thermal/core/fair share: Use the lockless __thermal_cdev_update() function
  thermal/core/fair share: Lock the thermal zone while looping over instances
  thermal/core/power_allocator: Update once cooling devices when temp is low
  thermal/core/power_allocator: Maintain the device statistics from going stale
  thermal/core: Create a helper __thermal_cdev_update() without a lock
  dt-bindings: thermal: tsens: Document ipq8064 bindings
  thermal/drivers/tsens: Add support for ipq8064-tsens
  thermal/drivers/tsens: Drop unused define for msm8960
  thermal/drivers/tsens: Replace custom 8960 apis with generic apis
  thermal/drivers/tsens: Fix bug in sensor enable for msm8960
  thermal/drivers/tsens: Use init_common for msm8960
  thermal/drivers/tsens: Add VER_0 tsens version
  thermal/drivers/tsens: Convert msm8960 to reg_field
  thermal/drivers/tsens: Don't hardcode sensor slope
  Documentation: driver-api: thermal: Remove thermal_notify_framework from documentation
  thermal/core: Remove thermal_notify_framework
  iwlwifi: mvm: tt: Replace thermal_notify_framework
  dt-bindings: thermal: brcm,ns-thermal: Convert to the json-schema
  ...
2021-05-05 12:46:48 -07:00
Linus Torvalds f34b2cf178 RDMA merge window pull request
This is significantly bug fixes and general cleanups. The noteworthy new
 features are fairly small:
 
 - XRC support for HNS and improves RQ operations
 
 - Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw and
   qib
 
 - Quite a few general cleanups on spelling, error handling, static checker
   detections, etc
 
 - Increase the number of device ports supported beyond 255. High port
   count software switches now exist
 
 - Several bug fixes for rtrs
 
 - mlx5 Device Memory support for host controlled atomics
 
 - Report SRQ tables through to rdma-tool
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmCMMHEACgkQOG33FX4g
 mxri3Q//RAgIExCGHebQ9xkptZHVyTLLJMpiMl2cqk3ZVRdDZ7QdiQjIqY2KqlUK
 nxBj7EXJeX6rV5a1xqCcOO1gBetB28TSwnCNE2ZqrXP5B59ISW8D052IWza3UkUz
 WmHLARxHQlyKBWA4+ZAgfoUGL0NmWA8QPf56t/RK/3/OsuYnGzcnWmmFbt8XKFcH
 NtO3KC45mKWDqqG0A0XRrLbEQz/ElO3OuPBqlBKgB3ZgGPzgsOUTOGkm1tCcZ89L
 /pvZGB7SklKZdCX8TxdpVGd9h0zHl8pqh1yEzvTA1ypNAYSUId2mvZXluU8J5yJl
 FLk7E1IxE5050FNEc7T5uZdUVntulYiqL2558coRI34l5w26pKGjIMxw/nTB8hg8
 4ZfBtKVemIG6yzW5Up6iBpK7qWYpvLWVShwYAWhbNsjN7JGzJuh1gJnjbmYgyz2P
 RTMU9wjFPLL2wZxg4LDHACVJNBb82j6KKuE+kZWpk11ro7INw9+7YwRuTo7/ezxC
 BwXKu8wF4igwSigV55jM+WnGXLhxdC3qmx/2cbtWyLM/PzdRL96tM0RWW5v8/Nv7
 teFhkt+f3RVqcfYH5K1qCXy3UFrxG6bxFSvcHHSBx2bdIrqhuTY5FqszAYImeW2j
 iHoyIsuSuGu79HQgOzAQZsEyksWi6OYDvA9Q9VBoPP4bJ3DOAa4=
 =vsXA
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This is significantly bug fixes and general cleanups. The noteworthy
  new features are fairly small:

   - XRC support for HNS and improves RQ operations

   - Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw
     and qib

   - Quite a few general cleanups on spelling, error handling, static
     checker detections, etc

   - Increase the number of device ports supported beyond 255. High port
     count software switches now exist

   - Several bug fixes for rtrs

   - mlx5 Device Memory support for host controlled atomics

   - Report SRQ tables through to rdma-tool"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (145 commits)
  IB/qib: Remove redundant assignment to ret
  RDMA/nldev: Add copy-on-fork attribute to get sys command
  RDMA/bnxt_re: Fix a double free in bnxt_qplib_alloc_res
  RDMA/siw: Fix a use after free in siw_alloc_mr
  IB/hfi1: Remove redundant variable rcd
  RDMA/nldev: Add QP numbers to SRQ information
  RDMA/nldev: Return SRQ information
  RDMA/restrack: Add support to get resource tracking for SRQ
  RDMA/nldev: Return context information
  RDMA/core: Add CM to restrack after successful attachment to a device
  RDMA/cma: Skip device which doesn't support CM
  RDMA/rxe: Fix a bug in rxe_fill_ip_info()
  RDMA/mlx5: Expose private query port
  RDMA/mlx4: Remove an unused variable
  RDMA/mlx5: Fix type assignment for ICM DM
  IB/mlx5: Set right RoCE l3 type and roce version while deleting GID
  RDMA/i40iw: Fix error unwinding when i40iw_hmc_sd_one fails
  RDMA/cxgb4: add missing qpid increment
  IB/ipoib: Remove unnecessary struct declaration
  RDMA/bnxt_re: Get rid of custom module reference counting
  ...
2021-05-01 09:15:05 -07:00
Parav Pandit f1b9acd3a5 net/mlx5: SF, Extend SF table for additional SF id range
Extended the SF table to cover additioanl SF id range of external
controller.

A user optionallly provides the external controller number when user
wants to create SF on the external controller.

An example on eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:59:07 -07:00
Parav Pandit a3088f87d9 net/mlx5: SF, Split mlx5_sf_hw_table into two parts
Device has SF ids in two different contiguous ranges. One for the local
controller and second for the external controller's PF.

Each such range has its own maximum number of functions and base id.
To allocate SF from either of the range, prepare code to split into
range specific fields into its own structure.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:59:04 -07:00
Parav Pandit 01ed9550e8 net/mlx5: SF, Use helpers for allocation and free
Use helper routines for SF id and SF table allocation and free
so that subsequent patch can reuse it for multiple SF function
id range.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:59:01 -07:00
Parav Pandit 326c08a020 net/mlx5: SF, Consider own vhca events of SF devices
Vhca events on eswitch manager are received for all the functions on the
NIC, including for SFs of external host PF controllers.

While SF device handler is only interested in SF devices events related
to its own PF.
Hence, validate if the function belongs to self or not.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:59 -07:00
Parav Pandit 7e6ccbc187 net/mlx5: SF, Store and use start function id
SF ids in the device are in two different contiguous ranges. One for
the local controller and second for the external host controller.

Prepare code to handle multiple start function id by storing it in the
table.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:56 -07:00
Parav Pandit a1ab3e4554 devlink: Extend SF port attributes to have external attribute
Extended SF port attributes to have optional external flag similar to
PCI PF and VF port attributes.

External atttibute is required to generate unique phys_port_name when PF number
and SF number are overlapping between two controllers similar to SR-IOV
VFs.

When a SF is for external controller an example view of external SF
port and config sequence.

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:53 -07:00
Parav Pandit 1d7979352f net/mlx5: SF, Rely on hw table for SF devlink port allocation
Supporting SF allocation is currently checked at two places:
(a) SF devlink port allocator and
(b) SF HW table handler.

Both layers are using HCA CAP to identify it using helper routine
mlx5_sf_supported() and mlx5_sf_max_functions().

Instead, rely on the HW table handler to check if SF is supported
or not.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:51 -07:00
Parav Pandit 87bd418ea7 net/mlx5: E-Switch, Consider SF ports of host PF
Query SF vports count and base id of host PF from the firmware.

Account these ports in the total port calculation whenever it is non
zero.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:48 -07:00
Parav Pandit 47dd7e609f net/mlx5: E-Switch, Use xarray for vport number to vport and rep mapping
Currently vport number to vport and its representor are mapped using an
array and an index.

Vport numbers of different types of functions are not contiguous. Adding
new such discontiguous range using index and number mapping is increasingly
complex and hard to maintain.

Hence, maintain an xarray of vport and rep whose lookup is done based on
the vport number.
Each VF and SF entry is marked with a xarray mark to identify the function
type. Additionally PF and VF needs special handling for legacy inline
mode. They are additionally marked as host function using additional
HOST_FN mark.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:45 -07:00
Parav Pandit 9f8c7100c8 net/mlx5: E-Switch, Prepare to return total vports from eswitch struct
Total vports are already stored during eswitch initialization. Instead
of calculating everytime, read directly from eswitch.

Additionally, host PF's SF vport information is available using
QUERY_HCA_CAP command. It is not available through HCA_CAP of the
eswitch manager PF.
Hence, this patch prepares the return total eswitch vport count from the
existing eswitch struct.

This further helps to keep eswitch port counting macros and logic within
eswitch.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:43 -07:00
Parav Pandit 06ec5acc77 net/mlx5: E-Switch, Return eswitch max ports when eswitch is supported
mlx5_eswitch_get_total_vports() doesn't honor MLX5_ESWICH Kconfig flag.

When MLX5_ESWITCH is disabled, FS layer continues to initialize eswitch
specific ACL namespaces.
Instead, start honoring MLX5_ESWITCH flag and perform vport specific
initialization only when vport count is non zero.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:40 -07:00
Hans Westgaard Ry 79ebfb11fe net/mlx4: Treat VFs fair when handling comm_channel_events
Handling comm_channel_event in mlx4_master_comm_channel uses a double
loop to determine which slaves have requested work. The search is
always started at lowest slave. This leads to unfairness; lower VFs
tends to be prioritized over higher VFs.

The patch uses find_next_bit to determine which slaves to handle.
Fairness is implemented by always starting at the next to the last
start.

An MPI program has been used to measure improvements. It runs 500
ibv_reg_mr, synchronizes with all other instances and then runs 500
ibv_dereg_mr.

The results running 500 processes, time reported is for running 500
calls:

ibv_reg_mr:
             Mod.   Org.
mlx4_1    403.356ms 424.674ms
mlx4_2    403.355ms 424.674ms
mlx4_3    403.354ms 424.674ms
mlx4_4    403.355ms 424.674ms
mlx4_5    403.357ms 424.677ms
mlx4_6    403.354ms 424.676ms
mlx4_7    403.357ms 424.675ms
mlx4_8    403.355ms 424.675ms

ibv_dereg_mr:
             Mod.   Org.
mlx4_1    116.408ms 142.818ms
mlx4_2    116.434ms 142.793ms
mlx4_3    116.488ms 143.247ms
mlx4_4    116.679ms 143.230ms
mlx4_5    112.017ms 107.204ms
mlx4_6    112.032ms 107.516ms
mlx4_7    112.083ms 184.195ms
mlx4_8    115.089ms 190.618ms

Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-22 14:59:26 -07:00
Petr Machata 7de85b0431 mlxsw: spectrum_qdisc: Index future FIFOs by band number
mlxsw used to hold an array of qdiscs indexed by the TC number. In the
previous patch, it was changed to allocate child qdiscs dynamically, and
they are now indexed by band number. Follow suit with the array of future
FIFOs.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata 5cbd960253 mlxsw: spectrum_qdisc: Allocate child qdiscs dynamically
Instead of keeping qdiscs in globally-preallocated arrays, introduce a
per-qdisc-kind value num_classes, and then allocate the necessary child
qdiscs (if any) based on that value. Since now dynamic allocation is
involved, mlxsw_sp_qdisc_replace() gets messy enough that it is worth it to
split it to two cases: a new qdisc allocation and a change of existing
qdisc. (Note that the change also includes what TC formally calls replace,
if the qdisc kind is the same.)

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata cff99e2045 mlxsw: spectrum_qdisc: Guard all qdisc accesses with a lock
The FIFO handler currently guards accesses to the future FIFO tracking by
asserting RTNL. In the future, the changes to the qdisc state will be more
thorough, so other qdiscs will need this guarding is as well. In order
to not further the RTNL infestation, instead convert to a custom lock that
will guard accesses to the qdisc state.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata 51d52ed955 mlxsw: spectrum_qdisc: Track children per qdisc
mlxsw currently allows a two-level structure of qdiscs: the root and
possibly a number of children. In order to support offloading more general
qdisc trees, introduce to struct mlxsw_sp_qdisc a pointer to child qdiscs.
Refer to the child qdiscs through this pointer, instead of going through
the tclass_qdiscs in qdisc_state. Additionally introduce a field
num_classes, which holds number of given qdisc's children.

Also introduce a generic function for walking qdisc trees. Rewrite
mlxsw_sp_qdisc_find() and _find_by_handle() to use the generic walker.

For now, keep the qdisc_state.tclass_qdisc, and just point root_qdiscs's
children to this array. Following patches will make the allocation dynamic.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata b21832b568 mlxsw: spectrum_qdisc: Promote backlog reduction to mlxsw_sp_qdisc_destroy()
When a qdisc is removed, it is necessary to update the backlog value at its
parent--unless the qdisc is at root position. RED, TBF and FIFO all do
that, each separately. Since all of them need to do this, just promote the
operation directly to mlxsw_sp_qdisc_destroy(), instead of deferring it to
individual destructors. Since FIFO dtor thus becomes trivial, remove it.

Add struct mlxsw_sp_qdisc.parent to point at the parent qdisc. This will be
handy later as deeper structures are offloaded. Use the parent qdisc to
find the chain of parents whose backlog value needs to be updated.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata 017a131cde mlxsw: spectrum_qdisc: Track tclass_num as int, not u8
tclass_num is just a number, a value that would be ordinarily passed around
as an int. (Which is unlike a u8 prio_bitmap.) In several places,
tclass_num already is an int. Convert the remaining instances.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata 549f2aae84 mlxsw: spectrum_qdisc: Drop an always-true condition
The function mlxsw_sp_qdisc_compare() is invoked a couple lines above this
check, which will bounce any requests where this condition does not hold.
Therefore drop it.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata 290fe2c595 mlxsw: spectrum_qdisc: Simplify mlxsw_sp_qdisc_compare()
The purpose of this function is to filter out events that are related to
qdiscs that are not offloaded, or are not offloaded anymore. But the
function is unnecessarily thorough:

- mlxsw_sp_qdisc pointer is never NULL in the context where it is called
- Two qdiscs with the same handle will never have different types. Even
  when replacing one qdisc with another in the same class, Linux will not
  permit handle reuse unless the qdisc type also matches.

Simplify the function by omitting these two unnecessary conditions.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Petr Machata 17c0e6d175 mlxsw: spectrum_qdisc: Drop one argument from check_params callback
The mlxsw_sp_qdisc argument is not used in any of the actual callbacks.
Drop it.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:43:13 -07:00
Parav Pandit dedbc2d358 IB/mlx5: Set right RoCE l3 type and roce version while deleting GID
Currently when GID is deleted, it zero out all the fields of the RoCE
address in the SET_ROCE_ADDRESS command for a specified index.

roce_version = 0 means RoCEv1 in the SET_ROCE_ADDRESS command.

This assumes that device has RoCEv1 always enabled which is not always
correct. For example Subfunction does not support RoCEv1.

Due to this assumption a previously added RoCEv2 GID is always deleted as
RoCEv1 GID. This results in a below syndrome:

   mlx5_core.sf mlx5_core.sf.4: mlx5_cmd_check:777:(pid 4256): SET_ROCE_ADDRESS(0x761) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x12822d)

Hence set the right RoCE version during GID deletion provided by the core.

Link: https://lore.kernel.org/r/d3f54129c90ca329caf438dbe31875d8ad08d91a.1618753425.git.leonro@nvidia.com
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-20 09:41:10 -03:00
Yevgeny Kliteynik aeacb52a8d net/mlx5: DR, Add support for isolate_vl_tc QP
When using SW steering, rule insertion rate depends on the RDMA RC QP
performance used for writing to the ICM. During stress this QP is competing
on the HW resources with all the other QPs that are used to send data.
To protect SW steering QP's performance in such cases, we set this QP to
use isolated VL. The VL number is reserved by FW and is not exposed to the
driver.
Support for this QP on isolated VL exists only when both force-loopback and
isolate_vl_tc capabilities are set.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:46 -07:00
Yevgeny Kliteynik 7304d603a5 net/mlx5: DR, Add support for force-loopback QP
When supported by the device, SW steering RoCE RC QP that is used to
write/read to/from ICM will be created with force-loopback attribute.
Such QP doesn't require GID index upon creation.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:43 -07:00
Yevgeny Kliteynik df9dd15ae1 net/mlx5: DR, Add support for matching tunnel GTP-U
Enable matching on tunnel GTP-U and GTP-U first extension
header using dynamic flex parser.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:40 -07:00
Yevgeny Kliteynik 35ba005d82 net/mlx5: DR, Set flex parser for TNL_MPLS dynamically
Query the flex_parser id that's intended for TNL_MPLS
and use an appropriate flex parser for MPLS over UDP/GRE.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:37 -07:00
Yevgeny Kliteynik 3442e0335e net/mlx5: DR, Add support for matching on geneve TLV option
Enable matching on tunnel geneve TLV option using the flex parser.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:34 -07:00
Yevgeny Kliteynik 4923938d2f net/mlx5: DR, Set STEv0 ICMP flex parser dynamically
Set the flex parser ID dynamicly for ICMP instead of relying
on hardcoded values.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:31 -07:00
Yevgeny Kliteynik 160e9cb37a net/mlx5: DR, Add support for dynamic flex parser
Flex parser is a HW parser that can support protocols that are not
natively supported by the HCA, such as Geneve (TLV options) and GTP-U.
There are 8 such parsers, and each of them can be assigned to parse a
specific set of protocols.
This patch adds misc4 match params which allows using a correct flex parser
that was programmed to the required protocol.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:28 -07:00
Muhammad Sammar 323b91acc1 net/mlx5: DR, Remove protocol-specific flex_parser_3 definitions
Remove MPLS specific fields from flex parser 3 layout.
Flex parser can be used for multiple protocols and should
not be hardcoded to a specific type.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:24 -07:00
Yevgeny Kliteynik 25cb317680 net/mlx5: E-Switch, Improve error messages in term table creation
Add error code to the error messages and removed duplicated message:
if termination table creation failed, we already get an error message
in mlx5_eswitch_termtbl_create, so no need for the additional error print
in the calling function.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:18 -07:00
Yevgeny Kliteynik ff1925bb0d net/mlx5: DR, Fix SQ/RQ in doorbell bitmask
QP doorbell size is 16 bits.
Fixing sw steering's QP doorbel bitmask, which had 20 bits.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:15 -07:00
Yevgeny Kliteynik 7d22ad732d net/mlx5: DR, Rename an argument in dr_rdma_segments
Rename the argument to better reflect that the meaning is
not number of records, but wheather or not we should
ring the dorbell.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:12 -07:00
Tariq Toukan 6980ffa0c5 net/mlx5e: RX, Add checks for calculated Striding RQ attributes
Striding RQ attributes below are mutually dependent. An unaware
change to one might take the others out of the valid range derived
by the HW caps:
- The MPWQE size in bytes
- The number of strides in a MPWQE
- The stride size

Add checks to verify they are valid and comply to the HW spec
and SW assumptions/requirements.
This is not a fix, no particular issue exists today.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:09 -07:00
Vladyslav Tarasiuk 6a5689ba02 net/mlx5e: Fix possible non-initialized struct usage
If mlx5e_devlink_port_register() fails, driver may try to register
devlink health TX and RX reporters on non-registered devlink port.

Instead, create health reporters only if mlx5e_devlink_port_register()
does not fail. And destroy reporters only if devlink_port is registered.

Also, change mlx5e_get_devlink_port() behavior and return NULL in case
port is not registered to replicate devlink's wrapper when ndo is not
implemented.

Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:06 -07:00
Tariq Toukan d408c01cae net/mlx5e: Fix lost changes during code movements
The changes done in commit [1] were missed by the code movements
done in [2], as they were developed in ~parallel.
Here we re-apply them.

[1] commit e4484d9df5 ("net/mlx5e: Enable striding RQ for Connect-X IPsec capable devices")
[2] commit b3a131c2a1 ("net/mlx5e: Move params logic into its dedicated file")

Fixes: b3a131c2a1 ("net/mlx5e: Move params logic into its dedicated file")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-19 20:17:03 -07:00
Jakub Kicinski 8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
Jakub Kicinski b572ec9ff0 mlx5: implement ethtool standard stats
Add support for PHY/MAC/Ctrl/RMON stats.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:59:47 -07:00
Jakub Kicinski c1912ab0ee mlxsw: implement ethtool standard stats
mlxsw has nicely grouped stats, add support for standard uAPI.
I'm guessing the register access part. Compile tested only.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:59:20 -07:00
David S. Miller 03e481e88b mlx5-updates-2021-04-16
This patchset introduces updates to mlx5e netdev driver.
 
 1) Tariq refactors TLS offloads and adds resiliency against RX resync
    failures
 
 2) Maxim reduces code duplications by unifying channels reset flow
    regardless if channels are closed or open
 
 3) Aya Enhances TX/RX health reporters diagnostics to expose the
    internal clock time-stamping format
 
 4) Moshe adds support for ethtool extended link state, to show the reason
    for link down
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmB53AUACgkQSD+KveBX
 +j6rzAf+JwJG9G7GSj3a/xird4dlgt4xPbRLB19pTw19ZyHZyujDxdN4QM3r5hTk
 5ua1PnhYYaUcyPFvdgR9J0cIJ3QRaxZ+q/XnkE9Yo0eZ1DJ0SL/n6rxEQpcxpee1
 XP7qjJu3leVwh5mVW2uOx/ClrL9vYb/fG3Q00j59rUB+i9bZszXZgZ99hJvYBFTB
 k7W/9X6BNxuLlEg/Ui9L499aDWHRcIY5J2ku+1v/8paJZltk+IFv5glYszylE++M
 l68drIy3dIjl/Sxj6WR2rHTBus6AIFxWFH8C2L7uqGl97BPjS80snMPIefLJhW+y
 bQvzMDtfKDmIpvEIdzHPuEhEdqqteg==
 =YCy6
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-04-16

This patchset introduces updates to mlx5e netdev driver.

1) Tariq refactors TLS offloads and adds resiliency against RX resync
   failures

2) Maxim reduces code duplications by unifying channels reset flow
   regardless if channels are closed or open

3) Aya Enhances TX/RX health reporters diagnostics to expose the
   internal clock time-stamping format

4) Moshe adds support for ethtool extended link state, to show the reason
   for link down
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:53:52 -07:00
Vladimir Oltean 2c4eca3ef7 net: bridge: switchdev: include local flag in FDB notifications
As explained in bugfix commit 6ab4c3117a ("net: bridge: don't notify
switchdev for local FDB addresses") as well as in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/

the switchdev notifiers for FDB entries managed to have a zero-day bug,
which was that drivers would not know what to do with local FDB entries,
because they were not told that they are local. The bug fix was to
simply not notify them of those addresses.

Let us now add the 'is_local' bit to bridge FDB entries, and make all
drivers ignore these entries by their own choice.

Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:15:45 -07:00
Aya Levin 95742c1cc5 net/mlx5: Enhance diagnostics info for TX/RX reporters
Add ts_format to 'Common Config' section of the TX/RX devlink reporters
diagnostics info. Possible values for ts_format: 'RT' or 'FRC'
which stands for: Real Time and Free Running Counters correspondingly.

Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:34 -07:00
Aya Levin 302522e67c net/mlx5: Add helper to initialize 1PPS
Wrap 1PPS initialization in a helper for a cleaner init flow.

Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:31 -07:00
Moshe Tal b3446acb2b net/mlx5e: Add ethtool extended link state
In case the interface was set up but cannot establish the link, ethtool
will print more information to help the user troubleshoot the state.

For example, no link due to missing cable:
$ ethtool eth1
...
Link detected: no (No cable)

Beside the general extended state, drivers can pass additional
information about the link state using the sub-state field. For example:

$ ethtool eth1
...
Link detected: no (Autoneg, No partner detected)

The extended state is available only for specific cases, in other cases
ethtool with print only "Link detected: no" as before

Signed-off-by: Moshe Tal <moshet@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:28 -07:00
Maor Dickman 5cec6de0ae net/mlx5: Allocate FC bulk structs with kvzalloc() instead of kzalloc()
The bulk size is larger than 16K so use kvzalloc().
The bulk bitmask upper size limit is 16K so use kvcalloc().

Signed-off-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:22 -07:00
Maxim Mikityanskiy 94872d4ef9 net/mlx5e: Cleanup safe switch channels API by passing params
mlx5e_safe_switch_channels accepts new_chs as a parameter and opens new
channels in place, then copying them to priv->channels. It requires all
the callers to allocate space for this temporary storage of the new
channels.

This commit cleans up the API by replacing new_chs with new_params, a
meaningful subset of new_chs to be filled by the caller. The temporary
space for the new channels is allocated inside mlx5e_safe_switch_params
(a new name for mlx5e_safe_switch_channels). An extra copy of params is
made, but since it's control flow, it's not critical.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:20 -07:00
Maxim Mikityanskiy b3b886cf96 net/mlx5e: Refactor on-the-fly configuration changes
This commit extends mlx5e_safe_switch_channels() to support on-the-fly
configuration changes, when the channels are open, but don't need to be
recreated. Such flows exist when a parameter being changed doesn't
affect how the queues are created, or when the queues can be modified
while remaining active.

Before this commit, such flows were handled as special cases on the
caller site. This commit adds this functionality to
mlx5e_safe_switch_channels(), allowing the caller to pass a boolean
indicating whether it's required to recreate the channels or it's
allowed to skip it. The logic of switching channel parameters is now
completely encapsulated into mlx5e_safe_switch_channels().

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:17 -07:00
Maxim Mikityanskiy 69cc4185dc net/mlx5e: Use mlx5e_safe_switch_channels when channels are closed
This commit uses new functionality of mlx5e_safe_switch_channels
introduced by the previous commit to reduce the amount of repeating
similar code all over the driver.

It's very common in mlx5e to call mlx5e_safe_switch_channels when the
channels are open, but assign parameters and run hardware commands
manually when the channels are closed.

After the previous commit it's no longer needed to do such manual things
every time, so this commit removes unneeded code and relies on the new
functionality of mlx5e_safe_switch_channels. Some of the places are
refactored and simplified, where more complex flows are used to change
configuration on the fly, without recreating the channels (the logic is
rewritten in a more robust way, with a reset required by default and a
list of exceptions).

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:14 -07:00
Maxim Mikityanskiy 6cad120d9e net/mlx5e: Allow mlx5e_safe_switch_channels to work with channels closed
mlx5e_safe_switch_channels is used to modify channel parameters and/or
hardware configuration in a safe way, so that if anything goes wrong,
everything reverts to the old configuration and remains in a consistent
state.

However, this function only works when the channels are open. When the
caller needs to modify some parameters, first it has to check that the
channels are open, otherwise it has to assign parameters directly, and
such boilerplate repeats in many different places.

This commit prepares for the refactoring of such places by allowing
mlx5e_safe_switch_channels to work when the channels are closed. In this
case it will assign the new parameters and run the preactivate hook.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:11 -07:00
Tariq Toukan e9ce991bce net/mlx5e: kTLS, Add resiliency to RX resync failures
When the TLS logic finds a tcp seq match for a kTLS RX resync
request, it calls the driver callback function mlx5e_ktls_resync()
to handle it and communicate it to the device.

Errors might occur during mlx5e_ktls_resync(), however, they are not
reported to the stack. Moreover, there is no error handling in the
stack for these errors.

In this patch, the driver obtains responsibility on errors handling,
adding queue and retry mechanisms to these resyncs.

We maintain a linked list of resync matches, and try posting them
to the async ICOSQ in the NAPI context.

Only possible failure that demands driver handling is ICOSQ being full.
By relying on the NAPI mechanism, we make sure that the entries in list
will be handled when ICOSQ completions arrive and make some room
available.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:08 -07:00
Tariq Toukan 72f6f2f8d6 net/mlx5e: TX, Inline function mlx5e_tls_handle_tx_wqe()
When TLS is supported, WQE ctrl segment of every transmitted packet
is updated with the (possibly empty, for non-TLS packets) TISN field.

Take this one-liner function into the header file and inline it,
to save the overhead of a function call per packet.

While here, remove unused function parameter.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:05 -07:00
Tariq Toukan b6b3ad2175 net/mlx5e: TX, Inline TLS skb check
When TLS is supported and enabled, every transmitted packet is tested
to identify if TLS offload is required.

Take the early-return condition into an inline function, to save
the overhead of a function call for non-TLS packets.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:48:02 -07:00
Tariq Toukan 8668587a33 net/mlx5e: Cleanup unused function parameter
Socket parameter is not used in accel_rule_init(), remove it.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:47:59 -07:00
Tariq Toukan 2f014f4016 net/mlx5e: Remove non-essential TLS SQ state bit
Maintaining an SQ state bit to indicate TLS support
has no real need, a simple and fast test [1] for the SKB is
almost equally good.

[1] !skb->sk || !tls_is_sk_tx_device_offloaded(skb->sk)

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-16 11:47:56 -07:00
Jakub Kicinski 1703bb50df mlx5: implement ethtool::get_fec_stats
Report corrected bits.

v2: catch reg access errors (Saeed)

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-15 17:08:29 -07:00
wenxu e3e0f9b279 net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta
In the nft_offload there is the mate flow_dissector with no
ingress_ifindex but with ingress_iftype that only be used
in the software. So if the mask of ingress_ifindex in meta is
0, this meta check should be bypass.

Fixes: 6d65bc64e2 ("net/mlx5e: Add mlx5e_flower_parse_meta support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 16:13:00 -07:00
Aya Levin 7a320c9db3 net/mlx5e: Fix setting of RS FEC mode
Change register setting from bit number to bit mask.

Fixes: b5ede32d33 ("net/mlx5e: Add support for FEC modes based on 50G per lane links")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 16:12:57 -07:00
Aya Levin 41bafb31dc net/mlx5: Fix setting of devlink traps in switchdev mode
Prevent setting of devlink traps on the uplink while in switchdev mode.
In this mode, it is the SW switch responsibility to handle both packets
with a mismatch in destination MAC or VLAN ID. Therefore, there are no
flow steering tables to trap undesirable packets and driver crashes upon
setting a trap.

Fixes: 241dc15939 ("net/mlx5: Notify on trap action by blocking event")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 16:12:54 -07:00
Aya Levin 5b232ea94c net/mlx5e: Fix RQ creation flow for queues which doesn't support XDP
Allow to create an RQ which is not registered as an XDP RQ. For example:
the trap-RQ doesn't register as an XDP RQ.

Fixes: 869c5f9262 ("net/mlx5e: Generalize open RQ")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:03:10 -07:00
Wenpeng Liang 31450b435f net/mlx5: Replace spaces with tab at the start of a line
There should be no spaces at the start of the line.

Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:03:07 -07:00
Wenpeng Liang 9dee115bc1 net/mlx5: Remove return statement exist at the end of void function
void function return statements are not generally useful.

Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:03:04 -07:00
Wenpeng Liang 02f47c04c3 net/mlx5: Add a blank line after declarations
There should be a blank lines after declarations.

Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:03:01 -07:00
Colin Ian King 82c3ba31c3 net/mlx5: Fix bit-wise and with zero
The bit-wise and of the action field with MLX5_ACCEL_ESP_ACTION_DECRYPT
is incorrect as MLX5_ACCEL_ESP_ACTION_DECRYPT is zero and not intended
to be a bit-flag. Fix this by using the == operator as was originally
intended.

Addresses-Coverity: ("Logically dead code")
Fixes: 7dfee4b1d7 ("net/mlx5: IPsec, Refactor SA handle creation and destruction")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:58 -07:00
Roi Dayan b7f86258a2 net/mlx5: DR, Alloc cmd buffer with kvzalloc() instead of kzalloc()
The cmd size is 8K so use kvzalloc().

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:55 -07:00
Jianbo Liu 9dac2966c5 net/mlx5: DR, Use variably sized data structures for different actions
mlx5dr_action is a generally used data structure, and there is an
union for different types of actions in it. The size of mlx5dr_action
is about 72 bytes, but for those actions with fewer fields, most of
the allocated memory is wasted.
Remove this union, and mlx5dr_action becomes a generic action header.
Then actions are dynamically allocated with needed memory, the data
for each action is stored right after the header.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:52 -07:00
Parav Pandit a74ed24c43 net/mlx5: SF, Reuse stored hardware function id
SF's hardware function id is already stored in mlx5_sf. Reuse it,
instead of querying the hw table.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:49 -07:00
Parav Pandit 6e74e6ea1b net/mlx5: SF, Use device pointer directly
At many places in the code, device pointer is directly available. Make
use of it, instead of accessing it from the table.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:46 -07:00
Parav Pandit 57b92bdd9e net/mlx5: E-Switch, Initialize eswitch acls ns when eswitch is enabled
Currently eswitch flow steering (FS) namespace of vport's ingress and
egress ACL are enabled when FS layer is initialized. This is done even
when eswitch is diabled. This demands that total eswitch ports to be
known to FS layer without eswitch in use.

Given the FS core is not dependent on eswitch, make namespace init and
cleanup routines as helper routines to be invoked only when eswitch is
needed.

With this change, ingress and egress ACL namespaces are created only
when eswitch legacy/offloads mode is enabled.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:43 -07:00
Parav Pandit b55b35382e net/mlx5: E-Switch, Move legacy code to a individual file
Currently eswitch offers two modes. Legacy and offloads.
Offloads code is already in its own file eswitch_offloads.c

However eswitch.c contains the eswitch legacy code and common
infrastructure  code.

To enable future extensions and to better manage generic common eswitch
infrastructure code, move the legacy code to its own legacy.c file.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:40 -07:00
Parav Pandit b16f2bb6b6 net/mlx5: E-Switch, Convert a macro to a helper routine
Convert ESW_ALLOWED macro to a helper routine so that it can be used in
other eswitch files.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:38 -07:00
Parav Pandit 13795553a8 net/mlx5: E-Switch Make cleanup sequence mirror of init
Make cleanup sequence mirror of init sequence for cleaning up reps
and freeing vports.

Also when reps initialization fails, there is no need to perform reps
cleanup.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:35 -07:00
Parav Pandit 6308a5f06b net/mlx5: E-Switch, Make vport number u16
Vport number is 16-bit field in hardware. Make it u16.

Move location of vport in the structure so that it reduces a hole
in the structure.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:32 -07:00
Parav Pandit 7d5ae47891 net/mlx5: E-Switch, Skip querying SF enabled bits
With vhca events, SF state is queried through the VHCA events. Device no
longer expects SF bitmap in the query eswitch functions command.

Hence, remove it to simplify the code.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14 11:02:29 -07:00
Parav Pandit 7bf481d7e7 net/mlx5: E-Switch, let user to enable disable metadata
Currently each packet inserted in eswitch is tagged with a internal
metadata to indicate source vport. Metadata tagging is not always
needed. Metadata insertion is needed for multi-port RoCE, failover
between representors and stacked devices. In many other cases,
metadata enablement is not needed.

Metadata insertion slows down the packet processing rate of the E-switch
when it is in switchdev mode.

Below table show performance gain with metadata disabled for VXLAN
offload rules in both SMFS and DMFS steering mode on ConnectX-5 device.

----------------------------------------------
| steering | metadata | pkt size | rx pps    |
| mode     |          |          | (million) |
----------------------------------------------
| smfs     | disabled | 128Bytes | 42        |
----------------------------------------------
| smfs     | enabled  | 128Bytes | 36        |
----------------------------------------------
| dmfs     | disabled | 128Bytes | 42        |
----------------------------------------------
| dmfs     | enabled  | 128Bytes | 36        |
----------------------------------------------

Hence, allow user to disable metadata using driver specific devlink
parameter. Metadata setting of the eswitch is applicable only for the
switchdev mode.

Example to show and disable metadata before changing eswitch mode:
$ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
pci/0000:06:00.0:
  name esw_port_metadata type driver-specific
    values:
      cmode runtime value true

$ devlink dev param set pci/0000:06:00.0 \
	  name esw_port_metadata value false cmode runtime

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
changelog:
v1->v2:
 - added performance numbers in commit log
 - updated commit log and documentation for switchdev mode
 - added explicit note on when user can disable metadata in
   documentation
2021-04-14 11:02:26 -07:00
Jason Gunthorpe a0354d2308 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================

This pr contains changes from  mlx5-next branch,
already reviewed on netdev and rdma mailing lists, links below.

1) From Leon, Dynamically assign MSI-X vectors count
Already Acked by Bjorn Helgaas.
https://patchwork.kernel.org/project/netdevbpf/cover/20210314124256.70253-1-leon@kernel.org/

2) Cleanup series:
https://patchwork.kernel.org/project/netdevbpf/cover/20210311070915.321814-1-saeed@kernel.org/

From Mark, E-Switch cleanups and refactoring, and the addition
of single FDB mode needed HW bits.

From Mikhael, Remove unused struct field

From Saeed, Cleanup W=1 prototype warning

From Zheng, Esw related cleanup

From Tariq, User order-0 page allocation for EQs

====================

* mlx5-next:
  net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Add dynamic MSI-X capabilities bits
  PCI/IOV: Add sysfs MSI-X vector assignment interface
  net/mlx5: Use order-0 allocations for EQs
  net/mlx5: Add IFC bits needed for single FDB mode
  net/mlx5: E-Switch, Refactor send to vport to be more generic
  RDMA/mlx5: Use representor E-Switch when getting netdev and metadata
  net/mlx5: E-Switch, Add eswitch pointer to each representor
  net/mlx5: E-Switch, Add match on vhca id to default send rules
  net/mlx5: Remove unused mlx5_core_health member recover_work
  net/mlx5: simplify the return expression of mlx5_esw_offloads_pair()
  net/mlx5: Cleanup prototype warning

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-12 13:49:48 -03:00
Vladyslav Tarasiuk 4c88fa412a net/mlx5: Add support for DSFP module EEPROM dumps
Allow the driver to recognise DSFP transceiver module ID and therefore
allow its EEPROM dumps using ethtool.

Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-11 16:34:56 -07:00
Vladyslav Tarasiuk e109d2b204 net/mlx5: Implement get_module_eeprom_by_page()
Implement ethtool_ops::get_module_eeprom_by_page() to enable
support of new SFP standards.

Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-11 16:34:56 -07:00
Vladyslav Tarasiuk e19b0a3474 net/mlx5: Refactor module EEPROM query
Prepare for ethtool_ops::get_module_eeprom_data() implementation by
extracting common part of mlx5_query_module_eeprom() into a separate
function.

Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-11 16:34:56 -07:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Jakub Kicinski 95b5c29132 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:

====================
mlx5-next 2021-04-09

This pr contains changes from  mlx5-next branch,
already reviewed on netdev and rdma mailing lists, links below.

1) From Leon, Dynamically assign MSI-X vectors count
Already Acked by Bjorn Helgaas.
https://patchwork.kernel.org/project/netdevbpf/cover/20210314124256.70253-1-leon@kernel.org/

2) Cleanup series:
https://patchwork.kernel.org/project/netdevbpf/cover/20210311070915.321814-1-saeed@kernel.org/

From Mark, E-Switch cleanups and refactoring, and the addition
of single FDB mode needed HW bits.

From Mikhael, Remove unused struct field

From Saeed, Cleanup W=1 prototype warning

From Zheng, Esw related cleanup

From Tariq, User order-0 page allocation for EQs

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Add dynamic MSI-X capabilities bits
  PCI/IOV: Add sysfs MSI-X vector assignment interface
  net/mlx5: Use order-0 allocations for EQs
  net/mlx5: Add IFC bits needed for single FDB mode
  net/mlx5: E-Switch, Refactor send to vport to be more generic
  RDMA/mlx5: Use representor E-Switch when getting netdev and metadata
  net/mlx5: E-Switch, Add eswitch pointer to each representor
  net/mlx5: E-Switch, Add match on vhca id to default send rules
  net/mlx5: Remove unused mlx5_core_health member recover_work
  net/mlx5: simplify the return expression of mlx5_esw_offloads_pair()
  net/mlx5: Cleanup prototype warning
====================

Link: https://lore.kernel.org/r/20210409200704.10886-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 18:07:21 -07:00
Danielle Ratson a975d7d8a3 ethtool: Remove link_mode param and derive link params from driver
Some drivers clear the 'ethtool_link_ksettings' struct in their
get_link_ksettings() callback, before populating it with actual values.
Such drivers will set the new 'link_mode' field to zero, resulting in
user space receiving wrong link mode information given that zero is a
valid value for the field.

Another problem is that some drivers (notably tun) can report random
values in the 'link_mode' field. This can result in a general protection
fault when the field is used as an index to the 'link_mode_params' array
[1].

This happens because such drivers implement their set_link_ksettings()
callback by simply overwriting their private copy of
'ethtool_link_ksettings' struct with the one they get from the stack,
which is not always properly initialized.

Fix these problems by removing 'link_mode' from 'ethtool_link_ksettings'
and instead have drivers call ethtool_params_from_link_mode() with the
current link mode. The function will derive the link parameters (e.g.,
speed) from the link mode and fill them in the 'ethtool_link_ksettings'
struct.

v3:
	* Remove link_mode parameter and derive the link parameters in
	  the driver instead of passing link_mode parameter to ethtool
	  and derive it there.

v2:
	* Introduce 'cap_link_mode_supported' instead of adding a
	  validity field to 'ethtool_link_ksettings' struct.

[1]
general protection fault, probably for non-canonical address 0xdffffc00f14cc32c: 0000 [#1] PREEMPT SMP KASAN
KASAN: probably user-memory-access in range [0x000000078a661960-0x000000078a661967]
CPU: 0 PID: 8452 Comm: syz-executor360 Not tainted 5.11.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__ethtool_get_link_ksettings+0x1a3/0x3a0 net/ethtool/ioctl.c:446
Code: b7 3e fa 83 fd ff 0f 84 30 01 00 00 e8 16 b0 3e fa 48 8d 3c ed 60 d5 69 8a 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03
+38 d0 7c 08 84 d2 0f 85 b9
RSP: 0018:ffffc900019df7a0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888026136008 RCX: 0000000000000000
RDX: 00000000f14cc32c RSI: ffffffff873439ca RDI: 000000078a661960
RBP: 00000000ffff8880 R08: 00000000ffffffff R09: ffff88802613606f
R10: ffffffff873439bc R11: 0000000000000000 R12: 0000000000000000
R13: ffff88802613606c R14: ffff888011d0c210 R15: ffff888011d0c210
FS:  0000000000749300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004b60f0 CR3: 00000000185c2000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 linkinfo_prepare_data+0xfd/0x280 net/ethtool/linkinfo.c:37
 ethnl_default_notify+0x1dc/0x630 net/ethtool/netlink.c:586
 ethtool_notify+0xbd/0x1f0 net/ethtool/netlink.c:656
 ethtool_set_link_ksettings+0x277/0x330 net/ethtool/ioctl.c:620
 dev_ethtool+0x2b35/0x45d0 net/ethtool/ioctl.c:2842
 dev_ioctl+0x463/0xb70 net/core/dev_ioctl.c:440
 sock_do_ioctl+0x148/0x2d0 net/socket.c:1060
 sock_ioctl+0x477/0x6a0 net/socket.c:1177
 vfs_ioctl fs/ioctl.c:48 [inline]
 __do_sys_ioctl fs/ioctl.c:753 [inline]
 __se_sys_ioctl fs/ioctl.c:739 [inline]
 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: c8907043c6 ("ethtool: Get link mode in use instead of speed and duplex parameters")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-07 14:53:04 -07:00
David S. Miller f86c70ed04 mlx5-updates-2021-04-06
Introduce TC sample offload
 
 Background
 ----------
 
 The tc sample action allows user to sample traffic matched by tc
 classifier. The sampling consists of choosing packets randomly and
 sampling them using psample module.
 
 The tc sample parameters include group id, sampling rate and packet's
 truncation (to save kernel-user traffic).
 
 Sample in TC SW
 ---------------
 
 User must specify rate and group id for sample action, truncate is
 optional.
 
 tc filter add dev enp4s0f0_0 ingress protocol ip prio 1 flower	\
 	src_mac 02:25:d0:14:01:02 dst_mac 02:25:d0:14:01:03	\
 	action sample rate 10 group 5 trunc 60			\
 	action mirred egress redirect dev enp4s0f0_1
 
 The tc sample action kernel module 'act_sample' will call another
 kernel module 'psample' to send sampled packets to userspace.
 
 MLX5 sample HW offload - MLX5 driver patches
 --------------------------------------------
 
 The sample action is translated to a goto flow table object
 destination which samples packets according to the provided
 sample ratio. Sampled packets are duplicated. One copy is
 processed by a termination table, named the sample table,
 which sends the packet to the eswitch manager port (that will
 be processed by software).
 
 The second copy is processed by the default table which executes
 the subsequent actions. The default table is created per <vport,
 chain, prio> tuple as rules with different prios and chains may
 overlap.
 
 For example, for the following typical flow table:
 
 +-------------------------------+
 +       original flow table     +
 +-------------------------------+
 +         original match        +
 +-------------------------------+
 + sample action + other actions +
 +-------------------------------+
 
 We translate the tc filter with sample action to the following HW model:
 
         +---------------------+
         + original flow table +
         +---------------------+
         +   original match    +
         +---------------------+
                    |
                    v
 +------------------------------------------------+
 +                Flow Sampler Object             +
 +------------------------------------------------+
 +                    sample ratio                +
 +------------------------------------------------+
 +    sample table id    |    default table id    +
 +------------------------------------------------+
            |                            |
            v                            v
 +-----------------------------+  +----------------------------------------+
 +        sample table         +  + default table per <vport, chain, prio> +
 +-----------------------------+  +----------------------------------------+
 + forward to management vport +  +            original match              +
 +-----------------------------+  +----------------------------------------+
                                  +            other actions               +
                                  +----------------------------------------+
 
 Flow sampler object
 -------------------
 
 Hardware introduces flow sampler object to do sample. It is a new
 destination type. Driver needs to specify two flow table ids in it.
 One is sample table id. The other one is the default table id.
 Sample table samples the packets according to the sample rate and
 forward the sampled packets to eswitch manager port. Default table
 finishes the subsequent actions.
 
 Group id and reg_c0
 -------------------
 
 Userspace program will take different actions for sampled packets
 according to tc sample action group id. So hardware must pass group
 id to software for each sampled packets. In Paul Blakey's "Introduce
 connection tracking offload" patch set, reg_c0 lower 16 bits are used
 for miss packet chain id restore. We convert reg_c0 lower 16 bits to
 a common object pool, so other features can also use it.
 
 Since sample group id is 32 bits, create a 16 bits object id to map
 the group id and write the object id to reg_c0 lower 16 bits. reg_c0
 can only be used for matching. Write reg_c0 to flow_tag, so software
 can get the object id via flow_tag and find group id via the common
 object pool.
 
 Sampler restore handle
 ----------------------
 
 Use common object pool to create an object id to map sample parameters.
 Allocate a modify header action to write the object id to reg_c0 lower
 16 bits. Create a restore rule to pass the object id to software. So
 software can identify sampled packets via the object id and send it to
 userspace.
 
 Aggregate the modify header action, restore rule and object id to a
 sample restore handle. Re-use identical sample restore handle for
 the same object id.
 
 Send sampled packets to userspace
 ---------------------------------
 
 The destination for sampled packets is eswitch manager port, so
 representors can receive sampled packets together with the group id.
 Driver will send sampled packets and group id to userspace via psample.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmBtNrUACgkQSD+KveBX
 +j6cRQf/ZARhVPDEgCvFd+wD+n2VCM11FJCpIumGecfqpA9DB/7i0iQrBWG2cGy6
 Go3XZ7HCPy0bAeDnVMBulF5RshfQkB/CNJfCTrw0QkNvenO/eYPZrl0XAGwL7w8W
 9vkeK51VG70bj7VEMeWVovL0X2VoGea0MD0ASLgOG3qZmCjFX0Aw3yY4WNZAA1fn
 i9rSP0AgTXqbR+nUezqP9xDHCyEf4etqpdPO/gosFvasZxTa9Xm6tXxT8YrcjAEH
 MjIYJVS5SERem/gxqrRi5p0u1RNrbZ3vPMmZQIr6x2eBXLwMhvjvcxKqZ2l9PvD5
 +O+Hf43GAmhAoqZukvU8H8oMWArciA==
 =MkzD
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-04-06

Introduce TC sample offload

Background
----------

The tc sample action allows user to sample traffic matched by tc
classifier. The sampling consists of choosing packets randomly and
sampling them using psample module.

The tc sample parameters include group id, sampling rate and packet's
truncation (to save kernel-user traffic).

Sample in TC SW
---------------

User must specify rate and group id for sample action, truncate is
optional.

tc filter add dev enp4s0f0_0 ingress protocol ip prio 1 flower	\
	src_mac 02:25:d0:14:01:02 dst_mac 02:25:d0:14:01:03	\
	action sample rate 10 group 5 trunc 60			\
	action mirred egress redirect dev enp4s0f0_1

The tc sample action kernel module 'act_sample' will call another
kernel module 'psample' to send sampled packets to userspace.

MLX5 sample HW offload - MLX5 driver patches
--------------------------------------------

The sample action is translated to a goto flow table object
destination which samples packets according to the provided
sample ratio. Sampled packets are duplicated. One copy is
processed by a termination table, named the sample table,
which sends the packet to the eswitch manager port (that will
be processed by software).

The second copy is processed by the default table which executes
the subsequent actions. The default table is created per <vport,
chain, prio> tuple as rules with different prios and chains may
overlap.

For example, for the following typical flow table:

+-------------------------------+
+       original flow table     +
+-------------------------------+
+         original match        +
+-------------------------------+
+ sample action + other actions +
+-------------------------------+

We translate the tc filter with sample action to the following HW model:

        +---------------------+
        + original flow table +
        +---------------------+
        +   original match    +
        +---------------------+
                   |
                   v
+------------------------------------------------+
+                Flow Sampler Object             +
+------------------------------------------------+
+                    sample ratio                +
+------------------------------------------------+
+    sample table id    |    default table id    +
+------------------------------------------------+
           |                            |
           v                            v
+-----------------------------+  +----------------------------------------+
+        sample table         +  + default table per <vport, chain, prio> +
+-----------------------------+  +----------------------------------------+
+ forward to management vport +  +            original match              +
+-----------------------------+  +----------------------------------------+
                                 +            other actions               +
                                 +----------------------------------------+

Flow sampler object
-------------------

Hardware introduces flow sampler object to do sample. It is a new
destination type. Driver needs to specify two flow table ids in it.
One is sample table id. The other one is the default table id.
Sample table samples the packets according to the sample rate and
forward the sampled packets to eswitch manager port. Default table
finishes the subsequent actions.

Group id and reg_c0
-------------------

Userspace program will take different actions for sampled packets
according to tc sample action group id. So hardware must pass group
id to software for each sampled packets. In Paul Blakey's "Introduce
connection tracking offload" patch set, reg_c0 lower 16 bits are used
for miss packet chain id restore. We convert reg_c0 lower 16 bits to
a common object pool, so other features can also use it.

Since sample group id is 32 bits, create a 16 bits object id to map
the group id and write the object id to reg_c0 lower 16 bits. reg_c0
can only be used for matching. Write reg_c0 to flow_tag, so software
can get the object id via flow_tag and find group id via the common
object pool.

Sampler restore handle
----------------------

Use common object pool to create an object id to map sample parameters.
Allocate a modify header action to write the object id to reg_c0 lower
16 bits. Create a restore rule to pass the object id to software. So
software can identify sampled packets via the object id and send it to
userspace.

Aggregate the modify header action, restore rule and object id to a
sample restore handle. Re-use identical sample restore handle for
the same object id.

Send sampled packets to userspace
---------------------------------

The destination for sampled packets is eswitch manager port, so
representors can receive sampled packets together with the group id.
Driver will send sampled packets and group id to userspace via psample.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-07 14:38:24 -07:00
Vadim Pasternak d567fd6e82 mlxsw: core: Remove critical trip points from thermal zones
Disable software thermal protection by removing critical trip points
from all thermal zones.

The software thermal protection is redundant given there are two layers
of protection below it in firmware and hardware. The first layer is
performed by firmware, the second, in case firmware was not able to
perform protection, by hardware.
The temperature threshold set for hardware protection is always higher
than for firmware.

Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-07 14:26:18 -07:00
Chris Mi f94d6389f6 net/mlx5e: TC, Add support to offload sample action
The following diagram illustrates the hardware model for tc sample action:

        +---------------------+
        + original flow table +
        +---------------------+
        +   original match    +
        +---------------------+
                   |
                   v
+------------------------------------------------+
+                Flow Sampler Object             +
+------------------------------------------------+
+                    sample ratio                +
+------------------------------------------------+
+    sample table id    |    default table id    +
+------------------------------------------------+
           |                            |
           v                            v
+-----------------------------+  +----------------------------------------+
+        sample table         +  + default table per <vport, chain, prio> +
+-----------------------------+  +----------------------------------------+
+ forward to management vport +  +            original match              +
+-----------------------------+  +----------------------------------------+
                                 +            other actions               +
                                 +----------------------------------------+

The sample action is translated to a goto flow table object
destination which samples packets according to the provided
sample ratio. Sampled packets are duplicated. One copy is
processed by a termination table, named the sample table,
which sends the packet to the eswitch manager port (that will
be processed by software).

The second copy is processed by the default table which executes
the subsequent actions. The default table is created per <vport,
chain, prio> tuple as rules with different prios and chains may
overlap.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:05 -07:00
Chris Mi be9dc00474 net/mlx5e: TC, Handle sampled packets
Mark the sampled packets with a sample restore object. Send sampled
packets using the psample api.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:04 -07:00
Chris Mi 7319a1cc3c net/mlx5e: TC, Refactor tc update skb function
As a pre-step to process sampled packet in this function.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:04 -07:00
Chris Mi 36a3196256 net/mlx5e: TC, Add sampler restore handle API
Use common object pool to create an object ID to map sample parameters.
Allocate a modify header action to write the object ID to reg_c0 lower
16 bits. Create a restore rule to pass the object ID to software. So
software can identify sampled packets via the object ID and send it to
userspace.

Aggregate the modify header action, restore rule and object ID to a
sample restore handle. Re-use identical sample restore handle for
the same object ID.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:04 -07:00
Chris Mi 11ecd6c60b net/mlx5e: TC, Add sampler object API
In order to offload sample action, HW introduces sampler object. The
sampler object samples packets according to the provided sample ratio.
Sampled packets are duplicated. One copy is processed by a termination
table, named the sample table, which sends the packet up to software.
The second copy is processed by the default table.

Instantiate sampler object. Re-use identical sampler object for
the same sample ratio, sample table and default table as a prestep for
offloading tc sample actions.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:03 -07:00
Chris Mi 2a9ab10a56 net/mlx5e: TC, Add sampler termination table API
Sampled packets are sent to software using termination tables. There
is only one rule in that table that is to forward sampled packets to
the e-switch management vport.

Create a sampler termination table and rule for each eswitch.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:03 -07:00
Chris Mi 41c2fd9498 net/mlx5e: TC, Parse sample action
Parse TC sample action and save sample parameters in flow attribute
data structure.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:03 -07:00
Chris Mi c935568271 net/mlx5: Instantiate separate mapping objects for FDB and NIC tables
Currently, the u32 chain id is mapped to u16 value which is stored on
the lower 16 bits of reg_c0 for FDB and reg_b for NIC tables. The
mapping is internally maintained by the chains object. However, with
the introduction of reg_c0 objects the fdb may store more than just
the chain id on reg_c0. This is not relevant for NIC tables.

Separate the chains mapping instantiation for FDB and NIC tables.
Remove the mapping from the chains object. For FDB tables, create
the mapping per eswitch. For NIC tables, create the mapping per tc
table. Pass the corresponding mapping pointer when creating the
chains object.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:02 -07:00
Chris Mi a91d98a0a2 net/mlx5: Map register values to restore objects
Currently reg_c0 lower 16 bits and reg_b are used to store the chain
id that missed in FDB and NIC tables accordingly. However, the
registers' values may index a restore object, rather than a single u32
value. Different object types can be used to restore mutually exclusive
contexts such as chain id and sample group id.

Use the mapping object to associate an index with a restore object
as a prestep for supporting additional restore types.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:02 -07:00
Chris Mi c1904360dd net/mlx5: E-switch, Set per vport table default group number
Different per voprt table is created using a different per vport table
namespace. Because we can't use variable to set the namespace member
value.  If max group number is 0 in the namespace, use the eswitch
default max group number.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:02 -07:00
Chris Mi c796bb7cd2 net/mlx5: E-switch, Generalize per vport table API
Currently, per vport table was used only for port mirroring actions.
However, sample action will also require a per vport table instance.

Generalize the vport table API to work with multiple namespaces where
each namespace manages its own vport table instance.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:01 -07:00
Chris Mi 0a9e230787 net/mlx5: E-switch, Rename functions to follow naming convention.
Public api starts with mlx5 and remove mlx5 for non-public api.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:01 -07:00
Chris Mi 4c7f40287a net/mlx5: E-switch, Move vport table functions to a new file
Currently, the vport table functions are in common eswitch offload
file. This file is too big. Move the vport table create, delete and
lookup functions to a separate file. Put the file in esw directory.

Pre-step for generalizing its functionality for serving both the
mirroring and the sample features.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:36:01 -07:00
Xiaoming Ni d5f9b005c3 net/mlx5: fix kfree mismatch in indir_table.c
Memory allocated by kvzalloc() should be freed by kvfree().

Fixes: 34ca65352d ("net/mlx5: E-Switch, Indirect table infrastructur")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:04:36 -07:00
Eli Cohen 1a73704c82 net/mlx5: Fix HW spec violation configuring uplink
Make sure to modify uplink port to follow only if the uplink_follow
capability is set as required by the HW spec. Failure to do so causes
traffic to the uplink representor net device to cease after switching to
switchdev mode.

Fixes: 7d0314b11c ("net/mlx5e: Modify uplink state on interface up/down")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-06 21:04:35 -07:00
Leon Romanovsky e71b75f737 net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks
The mlx5 implementation executes a firmware command on the PF to change
the configuration of the selected VF.

Link: https://lore.kernel.org/linux-pci/20210314124256.70253-5-leon@kernel.org
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2021-04-04 10:30:38 +03:00
Leon Romanovsky 604774add5 net/mlx5: Dynamically assign MSI-X vectors count
The number of MSI-X vectors is a PCI property visible through lspci. The
field is read-only and configured by the device. The mlx5 devices work in
a static or dynamic assignment mode.

Static assignment means that all newly created VFs have a preset number of
MSI-X vectors determined by device configuration parameters. This can
result in some VFs having too many or too few MSI-X vectors. Till now this
has been the only means of fine-tuning the MSI-X vector count and it was
acceptable for small numbers of VFs.

With dynamic assignment the inefficiency of having a fixed number of MSI-X
vectors can be avoided with each VF having exactly the required
vectors. Userspace will provide this information while provisioning the VF
for use, based on the intended use. For instance if being used with a VM,
the MSI-X vector count might be matched to the CPU count of the VM.

For compatibility mlx5 continues to start up with MSI-X vector assignment,
but the kernel can now access a larger dynamic vector pool and assign more
vectors to created VFs.

Link: https://lore.kernel.org/linux-pci/20210314124256.70253-4-leon@kernel.org
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2021-04-04 10:29:48 +03:00
Vu Pham 6783f0a21a net/mlx5e: Dynamic alloc vlan table for netdev when needed
Dynamic allocate vlan table in mlx5e_priv for EN netdev
when needed. Don't allocate it for representor netdev.

Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:08 -07:00
Vu Pham f6755b80d6 net/mlx5e: Dynamic alloc arfs table for netdev when needed
Dynamic allocate arfs table in mlx5e_priv for EN netdev
when needed. Don't allocate it for representor netdev.

Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:08 -07:00
Ariel Levkovich bb5696570b net/mlx5e: Reject tc rules which redirect from a VF to itself
Since there are self loopback prevention mechanisms at the
VF level, offloading such rules which redirect from a VF
to itself in the eswitch will break the datapath since the
packets will be dropped once they go back to the vport they
came from.

Therefore, offloading such rules will be rejected and left to
be handled by SW.

Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:08 -07:00
Roi Dayan 8802b8a44e net/mlx5: Use ida_alloc_range() instead of ida_simple_alloc()
ida_simple_alloc() and remove functions are deprecated.
Related change:
commit 3264ceec8f ("lib/idr.c: document that ida_simple_{get,remove}() are deprecated")

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:07 -07:00
Parav Pandit 233dd7d656 net/mlx5: E-Switch, move QoS specific fields to existing qos struct
Function QoS related fields are already defined in qos related struct.
min and max rate are left out to mlx5_vport_info struct.

Move them to existing qos struct.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:07 -07:00
Parav Pandit b47e105625 net/mlx5: E-Switch, cut down mlx5_vport_info structure size by 8 bytes
Structure mlx5_vport_info consumes 40 bytes of space due to a hole
in it. After packing it reduces to 32 bytes.

Currently:
pahole -C mlx5_vport_info drivers/net/ethernet/mellanox/mlx5/core/eswitch.o
struct mlx5_vport_info {
        u8                         mac[6];               /*     0     6 */
        u16                        vlan;                 /*     6     2 */
        u8                         qos;                  /*     8     1 */

        /* XXX 7 bytes hole, try to pack */

        u64                        node_guid;            /*    16     8 */
        int                        link_state;           /*    24     4 */
        u32                        min_rate;             /*    28     4 */
        u32                        max_rate;             /*    32     4 */
        bool                       spoofchk;             /*    36     1 */
        bool                       trusted;              /*    37     1 */

        /* size: 40, cachelines: 1, members: 9 */
        /* sum members: 31, holes: 1, sum holes: 7 */
        /* padding: 2 */
        /* last cacheline: 40 bytes */
};

After packing:

$ pahole -C mlx5_vport_info drivers/net/ethernet/mellanox/mlx5/core/eswitch.o

struct mlx5_vport_info {
        u8                         mac[6];               /*     0     6 */
        u16                        vlan;                 /*     6     2 */
        u64                        node_guid;            /*     8     8 */
        int                        link_state;           /*    16     4 */
        u32                        min_rate;             /*    20     4 */
        u32                        max_rate;             /*    24     4 */
        u8                         qos;                  /*    28     1 */
        u8                         spoofchk:1;           /*    29: 0  1 */
        u8                         trusted:1;            /*    29: 1  1 */

        /* size: 32, cachelines: 1, members: 9 */
        /* padding: 2 */
        /* bit_padding: 6 bits */
        /* last cacheline: 32 bytes */
};

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:07 -07:00
Parav Pandit 19779f28c9 net/mlx5: Pair mutex_destory with mutex_init for rate limit table
Add missing mutex_destroy() to pair with mutex_init().

This should be done only when table is initialized, hence perform
mutex_init() only when table is initialized.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:06 -07:00
Parav Pandit 6b30b6d4d3 net/mlx5: Allocate rate limit table when rate is configured
A device supports 128 rate limiters. A static table allocation consumes
8KB of memory even when rate is not configured.

Instead, allocate the table when at least one rate is configured.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:06 -07:00
Parav Pandit 97d85aba25 net/mlx5: Use helper to increment, decrement rate entry refcount
Rate limit entry refcount can be incremented uniformly when it is newly
allocated or reused.
So simplify the code to increment refcount at one place.

Use decrement refcount helper in two routines.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:06 -07:00
Parav Pandit 51ccc9f5f1 net/mlx5: Use helpers to allocate and free rl table entries
User helper routines to allocate and free rate limit table entries.
Subsequent patch extends use of these helpers to do allocation
during rate entry allocation callback.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:05 -07:00
Parav Pandit 16e74672a2 net/mlx5: Do not hold mutex while reading table constants
Table max_size, min and max rate are constants initialized while table
is created. Reading it doesn't need to hold a table mutex. Hence, read
them without holding table mutex.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:05 -07:00
Parav Pandit c6baac47d9 net/mlx5: Use unsigned int for free_count
Fix the warning due to missing int.

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
+       unsigned free_count;

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-02 16:13:04 -07:00