Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller: 1) Fix OOPS during nf_tables rule dump, from Florian Westphal. 2) Use after free in ip_vs_in, from Yue Haibing. 3) Fix various kTLS bugs (NULL deref during device removal resync, netdev notification ignoring, etc.) From Jakub Kicinski. 4) Fix ipv6 redirects with VRF, from David Ahern. 5) Memory leak fix in igmpv3_del_delrec(), from Eric Dumazet. 6) Missing memory allocation failure check in ip6_ra_control(), from Gen Zhang. And likewise fix ip_ra_control(). 7) TX clean budget logic error in aquantia, from Igor Russkikh. 8) SKB leak in llc_build_and_send_ui_pkt(), from Eric Dumazet. 9) Double frees in mlx5, from Parav Pandit. 10) Fix lost MAC address in r8169 during PCI D3, from Heiner Kallweit. 11) Fix botched register access in mvpp2, from Antoine Tenart. 12) Use after free in napi_gro_frags(), from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (89 commits) net: correct zerocopy refcnt with udp MSG_MORE ethtool: Check for vlan etype or vlan tci when parsing flow_rule net: don't clear sock->sk early to avoid trouble in strparser net-gro: fix use-after-free read in napi_gro_frags() net: dsa: tag_8021q: Create a stable binary format net: dsa: tag_8021q: Change order of rx_vid setup net: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value ipv4: tcp_input: fix stack out of bounds when parsing TCP options. mlxsw: spectrum: Prevent force of 56G mlxsw: spectrum_acl: Avoid warning after identical rules insertion net: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT r8169: fix MAC address being lost in PCI D3 net: core: support XDP generic on stacked devices. netvsc: unshare skb in VF rx handler udp: Avoid post-GRO UDP checksum recalculation net: phy: dp83867: Set up RGMII TX delay net: phy: dp83867: do not call config_init twice net: phy: dp83867: increase SGMII autoneg timer duration net: phy: dp83867: fix speed 10 in sgmii mode net: phy: marvell10g: report if the PHY fails to boot firmware ...
2019-05-30 21:11:22 -07:00 · 2019-05-30 21:11:22 -07:00 · 036e343109
parent adc3f554fa 100f6d8e09
commit 036e343109
87 changed files with 1769 additions and 684 deletions
--- a/Documentation/ABI/testing/sysfs-bus-mdio
+++ b/Documentation/ABI/testing/sysfs-bus-mdio
@ -1,29 +0,0 @@
-What:		/sys/bus/mdio_bus/devices/.../phy_id
-Date:		November 2012
-KernelVersion:	3.8
-Contact:	netdev@vger.kernel.org
-Description:
-		This attribute contains the 32-bit PHY Identifier as reported
-		by the device during bus enumeration, encoded in hexadecimal.
-		This ID is used to match the device with the appropriate
-		driver.
-
-What:		/sys/bus/mdio_bus/devices/.../phy_interface
-Date:		February 2014
-KernelVersion:	3.15
-Contact:	netdev@vger.kernel.org
-Description:
-		This attribute contains the PHY interface as configured by the
-		Ethernet driver during bus enumeration, encoded in string.
-		This interface mode is used to configure the Ethernet MAC with the
-		appropriate mode for its data lines to the PHY hardware.
-
-What:		/sys/bus/mdio_bus/devices/.../phy_has_fixups
-Date:		February 2014
-KernelVersion:	3.15
-Contact:	netdev@vger.kernel.org
-Description:
-		This attribute contains the boolean value whether a given PHY
-		device has had any "fixup" workaround running on it, encoded as
-		a boolean. This information is provided to help troubleshooting
-		PHY configurations.
--- a/Documentation/ABI/testing/sysfs-class-net-phydev
+++ b/Documentation/ABI/testing/sysfs-class-net-phydev
@ -11,24 +11,31 @@ Date:		February 2014
 KernelVersion:	3.15
 Contact:	netdev@vger.kernel.org
 Description:
-		Boolean value indicating whether the PHY device has
-		any fixups registered against it (phy_register_fixup)
+		This attribute contains the boolean value whether a given PHY
+		device has had any "fixup" workaround running on it, encoded as
+		a boolean. This information is provided to help troubleshooting
+		PHY configurations.

 What:		/sys/class/mdio_bus/<bus>/<device>/phy_id
 Date:		November 2012
 KernelVersion:	3.8
 Contact:	netdev@vger.kernel.org
 Description:
-		32-bit hexadecimal value corresponding to the PHY device's OUI,
-		model and revision number.
+		This attribute contains the 32-bit PHY Identifier as reported
+		by the device during bus enumeration, encoded in hexadecimal.
+		This ID is used to match the device with the appropriate
+		driver.

 What:		/sys/class/mdio_bus/<bus>/<device>/phy_interface
 Date:		February 2014
 KernelVersion:	3.15
 Contact:	netdev@vger.kernel.org
 Description:
-		String value indicating the PHY interface, possible
-		values are:.
+		This attribute contains the PHY interface as configured by the
+		Ethernet driver during bus enumeration, encoded in string.
+		This interface mode is used to configure the Ethernet MAC with the
+		appropriate mode for its data lines to the PHY hardware.
+		Possible values are:
 		<empty> (not available), mii, gmii, sgmii, tbi, rev-mii,
 		rmii, rgmii, rgmii-id, rgmii-rxid, rgmii-txid, rtbi, smii
 		xgmii, moca, qsgmii, trgmii, 1000base-x, 2500base-x, rxaui,
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Vendor Device Drivers
+=====================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   freescale/dpaa2/index
+   intel/e100
+   intel/e1000
+   intel/e1000e
+   intel/fm10k
+   intel/igb
+   intel/igbvf
+   intel/ixgb
+   intel/ixgbe
+   intel/ixgbevf
+   intel/i40e
+   intel/iavf
+   intel/ice
+
+.. only::  subproject
+
+   Indices
+   =======
+
+   * :ref:`genindex`
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@ -11,19 +11,7 @@ Contents:
   batman-adv
   can
   can_ucan_protocol
-   device_drivers/freescale/dpaa2/index
-   device_drivers/intel/e100
-   device_drivers/intel/e1000
-   device_drivers/intel/e1000e
-   device_drivers/intel/fm10k
-   device_drivers/intel/igb
-   device_drivers/intel/igbvf
-   device_drivers/intel/ixgb
-   device_drivers/intel/ixgbe
-   device_drivers/intel/ixgbevf
-   device_drivers/intel/i40e
-   device_drivers/intel/iavf
-   device_drivers/intel/ice
+   device_drivers/index
   dsa/index
   devlink-info-versions
   ieee802154
@ -40,6 +28,8 @@ Contents:
   checksum-offloads
   segmentation-offloads
   scaling
+   tls
+   tls-offload

 .. only::  subproject

--- a/Documentation/networking/tls-offload-layers.svg
+++ b/Documentation/networking/tls-offload-layers.svg
--- a/Documentation/networking/tls-offload-reorder-bad.svg
+++ b/Documentation/networking/tls-offload-reorder-bad.svg
--- a/Documentation/networking/tls-offload-reorder-good.svg
+++ b/Documentation/networking/tls-offload-reorder-good.svg
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@ -0,0 +1,482 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+==================
+Kernel TLS offload
+==================
+
+Kernel TLS operation
+====================
+
+Linux kernel provides TLS connection offload infrastructure. Once a TCP
+connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
+Layer Protocol (ULP) and install the cryptographic connection state.
+For details regarding the user-facing interface refer to the TLS
+documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
+
+``ktls`` can operate in three modes:
+
+ * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
+   In most basic cases only crypto operations synchronous with the CPU
+   can be used, but depending on calling context CPU may utilize
+   asynchronous crypto accelerators. The use of accelerators introduces extra
+   latency on socket reads (decryption only starts when a read syscall
+   is made) and additional I/O load on the system.
+ * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
+   on a packet by packet basis, provided the packets arrive in order.
+   This mode integrates best with the kernel stack and is described in detail
+   in the remaining part of this document
+   (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
+ * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
+   NIC driver and firmware replace the kernel networking stack
+   with its own TCP handling, it is not usable in production environments
+   making use of the Linux networking stack for example any firewalling
+   abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).
+
+The operation mode is selected automatically based on device configuration,
+offload opt-in or opt-out on per-connection basis is not currently supported.
+
+TX
+--
+
+At a high level user write requests are turned into a scatter list, the TLS ULP
+intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
+mode) and then hands the modified scatter list to the TCP layer. From this
+point on the TCP stack proceeds as normal.
+
+In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
+Instead packets reach a device driver, the driver will mark the packets
+for crypto offload based on the socket the packet is attached to,
+and send them to the device for encryption and transmission.
+
+RX
+--
+
+On the receive side if the device handled decryption and authentication
+successfully, the driver will set the decrypted bit in the associated
+:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
+are handled normally. ``ktls`` is informed when data is queued to the socket
+and the ``strparser`` mechanism is used to delineate the records. Upon read
+request, records are retrieved from the socket and passed to decryption routine.
+If device decrypted all the segments of the record the decryption is skipped,
+otherwise software path handles decryption.
+
+.. kernel-figure::  tls-offload-layers.svg
+   :alt:	TLS offload layers
+   :align:	center
+   :figwidth:	28em
+
+   Layers of Kernel TLS stack
+
+Device configuration
+====================
+
+During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
+``NETIF_F_HW_TLS_TX`` features and installs its
+:c:type:`struct tlsdev_ops <tlsdev_ops>`
+pointer in the :c:member:`tlsdev_ops` member of the
+:c:type:`struct net_device <net_device>`.
+
+When TLS cryptographic connection state is installed on a ``ktls`` socket
+(note that it is done twice, once for RX and once for TX direction,
+and the two are completely independent), the kernel checks if the underlying
+network device is offload-capable and attempts the offload. In case offload
+fails the connection is handled entirely in software using the same mechanism
+as if the offload was never tried.
+
+Offload request is performed via the :c:member:`tls_dev_add` callback of
+:c:type:`struct tlsdev_ops <tlsdev_ops>`:
+
+.. code-block:: c
+
+	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+			   enum tls_offload_ctx_dir direction,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn);
+
+``direction`` indicates whether the cryptographic information is for
+the received or transmitted packets. Driver uses the ``sk`` parameter
+to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
+Cryptographic information in ``crypto_info`` includes the key, iv, salt
+as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
+which TCP sequence number corresponds to the beginning of the record with
+sequence number from ``crypto_info``. The driver can add its state
+at the end of kernel structures (see :c:member:`driver_state` members
+in ``include/net/tls.h``) to avoid additional allocations and pointer
+dereferences.
+
+TX
+--
+
+After TX state is installed, the stack guarantees that the first segment
+of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
+number, simplifying TCP sequence number matching.
+
+TX offload being fully initialized does not imply that all segments passing
+through the driver and which belong to the offloaded socket will be after
+the expected sequence number and will have kernel record information.
+In particular, already encrypted data may have been queued to the socket
+before installing the connection state in the kernel.
+
+RX
+--
+
+In RX direction local networking stack has little control over the segmentation,
+so the initial records' TCP sequence number may be anywhere inside the segment.
+
+Normal operation
+================
+
+At the minimum the device maintains the following state for each connection, in
+each direction:
+
+ * crypto secrets (key, iv, salt)
+ * crypto processing state (partial blocks, partial authentication tag, etc.)
+ * record metadata (sequence number, processing offset and length)
+ * expected TCP sequence number
+
+There are no guarantees on record length or record segmentation. In particular
+segments may start at any point of a record and contain any number of records.
+Assuming segments are received in order, the device should be able to perform
+crypto operations and authentication regardless of segmentation. For this
+to be possible device has to keep small amount of segment-to-segment state.
+This includes at least:
+
+ * partial headers (if a segment carried only a part of the TLS header)
+ * partial data block
+ * partial authentication tag (all data had been seen but part of the
+   authentication tag has to be written or read from the subsequent segment)
+
+Record reassembly is not necessary for TLS offload. If the packets arrive
+in order the device should be able to handle them separately and make
+forward progress.
+
+TX
+--
+
+The kernel stack performs record framing reserving space for the authentication
+tag and populating all other TLS header and tailer fields.
+
+Both the device and the driver maintain expected TCP sequence numbers
+due to the possibility of retransmissions and the lack of software fallback
+once the packet reaches the device.
+For segments passed in order, the driver marks the packets with
+a connection identifier (note that a 5-tuple lookup is insufficient to identify
+packets requiring HW offload, see the :ref:`5tuple_problems` section)
+and hands them to the device. The device identifies the packet as requiring
+TLS handling and confirms the sequence number matches its expectation.
+The device performs encryption and authentication of the record data.
+It replaces the authentication tag and TCP checksum with correct values.
+
+RX
+--
+
+Before a packet is DMAed to the host (but after NIC's embedded switching
+and packet transformation functions) the device validates the Layer 4
+checksum and performs a 5-tuple lookup to find any TLS connection the packet
+may belong to (technically a 4-tuple
+lookup is sufficient - IP addresses and TCP port numbers, as the protocol
+is always TCP). If connection is matched device confirms if the TCP sequence
+number is the expected one and proceeds to TLS handling (record delineation,
+decryption, authentication for each record in the packet). The device leaves
+the record framing unmodified, the stack takes care of record decapsulation.
+Device indicates successful handling of TLS offload in the per-packet context
+(descriptor) passed to the host.
+
+Upon reception of a TLS offloaded packet, the driver sets
+the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
+corresponding to the segment. Networking stack makes sure decrypted
+and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
+and takes care of partial decryption.
+
+Resync handling
+===============
+
+In presence of packet drops or network packet reordering, the device may lose
+synchronization with the TLS stream, and require a resync with the kernel's
+TCP stack.
+
+Note that resync is only attempted for connections which were successfully
+added to the device table and are in TLS_HW mode. For example,
+if the table was full when cryptographic state was installed in the kernel,
+such connection will never get offloaded. Therefore the resync request
+does not carry any cryptographic connection state.
+
+TX
+--
+
+Segments transmitted from an offloaded socket can get out of sync
+in similar ways to the receive side-retransmissions - local drops
+are possible, though network reorders are not.
+
+Whenever an out of order segment is transmitted the driver provides
+the device with enough information to perform cryptographic operations.
+This means most likely that the part of the record preceding the current
+segment has to be passed to the device as part of the packet context,
+together with its TCP sequence number and TLS record number. The device
+can then initialize its crypto state, process and discard the preceding
+data (to be able to insert the authentication tag) and move onto handling
+the actual packet.
+
+In this mode depending on the implementation the driver can either ask
+for a continuation with the crypto state and the new sequence number
+(next expected segment is the one after the out of order one), or continue
+with the previous stream state - assuming that the out of order segment
+was just a retransmission. The former is simpler, and does not require
+retransmission detection therefore it is the recommended method until
+such time it is proven inefficient.
+
+RX
+--
+
+A small amount of RX reorder events may not require a full resynchronization.
+In particular the device should not lose synchronization
+when record boundary can be recovered:
+
+.. kernel-figure::  tls-offload-reorder-good.svg
+   :alt:	reorder of non-header segment
+   :align:	center
+
+   Reorder of non-header segment
+
+Green segments are successfully decrypted, blue ones are passed
+as received on wire, red stripes mark start of new records.
+
+In above case segment 1 is received and decrypted successfully.
+Segment 2 was dropped so 3 arrives out of order. The device knows
+the next record starts inside 3, based on record length in segment 1.
+Segment 3 is passed untouched, because due to lack of data from segment 2
+the remainder of the previous record inside segment 3 cannot be handled.
+The device can, however, collect the authentication algorithm's state
+and partial block from the new record in segment 3 and when 4 and 5
+arrive continue decryption. Finally when 2 arrives it's completely outside
+of expected window of the device so it's passed as is without special
+handling. ``ktls`` software fallback handles the decryption of record
+spanning segments 1, 2 and 3. The device did not get out of sync,
+even though two segments did not get decrypted.
+
+Kernel synchronization may be necessary if the lost segment contained
+a record header and arrived after the next record header has already passed:
+
+.. kernel-figure::  tls-offload-reorder-bad.svg
+   :alt:	reorder of header segment
+   :align:	center
+
+   Reorder of segment with a TLS header
+
+In this example segment 2 gets dropped, and it contains a record header.
+Device can only detect that segment 4 also contains a TLS header
+if it knows the length of the previous record from segment 2. In this case
+the device will lose synchronization with the stream.
+
+When the device gets out of sync and the stream reaches TCP sequence
+numbers more than a max size record past the expected TCP sequence number,
+the device starts scanning for a known header pattern. For example
+for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
+in the SSL/TLS version field of the header. Once pattern is matched
+the device continues attempting parsing headers at expected locations
+(based on the length fields at guessed locations).
+Whenever the expected location does not contain a valid header the scan
+is restarted.
+
+When the header is matched the device sends a confirmation request
+to the kernel, asking if the guessed location is correct (if a TLS record
+really starts there), and which record sequence number the given header had.
+The kernel confirms the guessed location was correct and tells the device
+the record sequence number. Meanwhile, the device had been parsing
+and counting all records since the just-confirmed one, it adds the number
+of records it had seen to the record number provided by the kernel.
+At this point the device is in sync and can resume decryption at next
+segment boundary.
+
+In a pathological case the device may latch onto a sequence of matching
+headers and never hear back from the kernel (there is no negative
+confirmation from the kernel). The implementation may choose to periodically
+restart scan. Given how unlikely falsely-matching stream is, however,
+periodic restart is not deemed necessary.
+
+Special care has to be taken if the confirmation request is passed
+asynchronously to the packet stream and record may get processed
+by the kernel before the confirmation request.
+
+Error handling
+==============
+
+TX
+--
+
+Packets may be redirected or rerouted by the stack to a different
+device than the selected TLS offload device. The stack will handle
+such condition using the :c:func:`sk_validate_xmit_skb` helper
+(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
+Offload maintains information about all records until the data is
+fully acknowledged, so if skbs reach the wrong device they can be handled
+by software fallback.
+
+Any device TLS offload handling error on the transmission side must result
+in the packet being dropped. For example if a packet got out of order
+due to a bug in the stack or the device, reached the device and can't
+be encrypted such packet must be dropped.
+
+RX
+--
+
+If the device encounters any problems with TLS offload on the receive
+side it should pass the packet to the host's networking stack as it was
+received on the wire.
+
+For example authentication failure for any record in the segment should
+result in passing the unmodified packet to the software fallback. This means
+packets should not be modified "in place". Splitting segments to handle partial
+decryption is not advised. In other words either all records in the packet
+had been handled successfully and authenticated or the packet has to be passed
+to the host's stack as it was on the wire (recovering original packet in the
+driver if device provides precise error is sufficient).
+
+The Linux networking stack does not provide a way of reporting per-packet
+decryption and authentication errors, packets with errors must simply not
+have the :c:member:`decrypted` mark set.
+
+A packet should also not be handled by the TLS offload if it contains
+incorrect checksums.
+
+Performance metrics
+===================
+
+TLS offload can be characterized by the following basic metrics:
+
+ * max connection count
+ * connection installation rate
+ * connection installation latency
+ * total cryptographic performance
+
+Note that each TCP connection requires a TLS session in both directions,
+the performance may be reported treating each direction separately.
+
+Max connection count
+--------------------
+
+The number of connections device can support can be exposed via
+``devlink resource`` API.
+
+Total cryptographic performance
+-------------------------------
+
+Offload performance may depend on segment and record size.
+
+Overload of the cryptographic subsystem of the device should not have
+significant performance impact on non-offloaded streams.
+
+Statistics
+==========
+
+Following minimum set of TLS-related statistics should be reported
+by the driver:
+
+ * ``rx_tls_decrypted`` - number of successfully decrypted TLS segments
+ * ``tx_tls_encrypted`` - number of in-order TLS segments passed to device
+   for encryption
+ * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
+   but did not arrive in the expected order
+ * ``tx_tls_drop_no_sync_data`` - number of TX packets dropped because
+   they arrived out of order and associated record could not be found
+   (see also :ref:`pre_tls_data`)
+
+Notable corner cases, exceptions and additional requirements
+============================================================
+
+.. _5tuple_problems:
+
+5-tuple matching limitations
+----------------------------
+
+The device can only recognize received packets based on the 5-tuple
+of the socket. Current ``ktls`` implementation will not offload sockets
+routed through software interfaces such as those used for tunneling
+or virtual networking. However, many packet transformations performed
+by the networking stack (most notably any BPF logic) do not require
+any intermediate software device, therefore a 5-tuple match may
+consistently miss at the device level. In such cases the device
+should still be able to perform TX offload (encryption) and should
+fallback cleanly to software decryption (RX).
+
+Out of order
+------------
+
+Introducing extra processing in NICs should not cause packets to be
+transmitted or received out of order, for example pure ACK packets
+should not be reordered with respect to data segments.
+
+Ingress reorder
+---------------
+
+A device is permitted to perform packet reordering for consecutive
+TCP segments (i.e. placing packets in the correct order) but any form
+of additional buffering is disallowed.
+
+Coexistence with standard networking offload features
+-----------------------------------------------------
+
+Offloaded ``ktls`` sockets should support standard TCP stack features
+transparently. Enabling device TLS offload should not cause any difference
+in packets as seen on the wire.
+
+Transport layer transparency
+----------------------------
+
+The device should not modify any packet headers for the purpose
+of the simplifying TLS offload.
+
+The device should not depend on any packet headers beyond what is strictly
+necessary for TLS offload.
+
+Segment drops
+-------------
+
+Dropping packets is acceptable only in the event of catastrophic
+system errors and should never be used as an error handling mechanism
+in cases arising from normal operation. In other words, reliance
+on TCP retransmissions to handle corner cases is not acceptable.
+
+TLS device features
+-------------------
+
+Drivers should ignore the changes to TLS the device feature flags.
+These flags will be acted upon accordingly by the core ``ktls`` code.
+TLS device feature flags only control adding of new TLS connection
+offloads, old connections will remain active after flags are cleared.
+
+Known bugs
+==========
+
+skb_orphan() leaks clear text
+-----------------------------
+
+Currently drivers depend on the :c:member:`sk` member of
+:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
+encryption. Any operation which removes or does not preserve the socket
+association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
+will cause the driver to miss the packets and lead to clear text leaks.
+
+Redirects leak clear text
+-------------------------
+
+In the RX direction, if segment has already been decrypted by the device
+and it gets redirected or mirrored - clear text will be transmitted out.
+
+.. _pre_tls_data:
+
+Transmission of pre-TLS data
+----------------------------
+
+User can enqueue some already encrypted and framed records before enabling
+``ktls`` on the socket. Those records have to get sent as they are. This is
+perfectly easy to handle in the software case - such data will be waiting
+in the TCP layer, TLS ULP won't see it. In the offloaded case when pre-queued
+segment reaches transmission point it appears to be out of order (before the
+expected TCP sequence number) and the stack does not have a record information
+associated.
+
+All segments without record information cannot, however, be assumed to be
+pre-queued data, because a race condition exists between TCP stack queuing
+a retransmission, the driver seeing the retransmission and TCP ACK arriving
+for the retransmitted data.
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@ -1,3 +1,9 @@
+.. _kernel_tls:
+
+==========
+Kernel TLS
+==========
+
 Overview
 ========

@ -12,6 +18,8 @@ Creating a TLS connection

 First create a new TCP socket and set the TLS ULP.

+.. code-block:: c
+
  sock = socket(AF_INET, SOCK_STREAM, 0);
  setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

@ -21,6 +29,8 @@ handshake is complete, we have all the parameters required to move the
 data-path to the kernel. There is a separate socket option for moving
 the transmit and the receive into the kernel.

+.. code-block:: c
+
  /* From linux/tls.h */
  struct tls_crypto_info {
          unsigned short version;
@ -58,6 +68,8 @@ After setting the TLS_TX socket option all application data sent over this
 socket is encrypted using TLS and the parameters provided in the socket option.
 For example, we can send an encrypted hello world record as follows:

+.. code-block:: c
+
  const char *msg = "hello world\n";
  send(sock, msg, strlen(msg));

@ -67,6 +79,8 @@ to the encrypted kernel send buffer if possible.
 The sendfile system call will send the file's data over TLS records of maximum
 length (2^14).

+.. code-block:: c
+
  file = open(filename, O_RDONLY);
  fstat(file, &stat);
  sendfile(sock, file, &offset, stat.st_size);
@ -89,6 +103,8 @@ After setting the TLS_RX socket option, all recv family socket calls
 are decrypted using TLS parameters provided.  A full TLS record must
 be received before decryption can happen.

+.. code-block:: c
+
  char buffer[16384];
  recv(sock, buffer, 16384);

@ -97,12 +113,12 @@ large enough, and no additional allocations occur.  If the userspace
 buffer is too small, data is decrypted in the kernel and copied to
 userspace.

-EINVAL is returned if the TLS version in the received message does not
+``EINVAL`` is returned if the TLS version in the received message does not
 match the version passed in setsockopt.

-EMSGSIZE is returned if the received message is too big.
+``EMSGSIZE`` is returned if the received message is too big.

-EBADMSG is returned if decryption failed for any other reason.
+``EBADMSG`` is returned if decryption failed for any other reason.

 Send TLS control messages
 -------------------------
@ -113,7 +129,9 @@ These messages can be sent over the socket by providing the TLS record type
 via a CMSG. For example the following function sends @data of @length bytes
 using a record of type @record_type.

-/* send TLS control message using record_type */
+.. code-block:: c
+
+  /* send TLS control message using record_type */
  static int klts_send_ctrl_message(int sock, unsigned char record_type,
                                    void *data, size_t length)
  {
@ -151,6 +169,8 @@ type passed via cmsg.  If no cmsg buffer is provided, an error is
 returned if a control message is received.  Data messages may be
 received without a cmsg buffer set.

+.. code-block:: c
+
  char buffer[16384];
  char cmsg[CMSG_SPACE(sizeof(unsigned char))];
  struct msghdr msg = {0};
@ -186,12 +206,10 @@ Integrating in to userspace TLS library
 At a high level, the kernel TLS ULP is a replacement for the record
 layer of a userspace TLS library.

-A patchset to OpenSSL to use ktls as the record layer is here:
+A patchset to OpenSSL to use ktls as the record layer is
+`here <https://github.com/Mellanox/openssl/commits/tls_rx2>`_.

-https://github.com/Mellanox/openssl/commits/tls_rx2
-
-An example of calling send directly after a handshake using
-gnutls.  Since it doesn't implement a full record layer, control
-messages are not supported:
-
-https://github.com/ktls/af_ktls-tool/commits/RX
+`An example <https://github.com/ktls/af_ktls-tool/commits/RX>`_
+of calling send directly after a handshake using gnutls.
+Since it doesn't implement a full record layer, control
+messages are not supported.
--- a/drivers/isdn/mISDN/dsp_cmx.c
+++ b/drivers/isdn/mISDN/dsp_cmx.c
@ -1851,14 +1851,14 @@ dsp_cmx_send(void *arg)

 	/* unlock */
 	spin_unlock_irqrestore(&dsp_lock, flags);
-	}
+}

 /*
 * audio data is transmitted from upper layer to the dsp
 */
-	void
-		dsp_cmx_transmit(struct dsp *dsp, struct sk_buff *skb)
-	{
+void
+dsp_cmx_transmit(struct dsp *dsp, struct sk_buff *skb)
+{
 	u_int w, ww;
 	u8 *d, *p;
 	int space; /* todo: , l = skb->len; */
@ -1884,7 +1884,6 @@ dsp_cmx_send(void *arg)
 		/* write until all byte are copied */
 		ww = (w + skb->len) & CMX_BUFF_MASK;
 	dsp->tx_W = ww;
-
 		/* show current buffer */
 #ifdef CMX_DEBUG
 	printk(KERN_DEBUG
@ -1908,14 +1907,14 @@ dsp_cmx_send(void *arg)
 	printk(KERN_DEBUG "%s\n", debugbuf);
 #endif

-	}
+}

 /*
 * hdlc data is received from card and sent to all members.
 */
-	void
-		dsp_cmx_hdlc(struct dsp *dsp, struct sk_buff *skb)
-	{
+void
+dsp_cmx_hdlc(struct dsp *dsp, struct sk_buff *skb)
+{
 	struct sk_buff *nskb = NULL;
 	struct dsp_conf_member *member;
 	struct mISDNhead *hh;
@ -1958,4 +1957,4 @@ dsp_cmx_send(void *arg)
 			}
 		}
 	}
-	}
+}
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@ -393,7 +393,7 @@ data_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 			memcpy(di.channelmap, dev->channelmap,
 			       sizeof(di.channelmap));
 			di.nrbchan = dev->nrbchan;
-			strcpy(di.name, dev_name(&dev->dev));
+			strscpy(di.name, dev_name(&dev->dev), sizeof(di.name));
 			if (copy_to_user((void __user *)arg, &di, sizeof(di)))
 				err = -EFAULT;
 		} else
@ -676,7 +676,7 @@ base_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 			memcpy(di.channelmap, dev->channelmap,
 			       sizeof(di.channelmap));
 			di.nrbchan = dev->nrbchan;
-			strcpy(di.name, dev_name(&dev->dev));
+			strscpy(di.name, dev_name(&dev->dev), sizeof(di.name));
 			if (copy_to_user((void __user *)arg, &di, sizeof(di)))
 				err = -EFAULT;
 		} else
@ -690,6 +690,7 @@ base_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 			err = -EFAULT;
 			break;
 		}
+		dn.name[sizeof(dn.name) - 1] = '\0';
 		dev = get_mdevice(dn.id);
 		if (dev)
 			err = device_rename(&dev->dev, dn.name);
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@ -3122,13 +3122,18 @@ static int bond_slave_netdev_event(unsigned long event,
 	case NETDEV_CHANGE:
 		/* For 802.3ad mode only:
 		 * Getting invalid Speed/Duplex values here will put slave
-		 * in weird state. So mark it as link-fail for the time
-		 * being and let link-monitoring (miimon) set it right when
-		 * correct speeds/duplex are available.
+		 * in weird state. Mark it as link-fail if the link was
+		 * previously up or link-down if it hasn't yet come up, and
+		 * let link-monitoring (miimon) set it right when correct
+		 * speeds/duplex are available.
 		 */
 		if (bond_update_speed_duplex(slave) &&
-		    BOND_MODE(bond) == BOND_MODE_8023AD)
+		    BOND_MODE(bond) == BOND_MODE_8023AD) {
+			if (slave->last_link_up)
 				slave->link = BOND_LINK_FAIL;
+			else
+				slave->link = BOND_LINK_DOWN;
+		}

 		if (BOND_MODE(bond) == BOND_MODE_8023AD)
 			bond_3ad_adapter_speed_duplex_changed(slave);
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@ -785,7 +785,7 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip,
 			err = mv88e6xxx_port_read(chip, port, s->reg + 1, &reg);
 			if (err)
 				return U64_MAX;
-			high = reg;
+			low |= ((u32)reg) << 16;
 		}
 		break;
 	case STATS_TYPE_BANK1:
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@ -223,10 +223,10 @@ void aq_ring_queue_stop(struct aq_ring_s *ring)
 bool aq_ring_tx_clean(struct aq_ring_s *self)
 {
 	struct device *dev = aq_nic_get_dev(self->aq_nic);
-	unsigned int budget = AQ_CFG_TX_CLEAN_BUDGET;
+	unsigned int budget;

-	for (; self->sw_head != self->hw_head && budget--;
-		self->sw_head = aq_ring_next_dx(self, self->sw_head)) {
+	for (budget = AQ_CFG_TX_CLEAN_BUDGET;
+	     budget && self->sw_head != self->hw_head; budget--) {
 		struct aq_ring_buff_s *buff = &self->buff_ring[self->sw_head];

 		if (likely(buff->is_mapped)) {
@ -251,6 +251,7 @@ bool aq_ring_tx_clean(struct aq_ring_s *self)

 		buff->pa = 0U;
 		buff->eop_index = 0xffffU;
+		self->sw_head = aq_ring_next_dx(self, self->sw_head);
 	}

 	return !!budget;
@ -298,35 +299,47 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 		unsigned int i = 0U;
 		u16 hdr_len;

-		if (buff->is_error)
-			continue;
-
 		if (buff->is_cleaned)
 			continue;

 		if (!buff->is_eop) {
-			for (next_ = buff->next,
-			     buff_ = &self->buff_ring[next_]; true;
+			buff_ = buff;
+			do {
 				next_ = buff_->next,
-			     buff_ = &self->buff_ring[next_]) {
+				buff_ = &self->buff_ring[next_];
 				is_rsc_completed =
 					aq_ring_dx_in_range(self->sw_head,
 							    next_,
 							    self->hw_head);

-				if (unlikely(!is_rsc_completed)) {
-					is_rsc_completed = false;
+				if (unlikely(!is_rsc_completed))
 					break;
-				}

-				if (buff_->is_eop)
-					break;
-			}
+				buff->is_error |= buff_->is_error;
+
+			} while (!buff_->is_eop);

 			if (!is_rsc_completed) {
 				err = 0;
 				goto err_exit;
 			}
+			if (buff->is_error) {
+				buff_ = buff;
+				do {
+					next_ = buff_->next,
+					buff_ = &self->buff_ring[next_];
+
+					buff_->is_cleaned = true;
+				} while (!buff_->is_eop);
+
+				++self->stats.rx.errors;
+				continue;
+			}
+		}
+
+		if (buff->is_error) {
+			++self->stats.rx.errors;
+			continue;
 		}

 		dma_sync_single_range_for_cpu(aq_nic_get_dev(self->aq_nic),
@ -389,6 +402,12 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 							AQ_CFG_RX_FRAME_MAX);
 					page_ref_inc(buff_->rxdata.page);
 					buff_->is_cleaned = 1;
+
+					buff->is_ip_cso &= buff_->is_ip_cso;
+					buff->is_udp_cso &= buff_->is_udp_cso;
+					buff->is_tcp_cso &= buff_->is_tcp_cso;
+					buff->is_cso_err |= buff_->is_cso_err;
+
 				} while (!buff_->is_eop);
 			}
 		}
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
@ -266,12 +266,11 @@ static int hw_atl_b0_hw_offload_set(struct aq_hw_s *self,
 		 */
 		hw_atl_rpo_lro_max_coalescing_interval_set(self, 50);

-
 		hw_atl_rpo_lro_qsessions_lim_set(self, 1U);

 		hw_atl_rpo_lro_total_desc_lim_set(self, 2U);

-		hw_atl_rpo_lro_patch_optimization_en_set(self, 0U);
+		hw_atl_rpo_lro_patch_optimization_en_set(self, 1U);

 		hw_atl_rpo_lro_min_pay_of_first_pkt_set(self, 10U);

@ -713,7 +712,7 @@ static int hw_atl_b0_hw_ring_rx_receive(struct aq_hw_s *self,
 		if ((rx_stat & BIT(0)) || rxd_wb->type & 0x1000U) {
 			/* MAC error or DMA error */
 			buff->is_error = 1U;
-		} else {
+		}
 		if (self->aq_nic_cfg->is_rss) {
 			/* last 4 byte */
 			u16 rss_type = rxd_wb->type & 0xFU;
@ -733,6 +732,10 @@ static int hw_atl_b0_hw_ring_rx_receive(struct aq_hw_s *self,
 			buff->next = 0U;
 			buff->is_eop = 1U;
 		} else {
+			buff->len =
+				rxd_wb->pkt_len > AQ_CFG_RX_FRAME_MAX ?
+				AQ_CFG_RX_FRAME_MAX : rxd_wb->pkt_len;
+
 			if (HW_ATL_B0_RXD_WB_STAT2_RSCCNT &
 				rxd_wb->status) {
 				/* LRO */
@ -747,7 +750,6 @@ static int hw_atl_b0_hw_ring_rx_receive(struct aq_hw_s *self,
 			}
 		}
 	}
-	}

 	return aq_hw_err_from_flags(self);
 }
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@ -1642,6 +1642,8 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 		skb = bnxt_copy_skb(bnapi, data_ptr, len, dma_addr);
 		bnxt_reuse_rx_data(rxr, cons, data);
 		if (!skb) {
+			if (agg_bufs)
+				bnxt_reuse_rx_agg_bufs(cpr, cp_cons, agg_bufs);
 			rc = -ENOMEM;
 			goto next_rx;
 		}
@ -6377,7 +6379,7 @@ static int bnxt_alloc_ctx_mem(struct bnxt *bp)
 	if (!ctx || (ctx->flags & BNXT_CTX_FLAG_INITED))
 		return 0;

-	if (bp->flags & BNXT_FLAG_ROCE_CAP) {
+	if ((bp->flags & BNXT_FLAG_ROCE_CAP) && !is_kdump_kernel()) {
 		pg_lvl = 2;
 		extra_qps = 65536;
 		extra_srqs = 8192;
@ -7616,22 +7618,23 @@ static void bnxt_clear_int_mode(struct bnxt *bp)
 	bp->flags &= ~BNXT_FLAG_USING_MSIX;
 }

-int bnxt_reserve_rings(struct bnxt *bp)
+int bnxt_reserve_rings(struct bnxt *bp, bool irq_re_init)
 {
 	int tcs = netdev_get_num_tc(bp->dev);
-	bool reinit_irq = false;
+	bool irq_cleared = false;
 	int rc;

 	if (!bnxt_need_reserve_rings(bp))
 		return 0;

-	if (BNXT_NEW_RM(bp) && (bnxt_get_num_msix(bp) != bp->total_irqs)) {
+	if (irq_re_init && BNXT_NEW_RM(bp) &&
+	    bnxt_get_num_msix(bp) != bp->total_irqs) {
 		bnxt_ulp_irq_stop(bp);
 		bnxt_clear_int_mode(bp);
-		reinit_irq = true;
+		irq_cleared = true;
 	}
 	rc = __bnxt_reserve_rings(bp);
-	if (reinit_irq) {
+	if (irq_cleared) {
 		if (!rc)
 			rc = bnxt_init_int_mode(bp);
 		bnxt_ulp_irq_restart(bp, rc);
@ -8530,7 +8533,7 @@ static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool link_re_init)
 			return rc;
 		}
 	}
-	rc = bnxt_reserve_rings(bp);
+	rc = bnxt_reserve_rings(bp, irq_re_init);
 	if (rc)
 		return rc;
 	if ((bp->flags & BNXT_FLAG_RFS) &&
@ -10434,7 +10437,7 @@ static int bnxt_set_dflt_rings(struct bnxt *bp, bool sh)

 	if (sh)
 		bp->flags |= BNXT_FLAG_SHARED_RINGS;
-	dflt_rings = netif_get_num_default_rss_queues();
+	dflt_rings = is_kdump_kernel() ? 1 : netif_get_num_default_rss_queues();
 	/* Reduce default rings on multi-port cards so that total default
 	 * rings do not exceed CPU count.
 	 */
@ -10722,11 +10725,12 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto init_err_pci_clean;
 	}

+	if (BNXT_PF(bp)) {
 		/* Read the adapter's DSN to use as the eswitch switch_id */
 		rc = bnxt_pcie_dsn_get(bp, bp->switch_id);
 		if (rc)
 			goto init_err_pci_clean;
-
+	}
 	bnxt_hwrm_func_qcfg(bp);
 	bnxt_hwrm_vnic_qcaps(bp);
 	bnxt_hwrm_port_led_qcaps(bp);
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@ -20,6 +20,7 @@

 #include <linux/interrupt.h>
 #include <linux/rhashtable.h>
+#include <linux/crash_dump.h>
 #include <net/devlink.h>
 #include <net/dst_metadata.h>
 #include <net/xdp.h>
@ -1369,7 +1370,8 @@ struct bnxt {
 #define BNXT_CHIP_TYPE_NITRO_A0(bp) ((bp)->flags & BNXT_FLAG_CHIP_NITRO_A0)
 #define BNXT_RX_PAGE_MODE(bp)	((bp)->flags & BNXT_FLAG_RX_PAGE_MODE)
 #define BNXT_SUPPORTS_TPA(bp)	(!BNXT_CHIP_TYPE_NITRO_A0(bp) &&	\
-				 !(bp->flags & BNXT_FLAG_CHIP_P5))
+				 !(bp->flags & BNXT_FLAG_CHIP_P5) &&	\
+				 !is_kdump_kernel())

 /* Chip class phase 5 */
 #define BNXT_CHIP_P5(bp)			\
@ -1790,7 +1792,7 @@ unsigned int bnxt_get_avail_stat_ctxs_for_en(struct bnxt *bp);
 unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp);
 unsigned int bnxt_get_avail_cp_rings_for_en(struct bnxt *bp);
 int bnxt_get_avail_msix(struct bnxt *bp, int num);
-int bnxt_reserve_rings(struct bnxt *bp);
+int bnxt_reserve_rings(struct bnxt *bp, bool irq_re_init);
 void bnxt_tx_disable(struct bnxt *bp);
 void bnxt_tx_enable(struct bnxt *bp);
 int bnxt_hwrm_set_pause(struct bnxt *);
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@ -831,7 +831,7 @@ static int bnxt_set_channels(struct net_device *dev,
 			 */
 		}
 	} else {
-		rc = bnxt_reserve_rings(bp);
+		rc = bnxt_reserve_rings(bp, true);
 	}

 	return rc;
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
@ -147,7 +147,7 @@ static int bnxt_req_msix_vecs(struct bnxt_en_dev *edev, int ulp_id,
 			bnxt_close_nic(bp, true, false);
 			rc = bnxt_open_nic(bp, true, false);
 		} else {
-			rc = bnxt_reserve_rings(bp);
+			rc = bnxt_reserve_rings(bp, true);
 		}
 	}
 	if (rc) {
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@ -1080,6 +1080,11 @@ struct macb_ptp_info {
 			 struct ifreq *ifr, int cmd);
 };

+struct macb_pm_data {
+	u32 scrt2;
+	u32 usrio;
+};
+
 struct macb_config {
 	u32			caps;
 	unsigned int		dma_burst_length;
@ -1220,6 +1225,8 @@ struct macb {
 	int	tx_bd_rd_prefetch;

 	u32	rx_intr_mask;
+
+	struct macb_pm_data pm_data;
 };

 #ifdef CONFIG_MACB_USE_HWSTAMP
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@ -2849,10 +2849,14 @@ static int macb_get_ts_info(struct net_device *netdev,

 static void gem_enable_flow_filters(struct macb *bp, bool enable)
 {
+	struct net_device *netdev = bp->dev;
 	struct ethtool_rx_fs_item *item;
 	u32 t2_scr;
 	int num_t2_scr;

+	if (!(netdev->features & NETIF_F_NTUPLE))
+		return;
+
 	num_t2_scr = GEM_BFEXT(T2SCR, gem_readl(bp, DCFG8));

 	list_for_each_entry(item, &bp->rx_fs_list.list, list) {
@ -3012,7 +3016,6 @@ static int gem_add_flow_filter(struct net_device *netdev,
 	gem_prog_cmp_regs(bp, fs);
 	bp->rx_fs_list.count++;
 	/* enable filtering if NTUPLE on */
-	if (netdev->features & NETIF_F_NTUPLE)
 	gem_enable_flow_filters(bp, 1);

 	spin_unlock_irqrestore(&bp->rx_fs_lock, flags);
@ -3201,6 +3204,50 @@ static int macb_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 }

+static inline void macb_set_txcsum_feature(struct macb *bp,
+					   netdev_features_t features)
+{
+	u32 val;
+
+	if (!macb_is_gem(bp))
+		return;
+
+	val = gem_readl(bp, DMACFG);
+	if (features & NETIF_F_HW_CSUM)
+		val |= GEM_BIT(TXCOEN);
+	else
+		val &= ~GEM_BIT(TXCOEN);
+
+	gem_writel(bp, DMACFG, val);
+}
+
+static inline void macb_set_rxcsum_feature(struct macb *bp,
+					   netdev_features_t features)
+{
+	struct net_device *netdev = bp->dev;
+	u32 val;
+
+	if (!macb_is_gem(bp))
+		return;
+
+	val = gem_readl(bp, NCFGR);
+	if ((features & NETIF_F_RXCSUM) && !(netdev->flags & IFF_PROMISC))
+		val |= GEM_BIT(RXCOEN);
+	else
+		val &= ~GEM_BIT(RXCOEN);
+
+	gem_writel(bp, NCFGR, val);
+}
+
+static inline void macb_set_rxflow_feature(struct macb *bp,
+					   netdev_features_t features)
+{
+	if (!macb_is_gem(bp))
+		return;
+
+	gem_enable_flow_filters(bp, !!(features & NETIF_F_NTUPLE));
+}
+
 static int macb_set_features(struct net_device *netdev,
 			     netdev_features_t features)
 {
@ -3208,39 +3255,35 @@ static int macb_set_features(struct net_device *netdev,
 	netdev_features_t changed = features ^ netdev->features;

 	/* TX checksum offload */
-	if ((changed & NETIF_F_HW_CSUM) && macb_is_gem(bp)) {
-		u32 dmacfg;
-
-		dmacfg = gem_readl(bp, DMACFG);
-		if (features & NETIF_F_HW_CSUM)
-			dmacfg |= GEM_BIT(TXCOEN);
-		else
-			dmacfg &= ~GEM_BIT(TXCOEN);
-		gem_writel(bp, DMACFG, dmacfg);
-	}
+	if (changed & NETIF_F_HW_CSUM)
+		macb_set_txcsum_feature(bp, features);

 	/* RX checksum offload */
-	if ((changed & NETIF_F_RXCSUM) && macb_is_gem(bp)) {
-		u32 netcfg;
-
-		netcfg = gem_readl(bp, NCFGR);
-		if (features & NETIF_F_RXCSUM &&
-		    !(netdev->flags & IFF_PROMISC))
-			netcfg |= GEM_BIT(RXCOEN);
-		else
-			netcfg &= ~GEM_BIT(RXCOEN);
-		gem_writel(bp, NCFGR, netcfg);
-	}
+	if (changed & NETIF_F_RXCSUM)
+		macb_set_rxcsum_feature(bp, features);

 	/* RX Flow Filters */
-	if ((changed & NETIF_F_NTUPLE) && macb_is_gem(bp)) {
-		bool turn_on = features & NETIF_F_NTUPLE;
+	if (changed & NETIF_F_NTUPLE)
+		macb_set_rxflow_feature(bp, features);

-		gem_enable_flow_filters(bp, turn_on);
-	}
 	return 0;
 }

+static void macb_restore_features(struct macb *bp)
+{
+	struct net_device *netdev = bp->dev;
+	netdev_features_t features = netdev->features;
+
+	/* TX checksum offload */
+	macb_set_txcsum_feature(bp, features);
+
+	/* RX checksum offload */
+	macb_set_rxcsum_feature(bp, features);
+
+	/* RX Flow Filters */
+	macb_set_rxflow_feature(bp, features);
+}
+
 static const struct net_device_ops macb_netdev_ops = {
 	.ndo_open		= macb_open,
 	.ndo_stop		= macb_close,
@ -4273,6 +4316,12 @@ static int __maybe_unused macb_suspend(struct device *dev)
 		spin_lock_irqsave(&bp->lock, flags);
 		macb_reset_hw(bp);
 		spin_unlock_irqrestore(&bp->lock, flags);
+
+		if (!(bp->caps & MACB_CAPS_USRIO_DISABLED))
+			bp->pm_data.usrio = macb_or_gem_readl(bp, USRIO);
+
+		if (netdev->hw_features & NETIF_F_NTUPLE)
+			bp->pm_data.scrt2 = gem_readl_n(bp, ETHT, SCRT2_ETHT);
 	}

 	netif_carrier_off(netdev);
@ -4301,6 +4350,13 @@ static int __maybe_unused macb_resume(struct device *dev)
 		disable_irq_wake(bp->queues[0].irq);
 	} else {
 		macb_writel(bp, NCR, MACB_BIT(MPE));
+
+		if (netdev->hw_features & NETIF_F_NTUPLE)
+			gem_writel_n(bp, ETHT, SCRT2_ETHT, bp->pm_data.scrt2);
+
+		if (!(bp->caps & MACB_CAPS_USRIO_DISABLED))
+			macb_or_gem_writel(bp, USRIO, bp->pm_data.usrio);
+
 		for (q = 0, queue = bp->queues; q < bp->num_queues;
 		     ++q, ++queue)
 			napi_enable(&queue->napi);
@ -4312,6 +4368,7 @@ static int __maybe_unused macb_resume(struct device *dev)
 	bp->macbgem_ops.mog_init_rings(bp);
 	macb_init_hw(bp);
 	macb_set_rx_mode(netdev);
+	macb_restore_features(bp);
 	netif_device_attach(netdev);
 	if (bp->ptp_info)
 		bp->ptp_info->ptp_init(netdev);
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@ -197,6 +197,9 @@ static void cxgb4_process_flow_match(struct net_device *dev,
 		fs->val.ivlan = vlan_tci;
 		fs->mask.ivlan = vlan_tci_mask;

+		fs->val.ivlan_vld = 1;
+		fs->mask.ivlan_vld = 1;
+
 		/* Chelsio adapters use ivlan_vld bit to match vlan packets
 		 * as 802.1Q. Also, when vlan tag is present in packets,
 		 * ethtype match is used then to match on ethtype of inner
@ -207,8 +210,6 @@ static void cxgb4_process_flow_match(struct net_device *dev,
 		 * ethtype value with ethtype of inner header.
 		 */
 		if (fs->val.ethtype == ETH_P_8021Q) {
-			fs->val.ivlan_vld = 1;
-			fs->mask.ivlan_vld = 1;
 			fs->val.ethtype = 0;
 			fs->mask.ethtype = 0;
 		}
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@ -7253,10 +7253,21 @@ int t4_fixup_host_params(struct adapter *adap, unsigned int page_size,
 			 unsigned int cache_line_size)
 {
 	unsigned int page_shift = fls(page_size) - 1;
+	unsigned int sge_hps = page_shift - 10;
 	unsigned int stat_len = cache_line_size > 64 ? 128 : 64;
 	unsigned int fl_align = cache_line_size < 32 ? 32 : cache_line_size;
 	unsigned int fl_align_log = fls(fl_align) - 1;

+	t4_write_reg(adap, SGE_HOST_PAGE_SIZE_A,
+		     HOSTPAGESIZEPF0_V(sge_hps) |
+		     HOSTPAGESIZEPF1_V(sge_hps) |
+		     HOSTPAGESIZEPF2_V(sge_hps) |
+		     HOSTPAGESIZEPF3_V(sge_hps) |
+		     HOSTPAGESIZEPF4_V(sge_hps) |
+		     HOSTPAGESIZEPF5_V(sge_hps) |
+		     HOSTPAGESIZEPF6_V(sge_hps) |
+		     HOSTPAGESIZEPF7_V(sge_hps));
+
 	if (is_t4(adap->params.chip)) {
 		t4_set_reg_field(adap, SGE_CONTROL_A,
 				 INGPADBOUNDARY_V(INGPADBOUNDARY_M) |
--- a/drivers/net/ethernet/dec/tulip/de4x5.c
+++ b/drivers/net/ethernet/dec/tulip/de4x5.c
@ -2107,7 +2107,6 @@ static struct eisa_driver de4x5_eisa_driver = {
 		.remove  = de4x5_eisa_remove,
        }
 };
-MODULE_DEVICE_TABLE(eisa, de4x5_eisa_ids);
 #endif

 #ifdef CONFIG_PCI
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@ -780,7 +780,7 @@ static void dpaa_eth_add_channel(u16 channel)
 	struct qman_portal *portal;
 	int cpu;

-	for_each_cpu(cpu, cpus) {
+	for_each_cpu_and(cpu, cpus, cpu_online_mask) {
 		portal = qman_get_affine_portal(cpu);
 		qman_p_static_dequeue_add(portal, pool);
 	}
@ -896,7 +896,7 @@ static void dpaa_fq_setup(struct dpaa_priv *priv,
 	u16 channels[NR_CPUS];
 	struct dpaa_fq *fq;

-	for_each_cpu(cpu, affine_cpus)
+	for_each_cpu_and(cpu, affine_cpus, cpu_online_mask)
 		channels[num_portals++] = qman_affine_channel(cpu);

 	if (num_portals == 0)
@ -2174,7 +2174,6 @@ static int dpaa_eth_poll(struct napi_struct *napi, int budget)
 	if (cleaned < budget) {
 		napi_complete_done(napi, cleaned);
 		qman_p_irqsource_add(np->p, QM_PIRQ_DQRI);
-
 	} else if (np->down) {
 		qman_p_irqsource_add(np->p, QM_PIRQ_DQRI);
 	}
@ -2448,7 +2447,7 @@ static void dpaa_eth_napi_enable(struct dpaa_priv *priv)
 	struct dpaa_percpu_priv *percpu_priv;
 	int i;

-	for_each_possible_cpu(i) {
+	for_each_online_cpu(i) {
 		percpu_priv = per_cpu_ptr(priv->percpu_priv, i);

 		percpu_priv->np.down = 0;
@ -2461,7 +2460,7 @@ static void dpaa_eth_napi_disable(struct dpaa_priv *priv)
 	struct dpaa_percpu_priv *percpu_priv;
 	int i;

-	for_each_possible_cpu(i) {
+	for_each_online_cpu(i) {
 		percpu_priv = per_cpu_ptr(priv->percpu_priv, i);

 		percpu_priv->np.down = 1;
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
@ -569,7 +569,7 @@ static int dpaa_set_coalesce(struct net_device *dev,
 	qman_dqrr_get_ithresh(portal, &prev_thresh);

 	/* set new values */
-	for_each_cpu(cpu, cpus) {
+	for_each_cpu_and(cpu, cpus, cpu_online_mask) {
 		portal = qman_get_affine_portal(cpu);
 		res = qman_portal_set_iperiod(portal, period);
 		if (res)
@ -586,7 +586,7 @@ static int dpaa_set_coalesce(struct net_device *dev,

 revert_values:
 	/* restore previous values */
-	for_each_cpu(cpu, cpus) {
+	for_each_cpu_and(cpu, cpus, cpu_online_mask) {
 		if (!needs_revert[cpu])
 			continue;
 		portal = qman_get_affine_portal(cpu);
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@ -1972,7 +1972,7 @@ alloc_channel(struct dpaa2_eth_priv *priv)

 	channel->dpcon = setup_dpcon(priv);
 	if (IS_ERR_OR_NULL(channel->dpcon)) {
-		err = PTR_ERR(channel->dpcon);
+		err = PTR_ERR_OR_ZERO(channel->dpcon);
 		goto err_setup;
 	}

@ -2028,7 +2028,7 @@ static int setup_dpio(struct dpaa2_eth_priv *priv)
 		/* Try to allocate a channel */
 		channel = alloc_channel(priv);
 		if (IS_ERR_OR_NULL(channel)) {
-			err = PTR_ERR(channel);
+			err = PTR_ERR_OR_ZERO(channel);
 			if (err != -EPROBE_DEFER)
 				dev_info(dev,
 					 "No affine channel for cpu %d and above\n", i);
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@ -467,7 +467,7 @@ enum dpaa2_eth_rx_dist {
 #define DPAA2_ETH_DIST_IPPROTO		BIT(6)
 #define DPAA2_ETH_DIST_L4SRC		BIT(7)
 #define DPAA2_ETH_DIST_L4DST		BIT(8)
-#define DPAA2_ETH_DIST_ALL		(~0U)
+#define DPAA2_ETH_DIST_ALL		(~0ULL)

 static inline
 unsigned int dpaa2_eth_needed_headroom(struct dpaa2_eth_priv *priv,
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
@ -4,6 +4,7 @@
 */

 #include <linux/net_tstamp.h>
+#include <linux/nospec.h>

 #include "dpni.h"	/* DPNI_LINK_OPT_* */
 #include "dpaa2-eth.h"
@ -648,6 +649,8 @@ static int dpaa2_eth_get_rxnfc(struct net_device *net_dev,
 	case ETHTOOL_GRXCLSRULE:
 		if (rxnfc->fs.location >= max_rules)
 			return -EINVAL;
+		rxnfc->fs.location = array_index_nospec(rxnfc->fs.location,
+							max_rules);
 		if (!priv->cls_rules[rxnfc->fs.location].in_use)
 			return -EINVAL;
 		rxnfc->fs = priv->cls_rules[rxnfc->fs.location].fs;
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@ -3556,7 +3556,7 @@ failed_init:
 	if (fep->reg_phy)
 		regulator_disable(fep->reg_phy);
 failed_reset:
-	pm_runtime_put(&pdev->dev);
+	pm_runtime_put_noidle(&pdev->dev);
 	pm_runtime_disable(&pdev->dev);
 failed_regulator:
 	clk_disable_unprepare(fep->clk_ahb);
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@ -4674,7 +4674,7 @@ static int mvneta_probe(struct platform_device *pdev)
 	err = register_netdev(dev);
 	if (err < 0) {
 		dev_err(&pdev->dev, "failed to register\n");
-		goto err_free_stats;
+		goto err_netdev;
 	}

 	netdev_info(dev, "Using %s mac address %pM\n", mac_from,
@ -4685,14 +4685,12 @@ static int mvneta_probe(struct platform_device *pdev)
 	return 0;

 err_netdev:
-	unregister_netdev(dev);
 	if (pp->bm_priv) {
 		mvneta_bm_pool_destroy(pp->bm_priv, pp->pool_long, 1 << pp->id);
 		mvneta_bm_pool_destroy(pp->bm_priv, pp->pool_short,
 				       1 << pp->id);
 		mvneta_bm_put(pp->bm_priv);
 	}
-err_free_stats:
 	free_percpu(pp->stats);
 err_free_ports:
 	free_percpu(pp->ports);
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c
@ -1271,6 +1271,9 @@ int mvpp2_ethtool_cls_rule_ins(struct mvpp2_port *port,
 	if (ret)
 		goto clean_eth_rule;

+	ethtool_rx_flow_rule_destroy(ethtool_rule);
+	efs->rule.flow = NULL;
+
 	memcpy(&efs->rxnfc, info, sizeof(*info));
 	port->rfs_rules[efs->rule.loc] = efs;
 	port->n_rfs_rules++;
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@ -1455,7 +1455,7 @@ static inline void mvpp2_xlg_max_rx_size_set(struct mvpp2_port *port)
 /* Set defaults to the MVPP2 port */
 static void mvpp2_defaults_set(struct mvpp2_port *port)
 {
-	int tx_port_num, val, queue, ptxq, lrxq;
+	int tx_port_num, val, queue, lrxq;

 	if (port->priv->hw_version == MVPP21) {
 		/* Update TX FIFO MIN Threshold */
@ -1476,11 +1476,9 @@ static void mvpp2_defaults_set(struct mvpp2_port *port)
 	mvpp2_write(port->priv, MVPP2_TXP_SCHED_FIXED_PRIO_REG, 0);

 	/* Close bandwidth for all queues */
-	for (queue = 0; queue < MVPP2_MAX_TXQ; queue++) {
-		ptxq = mvpp2_txq_phys(port->id, queue);
+	for (queue = 0; queue < MVPP2_MAX_TXQ; queue++)
 		mvpp2_write(port->priv,
-			    MVPP2_TXQ_SCHED_TOKEN_CNTR_REG(ptxq), 0);
-	}
+			    MVPP2_TXQ_SCHED_TOKEN_CNTR_REG(queue), 0);

 	/* Set refill period to 1 usec, refill tokens
 	 * and bucket size to maximum
@ -2336,7 +2334,7 @@ static void mvpp2_txq_deinit(struct mvpp2_port *port,
 	txq->descs_dma         = 0;

 	/* Set minimum bandwidth for disabled TXQs */
-	mvpp2_write(port->priv, MVPP2_TXQ_SCHED_TOKEN_CNTR_REG(txq->id), 0);
+	mvpp2_write(port->priv, MVPP2_TXQ_SCHED_TOKEN_CNTR_REG(txq->log_id), 0);

 	/* Set Tx descriptors queue starting address and size */
 	thread = mvpp2_cpu_to_thread(port->priv, get_cpu());
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@ -3687,6 +3687,12 @@ static netdev_features_t mlx5e_fix_features(struct net_device *netdev,
 			netdev_warn(netdev, "Disabling LRO, not supported in legacy RQ\n");
 	}

+	if (MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS)) {
+		features &= ~NETIF_F_RXHASH;
+		if (netdev->features & NETIF_F_RXHASH)
+			netdev_warn(netdev, "Disabling rxhash, not supported when CQE compress is active\n");
+	}
+
 	mutex_unlock(&priv->state_lock);

 	return features;
@ -3812,6 +3818,9 @@ int mlx5e_hwstamp_set(struct mlx5e_priv *priv, struct ifreq *ifr)
 	memcpy(&priv->tstamp, &config, sizeof(config));
 	mutex_unlock(&priv->state_lock);

+	/* might need to fix some features */
+	netdev_update_features(priv->netdev);
+
 	return copy_to_user(ifr->ifr_data, &config,
 			    sizeof(config)) ? -EFAULT : 0;
 }
@ -4680,6 +4689,10 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	if (!priv->channels.params.scatter_fcs_en)
 		netdev->features  &= ~NETIF_F_RXFCS;

+	/* prefere CQE compression over rxhash */
+	if (MLX5E_GET_PFLAG(&priv->channels.params, MLX5E_PFLAG_RX_CQE_COMPRESS))
+		netdev->features &= ~NETIF_F_RXHASH;
+
 #define FT_CAP(f) MLX5_CAP_FLOWTABLE(mdev, flow_table_properties_nic_receive.f)
 	if (FT_CAP(flow_modify_en) &&
 	    FT_CAP(modify_root) &&
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@ -813,7 +813,7 @@ static int mlx5e_nic_rep_netdevice_event(struct notifier_block *nb,
 	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);

 	if (!mlx5e_tc_tun_device_to_offload(priv, netdev) &&
-	    !is_vlan_dev(netdev))
+	    !(is_vlan_dev(netdev) && vlan_dev_real_dev(netdev) == rpriv->netdev))
 		return NOTIFY_OK;

 	switch (event) {
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@ -2284,7 +2284,7 @@ static struct mlx5_flow_root_namespace
 		cmds = mlx5_fs_cmd_get_default_ipsec_fpga_cmds(table_type);

 	/* Create the root namespace */
-	root_ns = kvzalloc(sizeof(*root_ns), GFP_KERNEL);
+	root_ns = kzalloc(sizeof(*root_ns), GFP_KERNEL);
 	if (!root_ns)
 		return NULL;

@ -2427,6 +2427,7 @@ static void cleanup_egress_acls_root_ns(struct mlx5_core_dev *dev)
 		cleanup_root_ns(steering->esw_egress_root_ns[i]);

 	kfree(steering->esw_egress_root_ns);
+	steering->esw_egress_root_ns = NULL;
 }

 static void cleanup_ingress_acls_root_ns(struct mlx5_core_dev *dev)
@ -2441,6 +2442,7 @@ static void cleanup_ingress_acls_root_ns(struct mlx5_core_dev *dev)
 		cleanup_root_ns(steering->esw_ingress_root_ns[i]);

 	kfree(steering->esw_ingress_root_ns);
+	steering->esw_ingress_root_ns = NULL;
 }

 void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
@ -2474,11 +2476,7 @@ static int init_sniffer_tx_root_ns(struct mlx5_flow_steering *steering)

 	/* Create single prio */
 	prio = fs_create_prio(&steering->sniffer_tx_root_ns->ns, 0, 1);
-	if (IS_ERR(prio)) {
-		cleanup_root_ns(steering->sniffer_tx_root_ns);
-		return PTR_ERR(prio);
-	}
-	return 0;
+	return PTR_ERR_OR_ZERO(prio);
 }

 static int init_sniffer_rx_root_ns(struct mlx5_flow_steering *steering)
@ -2491,11 +2489,7 @@ static int init_sniffer_rx_root_ns(struct mlx5_flow_steering *steering)

 	/* Create single prio */
 	prio = fs_create_prio(&steering->sniffer_rx_root_ns->ns, 0, 1);
-	if (IS_ERR(prio)) {
-		cleanup_root_ns(steering->sniffer_rx_root_ns);
-		return PTR_ERR(prio);
-	}
-	return 0;
+	return PTR_ERR_OR_ZERO(prio);
 }

 static int init_rdma_rx_root_ns(struct mlx5_flow_steering *steering)
@ -2511,11 +2505,7 @@ static int init_rdma_rx_root_ns(struct mlx5_flow_steering *steering)

 	/* Create single prio */
 	prio = fs_create_prio(&steering->rdma_rx_root_ns->ns, 0, 1);
-	if (IS_ERR(prio)) {
-		cleanup_root_ns(steering->rdma_rx_root_ns);
-		return PTR_ERR(prio);
-	}
-	return 0;
+	return PTR_ERR_OR_ZERO(prio);
 }
 static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
 {
@ -2637,6 +2627,7 @@ cleanup_root_ns:
 	for (i--; i >= 0; i--)
 		cleanup_root_ns(steering->esw_egress_root_ns[i]);
 	kfree(steering->esw_egress_root_ns);
+	steering->esw_egress_root_ns = NULL;
 	return err;
 }

@ -2664,6 +2655,7 @@ cleanup_root_ns:
 	for (i--; i >= 0; i--)
 		cleanup_root_ns(steering->esw_ingress_root_ns[i]);
 	kfree(steering->esw_ingress_root_ns);
+	steering->esw_ingress_root_ns = NULL;
 	return err;
 }

--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@ -1067,7 +1067,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 	err = mlx5_core_set_hca_defaults(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to set hca defaults\n");
-		goto err_fs;
+		goto err_sriov;
 	}

 	err = mlx5_sriov_attach(dev);
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@ -3128,6 +3128,10 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev,
 	ops->reg_ptys_eth_unpack(mlxsw_sp, ptys_pl, &eth_proto_cap, NULL, NULL);

 	autoneg = cmd->base.autoneg == AUTONEG_ENABLE;
+	if (!autoneg && cmd->base.speed == SPEED_56000) {
+		netdev_err(dev, "56G not supported with autoneg off\n");
+		return -EINVAL;
+	}
 	eth_proto_new = autoneg ?
 		ops->to_ptys_advert_link(mlxsw_sp, cmd) :
 		ops->to_ptys_speed(mlxsw_sp, cmd->base.speed);
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_erp.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_erp.c
@ -1171,13 +1171,12 @@ mlxsw_sp_acl_erp_delta_fill(const struct mlxsw_sp_acl_erp_key *parent_key,
 			return -EINVAL;
 	}
 	if (si == -1) {
-		/* The masks are the same, this cannot happen.
-		 * That means the caller is broken.
+		/* The masks are the same, this can happen in case eRPs with
+		 * the same mask were created in both A-TCAM and C-TCAM.
+		 * The only possible condition under which this can happen
+		 * is identical rule insertion. Delta is not possible here.
 		 */
-		WARN_ON(1);
-		*delta_start = 0;
-		*delta_mask = 0;
-		return 0;
+		return -EINVAL;
 	}
 	pmask = (unsigned char) parent_key->mask[__MASK_IDX(si)];
 	mask = (unsigned char) key->mask[__MASK_IDX(si)];
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@ -593,45 +593,25 @@ static int ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }

-static void ocelot_mact_mc_reset(struct ocelot_port *port)
+static int ocelot_mc_unsync(struct net_device *dev, const unsigned char *addr)
 {
-	struct ocelot *ocelot = port->ocelot;
-	struct netdev_hw_addr *ha, *n;
+	struct ocelot_port *port = netdev_priv(dev);

-	/* Free and forget all the MAC addresses stored in the port private mc
-	 * list. These are mc addresses that were previously added by calling
-	 * ocelot_mact_mc_add().
-	 */
-	list_for_each_entry_safe(ha, n, &port->mc, list) {
-		ocelot_mact_forget(ocelot, ha->addr, port->pvid);
-		list_del(&ha->list);
-		kfree(ha);
-	}
+	return ocelot_mact_forget(port->ocelot, addr, port->pvid);
 }

-static int ocelot_mact_mc_add(struct ocelot_port *port,
-			      struct netdev_hw_addr *hw_addr)
+static int ocelot_mc_sync(struct net_device *dev, const unsigned char *addr)
 {
-	struct ocelot *ocelot = port->ocelot;
-	struct netdev_hw_addr *ha = kzalloc(sizeof(*ha), GFP_ATOMIC);
+	struct ocelot_port *port = netdev_priv(dev);

-	if (!ha)
-		return -ENOMEM;
-
-	memcpy(ha, hw_addr, sizeof(*ha));
-	list_add_tail(&ha->list, &port->mc);
-
-	ocelot_mact_learn(ocelot, PGID_CPU, ha->addr, port->pvid,
+	return ocelot_mact_learn(port->ocelot, PGID_CPU, addr, port->pvid,
 				 ENTRYTYPE_LOCKED);
-
-	return 0;
 }

 static void ocelot_set_rx_mode(struct net_device *dev)
 {
 	struct ocelot_port *port = netdev_priv(dev);
 	struct ocelot *ocelot = port->ocelot;
-	struct netdev_hw_addr *ha;
 	int i;
 	u32 val;

@ -643,13 +623,7 @@ static void ocelot_set_rx_mode(struct net_device *dev)
 	for (i = ocelot->num_phys_ports + 1; i < PGID_CPU; i++)
 		ocelot_write_rix(ocelot, val, ANA_PGID_PGID, i);

-	/* Handle the device multicast addresses. First remove all the
-	 * previously installed addresses and then add the latest ones to the
-	 * mac table.
-	 */
-	ocelot_mact_mc_reset(port);
-	netdev_for_each_mc_addr(ha, dev)
-		ocelot_mact_mc_add(port, ha);
+	__dev_mc_sync(dev, ocelot_mc_sync, ocelot_mc_unsync);
 }

 static int ocelot_port_get_phys_port_name(struct net_device *dev,
@ -1657,7 +1631,6 @@ int ocelot_probe_port(struct ocelot *ocelot, u8 port,
 	ocelot_port->regs = regs;
 	ocelot_port->chip_port = port;
 	ocelot_port->phy = phy;
-	INIT_LIST_HEAD(&ocelot_port->mc);
 	ocelot->ports[port] = ocelot_port;

 	dev->netdev_ops = &ocelot_port_netdev_ops;
--- a/drivers/net/ethernet/mscc/ocelot.h
+++ b/drivers/net/ethernet/mscc/ocelot.h
@ -441,10 +441,6 @@ struct ocelot_port {
 	struct phy_device *phy;
 	void __iomem *regs;
 	u8 chip_port;
-	/* Keep a track of the mc addresses added to the mac table, so that they
-	 * can be removed when needed.
-	 */
-	struct list_head mc;

 	/* Ingress default VLAN (pvid) */
 	u16 pvid;
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@ -6722,6 +6722,8 @@ static int rtl8169_resume(struct device *device)
 	struct net_device *dev = dev_get_drvdata(device);
 	struct rtl8169_private *tp = netdev_priv(dev);

+	rtl_rar_set(tp, dev->dev_addr);
+
 	clk_prepare_enable(tp->clk);

 	if (netif_running(dev))
@ -6755,6 +6757,7 @@ static int rtl8169_runtime_resume(struct device *device)
 {
 	struct net_device *dev = dev_get_drvdata(device);
 	struct rtl8169_private *tp = netdev_priv(dev);
+
 	rtl_rar_set(tp, dev->dev_addr);

 	if (!tp->TxDescArray)
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@ -1594,6 +1594,10 @@ static void sh_eth_dev_exit(struct net_device *ndev)
 	sh_eth_get_stats(ndev);
 	mdp->cd->soft_reset(ndev);

+	/* Set the RMII mode again if required */
+	if (mdp->cd->rmiimode)
+		sh_eth_write(ndev, 0x1, RMIIMODE);
+
 	/* Set MAC address again */
 	update_mac_address(ndev);
 }
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
@ -455,7 +455,11 @@ static int dwc_eth_dwmac_probe(struct platform_device *pdev)
 	priv = data->probe(pdev, plat_dat, &stmmac_res);
 	if (IS_ERR(priv)) {
 		ret = PTR_ERR(priv);
-		dev_err(&pdev->dev, "failed to probe subdriver: %d\n", ret);
+
+		if (ret != -EPROBE_DEFER)
+			dev_err(&pdev->dev, "failed to probe subdriver: %d\n",
+				ret);
+
 		goto remove_config;
 	}

--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
@ -346,8 +346,6 @@ static int mediatek_dwmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);

 	plat_dat->interface = priv_plat->phy_mode;
-	/* clk_csr_i = 250-300MHz & MDC = clk_csr_i/124 */
-	plat_dat->clk_csr = 5;
 	plat_dat->has_gmac4 = 1;
 	plat_dat->has_gmac = 0;
 	plat_dat->pmt = 0;
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@ -3338,6 +3338,7 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv, u32 queue)
 		entry = STMMAC_GET_ENTRY(entry, DMA_RX_SIZE);
 	}
 	rx_q->dirty_rx = entry;
+	stmmac_set_rx_tail_ptr(priv, priv->ioaddr, rx_q->rx_tail_addr, queue);
 }

 /**
@ -4379,10 +4380,10 @@ int stmmac_dvr_probe(struct device *device,
 	 * set the MDC clock dynamically according to the csr actual
 	 * clock input.
 	 */
-	if (!priv->plat->clk_csr)
-		stmmac_clk_csr_set(priv);
-	else
+	if (priv->plat->clk_csr >= 0)
 		priv->clk_csr = priv->plat->clk_csr;
+	else
+		stmmac_clk_csr_set(priv);

 	stmmac_check_pcs_mode(priv);

--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@ -267,7 +267,8 @@ int stmmac_mdio_reset(struct mii_bus *bus)
 			of_property_read_u32_array(np,
 				"snps,reset-delays-us", data->delays, 3);

-			if (gpio_request(data->reset_gpio, "mdio-reset"))
+			if (devm_gpio_request(priv->device, data->reset_gpio,
+					      "mdio-reset"))
 				return 0;
 		}

--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@ -408,7 +408,10 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac)
 	/* Default to phy auto-detection */
 	plat->phy_addr = -1;

-	/* Get clk_csr from device tree */
+	/* Default to get clk_csr from stmmac_clk_crs_set(),
+	 * or get clk_csr from device tree.
+	 */
+	plat->clk_csr = -1;
 	of_property_read_u32(np, "clk_csr", &plat->clk_csr);

 	/* "snps,phy-addr" is not a standard property. Mark it as deprecated
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@ -2000,6 +2000,12 @@ static rx_handler_result_t netvsc_vf_handle_frame(struct sk_buff **pskb)
 	struct netvsc_vf_pcpu_stats *pcpu_stats
 		 = this_cpu_ptr(ndev_ctx->vf_stats);

+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return RX_HANDLER_CONSUMED;
+
+	*pskb = skb;
+
 	skb->dev = ndev;

 	u64_stats_update_begin(&pcpu_stats->syncp);
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@ -26,10 +26,18 @@

 /* Extended Registers */
 #define DP83867_CFG4            0x0031
+#define DP83867_CFG4_SGMII_ANEG_MASK (BIT(5) | BIT(6))
+#define DP83867_CFG4_SGMII_ANEG_TIMER_11MS   (3 << 5)
+#define DP83867_CFG4_SGMII_ANEG_TIMER_800US  (2 << 5)
+#define DP83867_CFG4_SGMII_ANEG_TIMER_2US    (1 << 5)
+#define DP83867_CFG4_SGMII_ANEG_TIMER_16MS   (0 << 5)
+
 #define DP83867_RGMIICTL	0x0032
 #define DP83867_STRAP_STS1	0x006E
 #define DP83867_RGMIIDCTL	0x0086
 #define DP83867_IO_MUX_CFG	0x0170
+#define DP83867_10M_SGMII_CFG   0x016F
+#define DP83867_10M_SGMII_RATE_ADAPT_MASK BIT(7)

 #define DP83867_SW_RESET	BIT(15)
 #define DP83867_SW_RESTART	BIT(14)
@ -247,10 +255,8 @@ static int dp83867_config_init(struct phy_device *phydev)
 		ret = phy_write(phydev, MII_DP83867_PHYCTRL, val);
 		if (ret)
 			return ret;
-	}

-	if ((phydev->interface >= PHY_INTERFACE_MODE_RGMII_ID) &&
-	    (phydev->interface <= PHY_INTERFACE_MODE_RGMII_RXID)) {
+		/* Set up RGMII delays */
 		val = phy_read_mmd(phydev, DP83867_DEVADDR, DP83867_RGMIICTL);

 		if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID)
@ -277,6 +283,33 @@ static int dp83867_config_init(struct phy_device *phydev)
 				       DP83867_IO_MUX_CFG_IO_IMPEDANCE_CTRL);
 	}

+	if (phydev->interface == PHY_INTERFACE_MODE_SGMII) {
+		/* For support SPEED_10 in SGMII mode
+		 * DP83867_10M_SGMII_RATE_ADAPT bit
+		 * has to be cleared by software. That
+		 * does not affect SPEED_100 and
+		 * SPEED_1000.
+		 */
+		ret = phy_modify_mmd(phydev, DP83867_DEVADDR,
+				     DP83867_10M_SGMII_CFG,
+				     DP83867_10M_SGMII_RATE_ADAPT_MASK,
+				     0);
+		if (ret)
+			return ret;
+
+		/* After reset SGMII Autoneg timer is set to 2us (bits 6 and 5
+		 * are 01). That is not enough to finalize autoneg on some
+		 * devices. Increase this timer duration to maximum 16ms.
+		 */
+		ret = phy_modify_mmd(phydev, DP83867_DEVADDR,
+				     DP83867_CFG4,
+				     DP83867_CFG4_SGMII_ANEG_MASK,
+				     DP83867_CFG4_SGMII_ANEG_TIMER_16MS);
+
+		if (ret)
+			return ret;
+	}
+
 	/* Enable Interrupt output INT_OE in CFG3 register */
 	if (phy_interrupt_is_valid(phydev)) {
 		val = phy_read(phydev, DP83867_CFG3);
@ -307,7 +340,7 @@ static int dp83867_phy_reset(struct phy_device *phydev)

 	usleep_range(10, 20);

-	return dp83867_config_init(phydev);
+	return 0;
 }

 static struct phy_driver dp83867_driver[] = {
--- a/drivers/net/phy/marvell10g.c
+++ b/drivers/net/phy/marvell10g.c
@ -31,6 +31,9 @@
 #define MV_PHY_ALASKA_NBT_QUIRK_REV	(MARVELL_PHY_ID_88X3310 | 0xa)

 enum {
+	MV_PMA_BOOT		= 0xc050,
+	MV_PMA_BOOT_FATAL	= BIT(0),
+
 	MV_PCS_BASE_T		= 0x0000,
 	MV_PCS_BASE_R		= 0x1000,
 	MV_PCS_1000BASEX	= 0x2000,
@ -213,6 +216,16 @@ static int mv3310_probe(struct phy_device *phydev)
 	    (phydev->c45_ids.devices_in_package & mmd_mask) != mmd_mask)
 		return -ENODEV;

+	ret = phy_read_mmd(phydev, MDIO_MMD_PMAPMD, MV_PMA_BOOT);
+	if (ret < 0)
+		return ret;
+
+	if (ret & MV_PMA_BOOT_FATAL) {
+		dev_warn(&phydev->mdio.dev,
+			 "PHY failed to boot firmware, status=%04x\n", ret);
+		return -ENODEV;
+	}
+
 	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
 	if (!priv)
 		return -ENOMEM;
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@ -51,6 +51,10 @@ struct phylink {

 	/* The link configuration settings */
 	struct phylink_link_state link_config;
+
+	/* The current settings */
+	phy_interface_t cur_interface;
+
 	struct gpio_desc *link_gpio;
 	struct timer_list link_poll;
 	void (*get_fixed_state)(struct net_device *dev,
@ -446,12 +450,12 @@ static void phylink_resolve(struct work_struct *w)
 		if (!link_state.link) {
 			netif_carrier_off(ndev);
 			pl->ops->mac_link_down(ndev, pl->link_an_mode,
-					       pl->phy_state.interface);
+					       pl->cur_interface);
 			netdev_info(ndev, "Link is Down\n");
 		} else {
+			pl->cur_interface = link_state.interface;
 			pl->ops->mac_link_up(ndev, pl->link_an_mode,
-					     pl->phy_state.interface,
-					     pl->phydev);
+					     pl->cur_interface, pl->phydev);

 			netif_carrier_on(ndev);

--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@ -260,6 +260,15 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 *	will remain unchanged.
 *	Returns a negative error code or zero. An error code must be returned
 *	if at least one unsupported change was requested.
+ * @get_rxfh_context: Get the contents of the RX flow hash indirection table,
+ *	hash key, and/or hash function assiciated to the given rss context.
+ *	Returns a negative error code or zero.
+ * @set_rxfh_context: Create, remove and configure RSS contexts. Allows setting
+ *	the contents of the RX flow hash indirection table, hash key, and/or
+ *	hash function associated to the given context. Arguments which are set
+ *	to %NULL or zero will remain unchanged.
+ *	Returns a negative error code or zero. An error code must be returned
+ *	if at least one unsupported change was requested.
 * @get_channels: Get number of channels.
 * @set_channels: Set number of channels.  Returns a negative error code or
 *	zero.
--- a/include/net/netfilter/nft_fib.h
+++ b/include/net/netfilter/nft_fib.h
@ -34,5 +34,5 @@ void nft_fib6_eval(const struct nft_expr *expr, struct nft_regs *regs,
 		   const struct nft_pktinfo *pkt);

 void nft_fib_store_result(void *reg, const struct nft_fib *priv,
-			  const struct nft_pktinfo *pkt, int index);
+			  const struct net_device *dev);
 #endif
--- a/include/net/udp.h
+++ b/include/net/udp.h
@ -471,12 +471,19 @@ void udpv6_encap_enable(void);
 static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
 					      struct sk_buff *skb, bool ipv4)
 {
+	netdev_features_t features = NETIF_F_SG;
 	struct sk_buff *segs;

+	/* Avoid csum recalculation by skb_segment unless userspace explicitly
+	 * asks for the final checksum values
+	 */
+	if (!inet_get_convert_csum(sk))
+		features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+
 	/* the GSO CB lays after the UDP one, no need to save and restore any
 	 * CB fragment
 	 */
-	segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+	segs = __skb_gso_segment(skb, features, false);
 	if (unlikely(IS_ERR_OR_NULL(segs))) {
 		int segs_nr = skb_shinfo(skb)->gso_segs;

--- a/net/core/dev.c
+++ b/net/core/dev.c
@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff *skb)

 	trace_netif_rx(skb);

-	if (static_branch_unlikely(&generic_xdp_needed_key)) {
-		int ret;
-
-		preempt_disable();
-		rcu_read_lock();
-		ret = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
-		rcu_read_unlock();
-		preempt_enable();
-
-		/* Consider XDP consuming the packet a success from
-		 * the netdev point of view we do not want to count
-		 * this as an error.
-		 */
-		if (ret != XDP_PASS)
-			return NET_RX_SUCCESS;
-	}
-
 #ifdef CONFIG_RPS
 	if (static_branch_unlikely(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
@ -4858,6 +4841,18 @@ another_round:

 	__this_cpu_inc(softnet_data.processed);

+	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		int ret2;
+
+		preempt_disable();
+		ret2 = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
+		preempt_enable();
+
+		if (ret2 != XDP_PASS)
+			return NET_RX_DROP;
+		skb_reset_mac_len(skb);
+	}
+
 	if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
 	    skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
 		skb = skb_vlan_untag(skb);
@ -5178,19 +5173,6 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	if (skb_defer_rx_timestamp(skb))
 		return NET_RX_SUCCESS;

-	if (static_branch_unlikely(&generic_xdp_needed_key)) {
-		int ret;
-
-		preempt_disable();
-		rcu_read_lock();
-		ret = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
-		rcu_read_unlock();
-		preempt_enable();
-
-		if (ret != XDP_PASS)
-			return NET_RX_DROP;
-	}
-
 	rcu_read_lock();
 #ifdef CONFIG_RPS
 	if (static_branch_unlikely(&rps_needed)) {
@ -5211,7 +5193,6 @@ static int netif_receive_skb_internal(struct sk_buff *skb)

 static void netif_receive_skb_list_internal(struct list_head *head)
 {
-	struct bpf_prog *xdp_prog = NULL;
 	struct sk_buff *skb, *next;
 	struct list_head sublist;

@ -5224,21 +5205,6 @@ static void netif_receive_skb_list_internal(struct list_head *head)
 	}
 	list_splice_init(&sublist, head);

-	if (static_branch_unlikely(&generic_xdp_needed_key)) {
-		preempt_disable();
-		rcu_read_lock();
-		list_for_each_entry_safe(skb, next, head, list) {
-			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
-			skb_list_del_init(skb);
-			if (do_xdp_generic(xdp_prog, skb) == XDP_PASS)
-				list_add_tail(&skb->list, &sublist);
-		}
-		rcu_read_unlock();
-		preempt_enable();
-		/* Put passed packets back on main list */
-		list_splice_init(&sublist, head);
-	}
-
 	rcu_read_lock();
 #ifdef CONFIG_RPS
 	if (static_branch_unlikely(&rps_needed)) {
@ -5809,7 +5775,6 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 	skb_reset_mac_header(skb);
 	skb_gro_reset_offset(skb);

-	eth = skb_gro_header_fast(skb, 0);
 	if (unlikely(skb_gro_header_hard(skb, hlen))) {
 		eth = skb_gro_header_slow(skb, hlen, 0);
 		if (unlikely(!eth)) {
@ -5819,6 +5784,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 			return NULL;
 		}
 	} else {
+		eth = (const struct ethhdr *)skb->data;
 		gro_pull_from_frag0(skb, hlen);
 		NAPI_GRO_CB(skb)->frag0 += hlen;
 		NAPI_GRO_CB(skb)->frag0_len -= hlen;
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@ -3010,11 +3010,12 @@ ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec_input *input)
 		const struct ethtool_flow_ext *ext_h_spec = &fs->h_ext;
 		const struct ethtool_flow_ext *ext_m_spec = &fs->m_ext;

-		if (ext_m_spec->vlan_etype &&
-		    ext_m_spec->vlan_tci) {
+		if (ext_m_spec->vlan_etype) {
 			match->key.vlan.vlan_tpid = ext_h_spec->vlan_etype;
 			match->mask.vlan.vlan_tpid = ext_m_spec->vlan_etype;
+		}

+		if (ext_m_spec->vlan_tci) {
 			match->key.vlan.vlan_id =
 				ntohs(ext_h_spec->vlan_tci) & 0x0fff;
 			match->mask.vlan.vlan_id =
@ -3024,7 +3025,10 @@ ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec_input *input)
 				(ntohs(ext_h_spec->vlan_tci) & 0xe000) >> 13;
 			match->mask.vlan.vlan_priority =
 				(ntohs(ext_m_spec->vlan_tci) & 0xe000) >> 13;
+		}

+		if (ext_m_spec->vlan_etype ||
+		    ext_m_spec->vlan_tci) {
 			match->dissector.used_keys |=
 				BIT(FLOW_DISSECTOR_KEY_VLAN);
 			match->dissector.offset[FLOW_DISSECTOR_KEY_VLAN] =
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@ -1036,7 +1036,11 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 			uarg->len++;
 			uarg->bytelen = bytelen;
 			atomic_set(&sk->sk_zckey, ++next);
+
+			/* no extra ref when appending to datagram (MSG_MORE) */
+			if (sk->sk_type == SOCK_STREAM)
 				sock_zerocopy_get(uarg);
+
 			return uarg;
 		}
 	}
--- a/net/dsa/tag_8021q.c
+++ b/net/dsa/tag_8021q.c
@ -11,20 +11,59 @@

 #include "dsa_priv.h"

-/* Allocating two VLAN tags per port - one for the RX VID and
- * the other for the TX VID - see below
+/* Binary structure of the fake 12-bit VID field (when the TPID is
+ * ETH_P_DSA_8021Q):
+ *
+ * | 11  | 10  |  9  |  8  |  7  |  6  |  5  |  4  |  3  |  2  |  1  |  0  |
+ * +-----------+-----+-----------------+-----------+-----------------------+
+ * |    DIR    | RSV |    SWITCH_ID    |    RSV    |          PORT         |
+ * +-----------+-----+-----------------+-----------+-----------------------+
+ *
+ * DIR - VID[11:10]:
+ *	Direction flags.
+ *	* 1 (0b01) for RX VLAN,
+ *	* 2 (0b10) for TX VLAN.
+ *	These values make the special VIDs of 0, 1 and 4095 to be left
+ *	unused by this coding scheme.
+ *
+ * RSV - VID[9]:
+ *	To be used for further expansion of SWITCH_ID or for other purposes.
+ *
+ * SWITCH_ID - VID[8:6]:
+ *	Index of switch within DSA tree. Must be between 0 and
+ *	DSA_MAX_SWITCHES - 1.
+ *
+ * RSV - VID[5:4]:
+ *	To be used for further expansion of PORT or for other purposes.
+ *
+ * PORT - VID[3:0]:
+ *	Index of switch port. Must be between 0 and DSA_MAX_PORTS - 1.
 */
-#define DSA_8021Q_VID_RANGE	(DSA_MAX_SWITCHES * DSA_MAX_PORTS)
-#define DSA_8021Q_VID_BASE	(VLAN_N_VID - 2 * DSA_8021Q_VID_RANGE - 1)
-#define DSA_8021Q_RX_VID_BASE	(DSA_8021Q_VID_BASE)
-#define DSA_8021Q_TX_VID_BASE	(DSA_8021Q_VID_BASE + DSA_8021Q_VID_RANGE)
+
+#define DSA_8021Q_DIR_SHIFT		10
+#define DSA_8021Q_DIR_MASK		GENMASK(11, 10)
+#define DSA_8021Q_DIR(x)		(((x) << DSA_8021Q_DIR_SHIFT) & \
+						 DSA_8021Q_DIR_MASK)
+#define DSA_8021Q_DIR_RX		DSA_8021Q_DIR(1)
+#define DSA_8021Q_DIR_TX		DSA_8021Q_DIR(2)
+
+#define DSA_8021Q_SWITCH_ID_SHIFT	6
+#define DSA_8021Q_SWITCH_ID_MASK	GENMASK(8, 6)
+#define DSA_8021Q_SWITCH_ID(x)		(((x) << DSA_8021Q_SWITCH_ID_SHIFT) & \
+						 DSA_8021Q_SWITCH_ID_MASK)
+
+#define DSA_8021Q_PORT_SHIFT		0
+#define DSA_8021Q_PORT_MASK		GENMASK(3, 0)
+#define DSA_8021Q_PORT(x)		(((x) << DSA_8021Q_PORT_SHIFT) & \
+						 DSA_8021Q_PORT_MASK)

 /* Returns the VID to be inserted into the frame from xmit for switch steering
 * instructions on egress. Encodes switch ID and port ID.
 */
 u16 dsa_8021q_tx_vid(struct dsa_switch *ds, int port)
 {
-	return DSA_8021Q_TX_VID_BASE + (DSA_MAX_PORTS * ds->index) + port;
+	return DSA_8021Q_DIR_TX | DSA_8021Q_SWITCH_ID(ds->index) |
+	       DSA_8021Q_PORT(port);
 }
 EXPORT_SYMBOL_GPL(dsa_8021q_tx_vid);

@ -33,21 +72,22 @@ EXPORT_SYMBOL_GPL(dsa_8021q_tx_vid);
 */
 u16 dsa_8021q_rx_vid(struct dsa_switch *ds, int port)
 {
-	return DSA_8021Q_RX_VID_BASE + (DSA_MAX_PORTS * ds->index) + port;
+	return DSA_8021Q_DIR_RX | DSA_8021Q_SWITCH_ID(ds->index) |
+	       DSA_8021Q_PORT(port);
 }
 EXPORT_SYMBOL_GPL(dsa_8021q_rx_vid);

 /* Returns the decoded switch ID from the RX VID. */
 int dsa_8021q_rx_switch_id(u16 vid)
 {
-	return ((vid - DSA_8021Q_RX_VID_BASE) / DSA_MAX_PORTS);
+	return (vid & DSA_8021Q_SWITCH_ID_MASK) >> DSA_8021Q_SWITCH_ID_SHIFT;
 }
 EXPORT_SYMBOL_GPL(dsa_8021q_rx_switch_id);

 /* Returns the decoded port ID from the RX VID. */
 int dsa_8021q_rx_source_port(u16 vid)
 {
-	return ((vid - DSA_8021Q_RX_VID_BASE) % DSA_MAX_PORTS);
+	return (vid & DSA_8021Q_PORT_MASK) >> DSA_8021Q_PORT_SHIFT;
 }
 EXPORT_SYMBOL_GPL(dsa_8021q_rx_source_port);

@ -128,10 +168,7 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
 		u16 flags;

 		if (i == upstream)
-			/* CPU port needs to see this port's RX VID
-			 * as tagged egress.
-			 */
-			flags = 0;
+			continue;
 		else if (i == port)
 			/* The RX VID is pvid on this port */
 			flags = BRIDGE_VLAN_INFO_UNTAGGED |
@ -150,6 +187,20 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
 			return err;
 		}
 	}
+
+	/* CPU port needs to see this port's RX VID
+	 * as tagged egress.
+	 */
+	if (enabled)
+		err = dsa_port_vid_add(upstream_dp, rx_vid, 0);
+	else
+		err = dsa_port_vid_del(upstream_dp, rx_vid);
+	if (err) {
+		dev_err(ds->dev, "Failed to apply RX VID %d to port %d: %d\n",
+			rx_vid, port, err);
+		return err;
+	}
+
 	/* Finally apply the TX VID on this port and on the CPU port */
 	if (enabled)
 		err = dsa_port_vid_add(dp, tx_vid, BRIDGE_VLAN_INFO_UNTAGGED);
--- a/net/hsr/hsr_framereg.c
+++ b/net/hsr/hsr_framereg.c
@ -365,6 +365,14 @@ void hsr_prune_nodes(struct timer_list *t)

 	rcu_read_lock();
 	list_for_each_entry_rcu(node, &hsr->node_db, mac_list) {
+		/* Don't prune own node. Neither time_in[HSR_PT_SLAVE_A]
+		 * nor time_in[HSR_PT_SLAVE_B], will ever be updated for
+		 * the master port. Thus the master node will be repeatedly
+		 * pruned leading to packet loss.
+		 */
+		if (hsr_addr_is_self(hsr, node->macaddress_A))
+			continue;
+
 		/* Shorthand */
 		time_a = node->time_in[HSR_PT_SLAVE_A];
 		time_b = node->time_in[HSR_PT_SLAVE_B];
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@ -428,8 +428,8 @@ int inet_release(struct socket *sock)
 		if (sock_flag(sk, SOCK_LINGER) &&
 		    !(current->flags & PF_EXITING))
 			timeout = sk->sk_lingertime;
-		sock->sk = NULL;
 		sk->sk_prot->close(sk, timeout);
+		sock->sk = NULL;
 	}
 	return 0;
 }
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@ -188,6 +188,17 @@ static void ip_ma_put(struct ip_mc_list *im)
 	     pmc != NULL;					\
 	     pmc = rtnl_dereference(pmc->next_rcu))

+static void ip_sf_list_clear_all(struct ip_sf_list *psf)
+{
+	struct ip_sf_list *next;
+
+	while (psf) {
+		next = psf->sf_next;
+		kfree(psf);
+		psf = next;
+	}
+}
+
 #ifdef CONFIG_IP_MULTICAST

 /*
@ -633,6 +644,13 @@ static void igmpv3_clear_zeros(struct ip_sf_list **ppsf)
 	}
 }

+static void kfree_pmc(struct ip_mc_list *pmc)
+{
+	ip_sf_list_clear_all(pmc->sources);
+	ip_sf_list_clear_all(pmc->tomb);
+	kfree(pmc);
+}
+
 static void igmpv3_send_cr(struct in_device *in_dev)
 {
 	struct ip_mc_list *pmc, *pmc_prev, *pmc_next;
@ -669,7 +687,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 			else
 				in_dev->mc_tomb = pmc_next;
 			in_dev_put(pmc->interface);
-			kfree(pmc);
+			kfree_pmc(pmc);
 		} else
 			pmc_prev = pmc;
 	}
@ -1215,14 +1233,18 @@ static void igmpv3_del_delrec(struct in_device *in_dev, struct ip_mc_list *im)
 		im->interface = pmc->interface;
 		if (im->sfmode == MCAST_INCLUDE) {
 			im->tomb = pmc->tomb;
+			pmc->tomb = NULL;
+
 			im->sources = pmc->sources;
+			pmc->sources = NULL;
+
 			for (psf = im->sources; psf; psf = psf->sf_next)
 				psf->sf_crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv;
 		} else {
 			im->crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv;
 		}
 		in_dev_put(pmc->interface);
-		kfree(pmc);
+		kfree_pmc(pmc);
 	}
 	spin_unlock_bh(&im->lock);
 }
@ -1243,21 +1265,18 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
 		nextpmc = pmc->next;
 		ip_mc_clear_src(pmc);
 		in_dev_put(pmc->interface);
-		kfree(pmc);
+		kfree_pmc(pmc);
 	}
 	/* clear dead sources, too */
 	rcu_read_lock();
 	for_each_pmc_rcu(in_dev, pmc) {
-		struct ip_sf_list *psf, *psf_next;
+		struct ip_sf_list *psf;

 		spin_lock_bh(&pmc->lock);
 		psf = pmc->tomb;
 		pmc->tomb = NULL;
 		spin_unlock_bh(&pmc->lock);
-		for (; psf; psf = psf_next) {
-			psf_next = psf->sf_next;
-			kfree(psf);
-		}
+		ip_sf_list_clear_all(psf);
 	}
 	rcu_read_unlock();
 }
@ -2123,7 +2142,7 @@ static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,

 static void ip_mc_clear_src(struct ip_mc_list *pmc)
 {
-	struct ip_sf_list *psf, *nextpsf, *tomb, *sources;
+	struct ip_sf_list *tomb, *sources;

 	spin_lock_bh(&pmc->lock);
 	tomb = pmc->tomb;
@ -2135,14 +2154,8 @@ static void ip_mc_clear_src(struct ip_mc_list *pmc)
 	pmc->sfcount[MCAST_EXCLUDE] = 1;
 	spin_unlock_bh(&pmc->lock);

-	for (psf = tomb; psf; psf = nextpsf) {
-		nextpsf = psf->sf_next;
-		kfree(psf);
-	}
-	for (psf = sources; psf; psf = nextpsf) {
-		nextpsf = psf->sf_next;
-		kfree(psf);
-	}
+	ip_sf_list_clear_all(tomb);
+	ip_sf_list_clear_all(sources);
 }

 /* Join a multicast group
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@ -878,7 +878,7 @@ static int __ip_append_data(struct sock *sk,
 	int csummode = CHECKSUM_NONE;
 	struct rtable *rt = (struct rtable *)cork->dst;
 	unsigned int wmem_alloc_delta = 0;
-	bool paged, extra_uref;
+	bool paged, extra_uref = false;
 	u32 tskey = 0;

 	skb = skb_peek_tail(queue);
@ -918,7 +918,7 @@ static int __ip_append_data(struct sock *sk,
 		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
 		if (!uarg)
 			return -ENOBUFS;
-		extra_uref = true;
+		extra_uref = !skb;	/* only extra ref if !MSG_MORE */
 		if (rt->dst.dev->features & NETIF_F_SG &&
 		    csummode == CHECKSUM_PARTIAL) {
 			paged = true;
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@ -343,6 +343,8 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 		return -EINVAL;

 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
+	if (on && !new_ra)
+		return -ENOMEM;

 	mutex_lock(&net->ipv4.ra_mutex);
 	for (rap = &net->ipv4.ra_chain;
--- a/net/ipv4/netfilter/nft_fib_ipv4.c
+++ b/net/ipv4/netfilter/nft_fib_ipv4.c
@ -58,11 +58,6 @@ void nft_fib4_eval_type(const struct nft_expr *expr, struct nft_regs *regs,
 }
 EXPORT_SYMBOL_GPL(nft_fib4_eval_type);

-static int get_ifindex(const struct net_device *dev)
-{
-	return dev ? dev->ifindex : 0;
-}
-
 void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs,
 		   const struct nft_pktinfo *pkt)
 {
@ -94,8 +89,7 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs,

 	if (nft_hook(pkt) == NF_INET_PRE_ROUTING &&
 	    nft_fib_is_loopback(pkt->skb, nft_in(pkt))) {
-		nft_fib_store_result(dest, priv, pkt,
-				     nft_in(pkt)->ifindex);
+		nft_fib_store_result(dest, priv, nft_in(pkt));
 		return;
 	}

@ -108,8 +102,7 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs,
 	if (ipv4_is_zeronet(iph->saddr)) {
 		if (ipv4_is_lbcast(iph->daddr) ||
 		    ipv4_is_local_multicast(iph->daddr)) {
-			nft_fib_store_result(dest, priv, pkt,
-					     get_ifindex(pkt->skb->dev));
+			nft_fib_store_result(dest, priv, pkt->skb->dev);
 			return;
 		}
 	}
@ -150,17 +143,7 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs,
 		found = oif;
 	}

-	switch (priv->result) {
-	case NFT_FIB_RESULT_OIF:
-		*dest = found->ifindex;
-		break;
-	case NFT_FIB_RESULT_OIFNAME:
-		strncpy((char *)dest, found->name, IFNAMSIZ);
-		break;
-	default:
-		WARN_ON_ONCE(1);
-		break;
-	}
+	nft_fib_store_result(dest, priv, found);
 }
 EXPORT_SYMBOL_GPL(nft_fib4_eval);

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@ -3791,6 +3791,8 @@ void tcp_parse_options(const struct net *net,
 			length--;
 			continue;
 		default:
+			if (length < 2)
+				return;
 			opsize = *ptr++;
 			if (opsize < 2) /* "silly options" */
 				return;
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@ -5661,18 +5661,6 @@ static const struct nla_policy inet6_af_policy[IFLA_INET6_MAX + 1] = {
 	[IFLA_INET6_TOKEN]		= { .len = sizeof(struct in6_addr) },
 };

-static int inet6_validate_link_af(const struct net_device *dev,
-				  const struct nlattr *nla)
-{
-	struct nlattr *tb[IFLA_INET6_MAX + 1];
-
-	if (dev && !__in6_dev_get(dev))
-		return -EAFNOSUPPORT;
-
-	return nla_parse_nested_deprecated(tb, IFLA_INET6_MAX, nla,
-					   inet6_af_policy, NULL);
-}
-
 static int check_addr_gen_mode(int mode)
 {
 	if (mode != IN6_ADDR_GEN_MODE_EUI64 &&
@ -5693,14 +5681,44 @@ static int check_stable_privacy(struct inet6_dev *idev, struct net *net,
 	return 1;
 }

-static int inet6_set_link_af(struct net_device *dev, const struct nlattr *nla)
+static int inet6_validate_link_af(const struct net_device *dev,
+				  const struct nlattr *nla)
 {
-	int err = -EINVAL;
-	struct inet6_dev *idev = __in6_dev_get(dev);
 	struct nlattr *tb[IFLA_INET6_MAX + 1];
+	struct inet6_dev *idev = NULL;
+	int err;

+	if (dev) {
+		idev = __in6_dev_get(dev);
 		if (!idev)
 			return -EAFNOSUPPORT;
+	}
+
+	err = nla_parse_nested_deprecated(tb, IFLA_INET6_MAX, nla,
+					  inet6_af_policy, NULL);
+	if (err)
+		return err;
+
+	if (!tb[IFLA_INET6_TOKEN] && !tb[IFLA_INET6_ADDR_GEN_MODE])
+		return -EINVAL;
+
+	if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
+		u8 mode = nla_get_u8(tb[IFLA_INET6_ADDR_GEN_MODE]);
+
+		if (check_addr_gen_mode(mode) < 0)
+			return -EINVAL;
+		if (dev && check_stable_privacy(idev, dev_net(dev), mode) < 0)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int inet6_set_link_af(struct net_device *dev, const struct nlattr *nla)
+{
+	struct inet6_dev *idev = __in6_dev_get(dev);
+	struct nlattr *tb[IFLA_INET6_MAX + 1];
+	int err;

 	if (nla_parse_nested_deprecated(tb, IFLA_INET6_MAX, nla, NULL, NULL) < 0)
 		BUG();
@ -5714,15 +5732,10 @@ static int inet6_set_link_af(struct net_device *dev, const struct nlattr *nla)
 	if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
 		u8 mode = nla_get_u8(tb[IFLA_INET6_ADDR_GEN_MODE]);

-		if (check_addr_gen_mode(mode) < 0 ||
-		    check_stable_privacy(idev, dev_net(dev), mode) < 0)
-			return -EINVAL;
-
 		idev->cnf.addr_gen_mode = mode;
-		err = 0;
 	}

-	return err;
+	return 0;
 }

 static int inet6_fill_ifinfo(struct sk_buff *skb, struct inet6_dev *idev,
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@ -1275,7 +1275,7 @@ static int __ip6_append_data(struct sock *sk,
 	int csummode = CHECKSUM_NONE;
 	unsigned int maxnonfragsize, headersize;
 	unsigned int wmem_alloc_delta = 0;
-	bool paged, extra_uref;
+	bool paged, extra_uref = false;

 	skb = skb_peek_tail(queue);
 	if (!skb) {
@ -1344,7 +1344,7 @@ emsgsize:
 		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
 		if (!uarg)
 			return -ENOBUFS;
-		extra_uref = true;
+		extra_uref = !skb;	/* only extra ref if !MSG_MORE */
 		if (rt->dst.dev->features & NETIF_F_SG &&
 		    csummode == CHECKSUM_PARTIAL) {
 			paged = true;
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@ -68,6 +68,8 @@ int ip6_ra_control(struct sock *sk, int sel)
 		return -ENOPROTOOPT;

 	new_ra = (sel >= 0) ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
+	if (sel >= 0 && !new_ra)
+		return -ENOMEM;

 	write_lock_bh(&ip6_ra_lock);
 	for (rap = &ip6_ra_chain; (ra = *rap) != NULL; rap = &ra->next) {
--- a/net/ipv6/netfilter/nft_fib_ipv6.c
+++ b/net/ipv6/netfilter/nft_fib_ipv6.c
@ -169,8 +169,7 @@ void nft_fib6_eval(const struct nft_expr *expr, struct nft_regs *regs,

 	if (nft_hook(pkt) == NF_INET_PRE_ROUTING &&
 	    nft_fib_is_loopback(pkt->skb, nft_in(pkt))) {
-		nft_fib_store_result(dest, priv, pkt,
-				     nft_in(pkt)->ifindex);
+		nft_fib_store_result(dest, priv, nft_in(pkt));
 		return;
 	}

@ -187,18 +186,7 @@ void nft_fib6_eval(const struct nft_expr *expr, struct nft_regs *regs,
 	if (oif && oif != rt->rt6i_idev->dev)
 		goto put_rt_err;

-	switch (priv->result) {
-	case NFT_FIB_RESULT_OIF:
-		*dest = rt->rt6i_idev->dev->ifindex;
-		break;
-	case NFT_FIB_RESULT_OIFNAME:
-		strncpy((char *)dest, rt->rt6i_idev->dev->name, IFNAMSIZ);
-		break;
-	default:
-		WARN_ON_ONCE(1);
-		break;
-	}
-
+	nft_fib_store_result(dest, priv, rt->rt6i_idev->dev);
 put_rt_err:
 	ip6_rt_put(rt);
 }
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@ -2512,6 +2512,12 @@ static struct rt6_info *__ip6_route_redirect(struct net *net,
 	struct fib6_info *rt;
 	struct fib6_node *fn;

+	/* l3mdev_update_flow overrides oif if the device is enslaved; in
+	 * this case we must match on the real ingress device, so reset it
+	 */
+	if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF)
+		fl6->flowi6_oif = skb->dev->ifindex;
+
 	/* Get the "current" route for this destination and
 	 * check if the redirect has come from appropriate router.
 	 *
--- a/net/llc/llc_output.c
+++ b/net/llc/llc_output.c
@ -72,6 +72,8 @@ int llc_build_and_send_ui_pkt(struct llc_sap *sap, struct sk_buff *skb,
 	rc = llc_mac_hdr_init(skb, skb->dev->dev_addr, dmac);
 	if (likely(!rc))
 		rc = dev_queue_xmit(skb);
+	else
+		kfree_skb(skb);
 	return rc;
 }

--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@ -2312,7 +2312,6 @@ static void __net_exit __ip_vs_cleanup(struct net *net)
 {
 	struct netns_ipvs *ipvs = net_ipvs(net);

-	nf_unregister_net_hooks(net, ip_vs_ops, ARRAY_SIZE(ip_vs_ops));
 	ip_vs_service_net_cleanup(ipvs);	/* ip_vs_flush() with locks */
 	ip_vs_conn_net_cleanup(ipvs);
 	ip_vs_app_net_cleanup(ipvs);
@ -2327,6 +2326,7 @@ static void __net_exit __ip_vs_dev_cleanup(struct net *net)
 {
 	struct netns_ipvs *ipvs = net_ipvs(net);
 	EnterFunction(2);
+	nf_unregister_net_hooks(net, ip_vs_ops, ARRAY_SIZE(ip_vs_ops));
 	ipvs->enable = 0;	/* Disable packet reception */
 	smp_wmb();
 	ip_vs_sync_net_cleanup(ipvs);
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@ -244,8 +244,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	rt = (struct rtable *)flow->tuplehash[dir].tuple.dst_cache;
 	outdev = rt->dst.dev;

-	if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)) &&
-	    (ip_hdr(skb)->frag_off & htons(IP_DF)) != 0)
+	if (unlikely(nf_flow_exceeds_mtu(skb, flow->tuplehash[dir].tuple.mtu)))
 		return NF_ACCEPT;

 	if (skb_try_make_writable(skb, sizeof(*iph)))
--- a/net/netfilter/nf_nat_helper.c
+++ b/net/netfilter/nf_nat_helper.c
@ -170,7 +170,7 @@ nf_nat_mangle_udp_packet(struct sk_buff *skb,
 	if (!udph->check && skb->ip_summed != CHECKSUM_PARTIAL)
 		return true;

-	nf_nat_csum_recalc(skb, nf_ct_l3num(ct), IPPROTO_TCP,
+	nf_nat_csum_recalc(skb, nf_ct_l3num(ct), IPPROTO_UDP,
 			   udph, &udph->check, datalen, oldlen);

 	return true;
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@ -255,6 +255,7 @@ static unsigned int nf_iterate(struct sk_buff *skb,
 repeat:
 		verdict = nf_hook_entry_hookfn(hook, skb, state);
 		if (verdict != NF_ACCEPT) {
+			*index = i;
 			if (verdict != NF_REPEAT)
 				return verdict;
 			goto repeat;
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@ -2270,13 +2270,13 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net,
 				    u32 flags, int family,
 				    const struct nft_table *table,
 				    const struct nft_chain *chain,
-				    const struct nft_rule *rule)
+				    const struct nft_rule *rule,
+				    const struct nft_rule *prule)
 {
 	struct nlmsghdr *nlh;
 	struct nfgenmsg *nfmsg;
 	const struct nft_expr *expr, *next;
 	struct nlattr *list;
-	const struct nft_rule *prule;
 	u16 type = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);

 	nlh = nlmsg_put(skb, portid, seq, type, sizeof(struct nfgenmsg), flags);
@ -2296,8 +2296,7 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net,
 			 NFTA_RULE_PAD))
 		goto nla_put_failure;

-	if ((event != NFT_MSG_DELRULE) && (rule->list.prev != &chain->rules)) {
-		prule = list_prev_entry(rule, list);
+	if (event != NFT_MSG_DELRULE && prule) {
 		if (nla_put_be64(skb, NFTA_RULE_POSITION,
 				 cpu_to_be64(prule->handle),
 				 NFTA_RULE_PAD))
@ -2344,7 +2343,7 @@ static void nf_tables_rule_notify(const struct nft_ctx *ctx,

 	err = nf_tables_fill_rule_info(skb, ctx->net, ctx->portid, ctx->seq,
 				       event, 0, ctx->family, ctx->table,
-				       ctx->chain, rule);
+				       ctx->chain, rule, NULL);
 	if (err < 0) {
 		kfree_skb(skb);
 		goto err;
@ -2369,12 +2368,13 @@ static int __nf_tables_dump_rules(struct sk_buff *skb,
 				  const struct nft_chain *chain)
 {
 	struct net *net = sock_net(skb->sk);
+	const struct nft_rule *rule, *prule;
 	unsigned int s_idx = cb->args[0];
-	const struct nft_rule *rule;

+	prule = NULL;
 	list_for_each_entry_rcu(rule, &chain->rules, list) {
 		if (!nft_is_active(net, rule))
-			goto cont;
+			goto cont_skip;
 		if (*idx < s_idx)
 			goto cont;
 		if (*idx > s_idx) {
@ -2386,11 +2386,13 @@ static int __nf_tables_dump_rules(struct sk_buff *skb,
 					NFT_MSG_NEWRULE,
 					NLM_F_MULTI | NLM_F_APPEND,
 					table->family,
-					table, chain, rule) < 0)
+					table, chain, rule, prule) < 0)
 			return 1;

 		nl_dump_check_consistent(cb, nlmsg_hdr(skb));
 cont:
+		prule = rule;
+cont_skip:
 		(*idx)++;
 	}
 	return 0;
@ -2546,7 +2548,7 @@ static int nf_tables_getrule(struct net *net, struct sock *nlsk,

 	err = nf_tables_fill_rule_info(skb2, net, NETLINK_CB(skb).portid,
 				       nlh->nlmsg_seq, NFT_MSG_NEWRULE, 0,
-				       family, table, chain, rule);
+				       family, table, chain, rule, NULL);
 	if (err < 0)
 		goto err;

--- a/net/netfilter/nft_fib.c
+++ b/net/netfilter/nft_fib.c
@ -135,17 +135,17 @@ int nft_fib_dump(struct sk_buff *skb, const struct nft_expr *expr)
 EXPORT_SYMBOL_GPL(nft_fib_dump);

 void nft_fib_store_result(void *reg, const struct nft_fib *priv,
-			  const struct nft_pktinfo *pkt, int index)
+			  const struct net_device *dev)
 {
-	struct net_device *dev;
 	u32 *dreg = reg;
+	int index;

 	switch (priv->result) {
 	case NFT_FIB_RESULT_OIF:
+		index = dev ? dev->ifindex : 0;
 		*dreg = (priv->flags & NFTA_FIB_F_PRESENT) ? !!index : index;
 		break;
 	case NFT_FIB_RESULT_OIFNAME:
-		dev = dev_get_by_index_rcu(nft_net(pkt), index);
 		if (priv->flags & NFTA_FIB_F_PRESENT)
 			*dreg = !!dev;
 		else
--- a/net/netfilter/nft_flow_offload.c
+++ b/net/netfilter/nft_flow_offload.c
@ -13,7 +13,6 @@
 #include <net/netfilter/nf_conntrack_core.h>
 #include <linux/netfilter/nf_conntrack_common.h>
 #include <net/netfilter/nf_flow_table.h>
-#include <net/netfilter/nf_conntrack_helper.h>

 struct nft_flow_offload {
 	struct nft_flowtable	*flowtable;
@ -50,14 +49,19 @@ static int nft_flow_route(const struct nft_pktinfo *pkt,
 	return 0;
 }

-static bool nft_flow_offload_skip(struct sk_buff *skb)
+static bool nft_flow_offload_skip(struct sk_buff *skb, int family)
 {
-	struct ip_options *opt  = &(IPCB(skb)->opt);
+	if (skb_sec_path(skb))
+		return true;
+
+	if (family == NFPROTO_IPV4) {
+		const struct ip_options *opt;
+
+		opt = &(IPCB(skb)->opt);

 		if (unlikely(opt->optlen))
 			return true;
-	if (skb_sec_path(skb))
-		return true;
+	}

 	return false;
 }
@ -68,15 +72,15 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,
 {
 	struct nft_flow_offload *priv = nft_expr_priv(expr);
 	struct nf_flowtable *flowtable = &priv->flowtable->data;
-	const struct nf_conn_help *help;
 	enum ip_conntrack_info ctinfo;
 	struct nf_flow_route route;
 	struct flow_offload *flow;
 	enum ip_conntrack_dir dir;
+	bool is_tcp = false;
 	struct nf_conn *ct;
 	int ret;

-	if (nft_flow_offload_skip(pkt->skb))
+	if (nft_flow_offload_skip(pkt->skb, nft_pf(pkt)))
 		goto out;

 	ct = nf_ct_get(pkt->skb, &ctinfo);
@ -85,14 +89,16 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,

 	switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
 	case IPPROTO_TCP:
+		is_tcp = true;
+		break;
 	case IPPROTO_UDP:
 		break;
 	default:
 		goto out;
 	}

-	help = nfct_help(ct);
-	if (help)
+	if (nf_ct_ext_exist(ct, NF_CT_EXT_HELPER) ||
+	    ct->status & IPS_SEQ_ADJUST)
 		goto out;

 	if (!nf_ct_is_confirmed(ct))
@ -109,6 +115,11 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,
 	if (!flow)
 		goto err_flow_alloc;

+	if (is_tcp) {
+		ct->proto.tcp.seen[0].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
+		ct->proto.tcp.seen[1].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
+	}
+
 	ret = flow_offload_add(flowtable, flow);
 	if (ret < 0)
 		goto err_flow_add;
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@ -800,7 +800,7 @@ int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[],

 	for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
 		a = actions[i];
-		nest = nla_nest_start_noflag(skb, a->order);
+		nest = nla_nest_start_noflag(skb, i + 1);
 		if (nest == NULL)
 			goto nla_put_failure;
 		err = tcf_action_dump_1(skb, a, bind, ref);
@ -1303,7 +1303,6 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 			ret = PTR_ERR(act);
 			goto err;
 		}
-		act->order = i;
 		attr_size += tcf_action_fill_size(act);
 		actions[i - 1] = act;
 	}
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@ -553,8 +553,8 @@ void tls_device_write_space(struct sock *sk, struct tls_context *ctx)
 void handle_device_resync(struct sock *sk, u32 seq, u64 rcd_sn)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct net_device *netdev = tls_ctx->netdev;
 	struct tls_offload_context_rx *rx_ctx;
+	struct net_device *netdev;
 	u32 is_req_pending;
 	s64 resync_req;
 	u32 req_seq;
@ -568,10 +568,15 @@ void handle_device_resync(struct sock *sk, u32 seq, u64 rcd_sn)
 	is_req_pending = resync_req;

 	if (unlikely(is_req_pending) && req_seq == seq &&
-	    atomic64_try_cmpxchg(&rx_ctx->resync_req, &resync_req, 0))
-		netdev->tlsdev_ops->tls_dev_resync_rx(netdev, sk,
-						      seq + TLS_HEADER_SIZE - 1,
+	    atomic64_try_cmpxchg(&rx_ctx->resync_req, &resync_req, 0)) {
+		seq += TLS_HEADER_SIZE - 1;
+		down_read(&device_offload_lock);
+		netdev = tls_ctx->netdev;
+		if (netdev)
+			netdev->tlsdev_ops->tls_dev_resync_rx(netdev, sk, seq,
 							      rcd_sn);
+		up_read(&device_offload_lock);
+	}
 }

 static int tls_device_reencrypt(struct sock *sk, struct sk_buff *skb)
@ -934,12 +939,6 @@ void tls_device_offload_cleanup_rx(struct sock *sk)
 	if (!netdev)
 		goto out;

-	if (!(netdev->features & NETIF_F_HW_TLS_RX)) {
-		pr_err_ratelimited("%s: device is missing NETIF_F_HW_TLS_RX cap\n",
-				   __func__);
-		goto out;
-	}
-
 	netdev->tlsdev_ops->tls_dev_del(netdev, tls_ctx,
 					TLS_OFFLOAD_CTX_DIR_RX);

@ -998,7 +997,8 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event,
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);

-	if (!(dev->features & (NETIF_F_HW_TLS_RX | NETIF_F_HW_TLS_TX)))
+	if (!dev->tlsdev_ops &&
+	    !(dev->features & (NETIF_F_HW_TLS_RX | NETIF_F_HW_TLS_TX)))
 		return NOTIFY_DONE;

 	switch (event) {
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@ -1712,15 +1712,14 @@ int tls_sw_recvmsg(struct sock *sk,
 		copied = err;
 	}

-	len = len - copied;
-	if (len) {
-		target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-		timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
-	} else {
+	if (len <= copied)
 		goto recv_end;
-	}

-	do {
+	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
+	len = len - copied;
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+	while (len && (decrypted + copied < target || ctx->recv_pkt)) {
 		bool retain_skb = false;
 		bool zc = false;
 		int to_decrypt;
@ -1851,11 +1850,7 @@ pick_next_record:
 		} else {
 			break;
 		}
-
-		/* If we have a new message from strparser, continue now. */
-		if (decrypted >= target && !ctx->recv_pkt)
-			break;
-	} while (len);
+	}

 recv_end:
 	if (num_async) {
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@ -208,8 +208,8 @@ tunnel6_a_addr="fd00:2::a"
 tunnel6_b_addr="fd00:2::b"
 tunnel6_mask="64"

-dummy6_0_addr="fc00:1000::0"
-dummy6_1_addr="fc00:1001::0"
+dummy6_0_prefix="fc00:1000::"
+dummy6_1_prefix="fc00:1001::"
 dummy6_mask="64"

 cleanup_done=1
@ -1005,13 +1005,13 @@ test_pmtu_vti6_link_change_mtu() {
 	run_cmd ${ns_a} ip link set dummy0 up
 	run_cmd ${ns_a} ip link set dummy1 up

-	run_cmd ${ns_a} ip addr add ${dummy6_0_addr}/${dummy6_mask} dev dummy0
-	run_cmd ${ns_a} ip addr add ${dummy6_1_addr}/${dummy6_mask} dev dummy1
+	run_cmd ${ns_a} ip addr add ${dummy6_0_prefix}1/${dummy6_mask} dev dummy0
+	run_cmd ${ns_a} ip addr add ${dummy6_1_prefix}1/${dummy6_mask} dev dummy1

 	fail=0

 	# Create vti6 interface bound to device, passing MTU, check it
-	run_cmd ${ns_a} ip link add vti6_a mtu 1300 type vti6 remote ${dummy6_0_addr} local ${dummy6_0_addr}
+	run_cmd ${ns_a} ip link add vti6_a mtu 1300 type vti6 remote ${dummy6_0_prefix}2 local ${dummy6_0_prefix}1
 	mtu="$(link_get_mtu "${ns_a}" vti6_a)"
 	if [ ${mtu} -ne 1300 ]; then
 		err "  vti6 MTU ${mtu} doesn't match configured value 1300"
@ -1020,7 +1020,7 @@ test_pmtu_vti6_link_change_mtu() {

 	# Move to another device with different MTU, without passing MTU, check
 	# MTU is adjusted
-	run_cmd ${ns_a} ip link set vti6_a type vti6 remote ${dummy6_1_addr} local ${dummy6_1_addr}
+	run_cmd ${ns_a} ip link set vti6_a type vti6 remote ${dummy6_1_prefix}2 local ${dummy6_1_prefix}1
 	mtu="$(link_get_mtu "${ns_a}" vti6_a)"
 	if [ ${mtu} -ne $((3000 - 40)) ]; then
 		err "  vti MTU ${mtu} is not dummy MTU 3000 minus IPv6 header length"
@ -1028,7 +1028,7 @@ test_pmtu_vti6_link_change_mtu() {
 	fi

 	# Move it back, passing MTU, check MTU is not overridden
-	run_cmd ${ns_a} ip link set vti6_a mtu 1280 type vti6 remote ${dummy6_0_addr} local ${dummy6_0_addr}
+	run_cmd ${ns_a} ip link set vti6_a mtu 1280 type vti6 remote ${dummy6_0_prefix}2 local ${dummy6_0_prefix}1
 	mtu="$(link_get_mtu "${ns_a}" vti6_a)"
 	if [ ${mtu} -ne 1280 ]; then
 		err "  vti6 MTU ${mtu} doesn't match configured value 1280"
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@ -442,6 +442,21 @@ TEST_F(tls, multiple_send_single_recv)
 	EXPECT_EQ(memcmp(send_mem, recv_mem + send_len, send_len), 0);
 }

+TEST_F(tls, single_send_multiple_recv_non_align)
+{
+	const unsigned int total_len = 15;
+	const unsigned int recv_len = 10;
+	char recv_mem[recv_len * 2];
+	char send_mem[total_len];
+
+	EXPECT_GE(send(self->fd, send_mem, total_len, 0), 0);
+	memset(recv_mem, 0, total_len);
+
+	EXPECT_EQ(recv(self->cfd, recv_mem, recv_len, 0), recv_len);
+	EXPECT_EQ(recv(self->cfd, recv_mem + recv_len, recv_len, 0), 5);
+	EXPECT_EQ(memcmp(send_mem, recv_mem, total_len), 0);
+}
+
 TEST_F(tls, recv_partial)
 {
 	char const *test_str = "test_read_partial";
@ -575,6 +590,25 @@ TEST_F(tls, recv_peek_large_buf_mult_recs)
 	EXPECT_EQ(memcmp(test_str, buf, len), 0);
 }

+TEST_F(tls, recv_lowat)
+{
+	char send_mem[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
+	char recv_mem[20];
+	int lowat = 8;
+
+	EXPECT_EQ(send(self->fd, send_mem, 10, 0), 10);
+	EXPECT_EQ(send(self->fd, send_mem, 5, 0), 5);
+
+	memset(recv_mem, 0, 20);
+	EXPECT_EQ(setsockopt(self->cfd, SOL_SOCKET, SO_RCVLOWAT,
+			     &lowat, sizeof(lowat)), 0);
+	EXPECT_EQ(recv(self->cfd, recv_mem, 1, MSG_WAITALL), 1);
+	EXPECT_EQ(recv(self->cfd, recv_mem + 1, 6, MSG_WAITALL), 6);
+	EXPECT_EQ(recv(self->cfd, recv_mem + 7, 10, 0), 8);
+
+	EXPECT_EQ(memcmp(send_mem, recv_mem, 10), 0);
+	EXPECT_EQ(memcmp(send_mem, recv_mem + 10, 5), 0);
+}

 TEST_F(tls, pollin)
 {
--- a/tools/testing/selftests/netfilter/Makefile
+++ b/tools/testing/selftests/netfilter/Makefile
@ -2,6 +2,6 @@
 # Makefile for netfilter selftests

 TEST_PROGS := nft_trans_stress.sh nft_nat.sh bridge_brouter.sh \
-	conntrack_icmp_related.sh
+	conntrack_icmp_related.sh nft_flowtable.sh

 include ../lib.mk
--- a/tools/testing/selftests/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/netfilter/nft_flowtable.sh
@ -0,0 +1,324 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This tests basic flowtable functionality.
+# Creates following topology:
+#
+# Originator (MTU 9000) <-Router1-> MTU 1500 <-Router2-> Responder (MTU 2000)
+# Router1 is the one doing flow offloading, Router2 has no special
+# purpose other than having a link that is smaller than either Originator
+# and responder, i.e. TCPMSS announced values are too large and will still
+# result in fragmentation and/or PMTU discovery.
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+ret=0
+
+ns1in=""
+ns2in=""
+ns1out=""
+ns2out=""
+
+log_netns=$(sysctl -n net.netfilter.nf_log_all_netns)
+
+nft --version > /dev/null 2>&1
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without nft tool"
+	exit $ksft_skip
+fi
+
+ip -Version > /dev/null 2>&1
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without ip tool"
+	exit $ksft_skip
+fi
+
+which nc > /dev/null 2>&1
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without nc (netcat)"
+	exit $ksft_skip
+fi
+
+ip netns add nsr1
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not create net namespace"
+	exit $ksft_skip
+fi
+
+ip netns add ns1
+ip netns add ns2
+
+ip netns add nsr2
+
+cleanup() {
+	for i in 1 2; do
+		ip netns del ns$i
+		ip netns del nsr$i
+	done
+
+	rm -f "$ns1in" "$ns1out"
+	rm -f "$ns2in" "$ns2out"
+
+	[ $log_netns -eq 0 ] && sysctl -q net.netfilter.nf_log_all_netns=$log_netns
+}
+
+trap cleanup EXIT
+
+sysctl -q net.netfilter.nf_log_all_netns=1
+
+ip link add veth0 netns nsr1 type veth peer name eth0 netns ns1
+ip link add veth1 netns nsr1 type veth peer name veth0 netns nsr2
+
+ip link add veth1 netns nsr2 type veth peer name eth0 netns ns2
+
+for dev in lo veth0 veth1; do
+  for i in 1 2; do
+    ip -net nsr$i link set $dev up
+  done
+done
+
+ip -net nsr1 addr add 10.0.1.1/24 dev veth0
+ip -net nsr1 addr add dead:1::1/64 dev veth0
+
+ip -net nsr2 addr add 10.0.2.1/24 dev veth1
+ip -net nsr2 addr add dead:2::1/64 dev veth1
+
+# set different MTUs so we need to push packets coming from ns1 (large MTU)
+# to ns2 (smaller MTU) to stack either to perform fragmentation (ip_no_pmtu_disc=1),
+# or to do PTMU discovery (send ICMP error back to originator).
+# ns2 is going via nsr2 with a smaller mtu, so that TCPMSS announced by both peers
+# is NOT the lowest link mtu.
+
+ip -net nsr1 link set veth0 mtu 9000
+ip -net ns1 link set eth0 mtu 9000
+
+ip -net nsr2 link set veth1 mtu 2000
+ip -net ns2 link set eth0 mtu 2000
+
+# transfer-net between nsr1 and nsr2.
+# these addresses are not used for connections.
+ip -net nsr1 addr add 192.168.10.1/24 dev veth1
+ip -net nsr1 addr add fee1:2::1/64 dev veth1
+
+ip -net nsr2 addr add 192.168.10.2/24 dev veth0
+ip -net nsr2 addr add fee1:2::2/64 dev veth0
+
+for i in 1 2; do
+  ip netns exec nsr$i sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null
+  ip netns exec nsr$i sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null
+
+  ip -net ns$i link set lo up
+  ip -net ns$i link set eth0 up
+  ip -net ns$i addr add 10.0.$i.99/24 dev eth0
+  ip -net ns$i route add default via 10.0.$i.1
+  ip -net ns$i addr add dead:$i::99/64 dev eth0
+  ip -net ns$i route add default via dead:$i::1
+  ip netns exec ns$i sysctl net.ipv4.tcp_no_metrics_save=1 > /dev/null
+
+  # don't set ip DF bit for first two tests
+  ip netns exec ns$i sysctl net.ipv4.ip_no_pmtu_disc=1 > /dev/null
+done
+
+ip -net nsr1 route add default via 192.168.10.2
+ip -net nsr2 route add default via 192.168.10.1
+
+ip netns exec nsr1 nft -f - <<EOF
+table inet filter {
+  flowtable f1 {
+     hook ingress priority 0
+     devices = { veth0, veth1 }
+   }
+
+   chain forward {
+      type filter hook forward priority 0; policy drop;
+
+      # flow offloaded? Tag ct with mark 1, so we can detect when it fails.
+      meta oif "veth1" tcp dport 12345 flow offload @f1 counter
+
+      # use packet size to trigger 'should be offloaded by now'.
+      # otherwise, if 'flow offload' expression never offloads, the
+      # test will pass.
+      tcp dport 12345 meta length gt 200 ct mark set 1 counter
+
+      # this turns off flow offloading internally, so expect packets again
+      tcp flags fin,rst ct mark set 0 accept
+
+      # this allows large packets from responder, we need this as long
+      # as PMTUd is off.
+      # This rule is deleted for the last test, when we expect PMTUd
+      # to kick in and ensure all packets meet mtu requirements.
+      meta length gt 1500 accept comment something-to-grep-for
+
+      # next line blocks connection w.o. working offload.
+      # we only do this for reverse dir, because we expect packets to
+      # enter slow path due to MTU mismatch of veth0 and veth1.
+      tcp sport 12345 ct mark 1 counter log prefix "mark failure " drop
+
+      ct state established,related accept
+
+      # for packets that we can't offload yet, i.e. SYN (any ct that is not confirmed)
+      meta length lt 200 oif "veth1" tcp dport 12345 counter accept
+
+      meta nfproto ipv4 meta l4proto icmp accept
+      meta nfproto ipv6 meta l4proto icmpv6 accept
+   }
+}
+EOF
+
+if [ $? -ne 0 ]; then
+	echo "SKIP: Could not load nft ruleset"
+	exit $ksft_skip
+fi
+
+# test basic connectivity
+ip netns exec ns1 ping -c 1 -q 10.0.2.99 > /dev/null
+if [ $? -ne 0 ];then
+  echo "ERROR: ns1 cannot reach ns2" 1>&2
+  bash
+  exit 1
+fi
+
+ip netns exec ns2 ping -c 1 -q 10.0.1.99 > /dev/null
+if [ $? -ne 0 ];then
+  echo "ERROR: ns2 cannot reach ns1" 1>&2
+  exit 1
+fi
+
+if [ $ret -eq 0 ];then
+	echo "PASS: netns routing/connectivity: ns1 can reach ns2"
+fi
+
+ns1in=$(mktemp)
+ns1out=$(mktemp)
+ns2in=$(mktemp)
+ns2out=$(mktemp)
+
+make_file()
+{
+	name=$1
+	who=$2
+
+	SIZE=$((RANDOM % (1024 * 8)))
+	TSIZE=$((SIZE * 1024))
+
+	dd if=/dev/urandom of="$name" bs=1024 count=$SIZE 2> /dev/null
+
+	SIZE=$((RANDOM % 1024))
+	SIZE=$((SIZE + 128))
+	TSIZE=$((TSIZE + SIZE))
+	dd if=/dev/urandom conf=notrunc of="$name" bs=1 count=$SIZE 2> /dev/null
+}
+
+check_transfer()
+{
+	in=$1
+	out=$2
+	what=$3
+
+	cmp "$in" "$out" > /dev/null 2>&1
+	if [ $? -ne 0 ] ;then
+		echo "FAIL: file mismatch for $what" 1>&2
+		ls -l "$in"
+		ls -l "$out"
+		return 1
+	fi
+
+	return 0
+}
+
+test_tcp_forwarding()
+{
+	local nsa=$1
+	local nsb=$2
+	local lret=0
+
+	ip netns exec $nsb nc -w 5 -l -p 12345 < "$ns2in" > "$ns2out" &
+	lpid=$!
+
+	sleep 1
+	ip netns exec $nsa nc -w 4 10.0.2.99 12345 < "$ns1in" > "$ns1out" &
+	cpid=$!
+
+	sleep 3
+
+	kill $lpid
+	kill $cpid
+	wait
+
+	check_transfer "$ns1in" "$ns2out" "ns1 -> ns2"
+	if [ $? -ne 0 ];then
+		lret=1
+	fi
+
+	check_transfer "$ns2in" "$ns1out" "ns1 <- ns2"
+	if [ $? -ne 0 ];then
+		lret=1
+	fi
+
+	return $lret
+}
+
+make_file "$ns1in" "ns1"
+make_file "$ns2in" "ns2"
+
+# First test:
+# No PMTU discovery, nsr1 is expected to fragment packets from ns1 to ns2 as needed.
+test_tcp_forwarding ns1 ns2
+if [ $? -eq 0 ] ;then
+	echo "PASS: flow offloaded for ns1/ns2"
+else
+	echo "FAIL: flow offload for ns1/ns2:" 1>&2
+	ip netns exec nsr1 nft list ruleset
+	ret=1
+fi
+
+# delete default route, i.e. ns2 won't be able to reach ns1 and
+# will depend on ns1 being masqueraded in nsr1.
+# expect ns1 has nsr1 address.
+ip -net ns2 route del default via 10.0.2.1
+ip -net ns2 route del default via dead:2::1
+ip -net ns2 route add 192.168.10.1 via 10.0.2.1
+
+# Second test:
+# Same, but with NAT enabled.
+ip netns exec nsr1 nft -f - <<EOF
+table ip nat {
+   chain postrouting {
+      type nat hook postrouting priority 0; policy accept;
+      meta oifname "veth1" masquerade
+   }
+}
+EOF
+
+test_tcp_forwarding ns1 ns2
+
+if [ $? -eq 0 ] ;then
+	echo "PASS: flow offloaded for ns1/ns2 with NAT"
+else
+	echo "FAIL: flow offload for ns1/ns2 with NAT" 1>&2
+	ip netns exec nsr1 nft list ruleset
+	ret=1
+fi
+
+# Third test:
+# Same as second test, but with PMTU discovery enabled.
+handle=$(ip netns exec nsr1 nft -a list table inet filter | grep something-to-grep-for | cut -d \# -f 2)
+
+ip netns exec nsr1 nft delete rule inet filter forward $handle
+if [ $? -ne 0 ] ;then
+	echo "FAIL: Could not delete large-packet accept rule"
+	exit 1
+fi
+
+ip netns exec ns1 sysctl net.ipv4.ip_no_pmtu_disc=0 > /dev/null
+ip netns exec ns2 sysctl net.ipv4.ip_no_pmtu_disc=0 > /dev/null
+
+test_tcp_forwarding ns1 ns2
+if [ $? -eq 0 ] ;then
+	echo "PASS: flow offloaded for ns1/ns2 with NAT and pmtu discovery"
+else
+	echo "FAIL: flow offload for ns1/ns2 with NAT and pmtu discovery" 1>&2
+	ip netns exec nsr1 nft list ruleset
+fi
+
+exit $ret
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@ -36,7 +36,11 @@ trap cleanup EXIT
 ip netns add ns1
 ip netns add ns2

-ip link add veth0 netns ns0 type veth peer name eth0 netns ns1
+ip link add veth0 netns ns0 type veth peer name eth0 netns ns1 > /dev/null 2>&1
+if [ $? -ne 0 ];then
+    echo "SKIP: No virtual ethernet pair device support in kernel"
+    exit $ksft_skip
+fi
 ip link add veth1 netns ns0 type veth peer name eth0 netns ns2

 ip -net ns0 link set lo up