Commit Graph

1172897 Commits

Author SHA1 Message Date
Jakub Kicinski e61caf04b9 Merge branch 'page_pool-allow-caching-from-safely-localized-napi'
Jakub Kicinski says:

====================
page_pool: allow caching from safely localized NAPI

I went back to the explicit "are we in NAPI method", mostly
because I don't like having both around :( (even tho I maintain
that in_softirq() && !in_hardirq() is as safe, as softirqs do
not nest).

Still returning the skbs to a CPU, tho, not to the NAPI instance.
I reckon we could create a small refcounted struct per NAPI instance
which would allow sockets and other users so hold a persisent
and safe reference. But that's a bigger change, and I get 90+%
recycling thru the cache with just these patches (for RR and
streaming tests with 100% CPU use it's almost 100%).

Some numbers for streaming test with 100% CPU use (from previous version,
but really they perform the same):

		HW-GRO				page=page
		before		after		before		after
recycle:
cached:			0	138669686		0	150197505
cache_full:		0	   223391		0	    74582
ring:		138551933         9997191	149299454		0
ring_full: 		0             488	     3154	   127590
released_refcnt:	0		0		0		0

alloc:
fast:		136491361	148615710	146969587	150322859
slow:		     1772	     1799	      144	      105
slow_high_order:	0		0		0		0
empty:		     1772	     1799	      144	      105
refill:		  2165245	   156302	  2332880	     2128
waive:			0		0		0		0

v1: https://lore.kernel.org/all/20230411201800.596103-1-kuba@kernel.org/
rfcv2: https://lore.kernel.org/all/20230405232100.103392-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20230413042605.895677-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:14 -07:00
Jakub Kicinski 294e39e0d0 bnxt: hook NAPIs to page pools
bnxt has 1:1 mapping of page pools and NAPIs, so it's safe
to hoook them up together.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:12 -07:00
Jakub Kicinski 8c48eea3ad page_pool: allow caching from safely localized NAPI
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:12 -07:00
Jakub Kicinski b07a2d97ba net: skb: plumb napi state thru skb freeing paths
We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
Going forward we will also try to cache head and data pages.
Plumb the "are we in a normal NAPI context" information thru
deeper into the freeing path, up to skb_release_data() and
skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
comes from netpoll which passes budget of 0 to try to reap
the Tx completions but not perform any Rx.

Use "bool napi_safe" rather than bare "int budget",
the further we get from NAPI the more confusing the budget
argument may seem (particularly whether 0 or MAX is the
correct value to pass in when not in NAPI).

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:12 -07:00
Yevgeny Kliteynik 220ae98783 net/mlx5: DR, Enable patterns and arguments for supporting devices
Check if patterns and arguments for modify header action
are supported and enable them accordingly.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:22 -07:00
Yevgeny Kliteynik a21e52bb8f net/mlx5: DR, Add support for the pattern/arg parameters in debug dump
Support the pattern/args-based MODIFY_HDR and TNL_L3_TO_L2 actions in dbg dump

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:22 -07:00
Yevgeny Kliteynik 40ff097f25 net/mlx5: DR, Modify header action of size 1 optimization
Set modify header action of size 1 directly on the STE for supporting
devices, thus reducing number of hops and cache misses.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:22 -07:00
Yevgeny Kliteynik 947e258537 net/mlx5: DR, Support decap L3 action using pattern / arg mechanism
Use the new accelerated action for decap L3 on RX side:
use the mechanism of pattern and argument same as in
modify-header action.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:22 -07:00
Yevgeny Kliteynik 62e40c8568 net/mlx5: DR, Apply new accelerated modify action and decapl3
If there is support for pattern/args, use the new accelerated modify
header action for modify header and decap L3 actions.
Otherwise fall back to the old modify-header implementation.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:22 -07:00
Yevgeny Kliteynik 0caebadda5 net/mlx5: DR, Add modify header argument pointer to actions attributes
While building the actions, add the pointer of the arguments for
accelerated modify list action into the action's attributes.
This will be used later on while building the specific STE
for this action.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:22 -07:00
Yevgeny Kliteynik 608d4f1769 net/mlx5: DR, Add modify header arg pool mechanism
Added new mechanism for handling arguments for modify-header action.
The new action "accelerated modify-header" asks for the arguments from
separated area from the pattern, this area accessed via general objects.
Handling of these object is done via the pool-manager struct.

When the new header patterns are supported, while loading the domain,
a few pools for argument creations will be created. The requests for
allocating/deallocating arg objects are done via the pool manager API.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:21 -07:00
Yevgeny Kliteynik 17dc71c336 net/mlx5: DR, Fix QP continuous allocation
When allocating a QP we allocate an RQ and an SQ, the RQ is stored first
in memory and followed by the SQ.
This allocation is not physically continiuos - it may span across different
physical pages. SW Steering code always writes in pairs: 1BB write + 1BB read,
or 2 continuous BBs of GTA WQE.

This lead to an issue where RQ allocation was 4x16 which is equal to 1 WQE BB,
causing 1 BB offset in the page and splitting the GTA WQE between different
physical pages.

The solution was to create the RQ with a even number of BBs and to have the
RQ aligned to a page.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:21 -07:00
Yevgeny Kliteynik 7d7c9453d6 net/mlx5: DR, Read ICM memory into dedicated buffer
Instead of using the write buffer for reading we will use a dedicated
buffer only for reading ICM memory.
Due to the new support for args, we can have a case with pending_wc
being odd number, and with reading into the same write buffer, it is
possible to overwrite next write on the same slot.
For example:
pending_wc is 17 so the buffer for write is:
   | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
and we have requests as follows:
   r wr wr wr wr wr wr wr wr
Now, the first read will be written into the last write because we use
the same buffer for read and write, before it was written to the HW and
we will have a wrong data in the ICM area.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:21 -07:00
Yevgeny Kliteynik 4605fc0a2b net/mlx5: DR, Add support for writing modify header argument
The accelerated modify header arguments are written in the HW area
with special WQE and specific data format.
New function was added to support writing of new argument type.
Note that GTA WQE is larger than READ and WRITE, so the queue
management logic was updated to support this.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:21 -07:00
Yevgeny Kliteynik de69696b6e net/mlx5: DR, Add create/destroy for modify-header-argument general object
Add functions for creation/destruction of the new type of general object.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:21 -07:00
Yevgeny Kliteynik b7ba743a2f net/mlx5: DR, Check for modify_header_argument device capabilities
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:21 -07:00
Yevgeny Kliteynik 2533e726f4 net/mlx5: DR, Split chunk allocation to HW-dependent ways
This way we are able to allocate chunk for modify_headers from 2 types:
STEv0 that is allocated from the action area, and STEv1 that is allocating
the chunks from the special area for patterns.

Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:20 -07:00
Yevgeny Kliteynik da5d0027d6 net/mlx5: DR, Add cache for modify header pattern
Starting with ConnectX-6 Dx, we use new design of modify_header FW object.
The current modify_header object allows for having only limited number
of FW objects, so the new design of pattern and argument allows pattern
reuse, saving memory, and having a large number of modify_header objects.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:20 -07:00
Yevgeny Kliteynik b47dddc624 net/mlx5: DR, Move ACTION_CACHE_LINE_SIZE macro to header
Move ACTION_CACHE_LINE_SIZE macro to header to be used by
the pattern functions as well.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-14 15:06:20 -07:00
Linus Torvalds 7a934f4bd7 RISC-V Fixes for 6.3-rc7
* A fix for a missing fence when generating the NOMMU sigreturn
   trampoline.
 * A set of fixes for early DTB handling of reserved memory nodes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmQ5ZsoTHHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYiUhED/9IsUdJfMdQHzrL7XZL7U9FgHhSCVl4
 HgpcpngG6Op+F8lGnVhQV4ivoVYrk85lR2JqJeES9FpChSmZXCptB6K9NFCDLdW+
 9+Qnqa6NnmcJ9Aba7Ckr7zwApohtDGucegIKx3bqafsnBqh2SM/l7ljCWEuaxIbZ
 5qFEbWGcILVGQrXJheLgkx7uGbBmEvrmLi9T5jZ1Vwltg1QS2STDjOCSR5cfBpx7
 kqoWfSa1+fb5RF428KRvDTuIjUzkkKvUEo/yEzJMMp1s3vI2mKT8ytqQHw4GrQ9w
 6X/wvabjDtz8iGRWsh4VVg7iJ+xImzpqRO4UXkd4wArCDdCvjSu4BpAs3cOqKVTl
 4w+dTbQPjL22PYCdhBVmJH8K4TmJGzSkOPK8wcIlK1mQaRCAjAalSvCLQIN/+ttq
 4AwkXLpfwLvJLhn78Y/7WS6dbedYjMbNxdz2Sy6yFm3qZj1dqfgS5yIuHpLVo7UW
 AK7nS3FGJRyii8w/lywxDgLTdMeJiZ8/Xy0gJkWps5sGolBIPqSwUcLDhFS68HoM
 6x+IzOXDlHnw7i/sj6Q10IRvjFVLieNDYniD7CnOYDKFSgUOvehCMlky7l4naRdO
 vfVHsHPX38z5ZTeM/vZ1HKobKeTmi3Hy/C1MWUwMgj2nPkui777c9mAQLHY7ppeZ
 dr4Cn/4GNfZz3w==
 =1Tw7
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - A fix for a missing fence when generating the NOMMU sigreturn
   trampoline

 - A set of fixes for early DTB handling of reserved memory nodes

* tag 'riscv-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: No need to relocate the dtb as it lies in the fixmap region
  riscv: Do not set initial_boot_params to the linear address of the dtb
  riscv: Move early dtb mapping into the fixmap region
  riscv: add icache flush for nommu sigreturn trampoline
2023-04-14 10:44:48 -07:00
Linus Torvalds 95abc817ab ACPI fixes for 6.3-rc7
- Add a quirk to force StorageD3Enable on AMD Picasso systems (Mario
    Limonciello).
 
  - Add an ACPI IRQ override quirk for ASUS ExpertBook B1502CBA (Paul
    Menzel).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQ5V/4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx3Z8P/3LgmzvoNt+Xsfx3u3//vyu02XGHd48N
 xA9SQgNJeqwPQxpjnZnycuX5RPex4NGjzEQaIuXkxRVf3UnMzxFir4AYT22qJfea
 7m1S59ym/LBCOIhGWuQ4de9vhmebR60nSMrR5geaQVwWedTUfMphIjh10volcnae
 IwWuaABCtL9vNoPdzy1syhg1nWFZnMIvDrGDdexBQ+68BQznAnEvL0ZlEUv4gls3
 Dh2I4ZYe+95Me8qAaNthG3U+BfLDR+/7Sqo3H0urOoTDcLxq/n2Wu6YfeC1Y6dp2
 LoBPWGVQpmuITZjTrr5BDTbzjePEA7l0vRdKkhKqmkQ6cfgelwL6LxK2ShIU8Ulx
 5OFxlR/i3kZmdpc3n32WdCaVScNEjZ1NEhoGkqAOBXyIplDm83K+uscy6M5Gv5fM
 lCgwzWu5Vn+NLDzZNx2fIFHOs5sLXb8R9cnycazVUas/NMC96VYRqViK4jcooFHz
 +y/Z4Sl23gx9i8C85UYN/02jPcjjALWH1PlOyVXWx/0wOvukYLQ1z943+ZEZyfkb
 vtPwv/xFuLNiZIrnDBBMoi0eri5amdqchYC/JKzDlVs6oPtR2F7hM7nqaajJ1JOz
 WLQy4XH5lAueUeMneAY2Th7Ogc3lMHsAPOChLwXAvPRJyxPsBaiS0PUeTVcGdiQ/
 vYUg1budlxa+
 =SG+Z
 -----END PGP SIGNATURE-----

Merge tag 'acpi-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These add two ACPI-related quirks:

   - Add a quirk to force StorageD3Enable on AMD Picasso systems (Mario
     Limonciello)

   - Add an ACPI IRQ override quirk for ASUS ExpertBook B1502CBA (Paul
     Menzel)"

* tag 'acpi-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: resource: Skip IRQ override on ASUS ExpertBook B1502CBA
  ACPI: x86: utils: Add Picasso to the list for forcing StorageD3Enable
2023-04-14 10:37:07 -07:00
Linus Torvalds 4b992ead33 Power management fix for 6.3-rc7
Make the amd-pstate cpufreq driver take all of the possible combinations
 of the "old" and "new" status values correctly while changing the
 operation mode via sysfs (Wyes Karny).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQ5WF8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxGt0P/0NGMuOoDVS9zNhROphsUhBWw/XuVYQQ
 xrtAkouF9zPDRR7R3p/tbqHNeaSOGapmGbFK6Xm4Z+Cko09jlHpIssfrkCK9W+mh
 BI8CC6jdFavstzbifzLwVg9dkTeD2ANgCwkSM4CLxzyzktV3C4WOcrBombzpK6CI
 rTL039jabNaelSzB+dp2asX8GVQD+H57BXTPb33h6BLfH2tvnreVEfTOTlOvrCKD
 eClLvDw8QArp74qh4U0VwtJ1EIZsYqv84AxSfmKQi9dOrfwIK81D9dHhJXpKfj4V
 4DSOuRRxG4wbxF816gXm1mbFiFs2+aK3SkrPXLP4LyphqMfDRP3ID/HojcrUBBtp
 YCiAQ1JHa9DLWuqOO8/pqUgIetwLuZYD/10JBgXX+ULmkC4Wuqd1Lz1GspdP9/Xy
 YJrMcwTAj5KCZw2ehdaeZMBa7Ox2flkeS1QdMY8uBposX2w/F8IhWTC4Xe5tbmk2
 s1hYZL7yHAZWTyNtcTY7+gPte4qRzQJ/dhFgvbwclaL6z4cgLitkRMf1Rv+vceV+
 pKpqzZWRUBqVCNSl8zZdYfyYy2qfAqMszvbDH5fa+26wqQNsKjAGrfQy4qOZGBje
 BQKsVA1MlAAEckmRH5wOoO0XieT9pn7hJjuTcocg4R4mdF+IfgmUsHgDiGnxG6X+
 BkdRd1ec5NX+
 =gU5P
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Make the amd-pstate cpufreq driver take all of the possible
  combinations of the 'old' and 'new' status values correctly while
  changing the operation mode via sysfs (Wyes Karny)"

* tag 'pm-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  amd-pstate: Fix amd_pstate mode switch
2023-04-14 10:25:30 -07:00
Linus Torvalds d0b85e7e60 Thermal control fix for 6.3-rc7
Modify the Intel thermal throttling code to avoid updating unsupported
 status clearing mask bits which causes the kernel to complain about
 unchecked MSR access (Srinivas Pandruvada).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQ5WNoSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxN3sP/jxTTvyatqFDmrmjuf4rysrapgGhh9IF
 l1iTixZiHWN3KsbpGoXGvK6QmbyAB0S61FMzucnKktrdzA5tSLSms+UqQcTlqucP
 pjNzZikojs4pEML9GvlChC2V2aKhALNrzEfIuS24Pw/TEDTH2Mbm0Wrz+yRg5PPt
 pQhdBwe9Y/NTSVRt3JlZJEzGcWajRsx5YZMyud2zGUqXtNJnpSw5y3Klrbdd3urs
 6501j/jLUcLRmLbnmf8oZGs0tCZK/FVhfgVpBemhJgMxeqflMnXBwYw43d0GLJNA
 bg7kq6eSCr7IP743AAuJaory9WCqAoa+6Km1BGS5TpTOdVtsJ5UDr7ARaCB7E+jg
 o0lYIs3O1q4WqFBV85R0JtRDopgdkFYxWSjNNa5lyAVwVbVA7Xi7jt48LUaSRO23
 Ag76U7VWMXNKSpznPxhTboywE5MH4Gu+VeoVeitXNTG7hM7plB1UoFZ5br3+dPfT
 qyxHvMkG2n+V9a7JXxaGg2/LvG97QX9LPtCPMTyVAMgQWtg2JN3NblXzGjunZnVK
 kXScxvRYKKH2xz1ArdTXJj0z9CsnyHRebKXk+PcbFY0kKtQupKpNXSJsMv5zC4kP
 /uKVxxdAroLfoMkD8BFjhiDV7GHrhXpSBJz72dOIB0eG4jEV+82XCohdNIj9xzEV
 LtJcGOutktSu
 =8D2v
 -----END PGP SIGNATURE-----

Merge tag 'thermal-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull thermal control fix from Rafael Wysocki:
 "Modify the Intel thermal throttling code to avoid updating unsupported
  status clearing mask bits which causes the kernel to complain about
  unchecked MSR access (Srinivas Pandruvada)"

* tag 'thermal-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: intel: Avoid updating unsupported THERM_STATUS_CLEAR mask bits
2023-04-14 10:19:18 -07:00
Linus Torvalds e251c42318 sound fixes for 6.3-rc7
A collection of small fixes.  At this time, quite a few fixes for
 the old PCI drivers are found.  Although they are no regression
 fixes, I took these as they are materials for stable kernels.
 In addition, a couple of regression fixes and another couple of
 HD-audio quirks are included.
 -----BEGIN PGP SIGNATURE-----
 
 iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmQ5RlAOHHRpd2FpQHN1
 c2UuZGUACgkQLtJE4w1nLE+lzxAAzfsapwLW4pN5gDApDiAFxu9Lu9VQvk3e4k7W
 khQFb7o+2jdjyCTiGEQCfFbPu/Ru4KlrYXkCkHEXRR0/95NiRq9GQx6xJtU+S7/m
 WBTWPhIh2Hic90yYUTD62Pb7j40P8C+wsATwpQftQIdGixBdv7nirbbzBPi5Xfcs
 +4iu8TBEh9izFIAnXADl/O+WA3E4r6TOvDeuD2FZLQWPcJLeHF9h79BU0k7UmUYR
 ZgLg+0GJrwJR9A1SzGs5kc47Q0zmP62gRyExBSnskHFjCglbTCo0VhVpBoBi+o0y
 epHyOJrvs/AdpMz4VFvn4WP5IVtyxa0diUWmd/eRczzzxJThLpMQYx2+ukYkUJMc
 +fua+NraWqXwC+s7+C/N8MhFXbSrRm+4fzOPmBdE9dV/hnAQIOCWM6f9PhciUcLm
 kZfcCCwZXNR0TVmtlZxKq5s4xyoYVVkEctQU3QO8TeHXKmV8EYQO+zzYQjch2izD
 H6gJyT8/QcKhRQSkthLEiltnkuFY+nq+jDdlCRH2/9v5VjvY0XZlBpxuP0vPIYX4
 iChCuCu5qkforCijetDkdArWdO4+WiFums4t+h1hekvhnN9IrrXUxU+dn+Hu63jJ
 oVtlxW1AVzEcYynUowI3jSo4z/5jxBF8a10XF4+ysCbUoTQ6pYTMVl4wYDEEEYJB
 2RUdHPQ=
 =Gmhh
 -----END PGP SIGNATURE-----

Merge tag 'sound-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of small fixes.

  At this time, quite a few fixes for the old PCI drivers are found.
  Although they are not regression fixes, I took these as they are
  materials for stable kernels.

  In addition, a couple of regression fixes and another couple of
  HD-audio quirks are included"

* tag 'sound-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/hdmi: disable KAE for Intel DG2
  ALSA: hda/realtek: Add quirks for Lenovo Z13/Z16 Gen2
  ALSA: hda: patch_realtek: add quirk for Asus N7601ZM
  ALSA: firewire-tascam: add missing unwind goto in snd_tscm_stream_start_duplex()
  ALSA: emu10k1: don't create old pass-through playback device on Audigy
  ALSA: emu10k1: fix capture interrupt handler unlinking
  ALSA: hda/sigmatel: fix S/PDIF out on Intel D*45* motherboards
  ALSA: hda/sigmatel: add pin overrides for Intel DP45SG motherboard
  ALSA: i2c/cs8427: fix iec958 mixer control deactivation
2023-04-14 10:13:54 -07:00
Linus Torvalds aee3c14e86 v6.3 rc RDMA pull request
Driver bug fixes:
 
 - irdma should not generate extra completions during flushing
 
 - Fix several memory leaks
 
 - Do not get confused in irdma's iwarp mode if IPv6 is present
 
 - Correct a link speed calculation in mlx5
 
 - Increase the EQ/WQ limits on erdma as they are too small for big
   applications
 
 - Use the right math for erdma's inline mtt feature
 
 - Make erdma probing more robust to boot time ordering differences
 
 - Fix a KMSAN crash in CMA due to uninitialized qkey
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZDlBPgAKCRCFwuHvBreF
 YfOtAQDAX3rEL4T6xu4jIHAhInTYZCYVI7pJALTzHh62DmdoZAD+Owy2vTRTmkvJ
 OLfA++yDRWdrzeSeCWbTYpwEo+00TAA=
 =9XAm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "We had a fairly slow cycle on the rc side this time, here are the
  accumulated fixes, mostly in drivers:

   - irdma should not generate extra completions during flushing

   - Fix several memory leaks

   - Do not get confused in irdma's iwarp mode if IPv6 is present

   - Correct a link speed calculation in mlx5

   - Increase the EQ/WQ limits on erdma as they are too small for big
     applications

   - Use the right math for erdma's inline mtt feature

   - Make erdma probing more robust to boot time ordering differences

   - Fix a KMSAN crash in CMA due to uninitialized qkey"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/core: Fix GID entry ref leak when create_ah fails
  RDMA/cma: Allow UD qp_type to join multicast only
  RDMA/erdma: Defer probing if netdevice can not be found
  RDMA/erdma: Inline mtt entries into WQE if supported
  RDMA/erdma: Update default EQ depth to 4096 and max_send_wr to 8192
  RDMA/erdma: Fix some typos
  IB/mlx5: Add support for 400G_8X lane speed
  RDMA/irdma: Add ipv4 check to irdma_find_listener()
  RDMA/irdma: Increase iWARP CM default rexmit count
  RDMA/irdma: Fix memory leak of PBLE objects
  RDMA/irdma: Do not generate SW completions for NOPs
2023-04-14 10:06:50 -07:00
Ilya Leoshkevich c730fce7c7 s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL
Thomas Richter reported a crash in linux-next with a backtrace similar
to the following one:

	 [<0000000000000000>] 0x0
	([<000000000031a182>] bpf_trace_run4+0xc2/0x218)
	 [<00000000001d59f4>] __bpf_trace_sched_switch+0x1c/0x28
	 [<0000000000c44a3a>] __schedule+0x43a/0x890
	 [<0000000000c44ef8>] schedule+0x68/0x110
	 [<0000000000c4e5ca>] do_nanosleep+0xa2/0x168
	 [<000000000026e7fe>] hrtimer_nanosleep+0xf6/0x1c0
	 [<000000000026eb6e>] __s390x_sys_nanosleep+0xb6/0xf0
	 [<0000000000c3b81c>] __do_syscall+0x1e4/0x208
	 [<0000000000c50510>] system_call+0x70/0x98
	Last Breaking-Event-Address:
	 [<000003ff7fda1814>] bpf_prog_65e887c70a835bbf_on_switch+0x1a4/0x1f0

The problem is that bpf_arch_text_poke() with new_addr == NULL is
susceptible to the following race condition:

	T1                 T2
        -----------------  -------------------
	plt.target = NULL
	                   entry: brcl 0xf,plt
	entry.mask = 0
	                   lgrl %r1,plt.target
	                   br %r1

Fix by setting PLT target to the instruction following `brcl 0xf,plt`
instead of 0. This way T2 will simply resume the execution of the eBPF
program, which is the desired effect of passing new_addr == NULL.

Fixes: f1d5df84cd ("s390/bpf: Implement bpf_arch_text_poke()")
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/bpf/20230414154755.184502-1-iii@linux.ibm.com
2023-04-14 18:28:52 +02:00
Rafael J. Wysocki a3babdb7a8 Merge branch 'acpi-x86'
Merge a quirk to force StorageD3Enable on AMD Picasso systems (Mario
Limonciello).

* acpi-x86:
  ACPI: x86: utils: Add Picasso to the list for forcing StorageD3Enable
2023-04-14 15:15:32 +02:00
Ming Lei 860e1c7f8b io_uring: complete request via task work in case of DEFER_TASKRUN
So far io_req_complete_post() only covers DEFER_TASKRUN by completing
request via task work when the request is completed from IOWQ.

However, uring command could be completed from any context, and if io
uring is setup with DEFER_TASKRUN, the command is required to be
completed from current context, otherwise wait on IORING_ENTER_GETEVENTS
can't be wakeup, and may hang forever.

The issue can be observed on removing ublk device, but turns out it is
one generic issue for uring command & DEFER_TASKRUN, so solve it in
io_uring core code.

Fixes: e6aeb2721d ("io_uring: complete all requests in task context")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@kernel.dk/
Reported-by: Jens Axboe <axboe@kernel.dk>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-14 06:38:23 -06:00
Jens Axboe f7ca1ae32b Merge branch 'nvme-6.3' of git://git.infradead.org/nvme into block-6.3
Pull NVMe fix from Christoph.

* 'nvme-6.3' of git://git.infradead.org/nvme:
  nvme-pci: add NVME_QUIRK_BOGUS_NID for T-FORCE Z330 SSD
2023-04-14 06:29:00 -06:00
Arnd Bergmann d75eecc3d1 A few more Qualcomm ARM64 DeviceTree fixes for 6.3
The GPIO polarity of the WSA881x shutdown GPIO was inconsistent and had
 to be corrected in the driver, this fixes the polarity in the DeviceTree
 for QRB5165 RB5, SM8250 MTP, Samsung Galaxy Book 2 and Lenovo Yoga C630.
 
 The recent rearrangement of nodes among the IPQ8074 accidentally enabled
 the PCIe PHYs, rather than the PCIe controllers. This is being
 corrected, to restore PCIe functionality.
 
 PMK8280 PON node has the wrong compatible, which recently caused the
 driver to stop probing. This is corrected and the required "pbs" region
 is added.
 
 With support for HBR3 introduced, it's noted that SC7280 Herobrine
 devices are having trouble running at this rate. This drops the claim
 that it's supported, until further analysis can be done.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAmQ0LU8VHGFuZGVyc3Nv
 bkBrZXJuZWwub3JnAAoJEAsfOT8Nma3Fa+wP/23+62lBn+5sfsUMo5k/ajYel0Ug
 sfqMS7PDvTPG3AsVhk1R95NrKWb9eWmkpvkwVKsBHD75BeRkuDovIx9yxSLSDZm2
 yWlV514AOLWlhgBu75XzOIDy/J9BQccZPmkBHPJY2I9vpnvlhWnQhaPx4tKgQM8T
 5qKuQUf2Q1sQy/QDcbOrMEpHCBHl+Zm11ffmnFosXtVsF0xEzcGyHcKwCOMSaxg1
 KVxeK0wS3Mt09D4SiPwVNaMA+c0vcexp2jpySoQBMbEmbYOZr3DEU+b5lyZDQqOi
 odMwF4EhMHaGvUCRwM+c4GgdcdXP8IT4eex9ZjdgYblzJsAUfdpFQ5NBB4PeoHPt
 9GOp1n+s4+L3pVm9oyeqUDAUNhE3X6PinOfEScwtgFnQAob45k+6vznVgWt66Apg
 uZAXRE1kONxtJO7bR8hzH+NH395mtb18lHh1MF09KdIR48wl8bK2f91yP0NkSTY8
 N4mKcdy9GwBAmTK3FpEmExOWfmwq0EXdeV2j2RdrVTGNBvTuX9nzpKHUdckk77Ay
 sOS5KfUANAHy/7UIOZHwerPKImRyo+y7gZ8kRcihtUgGOXaESeT1s1KPsTUE2ZqX
 xWuJWRxYizPHdn809iEr5jgAHOAZlKh5RsJ6QQijY1XKNwNLU5/8iLb61x55PbHv
 fnvqlKTO04pDSWD9
 =3Uaf
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmQ5PpEACgkQYKtH/8kJ
 Uic7Vw/8CTiwm+gErIrED6Wm7eCNkzvENkruXboy5437tyrPsnd68sJ+KFs7lznz
 0cXSA06bGKp6rj/i1nodfwO8A0PtEl+RAtCaRuesCI8hQRfkF9o6BOEjpmQ4mYbn
 09nr0ehIatVdwl+RxFn9GZEBQT4yPIt8cTjjgT/zyxUsomYJjDv6dl9iODyX8TBY
 w3J2VInSUW1GRrlf+GD7XPxbIHhHl7tqWG1neBENQirWb7N0xjDuq2X7absWyzgt
 TaPaKYiaqr9X5cE1cfDff+RDoBVZRdtG5H7Gs1u28UaVvmziP04oCxgD8L5cHbX5
 niT8cq0d0+f367o+B9WXfcqJXHk/0q3clWP+Ad5q6Bt7HhAkfxY5eezYUMgoDpN4
 3+fRrdBJBVK7e2OobxK8+tUt5Zu97zIrNL1b0k9+RkauhBL1efuOQ/tAfh4AY4LR
 tguyJ6RBIneNYlAWsWPW7f4TpkR/2XmnFYityjFMLLj75rVK2oC+q1kYU/4/Un6a
 Z9G20Uotr3YXThGQJasZSwh+R0maymIEN7pugg9mrImC5zvGOUXA1DC+6MqdLmd1
 ocqSr3a0p6jRe9t0xSBxS8YtFT33zU9pevOTnIIYrQhyQ1xL8DFXjmoEFbuQdvWG
 Ks2x83/F5OefLoGYXG5Q/K10eKmodaWsuyPsSePOkOTo4VWZEoE=
 =ikZD
 -----END PGP SIGNATURE-----

Merge tag 'qcom-arm64-fixes-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes

A few more Qualcomm ARM64 DeviceTree fixes for 6.3

The GPIO polarity of the WSA881x shutdown GPIO was inconsistent and had
to be corrected in the driver, this fixes the polarity in the DeviceTree
for QRB5165 RB5, SM8250 MTP, Samsung Galaxy Book 2 and Lenovo Yoga C630.

The recent rearrangement of nodes among the IPQ8074 accidentally enabled
the PCIe PHYs, rather than the PCIe controllers. This is being
corrected, to restore PCIe functionality.

PMK8280 PON node has the wrong compatible, which recently caused the
driver to stop probing. This is corrected and the required "pbs" region
is added.

With support for HBR3 introduced, it's noted that SC7280 Herobrine
devices are having trouble running at this rate. This drops the claim
that it's supported, until further analysis can be done.

* tag 'qcom-arm64-fixes-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
  arm64: dts: qcom: sc7280: remove hbr3 support on herobrine boards
  arm64: dts: qcom: sc8280xp-pmics: fix pon compatible and registers
  arm64: dts: qcom: ipq8074-hk10: enable QMP device, not the PHY node
  arm64: dts: qcom: ipq8074-hk01: enable QMP device, not the PHY node
  arm64: dts: qcom: qrb5165-rb5: Use proper WSA881x shutdown GPIO polarity
  arm64: dts: qcom: sm8250-mtp: Use proper WSA881x shutdown GPIO polarity
  arm64: dts: qcom: sdm850-samsung-w737: Use proper WSA881x shutdown GPIO polarity
  arm64: dts: qcom: sdm850-lenovo-yoga-c630: Use proper WSA881x shutdown GPIO polarity

Link: https://lore.kernel.org/r/20230410153850.4752-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-04-14 13:52:48 +02:00
Arnd Bergmann 43950556b7 Lower sd card speeds for two boards to make them run more reliable,
missing 32k clock definition for Anbric xx3 devices, missing cache-levels
 for rk3588, fixed rk3326-board display supplies and more dt-schema fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCAAuFiEE7v+35S2Q1vLNA3Lx86Z5yZzRHYEFAmQx/6UQHGhlaWtvQHNu
 dGVjaC5kZQAKCRDzpnnJnNEdgYMiB/45skI/LcajSjuqIJpAC+cDeCs/IaQb6aev
 4fnwBRYgb836JvThTGeOn4KRe2avnZMfyjn3IlkHhYBHtLukS8jcuNuU+T+oD1/n
 Yss63kUSbMCvNl13uCSz/dqsaFZJ1x8hHth0b19PlGWnnBgtzWWWScr371m+PsF6
 2x2dnteRvQcUyWa8/5Xavgc8bseMds2SiKEc9Y58MTwRVL/pqfh9AFJhsWRJYFsN
 x0SU+Gwopg2qW+Mfdm6DOK1m08Vu6Uaw1zftF06/oAnDqvYMiVSWfsFIpW5eEi/T
 kb5ovzTTDwnLz/n8jkG1ldB7hQY362NQjyIc/GoUab/03yxyFE63
 =JS4f
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmQ5PlAACgkQYKtH/8kJ
 UifHPRAAwF4jqjWejgzFnHsB7RPmjvUtvuobU0gmI25HIjKnWLdcuS4IFXWvFgLU
 gxc1oQS1m4C7eM8ceI/jGR3oDZfhKXCIrpif9Tg+LxtmwRaeo3tbpasPANwiO+Eq
 BI6LssLFStBblwwaaMzC1E/8fCdBsZYVbQe4LqsJmzb9hOvFCih4At1YFHKQXcHM
 L+1sCdpKMo/f/AQ3sjS3kEYnFwrIYdskxImHXmYkYe2e1OMkXSQcWjz958A7GPB1
 kdm5yMPARR8FZxY9RZBf6QLg3xnIc/OqraeHYCjpFLs+gDGGoEMGwwBW5u3/JJmE
 SLTq2EYYf6szdVh156zjfMtCFCIM51BXoWwlB3swhWI80WD87lzrKzYYRbW5H6jp
 5SIW74a8vR2OxOKyQu00Qb8S0SsTJF5Ord9EImYsm6rXpQnSs894kp18zNjmjUDh
 FLxNhMMuY8aiwbypYUm4YRFN9BJJth3SKLXIyP1xxBEpN3IIeQSIXeD9ZoA+sc6e
 OdEREqMwQV9xPq84JQLsxTqjcSCwO1MJECqwNvKj07w0qdaIMgDAgirLCTXSPJPh
 b8b2p07Dpb5wFKw/FwGSBuvi3DEzSLhK7zbJzky8zNhmWNAx0VS7gR/koFQDGxR0
 gNcVU5vPB2ujvBdW76MegqynP2UowsGdD3Sy3KlB/FAZ+LTGMGY=
 =Yjl8
 -----END PGP SIGNATURE-----

Merge tag 'v6.3-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes

Lower sd card speeds for two boards to make them run more reliable,
missing 32k clock definition for Anbric xx3 devices, missing cache-levels
for rk3588, fixed rk3326-board display supplies and more dt-schema fixes.

* tag 'v6.3-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
  arm64: dts: rockchip: correct panel supplies on some rk3326 boards
  arm64: dts: rockchip: use just "port" in panel on RockPro64
  arm64: dts: rockchip: use just "port" in panel on Pinebook Pro
  arm64: dts: rockchip: Remove non-existing pwm-delay-us property
  arm64: dts: rockchip: Add clk_rtc_32k to Anbernic xx3 Devices
  arm64: dts: rockchip: add rk3588 cache level information
  arm64: dts: rockchip: Lower SD card speed on rk3399 Pinebook Pro
  arm64: dts: rockchip: Lower sd speed on rk3566-soquartz
  ARM: dts: rockchip: fix a typo error for rk3288 spdif node
  arm64: dts: rockchip: Fix rk3399 GICv3 ITS node name

Link: https://lore.kernel.org/r/10559306.CDJkKcVGEf@phil
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-04-14 13:51:43 +02:00
Johan Hovold 4323516879 firmware/psci: demote suspend-mode warning to info level
On some Qualcomm platforms, like SC8280XP, the attempt to set PC mode
during boot fails with PSCI_RET_DENIED and since commit 998fcd001f
("firmware/psci: Print a warning if PSCI doesn't accept PC mode") this
is now logged at warning level:

	psci: failed to set PC mode: -3

As there is nothing users can do about the firmware behaving this way,
demote the warning to info level and clearly mark it as a firmware bug:

	psci: [Firmware Bug]: failed to set PC mode: -3

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Acked-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-04-14 13:48:56 +02:00
David S. Miller c11d2e718c Merge branch 'msg_control-split'
Kevin Brodsky says:

====================
net: Finish up ->msg_control{,_user} split

Commit 1f466e1f15 ("net: cleanly handle kernel vs user buffers for
->msg_control") introduced the msg_control_user and
msg_control_is_user fields in struct msghdr, to ensure that user
pointers are represented as such. It also took care of converting most
users of struct msghdr::msg_control where user pointers are involved. It
did however miss a number of cases, and some code using msg_control
inappropriately has also appeared in the meantime.

This series is attempting to complete the split, by eliminating the
remaining cases where msg_control is used when in fact a user
pointer is stored in the union (patch 1).

It also addresses a couple of issues with msg_control_is_user: one where
it is not updated as it should (patch 2), and one where it is not
initialised (patch 3).

v1..v2:
* Split out the msg_control_is_user fixes into separate patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 11:09:27 +01:00
Kevin Brodsky b6d85cf5bd net/ipv6: Initialise msg_control_is_user
do_ipv6_setsockopt() makes use of struct msghdr::msg_control in the
IPV6_2292PKTOPTIONS case. Make sure to initialise
msg_control_is_user accordingly.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 11:09:27 +01:00
Kevin Brodsky 60daf8d40b net/compat: Update msg_control_is_user when setting a kernel pointer
cmsghdr_from_user_compat_to_kern() is an unusual case w.r.t. how
the kmsg->msg_control* fields are used. The input struct msghdr
holds a pointer to a user buffer, i.e. ksmg->msg_control_user is
active. However, upon success, a kernel pointer is stored in
kmsg->msg_control. kmsg->msg_control_is_user should therefore be
updated accordingly.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 11:09:27 +01:00
Kevin Brodsky c39ef21304 net: Ensure ->msg_control_user is used for user buffers
Since commit 1f466e1f15 ("net: cleanly handle kernel vs user
buffers for ->msg_control"), pointers to user buffers should be
stored in struct msghdr::msg_control_user, instead of the
msg_control field.  Most users of msg_control have already been
converted (where user buffers are involved), but not all of them.

This patch attempts to address the remaining cases. An exception is
made for null checks, as it should be safe to use msg_control
unconditionally for that purpose.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 11:09:27 +01:00
Arseniy Krasnov eaaa4e9239 vsock/loopback: don't disable irqs for queue access
This replaces 'skb_queue_tail()' with 'virtio_vsock_skb_queue_tail()'.
The first one uses 'spin_lock_irqsave()', second uses 'spin_lock_bh()'.
There is no need to disable interrupts in the loopback transport as
there is no access to the queue with skbs from interrupt context. Both
virtio and vhost transports work in the same way.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 11:04:04 +01:00
Gwangun Jung 3037933448 net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
If the TCA_QFQ_LMAX value is not offered through nlattr, lmax is determined by the MTU value of the network device.
The MTU of the loopback device can be set up to 2^31-1.
As a result, it is possible to have an lmax value that exceeds QFQ_MIN_LMAX.

Due to the invalid lmax value, an index is generated that exceeds the QFQ_MAX_INDEX(=24) value, causing out-of-bounds read/write errors.

The following reports a oob access:

[   84.582666] BUG: KASAN: slab-out-of-bounds in qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
[   84.583267] Read of size 4 at addr ffff88810f676948 by task ping/301
[   84.583686]
[   84.583797] CPU: 3 PID: 301 Comm: ping Not tainted 6.3.0-rc5 #1
[   84.584164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   84.584644] Call Trace:
[   84.584787]  <TASK>
[   84.584906] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
[   84.585108] print_report (mm/kasan/report.c:320 mm/kasan/report.c:430)
[   84.585570] kasan_report (mm/kasan/report.c:538)
[   84.585988] qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
[   84.586599] qfq_enqueue (net/sched/sch_qfq.c:1255)
[   84.587607] dev_qdisc_enqueue (net/core/dev.c:3776)
[   84.587749] __dev_queue_xmit (./include/net/sch_generic.h:186 net/core/dev.c:3865 net/core/dev.c:4212)
[   84.588763] ip_finish_output2 (./include/net/neighbour.h:546 net/ipv4/ip_output.c:228)
[   84.589460] ip_output (net/ipv4/ip_output.c:430)
[   84.590132] ip_push_pending_frames (./include/net/dst.h:444 net/ipv4/ip_output.c:126 net/ipv4/ip_output.c:1586 net/ipv4/ip_output.c:1606)
[   84.590285] raw_sendmsg (net/ipv4/raw.c:649)
[   84.591960] sock_sendmsg (net/socket.c:724 net/socket.c:747)
[   84.592084] __sys_sendto (net/socket.c:2142)
[   84.593306] __x64_sys_sendto (net/socket.c:2150)
[   84.593779] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[   84.593902] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[   84.594070] RIP: 0033:0x7fe568032066
[   84.594192] Code: 0e 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c09[ 84.594796] RSP: 002b:00007ffce388b4e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c

Code starting with the faulting instruction
===========================================
[   84.595047] RAX: ffffffffffffffda RBX: 00007ffce388cc70 RCX: 00007fe568032066
[   84.595281] RDX: 0000000000000040 RSI: 00005605fdad6d10 RDI: 0000000000000003
[   84.595515] RBP: 00005605fdad6d10 R08: 00007ffce388eeec R09: 0000000000000010
[   84.595749] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
[   84.595984] R13: 00007ffce388cc30 R14: 00007ffce388b4f0 R15: 0000001d00000001
[   84.596218]  </TASK>
[   84.596295]
[   84.596351] Allocated by task 291:
[   84.596467] kasan_save_stack (mm/kasan/common.c:46)
[   84.596597] kasan_set_track (mm/kasan/common.c:52)
[   84.596725] __kasan_kmalloc (mm/kasan/common.c:384)
[   84.596852] __kmalloc_node (./include/linux/kasan.h:196 mm/slab_common.c:967 mm/slab_common.c:974)
[   84.596979] qdisc_alloc (./include/linux/slab.h:610 ./include/linux/slab.h:731 net/sched/sch_generic.c:938)
[   84.597100] qdisc_create (net/sched/sch_api.c:1244)
[   84.597222] tc_modify_qdisc (net/sched/sch_api.c:1680)
[   84.597357] rtnetlink_rcv_msg (net/core/rtnetlink.c:6174)
[   84.597495] netlink_rcv_skb (net/netlink/af_netlink.c:2574)
[   84.597627] netlink_unicast (net/netlink/af_netlink.c:1340 net/netlink/af_netlink.c:1365)
[   84.597759] netlink_sendmsg (net/netlink/af_netlink.c:1942)
[   84.597891] sock_sendmsg (net/socket.c:724 net/socket.c:747)
[   84.598016] ____sys_sendmsg (net/socket.c:2501)
[   84.598147] ___sys_sendmsg (net/socket.c:2557)
[   84.598275] __sys_sendmsg (./include/linux/file.h:31 net/socket.c:2586)
[   84.598399] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[   84.598520] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[   84.598688]
[   84.598744] The buggy address belongs to the object at ffff88810f674000
[   84.598744]  which belongs to the cache kmalloc-8k of size 8192
[   84.599135] The buggy address is located 2664 bytes to the right of
[   84.599135]  allocated 7904-byte region [ffff88810f674000, ffff88810f675ee0)
[   84.599544]
[   84.599598] The buggy address belongs to the physical page:
[   84.599777] page:00000000e638567f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10f670
[   84.600074] head:00000000e638567f order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   84.600330] flags: 0x200000000010200(slab|head|node=0|zone=2)
[   84.600517] raw: 0200000000010200 ffff888100043180 dead000000000122 0000000000000000
[   84.600764] raw: 0000000000000000 0000000080020002 00000001ffffffff 0000000000000000
[   84.601009] page dumped because: kasan: bad access detected
[   84.601187]
[   84.601241] Memory state around the buggy address:
[   84.601396]  ffff88810f676800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.601620]  ffff88810f676880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.601845] >ffff88810f676900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.602069]                                               ^
[   84.602243]  ffff88810f676980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.602468]  ffff88810f676a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.602693] ==================================================================
[   84.602924] Disabling lock debugging due to kernel taint

Fixes: 3015f3d2a3 ("pkt_sched: enable QFQ to support TSO/GSO")
Reported-by: Gwangun Jung <exsociety@gmail.com>
Signed-off-by: Gwangun Jung <exsociety@gmail.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 10:59:26 +01:00
David S. Miller c61fcc090f Merge branch 'mana-jumbo-frames'
Haiyang Zhang says:

====================
net: mana: Add support for jumbo frame

The set adds support for jumbo frame,
with some optimization for the RX path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:20 +01:00
Haiyang Zhang 80f6215b45 net: mana: Add support for jumbo frame
During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Haiyang Zhang 2fbbd712ba net: mana: Enable RX path to handle various MTU sizes
Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Haiyang Zhang a2917b2349 net: mana: Refactor RX buffer allocation code to prepare for various MTU
Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Haiyang Zhang ce518bc3e9 net: mana: Use napi_build_skb in RX path
Use napi_build_skb() instead of build_skb() to take advantage of the
NAPI percpu caches to obtain skbuff_head.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Kai Vehmanen 6ab6f98fcd ALSA: hda/hdmi: disable KAE for Intel DG2
Use of keep-alive (KAE) has resulted in loss of audio on some A750/770
cards as the transition from keep-alive to stream playback is not
working as expected. As there is limited benefit of the new KAE mode
on discrete cards, revert back to older silent-stream implementation
on these systems.

Cc: stable@vger.kernel.org
Fixes: 15175a4f2b ("ALSA: hda/hdmi: add keep-alive support for ADL-P and DG2")
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/8307
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Link: https://lore.kernel.org/r/20230413191153.3692049-1-kai.vehmanen@linux.intel.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-04-14 07:50:52 +02:00
Jakub Kicinski e473ea818b mlx5-updates-2023-04-11
1) Vlad adds the support for linux bridge multicast offload support
    Patches #1 through #9
    Synopsis
 
 Vlad Says:
 ==============
 Implement support of bridge multicast offload in mlx5. Handle port object
 attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
 offload and bridge snooping support on bridge. Handle port object
 SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
 
 Steering architecture
 
 Existing offload infrastructure relies on two levels of flow tables - bridge
 ingress and egress. For multicast offload the architecture is extended with
 additional layer of per-port multicast replication tables. Such tables filter
 loopback traffic (so packets are not replicated to their source port) and pop
 VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
 egress table. MDB egress rule can point to multiple per-port multicast tables,
 which causes matching multicast traffic to be replicated to all of them, and,
 consecutively, to several bridge ports:
 
                                                                                                                             +--------+--+
                                                                                     +---------------------------------------> Port 1 |  |
                                                                                     |                                       +-^------+--+
                                                                                     |                                         |
                                                                                     |                                         |
                                        +-----------------------------------------+  |     +---------------------------+       |
                                        | EGRESS table                            |  |  +--> PORT 1 multicast table    |       |
 +----------------------------------+   +-----------------------------------------+  |  |  +---------------------------+       |
 | INGRESS table                    |   |                                         |  |  |  |                           |       |
 +----------------------------------+   | dst_mac=P1,vlan=X -> pop vlan, goto P1  +--+  |  | FG0:                      |       |
 |                                  |   | dst_mac=P1,vlan=Y -> pop vlan, goto P1  |     |  | src_port=dst_port -> drop |       |
 | src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2  +--+  |  | FG1:                      |       |
 | ...                              |   | dst_mac=P2,vlan=Y -> goto P2            |  |  |  | VLAN X -> pop, goto port  |       |
 |                                  |   | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+  | ...                       |       |
 +----------------------------------+   |                                         |  |  |  | VLAN Y -> pop, goto port  +-------+
                                        +-----------------------------------------+  |  |  | FG3:                      |
                                                                                     |  |  | matchall -> goto port     |
                                                                                     |  |  |                           |
                                                                                     |  |  +---------------------------+
                                                                                     |  |
                                                                                     |  |
                                                                                     |  |                                    +--------+--+
                                                                                     +---------------------------------------> Port 2 |  |
                                                                                        |                                    +-^------+--+
                                                                                        |                                      |
                                                                                        |                                      |
                                                                                        |  +---------------------------+       |
                                                                                        +--> PORT 2 multicast table    |       |
                                                                                           +---------------------------+       |
                                                                                           |                           |       |
                                                                                           | FG0:                      |       |
                                                                                           | src_port=dst_port -> drop |       |
                                                                                           | FG1:                      |       |
                                                                                           | VLAN X -> pop, goto port  |       |
                                                                                           | ...                       |       |
                                                                                           |                           |       |
                                                                                           | FG3:                      |       |
                                                                                           | matchall -> goto port     +-------+
                                                                                           |                           |
                                                                                           +---------------------------+
 
 Patches overview:
 
 - Patch 1 adds hardware definition bits for capabilities required to replicate
   multicast packets to multiple per-port tables. These bits are used by
   following patches to only attempt multicast offload if firmware and hardware
   provide necessary support.
 
 - Pathces 2-4 patches are preparations and refactoring.
 
 - Patch 5 implements necessary infrastructure to toggle multicast offload
   via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
   This also enabled IGMP and MLD snooping.
 
 - Patch 6 implements per-port multicast replication tables. It only supports
   filtering of loopback packets.
 
 - Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
   VLANs.
 
 - Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
   creates MDB replication rules in egress table that can replicate packets to
   multiple per-port multicast tables.
 
 - Patch 9 adds tracepoints for MDB events.
 
 ==============
 
 2) Parav Create a new allocation profile for SFs, to save on memory
 
 3) Yevgeny provides some initial patches for upcoming software steering
    support new pattern/arguments type of modify_header actions.
 
 Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
 The current modify_header object allows for having only limited number of
 these FW objects, which means that we are limited in the number of offloaded
 flows that require modify_header action.
 
 As a preparation Yevgeny provides the following 4 patches:
  - Patch 1: Add required mlx5_ifc HW bits
  - Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
    support and adds appropriate support in dr_send.c
  - Patch 4: Add ICM pool for modify-header-pattern objects and implement
    patterns cache, allowing patterns reuse for different flows
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQ2LDIACgkQSD+KveBX
 +j59UAf+NfLEq7i/gYOri1rLwXOrgPd13w64Oob79+pKVXzRqik/gBg4d0CKdc2X
 Qafntudus7IvVrGJCM0vurw7nC75H9BMFh3dQW2u3uaR9m6i0k4tSKIAoJUFHMdj
 dMgAJywYkCCXlBbGVhRE3yybal2EXfOSJ+Fr2tGeqpL4vRlg4kIaVuFGcGk2gMpq
 gOahX6wtuop3ABzY3sqKCIQ1Q2jWx9AWWz+lEmw7HJGmHGugscYrgSIloHDnu1fz
 Bt6uODn7GFr866rQr9+JemG0VyK8ahyisEFnX1oJ3642ZidPcDrbIrrtMWeEqo7X
 1tcmY/wNti87xFbg3gL+DZekhHTHoQ==
 =oIN8
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-04-11

1) Vlad adds the support for linux bridge multicast offload support
   Patches #1 through #9
   Synopsis

Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.

Steering architecture

Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:

                                                                                                                            +--------+--+
                                                                                    +---------------------------------------> Port 1 |  |
                                                                                    |                                       +-^------+--+
                                                                                    |                                         |
                                                                                    |                                         |
                                       +-----------------------------------------+  |     +---------------------------+       |
                                       | EGRESS table                            |  |  +--> PORT 1 multicast table    |       |
+----------------------------------+   +-----------------------------------------+  |  |  +---------------------------+       |
| INGRESS table                    |   |                                         |  |  |  |                           |       |
+----------------------------------+   | dst_mac=P1,vlan=X -> pop vlan, goto P1  +--+  |  | FG0:                      |       |
|                                  |   | dst_mac=P1,vlan=Y -> pop vlan, goto P1  |     |  | src_port=dst_port -> drop |       |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2  +--+  |  | FG1:                      |       |
| ...                              |   | dst_mac=P2,vlan=Y -> goto P2            |  |  |  | VLAN X -> pop, goto port  |       |
|                                  |   | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+  | ...                       |       |
+----------------------------------+   |                                         |  |  |  | VLAN Y -> pop, goto port  +-------+
                                       +-----------------------------------------+  |  |  | FG3:                      |
                                                                                    |  |  | matchall -> goto port     |
                                                                                    |  |  |                           |
                                                                                    |  |  +---------------------------+
                                                                                    |  |
                                                                                    |  |
                                                                                    |  |                                    +--------+--+
                                                                                    +---------------------------------------> Port 2 |  |
                                                                                       |                                    +-^------+--+
                                                                                       |                                      |
                                                                                       |                                      |
                                                                                       |  +---------------------------+       |
                                                                                       +--> PORT 2 multicast table    |       |
                                                                                          +---------------------------+       |
                                                                                          |                           |       |
                                                                                          | FG0:                      |       |
                                                                                          | src_port=dst_port -> drop |       |
                                                                                          | FG1:                      |       |
                                                                                          | VLAN X -> pop, goto port  |       |
                                                                                          | ...                       |       |
                                                                                          |                           |       |
                                                                                          | FG3:                      |       |
                                                                                          | matchall -> goto port     +-------+
                                                                                          |                           |
                                                                                          +---------------------------+

Patches overview:

- Patch 1 adds hardware definition bits for capabilities required to replicate
  multicast packets to multiple per-port tables. These bits are used by
  following patches to only attempt multicast offload if firmware and hardware
  provide necessary support.

- Pathces 2-4 patches are preparations and refactoring.

- Patch 5 implements necessary infrastructure to toggle multicast offload
  via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
  This also enabled IGMP and MLD snooping.

- Patch 6 implements per-port multicast replication tables. It only supports
  filtering of loopback packets.

- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
  VLANs.

- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
  creates MDB replication rules in egress table that can replicate packets to
  multiple per-port multicast tables.

- Patch 9 adds tracepoints for MDB events.

==============

2) Parav Create a new allocation profile for SFs, to save on memory

3) Yevgeny provides some initial patches for upcoming software steering
   support new pattern/arguments type of modify_header actions.

Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.

As a preparation Yevgeny provides the following 4 patches:
 - Patch 1: Add required mlx5_ifc HW bits
 - Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
   support and adds appropriate support in dr_send.c
 - Patch 4: Add ICM pool for modify-header-pattern objects and implement
   patterns cache, allowing patterns reuse for different flows

* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Add modify-header-pattern ICM pool
  net/mlx5: DR, Prepare sending new WQE type
  net/mlx5: Add new WQE for updating flow table
  net/mlx5: Add mlx5_ifc bits for modify header argument
  net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
  net/mlx5: Create a new profile for SFs
  net/mlx5: Bridge, add tracepoints for multicast
  net/mlx5: Bridge, implement mdb offload
  net/mlx5: Bridge, support multicast VLAN pop
  net/mlx5: Bridge, add per-port multicast replication tables
  net/mlx5: Bridge, snoop igmp/mld packets
  net/mlx5: Bridge, extract code to lookup parent bridge of port
  net/mlx5: Bridge, move additional data structures to priv header
  net/mlx5: Bridge, increase bridge tables sizes
  net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================

Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:28:03 -07:00
Jakub Kicinski f7d29571ab Merge branch 'add-kernel-tc-mqprio-and-tc-taprio-support-for-preemptible-traffic-classes'
Vladimir Oltean says:

====================
Add kernel tc-mqprio and tc-taprio support for preemptible traffic classes

The last RFC in August 2022 contained a proposal for the UAPI of both
TSN standards which together form Frame Preemption (802.1Q and 802.3):
https://lore.kernel.org/netdev/20220816222920.1952936-1-vladimir.oltean@nxp.com/

It wasn't clear at the time whether the 802.1Q portion of Frame Preemption
should be exposed via the tc qdisc (mqprio, taprio) or via some other
layer (perhaps also ethtool like the 802.3 portion, or dcbnl), even
though the options were discussed extensively, with pros and cons:
https://lore.kernel.org/netdev/20220816222920.1952936-3-vladimir.oltean@nxp.com/

So the 802.3 portion got submitted separately and finally was accepted:
https://lore.kernel.org/netdev/20230119122705.73054-1-vladimir.oltean@nxp.com/

leaving the only remaining question: how do we expose the 802.1Q bits?

This series proposes that we use the Qdisc layer, through separate
(albeit very similar) UAPI in mqprio and taprio, and that both these
Qdiscs pass the information down to the offloading device driver through
the common mqprio offload structure (which taprio also passes).

An implementation is provided for the NXP LS1028A on-board Ethernet
endpoint (enetc). Previous versions also contained support for its
embedded switch (felix), but this needs more work and will be submitted
separately.

v4: https://lore.kernel.org/netdev/20230403103440.2895683-1-vladimir.oltean@nxp.com/
v2: https://lore.kernel.org/netdev/20230219135309.594188-1-vladimir.oltean@nxp.com/
v1: https://lore.kernel.org/netdev/20230216232126.3402975-1-vladimir.oltean@nxp.com/
====================

Link: https://lore.kernel.org/r/20230411180157.1850527-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:12 -07:00
Vladimir Oltean 01e23b2b3b net: enetc: add support for preemptible traffic classes
PFs which support the MAC Merge layer also have a set of 8 registers
called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)".
Through these, a traffic class (group of TX rings of same dequeue
priority) can be mapped to the eMAC or to the pMAC.

There's nothing particularly spectacular here. We should probably only
commit the preemptible TCs to hardware once the MAC Merge layer became
active, but unlike Felix, we don't have an IRQ that notifies us of that.
We'd have to sleep for up to verifyTime (127 ms) to wait for a
resolution coming from the verification state machine; not only from the
ndo_setup_tc() code path, but also from enetc_mm_link_state_update().
Since it's relatively complicated and has a relatively small benefit,
I'm not doing it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Vladimir Oltean 50764da37c net: enetc: rename "mqprio" to "qopt"
To gain access to the larger encapsulating structure which has the type
tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Vladimir Oltean a721c3e54b net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.

I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.

Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.

Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.

A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Vladimir Oltean f62af20bed net/sched: mqprio: allow per-TC user input of FP adminStatus
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).

The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.

I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).

The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.

Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00