Jakub Kicinski says:
====================
page_pool: allow caching from safely localized NAPI
I went back to the explicit "are we in NAPI method", mostly
because I don't like having both around :( (even tho I maintain
that in_softirq() && !in_hardirq() is as safe, as softirqs do
not nest).
Still returning the skbs to a CPU, tho, not to the NAPI instance.
I reckon we could create a small refcounted struct per NAPI instance
which would allow sockets and other users so hold a persisent
and safe reference. But that's a bigger change, and I get 90+%
recycling thru the cache with just these patches (for RR and
streaming tests with 100% CPU use it's almost 100%).
Some numbers for streaming test with 100% CPU use (from previous version,
but really they perform the same):
HW-GRO page=page
before after before after
recycle:
cached: 0 138669686 0 150197505
cache_full: 0 223391 0 74582
ring: 138551933 9997191 149299454 0
ring_full: 0 488 3154 127590
released_refcnt: 0 0 0 0
alloc:
fast: 136491361 148615710 146969587 150322859
slow: 1772 1799 144 105
slow_high_order: 0 0 0 0
empty: 1772 1799 144 105
refill: 2165245 156302 2332880 2128
waive: 0 0 0 0
v1: https://lore.kernel.org/all/20230411201800.596103-1-kuba@kernel.org/
rfcv2: https://lore.kernel.org/all/20230405232100.103392-1-kuba@kernel.org/
====================
Link: https://lore.kernel.org/r/20230413042605.895677-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bnxt has 1:1 mapping of page pools and NAPIs, so it's safe
to hoook them up together.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.
Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).
If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.
Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.
With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).
The CPU use of relevant functions decreases as expected:
page_pool_refill_alloc_cache 1.17% -> 0%
_raw_spin_lock 2.41% -> 0.98%
Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.
The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
Going forward we will also try to cache head and data pages.
Plumb the "are we in a normal NAPI context" information thru
deeper into the freeing path, up to skb_release_data() and
skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
comes from netpoll which passes budget of 0 to try to reap
the Tx completions but not perform any Rx.
Use "bool napi_safe" rather than bare "int budget",
the further we get from NAPI the more confusing the budget
argument may seem (particularly whether 0 or MAX is the
correct value to pass in when not in NAPI).
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Check if patterns and arguments for modify header action
are supported and enable them accordingly.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Support the pattern/args-based MODIFY_HDR and TNL_L3_TO_L2 actions in dbg dump
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Set modify header action of size 1 directly on the STE for supporting
devices, thus reducing number of hops and cache misses.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use the new accelerated action for decap L3 on RX side:
use the mechanism of pattern and argument same as in
modify-header action.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If there is support for pattern/args, use the new accelerated modify
header action for modify header and decap L3 actions.
Otherwise fall back to the old modify-header implementation.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
While building the actions, add the pointer of the arguments for
accelerated modify list action into the action's attributes.
This will be used later on while building the specific STE
for this action.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Added new mechanism for handling arguments for modify-header action.
The new action "accelerated modify-header" asks for the arguments from
separated area from the pattern, this area accessed via general objects.
Handling of these object is done via the pool-manager struct.
When the new header patterns are supported, while loading the domain,
a few pools for argument creations will be created. The requests for
allocating/deallocating arg objects are done via the pool manager API.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When allocating a QP we allocate an RQ and an SQ, the RQ is stored first
in memory and followed by the SQ.
This allocation is not physically continiuos - it may span across different
physical pages. SW Steering code always writes in pairs: 1BB write + 1BB read,
or 2 continuous BBs of GTA WQE.
This lead to an issue where RQ allocation was 4x16 which is equal to 1 WQE BB,
causing 1 BB offset in the page and splitting the GTA WQE between different
physical pages.
The solution was to create the RQ with a even number of BBs and to have the
RQ aligned to a page.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Instead of using the write buffer for reading we will use a dedicated
buffer only for reading ICM memory.
Due to the new support for args, we can have a case with pending_wc
being odd number, and with reading into the same write buffer, it is
possible to overwrite next write on the same slot.
For example:
pending_wc is 17 so the buffer for write is:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
and we have requests as follows:
r wr wr wr wr wr wr wr wr
Now, the first read will be written into the last write because we use
the same buffer for read and write, before it was written to the HW and
we will have a wrong data in the ICM area.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The accelerated modify header arguments are written in the HW area
with special WQE and specific data format.
New function was added to support writing of new argument type.
Note that GTA WQE is larger than READ and WRITE, so the queue
management logic was updated to support this.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add functions for creation/destruction of the new type of general object.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This way we are able to allocate chunk for modify_headers from 2 types:
STEv0 that is allocated from the action area, and STEv1 that is allocating
the chunks from the special area for patterns.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Starting with ConnectX-6 Dx, we use new design of modify_header FW object.
The current modify_header object allows for having only limited number
of FW objects, so the new design of pattern and argument allows pattern
reuse, saving memory, and having a large number of modify_header objects.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move ACTION_CACHE_LINE_SIZE macro to header to be used by
the pattern functions as well.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* A fix for a missing fence when generating the NOMMU sigreturn
trampoline.
* A set of fixes for early DTB handling of reserved memory nodes.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmQ5ZsoTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiUhED/9IsUdJfMdQHzrL7XZL7U9FgHhSCVl4
HgpcpngG6Op+F8lGnVhQV4ivoVYrk85lR2JqJeES9FpChSmZXCptB6K9NFCDLdW+
9+Qnqa6NnmcJ9Aba7Ckr7zwApohtDGucegIKx3bqafsnBqh2SM/l7ljCWEuaxIbZ
5qFEbWGcILVGQrXJheLgkx7uGbBmEvrmLi9T5jZ1Vwltg1QS2STDjOCSR5cfBpx7
kqoWfSa1+fb5RF428KRvDTuIjUzkkKvUEo/yEzJMMp1s3vI2mKT8ytqQHw4GrQ9w
6X/wvabjDtz8iGRWsh4VVg7iJ+xImzpqRO4UXkd4wArCDdCvjSu4BpAs3cOqKVTl
4w+dTbQPjL22PYCdhBVmJH8K4TmJGzSkOPK8wcIlK1mQaRCAjAalSvCLQIN/+ttq
4AwkXLpfwLvJLhn78Y/7WS6dbedYjMbNxdz2Sy6yFm3qZj1dqfgS5yIuHpLVo7UW
AK7nS3FGJRyii8w/lywxDgLTdMeJiZ8/Xy0gJkWps5sGolBIPqSwUcLDhFS68HoM
6x+IzOXDlHnw7i/sj6Q10IRvjFVLieNDYniD7CnOYDKFSgUOvehCMlky7l4naRdO
vfVHsHPX38z5ZTeM/vZ1HKobKeTmi3Hy/C1MWUwMgj2nPkui777c9mAQLHY7ppeZ
dr4Cn/4GNfZz3w==
=1Tw7
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for a missing fence when generating the NOMMU sigreturn
trampoline
- A set of fixes for early DTB handling of reserved memory nodes
* tag 'riscv-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: No need to relocate the dtb as it lies in the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: Move early dtb mapping into the fixmap region
riscv: add icache flush for nommu sigreturn trampoline
- Add a quirk to force StorageD3Enable on AMD Picasso systems (Mario
Limonciello).
- Add an ACPI IRQ override quirk for ASUS ExpertBook B1502CBA (Paul
Menzel).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQ5V/4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx3Z8P/3LgmzvoNt+Xsfx3u3//vyu02XGHd48N
xA9SQgNJeqwPQxpjnZnycuX5RPex4NGjzEQaIuXkxRVf3UnMzxFir4AYT22qJfea
7m1S59ym/LBCOIhGWuQ4de9vhmebR60nSMrR5geaQVwWedTUfMphIjh10volcnae
IwWuaABCtL9vNoPdzy1syhg1nWFZnMIvDrGDdexBQ+68BQznAnEvL0ZlEUv4gls3
Dh2I4ZYe+95Me8qAaNthG3U+BfLDR+/7Sqo3H0urOoTDcLxq/n2Wu6YfeC1Y6dp2
LoBPWGVQpmuITZjTrr5BDTbzjePEA7l0vRdKkhKqmkQ6cfgelwL6LxK2ShIU8Ulx
5OFxlR/i3kZmdpc3n32WdCaVScNEjZ1NEhoGkqAOBXyIplDm83K+uscy6M5Gv5fM
lCgwzWu5Vn+NLDzZNx2fIFHOs5sLXb8R9cnycazVUas/NMC96VYRqViK4jcooFHz
+y/Z4Sl23gx9i8C85UYN/02jPcjjALWH1PlOyVXWx/0wOvukYLQ1z943+ZEZyfkb
vtPwv/xFuLNiZIrnDBBMoi0eri5amdqchYC/JKzDlVs6oPtR2F7hM7nqaajJ1JOz
WLQy4XH5lAueUeMneAY2Th7Ogc3lMHsAPOChLwXAvPRJyxPsBaiS0PUeTVcGdiQ/
vYUg1budlxa+
=SG+Z
-----END PGP SIGNATURE-----
Merge tag 'acpi-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These add two ACPI-related quirks:
- Add a quirk to force StorageD3Enable on AMD Picasso systems (Mario
Limonciello)
- Add an ACPI IRQ override quirk for ASUS ExpertBook B1502CBA (Paul
Menzel)"
* tag 'acpi-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: resource: Skip IRQ override on ASUS ExpertBook B1502CBA
ACPI: x86: utils: Add Picasso to the list for forcing StorageD3Enable
Make the amd-pstate cpufreq driver take all of the possible combinations
of the "old" and "new" status values correctly while changing the
operation mode via sysfs (Wyes Karny).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQ5WF8SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxGt0P/0NGMuOoDVS9zNhROphsUhBWw/XuVYQQ
xrtAkouF9zPDRR7R3p/tbqHNeaSOGapmGbFK6Xm4Z+Cko09jlHpIssfrkCK9W+mh
BI8CC6jdFavstzbifzLwVg9dkTeD2ANgCwkSM4CLxzyzktV3C4WOcrBombzpK6CI
rTL039jabNaelSzB+dp2asX8GVQD+H57BXTPb33h6BLfH2tvnreVEfTOTlOvrCKD
eClLvDw8QArp74qh4U0VwtJ1EIZsYqv84AxSfmKQi9dOrfwIK81D9dHhJXpKfj4V
4DSOuRRxG4wbxF816gXm1mbFiFs2+aK3SkrPXLP4LyphqMfDRP3ID/HojcrUBBtp
YCiAQ1JHa9DLWuqOO8/pqUgIetwLuZYD/10JBgXX+ULmkC4Wuqd1Lz1GspdP9/Xy
YJrMcwTAj5KCZw2ehdaeZMBa7Ox2flkeS1QdMY8uBposX2w/F8IhWTC4Xe5tbmk2
s1hYZL7yHAZWTyNtcTY7+gPte4qRzQJ/dhFgvbwclaL6z4cgLitkRMf1Rv+vceV+
pKpqzZWRUBqVCNSl8zZdYfyYy2qfAqMszvbDH5fa+26wqQNsKjAGrfQy4qOZGBje
BQKsVA1MlAAEckmRH5wOoO0XieT9pn7hJjuTcocg4R4mdF+IfgmUsHgDiGnxG6X+
BkdRd1ec5NX+
=gU5P
-----END PGP SIGNATURE-----
Merge tag 'pm-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Make the amd-pstate cpufreq driver take all of the possible
combinations of the 'old' and 'new' status values correctly while
changing the operation mode via sysfs (Wyes Karny)"
* tag 'pm-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
amd-pstate: Fix amd_pstate mode switch
Modify the Intel thermal throttling code to avoid updating unsupported
status clearing mask bits which causes the kernel to complain about
unchecked MSR access (Srinivas Pandruvada).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQ5WNoSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxN3sP/jxTTvyatqFDmrmjuf4rysrapgGhh9IF
l1iTixZiHWN3KsbpGoXGvK6QmbyAB0S61FMzucnKktrdzA5tSLSms+UqQcTlqucP
pjNzZikojs4pEML9GvlChC2V2aKhALNrzEfIuS24Pw/TEDTH2Mbm0Wrz+yRg5PPt
pQhdBwe9Y/NTSVRt3JlZJEzGcWajRsx5YZMyud2zGUqXtNJnpSw5y3Klrbdd3urs
6501j/jLUcLRmLbnmf8oZGs0tCZK/FVhfgVpBemhJgMxeqflMnXBwYw43d0GLJNA
bg7kq6eSCr7IP743AAuJaory9WCqAoa+6Km1BGS5TpTOdVtsJ5UDr7ARaCB7E+jg
o0lYIs3O1q4WqFBV85R0JtRDopgdkFYxWSjNNa5lyAVwVbVA7Xi7jt48LUaSRO23
Ag76U7VWMXNKSpznPxhTboywE5MH4Gu+VeoVeitXNTG7hM7plB1UoFZ5br3+dPfT
qyxHvMkG2n+V9a7JXxaGg2/LvG97QX9LPtCPMTyVAMgQWtg2JN3NblXzGjunZnVK
kXScxvRYKKH2xz1ArdTXJj0z9CsnyHRebKXk+PcbFY0kKtQupKpNXSJsMv5zC4kP
/uKVxxdAroLfoMkD8BFjhiDV7GHrhXpSBJz72dOIB0eG4jEV+82XCohdNIj9xzEV
LtJcGOutktSu
=8D2v
-----END PGP SIGNATURE-----
Merge tag 'thermal-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"Modify the Intel thermal throttling code to avoid updating unsupported
status clearing mask bits which causes the kernel to complain about
unchecked MSR access (Srinivas Pandruvada)"
* tag 'thermal-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: Avoid updating unsupported THERM_STATUS_CLEAR mask bits
A collection of small fixes. At this time, quite a few fixes for
the old PCI drivers are found. Although they are no regression
fixes, I took these as they are materials for stable kernels.
In addition, a couple of regression fixes and another couple of
HD-audio quirks are included.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmQ5RlAOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE+lzxAAzfsapwLW4pN5gDApDiAFxu9Lu9VQvk3e4k7W
khQFb7o+2jdjyCTiGEQCfFbPu/Ru4KlrYXkCkHEXRR0/95NiRq9GQx6xJtU+S7/m
WBTWPhIh2Hic90yYUTD62Pb7j40P8C+wsATwpQftQIdGixBdv7nirbbzBPi5Xfcs
+4iu8TBEh9izFIAnXADl/O+WA3E4r6TOvDeuD2FZLQWPcJLeHF9h79BU0k7UmUYR
ZgLg+0GJrwJR9A1SzGs5kc47Q0zmP62gRyExBSnskHFjCglbTCo0VhVpBoBi+o0y
epHyOJrvs/AdpMz4VFvn4WP5IVtyxa0diUWmd/eRczzzxJThLpMQYx2+ukYkUJMc
+fua+NraWqXwC+s7+C/N8MhFXbSrRm+4fzOPmBdE9dV/hnAQIOCWM6f9PhciUcLm
kZfcCCwZXNR0TVmtlZxKq5s4xyoYVVkEctQU3QO8TeHXKmV8EYQO+zzYQjch2izD
H6gJyT8/QcKhRQSkthLEiltnkuFY+nq+jDdlCRH2/9v5VjvY0XZlBpxuP0vPIYX4
iChCuCu5qkforCijetDkdArWdO4+WiFums4t+h1hekvhnN9IrrXUxU+dn+Hu63jJ
oVtlxW1AVzEcYynUowI3jSo4z/5jxBF8a10XF4+ysCbUoTQ6pYTMVl4wYDEEEYJB
2RUdHPQ=
=Gmhh
-----END PGP SIGNATURE-----
Merge tag 'sound-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes.
At this time, quite a few fixes for the old PCI drivers are found.
Although they are not regression fixes, I took these as they are
materials for stable kernels.
In addition, a couple of regression fixes and another couple of
HD-audio quirks are included"
* tag 'sound-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/hdmi: disable KAE for Intel DG2
ALSA: hda/realtek: Add quirks for Lenovo Z13/Z16 Gen2
ALSA: hda: patch_realtek: add quirk for Asus N7601ZM
ALSA: firewire-tascam: add missing unwind goto in snd_tscm_stream_start_duplex()
ALSA: emu10k1: don't create old pass-through playback device on Audigy
ALSA: emu10k1: fix capture interrupt handler unlinking
ALSA: hda/sigmatel: fix S/PDIF out on Intel D*45* motherboards
ALSA: hda/sigmatel: add pin overrides for Intel DP45SG motherboard
ALSA: i2c/cs8427: fix iec958 mixer control deactivation
Driver bug fixes:
- irdma should not generate extra completions during flushing
- Fix several memory leaks
- Do not get confused in irdma's iwarp mode if IPv6 is present
- Correct a link speed calculation in mlx5
- Increase the EQ/WQ limits on erdma as they are too small for big
applications
- Use the right math for erdma's inline mtt feature
- Make erdma probing more robust to boot time ordering differences
- Fix a KMSAN crash in CMA due to uninitialized qkey
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZDlBPgAKCRCFwuHvBreF
YfOtAQDAX3rEL4T6xu4jIHAhInTYZCYVI7pJALTzHh62DmdoZAD+Owy2vTRTmkvJ
OLfA++yDRWdrzeSeCWbTYpwEo+00TAA=
=9XAm
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"We had a fairly slow cycle on the rc side this time, here are the
accumulated fixes, mostly in drivers:
- irdma should not generate extra completions during flushing
- Fix several memory leaks
- Do not get confused in irdma's iwarp mode if IPv6 is present
- Correct a link speed calculation in mlx5
- Increase the EQ/WQ limits on erdma as they are too small for big
applications
- Use the right math for erdma's inline mtt feature
- Make erdma probing more robust to boot time ordering differences
- Fix a KMSAN crash in CMA due to uninitialized qkey"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/core: Fix GID entry ref leak when create_ah fails
RDMA/cma: Allow UD qp_type to join multicast only
RDMA/erdma: Defer probing if netdevice can not be found
RDMA/erdma: Inline mtt entries into WQE if supported
RDMA/erdma: Update default EQ depth to 4096 and max_send_wr to 8192
RDMA/erdma: Fix some typos
IB/mlx5: Add support for 400G_8X lane speed
RDMA/irdma: Add ipv4 check to irdma_find_listener()
RDMA/irdma: Increase iWARP CM default rexmit count
RDMA/irdma: Fix memory leak of PBLE objects
RDMA/irdma: Do not generate SW completions for NOPs
Thomas Richter reported a crash in linux-next with a backtrace similar
to the following one:
[<0000000000000000>] 0x0
([<000000000031a182>] bpf_trace_run4+0xc2/0x218)
[<00000000001d59f4>] __bpf_trace_sched_switch+0x1c/0x28
[<0000000000c44a3a>] __schedule+0x43a/0x890
[<0000000000c44ef8>] schedule+0x68/0x110
[<0000000000c4e5ca>] do_nanosleep+0xa2/0x168
[<000000000026e7fe>] hrtimer_nanosleep+0xf6/0x1c0
[<000000000026eb6e>] __s390x_sys_nanosleep+0xb6/0xf0
[<0000000000c3b81c>] __do_syscall+0x1e4/0x208
[<0000000000c50510>] system_call+0x70/0x98
Last Breaking-Event-Address:
[<000003ff7fda1814>] bpf_prog_65e887c70a835bbf_on_switch+0x1a4/0x1f0
The problem is that bpf_arch_text_poke() with new_addr == NULL is
susceptible to the following race condition:
T1 T2
----------------- -------------------
plt.target = NULL
entry: brcl 0xf,plt
entry.mask = 0
lgrl %r1,plt.target
br %r1
Fix by setting PLT target to the instruction following `brcl 0xf,plt`
instead of 0. This way T2 will simply resume the execution of the eBPF
program, which is the desired effect of passing new_addr == NULL.
Fixes: f1d5df84cd ("s390/bpf: Implement bpf_arch_text_poke()")
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/bpf/20230414154755.184502-1-iii@linux.ibm.com
Merge a quirk to force StorageD3Enable on AMD Picasso systems (Mario
Limonciello).
* acpi-x86:
ACPI: x86: utils: Add Picasso to the list for forcing StorageD3Enable
So far io_req_complete_post() only covers DEFER_TASKRUN by completing
request via task work when the request is completed from IOWQ.
However, uring command could be completed from any context, and if io
uring is setup with DEFER_TASKRUN, the command is required to be
completed from current context, otherwise wait on IORING_ENTER_GETEVENTS
can't be wakeup, and may hang forever.
The issue can be observed on removing ublk device, but turns out it is
one generic issue for uring command & DEFER_TASKRUN, so solve it in
io_uring core code.
Fixes: e6aeb2721d ("io_uring: complete all requests in task context")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@kernel.dk/
Reported-by: Jens Axboe <axboe@kernel.dk>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The GPIO polarity of the WSA881x shutdown GPIO was inconsistent and had
to be corrected in the driver, this fixes the polarity in the DeviceTree
for QRB5165 RB5, SM8250 MTP, Samsung Galaxy Book 2 and Lenovo Yoga C630.
The recent rearrangement of nodes among the IPQ8074 accidentally enabled
the PCIe PHYs, rather than the PCIe controllers. This is being
corrected, to restore PCIe functionality.
PMK8280 PON node has the wrong compatible, which recently caused the
driver to stop probing. This is corrected and the required "pbs" region
is added.
With support for HBR3 introduced, it's noted that SC7280 Herobrine
devices are having trouble running at this rate. This drops the claim
that it's supported, until further analysis can be done.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAmQ0LU8VHGFuZGVyc3Nv
bkBrZXJuZWwub3JnAAoJEAsfOT8Nma3Fa+wP/23+62lBn+5sfsUMo5k/ajYel0Ug
sfqMS7PDvTPG3AsVhk1R95NrKWb9eWmkpvkwVKsBHD75BeRkuDovIx9yxSLSDZm2
yWlV514AOLWlhgBu75XzOIDy/J9BQccZPmkBHPJY2I9vpnvlhWnQhaPx4tKgQM8T
5qKuQUf2Q1sQy/QDcbOrMEpHCBHl+Zm11ffmnFosXtVsF0xEzcGyHcKwCOMSaxg1
KVxeK0wS3Mt09D4SiPwVNaMA+c0vcexp2jpySoQBMbEmbYOZr3DEU+b5lyZDQqOi
odMwF4EhMHaGvUCRwM+c4GgdcdXP8IT4eex9ZjdgYblzJsAUfdpFQ5NBB4PeoHPt
9GOp1n+s4+L3pVm9oyeqUDAUNhE3X6PinOfEScwtgFnQAob45k+6vznVgWt66Apg
uZAXRE1kONxtJO7bR8hzH+NH395mtb18lHh1MF09KdIR48wl8bK2f91yP0NkSTY8
N4mKcdy9GwBAmTK3FpEmExOWfmwq0EXdeV2j2RdrVTGNBvTuX9nzpKHUdckk77Ay
sOS5KfUANAHy/7UIOZHwerPKImRyo+y7gZ8kRcihtUgGOXaESeT1s1KPsTUE2ZqX
xWuJWRxYizPHdn809iEr5jgAHOAZlKh5RsJ6QQijY1XKNwNLU5/8iLb61x55PbHv
fnvqlKTO04pDSWD9
=3Uaf
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmQ5PpEACgkQYKtH/8kJ
Uic7Vw/8CTiwm+gErIrED6Wm7eCNkzvENkruXboy5437tyrPsnd68sJ+KFs7lznz
0cXSA06bGKp6rj/i1nodfwO8A0PtEl+RAtCaRuesCI8hQRfkF9o6BOEjpmQ4mYbn
09nr0ehIatVdwl+RxFn9GZEBQT4yPIt8cTjjgT/zyxUsomYJjDv6dl9iODyX8TBY
w3J2VInSUW1GRrlf+GD7XPxbIHhHl7tqWG1neBENQirWb7N0xjDuq2X7absWyzgt
TaPaKYiaqr9X5cE1cfDff+RDoBVZRdtG5H7Gs1u28UaVvmziP04oCxgD8L5cHbX5
niT8cq0d0+f367o+B9WXfcqJXHk/0q3clWP+Ad5q6Bt7HhAkfxY5eezYUMgoDpN4
3+fRrdBJBVK7e2OobxK8+tUt5Zu97zIrNL1b0k9+RkauhBL1efuOQ/tAfh4AY4LR
tguyJ6RBIneNYlAWsWPW7f4TpkR/2XmnFYityjFMLLj75rVK2oC+q1kYU/4/Un6a
Z9G20Uotr3YXThGQJasZSwh+R0maymIEN7pugg9mrImC5zvGOUXA1DC+6MqdLmd1
ocqSr3a0p6jRe9t0xSBxS8YtFT33zU9pevOTnIIYrQhyQ1xL8DFXjmoEFbuQdvWG
Ks2x83/F5OefLoGYXG5Q/K10eKmodaWsuyPsSePOkOTo4VWZEoE=
=ikZD
-----END PGP SIGNATURE-----
Merge tag 'qcom-arm64-fixes-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes
A few more Qualcomm ARM64 DeviceTree fixes for 6.3
The GPIO polarity of the WSA881x shutdown GPIO was inconsistent and had
to be corrected in the driver, this fixes the polarity in the DeviceTree
for QRB5165 RB5, SM8250 MTP, Samsung Galaxy Book 2 and Lenovo Yoga C630.
The recent rearrangement of nodes among the IPQ8074 accidentally enabled
the PCIe PHYs, rather than the PCIe controllers. This is being
corrected, to restore PCIe functionality.
PMK8280 PON node has the wrong compatible, which recently caused the
driver to stop probing. This is corrected and the required "pbs" region
is added.
With support for HBR3 introduced, it's noted that SC7280 Herobrine
devices are having trouble running at this rate. This drops the claim
that it's supported, until further analysis can be done.
* tag 'qcom-arm64-fixes-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: dts: qcom: sc7280: remove hbr3 support on herobrine boards
arm64: dts: qcom: sc8280xp-pmics: fix pon compatible and registers
arm64: dts: qcom: ipq8074-hk10: enable QMP device, not the PHY node
arm64: dts: qcom: ipq8074-hk01: enable QMP device, not the PHY node
arm64: dts: qcom: qrb5165-rb5: Use proper WSA881x shutdown GPIO polarity
arm64: dts: qcom: sm8250-mtp: Use proper WSA881x shutdown GPIO polarity
arm64: dts: qcom: sdm850-samsung-w737: Use proper WSA881x shutdown GPIO polarity
arm64: dts: qcom: sdm850-lenovo-yoga-c630: Use proper WSA881x shutdown GPIO polarity
Link: https://lore.kernel.org/r/20230410153850.4752-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
missing 32k clock definition for Anbric xx3 devices, missing cache-levels
for rk3588, fixed rk3326-board display supplies and more dt-schema fixes.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCAAuFiEE7v+35S2Q1vLNA3Lx86Z5yZzRHYEFAmQx/6UQHGhlaWtvQHNu
dGVjaC5kZQAKCRDzpnnJnNEdgYMiB/45skI/LcajSjuqIJpAC+cDeCs/IaQb6aev
4fnwBRYgb836JvThTGeOn4KRe2avnZMfyjn3IlkHhYBHtLukS8jcuNuU+T+oD1/n
Yss63kUSbMCvNl13uCSz/dqsaFZJ1x8hHth0b19PlGWnnBgtzWWWScr371m+PsF6
2x2dnteRvQcUyWa8/5Xavgc8bseMds2SiKEc9Y58MTwRVL/pqfh9AFJhsWRJYFsN
x0SU+Gwopg2qW+Mfdm6DOK1m08Vu6Uaw1zftF06/oAnDqvYMiVSWfsFIpW5eEi/T
kb5ovzTTDwnLz/n8jkG1ldB7hQY362NQjyIc/GoUab/03yxyFE63
=JS4f
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmQ5PlAACgkQYKtH/8kJ
UifHPRAAwF4jqjWejgzFnHsB7RPmjvUtvuobU0gmI25HIjKnWLdcuS4IFXWvFgLU
gxc1oQS1m4C7eM8ceI/jGR3oDZfhKXCIrpif9Tg+LxtmwRaeo3tbpasPANwiO+Eq
BI6LssLFStBblwwaaMzC1E/8fCdBsZYVbQe4LqsJmzb9hOvFCih4At1YFHKQXcHM
L+1sCdpKMo/f/AQ3sjS3kEYnFwrIYdskxImHXmYkYe2e1OMkXSQcWjz958A7GPB1
kdm5yMPARR8FZxY9RZBf6QLg3xnIc/OqraeHYCjpFLs+gDGGoEMGwwBW5u3/JJmE
SLTq2EYYf6szdVh156zjfMtCFCIM51BXoWwlB3swhWI80WD87lzrKzYYRbW5H6jp
5SIW74a8vR2OxOKyQu00Qb8S0SsTJF5Ord9EImYsm6rXpQnSs894kp18zNjmjUDh
FLxNhMMuY8aiwbypYUm4YRFN9BJJth3SKLXIyP1xxBEpN3IIeQSIXeD9ZoA+sc6e
OdEREqMwQV9xPq84JQLsxTqjcSCwO1MJECqwNvKj07w0qdaIMgDAgirLCTXSPJPh
b8b2p07Dpb5wFKw/FwGSBuvi3DEzSLhK7zbJzky8zNhmWNAx0VS7gR/koFQDGxR0
gNcVU5vPB2ujvBdW76MegqynP2UowsGdD3Sy3KlB/FAZ+LTGMGY=
=Yjl8
-----END PGP SIGNATURE-----
Merge tag 'v6.3-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes
Lower sd card speeds for two boards to make them run more reliable,
missing 32k clock definition for Anbric xx3 devices, missing cache-levels
for rk3588, fixed rk3326-board display supplies and more dt-schema fixes.
* tag 'v6.3-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
arm64: dts: rockchip: correct panel supplies on some rk3326 boards
arm64: dts: rockchip: use just "port" in panel on RockPro64
arm64: dts: rockchip: use just "port" in panel on Pinebook Pro
arm64: dts: rockchip: Remove non-existing pwm-delay-us property
arm64: dts: rockchip: Add clk_rtc_32k to Anbernic xx3 Devices
arm64: dts: rockchip: add rk3588 cache level information
arm64: dts: rockchip: Lower SD card speed on rk3399 Pinebook Pro
arm64: dts: rockchip: Lower sd speed on rk3566-soquartz
ARM: dts: rockchip: fix a typo error for rk3288 spdif node
arm64: dts: rockchip: Fix rk3399 GICv3 ITS node name
Link: https://lore.kernel.org/r/10559306.CDJkKcVGEf@phil
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
On some Qualcomm platforms, like SC8280XP, the attempt to set PC mode
during boot fails with PSCI_RET_DENIED and since commit 998fcd001f
("firmware/psci: Print a warning if PSCI doesn't accept PC mode") this
is now logged at warning level:
psci: failed to set PC mode: -3
As there is nothing users can do about the firmware behaving this way,
demote the warning to info level and clearly mark it as a firmware bug:
psci: [Firmware Bug]: failed to set PC mode: -3
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Acked-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Kevin Brodsky says:
====================
net: Finish up ->msg_control{,_user} split
Commit 1f466e1f15 ("net: cleanly handle kernel vs user buffers for
->msg_control") introduced the msg_control_user and
msg_control_is_user fields in struct msghdr, to ensure that user
pointers are represented as such. It also took care of converting most
users of struct msghdr::msg_control where user pointers are involved. It
did however miss a number of cases, and some code using msg_control
inappropriately has also appeared in the meantime.
This series is attempting to complete the split, by eliminating the
remaining cases where msg_control is used when in fact a user
pointer is stored in the union (patch 1).
It also addresses a couple of issues with msg_control_is_user: one where
it is not updated as it should (patch 2), and one where it is not
initialised (patch 3).
v1..v2:
* Split out the msg_control_is_user fixes into separate patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
do_ipv6_setsockopt() makes use of struct msghdr::msg_control in the
IPV6_2292PKTOPTIONS case. Make sure to initialise
msg_control_is_user accordingly.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
cmsghdr_from_user_compat_to_kern() is an unusual case w.r.t. how
the kmsg->msg_control* fields are used. The input struct msghdr
holds a pointer to a user buffer, i.e. ksmg->msg_control_user is
active. However, upon success, a kernel pointer is stored in
kmsg->msg_control. kmsg->msg_control_is_user should therefore be
updated accordingly.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 1f466e1f15 ("net: cleanly handle kernel vs user
buffers for ->msg_control"), pointers to user buffers should be
stored in struct msghdr::msg_control_user, instead of the
msg_control field. Most users of msg_control have already been
converted (where user buffers are involved), but not all of them.
This patch attempts to address the remaining cases. An exception is
made for null checks, as it should be safe to use msg_control
unconditionally for that purpose.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces 'skb_queue_tail()' with 'virtio_vsock_skb_queue_tail()'.
The first one uses 'spin_lock_irqsave()', second uses 'spin_lock_bh()'.
There is no need to disable interrupts in the loopback transport as
there is no access to the queue with skbs from interrupt context. Both
virtio and vhost transports work in the same way.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the TCA_QFQ_LMAX value is not offered through nlattr, lmax is determined by the MTU value of the network device.
The MTU of the loopback device can be set up to 2^31-1.
As a result, it is possible to have an lmax value that exceeds QFQ_MIN_LMAX.
Due to the invalid lmax value, an index is generated that exceeds the QFQ_MAX_INDEX(=24) value, causing out-of-bounds read/write errors.
The following reports a oob access:
[ 84.582666] BUG: KASAN: slab-out-of-bounds in qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
[ 84.583267] Read of size 4 at addr ffff88810f676948 by task ping/301
[ 84.583686]
[ 84.583797] CPU: 3 PID: 301 Comm: ping Not tainted 6.3.0-rc5 #1
[ 84.584164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 84.584644] Call Trace:
[ 84.584787] <TASK>
[ 84.584906] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
[ 84.585108] print_report (mm/kasan/report.c:320 mm/kasan/report.c:430)
[ 84.585570] kasan_report (mm/kasan/report.c:538)
[ 84.585988] qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
[ 84.586599] qfq_enqueue (net/sched/sch_qfq.c:1255)
[ 84.587607] dev_qdisc_enqueue (net/core/dev.c:3776)
[ 84.587749] __dev_queue_xmit (./include/net/sch_generic.h:186 net/core/dev.c:3865 net/core/dev.c:4212)
[ 84.588763] ip_finish_output2 (./include/net/neighbour.h:546 net/ipv4/ip_output.c:228)
[ 84.589460] ip_output (net/ipv4/ip_output.c:430)
[ 84.590132] ip_push_pending_frames (./include/net/dst.h:444 net/ipv4/ip_output.c:126 net/ipv4/ip_output.c:1586 net/ipv4/ip_output.c:1606)
[ 84.590285] raw_sendmsg (net/ipv4/raw.c:649)
[ 84.591960] sock_sendmsg (net/socket.c:724 net/socket.c:747)
[ 84.592084] __sys_sendto (net/socket.c:2142)
[ 84.593306] __x64_sys_sendto (net/socket.c:2150)
[ 84.593779] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 84.593902] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 84.594070] RIP: 0033:0x7fe568032066
[ 84.594192] Code: 0e 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c09[ 84.594796] RSP: 002b:00007ffce388b4e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
Code starting with the faulting instruction
===========================================
[ 84.595047] RAX: ffffffffffffffda RBX: 00007ffce388cc70 RCX: 00007fe568032066
[ 84.595281] RDX: 0000000000000040 RSI: 00005605fdad6d10 RDI: 0000000000000003
[ 84.595515] RBP: 00005605fdad6d10 R08: 00007ffce388eeec R09: 0000000000000010
[ 84.595749] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
[ 84.595984] R13: 00007ffce388cc30 R14: 00007ffce388b4f0 R15: 0000001d00000001
[ 84.596218] </TASK>
[ 84.596295]
[ 84.596351] Allocated by task 291:
[ 84.596467] kasan_save_stack (mm/kasan/common.c:46)
[ 84.596597] kasan_set_track (mm/kasan/common.c:52)
[ 84.596725] __kasan_kmalloc (mm/kasan/common.c:384)
[ 84.596852] __kmalloc_node (./include/linux/kasan.h:196 mm/slab_common.c:967 mm/slab_common.c:974)
[ 84.596979] qdisc_alloc (./include/linux/slab.h:610 ./include/linux/slab.h:731 net/sched/sch_generic.c:938)
[ 84.597100] qdisc_create (net/sched/sch_api.c:1244)
[ 84.597222] tc_modify_qdisc (net/sched/sch_api.c:1680)
[ 84.597357] rtnetlink_rcv_msg (net/core/rtnetlink.c:6174)
[ 84.597495] netlink_rcv_skb (net/netlink/af_netlink.c:2574)
[ 84.597627] netlink_unicast (net/netlink/af_netlink.c:1340 net/netlink/af_netlink.c:1365)
[ 84.597759] netlink_sendmsg (net/netlink/af_netlink.c:1942)
[ 84.597891] sock_sendmsg (net/socket.c:724 net/socket.c:747)
[ 84.598016] ____sys_sendmsg (net/socket.c:2501)
[ 84.598147] ___sys_sendmsg (net/socket.c:2557)
[ 84.598275] __sys_sendmsg (./include/linux/file.h:31 net/socket.c:2586)
[ 84.598399] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 84.598520] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 84.598688]
[ 84.598744] The buggy address belongs to the object at ffff88810f674000
[ 84.598744] which belongs to the cache kmalloc-8k of size 8192
[ 84.599135] The buggy address is located 2664 bytes to the right of
[ 84.599135] allocated 7904-byte region [ffff88810f674000, ffff88810f675ee0)
[ 84.599544]
[ 84.599598] The buggy address belongs to the physical page:
[ 84.599777] page:00000000e638567f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10f670
[ 84.600074] head:00000000e638567f order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 84.600330] flags: 0x200000000010200(slab|head|node=0|zone=2)
[ 84.600517] raw: 0200000000010200 ffff888100043180 dead000000000122 0000000000000000
[ 84.600764] raw: 0000000000000000 0000000080020002 00000001ffffffff 0000000000000000
[ 84.601009] page dumped because: kasan: bad access detected
[ 84.601187]
[ 84.601241] Memory state around the buggy address:
[ 84.601396] ffff88810f676800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 84.601620] ffff88810f676880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 84.601845] >ffff88810f676900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 84.602069] ^
[ 84.602243] ffff88810f676980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 84.602468] ffff88810f676a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 84.602693] ==================================================================
[ 84.602924] Disabling lock debugging due to kernel taint
Fixes: 3015f3d2a3 ("pkt_sched: enable QFQ to support TSO/GSO")
Reported-by: Gwangun Jung <exsociety@gmail.com>
Signed-off-by: Gwangun Jung <exsociety@gmail.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang says:
====================
net: mana: Add support for jumbo frame
The set adds support for jumbo frame,
with some optimization for the RX path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use napi_build_skb() instead of build_skb() to take advantage of the
NAPI percpu caches to obtain skbuff_head.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQ2LDIACgkQSD+KveBX
+j59UAf+NfLEq7i/gYOri1rLwXOrgPd13w64Oob79+pKVXzRqik/gBg4d0CKdc2X
Qafntudus7IvVrGJCM0vurw7nC75H9BMFh3dQW2u3uaR9m6i0k4tSKIAoJUFHMdj
dMgAJywYkCCXlBbGVhRE3yybal2EXfOSJ+Fr2tGeqpL4vRlg4kIaVuFGcGk2gMpq
gOahX6wtuop3ABzY3sqKCIQ1Q2jWx9AWWz+lEmw7HJGmHGugscYrgSIloHDnu1fz
Bt6uODn7GFr866rQr9+JemG0VyK8ahyisEFnX1oJ3642ZidPcDrbIrrtMWeEqo7X
1tcmY/wNti87xFbg3gL+DZekhHTHoQ==
=oIN8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-11
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Add modify-header-pattern ICM pool
net/mlx5: DR, Prepare sending new WQE type
net/mlx5: Add new WQE for updating flow table
net/mlx5: Add mlx5_ifc bits for modify header argument
net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
net/mlx5: Create a new profile for SFs
net/mlx5: Bridge, add tracepoints for multicast
net/mlx5: Bridge, implement mdb offload
net/mlx5: Bridge, support multicast VLAN pop
net/mlx5: Bridge, add per-port multicast replication tables
net/mlx5: Bridge, snoop igmp/mld packets
net/mlx5: Bridge, extract code to lookup parent bridge of port
net/mlx5: Bridge, move additional data structures to priv header
net/mlx5: Bridge, increase bridge tables sizes
net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================
Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
PFs which support the MAC Merge layer also have a set of 8 registers
called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)".
Through these, a traffic class (group of TX rings of same dequeue
priority) can be mapped to the eMAC or to the pMAC.
There's nothing particularly spectacular here. We should probably only
commit the preemptible TCs to hardware once the MAC Merge layer became
active, but unlike Felix, we don't have an IRQ that notifies us of that.
We'd have to sleep for up to verifyTime (127 ms) to wait for a
resolution coming from the verification state machine; not only from the
ndo_setup_tc() code path, but also from enetc_mm_link_state_update().
Since it's relatively complicated and has a relatively small benefit,
I'm not doing it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To gain access to the larger encapsulating structure which has the type
tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt".
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.
I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.
Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.
Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).
The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:
| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.
It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.
I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).
The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.
Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>