The nexthop and nexthop bucket dump callbacks previously returned a
positive return code even when the dump was complete, prompting the core
netlink code to invoke the callback again, until returning zero.
Zero was only returned by these callbacks when no information was filled
in the provided skb, which was achieved by incrementing the dump
sentinel at the end of the dump beyond the ID of the last nexthop.
This is no longer necessary as when the dump is complete these callbacks
return zero.
Remove the unnecessary increment.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before commit f10d3d9df4 ("nexthop: Make nexthop bucket dump more
efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return
code for each resilient nexthop group whose buckets it dumped,
regardless if it encountered an error or not.
This meant that the sentinel ('dd->ctx->nh.idx') used by the function
that walked the different nexthops could not be used as a sentinel for
the bucket dump, as otherwise buckets from the same group would be
dumped over and over again.
This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx')
that was incremented by rtm_dump_nexthop_bucket_nh() after successfully
dumping all the buckets from a given group.
After the previously mentioned commit this sentinel is no longer
necessary since the function no longer returns a non-zero return code
when successfully dumping all the buckets from a given group.
Remove this sentinel and simplify the code.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 54810 does not support c45. The mmd_phy_indirect accesses return
arbirtary values leading to odd behavior like saying it supports EEE
when it doesn't. We also see that reading/writing these non-existent
MMD registers leads to phy instability in some cases.
Fixes: b14995ac25 ("net: phy: broadcom: Add BCM54810 PHY entry")
Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/1691901708-28650-1-git-send-email-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Mayer says:
====================
seg6: add NEXT-C-SID support for SRv6 End.X behavior
In the Segment Routing (SR) architecture a list of instructions, called
segments, can be added to the packet headers to influence the forwarding and
processing of the packets in an SR enabled network.
Considering the Segment Routing over IPv6 data plane (SRv6) [1], the segment
identifiers (SIDs) are IPv6 addresses (128 bits) and the segment list (SID
List) is carried in the Segment Routing Header (SRH). A segment may correspond
to a "behavior" that is executed by a node when the packet is received.
The Linux kernel currently supports a large subset of the behaviors described
in [2] (e.g., End, End.X, End.T and so on).
In some SRv6 scenarios, the number of segments carried by the SID List may
increase dramatically, reducing the MTU (Maximum Transfer Unit) size and/or
limiting the processing power of legacy hardware devices (due to longer IPv6
headers).
The NEXT-C-SID mechanism [3] extends the SRv6 architecture by providing several
ways to efficiently represent the SID List.
By leveraging the NEXT-C-SID, it is possible to encode several SRv6 segments
within a single 128 bit SID address (also referenced as Compressed SID
Container). In this way, the length of the SID List can be drastically reduced.
The NEXT-C-SID mechanism is built upon the "flavors" framework defined in [2].
This framework is already supported by the Linux SRv6 subsystem and is used to
modify and/or extend a subset of existing behaviors.
In this patchset, we extend the SRv6 End.X behavior in order to support the
NEXT-C-SID mechanism.
In details, the patchset is made of:
- patch 1/2: add NEXT-C-SID support for SRv6 End.X behavior;
- patch 2/2: add selftest for NEXT-C-SID in SRv6 End.X behavior.
From the user space perspective, we do not need to change the iproute2 code to
support the NEXT-C-SID flavor for the SRv6 End.X behavior. However, we will
update the man page considering the NEXT-C-SID flavor applied to the SRv6 End.X
behavior in a separate patch.
[1] - https://datatracker.ietf.org/doc/html/rfc8754
[2] - https://datatracker.ietf.org/doc/html/rfc8986
[3] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
====================
Link: https://lore.kernel.org/r/20230812180926.16689-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This selftest is designed for testing the support of NEXT-C-SID flavor
for SRv6 End.X behavior. It instantiates a virtual network composed of
several nodes: hosts and SRv6 routers. Each node is realized using a
network namespace that is properly interconnected to others through veth
pairs, according to the topology depicted in the selftest script file.
The test considers SRv6 routers implementing IPv4/IPv6 L3 VPNs leveraged
by hosts for communicating with each other. Such routers i) apply
different SRv6 Policies to the traffic received from connected hosts,
considering the IPv4 or IPv6 protocols; ii) use the NEXT-C-SID
compression mechanism for encoding several SRv6 segments within a single
128-bit SID address, referred to as a Compressed SID (C-SID) container.
The NEXT-C-SID is provided as a "flavor" of the SRv6 End.X behavior,
enabling it to properly process the C-SID containers. The correct
execution of the enabled NEXT-C-SID SRv6 End.X behavior is verified
through reachability tests carried out between hosts belonging to the
same VPN.
Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Co-developed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-3-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The NEXT-C-SID mechanism described in [1] offers the possibility of
encoding several SRv6 segments within a single 128 bit SID address. Such
a SID address is called a Compressed SID (C-SID) container. In this way,
the length of the SID List can be drastically reduced.
A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
logically structured in three main blocks: i) Locator-Block; ii)
Locator-Node Function; iii) Argument.
C-SID container
+------------------------------------------------------------------+
| Locator-Block |Loc-Node| Argument |
| |Function| |
+------------------------------------------------------------------+
<--------- B -----------> <- NF -> <------------- A --------------->
(i) The Locator-Block can be any IPv6 prefix available to the provider;
(ii) The Locator-Node Function represents the node and the function to
be triggered when a packet is received on the node;
(iii) The Argument carries the remaining C-SIDs in the current C-SID
container.
This patch leverages the NEXT-C-SID mechanism previously introduced in the
Linux SRv6 subsystem [2] to support SID compression capabilities in the
SRv6 End.X behavior [3].
An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior
but it is capable of processing the compressed SID List encoded in C-SID
containers.
An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support
user-provided Locator-Block and Locator-Node Function lengths. In this
implementation, such lengths must be evenly divisible by 8 (i.e. must be
byte-aligned), otherwise the kernel informs the user about invalid
values with a meaningful error code and message through netlink_ext_ack.
If Locator-Block and/or Locator-Node Function lengths are not provided
by the user during configuration of an SRv6 End.X behavior instance with
NEXT-C-SID flavor, the kernel will choose their default values i.e.,
32-bit Locator-Block and 16-bit Locator-Node Function.
[1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
[2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/
[3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not allow to insert elements from datapath to objects maps.
Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use maybe_get_net() since GC workqueue might race with netns exit path.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Netlink event path is missing a synchronization point with GC
transactions. Add GC sequence number update to netns release path and
netlink event path, any GC transaction losing race will be discarded.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
When two threads run proc_do_sync_threshold() in parallel,
data races could happen between the two memcpy():
Thread-1 Thread-2
memcpy(val, valp, sizeof(val));
memcpy(valp, val, sizeof(val));
This race might mess up the (struct ctl_table *) table->data,
so we add a mutex lock to serialize them.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/
Signed-off-by: Sishuai Gong <sishuai.system@gmail.com>
Acked-by: Simon Horman <horms@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and
SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout
value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300
msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state.
As Paolo Valerio noticed, this might cause unwanted expiration of the ct
entry. In my test, with 1s tc netem delay set on the NAT path, after the
SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND
state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is
sent back from the peer, the sctp ct entry has expired and been deleted,
and then the SHUTDOWN_ACK has to be dropped.
Also, it is confusing these two sysctl options always show 0 due to all
timeout values using sec as unit:
net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0
net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0
This patch fixes it by also using 3 secs for sctp shutdown send and recv
state in sctp conntrack, which is also RTO.initial value in SCTP protocol.
Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV
was probably used for a rare scenario where SHUTDOWN is sent on 1st path
but SHUTDOWN_ACK is replied on 2nd path, then a new connection started
immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV
to CLOSE when receiving INIT in the ORIGINAL direction.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Reported-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
nftables selftests fail:
run-tests.sh testcases/sets/0044interval_overlap_0
Expected: 0-2 . 0-3, got:
W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1
Insertion must ignore duplicate but expired entries.
Moreover, there is a strange asymmetry in nft_pipapo_activate:
It refetches the current element, whereas the other ->activate callbacks
(bitmap, hash, rhash, rbtree) use elem->priv.
Same for .remove: other set implementations take elem->priv,
nft_pipapo_remove fetches elem->priv, then does a relookup,
remove this.
I suspect this was the reason for the change that prompted the
removal of the expired check in pipapo_get() in the first place,
but skipping exired elements there makes no sense to me, this helper
is used for normal get requests, insertions (duplicate check)
and deactivate callback.
In first two cases expired elements must be skipped.
For ->deactivate(), this gets called for DELSETELEM, so it
seems to me that expired elements should be skipped as well, i.e.
delete request should fail with -ENOENT error.
Fixes: 24138933b9 ("netfilter: nf_tables: don't skip expired elements during walk")
Signed-off-by: Florian Westphal <fw@strlen.de>
When flushing, individual set elements are disabled in the next
generation via the ->flush callback.
Catchall elements are not disabled. This is incorrect and may lead to
double-deactivations of catchall elements which then results in memory
leaks:
WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730
CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60
RIP: 0010:nft_map_deactivate+0x549/0x730
[..]
? nft_map_deactivate+0x549/0x730
nf_tables_delset+0xb66/0xeb0
(the warn is due to nft_use_dec() detecting underflow).
Fixes: aaa31047a6 ("netfilter: nftables: add catch-all set element support")
Reported-by: lonial con <kongln9170@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Jakub Kicinski says:
We've got some new kdoc warnings here:
net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc'
net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc'
include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set'
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Fixes: f6c383b8c3 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>
Jakub Kicinski says:
====================
genetlink: provide struct genl_info to dumps
One of the biggest (which is not to say only) annoyances with genetlink
handling today is that doit and dumpit need some of the same information,
but it is passed to them in completely different structs.
The implementations commonly end up writing a _fill() method which
populates a message and have to pass at least 6 parameters. 3 of which
are extracted manually from request info.
After a lot of umming and ahing I decided to populate struct genl_info
for dumps, without trying to factor out only the common parts.
This makes the adoption easiest.
In the future we may add a new version of dump which takes
struct genl_info *info as the second argument, instead of
struct netlink_callback *cb. For now developers have to call
genl_info_dump(cb) to get the info.
Typical genetlink families no longer get exposed to netlink protocol
internals like pid and seq numbers.
v3:
- correct the condition in ethtool code (patch 10)
v2: https://lore.kernel.org/all/20230810233845.2318049-1-kuba@kernel.org/
- replace the GENL_INFO_NTF() macro with init helper
- fix the commit messages
v1: https://lore.kernel.org/all/20230809182648.1816537-1-kuba@kernel.org/
====================
Link: https://lore.kernel.org/r/20230814214723.2924989-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We had a number of bugs in the past because developers forgot
to fully test dumps, which pass NULL as info to .prepare_data.
.prepare_data implementations would try to access info->extack
leading to a null-deref.
Now that dumps and notifications can access struct genl_info
we can pass it in, and remove the info null checks.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add some APIs and helpers required for convenient construction
of replies and notifications based on struct genl_info.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Having family in struct genl_info is quite useful. It cuts
down the number of arguments which need to be passed to
helpers which already take struct genl_info.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since dumps carry struct genl_info now, use the attrs pointer
from genl_info and remove the one in struct genl_dumpit_info.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netlink GET implementations must currently juggle struct genl_info
and struct netlink_callback, depending on whether they were called
from doit or dumpit.
Add genl_info to the dump state and populate the fields.
This way implementations can simply pass struct genl_info around.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct netlink_callback has a const nlh pointer, make the
pointer in struct genl_info const as well, to make copying
between the two easier.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add helpers which take/release the genl mutex based
on family->parallel_ops. Remove the separation between
handling of ops in locked and parallel families.
Future patches would make the duplicated code grow even more.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kumar reported a KASAN splat in tcp_v6_rcv:
bash-5.2# ./test_progs -t btf_skc_cls_ingress
...
[ 51.810085] BUG: KASAN: slab-out-of-bounds in tcp_v6_rcv+0x2d7d/0x3440
[ 51.810458] Read of size 2 at addr ffff8881053f038c by task test_progs/226
The problem is that inet[6]_steal_sock accesses sk->sk_protocol without
accounting for request or timewait sockets. To fix this we can't just
check sock_common->skc_reuseport since that flag is present on timewait
sockets.
Instead, add a fullsock check to avoid the out of bands access of sk_protocol.
Fixes: 9c02bec959 ("bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230815-bpf-next-v2-1-95126eaa4c1b@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
- Fix TLB ptlock checks to not break lightweight spinlock checks
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZNva3QAKCRD3ErUQojoP
X+r+AQCYl32LWsQ7cNOxCjaT+sQ+a2poQbf2TFjrYYjpGwTvWQEAoxq3ycBKquOj
WY1CY6dO9vrj1EVGc90P8NvRWDRbaQM=
=e6iw
-----END PGP SIGNATURE-----
Merge tag 'parisc-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
"Fix the parisc TLB ptlock checks so that they can be enabled together
with the lightweight spinlock checks"
* tag 'parisc-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix CONFIG_TLB_PTLOCK to work with lightweight spinlock checks
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTaMT8ACgkQiiy9cAdy
T1F7Ygv/ed2tYvcdwvrakFOvLWEvgTpAh/tb7dh+l58mG1WV7ZDRDWamSAyzy8OT
s8IeBYmIaOdOv5opYakuQrY0lhTcUSRCoSsvH6k2vZAMLG0SX9nXVyHv0JgPDTuz
gNvxDRrOusOhNfVlGya0dhH90hDyvW1wzU66HlCMbzrfmeQKKG6A6shOztGfw1y6
cXVKr4k315dcH9sAHzMDcg5bv3ucyKWztdAaF68dK71oEUwceMTmKpKc7OYPxThn
DOY4blVefIUAPTZYh7RD1Ota1VYfQafZFu01ttqh3XvG9PtOlDTuEbRlANpYv2d/
Awn6ZIdx2tV8MERJ7R0p/vKdVj5m5sDaTls0q4PWc/OMFrOFGfvuMhoZ/uAFPhFc
e9EKjg7u0B7q3F8aT4E34Hqwl6UNhyDvRqn5BhztcDgMdIke7OVuvHQSiLGXfMQT
XJN0bTynTB6RnHDsFxG8i7YBlsHDk6Ic/xOAcSG42U/5hNBTrovY9HyhYdMzZ/TD
9sETDezn
=Woor
-----END PGP SIGNATURE-----
Merge tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Three smb client fixes, all for stable:
- fix for oops in unmount race with lease break of deferred close
- debugging improvement for reconnect
- fix for fscache deadlock (folio_wait_bit_common hang)"
* tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: display network namespace in debug information
cifs: Release folio lock on fscache read hit.
cifs: fix potential oops in cifs_oplock_break
A couple of small, driver specific fixes - one incorrect definition for
one of the Qualcomm regulators and better handling of poorly formed DTs
in the DA9063 driver.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmTbcNIACgkQJNaLcl1U
h9DoGwf8DdO2/wJNjo4JmvT5C5QpUyZ9NE1nKyQXn51yNof9rmJS1O4eaGtL6b6R
LeNgvmQ2Oy8V15wXjM/SgMIobmWC5blgcZsBGtBUbTbZwYkzST1N/OdYg9cptcXx
eJ5Jnt3YXaOvU2RdM0o60bEJwdnNpUoiiYLjAacRlBvwbcRrMuUDx0mw6shTCW5E
jDX3y+wNQ2F/bxXfeqvpU7idpZncThS8AfdifbLY2bWGa6PDOiXldABtMPulzgnB
fQd+459guH3PXp0mp6Y3l0Fbsx0IwD25TANoQHmRqAbAoDnG0oFpF+gKreedNo6H
bR8bqtKT6NmiSgeYkMHHxD7i4ApCBQ==
=X+AW
-----END PGP SIGNATURE-----
Merge tag 'regulator-fix-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"Two small driver specific fixes: one incorrect definition for one of
the Qualcomm regulators and better handling of poorly formed DTs in
the DA9063 driver"
* tag 'regulator-fix-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: qcom-rpmh: Fix LDO 12 regulator for PM8550
regulator: da9063: better fix null deref with partial DT
In the real workload, I encountered an issue which could cause the RTO
timer to retransmit the skb per 1ms with linear option enabled. The amount
of lost-retransmitted skbs can go up to 1000+ instantly.
The root cause is that if the icsk_rto happens to be zero in the 6th round
(which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
due to the changed calculation method in tcp_retransmit_timer() as follows:
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
Above line could be converted to
icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0
Therefore, the timer expires so quickly without any doubt.
I read through the RFC 6298 and found that the RTO value can be rounded
up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
regarded as the lower bound in this patch as suggested by Eric.
Fixes: 36e31b0af5 ("net: TCP thin linear timeouts")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The encode_dma() function has some validation on in_trans->size but it
would be more clear to move those checks to find_and_map_user_pages().
The encode_dma() had two checks:
if (in_trans->addr + in_trans->size < in_trans->addr || !in_trans->size)
return -EINVAL;
The in_trans->addr variable is the starting address. The in_trans->size
variable is the total size of the transfer. The transfer can occur in
parts and the resources->xferred_dma_size tracks how many bytes we have
already transferred.
This patch introduces a new variable "remaining" which represents the
amount we want to transfer (in_trans->size) minus the amount we have
already transferred (resources->xferred_dma_size).
I have modified the check for if in_trans->size is zero to instead check
if in_trans->size is less than resources->xferred_dma_size. If we have
already transferred more bytes than in_trans->size then there are negative
bytes remaining which doesn't make sense. If there are zero bytes
remaining to be copied, just return success.
The check in encode_dma() checked that "addr + size" could not overflow
and barring a driver bug that should work, but it's easier to check if
we do this in parts. First check that "in_trans->addr +
resources->xferred_dma_size" is safe. Then check that "xfer_start_addr +
remaining" is safe.
My final concern was that we are dealing with u64 values but on 32bit
systems the kmalloc() function will truncate the sizes to 32 bits. So
I calculated "total = in_trans->size + offset_in_page(xfer_start_addr);"
and returned -EINVAL if it were >= SIZE_MAX. This will not affect 64bit
systems.
Fixes: 129776ac2e ("accel/qaic: Add control path")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://patchwork.freedesktop.org/patch/msgid/24d3348b-25ac-4c1b-b171-9dae7c43e4e0@moroto.mountain
The temporary buffer storing slicing configuration data from user is only
freed on error. This is a memory leak. Free the buffer unconditionally.
Fixes: ff13be8303 ("accel/qaic: Add datapath")
Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230802145937.14827-1-quic_jhugo@quicinc.com
just a bunch of bugfixes all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmTVQCMPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpclIH/A1l9iFK6v0eQ7cfdRjONTsd0YzlAdqU9J1V
yoygkdd3xZIXSwAUuRTEHq7AbTyHeHb2aTV6cPFSNzTckjAQV7v6Nk6vbxeO2uLM
d620V6cOI2QJGcib0nEkk1QMvyCJ/gipYRDvXwakyZTsga/VztpOG+EdXbTcuDUk
ud8lY90LgiVc/JozMT+bCUmbYsorfe/Xo52twLwb+irtHJ/w5h8sQbq0M7jd1B7k
hM/KebGlg9vGHH/G4UdfQx/3Td1GnNhNBDLVZLvqmie9FkA3c0S4SQwoKLXMbHfu
By0bnvmJaQFs64Qr2AvYGmyaUniu9qnpXDGNMBweASAzJA04tg0=
=h3Pg
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Just a bunch of bugfixes all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (26 commits)
virtio-mem: check if the config changed before fake offlining memory
virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub Block Mode (SBM)
virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY
virtio-mem: remove unsafe unplug in Big Block Mode (BBM)
pds_vdpa: fix up debugfs feature bit printing
pds_vdpa: alloc irq vectors on DRIVER_OK
pds_vdpa: clean and reset vqs entries
pds_vdpa: always allow offering VIRTIO_NET_F_MAC
pds_vdpa: reset to vdpa specified mac
virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case
vdpa/mlx5: Fix crash on shutdown for when no ndev exists
vdpa/mlx5: Delete control vq iotlb in destroy_mr only when necessary
vdpa/mlx5: Fix mr->initialized semantics
vdpa/mlx5: Correct default number of queues when MQ is on
virtio-vdpa: Fix cpumask memory leak in virtio_vdpa_find_vqs()
vduse: Use proper spinlock for IRQ injection
vdpa: Enable strict validation for netlinks ops
vdpa: Add max vqp attr to vdpa_nl_policy for nlattr length check
vdpa: Add queue index attr to vdpa_nl_policy for nlattr length check
vdpa: Add features attr to vdpa_nl_policy for nlattr length check
...
David Vernet says:
====================
The struct bpf_struct_ops structure in BPF is a framework that allows
subsystems to extend themselves using BPF. In commit 68b04864ca
("bpf: Create links for BPF struct_ops maps") and commit aef56f2e91
("bpf: Update the struct_ops of a bpf_link"), the structure was updated
to include new ->validate() and ->update() callbacks respectively in
support of allowing struct_ops maps to be created with BPF_F_LINK.
The intention was that struct bpf_struct_ops implementations could
support map updates through the link. Because map validation and
registration would take place in two separate steps for struct_ops
maps managed by the link (the first in map update elem, and the latter
in link create), the ->validate() callback was added, and any struct_ops
implementation that wished to use BPF_F_LINK, even just for lifetime
management, would then be required to define both it and ->update().
Not all struct_ops implementations can or will support update, however.
For example, the sched_ext struct_ops implementation proposed in [0]
will not be able to support atomic map updates because it can race with
sysrq, has to cycle tasks through various states in order to safely
transition, etc. It can, however, benefit from letting the BPF link
automatically evict the struc_ops map when the application exits (e.g.
if it crashes).
This patch set therefore:
1. Updates the struct_ops implementation to support default values for
->validate() and ->update() so that struct_ops implementations can
benefit from BPF_F_LINK management even if they can't support
updates.
2. Documents struct bpf_struct_ops so that the semantics are clear and
well defined.
---
v2: https://lore.kernel.org/bpf/0f5ea3de-c6e7-490f-b5ec-b5c7cd288687@gmail.com/T/
Changes from v2 -> v3:
- Add patch 2/2 that documents the struct bpf_struct_ops structure.
- Add Kui-Feng's Acked-by tag to patch 1/2.
v1: https://lore.kernel.org/lkml/20230811150934.GA542801@maniforge/
Changes from v1 -> v2:
- Move the if (!st_map->st_ops->update) check outside of the critical
section before we acquire the update_mutex.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Subsystems that want to implement a struct bpf_struct_ops structure to
enable struct_ops maps must currently reverse engineer how the structure
works. Given that this is meant to be a way for subsystem maintainers to
extend their subsystems using BPF, let's document it to make it a bit
easier on them.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230814185908.700553-3-void@manifault.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
define the .validate() and .update() callbacks in its corresponding
struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
in its own right to ensure that the map is unloaded if an application
crashes. For example, with sched_ext, we want to automatically unload
the host-wide scheduler if the application crashes. We would likely
never support updating elements of a sched_ext struct_ops map, so we'd
have to implement these callbacks showing that they _can't_ support
element updates just to benefit from the basic lifetime management of
struct_ops links.
Let's enable struct_ops maps to work with BPF_F_LINK even if they
haven't defined these callbacks, by assuming that a struct_ops map
element cannot be updated by default.
Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The failure handling procedure destroys page pools for all queues,
including those that haven't had their page pool created yet. this patch
introduces necessary adjustments to prevent potential risks and
inconsistency with the error handling behavior.
Fixes: 0ebab78cbc ("net: veth: add page_pool for page recycling")
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Link: https://lore.kernel.org/r/20230812023016.10553-1-liangchen.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michal Schmidt says:
====================
octeon_ep: fixes for error and remove paths
I have an Octeon card that's misconfigured in a way that exposes a
couple of bugs in the octeon_ep driver's error paths. It can reproduce
the issues that patches 1 & 4 are fixing. Patches 2 & 3 are a result of
reviewing the nearby code.
====================
Link: https://lore.kernel.org/r/20230810150114.107765-1-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If it fails to get the devices's MAC address, octep_probe exits while
leaving the delayed work intr_poll_task queued. When the work later
runs, it's a use after free.
Move the cancelation of intr_poll_task from octep_remove into
octep_device_cleanup. This does not change anything in the octep_remove
flow, but octep_device_cleanup is called also in the octep_probe error
path, where the cancelation is needed.
Note that the cancelation of ctrl_mbox_task has to follow
intr_poll_task's, because the ctrl_mbox_task may be queued by
intr_poll_task.
Fixes: 24d4333233 ("octeon_ep: poll for control messages")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://lore.kernel.org/r/20230810150114.107765-5-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
intr_poll_task may queue ctrl_mbox_task. The function
octep_poll_non_ioq_interrupts_cn93_pf does this.
When removing the driver and canceling these two works, cancel
ctrl_mbox_task last to guarantee it does not run anymore.
Fixes: 24d4333233 ("octeon_ep: poll for control messages")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://lore.kernel.org/r/20230810150114.107765-4-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tx_timeout_task is canceled too early when removing the driver. Nothing
prevents .ndo_tx_timeout from triggering and queuing the work again.
Better cancel it after the netdev is unregistered.
It's harmless for octep_tx_timeout_task to run in the window between the
unregistration and cancelation, because it checks netif_running.
Fixes: 862cd659a6 ("octeon_ep: Add driver framework and device initialization")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://lore.kernel.org/r/20230810150114.107765-3-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The intention was to wait up to 500 ms for the mbox response.
The third argument to wait_event_interruptible_timeout() is supposed to
be the timeout duration. The driver mistakenly passed absolute time
instead.
Fixes: 577f0d1b1c ("octeon_ep: add separate mailbox command and response queues")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230810150114.107765-2-mschmidt@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On Zynq UltraScale+ MPSoC ubuntu platform when systemctl issues suspend,
network manager bring down the interface and goes into suspend. When it
wakes up it again enables the interface.
This leads to xilinx-psgtr "PLL lock timeout" on interface bringup, as
the power management controller power down the entire FPD (including
SERDES) if none of the FPD devices are in use and serdes is not
initialized on resume.
$ sudo rtcwake -m no -s 120 -v
$ sudo systemctl suspend <this does ifconfig eth1 down>
$ ifconfig eth1 up
xilinx-psgtr fd400000.phy: lane 0 (type 10, protocol 5): PLL lock timeout
phy phy-fd400000.phy.0: phy poweron failed --> -110
macb driver is called in this way:
1. macb_close: Stop network interface. In this function, it
reset MACB IP and disables PHY and network interface.
2. macb_suspend: It is called in kernel suspend flow. But because
network interface has been disabled(netif_running(ndev) is
false), it does nothing and returns directly;
3. System goes into suspend state. Some time later, system is
waken up by RTC wakeup device;
4. macb_resume: It does nothing because network interface has
been disabled;
5. macb_open: It is called to enable network interface again. ethernet
interface is initialized in this API but serdes which is power-off
by PMUFW during FPD-off suspend is not initialized again and so
we hit GT PLL lock issue on open.
To resolve this PLL timeout issue always do PS GTR initialization
when ethernet device is configured as non-wakeup source.
Fixes: f22bd29ba1 ("net: macb: Fix ZynqMP SGMII non-wakeup source resume failure")
Fixes: 8b73fa3ae0 ("net: macb: Added ZynqMP-specific initialization")
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://lore.kernel.org/r/1691414091-2260697-1-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a phylink_get_caps implementation for Marvell 88e6060 DSA switch.
This is a fast ethernet switch, with internal PHYs for ports 0 through
4. Port 4 also supports MII, REVMII, REVRMII and SNI. Port 5 supports
MII, REVMII, REVRMII and SNI without an internal PHY.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1qUkx7-003dMX-9b@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever mlx5 driver is probed or reloaded, it queries some capabilities
in MAX mode via set_hca_cap() API. Afterwards, the driver queries all
capabilities in MAX mode via mlx5_query_hca_caps() API.
Since MAX caps are read only caps, querying them twice is redundant.
Hence, delete the second query.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Each device cap has two modes: MAX and CUR. The driver maintains a
cache of both modes of the capabilities. For most device caps, the MAX
cap mode is never used.
Hence, remove all driver queries of the MAX mode of the said caps as
well as their helper MACROs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but
there isn't any user who requires them.
As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used.
Thus, drop all usages and definitions of the mentioned caps above.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>