By using an independent drop reason it makes it easy to distinguish
between QoS-triggered or flow-triggered drop.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Eric Garver <eric@garver.life>
This adds an explicit drop action. This is used by OVS to drop packets
for which it cannot determine what to do. An explicit action in the
kernel allows passing the reason _why_ the packet is being dropped or
zero to indicate no particular error happened (i.e: OVS intentionally
dropped the packet).
Since the error codes coming from userspace mean nothing for the kernel,
we squash all of them into only two drop reasons:
- OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed
- OVS_DROP_EXPLICIT to indicate a zero value was passed (no error)
e.g. trace all OVS dropped skbs
# perf trace -e skb:kfree_skb --filter="reason >= 0x30000"
[..]
106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \
location:0xffffffffc0d9b462, protocol: 2048, reason: 196611)
reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT)
Also, this patch allows ovs-dpctl.py to add explicit drop actions as:
"drop" -> implicit empty-action drop
"drop(0)" -> explicit non-error action drop
"drop(42)" -> explicit error action drop
Signed-off-by: Eric Garver <eric@garver.life>
Co-developed-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a drop reason for packets that are dropped because an action
returns a non-zero error code.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new drop reason subsystem for openvswitch and add the first
drop reason to represent last-action drops.
Last-action drops happen when a flow has an empty action list or there
is no action that consumes the packet (output, userspace, recirc, etc).
It is the most common way in which OVS drops packets.
Implementation-wise, most of these skb-consuming actions already call
"consume_skb" internally and return directly from within the
do_execute_actions() loop so with minimal changes we can assume that
any skb that exits the loop normally is a packet drop.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__mptcp_init_sock() always returns 0 because mptcp_init_sock() used
to return the value directly.
But after commit 18b683bff8 ("mptcp: queue data for mptcp level
retransmission"), __mptcp_init_sock() need not return value anymore.
Let's remove the unnecessary test for __mptcp_init_sock() and make
it return void.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Such field is now unused just as a flag to control the first subflow
deletion at close() time. Introduce a new bit flag for that and finally
drop the mentioned field.
As an intended side effect, now the first subflow sock is not freed
before close() even for passive sockets. The msk has no open/active
subflows if the first one is closed and the subflow list is singular,
update accordingly the state check in mptcp_stream_accept().
Among other benefits, the subflow removal, reduces the amount of memory
used on the client side for each mptcp connection, allows passive sockets
to go through successful accept()/disconnect()/connect() and makes return
error code consistent for failing both passive and active sockets.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch the __mptcp_nmpc_socket helper is used
only to ensure that the MPTCP socket is a suitable status - that
is, the mptcp capable handshake is not started yet.
Change the return value to the relevant subflow sock, to finally
remove the last references to first subflow socket in the MPTCP stack.
As a bonus, we can get rid of a few local variables in different
functions.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is one of the few remaining spots actually manipulating the
first subflow socket. We can leverage the recently introduced
inet helpers to get rid of ssock there.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp sockopt infrastructure unneedly uses the first subflow
socket struct in a few spots. We are going to remove such field
soon, so use directly the first subflow sock instead.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection at accept() time. Instead access
directly the first subflow sock, and update mptcp_accept() to
operate on it. This allows dropping a duplicated check in
mptcp_accept().
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection at poll() time. Instead access
directly the first subflow sock.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at listen() time. Instead call
directly the recently introduced helper on the first subflow sock.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of them
is inet_listen().
Factor out an helper operating directly on the (locked) struct sock,
to allow get rid of the above dependency in the next patch without
duplicating the existing code.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at bind() time. Instead call directly
the recently introduced helpers on the first subflow sock.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of
them is bind().
Factor out the helpers operating directly on the struct sock, to
allow get rid of the above dependency in the next patch without
duplicating the existing code.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
accessing it in mptcp_get_port(). Instead, access directly the
first subflow sock.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of them is
__inet_stream_connect().
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at connect time, calling directly
into the sock-level connect() ops.
The sk-level connect never return -EINPROGRESS, cleanup the error
path accordingly. Additionally, the ssk status on error is always
TCP_CLOSE. Avoid unneeded access to the subflow sk state.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP protocol currently clears the msk token both at connect() and
listen() time. That is needed to deal with failing connect() calls that
can create a new token while leaving the sk in TCP_CLOSE,SS_UNCONNECTED
status and thus allowing later connect() and/or listen() calls.
Let's deal with such failures explicitly, cleaning the token in a timely
manner and avoid the confusing early mptcp_token_destroy().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add new VID/PID for Mediatek MT7922
- Add support multiple BIS/BIG
- Add support for Intel Gale Peak
- Add support for Qualcomm WCN3988
- Add support for BT_PKT_STATUS for ISO sockets
- Various fixes for experimental ISO support
- Load FW v2 for RTL8852C
- Add support for NXP AW693 chipset
- Add support for Mediatek MT2925
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmTWigQZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKd8sD/92kBczbO3v+nSNyiYcbVmB
x3Z7x1l2ExxHnPdW8xBmEzHlDErYB/KBKYdJWM8y6Bam5z1lnsX7LflXSy+bhZeX
iOFYl94Gh/9/ooyYOwwYUKC2fLKWT54PLg1TcJzyfp8uUizQNWAg9QD7vjvxe7lN
HXrW6CaA4Oohcq2YXagZV1h6Q/jl3BjcfEe7N0E6YYjeonplsJsv6rYG8Ku5n0Pi
9YhB5IkX5zszTGKBSSWURKvaJjbFd7pr3mYkgLZG2pIMGQcUAFJZ9kL7de9xeBWI
TRfgehZZPB2bUac1LxGLcAfONTmzUmo3/trjL1opdxreVCAX565JlaVSJwd0zuQk
cBrmtU3Q8peFSOgJRb1Ci5junE8tqjEWzFRIgw7/wL1Ys3mrbbVDDGKqPhwhvjdq
grOBf6UGaDpEO797yWWpBl5DLV3klMQDi4v84J0yTdvf4GXF8t8fuZU+zIpknVou
BwdeeF33yzqtk01BjomQcLVOrrGOP7+Salc5g7eEVU1jZnaw0MH9aH+o6R2JYtP8
uIiH4QOUJh7NA543F+/wPdZU+OV1E+Io+b34pTZ1oIyM2UT9Dy57Tex/DDKq2UCe
69WV6aVM+FTt2VSMUS2J0XrXkxbI4f6/ABOLht5hHKxT1m6LhOh8mCSTof+UENrr
G0sVoCodRrSljSMS/VltTA==
=akZ8
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2023-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
bluetooth-next pull request for net-next:
- Add new VID/PID for Mediatek MT7922
- Add support multiple BIS/BIG
- Add support for Intel Gale Peak
- Add support for Qualcomm WCN3988
- Add support for BT_PKT_STATUS for ISO sockets
- Various fixes for experimental ISO support
- Load FW v2 for RTL8852C
- Add support for NXP AW693 chipset
- Add support for Mediatek MT2925
Commit 39de828179 ("RDS: Main header file") declared but never implemented
rds_trans_init() and rds_trans_exit(), remove it.
Commit d37c935905 ("RDS: Move loop-only function to loop.c") removed the
implementation rds_message_inc_free() but not the declaration.
Since commit 55b7ed0b58 ("RDS: Common RDMA transport code")
rds_rdma_conn_connect() is never implemented and used.
rds_tcp_map_seq() is never implemented and used since
commit 70041088e3 ("RDS: Add TCP transport to RDS").
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_diag_put_flags(), netlink_setsockopt(), netlink_getsockopt()
and others use nlk->flags without correct locking.
Use set_bit(), clear_bit(), test_bit(), assign_bit() to remove
data-races.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.
Change it to probing zero-window, as it is a expected case now. The
description may be not correct.
Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.
And the message now like this:
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.
The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.
However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.
Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.
Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.
However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
could already be a long time (minutes or hours) in the past even on the
first RTO. So we double check the timeout with the duration of the
retransmission.
Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
dying too soon.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fow now, an ACK can update the window in following case, according to
the tcp_may_update_window():
1. the ACK acknowledged new data
2. the ACK has new data
3. the ACK expand the window and the seq of it is valid
Now, we allow the ACK update the window if the window is 0, and the
seq/ack of it is valid. This is for the case that the receiver replies
an zero-window ACK when it is under memory stress and can't queue the new
data.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.
In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unify the type of tty_operations::write() counters with the 'count'
parameter. I.e. use size_t for them.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Link: https://lore.kernel.org/r/20230810091510.13006-37-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some hooks in struct tty_ldisc_ops still reference buffers by 'unsigned
char'. Unify to 'u8' as the rest of the tty layer does.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230810091510.13006-32-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This makes all those 'char's an explicit 'u8'. This is part of the
continuing unification of chars and flags to be consistent u8.
This approaches tty_port_default_receive_buf().
Note that we do not change signedness as we compile with
-funsigned-char.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Chris Brannon <chris@the-brannons.com>
Cc: Kirk Reiser <kirk@reisers.ca>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-18-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Count passed to tty_ldisc_ops::receive_buf*(), ::lookahead_buf(), and
returned from ::receive_buf2() is expected to be size_t. So set it to
size_t to unify with the rest of the code.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Chris Brannon <chris@the-brannons.com>
Cc: Kirk Reiser <kirk@reisers.ca>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-16-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tty_ldisc_ops::poll() is optional and needs not be provided. It is equal
to returning 0. So remove all those from the code.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230810091510.13006-4-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The CIS/CIG ids of ISO connections are defined only when the connection
is unicast.
Fix the lookup functions to check for unicast first. Ensure CIG/CIS
IDs have valid value also in state BT_OPEN.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When user tries to connect a new CIS when its CIG is not configurable,
that connection shall fail, but pre-existing connections shall not be
affected. However, currently hci_cc_le_set_cig_params deletes all CIS
of the CIG on error so it doesn't work, even though controller shall not
change CIG/CIS configuration if the command fails.
Fix by failing on command error only the connections that are not yet
bound, so that we keep the previous CIS configuration like the
controller does.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Remove unnecessary NULL check which causes coccinelle warning:
net/bluetooth/coredump.c:104:2-7: WARNING: NULL check before some
freeing functions is not needed.
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
KSAN reports use-after-free in hci_add_adv_monitor().
While adding an adv monitor,
hci_add_adv_monitor() calls ->
msft_add_monitor_pattern() calls ->
msft_add_monitor_sync() calls ->
msft_le_monitor_advertisement_cb() calls in an error case ->
hci_free_adv_monitor() which frees the *moniter.
This is referenced by bt_dev_dbg() in hci_add_adv_monitor().
Fix the bt_dev_dbg() by using handle instead of monitor->handle.
Fixes: b747a83690 ("Bluetooth: hci_sync: Refactor add Adv Monitor")
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Similar to commit c5d2b6fa26 ("Bluetooth: Fix use-after-free in
hci_remove_ltk/hci_remove_irk"). We can not access k after kfree_rcu()
call.
Fixes: d7d41682ef ("Bluetooth: Fix Suspicious RCU usage warnings")
Signed-off-by: Min Li <lm0963hack@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When running with concurrent task only one CIS was being assigned so
this attempts to rework the way the PDU is constructed so it is handled
later at the callback instead of in place.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This moves hci_is_le_conn_scanning to hci_core.h so it can be used by
different files without having to duplicate its code.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Only the number of CIS shall be limited to 0x1f, the CIS ID in the
other hand is up to 0xef.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This introduces hci_conn_set_handle which takes care of verifying the
conditions where the hci_conn handle can be modified, including when
hci_conn_abort has been called and also checks that the handles is
valid as well.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Valid range of CIG/CIS are 0x00 to 0xEF, so this checks they are
properly checked before attempting to use HCI_OP_LE_SET_CIG_PARAMS.
Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Connections may be cleanup while waiting for the commands to complete so
this attempts to check if the connection handle remains valid in case of
errors that would lead to call hci_conn_failed:
BUG: KASAN: slab-use-after-free in hci_conn_failed+0x1f/0x160
Read of size 8 at addr ffff888001376958 by task kworker/u3:0/52
CPU: 0 PID: 52 Comm: kworker/u3:0 Not tainted
6.5.0-rc1-00527-g2dfe76d58d3a #5615
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.2-1.fc38 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
<TASK>
dump_stack_lvl+0x1d/0x70
print_report+0xce/0x620
? __virt_addr_valid+0xd4/0x150
? hci_conn_failed+0x1f/0x160
kasan_report+0xd1/0x100
? hci_conn_failed+0x1f/0x160
hci_conn_failed+0x1f/0x160
hci_abort_conn_sync+0x237/0x360
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When sending HCI_OP_CREATE_CONN_CANCEL it shall Wait for
HCI_EV_CONN_COMPLETE, not HCI_EV_CMD_STATUS, when the reason is
anything but HCI_ERROR_REMOTE_POWER_OFF. This reason is used when
suspending or powering off, where we don't want to wait for the peer's
response.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Dropped CIS that are in state BT_OPEN/BT_BOUND, and in state BT_CONNECT
with HCI_CONN_CREATE_CIS unset, should be cleaned up immediately.
Closing CIS ISO sockets should result to the hci_conn be deleted, so
that potentially pending CIG removal can run.
hci_abort_conn cannot refer to them by handle, since their handle is
still unset if Set CIG Parameters has not yet completed.
This fixes CIS not being terminated if the socket is shut down
immediately after connection, so that the hci_abort_conn runs before Set
CIG Parameters completes. See new BlueZ test "ISO Connect Close - Success"
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Calling hci_conn_del in __iso_sock_close is invalid. It needs
hdev->lock, but it cannot be acquired there due to lock ordering.
Fix this by doing cleanup via hci_conn_drop.
Return hci_conn with refcount 1 from hci_bind_cis and hci_connect_cis,
so that the iso_conn always holds one reference. This also fixes
refcounting when error handling.
Since hci_conn_abort shall handle termination of connections in any
state properly, we can handle BT_CONNECT socket state in the same way as
BT_CONNECTED.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This is introduced in commit 903e454110 but was never implemented.
Fixes: 903e454110 ("Bluetooth: AMP: Use HCI cmd to Read Loc AMP Assoc")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This adds support for BT_PKT_STATUS socketopt by setting
BT_SK_PKT_STATUS. Then upon receiving an ISO packet the code would
attempt to store the Packet_Status_Flag to hci_skb_pkt_status which
is then forward to userspace in the form of BT_SCM_PKT_STATUS whenever
BT_PKT_STATUS has been enabled/set.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This makes the handling of BT_PKT_STATUS more generic so it can be
reused by sockets other than SCO like BT_DEFER_SETUP, etc.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
If hci_unregister_dev() frees the hci_dev object but hci_suspend_notifier
may still be accessing it, it can cause the program to crash.
Here's the call trace:
<4>[102152.653246] Call Trace:
<4>[102152.653254] hci_suspend_sync+0x109/0x301 [bluetooth]
<4>[102152.653259] hci_suspend_dev+0x78/0xcd [bluetooth]
<4>[102152.653263] hci_suspend_notifier+0x42/0x7a [bluetooth]
<4>[102152.653268] notifier_call_chain+0x43/0x6b
<4>[102152.653271] __blocking_notifier_call_chain+0x48/0x69
<4>[102152.653273] __pm_notifier_call_chain+0x22/0x39
<4>[102152.653276] pm_suspend+0x287/0x57c
<4>[102152.653278] state_store+0xae/0xe5
<4>[102152.653281] kernfs_fop_write+0x109/0x173
<4>[102152.653284] __vfs_write+0x16f/0x1a2
<4>[102152.653287] ? selinux_file_permission+0xca/0x16f
<4>[102152.653289] ? security_file_permission+0x36/0x109
<4>[102152.653291] vfs_write+0x114/0x21d
<4>[102152.653293] __x64_sys_write+0x7b/0xdb
<4>[102152.653296] do_syscall_64+0x59/0x194
<4>[102152.653299] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
This patch holds the reference count of the hci_dev object while
processing it in hci_suspend_notifier to avoid potential crash
caused by the race condition.
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
HCI_MAX_AD_LENGTH shall only be used if the controller doesn't support
extended advertising, otherwise HCI_MAX_EXT_AD_LENGTH shall be used
instead.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The hci_add_adv_monitor() hci_remove_adv_monitor() functions call
bt_dev_dbg() to print some debug statements. The bt_dev_dbg() macro
automatically adds in the device's name. That means that we shouldn't
include the name in the bt_dev_dbg() calls.
Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Some use cases require the user to be informed if BIG synchronization
fails. This commit makes it so that even if the BIG sync established
event arrives with error status, a new hconn is added for each BIS,
and the iso layer is notified about the failed connections.
Unsuccesful bis connections will be marked using the
HCI_CONN_BIG_SYNC_FAILED flag. From the iso layer, the POLLERR event
is triggered on the newly allocated bis sockets, before adding them
to the accept list of the parent socket.
From user space, a new fd for each failed bis connection will be
obtained by calling accept. The user should check for the POLLERR
event on the new socket, to determine if the connection was successful
or not.
The HCI_CONN_BIG_SYNC flag has been added to mark whether the BIG sync
has been successfully established. This flag is checked at bis cleanup,
so the HCI LE BIG Terminate Sync command is only issued if needed.
The BT_SK_BIG_SYNC flag indicates if BIG create sync has been called
for a listening socket, to avoid issuing the command everytime a BIGInfo
advertising report is received.
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This attempts to always allocate a unique handle for connections so they
can be properly aborted by the likes of hci_abort_conn, so this uses the
invalid range as a pool of unset handles that way if userspace is trying
to create multiple connections at once each will be given a unique
handle which will be considered unset.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
ISO_LINK connections where not being handled properly on
hci_abort_conn_sync which sometimes resulted in sending the wrong
commands, or in case of having the reject command being sent by the
socket code (iso.c) which is sort of a layer violation.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This consolidates code for aborting connections using
hci_cmd_sync_queue so it is synchronized with other threads, but
because of the fact that some commands may block the cmd_sync_queue
while waiting specific events this attempt to cancel those requests by
using hci_cmd_sync_cancel.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In the case of a Synchronized Receiver capable device, enable at start-up the
events for PA reports, PA Sync Established and Big Info Adv reports.
Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This adds support for creating multiple BIGs. According to
spec, each BIG shall have an unique handle, and each BIG should
be associated with a different advertising handle. Otherwise,
the LE Create BIG command will fail, with error code
Command Disallowed (for reusing a BIG handle), or
Unknown Advertising Identifier (for reusing an advertising
handle).
The btmon snippet below shows an exercise for creating two BIGs
for the same controller, by opening two isotest instances with
the following command:
tools/isotest -i hci0 -s 00:00:00:00:00:00
< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068) plen 31
Handle: 0x00
Advertising Handle: 0x01
Number of BIS: 1
SDU Interval: 10000 us (0x002710)
Maximum SDU size: 40
Maximum Latency: 10 ms (0x000a)
RTN: 0x02
PHY: LE 2M (0x02)
Packing: Sequential (0x00)
Framing: Unframed (0x00)
Encryption: 0x00
Broadcast Code: 00000000000000000000000000000000
> HCI Event: Command Status (0x0f) plen 4
LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 21
LE Broadcast Isochronous Group Complete (0x1b)
Status: Success (0x00)
Handle: 0x00
BIG Synchronization Delay: 912 us (0x000390)
Transport Latency: 912 us (0x000390)
PHY: LE 2M (0x02)
NSE: 3
BN: 1
PTO: 1
IRC: 3
Maximum PDU: 40
ISO Interval: 10.00 msec (0x0008)
Connection Handle #0: 10
< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068)
Handle: 0x01
Advertising Handle: 0x02
Number of BIS: 1
SDU Interval: 10000 us (0x002710)
Maximum SDU size: 40
Maximum Latency: 10 ms (0x000a)
RTN: 0x02
PHY: LE 2M (0x02)
Packing: Sequential (0x00)
Framing: Unframed (0x00)
Encryption: 0x00
Broadcast Code: 00000000000000000000000000000000
> HCI Event: Command Status (0x0f) plen 4
LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
Status: Success (0x00)
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This stores scm_creds into hci_skb_cb so they can be properly forwarded
to the likes of btmon which is then able to print information about the
process who is originating the traffic:
bluetoothd[35]: @ MGMT Command: Rea.. (0x0001) plen 0 {0x0001}
@ MGMT Event: Command Complete (0x0001) plen 6 {0x0001}
Read Management Version Information (0x0001) plen 3
bluetoothd[35]: < ACL Data T.. flags 0x00 dlen 41
ATT: Write Command (0x52) len 36
Handle: 0x0043 Type: ASE Control Point (0x2bc6)
Data: 020203000110270000022800020a00409c0001000110270000022800020a00409c00
Opcode: QoS Configuration (0x02)
Number of ASE(s): 2
ASE: #0
ASE ID: 0x03
CIG ID: 0x00
CIS ID: 0x01
SDU Interval: 10000 usec
Framing: Unframed (0x00)
PHY: 0x02
LE 2M PHY (0x02)
Max SDU: 40
RTN: 2
Max Transport Latency: 10
Presentation Delay: 40000 us
ASE: #1
ASE ID: 0x01
CIG ID: 0x00
CIS ID: 0x01
SDU Interval: 10000 usec
Framing: Unframed (0x00)
PHY: 0x02
LE 2M PHY (0x02)
Max SDU: 40
RTN: 2
Max Transport Latency: 10
Presentation Delay: 40000 us
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This makes sure peer information is always available via sock when using
bt_sock_alloc.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This consolidates code around sk_alloc into bt_sock_alloc which does
take care of common initialization.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
LE Create CIS command shall not be sent before all CIS Established
events from its previous invocation have been processed. Currently it is
sent via hci_sync but that only waits for the first event, but there can
be multiple.
Make it wait for all events, and simplify the CIS creation as follows:
Add new flag HCI_CONN_CREATE_CIS, which is set if Create CIS has been
sent for the connection but it is not yet completed.
Make BT_CONNECT state to mean the connection wants Create CIS.
On events after which new Create CIS may need to be sent, send it if
possible and some connections need it. These events are:
hci_connect_cis, iso_connect_cfm, hci_cs_le_create_cis,
hci_le_cis_estabilished_evt.
The Create CIS status/completion events shall queue new Create CIS only
if at least one of the connections transitions away from BT_CONNECT, so
that we don't loop if controller is sending bogus events.
This fixes sending multiple CIS Create for the same CIS in the
"ISO AC 6(i) - Success" BlueZ test case:
< HCI Command: LE Create Co.. (0x08|0x0064) plen 9 #129 [hci0]
Number of CIS: 2
CIS Handle: 257
ACL Handle: 42
CIS Handle: 258
ACL Handle: 42
> HCI Event: Command Status (0x0f) plen 4 #130 [hci0]
LE Create Connected Isochronous Stream (0x08|0x0064) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 29 #131 [hci0]
LE Connected Isochronous Stream Established (0x19)
Status: Success (0x00)
Connection Handle: 257
...
< HCI Command: LE Setup Is.. (0x08|0x006e) plen 13 #132 [hci0]
...
> HCI Event: Command Complete (0x0e) plen 6 #133 [hci0]
LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
...
< HCI Command: LE Create Co.. (0x08|0x0064) plen 5 #134 [hci0]
Number of CIS: 1
CIS Handle: 258
ACL Handle: 42
> HCI Event: Command Status (0x0f) plen 4 #135 [hci0]
LE Create Connected Isochronous Stream (0x08|0x0064) ncmd 1
Status: ACL Connection Already Exists (0x0b)
> HCI Event: LE Meta Event (0x3e) plen 29 #136 [hci0]
LE Connected Isochronous Stream Established (0x19)
Status: Success (0x00)
Connection Handle: 258
...
Fixes: c09b80be6f ("Bluetooth: hci_conn: Fix not waiting for HCI_EVT_LE_CIS_ESTABLISHED")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
It is required for some configurations to have multiple BISes as part
of the same BIG.
Similar to the flow implemented for unicast, DEFER_SETUP will also be
used to bind multiple BISes for the same BIG, before starting Periodic
Advertising and creating the BIG.
The user will have to open a new socket for each BIS. By setting the
BT_DEFER_SETUP socket option and calling connect, a new connection
will be added for the BIG and advertising handle set by the socket
QoS parameters. Since all BISes will be bound for the same BIG and
advertising handle, the socket QoS options and base parameters should
match for all connections.
By calling connect on a socket that does not have the BT_DEFER_SETUP
option set, periodic advertising will be started and the BIG will
be created, with a BIS for each previously bound connection. Since
a BIG cannot be reconfigured with additional BISes after creation,
no more connections can be bound for the BIG after the start periodic
advertising and create BIG commands have been queued.
The bis_cleanup function has also been updated, so that the advertising
set and the BIG will not be terminated unless there are no more
bound or connected BISes.
The HCI_CONN_BIG_CREATED connection flag has been added to indicate
that the BIG has been successfully created. This flag is checked at
bis_cleanup, so that the BIG is only terminated if the
HCI_LE_Create_BIG_Complete has been received.
This implementation has been tested on hardware, using the "isotest"
tool with an additional command line option, to specify the number of
BISes to create as part of the desired BIG:
tools/isotest -i hci0 -s 00:00:00:00:00:00 -N 2 -G 1 -T 1
The btmon log shows that a BIG containing 2 BISes has been created:
< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068) plen 31
Handle: 0x01
Advertising Handle: 0x01
Number of BIS: 2
SDU Interval: 10000 us (0x002710)
Maximum SDU size: 40
Maximum Latency: 10 ms (0x000a)
RTN: 0x02
PHY: LE 2M (0x02)
Packing: Sequential (0x00)
Framing: Unframed (0x00)
Encryption: 0x00
Broadcast Code: 00000000000000000000000000000000
> HCI Event: Command Status (0x0f) plen 4
LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 23
LE Broadcast Isochronous Group Complete (0x1b)
Status: Success (0x00)
Handle: 0x01
BIG Synchronization Delay: 1974 us (0x0007b6)
Transport Latency: 1974 us (0x0007b6)
PHY: LE 2M (0x02)
NSE: 3
BN: 1
PTO: 1
IRC: 3
Maximum PDU: 40
ISO Interval: 10.00 msec (0x0008)
Connection Handle #0: 10
Connection Handle #1: 11
< HCI Command: LE Setup Isochronous Data Path (0x08|0x006e) plen 13
Handle: 10
Data Path Direction: Input (Host to Controller) (0x00)
Data Path: HCI (0x00)
Coding Format: Transparent (0x03)
Company Codec ID: Ericsson Technology Licensing (0)
Vendor Codec ID: 0
Controller Delay: 0 us (0x000000)
Codec Configuration Length: 0
Codec Configuration:
> HCI Event: Command Complete (0x0e) plen 6
LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
Status: Success (0x00)
Handle: 10
< HCI Command: LE Setup Isochronous Data Path (0x08|0x006e) plen 13
Handle: 11
Data Path Direction: Input (Host to Controller) (0x00)
Data Path: HCI (0x00)
Coding Format: Transparent (0x03)
Company Codec ID: Ericsson Technology Licensing (0)
Vendor Codec ID: 0
Controller Delay: 0 us (0x000000)
Codec Configuration Length: 0
Codec Configuration:
> HCI Event: Command Complete (0x0e) plen 6
LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
Status: Success (0x00)
Handle: 11
< ISO Data TX: Handle 10 flags 0x02 dlen 44
< ISO Data TX: Handle 11 flags 0x02 dlen 44
> HCI Event: Number of Completed Packets (0x13) plen 5
Num handles: 1
Handle: 10
Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5
Num handles: 1
Handle: 11
Count: 1
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This patch checks for ISO_BROADCASTER and ISO_SYNC_RECEIVER in
controller.
Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Still trending up in size but the good news is that the "current"
regressions are resolved, AFAIK.
We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
handling on embedded this week (most not merged yet), not sure why.
But those are all for older bugs.
Current release - regressions:
- tls: set MSG_SPLICE_PAGES consistently when handing encrypted
data over to TCP
Current release - new code bugs:
- eth: mlx5: correct IDs on VFs internal to the device (IPU)
Previous releases - regressions:
- phy: at803x: fix WoL support / reporting on AR8032
- bonding: fix incorrect deletion of ETH_P_8021AD protocol VID
from slaves, leading to BUG_ON()
- tun: prevent tun_build_skb() from exceeding the packet size limit
- wifi: rtw89: fix 8852AE disconnection caused by RX full flags
- eth/PCI: enetc: fix probing after 6fffbc7ae1 ("PCI: Honor
firmware's device disabled status"), keep PCI devices around
even if they are disabled / not going to be probed to be
able to apply quirks on them
- eth: prestera: fix handling IPv4 routes with nexthop IDs
Previous releases - always broken:
- netfilter: re-work garbage collection to avoid races between
user-facing API and timeouts
- tunnels: fix generating ipv4 PMTU error on non-linear skbs
- nexthop: fix infinite nexthop bucket dump when using maximum
nexthop ID
- wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
Misc:
- unix: use consistent error code in SO_PEERPIDFD
- ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO,
in prep for upcoming IETF RFC
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTVMSsACgkQMUZtbf5S
Irul3g//RlSANV/MWkiDmHIS5IhqkVWbvjGhFXFfdqZPH4gfgcX9VrsMuxgNM1Xu
YXGx+rIu408qNNkVG2hpFMxPerRiqVB/XsH1TxRr0Mi6AMFoKGXS+cGwzSOaoMQj
FYlcC6j2SnQ9N4I0qQuKOSOffyvyxrx/l9ozpVXsbGsOic1k6j1Ipwtf3+WP7dEe
kkAPUlsQPdCIhMyQdK3X4xI1PGLtAXFgY3VV9bZ7u99l7QBwmconkl3GHq/xnPa8
Uyll005ThyYce0c4EPVcrY1YBXyY0LjOBIRtiTFAk6CMWc0Su8Ug/i4+K2KTq0eh
yjqqHkpR//ruLgtAXBLLE9mxma8448vmmex/cSLIBaMAttlnj9n2LvCqvbzNfTZA
ssnKO4D3HhoQvHqbeOOW6VzVX7XyhomOvQXihfdLUs9u2tKE3nQoU+QCnrnIUXZO
VF5/ubCERRdZDPQ1SSAktimlC0R1qVL7JPMRaQF0aW5xByabbEWwMaNiwkYQOh2o
w2KsbhM/vWyd+5JB412LrNsEgK1BV6WjgwzC+27YQ7QD/JxUZBUghL0ps2jgU2Lu
d4YdbBOgYz+xyUBPByeYzcac0SIeMkB/UEcaO54ySWU8GcWYLt4KXwydUq/cXlw0
rUDCO5bikMxmygLKtnTSwmwvGbGByEXbGvVKwUwNPqTnR+vPIbM=
=NZgp
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bpf.
Still trending up in size but the good news is that the "current"
regressions are resolved, AFAIK.
We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
handling on embedded this week (most not merged yet), not sure why.
But those are all for older bugs.
Current release - regressions:
- tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
over to TCP
Current release - new code bugs:
- eth: mlx5: correct IDs on VFs internal to the device (IPU)
Previous releases - regressions:
- phy: at803x: fix WoL support / reporting on AR8032
- bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
slaves, leading to BUG_ON()
- tun: prevent tun_build_skb() from exceeding the packet size limit
- wifi: rtw89: fix 8852AE disconnection caused by RX full flags
- eth/PCI: enetc: fix probing after 6fffbc7ae1 ("PCI: Honor
firmware's device disabled status"), keep PCI devices around even
if they are disabled / not going to be probed to be able to apply
quirks on them
- eth: prestera: fix handling IPv4 routes with nexthop IDs
Previous releases - always broken:
- netfilter: re-work garbage collection to avoid races between
user-facing API and timeouts
- tunnels: fix generating ipv4 PMTU error on non-linear skbs
- nexthop: fix infinite nexthop bucket dump when using maximum
nexthop ID
- wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
Misc:
- unix: use consistent error code in SO_PEERPIDFD
- ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
upcoming IETF RFC"
* tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
net: hns3: fix strscpy causing content truncation issue
net: tls: set MSG_SPLICE_PAGES consistently
ibmvnic: Ensure login failure recovery is safe from other resets
ibmvnic: Do partial reset on login failure
ibmvnic: Handle DMA unmapping of login buffs in release functions
ibmvnic: Unmap DMA login rsp buffer on send login fail
ibmvnic: Enforce stronger sanity checks on login response
net: mana: Fix MANA VF unload when hardware is unresponsive
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
ipv6: adjust ndisc_is_useropt() to also return true for PIO
selftests: forwarding: bridge_mdb: Make test more robust
selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
...
We used to change the flags for the last segment, because
non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
That flag is no longer a thing so remove the setting.
Since flags most likely don't have MSG_SPLICE_PAGES set
this avoids passing parts of the sg as splice and parts
as non-splice. Before commit under Fixes we'd have called
tcp_sendpage() which would add the MSG_SPLICE_PAGES.
Why this leads to trouble remains unclear but Tariq
reports hitting the WARN_ON(!sendpage_ok()) due to
page refcount of 0.
Fixes: e117dcfd64 ("tls: Inline do_tcp_sendpages()")
Reported-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/
Tested-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmTUhbUACgkQ1V2XiooU
IORz1Q//a2fDuMsK5iW1BlF4y0P9aQUSVV//r3DYaoYOspJhsB2yZu4HtL+XQJvY
yncwg+ub24yQh5sUNSJnZztQVTN+NPY9Vl2TkXXMx6Wxs2XenmgzZmDdghUDzhTd
DuOjIGVEJ2M6XpPAOub89sqL+E0K7J0/q0aIcV0K0/xKo7U/z3vgLv4aZx/ZjPCV
daj3gcGpYQ1JJ9pi2se2yh89dzT321U7yYde9ek0TUeKGdCFJkfHkqMurwbcgoJ8
jkx5NOtrp+GLbhd+ME86IUtD+Edm46+bJUxvG0My99CVlak7y5gJh/aPxpAPACuW
NhWWy26kivVRWyttLQk0ScZfbO1CIwvaPzQC+QdlFdNA1eWTMhEk6AG2dVaU9CNB
V9WKWv59CPaDwPCKhXiPLQ9J+Kds7oyHPXGlV2dDOuSmJ9QbOh/HBQGEm/mI93qX
Fr+qqP3A9/juXZ5FdSLT2pJPuVlXdhQdgyHgiunyDPHoL9q7GFn5aQL/BVKE23tc
bgMez0GKzBR0waS9cycFSVls1rQN1XUIdoD6SLaRYq9FkKcCx+YGn3LH44Y1feL/
UnLMFlt9xIG4dPbGcGGy4r7mB53JpglHEqJEftvsNcBEd/r/f+4JP+/fa9FJ70uZ
GpGmv7Wo5DZT5V8LaMeWDWpJl6G7UcxrFOyDTw27l2OOVNaD2Ic=
=KNf7
-----END PGP SIGNATURE-----
Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The existing attempt to resolve races between control plane and GC work
is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places
forgot to call nft_set_elem_mark_busy(), leading to double-deactivation
of elements.
This series contains the following patches:
1) Do not skip expired elements during walk otherwise elements might
never decrement the reference counter on data, leading to memleak.
2) Add a GC transaction API to replace the former attempt to deal with
races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT
on elements and it creates a GC transaction to remove the expired
elements, GC transaction could abort in case of interference with
control plane and retried later (GC async). Set backends such as
rbtree and pipapo also perform GC from control plane (GC sync), in
such case, element deactivation and removal is safe because mutex
is held then collected elements are released via call_rcu().
3) Adapt existing set backends to use the GC transaction API.
4) Update rhash set backend to set on _DEAD bit to report deleted
elements from datapath for GC.
5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT.
* tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
netfilter: nf_tables: don't skip expired elements during walk
====================
Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRuIQAKCRBar9k/UBDW
4++9AP9ymOcPOKTKdQwZ6cnq3vkmvN37H6teufTyM8vsCha9NAD+OQE+vg1304RM
aETtG6d5Nb+byIHZGJrdUyT7g9jRzgw=
=qr/C
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-09
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 6 files changed, 102 insertions(+), 8 deletions(-).
The main changes are:
1) A bpf sockmap memleak fix and a fix in accessing the programs of
a sockmap under the incorrect map type from Xu Kuohai.
2) A refcount underflow fix in xsk from Magnus Karlsson.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
====================
Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of
performing element removal which might race with an ongoing transaction.
Enable gc when dynamic flag is set on since dynset deletion requires
garbage collection after this patch.
Fixes: d0a8d877da ("netfilter: nft_dynset: support for element deletion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the GC transaction API to replace the old and buggy gc API and the
busy mark approach.
No set elements are removed from async garbage collection anymore,
instead the _DEAD bit is set on so the set element is not visible from
lookup path anymore. Async GC enqueues transaction work that might be
aborted and retried later.
rbtree and pipapo set backends does not set on the _DEAD bit from the
sync GC path since this runs in control plane path where mutex is held.
In this case, set elements are deactivated, removed and then released
via RCU callback, sync GC never fails.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.
The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.
We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:
cpu 1 cpu2
GC work transaction comes in , lock nft mutex
`acquire nft mutex // BLOCKS
transaction asks to remove the set
set destruction calls cancel_work_sync()
cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.
This patch adds a new API that deals with garbage collection in two
steps:
1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
so they are not visible via lookup. Annotate current GC sequence in
the GC transaction. Enqueue GC transaction work as soon as it is
full. If ruleset is updated, then GC transaction is aborted and
retried later.
2) GC work grabs the mutex. If GC sequence has changed then this GC
transaction lost race with control plane, abort it as it contains
stale references to objects and let GC try again later. If the
ruleset is intact, then this GC transaction deactivates and removes
the elements and it uses call_rcu() to destroy elements.
Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.
To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.
Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.
We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.
This gives us the choice of ABBA deadlock or UaF.
To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.
Set backends are adapted to use the GC transaction API in a follow up
patch entitled:
("netfilter: nf_tables: use gc transaction API in set backends")
This is joint work with Florian Westphal.
Fixes: cfed7e1b1f ("netfilter: nf_tables: add set garbage collection helpers")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
strp_done is only called when psock->progs.stream_parser is not NULL,
but stream_parser was set to NULL by sk_psock_stop_strp(), called
by sk_psock_drop() earlier. So, strp_done can never be called.
Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
Change the condition for calling strp_done from judging whether
stream_parser is set to judging whether this flag is set. This flag is
only set once when strp_init() succeeds, and will never be cleared later.
Fixes: c0d95d3380 ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although
both types have member named "progs", the offset of "progs" member in
these two types is different, so "progs" should be accessed with the
real map type.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Fix a refcount underflow problem reported by syzbot that can happen
when a system is running out of memory. If xp_alloc_tx_descs() fails,
and it can only fail due to not having enough memory, then the error
path is triggered. In this error path, the refcount of the pool is
decremented as it has incremented before. However, the reference to
the pool in the socket was not nulled. This means that when the socket
is closed later, the socket teardown logic will think that there is a
pool attached to the socket and try to decrease the refcount again,
leading to a refcount underflow.
I chose this fix as it involved adding just a single line. Another
option would have been to move xp_get_pool() and the assignment of
xs->pool to after the if-statement and using xs_umem->pool instead of
xs->pool in the whole if-statement resulting in somewhat simpler code,
but this would have led to much more churn in the code base perhaps
making it harder to backport.
Fixes: ba3beec2ec ("xsk: Fix possible crash when multiple sockets are created")
Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This makes a difference for the software scheduling mode, where
dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
but when we're talking about what Qdisc and stats get reported for a
traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
is.
To understand the difference, I've attempted to send 100 packets in
software mode through class 8001:5, and recorded the stats before and
after the change.
Here is before:
$ tc -s class show dev eth0
class taprio 8001:1 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:2 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:3 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:4 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:5 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:6 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:7 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:8 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
and here is after:
class taprio 8001:1 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:2 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:3 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:4 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:5 root
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:6 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:7 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:8 root leaf 800d:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
The most glaring (and expected) difference is that before, all class
stats reported the global stats, whereas now, they really report just
the counters for that traffic class.
Finally, Pedro Tammela points out that there is a tc selftest which
checks specifically which handle do the child Qdiscs corresponding to
each class have. That's changing here - taprio no longer reports
tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0
(the handle of the default pfifo child Qdiscs). Since iproute2 does not
print a child Qdisc handle of 0, adjust the test's expected output.
Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in commit af7b29b1de ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike
mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport
between Qdisc_ops :: init() and Qdisc_ops :: attach().
Delete the comment, which is just stolen from mqprio, but there, the
usage patterns are a lot different, and this is nothing but confusing.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is another stab at commit 1461d212ab ("net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later
reverted in commit af7b29b1de ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").
I believe that the problems that caused the revert were fixed, and thus,
this change is identical to the original patch.
Its purpose is to properly reject attaching a software taprio child
qdisc to a software taprio parent. Because unoffloaded taprio currently
reports itself (the root Qdisc) as the return value from qdisc_leaf(),
then the process of attaching another taprio as child to a Qdisc class
of the root will just result in a Qdisc_ops :: change() call for the
root. Whereas that's not we want. We want Qdisc_ops :: init() to be
called for the taprio child, in order to give the taprio child a chance
to check whether its sch->parent is TC_H_ROOT or not (and reject this
configuration).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Normally, Qdiscs have one reference on them held by their owner and one
held for each TXQ to which they are attached, however this is not the
case with the children of an offloaded taprio. Instead, the taprio qdisc
currently lives in the following fragile equilibrium.
In the software scheduling case, taprio attaches itself (the root Qdisc)
to all TXQs, thus having a refcount of 1 + the number of TX queues. In
this mode, the q->qdiscs[] children are not visible directly to the
Qdisc API. The lifetime of the Qdiscs from this private array lasts
until qdisc_destroy() -> taprio_destroy().
In the fully offloaded case, the root taprio has a refcount of 1, and
all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
are attached to the netdev TXQs directly and thus are visible to the
Qdisc API, however taprio loses a reference to them very early - during
qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
the q->qdiscs[] array to not leak memory, but interestingly, it does not
release a reference on these qdiscs because it doesn't effectively own
them - they are created by taprio but owned by the Qdisc core, and will
be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
the Qdisc is deleted or when the child Qdisc is replaced with something
else.
My interest is to change this equilibrium such that taprio also owns a
reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
Qdisc, including in full offload mode. I want this because I would like
taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
insight into q->qdiscs[] for the software scheduling mode - currently
they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
as the root taprio.
The following set of changes is necessary:
- don't free q->qdiscs[] early in taprio_attach(), free it late in
taprio_destroy() for consistency with software mode. But:
- currently that's not possible, because taprio doesn't own a reference
on q->qdiscs[]. So hold that reference - once during the initial
attach() and once during subsequent graft() calls when the child is
changed.
- always keep track of the current child in q->qdiscs[], even for full
offload mode, so that we free in taprio_destroy() what we should, and
not something stale.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a simple code transformation with no intended behavior change,
just to make it absolutely clear that q->qdiscs[] is only attached to
the child taprio classes in full offload mode.
Right now we use the q->qdiscs[] variable in taprio_attach() for
software mode too, but that is quite confusing and avoidable. We use
it only to reach the netdev TX queue, but we could as well just use
netdev_get_tx_queue() for that.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The upcoming (and nearly finalized):
https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/
will update the IPv6 RA to include a new flag in the PIO field,
which will serve as a hint to perform DHCPv6-PD.
As we don't want DHCPv6 related logic inside the kernel, this piece of
information needs to be exposed to userspace. The simplest option is to
simply expose the entire PIO through the already existing mechanism.
Even without this new flag, the already existing PIO R (router address)
flag (from RFC6275) cannot AFAICT be handled entirely in kernel,
and provides useful information that should be exposed to userspace
(the router's global address, for use by Mobile IPv6).
Also cc'ing stable@ for inclusion in LTS, as while technically this is
not quite a bugfix, and instead more of a feature, it is absolutely
trivial and the alternative is manually cherrypicking into all Android
Common Kernel trees - and I know Greg will ask for it to be sent in via
LTS instead...
Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Cc: stable@vger.kernel.org
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20230807102533.1147559-1-maze@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I'm looking to enable -Wmissing-variable-declarations behind W=1. 0day
bot spotted the following instances:
net/llc/llc_conn.c:44:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_ack_timeout'
[-Wmissing-variable-declarations]
44 | int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ;
| ^
net/llc/llc_conn.c:44:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
44 | int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ;
| ^
net/llc/llc_conn.c:45:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_p_timeout'
[-Wmissing-variable-declarations]
45 | int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ;
| ^
net/llc/llc_conn.c:45:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
45 | int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ;
| ^
net/llc/llc_conn.c:46:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_rej_timeout'
[-Wmissing-variable-declarations]
46 | int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ;
| ^
net/llc/llc_conn.c:46:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
46 | int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ;
| ^
net/llc/llc_conn.c:47:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_busy_timeout'
[-Wmissing-variable-declarations]
47 | int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ;
| ^
net/llc/llc_conn.c:47:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
47 | int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ;
| ^
These symbols are referenced by more than one translation unit, so make
include the correct header for their declarations. Finally, sort the
list of includes to help keep them tidy.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/llvm/202308081000.tTL1ElTr-lkp@intel.com/
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808-llc_static-v1-1-c140c4c297e4@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPV6_ADDRFORM socket option is evil, because it can change sock->ops
while other threads might read it. Same issue for sk->sk_family
being set to AF_INET.
Adding READ_ONCE() over sock->ops reads is needed for sockets
that might be impacted by IPV6_ADDRFORM.
Note that mptcp_is_tcpsk() can also overwrite sock->ops.
Adding annotations for all sk->sk_family reads will require
more patches :/
BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt
write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0:
do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663
__sys_setsockopt+0x1c3/0x230 net/socket.c:2273
__do_sys_setsockopt net/socket.c:2284 [inline]
__se_sys_setsockopt net/socket.c:2281 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2281
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1:
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x349/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmmsg+0x263/0x500 net/socket.c:2643
__do_sys_sendmmsg net/socket.c:2672 [inline]
__se_sys_sendmmsg net/socket.c:2669 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffffffff850e32b8 -> 0xffffffff850da890
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTSNyYNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AOpyD/4pRjBvLgU0O+33x3l/X8z80QW6VMwq6PDUDmNA
t0AFnhIx0v6yky0abLzGGV9q2N9SLdNltTsqb0pem00f1TncKR1BlHNttfU+yjS+
qmkuUTJ8HixxkbBRKB9E7kA3IPM2aj1gxd3sji/QHKMT8XtcE5ufoad8/jcM9px2
FHNHJ8Onwl8ohjV4qNeuPe8XWm47pN/FeaxK7jRrKJaCal0P96sT8AGf/Rvx5VNY
jCysb2+fIKMzHssbLcRr1UDMJJFqtcQx0alnzwxh4sEPsmgYYR7UGmcDku4pSbtB
uJBrjMnLpORpw1l2syuYiiyEy+VRAAIjWAUxb6oTOvDhj1Yj2ki/915b/Hl/jnqa
q8EUm6i+B4CuiE8LCj0WLG2gKO7vRjFnDH/Li/qiFMUHzW/HmnLRituxTVTXwopC
1CxFkekNIklxLr+n21dP6f+NJU9hIs1nw+iy0JhLffTV8u6TosZ3Ve5ZInUQ4Bna
hZUyksy1s42fi0oGHN1Gi3AWPhUIlH69lKXTOher/9PvC+rL5/l8LFVywqtYxT4t
HWJwBtpm58IrsDttgPm6fnJqIgrm5mcW+FJYqP9Td6NjUvyfafhJsdhXJ9Bs1lJV
SfJp4iEUkOx4bfUi0rDFb/8NR2fwrCpbbgYx2xdS+tHQ2BkRlAOxYJC9Q8YP2vxH
Mjgc9w==
=XfrS
-----END PGP SIGNATURE-----
Merge tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
First 4 Patches, from Yue Haibing, remove unused prototypes in
various netfilter headers.
Last patch makes nfnetlink_log to always include a packet timestamp,
up to now it was only included if the skb had assigned previously.
From Maciej Żenczykowski.
* tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nfnetlink_log: always add a timestamp
netfilter: h323: Remove unused function declarations
netfilter: conntrack: Remove unused function declarations
netfilter: helper: Remove unused function declarations
netfilter: gre: Remove unused function declaration nf_ct_gre_keymap_flush()
====================
Link: https://lore.kernel.org/r/20230808124159.19046-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.
The nexthop bucket dump callback always returns a positive number if
nexthop buckets were filled in the provided skb, even if the dump is
complete. This means that a dump will span at least two recvmsg() calls
as long as nexthop buckets are present. In the last recvmsg() call the
dump callback will not fill in any nexthop buckets because the previous
call indicated that the dump should restart from the last dumped nexthop
ID plus one.
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 6.66 nhid 1
id 10 index 1 idle_time 6.66 nhid 1
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++
This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# ip nexthop bucket
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
[...]
Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
id 4294967295 index 0 idle_time 6.61 nhid 1
id 4294967295 index 1 idle_time 6.61 nhid 1
+++ exited with 0 +++
Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.
Add a test that fails before the fix:
# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]
And passes after it:
# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]
Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging
to a specific resilient nexthop group. The function returns a positive
return code (the skb length) upon both success and failure.
The above behavior is problematic. When a complete nexthop bucket dump
is requested, the function that walks the different nexthops treats the
non-zero return code as an error. This causes buckets belonging to
different resilient nexthop groups to be dumped using different buffers
even if they can all fit in the same buffer:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 10 index 0 idle_time 10.27 nhid 1
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 20 index 0 idle_time 6.44 nhid 1
[...]
Fix by only returning a non-zero return code when an error occurred and
restarting the dump from the bucket index we failed to fill in. This
allows buckets belonging to different resilient nexthop groups to be
dumped using the same buffer:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 30.21 nhid 1
id 20 index 0 idle_time 26.7 nhid 1
[...]
While this change is more of a performance improvement change than an
actual bug fix, it is a prerequisite for a subsequent patch that does
fix a bug.
Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.
The nexthop dump callback always returns a positive number if nexthops
were filled in the provided skb, even if the dump is complete. This
means that a dump will span at least two recvmsg() calls as long as
nexthops are present. In the last recvmsg() call the dump callback will
not fill in any nexthops because the previous call indicated that the
dump should restart from the last dumped nexthop ID plus one.
# ip nexthop add id 1 blackhole
# strace -e sendto,recvmsg -s 5 ip nexthop
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
id 1 blackhole
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++
This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:
# ip nexthop add id $((2**32-1)) blackhole
# ip nexthop
id 4294967295 blackhole
id 4294967295 blackhole
[...]
Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:
# ip nexthop add id $((2**32-1)) blackhole
# strace -e sendto,recvmsg -s 5 ip nexthop
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56
id 4294967295 blackhole
+++ exited with 0 +++
Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.
Add a test that fails before the fix:
# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]
And passes after it:
# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]
Fixes: ab84be7e54 ("net: Initial nexthop code")
Reported-by: Petr Machata <petrm@nvidia.com>
Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If we successfully parsed an interface mode with a legacy switch
driver, populate that mode into phylink's supported interfaces rather
than defaulting to the internal and gmii interfaces.
This hasn't caused an issue so far, because when the interface doesn't
match a supported one, phylink_validate() doesn't clear the supported
mask, but instead returns -EINVAL. phylink_parse_fixedlink() doesn't
check this return value, and merely relies on the supported ethtool
link modes mask being cleared. Therefore, the fixed link settings end
up being allowed despite validation failing.
Before this causes a problem, arrange for DSA to more accurately
populate phylink's supported interfaces mask so validation can
correctly succeed.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1qTKdM-003Cpx-Eh@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5266698661 ("tipc: let broadcast packet reception use new link receive function")
declared but never implemented this.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230807142926.45752-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enable io_uring commands on network sockets. Create two new
SOCKET_URING_OP commands that will operate on sockets.
In order to call ioctl on sockets, use the file_operations->io_uring_cmd
callbacks, and map it to a uring socket function, which handles the
SOCKET_URING_OP accordingly, and calls socket ioctls.
This patches was tested by creating a new test case in liburing.
Link: https://github.com/leitao/liburing/tree/io_uring_cmd
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the
number of MBSSID elements in the nested netlink attribute attrs, which can
lead to an integer overflow if a user of the nl80211 interface specifies
256 or more elements in the corresponding attribute in userspace. The
integer overflow can lead to a heap buffer overflow as num_elems determines
the size of the trailing array in elems, and this array is thereafter
written to for each element in attrs.
Note that this vulnerability only affects devices with the
wiphy->mbssid_max_interfaces member set for the wireless physical device
struct in the device driver, and can only be triggered by a process with
CAP_NET_ADMIN capabilities.
Fix this by checking for a maximum of 255 elements in attrs.
Cc: stable@vger.kernel.org
Fixes: dc1e3cb8da ("nl80211: MBSSID and EMA support in AP mode")
Signed-off-by: Keith Yeo <keithyjy@gmail.com>
Link: https://lore.kernel.org/r/20230731034719.77206-1-keithyjy@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There is an asymmetry between commit/abort and preparation phase if the
following conditions are met:
1. set is a verdict map ("1.2.3.4 : jump foo")
2. timeouts are enabled
In this case, following sequence is problematic:
1. element E in set S refers to chain C
2. userspace requests removal of set S
3. kernel does a set walk to decrement chain->use count for all elements
from preparation phase
4. kernel does another set walk to remove elements from the commit phase
(or another walk to do a chain->use increment for all elements from
abort phase)
If E has already expired in 1), it will be ignored during list walk, so its use count
won't have been changed.
Then, when set is culled, ->destroy callback will zap the element via
nf_tables_set_elem_destroy(), but this function is only safe for
elements that have been deactivated earlier from the preparation phase:
lack of earlier deactivate removes the element but leaks the chain use
count, which results in a WARN splat when the chain gets removed later,
plus a leak of the nft_chain structure.
Update pipapo_get() not to skip expired elements, otherwise flush
command reports bogus ENOENT errors.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tuning of the effective buffer size through setsockopts was working for
SMC traffic only but not for TCP fall-back connections even before
commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and
make them tunable"). That change made it apparent that TCP fall-back
connections would use net.smc.[rw]mem as buffer size instead of
net.ipv4_tcp_[rw]mem.
Amend the code that copies attributes between the (TCP) clcsock and the
SMC socket and adjust buffer sizes appropriately:
- Copy over sk_userlocks so that both sockets agree on whether tuning
via setsockopt is active.
- When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
- Likewise, use either values from setsockopt or from sysctl for SMC
(duplicated) on successful SMC connect.
In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
is taken care of by the attribute copy.
Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.
Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.
Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.
While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.
Background:
Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.
When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:
[...]
SO_SNDBUF
Sets or gets the maximum socket send buffer in bytes.
The kernel doubles this value (to allow space for book‐
keeping overhead) when it is set using setsockopt(2),
and this doubled value is returned by getsockopt(2).
The default value is set by the
/proc/sys/net/core/wmem_default file and the maximum
allowed value is set by the /proc/sys/net/core/wmem_max
file. The minimum (doubled) value for this option is
2048.
[...]
Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Co-developed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
rather than ESRCH if a socket type does not support remote peer-PID
queries.
Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
not an AF_UNIX socket. This is quite unexpected, given that one would
assume ESRCH means the peer process already exited and thus cannot be
found. However, in that case the sockopt actually returns EINVAL (via
pidfd_prepare()). This is rather inconsistent with other syscalls, which
usually return ESRCH if a given PID refers to a non-existant process.
This changes SO_PEERPIDFD to return ENODATA instead. This is also what
SO_PEERGROUPS returns, and thus keeps a consistent behavior across
sockopts.
Note that this code is returned in 2 cases: First, if the socket type is
not AF_UNIX, and secondly if the socket was not yet connected. In both
cases ENODATA seems suitable.
Signed-off-by: David Rheinsberg <david@readahead.eu>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Fixes: 7b26952a91 ("net: core: add getsockopt SO_PEERPIDFD")
Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When flushing the backlog after decoding a record we don't really
know how much data the caller want us to evaluate, so use INT_MAX
and 0 as arguments to tls_read_flush_backlog() to ensure we flush
at 128k of data. Otherwise we might be reading too much data and
trigger a TCP window full.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20230807071022.10091-1-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Get a coccinelle warning as follows:
net/ipv6/exthdrs.c:800:29-30: WARNING opportunity for swap()
Use swap() to replace opencoded implementation.
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230807020947.1991716-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For now, No matter what error pointer ip_neigh_for_gw() returns,
ip_finish_output2() always return -EINVAL, which may mislead the upper
users.
For exemple, an application uses sendto to send an UDP packet, but when the
neighbor table overflows, sendto() will get a value of -EINVAL, and it will
cause users to waste a lot of time checking parameters for errors.
Return the real errno instead of -EINVAL.
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Si Hao <si.hao@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20230807015408.248237-1-xu.xin16@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Compared to all the other work we're already doing to deliver
an skb to userspace this is very cheap - at worse an extra
call to ktime_get_real() - and very useful.
(and indeed it may even be cheaper if we're running from other hooks)
(background: Android occasionally logs packets which
caused wake from sleep/suspend and we'd like to have
timestamps reliably associated with these events)
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Syzkaller reported the following issue:
=======================================
Too BIG xdp->frame_sz = 131072
WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
____bpf_xdp_adjust_tail net/core/filter.c:4121 [inline]
WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121
bpf_xdp_adjust_tail+0x466/0xa10 net/core/filter.c:4103
...
Call Trace:
<TASK>
bpf_prog_4add87e5301a4105+0x1a/0x1c
__bpf_prog_run include/linux/filter.h:600 [inline]
bpf_prog_run_xdp include/linux/filter.h:775 [inline]
bpf_prog_run_generic_xdp+0x57e/0x11e0 net/core/dev.c:4721
netif_receive_generic_xdp net/core/dev.c:4807 [inline]
do_xdp_generic+0x35c/0x770 net/core/dev.c:4866
tun_get_user+0x2340/0x3ca0 drivers/net/tun.c:1919
tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2043
call_write_iter include/linux/fs.h:1871 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x650/0xe40 fs/read_write.c:584
ksys_write+0x12f/0x250 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
xdp->frame_sz > PAGE_SIZE check was introduced in commit c8741e2bfe
("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper
Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the
xdp_init_buff() which all XDP driver use - it's safe to remove this
check. The original intend was to catch cases where XDP drivers have
not been updated to use xdp.frame_sz, but that is not longer a concern
(since xdp_init_buff).
Running the initial syzkaller repro it was discovered that the
contiguous physical memory allocation is used for both xdp paths in
tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also
stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can
work on higher order pages, as long as this is contiguous physical
memory (e.g. a page).
Reported-and-tested-by: syzbot+f817490f5bd20541b90a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000774b9205f1d8a80d@google.com/T/
Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90a
Link: https://lore.kernel.org/all/20230725155403.796-1-andrew.kanner@gmail.com/T/
Fixes: 43b5169d83 ("net, xdp: Introduce xdp_init_buff utility routine")
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20230803190316.2380231-1-andrew.kanner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 8c48eea3ad ("page_pool: allow caching from safely localized
NAPI") allowed direct recycling of skb pages to their PP for some cases,
but unfortunately missed a couple of other majors.
For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
which unconditionally passes `false` as @napi_safe. Thus, all pages go
through ptr_ring and locks, although most of time we're actually inside
the NAPI polling this PP is linked with, so that it would be perfectly
safe to recycle pages directly.
Let's address such. If @napi_safe is true, we're fine, don't change
anything for this path. But if it's false, check whether we are in the
softirq context. It will most likely be so and then if ->list_owner
is our current CPU, we're good to use direct recycling, even though
@napi_safe is false -- concurrent access is excluded. in_softirq()
protection is needed mostly due to we can hit this place in the
process context (not the hardirq though).
For the mentioned xdp-drop-skb-mode case, the improvement I got is
3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
alloc_slow counter doesn't change most of time, which means the
MM layer is not even called to allocate any new pages.
Suggested-by: Jakub Kicinski <kuba@kernel.org> # in_softirq()
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Page pool use in hardirq is prohibited, add debug checks
to catch misuses. IIRC we previously discussed using
DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns
that people will have DEBUG_NET enabled in perf testing.
I don't think anyone enables lockdep in perf testing,
so use lockdep to avoid pushback and arguing :)
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, pp->p.napi is always read, but the actual variable it gets
assigned to is read-only when @napi_safe is true. For the !napi_safe
cases, which yet is still a pack, it's an unneeded operation.
Moreover, it can lead to premature or even redundant page_pool
cacheline access. For example, when page_pool_is_last_frag() returns
false (with the recent frag improvements).
Thus, read it only when @napi_safe is true. This also allows moving
@napi inside the condition block itself. Constify it while we are
here, because why not.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, touching <net/page_pool/types.h> triggers a rebuild of more
than half of the kernel. That's because it's included in
<linux/skbuff.h>. And each new include to page_pool/types.h adds more
[useless] data for the toolchain to process per each source file from
that pile.
In commit 6a5bcd84e8 ("page_pool: Allow drivers to hint on SKB
recycling"), Matteo included it to be able to call a couple of functions
defined there. Then, in commit 57f05bc2ab ("page_pool: keep pp info as
long as page pool owns the page") one of the calls was removed, so only
one was left. It's the call to page_pool_return_skb_page() in
napi_frag_unref(). The function is external and doesn't have any
dependencies. Having very niche page_pool_types.h included only for that
looks like an overkill.
As %PP_SIGNATURE is not local to page_pool.c (was only in the
early submissions), nothing holds this function there. Teleport
page_pool_return_skb_page() to skbuff.c, just next to the main consumer,
skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't
work with skbs at all and the former name tells nothing. The #if guards
here are only to not compile and have it in the vmlinux when not needed
-- both call sites are already guarded.
Now, touching page_pool_types.h only triggers rebuilding of the drivers
using it and a couple of core networking files.
Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy
Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Split types and pure function declarations from page_pool.h
and add them in page_page/types.h, so that C sources can
include page_pool.h and headers should generally only include
page_pool/types.h as suggested by jakub.
Rename page_pool.h to page_pool/helpers.h to have both in
one place.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com
[Jakub: change microsoft/mana, fix kdoc paths in Documentation]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 3c4d755915 ("tls: kernel TLS support") declared but never implemented
these functions.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting dev->priv_flags & IFF_SEE_ALL_HWTSTAMP_REQUESTS is only legal
for drivers which were converted to ndo_hwtstamp_get() and
ndo_hwtstamp_set(), and it is only there that we call ndo_hwtstamp_set()
for a request that otherwise goes to phylib (for stuff like packet traps,
which need to be undone if phylib failed, hence the old_cfg logic).
The problem is that we end up calling ndo_hwtstamp_get() when we don't
need to (even if the SIOCSHWTSTAMP wasn't intended for phylib, or if it
was, but the driver didn't set IFF_SEE_ALL_HWTSTAMP_REQUESTS). For those
unnecessary conditions, we share a code path with virtual drivers (vlan,
macvlan, bonding) where ndo_hwtstamp_get() is implemented as
generic_hwtstamp_get_lower(), and may be resolved through
generic_hwtstamp_ioctl_lower() if the lower device is unconverted.
I.e. this situation:
$ ip link add link eno0 name eno0.100 type vlan id 100
$ hwstamp_ctl -i eno0.100 -t 1
We are unprepared to deal with this, because if ndo_hwtstamp_get() is
resolved through a legacy ndo_eth_ioctl(SIOCGHWTSTAMP) lower_dev
implementation, that needs a non-NULL old_cfg.ifr pointer, and we don't
have it.
But we don't even need to deal with it either. In the general case,
drivers may not even implement SIOCGHWTSTAMP handling, only SIOCSHWTSTAMP,
so it makes sense to completely avoid a SIOCGHWTSTAMP call if we can.
The solution is to split the single "if" condition into 3 smaller ones,
thus separating the decision to call ndo_hwtstamp_get() from the
decision to call ndo_hwtstamp_set(). The third "if" condition is
identical to the first one, and both are subsets of the second one.
Thus, the "cfg" argument of kernel_hwtstamp_config_changed() is always
valid.
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iLOspJsvjPj+y8jikg7erXDomWe8sqHMdfL_2LQSFrPAg@mail.gmail.com/
Fixes: fd770e856e ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TLS records end with a 16B tag. For TLS device offload we only
need to make space for this tag in the stream, the device will
generate and replace it with the actual calculated tag.
Long time ago the code would just re-reference the head frag
which mostly worked but was suboptimal because it prevented TCP
from combining the record into a single skb frag. I'm not sure
if it was correct as the first frag may be shorter than the tag.
The commit under fixes tried to replace that with using the page
frag and if the allocation failed rolling back the data, if record
was long enough. It achieves better fragment coalescing but is
also buggy.
We don't roll back the iterator, so unless we're at the end of
send we'll skip the data we designated as tag and start the
next record as if the rollback never happened.
There's also the possibility that the record was constructed
with MSG_MORE and the data came from a different syscall and
we already told the user space that we "got it".
Allocate a single dummy page and use it as fallback.
Found by code inspection, and proven by forcing allocation
failures.
Fixes: e7b159a48b ("net/tls: remove the record tail optimization")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
rskq_defer_accept field can be read/written without
the need of holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->linger2 can be set locklessly as long as readers
use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->keepalive_probes can be set locklessly, readers
are already taking care of this field being potentially
set by other threads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->keepalive_intvl can be set locklessly, readers
are already taking care of this field being potentially
set by other threads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icsk->icsk_user_timeout can be set locklessly,
if all read sides use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icsk->icsk_syn_retries can safely be set without locking the socket.
We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer()
and tcp_write_timeout().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a client roamed back to a node before it got time to destroy the
pending local entry (i.e. within the same originator interval) the old
global one is directly removed from hash table and left as such.
But because this entry had an extra reference taken at lookup (i.e using
batadv_tt_global_hash_find) there is no way its memory will be reclaimed
at any time causing the following memory leak:
unreferenced object 0xffff0000073c8000 (size 18560):
comm "softirq", pid 0, jiffies 4294907738 (age 228.644s)
hex dump (first 32 bytes):
06 31 ac 12 c7 7a 05 00 01 00 00 00 00 00 00 00 .1...z..........
2c ad be 08 00 80 ff ff 6c b6 be 08 00 80 ff ff ,.......l.......
backtrace:
[<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300
[<000000000ff2fdbc>] batadv_tt_global_add+0x700/0xe20
[<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790
[<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110
[<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10
[<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0
[<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4
[<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0
[<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90
[<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74
[<000000000f39a009>] __netif_receive_skb+0x48/0xe0
[<00000000f2cd8888>] process_backlog+0x174/0x344
[<00000000507d6564>] __napi_poll+0x58/0x1f4
[<00000000b64ef9eb>] net_rx_action+0x504/0x590
[<00000000056fa5e4>] _stext+0x1b8/0x418
[<00000000878879d6>] run_ksoftirqd+0x74/0xa4
unreferenced object 0xffff00000bae1a80 (size 56):
comm "softirq", pid 0, jiffies 4294910888 (age 216.092s)
hex dump (first 32 bytes):
00 78 b1 0b 00 00 ff ff 0d 50 00 00 00 00 00 00 .x.......P......
00 00 00 00 00 00 00 00 50 c8 3c 07 00 00 ff ff ........P.<.....
backtrace:
[<00000000ee6e0ffa>] kmem_cache_alloc+0x1b4/0x300
[<00000000d9aaa49e>] batadv_tt_global_add+0x53c/0xe20
[<00000000443897c7>] _batadv_tt_update_changes+0x21c/0x790
[<000000005dd90463>] batadv_tt_update_changes+0x3c/0x110
[<00000000a2d7fc57>] batadv_tt_tvlv_unicast_handler_v1+0xafc/0xe10
[<0000000011793f2a>] batadv_tvlv_containers_process+0x168/0x2b0
[<00000000b7cbe2ef>] batadv_recv_unicast_tvlv+0xec/0x1f4
[<0000000042aef1d8>] batadv_batman_skb_recv+0x25c/0x3a0
[<00000000bbd8b0a2>] __netif_receive_skb_core.isra.0+0x7a8/0xe90
[<000000004033d428>] __netif_receive_skb_one_core+0x64/0x74
[<000000000f39a009>] __netif_receive_skb+0x48/0xe0
[<00000000f2cd8888>] process_backlog+0x174/0x344
[<00000000507d6564>] __napi_poll+0x58/0x1f4
[<00000000b64ef9eb>] net_rx_action+0x504/0x590
[<00000000056fa5e4>] _stext+0x1b8/0x418
[<00000000878879d6>] run_ksoftirqd+0x74/0xa4
Releasing the extra reference from batadv_tt_global_hash_find even at
roam back when batadv_tt_global_free is called fixes this memory leak.
Cc: stable@vger.kernel.org
Fixes: 068ee6e204 ("batman-adv: roaming handling mechanism redesign")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by; Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Since commit 9ea88a1530 ("tcp: md5: check md5 signature without socket
lock"), the MD5 option is checked in tcp_v[46]_rcv().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803224552.69398-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP socket saves the minimum required header length in tcp_header_len
of struct tcp_sock, and later the value is used in __tcp_fast_path_on()
to generate a part of TCP header in tcp_sock(sk)->pred_flags.
In tcp_rcv_established(), if the incoming packet has the same pattern
with pred_flags, we enter the fast path and skip full option parsing.
The MD5 option is parsed in tcp_v[46]_rcv(), so we need not parse it
again later in tcp_rcv_established() unless other options exist. We
add TCPOLEN_MD5SIG_ALIGNED to tcp_header_len in two paths to avoid the
slow path.
For passive open connections with MD5, we add TCPOLEN_MD5SIG_ALIGNED
to tcp_header_len in tcp_create_openreq_child() after 3WHS.
On the other hand, we do it in tcp_connect_init() for active open
connections. However, the value is overwritten while processing
SYN+ACK or crossed SYN in tcp_rcv_synsent_state_process().
These two cases will have the wrong value in pred_flags and never go
into the fast path.
We could update tcp_header_len in tcp_rcv_synsent_state_process(), but
a test with slightly modified netperf which uses MD5 for each flow shows
that the slow path is actually a bit faster than the fast path.
On c5.4xlarge EC2 instance (16 vCPU, 32 GiB mem)
$ for i in {1..10}; do
./super_netperf $(nproc) -H localhost -l 10 -- -m 256 -M 256;
done
Avg of 10
* 36e68eadd3 : 10.376 Gbps
* all fast path : 10.374 Gbps (patch v2, See Link)
* all slow path : 10.394 Gbps
The header prediction is not worth adding complexity for MD5, so let's
disable it for MD5.
Link: https://lore.kernel.org/netdev/20230803042214.38309-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803224552.69398-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket.
Same thing in do_dccp_getsockopt().
Add READ_ONCE()/WRITE_ONCE() annotations,
and change dccp_sendmsg() to check again dccps_mss_cache
after socket is locked.
Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803163021.2958262-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Despite commit 0ad529d9fd ("mptcp: fix possible divide by zero in
recvmsg()"), the mptcp protocol is still prone to a race between
disconnect() (or shutdown) and accept.
The root cause is that the mentioned commit checks the msk-level
flag, but mptcp_stream_accept() does acquire the msk-level lock,
as it can rely directly on the first subflow lock.
As reported by Christoph than can lead to a race where an msk
socket is accepted after that mptcp_subflow_queue_clean() releases
the listener socket lock and just before it takes destructive
actions leading to the following splat:
BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 5a4ca067 P4D 5a4ca067 PUD 37d4c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 10955 Comm: syz-executor.5 Not tainted 6.5.0-rc1-gdc7b257ee5dd #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:mptcp_stream_accept+0x1ee/0x2f0 include/net/inet_sock.h:330
Code: 0a 09 00 48 8b 1b 4c 39 e3 74 07 e8 bc 7c 7f fe eb a1 e8 b5 7c 7f fe 4c 8b 6c 24 08 eb 05 e8 a9 7c 7f fe 49 8b 85 d8 09 00 00 <0f> b6 40 12 88 44 24 07 0f b6 6c 24 07 bf 07 00 00 00 89 ee e8 89
RSP: 0018:ffffc90000d07dc0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888037e8d020 RCX: ffff88803b093300
RDX: 0000000000000000 RSI: ffffffff833822c5 RDI: ffffffff8333896a
RBP: 0000607f82031520 R08: ffff88803b093300 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000003e83 R12: ffff888037e8d020
R13: ffff888037e8c680 R14: ffff888009af7900 R15: ffff888009af6880
FS: 00007fc26d708640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000012 CR3: 0000000066bc5001 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
do_accept+0x1ae/0x260 net/socket.c:1872
__sys_accept4+0x9b/0x110 net/socket.c:1913
__do_sys_accept4 net/socket.c:1954 [inline]
__se_sys_accept4 net/socket.c:1951 [inline]
__x64_sys_accept4+0x20/0x30 net/socket.c:1951
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Address the issue by temporary removing the pending request socket
from the accept queue, so that racing accept() can't touch them.
After depleting the msk - the ssk still exists, as plain TCP sockets,
re-insert them into the accept queue, so that later inet_csk_listen_stop()
will complete the tcp socket disposal.
Fixes: 2a6a870e44 ("mptcp: stops worker on unaccepted sockets at listener close")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/423
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-4-6671b1ab11cc@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the blamed commit, the MPTCP protocol unconditionally sends
TCP resets on all the subflows on disconnect().
That fits full-blown MPTCP sockets - to implement the fastclose
mechanism - but causes unexpected corruption of the data stream,
caught as sporadic self-tests failures.
Fixes: d21f834855 ("mptcp: use fastclose on more edge scenarios")
Cc: stable@vger.kernel.org
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/419
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-3-6671b1ab11cc@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If we try to emit an icmp error in response to a nonliner skb, we get
BUG: KASAN: slab-out-of-bounds in ip_compute_csum+0x134/0x220
Read of size 4 at addr ffff88811c50db00 by task iperf3/1691
CPU: 2 PID: 1691 Comm: iperf3 Not tainted 6.5.0-rc3+ #309
[..]
kasan_report+0x105/0x140
ip_compute_csum+0x134/0x220
iptunnel_pmtud_build_icmp+0x554/0x1020
skb_tunnel_check_pmtu+0x513/0xb80
vxlan_xmit_one+0x139e/0x2ef0
vxlan_xmit+0x1867/0x2760
dev_hard_start_xmit+0x1ee/0x4f0
br_dev_queue_push_xmit+0x4d1/0x660
[..]
ip_compute_csum() cannot deal with nonlinear skbs, so avoid it.
After this change, splat is gone and iperf3 is no longer stuck.
Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230803152653.29535-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 098a697b49 ("tcp_metrics: Use a single hash table
for all network namespaces.") we can avoid calling tcp_net_metrics_init()
for each new netns.
Instead, rename tcp_net_metrics_init() to tcp_metrics_hash_alloc(),
and move it to __init section.
Also move tcpmhash_entries to __initdata section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230803135417.2716879-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Coccicheck reports the error below:
net/mptcp/protocol.c:3330:15-28: ERROR: test of a variable/field address
Since the address of msk->cb_flags is used in __test_and_clear_bit, the
address should not be NULL. The judgment for if (unlikely(msk->cb_flags))
will always be true, we should check the real value of msk->cb_flags here.
Fixes: 65a569b03c ("mptcp: optimize release_cb for the common case")
Signed-off-by: Xiang Yang <xiangyang3@huawei.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230803072438.1847500-1-xiangyang3@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do the switch and use generated split ops for get and info_get commands.
Remove those from small ops array.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Put the newly added generated header to the include list. Remove the
duplicated temporary function prototypes.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Improve the existing devlink spec in order to serve as a source for
generation of valid devlink split ops for the existing commands.
Add the generated sources.
Node that the policies are narrowed down only to the attributes that
are actually parsed. The dont-validate-strict parsing policy makes sure
that other possibly passed garbage attributes from userspace are
ignored during validation.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To be prepared for the follow-up generated split ops addition,
make the functions devlink_nl_pre_doit() and devlink_nl_post_doit()
usable outside of netlink.c. Introduce temporary prototypes which are
going to be removed once the generated header will be included.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce couple of dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The generated names of the doit netlink callback are missing "cmd" in
their names. Change names to be ready to switch to generated split ops
header.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to avoid name collision with the generated split ops array
which is going to be introduced as a follow-up patch, rename
the existing ops array to devlink_nl_small_ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230803111340.1074067-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
osd_request_timeout option and another fix to reduce the potential for
erroneous blocklisting -- this time in CephFS. All going to stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTNFFUTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5I8B/9a8C5ed0XfTadHcHX5VQsY3b//4rgp
0VYkQbjYnSCwrYRIPsvnL8LeLHzbcPGLpFAQXg7uUlmJ5dpaOz303hKmKt5GdyOR
qvWka3K4zeG177b6yc1srqs0cEsCLpQrn+krnvOl5v87QdFsCP/bsJMOrJ9mlhdM
9GjkjDRn6jvNyOLGbn3kIvwCRF9NH6/nHzjBcTUzvS8fBUye02o9C1H6ZQ7sYjKH
sJnmQCNCFHEqdaVjDZ7mw/doIrAbmTV6sgusuPjiF5bHILzX4oWG4UJmRpHFV//S
JPQgMp2DNjP8tW9aCVLVVVV5t5AKBr84etF59DaFNflk27U3COJWkE0a
=gw7n
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two patches to improve RBD exclusive lock interaction with
osd_request_timeout option and another fix to reduce the potential for
erroneous blocklisting -- this time in CephFS. All going to stable"
* tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client:
libceph: fix potential hang in ceph_osdc_notify()
rbd: prevent busy loop when requesting exclusive lock
ceph: defer stopping mdsc delayed_work
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
=/kVE
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-03
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).
The main changes are:
1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
Daniel Borkmann
2) Support new insns from cpu v4 from Yonghong Song
3) Non-atomically allocate freelist during prefill from YiFei Zhu
4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu
5) Add tracepoint to xdp attaching failure from Leon Hwang
6) struct netdev_rx_queue and xdp.h reshuffling to reduce
rebuild time from Jakub Kicinski
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
net: invert the netdevice.h vs xdp.h dependency
net: move struct netdev_rx_queue out of netdevice.h
eth: add missing xdp.h includes in drivers
selftests/bpf: Add testcase for xdp attaching failure tracepoint
bpf, xdp: Add tracepoint to xdp attaching failure
selftests/bpf: fix static assert compilation issue for test_cls_*.c
bpf: fix bpf_probe_read_kernel prototype mismatch
riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
libbpf: fix typos in Makefile
tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
netfilter: bpf: Only define get_proto_defrag_hook() if necessary
bpf: Fix an array-index-out-of-bounds issue in disasm.c
net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
docs/bpf: Fix malformed documentation
bpf: selftests: Add defrag selftests
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Support not connecting client socket
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
...
====================
Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nothing scary here. Feels like the first wave of regressions
from v6.5 is addressed - one outstanding fix still to come
in TLS for the sendpage rework.
Current release - regressions:
- udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
- dsa: fix older DSA drivers using phylink
Previous releases - regressions:
- gro: fix misuse of CB in udp socket lookup
- mlx5: unregister devlink params in case interface is down
- Revert "wifi: ath11k: Enable threaded NAPI"
Previous releases - always broken:
- sched: cls_u32: fix match key mis-addressing
- sched: bind logic fixes for cls_fw, cls_u32 and cls_route
- add bound checks to a number of places which hand-parse netlink
- bpf: disable preemption in perf_event_output helpers code
- qed: fix scheduling in a tasklet while getting stats
- avoid using APIs which are not hardirq-safe in couple of drivers,
when we may be in a hard IRQ (netconsole)
- wifi: cfg80211: fix return value in scan logic, avoid page
allocator warning
- wifi: mt76: mt7615: do not advertise 5 GHz on first PHY
of MT7615D (DBDC)
Misc:
- drop handful of inactive maintainers, put some new in place
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S
Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje
8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu
gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH
ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6
ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH
TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+
cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK
QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ
apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC
5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip
ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y=
=HYkl
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and wireless.
Nothing scary here. Feels like the first wave of regressions from v6.5
is addressed - one outstanding fix still to come in TLS for the
sendpage rework.
Current release - regressions:
- udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
- dsa: fix older DSA drivers using phylink
Previous releases - regressions:
- gro: fix misuse of CB in udp socket lookup
- mlx5: unregister devlink params in case interface is down
- Revert "wifi: ath11k: Enable threaded NAPI"
Previous releases - always broken:
- sched: cls_u32: fix match key mis-addressing
- sched: bind logic fixes for cls_fw, cls_u32 and cls_route
- add bound checks to a number of places which hand-parse netlink
- bpf: disable preemption in perf_event_output helpers code
- qed: fix scheduling in a tasklet while getting stats
- avoid using APIs which are not hardirq-safe in couple of drivers,
when we may be in a hard IRQ (netconsole)
- wifi: cfg80211: fix return value in scan logic, avoid page
allocator warning
- wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D
(DBDC)
Misc:
- drop handful of inactive maintainers, put some new in place"
* tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits)
MAINTAINERS: update TUN/TAP maintainers
test/vsock: remove vsock_perf executable on `make clean`
tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
tcp_metrics: annotate data-races around tm->tcpm_net
tcp_metrics: annotate data-races around tm->tcpm_vals[]
tcp_metrics: annotate data-races around tm->tcpm_lock
tcp_metrics: annotate data-races around tm->tcpm_stamp
tcp_metrics: fix addr_same() helper
prestera: fix fallback to previous version on same major version
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
net/mlx5e: Set proper IPsec source port in L4 selector
net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
net/mlx5: fs_core: Make find_closest_ft more generic
wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
vxlan: Fix nexthop hash size
ip6mr: Fix skb_under_panic in ip6mr_cache_report()
s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
net: tap_open(): set sk_uid from current_fsuid()
net: tun_chr_open(): set sk_uid from current_fsuid()
net: dcb: choose correct policy to parse DCB_ATTR_BCN
...
If the MTU of the soft/mesh interface was already reduced (enough), it is
not necessary to print a warning about a hard interface not having a MTU to
transport ethernet payloads of 1500 bytes.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The header linux/if_ether.h already defines a constant for the minimum MTU.
So simply use it instead of having a magic constant in the code.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Since commit 335fbe0f5d ("batman-adv: tvlv - convert tt query packet to use tvlv unicast packets")
batadv_recv_tt_query() is not used.
And commit 122edaa059 ("batman-adv: tvlv - convert roaming adv packet to use tvlv unicast packets")
left behind batadv_recv_roam_adv().
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This version will contain all the (major or even only minor) changes for
Linux 6.6.
The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW
48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf
61d6hoiXj/sIibgMQT/ihODgeJ4wfQE=
=u7qn
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-03
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 3 files changed, 37 insertions(+), 20 deletions(-).
The main changes are:
1) Disable preemption in perf_event_output helpers code,
from Jiri Olsa
2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing,
from Lin Ma
3) Multiple warning splat fixes in cpumap from Hou Tao
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, cpumap: Handle skb as well when clean up ptr_ring
bpf, cpumap: Make sure kthread is running before map update returns
bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
bpf: Disable preemption in bpf_event_output
bpf: Disable preemption in bpf_perf_event_output
====================
Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We did some house cleaning in MAINTAINERS file so several patches
about that. Few regressions fixed and also fix some recently enabled
memcpy() warnings. Only small commits and nothing special standing
out.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmTLsrcRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZtn6gf/ZsEOZl98ZVbCoFB09t5/M2IgRdWzbv8C
vXyVoacrRaq80rzFQwGZqorEsnEdDXOIJI54VIqnT5avZbIIWIia4mFzBkHwPBef
TXcdL2k1KDd+ktPrw3GK8401iEMnWSHs2a/4ztx3x8CFCB47VhGT9DiaIWh6jg1J
FUvDhUK7BAk0dItgVjioL+0XKJ5vo4VLENiOCAVj4QJgShKIaq72j/WhKiI/W/+Q
8TBBUjydu0nx7MOM0tOcQlI0z6HXOB89RHj4GxOMA/wvEf+7PHhOE67RAgSAMHJM
R9TmeVvdub05Yppv33PUbbvK29McZEI+M+lHMZjLy5AYaXxyYJ+nhw==
=4o1a
-----END PGP SIGNATURE-----
Merge tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.5
We did some house cleaning in MAINTAINERS file so several patches
about that. Few regressions fixed and also fix some recently enabled
memcpy() warnings. Only small commits and nothing special standing
out.
* tag 'wireless-2023-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1()
wifi: ray_cs: Replace 1-element array with flexible array
MAINTAINERS: add Jeff as ath10k, ath11k and ath12k maintainer
MAINTAINERS: wifi: mark mlw8k as orphan
MAINTAINERS: wifi: mark b43 as orphan
MAINTAINERS: wifi: mark zd1211rw as orphan
MAINTAINERS: wifi: mark wl3501 as orphan
MAINTAINERS: wifi: mark rndis_wlan as orphan
MAINTAINERS: wifi: mark ar5523 as orphan
MAINTAINERS: wifi: mark cw1200 as orphan
MAINTAINERS: wifi: atmel: mark as orphan
MAINTAINERS: wifi: rtw88: change Ping as the maintainer
Revert "wifi: ath6k: silence false positive -Wno-dangling-pointer warning on GCC 12"
wifi: cfg80211: Fix return value in scan logic
Revert "wifi: ath11k: Enable threaded NAPI"
MAINTAINERS: Update mwifiex maintainer list
wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
====================
Link: https://lore.kernel.org/r/20230803140058.57476C433C9@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever tcpm_new() reclaims an old entry, tcpm_suck_dst()
would overwrite data that could be read from tcp_fastopen_cache_get()
or tcp_metrics_fill_info().
We need to acquire fastopen_seqlock to maintain consistency.
For newly allocated objects, tcpm_new() can switch to kzalloc()
to avoid an extra fastopen_seqlock acquisition.
Fixes: 1fe4c481ba ("net-tcp: Fast Open client - cookie cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tm->tcpm_net can be read or written locklessly.
Instead of changing write_pnet() and read_pnet() and potentially
hurt performance, add the needed READ_ONCE()/WRITE_ONCE()
in tm_net() and tcpm_new().
Fixes: 849e8a0ca8 ("tcp_metrics: Add a field tcpm_net and verify it matches on lookup")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tm->tcpm_vals[] values can be read or written locklessly.
Add needed READ_ONCE()/WRITE_ONCE() to document this,
and force use of tcp_metric_get() and tcp_metric_set()
Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tm->tcpm_lock can be read or written locklessly.
Add needed READ_ONCE()/WRITE_ONCE() to document this.
Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tm->tcpm_stamp can be read or written locklessly.
Add needed READ_ONCE()/WRITE_ONCE() to document this.
Also constify tcpm_check_stamp() dst argument.
Fixes: 51c5d0c4b1 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because v4 and v6 families use separate inetpeer trees (respectively
net->ipv4.peers and net->ipv6.peers), inetpeer_addr_cmp(a, b) assumes
a & b share the same family.
tcp_metrics use a common hash table, where entries can have different
families.
We must therefore make sure to not call inetpeer_addr_cmp()
if the families do not match.
Fixes: d39d14ffa2 ("net: Add helper function to compare inetpeer addresses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230802131500.1478140-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All struct members of the driver-facing APIs are documented twice,
in the code and under Documentation. This is a bit tedious.
I also get the feeling that a lot of developers will read the header
when coding, rather than the doc. Bring the two a little closer
together by using kdoc for structs and functions.
Using kdoc also gives us links (mentioning a function or struct
in the text gets replaced by a link to its doc).
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230802161821.3621985-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.
Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.
The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.
In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.
We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
__ip6_append_data() can has a similar problem to __ip_append_data()[1] when
asked to splice into a partially-built UDP message that has more than the
frag-limit data and up to the MTU limit, but in the ipv6 case, it errors
out with EINVAL. This can be triggered with something like:
pipe(pfd);
sfd = socket(AF_INET6, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8137, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);
where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).
The problem is that the calculation of the amount to copy in
__ip6_append_data() goes negative in two places, but a check has been put
in to give an error in this case.
This happens because when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), the terms in:
copy = datalen - transhdrlen - fraggap - pagedlen;
then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
Fix this by:
(1) Insert a note about the dodgy calculation of 'copy'.
(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.
(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.
(4) If MSG_SPLICE_PAGES, override the copy<0 check.
[!] Note that this should also affect MSG_ZEROCOPY, but that will return
-EINVAL for the range of send sizes that requires the skbuff to be split.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ [1]
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1580952.1690961810@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit d50ccc2d39 ("tipc: add 128-bit node identifier") declared but never
implemented tipc_node_id2hash().
Also commit 5c216e1d28 ("tipc: Allow run-time alteration of default link settings")
never implemented tipc_media_set_priority() and tipc_media_set_window(),
commit cad2929dc4 ("tipc: update a binding service via broadcast") only declared
tipc_named_bcast().
Since commit be07f05639 ("tipc: simplify the finalize work queue")
tipc_sched_net_finalize() is removed and declaration is unused.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230802034659.39840-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tty->disc_data is 'void *', so there is no need to cast from that.
Therefore remove the casts and assign the pointer directly.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: linux-can@vger.kernel.org
Cc: netdev@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Max Staudt <max@enpas.org>
Link: https://lore.kernel.org/r/20230801062237.2687-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__ip_append_data() can get into an infinite loop when asked to splice into
a partially-built UDP message that has more than the frag-limit data and up
to the MTU limit. Something like:
pipe(pfd);
sfd = socket(AF_INET, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);
where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).
The problem is that the calculation of the amount to copy in
__ip_append_data() goes negative in two places, and, in the second place,
this gets subtracted from the length remaining, thereby increasing it.
This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), because the terms in:
copy = datalen - transhdrlen - fraggap - pagedlen;
then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
This causes:
length -= copy + transhdrlen;
to increase the length to more than the amount of data in msg->msg_iter,
which causes skb_splice_from_iter() to be unable to fill the request and it
returns less than 'copied' - which means that length never gets to 0 and we
never exit the loop.
Fix this by:
(1) Insert a note about the dodgy calculation of 'copy'.
(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.
(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.
[!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY
avoids the problem by simply assuming that everything asked for got copied,
not just the amount that was in the iterator. This is a potential bug for
the future.
Fixes: 7ac7c98785 ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is desirable that the new .ndo_hwtstamp_set() API gives more
uniformity, less overhead and future flexibility w.r.t. the PHY
timestamping behavior.
Currently there are some drivers which allow PHY timestamping through
the procedure mentioned in Documentation/networking/timestamping.rst.
They don't do anything locally if phy_has_hwtstamp() is set, except for
lan966x which installs PTP packet traps.
Centralize that behavior in a new dev_set_hwtstamp_phylib() code
function, which calls either phy_mii_ioctl() for the phylib PHY,
or .ndo_hwtstamp_set() of the netdev, based on a single policy
(currently simplistic: phy_has_hwtstamp()).
Any driver converted to .ndo_hwtstamp_set() will automatically opt into
the centralized phylib timestamping policy. Unconverted drivers still
get to choose whether they let the PHY handle timestamping or not.
Netdev drivers with integrated PHY drivers that don't use phylib
presumably don't set dev->phydev, and those will always see
HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping
policy will remain 100% up to them.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
phy_init() and phy_exit() will have to do more stuff under rtnl_lock()
in a future change. Since rtnl_unlock() -> netdev_run_todo() does a lot
of stuff under the hood, it's a pity to lock and unlock the rtnetlink
mutex twice in a row.
Change the calling convention such that the only caller of
ethtool_set_ethtool_phy_ops(), phy_device.c, provides a context where
the rtnl_mutex is already acquired.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-11-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8021q is one of the stackable net devices which pass the hardware
timestamping ops to the real device through ndo_eth_ioctl(). This
prevents converting any device driver to the new hwtimestamping API
without regressions.
Remove that limitation in the vlan driver by using the newly introduced
helpers for timestamping through lower devices, that handle both the new
and the old driver API.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The stackable net devices with hwtstamping support (vlan, macvlan,
bonding) only pass the hwtstamping ops to the lower (real) device.
These drivers are the first that need to be converted to the new
timestamping API, because if they aren't prepared to handle that,
then no real device driver cannot be converted to the new API either.
After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and
bond_eth_ioctl() have in common, here we propose two generic
implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which
can be called by those 3 drivers, with "dev" being their lower device.
These helpers cover both cases, when the lower driver is converted to
the new API or unconverted.
We need some hacks in case of an unconverted driver, namely to stuff
some pointers in struct kernel_hwtstamp_config which shouldn't have
been there (since the new API isn't supposed to need it). These will
be removed when all drivers will have been converted to the new API.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current hardware timestamping API for NICs requires implementing
.ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP.
That API has some boilerplate such as request parameter translation
between user and kernel address spaces, handling possible translation
failures correctly, etc. Since it is the same all across the board, it
would be desirable to handle it through generic code.
Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which
implement that boilerplate and allow drivers to just act upon requests.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
packet_alloc_skb() is currently calling sock_alloc_send_pskb()
forcing order-0 page allocations.
Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max size by 8x.
Also add logic to increase the linear part if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When error happens in dev_xdp_attach(), it should have a way to tell
users the error message like the netlink approach.
To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is
an appropriate way to notify users the error message.
Hence, bpf libraries are able to retrieve the error message by this
tracepoint, and then report the error message to users.
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since commit b1edeb1023 ("netlabel: Replace protocol/NetLabel linking with refrerence counts")
this declaration is unused and can be removed.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20230801143453.24452-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch enables offload for TC classifier
flower rules which matches against SPI field.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tc flower rules support to classify ESP/AH
packets matching SPI field.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for dissecting IPSEC field SPI (which is
32bits in size) for ESP and AH packets.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the cluster becomes unavailable, ceph_osdc_notify() may hang even
with osd_request_timeout option set because linger_notify_finish_wait()
waits for MWatchNotify NOTIFY_COMPLETE message with no associated OSD
request in flight -- it's completely asynchronous.
Introduce an additional timeout, derived from the specified notify
timeout. While at it, switch both waits to killable which is more
correct.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
The dcbnl_bcn_setcfg uses erroneous policy to parse tb[DCB_ATTR_BCN],
which is introduced in commit 859ee3c438 ("DCB: Add support for DCB
BCN"). Please see the comment in below code
static int dcbnl_bcn_setcfg(...)
{
...
ret = nla_parse_nested_deprecated(..., dcbnl_pfc_up_nest, .. )
// !!! dcbnl_pfc_up_nest for attributes
// DCB_PFC_UP_ATTR_0 to DCB_PFC_UP_ATTR_ALL in enum dcbnl_pfc_up_attrs
...
for (i = DCB_BCN_ATTR_RP_0; i <= DCB_BCN_ATTR_RP_7; i++) {
// !!! DCB_BCN_ATTR_RP_0 to DCB_BCN_ATTR_RP_7 in enum dcbnl_bcn_attrs
...
value_byte = nla_get_u8(data[i]);
...
}
...
for (i = DCB_BCN_ATTR_BCNA_0; i <= DCB_BCN_ATTR_RI; i++) {
// !!! DCB_BCN_ATTR_BCNA_0 to DCB_BCN_ATTR_RI in enum dcbnl_bcn_attrs
...
value_int = nla_get_u32(data[i]);
...
}
...
}
That is, the nla_parse_nested_deprecated uses dcbnl_pfc_up_nest
attributes to parse nlattr defined in dcbnl_pfc_up_attrs. But the
following access code fetch each nlattr as dcbnl_bcn_attrs attributes.
By looking up the associated nla_policy for dcbnl_bcn_attrs. We can find
the beginning part of these two policies are "same".
static const struct nla_policy dcbnl_pfc_up_nest[...] = {
[DCB_PFC_UP_ATTR_0] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_1] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_2] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_3] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_4] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_5] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_6] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_7] = {.type = NLA_U8},
[DCB_PFC_UP_ATTR_ALL] = {.type = NLA_FLAG},
};
static const struct nla_policy dcbnl_bcn_nest[...] = {
[DCB_BCN_ATTR_RP_0] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_1] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_2] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_3] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_4] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_5] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_6] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_7] = {.type = NLA_U8},
[DCB_BCN_ATTR_RP_ALL] = {.type = NLA_FLAG},
// from here is somewhat different
[DCB_BCN_ATTR_BCNA_0] = {.type = NLA_U32},
...
[DCB_BCN_ATTR_ALL] = {.type = NLA_FLAG},
};
Therefore, the current code is buggy and this
nla_parse_nested_deprecated could overflow the dcbnl_pfc_up_nest and use
the adjacent nla_policy to parse attributes from DCB_BCN_ATTR_BCNA_0.
Hence use the correct policy dcbnl_bcn_nest to parse the nested
tb[DCB_ATTR_BCN] TLV.
Fixes: 859ee3c438 ("DCB: Add support for DCB BCN")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230801013248.87240-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In destruction flow, the assignment of NULL to xso->dev
caused to skip of xfrm_dev_state_free() call, which was
called in xfrm_state_put(to_put) routine.
Instead of open-coded variant of xfrm_dev_state_delete() and
xfrm_dev_state_free(), let's use them directly.
Fixes: f8a70afafc ("xfrm: add TX datapath support for IPsec packet offload mode")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The policy memory was released but not HW driver data. Add
call to xfrm_dev_policy_delete(), so drivers will have a chance
to release their resources.
Fixes: 919e43fad5 ("xfrm: add an interface to offload policy")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Currently checksum is recalculated and dsa tag stripped even if we later
don't find the dev.
To improve code, exit early if we don't find the dev and skip additional
operation on the skb since it will be freed anyway.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20230730074113.21889-1-ansuelsmth@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add extack to warn that delete was rejected because
the class is still in use
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add extack to warn that delete was rejected because
the class is still in use
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add extack to warn that delete was rejected because
the class is still in use
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add extack to warn that delete was rejected because
the class is still in use
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The 'filter_cnt' counter is used to control a Qdisc class lifetime.
Each filter referecing this class by its id will eventually
increment/decrement this counter in their respective
'add/update/delete' routines.
As these operations are always serialized under rtnl lock, we don't
need an atomic type like 'refcount_t'.
It also means that we lose the overflow/underflow checks already
present in refcount_t, which are valuable to hunt down bugs
where the unsigned counter wraps around as it aids automated tools
like syzkaller to scream in such situations.
Wrap the open coded increment/decrement into helper functions and
add overflow checks to the operations.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC
allocation later in sk_psock_init_link on PREEMPT_RT kernels, since
GFP_ATOMIC might sleep on RT (see bpf: Make BPF and PREEMPT_RT co-exist
patchset notes for details).
This causes calling bpf_map_update_elem on BPF_MAP_TYPE_SOCKMAP maps to
BUG (sleeping function called from invalid context) on RT kernels.
preempt_disable was introduced together with lock_sk and rcu_read_lock
in commit 99ba2b5aba ("bpf: sockhash, disallow bpf_tcp_close and update
in parallel"), probably to match disabled migration of BPF programs, and
is no longer necessary.
Remove preempt_disable to fix BUG in sock_map_update_common on RT.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/all/20200224140131.461979697@linutronix.de/
Fixes: 99ba2b5aba ("bpf: sockhash, disallow bpf_tcp_close and update in parallel")
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230728064411.305576-1-tglozar@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
commit f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
introducted these but never implemented.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230729123456.36340-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When route4_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.
This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.
Fix this by no longer copying the tcf_result struct from the old filter.
Fixes: 1109c00547 ("net: sched: RCU cls_route")
Reported-by: valis <sec@valis.email>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-4-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When fw_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.
This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.
Fix this by no longer copying the tcf_result struct from the old filter.
Fixes: e35a8ee599 ("net: sched: fw use RCU")
Reported-by: valis <sec@valis.email>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-3-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When u32_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.
This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.
Fix this by no longer copying the tcf_result struct from the old filter.
Fixes: de5df63228 ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Reported-by: valis <sec@valis.email>
Reported-by: M A Ramdhan <ramdhan@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These are never implemented since introduction in
commit d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230729122036.32988-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit f9aab6f2ce ("net/smc: immediate freeing in smc_lgr_cleanup_early()")
left behind smc_lgr_schedule_free_work_fast() declaration.
And since commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
smc_ib_modify_qp_reset() is not used anymore.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20230729121929.17180-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are already INDIRECT_CALLABLE_DECLARE in the hashtable
headers, no need to declare them again.
Fixes: 0f495f7617 ("net: remove duplicate reuseport_lookup functions")
Suggested-by: Martin Lau <martin.lau@linux.dev>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230731-indir-call-v1-1-4cd0aeaee64f@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
tty->driver_data is 'void *', so there is no need to cast from that.
Therefore remove the casts and assign the pointer directly.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: linux-bluetooth@vger.kernel.org
Link: https://lore.kernel.org/r/20230731080244.2698-3-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit df8fc4e934 ("kbuild: Enable -fstrict-flex-arrays=3") started
applying strict rules to standard string functions.
It does not work well with conventional socket code around each protocol-
specific sockaddr_XXX struct, which is cast from sockaddr_storage and has
a bigger size than fortified functions expect. See these commits:
commit 06d4c8a808 ("af_unix: Fix fortify_panic() in unix_bind_bsd().")
commit ecb4534b6a ("af_unix: Terminate sun_path when bind()ing pathname socket.")
commit a0ade8404c ("af_packet: Fix warning of fortified memcpy() in packet_getname().")
We must cast the protocol-specific address back to sockaddr_storage
to call such functions.
However, in the case of getsockaddr(SO_PEERNAME), the rationale is a bit
unclear as the buffer is defined by char[128] which is the same size as
sockaddr_storage.
Let's use sockaddr_storage explicitly.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller found zero division error [0] in div_s64_rem() called from
get_cycle_time_elapsed(), where sched->cycle_time is the divisor.
We have tests in parse_taprio_schedule() so that cycle_time will never
be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed().
The problem is that the types of divisor are different; cycle_time is
s64, but the argument of div_s64_rem() is s32.
syzkaller fed this input and 0x100000000 is cast to s32 to be 0.
@TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000}
We use s64 for cycle_time to cast it to ktime_t, so let's keep it and
set max for cycle_time.
While at it, we prevent overflow in setup_txtime() and add another
test in parse_taprio_schedule() to check if cycle_time overflows.
Also, we add a new tdc test case for this issue.
[0]:
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline]
RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline]
RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344
Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10
RSP: 0018:ffffc90000acf260 EFLAGS: 00010206
RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000
RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934
R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800
R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
get_packet_txtime net/sched/sch_taprio.c:508 [inline]
taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577
taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658
dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732
__dev_xmit_skb net/core/dev.c:3821 [inline]
__dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169
dev_queue_xmit include/linux/netdevice.h:3088 [inline]
neigh_resolve_output net/core/neighbour.c:1552 [inline]
neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:544 [inline]
ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135
__ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196
ip6_finish_output net/ipv6/ip6_output.c:207 [inline]
NF_HOOK_COND include/linux/netfilter.h:292 [inline]
ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228
dst_output include/net/dst.h:458 [inline]
NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303
ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508
ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666
addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175
process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597
worker_thread+0x60f/0x1240 kernel/workqueue.c:2748
kthread+0x2fe/0x3f0 kernel/kthread.c:389
ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308
</TASK>
Modules linked in:
Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Co-developed-by: Eric Dumazet <edumazet@google.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As 32bits of dissector->used_keys are exhausted,
increase the size to 64bits.
This is base change for ESP/AH flow dissector patch.
Please find patch and discussions at
https://lore.kernel.org/netdev/ZMDNjD46BvZ5zp5I@corigine.com/T/#t
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Tested-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous commit 4e484b3e96 ("xfrm: rate limit SA mapping change
message to user space") added one additional attribute named
XFRMA_MTIMER_THRESH and described its type at compat_policy
(net/xfrm/xfrm_compat.c).
However, the author forgot to also describe the nla_policy at
xfrma_policy (net/xfrm/xfrm_user.c). Hence, this suppose NLA_U32 (4
bytes) value can be faked as empty (0 bytes) by a malicious user, which
leads to 4 bytes overflow read and heap information leak when parsing
nlattrs.
To exploit this, one malicious user can spray the SLUB objects and then
leverage this 4 bytes OOB read to leak the heap data into
x->mapping_maxage (see xfrm_update_ae_params(...)), and leak it to
userspace via copy_to_user_state_extra(...).
The above bug is assigned CVE-2023-3773. To fix it, this commit just
completes the nla_policy description for XFRMA_MTIMER_THRESH, which
enforces the length check and avoids such OOB read.
Fixes: 4e484b3e96 ("xfrm: rate limit SA mapping change message to user space")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
sk_getsockopt() runs locklessly. This means sk->sk_priority
can be read while other threads are changing its value.
Other reads also happen without socket lock being held.
Add missing annotations where needed.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit I forgot that sk_getsockopt() reads
sk->sk_ll_usec without holding a lock.
Fixes: 0dbffbb533 ("net: annotate data race around sk_ll_usec")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly, thus we need to annotate the read
of sk->sk_peek_off.
While we are at it, add corresponding annotations to sk_set_peek_off()
and unix_set_peek_off().
Fixes: b9bb53f383 ("sock: convert sk_peek_offset functions to WRITE_ONCE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_mark is often read while another thread could change the value.
Fixes: 4a19ec5800 ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_rcvbuf locklessly.
Fixes: ebb3b78db7 ("tcp: annotate sk->sk_rcvbuf lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_sndbuf locklessly.
Fixes: e292f05e0d ("tcp: annotate sk->sk_sndbuf lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs without locks, we must add annotations
to sk->sk_rcvtimeo and sk->sk_sndtimeo.
In the future we might allow fetching these fields before
we lock the socket in TCP fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a prior commit, I forgot to change sk_getsockopt()
when reading sk->sk_rcvlowat locklessly.
Fixes: eac66402d1 ("net: annotate sk->sk_rcvlowat lockless reads")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_max_pacing_rate
can be read while other threads are changing its value.
Fixes: 62748f32d5 ("net: introduce SO_MAX_PACING_RATE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_txrehash
can be read while other threads are changing its value.
Other locations were handled in commit cb6cd2cec7
("tcp: Change SYN ACK retransmit behaviour to account for rehash")
Fixes: 26859240e4 ("txhash: Add socket option to control TX hash rethink behavior")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_getsockopt() runs locklessly. This means sk->sk_reserved_mem
can be read while other threads are changing its value.
Add missing annotations where they are needed.
Fixes: 2bb2f5fb21 ("net: add new socket option SO_RESERVE_MEM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a misuse of IP{6}CB(skb) in GRO, while calling to
`udp6_lib_lookup2` when handling udp tunnels. `udp6_lib_lookup2` fetch the
device from CB. The fix changes it to fetch the device from `skb->dev`.
l3mdev case requires special attention since it has a master and a slave
device.
Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A match entry is uniquely identified with an "address" or "path" in the
form of: hashtable ID(12b):bucketid(8b):nodeid(12b).
When creating table match entries all of hash table id, bucket id and
node (match entry id) are needed to be either specified by the user or
reasonable in-kernel defaults are used. The in-kernel default for a table id is
0x800(omnipresent root table); for bucketid it is 0x0. Prior to this fix there
was none for a nodeid i.e. the code assumed that the user passed the correct
nodeid and if the user passes a nodeid of 0 (as Mingi Cho did) then that is what
was used. But nodeid of 0 is reserved for identifying the table. This is not
a problem until we dump. The dump code notices that the nodeid is zero and
assumes it is referencing a table and therefore references table struct
tc_u_hnode instead of what was created i.e match entry struct tc_u_knode.
Ming does an equivalent of:
tc filter add dev dummy0 parent 10: prio 1 handle 0x1000 \
protocol ip u32 match ip src 10.0.0.1/32 classid 10:1 action ok
Essentially specifying a table id 0, bucketid 1 and nodeid of zero
Tableid 0 is remapped to the default of 0x800.
Bucketid 1 is ignored and defaults to 0x00.
Nodeid was assumed to be what Ming passed - 0x000
dumping before fix shows:
~$ tc filter ls dev dummy0 parent 10:
filter protocol ip pref 1 u32 chain 0
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor -30591
Note that the last line reports a table instead of a match entry
(you can tell this because it says "ht divisor...").
As a result of reporting the wrong data type (misinterpretting of struct
tc_u_knode as being struct tc_u_hnode) the divisor is reported with value
of -30591. Ming identified this as part of the heap address
(physmap_base is 0xffff8880 (-30591 - 1)).
The fix is to ensure that when table entry matches are added and no
nodeid is specified (i.e nodeid == 0) then we get the next available
nodeid from the table's pool.
After the fix, this is what the dump shows:
$ tc filter ls dev dummy0 parent 10:
filter protocol ip pref 1 u32 chain 0
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 1 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 10:1 not_in_hw
match 0a000001/ffffffff at 12
action order 1: gact action pass
random type none pass val 0
index 1 ref 1 bind 1
Reported-by: Mingi Cho <mgcho.minic@gmail.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230726135151.416917-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.
We also take care to avoid any issues w.r.t. module unloading -- while
defrag is active on a link, the module is prevented from unloading.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We want to be able to enable/disable IP packet defrag from core
bpf/netfilter code. In other words, execute code from core that could
possibly be built as a module.
To help avoid symbol resolution errors, use glue hooks that the modules
will register callbacks with during module init.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/f6a8824052441b72afe5285acedbd634bd3384c1.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Chuck Lever says:
====================
In-kernel support for the TLS Alert protocol
IMO the kernel doesn't need user space (ie, tlshd) to handle the TLS
Alert protocol. Instead, a set of small helper functions can be used
to handle sending and receiving TLS Alerts for in-kernel TLS
consumers.
====================
Merged on top of a tag in case it's needed in the NFS tree.
Link: https://lore.kernel.org/r/169047923706.5241.1181144206068116926.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kernel TLS consumers will need definitions of various parts of the
TLS protocol, but often do not need the function declarations and
other infrastructure provided in <net/tls.h>.
Break out existing standardized protocol elements into a separate
header, and make room for a few more elements in subsequent patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169047931374.5241.7713175865185969309.stgit@oracle-102.nfsv4bat.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
accept_ra_min_rtr_lft only considered the lifetime of the default route
and discarded entire RAs accordingly.
This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and
applies the value to individual RA sections; in particular, router
lifetime, PIO preferred lifetime, and RIO lifetime. If any of those
lifetimes are lower than the configured value, the specific RA section
is ignored.
In order for the sysctl to be useful to Android, it should really apply
to all lifetimes in the RA, since that is what determines the minimum
frequency at which RAs must be processed by the kernel. Android uses
hardware offloads to drop RAs for a fraction of the minimum of all
lifetimes present in the RA (some networks have very frequent RAs (5s)
with high lifetimes (2h)). Despite this, we have encountered networks
that set the router lifetime to 30s which results in very frequent CPU
wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the
WiFi firmware) entirely on such networks, it seems better to ignore the
misconfigured routers while still processing RAs from other IPv6 routers
on the same network (i.e. to support IoT applications).
The previous implementation dropped the entire RA based on router
lifetime. This turned out to be hard to expand to the other lifetimes
present in the RA in a consistent manner; dropping the entire RA based
on RIO/PIO lifetimes would essentially require parsing the whole thing
twice.
Fixes: 1671bcfd76 ("net: add sysctl accept_ra_min_rtr_lft")
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230726230701.919212-1-prohr@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reap the benefits of easier iteration thanks to the xarray.
Convert just the genetlink ones, those are easier to test.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.
Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..
To illustrate let's look at an example:
System state:
start: # [A, B, C]
del: B # [A, C]
with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".
Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:
System state:
start: # [A, B]
add: C # [A, C, B]
del: B # [A, C]
we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).
To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):
#devs | hash | xa | delta
2 | 18.3 | 20.1 | + 9.8%
16 | 18.3 | 20.1 | + 9.5%
64 | 18.3 | 26.3 | +43.8%
128 | 20.4 | 26.3 | +28.6%
256 | 20.0 | 26.4 | +32.1%
1024 | 26.6 | 26.7 | + 0.2%
8192 |541.3 | 33.5 | -93.8%
No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
blocklisting (fencing) with a couple of prerequisites and a fixup to
prevent metrics from being sent to the MDS even just once after that
has been disabled by the user. All marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTD7MgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3SBB/4nHdfSQwy0z2+PM766sUKxSlmaRw8X
4AJyGAIGj5BnHHhtluwLpEYfrh3wfyCRaNYgS64jdsudbUBxPKIIWn2lEFxtyWbC
w0R2uEc+NGJLJOYfJ+lBP06Q2r6qk7N6OGNy6qLaN+v6xJ8WPw7H3fJBLVhnPgMq
7lkACRN+0P5Xt6ZJ57kbWWiFQ+vjv7bbDa0P9zMl6uCgoYsIpvrskqygx+gHbdsq
IcnpsHu3F0ycYAJT5eJ5GcCcThvwbNjWdbJy1fERah7U/LNcX/S3To9V5LPmydOQ
tYAWMlC/1a99fr+jTYF0Pu5GLUdK0UMKeX04ZN3SKON4pORurpypw3n8
=rw72
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to reduce the potential for erroneous RBD exclusive lock
blocklisting (fencing) with a couple of prerequisites and a fixup to
prevent metrics from being sent to the MDS even just once after that
has been disabled by the user. All marked for stable"
* tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client:
rbd: retrieve and check lock owner twice before blocklisting
rbd: harden get_lock_owner_info() a bit
rbd: make get_lock_owner_info() return a single locker or NULL
ceph: never send metrics if disable_send_metrics is set
Most of these clean up warnings we've gotten out of compilation tools, but
several of them were from inspection while hunting down a couple of
regressions.
The most important one to pull is 75b396821c
(fs/9p: remove unnecessary and overrestrictive check)
which caused a regression for some folks by restricting mmap
in any case where writeback caches weren't enabled.
Most of the other bugs caught via inspection were type mismatches.
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmTDH8cACgkQiP/V+0pf
/5hSJQ//b59cDliC7Knf9B2Of1UsLJ2wYIbxWVYKLwYarKFn3tmtO5dPtWZrQzjB
Kz6fif5z1c0WdjNFLifs/XNqUq5znX/TY8bV/NmOg8VlaoJqmUQSSYnNQOWZCFKT
zwxC6BO6gPNNIkJN2xQ8oOq11Qon/nbZbuN9P2VDcT5Yr2KmFx6FHRcrBNRYAm3E
UzFdjkLrLef3VrvegJNGM3Wv2HqyNBA6QhifZBjDkydtDPMd9fRNns7Q60AARR9K
aqXV6SihE/Ox7sSmVNjTzYF67eq5Xjt+sSzo2SdfOaZxVIa6wf0UXQuFqmHts6Zs
QUCdXS5YbQAwQfdkm22rnTIxAwsbEpFOGGweUvMBXzZbl/sq/PK4Nt6DpCS8ZFi3
81Z5Ey+Q4yaxwdirP521M4ao2Ae2Fzg12bqDTNssZdOYGcXBqBfWiR5IfMbbkgWq
WzCVI3V/LshQ75pXQyS4BtW/29C2nN7g3jLrF3Q5OTe7XmHMCZFvtP4lKvY0piQy
++XoDs1LCJWSZebfkNa05L5nhQ1mYhwiZutHTtF3ejTTiJvcJXQ4xHYLzjOON+4i
blLTpgWLO0rIRAmX0I8GwPi6q0xL4rFP4XGGz/LDppRQkRa13vtzFtMvuldPyEq7
g4pcLkI3SPbL982qYbg8UO+GjO/Q9M/DafXQVlUyDw04TQbT8Jg=
=TAC1
-----END PGP SIGNATURE-----
Merge tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p fixes from Eric Van Hensbergen:
"Misc set of fixes for 9p.
Most of these clean up warnings we've gotten out of compilation tools,
but several of them were from inspection while hunting down a couple
of regressions.
The most important one is 75b396821c ("fs/9p: remove unnecessary and
overrestrictive check") which caused a regression for some folks by
restricting mmap in any case where writeback caches weren't enabled.
Most of the other bugs caught via inspection were type mismatches"
* tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Remove unused extern declaration
9p: remove dead stores (variable set again without being read)
9p: virtio: skip incrementing unused variable
9p: virtio: make sure 'offs' is initialized in zc_request
9p: virtio: fix unlikely null pointer deref in handle_rerror
9p: fix ignored return value in v9fs_dir_release
fs/9p: remove unnecessary invalidate_inode_pages2
fs/9p: fix type mismatch in file cache mode helper
fs/9p: fix typo in comparison logic for cache mode
fs/9p: remove unnecessary and overrestrictive check
fs/9p: Fix a datatype used with V9FS_DIRECT_IO
Add extack info for IPv6 address add/delete, which would be useful for
users to understand the problem without having to read kernel code.
Suggested-by: Beniamino Galvani <bgalvani@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ETHTOOL_GRXFH correctly copies in the full struct ethtool_rxnfc when
FLOW_RSS is set; ETHTOOL_SRXFH needs a similar code path to handle the
FLOW_RSS case so that ethtool can set the flow hash for custom RSS
contexts (if supported by the driver).
The copy code from ETHTOOL_GRXFH has been pulled out in to a helper so
that it can be called in both ETHTOOL_{G,S}RXFH code paths.
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.
Acked-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230727014944.3972546-1-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTCcgkNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AJmMD/9IPWnzSNLUgoAhSo0h2OkCKl2iIdRnkrPrruhE
Su8bD8ohmU100iN1DMXT2a7C9o0BTog4EB7WtF21z+06dUhROiZizrSt8bTk/rRi
0+Sm9xlDAdl3CZcU8fnVjwf6PLYgUv5zVjcQc4Ggf15MwEIdpviKCps2bbBtrozF
PJEK6+UwTU6+z4GSTc957nhFHstEcwktyxoaAote98CD78G2YCQT5yVbfctHgRm0
9qovT8S/zZmqHvqvUfrqJd+N5V/+40O7ZuFls93kYxK9Bttx9wRwEqALPldxXudU
o0kG4QZ8NAwiIVsGqPwKu/cKi9PF0z/PUXYgVdnkKK+XofBDHbHyfR+BJO1ejOdX
+ea9AoQ6lD6NVmvX01+lF9OI4D1zgc6pLGyjSsyVgv3x0iKJeZ8QOgb0DTGFiG1U
MnFIeckedrh/dt3NXLG/blZvuAzhofHqEhH/DlvbI/QBtN2zEgIMJKxRfBAMs3OO
WAIlaHASQFVbyrHOr/X3FoNDTsvZyrTppo9WwJVTj9F41lYXzWoiBY+nVj2brGDR
SMW1M13sufRBQlk0aTpPYPvcS5FhsMf6ggxygi2rNxX5/AdFE02nnEU9ybpHAqcy
NiZ8kCxJ2J9+aCj7yvJ7QQcAD7l2tAIeAZCKSlKteigqTI0PWoTUc0IYPT85URLm
cy/l4A==
=fgLz
-----END PGP SIGNATURE-----
Merge tag 'nf-next-23-07-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
1. silence a harmless warning for CONFIG_NF_CONNTRACK_PROCFS=n builds,
from Zhu Wang.
2, 3:
Allow NLA_POLICY_MASK to be used with BE16/BE32 types, and replace a few
manual checks with nla_policy based one in nf_tables, from myself.
4: cleanup in ctnetlink to validate while parsing rather than
using two steps, from Lin Ma.
5: refactor boyer-moore textsearch by moving a small chunk to
a helper function, rom Jeremy Sowden.
* tag 'nf-next-23-07-27' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
lib/ts_bm: add helper to reduce indentation and improve readability
netfilter: conntrack: validate cta_ip via parsing
netfilter: nf_tables: use NLA_POLICY_MASK to test for valid flag options
netlink: allow be16 and be32 types in all uint policy checks
nf_conntrack: fix -Wunused-const-variable=
====================
Link: https://lore.kernel.org/r/20230727133604.8275-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hannes Reinecke says:
====================
net/tls: fixes for NVMe-over-TLS
here are some small fixes to get NVMe-over-TLS up and running.
The first set are just minor modifications to have MSG_EOR handled
for TLS, but the second set implements the ->read_sock() callback
for tls_sw.
The ->read_sock() callbacks return -EIO when encountering any TLS
Alert message, but as that's the default behaviour anyway I guess
we can get away with it.
====================
Applied on top of the tag in case Sagi gets convinced to pull it.
Link: https://lore.kernel.org/r/20230726191556.41714-1-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement ->read_sock() function for use with nvme-tcp.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Cc: Boris Pismenny <boris.pismenny@gmail.com>
Link: https://lore.kernel.org/r/20230726191556.41714-7-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Split tls_rx_reader_{lock,unlock} into an 'acquire/release' and
the actual locking part.
With that we can use the tls_rx_reader_lock in situations where
the socket is already locked.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-6-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TLS resets the protocol operations, so the read_sock() callback might
be changed, too.
In this case using sock->ops->readsock() in tls_strp_read_copyin() will
enter an infinite recursion if the read_sock() callback is calling
tls_rx_rec_wait() which will call into sock->ops->readsock() via
tls_strp_read_copyin().
But as tls_strp_read_copyin() is supposed to produce data from the
consumed socket and that socket is always a TCP socket we can call
tcp_read_sock() directly without having to deal with callbacks.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-5-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_push_data() MSG_MORE, but bails out on MSG_EOR.
Seeing that MSG_EOR is basically the opposite of MSG_MORE
this patch adds handling MSG_EOR by treating it as the
absence of MSG_MORE.
Consequently we should return an error when both are set.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-3-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_sw_sendmsg() already handles MSG_MORE, but bails
out on MSG_EOR.
Seeing that MSG_EOR is basically the opposite of
MSG_MORE this patch adds handling MSG_EOR by treating
it as the negation of MSG_MORE.
And erroring out if MSG_EOR is specified with MSG_MORE.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230726191556.41714-2-hare@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Older DSA drivers that do not provide an dsa_ops adjust_link method end
up using phylink. Unfortunately, a recent phylink change that requires
its supported_interfaces bitmap to be filled breaks these drivers
because the bitmap remains empty.
Rather than fixing each driver individually, fix it in the core code so
we have a sensible set of defaults.
Reported-by: Sergei Antonov <saproj@gmail.com>
Fixes: de5c9bf40c ("net: phylink: require supported_interfaces to be filled")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com> # dsa_loop
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1qOflM-001AEz-D3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are totally 9 ndo_bridge_setlink handlers in the current kernel,
which are 1) bnxt_bridge_setlink, 2) be_ndo_bridge_setlink 3)
i40e_ndo_bridge_setlink 4) ice_bridge_setlink 5)
ixgbe_ndo_bridge_setlink 6) mlx5e_bridge_setlink 7)
nfp_net_bridge_setlink 8) qeth_l2_bridge_setlink 9) br_setlink.
By investigating the code, we find that 1-7 parse and use nlattr
IFLA_BRIDGE_MODE but 3 and 4 forget to do the nla_len check. This can
lead to an out-of-attribute read and allow a malformed nlattr (e.g.,
length 0) to be viewed as a 2 byte integer.
To avoid such issues, also for other ndo_bridge_setlink handlers in the
future. This patch adds the nla_len check in rtnl_bridge_setlink and
does an early error return if length mismatches. To make it works, the
break is removed from the parsing for IFLA_BRIDGE_FLAGS to make sure
this nla_for_each_nested iterates every attribute.
Fixes: b1edc14a3f ("ice: Implement ice_bridge_getlink and ice_bridge_setlink")
Fixes: 51616018dd ("i40e: Add support for getlink, setlink ndo ops")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230726075314.1059224-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 19e3a9c90c ("net: bridge: convert multicast to generic rhashtable")
this is not used, so can remove it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20230726143141.11704-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Removes superfluous (and misplaced) comment from ndisc_router_discovery.
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230726184742.342825-1-prohr@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The nla_for_each_nested parsing in function bpf_sk_storage_diag_alloc
does not check the length of the nested attribute. This can lead to an
out-of-attribute read and allow a malformed nlattr (e.g., length 0) to
be viewed as a 4 byte integer.
This patch adds an additional check when the nlattr is getting counted.
This makes sure the latter nla_get_u32 can access the attributes with
the correct length.
Fixes: 1ed4d92458 ("bpf: INET_DIAG support in bpf_sk_storage")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230725023330.422856-1-linma@zju.edu.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This adds support of MSG_PEEK flag for SOCK_SEQPACKET type of socket.
Difference with SOCK_STREAM is that this callback returns either length
of the message or error.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reworks current implementation of MSG_PEEK logic:
1) Replaces 'skb_queue_walk_safe()' with 'skb_queue_walk()'. There is
no need in the first one, as there are no removes of skb in loop.
2) Removes nested while loop - MSG_PEEK logic could be implemented
without it: just iterate over skbs without removing it and copy
data from each until destination buffer is not full.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In current ctnetlink_parse_tuple_ip() function, nested parsing and
validation is splitting as two parts, which could be cleanup to a
simplified form. As the nla_parse_nested_deprecated function
supports validation in the fly. These two finially reach same place
__nla_validate_parse with same validate flag.
nla_parse_nested_deprecated
__nla_parse(.., NL_VALIDATE_LIBERAL, ..)
__nla_validate_parse
nla_validate_nested_deprecated
__nla_validate_nested(.., NL_VALIDATE_LIBERAL, ..)
__nla_validate
__nla_validate_parse
This commit removes the call to nla_validate_nested_deprecated and pass
cta_ip_nla_policy when do parsing.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
nf_tables relies on manual test of netlink attributes coming from userspace
even in cases where this could be handled via netlink policy.
Convert a bunch of 'flag' attributes to use NLA_POLICY_MASK checks.
Signed-off-by: Florian Westphal <fw@strlen.de>
When building with W=1, the following warning occurs.
net/netfilter/nf_conntrack_proto_dccp.c:72:27: warning: ‘dccp_state_names’ defined but not used [-Wunused-const-variable=]
static const char * const dccp_state_names[] = {
We include dccp_state_names in the macro
CONFIG_NF_CONNTRACK_PROCFS, since it is only used in the place
which is included in the macro CONFIG_NF_CONNTRACK_PROCFS.
Fixes: 2bc780499a ("[NETFILTER]: nf_conntrack: add DCCP protocol support")
Signed-off-by: Zhu Wang <wangzhu9@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
If tipc_link_bc_create() fails inside tipc_node_create() for a newly
allocated tipc node then we should stop its tipc crypto and free the
resources allocated with a call to tipc_crypto_start().
As the node ref is initialized to one to that point, just put the ref on
tipc_link_bc_create() error case that would lead to tipc_node_free() be
eventually executed and properly clean the node and its crypto resources.
Found by Linux Verification Center (linuxtesting.org).
Fixes: cb8092d70a ("tipc: move bc link creation back to tipc_node_create")
Suggested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20230725214628.25246-1-pchelkin@ispras.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kernel test robot reported slab-out-of-bounds access in strlen(). [0]
Commit 06d4c8a808 ("af_unix: Fix fortify_panic() in unix_bind_bsd().")
removed unix_mkname_bsd() call in unix_bind_bsd().
If sunaddr->sun_path is not terminated by user and we don't enable
CONFIG_INIT_STACK_ALL_ZERO=y, strlen() will do the out-of-bounds access
during file creation.
Let's go back to strlen()-with-sockaddr_storage way and pack all 108
trickiness into unix_mkname_bsd() with bold comments.
[0]:
BUG: KASAN: slab-out-of-bounds in strlen (lib/string.c:?)
Read of size 1 at addr ffff000015492777 by task fortify_strlen_/168
CPU: 0 PID: 168 Comm: fortify_strlen_ Not tainted 6.5.0-rc1-00333-g3329b603ebba #16
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace (arch/arm64/kernel/stacktrace.c:235)
show_stack (arch/arm64/kernel/stacktrace.c:242)
dump_stack_lvl (lib/dump_stack.c:107)
print_report (mm/kasan/report.c:365 mm/kasan/report.c:475)
kasan_report (mm/kasan/report.c:590)
__asan_report_load1_noabort (mm/kasan/report_generic.c:378)
strlen (lib/string.c:?)
getname_kernel (./include/linux/fortify-string.h:? fs/namei.c:226)
kern_path_create (fs/namei.c:3926)
unix_bind (net/unix/af_unix.c:1221 net/unix/af_unix.c:1324)
__sys_bind (net/socket.c:1792)
__arm64_sys_bind (net/socket.c:1801)
invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52)
el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147)
do_el0_svc (arch/arm64/kernel/syscall.c:189)
el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?)
el0t_64_sync (arch/arm64/kernel/entry.S:591)
Allocated by task 168:
kasan_set_track (mm/kasan/common.c:45 mm/kasan/common.c:52)
kasan_save_alloc_info (mm/kasan/generic.c:512)
__kasan_kmalloc (mm/kasan/common.c:383)
__kmalloc (mm/slab_common.c:? mm/slab_common.c:998)
unix_bind (net/unix/af_unix.c:257 net/unix/af_unix.c:1213 net/unix/af_unix.c:1324)
__sys_bind (net/socket.c:1792)
__arm64_sys_bind (net/socket.c:1801)
invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52)
el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147)
do_el0_svc (arch/arm64/kernel/syscall.c:189)
el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?)
el0t_64_sync (arch/arm64/kernel/entry.S:591)
The buggy address belongs to the object at ffff000015492700
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 0 bytes to the right of
allocated 119-byte region [ffff000015492700, ffff000015492777)
The buggy address belongs to the physical page:
page:00000000aeab52ba refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x55492
anon flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff)
page_type: 0xffffffff()
raw: 03fffc0000000200 ffff0000084018c0 fffffc00003d0e00 0000000000000005
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff000015492600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff000015492680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff000015492700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 fc
^
ffff000015492780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff000015492800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 06d4c8a808 ("af_unix: Fix fortify_panic() in unix_bind_bsd().")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/netdev/202307262110.659e5e8-oliver.sang@intel.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230726190828.47874-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
goto free_skb if an unexpected result is returned by pskb_tirm()
in tipc_crypto_rcv_complete().
Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Signed-off-by: Yuanjun Gong <ruc_gongyuanjun@163.com>
Reviewed-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Link: https://lore.kernel.org/r/20230725064810.5820-1-ruc_gongyuanjun@163.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTBN7QNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AMg4D/wLs+Nm4XZmvz3ZOtgbHrw3xSbkLgJ563cCGwHw
1k4/726FrfbVHqMvOWgHNVdmsVCcw0jeLVddlIN3QmbMlxn2YUwQ16nRg6SUMeKZ
u/9KsI0wLr9zGKJxiUDWhe+7An3oY5G/7uFngJZ1M/vbGTxrl1GqOEU7bMvkm/G/
08z8/ho9Qv8CbPUfn4ZcYfxDCPKTB74O0UZiItHDAp+tQcD729xvTZ+AGOPe+632
YifWe7FzY+SUNu/sesr8kyeMU0UPsxETO7pnvgn5PJ3osLDQB1mxj+Kpa5YSuQ4H
3gO2S6T07iQZ74//aUTzexEzss4HNsdXKDPcTfXyyiZPJsfVkZRuKTcuA16bGt0l
zCVzCz2Aj5brZEjYhFlmgCnWzlnWDNGHJF7WBxF9UAHoEqXQa3h4im2RTj9/8RvZ
ZRXCZnmL+UIr5k3m5NzXYDM7vWxHvavNbKRl594XVq1fI6GdSGGQ9SOyPkKFUYQt
BYxDsbg9RY/piZt7vH+YQfmjuQ8sCkxqPEqJvSzy6U/TVtTOdsjfCCeijSrEIlvg
i/B+P24GI2dZW09trHfVVcn5YQc8/HkzCr029BcmfSu9+4+GeTmjMArSJfO1hThd
d3IpsNye9OgdhxsR75kZusCoZpRfuEtySKdkfdzvvAWLqPh8nNzhIzGx6WHmcTt6
TYYquA==
=4wB/
-----END PGP SIGNATURE-----
Merge tag 'nf-23-07-26' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter fixes for net
1. On-demand overlap detection in 'rbtree' set can cause memory leaks.
This is broken since 6.2.
2. An earlier fix in 6.4 to address an imbalance in refcounts during
transaction error unwinding was incomplete, from Pablo Neira.
3. Disallow adding a rule to a deleted chain, also from Pablo.
Broken since 5.9.
* tag 'nf-23-07-26' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: disallow rule addition to bound chain via NFTA_RULE_CHAIN_ID
netfilter: nf_tables: skip immediate deactivate in _PREPARE_ERROR
netfilter: nft_set_rbtree: fix overlap expiration walk
====================
Link: https://lore.kernel.org/r/20230726152524.26268-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The nla_for_each_nested parsing in function mqprio_parse_nlattr() does
not check the length of the nested attribute. This can lead to an
out-of-attribute read and allow a malformed nlattr (e.g., length 0) to
be viewed as 8 byte integer and passed to priv->max_rate/min_rate.
This patch adds the check based on nla_len() when check the nla_type(),
which ensures that the length of these two attribute must equals
sizeof(u64).
Fixes: 4e8b86c062 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20230725024227.426561-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the mptcp code generate a "new listener" event even
if the actual listen() syscall fails. Address the issue moving
the event generation call under the successful branch.
Cc: stable@vger.kernel.org
Fixes: f8c9dfbd87 ("mptcp: add pm listener events")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230725-send-net-20230725-v1-2-6f60fe7137a9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bail out with EOPNOTSUPP when adding rule to bound chain via
NFTA_RULE_CHAIN_ID. The following warning splat is shown when
adding a rule to a deleted bound chain:
WARNING: CPU: 2 PID: 13692 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables]
CPU: 2 PID: 13692 Comm: chain-bound-rul Not tainted 6.1.39 #1
RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables]
Fixes: d0e2c7de92 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Reported-by: Kevin Rich <kevinrich1337@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
On error when building the rule, the immediate expression unbinds the
chain, hence objects can be deactivated by the transaction records.
Otherwise, it is possible to trigger the following warning:
WARNING: CPU: 3 PID: 915 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables]
CPU: 3 PID: 915 Comm: chain-bind-err- Not tainted 6.1.39 #1
RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables]
Fixes: 4bedf9eee0 ("netfilter: nf_tables: fix chain binding transaction logic")
Reported-by: Kevin Rich <kevinrich1337@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
The lazy gc on insert that should remove timed-out entries fails to release
the other half of the interval, if any.
Can be reproduced with tests/shell/testcases/sets/0044interval_overlap_0
in nftables.git and kmemleak enabled kernel.
Second bug is the use of rbe_prev vs. prev pointer.
If rbe_prev() returns NULL after at least one iteration, rbe_prev points
to element that is not an end interval, hence it should not be removed.
Lastly, check the genmask of the end interval if this is active in the
current generation.
Fixes: c9e6978e27 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Florian Westphal <fw@strlen.de>
- we want the exclusive lock type, so test for it directly
- use sscanf() to actually parse the lock cookie and avoid admitting
invalid handles
- bail if locker has a blank address
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
The reporter noticed a warning when running iwlwifi:
WARNING: CPU: 8 PID: 659 at mm/page_alloc.c:4453 __alloc_pages+0x329/0x340
As cfg80211_parse_colocated_ap() is not expected to return a negative
value return 0 and not a negative value if cfg80211_calc_short_ssid()
fails.
Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217675
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230723201043.3007430-1-ilan.peer@intel.com
syzkaller found a bug in unix_bind_bsd() [0]. We can reproduce it
by bind()ing a socket on a path with length 108.
108 is the size of sun_addr of struct sockaddr_un and is the maximum
valid length for the pathname socket. When calling bind(), we use
struct sockaddr_storage as the actual buffer size, so terminating
sun_addr[108] with null is legitimate as done in unix_mkname_bsd().
However, strlen(sunaddr) for such a case causes fortify_panic() if
CONFIG_FORTIFY_SOURCE=y. __fortify_strlen() has no idea about the
actual buffer size and see the string as unterminated.
Let's use strnlen() to allow sun_addr to be unterminated at 107.
[0]:
detected buffer overflow in __fortify_strlen
kernel BUG at lib/string_helpers.c:1031!
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 255 Comm: syz-executor296 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #4
Hardware name: linux,dummy-virt (DT)
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030
lr : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030
sp : ffff800089817af0
x29: ffff800089817af0 x28: ffff800089817b40 x27: 1ffff00011302f68
x26: 000000000000006e x25: 0000000000000012 x24: ffff800087e60140
x23: dfff800000000000 x22: ffff800089817c20 x21: ffff800089817c8e
x20: 000000000000006c x19: ffff00000c323900 x18: ffff800086ab1630
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001
x14: 1ffff00011302eb8 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 64a26b65474d2a00
x8 : 64a26b65474d2a00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff800089817438 x4 : ffff800086ac99e0 x3 : ffff800080f19e8c
x2 : 0000000000000001 x1 : 0000000100000000 x0 : 000000000000002c
Call trace:
fortify_panic+0x1c/0x20 lib/string_helpers.c:1030
_Z16__fortify_strlenPKcU25pass_dynamic_object_size1 include/linux/fortify-string.h:217 [inline]
unix_bind_bsd net/unix/af_unix.c:1212 [inline]
unix_bind+0xba8/0xc58 net/unix/af_unix.c:1326
__sys_bind+0x1ac/0x248 net/socket.c:1792
__do_sys_bind net/socket.c:1803 [inline]
__se_sys_bind net/socket.c:1801 [inline]
__arm64_sys_bind+0x7c/0x94 net/socket.c:1801
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52
el0_svc_common+0x134/0x240 arch/arm64/kernel/syscall.c:139
do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:188
el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:647
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
Code: aa0003e1 d0000e80 91030000 97ffc91a (d4210000)
Fixes: df8fc4e934 ("kbuild: Enable -fstrict-flex-arrays=3")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230724213425.22920-2-kuniyu@amazon.com
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are currently two paths that call remove_xps_queue():
1. __netif_set_xps_queue -> remove_xps_queue
2. clean_xps_maps -> remove_xps_queue_cpu -> remove_xps_queue
There is no need to check dev_maps in remove_xps_queue() because
dev_maps has been checked on these two paths.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20230724023735.2751602-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.
The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:
sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.
Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.
Fixes: 8e368dc72e ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Now that inet[6]_lookup_reuseport are parameterised on the ehashfn
we can remove two sk_lookup helpers.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-6-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The current implementation was extracted from inet[6]_lhash2_lookup
in commit 80b373f74f ("inet: Extract helper for selecting socket
from reuseport group") and commit 5df6531292 ("inet6: Extract helper
for selecting socket from reuseport group"). In the original context,
sk is always in TCP_LISTEN state and so did not have a separate check.
Add documentation that specifies which sk_state are valid to pass to
the function.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-5-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
There are currently four copies of reuseport_lookup: one each for
(TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of
those functions as well. This is already the case for sk_lookup
helpers (inet,inet6,udp4,udp6)_lookup_run_bpf.
There are two differences between the reuseport_lookup helpers:
1. They call different hash functions depending on protocol
2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED
Move the check for sk_state into the caller and use the INDIRECT_CALL
infrastructure to cut down the helpers to one per IP version.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-4-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Rename the existing reuseport helpers for IPv4 and IPv6 so that they
can be invoked in the follow up commit. Export them so that building
DCCP and IPv6 as a module works.
No change in functionality.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-3-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The semantics for bpf_sk_assign are as follows:
sk = some_lookup_func()
bpf_sk_assign(skb, sk)
bpf_sk_release(sk)
That is, the sk is not consumed by bpf_sk_assign. The function
therefore needs to make sure that sk lives long enough to be
consumed from __inet_lookup_skb. The path through the stack for a
TCPv4 packet is roughly:
netif_receive_skb_core: takes RCU read lock
__netif_receive_skb_core:
sch_handle_ingress:
tcf_classify:
bpf_sk_assign()
deliver_ptype_list_skb:
deliver_skb:
ip_packet_type->func == ip_rcv:
ip_rcv_core:
ip_rcv_finish_core:
dst_input:
ip_local_deliver:
ip_local_deliver_finish:
ip_protocol_deliver_rcu:
tcp_v4_rcv:
__inet_lookup_skb:
skb_steal_sock
The existing helper takes advantage of the fact that everything
happens in the same RCU critical section: for sockets with
SOCK_RCU_FREE set bpf_sk_assign never takes a reference.
skb_steal_sock then checks SOCK_RCU_FREE again and does sock_put
if necessary.
This approach assumes that SOCK_RCU_FREE is never set on a sk
between bpf_sk_assign and skb_steal_sock, but this invariant is
violated by unhashed UDP sockets. A new UDP socket is created
in TCP_CLOSE state but without SOCK_RCU_FREE set. That flag is only
added in udp_lib_get_port() which happens when a socket is bound.
When bpf_sk_assign was added it wasn't possible to access unhashed
UDP sockets from BPF, so this wasn't a problem. This changed
in commit 0c48eefae7 ("sock_map: Lift socket state restriction
for datagram sockets"), but the helper wasn't adjusted accordingly.
The following sequence of events will therefore lead to a refcount
leak:
1. Add socket(AF_INET, SOCK_DGRAM) to a sockmap.
2. Pull socket out of sockmap and bpf_sk_assign it. Since
SOCK_RCU_FREE is not set we increment the refcount.
3. bind() or connect() the socket, setting SOCK_RCU_FREE.
4. skb_steal_sock will now set refcounted = false due to
SOCK_RCU_FREE.
5. tcp_v4_rcv() skips sock_put().
Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
This matches the behaviour of __inet_lookup_skb which is ultimately
the goal of bpf_sk_assign().
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Cc: Joe Stringer <joe@cilium.io>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-2-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Contrary to TCP, UDP reuseport groups can contain TCP_ESTABLISHED
sockets. To support these properly we remember whether a group has
a connected socket and skip the fast reuseport early-return. In
effect we continue scoring all reuseport sockets and then choose the
one with the highest score.
The current code fails to re-calculate the score for the result of
lookup_reuseport. According to Kuniyuki Iwashima:
1) SO_INCOMING_CPU is set
-> selected sk might have +1 score
2) BPF prog returns ESTABLISHED and/or SO_INCOMING_CPU sk
-> selected sk will have more than 8
Using the old score could trigger more lookups depending on the
order that sockets are created.
sk -> sk (SO_INCOMING_CPU) -> sk (ESTABLISHED)
| |
`-> select the next SO_INCOMING_CPU sk
|
`-> select itself (We should save this lookup)
Fixes: efc6b6f6c3 ("udp: Improve load balancing for SO_REUSEPORT.")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-1-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Since mlx5 supports UDP encapsulation in packet offload, change the XFRM
core to allow users to configure it.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmS+jLwTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRC+UBxKKooE6LMiB/wPcSaPk8b/Tkpwnf0R0yP36tP7YS0V
7SZldc2KRYeQb0Lk1gPTzWRGmddKl2kORh3Y3JkfanKNsiNfhYgUrxeLDkbeDolo
n1Io6fjhK1DzM5cx6Sn+spPpl4QGWV3YQ8PAJm6FjsH5+M5LPzZIK0GGakESCxU7
YDbq3wqglKeI2h1Ae3sQeBd7k26KQXupHfoCyYgUlmOF/u2PuKP0xmn2674QdlvR
m99iE5kqF/HKlrg1OlcOTa7KzknOYPjsVpJ1nTcqxKmUDEYq+niLflBV9Gw8VEyO
Y1naqUJz3TavDVgVIGkjUJEd/+sw3yI2LK7g3+DhugXvMLrqIPa6kmrM
=ayq3
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-6.5-20230724' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2023-07-24
The first patch is by me and adds a missing set of CAN state to
CAN_STATE_STOPPED on close in the gs_usb driver.
The last patch is by Eric Dumazet and fixes a lockdep issue in the CAN
raw protocol.
* tag 'linux-can-fixes-for-6.5-20230724' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: raw: fix lockdep issue in raw_release()
can: gs_usb: gs_can_close(): add missing set of CAN state to CAN_STATE_STOPPED
====================
Link: https://lore.kernel.org/r/20230724150141.766047-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MPTCP code uses the assumption that the tcp_win_from_space() helper
does not use any TCP-specific field, and thus works correctly operating
on an MPTCP socket.
The commit dfa2f04833 ("tcp: get rid of sysctl_tcp_adv_win_scale")
broke such assumption, and as a consequence most MPTCP connections stall
on zero-window event due to auto-tuning changing the rcv buffer size
quite randomly.
Address the issue syncing again the MPTCP auto-tuning code with the TCP
one. To achieve that, factor out the windows size logic in socket
independent helpers, and reuse them in mptcp_rcv_space_adjust(). The
MPTCP level scaling_ratio is selected as the minimum one from the all
the subflows, as a worst-case estimate.
Fixes: dfa2f04833 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20230720-upstream-net-next-20230720-mptcp-fix-rcv-buffer-auto-tuning-v1-1-175ef12b8380@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
currently on 6.4 net/main:
# ip link add dummy1 type dummy
# echo 1 > /proc/sys/net/ipv6/conf/dummy1/use_tempaddr
# ip link set dummy1 up
# ip -6 addr add 2000::1/64 mngtmpaddr dev dummy1
# ip -6 addr show dev dummy1
11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet6 2000::44f3:581c:8ca:3983/64 scope global temporary dynamic
valid_lft 604800sec preferred_lft 86172sec
inet6 2000::1/64 scope global mngtmpaddr
valid_lft forever preferred_lft forever
inet6 fe80::e8a8:a6ff:fed5:56d4/64 scope link
valid_lft forever preferred_lft forever
# ip -6 addr del 2000::44f3:581c:8ca:3983/64 dev dummy1
(can wait a few seconds if you want to, the above delete isn't [directly] the problem)
# ip -6 addr show dev dummy1
11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet6 2000::1/64 scope global mngtmpaddr
valid_lft forever preferred_lft forever
inet6 fe80::e8a8:a6ff:fed5:56d4/64 scope link
valid_lft forever preferred_lft forever
# ip -6 addr del 2000::1/64 mngtmpaddr dev dummy1
# ip -6 addr show dev dummy1
11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet6 2000::81c9:56b7:f51a:b98f/64 scope global temporary dynamic
valid_lft 604797sec preferred_lft 86169sec
inet6 fe80::e8a8:a6ff:fed5:56d4/64 scope link
valid_lft forever preferred_lft forever
This patch prevents this new 'global temporary dynamic' address from being
created by the deletion of the related (same subnet prefix) 'mngtmpaddr'
(which is triggered by there already being no temporary addresses).
Cc: Jiri Pirko <jiri@resnulli.us>
Fixes: 53bd674915 ("ipv6 addrconf: introduce IFA_F_MANAGETEMPADDR to tell kernel to manage temporary addresses")
Reported-by: Xiao Ma <xiaom@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230720160022.1887942-1-maze@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv6 inet sockets are supposed to have a "struct ipv6_pinfo"
field at the end of their definition, so that inet6_sk_generic()
can derive from socket size the offset of the "struct ipv6_pinfo".
This is very fragile, and prevents adding bigger alignment
in sockets, because inet6_sk_generic() does not work
if the compiler adds padding after the ipv6_pinfo component.
We are currently working on a patch series to reorganize
TCP structures for better data locality and found issues
similar to the one fixed in commit f5d547676c
("tcp: fix tcp_inet6_sk() for 32bit kernels")
Alternative would be to force an alignment on "struct ipv6_pinfo",
greater or equal to __alignof__(any ipv6 sock) to ensure there is
no padding. This does not look great.
v2: fix typo in mptcp_proto_v6_init() (Paolo)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Chao Wu <wwchao@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-86-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This change adds a new sysctl accept_ra_min_rtr_lft to specify the
minimum acceptable router lifetime in an RA. If the received RA router
lifetime is less than the configured value (and not 0), the RA is
ignored.
This is useful for mobile devices, whose battery life can be impacted
by networks that configure RAs with a short lifetime. On such networks,
the device should never gain IPv6 provisioning and should attempt to
drop RAs via hardware offload, if available.
Signed-off-by: Patrick Rohr <prohr@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
Even call sites utilizing length-bounded destination buffers should
switch over to using `strtomem` or `strtomem_pad`. In this case,
however, the compiler is unable to determine the size of the `data`
buffer which renders `strtomem` unusable. Due to this, `strscpy`
should be used.
It should be noted that most call sites already zero-initialize the
destination buffer. However, I've opted to use `strscpy_pad` to maintain
the same exact behavior that `strncpy` produced (zero-padded tail up to
`len`).
Also see [3].
[1]: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
[2]: elixir.bootlin.com/linux/v6.3/source/net/ethtool/ioctl.c#L1944
[3]: manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A new function netlink_release is added in netlink_sock to store the
protocol's release function. This is called when the socket is deleted.
This can be supplied by the protocol via the release function in
netlink_kernel_cfg. This is being added for the NETLINK_CONNECTOR
protocol, so it can free it's data when socket is deleted.
Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To use filtering at the connector & cn_proc layers, we need to enable
filtering in the netlink layer. This reverses the patch which removed
netlink filtering - commit ID for that patch:
549017aa1b (netlink: remove netlink_broadcast_filtered).
Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that page_pool_release_page() is not exported we can
merge it with page_pool_return_page(). I believe that
the "Do not replace this with page_pool_return_page()"
comment was there in case page_pool_return_page() was
not inlined, to avoid two function calls.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lore.kernel.org/r/20230720010409.1967072-5-kuba@kernel.org
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There seems to be no user calling page_pool_release_page()
for legit reasons, all the users simply haven't been converted
to skb-based recycling, yet. Previous changes converted them.
Update the docs, and unexport the function.
Link: https://lore.kernel.org/r/20230720010409.1967072-4-kuba@kernel.org
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, if cmd in the split ops array is of lower value than the
previous one, genl_validate_ops() continues to do the checks as if
the values are equal. This may result in non-obvious WARN_ON() hit in
these check.
Instead, check the incorrect ordering explicitly and put a WARN_ON()
in case it is broken.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230720111354.562242-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current implementation of HTB offload returns the EINVAL error for
quantum parameter. This patch removes the error returning checks for
'quantum' parameter and populates its value to tc_htb_qopt_offload
structure such that driver can use the same.
Add quantum parameter check in mlx5 driver, as mlx5 devices are not capable
of supporting the quantum parameter when htb offload is used. Report error
if quantum parameter is set to a non-default value.
Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a front panel joins a bridge via another netdevice (typically a LAG),
the driver needs to learn about the objects configured on the bridge port.
When the bridge port is offloaded by the driver for the first time, this
can be achieved by passing a notifier to switchdev_bridge_port_offload().
The notifier is then invoked for the individual objects (such as VLANs)
configured on the bridge, and can look for the interesting ones.
Calling switchdev_bridge_port_offload() when the second port joins the
bridge lower is unnecessary, but the replay is still needed. To that end,
add a new function, switchdev_bridge_port_replay(), which does only the
replay part of the _offload() function in exactly the same way as that
function.
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Ivan Vecera <ivecera@redhat.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two kinds of MDB entries to be replayed: port MDB entries, and
host MDB entries. They are both replayed by br_switchdev_mdb_replay(). If
the driver supports one kind, but lacks the other, the first -EOPNOTSUPP
returned terminates the whole replay, including any further still-supported
objects in the list.
For this to cause issues, there must be MDB entries for both the host and
the port being replayed. In that case, if the driver bails out from
handling the host entry, the port entries are never replayed. However, the
replay is currently only done when a switchdev port joins a bridge. There
would be no port memberships at that point. Thus despite being erroneous,
the code does not cause observable bugs.
This is not an issue with other object kinds either, because there, each
function replays one object kind. If a driver does not support that kind,
it makes sense to bail out early. -EOPNOTSUPP is then ignored in
nbp_switchdev_sync_objs().
For MDB, suppress the -EOPNOTSUPP error code in br_switchdev_mdb_replay()
already, so that the whole list gets replayed.
The reason we need this patch is that a future patch will introduce a
replay that should be used when a front-panel port netdevice is enslaved to
a bridge lower, in particular a LAG. The LAG netdevice can already have
both host and port MDB entries. The port entries need to be replayed so
that they are offloaded on the port that joins the LAG.
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Ivan Vecera <ivecera@redhat.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With legacy nexthops, when net.ipv4.fib_multipath_use_neigh is set,
fib_select_multipath() will never set res->nhc to a nexthop that is not
good (as per fib_good_nh()). OTOH, with nexthop objects,
nexthop_select_path_hthr() may return a nexthop that failed the
nexthop_is_good_nh() test even if there was one that passed. Refactor
nexthop_select_path_hthr() to follow a selection logic more similar to
fib_select_multipath().
The issue can be demonstrated with the following sequence of commands. The
first block shows that things work as expected with legacy nexthops. The
last sequence of `ip rou get` in the second block shows the problem case -
some routes still use the .2 nexthop.
sysctl net.ipv4.fib_multipath_use_neigh=1
ip link add dummy1 up type dummy
ip rou add 198.51.100.0/24 nexthop via 192.0.2.1 dev dummy1 onlink nexthop via 192.0.2.2 dev dummy1 onlink
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh add 192.0.2.1 dev dummy1 nud failed
echo ".1 failed:" # results should not use .1
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh del 192.0.2.1 dev dummy1
ip neigh add 192.0.2.2 dev dummy1 nud failed
echo ".2 failed:" # results should not use .2
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip link del dummy1
ip link add dummy1 up type dummy
ip nexthop add id 1 via 192.0.2.1 dev dummy1 onlink
ip nexthop add id 2 via 192.0.2.2 dev dummy1 onlink
ip nexthop add id 1001 group 1/2
ip rou add 198.51.100.0/24 nhid 1001
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh add 192.0.2.1 dev dummy1 nud failed
echo ".1 failed:" # results should not use .1
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh del 192.0.2.1 dev dummy1
ip neigh add 192.0.2.2 dev dummy1 nud failed
echo ".2 failed:" # results should not use .2
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip link del dummy1
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-3-04383e89f868@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For legacy nexthops, there is fib_good_nh() to check the neighbor validity.
In order to make the nexthop object code more similar to the legacy nexthop
code, factor out the nexthop object neighbor validity check into its own
function.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-2-04383e89f868@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The loop in nexthop_select_path_hthr() includes code to check for neighbor
validity. Since this does not apply to fdb nexthops, simplify the loop by
moving the fdb nexthop selection to its own function.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-1-04383e89f868@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- eth: r8169: multiple fixes for PCIe ASPM-related problems
- vrf: fix RCU lockdep splat in output path
Previous releases - regressions:
- gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set
- dsa: mv88e6xxx: do a final check before timing out when polling
- nf_tables: fix sleep in atomic in nft_chain_validate
Previous releases - always broken:
- sched: fix undoing tcf_bind_filter() in multiple classifiers
- bpf, arm64: fix BTI type used for freplace attached functions
- can: gs_usb: fix time stamp counter initialization
- nft_set_pipapo: fix improper element removal (leading to UAF)
Misc:
- net: support STP on bridge in non-root netns, STP prevents
packet loops so not supporting it results in freezing systems
of unsuspecting users, and in turn very upset noises being made
- fix kdoc warnings
- annotate various bits of TCP state to prevent data races
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmS5pp0ACgkQMUZtbf5S
IrtudA/9Ep+URprI3tpv+VHOQMWtMd7lzz+wwEUDQSo2T6xdMcYbd1E4ZWWOPw/y
jTIIVF3qde4nuI/MZtzGhvCD8v4bzhw10uRm4f4vhC2i+CzXr/UdOQSMqeZmJZgN
vndixvRjHJKYxogOa+DjXgOiuQTQfuSfSnaai0kvw3zZzi4tev/Bdj6KZmFW+UK+
Q7uQZ5n8tdE4UvUdj8Jek23SZ4kL+HtQOIdAAqyduQnYnax5L5sbep0TjuCjjkpK
26rvmwYFJmEab4mC2T3Y7VDaXYM9M2f/EuFBMBVEohE3KPTTdT12WzLfJv7TTKTl
hymfXgfmCXiZElzoQTJ69bFGbhqFaCJwhCUHFwYqkqj0bW9cXYJD2achpi3nVgnn
CV8vfqJtkzdgh2bV2faG+1wmAm1wzHSURmT5NlnFaX6a6BYypaN7CERn7BnIdLM/
YA2wud39bL0EJsic5e3gtlyJdfhtx7iqCMzE7S5FiUZvgOmUhBZ4IWkMs6Aq5PpL
FLLgBSHGEIAdLVQGvXLjfQ/LeSrW8JsiSy6deztzR+ZflvvaBIP5y8sC3+KdxAvN
3ybMsMEE5OK3i808aV3l6/8DLeAJ+DWuMc96Ix7Yyt2LXFnnV79DX49zJAEUWrc7
54FnNzkgAO/Q9aEFmmQoFt5qZmoFHuNwcHBOmXARAatQqNCwDqk=
=Xifr
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from BPF, netfilter, bluetooth and CAN.
Current release - regressions:
- eth: r8169: multiple fixes for PCIe ASPM-related problems
- vrf: fix RCU lockdep splat in output path
Previous releases - regressions:
- gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set
- dsa: mv88e6xxx: do a final check before timing out when polling
- nf_tables: fix sleep in atomic in nft_chain_validate
Previous releases - always broken:
- sched: fix undoing tcf_bind_filter() in multiple classifiers
- bpf, arm64: fix BTI type used for freplace attached functions
- can: gs_usb: fix time stamp counter initialization
- nft_set_pipapo: fix improper element removal (leading to UAF)
Misc:
- net: support STP on bridge in non-root netns, STP prevents packet
loops so not supporting it results in freezing systems of
unsuspecting users, and in turn very upset noises being made
- fix kdoc warnings
- annotate various bits of TCP state to prevent data races"
* tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
net: phy: prevent stale pointer dereference in phy_init()
tcp: annotate data-races around fastopenq.max_qlen
tcp: annotate data-races around icsk->icsk_user_timeout
tcp: annotate data-races around tp->notsent_lowat
tcp: annotate data-races around rskq_defer_accept
tcp: annotate data-races around tp->linger2
tcp: annotate data-races around icsk->icsk_syn_retries
tcp: annotate data-races around tp->keepalive_probes
tcp: annotate data-races around tp->keepalive_intvl
tcp: annotate data-races around tp->keepalive_time
tcp: annotate data-races around tp->tsoffset
tcp: annotate data-races around tp->tcp_tx_delay
Bluetooth: MGMT: Use correct address for memcpy()
Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
Bluetooth: SCO: fix sco_conn related locking and validity issues
Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
Bluetooth: coredump: fix building with coredump disabled
Bluetooth: ISO: fix iso_conn related locking and validity issues
Bluetooth: hci_event: call disconnect callback before deleting conn
...
- Fix building with coredump disabled
- Fix use-after-free in hci_remove_adv_monitor
- Use RCU for hci_conn_params and iterate safely in hci_sync
- Fix locking issues on ISO and SCO
- Fix bluetooth on Intel Macbook 2014
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmS5gLEZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKd1BD/9nVq2/rC0l2j2RW6y/Mvym
kE4AglMzP1y0xd1xwjJsiHJdvT5D1cgoIAkn3kN0E/LwEvjUtKT4453w70F8ZEoR
reM98PJUIxvSMzP6S88BxAuDIcpeCs0Mu59cm+J50oC8cUNaX8vJr6QPUj30J3Tm
KFWh89/HAQr5sgfbszKHpSXpcfzlzqMFS/gWadT+vJPmLDipvkPAo3m4WdJe+z67
D4nRlAVas8VElv8UuFYGCHz4iRq+RUFYrSAfTRgQakfFIaFddnZT2+7UM262d3QF
tdmrGtrLZtyxr8N5zPU6yyrfsJTSRZlJ8tRBxff3qf/pDOSgsDsob3VbWiZCkbzy
WIAih8MxEvkzFoRYvL3jkgiGcjziW5uEC8XQW3PrcjA195Qb8Eyr8Xec5sh5ekIE
orSvlyvIXF+PgU1BPSS/UlMSSxgBqnF4Zt8i17zlXrTy3MR4GfpHXYATT51dPwjd
lLJ7Ec2D9XzQW77MS4o41wX13Y4ALMcoyHuABfAYIPG5DCg/m8gofzN8+zdOwpex
vFuYX0V29NxB4ovw9+9O+mnbhuip5LQBqI2DkTd8bPjrOPw6DzP5OxtXzrsgm1Is
d2FS+eOhh33+mEbmOe9BtK5lbUkEY+iKEQnrHW7jBbm2NwEBHac6ZVn6cjwlCU6g
SDKOqvbApbJMDfZSnB+m3g==
=s4eF
-----END PGP SIGNATURE-----
Merge tag 'for-net-2023-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix building with coredump disabled
- Fix use-after-free in hci_remove_adv_monitor
- Use RCU for hci_conn_params and iterate safely in hci_sync
- Fix locking issues on ISO and SCO
- Fix bluetooth on Intel Macbook 2014
* tag 'for-net-2023-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Use correct address for memcpy()
Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
Bluetooth: SCO: fix sco_conn related locking and validity issues
Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
Bluetooth: coredump: fix building with coredump disabled
Bluetooth: ISO: fix iso_conn related locking and validity issues
Bluetooth: hci_event: call disconnect callback before deleting conn
Bluetooth: use RCU for hci_conn_params and iterate safely in hci_sync
====================
Link: https://lore.kernel.org/r/20230720190201.446469-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmS5ZXQNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AArEEAC0M0dhcZIh91Z0l3kLl9fjcc9Ee6qowYo+srYK
N3H80JIinVOdzlAnqUVq3aiy+1RROqhX+t5WyYuKdHC1ujLzbcQT0LzDN1RiAjvl
57d49+jfFulRmbqerl22/u4bmOA3j7sdSQwztfmTx+dCR+ap2z7ghRlhwuOx0orq
JDl4f/UX8ZWCIeQfmq/FjtsZVNI1af6oKy6G+c9y0Xjj/1jUvF/y3Xe3/Xbn1H14
mkkonZPBnA0Xw9dAaaMmcM/JxhO1R4KEUVKxP7L78XtU/FmxbWP8K3xU91ExaR6q
MnzDkbL7Z+k68Jf2vuxwldwkcmCJDuCibBmbgWeDhW2qAutCQGraxewnp405zPmR
KzRVYtWxBbA5FFZWXSfHscURLG9pdzIMFruu5Vm8gzzefpQ0502IaQn1YRcYj10a
sMvCep2x79kUvoI4hwW07eo2y5+rBIRjPu5sCv5MWENxf75XLHjqJeje+q8NKhoD
+YGP2tWe7Pm0Ekr3Ju8TH8LDGBbmNt5FVdZCCIOOeNYp5cOLLNyupWyB86Vsjrn8
FLnxe8xb95+2EnfA5aj7VAJtdmtDvCESDRVp30PxWSNgeltV+hBJbAUdYsaBK607
/zOm18KreDUPNQrXTH81a9gu8tYF50zez5bTHjDOw7yatwN7etRuhNyKkOSiQoC8
OiGA6A==
=0qzt
-----END PGP SIGNATURE-----
Merge tag 'nf-23-07-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
Netfilter fixes for net:
The following patchset contains Netfilter fixes for net:
1. Fix spurious -EEXIST error from userspace due to
padding holes, this was broken since 4.9 days
when 'ignore duplicate entries on insert' feature was
added.
2. Fix a sched-while-atomic bug, present since 5.19.
3. Properly remove elements if they lack an "end range".
nft userspace always sets an end range attribute, even
when its the same as the start, but the abi doesn't
have such a restriction. Always broken since it was
added in 5.6, all three from myself.
4 + 5: Bound chain needs to be skipped in netns release
and on rule flush paths, from Pablo Neira.
* tag 'nf-23-07-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: skip bound chain on rule flush
netfilter: nf_tables: skip bound chain in netns release path
netfilter: nft_set_pipapo: fix improper element removal
netfilter: nf_tables: can't schedule in nft_chain_validate
netfilter: nf_tables: fix spurious set element insertion failure
====================
Link: https://lore.kernel.org/r/20230720165143.30208-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
do_tcp_getsockopt() and reqsk_timer_handler() read
icsk->icsk_syn_retries while another cpu might change its value.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
do_tcp_getsockopt() reads tp->tcp_tx_delay while another cpu
might change its value.
Fixes: a842fe1425 ("tcp: add optional per socket transmit delay")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 9p code for some reason used to initialize variables outside of the
declaration, e.g. instead of just initializing the variable like this:
int retval = 0
We would be doing this:
int retval;
retval = 0;
This is perfectly fine and the compiler will just optimize dead stores
anyway, but scan-build seems to think this is a problem and there are
many of these warnings making the output of scan-build full of such
warnings:
fs/9p/vfs_inode.c:916:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores]
retval = 0;
^ ~
I have no strong opinion here, but if we want to regularly run
scan-build we should fix these just to silence the messages.
I've confirmed these all are indeed ok to remove.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
Fix the following scan-build warning:
net/9p/trans_virtio.c:504:3: warning: Value stored to 'in' is never read [deadcode.DeadStores]
in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I'm honestly not 100% sure about this one; I'm tempted to think we
could (should?) just check the return value of pack_sg_list_p to skip
the in_sgs++ and setting sgs[] if it didn't process anything, but I'm
not sure it should ever happen so this is probably fine as is.
Just removing the assignment at least makes it clear the return value
isn't used, so it's an improvement in terms of readability.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
Similarly to the previous patch: offs can be used in handle_rerrors
without initializing on small payloads; in this case handle_rerrors will
not use it because of the size check, but it doesn't hurt to make sure
it is zero to please scan-build.
This fixes the following warning:
net/9p/trans_virtio.c:539:3: warning: 3rd function call argument is an uninitialized value [core.CallAndMessage]
handle_rerror(req, in_hdr_len, offs, in_pages);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
handle_rerror can dereference the pages pointer, but it is not
necessarily set for small payloads.
In practice these should be filtered out by the size check, but
might as well double-check explicitly.
This fixes the following scan-build warnings:
net/9p/trans_virtio.c:401:24: warning: Dereference of null pointer [core.NullDereference]
memcpy_from_page(to, *pages++, offs, n);
^~~~~~~~
net/9p/trans_virtio.c:406:23: warning: Dereference of null pointer (loaded from variable 'pages') [core.NullDereference]
memcpy_from_page(to, *pages, offs, size);
^~~~~~
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
In function ‘fortify_memcpy_chk’,
inlined from ‘get_conn_info_complete’ at net/bluetooth/mgmt.c:7281:2:
include/linux/fortify-string.h:592:25: error: call to
‘__read_overflow2_field’ declared with attribute warning: detected read
beyond size of field (2nd parameter); maybe use struct_group()?
[-Werror=attribute-warning]
592 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
This is due to the wrong member is used for memcpy(). Use correct one.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Operations that check/update sk_state and access conn should hold
lock_sock, otherwise they can race.
The order of taking locks is hci_dev_lock > lock_sock > sco_conn_lock,
which is how it is in connect/disconnect_cfm -> sco_conn_del ->
sco_chan_del.
Fix locking in sco_connect to take lock_sock around updating sk_state
and conn.
sco_conn_del must not occur during sco_connect, as it frees the
sco_conn. Hold hdev->lock longer to prevent that.
sco_conn_add shall return sco_conn with valid hcon. Make it so also when
reusing an old SCO connection waiting for disconnect timeout (see
__sco_sock_close where conn->hcon is set to NULL).
This should not reintroduce the issue fixed in the earlier
commit 9a8ec9e8eb ("Bluetooth: SCO: Fix possible circular locking
dependency on sco_connect_cfm"), the relevant fix of releasing lock_sock
in sco_sock_connect before acquiring hdev->lock is retained.
These changes mirror similar fixes earlier in ISO sockets.
Fixes: 9a8ec9e8eb ("Bluetooth: SCO: Fix possible circular locking dependency on sco_connect_cfm")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_connect_sco currently returns NULL when there is no link (i.e. when
hci_conn_link() returns NULL).
sco_connect() expects an ERR_PTR in case of any error (see line 266 in
sco.c). Thus, hcon set as NULL passes through to sco_conn_add(), which
tries to get hcon->hdev, resulting in dereferencing a NULL pointer as
reported by syzkaller.
The same issue exists for iso_connect_cis() calling hci_connect_cis().
Thus, make hci_connect_sco() and hci_connect_cis() return ERR_PTR
instead of NULL.
Reported-and-tested-by: syzbot+37acd5d80d00d609d233@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=37acd5d80d00d609d233
Fixes: 06149746e7 ("Bluetooth: hci_conn: Add support for linking multiple hcon")
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
KASAN reports that there's a use-after-free in
hci_remove_adv_monitor(). Trawling through the disassembly, you can
see that the complaint is from the access in bt_dev_dbg() under the
HCI_ADV_MONITOR_EXT_MSFT case. The problem case happens because
msft_remove_monitor() can end up freeing the monitor
structure. Specifically:
hci_remove_adv_monitor() ->
msft_remove_monitor() ->
msft_remove_monitor_sync() ->
msft_le_cancel_monitor_advertisement_cb() ->
hci_free_adv_monitor()
Let's fix the problem by just stashing the relevant data when it's
still valid.
Fixes: 7cf5c2978f ("Bluetooth: hci_sync: Refactor remove Adv Monitor")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sk->sk_state indicates whether iso_pi(sk)->conn is valid. Operations
that check/update sk_state and access conn should hold lock_sock,
otherwise they can race.
The order of taking locks is hci_dev_lock > lock_sock > iso_conn_lock,
which is how it is in connect/disconnect_cfm -> iso_conn_del ->
iso_chan_del.
Fix locking in iso_connect_cis/bis and sendmsg/recvmsg to take lock_sock
around updating sk_state and conn.
iso_conn_del must not occur during iso_connect_cis/bis, as it frees the
iso_conn. Hold hdev->lock longer to prevent that.
This should not reintroduce the issue fixed in commit 241f51931c
("Bluetooth: ISO: Avoid circular locking dependency"), since the we
acquire locks in order. We retain the fix in iso_sock_connect to release
lock_sock before iso_connect_* acquires hdev->lock.
Similarly for commit 6a5ad251b7 ("Bluetooth: ISO: Fix possible
circular locking dependency"). We retain the fix in iso_conn_ready to
not acquire iso_conn_lock before lock_sock.
iso_conn_add shall return iso_conn with valid hcon. Make it so also when
reusing an old CIS connection waiting for disconnect timeout (see
__iso_sock_close where conn->hcon is set to NULL).
Trace with iso_conn_del after iso_chan_add in iso_connect_cis:
===============================================================
iso_sock_create:771: sock 00000000be9b69b7
iso_sock_init:693: sk 000000004dff667e
iso_sock_bind:827: sk 000000004dff667e 70:1a:b8:98:ff:a2 type 1
iso_sock_setsockopt:1289: sk 000000004dff667e
iso_sock_setsockopt:1289: sk 000000004dff667e
iso_sock_setsockopt:1289: sk 000000004dff667e
iso_sock_connect:875: sk 000000004dff667e
iso_connect_cis:353: 70:1a:b8:98:ff:a2 -> 28:3d:c2:4a:7e:da
hci_get_route:1199: 70:1a:b8:98:ff:a2 -> 28:3d:c2:4a:7e:da
hci_conn_add:1005: hci0 dst 28:3d:c2:4a:7e:da
iso_conn_add:140: hcon 000000007b65d182 conn 00000000daf8625e
__iso_chan_add:214: conn 00000000daf8625e
iso_connect_cfm:1700: hcon 000000007b65d182 bdaddr 28:3d:c2:4a:7e:da status 12
iso_conn_del:187: hcon 000000007b65d182 conn 00000000daf8625e, err 16
iso_sock_clear_timer:117: sock 000000004dff667e state 3
<Note: sk_state is BT_BOUND (3), so iso_connect_cis is still
running at this point>
iso_chan_del:153: sk 000000004dff667e, conn 00000000daf8625e, err 16
hci_conn_del:1151: hci0 hcon 000000007b65d182 handle 65535
hci_conn_unlink:1102: hci0: hcon 000000007b65d182
hci_chan_list_flush:2780: hcon 000000007b65d182
iso_sock_getsockopt:1376: sk 000000004dff667e
iso_sock_getname:1070: sock 00000000be9b69b7, sk 000000004dff667e
iso_sock_getname:1070: sock 00000000be9b69b7, sk 000000004dff667e
iso_sock_getsockopt:1376: sk 000000004dff667e
iso_sock_getname:1070: sock 00000000be9b69b7, sk 000000004dff667e
iso_sock_getname:1070: sock 00000000be9b69b7, sk 000000004dff667e
iso_sock_shutdown:1434: sock 00000000be9b69b7, sk 000000004dff667e, how 1
__iso_sock_close:632: sk 000000004dff667e state 5 socket 00000000be9b69b7
<Note: sk_state is BT_CONNECT (5), even though iso_chan_del sets
BT_CLOSED (6). Only iso_connect_cis sets it to BT_CONNECT, so it
must be that iso_chan_del occurred between iso_chan_add and end of
iso_connect_cis.>
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 8000000006467067 P4D 8000000006467067 PUD 3f5f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
RIP: 0010:__iso_sock_close (net/bluetooth/iso.c:664) bluetooth
===============================================================
Trace with iso_conn_del before iso_chan_add in iso_connect_cis:
===============================================================
iso_connect_cis:356: 70:1a:b8:98:ff:a2 -> 28:3d:c2:4a:7e:da
...
iso_conn_add:140: hcon 0000000093bc551f conn 00000000768ae504
hci_dev_put:1487: hci0 orig refcnt 21
hci_event_packet:7607: hci0: event 0x0e
hci_cmd_complete_evt:4231: hci0: opcode 0x2062
hci_cc_le_set_cig_params:3846: hci0: status 0x07
hci_sent_cmd_data:3107: hci0 opcode 0x2062
iso_connect_cfm:1703: hcon 0000000093bc551f bdaddr 28:3d:c2:4a:7e:da status 7
iso_conn_del:187: hcon 0000000093bc551f conn 00000000768ae504, err 12
hci_conn_del:1151: hci0 hcon 0000000093bc551f handle 65535
hci_conn_unlink:1102: hci0: hcon 0000000093bc551f
hci_chan_list_flush:2780: hcon 0000000093bc551f
__iso_chan_add:214: conn 00000000768ae504
<Note: this conn was already freed in iso_conn_del above>
iso_sock_clear_timer:117: sock 0000000098323f95 state 3
general protection fault, probably for non-canonical address 0x30b29c630930aec8: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1920 Comm: bluetoothd Tainted: G E 6.3.0-rc7+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
RIP: 0010:detach_if_pending+0x28/0xd0
Code: 90 90 0f 1f 44 00 00 48 8b 47 08 48 85 c0 0f 84 ad 00 00 00 55 89 d5 53 48 83 3f 00 48 89 fb 74 7d 66 90 48 8b 03 48 8b 53 08 <>
RSP: 0018:ffffb90841a67d08 EFLAGS: 00010007
RAX: 0000000000000000 RBX: ffff9141bd5061b8 RCX: 0000000000000000
RDX: 30b29c630930aec8 RSI: ffff9141fdd21e80 RDI: ffff9141bd5061b8
RBP: 0000000000000001 R08: 0000000000000000 R09: ffffb90841a67b88
R10: 0000000000000003 R11: ffffffff8613f558 R12: ffff9141fdd21e80
R13: 0000000000000000 R14: ffff9141b5976010 R15: ffff914185755338
FS: 00007f45768bd840(0000) GS:ffff9141fdd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000619000424074 CR3: 0000000009f5e005 CR4: 0000000000170ee0
Call Trace:
<TASK>
timer_delete+0x48/0x80
try_to_grab_pending+0xdf/0x170
__cancel_work+0x37/0xb0
iso_connect_cis+0x141/0x400 [bluetooth]
===============================================================
Trace with NULL conn->hcon in state BT_CONNECT:
===============================================================
__iso_sock_close:619: sk 00000000f7c71fc5 state 1 socket 00000000d90c5fe5
...
__iso_sock_close:619: sk 00000000f7c71fc5 state 8 socket 00000000d90c5fe5
iso_chan_del:153: sk 00000000f7c71fc5, conn 0000000022c03a7e, err 104
...
iso_sock_connect:862: sk 00000000129b56c3
iso_connect_cis:348: 70:1a:b8:98:ff:a2 -> 28:3d:c2:4a:7d:2a
hci_get_route:1199: 70:1a:b8:98:ff:a2 -> 28:3d:c2:4a:7d:2a
hci_dev_hold:1495: hci0 orig refcnt 19
__iso_chan_add:214: conn 0000000022c03a7e
<Note: reusing old conn>
iso_sock_clear_timer:117: sock 00000000129b56c3 state 3
...
iso_sock_ready:1485: sk 00000000129b56c3
...
iso_sock_sendmsg:1077: sock 00000000e5013966, sk 00000000129b56c3
BUG: kernel NULL pointer dereference, address: 00000000000006a8
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1403 Comm: wireplumber Tainted: G E 6.3.0-rc7+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
RIP: 0010:iso_sock_sendmsg+0x63/0x2a0 [bluetooth]
===============================================================
Fixes: 241f51931c ("Bluetooth: ISO: Avoid circular locking dependency")
Fixes: 6a5ad251b7 ("Bluetooth: ISO: Fix possible circular locking dependency")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_update_accept_list_sync iterates over hdev->pend_le_conns and
hdev->pend_le_reports, and waits for controller events in the loop body,
without holding hdev lock.
Meanwhile, these lists and the items may be modified e.g. by
le_scan_cleanup. This can invalidate the list cursor or any other item
in the list, resulting to invalid behavior (eg use-after-free).
Use RCU for the hci_conn_params action lists. Since the loop bodies in
hci_sync block and we cannot use RCU or hdev->lock for the whole loop,
copy list items first and then iterate on the copy. Only the flags field
is written from elsewhere, so READ_ONCE/WRITE_ONCE should guarantee we
read valid values.
Free params everywhere with hci_conn_params_free so the cleanup is
guaranteed to be done properly.
This fixes the following, which can be triggered e.g. by BlueZ new
mgmt-tester case "Add + Remove Device Nowait - Success", or by changing
hci_le_set_cig_params to always return false, and running iso-tester:
==================================================================
BUG: KASAN: slab-use-after-free in hci_update_passive_scan_sync (net/bluetooth/hci_sync.c:2536 net/bluetooth/hci_sync.c:2723 net/bluetooth/hci_sync.c:2841)
Read of size 8 at addr ffff888001265018 by task kworker/u3:0/32
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
<TASK>
dump_stack_lvl (./arch/x86/include/asm/irqflags.h:134 lib/dump_stack.c:107)
print_report (mm/kasan/report.c:320 mm/kasan/report.c:430)
? __virt_addr_valid (./include/linux/mmzone.h:1915 ./include/linux/mmzone.h:2011 arch/x86/mm/physaddr.c:65)
? hci_update_passive_scan_sync (net/bluetooth/hci_sync.c:2536 net/bluetooth/hci_sync.c:2723 net/bluetooth/hci_sync.c:2841)
kasan_report (mm/kasan/report.c:538)
? hci_update_passive_scan_sync (net/bluetooth/hci_sync.c:2536 net/bluetooth/hci_sync.c:2723 net/bluetooth/hci_sync.c:2841)
hci_update_passive_scan_sync (net/bluetooth/hci_sync.c:2536 net/bluetooth/hci_sync.c:2723 net/bluetooth/hci_sync.c:2841)
? __pfx_hci_update_passive_scan_sync (net/bluetooth/hci_sync.c:2780)
? mutex_lock (kernel/locking/mutex.c:282)
? __pfx_mutex_lock (kernel/locking/mutex.c:282)
? __pfx_mutex_unlock (kernel/locking/mutex.c:538)
? __pfx_update_passive_scan_sync (net/bluetooth/hci_sync.c:2861)
hci_cmd_sync_work (net/bluetooth/hci_sync.c:306)
process_one_work (./arch/x86/include/asm/preempt.h:27 kernel/workqueue.c:2399)
worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2538)
? __pfx_worker_thread (kernel/workqueue.c:2480)
kthread (kernel/kthread.c:376)
? __pfx_kthread (kernel/kthread.c:331)
ret_from_fork (arch/x86/entry/entry_64.S:314)
</TASK>
Allocated by task 31:
kasan_save_stack (mm/kasan/common.c:46)
kasan_set_track (mm/kasan/common.c:52)
__kasan_kmalloc (mm/kasan/common.c:374 mm/kasan/common.c:383)
hci_conn_params_add (./include/linux/slab.h:580 ./include/linux/slab.h:720 net/bluetooth/hci_core.c:2277)
hci_connect_le_scan (net/bluetooth/hci_conn.c:1419 net/bluetooth/hci_conn.c:1589)
hci_connect_cis (net/bluetooth/hci_conn.c:2266)
iso_connect_cis (net/bluetooth/iso.c:390)
iso_sock_connect (net/bluetooth/iso.c:899)
__sys_connect (net/socket.c:2003 net/socket.c:2020)
__x64_sys_connect (net/socket.c:2027)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
Freed by task 15:
kasan_save_stack (mm/kasan/common.c:46)
kasan_set_track (mm/kasan/common.c:52)
kasan_save_free_info (mm/kasan/generic.c:523)
__kasan_slab_free (mm/kasan/common.c:238 mm/kasan/common.c:200 mm/kasan/common.c:244)
__kmem_cache_free (mm/slub.c:1807 mm/slub.c:3787 mm/slub.c:3800)
hci_conn_params_del (net/bluetooth/hci_core.c:2323)
le_scan_cleanup (net/bluetooth/hci_conn.c:202)
process_one_work (./arch/x86/include/asm/preempt.h:27 kernel/workqueue.c:2399)
worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2538)
kthread (kernel/kthread.c:376)
ret_from_fork (arch/x86/entry/entry_64.S:314)
==================================================================
Fixes: e8907f7654 ("Bluetooth: hci_sync: Make use of hci_cmd_sync_queue set 3")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
tcp_sequence() uses two conditions to decide to drop a packet,
and we currently report generic TCP_INVALID_SEQUENCE drop reason.
Duplicates are common, we need to distinguish them from
the other case.
I chose to not reuse TCP_OLD_DATA, and instead added
TCP_OLD_SEQUENCE drop reason.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719064754.2794106-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
end key should be equal to start unless NFT_SET_EXT_KEY_END is present.
Its possible to add elements that only have a start key
("{ 1.0.0.0 . 2.0.0.0 }") without an internval end.
Insertion treats this via:
if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
end = (const u8 *)nft_set_ext_key_end(ext)->data;
else
end = start;
but removal side always uses nft_set_ext_key_end().
This is wrong and leads to garbage remaining in the set after removal
next lookup/insert attempt will give:
BUG: KASAN: slab-use-after-free in pipapo_get+0x8eb/0xb90
Read of size 1 at addr ffff888100d50586 by task nft-pipapo_uaf_/1399
Call Trace:
kasan_report+0x105/0x140
pipapo_get+0x8eb/0xb90
nft_pipapo_insert+0x1dc/0x1710
nf_tables_newsetelem+0x31f5/0x4e00
..
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: lonial con <kongln9170@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Can be called via nft set element list iteration, which may acquire
rcu and/or bh read lock (depends on set type).
BUG: sleeping function called from invalid context at net/netfilter/nf_tables_api.c:3353
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1232, name: nft
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by nft/1232:
#0: ffff8881180e3ea8 (&nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid
#1: ffffffff83f5f540 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire
Call Trace:
nft_chain_validate
nft_lookup_validate_setelem
nft_pipapo_walk
nft_lookup_validate
nft_chain_validate
nft_immediate_validate
nft_chain_validate
nf_tables_validate
nf_tables_abort
No choice but to move it to nf_tables_validate().
Fixes: 81ea010667 ("netfilter: nf_tables: add rescheduling points during loop detection walks")
Signed-off-by: Florian Westphal <fw@strlen.de>
On some platforms there is a padding hole in the nft_verdict
structure, between the verdict code and the chain pointer.
On element insertion, if the new element clashes with an existing one and
NLM_F_EXCL flag isn't set, we want to ignore the -EEXIST error as long as
the data associated with duplicated element is the same as the existing
one. The data equality check uses memcmp.
For normal data (NFT_DATA_VALUE) this works fine, but for NFT_DATA_VERDICT
padding area leads to spurious failure even if the verdict data is the
same.
This then makes the insertion fail with 'already exists' error, even
though the new "key : data" matches an existing entry and userspace
told the kernel that it doesn't want to receive an error indication.
Fixes: c016c7e45d ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion")
Signed-off-by: Florian Westphal <fw@strlen.de>
This reverts commit 56a16035bb.
Since the previous commit, STP works on bridge in netns.
# unshare -n
# ip link add br0 type bridge
# ip link add veth0 type veth peer name veth1
# ip link set veth0 master br0 up
[ 50.558135] br0: port 1(veth0) entered blocking state
[ 50.558366] br0: port 1(veth0) entered disabled state
[ 50.558798] veth0: entered allmulticast mode
[ 50.564401] veth0: entered promiscuous mode
# ip link set veth1 master br0 up
[ 54.215487] br0: port 2(veth1) entered blocking state
[ 54.215657] br0: port 2(veth1) entered disabled state
[ 54.215848] veth1: entered allmulticast mode
[ 54.219577] veth1: entered promiscuous mode
# ip link set br0 type bridge stp_state 1
# ip link set br0 up
[ 61.960726] br0: port 2(veth1) entered blocking state
[ 61.961097] br0: port 2(veth1) entered listening state
[ 61.961495] br0: port 1(veth0) entered blocking state
[ 61.961653] br0: port 1(veth0) entered listening state
[ 63.998835] br0: port 2(veth1) entered blocking state
[ 77.437113] br0: port 1(veth0) entered learning state
[ 86.653501] br0: received packet on veth0 with own address as source address (addr:6e:0f:e7:6f:5f:5f, vlan:0)
[ 92.797095] br0: port 1(veth0) entered forwarding state
[ 92.797398] br0: topology change detected, propagating
Let's remove the warning.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now these upper layer protocol handlers can be called from llc_rcv()
as sap->rcv_func(), which is registered by llc_sap_open().
* function which is passed to register_8022_client()
-> no in-kernel user calls register_8022_client().
* snap_rcv()
`- proto->rcvfunc() : registered by register_snap_client()
-> aarp_rcv() and atalk_rcv() drop packets from non-root netns
* stp_pdu_rcv()
`- garp_protos[]->rcv() : registered by stp_proto_register()
-> garp_pdu_rcv() and br_stp_rcv() are netns-aware
So, we can safely remove the netns restriction in llc_rcv().
Fixes: e730c15519 ("[NET]: Make packet reception network namespace safe")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will remove this restriction in llc_rcv() in the following patch,
which means that the protocol handler must be aware of netns.
if (!net_eq(dev_net(dev), &init_net))
goto drop;
llc_rcv() fetches llc_type_handlers[llc_pdu_type(skb) - 1] and calls it
if not NULL.
If the PDU type is LLC_DEST_CONN, llc_conn_handler() is called to pass
skb to corresponding sockets. Then, we must look up a proper socket in
the same netns with skb->dev.
llc_conn_handler() calls __llc_lookup() to look up a established or
litening socket by __llc_lookup_established() and llc_lookup_listener().
Both functions iterate on a list and call llc_estab_match() or
llc_listener_match() to check if the socket is the correct destination.
However, these functions do not check netns.
Also, bind() and connect() call llc_establish_connection(), which
finally calls __llc_lookup_established(), to check if there is a
conflicting socket.
Let's test netns in llc_estab_match() and llc_listener_match().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will remove this restriction in llc_rcv() soon, which means that the
protocol handler must be aware of netns.
if (!net_eq(dev_net(dev), &init_net))
goto drop;
llc_rcv() fetches llc_type_handlers[llc_pdu_type(skb) - 1] and calls it
if not NULL.
If the PDU type is LLC_DEST_SAP, llc_sap_handler() is called to pass skb
to corresponding sockets. Then, we must look up a proper socket in the
same netns with skb->dev.
If the destination is a multicast address, llc_sap_handler() calls
llc_sap_mcast(). It calculates a hash based on DSAP and skb->dev->ifindex,
iterates on a socket list, and calls llc_mcast_match() to check if the
socket is the correct destination. Then, llc_mcast_match() checks if
skb->dev matches with llc_sk(sk)->dev. So, we need not check netns here.
OTOH, if the destination is a unicast address, llc_sap_handler() calls
llc_lookup_dgram() to look up a socket, but it does not check the netns.
Therefore, we need to add netns check in llc_lookup_dgram().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
By not setting IPS_CONFIRMED in tmpl that allows the exp not to be removed
from the hashtable when lookup, we can simplify the exp processing code a
lot in openvswitch conntrack.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With the following flows, the packets will be dropped if OVS TC offload is
enabled.
'ip,ct_state=-trk,in_port=1 actions=ct(zone=1)'
'ip,ct_state=+trk+new+rel,in_port=1 actions=ct(commit,zone=1)'
'ip,ct_state=+trk+new+rel,in_port=1 actions=ct(commit,zone=2),normal'
In the 1st flow, it finds the exp from the hashtable and removes it then
creates the ct with this exp in act_ct. However, in the 2nd flow it goes
to the OVS upcall at the 1st time. When the skb comes back from userspace,
it has to create the ct again without exp(the exp was removed last time).
With no 'rel' set in the ct, the 3rd flow can never get matched.
In OVS conntrack, it works around it by adding its own exp lookup function
ovs_ct_expect_find() where it doesn't remove the exp. Instead of creating
a real ct, it only updates its keys with the exp and its master info. So
when the skb comes back, the exp is still in the hashtable.
However, we can't do this trick in act_ct, as tc flower match is using a
real ct, and passing the exp and its master info to flower parsing via
tc_skb_cb is also not possible (tc_skb_cb size is not big enough).
The simple and clear fix is to not remove the exp at the 1st flow, namely,
not set IPS_CONFIRMED in tmpl when commit is not set in act_ct.
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently nf_conntrack_in() calling nf_ct_find_expectation() will
remove the exp from the hash table. However, in some scenario, we
expect the exp not to be removed when the created ct will not be
confirmed, like in OVS and TC conntrack in the following patches.
This patch allows exp not to be removed by setting IPS_CONFIRMED
in the status of the tmpl.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If the user set an MTU value, it usually means that there are special
requirements for the MTU. But if an interface gots activated, the MTU was
always recalculated and then the user set value was overwritten.
The only reason why this user set value has to be overwritten, is when the
MTU has to be decreased because batman-adv is not able to transfer packets
with the user specified size.
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Cc: stable@vger.kernel.org
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
If an interface changes the MTU, it is expected that an NETDEV_PRECHANGEMTU
and NETDEV_CHANGEMTU notification events is triggered. This worked fine for
.ndo_change_mtu based changes because core networking code took care of it.
But for auto-adjustments after hard-interfaces changes, these events were
simply missing.
Due to this problem, non-batman-adv components weren't aware of MTU changes
and thus couldn't perform their own tasks correctly.
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Cc: stable@vger.kernel.org
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
After commit d2ccd7bc8a ("tcp: avoid resetting ACK timer in DCTCP"),
tcp_enter_quickack_mode() is only used from net/ipv4/tcp_input.c.
Fixes: d2ccd7bc8a ("tcp: avoid resetting ACK timer in DCTCP")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20230718162049.1444938-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In most cases UDP sockets use the default data ready callback.
Leverage the indirect call wrapper for such callback to avoid an
indirect call in fastpath.
The above gives small but measurable performance gain under UDP flood.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/d47d53e6f8ee7a11228ca2f025d6243cc04b77f3.1689691004.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 3f4ca5fafc.
Commit 3f4ca5fafc ("tcp: avoid the lookup process failing to get sk in
ehash table") reversed the order in how a socket is inserted into ehash
to fix an issue that ehash-lookup could fail when reqsk/full sk/twsk are
swapped. However, it introduced another lookup failure.
The full socket in ehash is allocated from a slab with SLAB_TYPESAFE_BY_RCU
and does not have SOCK_RCU_FREE, so the socket could be reused even while
it is being referenced on another CPU doing RCU lookup.
Let's say a socket is reused and inserted into the same hash bucket during
lookup. After the blamed commit, a new socket is inserted at the end of
the list. If that happens, we will skip sockets placed after the previous
position of the reused socket, resulting in ehash lookup failure.
As described in Documentation/RCU/rculist_nulls.rst, we should insert a
new socket at the head of the list to avoid such an issue.
This issue, the swap-lookup-failure, and another variant reported in [0]
can all be handled properly by adding a locked ehash lookup suggested by
Eric Dumazet [1].
However, this issue could occur for every packet, thus more likely than
the other two races, so let's revert the change for now.
Link: https://lore.kernel.org/netdev/20230606064306.9192-1-duanmuquan@baidu.com/ [0]
Link: https://lore.kernel.org/netdev/CANn89iK8snOz8TYOhhwfimC7ykYA78GA3Nyv8x06SZYa1nKdyA@mail.gmail.com/ [1]
Fixes: 3f4ca5fafc ("tcp: avoid the lookup process failing to get sk in ehash table")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230717215918.15723-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmS3s90ACgkQrB3Eaf9P
W7dl5A/+N59cq4fk7k8V4KQ/MWIB42quzq6lem8nqSams7tGaug1AyJkr4urZdjg
DwlmpGHEw4yWmHiOdzbPwgqOsqYyhzbcwU2RPNuDgdtkJm6CigbddJbIFfx8HvMu
zKLtnKJ7QDGf0TdVjuTcx0DQjsK4rxPzr445iZmvZ3g9rcomkxQZ/LDozjH2QZhc
LYkLrjt09kqLpd5Dkco5/ZuAYGJF4ujAHNTDVkdCuRQFmb+WsfLpzcpT5AtJLGeU
Q/9TfZwI3YH0AbfdpBL8qhDAVYzwTcx6ldWZLbSJNSUUTswg0H7tF+JyBR3wt626
9B+qsSWEvLuA0WFIKdOz484Dt/h3ggTgoEEDVjDEaDLenx3ESx3A8jv4f7IL2EMs
edjTSphj8ZTQv4uw/1CBNIdTbujEtvf4FhkagSvrUoUYOGvobcvqvsvCGzac0d+K
5x7bMSJU9Cnv8YYAB8xbhIN5LqE1+1bPiNHk4xAxW5qfYvO0E2rjwd52DYgCeV2G
GRDbnhAmtrk9HHdtAelWSZyI6tJJNG0H43KSjBUNEdHcVB2bz8iGLHFr72+cAfKp
I0nR3EoGYKQGZHIXiAtqSZIbEHlpU080pclT5WDEJJG0yLR0PwnDiVTrEI2AngCP
dkSvJCi5wwMCx+HljBY0JyH8CDDu3YS1SpfXzEe5lqset4ai4CI=
=e8vA
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2023-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2023-07-19
Just a leftover from the last development cycle:
1) delete a clear to zero of encap_oa, it is not needed anymore.
From Leon Romanovsky.
* tag 'ipsec-next-2023-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: delete not-needed clear to zero of encap_oa
====================
Link: https://lore.kernel.org/r/20230719100228.373055-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmS4IUIACgkQ6rmadz2v
bTrVCw/9GG5A5ebqwoh/DrsFXEzpKDmZFIAWd5wB+Fx2i8y+6Jl/Fw6SjkkAtUnc
215T3YX2u3Xg1WFC5zxY9lYm2OeMq2lPHVwjlqgt/pHE8D6b8cZ44eyN+f0ZSiLy
wyx0wHLd3oP4KvMyiqm7/ZmhDjAtBpuqMjY5FNsbUxrIGUUI2ZLC4VFVWhnWmzRA
eEOQuUge4e1YD62kfkWlT/GEv710ysqFZD2zs4yhevDfmr/6DAIaA7dhfKMYsM/S
hCPoCuuXWVoHiqksm0U1BwpEiAQrqR91Sx8RCAakw5Pyp5hkj9dJc9sLwkgMH/k7
2352IIPXddH8cGKQM+hIBrc/io+6MxMbVk7Pe+1OUIBrvP//zQrHWk0zbssF3D8C
z6TbxBLdSzbDELPph3gZu5bNaLSkpuODhNjLcIVGSOeSJ5nsgATCQtXFAAPV0E/Q
v2O7Te5aTjTOpFMcIrIK1eWXUS56yRA+YwDa1VuWXAiLrr+Rq0tm4tBqxhof3KlH
bfCoqFNa12MfpCJURHICcV7DJo53rWbCtDSJPaYwZXb/jJPd3gPb8EVixoLN2A1M
dV/ou9rKEEkJXxsZ4Bctuh7t5YwpqxTq74YSdvnkOJ8P1lBDYST2SfHgQVOayQPv
XH9MlMO3Qtb9Sl0ZiI7gHbpK7h6v9RvRuHJcnN2e3wwMEx256xE=
=VRCb
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-07-19
We've added 45 non-merge commits during the last 3 day(s) which contain
a total of 71 files changed, 7808 insertions(+), 592 deletions(-).
The main changes are:
1) multi-buffer support in AF_XDP, from Maciej Fijalkowski,
Magnus Karlsson, Tirthendu Sarkar.
2) BPF link support for tc BPF programs, from Daniel Borkmann.
3) Enable bpf_map_sum_elem_count kfunc for all program types,
from Anton Protopopov.
4) Add 'owner' field to bpf_rb_node to fix races in shared ownership,
Dave Marchevsky.
5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
bpf, net: Introduce skb_pointer_if_linear().
bpf: sync tools/ uapi header with
selftests/bpf: Add mprog API tests for BPF tcx links
selftests/bpf: Add mprog API tests for BPF tcx opts
bpftool: Extend net dump with tcx progs
libbpf: Add helper macro to clear opts structs
libbpf: Add link-based API for tcx
libbpf: Add opts-based attach/detach/query API for tcx
bpf: Add fd-based tcx multi-prog infra with link support
bpf: Add generic attach/detach/query API for multi-progs
selftests/xsk: reset NIC settings to default after running test suite
selftests/xsk: add test for too many frags
selftests/xsk: add metadata copy test for multi-buff
selftests/xsk: add invalid descriptor test for multi-buffer
selftests/xsk: add unaligned mode test for multi-buffer
selftests/xsk: add basic multi-buffer test
selftests/xsk: transmit and receive multi-buffer packets
xsk: add multi-buffer documentation
i40e: xsk: add TX multi-buffer support
ice: xsk: Tx multi-buffer support
...
====================
Link: https://lore.kernel.org/r/20230719175424.75717-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Modify xskq_cons_read_desc_batch() in a way that each processed
descriptor will be checked if it is an EOP one or not and act
accordingly to that.
Change the behavior of mentioned function to break the processing when
stumbling upon invalid descriptor instead of skipping it. Furthermore,
let us give only full packets down to ZC driver.
With these two assumptions ZC drivers will not have to take care of an
intermediate state of incomplete frames, which will simplify its
implementations a lot.
Last but not least, stop processing when count of frags would exceed
max supported segments on underlying device.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-15-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Given that skb_shared_info relies on skb_frag_t, in order to support
xskb chaining, introduce xdp_buff_xsk::xskb_list_node and
xsk_buff_pool::xskb_list.
This is needed so ZC drivers can add frags as xskb nodes which will make
it possible to handle it both when producing AF_XDP Rx descriptors as
well as freeing/recycling all the frags that a single frame carries.
Speaking of latter, update xsk_buff_free() to take care of list nodes.
For the former (adding as frags), introduce xsk_buff_add_frag() for ZC
drivers usage that is going to be used to add a frag to xskb list from
pool.
xsk_buff_get_frag() will be utilized by XDP_TX and, on contrary, will
return xdp_buff.
One of the previous patches added a wrapper for ZC Rx so implement xskb
list walk and production of Rx descriptors there.
On bind() path, bail out if socket wants to use ZC multi-buffer but
underlying netdev does not support it.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-12-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Descriptors with zero length are not supported by many NICs. To preserve
uniform behavior discard any zero length desc as invvalid desc.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-10-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For transmitting an AF_XDP packet, allocate skb while processing the
first desc and copy data to it. The 'XDP_PKT_CONTD' flag in 'options'
field of the desc indicates the EOP status of the packet. If the current
desc is not EOP, store the skb, release the current desc and go
on to read the next descs.
Allocate a page for each subsequent desc, copy data to it and add it as
a frag in the skb stored in xsk. On processing EOP, transmit the skb
with frags. Addresses contained in descs have been already queued in
consumer queue and skb destructor updated the completion count.
On transmit failure cancel the releases, clear the descs from the
completion queue and consume the skb for retrying packet transmission.
For any invalid descriptor (invalid length/address/options) in the middle
of a packet, all pending descriptors will be dropped by xsk core along
with the invalid one and the next descriptor is treated as the start of
a new packet.
Maximum supported frames for a packet is MAX_SKB_FRAGS + 1. If it is
exceeded, all descriptors accumulated so far are dropped.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-9-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In Tx path, xsk core reserves space for each desc to be transmitted in
the completion queue and it's address contained in it is stored in the
skb destructor arg. After successful transmission the skb destructor
submits the addr marking completion.
To handle multiple descriptors per packet, now along with reserving
space for each descriptor, the corresponding address is also stored in
completion queue. The number of pending descriptors are stored in skb
destructor arg and is used by the skb destructor to update completions.
Introduce 'skb' in xdp_sock to store a partially built packet when
__xsk_generic_xmit() must return before it sees the EOP descriptor for
the current packet so that packet building can resume in next call of
__xsk_generic_xmit().
Helper functions are introduced to set and get the pending descriptors
in the skb destructor arg. Also, wrappers are introduced for storing
descriptor addresses, submitting and cancelling (for unsuccessful
transmissions) the number of completions.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-7-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add multi-buffer support for AF_XDP by extending the XDP multi-buffer
support to be reflected in user-space when a packet is redirected to
an AF_XDP socket.
In the XDP implementation, the NIC driver builds the xdp_buff from the
first frag of the packet and adds any subsequent frags in the skb_shinfo
area of the xdp_buff. In AF_XDP core, XDP buffers are allocated from
xdp_sock's pool and data is copied from the driver's xdp_buff and frags.
Once an allocated XDP buffer is full and there is still data to be
copied, the 'XDP_PKT_CONTD' flag in'options' field of the corresponding
xdp ring descriptor is set and passed to the application. When application
sees the aforementioned flag set it knows there is pending data for this
packet that will be carried in the following descriptors. If there is no
more data to be copied, the flag in 'options' field is cleared for that
descriptor signalling EOP to the application.
If application reads a batch of descriptors using for example the libxdp
interfaces, it is not guaranteed that the batch will end with a full
packet. It might end in the middle of a packet and the rest of the frames
of that packet will arrive at the beginning of the next batch.
AF_XDP ensures that only a complete packet (along with all its frags) is
sent to application.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-6-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>