This patch defines two packet scheduler wrappers mptcp_sched_get_send()
and mptcp_sched_get_retrans(), invoke get_subflow() of msk->sched in
them.
Set data->reinject to true in mptcp_sched_get_retrans(), set it false in
mptcp_sched_get_send().
If msk->sched is NULL, use default functions mptcp_subflow_get_send()
and mptcp_subflow_get_retrans() to send data.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-7-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new member scheduled in struct mptcp_subflow_context,
which will be set in the MPTCP scheduler context when the scheduler
picks this subflow to send data.
Add a new helper mptcp_subflow_set_scheduled() to set this flag using
WRITE_ONCE().
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-6-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new struct member sched in struct mptcp_sock.
And two helpers mptcp_init_sched() and mptcp_release_sched() to
init and release it.
Init it with the sysctl scheduler in mptcp_init_sock(), copy the
scheduler from the parent in mptcp_sk_clone(), and release it in
__mptcp_destroy_sock().
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-5-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new sysctl, named scheduler, to support for selection
of different schedulers. Export mptcp_get_scheduler helper to get this
sysctl.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-4-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch defines struct mptcp_sched_ops, which has three struct members,
name, owner and list, and four function pointers: init(), release() and
get_subflow().
The scheduler function get_subflow() have a struct mptcp_sched_data
parameter, which contains a reinject flag for retrans or not, a subflows
number and a mptcp_subflow_context array.
Add the scheduler registering, unregistering and finding functions to add,
delete and find a packet scheduler on the global list mptcp_sched_list.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-3-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the burst check conditions have moved out of the function
mptcp_subflow_get_send(), it makes all msk->last_snd useless.
This patch drops them as well as the macro MPTCP_RESET_SCHEDULER.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-2-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To support redundant package schedulers more easily, this patch refactors
__mptcp_push_pending() logic from:
For each dfrag:
While sends succeed:
Call the scheduler (selects subflow and msk->snd_burst)
Update subflow locks (push/release/acquire as needed)
Send the dfrag data with mptcp_sendmsg_frag()
Update already_sent, snd_nxt, snd_burst
Update msk->first_pending
Push/release on final subflow
->
While first_pending isn't empty:
Call the scheduler (selects subflow and msk->snd_burst)
Update subflow locks (push/release/acquire as needed)
For each pending dfrag:
While sends succeed:
Send the dfrag data with mptcp_sendmsg_frag()
Update already_sent, snd_nxt, snd_burst
Update msk->first_pending
Break if required by msk->snd_burst / etc
Push/release on final subflow
Refactors __mptcp_subflow_push_pending logic from:
For each dfrag:
While sends succeed:
Call the scheduler (selects subflow and msk->snd_burst)
Send the dfrag data with mptcp_subflow_delegate(), break
Send the dfrag data with mptcp_sendmsg_frag()
Update dfrag->already_sent, msk->snd_nxt, msk->snd_burst
Update msk->first_pending
->
While first_pending isn't empty:
Call the scheduler (selects subflow and msk->snd_burst)
Send the dfrag data with mptcp_subflow_delegate(), break
Send the dfrag data with mptcp_sendmsg_frag()
For each pending dfrag:
While sends succeed:
Send the dfrag data with mptcp_sendmsg_frag()
Update already_sent, snd_nxt, snd_burst
Update msk->first_pending
Break if required by msk->snd_burst / etc
Move the duplicate code from __mptcp_push_pending() and
__mptcp_subflow_push_pending() into a new helper function, named
__subflow_push_pending(). Simplify __mptcp_push_pending() and
__mptcp_subflow_push_pending() by invoking this helper.
Also move the burst check conditions out of the function
mptcp_subflow_get_send(), check them in __subflow_push_pending() in
the inner "for each pending dfrag" loop.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-1-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The automatic recalculation of the maximum allowed MTU is usually triggered
by code sections which are already rtnl lock protected by callers outside
of batman-adv. But when the fragmentation setting is changed via
batman-adv's own batadv genl family, then the rtnl lock is not yet taken.
But dev_set_mtu requires that the caller holds the rtnl lock because it
uses netdevice notifiers. And this code will then fail the check for this
lock:
RTNL: assertion failed at net/core/dev.c (1953)
Cc: stable@vger.kernel.org
Reported-by: syzbot+f8812454d9b3ac00d282@syzkaller.appspotmail.com
Fixes: c6a953cce8 ("batman-adv: Trigger events for auto adjusted MTU")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230821-batadv-missing-mtu-rtnl-lock-v1-1-1c5a7bfe861e@narfation.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit ee8b94c851 ("can: raw: fix receiver memory leak") introduced
a new reference to the CAN netdevice that has assigned CAN filters.
But this new ro->dev reference did not maintain its own refcount which
lead to another KASAN use-after-free splat found by Eric Dumazet.
This patch ensures a proper refcount for the CAN nedevice.
Fixes: ee8b94c851 ("can: raw: fix receiver memory leak")
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20230821144547.6658-3-socketcan@hartkopp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The original implementation had a very simple handling for single frame
transmissions as it just sent the single frame without a timeout handling.
With the new echo frame handling the echo frame was also introduced for
single frames but the former exception ('simple without timers') has been
maintained by accident. This leads to a 1 second timeout when closing the
socket and to an -ECOMM error when CAN_ISOTP_WAIT_TX_DONE is selected.
As the echo handling is always active (also for single frames) remove the
wrong extra condition for single frames.
Fixes: 9f39d36530 ("can: isotp: add support for transmission without flow control")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20230821144547.6658-2-socketcan@hartkopp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nl80211_pmsr_dump_results() is never implemented since it was added in
commit 9bb7e0f24e ("cfg80211: add peer measurement with FTM initiator
API").
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230729121651.36836-1-yuehaibing@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit ccf80ddfe4 ("mac80211: mesh function and data structures
definitions") introducted this but never implemented it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230731140712.1204-1-yuehaibing@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The previous commit dd3e4fc75b ("nl80211/cfg80211: add BSS color to
NDP ranging parameters") adds a parameter for NDP ranging by introducing
a new attribute type named NL80211_PMSR_FTM_REQ_ATTR_BSS_COLOR.
However, the author forgot to also describe the nla_policy at
nl80211_pmsr_ftm_req_attr_policy (net/wireless/nl80211.c). Just
complement it to avoid malformed attribute that causes out-of-attribute
access.
Fixes: dd3e4fc75b ("nl80211/cfg80211: add BSS color to NDP ranging parameters")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230809033151.768910-1-linma@zju.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Code inspection reveals that we switch the puncturing bitmap
before the real channel switch, since that happens only in
the second round of the worker after the channel context is
switched by ieee80211_link_use_reserved_context().
Fixes: 2cc25e4b2a ("wifi: mac80211: configure puncturing bitmap")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When probing a client, first check if we have it, and then
check for the channel context, otherwise you can trigger
the warning there easily by probing when the AP isn't even
started yet. Since a client existing means the AP is also
operating, we can then keep the warning.
Also simplify the moved code a bit.
Reported-by: syzbot+999fac712d84878a7379@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If there's no OCB state, don't ask the driver/mac80211 to
leave, since that's just confusing. Since set/clear the
chandef state, that's a simple check.
Reported-by: syzbot+09d1cd2f71e6dd3bfd2c@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the AP uses our own address as its MLD address or BSSID, then
clearly something's wrong. Reject such connections so we don't
try and fail later.
Reported-by: syzbot+2676771ed06a6df166ad@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* reorder buffer filter checks can cause bad shift/UBSAN
warning with newer HW, avoid the check (mac80211)
* add Kconfig dependency for iwlwifi for PTP clock usage
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmTkrKMACgkQ10qiO8sP
aAD6/A//bR8/98qCxJqeZurTw83JwWDo7J5Twthk4ik4cpi35s+7bqVYc34a1vIT
poIiIxffZRUSsHOoMXpTP6xd7gLP9Hz6Ba8Jd7X9NG+/lfdpJlWGLDFri3JpNREi
Jjqq8XOzB7+c+TwrK85j7nxY1JXLqLzXOxetAtuIZEqVrmC+T6+nnAw1ITLfCVL/
fVaAFx3mDxTLi/qbAYnYPBKY5Kq1KH0F3j389vfIVPzTXBmLcYppZ17Lz38io0pZ
MC0NO3W0U4MFX3i5P2Q/yjlUsuwJJROqJmnAOcy9+McQU9nFHrIAV3Na9G1mZbyS
xmdcbjokhcE95ht8JI9yHESAbtACaSM4jq4W03zlgvvj3TrVt+1bDjjXLexqb6kc
YinRoVc3Dn+Nvw8DrDa/1PMuO1YAlVg3ZYXRyjlL4dcLbz0SOiEHFyTIjIKflZYZ
TbstNDbygIxBBtmVx/aJBZZoFo3G6e5FGQrVZ0uvDPqVaaFvs3ESv+ooXek9gl06
OlzngnrTJO52Ky4quCTtmR16+J3GeAUjZNUsQKqNHu28zzBsF+CP8j+OCUTPQTe0
YZuqBqnSDPUcllBxXEDp2pPm102q0DvHeKXg7cugdY2zyzvjgrxBSBwh5xwUA4sT
RMI55Gok6gcdzQ1GHHRzLK6UkKzgmAbTjMypJkVV6Qh7dG6RNJo=
=edA4
-----END PGP SIGNATURE-----
Merge tag 'wireless-2023-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Two fixes:
- reorder buffer filter checks can cause bad shift/UBSAN
warning with newer HW, avoid the check (mac80211)
- add Kconfig dependency for iwlwifi for PTP clock usage
* tag 'wireless-2023-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: limit reorder_buf_filtered to avoid UBSAN warning
wifi: iwlwifi: mvm: add dependency for PTP clock
====================
Link: https://lore.kernel.org/r/20230822124206.43926-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Highlights include:
Stable fixes
- NFS: Fix a use after free in nfs_direct_join_group()
Bugfixes
- NFS: Fix a sysfs server name memory leak
- NFS: Fix a lock recovery hang in NFSv4.0
- NFS: Fix page free in the error path for nfs42_proc_getxattr
- NFS: Fix page free in the error path for __nfs4_get_acl_uncached
- SUNRPC/rdma: Fix receive buffer dma-mapping after a server disconnect
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmTjgEEACgkQZwvnipYK
APItFA//WzGcKbujlMXpiRdvUg6k6CfG/ikBRB1UwQEyZjK/tVZ96qt6UuHGNMbz
b8GaGls7NRYJKezAcMSW9QMMPYVyG0PLwxOW6BPwsZS61Zn6HMeM1YRboaZEid7f
JrUNhbUXHl6bVWrBNEtcr3IN/5ERU4sGCAa4A3uWdNxGyffD/avrK06/bfmE/SJi
+7LVPp0M9rM5X5Z1c407TbWfg+L81Q9t0tTz7II3Ba9i2BzQ0uhQhyVUQAGF767u
Vua4XWTRoqG1es+tA4iuwZ3KtaqXoaMRDWPLGTkmBrY+pAo+u4IPzY5LCwfUu6kI
vttkZU5b0b05+UomJ1d+Muzr8uEjRmBhIHZsP6lgVVmuNzqkDb0gCGkfix87J+RO
0QmDZ9D0ftJxsb8fSdp8iy8NqmqJ6X4FhsylRtANEuCrf8+zrkUlBJi47CCwpYDD
8gq6SoTfA8MmiSgzrBuYkJe2HSx7c2csDl3xp5KrJX2IHODjbzlHC05fNadTWc6W
0jQvq1cJ2xBYDNSxkG0Trsd3lTTao3rZC4M7imVVjTTOHS8X1LNCLkbZ7LVnA8rn
0F+lp/h1qs/daXSp0aMG5wyvZNkx5rsJ23o+InNCjiCh3cDvoi9mg6DN5bQK8Foy
Iqd2MTgxrMaF/FUbdGLdnFX4GQkgFPng8TpdX8sqqm1JHUprpqg=
=nd41
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- fix a use after free in nfs_direct_join_group() (Cc: stable)
- fix sysfs server name memory leak
- fix lock recovery hang in NFSv4.0
- fix page free in the error path for nfs42_proc_getxattr() and
__nfs4_get_acl_uncached()
- SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect
* tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
xprtrdma: Remap Receive buffers after a reconnect
NFSv4: fix out path in __nfs4_get_acl_uncached
NFSv4.2: fix error handling in nfs42_proc_getxattr
NFS: Fix sysfs server name memory leak
NFS: Fix a use after free in nfs_direct_join_group()
NFSv4: Fix dropped lock for racing OPEN and delegation return
abort early so task can exit faster if a fatal signal is pending,
no need to continue validation in that case.
Signed-off-by: Florian Westphal <fw@strlen.de>
Prefer `strscpy_pad` as it's a more robust interface whilst maintaing
zero-padding behavior.
There may have existed a bug here due to both `tbl->repl.name` and
`info->name` having a size of 32 as defined below:
| #define XT_TABLE_MAXNAMELEN 32
This may lead to buffer overreads in some situations -- `strscpy` solves
this by guaranteeing NUL-termination of the dest buffer.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use `strscpy_pad` over `strncpy` for NUL-terminated strings.
We can also drop the + 1 from `NFT_OSF_MAXGENRELEN + 1` since `strscpy`
will guarantee NUL-termination.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
When compiling with gcc 13 and CONFIG_FORTIFY_SOURCE=y, the following
warning appears:
In function ‘fortify_memcpy_chk’,
inlined from ‘size_entry_mwt’ at net/bridge/netfilter/ebtables.c:2118:2:
./include/linux/fortify-string.h:592:25: error: call to ‘__read_overflow2_field’
declared with attribute warning: detected read beyond size of field (2nd parameter);
maybe use struct_group()? [-Werror=attribute-warning]
592 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The compiler is complaining:
memcpy(&offsets[1], &entry->watchers_offset,
sizeof(offsets) - sizeof(offsets[0]));
where memcpy reads beyong &entry->watchers_offset to copy
{watchers,target,next}_offset altogether into offsets[]. Silence the
warning by wrapping these three up via struct_group().
Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
When the ifdown function in the dst_ops structure is referenced, the input
parameter 'how' is always true. In the current implementation of the
ifdown interface, ip6_dst_ifdown does not use the input parameter 'how',
xfrm6_dst_ifdown and xfrm4_dst_ifdown functions use the input parameter
'unregister'. But false judgment on 'unregister' in xfrm6_dst_ifdown and
xfrm4_dst_ifdown is false, so remove the input parameter 'how' in ifdown
function.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230821084104.3812233-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct cfg80211_rnr_elems.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20230817211531.4193219-5-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct cfg80211_pmsr_request.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20230817211531.4193219-4-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct cfg80211_mbssid_elems.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20230817211531.4193219-3-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct cfg80211_cqm_config.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20230817211531.4193219-2-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct cfg80211_acl_data.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20230817211531.4193219-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an iov_iter to the unions in ceph_msg_data and ceph_msg_data_cursor.
Instead of requiring a list of pages or bvecs, we can just use an
iov_iter directly, and avoid extra allocations.
We assume that the pages represented by the iter are pinned such that
they shouldn't incur page faults, which is the case for the iov_iters
created by netfs.
While working on this, Al Viro informed me that he was going to change
iov_iter_get_pages to auto-advance the iterator as that pattern is more
or less required for ITER_PIPE anyway. We emulate that here for now by
advancing in the _next op and tracking that amount in the "lastlen"
field.
In the event that _next is called twice without an intervening
_advance, we revert the iov_iter by the remaining lastlen before
calling iov_iter_get_pages.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Have get_reply check for the presence of sparse read ops in the
request and set the sparse_read boolean in the msg. That will queue the
messenger layer to use the sparse read codepath instead of the normal
data receive.
Add a new sparse_read operation for the OSD client, driven by its own
state machine. The messenger will repeatedly call the sparse_read
operation, and it will pass back the necessary info to set up to read
the next extent of data, while zero-filling the sparse regions.
The state machine will stop at the end of the last extent, and will
attach the extent map buffer to the ceph_osd_req_op so that the caller
can use it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add 2 new fields to ceph_connection_v1_info to track the necessary info
in sparse reads. Skip initializing the cursor for a sparse read.
Break out read_partial_message_section into a wrapper around a new
read_partial_message_chunk function that doesn't zero out the crc first.
Add new helper functions to drive receiving into the destinations
provided by the sparse_read state machine.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add a new init_sgs_pages helper that populates the scatterlist from
an arbitrary point in an array of pages.
Change setup_message_sgs to take an optional pointer to an array of
pages. If that's set, then the scatterlist will be set using that
array instead of the cursor.
When given a sparse read on a secure connection, decrypt the data
in-place rather than into the final destination, by passing it the
in_enc_pages array.
After decrypting, run the sparse_read state machine in a loop, copying
data from the decrypted pages until it's complete.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add support for a new sparse_read ceph_connection operation. The idea is
that the client driver can define this operation use it to do special
handling for incoming reads.
The alloc_msg routine will look at the request and determine whether the
reply is expected to be sparse. If it is, then we'll dispatch to a
different set of state machine states that will repeatedly call the
driver's sparse_read op to get length and placement info for reading the
extent map, and the extents themselves.
This necessitates adding some new field to some other structs:
- The msg gets a new bool to track whether it's a sparse_read request.
- A new field is added to the cursor to track the amount remaining in the
current extent. This is used to cap the read from the socket into the
msg_data
- Handing a revoke with all of this is particularly difficult, so I've
added a new data_len_remain field to the v2 connection info, and then
use that to skip that much on a revoke. We may want to expand the use of
that to the normal read path as well, just for consistency's sake.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When the OSD sends back a sparse read reply, it contains an array of
these structures. Define the structure and add a couple of helpers for
dealing with them.
Also add a place in struct ceph_osd_req_op to store the extent buffer,
and code to free it if it's populated when the req is torn down.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In a later patch, we're going to need to search for a request in
the rbtree, but taking the o_mutex is inconvenient as we already
hold the con mutex at the point where we need it.
Add a new spinlock that we take when inserting and erasing entries from
the o_requests tree. Search of the rbtree can be done with either the
mutex or the spinlock, but insertion and removal requires both.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The commit 06470f7468 ("mac80211: add API to allow filtering frames in BA sessions")
added reorder_buf_filtered to mark frames filtered by firmware, and it
can only work correctly if hw.max_rx_aggregation_subframes <= 64 since
it stores the bitmap in a u64 variable.
However, new HE or EHT devices can support BlockAck number up to 256 or
1024, and then using a higher subframe index leads UBSAN warning:
UBSAN: shift-out-of-bounds in net/mac80211/rx.c:1129:39
shift exponent 215 is too large for 64-bit type 'long long unsigned int'
Call Trace:
<IRQ>
dump_stack_lvl+0x48/0x70
dump_stack+0x10/0x20
__ubsan_handle_shift_out_of_bounds+0x1ac/0x360
ieee80211_release_reorder_frame.constprop.0.cold+0x64/0x69 [mac80211]
ieee80211_sta_reorder_release+0x9c/0x400 [mac80211]
ieee80211_prepare_and_rx_handle+0x1234/0x1420 [mac80211]
ieee80211_rx_list+0xaef/0xf60 [mac80211]
ieee80211_rx_napi+0x53/0xd0 [mac80211]
Since only old hardware that supports <=64 BlockAck uses
ieee80211_mark_rx_ba_filtered_frames(), limit the use as it is, so add a
WARN_ONCE() and comment to note to avoid using this function if hardware
capability is not suitable.
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Link: https://lore.kernel.org/r/20230818014004.16177-1-pkshih@realtek.com
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
sk_getsockopt() runs locklessly. This means sk->sk_lingertime
can be read while other threads are changing its value.
Other reads also happen without socket lock being held,
and must be annotated.
Remove preprocessor logic using BITS_PER_LONG, compilers
are smart enough to figure this by themselves.
v2: fixed a clang W=1 (-Wtautological-constant-out-of-range-compare) warning
(Jakub)
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack info for IPv4 address add/delete, which would be useful for
users to understand the problem without having to read kernel code.
No extack message for the ifa_local checking in __inet_insert_ifa() as
it has been checked in find_matching_ifa().
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After deleting an IPv6 address on an interface and cleaning up the
related preferred source entries, it is important to ensure that all
routes associated with the deleted address are properly cleared. The
current implementation of rt6_remove_prefsrc() only checks the preferred
source addresses bound to the current device. However, there may be
routes that are bound to other devices but still utilize the same
preferred source address.
To address this issue, it is necessary to also delete entries that are
bound to other interfaces but share the same source address with the
current device. Failure to delete these entries would leave routes that
are bound to the deleted address unclear. Here is an example reproducer
(I have omitted unrelated routes):
+ ip link add dummy1 type dummy
+ ip link add dummy2 type dummy
+ ip link set dummy1 up
+ ip link set dummy2 up
+ ip addr add 1:2:3:4::5/64 dev dummy1
+ ip route add 7:7:7:0::1 dev dummy1 src 1:2:3:4::5
+ ip route add 7:7:7:0::2 dev dummy2 src 1:2:3:4::5
+ ip -6 route show
1:2:3:4::/64 dev dummy1 proto kernel metric 256 pref medium
7:7:7::1 dev dummy1 src 1:2:3:4::5 metric 1024 pref medium
7:7:7::2 dev dummy2 src 1:2:3:4::5 metric 1024 pref medium
+ ip addr del 1:2:3:4::5/64 dev dummy1
+ ip -6 route show
7:7:7::1 dev dummy1 metric 1024 pref medium
7:7:7::2 dev dummy2 src 1:2:3:4::5 metric 1024 pref medium
As Ido reminds, in IPv6, the preferred source address is looked up in
the same VRF as the first nexthop device, which is different with IPv4.
So, while removing the device checking, we also need to add an
ipv6_chk_addr() check to make sure the address does not exist on the other
devices of the rt nexthop device's VRF.
After fix:
+ ip addr del 1:2:3:4::5/64 dev dummy1
+ ip -6 route show
7:7:7::1 dev dummy1 metric 1024 pref medium
7:7:7::2 dev dummy2 metric 1024 pref medium
Reported-by: Thomas Haller <thaller@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2170513
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
addrconf_prefix_rcv returned early without releasing the inet6_dev
pointer when the PIO lifetime is less than accept_ra_min_lft.
Fixes: 5027d54a9c ("net: change accept_ra_min_rtr_lft to affect all RA lifetimes")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Simon Horman <horms@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting IP_RECVERR and IPV6_RECVERR options to zero currently
purges the socket error queue, which was probably not expected
for zerocopy and tx_timestamp users.
I discovered this issue while preparing commit 6b5f43ea08
("inet: move inet->recverr to inet->inet_flags"), I presume this
change does not need to be backported to stable kernels.
Add skb_errqueue_purge() helper to purge error messages only.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sendmsg() is lockless, so ip_select_ident_segs()
can very well be run from multiple cpus [1]
Convert inet->inet_id to an atomic_t, but implement
a dedicated path for TCP, avoiding cost of a locked
instruction (atomic_add_return())
Note that this patch will cause a trivial merge conflict
because we added inet->flags in net-next tree.
v2: added missing change in
drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
(David Ahern)
[1]
BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb
read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1:
ip_select_ident_segs include/net/ip.h:542 [inline]
ip_select_ident include/net/ip.h:556 [inline]
__ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446
ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560
udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260
inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2634
__do_sys_sendmmsg net/socket.c:2663 [inline]
__se_sys_sendmmsg net/socket.c:2660 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0:
ip_select_ident_segs include/net/ip.h:541 [inline]
ip_select_ident include/net/ip.h:556 [inline]
__ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446
ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560
udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260
inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2634
__do_sys_sendmmsg net/socket.c:2663 [inline]
__se_sys_sendmmsg net/socket.c:2660 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x184d -> 0x184e
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
==================================================================
Fixes: 23f57406b8 ("ipv4: avoid using shared IP generator for connected sockets")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
veth and vxcan need to make sure the ifindexes of the peer
are not negative, core does not validate this.
Using iproute2 with user-space-level checking removed:
Before:
# ./ip link add index 10 type veth peer index -1
# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:74:b2:03 brd ff:ff:ff:ff:ff:ff
10: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8a:90:ff:57:6d:5d brd ff:ff:ff:ff:ff:ff
-1: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:ed:18:e6:fa:7f brd ff:ff:ff:ff:ff:ff
Now:
$ ./ip link add index 10 type veth peer index -1
Error: ifindex can't be negative.
This problem surfaced in net-next because an explicit WARN()
was added, the root cause is older.
Fixes: e6f8f1a739 ("veth: Allow to create peer link with given ifindex")
Fixes: a8f820a380 ("can: add Virtual CAN Tunnel driver (vxcan)")
Reported-by: syzbot+5ba06978f34abb058571@syzkaller.appspotmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_queue_purge() and __skb_queue_purge() become wrappers
around the new generic functions.
New SKB_DROP_REASON_QUEUE_PURGE drop reason is added,
but users can start adding more specific reasons.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On server-initiated disconnect, rpcrdma_xprt_disconnect() was DMA-
unmapping the Receive buffers, but rpcrdma_post_recvs() neglected
to remap them after a new connection had been established. The
result was immediate failure of the new connection with the Receives
flushing with LOCAL_PROT_ERR.
Fixes: 671c450b6f ("xprtrdma: Fix oops in Receive handler after device removal")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add SMC_NLA_LGR_R_V2_MAX_CONNS and SMC_NLA_LGR_R_V2_MAX_LINKS
to SMCR v2 linkgroup netlink attribute SMC_NLA_LGR_R_V2 for
linkgroup's detail info showing.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support max links per lgr negotiation in clc handshake for SMCR v2.1,
which is one of smc v2.1 features. Server makes decision for the final
value of max links based on the client preferred max links and
self-preferred max links. Here use the minimum value of the client
preferred max links and server preferred max links.
Client Server
Proposal(max links(client preferred))
-------------------------------------->
Accept(max links(accepted value))
accepted value=min(client preferred, server preferred)
<-------------------------------------
Confirm(max links(accepted value))
------------------------------------->
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support max connections per lgr negotiation for SMCR v2.1,
which is one of smc v2.1 features. Server makes decision for
the final value of max conns based on the client preferred
max conns and self-preferred max conns. Here use the minimum
value of client preferred max conns and server preferred max
conns.
Client Server
Proposal(max conns(client preferred))
------------------------------------>
Accept(max conns(accepted value))
accepted value=min(client preferred, server preferred)
<-----------------------------------
Confirm(max conns(accepted value))
----------------------------------->
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support SMC v2.x features validate for SMC v2.1. This is the frame
code for SMC v2.x features validate, and will take effects only when
the negotiated release version is v2.1 or later.
For Server, v2.x features' validation should be done in smc_clc_srv_
v2x_features_validate when receiving v2.1 or later CLC Proposal Message,
such as max conns, max links negotiation, the decision of the final
value of max conns and max links should be made in this function.
And final check for server when receiving v2.1 or later CLC Confirm
Message should be done in smc_clc_v2x_features_confirm_check.
For client, v2.x features' validation should be done in smc_clc_clnt_
v2x_features_validate when receiving v2.1 or later CLC Accept Message,
for example, the decision to accpt the accepted value or to decline
should be made in this function.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add vendor unique experimental options area in clc handshake. In clc
accept and confirm msg, vendor unique experimental options use the
16-Bytes reserved field, which defined in struct smc_clc_fce_gid_ext
in previous version. Because of the struct smc_clc_first_contact_ext
is widely used and limit the scope of modification, this patch moves
the 16-Bytes reserved field out of struct smc_clc_fce_gid_ext, and
followed with the struct smc_clc_first_contact_ext in a new struct
names struct smc_clc_first_contact_ext_v2x.
For SMC-R first connection, in previous version, the struct smc_clc_
first_contact_ext and the 16-Bytes reserved field has already been
included in clc accept and confirm msg. Thus, this patch use struct
smc_clc_first_contact_ext_v2x instead of the struct smc_clc_first_
contact_ext and the 16-Bytes reserved field in SMC-R clc accept and
confirm msg is compatible with previous version.
For SMC-D first connection, in previous version, only the struct smc_
clc_first_contact_ext is included in clc accept and confirm msg, and
the 16-Bytes reserved field is not included. Thus, when the negotiated
smc release version is the version before v2.1, we still use struct
smc_clc_first_contact_ext for compatible consideration. If the negotiated
smc release version is v2.1 or later, use struct smc_clc_first_contact_
ext_v2x instead.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support smc release version negotiation in clc handshake based on
SMC v2, where no negotiation process for different releases, but
for different versions. The latest smc release version was updated
to v2.1. And currently there are two release versions of SMCv2, v2.0
and v2.1. In the release version negotiation, client sends the preferred
release version by CLC Proposal Message, server makes decision for which
release version to use based on the client preferred release version and
self-supported release version (here choose the minimum release version
of the client preferred and server latest supported), then the decision
returns to client by CLC Accept Message. Client confirms the decision by
CLC Confirm Message.
Client Server
Proposal(preferred release version)
------------------------------------>
Accept(accpeted release version)
min(client preferred, server latest supported)
<------------------------------------
Confirm(accpeted release version)
------------------------------------>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We changed tcp_poll() over time, bug never updated dccp.
Note that we also could remove dccp instead of maintaining it.
Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230818015820.2701595-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enabling BIG TCP on a low end platform apparently increased
chances of getting flows locked on one busy TX queue.
A similar problem was handled in commit 9b462d02d6
("tcp: TCP Small Queues and strange attractors"),
but the strategy worked for either bulk flows,
or 'large enough' RPC. BIG TCP changed how large
RPC needed to be to enable the work around:
If RPC fits in a single skb, TSQ never triggers.
Root cause for the problem is a busy TX queue,
with delayed TX completions.
This patch changes how we set skb->ooo_okay to detect
the case TX completion was not done, but incoming ACK
already was processed and emptied rtx queue.
Update the comment to explain the tricky details.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230817182353.2523746-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
*prot->memory_pressure is read/writen locklessly, we need
to add proper annotations.
A recent commit added a new race, it is time to audit all accesses.
Fixes: 2d0c88e84e ("sock: Fix misuse of sk_under_memory_pressure()")
Fixes: 4d93df0abd ("[SCTP]: Rewrite of sctp buffer management code")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Link: https://lore.kernel.org/r/20230818015132.2699348-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited fixes commit introduced linecard notifications for register,
however it didn't add them for unregister. Fix that by adding them.
Fixes: c246f9b5fd ("devlink: add support to create line card and expose to user")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230817125240.2144794-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- bump version strings, by Simon Wunderlich
- Remove unused declarations, by Yue Haibing
- Clean up MTU handling, by Sven Eckelmann (2 patches)
- Clean up/remove (obsolete) functions, by Sven Eckelmann (3 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmTc+z8WHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoeEDD/44gdQmazggfYU2KXn42Pq8CLWR
Nsr4ljmTXiCVh1a/Gw3pRvT7vjH12Jv+nlzoEXzv+iAHKnchVvq1zAguIMSQJymQ
6MbZNHYMckwb6zBXxYGC3qaDI3Zf67ECMEmQkX6ZvwYbXYUiAR+/P7UPlzc145Ur
jWS0wgm9kn8631ZQoWsETIapaN8X9N9Iqt1kDLm29bMlQoEmE6XdpGgpv4JWFpwp
8lDqwcI+v4dt/QsNd07xka0+3lpnQB6bYeyYDjWY1ae04OME5B81PqbbehGNYigp
VP2yepBtfWmWM2nCLqM9tK98ALw+vxtFN9K0M/9p7jd9TyVEWephD5SBrA9aJiKL
ruOIu1YNe1nQ1CFIcc28MiK/1o8ljDUVBr8Mt4OhZ+Pn9fhOlWLy1ZqahiFmley/
DcmwZfreyyNZ1BIrVSsLHGgvJ+X3j4llNFdtk7oB8LplBL3wMVpvrLSwDH34aZbS
dNuWanCEjoqpBEnMLD3J6BgFhvfBjuANXDmpWR6fvXz4WNFCyirvInldG1aJFHKL
HyUPG6ANlT7q6A+vgWHDVsEcxIuiCZlBuNlj/Z2Sg6DAq4WTPcdg+/BYnPusDe4t
M0tAHnkn/dCOCkA0Iq4vPZREgj/xvip+yTfBp5R8hMNibL5TOGoYOLVk/q0X5Dah
ON1TAvgsFbymf129Jg==
=f1Uk
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20230816' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Remove unused declarations, by Yue Haibing
- Clean up MTU handling, by Sven Eckelmann (2 patches)
- Clean up/remove (obsolete) functions, by Sven Eckelmann (3 patches)
* tag 'batadv-next-pullrequest-20230816' of git://git.open-mesh.org/linux-merge:
batman-adv: Drop per algo GW section class code
batman-adv: Keep batadv_netlink_notify_* static
batman-adv: Drop unused function batadv_gw_bandwidth_set
batman-adv: Check hardif MTU against runtime MTU
batman-adv: Avoid magic value for minimum MTU
batman-adv: Remove unused declarations
batman-adv: Start new development cycle
====================
Link: https://lore.kernel.org/r/20230816164000.190884-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Fix issues with adjusted MTUs (2 patches), by Sven Eckelmann
- Fix header access for memory reallocation case, by Remi Pommarel
- Fix two memory leaks (2 patches), by Remi Pommarel
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmTc+eYWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeobgFD/9oBvQKj9rObbnHIgxe3ZZ4x4po
FOln4eLv83YZwVP54BC0X8COymO+gd3tBbGg9U1s9kpn+hIOXi7zI8xnmS/jrKGB
t8tDQ/1S9laCfanfDoHDdQ96ifJfQR6Mp7ZH1e64L22Ag5hKjVoGeQp2Mf5X2S+S
7ZFdhofr/ZNi0Tz4Y+Jw9bh3W5TMnwSFfexSIfUJmh+06RGRRspOx2WbbgArMs12
hP4UST0cIfIr0CinBQz+LiyT90GgC6r+xjkQrP3LgzYegC7eBW+bQxLgCtnk+Hic
+t+aS3SnityZzFyaNJrULX7/u8WJumW4udu0jDl9raAWIJBUV5pNr7sNagQ45mvE
NZ4/VnWGg6MnjdPC6CIuU6AuCLZYn1NiE6mp1vuFxMpqmiJUhMjwjTp8DaLpQZCV
vDYca/bBuDMbTIl5LxQ965svbNVDiAS6gNHbrVs2k3bq3Ji7QS1M7MVR3npehGT0
xInqQNO7QJ0c+/PaFMTZwi0LKk8qwvHLggsZyKXJ6i6YBH7YG1LTbEtqLT7rH3nv
sHfTyPzw7b0oFh7/rSMzvf1P9yiMG1ZCY622uhU1M+CxA2Axr1Lcq1yc3umDP4Ds
BM3JKilHsj/mM7g/HobBs7eg+BL9/KjNKFbj12Bc0xV7IXC4SNTlKSJUN0ZlGSsH
RP7dNXOzS0LUErpARw==
=EsR1
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-pullrequest-20230816' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- Fix issues with adjusted MTUs (2 patches), by Sven Eckelmann
- Fix header access for memory reallocation case, by Remi Pommarel
- Fix two memory leaks (2 patches), by Remi Pommarel
* tag 'batadv-net-pullrequest-20230816' of git://git.open-mesh.org/linux-merge:
batman-adv: Fix batadv_v_ogm_aggr_send memory leak
batman-adv: Fix TT global entry leak when client roamed back
batman-adv: Do not get eth header before batadv_check_management_packet
batman-adv: Don't increase MTU when set by user
batman-adv: Trigger events for auto adjusted MTU
====================
Link: https://lore.kernel.org/r/20230816163318.189996-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE
in commit d0bf7d5759 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE")
now is the time to use this flag for networking as proposed
earlier see link.
The SKB (sk_buff) kmem_cache slab is critical for network performance.
Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
performance by amortising the alloc/free cost.
For the bulk API to perform efficiently the slub fragmentation need to
be low. Especially for the SLUB allocator, the efficiency of bulk free
API depend on objects belonging to the same slab (page).
When running different network performance microbenchmarks, I started
to notice that performance was reduced (slightly) when machines had
longer uptimes. I believe the cause was 'skbuff_head_cache' got
aliased/merged into the general slub for 256 bytes sized objects (with
my kernel config, without CONFIG_HARDENED_USERCOPY).
For SKB kmem_cache network stack have other various reasons for
not merging, but it varies depending on kernel config (e.g.
CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE
for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs.
When CONFIG_SLUB_TINY is configured the bulk APIs are essentially
disabled. Thus, for this case drop the SLAB_NO_MERGE flag.
Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/169211265663.1491038.8580163757548985946.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c
fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the TCP layering violation by allowing per-VMA locks on all VMAs.
The fault path will immediately fail in handle_mm_fault(). There may be a
small performance reduction from this patch as a little unnecessary work
will be done on each page fault. See later patches for the improvement.
Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
LWTUNNEL_XMIT_CONTINUE is implicitly assumed in ip(6)_finish_output2,
such that any positive return value from a xmit hook could cause
unexpected continue behavior, despite that related skb may have been
freed. This could be error-prone for future xmit hook ops. One of the
possible errors is to return statuses of dst_output directly.
To make the code safer, redefine LWTUNNEL_XMIT_CONTINUE value to
distinguish from dst_output statuses and check the continue
condition explicitly.
Fixes: 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure")
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/96b939b85eda00e8df4f7c080f770970a4c5f698.1692326837.git.yan@cloudflare.com
BPF encap ops can return different types of positive values, such like
NET_RX_DROP, NET_XMIT_CN, NETDEV_TX_BUSY, and so on, from function
skb_do_redirect and bpf_lwt_xmit_reroute. At the xmit hook, such return
values would be treated implicitly as LWTUNNEL_XMIT_CONTINUE in
ip(6)_finish_output2. When this happens, skbs that have been freed would
continue to the neighbor subsystem, causing use-after-free bug and
kernel crashes.
To fix the incorrect behavior, skb_do_redirect return values can be
simply discarded, the same as tc-egress behavior. On the other hand,
bpf_lwt_xmit_reroute returns useful errors to local senders, e.g. PMTU
information. Thus convert its return values to avoid the conflict with
LWTUNNEL_XMIT_CONTINUE.
Fixes: 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure")
Reported-by: Jordan Griege <jgriege@cloudflare.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/0d2b878186cfe215fec6b45769c1cd0591d3628d.1692326837.git.yan@cloudflare.com
No known outstanding regressions.
Fixes to fixes:
- virtio-net: set queues after driver_ok, avoid a potential race
added by recent fix
- Revert "vlan: Fix VLAN 0 memory leak", it may lead to a warning
when VLAN 0 is registered explicitly
- nf_tables:
- fix false-positive lockdep splat in recent fixes
- don't fail inserts if duplicate has expired (fix test failures)
- fix races between garbage collection and netns dismantle
Current release - new code bugs:
- mlx5: Fix mlx5_cmd_update_root_ft() error flow
Previous releases - regressions:
- phy: fix IRQ-based wake-on-lan over hibernate / power off
Previous releases - always broken:
- sock: fix misuse of sk_under_memory_pressure() preventing system
from exiting global TCP memory pressure if a single cgroup is under
pressure
- fix the RTO timer retransmitting skb every 1ms if linear option
is enabled
- af_key: fix sadb_x_filter validation, amment netlink policy
- ipsec: fix slab-use-after-free in decode_session6()
- macb: in ZynqMP resume always configure PS GTR for non-wakeup source
Misc:
- netfilter: set default timeout to 3 secs for sctp shutdown send and
recv state (from 300ms), align with protocol timers
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTemA4ACgkQMUZtbf5S
IrtCThAAj+t35QM5BgGZowmrx9U4yF+kacDkdPztxlT8a/b+famrTtnZJ8USW+PF
VCk3Eu8JXheuyAOMArHyM84/crS6wim6mzGcXaucusA3981PFzoqdgCLLf9emAJ2
j9vzKrnHBtdd5fj8Exwq70KN4CzXyrzRgqwr2EXBK9lH59HjX0+J7o+trbDxNmFK
RZJE2oDCqf939iRGG3PhJryKYBmrQaMtdonNpSU5PiiRT0HnVYcEtdWcOXK7d53D
onpoaPdawcsqsns5c5Qj01E1OdyM8X54BEGkl/S4FmSw5jF9Bp6btmTcxYYtdb7E
M3CeYROZ0Kt8KcKKje/o1AzdGqWq8Hnxfwy+2WulZAHMucshg0JPm6Ev74WRondw
NGYriKJSdORSO8idK9K/i7pnjZXYr9gU50lpPUFU+QzSdd+zv+U11arjAodwI9Wi
pW+dFi3UR7J01LidaxclvHmWnZ7d5sSzE2khpqb0xd0+PagRGesl8qnKyoDJNS1P
IHsOrRh9aXLzEZjud/rVG+sUobQvc1oiHW+hvbJ04GLKoli9U5poGT2fcaa4O67M
T7JcN5oGDF+PIHJKgTEN7pfX2epY33gmofKUhbt/OPOqnvZOVbTu7/ojjuJZ8Lc5
SF8AvTe+lECcX8Htjq30PoVfai+FT6AhnZzK0H9K4HMfUB9O32Q=
=Ze13
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from ipsec and netfilter.
No known outstanding regressions.
Fixes to fixes:
- virtio-net: set queues after driver_ok, avoid a potential race
added by recent fix
- Revert "vlan: Fix VLAN 0 memory leak", it may lead to a warning
when VLAN 0 is registered explicitly
- nf_tables:
- fix false-positive lockdep splat in recent fixes
- don't fail inserts if duplicate has expired (fix test failures)
- fix races between garbage collection and netns dismantle
Current release - new code bugs:
- mlx5: Fix mlx5_cmd_update_root_ft() error flow
Previous releases - regressions:
- phy: fix IRQ-based wake-on-lan over hibernate / power off
Previous releases - always broken:
- sock: fix misuse of sk_under_memory_pressure() preventing system
from exiting global TCP memory pressure if a single cgroup is under
pressure
- fix the RTO timer retransmitting skb every 1ms if linear option is
enabled
- af_key: fix sadb_x_filter validation, amment netlink policy
- ipsec: fix slab-use-after-free in decode_session6()
- macb: in ZynqMP resume always configure PS GTR for non-wakeup
source
Misc:
- netfilter: set default timeout to 3 secs for sctp shutdown send and
recv state (from 300ms), align with protocol timers"
* tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
ice: Block switchdev mode when ADQ is active and vice versa
qede: fix firmware halt over suspend and resume
net: do not allow gso_size to be set to GSO_BY_FRAGS
sock: Fix misuse of sk_under_memory_pressure()
sfc: don't fail probe if MAE/TC setup fails
sfc: don't unregister flow_indr if it was never registered
net: dsa: mv88e6xxx: Wait for EEPROM done before HW reset
net/mlx5: Fix mlx5_cmd_update_root_ft() error flow
net/mlx5e: XDP, Fix fifo overrun on XDP_REDIRECT
i40e: fix misleading debug logs
iavf: fix FDIR rule fields masks validation
ipv6: fix indentation of a config attribute
mailmap: add entries for Simon Horman
broadcom: b44: Use b44_writephy() return value
net: openvswitch: reject negative ifindex
team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
net: phy: broadcom: stub c45 read/write for 54810
netfilter: nft_dynset: disallow object maps
netfilter: nf_tables: GC transaction race with netns dismantle
netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path
...
Use prandom_u32_state() instead of get_random_u32() to generate
the correlated loss events of netem.
Signed-off-by: François Michel <francois.michel@uclouvain.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20230815092348.1449179-4-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use prandom_u32_state() instead of get_random_u32() to generate
the random loss events of netem. The state of the prng is part
of the prng attribute of struct netem_sched_data.
Signed-off-by: François Michel <francois.michel@uclouvain.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20230815092348.1449179-3-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add prng attribute to struct netem_sched_data and
allows setting the seed of the PRNG through netlink
using the new TCA_NETEM_PRNG_SEED attribute.
The PRNG attribute is not actually used yet.
Signed-off-by: François Michel <francois.michel@uclouvain.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20230815092348.1449179-2-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The status of global socket memory pressure is updated when:
a) __sk_mem_raise_allocated():
enter: sk_memory_allocated(sk) > sysctl_mem[1]
leave: sk_memory_allocated(sk) <= sysctl_mem[0]
b) __sk_mem_reduce_allocated():
leave: sk_under_memory_pressure(sk) &&
sk_memory_allocated(sk) < sysctl_mem[0]
So the conditions of leaving global pressure are inconstant, which
may lead to the situation that one pressured net-memcg prevents the
global pressure from being cleared when there is indeed no global
pressure, thus the global constrains are still in effect unexpectedly
on the other sockets.
This patch fixes this by ignoring the net-memcg's pressure when
deciding whether should leave global memory pressure.
Fixes: e1aab161e0 ("socket: initial cgroup code.")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Link: https://lore.kernel.org/r/20230816091226.1542-1-wuyun.abel@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZN0eNgAKCRDbK58LschI
gwhhAQCwbrEgA3LslDlk22eqyfRH04D+9d7Kc3ISQssyjlr9swD+NfwfDvYqopwj
Dp67QkHdluixf2/NMPTEvg/CA4mlmww=
=4BwF
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-08-16
We've added 17 non-merge commits during the last 6 day(s) which contain
a total of 20 files changed, 1179 insertions(+), 37 deletions(-).
The main changes are:
1) Add a BPF hook in sys_socket() to change the protocol ID
from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy
applications, from Geliang Tang.
2) Follow-up/fallout fix from the SO_REUSEPORT + bpf_sk_assign work
to fix a splat on non-fullsock sks in inet[6]_steal_sock,
from Lorenz Bauer.
3) Improvements to struct_ops links to avoid forcing presence of
update/validate callbacks. Also add bpf_struct_ops fields documentation,
from David Vernet.
4) Ensure libbpf sets close-on-exec flag on gzopen, from Marco Vedovati.
5) Several new tcx selftest additions and bpftool link show support for
tcx and xdp links, from Daniel Borkmann.
6) Fix a smatch warning on uninitialized symbol in
bpf_perf_link_fill_kprobe, from Yafang Shao.
7) BPF selftest fixes e.g. misplaced break in kfunc_call test,
from Yipeng Zou.
8) Small cleanup to remove unused declaration bpf_link_new_file,
from Yue Haibing.
9) Small typo fix to bpftool's perf help message, from Daniel T. Lee.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Add mptcpify test
selftests/bpf: Fix error checks of mptcp open_and_load
selftests/bpf: Add two mptcp netns helpers
bpf: Add update_socket_protocol hook
bpftool: Implement link show support for xdp
bpftool: Implement link show support for tcx
selftests/bpf: Add selftest for fill_link_info
bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()
net: Fix slab-out-of-bounds in inet[6]_steal_sock
bpf: Document struct bpf_struct_ops fields
bpf: Support default .validate() and .update() behavior for struct_ops links
selftests/bpf: Add various more tcx test cases
selftests/bpf: Clean up fmod_ret in bench_rename test script
selftests/bpf: Fix repeat option when kfunc_call verification fails
libbpf: Set close-on-exec flag on gzopen
bpftool: fix perf help message
bpf: Remove unused declaration bpf_link_new_file()
====================
Link: https://lore.kernel.org/r/20230816212840.1539-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a hook named update_socket_protocol in __sys_socket(), for bpf
progs to attach to and update socket protocol. One user case is to
force legacy TCP apps to create and use MPTCP sockets instead of
TCP ones.
Define a fmod_ret set named bpf_mptcp_fmodret_ids, add the hook
update_socket_protocol into this set, and register it in
bpf_mptcp_kfunc_init().
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree
can be expensive if the number of routes in a table is big, even if most of
them are permanent. Checking routes in a separated list of routes having
expiration will avoid this potential issue.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTb9twNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2ANr7D/wN/XTDG3JxLl4VUtmSOBrD6y6QwpqfUYBD11Ev
eAXpP69wxh2J9gPqtVsPAJwbc0F2eca38ziyJ9+4hmNWNBc3Hh1oXj/9e0IqdPUP
9AEHu73jLmeb6bN0RU8guxmipwZq/a4Q6y/OYPhf+c0uULYEWdH6AAHs3WGRCWHI
a0gedUU3ChKDueObHWfaZSqGuMVKS+eCfT57oKc/l2J2b2064JAksKMDglsDmsA/
VQw8Ko+l1PO7t8mOswPufGYyg5tKUXpq8AJ3Dlg2l2Qzws29FEIfzLLbu89GHlvP
FYSjgOuVdwVra/Kt7jQzUxeGrBXZC8MuaGnEOi/tMXidw0uc+N1y2Bg2N6eFkmxW
AN5e4p0S3ddfbGaEVoDx5aS6kKTDCQAvgeaM+KIExmbMJQ77FEqnq/WqyDd/eHJL
5Su1nPoIGzswSzcYC6eh6AEnesx01OdKCZtMfF1LxutopMLItsBe3UxPRglJfvvX
XxuosrZe1aIOgCSQGTeP6DqpTVEOsvNThhxapKt1AeW+YfnzGwZD0hU6vOq9ZSHS
K+wRWfhipS5nt8zTv8SbM2DgOmD0pPcsiueAXNQUW7gUUXleLOvP8tden91M/37C
eGMuZqdWRAjJtU0q4QvhiLSvlI7Awh2dsr0Vgh4k6b1oyuk3UrwKJeoZW07g9MR4
VLRIUg==
=5ePi
-----END PGP SIGNATURE-----
Merge tag 'nf-23-08-16' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florisn Westphal says:
====================
These are netfilter fixes for the *net* tree.
First patch resolves a false-positive lockdep splat:
rcu_dereference is used outside of rcu read lock. Let lockdep
validate that the transaction mutex is locked.
Second patch fixes a kdoc warning added in previous PR.
Third patch fixes a memory leak:
The catchall element isn't disabled correctly, this allows
userspace to deactivate the element again. This results in refcount
underflow which in turn prevents memory release. This was always
broken since the feature was added in 5.13.
Patch 4 fixes an incorrect change in the previous pull request:
Adding a duplicate key to a set should work if the duplicate key
has expired, restore this behaviour. All from myself.
Patch #5 resolves an old historic artifact in sctp conntrack:
a 300ms timeout for shutdown_ack. Increase this to 3s. From Xin Long.
Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the
sysctl value, from Sishuai Gong. This is a day-0 bug that predates git
history.
Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups
for the previous GC rework in nf_tables: The netlink notifier and the
netns exit path must both increment the gc worker seqcount, else worker
may encounter stale (free'd) pointers.
================
Signed-off-by: David S. Miller <davem@davemloft.net>
inet->min_ttl is already read with READ_ONCE().
Implementing IP_MINTTL socket option set/read
without holding the socket lock is easy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_select_ttl() is racy, because it reads inet->uc_ttl
without proper locking.
Add READ_ONCE()/WRITE_ONCE() annotations while
allowing IP_TTL socket option to be set/read without
holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make room in struct inet_sock by removing this bit field,
using one available bit in inet_flags instead.
Also move local_port_range to fill the resulting hole,
saving 8 bytes on 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_BIND_ADDRESS_NO_PORT socket option can now be set/read
without locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_NODEFRAG socket option can now be set/read
without locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We move single bit fields to inet->inet_flags to avoid races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_TRANSPARENT socket option can now be set/read
without locking the socket.
v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent()
v4: rebased after commit 3f326a821b ("mptcp: change the mpc check helper to return a sk")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_MULTICAST_ALL socket option can now be set/read
without locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_MULTICAST_LOOP socket option can now be set/read
without locking the socket.
v3: fix build bot error reported in ipvs set_mcast_loop()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_HDRINCL socket option can now be set/read
without locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_FREEBIND socket option can now be set/read
without locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_RECVERR_RFC4884 socket option can now be set/read
without locking the socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_RECVERR socket option can now be set/get without locking the socket.
This patch potentially avoid data-races around inet->recverr.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we have inet->inet_flags, we can set following options
without having to hold the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE.
ip_sock_set_pktinfo() no longer hold the socket lock.
Similarly we can get the following options whithout holding
the socket lock:
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS,
IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various inet fields are currently racy.
do_ip_setsockopt() and do_ip_getsockopt() are mostly holding
the socket lock, but some (fast) paths do not.
Use a new inet->inet_flags to hold atomic bits in the series.
Remove inet->cmsg_flags, and use instead 9 bits from inet_flags.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix indentation of a type attribute of IPV6_VTI config entry.
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmTbRoQACgkQrB3Eaf9P
W7eY+A/8DJtqwFs1uAahS9jCX1bxf3vKKUPkEKu3IfcZVv2WcMVDI7XRLnBb93PC
GR1RQskCErXMrVv7mmaBuk/uZAcAFQUkzna3MmyAw4lFfJSWORD6rzGiFeDVMsvx
7gczhYC6aPwhiyAqk6eoNTKxLaZ0zfGiW9ZKWdZXuTjp+ijksa56gEdKsPwMQIht
FE4+CHia0dxFK0bUZMLHc4ixQbqKkHj/qVxB8k8zQnDgmCavjlEAnc+PAOX+SNxm
uju4gDV/9qXYOkHTwRD9/aPcvCofTlD9XynSHkMC24yLS6Ir4A1mFUZywNSiwcgX
//WxymD1N93inuHGzVluhm6Jy+4hTaS5p1y+H86L2TfC9b5SOrNYtj3yLB3aqDgq
1+4t4cVAtpk7uLfPYKzreDJH+CoxQDC8x+0dlzQUGnV11eIJ2RA0brJhFqHjOlbD
SAQtBwkPqlAXnrdDr2pUhyrlrwAGXux8T5u5tF3NSS3FEwh7akRBfU2HV6vPEE80
qPIxHSbA9d0j+tOjbkHIYEv9fMHHFC/aFLZMYOew016TKGBJth4g+DJqJiEEDZZh
iEIC62lrMV2qsyW5PdYdxGesZaAC/4koFCTBgkBYyIC/4gm5E74Ygu5B2A3xx89p
H6MlF3Miofsf9aSh1vw4cyb69mPaP1XG95OGTqc0qpV2XhhjpE0=
=UVTC
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2023-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
1) Fix a slab-out-of-bounds read in xfrm_address_filter.
From Lin Ma.
2) Fix the pfkey sadb_x_filter validation.
From Lin Ma.
3) Use the correct nla_policy structure for XFRMA_SEC_CTX.
From Lin Ma.
4) Fix warnings triggerable by bad packets in the encap functions.
From Herbert Xu.
5) Fix some slab-use-after-free in decode_session6.
From Zhengchao Shao.
6) Fix a possible NULL piointer dereference in xfrm_update_ae_params.
Lin Ma.
7) Add a forgotten nla_policy for XFRMA_MTIMER_THRESH.
From Lin Ma.
8) Don't leak offloaded policies.
From Leon Romanovsky.
9) Delete also the offloading part of an acquire state.
From Leon Romanovsky.
Please pull or let me know if there are problems.
Since the xarray changes we mix returning valid ifindex and negative
errno in a single int returned from dev_index_reserve(). This depends
on the fact that ifindexes can't be negative. Otherwise we may insert
into the xarray and return a very large negative value. This in turn
may break ERR_PTR().
OvS is susceptible to this problem and lacking validation (fix posted
separately for net).
Reject negative ifindex explicitly. Add a warning because the input
validation is better handled by the caller.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recent changes in net-next (commit 759ab1edb5 ("net: store netdevs
in an xarray")) refactored the handling of pre-assigned ifindexes
and let syzbot surface a latent problem in ovs. ovs does not validate
ifindex, making it possible to create netdev ports with negative
ifindex values. It's easy to repro with YNL:
$ ./cli.py --spec netlink/specs/ovs_datapath.yaml \
--do new \
--json '{"upcall-pid": 1, "name":"my-dp"}'
$ ./cli.py --spec netlink/specs/ovs_vport.yaml \
--do new \
--json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}'
$ ip link show
-65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff
...
Validate the inputs. Now the second command correctly returns:
$ ./cli.py --spec netlink/specs/ovs_vport.yaml \
--do new \
--json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}'
lib.ynl.NlError: Netlink error: Numerical result out of range
nl_len = 108 (92) nl_flags = 0x300 nl_type = 2
error: -34 extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'}
Accept 0 since it used to be silently ignored.
Fixes: 54c4ef34c4 ("openvswitch: allow specifying ifindex of new interfaces")
Reported-by: syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The nexthop and nexthop bucket dump callbacks previously returned a
positive return code even when the dump was complete, prompting the core
netlink code to invoke the callback again, until returning zero.
Zero was only returned by these callbacks when no information was filled
in the provided skb, which was achieved by incrementing the dump
sentinel at the end of the dump beyond the ID of the last nexthop.
This is no longer necessary as when the dump is complete these callbacks
return zero.
Remove the unnecessary increment.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before commit f10d3d9df4 ("nexthop: Make nexthop bucket dump more
efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return
code for each resilient nexthop group whose buckets it dumped,
regardless if it encountered an error or not.
This meant that the sentinel ('dd->ctx->nh.idx') used by the function
that walked the different nexthops could not be used as a sentinel for
the bucket dump, as otherwise buckets from the same group would be
dumped over and over again.
This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx')
that was incremented by rtm_dump_nexthop_bucket_nh() after successfully
dumping all the buckets from a given group.
After the previously mentioned commit this sentinel is no longer
necessary since the function no longer returns a non-zero return code
when successfully dumping all the buckets from a given group.
Remove this sentinel and simplify the code.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The NEXT-C-SID mechanism described in [1] offers the possibility of
encoding several SRv6 segments within a single 128 bit SID address. Such
a SID address is called a Compressed SID (C-SID) container. In this way,
the length of the SID List can be drastically reduced.
A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
logically structured in three main blocks: i) Locator-Block; ii)
Locator-Node Function; iii) Argument.
C-SID container
+------------------------------------------------------------------+
| Locator-Block |Loc-Node| Argument |
| |Function| |
+------------------------------------------------------------------+
<--------- B -----------> <- NF -> <------------- A --------------->
(i) The Locator-Block can be any IPv6 prefix available to the provider;
(ii) The Locator-Node Function represents the node and the function to
be triggered when a packet is received on the node;
(iii) The Argument carries the remaining C-SIDs in the current C-SID
container.
This patch leverages the NEXT-C-SID mechanism previously introduced in the
Linux SRv6 subsystem [2] to support SID compression capabilities in the
SRv6 End.X behavior [3].
An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior
but it is capable of processing the compressed SID List encoded in C-SID
containers.
An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support
user-provided Locator-Block and Locator-Node Function lengths. In this
implementation, such lengths must be evenly divisible by 8 (i.e. must be
byte-aligned), otherwise the kernel informs the user about invalid
values with a meaningful error code and message through netlink_ext_ack.
If Locator-Block and/or Locator-Node Function lengths are not provided
by the user during configuration of an SRv6 End.X behavior instance with
NEXT-C-SID flavor, the kernel will choose their default values i.e.,
32-bit Locator-Block and 16-bit Locator-Node Function.
[1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
[2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/
[3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move from register_net_sysctl to register_net_sysctl_sz for all the
networking related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.
We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.
An additional size function was added to the following files in order to
calculate the size of an array that is defined in another file:
include/net/ipv6.h
net/ipv6/icmp.c
net/ipv6/route.c
net/ipv6/sysctl_net_ipv6.c
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Move from register_net_sysctl to register_net_sysctl_sz for all the
netfilter related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.
We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Move from register_net_sysctl to register_net_sysctl_sz and pass the
ARRAY_SIZE of the ctl_table array that was used to create the table
variable. We need to move to the new function in preparation for when we
change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro.
Failing to do so would erroneously allow ARRAY_SIZE() to be called on a
pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have
migrated all the relevant net sysctl registering functions to
register_net_sysctl_sz in subsequent commits.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This commit adds size to the register_net_sysctl indirection function to
facilitate the removal of the sentinel elements (last empty markers)
from the ctl_table arrays. Though we don't actually remove any sentinels
in this commit, register_net_sysctl* now has the capability of
forwarding table_size for when that happens.
We create a new function register_net_sysctl_sz with an extra size
argument. A macro replaces the existing register_net_sysctl. The size in
the macro is SIZE_MAX instead of ARRAY_SIZE to avoid compilation errors
while we systematically migrate to register_net_sysctl_sz. Will change
to ARRAY_SIZE in subsequent commits.
Care is taken to add table_size to the stopping criteria in such a way
that when we remove the empty sentinel element, it will continue
stopping in the last element of the ctl_table array.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This commit adds table_size to register_sysctl in preparation for the
removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by either passing the table_size explicitly or using
ARRAY_SIZE on the ctl_table arrays.
We replace the register_syctl function with a macro that will add the
ARRAY_SIZE to the new register_sysctl_sz function. In this way the
callers that are already using an array of ctl_table structs do not
change. For the callers that pass a ctl_table array pointer, we pass the
table_size to register_sysctl_sz instead of the macro.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.
We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.
The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Do not allow to insert elements from datapath to objects maps.
Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use maybe_get_net() since GC workqueue might race with netns exit path.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Netlink event path is missing a synchronization point with GC
transactions. Add GC sequence number update to netns release path and
netlink event path, any GC transaction losing race will be discarded.
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
When two threads run proc_do_sync_threshold() in parallel,
data races could happen between the two memcpy():
Thread-1 Thread-2
memcpy(val, valp, sizeof(val));
memcpy(valp, val, sizeof(val));
This race might mess up the (struct ctl_table *) table->data,
so we add a mutex lock to serialize them.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/
Signed-off-by: Sishuai Gong <sishuai.system@gmail.com>
Acked-by: Simon Horman <horms@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and
SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout
value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300
msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state.
As Paolo Valerio noticed, this might cause unwanted expiration of the ct
entry. In my test, with 1s tc netem delay set on the NAT path, after the
SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND
state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is
sent back from the peer, the sctp ct entry has expired and been deleted,
and then the SHUTDOWN_ACK has to be dropped.
Also, it is confusing these two sysctl options always show 0 due to all
timeout values using sec as unit:
net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0
net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0
This patch fixes it by also using 3 secs for sctp shutdown send and recv
state in sctp conntrack, which is also RTO.initial value in SCTP protocol.
Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV
was probably used for a rare scenario where SHUTDOWN is sent on 1st path
but SHUTDOWN_ACK is replied on 2nd path, then a new connection started
immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV
to CLOSE when receiving INIT in the ORIGINAL direction.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Reported-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
nftables selftests fail:
run-tests.sh testcases/sets/0044interval_overlap_0
Expected: 0-2 . 0-3, got:
W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1
Insertion must ignore duplicate but expired entries.
Moreover, there is a strange asymmetry in nft_pipapo_activate:
It refetches the current element, whereas the other ->activate callbacks
(bitmap, hash, rhash, rbtree) use elem->priv.
Same for .remove: other set implementations take elem->priv,
nft_pipapo_remove fetches elem->priv, then does a relookup,
remove this.
I suspect this was the reason for the change that prompted the
removal of the expired check in pipapo_get() in the first place,
but skipping exired elements there makes no sense to me, this helper
is used for normal get requests, insertions (duplicate check)
and deactivate callback.
In first two cases expired elements must be skipped.
For ->deactivate(), this gets called for DELSETELEM, so it
seems to me that expired elements should be skipped as well, i.e.
delete request should fail with -ENOENT error.
Fixes: 24138933b9 ("netfilter: nf_tables: don't skip expired elements during walk")
Signed-off-by: Florian Westphal <fw@strlen.de>
When flushing, individual set elements are disabled in the next
generation via the ->flush callback.
Catchall elements are not disabled. This is incorrect and may lead to
double-deactivations of catchall elements which then results in memory
leaks:
WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730
CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60
RIP: 0010:nft_map_deactivate+0x549/0x730
[..]
? nft_map_deactivate+0x549/0x730
nf_tables_delset+0xb66/0xeb0
(the warn is due to nft_use_dec() detecting underflow).
Fixes: aaa31047a6 ("netfilter: nftables: add catch-all set element support")
Reported-by: lonial con <kongln9170@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Jakub Kicinski says:
We've got some new kdoc warnings here:
net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc'
net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc'
include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set'
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Fixes: f6c383b8c3 ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/
Signed-off-by: Florian Westphal <fw@strlen.de>
We had a number of bugs in the past because developers forgot
to fully test dumps, which pass NULL as info to .prepare_data.
.prepare_data implementations would try to access info->extack
leading to a null-deref.
Now that dumps and notifications can access struct genl_info
we can pass it in, and remove the info null checks.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Having family in struct genl_info is quite useful. It cuts
down the number of arguments which need to be passed to
helpers which already take struct genl_info.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since dumps carry struct genl_info now, use the attrs pointer
from genl_info and remove the one in struct genl_dumpit_info.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netlink GET implementations must currently juggle struct genl_info
and struct netlink_callback, depending on whether they were called
from doit or dumpit.
Add genl_info to the dump state and populate the fields.
This way implementations can simply pass struct genl_info around.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct netlink_callback has a const nlh pointer, make the
pointer in struct genl_info const as well, to make copying
between the two easier.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add helpers which take/release the genl mutex based
on family->parallel_ops. Remove the separation between
handling of ops in locked and parallel families.
Future patches would make the duplicated code grow even more.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the real workload, I encountered an issue which could cause the RTO
timer to retransmit the skb per 1ms with linear option enabled. The amount
of lost-retransmitted skbs can go up to 1000+ instantly.
The root cause is that if the icsk_rto happens to be zero in the 6th round
(which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero
due to the changed calculation method in tcp_retransmit_timer() as follows:
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
Above line could be converted to
icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0
Therefore, the timer expires so quickly without any doubt.
I read through the RFC 6298 and found that the RTO value can be rounded
up to a certain value, in Linux, say TCP_RTO_MIN as default, which is
regarded as the lower bound in this patch as suggested by Eric.
Fixes: 36e31b0af5 ("net: TCP thin linear timeouts")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages,
but commit 5df5dd03a8 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather
then sendpage") dropped that part of the handling. Fix it by setting
the bv_offset of the first bvec.
Fixes: 5df5dd03a8 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Allow user to pass port index for health reporter dump request.
Re-generate the related code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend per-instance dump command definitions to accept instance
attributes. Allow parsing of devlink handle attributes so they could
be used for instance selection.
Re-generate the related code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.
Allow user to pass devlink handle attributes alongside the dump command
and dump only objects which are under selected devlink instance.
Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As the commands are already defined in split ops, remove them
from small ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the duplicate temporary netlink callback prototype as the
generated ones are already in place.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the definitions for the commands that do per-instance dump
and re-generate the related code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the
flags as an arg of dump_one() callback. Currently, it is always
NLM_F_MULTI.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.
Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename netlink doit callback functions for the commands that do
implement per-instance dump to match the generated names that are going
to be introduce in the follow-up patch.
Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Define port handling helpers what don't rely on internal_flags.
Have __devlink_nl_pre_doit() to accept the flags as a function arg and
make devlink_nl_pre_doit() a wrapper helper function calling it.
Introduce new helpers devlink_nl_pre_doit_port() and
devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up
patch.
Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No need to give the rate any special treatment in netlink attributes
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.
Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler
by moving the rate attributes parsing to rate_*_doit() ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No need to give the linecards any special treatment in netlink attribute
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.
Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler
by moving the linecard attribute parsing to linecard_[gs]et_doit() ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This code was only used in the past for the sysfs interface. But since
this was replace with netlink, it was never executed. The function pointer
was only checked to figure out whether the limit 255 (B.A.T.M.A.N. IV) or
2**32-1 (B.A.T.M.A.N. V) should be used as limit.
So instead of keeping the function pointer, just store the limits directly
in struct batadv_algo_gw_ops.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The batadv_netlink_notify_*() functions are not used by any other source
file. Just keep them local to netlink.c to get informed by the compiler
when they are not used anymore.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This function is no longer used since the sysfs support was removed from
batman-adv.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Use drop reasons from include/net/dropreason-core.h when a reasonable
candidate exists.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By using an independent drop reason it makes it easy to distinguish
between QoS-triggered or flow-triggered drop.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Eric Garver <eric@garver.life>
This adds an explicit drop action. This is used by OVS to drop packets
for which it cannot determine what to do. An explicit action in the
kernel allows passing the reason _why_ the packet is being dropped or
zero to indicate no particular error happened (i.e: OVS intentionally
dropped the packet).
Since the error codes coming from userspace mean nothing for the kernel,
we squash all of them into only two drop reasons:
- OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed
- OVS_DROP_EXPLICIT to indicate a zero value was passed (no error)
e.g. trace all OVS dropped skbs
# perf trace -e skb:kfree_skb --filter="reason >= 0x30000"
[..]
106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \
location:0xffffffffc0d9b462, protocol: 2048, reason: 196611)
reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT)
Also, this patch allows ovs-dpctl.py to add explicit drop actions as:
"drop" -> implicit empty-action drop
"drop(0)" -> explicit non-error action drop
"drop(42)" -> explicit error action drop
Signed-off-by: Eric Garver <eric@garver.life>
Co-developed-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a drop reason for packets that are dropped because an action
returns a non-zero error code.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new drop reason subsystem for openvswitch and add the first
drop reason to represent last-action drops.
Last-action drops happen when a flow has an empty action list or there
is no action that consumes the packet (output, userspace, recirc, etc).
It is the most common way in which OVS drops packets.
Implementation-wise, most of these skb-consuming actions already call
"consume_skb" internally and return directly from within the
do_execute_actions() loop so with minimal changes we can assume that
any skb that exits the loop normally is a packet drop.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__mptcp_init_sock() always returns 0 because mptcp_init_sock() used
to return the value directly.
But after commit 18b683bff8 ("mptcp: queue data for mptcp level
retransmission"), __mptcp_init_sock() need not return value anymore.
Let's remove the unnecessary test for __mptcp_init_sock() and make
it return void.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Such field is now unused just as a flag to control the first subflow
deletion at close() time. Introduce a new bit flag for that and finally
drop the mentioned field.
As an intended side effect, now the first subflow sock is not freed
before close() even for passive sockets. The msk has no open/active
subflows if the first one is closed and the subflow list is singular,
update accordingly the state check in mptcp_stream_accept().
Among other benefits, the subflow removal, reduces the amount of memory
used on the client side for each mptcp connection, allows passive sockets
to go through successful accept()/disconnect()/connect() and makes return
error code consistent for failing both passive and active sockets.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch the __mptcp_nmpc_socket helper is used
only to ensure that the MPTCP socket is a suitable status - that
is, the mptcp capable handshake is not started yet.
Change the return value to the relevant subflow sock, to finally
remove the last references to first subflow socket in the MPTCP stack.
As a bonus, we can get rid of a few local variables in different
functions.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is one of the few remaining spots actually manipulating the
first subflow socket. We can leverage the recently introduced
inet helpers to get rid of ssock there.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp sockopt infrastructure unneedly uses the first subflow
socket struct in a few spots. We are going to remove such field
soon, so use directly the first subflow sock instead.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection at accept() time. Instead access
directly the first subflow sock, and update mptcp_accept() to
operate on it. This allows dropping a duplicated check in
mptcp_accept().
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection at poll() time. Instead access
directly the first subflow sock.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at listen() time. Instead call
directly the recently introduced helper on the first subflow sock.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of them
is inet_listen().
Factor out an helper operating directly on the (locked) struct sock,
to allow get rid of the above dependency in the next patch without
duplicating the existing code.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at bind() time. Instead call directly
the recently introduced helpers on the first subflow sock.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of
them is bind().
Factor out the helpers operating directly on the struct sock, to
allow get rid of the above dependency in the next patch without
duplicating the existing code.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to remove the first subflow socket soon, so avoid
accessing it in mptcp_get_port(). Instead, access directly the
first subflow sock.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp protocol maintains an additional socket just to easily
invoke a few stream operations on the first subflow. One of them is
__inet_stream_connect().
We are going to remove the first subflow socket soon, so avoid
the additional indirection via at connect time, calling directly
into the sock-level connect() ops.
The sk-level connect never return -EINPROGRESS, cleanup the error
path accordingly. Additionally, the ssk status on error is always
TCP_CLOSE. Avoid unneeded access to the subflow sk state.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP protocol currently clears the msk token both at connect() and
listen() time. That is needed to deal with failing connect() calls that
can create a new token while leaving the sk in TCP_CLOSE,SS_UNCONNECTED
status and thus allowing later connect() and/or listen() calls.
Let's deal with such failures explicitly, cleaning the token in a timely
manner and avoid the confusing early mptcp_token_destroy().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add new VID/PID for Mediatek MT7922
- Add support multiple BIS/BIG
- Add support for Intel Gale Peak
- Add support for Qualcomm WCN3988
- Add support for BT_PKT_STATUS for ISO sockets
- Various fixes for experimental ISO support
- Load FW v2 for RTL8852C
- Add support for NXP AW693 chipset
- Add support for Mediatek MT2925
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmTWigQZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKd8sD/92kBczbO3v+nSNyiYcbVmB
x3Z7x1l2ExxHnPdW8xBmEzHlDErYB/KBKYdJWM8y6Bam5z1lnsX7LflXSy+bhZeX
iOFYl94Gh/9/ooyYOwwYUKC2fLKWT54PLg1TcJzyfp8uUizQNWAg9QD7vjvxe7lN
HXrW6CaA4Oohcq2YXagZV1h6Q/jl3BjcfEe7N0E6YYjeonplsJsv6rYG8Ku5n0Pi
9YhB5IkX5zszTGKBSSWURKvaJjbFd7pr3mYkgLZG2pIMGQcUAFJZ9kL7de9xeBWI
TRfgehZZPB2bUac1LxGLcAfONTmzUmo3/trjL1opdxreVCAX565JlaVSJwd0zuQk
cBrmtU3Q8peFSOgJRb1Ci5junE8tqjEWzFRIgw7/wL1Ys3mrbbVDDGKqPhwhvjdq
grOBf6UGaDpEO797yWWpBl5DLV3klMQDi4v84J0yTdvf4GXF8t8fuZU+zIpknVou
BwdeeF33yzqtk01BjomQcLVOrrGOP7+Salc5g7eEVU1jZnaw0MH9aH+o6R2JYtP8
uIiH4QOUJh7NA543F+/wPdZU+OV1E+Io+b34pTZ1oIyM2UT9Dy57Tex/DDKq2UCe
69WV6aVM+FTt2VSMUS2J0XrXkxbI4f6/ABOLht5hHKxT1m6LhOh8mCSTof+UENrr
G0sVoCodRrSljSMS/VltTA==
=akZ8
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2023-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
bluetooth-next pull request for net-next:
- Add new VID/PID for Mediatek MT7922
- Add support multiple BIS/BIG
- Add support for Intel Gale Peak
- Add support for Qualcomm WCN3988
- Add support for BT_PKT_STATUS for ISO sockets
- Various fixes for experimental ISO support
- Load FW v2 for RTL8852C
- Add support for NXP AW693 chipset
- Add support for Mediatek MT2925
Commit 39de828179 ("RDS: Main header file") declared but never implemented
rds_trans_init() and rds_trans_exit(), remove it.
Commit d37c935905 ("RDS: Move loop-only function to loop.c") removed the
implementation rds_message_inc_free() but not the declaration.
Since commit 55b7ed0b58 ("RDS: Common RDMA transport code")
rds_rdma_conn_connect() is never implemented and used.
rds_tcp_map_seq() is never implemented and used since
commit 70041088e3 ("RDS: Add TCP transport to RDS").
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_diag_put_flags(), netlink_setsockopt(), netlink_getsockopt()
and others use nlk->flags without correct locking.
Use set_bit(), clear_bit(), test_bit(), assign_bit() to remove
data-races.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.
Change it to probing zero-window, as it is a expected case now. The
description may be not correct.
Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.
And the message now like this:
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.
The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.
However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.
Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.
Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.
However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
could already be a long time (minutes or hours) in the past even on the
first RTO. So we double check the timeout with the duration of the
retransmission.
Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
dying too soon.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fow now, an ACK can update the window in following case, according to
the tcp_may_update_window():
1. the ACK acknowledged new data
2. the ACK has new data
3. the ACK expand the window and the seq of it is valid
Now, we allow the ACK update the window if the window is 0, and the
seq/ack of it is valid. This is for the case that the receiver replies
an zero-window ACK when it is under memory stress and can't queue the new
data.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.
In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unify the type of tty_operations::write() counters with the 'count'
parameter. I.e. use size_t for them.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Link: https://lore.kernel.org/r/20230810091510.13006-37-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some hooks in struct tty_ldisc_ops still reference buffers by 'unsigned
char'. Unify to 'u8' as the rest of the tty layer does.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230810091510.13006-32-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This makes all those 'char's an explicit 'u8'. This is part of the
continuing unification of chars and flags to be consistent u8.
This approaches tty_port_default_receive_buf().
Note that we do not change signedness as we compile with
-funsigned-char.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Chris Brannon <chris@the-brannons.com>
Cc: Kirk Reiser <kirk@reisers.ca>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-18-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Count passed to tty_ldisc_ops::receive_buf*(), ::lookahead_buf(), and
returned from ::receive_buf2() is expected to be size_t. So set it to
size_t to unify with the rest of the code.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Chris Brannon <chris@the-brannons.com>
Cc: Kirk Reiser <kirk@reisers.ca>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Max Staudt <max@enpas.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230810091510.13006-16-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tty_ldisc_ops::poll() is optional and needs not be provided. It is equal
to returning 0. So remove all those from the code.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230810091510.13006-4-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The CIS/CIG ids of ISO connections are defined only when the connection
is unicast.
Fix the lookup functions to check for unicast first. Ensure CIG/CIS
IDs have valid value also in state BT_OPEN.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When user tries to connect a new CIS when its CIG is not configurable,
that connection shall fail, but pre-existing connections shall not be
affected. However, currently hci_cc_le_set_cig_params deletes all CIS
of the CIG on error so it doesn't work, even though controller shall not
change CIG/CIS configuration if the command fails.
Fix by failing on command error only the connections that are not yet
bound, so that we keep the previous CIS configuration like the
controller does.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Remove unnecessary NULL check which causes coccinelle warning:
net/bluetooth/coredump.c:104:2-7: WARNING: NULL check before some
freeing functions is not needed.
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
KSAN reports use-after-free in hci_add_adv_monitor().
While adding an adv monitor,
hci_add_adv_monitor() calls ->
msft_add_monitor_pattern() calls ->
msft_add_monitor_sync() calls ->
msft_le_monitor_advertisement_cb() calls in an error case ->
hci_free_adv_monitor() which frees the *moniter.
This is referenced by bt_dev_dbg() in hci_add_adv_monitor().
Fix the bt_dev_dbg() by using handle instead of monitor->handle.
Fixes: b747a83690 ("Bluetooth: hci_sync: Refactor add Adv Monitor")
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Similar to commit c5d2b6fa26 ("Bluetooth: Fix use-after-free in
hci_remove_ltk/hci_remove_irk"). We can not access k after kfree_rcu()
call.
Fixes: d7d41682ef ("Bluetooth: Fix Suspicious RCU usage warnings")
Signed-off-by: Min Li <lm0963hack@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When running with concurrent task only one CIS was being assigned so
this attempts to rework the way the PDU is constructed so it is handled
later at the callback instead of in place.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This moves hci_is_le_conn_scanning to hci_core.h so it can be used by
different files without having to duplicate its code.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Only the number of CIS shall be limited to 0x1f, the CIS ID in the
other hand is up to 0xef.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This introduces hci_conn_set_handle which takes care of verifying the
conditions where the hci_conn handle can be modified, including when
hci_conn_abort has been called and also checks that the handles is
valid as well.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Valid range of CIG/CIS are 0x00 to 0xEF, so this checks they are
properly checked before attempting to use HCI_OP_LE_SET_CIG_PARAMS.
Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Connections may be cleanup while waiting for the commands to complete so
this attempts to check if the connection handle remains valid in case of
errors that would lead to call hci_conn_failed:
BUG: KASAN: slab-use-after-free in hci_conn_failed+0x1f/0x160
Read of size 8 at addr ffff888001376958 by task kworker/u3:0/52
CPU: 0 PID: 52 Comm: kworker/u3:0 Not tainted
6.5.0-rc1-00527-g2dfe76d58d3a #5615
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.2-1.fc38 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
<TASK>
dump_stack_lvl+0x1d/0x70
print_report+0xce/0x620
? __virt_addr_valid+0xd4/0x150
? hci_conn_failed+0x1f/0x160
kasan_report+0xd1/0x100
? hci_conn_failed+0x1f/0x160
hci_conn_failed+0x1f/0x160
hci_abort_conn_sync+0x237/0x360
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When sending HCI_OP_CREATE_CONN_CANCEL it shall Wait for
HCI_EV_CONN_COMPLETE, not HCI_EV_CMD_STATUS, when the reason is
anything but HCI_ERROR_REMOTE_POWER_OFF. This reason is used when
suspending or powering off, where we don't want to wait for the peer's
response.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Dropped CIS that are in state BT_OPEN/BT_BOUND, and in state BT_CONNECT
with HCI_CONN_CREATE_CIS unset, should be cleaned up immediately.
Closing CIS ISO sockets should result to the hci_conn be deleted, so
that potentially pending CIG removal can run.
hci_abort_conn cannot refer to them by handle, since their handle is
still unset if Set CIG Parameters has not yet completed.
This fixes CIS not being terminated if the socket is shut down
immediately after connection, so that the hci_abort_conn runs before Set
CIG Parameters completes. See new BlueZ test "ISO Connect Close - Success"
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Calling hci_conn_del in __iso_sock_close is invalid. It needs
hdev->lock, but it cannot be acquired there due to lock ordering.
Fix this by doing cleanup via hci_conn_drop.
Return hci_conn with refcount 1 from hci_bind_cis and hci_connect_cis,
so that the iso_conn always holds one reference. This also fixes
refcounting when error handling.
Since hci_conn_abort shall handle termination of connections in any
state properly, we can handle BT_CONNECT socket state in the same way as
BT_CONNECTED.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This is introduced in commit 903e454110 but was never implemented.
Fixes: 903e454110 ("Bluetooth: AMP: Use HCI cmd to Read Loc AMP Assoc")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This adds support for BT_PKT_STATUS socketopt by setting
BT_SK_PKT_STATUS. Then upon receiving an ISO packet the code would
attempt to store the Packet_Status_Flag to hci_skb_pkt_status which
is then forward to userspace in the form of BT_SCM_PKT_STATUS whenever
BT_PKT_STATUS has been enabled/set.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This makes the handling of BT_PKT_STATUS more generic so it can be
reused by sockets other than SCO like BT_DEFER_SETUP, etc.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
If hci_unregister_dev() frees the hci_dev object but hci_suspend_notifier
may still be accessing it, it can cause the program to crash.
Here's the call trace:
<4>[102152.653246] Call Trace:
<4>[102152.653254] hci_suspend_sync+0x109/0x301 [bluetooth]
<4>[102152.653259] hci_suspend_dev+0x78/0xcd [bluetooth]
<4>[102152.653263] hci_suspend_notifier+0x42/0x7a [bluetooth]
<4>[102152.653268] notifier_call_chain+0x43/0x6b
<4>[102152.653271] __blocking_notifier_call_chain+0x48/0x69
<4>[102152.653273] __pm_notifier_call_chain+0x22/0x39
<4>[102152.653276] pm_suspend+0x287/0x57c
<4>[102152.653278] state_store+0xae/0xe5
<4>[102152.653281] kernfs_fop_write+0x109/0x173
<4>[102152.653284] __vfs_write+0x16f/0x1a2
<4>[102152.653287] ? selinux_file_permission+0xca/0x16f
<4>[102152.653289] ? security_file_permission+0x36/0x109
<4>[102152.653291] vfs_write+0x114/0x21d
<4>[102152.653293] __x64_sys_write+0x7b/0xdb
<4>[102152.653296] do_syscall_64+0x59/0x194
<4>[102152.653299] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
This patch holds the reference count of the hci_dev object while
processing it in hci_suspend_notifier to avoid potential crash
caused by the race condition.
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
HCI_MAX_AD_LENGTH shall only be used if the controller doesn't support
extended advertising, otherwise HCI_MAX_EXT_AD_LENGTH shall be used
instead.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The hci_add_adv_monitor() hci_remove_adv_monitor() functions call
bt_dev_dbg() to print some debug statements. The bt_dev_dbg() macro
automatically adds in the device's name. That means that we shouldn't
include the name in the bt_dev_dbg() calls.
Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Some use cases require the user to be informed if BIG synchronization
fails. This commit makes it so that even if the BIG sync established
event arrives with error status, a new hconn is added for each BIS,
and the iso layer is notified about the failed connections.
Unsuccesful bis connections will be marked using the
HCI_CONN_BIG_SYNC_FAILED flag. From the iso layer, the POLLERR event
is triggered on the newly allocated bis sockets, before adding them
to the accept list of the parent socket.
From user space, a new fd for each failed bis connection will be
obtained by calling accept. The user should check for the POLLERR
event on the new socket, to determine if the connection was successful
or not.
The HCI_CONN_BIG_SYNC flag has been added to mark whether the BIG sync
has been successfully established. This flag is checked at bis cleanup,
so the HCI LE BIG Terminate Sync command is only issued if needed.
The BT_SK_BIG_SYNC flag indicates if BIG create sync has been called
for a listening socket, to avoid issuing the command everytime a BIGInfo
advertising report is received.
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This attempts to always allocate a unique handle for connections so they
can be properly aborted by the likes of hci_abort_conn, so this uses the
invalid range as a pool of unset handles that way if userspace is trying
to create multiple connections at once each will be given a unique
handle which will be considered unset.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
ISO_LINK connections where not being handled properly on
hci_abort_conn_sync which sometimes resulted in sending the wrong
commands, or in case of having the reject command being sent by the
socket code (iso.c) which is sort of a layer violation.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This consolidates code for aborting connections using
hci_cmd_sync_queue so it is synchronized with other threads, but
because of the fact that some commands may block the cmd_sync_queue
while waiting specific events this attempt to cancel those requests by
using hci_cmd_sync_cancel.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In the case of a Synchronized Receiver capable device, enable at start-up the
events for PA reports, PA Sync Established and Big Info Adv reports.
Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This adds support for creating multiple BIGs. According to
spec, each BIG shall have an unique handle, and each BIG should
be associated with a different advertising handle. Otherwise,
the LE Create BIG command will fail, with error code
Command Disallowed (for reusing a BIG handle), or
Unknown Advertising Identifier (for reusing an advertising
handle).
The btmon snippet below shows an exercise for creating two BIGs
for the same controller, by opening two isotest instances with
the following command:
tools/isotest -i hci0 -s 00:00:00:00:00:00
< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068) plen 31
Handle: 0x00
Advertising Handle: 0x01
Number of BIS: 1
SDU Interval: 10000 us (0x002710)
Maximum SDU size: 40
Maximum Latency: 10 ms (0x000a)
RTN: 0x02
PHY: LE 2M (0x02)
Packing: Sequential (0x00)
Framing: Unframed (0x00)
Encryption: 0x00
Broadcast Code: 00000000000000000000000000000000
> HCI Event: Command Status (0x0f) plen 4
LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 21
LE Broadcast Isochronous Group Complete (0x1b)
Status: Success (0x00)
Handle: 0x00
BIG Synchronization Delay: 912 us (0x000390)
Transport Latency: 912 us (0x000390)
PHY: LE 2M (0x02)
NSE: 3
BN: 1
PTO: 1
IRC: 3
Maximum PDU: 40
ISO Interval: 10.00 msec (0x0008)
Connection Handle #0: 10
< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068)
Handle: 0x01
Advertising Handle: 0x02
Number of BIS: 1
SDU Interval: 10000 us (0x002710)
Maximum SDU size: 40
Maximum Latency: 10 ms (0x000a)
RTN: 0x02
PHY: LE 2M (0x02)
Packing: Sequential (0x00)
Framing: Unframed (0x00)
Encryption: 0x00
Broadcast Code: 00000000000000000000000000000000
> HCI Event: Command Status (0x0f) plen 4
LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
Status: Success (0x00)
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This stores scm_creds into hci_skb_cb so they can be properly forwarded
to the likes of btmon which is then able to print information about the
process who is originating the traffic:
bluetoothd[35]: @ MGMT Command: Rea.. (0x0001) plen 0 {0x0001}
@ MGMT Event: Command Complete (0x0001) plen 6 {0x0001}
Read Management Version Information (0x0001) plen 3
bluetoothd[35]: < ACL Data T.. flags 0x00 dlen 41
ATT: Write Command (0x52) len 36
Handle: 0x0043 Type: ASE Control Point (0x2bc6)
Data: 020203000110270000022800020a00409c0001000110270000022800020a00409c00
Opcode: QoS Configuration (0x02)
Number of ASE(s): 2
ASE: #0
ASE ID: 0x03
CIG ID: 0x00
CIS ID: 0x01
SDU Interval: 10000 usec
Framing: Unframed (0x00)
PHY: 0x02
LE 2M PHY (0x02)
Max SDU: 40
RTN: 2
Max Transport Latency: 10
Presentation Delay: 40000 us
ASE: #1
ASE ID: 0x01
CIG ID: 0x00
CIS ID: 0x01
SDU Interval: 10000 usec
Framing: Unframed (0x00)
PHY: 0x02
LE 2M PHY (0x02)
Max SDU: 40
RTN: 2
Max Transport Latency: 10
Presentation Delay: 40000 us
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This makes sure peer information is always available via sock when using
bt_sock_alloc.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This consolidates code around sk_alloc into bt_sock_alloc which does
take care of common initialization.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
LE Create CIS command shall not be sent before all CIS Established
events from its previous invocation have been processed. Currently it is
sent via hci_sync but that only waits for the first event, but there can
be multiple.
Make it wait for all events, and simplify the CIS creation as follows:
Add new flag HCI_CONN_CREATE_CIS, which is set if Create CIS has been
sent for the connection but it is not yet completed.
Make BT_CONNECT state to mean the connection wants Create CIS.
On events after which new Create CIS may need to be sent, send it if
possible and some connections need it. These events are:
hci_connect_cis, iso_connect_cfm, hci_cs_le_create_cis,
hci_le_cis_estabilished_evt.
The Create CIS status/completion events shall queue new Create CIS only
if at least one of the connections transitions away from BT_CONNECT, so
that we don't loop if controller is sending bogus events.
This fixes sending multiple CIS Create for the same CIS in the
"ISO AC 6(i) - Success" BlueZ test case:
< HCI Command: LE Create Co.. (0x08|0x0064) plen 9 #129 [hci0]
Number of CIS: 2
CIS Handle: 257
ACL Handle: 42
CIS Handle: 258
ACL Handle: 42
> HCI Event: Command Status (0x0f) plen 4 #130 [hci0]
LE Create Connected Isochronous Stream (0x08|0x0064) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 29 #131 [hci0]
LE Connected Isochronous Stream Established (0x19)
Status: Success (0x00)
Connection Handle: 257
...
< HCI Command: LE Setup Is.. (0x08|0x006e) plen 13 #132 [hci0]
...
> HCI Event: Command Complete (0x0e) plen 6 #133 [hci0]
LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
...
< HCI Command: LE Create Co.. (0x08|0x0064) plen 5 #134 [hci0]
Number of CIS: 1
CIS Handle: 258
ACL Handle: 42
> HCI Event: Command Status (0x0f) plen 4 #135 [hci0]
LE Create Connected Isochronous Stream (0x08|0x0064) ncmd 1
Status: ACL Connection Already Exists (0x0b)
> HCI Event: LE Meta Event (0x3e) plen 29 #136 [hci0]
LE Connected Isochronous Stream Established (0x19)
Status: Success (0x00)
Connection Handle: 258
...
Fixes: c09b80be6f ("Bluetooth: hci_conn: Fix not waiting for HCI_EVT_LE_CIS_ESTABLISHED")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
It is required for some configurations to have multiple BISes as part
of the same BIG.
Similar to the flow implemented for unicast, DEFER_SETUP will also be
used to bind multiple BISes for the same BIG, before starting Periodic
Advertising and creating the BIG.
The user will have to open a new socket for each BIS. By setting the
BT_DEFER_SETUP socket option and calling connect, a new connection
will be added for the BIG and advertising handle set by the socket
QoS parameters. Since all BISes will be bound for the same BIG and
advertising handle, the socket QoS options and base parameters should
match for all connections.
By calling connect on a socket that does not have the BT_DEFER_SETUP
option set, periodic advertising will be started and the BIG will
be created, with a BIS for each previously bound connection. Since
a BIG cannot be reconfigured with additional BISes after creation,
no more connections can be bound for the BIG after the start periodic
advertising and create BIG commands have been queued.
The bis_cleanup function has also been updated, so that the advertising
set and the BIG will not be terminated unless there are no more
bound or connected BISes.
The HCI_CONN_BIG_CREATED connection flag has been added to indicate
that the BIG has been successfully created. This flag is checked at
bis_cleanup, so that the BIG is only terminated if the
HCI_LE_Create_BIG_Complete has been received.
This implementation has been tested on hardware, using the "isotest"
tool with an additional command line option, to specify the number of
BISes to create as part of the desired BIG:
tools/isotest -i hci0 -s 00:00:00:00:00:00 -N 2 -G 1 -T 1
The btmon log shows that a BIG containing 2 BISes has been created:
< HCI Command: LE Create Broadcast Isochronous Group (0x08|0x0068) plen 31
Handle: 0x01
Advertising Handle: 0x01
Number of BIS: 2
SDU Interval: 10000 us (0x002710)
Maximum SDU size: 40
Maximum Latency: 10 ms (0x000a)
RTN: 0x02
PHY: LE 2M (0x02)
Packing: Sequential (0x00)
Framing: Unframed (0x00)
Encryption: 0x00
Broadcast Code: 00000000000000000000000000000000
> HCI Event: Command Status (0x0f) plen 4
LE Create Broadcast Isochronous Group (0x08|0x0068) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 23
LE Broadcast Isochronous Group Complete (0x1b)
Status: Success (0x00)
Handle: 0x01
BIG Synchronization Delay: 1974 us (0x0007b6)
Transport Latency: 1974 us (0x0007b6)
PHY: LE 2M (0x02)
NSE: 3
BN: 1
PTO: 1
IRC: 3
Maximum PDU: 40
ISO Interval: 10.00 msec (0x0008)
Connection Handle #0: 10
Connection Handle #1: 11
< HCI Command: LE Setup Isochronous Data Path (0x08|0x006e) plen 13
Handle: 10
Data Path Direction: Input (Host to Controller) (0x00)
Data Path: HCI (0x00)
Coding Format: Transparent (0x03)
Company Codec ID: Ericsson Technology Licensing (0)
Vendor Codec ID: 0
Controller Delay: 0 us (0x000000)
Codec Configuration Length: 0
Codec Configuration:
> HCI Event: Command Complete (0x0e) plen 6
LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
Status: Success (0x00)
Handle: 10
< HCI Command: LE Setup Isochronous Data Path (0x08|0x006e) plen 13
Handle: 11
Data Path Direction: Input (Host to Controller) (0x00)
Data Path: HCI (0x00)
Coding Format: Transparent (0x03)
Company Codec ID: Ericsson Technology Licensing (0)
Vendor Codec ID: 0
Controller Delay: 0 us (0x000000)
Codec Configuration Length: 0
Codec Configuration:
> HCI Event: Command Complete (0x0e) plen 6
LE Setup Isochronous Data Path (0x08|0x006e) ncmd 1
Status: Success (0x00)
Handle: 11
< ISO Data TX: Handle 10 flags 0x02 dlen 44
< ISO Data TX: Handle 11 flags 0x02 dlen 44
> HCI Event: Number of Completed Packets (0x13) plen 5
Num handles: 1
Handle: 10
Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5
Num handles: 1
Handle: 11
Count: 1
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This patch checks for ISO_BROADCASTER and ISO_SYNC_RECEIVER in
controller.
Signed-off-by: Claudia Draghicescu <claudia.rosu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Still trending up in size but the good news is that the "current"
regressions are resolved, AFAIK.
We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
handling on embedded this week (most not merged yet), not sure why.
But those are all for older bugs.
Current release - regressions:
- tls: set MSG_SPLICE_PAGES consistently when handing encrypted
data over to TCP
Current release - new code bugs:
- eth: mlx5: correct IDs on VFs internal to the device (IPU)
Previous releases - regressions:
- phy: at803x: fix WoL support / reporting on AR8032
- bonding: fix incorrect deletion of ETH_P_8021AD protocol VID
from slaves, leading to BUG_ON()
- tun: prevent tun_build_skb() from exceeding the packet size limit
- wifi: rtw89: fix 8852AE disconnection caused by RX full flags
- eth/PCI: enetc: fix probing after 6fffbc7ae1 ("PCI: Honor
firmware's device disabled status"), keep PCI devices around
even if they are disabled / not going to be probed to be
able to apply quirks on them
- eth: prestera: fix handling IPv4 routes with nexthop IDs
Previous releases - always broken:
- netfilter: re-work garbage collection to avoid races between
user-facing API and timeouts
- tunnels: fix generating ipv4 PMTU error on non-linear skbs
- nexthop: fix infinite nexthop bucket dump when using maximum
nexthop ID
- wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
Misc:
- unix: use consistent error code in SO_PEERPIDFD
- ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO,
in prep for upcoming IETF RFC
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTVMSsACgkQMUZtbf5S
Irul3g//RlSANV/MWkiDmHIS5IhqkVWbvjGhFXFfdqZPH4gfgcX9VrsMuxgNM1Xu
YXGx+rIu408qNNkVG2hpFMxPerRiqVB/XsH1TxRr0Mi6AMFoKGXS+cGwzSOaoMQj
FYlcC6j2SnQ9N4I0qQuKOSOffyvyxrx/l9ozpVXsbGsOic1k6j1Ipwtf3+WP7dEe
kkAPUlsQPdCIhMyQdK3X4xI1PGLtAXFgY3VV9bZ7u99l7QBwmconkl3GHq/xnPa8
Uyll005ThyYce0c4EPVcrY1YBXyY0LjOBIRtiTFAk6CMWc0Su8Ug/i4+K2KTq0eh
yjqqHkpR//ruLgtAXBLLE9mxma8448vmmex/cSLIBaMAttlnj9n2LvCqvbzNfTZA
ssnKO4D3HhoQvHqbeOOW6VzVX7XyhomOvQXihfdLUs9u2tKE3nQoU+QCnrnIUXZO
VF5/ubCERRdZDPQ1SSAktimlC0R1qVL7JPMRaQF0aW5xByabbEWwMaNiwkYQOh2o
w2KsbhM/vWyd+5JB412LrNsEgK1BV6WjgwzC+27YQ7QD/JxUZBUghL0ps2jgU2Lu
d4YdbBOgYz+xyUBPByeYzcac0SIeMkB/UEcaO54ySWU8GcWYLt4KXwydUq/cXlw0
rUDCO5bikMxmygLKtnTSwmwvGbGByEXbGvVKwUwNPqTnR+vPIbM=
=NZgp
-----END PGP SIGNATURE-----
Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bpf.
Still trending up in size but the good news is that the "current"
regressions are resolved, AFAIK.
We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
handling on embedded this week (most not merged yet), not sure why.
But those are all for older bugs.
Current release - regressions:
- tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
over to TCP
Current release - new code bugs:
- eth: mlx5: correct IDs on VFs internal to the device (IPU)
Previous releases - regressions:
- phy: at803x: fix WoL support / reporting on AR8032
- bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
slaves, leading to BUG_ON()
- tun: prevent tun_build_skb() from exceeding the packet size limit
- wifi: rtw89: fix 8852AE disconnection caused by RX full flags
- eth/PCI: enetc: fix probing after 6fffbc7ae1 ("PCI: Honor
firmware's device disabled status"), keep PCI devices around even
if they are disabled / not going to be probed to be able to apply
quirks on them
- eth: prestera: fix handling IPv4 routes with nexthop IDs
Previous releases - always broken:
- netfilter: re-work garbage collection to avoid races between
user-facing API and timeouts
- tunnels: fix generating ipv4 PMTU error on non-linear skbs
- nexthop: fix infinite nexthop bucket dump when using maximum
nexthop ID
- wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
Misc:
- unix: use consistent error code in SO_PEERPIDFD
- ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
upcoming IETF RFC"
* tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
net: hns3: fix strscpy causing content truncation issue
net: tls: set MSG_SPLICE_PAGES consistently
ibmvnic: Ensure login failure recovery is safe from other resets
ibmvnic: Do partial reset on login failure
ibmvnic: Handle DMA unmapping of login buffs in release functions
ibmvnic: Unmap DMA login rsp buffer on send login fail
ibmvnic: Enforce stronger sanity checks on login response
net: mana: Fix MANA VF unload when hardware is unresponsive
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
ipv6: adjust ndisc_is_useropt() to also return true for PIO
selftests: forwarding: bridge_mdb: Make test more robust
selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
...
We used to change the flags for the last segment, because
non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
That flag is no longer a thing so remove the setting.
Since flags most likely don't have MSG_SPLICE_PAGES set
this avoids passing parts of the sg as splice and parts
as non-splice. Before commit under Fixes we'd have called
tcp_sendpage() which would add the MSG_SPLICE_PAGES.
Why this leads to trouble remains unclear but Tariq
reports hitting the WARN_ON(!sendpage_ok()) due to
page refcount of 0.
Fixes: e117dcfd64 ("tls: Inline do_tcp_sendpages()")
Reported-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/
Tested-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmTUhbUACgkQ1V2XiooU
IORz1Q//a2fDuMsK5iW1BlF4y0P9aQUSVV//r3DYaoYOspJhsB2yZu4HtL+XQJvY
yncwg+ub24yQh5sUNSJnZztQVTN+NPY9Vl2TkXXMx6Wxs2XenmgzZmDdghUDzhTd
DuOjIGVEJ2M6XpPAOub89sqL+E0K7J0/q0aIcV0K0/xKo7U/z3vgLv4aZx/ZjPCV
daj3gcGpYQ1JJ9pi2se2yh89dzT321U7yYde9ek0TUeKGdCFJkfHkqMurwbcgoJ8
jkx5NOtrp+GLbhd+ME86IUtD+Edm46+bJUxvG0My99CVlak7y5gJh/aPxpAPACuW
NhWWy26kivVRWyttLQk0ScZfbO1CIwvaPzQC+QdlFdNA1eWTMhEk6AG2dVaU9CNB
V9WKWv59CPaDwPCKhXiPLQ9J+Kds7oyHPXGlV2dDOuSmJ9QbOh/HBQGEm/mI93qX
Fr+qqP3A9/juXZ5FdSLT2pJPuVlXdhQdgyHgiunyDPHoL9q7GFn5aQL/BVKE23tc
bgMez0GKzBR0waS9cycFSVls1rQN1XUIdoD6SLaRYq9FkKcCx+YGn3LH44Y1feL/
UnLMFlt9xIG4dPbGcGGy4r7mB53JpglHEqJEftvsNcBEd/r/f+4JP+/fa9FJ70uZ
GpGmv7Wo5DZT5V8LaMeWDWpJl6G7UcxrFOyDTw27l2OOVNaD2Ic=
=KNf7
-----END PGP SIGNATURE-----
Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The existing attempt to resolve races between control plane and GC work
is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places
forgot to call nft_set_elem_mark_busy(), leading to double-deactivation
of elements.
This series contains the following patches:
1) Do not skip expired elements during walk otherwise elements might
never decrement the reference counter on data, leading to memleak.
2) Add a GC transaction API to replace the former attempt to deal with
races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT
on elements and it creates a GC transaction to remove the expired
elements, GC transaction could abort in case of interference with
control plane and retried later (GC async). Set backends such as
rbtree and pipapo also perform GC from control plane (GC sync), in
such case, element deactivation and removal is safe because mutex
is held then collected elements are released via call_rcu().
3) Adapt existing set backends to use the GC transaction API.
4) Update rhash set backend to set on _DEAD bit to report deleted
elements from datapath for GC.
5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT.
* tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
netfilter: nf_tables: don't skip expired elements during walk
====================
Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRuIQAKCRBar9k/UBDW
4++9AP9ymOcPOKTKdQwZ6cnq3vkmvN37H6teufTyM8vsCha9NAD+OQE+vg1304RM
aETtG6d5Nb+byIHZGJrdUyT7g9jRzgw=
=qr/C
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-09
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 6 files changed, 102 insertions(+), 8 deletions(-).
The main changes are:
1) A bpf sockmap memleak fix and a fix in accessing the programs of
a sockmap under the incorrect map type from Xu Kuohai.
2) A refcount underflow fix in xsk from Magnus Karlsson.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
====================
Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of
performing element removal which might race with an ongoing transaction.
Enable gc when dynamic flag is set on since dynset deletion requires
garbage collection after this patch.
Fixes: d0a8d877da ("netfilter: nft_dynset: support for element deletion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the GC transaction API to replace the old and buggy gc API and the
busy mark approach.
No set elements are removed from async garbage collection anymore,
instead the _DEAD bit is set on so the set element is not visible from
lookup path anymore. Async GC enqueues transaction work that might be
aborted and retried later.
rbtree and pipapo set backends does not set on the _DEAD bit from the
sync GC path since this runs in control plane path where mutex is held.
In this case, set elements are deactivated, removed and then released
via RCU callback, sync GC never fails.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.
The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.
We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:
cpu 1 cpu2
GC work transaction comes in , lock nft mutex
`acquire nft mutex // BLOCKS
transaction asks to remove the set
set destruction calls cancel_work_sync()
cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.
This patch adds a new API that deals with garbage collection in two
steps:
1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
so they are not visible via lookup. Annotate current GC sequence in
the GC transaction. Enqueue GC transaction work as soon as it is
full. If ruleset is updated, then GC transaction is aborted and
retried later.
2) GC work grabs the mutex. If GC sequence has changed then this GC
transaction lost race with control plane, abort it as it contains
stale references to objects and let GC try again later. If the
ruleset is intact, then this GC transaction deactivates and removes
the elements and it uses call_rcu() to destroy elements.
Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.
To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.
Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.
We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.
This gives us the choice of ABBA deadlock or UaF.
To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.
Set backends are adapted to use the GC transaction API in a follow up
patch entitled:
("netfilter: nf_tables: use gc transaction API in set backends")
This is joint work with Florian Westphal.
Fixes: cfed7e1b1f ("netfilter: nf_tables: add set garbage collection helpers")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
strp_done is only called when psock->progs.stream_parser is not NULL,
but stream_parser was set to NULL by sk_psock_stop_strp(), called
by sk_psock_drop() earlier. So, strp_done can never be called.
Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
Change the condition for calling strp_done from judging whether
stream_parser is set to judging whether this flag is set. This flag is
only set once when strp_init() succeeds, and will never be cleared later.
Fixes: c0d95d3380 ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although
both types have member named "progs", the offset of "progs" member in
these two types is different, so "progs" should be accessed with the
real map type.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Fix a refcount underflow problem reported by syzbot that can happen
when a system is running out of memory. If xp_alloc_tx_descs() fails,
and it can only fail due to not having enough memory, then the error
path is triggered. In this error path, the refcount of the pool is
decremented as it has incremented before. However, the reference to
the pool in the socket was not nulled. This means that when the socket
is closed later, the socket teardown logic will think that there is a
pool attached to the socket and try to decrease the refcount again,
leading to a refcount underflow.
I chose this fix as it involved adding just a single line. Another
option would have been to move xp_get_pool() and the assignment of
xs->pool to after the if-statement and using xs_umem->pool instead of
xs->pool in the whole if-statement resulting in somewhat simpler code,
but this would have led to much more churn in the code base perhaps
making it harder to backport.
Fixes: ba3beec2ec ("xsk: Fix possible crash when multiple sockets are created")
Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This makes a difference for the software scheduling mode, where
dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
but when we're talking about what Qdisc and stats get reported for a
traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
is.
To understand the difference, I've attempted to send 100 packets in
software mode through class 8001:5, and recorded the stats before and
after the change.
Here is before:
$ tc -s class show dev eth0
class taprio 8001:1 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:2 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:3 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:4 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:5 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:6 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:7 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:8 root leaf 8001:
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
and here is after:
class taprio 8001:1 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:2 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:3 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:4 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:5 root
Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:6 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:7 root
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
class taprio 8001:8 root leaf 800d:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
window_drops 0
The most glaring (and expected) difference is that before, all class
stats reported the global stats, whereas now, they really report just
the counters for that traffic class.
Finally, Pedro Tammela points out that there is a tc selftest which
checks specifically which handle do the child Qdiscs corresponding to
each class have. That's changing here - taprio no longer reports
tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0
(the handle of the default pfifo child Qdiscs). Since iproute2 does not
print a child Qdisc handle of 0, adjust the test's expected output.
Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As mentioned in commit af7b29b1de ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike
mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport
between Qdisc_ops :: init() and Qdisc_ops :: attach().
Delete the comment, which is just stolen from mqprio, but there, the
usage patterns are a lot different, and this is nothing but confusing.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is another stab at commit 1461d212ab ("net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later
reverted in commit af7b29b1de ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").
I believe that the problems that caused the revert were fixed, and thus,
this change is identical to the original patch.
Its purpose is to properly reject attaching a software taprio child
qdisc to a software taprio parent. Because unoffloaded taprio currently
reports itself (the root Qdisc) as the return value from qdisc_leaf(),
then the process of attaching another taprio as child to a Qdisc class
of the root will just result in a Qdisc_ops :: change() call for the
root. Whereas that's not we want. We want Qdisc_ops :: init() to be
called for the taprio child, in order to give the taprio child a chance
to check whether its sch->parent is TC_H_ROOT or not (and reject this
configuration).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Normally, Qdiscs have one reference on them held by their owner and one
held for each TXQ to which they are attached, however this is not the
case with the children of an offloaded taprio. Instead, the taprio qdisc
currently lives in the following fragile equilibrium.
In the software scheduling case, taprio attaches itself (the root Qdisc)
to all TXQs, thus having a refcount of 1 + the number of TX queues. In
this mode, the q->qdiscs[] children are not visible directly to the
Qdisc API. The lifetime of the Qdiscs from this private array lasts
until qdisc_destroy() -> taprio_destroy().
In the fully offloaded case, the root taprio has a refcount of 1, and
all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
are attached to the netdev TXQs directly and thus are visible to the
Qdisc API, however taprio loses a reference to them very early - during
qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
the q->qdiscs[] array to not leak memory, but interestingly, it does not
release a reference on these qdiscs because it doesn't effectively own
them - they are created by taprio but owned by the Qdisc core, and will
be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
the Qdisc is deleted or when the child Qdisc is replaced with something
else.
My interest is to change this equilibrium such that taprio also owns a
reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
Qdisc, including in full offload mode. I want this because I would like
taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
insight into q->qdiscs[] for the software scheduling mode - currently
they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
as the root taprio.
The following set of changes is necessary:
- don't free q->qdiscs[] early in taprio_attach(), free it late in
taprio_destroy() for consistency with software mode. But:
- currently that's not possible, because taprio doesn't own a reference
on q->qdiscs[]. So hold that reference - once during the initial
attach() and once during subsequent graft() calls when the child is
changed.
- always keep track of the current child in q->qdiscs[], even for full
offload mode, so that we free in taprio_destroy() what we should, and
not something stale.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a simple code transformation with no intended behavior change,
just to make it absolutely clear that q->qdiscs[] is only attached to
the child taprio classes in full offload mode.
Right now we use the q->qdiscs[] variable in taprio_attach() for
software mode too, but that is quite confusing and avoidable. We use
it only to reach the netdev TX queue, but we could as well just use
netdev_get_tx_queue() for that.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The upcoming (and nearly finalized):
https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/
will update the IPv6 RA to include a new flag in the PIO field,
which will serve as a hint to perform DHCPv6-PD.
As we don't want DHCPv6 related logic inside the kernel, this piece of
information needs to be exposed to userspace. The simplest option is to
simply expose the entire PIO through the already existing mechanism.
Even without this new flag, the already existing PIO R (router address)
flag (from RFC6275) cannot AFAICT be handled entirely in kernel,
and provides useful information that should be exposed to userspace
(the router's global address, for use by Mobile IPv6).
Also cc'ing stable@ for inclusion in LTS, as while technically this is
not quite a bugfix, and instead more of a feature, it is absolutely
trivial and the alternative is manually cherrypicking into all Android
Common Kernel trees - and I know Greg will ask for it to be sent in via
LTS instead...
Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Cc: stable@vger.kernel.org
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20230807102533.1147559-1-maze@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I'm looking to enable -Wmissing-variable-declarations behind W=1. 0day
bot spotted the following instances:
net/llc/llc_conn.c:44:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_ack_timeout'
[-Wmissing-variable-declarations]
44 | int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ;
| ^
net/llc/llc_conn.c:44:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
44 | int sysctl_llc2_ack_timeout = LLC2_ACK_TIME * HZ;
| ^
net/llc/llc_conn.c:45:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_p_timeout'
[-Wmissing-variable-declarations]
45 | int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ;
| ^
net/llc/llc_conn.c:45:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
45 | int sysctl_llc2_p_timeout = LLC2_P_TIME * HZ;
| ^
net/llc/llc_conn.c:46:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_rej_timeout'
[-Wmissing-variable-declarations]
46 | int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ;
| ^
net/llc/llc_conn.c:46:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
46 | int sysctl_llc2_rej_timeout = LLC2_REJ_TIME * HZ;
| ^
net/llc/llc_conn.c:47:5: warning: no previous extern declaration for
non-static variable 'sysctl_llc2_busy_timeout'
[-Wmissing-variable-declarations]
47 | int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ;
| ^
net/llc/llc_conn.c:47:1: note: declare 'static' if the variable is not
intended to be used outside of this translation unit
47 | int sysctl_llc2_busy_timeout = LLC2_BUSY_TIME * HZ;
| ^
These symbols are referenced by more than one translation unit, so make
include the correct header for their declarations. Finally, sort the
list of includes to help keep them tidy.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/llvm/202308081000.tTL1ElTr-lkp@intel.com/
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808-llc_static-v1-1-c140c4c297e4@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPV6_ADDRFORM socket option is evil, because it can change sock->ops
while other threads might read it. Same issue for sk->sk_family
being set to AF_INET.
Adding READ_ONCE() over sock->ops reads is needed for sockets
that might be impacted by IPV6_ADDRFORM.
Note that mptcp_is_tcpsk() can also overwrite sock->ops.
Adding annotations for all sk->sk_family reads will require
more patches :/
BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt
write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0:
do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663
__sys_setsockopt+0x1c3/0x230 net/socket.c:2273
__do_sys_setsockopt net/socket.c:2284 [inline]
__se_sys_setsockopt net/socket.c:2281 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2281
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1:
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x349/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmmsg+0x263/0x500 net/socket.c:2643
__do_sys_sendmmsg net/socket.c:2672 [inline]
__se_sys_sendmmsg net/socket.c:2669 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffffffff850e32b8 -> 0xffffffff850da890
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTSNyYNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AOpyD/4pRjBvLgU0O+33x3l/X8z80QW6VMwq6PDUDmNA
t0AFnhIx0v6yky0abLzGGV9q2N9SLdNltTsqb0pem00f1TncKR1BlHNttfU+yjS+
qmkuUTJ8HixxkbBRKB9E7kA3IPM2aj1gxd3sji/QHKMT8XtcE5ufoad8/jcM9px2
FHNHJ8Onwl8ohjV4qNeuPe8XWm47pN/FeaxK7jRrKJaCal0P96sT8AGf/Rvx5VNY
jCysb2+fIKMzHssbLcRr1UDMJJFqtcQx0alnzwxh4sEPsmgYYR7UGmcDku4pSbtB
uJBrjMnLpORpw1l2syuYiiyEy+VRAAIjWAUxb6oTOvDhj1Yj2ki/915b/Hl/jnqa
q8EUm6i+B4CuiE8LCj0WLG2gKO7vRjFnDH/Li/qiFMUHzW/HmnLRituxTVTXwopC
1CxFkekNIklxLr+n21dP6f+NJU9hIs1nw+iy0JhLffTV8u6TosZ3Ve5ZInUQ4Bna
hZUyksy1s42fi0oGHN1Gi3AWPhUIlH69lKXTOher/9PvC+rL5/l8LFVywqtYxT4t
HWJwBtpm58IrsDttgPm6fnJqIgrm5mcW+FJYqP9Td6NjUvyfafhJsdhXJ9Bs1lJV
SfJp4iEUkOx4bfUi0rDFb/8NR2fwrCpbbgYx2xdS+tHQ2BkRlAOxYJC9Q8YP2vxH
Mjgc9w==
=XfrS
-----END PGP SIGNATURE-----
Merge tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
First 4 Patches, from Yue Haibing, remove unused prototypes in
various netfilter headers.
Last patch makes nfnetlink_log to always include a packet timestamp,
up to now it was only included if the skb had assigned previously.
From Maciej Żenczykowski.
* tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nfnetlink_log: always add a timestamp
netfilter: h323: Remove unused function declarations
netfilter: conntrack: Remove unused function declarations
netfilter: helper: Remove unused function declarations
netfilter: gre: Remove unused function declaration nf_ct_gre_keymap_flush()
====================
Link: https://lore.kernel.org/r/20230808124159.19046-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.
The nexthop bucket dump callback always returns a positive number if
nexthop buckets were filled in the provided skb, even if the dump is
complete. This means that a dump will span at least two recvmsg() calls
as long as nexthop buckets are present. In the last recvmsg() call the
dump callback will not fill in any nexthop buckets because the previous
call indicated that the dump should restart from the last dumped nexthop
ID plus one.
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 6.66 nhid 1
id 10 index 1 idle_time 6.66 nhid 1
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++
This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# ip nexthop bucket
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
[...]
Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
id 4294967295 index 0 idle_time 6.61 nhid 1
id 4294967295 index 1 idle_time 6.61 nhid 1
+++ exited with 0 +++
Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.
Add a test that fails before the fix:
# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]
And passes after it:
# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]
Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging
to a specific resilient nexthop group. The function returns a positive
return code (the skb length) upon both success and failure.
The above behavior is problematic. When a complete nexthop bucket dump
is requested, the function that walks the different nexthops treats the
non-zero return code as an error. This causes buckets belonging to
different resilient nexthop groups to be dumped using different buffers
even if they can all fit in the same buffer:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 10 index 0 idle_time 10.27 nhid 1
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 20 index 0 idle_time 6.44 nhid 1
[...]
Fix by only returning a non-zero return code when an error occurred and
restarting the dump from the bucket index we failed to fill in. This
allows buckets belonging to different resilient nexthop groups to be
dumped using the same buffer:
# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 30.21 nhid 1
id 20 index 0 idle_time 26.7 nhid 1
[...]
While this change is more of a performance improvement change than an
actual bug fix, it is a prerequisite for a subsequent patch that does
fix a bug.
Fixes: 8a1bbabb03 ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.
The nexthop dump callback always returns a positive number if nexthops
were filled in the provided skb, even if the dump is complete. This
means that a dump will span at least two recvmsg() calls as long as
nexthops are present. In the last recvmsg() call the dump callback will
not fill in any nexthops because the previous call indicated that the
dump should restart from the last dumped nexthop ID plus one.
# ip nexthop add id 1 blackhole
# strace -e sendto,recvmsg -s 5 ip nexthop
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
id 1 blackhole
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++
This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:
# ip nexthop add id $((2**32-1)) blackhole
# ip nexthop
id 4294967295 blackhole
id 4294967295 blackhole
[...]
Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:
# ip nexthop add id $((2**32-1)) blackhole
# strace -e sendto,recvmsg -s 5 ip nexthop
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56
id 4294967295 blackhole
+++ exited with 0 +++
Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.
Add a test that fails before the fix:
# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]
And passes after it:
# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]
Fixes: ab84be7e54 ("net: Initial nexthop code")
Reported-by: Petr Machata <petrm@nvidia.com>
Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If we successfully parsed an interface mode with a legacy switch
driver, populate that mode into phylink's supported interfaces rather
than defaulting to the internal and gmii interfaces.
This hasn't caused an issue so far, because when the interface doesn't
match a supported one, phylink_validate() doesn't clear the supported
mask, but instead returns -EINVAL. phylink_parse_fixedlink() doesn't
check this return value, and merely relies on the supported ethtool
link modes mask being cleared. Therefore, the fixed link settings end
up being allowed despite validation failing.
Before this causes a problem, arrange for DSA to more accurately
populate phylink's supported interfaces mask so validation can
correctly succeed.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1qTKdM-003Cpx-Eh@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5266698661 ("tipc: let broadcast packet reception use new link receive function")
declared but never implemented this.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230807142926.45752-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enable io_uring commands on network sockets. Create two new
SOCKET_URING_OP commands that will operate on sockets.
In order to call ioctl on sockets, use the file_operations->io_uring_cmd
callbacks, and map it to a uring socket function, which handles the
SOCKET_URING_OP accordingly, and calls socket ioctls.
This patches was tested by creating a new test case in liburing.
Link: https://github.com/leitao/liburing/tree/io_uring_cmd
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the
number of MBSSID elements in the nested netlink attribute attrs, which can
lead to an integer overflow if a user of the nl80211 interface specifies
256 or more elements in the corresponding attribute in userspace. The
integer overflow can lead to a heap buffer overflow as num_elems determines
the size of the trailing array in elems, and this array is thereafter
written to for each element in attrs.
Note that this vulnerability only affects devices with the
wiphy->mbssid_max_interfaces member set for the wireless physical device
struct in the device driver, and can only be triggered by a process with
CAP_NET_ADMIN capabilities.
Fix this by checking for a maximum of 255 elements in attrs.
Cc: stable@vger.kernel.org
Fixes: dc1e3cb8da ("nl80211: MBSSID and EMA support in AP mode")
Signed-off-by: Keith Yeo <keithyjy@gmail.com>
Link: https://lore.kernel.org/r/20230731034719.77206-1-keithyjy@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There is an asymmetry between commit/abort and preparation phase if the
following conditions are met:
1. set is a verdict map ("1.2.3.4 : jump foo")
2. timeouts are enabled
In this case, following sequence is problematic:
1. element E in set S refers to chain C
2. userspace requests removal of set S
3. kernel does a set walk to decrement chain->use count for all elements
from preparation phase
4. kernel does another set walk to remove elements from the commit phase
(or another walk to do a chain->use increment for all elements from
abort phase)
If E has already expired in 1), it will be ignored during list walk, so its use count
won't have been changed.
Then, when set is culled, ->destroy callback will zap the element via
nf_tables_set_elem_destroy(), but this function is only safe for
elements that have been deactivated earlier from the preparation phase:
lack of earlier deactivate removes the element but leaks the chain use
count, which results in a WARN splat when the chain gets removed later,
plus a leak of the nft_chain structure.
Update pipapo_get() not to skip expired elements, otherwise flush
command reports bogus ENOENT errors.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tuning of the effective buffer size through setsockopts was working for
SMC traffic only but not for TCP fall-back connections even before
commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and
make them tunable"). That change made it apparent that TCP fall-back
connections would use net.smc.[rw]mem as buffer size instead of
net.ipv4_tcp_[rw]mem.
Amend the code that copies attributes between the (TCP) clcsock and the
SMC socket and adjust buffer sizes appropriately:
- Copy over sk_userlocks so that both sockets agree on whether tuning
via setsockopt is active.
- When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
- Likewise, use either values from setsockopt or from sysctl for SMC
(duplicated) on successful SMC connect.
In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
is taken care of by the attribute copy.
Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.
Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.
Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.
While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.
Background:
Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.
When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:
[...]
SO_SNDBUF
Sets or gets the maximum socket send buffer in bytes.
The kernel doubles this value (to allow space for book‐
keeping overhead) when it is set using setsockopt(2),
and this doubled value is returned by getsockopt(2).
The default value is set by the
/proc/sys/net/core/wmem_default file and the maximum
allowed value is set by the /proc/sys/net/core/wmem_max
file. The minimum (doubled) value for this option is
2048.
[...]
Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Co-developed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
rather than ESRCH if a socket type does not support remote peer-PID
queries.
Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
not an AF_UNIX socket. This is quite unexpected, given that one would
assume ESRCH means the peer process already exited and thus cannot be
found. However, in that case the sockopt actually returns EINVAL (via
pidfd_prepare()). This is rather inconsistent with other syscalls, which
usually return ESRCH if a given PID refers to a non-existant process.
This changes SO_PEERPIDFD to return ENODATA instead. This is also what
SO_PEERGROUPS returns, and thus keeps a consistent behavior across
sockopts.
Note that this code is returned in 2 cases: First, if the socket type is
not AF_UNIX, and secondly if the socket was not yet connected. In both
cases ENODATA seems suitable.
Signed-off-by: David Rheinsberg <david@readahead.eu>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Fixes: 7b26952a91 ("net: core: add getsockopt SO_PEERPIDFD")
Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>