Detach the skb from ctx->recv_pkt after decryption is done,
even if we can't consume it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I thought that having the skb either always on the ctx->rx_list
or ctx->recv_pkt will simplify the handling, as we would not
have to remember to flip it from one to the other on exit paths.
This became a little harder to justify with the fix for BPF
sockmaps. Subsequent changes will make the situation even worse.
Queue the skbs only when really needed.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
recvmsg() in TLS gets data from the skb list (rx_list) or fresh
skbs we read from TCP via strparser. The former holds skbs which were
already decrypted for peek or decrypted and partially consumed.
tls_wait_data() only notices appearance of fresh skbs coming out
of TCP (or psock). It is possible, if there is a concurrent call
to peek() and recv() that the peek() will move the data from input
to rx_list without recv() noticing. recv() will then read data out
of order or never wake up.
This is not a practical use case/concern, but it makes the self
tests less reliable. This patch solves the problem by allowing
only one reader in.
Because having multiple processes calling read()/peek() is not
normal avoid adding a lock and try to fast-path the single reader
case.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
SMC-R link group.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.
When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.
So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.
Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:
1) regression in data path, which is brought by additional address
translation of sndbuf by RNIC in Tx. But in general, translating
address through MTT is fast.
Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
latency and bandwidth test with physically and virtually contiguous
buffers are as follows:
- client:
smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
-t 5 -vu tcp_{bw|lat}
- server:
smc_run taskset -c <cpu> qperf
[latency]
msgsize tcp smcr smcr-use-virt-buf
1 11.17 us 7.56 us 7.51 us (-0.67%)
2 10.65 us 7.74 us 7.56 us (-2.31%)
4 11.11 us 7.52 us 7.59 us ( 0.84%)
8 10.83 us 7.55 us 7.51 us (-0.48%)
16 11.21 us 7.46 us 7.51 us ( 0.71%)
32 10.65 us 7.53 us 7.58 us ( 0.61%)
64 10.95 us 7.74 us 7.80 us ( 0.76%)
128 11.14 us 7.83 us 7.87 us ( 0.47%)
256 10.97 us 7.94 us 7.92 us (-0.28%)
512 11.23 us 7.94 us 8.20 us ( 3.25%)
1024 11.60 us 8.12 us 8.20 us ( 0.96%)
2048 14.04 us 8.30 us 8.51 us ( 2.49%)
4096 16.88 us 9.13 us 9.07 us (-0.64%)
8192 22.50 us 10.56 us 11.22 us ( 6.26%)
16384 28.99 us 12.88 us 13.83 us ( 7.37%)
32768 40.13 us 16.76 us 16.95 us ( 1.16%)
65536 68.70 us 24.68 us 24.85 us ( 0.68%)
[bandwidth]
msgsize tcp smcr smcr-use-virt-buf
1 1.65 MB/s 1.59 MB/s 1.53 MB/s (-3.88%)
2 3.32 MB/s 3.17 MB/s 3.08 MB/s (-2.67%)
4 6.66 MB/s 6.33 MB/s 6.09 MB/s (-3.85%)
8 13.67 MB/s 13.45 MB/s 11.97 MB/s (-10.99%)
16 25.36 MB/s 27.15 MB/s 24.16 MB/s (-11.01%)
32 48.22 MB/s 54.24 MB/s 49.41 MB/s (-8.89%)
64 106.79 MB/s 107.32 MB/s 99.05 MB/s (-7.71%)
128 210.21 MB/s 202.46 MB/s 201.02 MB/s (-0.71%)
256 400.81 MB/s 416.81 MB/s 393.52 MB/s (-5.59%)
512 746.49 MB/s 834.12 MB/s 809.99 MB/s (-2.89%)
1024 1292.33 MB/s 1641.96 MB/s 1571.82 MB/s (-4.27%)
2048 2007.64 MB/s 2760.44 MB/s 2717.68 MB/s (-1.55%)
4096 2665.17 MB/s 4157.44 MB/s 4070.76 MB/s (-2.09%)
8192 3159.72 MB/s 4361.57 MB/s 4270.65 MB/s (-2.08%)
16384 4186.70 MB/s 4574.13 MB/s 4501.17 MB/s (-1.60%)
32768 4093.21 MB/s 4487.42 MB/s 4322.43 MB/s (-3.68%)
65536 4057.14 MB/s 4735.61 MB/s 4555.17 MB/s (-3.81%)
2) regression in buffer initialization and destruction path, which is
brought by additional MR operations of sndbufs. But thanks to link
group buffer reuse mechanism, the impact of this kind of regression
decreases as times of buffer reuse increases.
Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
buffer-related function obtained by bpftrace are as follows:
Function Phys-bufs Virt-bufs
smcr_new_buf_create() 67154 ns 79164 ns
smc_ib_buf_map_sg() 525 ns 928 ns
smc_ib_get_memory_region() 162294 ns 161191 ns
smc_wr_reg_send() 9957 ns 9635 ns
smc_ib_put_memory_region() 203548 ns 198374 ns
smc_ib_buf_unmap_sg() 508 ns 1158 ns
------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a new SMC-R specific element buf_type
in struct smc_link_group, for recording the value of sysctl
smcr_buf_type when link group is created.
New created link group will create and reuse buffers of the
type specified by buf_type.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the sysctl smcr_buf_type for setting
the type of SMC-R sndbufs and RMBs.
Valid values includes:
- SMCR_PHYS_CONT_BUFS, which means use physically contiguous
buffers for better performance and is the default value.
- SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
buffers in case of physically contiguous memory is scarce.
- SMCR_MIXED_BUFS, which means first try to use physically
contiguous buffers. If not available, then use virtually
contiguous buffers.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some CPU, such as Xeon, can guarantee DMA cache coherency.
So it is no need to use dma sync APIs to flush cache on such CPUs.
In order to avoid calling dma sync APIs on the IO path, use the
dma_need_sync to check whether smc_buf_desc needs dma sync when
creating smc_buf_desc.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
consistency. Smc sndbufs are dma buffers, where CPU writes data to
it and PCIE device reads data from it. So for sndbufs,
smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
redundant as PCIE device will not write the buffers. Smc rmbs
are dma buffers, where PCIE device write data to it and CPU read
data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a third knob, '2', which extends the
accept_untracked_na option to learn a neighbor only if the src ip is
in the same subnet as an address configured on the interface that
received the neighbor advertisement. This is similar to the arp_accept
configuration for ipv4.
Signed-off-by: Jaehee Park <jhpark1013@gmail.com>
Suggested-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In many deployments, we want the option to not learn a neighbor from
garp if the src ip is not in the same subnet as an address configured
on the interface that received the garp message. net.ipv4.arp_accept
sysctl is currently used to control creation of a neigh from a
received garp packet. This patch adds a new option '2' to
net.ipv4.arp_accept which extends option '1' by including the subnet
check.
Signed-off-by: Jaehee Park <jhpark1013@gmail.com>
Suggested-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit e21145a987 ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis. Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb3 ("net: Add sysctl to toggle early demux for
tcp and udp"). However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic. Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline. So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb3 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to some changes and rebasing between different patches
this fell through the cracks; we need to set sta.mlo if the
connection is using MLO.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Unfortunately, a printk snuck into a previous patch,
remove it.
Fixes: 81151ce462 ("wifi: mac80211: support MLO authentication/association with one link")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
While reading sysctl_tcp_probe_interval, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 05cbc0db03 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_probe_threshold, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 6b58e0a5f3 ("ipv4: Use binary search to choose tcp PMTU probe_size")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_mtu_probe_floor, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: c04b79b6cf ("tcp: add new tcp_mtu_probe_floor sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_min_snd_mss, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5f3e2bf008 ("tcp: add tcp_min_snd_mss sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_base_mss, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5d424d5a67 ("[TCP]: MTU probing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_mtu_probing, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5d424d5a67 ("[TCP]: MTU probing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_autobind_reuse, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 4b01a96742 ("tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_nonlocal_bind, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_fwd_update_priority, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 432e05d328 ("net: ipv4: Control SKB reprioritization after forwarding")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_fwd_use_pmtu, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: f87c10a8aa ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_no_pmtu_disc, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_default_ttl, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
delay_timer has been unused since commit c3498d34dd ("cbq: remove
TCA_CBQ_OVL_STRATEGY support"). Delete it.
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It might seem a bit pointless to do a multi-link operation
connection with just a single link, but this is already a
big change, so for now, limit MLO connections to a single
link.
Extending that to multiple links will require
* work on parsing the multi-link element with STA profile
properly, including element fragmentation;
* checking the per-link status in the multi-link element
* implementing logic to have active/inactive links to let
drivers decide which links should be active;
* implementing multicast RX deduplication;
* and likely more.
For now this is still useful since it lets us do multi-link
connections for the purposes of testing APIs and the higher
layers such as wpa_supplicant.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add the necessary API to parse the multi-link element in
the future. For now, link only to the element when found
so we can use it in the client-side code later.
Later, we'll need to fill this in to deal with element
fragmentation, parse the STA profile, etc.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In some cases, e.g. with Qualcomm devices and management
frames, or in hwsim, frames may be reported from the driver
with link addresses, but for decryption and matching needs
we really want to have them with MLD addresses. Support the
translation on RX.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When an MLO AP is transmitting to a non-MLO station, addr2 should be set
to a link address. This should be done before the frame is encrypted as
otherwise aad verification would fail. In case of software encryption
this can't be left for the device to handle, and should be done by
mac80211 when building the frame hdr.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we create a station with a non-default link, then
we should have a link address, and we definitely need
to insert it into the link hash table on insertion.
Split the API into with and without link creation and
if it has a link, insert the link into the link hash
table on sta_info_insert().
Fixes: ba6ddab94f ("wifi: mac80211: maintain link-sta hash table")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In AP/mesh where the stations are added by userspace, we
limit the number of A-MSDU subframes according to the
extended capabilities.
Refactor the code and extend that also to client-side.
Fixes: 506bcfa8ab ("mac80211: limit the A-MSDU Tx based on peer's capabilities")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Split out much of the code in ieee80211_set_associated()
into a new ieee80211_link_set_associated() which can be
called per link later for MLO.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a helper function cfg80211_get_iftype_ext_capa() to
look up interface type-specific (extended) capabilities.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If NEED_DTIM_BEFORE_ASSOC isn't set, then we don't need
to enter an RCU critical section and look up the beacon
elements.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Factor out the code to set up the assoc link into a
new function ieee80211_setup_assoc_link().
While at it, also modify the 'override' handling to
just take into account whether or not the conn_flags
were changed, which is what we need to setup again
the channel later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no need to pass the address, we can look at the auth_data
inside the function rather than outside.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Refactor the per-link setup out of ieee80211_assoc_success()
into a new function ieee80211_assoc_config_link().
It looks useless for now to parse the elements again inside
ieee80211_assoc_config_link(), but that will be done with
the link ID in the future.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Refactor ieee80211_prep_channel() to make the link argument
optional and add a conn_flags pointer argument instead, so
that we can later use this for links that don't exist yet
to build the right information for MLO.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For MLO, we will need to build these elements per link, so
factor out the code that does this, returning the capability,
to simplify building the multi-link element in the future.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With MLO, when we'll disconnect from an AP MLD, we'll just
destroy all the links. Therefore, the only thing we (may)
need to reset is the deflink data, so switch back to that
and adjust the comments accordingly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For MLO we'll need to read flags not directly from the link as
it may not even exist yet if we're just setting up flags for
a secondary link before sending the association request, so
pass the incoming conn_flags separately. Also, while at it,
pass the sdata/link separately as for non-tracking now the
link may be NULL.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We'll need ieee80211_prep_channel() in other code for MLO
later, so move the code up - unchanged for now - to avoid
forward declarations in the future.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Refactor the code here since we need to have it also for each
link station after association in MLO later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The flag here is currently per interface, but the way we
set and clear it means it should be per link, so change
it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When sending an authentication frame from an MLD, include
the multi-link element with the MLD address and use the
link address for transmission.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Clean up the code building the supported channels element
a little bit by using a local variable instead of the long
line.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For now, prohibit DEAUTH_NEED_MGD_TX_PREP since we can't
really transmit this on a specific link yet as we don't
know which links are active.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The new NL80211_CMD_ADD_LINK_STA and NL80211_CMD_MODIFY_LINK_STA
commands have strict policy validation, so fix the policy so it
can be validated correctly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The underlying mac80211 code cannot deal with fragmented
elements for purposes of sorting the elements into the
association frame, so reject those inside the link. We
might want to reject them inside the assoc frame, but
they're used today for FILS, so cannot do that.
The non-inheritance element inside the links similarly
cannot be handled by mac80211, and outside the links it
makes no sense.
Reject both since using them could lead to an incorrect
implementation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we associate, we'll include all the elements for the
link we're sending the association request on in the frame
and the specific ones for other links in the multi-link
element container. Prohibit adding link-specific elements
for the association link.
Fixes: d648c23024 ("wifi: nl80211: support MLO in auth/assoc")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The link loop will always have a valid link so that
it's always set, but static checkers don't always
see that, so set it to NULL explicitly.
Fixes: efbabc1165 ("cfg80211: Indicate MLO connection info in connect and roam callbacks")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since struct ieee80211_bss_conf already contains link_id,
passing link_id is not necessary.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since mac80211 already has a protected pointer to link_conf,
pass it to the driver to avoid additional RCU locking.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
At least while we don't have any more specific interface
combinations support, add a simple flag for MLO support,
we can keep this later based on something other than the
wiphy flag.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Recalculate min channel context for the given or all interface
links, depending on the caller. For a station state change, we
need to recalculate all of them since we don't know which link
(or multiple) it might be on.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We check here that we don't enable TX (netif_carrier_ok())
before we actually start using some channel context, but to
our knowledge this check has never triggered, and with MLO
it's just wrong since links can be added and removed much
more dynamically than before.
Simply remove the checks, there's no really good way to do
anything that would replace them.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This simplifies hostapd implementation, since it didn't
switch to NL80211_CMD_SET_CHANNEL.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow link source address on TX.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow transmitting EAPOL frames not only from the interface
address (which is the MLD address) but also any link addresses,
in order to support non-MLO stations on AP interfaces.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case of authentication with a legacy station, link addressed EAPOL
frames should be sent. Support it.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Set the MLD parameters in NL80211_CMD_SET_STATION handling
to be able to change an MLD station.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We should check that MLO connections are supported before
attempting to authenticate with MLO parameters, check that.
Fixes: d648c23024 ("wifi: nl80211: support MLO in auth/assoc")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The way this works is that you add all the element data,
keeping a pointer to the length field of the element.
Then call this helper function, which will fragment the
element if there was more than 255 bytes in the element,
memmove()ing the data back if needed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For now, skip rate statistics here to avoid warnings in
the called code, we'll need to adjust this to have all
the statistics for link stations.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the BSS lookup returned an error, set it to NULL so we
don't try to free it.
Fixes: d648c23024 ("wifi: nl80211: support MLO in auth/assoc")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We have the per-interface type capabilities, currently for
extended capabilities, add the EML/MLD capabilities there
to have this advertised by the driver.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we add a station on an MLD, we need a link ID to see
where it lives (by default). Validate the link ID against
the valid_links.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we add non-deflink pointers, we need to remove the
link[0] pointer to deflink in case link[0] is not valid
afterwards. Also, we need to add that back when there
are no more valid links. Reorg the code to fix that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we remove a link that doesn't have a channel context,
we don't really need the local->mtx locking. Tighten the
check here.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This was missing earlier, we need to remove links when
interfaces are being destroyed, and we also need to
stop (AP) operations when a link is being destroyed.
Address these issues to remove many warnings that will
otherwise appear in mac80211.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We need to consider the (maximum) size of the EHT element
we'll add for the association request, otherwise we may run
out of space.
Fixes: 820acc810f ("mac80211: Add EHT capabilities to association/probe request")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The functions currently take a link and check data
from it, but this needs to change for MLO. Simplify
the prototypes by passing only the needed arguments.
Remove the regulatory checks, the warnings shouldn't
trigger, and haven't as far as I know.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Rework the sorting of custom elements into the association
request by moving the elements before HT/VHT/HE to each
their own function. While at it, fix the placement of the
ones that should be between VHT and HE.
This doesn't fix the placement of elements that should be
between HE and EHT yet, a similar change might be needed
in the future.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's some awkward code that really only exists
because we want to optimize the allocation size,
but that's not really all that necessary.
Refactor the code that adds rates to the association
request frame to have a separate function, removing
the goto.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Here, ext_capa is checked and can only be non-NULL if
assoc_data->ie_len was set before, so the check here
is redundant.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We need to handle the link addresses for station differently,
they will be determined by the association code, stored, and
then applied when the links are actually created on success,
cfg80211 will fill in the right addresses per the data we're
sending back to it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When parsing a frame containing a multi-BSSID element, we
need to know both the transmitted and non-transmitted BSSID
so we can parse it correctly.
Unfortunately, in quite a number of cases, we got this wrong
and were passing the wrong BSSID or useless information:
* the mgmt->bssid from a frame is only the transmitted
BSSID if the frame is a beacon
* passing just one of the parameters as non-NULL isn't
useful and ignored
In those case where we need to parse for a specific BSS we
always have a BSS structure pointer, representing the BSS
we need, whether transmitted or not. Thus, pass that pointer
to the parsing function instead of the two BSSIDs.
Also fix two bugs:
* we need to re-parse all the elements for the other BSS
when iterating the non-transmitted BSSes in scan
* we need to parse for the correct BSS when setting up
the channel data in client code
Fixes: 78ac51f815 ("mac80211: support multi-bssid")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We're already passing the elems pointer, and have parsed
them from the same frame with exactly the same parameters,
so don't need to do that again.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When calling start/stop_ap(), mac80211 already has a protected
link_conf pointer. Pass it to the driver, so it shouldn't
handle RCU protection.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Refactor the element parsing into a version that has
a parameter struct so we can add more parameters more
easily in the future.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Extend the cfg80211_rx_assoc_resp() to cover multiple
BSSes, the AP MLD address and local link addresses
for MLO.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For MLO we'll need a lot more arguments, including all the
BSS pointers and link addresses, so move the data to a struct
to be able to extend it more easily later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We only report the BSSID to userspace, so change the
argument from BSS struct pointer to AP address, which
we'll use to carry either the BSSID or AP MLD address.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are a few cases where we send an event to cfg80211
manually, but ieee80211_destroy_assoc_data() also handles
the case of abandoning; some cases don't need an event
and success is handled yet differently.
Unify this by providing a single status argument to the
ieee80211_destroy_assoc_data() function and then handling
all the different cases of events (or no events) there.
This will help simplify the code when MLO support is
added.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For MLO, we need the ability to report back multiple BSS
structures to release, as well as the AP MLD address (if
attempting to make an MLO connection).
Unify cfg80211_assoc_timeout() and cfg80211_abandon_assoc()
into a new cfg80211_assoc_failure() that gets a structure
parameter with the necessary data.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The race described by the comment in mac80211 hasn't existed
since the locking rework to use the same lock and for MLO we
need to pass the AP MLD address, so just pass the BSSID or
AP MLD address instead of the BSS struct pointer, and adjust
all the code accordingly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For station capabilities, e.g. TWT, we need to use the correct
link station instead of deflink. Switch the code to do that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The argument is unused except for NULL checking, but we already
do that anyway, so it's not needed. Remove the argument.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This requires a few more changes.
While at it, also add a warning to ieee80211_get_sband()
to avoid it being used when there are multiple links.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we decide to stop tracking QoS/WMM parameters, then
this should be a per-link decision. Move the flag to
the link instead.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Do the first adjustments in the client-side code to pass
the link pointer (instead of sdata) to most places etc.
This is just preparation, so the real MLO patches become
smaller.
Note that this isn't complete, notably there are still
quite a few references to sta->deflink and sta->sta.deflink.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove the IEEE80211_STA_RESET_SIGNAL_AVE flag and use
a bool instead, but invert the polarity (now calling it
tracking_signal_avg) so we don't have to initialize it,
and put that into the link instead.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
To prepare a bit more for MLO in the client code,
track the AP's address (for now only the BSSID, but
will track the AP MLD's address later) separately
from the per-link BSSID.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In MLO, expect the driver fully handles powersave handling,
including tracking whether or not a beacon was received,
the DTIM period, etc.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This really shouldn't be in a per-link config, we don't want
to let anyone control it that way (if anything, link powersave
could be forced through APIs to activate/deactivate a link),
and we don't support powersave in software with devices that
can do MLO.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It might be useful to drivers to be able to pass only the
link_conf pointer, rather than both the pointer and the
link_id; add the link_id to the link_conf to facility that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In station/client mode, the link data needs a bit more
initialization and destruction than just zero-init and
kfree() respectively, implement that.
This required some shuffling of the link data handling
in general, as we should set it up in setup and do the
teardown in teardown, otherwise we're asymmetric in
case of interface type changes.
Also stop using kfree_rcu(), we cannot guarantee that
nothing is scheduling things that live within the link
(e.g. the u.mgd.request_smps_work) until we're sure it
cannot be referenced anymore, therefore synchronize
instead. This isn't very efficient, but we can always
optimize it later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
At least the quantenna driver calls wdev_chandef() here
which now requires the lock, so acquire it.
Fixes: 7b0a0e3c3a ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With the split into keys[]/deflink.gtk[] arrays, WEP keys are
still installed into the keys[] array, but we didn't look them
up there. This meant they weren't deleted correctly.
Fix this by looking up the key there even if it's not pairwise
so we can be sure we don't have it.
Fixes: bfd8403add ("wifi: mac80211: reorg some iface data structs for MLD")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Separate out the connection downgrade flags from the ifmgd->flags
and put them into the link information instead. While at it, make
them a separate sparse type so we don't get confused about where
they belong and have static checking on correct handling.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Align the mac80211 implementation with P802.11be_D1.5.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are a few places that check ps_sdata and/or the dynamic
PS timeout, but they're erroneous in case SUPPORTS_DYNAMIC_PS
is set by the driver.
Skip the entire recalculation in this case so we cannot get
into those paths elsewhere, and so we simplify this for the
purpose of implementing MLO.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we don't really have multiple links, omit the link ID from
link debug prints, otherwise we change the format for all of
the existing drivers (most of which might never support MLO),
and also have extra noise in the logs.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For multi-link operation, this cannot work as the req->bss pointer
will be NULL, and we'll need to do more work on this to really add
tracing for the MLO case here. Drop the BSS elements for now as
they're not the most useful thing, and it's hard to size things
correctly for the MLO case (without adding a lot of code that's
also executed when tracing isn't enabled.)
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since for an MLD station the default link is added together
with the add station command, allow also setting the link
MAC address.
Otherwise, it is needed to use the modify link API only
for setting the link MAC address.
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since links can be added and removed dynamically, we need to
somehow protect the sdata->link[] and vif->link_conf[] array
pointers from disappearing when accessing them without locks.
RCU-ify the pointers to achieve this, which requires quite a
bit of rework.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since this will need to refer - at least in part - to the link
stations of an MLD, hold the wdev mutex for driver convenience.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since we deal with links, and that requires looking at wdev links,
we should hold the wdev mutex for driver convenience.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Put the link_station_parameters structure in the station_parameters
structure (and remove the station_parameters fields already existing
in link_station_parameters).
Now, for an MLD station, the default link is added together with
the station.
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an API for adding/modifying/removing a link of a station.
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Similar to ieee80211_get_sband but get the sband of the link_conf.
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
AP SMPS was removed and not needed anymore. Remove the leftovers.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Management frames are transmitted from link address and not device
address. Allow that.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Check all the MLO links to decide whether offchannel TX is needed.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When checking whether or not to accept a frame in RX,
take into account the configured link addresses. Also
look up the link station correctly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211s_update_metric function uses sta_set_rate_info_tx
function to get struct rate_info data from ieee80211_tx_rate
struct, present in ieee80211_sta->deflink.tx_stats. However,
drivers can skip tx rate calculation by setting rate idx as
-1. Such drivers provides rate_info directly and hence
ieee80211s metric is updated incorrectly since ieee80211_tx_rate
has inconsistent data.
Add fix to use rate_info directly if present instead of
sta_set_rate_info_tx for updating ieee80211s metric.
Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://lore.kernel.org/r/20220701133611.544-1-quic_adisi@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
WDS needs 4addr packets to trigger AP for wlan0.sta creation.
However, the 4addr null frame is sent at a high rate so that
sometimes the AP can't receive it. Switch to using min rate.
Signed-off-by: Lian Chen <lian.chen@mediatek.com>
Link: https://lore.kernel.org/r/20220714091636.59107-1-lian.chen@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In AP mode, multicast traffic is handled very differently from normal traffic,
especially if at least one client is in powersave mode.
This means that multicast packets can be buffered a lot longer than normal
unicast packets, and can eat up the AQL budget very quickly because of the low
data rate.
Along with the recent change to maintain a global PHY AQL limit, this can lead
to significant latency spikes for unicast traffic.
Since queueing multicast to hardware is currently not constrained by AQL limits
anyway, let's just exclude it from the AQL pending airtime calculation entirely.
Fixes: 8e4bac0671 ("wifi: mac80211: add a per-PHY AQL limit to improve fairness")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20220713083444.86129-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since the Linux RNG no longer uses sha1_transform(), the SHA-1 library
is no longer needed unconditionally. Make it possible to build the
Linux kernel without the SHA-1 library by putting it behind a kconfig
option, and selecting this new option from the kconfig options that gate
the remaining users: CRYPTO_SHA1 for crypto/sha1_generic.c, BPF for
kernel/bpf/core.c, and IPV6 for net/ipv6/addrconf.c.
Unfortunately, since BPF is selected by NET, for now this can only make
a difference for kernels built without networking support.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Return directly without intermediate value store at the end of
devlink_port_new_notify() function.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the typo in a name of devlink_port_new_notifiy() function.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The return value is not used, so change the return value type to void.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A couple of the syscalls which load values (bpf_skb_load_helper_16() and
bpf_skb_load_helper_32()) are using u16/u32 types which are triggering
warnings as they are then converted from big-endian to CPU-endian. Fix
these by making the types __be instead.
Fixes the following sparse warnings:
net/core/filter.c:246:32: warning: cast to restricted __be16
net/core/filter.c:246:32: warning: cast to restricted __be16
net/core/filter.c:246:32: warning: cast to restricted __be16
net/core/filter.c:246:32: warning: cast to restricted __be16
net/core/filter.c:273:32: warning: cast to restricted __be32
net/core/filter.c:273:32: warning: cast to restricted __be32
net/core/filter.c:273:32: warning: cast to restricted __be32
net/core/filter.c:273:32: warning: cast to restricted __be32
net/core/filter.c:273:32: warning: cast to restricted __be32
net/core/filter.c:273:32: warning: cast to restricted __be32
Signed-off-by: Ben Dooks <ben.dooks@sifive.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220714105101.297304-1-ben.dooks@sifive.com
When application runs in busy poll mode and does not receive a single
packet but only sends them, it is currently impossible to get into
napi_busy_loop() as napi_id is only marked on Rx side in xsk_rcv_check().
In there, napi_id is being taken from xdp_rxq_info carried by xdp_buff.
From Tx perspective, we do not have access to it. What we have handy is
the xsk pool.
Xsk pool works on a pool of internal xdp_buff wrappers called xdp_buff_xsk.
AF_XDP ZC enabled drivers call xp_set_rxq_info() so each of xdp_buff_xsk
has a valid pointer to xdp_rxq_info of underlying queue. Therefore, on Tx
side, napi_id can be pulled from xs->pool->heads[0].xdp.rxq->napi_id. Hide
this pointer chase under helper function, xsk_pool_get_napi_id().
Do this only for sockets working in ZC mode as otherwise rxq pointers would
not be initialized.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220707130842.49408-1-maciej.fijalkowski@intel.com
When a nexthop is added, without a gw address, the default scope was set
to 'host'. Thus, when a source address is selected, 127.0.0.1 may be chosen
but rejected when the route is used.
When using a route without a nexthop id, the scope can be configured in the
route, thus the problem doesn't exist.
To explain more deeply: when a user creates a nexthop, it cannot specify
the scope. To create it, the function nh_create_ipv4() calls fib_check_nh()
with scope set to 0. fib_check_nh() calls fib_check_nh_nongw() wich was
setting scope to 'host'. Then, nh_create_ipv4() calls
fib_info_update_nhc_saddr() with scope set to 'host'. The src addr is
chosen before the route is inserted.
When a 'standard' route (ie without a reference to a nexthop) is added,
fib_create_info() calls fib_info_update_nhc_saddr() with the scope set by
the user. iproute2 set the scope to 'link' by default.
Here is a way to reproduce the problem:
ip netns add foo
ip -n foo link set lo up
ip netns add bar
ip -n bar link set lo up
sleep 1
ip -n foo link add name eth0 type dummy
ip -n foo link set eth0 up
ip -n foo address add 192.168.0.1/24 dev eth0
ip -n foo link add name veth0 type veth peer name veth1 netns bar
ip -n foo link set veth0 up
ip -n bar link set veth1 up
ip -n bar address add 192.168.1.1/32 dev veth1
ip -n bar route add default dev veth1
ip -n foo nexthop add id 1 dev veth0
ip -n foo route add 192.168.1.1 nhid 1
Try to get/use the route:
> $ ip -n foo route get 192.168.1.1
> RTNETLINK answers: Invalid argument
> $ ip netns exec foo ping -c1 192.168.1.1
> ping: connect: Invalid argument
Try without nexthop group (iproute2 sets scope to 'link' by dflt):
ip -n foo route del 192.168.1.1
ip -n foo route add 192.168.1.1 dev veth0
Try to get/use the route:
> $ ip -n foo route get 192.168.1.1
> 192.168.1.1 dev veth0 src 192.168.0.1 uid 0
> cache
> $ ip netns exec foo ping -c1 192.168.1.1
> PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
> 64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.039 ms
>
> --- 192.168.1.1 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
CC: stable@vger.kernel.org
Fixes: 597cfe4fc3 ("nexthop: Add support for IPv4 nexthops")
Reported-by: Edwin Brossette <edwin.brossette@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20220713114853.29406-1-nicolas.dichtel@6wind.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Both helper functions bpf_lwt_seg6_action() and bpf_lwt_push_encap() use
the bpf_push_seg6_encap() to encapsulate the packet in an IPv6 with Segment
Routing Header (SRH) or insert an SRH between the IPv6 header and the
payload.
To achieve this result, such helper functions rely on bpf_push_seg6_encap()
which, in turn, leverages seg6_do_srh_{encap,inline}() to perform the
required operation (i.e. encap/inline).
This patch removes the initialization of the IPv6 header payload length
from bpf_push_seg6_encap(), as it is now handled properly by
seg6_do_srh_{encap,inline}() to prevent corruption of the skb checksum.
Fixes: fe94cc290f ("bpf: Add IPv6 Segment Routing helpers")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The SRv6 End.B6 and End.B6.Encaps behaviors rely on functions
seg6_do_srh_{encap,inline}() to, respectively: i) encapsulate the
packet within an outer IPv6 header with the specified Segment Routing
Header (SRH); ii) insert the specified SRH directly after the IPv6
header of the packet.
This patch removes the initialization of the IPv6 header payload length
from the input_action_end_b6{_encap}() functions, as it is now handled
properly by seg6_do_srh_{encap,inline}() to avoid corruption of the skb
checksum.
Fixes: 140f04c33b ("ipv6: sr: implement several seg6local actions")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Support for SRH encapsulation and insertion was introduced with
commit 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and
injection with lwtunnels"), through the seg6_do_srh_encap() and
seg6_do_srh_inline() functions, respectively.
The former encapsulates the packet in an outer IPv6 header along with
the SRH, while the latter inserts the SRH between the IPv6 header and
the payload. Then, the headers are initialized/updated according to the
operating mode (i.e., encap/inline).
Finally, the skb checksum is calculated to reflect the changes applied
to the headers.
The IPv6 payload length ('payload_len') is not initialized
within seg6_do_srh_{inline,encap}() but is deferred in seg6_do_srh(), i.e.
the caller of seg6_do_srh_{inline,encap}().
However, this operation invalidates the skb checksum, since the
'payload_len' is updated only after the checksum is evaluated.
To solve this issue, the initialization of the IPv6 payload length is
moved from seg6_do_srh() directly into the seg6_do_srh_{inline,encap}()
functions and before the skb checksum update takes place.
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/all/20220705190727.69d532417be7438b15404ee1@uniroma2.it
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Return value of unregister_tcf_proto_ops is unused, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ath10k
* ethernet frame format support
rtw89
* TDLS support
cfg80211/mac80211
* airtime fairness fixes
* EHT support continued, especially in AP mode
* initial (and still major) rework for multi-link
operation (MLO) from 802.11be/wifi 7
As usual, also many small updates/cleanups/fixes/etc.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmLOca8ACgkQB8qZga/f
l8S2sQ//VyUyfPxKTnos4xLm9cZFYbP4/JAl+e1QwbYpa8TtQFMjyiDq+/mTiowA
gS5qdiAllS75MyxH5LuVJ1fSWe7DmSQ1A733gO4cQUxPUtaUrtXWZpsinYT+Vk4J
a20kOic/9KCD6j1JFLEFToaDBHxO6Rbqo1knnTuOpMXIV6H/ou0PNlj6Ys66oFLV
V5SvsoeIfCXsN3j/8JyGgjIC52LiNLam3VfdalParurY8yAxda0ub9IKvYqL/s3M
PZyuHUc0kJsL/2094sjmn6SKZobjTzrOQcLgq4nPXgspp+8YQ+CUf97QS8nH5rBV
AOlv7+WOiC9Ext/rBzxwZvjCmJUZSVn44mDMjafzIfTYDn0sB9m4CpqfQpgK5zvC
mf+jhvI99VuK3S4Zx/xRhNFZMAZZG65zkJKEACclBL2Bcs9A+z12CPIWvalEb3/k
Hk38VlUIMWPQlbcJW7oVTNH8HNpKIuOCecxKWZC+8MDDb/ZhIYhFqFNMb5TnbOBI
GMXIDBlfYZgvBKHgwcj9G24QGgm1P+yKGyDcnVH0KPismZwt0gm9R+VX2B4HyBnD
neT/7wx8yxsm7ujJIF28CM+BnF9vxZKVPGUS6XhS2aarOKanAalybsm9DKLwlArZ
Qlr2rwaTM+ZkHS82Yapv6At97IYvfiq+ju3b940aL3YrOmgHoqs=
=smwk
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2022-07-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
A fairly large set of updates for next, highlights:
ath10k
* ethernet frame format support
rtw89
* TDLS support
cfg80211/mac80211
* airtime fairness fixes
* EHT support continued, especially in AP mode
* initial (and still major) rework for multi-link
operation (MLO) from 802.11be/wifi 7
As usual, also many small updates/cleanups/fixes/etc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* queue selection in mesh/ocb
* queue handling on interface stop
* hwsim virtio device vs. some other virtio changes
* dt-bindings email addresses
* color collision memory allocation
* a const variable in rtw88
* shared SKB transmit in the ethernet format path
* P2P client port authorization
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmLOcFIACgkQB8qZga/f
l8TK6g//dM2kjGZhyDJUnUicUplN6m4sHLeVqqWCJiUaZepg0Zb3zwEhfEjXnYgn
nWfFCqyRYN2JgESKFG2LNliAUW954ccu5mAHNoR41SXjwPxPLZblYqdirdtMsbv3
VM6Ar7WKVWqIer103lUOmiH+tSMObuUhfESbFVByutJfRAcWOolEIJdoAQEmqoKt
BgU0frkZLGpX9PTzJaT5KmgOnXstrWqdTY1JzLPR93k+fN0kwsOcBtwipqYTombI
gcnIMb5eY16EHQES9Rf02PIGDe9Oka2+xr9gfOAwFE5JWgh6j6TwHnXBi6UM5mby
/i6owhSS9km1rwTzsqJnpC89zZ1E26e5W7i6tDdQ+70OorSgPjMOGiyPNP+1KX0x
P9CfFGV6c2CICCfylva7lQXoBkAUn9uQsimGBOzYY3eWt5gYZKrwNistLKlrZQca
qRMRCXApfPvcyPvkX4DEuiJDgi+74nUqm0okIHLVHN4QfAuoq22DzTlTlFiF6OCJ
Fj5URCCfwyuwNtaF0W6IH8PnhkD8VQjYHH0RqclQAUaS5yJxj4x///GTGPwYDCxe
JcbASQfDOK1QmN4C3vOweym9J5jUdJR4fbvuj2iJhL0qQLrQZrKHoPfu8J5G4EyC
rtHAVmz8eI+IQtYsppRpQbRpNtmcj773FXhQ2wNqkZ6Y7i/GtFE=
=GrDi
-----END PGP SIGNATURE-----
Merge tag 'wireless-2022-07-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
A small set of fixes for
* queue selection in mesh/ocb
* queue handling on interface stop
* hwsim virtio device vs. some other virtio changes
* dt-bindings email addresses
* color collision memory allocation
* a const variable in rtw88
* shared SKB transmit in the ethernet format path
* P2P client port authorization
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 multicast routing code previously implemented only the dump
variant of RTM_GETROUTE. Implement single MFC item retrieval by copying
and adapting the respective IPv4 code.
Tested against FRRouting's IPv6 PIM stack.
Signed-off-by: David Lamparter <equinox@diac24.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As far as the lock helpers exist as the drivers need to work with the
devlink->lock mutex, use the helpers internally in devlink.c in order to
be consistent.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be unified with the rest of the code, the unlocked version (devl_*)
of function should have the same description in documentation as the
locked one. Add the missing documentation. Also, add "Context"
annotation for the locked versions where it is missing.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading nexthop_compat_mode, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 4f80116d3d ("net: ipv4: add sysctl for nexthop api compatibility mode")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_ip_dynaddr, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_ecn_fallback, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 492135557d ("tcp: add rfc3168, section 6.1.1.1. fallback")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_ecn, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_ratemask, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_ratelimit, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_errors_use_inbound_ifaddr, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 1c2fb7f93c ("[IPV4]: Sysctl configurable icmp error source address.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_ignore_bogus_error_responses, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_echo_ignore_broadcasts, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_echo_enable_probe, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: d329ea5bd8 ("icmp: add response to RFC 8335 PROBE messages")
Fixes: 1fd07f33c3 ("ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_icmp_echo_ignore_all, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_max_tw_buckets, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So it can be used for port range filter offloading.
Co-developed-by: Volodymyr Mytnyk <volodymyr.mytnyk@plvision.eu>
Signed-off-by: Volodymyr Mytnyk <volodymyr.mytnyk@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code allows to inherit the TTL (hop_limit) from the
payload when skb->protocol is ETH_P_IP or ETH_P_IPV6.
However when the payload is VLAN encapsulated (e.g because the tunnel
is of type GRETAP), then this inheriting does not work, because the
visible skb->protocol is of type ETH_P_8021Q or ETH_P_8021AD.
Instead of skb->protocol, use skb_protocol().
Signed-off-by: Matthias May <matthias.may@westermo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the payload is a VLAN encapsulated IPv6/IPv6 frame, we can
skip the 802.1q/802.1ad ethertypes and jump to the actual protocol.
This way we treat IPv4/IPv6 frames as IP instead of as "other".
Signed-off-by: Matthias May <matthias.may@westermo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code always forces a dscp of 0 for all non-IP frames.
However when setting a specific TOS with the command
ip link add name tep0 type ip6gretap local fdd1:ced0:5d88:3fce::1
remote fdd1:ced0:5d88:3fce::2 tos 0xa0
one would expect all GRE encapsulated frames to have a TOS of 0xA0.
and not only when the payload is IPv4/IPv6.
Signed-off-by: Matthias May <matthias.may@westermo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code allows to inherit the TOS, TTL, DF from the payload
when skb->protocol is ETH_P_IP or ETH_P_IPV6.
However when the payload is VLAN encapsulated (e.g because the tunnel
is of type GRETAP), then this inheriting does not work, because the
visible skb->protocol is of type ETH_P_8021Q or ETH_P_8021AD.
Instead of skb->protocol, use skb_protocol().
Signed-off-by: Matthias May <matthias.may@westermo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the id accounting for the ID 0 subflow is not correct:
at creation time we mark (correctly) as unavailable the endpoint
id corresponding the MPC subflow source address, while at subflow
removal time set as available the id 0.
With this change we track explicitly the endpoint id corresponding
to the MPC subflow so that we can mark it as available at removal time.
Additionally this allow deleting the initial subflow via the NL PM
specifying the corresponding endpoint id.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Any local endpoints configured on the address matching the
MPC subflow are currently ignored.
Specifically, setting a backup flag on them has no effect
on the first subflow, as the MPC handshake can't carry such
info.
This change refactors the MPC endpoint id accounting to
additionally fetch the priority info from the relevant endpoint
and eventually trigger the MP_PRIO handshake as needed.
As a result, the MPC subflow now switches to backup priority
after that the MPTCP socket is fully established, according
to the local endpoint configuration.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When looking-up a socket address in the endpoint list, we
must prefer port-based matches over address only match.
Ensure that port-based endpoints are listed first, using
head insertion for them. Additionally be sure that only
port-based endpoints carry a non zero port number.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The in-kernel PM has a bit of duplicate code related to ack
generation. Create a new helper factoring out the PM-specific
needs and use it in a couple of places.
As a bonus, mptcp_subflow_send_ack() is not used anymore
outside its own compilation unit and can become static.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building with Clang we encounter these warnings:
| net/ipv4/ah4.c:513:4: error: format specifies type 'unsigned short' but
| the argument has type 'int' [-Werror,-Wformat]
| aalg_desc->uinfo.auth.icv_fullbits / 8);
-
| net/ipv4/esp4.c:1114:5: error: format specifies type 'unsigned short'
| but the argument has type 'int' [-Werror,-Wformat]
| aalg_desc->uinfo.auth.icv_fullbits / 8);
`aalg_desc->uinfo.auth.icv_fullbits` is a u16 but due to default
argument promotion becomes an int.
Variadic functions (printf-like) undergo default argument promotion.
Documentation/core-api/printk-formats.rst specifically recommends using
the promoted-to-type's format flag.
As per C11 6.3.1.1:
(https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf) `If an int
can represent all values of the original type ..., the value is
converted to an int; otherwise, it is converted to an unsigned int.
These are called the integer promotions.` Thus it makes sense to change
%hu to %d not only to follow this standard but to suppress the warning
as well.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Justin Stitt <justinstitt@google.com>
Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Let the core take the devlink instance lock around port_new and port_del
callbacks and remove the now redundant locking in the only driver that
currently use them.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The previous patch removed the last usage of the functions
devlink_rate_leaf_create() and devlink_rate_nodes_destroy(). Thus,
remove these function from devlink API.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The previous patch removed the last usage of the function
devlink_rate_nodes_destroy(). Thus, remove this function from devlink
API.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As discussed with Maxim add a counter for true NoPad violations.
This should help deployments catch unexpected padded records vs
just control records which always need re-encryption.
https: //lore.kernel.org/all/b111828e6ac34baad9f4e783127eba8344ac252d.camel@nvidia.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In sk_psock_skb_ingress_enqueue function, if the linear area + nr_frags +
frag_list of the SKB has NR_MSG_FRAG_IDS blocks in total, skb_to_sgvec
will return NR_MSG_FRAG_IDS, then msg->sg.end will be set to
NR_MSG_FRAG_IDS, and in addition, (NR_MSG_FRAG_IDS - 1) is set to the last
SG of msg. Recv the msg in sk_msg_recvmsg, when i is (NR_MSG_FRAG_IDS - 1),
the sk_msg_iter_var_next(i) will change i to 0 (not NR_MSG_FRAG_IDS), the
judgment condition "msg_rx->sg.start==msg_rx->sg.end" and
"i != msg_rx->sg.end" can not work.
As a result, the processed msg cannot be deleted from ingress_msg list.
But the length of all the sge of the msg has changed to 0. Then the next
recvmsg syscall will process the msg repeatedly, because the length of sge
is 0, the -EFAULT error is always returned.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220628123616.186950-1-liujian56@huawei.com
... and cast result to u32 so sparse won't complain anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sparse tool complains about mixing of different endianess
types, so use the correct ones.
Add type casts where needed.
objdiff shows no changes except in nft_tunnel (type is changed).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Same as the existing ones, no conversions. This is just for sparse sake
only so that we no longer mix be16/u16 and be32/u32 types.
Alternative is to add __force __beX in various places, but this
seems nicer.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Switch to be16/32 and u16/32 respectively. No code changes here,
the functions do the same thing, this is just for sparse checkers' sake.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sparse complains because __be32 and u32 are mixed without
conversions. Use the correct types, no code changes.
Furthermore, xt_DSCP generates a bit truncation warning:
"cast truncates bits from constant value (ffffff03 becomes 3)"
The truncation is fine (and wanted). Add a private definition and use that
instead.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sparse flags this as suspicious, because this compares
integer with a be16 with no conversion.
Its a compat check for old userspace that sends host byte order,
so force a be16 cast here.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
sparse complains about incorrect rcu usage.
Code uses the correct rcu access primitives, but the function pointers
lack rcu annotations.
Collapse all of them into a single structure, then annotate the pointer.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sparse complains about direct access to the 'helper' and timeout members.
Both have __rcu annotation, so use the accessors.
xt_CT is fine, accesses occur before the structure is visible to other
cpus. Switch to rcu accessors there as well to reduce noise.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Access to the hook pointers use correct helpers but the pointers lack
the needed __rcu annotation.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
To improve hardware offload debuggability count pending 'add', 'del' and
'stats' flow_table offload workqueue tasks. Counters are incremented before
scheduling new task and decremented when workqueue handler finishes
executing. These counters allow user to diagnose congestion on hardware
offload workqueues that can happen when either CPU is starved and workqueue
jobs are executed at lower rate than new ones are added or when
hardware/driver can't keep up with the rate.
Implement the described counters as percpu counters inside new struct
netns_ft which is stored inside struct net. Expose them via new procfs file
'/proc/net/stats/nf_flowtable' that is similar to existing 'nf_conntrack'
file.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Following patches in series use the pointer to access flow table offload
debug variables.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When compiling with -Wformat, clang emits the following warnings:
net/netfilter/nf_conntrack_helper.c:168:18: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security]
request_module(mod_name);
^~~~~~~~
Use a string literal for the format string.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Bill Wendling <isanbard@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These cases all use the same function. we can simplify the code through
fallthrough.
$ size net/netfilter/nf_conntrack_core.o
text data bss dec hex filename
before 81601 81430 768 163799 27fd7 net/netfilter/nf_conntrack_core.o
after 80361 81430 768 162559 27aff net/netfilter/nf_conntrack_core.o
Arch: aarch64
Gcc : gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we set XFRM security policy by calling setsockopt with option
IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
struct. However tcp_v6_send_response doesn't look up dst_entry with the
actual socket but looks up with tcp control socket. This may cause a
problem that a RST packet is sent without ESP encryption & peer's TCP
socket can't receive it.
This patch will make the function look up dest_entry with actual socket,
if the socket has XFRM policy(sock_policy), so that the TCP response
packet via this function can be encrypted, & aligned on the encrypted
TCP socket.
Tested: We encountered this problem when a TCP socket which is encrypted
in ESP transport mode encryption, receives challenge ACK at SYN_SENT
state. After receiving challenge ACK, TCP needs to send RST to
establish the socket at next SYN try. But the RST was not encrypted &
peer TCP socket still remains on ESTABLISHED state.
So we verified this with test step as below.
[Test step]
1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
2. Client tries a new connection on the same TCP ports(src & dst).
3. Server will return challenge ACK instead of SYN,ACK.
4. Client will send RST to server to clear the SOCKET.
5. Client will retransmit SYN to server on the same TCP ports.
[Expected result]
The TCP connection should be established.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Sehee Lee <seheele@google.com>
Signed-off-by: Sewook Seo <sewookseo@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) refcount_inc_not_zero() is not semantically equivalent to
atomic_int_not_zero(), from Florian Westphal. My understanding was
that refcount_*() API provides a wrapper to easier debugging of
reference count leaks, however, there are semantic differences
between these two APIs, where refcount_inc_not_zero() needs a barrier.
Reason for this subtle difference to me is unknown.
2) packet logging is not correct for ARP and IP packets, from the
ARP family and netdev/egress respectively. Use skb_network_offset()
to reach the headers accordingly.
3) set element extension length have been growing over time, replace
a BUG_ON by EINVAL which might be triggerable from userspace.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
At disconnect time the MPTCP protocol traverse the subflows
list closing each of them. In some circumstances - MPJ subflow,
passive MPTCP socket, the latter operation can remove the
subflow from the list, invalidating the current iterator.
Address the issue using the safe list traversing helper
variant.
Reported-by: van fantasy <g1042620637@gmail.com>
Fixes: b29fcfb54c ("mptcp: full disconnect implementation")
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using iTXQ, the code assumes that there is only one vif queue for
broadcast packets, using the BE queue. Allowing non-BE queue marking
violates that assumption and txq->ac == skb_queue_mapping is no longer
guaranteed. This can cause issues with queue handling in the driver and
also causes issues with the recent ATF change, resulting in an AQL
underflow warning.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20220702145227.39356-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When mac80211 downgrades working bandwidth, the
center_freq and center_freq1 need to be recalculated.
There is a typo in the case of downgrading bandwidth from
320MHz to 160MHz which would cause a wrong frequency value.
Reviewed-by: Money Wang <Money.Wang@mediatek.com>
Signed-off-by: MeiChia Chiu <MeiChia.Chiu@mediatek.com>
Link: https://lore.kernel.org/r/20220708095823.12959-1-MeiChia.Chiu@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
commit dd374f84ba ("wifi: nl80211: expose link ID for associated
BSSes") used a top-level attribute to send link ID of the associated
BSS in the nested attribute NL80211_ATTR_BSS. But since NL80211_ATTR_BSS
is a nested attribute of the attributes defined in enum nl80211_bss,
define a new attribute in enum nl80211_bss and use it for sending the
link ID of the BSS.
Fixes: dd374f84ba ("wifi: nl80211: expose link ID for associated BSSes")
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Reviewed-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20220708122607.1836958-1-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A comment in cfg80211_mlme_mgmt_tx() is describing this API used only
for transmitting action frames. Fix the comment since
cfg80211_mlme_mgmt_tx() can be used to transmit any management frame.
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://lore.kernel.org/r/20220708165545.2072999-1-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
nl80211_pre_doit() using nla_get_u16() to read u8 attribute
NL80211_ATTR_MLO_LINK_ID. Fix this by using nla_get_u8() to
read NL80211_ATTR_MLO_LINK_ID attribute.
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://lore.kernel.org/r/1657517683-5724-1-git-send-email-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-07-09
We've added 94 non-merge commits during the last 19 day(s) which contain
a total of 125 files changed, 5141 insertions(+), 6701 deletions(-).
The main changes are:
1) Add new way for performing BTF type queries to BPF, from Daniel Müller.
2) Add inlining of calls to bpf_loop() helper when its function callback is
statically known, from Eduard Zingerman.
3) Implement BPF TCP CC framework usability improvements, from Jörn-Thorben Hinz.
4) Add LSM flavor for attaching per-cgroup BPF programs to existing LSM
hooks, from Stanislav Fomichev.
5) Remove all deprecated libbpf APIs in prep for 1.0 release, from Andrii Nakryiko.
6) Add benchmarks around local_storage to BPF selftests, from Dave Marchevsky.
7) AF_XDP sample removal (given move to libxdp) and various improvements around AF_XDP
selftests, from Magnus Karlsson & Maciej Fijalkowski.
8) Add bpftool improvements for memcg probing and bash completion, from Quentin Monnet.
9) Add arm64 JIT support for BPF-2-BPF coupled with tail calls, from Jakub Sitnicki.
10) Sockmap optimizations around throughput of UDP transmissions which have been
improved by 61%, from Cong Wang.
11) Rework perf's BPF prologue code to remove deprecated functions, from Jiri Olsa.
12) Fix sockmap teardown path to avoid sleepable sk_psock_stop, from John Fastabend.
13) Fix libbpf's cleanup around legacy kprobe/uprobe on error case, from Chuang Wang.
14) Fix libbpf's bpf_helpers.h to work with gcc for the case of its sec/pragma
macro, from James Hilliard.
15) Fix libbpf's pt_regs macros for riscv to use a0 for RC register, from Yixun Lan.
16) Fix bpftool to show the name of type BPF_OBJ_LINK, from Yafang Shao.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (94 commits)
selftests/bpf: Fix xdp_synproxy build failure if CONFIG_NF_CONNTRACK=m/n
bpf: Correctly propagate errors up from bpf_core_composites_match
libbpf: Disable SEC pragma macro on GCC
bpf: Check attach_func_proto more carefully in check_return_code
selftests/bpf: Add test involving restrict type qualifier
bpftool: Add support for KIND_RESTRICT to gen min_core_btf command
MAINTAINERS: Add entry for AF_XDP selftests files
selftests, xsk: Rename AF_XDP testing app
bpf, docs: Remove deprecated xsk libbpf APIs description
selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage
libbpf, riscv: Use a0 for RC register
libbpf: Remove unnecessary usdt_rel_ip assignments
selftests/bpf: Fix few more compiler warnings
selftests/bpf: Fix bogus uninitialized variable warning
bpftool: Remove zlib feature test from Makefile
libbpf: Cleanup the legacy uprobe_event on failed add/attach_event()
libbpf: Fix wrong variable used in perf_event_uprobe_open_legacy()
libbpf: Cleanup the legacy kprobe_event on failed add/attach_event()
selftests/bpf: Add type match test against kernel's task_struct
selftests/bpf: Add nested type to type based tests
...
====================
Link: https://lore.kernel.org/r/20220708233145.32365-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BUG_ON can be triggered from userspace with an element with a large
userdata area. Replace it by length check and return EINVAL instead.
Over time extensions have been growing in size.
Pick a sufficiently old Fixes: tag to propagate this fix.
Fixes: 7d7402642e ("netfilter: nf_tables: variable sized set element keys / data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We want to kfree(table) if @table has been kmalloced,
ie for non initial network namespace.
Fixes: 849d5aa3a1 ("af_unix: Do not call kmemdup() for init_net's sysctl table.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move macro MPTCPOPT_HMAC_LEN definition from net/mptcp/protocol.h to
include/net/mptcp.h.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFPROTO_ARP is expecting to find the ARP header at the network offset.
In the particular case of ARP, HTYPE= field shows the initial bytes of
the ethernet header destination MAC address.
netdev out: IN= OUT=bridge0 MACSRC=c2:76:e5:71:e1:de MACDST=36:b0:4a:e2:72:ea MACPROTO=0806 ARP HTYPE=14000 PTYPE=0x4ae2 OPCODE=49782
NFPROTO_NETDEV egress hook is also expecting to find the IP headers at
the network offset.
Fixes: 35b9395104 ("netfilter: add generic ARP packet logger")
Reported-by: Tom Yan <tom.ty89@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When building with Clang we encounter this warning:
| net/rxrpc/rxkad.c:434:33: error: format specifies type 'unsigned short'
| but the argument has type 'u32' (aka 'unsigned int') [-Werror,-Wformat]
| _leave(" = %d [set %hx]", ret, y);
y is a u32 but the format specifier is `%hx`. Going from unsigned int to
short int results in a loss of data. This is surely not intended
behavior. If it is intended, the warning should be suppressed through
other means.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220707182052.769989-1-justinstitt@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_wait_data() sets the return code as an output parameter
and always returns ctx->recv_pkt on success.
Return the error code directly and let the caller read the skb
from the context. Use positive return code to indicate ctx->recv_pkt
is ready.
While touching the definition of the function rename it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The max size of iv + aad + tail is 22B. That's smaller
than a single sg entry (32B). Don't bother with the
memory packing, just create a struct which holds the
max size of those members.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_skb_cb lives within skb->cb[]. skb->cb[] straddles
2 cache lines, each containing 24B of data.
The first cache line does not contain much interesting
information for users of strparser, so pad things a little.
Previously strp_msg->full_len would live in the first cache
line and strp_msg->offset in the second.
We need to reorder the 8 byte temp_reg with struct tls_msg
to prevent a 4B hole which would push the struct over 48B.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf 2022-07-08
We've added 3 non-merge commits during the last 2 day(s) which contain
a total of 7 files changed, 40 insertions(+), 24 deletions(-).
The main changes are:
1) Fix cBPF splat triggered by skb not having a mac header, from Eric Dumazet.
2) Fix spurious packet loss in generic XDP when pushing packets out (note
that native XDP is not affected by the issue), from Johan Almbladh.
3) Fix bpf_dynptr_{read,write}() helper signatures with flag argument before
its set in stone as UAPI, from Joanne Koong.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs
bpf: Make sure mac_header was set before using it
xdp: Fix spurious packet loss in generic XDP TX path
====================
Link: https://lore.kernel.org/r/20220708213418.19626-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP allocates 'fast clones' skbs for packets in tx queues.
Currently, __alloc_skb() initializes the companion fclone
field to SKB_FCLONE_CLONE, and leaves other fields untouched.
It makes sense to defer this init much later in skb_clone(),
because all fclone fields are copied and hot in cpu caches
at that time.
This removes one cache line miss in __alloc_skb(), cost seen
on an host with 256 cpus all competing on memory accesses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When building with Clang we encounter the following warnings:
| net/l2tp/l2tp_debugfs.c:187:40: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] seq_printf(m, " nr %hu, ns %hu\n", session->nr,
| session->ns);
-
| net/l2tp/l2tp_debugfs.c:196:32: error: format specifies type 'unsigned
| short' but the argument has type 'int' [-Werror,-Wformat]
| session->l2specific_type, l2tp_get_l2specific_len(session));
-
| net/l2tp/l2tp_debugfs.c:219:6: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] session->nr, session->ns,
Both session->nr and ->nc are of type `u32`. The currently used format
specifier is `%hu` which describes a `u16`. My proposed fix is to listen
to Clang and use the correct format specifier `%u`.
For the warning at line 196, l2tp_get_l2specific_len() returns an int
and should therefore be using the `%d` format specifier.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Justin Stitt <justinstitt@google.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_fib_sync_mem, it can be changed concurrently.
So, we need to add READ_ONCE() to avoid a data-race.
Fixes: 9ab948a91b ("ipv4: Allow amount of dirty memory from fib resizing to be controllable")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading icmp sysctl variables, they can be changed concurrently.
So, we need to add READ_ONCE() to avoid data-races.
Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading cipso sysctl variables, they can be changed concurrently.
So, we need to add READ_ONCE() to avoid data-races.
Fixes: 446fda4f26 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading inetpeer sysctl variables, they can be changed
concurrently. So, we need to add READ_ONCE() to avoid data-races.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_max_orphans, it can be changed concurrently.
So, we need to add READ_ONCE() to avoid a data-race.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When building with clang we encounter this warning:
| net/l2tp/l2tp_ppp.c:1557:6: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] session->nr, session->ns,
Both session->nr and session->ns are of type u32. The format specifier
previously used is `%hu` which would truncate our unsigned integer from
32 to 16 bits. This doesn't seem like intended behavior, if it is then
perhaps we need to consider suppressing the warning with pragma clauses.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Justin Stitt <justinstitt@google.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/20220706230833.535238-1-justinstitt@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently NIC packet receiving performance based on page pool deteriorates
occasionally. To analysis the causes of this problem page allocation stats
are collected. Here are the stats when NIC rx performance deteriorates:
bandwidth(Gbits/s) 16.8 6.91
rx_pp_alloc_fast 13794308 21141869
rx_pp_alloc_slow 108625 166481
rx_pp_alloc_slow_h 0 0
rx_pp_alloc_empty 8192 8192
rx_pp_alloc_refill 0 0
rx_pp_alloc_waive 100433 158289
rx_pp_recycle_cached 0 0
rx_pp_recycle_cache_full 0 0
rx_pp_recycle_ring 362400 420281
rx_pp_recycle_ring_full 6064893 9709724
rx_pp_recycle_released_ref 0 0
The rx_pp_alloc_waive count indicates that a large number of pages' numa
node are inconsistent with the NIC device numa node. Therefore these pages
can't be reused by the page pool. As a result, many new pages would be
allocated by __page_pool_alloc_pages_slow which is time consuming. This
causes the NIC rx performance fluctuations.
The main reason of huge numa mismatch pages in page pool is that page pool
uses alloc_pages_bulk_array to allocate original pages. This function is
not suitable for page allocation in NUMA scenario. So this patch uses
alloc_pages_bulk_array_node which has a NUMA id input parameter to ensure
the NUMA consistent between NIC device and allocated pages.
Repeated NIC rx performance tests are performed 40 times. NIC rx bandwidth
is higher and more stable compared to the datas above. Here are three test
stats, the rx_pp_alloc_waive count is zero and rx_pp_alloc_slow which
indicates pages allocated from slow patch is relatively low.
bandwidth(Gbits/s) 93 93.9 93.8
rx_pp_alloc_fast 60066264 61266386 60938254
rx_pp_alloc_slow 16512 16517 16539
rx_pp_alloc_slow_ho 0 0 0
rx_pp_alloc_empty 16512 16517 16539
rx_pp_alloc_refill 473841 481910 481585
rx_pp_alloc_waive 0 0 0
rx_pp_recycle_cached 0 0 0
rx_pp_recycle_cache_full 0 0 0
rx_pp_recycle_ring 29754145 30358243 30194023
rx_pp_recycle_ring_full 0 0 0
rx_pp_recycle_released_ref 0 0 0
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20220705113515.54342-1-huangguangbin2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kajetan Puchalski reports crash on ARM, with backtrace of:
__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc
Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:
CPU1 CPU2
encounter 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */
/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);
In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.
The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:
1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.
2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.
If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.
Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.
Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:
1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).
2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.
Add needed barriers when refcount_inc_not_zero() is successful.
It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.
Because other CPU might still see the object, refcount_set(1)
"resurrects" it, so we need to make sure that other CPUs will also observe
the right content. In particular, the CONFIRMED bit test must only pass
once the object is fully initialised and either in the hash or about to be
inserted (with locks held to delay possible unlink from early_drop or
gc worker).
I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.
v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
add comment in nf_conntrack_netlink, no control dependency there
due to locks.
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/
Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Diagnosed-by: Will Deacon <will@kernel.org>
Fixes: 7197743776 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Will Deacon <will@kernel.org>
Commit 6dd4142fb5 ("Merge branch 'af_unix-per-netns-socket-hash'") and
commit 51bae889fe ("af_unix: Put pathname sockets in the global hash
table.") changed a hash table layout.
Before:
unix_socket_table [0 - 255] : abstract & pathname sockets
[256 - 511] : unnamed sockets
After:
per-netns table [0 - 255] : abstract & pathname sockets
[256 - 511] : unnamed sockets
bsd_socket_table [0 - 255] : pathname sockets (sk_bind_node)
Now, while looking up sockets, we traverse the global table for the
pathname sockets and the first half of each per-netns hash table for
abstract sockets, where pathname sockets are also linked. Thus, the
more pathname sockets we have, the longer we take to look up abstract
sockets. This characteristic has been there before the layout change,
but we can improve it now.
This patch changes the per-netns hash table's layout so that sockets not
requiring lookup reside in the first half and do not impact the lookup of
abstract sockets.
per-netns table [0 - 255] : pathname & unnamed sockets
[256 - 511] : abstract sockets
bsd_socket_table [0 - 255] : pathname sockets (sk_bind_node)
We have run a test that bind()s 100,000 abstract/pathname sockets for
each, bind()s an abstract socket 100,000 times and measures the time
on __unix_find_socket_byname(). The result shows that the patch makes
each lookup faster.
Without this patch:
$ sudo ./funclatency -p 2278 --microseconds __unix_find_socket_byname.isra.44
usec : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 126 | |
16 -> 31 : 1438 |* |
32 -> 63 : 4150 |*** |
64 -> 127 : 9049 |******* |
128 -> 255 : 37704 |******************************* |
256 -> 511 : 47533 |****************************************|
With this patch:
$ sudo ./funclatency -p 3648 --microseconds __unix_find_socket_byname.isra.46
usec : count distribution
0 -> 1 : 109 | |
2 -> 3 : 318 | |
4 -> 7 : 725 | |
8 -> 15 : 2501 |* |
16 -> 31 : 3061 |** |
32 -> 63 : 4028 |*** |
64 -> 127 : 9312 |******* |
128 -> 255 : 51372 |****************************************|
256 -> 511 : 28574 |********************** |
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220705233715.759-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are UAF bugs caused by rose_t0timer_expiry(). The
root cause is that del_timer() could not stop the timer
handler that is running and there is no synchronization.
One of the race conditions is shown below:
(thread 1) | (thread 2)
| rose_device_event
| rose_rt_device_down
| rose_remove_neigh
rose_t0timer_expiry | rose_stop_t0timer(rose_neigh)
... | del_timer(&neigh->t0timer)
| kfree(rose_neigh) //[1]FREE
neigh->dce_mode //[2]USE |
The rose_neigh is deallocated in position [1] and use in
position [2].
The crash trace triggered by POC is like below:
BUG: KASAN: use-after-free in expire_timers+0x144/0x320
Write of size 8 at addr ffff888009b19658 by task swapper/0/0
...
Call Trace:
<IRQ>
dump_stack_lvl+0xbf/0xee
print_address_description+0x7b/0x440
print_report+0x101/0x230
? expire_timers+0x144/0x320
kasan_report+0xed/0x120
? expire_timers+0x144/0x320
expire_timers+0x144/0x320
__run_timers+0x3ff/0x4d0
run_timer_softirq+0x41/0x80
__do_softirq+0x233/0x544
...
This patch changes rose_stop_ftimer() and rose_stop_t0timer()
in rose_remove_neigh() to del_timer_sync() in order that the
timer handler could be finished before the resources such as
rose_neigh and so on are deallocated. As a result, the UAF
bugs could be mitigated.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Link: https://lore.kernel.org/r/20220705125610.77971-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The byte queue limits (BQL) mechanism is intended to move queuing from
the driver to the network stack in order to reduce latency caused by
excessive queuing in hardware. However, when transmitting or redirecting
a packet using generic XDP, the qdisc layer is bypassed and there are no
additional queues. Since netif_xmit_stopped() also takes BQL limits into
account, but without having any alternative queuing, packets are
silently dropped.
This patch modifies the drop condition to only consider cases when the
driver itself cannot accept any more packets. This is analogous to the
condition in __dev_direct_xmit(). Dropped packets are also counted on
the device.
Bypassing the qdisc layer in the generic XDP TX path means that XDP
packets are able to starve other packets going through a qdisc, and
DDOS attacks will be more effective. In-driver-XDP use dedicated TX
queues, so they do not have this starvation issue.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com
This reverts commit 284b4d93da.
When using TLS device offload and coming from tls_device_reencrypt()
flow, -EBADMSG error in tls_do_decryption() should not be counted
towards the TLSTlsDecryptError counter.
Move the counter increase back to the decrypt_internal() call site in
decrypt_skb_update().
This also fixes an issue where:
if (n_sgin < 1)
return -EBADMSG;
Errors in decrypt_internal() were not counted after the cited patch.
Fixes: 284b4d93da ("tls: rx: move counting TlsDecryptErrors for sync")
Cc: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We continuously hold the socket lock during large reads and writes.
This may inflate RTT and negatively impact TCP performance.
Flush the backlog periodically. I tried to pick a flush period (128kB)
which gives significant benefit but the max Bps rate is not yet visibly
impacted.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since optimisitic decrypt may add extra load in case of retries
require socket owner to explicitly opt-in.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently don't support decrypt to user buffer with TLS 1.3
because we don't know the record type and how much padding
record contains before decryption. In practice data records
are by far most common and padding gets used rarely so
we can assume data record, no padding, and if we find out
that wasn't the case - retry the crypto in place (decrypt
to skb).
To safeguard from user overwriting content type and padding
before we can check it attach a 1B sg entry where last byte
of the record will land.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make future patches easier to review make data_len
contain the length of the data, without the tail.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch increases MPTCP_MIB_RMSUBFLOW mib counter in userspace pm
destroy subflow function mptcp_nl_cmd_sf_destroy() when removing subflow.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In mptcp_pm_nl_rm_addr_or_subflow() we always mark as available
the id corresponding to the just removed address.
The used bitmap actually tracks only the local IDs: we must
restrict the operation when a (local) subflow is removed.
Fixes: a88c9e4969 ("mptcp: do not block subflows creation on errors")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change updates MPTCP_PM_CMD_SET_FLAGS to allow userspace PMs
to issue MP_PRIO signals over a specific subflow selected by
the connection token, local and remote address+port.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/286
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kishen Maloor <kishen.maloor@intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting up a subflow's flags for sending MP_PRIO MPTCP options, the
subflow socket lock was not held while reading and modifying several
struct members that are also read and modified in mptcp_write_options().
Acquire the subflow socket lock earlier and send the MP_PRIO ACK with
that lock already acquired. Add a new variant of the
mptcp_subflow_send_ack() helper to use with the subflow lock held.
Fixes: 067065422f ("mptcp: add the outgoing MP_PRIO support")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The in-kernel path manager code for changing subflow flags acquired both
the msk socket lock and the PM lock when possibly changing the "backup"
and "fullmesh" flags. mptcp_pm_nl_mp_prio_send_ack() does not access
anything protected by the PM lock, and it must release and reacquire
the PM lock.
By pushing the PM lock to where it is needed in mptcp_pm_nl_fullmesh(),
the lock is only acquired when the fullmesh flag is changed and the
backup flag code no longer has to release and reacquire the PM lock. The
change in locking context requires the MIB update to be modified - move
that to a better location instead.
This change also makes it possible to call
mptcp_pm_nl_mp_prio_send_ack() for the userspace PM commands without
manipulating the in-kernel PM lock.
Fixes: 0f9f696a50 ("mptcp: add set_flags command in PM netlink")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user-space PM subflow removal path uses a couple of helpers
that must be called under the msk socket lock and the current
code lacks such requirement.
Change the existing lock scope so that the relevant code is under
its protection.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/287
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offloading police with action TC_ACT_UNSPEC was erroneously disabled even
though it was supported by mlx5 matchall offload implementation, which
didn't verify the action type but instead assumed that any single police
action attached to matchall classifier is a 'continue' action. Lack of
action type check made it non-obvious what mlx5 matchall implementation
actually supports and caused implementers and reviewers of referenced
commits to disallow it as a part of improved validation code.
Fixes: b8cd5831c6 ("net: flow_offload: add tc police action parameters")
Fixes: b50e462bc2 ("net/sched: act_police: Add extack messages for offload failure")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
`cancel_work_sync(&hdev->power_on)` was moved to hci_dev_close_sync in
commit [1] to ensure that power_on work is canceled after HCI interface
down.
But, in certain cases power_on work function may call hci_dev_close_sync
itself: hci_power_on -> hci_dev_do_close -> hci_dev_close_sync ->
cancel_work_sync(&hdev->power_on), causing deadlock. In particular, this
happens when device is rfkilled on boot. To avoid deadlock, move
power_on work canceling out of hci_dev_do_close/hci_dev_close_sync.
Deadlock introduced by commit [1] was reported in [2,3] as broken
suspend. Suspend did not work because `hdev->req_lock` held as result of
`power_on` work deadlock. In fact, other BT features were not working.
It was not observed when testing [1] since it was verified without
rfkill in place.
NOTE: It is not needed to cancel power_on work from other places where
hci_dev_do_close/hci_dev_close_sync is called in case:
* Requests were serialized due to `hdev->req_workqueue`. The power_on
work is first in that workqueue.
* hci_rfkill_set_block which won't close device anyway until HCI_SETUP
is on.
* hci_sock_release which runs after hci_sock_bind which ensures
HCI_SETUP was cleared.
As result, behaviour is the same as in pre-dd06ed7 commit, except
power_on work cancel added to hci_dev_close.
[1]: commit ff7f292611 ("Bluetooth: core: Fix missing power_on work cancel on HCI close")
[2]: https://lore.kernel.org/lkml/20220614181706.26513-1-max.oss.09@gmail.com/
[2]: https://lore.kernel.org/lkml/1236061d-95dd-c3ad-a38f-2dae7aae51ef@o2.pl/
Fixes: ff7f292611 ("Bluetooth: core: Fix missing power_on work cancel on HCI close")
Signed-off-by: Vasyl Vavrychuk <vasyl.vavrychuk@opensynergy.com>
Reported-by: Max Krummenacher <max.krummenacher@toradex.com>
Reported-by: Mateusz Jonczyk <mat.jonczyk@o2.pl>
Tested-by: Max Krummenacher <max.krummenacher@toradex.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
family is only set to either AF_INET or AF_INET6 based on len. In all
other cases we return early. Thus the check against AF_UNSPEC can be
omitted.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220630082618.15649-1-tklauser@distanz.ch
Commit cf2f225e26 ("af_unix: Put a socket into a per-netns hash table.")
accidentally broke user API for pathname sockets. A socket was able to
connect() to a pathname socket whose file was visible even if they were in
different network namespaces.
The commit puts all sockets into a per-netns hash table. As a result,
connect() to a pathname socket in a different netns fails to find it in the
caller's per-netns hash table and returns -ECONNREFUSED even when the task
can view the peer socket file.
We can reproduce this issue by:
Console A:
# python3
>>> from socket import *
>>> s = socket(AF_UNIX, SOCK_STREAM, 0)
>>> s.bind('test')
>>> s.listen(32)
Console B:
# ip netns add test
# ip netns exec test sh
# python3
>>> from socket import *
>>> s = socket(AF_UNIX, SOCK_STREAM, 0)
>>> s.connect('test')
Note when dumping sockets by sock_diag, procfs, and bpf_iter, they are
filtered only by netns. In other words, even if they are visible and
connect()able, all sockets in different netns are skipped while iterating
sockets. Thus, we need a fix only for finding a peer pathname socket.
This patch adds a global hash table for pathname sockets, links them with
sk_bind_node, and uses it in unix_find_socket_byinode(). By doing so, we
can keep sockets in per-netns hash tables and dump them easily.
Thanks to Sachin Sant and Leonard Crestez for reports, logs and a reproducer.
Fixes: cf2f225e26 ("af_unix: Put a socket into a per-netns hash table.")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Reported-by: Leonard Crestez <cdleonard@gmail.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Leonard Crestez <cdleonard@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The strlcpy should not be used because it doesn't limit the source
length. Preferred is strscpy.
Signed-off-by: XueBing Chen <chenxuebing@jari.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit d5f9023fa6 ("can: bcm: delay release of struct bcm_op
after synchronize_rcu()") Thadeu Lima de Souza Cascardo introduced two
synchronize_rcu() calls in bcm_release() (only once at socket close)
and in bcm_delete_rx_op() (called on removal of each single bcm_op).
Unfortunately this slow removal of the bcm_op's affects user space
applications like cansniffer where the modification of a filter
removes 2048 bcm_op's which blocks the cansniffer application for
40(!) seconds.
In commit 181d444790 ("can: gw: use call_rcu() instead of costly
synchronize_rcu()") Eric Dumazet replaced the synchronize_rcu() calls
with several call_rcu()'s to safely remove the data structures after
the removal of CAN ID subscriptions with can_rx_unregister() calls.
This patch adopts Erics approach for the can-bcm which should be
applicable since the removal of tasklet_kill() in bcm_remove_op() and
the introduction of the HRTIMER_MODE_SOFT timer handling in Linux 5.4.
Fixes: d5f9023fa6 ("can: bcm: delay release of struct bcm_op after synchronize_rcu()") # >= 5.4
Link: https://lore.kernel.org/all/20220520183239.19111-1-socketcan@hartkopp.net
Cc: stable@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Cc: Norbert Slusarek <nslusarek@gmx.net>
Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Insufficient validation of element datatype and length in
nft_setelem_parse_data(). At least commit 7d7402642e updates
maximum element data area up to 64 bytes when only 16 bytes
where supported at the time. Support for larger element size
came later in fdb9c405e3 though. Picking this older commit
as Fixes: tag to be safe than sorry.
2) Memleak in pipapo destroy path, reproducible when transaction
in aborted. This is already triggering in the existing netfilter
test infrastructure since more recent new tests are covering this
path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
New elements that reside in the clone are not released in case that the
transaction is aborted.
[16302.231754] ------------[ cut here ]------------
[16302.231756] WARNING: CPU: 0 PID: 100509 at net/netfilter/nf_tables_api.c:1864 nf_tables_chain_destroy+0x26/0x127 [nf_tables]
[...]
[16302.231882] CPU: 0 PID: 100509 Comm: nft Tainted: G W 5.19.0-rc3+ #155
[...]
[16302.231887] RIP: 0010:nf_tables_chain_destroy+0x26/0x127 [nf_tables]
[16302.231899] Code: f3 fe ff ff 41 55 41 54 55 53 48 8b 6f 10 48 89 fb 48 c7 c7 82 96 d9 a0 8b 55 50 48 8b 75 58 e8 de f5 92 e0 83 7d 50 00 74 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8b 65 00 48 8b 7d 08 49 39 fc 74 05
[...]
[16302.231917] Call Trace:
[16302.231919] <TASK>
[16302.231921] __nf_tables_abort.cold+0x23/0x28 [nf_tables]
[16302.231934] nf_tables_abort+0x30/0x50 [nf_tables]
[16302.231946] nfnetlink_rcv_batch+0x41a/0x840 [nfnetlink]
[16302.231952] ? __nla_validate_parse+0x48/0x190
[16302.231959] nfnetlink_rcv+0x110/0x129 [nfnetlink]
[16302.231963] netlink_unicast+0x211/0x340
[16302.231969] netlink_sendmsg+0x21e/0x460
Add nft_set_pipapo_match_destroy() helper function to release the
elements in the lookup tables.
Stefano Brivio says: "We additionally look for elements pointers in the
cloned matching data if priv->dirty is set, because that means that
cloned data might point to additional elements we did not commit to the
working copy yet (such as the abort path case, but perhaps not limited
to it)."
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure element data type and length do not mismatch the one specified
by the set declaration.
Fixes: 7d7402642e ("netfilter: nf_tables: variable sized set element keys / data")
Reported-by: Hugues ANGUELKOV <hanguelkov@randorisec.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The Microchip LAN937X switches have a tagging protocol which is
very similar to KSZ tagging. So that the implementation is added to
tag_ksz.c and reused common APIs
Signed-off-by: Prasanna Vengateshan <prasanna.vengateshan@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most drivers use "skb_transport_offset(skb) + tcp_hdrlen(skb)"
to compute headers length for a TCP packet, but others
use more convoluted (but equivalent) ways.
Add skb_tcp_all_headers() and skb_inner_tcp_all_headers()
helpers to harmonize this a bit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2022-07-02
We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 6 files changed, 193 insertions(+), 86 deletions(-).
The main changes are:
1) Fix clearing of page contiguity when unmapping XSK pool, from Ivan Malov.
2) Two verifier fixes around bounds data propagation, from Daniel Borkmann.
3) Fix fprobe sample module's parameter descriptions, from Masami Hiramatsu.
4) General BPF maintainer entry revamp to better scale patch reviews.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, selftests: Add verifier test case for jmp32's jeq/jne
bpf, selftests: Add verifier test case for imm=0,umin=0,umax=1 scalar
bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_vals
bpf: Fix incorrect verifier simulation around jmp32's jeq/jne
xsk: Clear page contiguity bit when unmapping pool
bpf, docs: Better scale maintenance of BPF subsystem
fprobe, samples: Add module parameter descriptions
====================
Link: https://lore.kernel.org/r/20220701230121.10354-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to commit 7c80b038d2 ("net: fix sk_wmem_schedule() and
sk_rmem_schedule() errors"), let the MPTCP receive path schedule
exactly the required amount of memory.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
possible"), the MPTCP protocol is the last SK_RECLAIM_CHUNK and
SK_RECLAIM_THRESHOLD users.
Update the MPTCP reclaim schema to match the core/TCP one and drop the
mentioned macros. This additionally clean the MPTCP code a bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory accounting is broken in such exceptional code
path, and after commit 4890b686f4 ("net: keep sk->sk_forward_alloc
as small as possible") we can't find much help there.
Drop the broken code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to retrieve EHT capabilities and EHT operation elements
passed by the userspace in the beacon template and store the pointers
in struct cfg80211_ap_settings to be used by the drivers.
Co-developed-by: Vikram Kandukuri <quic_vikram@quicinc.com>
Signed-off-by: Vikram Kandukuri <quic_vikram@quicinc.com>
Co-developed-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com>
Link: https://lore.kernel.org/r/20220523064904.28523-1-quic_alokad@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Increase akm_suites array size in struct cfg80211_crypto_settings to 10
and advertise the capability to userspace. This allows userspace to send
more than two AKMs to driver in netlink commands such as
NL80211_CMD_CONNECT.
This capability is needed for implementing WPA3-Personal transition mode
correctly with any driver that handles roaming internally. Currently,
the possible AKMs for multi-AKM connect can include PSK, PSK-SHA-256,
SAE, FT-PSK and FT-SAE. Since the count is already 5, increasing
the akm_suites array size to 10 should be reasonable for future
usecases.
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://lore.kernel.org/r/1653312358-12321-1-git-send-email-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The current check only worked for AP mode, but we can do
radar detection in mesh as well (for example). We could
try to check this using wdev_chandef(), but we also don't
really care since the chandef is passed in and we have no
need to use it anymore (since we added the argument in
commit d2859df5e7 ("cfg80211/mac80211: DFS setup chandef
for cac event")).
Change-Id: I856e4344d5e64ff4d2eead0b4c53b11f264be9b8
Fixes: 7b0a0e3c3a ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In many cases we might get here from driver code that's
not really set up to care about the locking, and for the
non-MLO cases we really don't care so much about it. So
relax the checking here for now, perhaps we should even
remove it completely since we might not really care if
we point to an invalid link's chandef and can require
the caller to check the link validity first.
Fixes: 7b0a0e3c3a ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We need to hold the wdev mutex already in order to call
nl80211_parse_tx_bitrate_mask(), so acquire it earlier.
Fixes: 7b0a0e3c3a ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We need wdev_chandef() in this code, which now requires
the wdev mutex due to the per-link nature. Hold it here
to make sure we can access the link.
Reported-by: syzbot+b4e9aa0f32ffd9902442@syzkaller.appspotmail.com
Fixes: 7b0a0e3c3a ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prior to commit 7b0a0e3c3a ("wifi: cfg80211: do some
rework towards MLO link APIs") the interface type didn't
really matter here, but now we need to handle all of the
possible cases. Add IBSS ("ADHOC") and handle it.
Fixes: 7b0a0e3c3a ("wifi: cfg80211: do some rework towards MLO link APIs")
Reported-by: syzbot+90d912872157e63589e4@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the interface isn't (yet) added to the driver, skip the
link info update. This was previously done for the BSS info
changes, but I forgot to copy the same check here.
Fixes: 7b7090b4c6 ("wifi: mac80211: split bss_info_changed method")
Reported-by: syzbot+bce2ca140cc00578ed07@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a client does not generate any local tx activity, accumulating airtime
deficit for the round-robin scheduler can be harmful. If this goes on for too
long, the deficit could grow quite large, which might cause unreasonable
initial latency once the client becomes active
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-7-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Now that the global pending airtime is more relevant for airtime fairness,
it makes sense to make it accessible via debugfs for debugging
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-6-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order to maintain fairness, the amount of queueing needs to be limited
beyond the simple per-station AQL budget, otherwise the driver can simply
repeatedly do scheduling rounds until all queues that have not used their
AQL budget become eligble.
To be conservative, use the high AQL limit for the first txq and add half
of the low AQL for each subsequent queue.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-5-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows proper deficit accounting to ensure that they don't carry their
deficit until the next time they become active
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-4-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When queueing packets for a station, deficit only gets added once the packets
have been transmitted, which could be much later. During that time, a lot of
temporary unfairness could happen, which could lead to bursty behavior.
Fix this by subtracting the aql_tx_pending when checking the deficit in tx
scheduling.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
32 bit is more than enough range for the airtime deficit
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commits 6a789ba679 and
2433647bc8.
The virtual time scheduler code has a number of issues:
- queues slowed down by hardware/firmware powersave handling were not properly
handled.
- on ath10k in push-pull mode, tx queues that the driver tries to pull from
were starved, causing excessive latency
- delay between tx enqueue and reported airtime use were causing excessively
bursty tx behavior
The bursty behavior may also be present on the round-robin scheduler, but there
it is much easier to fix without introducing additional regressions
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220625212411.36675-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
struct link_sta_info has now a cur_max_bandwidth data:
net/mac80211/sta_info.h:569: warning: Function parameter or member 'cur_max_bandwidth' not described in 'link_sta_info'
Copy the meaning from struct sta_info, documenting it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lore.kernel.org/r/37d898634bb30776442a33833c48cbb21c90ecc6.1656409369.git.mchehab@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Time-sensitive networking code needs to work with PTP times expressed in
nanoseconds, and with packet transmission times expressed in
picoseconds, since those would be fractional at higher than gigabit
speed when expressed in nanoseconds.
Convert the existing uses in tc-taprio and the ocelot/felix DSA driver
to a PSEC_PER_NSEC macro. This macro is placed in include/linux/time64.h
as opposed to its relatives (PSEC_PER_SEC etc) from include/vdso/time64.h
because the vDSO library does not (yet) need/use it.
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for the vDSO parts
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
9c5de246c1 ("net: sparx5: mdb add/del handle non-sparx5 devices")
fbb89d02e3 ("net: sparx5: Allow mdb entries to both CPU and ports")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Looks like there are still cases when "space_left - frag1bytes" can
legitimately exceed PAGE_SIZE. Ensure that xdr->end always remains
within the current encode buffer.
Reported-by: Bruce Fields <bfields@fieldses.org>
Reported-by: Zorro Lang <zlang@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216151
Fixes: 6c254bf3b6 ("SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
commit ed6cd6a178 ("net, neigh: Set lower cap for neigh_managed_work rearming")
fixed a case when DELAY_PROBE_TIME is configured to 0, the processing of the
system work queue hog CPU to 100%, and further more we should introduce
a new option used by periodic probe
Signed-off-by: Yuwei Wang <wangyuweihx@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are UAF bugs in rose_heartbeat_expiry(), rose_timer_expiry()
and rose_idletimer_expiry(). The root cause is that del_timer()
could not stop the timer handler that is running and the refcount
of sock is not managed properly.
One of the UAF bugs is shown below:
(thread 1) | (thread 2)
| rose_bind
| rose_connect
| rose_start_heartbeat
rose_release | (wait a time)
case ROSE_STATE_0 |
rose_destroy_socket | rose_heartbeat_expiry
rose_stop_heartbeat |
sock_put(sk) | ...
sock_put(sk) // FREE |
| bh_lock_sock(sk) // USE
The sock is deallocated by sock_put() in rose_release() and
then used by bh_lock_sock() in rose_heartbeat_expiry().
Although rose_destroy_socket() calls rose_stop_heartbeat(),
it could not stop the timer that is running.
The KASAN report triggered by POC is shown below:
BUG: KASAN: use-after-free in _raw_spin_lock+0x5a/0x110
Write of size 4 at addr ffff88800ae59098 by task swapper/3/0
...
Call Trace:
<IRQ>
dump_stack_lvl+0xbf/0xee
print_address_description+0x7b/0x440
print_report+0x101/0x230
? irq_work_single+0xbb/0x140
? _raw_spin_lock+0x5a/0x110
kasan_report+0xed/0x120
? _raw_spin_lock+0x5a/0x110
kasan_check_range+0x2bd/0x2e0
_raw_spin_lock+0x5a/0x110
rose_heartbeat_expiry+0x39/0x370
? rose_start_heartbeat+0xb0/0xb0
call_timer_fn+0x2d/0x1c0
? rose_start_heartbeat+0xb0/0xb0
expire_timers+0x1f3/0x320
__run_timers+0x3ff/0x4d0
run_timer_softirq+0x41/0x80
__do_softirq+0x233/0x544
irq_exit_rcu+0x41/0xa0
sysvec_apic_timer_interrupt+0x8c/0xb0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:default_idle+0xb/0x10
RSP: 0018:ffffc9000012fea0 EFLAGS: 00000202
RAX: 000000000000bcae RBX: ffff888006660f00 RCX: 000000000000bcae
RDX: 0000000000000001 RSI: ffffffff843a11c0 RDI: ffffffff843a1180
RBP: dffffc0000000000 R08: dffffc0000000000 R09: ffffed100da36d46
R10: dfffe9100da36d47 R11: ffffffff83cf0950 R12: 0000000000000000
R13: 1ffff11000ccc1e0 R14: ffffffff8542af28 R15: dffffc0000000000
...
Allocated by task 146:
__kasan_kmalloc+0xc4/0xf0
sk_prot_alloc+0xdd/0x1a0
sk_alloc+0x2d/0x4e0
rose_create+0x7b/0x330
__sock_create+0x2dd/0x640
__sys_socket+0xc7/0x270
__x64_sys_socket+0x71/0x80
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 152:
kasan_set_track+0x4c/0x70
kasan_set_free_info+0x1f/0x40
____kasan_slab_free+0x124/0x190
kfree+0xd3/0x270
__sk_destruct+0x314/0x460
rose_release+0x2fa/0x3b0
sock_close+0xcb/0x230
__fput+0x2d9/0x650
task_work_run+0xd6/0x160
exit_to_user_mode_loop+0xc7/0xd0
exit_to_user_mode_prepare+0x4e/0x80
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x4f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This patch adds refcount of sock when we use functions
such as rose_start_heartbeat() and so on to start timer,
and decreases the refcount of sock when timer is finished
or deleted by functions such as rose_stop_heartbeat()
and so on. As a result, the UAF bugs could be mitigated.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Tested-by: Duoming Zhou <duoming@zju.edu.cn>
Link: https://lore.kernel.org/r/20220629002640.5693-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There is no need to store the result of the addition back to variable count
after the addition. The store is redundant, replace += with just +
Cleans up clang scan build warning:
warning: Although the value stored to 'count' is used in the enclosing
expression, the value is never actually read from 'count'
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220628145406.183527-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Restore set counter when one of the CPU loses race to add elements
to sets.
2) After NF_STOLEN, skb might be there no more, update nftables trace
infra to avoid access to skb in this case. From Florian Westphal.
3) nftables bridge might register a prerouting hook with zero priority,
br_netfilter incorrectly skips it. Also from Florian.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: br_netfilter: do not skip all hooks with 0 priority
netfilter: nf_tables: avoid skb access on nf_stolen
netfilter: nft_dynset: restore set element counter when failing to update
====================
Link: https://lore.kernel.org/r/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I don't see how to make it nice without introducing btf id lists
for the hooks where these helpers are allowed. Some LSM hooks
work on the locked sockets, some are triggering early and
don't grab any locks, so have two lists for now:
1. LSM hooks which trigger under socket lock - minority of the hooks,
but ideal case for us, we can expose existing BTF-based helpers
2. LSM hooks which trigger without socket lock, but they trigger
early in the socket creation path where it should be safe to
do setsockopt without any locks
3. The rest are prohibited. I'm thinking that this use-case might
be a good gateway to sleeping lsm cgroup hooks in the future.
We can either expose lock/unlock operations (and add tracking
to the verifier) or have another set of bpf_setsockopt
wrapper that grab the locks and might sleep.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Free sk in case tipc_sk_insert() fails.
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of 4way handshake offload, cfg80211_port_authorized
enables driver to indicate successful 4way handshake to cfg80211 layer.
Currently this path of port authorization is restricted to
interface type NL80211_IFTYPE_STATION. This patch extends
the use of port authorization API for P2P client as well.
Signed-off-by: Vinayak Yadawad <vinayak.yadawad@broadcom.com>
Link: https://lore.kernel.org/r/ef25cb49fcb921df2e5d99e574f65e8a009cc52c.1655905440.git.vinayak.yadawad@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a vif is being removed and sdata->bss is cleared, __ieee80211_wake_txqs
can still be called on it, which crashes as soon as sdata->bss is being
dereferenced.
To fix this properly, check for SDATA_STATE_RUNNING before waking queues,
and take the fq lock when setting it (to ensure that __ieee80211_wake_txqs
observes the change when running on a different CPU)
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@kernel.org>
Link: https://lore.kernel.org/r/20220531190824.60019-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introduce the capability to specify gfp_t parameter to
ieeee80211_obss_color_collision_notify routine since it runs in
interrupt context in ieee80211_rx_check_bss_color_collision().
Fixes: 6d945a33f2 ("mac80211: introduce BSS color collision detection")
Co-developed-by: Ryder Lee <ryder.lee@mediatek.com>
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/02c990fb3fbd929c8548a656477d20d6c0427a13.1655419135.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
As of commit 5801f064e3 ("net: ipv6: unexport __init-annotated seg6_hmac_init()"),
EXPORT_SYMBOL and __init is a bad combination because the .init.text
section is freed up after the initialization. Hence, modules cannot
use symbols annotated __init. The access to a freed symbol may end up
with kernel panic.
This remove the EXPORT_SYMBOL to fix modpost warning:
WARNING: modpost: vmlinux.o(___ksymtab+seg6_hmac_net_init+0x0): Section mismatch in reference from the variable __ksymtab_seg6_hmac_net_init to the function .init.text:seg6_hmac_net_init()
The symbol seg6_hmac_net_init is exported and annotated __init
Fix this by removing the __init annotation of seg6_hmac_net_init or drop the export.
Fixes: bf355b8d2c ("ipv6: sr: add core files for SR HMAC support")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20220628033134.21088-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When kcalloc fails, ipip6_tunnel_get_prl() should return -ENOMEM.
Move the position of label "out" to return correctly.
Addresses-Coverity: ("Unused value")
Fixes: 300aaeeaab ("[IPV6] SIT: Add SIOCGETPRL ioctl to get/dump PRL.")
Signed-off-by: katrinzhou <katrinzhou@tencent.com>
Reviewed-by: Eric Dumazet<edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220628035030.1039171-1-zys.zljxml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While setting up init_net's sysctl table, we need not duplicate the
global table and can use it directly as ipv4_sysctl_init_net() does.
Unlike IPv4, AF_UNIX does not have a huge sysctl table for now, so it
cannot be a problem, but this patch makes code consistent.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220627233627.51646-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the listener socket owning the relevant request is closed,
it frees the unaccepted subflows and that causes later deletion
of the paired MPTCP sockets.
The mptcp socket's worker can run in the time interval between such delete
operations. When that happens, any access to msk->first will cause an UaF
access, as the subflow cleanup did not cleared such field in the mptcp
socket.
Address the issue explicitly traversing the listener socket accept
queue at close time and performing the needed cleanup on the pending
msk.
Note that the locking is a bit tricky, as we need to acquire the msk
socket lock, while still owning the subflow socket one.
Fixes: 86e39e0448 ("mptcp: keep track of local endpoint still available for each msk")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the MPTCP receive path reach a non fatal fall-back condition, e.g.
when the MPC sockets must fall-back to TCP, the existing code is a little
self-inconsistent: it reports that new data is available - return true -
but sets the MPC flag to the opposite value.
As the consequence read operations in some exceptional scenario may block
unexpectedly.
Address the issue setting the correct MPC read status. Additionally avoid
some code duplication in the fatal fall-back scenario.
Fixes: 9c81be0dbc ("mptcp: add MP_FAIL response support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the MPTCP socket shutdown happens before a fallback
to TCP, and all the pending data have been already spooled,
we never close the TCP connection.
Address the issue explicitly checking for critical condition
at fallback time.
Fixes: 1e39e5a32a ("mptcp: infinite mapping sending")
Fixes: 0348c690ed ("mptcp: add the fallback check")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>