This reverts commit 9ea3e52c5b.
Cited commit added a check to make sure 'action' is not NULL, but
'action' is already dereferenced before the check, when calling
flow_offload_has_one_action().
Therefore, the check does not make any sense and results in a smatch
warning:
include/net/flow_offload.h:322 flow_action_mixed_hw_stats_check() warn:
variable dereferenced before check 'action' (see line 319)
Fix by reverting this commit.
Cc: gushengxian <gushengxian@yulong.com>
Fixes: 9ea3e52c5b ("flow_offload: action should not be NULL when it is referenced")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20210819105842.1315705-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, when creating an ingress qdisc on an indirect device before
the driver registered for callbacks, the driver will not have a chance
to register its filter configuration callbacks.
To fix that, modify the code such that it keeps track of all the ingress
qdiscs that call flow_indr_dev_setup_offload(). When a driver calls
flow_indr_dev_register(), go through the list of tracked ingress qdiscs
and call the driver callback entry point so as to give it a chance to
register its callback.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
to give more control to the networking stack and enable it to change
memcg charging behavior. In the future, the networking stack may decide
to avoid oom-kills when fallbacks are more appropriate.
One behavior change in mem_cgroup_charge_skmem() by this patch is to
avoid force charging by default and let the caller decide when and if
force charging is needed through the presence or absence of
__GFP_NOFAIL.
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The color change announcement is very similar to how CSA works where
we have an IE that includes a counter. When the counter hits 0, the new
color is applied via an updated beacon.
This patch makes the CSA counter functionality reusable, rather than
implementing it again. This also allows for future reuse incase support
for other counter IEs gets added.
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/057c1e67b82bee561ea44ce6a45a8462d3da6995.1625247619.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds support for BSS color collisions to the wireless subsystem.
Add the required functionality to nl80211 that will notify about color
collisions, triggering the color change and notifying when it is completed.
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/500b3582aec8fe2c42ef46f3117b148cb7cbceb5.1625247619.git.lorenzo@kernel.org
[remove unnecessary NULL initialisation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Previously, sockmap for AF_UNIX protocol only supports
dgram type. This patch add unix stream type support, which
is similar to unix_dgram_proto. To support sockmap, dgram
and stream cannot share the same unix_proto anymore, because
they have different implementations, such as unhash for stream
type (which will remove closed or disconnected sockets from the map),
so rename unix_proto to unix_dgram_proto and add a new
unix_stream_proto.
Also implement stream related sockmap functions.
And add dgram key words to those dgram specific functions.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-3-jiang.wang@bytedance.com
This stores the advertising handle/instance into hci_conn so it is
accessible when re-enabling the advertising once disconnected.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
commit <47595e32869f> ("<MAINTAINERS: Mark some staging directories>")
indicated the ipx network layer as obsolete in Jan 2018,
updated in the MAINTAINERS file
now, after being exposed for 3 years to refactoring, so to
delete uapi/linux/ipx.h and net/ipx.h header files for good.
additionally, there is no module that depends on ipx.h except
a broken staging driver(r8188eu)
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can use xarray instead of linearly organized linked lists for the
devlink instances. This will let us revise the locking scheme in favour
of internal xarray locking that protects database.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct devlink itself is protected by internal lock and doesn't
need global lock during operation. That global lock is used to protect
addition/removal new devlink instances from the global list in use by
all devlink consumers in the system.
The future conversion of linked list to be xarray will allow us to
actually delete that lock, but first we need to count all struct devlink
users.
The reference counting provides us a way to ensure that no new user
space commands success to grab devlink instance which is going to be
destroyed makes it is safe to access it without lock.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
The it_present member of struct ieee80211_radiotap_header is treated as a
flexible array (multiple u32s can be conditionally present). In order for
memcpy() to reason (or really, not reason) about the size of operations
against this struct, use of bytes beyond it_present need to be treated
as part of the flexible array. Add a trailing flexible array and
initialize its initial index via pointer arithmetic.
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210806215305.2875621-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This change adds a 'type' attribute to routes, which can be parsed from
a RTM_NEWROUTE message. This will help to distinguish local vs. peer
routes in a future change.
This means userspace will need to set a correct rtm_type in RTM_NEWROUTE
and RTM_DELROUTE messages; we currently only accept RTN_UNICAST.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20210810023834.2231088-1-jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All of the symbols either only exist in bond_options.c or nowhere at
all. These symbols were verified to not exist in the code base by
using `git grep` and their removal was verified by compiling bonding.ko.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Enable drivers to publish/unpublish individual parameter.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently device configuration parameters can be registered as an array.
Due to this a constant array must be registered. A single driver
supporting multiple devices each with different device capabilities end
up registering all parameters even if it doesn't support it.
One possible workaround a driver can do is, it registers multiple single
entry arrays to overcome such limitation.
Better is to provide a API that enables driver to register/unregister a
single parameter. This also further helps in two ways.
(1) to reduce the memory of devlink_param_entry by avoiding in registering
parameters which are not supported by the device.
(2) avoid generating multiple parameter add, delete, publish, unpublish,
init value notifications for such unsupported parameters
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new device generic parameter to enable/disable creation of
VDPA net auxiliary device and associated device functionality
in the devlink instance.
User who prefers to disable such functionality can disable it using below
example.
$ devlink dev param set pci/0000:06:00.0 \
name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device for the
VDPA net functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new device generic parameter to enable/disable creation of
RDMA auxiliary device and associated device functionality
in the devlink instance.
User who prefers to disable such functionality can disable it using below
example.
$ devlink dev param set pci/0000:06:00.0 \
name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device for the
RDMA functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new device generic parameter to enable/disable creation of
Ethernet auxiliary device and associated device functionality
in the devlink instance.
User who prefers to disable such functionality can disable it using below
example.
$ devlink dev param set pci/0000:06:00.0 \
name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device for the
Ethernet functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat.
2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout
fixup.
3) CLUSTERIP registers ARP hook on demand, from Florian.
4) Use clusterip_net to store pernet warning, also from Florian.
5) Remove struct netns_xt, from Florian Westphal.
6) Enable ebtables hooks in initns on demand, from Florian.
7) Allow to filter conntrack netlink dump per status bits,
from Florian Westphal.
8) Register x_tables hooks in initns on demand, from Florian.
9) Remove queue_handler from per-netns structure, again from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This was done to detect when the pernet->init() function was not called
yet, by checking if net->nf.queue_handler is NULL.
Once the nfnetlink_queue module is active, all struct net pointers
contain the same address. So place this back in nf_queue.c.
Handle the 'netns error unwind' test by checking nfnl_queue_net for a
NULL pointer and add a comment for this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently page pool only support page recycling when there
is only one user of the page, and the split page reusing
implemented in the most driver can not use the page pool as
bing-pong way of reusing requires the multi user support in
page pool.
Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
multi desc using different part of the same page in order
to save memory.
So add multi-users support and frag page recycling in page
pool to overcome the above limitation.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For 32 bit systems with 64 bit dma, dma_addr[1] is used to
store the upper 32 bit dma addr, those system should be rare
those days.
For normal system, the dma_addr[1] in 'struct page' is not
used, so we can reuse dma_addr[1] for storing frag count,
which means how many frags this page might be splited to.
In order to simplify the page frag support in the page pool,
the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
the 32 bit systems with 64 bit dma, and the page frag support
in page pool is disabled for such system.
The newly added page_pool_set_frag_count() is called to reserve
the maximum frag count before any page frag is passed to the
user. The page_pool_atomic_sub_frag_count_return() is called
when user is done with the page frag.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.
So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.
This is also a preparation to support allocating frag page
in page pool.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Without this there is a warning if source files include psample.h
before skbuff.h or doesn't include it at all.
Fixes: 6ae0a62861 ("net: Introduce psample, a new genetlink channel for packet sampling")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Link: https://lore.kernel.org/r/20210808065242.1522535-1-roid@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.
Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.
The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.
Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].
-----------------------| CPU |--| rxpck/s |--| txpck/s |----
without patch (1 dev):
XDP_DROP: 3.15% 48.6Mpps
XDP_TX: 3.12% 18.3Mpps 18.3Mpps
XDP_DROP (RSS): 9.47% 116.5Mpps
XDP_TX (RSS): 9.67% 25.3Mpps 24.2Mpps
-----------------------
with patch, bond (1 dev):
XDP_DROP: 3.14% 46.7Mpps
XDP_TX: 3.15% 13.9Mpps 13.9Mpps
XDP_DROP (RSS): 10.33% 117.2Mpps
XDP_TX (RSS): 10.64% 25.1Mpps 24.0Mpps
-----------------------
with patch, bond (2 devs):
XDP_DROP: 6.27% 92.7Mpps
XDP_TX: 6.26% 17.6Mpps 17.5Mpps
XDP_DROP (RSS): 11.38% 117.2Mpps
XDP_TX (RSS): 14.30% 28.7Mpps 27.4Mpps
--------------------------------------------------------------
RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.
[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
[4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
All kernel devlink implementations call to devlink_alloc() during
initialization routine for specific device which is used later as
a parent device for devlink_register().
Such late device assignment causes to the situation which requires us to
call to device_register() before setting other parameters, but that call
opens devlink to the world and makes accessible for the netlink users.
Any attempt to move devlink_register() to be the last call generates the
following error due to access to the devlink->dev pointer.
[ 8.758862] devlink_nl_param_fill+0x2e8/0xe50
[ 8.760305] devlink_param_notify+0x6d/0x180
[ 8.760435] __devlink_params_register+0x2f1/0x670
[ 8.760558] devlink_params_register+0x1e/0x20
The simple change of API to set devlink device in the devlink_alloc()
instead of devlink_register() fixes all this above and ensures that
prior to call to devlink_register() everything already set.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that DSA keeps track of the port learning state, it becomes
superfluous to keep an additional variable with this information in the
sja1105 driver. Remove it.
The DSA core's learning state is present in struct dsa_port *dp.
To avoid the antipattern where we iterate through a DSA switch's
ports and then call dsa_to_port to obtain the "dp" reference (which is
bad because dsa_to_port iterates through the DSA switch tree once
again), just iterate through the dst->ports and operate on those
directly.
The sja1105 had an extra use of priv->learn_ena on non-user ports. DSA
does not touch the learning state of those ports - drivers are free to
do what they wish on them. Mark that information with a comment in
struct dsa_port and let sja1105 set dp->learning for cascade ports.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently DSA leaves it down to device drivers to fast age the FDB on a
port when address learning is disabled on it. There are 2 reasons for
doing that in the first place:
- when address learning is disabled by user space, through
IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user
space typically wants to achieve is to operate in a mode with no
dynamic FDB entry on that port. But if the port is already up, some
addresses might have been already learned on it, and it seems silly to
wait for 5 minutes for them to expire until something useful can be
done.
- when a port leaves a bridge and becomes standalone, DSA turns off
address learning on it. This also has the nice side effect of flushing
the dynamically learned bridge FDB entries on it, which is a good idea
because standalone ports should not have bridge FDB entries on them.
We let drivers manage fast ageing under this condition because if DSA
were to do it, it would need to track each port's learning state, and
act upon the transition, which it currently doesn't.
But there are 2 reasons why doing it is better after all:
- drivers might get it wrong and not do it (see b53_port_set_learning)
- we would like to flush the dynamic entries from the software bridge
too, and letting drivers do that would be another pain point
So track the port learning state and trigger a fast age process
automatically within DSA.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink port already has pointer to the devlink instance and all API
calls that forward these devlink ports to the drivers perform same
"devlink_port->devlink" assignment before actual call.
This patch removes useless parameter and allows us in the future
to create specific devlink_port_ops to manage user space access with
reliable ops assignment.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Restrict range element expansion in ipset to avoid soft lockup,
from Jozsef Kadlecsik.
2) Memleak in error path for nf_conntrack_bridge for IPv4 packets,
from Yajun Deng.
3) Simplify conntrack garbage collection strategy to avoid frequent
wake-ups, from Florian Westphal.
4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name.
5) Missing chain family netlink attribute in chain description
in nfnetlink_hook.
6) Incorrect sequence number on nfnetlink_hook dumps.
7) Use netlink request family in reply message for consistency.
8) Remove offload_pickup sysctl, use conntrack for established state
instead, from Florian Westphal.
9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since
NFPROTO_INET is not exposed through nfnetlink_hook.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nfnetlink_hook: translate inet ingress to netdev
netfilter: conntrack: remove offload_pickup sysctl again
netfilter: nfnetlink_hook: Use same family as request message
netfilter: nfnetlink_hook: use the sequence number of the request message
netfilter: nfnetlink_hook: missing chain family
netfilter: nfnetlink_hook: strip off module name from hookfn
netfilter: conntrack: collect all entries in one cycle
netfilter: nf_conntrack_bridge: Fix memory leak when error
netfilter: ipset: Limit the maximal range of consecutive elements to add/delete
====================
Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These two sysctls were added because the hardcoded defaults (2 minutes,
tcp, 30 seconds, udp) turned out to be too low for some setups.
They appeared in 5.14-rc1 so it should be fine to remove it again.
Marcelo convinced me that there should be no difference between a flow
that was offloaded vs. a flow that was not wrt. timeout handling.
Thus the default is changed to those for TCP established and UDP stream,
5 days and 120 seconds, respectively.
Marcelo also suggested to account for the timeout value used for the
offloading, this avoids increase beyond the value in the conntrack-sysctl
and will also instantly expire the conntrack entry with altered sysctls.
Example:
nf_conntrack_udp_timeout_stream=60
nf_flowtable_udp_timeout=60
This will remove offloaded udp flows after one minute, rather than two.
An earlier version of this patch also cleared the ASSURED bit to
allow nf_conntrack to evict the entry via early_drop (i.e., table full).
However, it looks like we can safely assume that connection timed out
via HW is still in established state, so this isn't needed.
Quoting Oz:
[..] the hardware sends all packets with a set FIN flags to sw.
[..] Connections that are aged in hardware are expected to be in the
established state.
In case it turns out that back-to-sw-path transition can occur for
'dodgy' connections too (e.g., one side disappeared while software-path
would have been in RETRANS timeout), we can adjust this later.
Cc: Oz Shlomo <ozsh@nvidia.com>
Cc: Paul Blakey <paulb@nvidia.com>
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 08cc83cc7f ("net: dsa: add support for BRIDGE_MROUTER
attribute") added an option for users to turn off multicast flooding
towards the CPU if they turn off the IGMP querier on a bridge which
already has enslaved ports (echo 0 > /sys/class/net/br0/bridge/multicast_router).
And commit a8b659e7ff ("net: dsa: act as passthrough for bridge port flags")
simply papered over that issue, because it moved the decision to flood
the CPU with multicast (or not) from the DSA core down to individual drivers,
instead of taking a more radical position then.
The truth is that disabling multicast flooding to the CPU is simply
something we are not prepared to do now, if at all. Some reasons:
- ICMP6 neighbor solicitation messages are unregistered multicast
packets as far as the bridge is concerned. So if we stop flooding
multicast, the outside world cannot ping the bridge device's IPv6
link-local address.
- There might be foreign interfaces bridged with our DSA switch ports
(sending a packet towards the host does not necessarily equal
termination, but maybe software forwarding). So if there is no one
interested in that multicast traffic in the local network stack, that
doesn't mean nobody is.
- PTP over L4 (IPv4, IPv6) is multicast, but is unregistered as far as
the bridge is concerned. This should reach the CPU port.
- The switch driver might not do FDB partitioning. And since we don't
even bother to do more fine-grained flood disabling (such as "disable
flooding _from_port_N_ towards the CPU port" as opposed to "disable
flooding _from_any_port_ towards the CPU port"), this breaks standalone
ports, or even multiple bridges where one has an IGMP querier and one
doesn't.
Reverting the logic makes all of the above work.
Fixes: a8b659e7ff ("net: dsa: act as passthrough for bridge port flags")
Fixes: 08cc83cc7f ("net: dsa: add support for BRIDGE_MROUTER attribute")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).
Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
589918df93 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
0fac6aa098 ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")
Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df93 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- sched: taprio: fix init procedure to avoid inf loop when dumping
- sctp: move the active_key update after sh_keys is added
Current release - new code bugs:
- sparx5: fix build with old GCC & bitmask on 32-bit targets
Previous releases - regressions:
- xfrm: redo the PREEMPT_RT RCU vs hash_resize_mutex deadlock fix
- xfrm: fixes for the compat netlink attribute translator
- phy: micrel: Fix detection of ksz87xx switch
Previous releases - always broken:
- gro: set inner transport header offset in tcp/udp GRO hook to avoid
crashes when such packets reach GSO
- vsock: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST, as required by spec
- dsa: sja1105: fix static FDB entries on SJA1105P/Q/R/S and SJA1110
- bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry
- usb: lan78xx: don't modify phy_device state concurrently
- usb: pegasus: check for errors of IO routines
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEL/jwACgkQMUZtbf5S
IrsxmQ//Qbq6TluxzmFnWCjOGWo009GZUZ/PdlnhxnHRAk8BzzMM7209xGARTKod
t+bNl8ioDDxxiBlp57gtoe67nnnd4cwbPCGKpY49al8guvetsQq+0vlg0u3490X+
clY7Uz7G/8thf0JylhqQB1LrMXcNNHqB7ZV5CpM1cC+H/YxeHBv+LQy44S7Vz+LG
btHGQbbsnHrVF6WhohU8tr5AX7MdLaQvQ2aZ1XodEXRd9js4P4CP2Hn/cazZJBOT
rwxaFao2DWs6qaVYBFHtKyU1qvoxQ6Ngex/lMY0QQ9rOX/L+ha+ygzzUoqXjg7DX
jOFUeZIiGHcPe0a10NO8NkPCqn7bOBQ2h/BpJPF9b8VvQKbJAOV8kOdtTbGhMh28
vboensrppqW4qzWpgkoJaVbusvcNWibFspYFyrLjpKxpPmKuLJlli2mkyUbsUiCO
uxMN+IqisWiR379rWLX5tJQp6OIvWeQW3htD5ms7nIHpvL1pbRJnsekepkUjmTx9
DtvowHGpPSG4dPq7EP6LcE/1K0YQFjZQMsJkqTH7J4Qi+pmB2MoQyJzPraktiquT
2Qb/O2yZlng9sYYCs0P73TiVBef5KnIoIJXKvqkrmyN4QjyO+LDevGQyXV06B5VJ
a8duR+yWgPVQn+T7SMKhAOXoqXwSCbJlpXSG7iOFp4dQLCiBVzI=
=/rrJ
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from ipsec.
Current release - regressions:
- sched: taprio: fix init procedure to avoid inf loop when dumping
- sctp: move the active_key update after sh_keys is added
Current release - new code bugs:
- sparx5: fix build with old GCC & bitmask on 32-bit targets
Previous releases - regressions:
- xfrm: redo the PREEMPT_RT RCU vs hash_resize_mutex deadlock fix
- xfrm: fixes for the compat netlink attribute translator
- phy: micrel: Fix detection of ksz87xx switch
Previous releases - always broken:
- gro: set inner transport header offset in tcp/udp GRO hook to avoid
crashes when such packets reach GSO
- vsock: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST, as required by spec
- dsa: sja1105: fix static FDB entries on SJA1105P/Q/R/S and SJA1110
- bridge: validate the NUD_PERMANENT bit when adding an extern_learn
FDB entry
- usb: lan78xx: don't modify phy_device state concurrently
- usb: pegasus: check for errors of IO routines"
* tag 'net-5.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
net: vxge: fix use-after-free in vxge_device_unregister
net: fec: fix use-after-free in fec_drv_remove
net: pegasus: fix uninit-value in get_interrupt_interval
net: ethernet: ti: am65-cpsw: fix crash in am65_cpsw_port_offload_fwd_mark_update()
bnx2x: fix an error code in bnx2x_nic_load()
net: wwan: iosm: fix recursive lock acquire in unregister
net: wwan: iosm: correct data protocol mask bit
net: wwan: iosm: endianness type correction
net: wwan: iosm: fix lkp buildbot warning
net: usb: lan78xx: don't modify phy_device state concurrently
docs: networking: netdevsim rules
net: usb: pegasus: Remove the changelog and DRIVER_VERSION.
net: usb: pegasus: Check the return value of get_geristers() and friends;
net/prestera: Fix devlink groups leakage in error flow
net: sched: fix lockdep_set_class() typo error for sch->seqlock
net: dsa: qca: ar9331: reorder MDIO write sequence
VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST
mptcp: drop unused rcu member in mptcp_pm_addr_entry
net: ipv6: fix returned variable type in ip6_skb_dst_mtu
nfp: update ethtool reporting of pauseframe control
...
syzbot is hitting might_sleep() warning at hci_sock_dev_event() due to
calling lock_sock() with rw spinlock held [1].
It seems that history of this locking problem is a trial and error.
Commit b40df5743e ("[PATCH] bluetooth: fix socket locking in
hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to
lock_sock() as an attempt to fix lockdep warning.
Then, commit 4ce61d1c7a ("[BLUETOOTH]: Fix locking in
hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to
local_bh_disable() + bh_lock_sock_nested() as an attempt to fix the
sleep in atomic context warning.
Then, commit 4b5dd696f8 ("Bluetooth: Remove local_bh_disable() from
hci_sock.c") in 3.3-rc1 removed local_bh_disable().
Then, commit e305509e67 ("Bluetooth: use correct lock to prevent UAF
of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to
lock_sock() as an attempt to fix CVE-2021-3573.
This difficulty comes from current implementation that
hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all
references from sockets because hci_unregister_dev() immediately
reclaims resources as soon as returning from
hci_sock_dev_event(HCI_DEV_UNREG).
But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not
doing what it should do.
Therefore, instead of trying to detach sockets from device, let's accept
not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG),
by moving actual cleanup of resources from hci_unregister_dev() to
hci_cleanup_dev() which is called by bt_host_release() when all
references to this unregistered device (which is a kobject) are gone.
Since hci_sock_dev_event(HCI_DEV_UNREG) no longer resets
hci_pi(sk)->hdev, we need to check whether this device was unregistered
and return an error based on HCI_UNREGISTER flag. There might be subtle
behavioral difference in "monitor the hdev" functionality; please report
if you found something went wrong due to this patch.
Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1]
Reported-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: e305509e67 ("Bluetooth: use correct lock to prevent UAF of hdev object")
Acked-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds support hdev to allocate extra size for private data.
The size of private data is specified in the hdev_alloc_size(priv_size)
and the allocated buffer can be accessed with hci_get_priv(hdev).
Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Replace IP6_SFLSIZE() with struct_size() helper in order to avoid any
potential type mistakes or integer overflows that, in the worst
scenario, could lead to heap overflows.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
tcp_sndbuf_expand()). If we've just created a new socket this adjustment
is enabled on it, but if one changes the socket buffer size by
setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.
CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
restore as it first needs to increase buffer sizes for packet queues
restore and second it needs to restore back original buffer sizes. So
after CRIU restore all sockets become non-auto-adjustable, which can
decrease network performance of restored applications significantly.
CRIU need to be able to restore sockets with enabled/disabled adjustment
to the same state it was before dump, so let's add special setsockopt
for it.
Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
that using these interface one can reenable automatic socket buffer
adjustment on their sockets.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of explicit offloading API in switchdev in commit
2f5dc00f7a ("net: bridge: switchdev: let drivers inform which bridge
ports are offloaded"), we started having Ethernet switch drivers calling
directly into a function exported by net/bridge/br_switchdev.c, which is
a function exported by the bridge driver.
This means that drivers that did not have an explicit dependency on the
bridge before, like cpsw and am65-cpsw, now do - otherwise it is not
possible to call a symbol exported by a driver that can be built as
module unless you are a module too.
There was an attempt to solve the dependency issue in the form of commit
b0e8181762 ("net: build all switchdev drivers as modules when the
bridge is a module"). Grygorii Strashko, however, says about it:
| In my opinion, the problem is a bit bigger here than just fixing the
| build :(
|
| In case, of ^cpsw the switchdev mode is kinda optional and in many
| cases (especially for testing purposes, NFS) the multi-mac mode is
| still preferable mode.
|
| There were no such tight dependency between switchdev drivers and
| bridge core before and switchdev serviced as independent, notification
| based layer between them, so ^cpsw still can be "Y" and bridge can be
| "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE
| will need to be set as "Y", or we will have to update drivers to
| support build with BRIDGE=n and maintain separate builds for
| networking vs non-networking testing. But is this enough? Wouldn't
| it cause 'chain reaction' required to add more and more "Y" options
| (like CONFIG_VLAN_8021Q)?
|
| PS. Just to be sure we on the same page - ARM builds will be forced
| (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our
| automation testing will just fail with omap2plus_defconfig.
In the light of this, it would be desirable for some configurations to
avoid dependencies between switchdev drivers and the bridge, and have
the switchdev mode as completely optional within the driver.
Arnd Bergmann also tried to write a patch which better expressed the
build time dependency for Ethernet switch drivers where the switchdev
support is optional, like cpsw/am65-cpsw, and this made the drivers
follow the bridge (compile as module if the bridge is a module) only if
the optional switchdev support in the driver was enabled in the first
place:
https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/
but this still did not solve the fact that cpsw and am65-cpsw now must
be built as modules when the bridge is a module - it just expressed
correctly that optional dependency. But the new behavior is an apparent
regression from Grygorii's perspective.
So to support the use case where the Ethernet driver is built-in,
NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we
need a framework that can handle the possible absence of the bridge from
the running system, i.e. runtime bloatware as opposed to build-time
bloatware.
Luckily we already have this framework, since switchdev has been using
it extensively. Events from the bridge side are transmitted to the
driver side using notifier chains - this was originally done so that
unrelated drivers could snoop for events emitted by the bridge towards
ports that are implemented by other drivers (think of a switch driver
with LAG offload that listens for switchdev events on a bonding/team
interface that it offloads).
There are also events which are transmitted from the driver side to the
bridge side, which again are modeled using notifiers.
SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with
notifying the bridge that a MAC address has been dynamically learned.
So there is a precedent we can use for modeling the new framework.
The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work
that the bridge needs to do when a port becomes offloaded is blocking in
its nature: replay VLANs, MDBs etc. The calling context is indeed
blocking (we are under rtnl_mutex), but the existing switchdev
notification chain that the bridge is subscribed to is only the atomic
one. So we need to subscribe the bridge to the blocking switchdev
notification chain too.
This patch:
- keeps the driver-side perception of the switchdev_bridge_port_{,un}offload
unchanged
- moves the implementation of switchdev_bridge_port_{,un}offload from
the bridge module into the switchdev module.
- makes everybody that is subscribed to the switchdev blocking notifier
chain "hear" offload & unoffload events
- makes the bridge driver subscribe and handle those events
- moves the bridge driver's handling of those events into 2 new
functions called br_switchdev_port_{,un}offload. These functions
contain in fact the core of the logic that was previously in
switchdev_bridge_port_{,un}offload, just that now we go through an
extra indirection layer to reach them.
Unlike all the other switchdev notification structures, the structure
used to carry the bridge port information, struct
switchdev_notifier_brport_info, does not contain a "bool handled".
This is because in the current usage pattern, we always know that a
switchdev bridge port offloading event will be handled by the bridge,
because the switchdev_bridge_port_offload() call was initiated by a
NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a
bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event
couldn't have happened.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2021-08-04
1) Fix a sysbot reported memory leak in xfrm_user_rcv_msg.
From Pavel Skripkin.
2) Revert "xfrm: policy: Read seqcount outside of rcu-read side
in xfrm_policy_lookup_bytype". This commit tried to fix a
lockin bug, but only cured some of the symptoms. A proper
fix is applied on top of this revert.
3) Fix a locking bug on xfrm state hash resize. A recent change
on sequence counters accidentally repaced a spinlock by a mutex.
Fix from Frederic Weisbecker.
4) Fix possible user-memory-access in xfrm_user_rcv_msg_compat().
From Dmitry Safonov.
5) Add initialiation sefltest fot xfrm_spdattr_type_t.
From Dmitry Safonov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass extack arg to validate_linkmsg and validate_link_af callbacks.
If a netlink attribute has a reject_message, use the extended ack
mechanism to carry the message back to user space.
Signed-off-by: Rocco Yue <rocco.yue@mediatek.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds OOB support for AF_UNIX sockets.
The semantics is same as TCP.
The last byte of a message with the OOB flag is
treated as the OOB byte. The byte is separated into
a skb and a pointer to the skb is stored in unix_sock.
The pointer is used to enforce OOB semantics.
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Code was checking if random_addr and hdev->rpa match without first
checking if the RPA has not been set (BDADDR_ANY), furthermore it was
clearing HCI_RPA_EXPIRED before the command completes and the RPA is
actually programmed which in case of failure would leave the expired
RPA still set.
Since advertising instance have a similar problem the clearing of
HCI_RPA_EXPIRED has been moved to hci_event.c after checking the random
address is in fact the hdev->rap and then proceed to set the expire
timeout.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This adds a field to track if advertising instances are enabled or not
and only clear HCI_LE_ADV flag if there is no instance left advertising.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The driver was merged in 1999 and has only ever seen treewide cleanups
since then, with no indication whatsoever that anyone has actually
had access to hardware for testing the patches.
>From the information in the link below, it appears that the hardware
is for some leased line system in Russia that has since been
discontinued, and useless without any remote end to connect to.
As the driver still feels like a Linux-2.2 era artifact today, it
appears that the best way forward is to just delete it.
Link: https://www.tms.ru/%D0%90%D0%B4%D0%B0%D0%BF%D1%82%D0%B5%D1%80_%D0%B4%D0%BB%D1%8F_%D0%B2%D1%8B%D0%B4%D0%B5%D0%BB%D0%B5%D0%BD%D0%BD%D1%8B%D1%85_%D0%BB%D0%B8%D0%BD%D0%B8%D0%B9_Granch_SBNI12-10
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The block I/O code for the new X-Surf 100 ax88796 driver needs
ax_NS8390_init() for error fixup in its block_output function.
Export this static function through the ax_NS8390_reinit()
wrapper so we can lose the lib8380.c include in the X-Surf 100
driver.
[arnd: add the declaration in the header to avoid a
-Wmissing-prototypes warning]
Fixes: 861928f4e6 ("net-next: New ax88796 platform
driver for Amiga X-Surf 100 Zorro board (m68k)")
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are six m68k specific drivers that use the legacy probe method
in drivers/net/Space.c. However, all of these only support a single
device, and they completely ignore the command line settings from
netdev_boot_setup_check, so there is really no point at all.
Aside from sun3_82586, these already have a module_init function that
can be used for built-in mode as well, simply by removing the #ifdef.
Note that the 82596 driver was previously used on ISA as well, but
that got dropped long ago.
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This driver never relies on the netdev_boot_setup_check()
to get its configuration, so it can just as well do its
own probing all the time.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch fixing the returned value of ip6_skb_dst_mtu (int -> unsigned
int) was rebased between its initial review and the version applied. In
the meantime fade56410c was applied, which added a new variable (int)
used as the returned value. This lead to a mismatch between the function
prototype and the variable used as the return value.
Fixes: 40fc3054b4 ("net: ipv6: fix return value of ip6_skb_dst_mtu")
Cc: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide missing kdoc of fields of struct tcf_pkt_info and tcf_ematch_ops.
Found using ./scripts/kernel-doc -none -Werror include/net/pkt_cls.h
Signed-off-by: Bijie Xu <bijie.xu@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct mismatch between the name of flow_offload_has_one_action()
and its kdoc entry.
Found using ./scripts/kernel-doc -Werror -none include/net/flow_offload.h
Signed-off-by: Bijie Xu <bijie.xu@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option lacp_active, which is similar with team's runner.active.
This option specifies whether to send LACPDU frames periodically. If set
on, the LACPDU frames are sent along with the configured lacp_rate
setting. If set off, the LACPDU frames acts as "speak when spoken to".
Note, the LACPDU state frames still will be sent when init or unbind port.
v2: remove module parameter
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nci_request() receives a callback function and unsigned long data
argument "opt" which is passed to the callback. Almost all of the
nci_request() callers pass pointer to a stack variable as data argument.
Only few pass scalar value (e.g. u8).
All such callbacks do not modify passed data argument and in previous
commit they were made as const. However passing pointers via unsigned
long removes the const annotation. The callback could simply cast
unsigned long to a pointer to writeable memory.
Use "const void *" as type of this "opt" argument to solve this and
prevent modifying the pointed contents. This is also consistent with
generic pattern of passing data arguments - via "void *". In few places
which pass scalar values, use casts via "unsigned long" to suppress any
warnings.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC action ->init() API has 10 parameters, it becomes harder
to read. Some of them are just boolean and can be replaced
by flags. Similarly for the internal API tcf_action_init()
and tcf_exts_validate().
This patch converts them to flags and fold them into
the upper 16 bits of "flags", whose lower 16 bits are still
reserved for user-space. More specifically, the following
kernel flags are introduced:
TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
distinguish whether it is compatible with policer.
TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
this action is bound to a filter.
TCA_ACT_FLAGS_REPLACE replaces 'ovr' in most contexts,
means we are replacing an existing action.
TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
opposite meaning, because we still hold RTNL in most
cases.
The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
untouched and still stored as before.
I have tested this patch with tdc and I do not see any
failure related to this patch.
Tested-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
clusterip is now handled via net_generic.
NOTRACK is tiny compared to rest of xt_CT feature set, even the existing
deprecation warning is bigger than the actual functionality.
Just remove the warning, its not worth keeping/adding a net_generic one.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Andrii Nakryiko says:
====================
bpf-next 2021-07-30
We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).
The main changes are:
1) BTF-guided binary data dumping libbpf API, from Alan.
2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.
3) Ambient BPF run context and cgroup storage cleanup, from Andrii.
4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.
5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.
6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.
7) BPF map pinning improvements in libbpf, from Martynas.
8) Improved module BTF support in libbpf and bpftool, from Quentin.
9) Bpftool cleanups and documentation improvements, from Quentin.
10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.
11) Increased maximum cgroup storage size, from Stanislav.
12) Small fixes and improvements to BPF tests and samples, from various folks.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
tools: bpftool: Document and add bash completion for -L, -B options
selftests/bpf: Update bpftool's consistency script for checking options
tools: bpftool: Update and synchronise option list in doc and help msg
tools: bpftool: Complete and synchronise attach or map types
selftests/bpf: Check consistency between bpftool source, doc, completion
tools: bpftool: Slightly ease bash completion updates
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
tools: bpftool: Support dumping split BTF by id
libbpf: Add split BTF support for btf__load_from_kernel_by_id()
tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
tools: Free BTF objects at various locations
libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
libbpf: Rename btf__load() as btf__load_into_kernel()
libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
bpf: Increase supported cgroup storage value size
libbpf: Fix race when pinning maps in parallel
...
====================
Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need in extra call indirection and check from impossible
flow where someone tries to set namespace without prior call
to devlink_alloc().
Instead of this extra logic and additional EXPORT_SYMBOL, use specialized
devlink allocation function that receives net namespace as an argument.
Such specialized API allows clear view when devlink initialized in wrong
net namespace and/or kernel users don't try to change devlink namespace
under the hood.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Several functions receive pointers to u8, sk_buff or other structs but
do not modify the contents so make them const. This allows doing the
same for local variables and in total makes the code a little bit safer.
This makes const also data passed as "unsigned long opt" argument to
nci_request() function. Usual flow for such functions is:
1. Receive "u8 *" and store it (the pointer) in a structure
allocated on stack (e.g. struct nci_set_config_param),
2. Call nci_request() or __nci_request() passing a callback function an
the pointer to the structure via an "unsigned long opt",
3. nci_request() calls the callback which dereferences "unsigned long
opt" in a read-only way.
This converts all above paths to use proper pointer to const data, so
entire flow is safer.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Several functions receive pointers to u8, char or sk_buff but do not
modify the contents so make them const. This allows doing the same for
local variables and in total makes the code a little bit safer.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
refcount_t type should be used instead of int when fib_treeref is used as
a reference counter,and avoid use-after-free risks.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210729071350.28919-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain
id' in the skb extension. However, userspace programs (like ovs) are able
to setup egress rules, and datapath gets confused in case it doesn't find
the 'chain id' for a packet that's "recirculated" by tc.
Change tcf_classify() to have the same semantic as tcf_classify_ingress()
so that a single function can be called in ingress / egress, using the tc
ingress / egress block respectively.
Suggested-by: Alaa Hleilel <alaa@nvidia.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we have a compile-time default network
(MCTP_INITIAL_DEFAULT_NET). This change introduces a default_net field
on the net namespace, allowing future configuration for new interfaces.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change implements MCTP fragmentation (based on route & device MTU),
and corresponding reassembly.
The MCTP specification only allows for fragmentation on the originating
message endpoint, and reassembly on the destination endpoint -
intermediate nodes do not need to reassemble/refragment. Consequently,
we only fragment in the local transmit path, and reassemble
locally-bound packets. Messages are required to be in-order, so we
simply cancel reassembly on out-of-order or missing packets.
In the fragmentation path, we just break up the message into MTU-sized
fragments; the skb structure is a simple copy for now, which we can later
improve with a shared data implementation.
For reassembly, we keep track of incoming message fragments using the
existing tag infrastructure, allocating a key on the (src,dest,tag)
tuple, and reassembles matching fragments into a skb->frag_list.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Start filling-out the socket syscalls: bind, sendmsg & recvmsg.
This requires an input route implementation, so we add to
mctp_route_input, allowing lookups on binds & message tags. This just
handles single-packet messages at present, we will add fragmentation in
a future change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an initial neighbour table implementation, to be used in the route
output path.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds RTM_GETROUTE, RTM_NEWROUTE & RTM_DELROUTE handlers,
allowing management of the MCTP route table.
Includes changes from Jeremy Kerr <jk@codeconstruct.com.au>.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a simple routing table, and a couple of route output handlers, and
the mctp packet_type & handler.
Includes changes from Matt Johnston <matt@codeconstruct.com.au>.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds the infrastructure for managing MCTP netdevices; we add
a pointer to the AF_MCTP-specific data to struct netdevice, and hook up
the rtnetlink operations for adding and removing addresses.
Includes changes from Matt Johnston <matt@codeconstruct.com.au>.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple packet header format as defined by DMTF DSP0236.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct nfc_dev is not modified by nfc_get_drvdata() and
nfc_device_name() so it can be made a const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change leverages the infrastructure introduced by the previous
patches to allow soft devices passing to the GRO engine owned skbs
without impacting the fast-path.
It's up to the GRO caller ensuring the slow_gro bit validity before
invoking the GRO engine. The new helper skb_prepare_for_gro() is
introduced for that goal.
On slow_gro, skbs are aggregated only with equal sk.
Additionally, skb truesize on GRO recycle and free is correctly
updated so that sk wmem is not changed by the GRO processing.
rfc-> v1:
- fixed bad truesize on dev_gro_receive NAPI_FREE
- use the existing state bit
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the previous patch, but covering the dst field:
the slow_gro flag is additionally set when a dst is attached
to the skb
RFC -> v1:
- use the existing flag instead of adding a new one
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot is hitting might_sleep() warning at hci_sock_dev_event()
due to calling lock_sock() with rw spinlock held [1].
It seems that history of this locking problem is a trial and error.
Commit b40df5743e ("[PATCH] bluetooth: fix socket locking in
hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to lock_sock()
as an attempt to fix lockdep warning.
Then, commit 4ce61d1c7a ("[BLUETOOTH]: Fix locking in
hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to
local_bh_disable() + bh_lock_sock_nested() as an attempt to fix
sleep in atomic context warning.
Then, commit 4b5dd696f8 ("Bluetooth: Remove local_bh_disable() from
hci_sock.c") in 3.3-rc1 removed local_bh_disable().
Then, commit e305509e67 ("Bluetooth: use correct lock to prevent UAF
of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to
lock_sock() as an attempt to fix CVE-2021-3573.
This difficulty comes from current implementation that
hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all
references from sockets because hci_unregister_dev() immediately reclaims
resources as soon as returning from hci_sock_dev_event(HCI_DEV_UNREG).
But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not
doing what it should do.
Therefore, instead of trying to detach sockets from device, let's accept
not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG),
by moving actual cleanup of resources from hci_unregister_dev() to
hci_release_dev() which is called by bt_host_release when all references
to this unregistered device (which is a kobject) are gone.
Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1]
Reported-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com>
Fixes: e305509e67 ("Bluetooth: use correct lock to prevent UAF of hdev object")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Both registered flag and devlink pointer are set at the same time
and indicate the same thing - devlink/devlink_port are ready. Instead
of checking ->registered use devlink pointer as an indication.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users of ndo_do_ioctl are ethernet drivers that implement
the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
Separate these from the few drivers that use ndo_do_ioctl to
implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
This is a purely cosmetic change intended to help readers find
their way through the implementation.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The various ipv4 and ipv6 tunnel drivers each implement a set
of 12 SIOCDEVPRIVATE commands for managing tunnels. These
all work correctly in compat mode.
Move them over to the new .ndo_siocdevprivate operation.
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported skb_over_panic() in llc_pdu_init_as_xid_cmd(). The
problem was in wrong LCC header manipulations.
Syzbot's reproducer tries to send XID packet. llc_ui_sendmsg() is
doing following steps:
1. skb allocation with size = len + header size
len is passed from userpace and header size
is 3 since addr->sllc_xid is set.
2. skb_reserve() for header_len = 3
3. filling all other space with memcpy_from_msg()
Ok, at this moment we have fully loaded skb, only headers needs to be
filled.
Then code comes to llc_sap_action_send_xid_c(). This function pushes 3
bytes for LLC PDU header and initializes it. Then comes
llc_pdu_init_as_xid_cmd(). It initalizes next 3 bytes *AFTER* LLC PDU
header and call skb_push(skb, 3). This looks wrong for 2 reasons:
1. Bytes rigth after LLC header are user data, so this function
was overwriting payload.
2. skb_push(skb, 3) call can cause skb_over_panic() since
all free space was filled in llc_ui_sendmsg(). (This can
happen is user passed 686 len: 686 + 14 (eth header) + 3 (LLC
header) = 703. SKB_DATA_ALIGN(703) = 704)
So, in this patch I added 2 new private constansts: LLC_PDU_TYPE_U_XID
and LLC_PDU_LEN_U_XID. LLC_PDU_LEN_U_XID is used to correctly reserve
header size to handle LLC + XID case. LLC_PDU_TYPE_U_XID is used by
llc_pdu_header_init() function to push 6 bytes instead of 3. And finally
I removed skb_push() call from llc_pdu_init_as_xid_cmd().
This changes should not affect other parts of LLC, since after
all steps we just transmit buffer.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+5e5a981ad7cc54c4b2b4@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the case when nlh is NULL in nlmsg_report(),
so that the caller doesn't need to deal with this case.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit cc1939e4b3.
Currently 2 classes of DSA drivers are able to send/receive packets
directly through the DSA master:
- drivers with DSA_TAG_PROTO_NONE
- sja1105
Now that sja1105 has gained the ability to perform traffic termination
even under the tricky case (VLAN-aware bridge), and that is much more
functional (we can perform VLAN-aware bridging with foreign interfaces),
there is no reason to keep this code in the receive path of the network
core. So delete it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to introduce last_rtx_chunks into sctp_transport to detect
if there's any packet retransmission/loss happened by checking against
asoc's rtx_data_chunks in sctp_transport_pl_send().
If there is, namely, transport->last_rtx_chunks != asoc->rtx_data_chunks,
the pmtu probe will be sent out. Otherwise, increment the pl.raise_count
and return when it's in Search Complete state.
With this patch, if in Search Complete state, which is a long period, it
doesn't need to keep probing the current pmtu unless there's data packet
loss. This will save quite some traffic.
v1->v2:
- add the missing Fixes tag.
Fixes: 0dac127c05 ("sctp: do black hole detection in search complete state")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does 3 things:
- make sctp_transport_pl_send() and sctp_transport_pl_recv()
return bool type to decide if more probe is needed to send.
- pr_debug() only when probe is really needed to send.
- count pl.raise_count in sctp_transport_pl_send() instead of
sctp_transport_pl_recv(), and it's only incremented for the
1st probe for the same size.
These are preparations for the next patch to make probes happen
only when there's packet loss in Search Complete state.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neither the core nor the drivers modify the passed pointer to struct
nfc_digital_ops, so make it a pointer to const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neither the core nor the drivers modify the passed pointer to struct
nfc_hci_ops, so make it a pointer to const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neither the core nor the drivers modify the passed pointer to struct
nfc_ops, so make it a pointer to const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neither the core nor the drivers modify the passed pointer to struct
nfc_vendor_cmd, so make it a pointer to const for correctness and
safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neither the core nor the drivers modify the passed pointer to struct
nci_driver_ops (consisting of function pointers), so make it a pointer
to const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct nci_ops is modified by NFC core in only one case:
nci_allocate_device() receives too many proprietary commands (prop_ops)
to configure. This is a build time known constrain, so a graceful
handling of such case is not necessary.
Instead, fail the nci_allocate_device() and add BUILD_BUG_ON() to places
which set these.
This allows to constify the struct nci_ops (consisting of function
pointers) for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nci_send_cmd() payload argument is passed directly to skb_put_data()
which already accepts a pointer to const, so make it const as well for
correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the tcp seq_file iteration on listeners
from the port only listening_hash to the port+addr lhash2.
When iterating from the bpf iter, the next patch will need to
lock the socket such that the bpf iter can call setsockopt (e.g. to
change the TCP_CONGESTION). To avoid locking the bucket and then locking
the sock, the bpf iter will first batch some sockets from the same bucket
and then unlock the bucket. If the bucket size is small (which
usually is), it is easier to batch the whole bucket such that it is less
likely to miss a setsockopt on a socket due to changes in the bucket.
However, the port only listening_hash could have many listeners
hashed to a bucket (e.g. many individual VIP(s):443 and also
multiple by the number of SO_REUSEPORT). We have seen bucket size in
tens of thousands range. Also, the chance of having changes
in some popular port buckets (e.g. 443) is also high.
The port+addr lhash2 was introduced to solve this large listener bucket
issue. Also, the listening_hash usage has already been replaced with
lhash2 in the fast path inet[6]_lookup_listener(). This patch follows
the same direction on moving to lhash2 and iterates the lhash2
instead of listening_hash.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200606.1035783-1-kafai@fb.com
A following patch will create a separate struct to store extra
bpf_iter state and it will embed the existing tcp_iter_state like this:
struct bpf_tcp_iter_state {
struct tcp_iter_state state;
/* More bpf_iter specific states here ... */
}
As a prep work, this patch removes the
"struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is
to tell if it is iterating from bpf_iter instead of proc fs.
Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from
bpf_iter. The kernel should not filter by the addr family and
leave this filtering decision to the bpf prog.
Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the
"seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating
from the bpf iter.
The bpf_iter_(init|fini)_tcp() is left here to prepare for
the change of a following patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200554.1034982-1-kafai@fb.com
For a DSA switch, to offload the forwarding process of a bridge device
means to send the packets coming from the software bridge as data plane
packets. This is contrary to everything that DSA has done so far,
because the current taggers only know to send control packets (ones that
target a specific destination port), whereas data plane packets are
supposed to be forwarded according to the FDB lookup, much like packets
ingressing on any regular ingress port. If the FDB lookup process
returns multiple destination ports (flooding, multicast), then
replication is also handled by the switch hardware - the bridge only
sends a single packet and avoids the skb_clone().
DSA keeps for each bridge port a zero-based index (the number of the
bridge). Multiple ports performing TX forwarding offload to the same
bridge have the same dp->bridge_num value, and ports not offloading the
TX data plane of a bridge have dp->bridge_num = -1.
The tagger can check if the packet that is being transmitted on has
skb->offload_fwd_mark = true or not. If it does, it can be sure that the
packet belongs to the data plane of a bridge, further information about
which can be obtained based on dp->bridge_dev and dp->bridge_num.
It can then compose a DSA tag for injecting a data plane packet into
that bridge number.
For the switch driver side, we offer two new dsa_switch_ops methods,
called .port_bridge_fwd_offload_{add,del}, which are modeled after
.port_bridge_{join,leave}.
These methods are provided in case the driver needs to configure the
hardware to treat packets coming from that bridge software interface as
data plane packets. The switchdev <-> bridge interaction happens during
the netdev_master_upper_dev_link() call, so to switch drivers, the
effect is that the .port_bridge_fwd_offload_add() method is called
immediately after .port_bridge_join().
If the bridge number exceeds the number of bridges for which the switch
driver can offload the TX data plane (and this includes the case where
the driver can offload none), DSA falls back to simply returning
tx_fwd_offload = false in the switchdev_bridge_port_offload() call.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation of supporting data plane forwarding on behalf of a
software bridge, some drivers might need to view bridges as virtual
switches behind the CPU port in a cross-chip topology.
Give them some help and let them know how many physical switches there
are in the tree, so that they can count the virtual switches starting
from that number on.
Note that the first dsa_switch_ops method where this information is
reliably available is .setup(). This is because of how DSA works:
in a tree with 3 switches, each calling dsa_register_switch(), the first
2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
and exit with error code 0 because the topology is not complete. Since
probing is parallel at this point, one switch does not know about the
existence of the other. Then the third switch comes, and for it,
dsa_tree_setup_routing_table() returns complete = true. This switch goes
ahead and calls dsa_tree_setup_switches() for everybody else, calling
their .setup() methods too. This acts as the synchronization point.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate IPv4 MTU code the same way it is done in IPv6 to have code
aligned in both address families
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and
reuse this code in ip6_mtu. Actually these two functions were
almost duplicates, this change will simplify the maintaince of
mtu calculation code.
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the IOAM inline insertion (only for the host-to-host use case)
which is per-route configured with lightweight tunnels. The target is iproute2
and the patch is ready. It will be posted as soon as this patchset is merged.
Here is an overview:
$ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0
This example configures an IOAM Pre-allocated Trace option attached to the
fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated
trace data block is 12 octets (size) and only the first IOAM data (bit 0:
hop_limit + node id) is included in the trace (type) represented as a bitfield.
The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not
implemented is explained on the patchset cover.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for processing the IOAM Pre-allocated Trace with IPv6,
see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
- net.ipv6.conf.XXX.ioam6_enabled
Two other sysctls are introduced to define IOAM IDs, represented by an integer.
They are respectively per-namespace and per-interface:
- net.ipv6.ioam6_id
- net.ipv6.conf.XXX.ioam6_id
The value of the first one represents the IOAM ID of the node itself (u32; max
and default value = U32_MAX>>8, due to hop limit concatenation) while the other
represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
Each "ioam6_id" sysctl has a "_wide" equivalent:
- net.ipv6.ioam6_id_wide
- net.ipv6.conf.XXX.ioam6_id_wide
The value of the first one represents the wide IOAM ID of the node itself (u64;
max and default value = U64_MAX>>8, due to hop limit concatenation) while the
other represents the wide IOAM ID of an interface (u32; max and default value
= U32_MAX).
The use of short and wide equivalents is not exclusive, a deployment could
choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
could be an identifier for a physical interface, whereas
net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
logical sub-interface. Documentation about new sysctls is provided at the end
of this patchset.
Two relativistic hash tables are used: one for IOAM namespaces, the other for
IOAM schemas. A namespace can only have a single active schema and a schema
can only be attached to a single namespace (1:1 relationship).
[1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
[2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
[3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the semicolon at the end, the compiler sees the shim function as a
declaration and not as a definition, and warns:
'switchdev_handle_fdb_del_to_device' declared 'static' but never defined
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8ca07176ab ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the default we assume the traffic to pass, if we have no
matching IPsec policy. With this patch, we have a possibility to
change this default from allow to block. It can be configured
via netlink. Each direction (input/output/forward) can be
configured separately. With the default to block configuered,
we need allow policies for all packet flows we accept.
We do not use default policy lookup for the loopback device.
v1->v2
- fix compiling when XFRM is disabled
- Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: Christian Langrock <christian.langrock@secunet.com>
Signed-off-by: Christian Langrock <christian.langrock@secunet.com>
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Currently DSA has an issue with FDB entries pointing towards the bridge
in the presence of br_fdb_replay() being called at port join and leave
time.
In particular, each bridge port will ask for a replay for the FDB
entries pointing towards the bridge when it joins, and for another
replay when it leaves.
This means that for example, a bridge with 4 switch ports will notify
DSA 4 times of the bridge MAC address.
But if the MAC address of the bridge changes during the normal runtime
of the system, the bridge notifies switchdev [ once ] of the deletion of
the old MAC address as a local FDB towards the bridge, and of the
insertion [ again once ] of the new MAC address as a local FDB.
This is a problem, because DSA keeps the old MAC address as a host FDB
entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the
old MAC address will not be deleted. Additionally, the new MAC address
will only be installed with refcount 1, and when the first switch port
leaves the bridge (leaving 3 others as still members), it will delete
with it the new MAC address of the bridge from the local FDB entries
kept by DSA (because the br_fdb_replay call on deletion will bring the
entry's refcount from 1 to 0).
So the problem, really, is that the number of br_fdb_replay() calls is
not matched with the refcount that a host FDB is offloaded to DSA during
normal runtime.
An elegant way to solve the problem would be to make the switchdev
notification emitted by br_fdb_change_mac_address() result in a host FDB
kept by DSA which has a refcount exactly equal to the number of ports
under that bridge. Then, no matter how many DSA ports join or leave that
bridge, the host FDB entry will always be deleted when there are exactly
zero remaining DSA switch ports members of the bridge.
To implement the proposed solution, we remember that the switchdev
objects and port attributes have some helpers provided by switchdev,
which can be optionally called by drivers:
switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set.
These helpers:
- fan out a switchdev object/attribute emitted for the bridge towards
all the lower interfaces that pass the check_cb().
- fan out a switchdev object/attribute emitted for a bridge port that is
a LAG towards all the lower interfaces that pass the check_cb().
In other words, this is the model we need for the FDB events too:
something that will keep an FDB entry emitted towards a physical port as
it is, but translate an FDB entry emitted towards the bridge into N FDB
entries, one per physical port.
Of course, there are many differences between fanning out a switchdev
object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB
entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards
a LAG should be treated specially, because FDB entries are unicast, we
can't just install the same address towards 3 destinations. It is
imaginable that drivers might want to treat this case specifically, so
create some methods for this case and do not recurse into the LAG lower
ports, just the bridge ports.
DSA also listens for FDB entries on "foreign" interfaces, aka interfaces
bridged with us which are not part of our hardware domain: think an
Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA
installs host FDB entries. However, there we have the same problem
(those host FDB entries are installed with a refcount of only 1) and an
even bigger one which we did not have with FDB entries towards the
bridge:
br_fdb_replay() is currently not called for FDB entries on foreign
interfaces, just for the physical port and for the bridge itself.
So when DSA sniffs an address learned by the software bridge towards a
foreign interface like an e1000 port, and then that e1000 leaves the
bridge, DSA remains with the dangling host FDB address. That will be
fixed separately by replaying all FDB entries and not just the ones
towards the port and the bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is a bit difficult to understand what DSA checks when it tries to
avoid installing dynamically learned addresses on foreign interfaces as
local host addresses, so create a generic switchdev helper that can be
reused and is generally more readable.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make tag_8021q a more central element of DSA and move the 2 driver
specific operations outside of struct dsa_8021q_context (which is
supposed to hold dynamic data and not really constant function
pointers).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The basic problem description is as follows:
Be there 3 switches in a daisy chain topology:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
The CPU will not be able to ping through the user ports of the
bottom-most switch (like for example sw2p0), simply because tag_8021q
was not coded up for this scenario - it has always assumed DSA switch
trees with a single switch.
To add support for the topology above, we must admit that the RX VLAN of
sw2p0 must be added on some ports of switches 0 and 1 as well. This is
in fact a textbook example of thing that can use the cross-chip notifier
framework that DSA has set up in switch.c.
There is only one problem: core DSA (switch.c) is not able right now to
make the connection between a struct dsa_switch *ds and a struct
dsa_8021q_context *ctx. Right now, it is drivers who call into
tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
methods.
But with cross-chip notifiers, it is possible for tag_8021q to call
drivers without drivers having ever asked for anything. A good example
is right above: when sw2p0 wants to set itself up for tag_8021q,
the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
so that they transport sw2p0's VLANs towards the CPU without dropping
them.
So instead of letting drivers manage the tag_8021q context, add a
tag_8021q_ctx pointer inside of struct dsa_switch, which will be
populated when dsa_tag_8021q_register() returns success.
The patch is fairly long-winded because we are partly reverting commit
5899ee367a ("net: dsa: tag_8021q: add a context structure") which made
the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we
can access "ctx" directly from "ds", this is no longer needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After cited commit, sysctl_tcp_fastopen_blackhole_timeout is no longer
a global variable.
Fixes: 3733be14a3 ("ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Link: https://lore.kernel.org/r/20210719092028.3016745-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-07-15
The following pull-request contains BPF updates for your *net-next* tree.
We've added 45 non-merge commits during the last 15 day(s) which contain
a total of 52 files changed, 3122 insertions(+), 384 deletions(-).
The main changes are:
1) Introduce bpf timers, from Alexei.
2) Add sockmap support for unix datagram socket, from Cong.
3) Fix potential memleak and UAF in the verifier, from He.
4) Add bpf_get_func_ip helper, from Jiri.
5) Improvements to generic XDP mode, from Kumar.
6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi.
===================
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to implement unix_dgram_bpf_recvmsg() to replace the
original ->recvmsg() to retrieve skmsg from ingress_msg.
AF_UNIX is again special here because the lack of
sk_prot->recvmsg(). I simply add a special case inside
unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
Current release - regressions:
- sock: fix parameter order in sock_setsockopt()
Current release - new code bugs:
- netfilter: nft_last:
- fix incorrect arithmetic when restoring last used
- honor NFTA_LAST_SET on restoration
Previous releases - regressions:
- udp: properly flush normal packet at GRO time
- sfc: ensure correct number of XDP queues; don't allow enabling the
feature if there isn't sufficient resources to Tx from any CPU
- dsa: sja1105: fix address learning getting disabled on the CPU port
- mptcp: addresses a rmem accounting issue that could keep packets
in subflow receive buffers longer than necessary, delaying
MPTCP-level ACKs
- ip_tunnel: fix mtu calculation for ETHER tunnel devices
- do not reuse skbs allocated from skbuff_fclone_cache in the napi
skb cache, we'd try to return them to the wrong slab cache
- tcp: consistently disable header prediction for mptcp
Previous releases - always broken:
- bpf: fix subprog poke descriptor tracking use-after-free
- ipv6:
- allocate enough headroom in ip6_finish_output2() in case
iptables TEE is used
- tcp: drop silly ICMPv6 packet too big messages to avoid
expensive and pointless lookups (which may serve as a DDOS
vector)
- make sure fwmark is copied in SYNACK packets
- fix 'disable_policy' for forwarded packets (align with IPv4)
- netfilter: conntrack: do not renew entry stuck in tcp SYN_SENT state
- netfilter: conntrack: do not mark RST in the reply direction coming
after SYN packet for an out-of-sync entry
- mptcp: cleanly handle error conditions with MP_JOIN and syncookies
- mptcp: fix double free when rejecting a join due to port mismatch
- validate lwtstate->data before returning from skb_tunnel_info()
- tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
- mt76: mt7921: continue to probe driver when fw already downloaded
- bonding: fix multiple issues with offloading IPsec to (thru?) bond
- stmmac: ptp: fix issues around Qbv support and setting time back
- bcmgenet: always clear wake-up based on energy detection
Misc:
- sctp: move 198 addresses from unusable to private scope
- ptp: support virtual clocks and timestamping
- openvswitch: optimize operation for key comparison
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDu3mMACgkQMUZtbf5S
Irsjxg//UwcPJMYFmXV+fGkEsWYe1Kf29FcUDEeANFtbltfAcIfZ0GoTbSDRnrVb
HcYAKcm4XRx5bWWdQrQsQq/yiLbnS/rSLc7VRB+uRHWRKl3eYcaUB2rnCXsxrjGw
wQJgOmztDCJS4BIky24iQpF/8lg7p/Gj2Ih532gh93XiYo612FrEJKkYb2/OQfYX
GkbnZ0kL2Y1SV+bhy6aT5azvhHKM4/3eA4fHeJ2p8e2gOZ5ni0vpX0xEzdzKOCd0
vwR/Wu3h/+2QuFYVcSsVguuM++JXACG8MAS/Tof78dtNM4a3kQxzqeh5Bv6IkfTu
rokENLq4pjNRy+nBAOeQZj8Jd0K0kkf/PN9WMdGQtplMoFhjjV25R6PeRrV9wwPo
peozIz2MuQo7Kfof1D+44h2foyLfdC28/Z0CvRbDpr5EHOfYynvBbrnhzIGdQp6V
xgftKTOdgz2Djgg8HiblZund1FA44OYerddVAASrIsnSFnIz1VLVQIsfV+GLBwwc
FawrIZ6WfIjzRSrDGOvDsbAQI47T/1jbaPJeK6XgjWkQmjEd6UtRWRZLYCxemQEw
4HP3sWC96BOehuD8ylipVE1oFqrxCiOB/fZxezXqjo8dSX3NLdak4cCHTHoW5SuZ
eEAxQRaBliKd+P7hoy9cZ57CAu3zUa8kijfM5QRlCAHF+zSxaPs=
=QFnb
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski.
"Including fixes from bpf and netfilter.
Current release - regressions:
- sock: fix parameter order in sock_setsockopt()
Current release - new code bugs:
- netfilter: nft_last:
- fix incorrect arithmetic when restoring last used
- honor NFTA_LAST_SET on restoration
Previous releases - regressions:
- udp: properly flush normal packet at GRO time
- sfc: ensure correct number of XDP queues; don't allow enabling the
feature if there isn't sufficient resources to Tx from any CPU
- dsa: sja1105: fix address learning getting disabled on the CPU port
- mptcp: addresses a rmem accounting issue that could keep packets in
subflow receive buffers longer than necessary, delaying MPTCP-level
ACKs
- ip_tunnel: fix mtu calculation for ETHER tunnel devices
- do not reuse skbs allocated from skbuff_fclone_cache in the napi
skb cache, we'd try to return them to the wrong slab cache
- tcp: consistently disable header prediction for mptcp
Previous releases - always broken:
- bpf: fix subprog poke descriptor tracking use-after-free
- ipv6:
- allocate enough headroom in ip6_finish_output2() in case
iptables TEE is used
- tcp: drop silly ICMPv6 packet too big messages to avoid
expensive and pointless lookups (which may serve as a DDOS
vector)
- make sure fwmark is copied in SYNACK packets
- fix 'disable_policy' for forwarded packets (align with IPv4)
- netfilter: conntrack:
- do not renew entry stuck in tcp SYN_SENT state
- do not mark RST in the reply direction coming after SYN packet
for an out-of-sync entry
- mptcp: cleanly handle error conditions with MP_JOIN and syncookies
- mptcp: fix double free when rejecting a join due to port mismatch
- validate lwtstate->data before returning from skb_tunnel_info()
- tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
- mt76: mt7921: continue to probe driver when fw already downloaded
- bonding: fix multiple issues with offloading IPsec to (thru?) bond
- stmmac: ptp: fix issues around Qbv support and setting time back
- bcmgenet: always clear wake-up based on energy detection
Misc:
- sctp: move 198 addresses from unusable to private scope
- ptp: support virtual clocks and timestamping
- openvswitch: optimize operation for key comparison"
* tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
sfc: add logs explaining XDP_TX/REDIRECT is not available
sfc: ensure correct number of XDP queues
sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
net: fddi: fix UAF in fza_probe
net: dsa: sja1105: fix address learning getting disabled on the CPU port
net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
net: Use nlmsg_unicast() instead of netlink_unicast()
octeontx2-pf: Fix uninitialized boolean variable pps
ipv6: allocate enough headroom in ip6_finish_output2()
net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
net: bridge: multicast: fix MRD advertisement router port marking race
net: bridge: multicast: fix PIM hello router port marking race
net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
dsa: fix for_each_child.cocci warnings
virtio_net: check virtqueue_add_sgs() return value
mptcp: properly account bulk freed memory
selftests: mptcp: fix case multiple subflows limited by server
mptcp: avoid processing packet if a subflow reset
mptcp: fix syncookie process if mptcp can not_accept new subflow
...
If check_fully_established() causes a subflow reset, it should not
continue to process the packet in tcp_data_queue().
Add a return value to mptcp_incoming_options(), and return false if a
subflow has been reset, else return true. Then drop the packet in
tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options()
return false.
Fixes: d582484726 ("mptcp: fix fallback for MP_JOIN subflows")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info
type without validation. lwtstate->data can have various types such as
mpls_iptunnel_encap, etc and these are not compatible.
So skb_tunnel_info() should validate before returning that pointer.
Splat looks like:
BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan]
Read of size 2 at addr ffff888106ec2698 by task ping/811
CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195
Call Trace:
dump_stack_lvl+0x56/0x7b
print_address_description.constprop.8.cold.13+0x13/0x2ee
? vxlan_get_route+0x418/0x4b0 [vxlan]
? vxlan_get_route+0x418/0x4b0 [vxlan]
kasan_report.cold.14+0x83/0xdf
? vxlan_get_route+0x418/0x4b0 [vxlan]
vxlan_get_route+0x418/0x4b0 [vxlan]
[ ... ]
vxlan_xmit_one+0x148b/0x32b0 [vxlan]
[ ... ]
vxlan_xmit+0x25c5/0x4780 [vxlan]
[ ... ]
dev_hard_start_xmit+0x1ae/0x6e0
__dev_queue_xmit+0x1f39/0x31a0
[ ... ]
neigh_xmit+0x2f9/0x940
mpls_xmit+0x911/0x1600 [mpls_iptunnel]
lwtunnel_xmit+0x18f/0x450
ip_finish_output2+0x867/0x2040
[ ... ]
Fixes: 61adedf3e3 ("route: move lwtunnel state to dst_entry")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit prepares to use the XDP meta data length check in multiple
places by making it into a static inline function instead of a literal.
Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-2-zeffron@riotgames.com
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Do not refresh timeout in SYN_SENT for syn retransmissions.
Add selftest for unreplied TCP connection, from Florian Westphal.
2) Fix null dereference from error path with hardware offload
in nftables.
3) Remove useless nf_ct_gre_keymap_flush() from netns exit path,
from Vasily Averin.
4) Missing rcu read-lock side in ctnetlink helper info dump,
also from Vasily.
5) Do not mark RST in the reply direction coming after SYN packet
for an out-of-sync entry, from Ali Abdallah and Florian Westphal.
6) Add tcp_ignore_invalid_rst sysctl to allow to disable out of
segment RSTs, from Ali.
7) KCSAN fix for nf_conntrack_all_lock(), from Manfred Spraul.
8) Honor NFTA_LAST_SET in nft_last.
9) Fix incorrect arithmetics when restore last_jiffies in nft_last.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bonding has been supporting ipsec offload.
When SA is added, bonding just passes SA to its own active real interface.
But it doesn't manage SA.
So, when events(add/del real interface, active real interface change, etc)
occur, bonding can't handle that well because It doesn't manage SA.
So some problems(panic, UAF, refcnt leak)occur.
In order to make it stable, it should manage SA.
That's the reason why struct bond_ipsec is added.
When a new SA is added to bonding interface, it is stored in the
bond_ipsec list. And the SA is passed to a current active real interface.
If events occur, it uses bond_ipsec data to handle these events.
bond->ipsec_list is protected by bond->ipsec_lock.
If a current active real interface is changed, the following logic works.
1. delete all SAs from old active real interface
2. Add all SAs to the new active real interface.
3. If a new active real interface doesn't support ipsec offload or SA's
option, it sets real_dev to NULL.
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new sysctl tcp_ignore_invalid_rst to disable marking
out of segments RSTs as INVALID.
Signed-off-by: Ali Abdallah <aabdallah@suse.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull iov_iter updates from Al Viro:
"iov_iter cleanups and fixes.
There are followups, but this is what had sat in -next this cycle. IMO
the macro forest in there became much thinner and easier to follow..."
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
clean up copy_mc_pipe_to_iter()
pipe_zero(): we don't need no stinkin' kmap_atomic()...
iov_iter: clean csum_and_copy_...() primitives up a bit
copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
iterate_xarray(): only of the first iteration we might get offset != 0
pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
iov_iter: make iterator callbacks use base and len instead of iovec
iov_iter: make the amount already copied available to iterator callbacks
iov_iter: get rid of separate bvec and xarray callbacks
iov_iter: teach iterate_{bvec,xarray}() about possible short copies
iterate_bvec(): expand bvec.h macro forest, massage a bit
iov_iter: unify iterate_iovec and iterate_kvec
iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
iterate_and_advance(): get rid of magic in case when n is 0
csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
[xarray] iov_iter_npages(): just use DIV_ROUND_UP()
iov_iter_npages(): don't bother with iterate_all_kinds()
...
Commit 628a5c5618 ("[INET]: Add IP(V6)_PMTUDISC_RPOBE") introduced
ip6_skb_dst_mtu with return value of signed int which is inconsistent
with actually returned values. Also 2 users of this function actually
assign its value to unsigned int variable and only __xfrm6_output
assigns result of this function to signed variable but actually uses
as unsigned in further comparisons and calls. Change this function
to return unsigned int value.
Fixes: 628a5c5618 ("[INET]: Add IP(V6)_PMTUDISC_RPOBE")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm_bydst_resize() calls synchronize_rcu() while holding
hash_resize_mutex. But then on PREEMPT_RT configurations,
xfrm_policy_lookup_bytype() may acquire that mutex while running in an
RCU read side critical section. This results in a deadlock.
In fact the scope of hash_resize_mutex is way beyond the purpose of
xfrm_policy_lookup_bytype() to just fetch a coherent and stable policy
for a given destination/direction, along with other details.
The lower level net->xfrm.xfrm_policy_lock, which among other things
protects per destination/direction references to policy entries, is
enough to serialize and benefit from priority inheritance against the
write side. As a bonus, it makes it officially a per network namespace
synchronization business where a policy table resize on namespace A
shouldn't block a policy lookup on namespace B.
Fixes: 77cc278f7b (xfrm: policy: Use sequence counters with associated lock)
Cc: stable@vger.kernel.org
Cc: Ahmed S. Darwish <a.darwish@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Varad Gautam <varad.gautam@suse.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
nf_ct_gre_keymap_flush() is useless.
It is called from nf_conntrack_cleanup_net_list() only and tries to remove
nf_ct_gre_keymap entries from pernet gre keymap list. Though:
a) at this point the list should already be empty, all its entries were
deleted during the conntracks cleanup, because
nf_conntrack_cleanup_net_list() executes nf_ct_iterate_cleanup(kill_all)
before nf_conntrack_proto_pernet_fini():
nf_conntrack_cleanup_net_list
+- nf_ct_iterate_cleanup
| nf_ct_put
| nf_conntrack_put
| nf_conntrack_destroy
| destroy_conntrack
| destroy_gre_conntrack
| nf_ct_gre_keymap_destroy
`- nf_conntrack_proto_pernet_fini
nf_ct_gre_keymap_flush
b) Let's say we find that the keymap list is not empty. This means netns
still has a conntrack associated with gre, in which case we should not free
its memory, because this will lead to a double free and related crashes.
However I doubt it could have gone unnoticed for years, obviously
this does not happen in real life. So I think we can remove
both nf_ct_gre_keymap_flush() and nf_conntrack_proto_pernet_fini().
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The MPTCP receive path is hooked only into the TCP slow-path.
The DSS presence allows plain MPTCP traffic to hit that
consistently.
Since commit e1ff9e82e2 ("net: mptcp: improve fallback to TCP"),
when an MPTCP socket falls back to TCP, it can hit the TCP receive
fast-path, and delay or stop triggering the event notification.
Address the issue explicitly disabling the header prediction
for MPTCP sockets.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/200
Fixes: e1ff9e82e2 ("net: mptcp: improve fallback to TCP")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caif_hsi driver relies on a cfhsi_get_ops symbol using symbol_get,
but this symbol is not provided anywhere in the kernel tree. Remove
this driver given that it is dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since PTP virtual clock support is added, there can be
several PTP virtual clocks based on one PTP physical
clock for timestamping.
This patch is to extend SO_TIMESTAMPING API to support
PHC (PTP Hardware Clock) binding by adding a new flag
SOF_TIMESTAMPING_BIND_PHC. When PTP virtual clocks are
in use, user space can configure to bind one for
timestamping, but PTP physical clock is not supported
and not needed to bind.
This patch is preparation for timestamp conversion from
raw timestamp to a specific PTP virtual clock time in
core net.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The doc draft-stewart-tsvwg-sctp-ipv4-00 that restricts 198 addresses
was never published. These addresses as private addresses should be
allowed to use in SCTP.
As Michael Tuexen suggested, this patch is to move 198 addresses from
unusable to private scope.
Reported-by: Sérgio <surkamp@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_ll_usec is read locklessly from sk_can_busy_loop()
while another thread can change its value in sock_setsockopt()
This is correct but needs annotations.
BUG: KCSAN: data-race in __skb_try_recv_datagram / sock_setsockopt
write to 0xffff88814eb5f904 of 4 bytes by task 14011 on cpu 0:
sock_setsockopt+0x1287/0x2090 net/core/sock.c:1175
__sys_setsockopt+0x14f/0x200 net/socket.c:2100
__do_sys_setsockopt net/socket.c:2115 [inline]
__se_sys_setsockopt net/socket.c:2112 [inline]
__x64_sys_setsockopt+0x62/0x70 net/socket.c:2112
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88814eb5f904 of 4 bytes by task 14001 on cpu 1:
sk_can_busy_loop include/net/busy_poll.h:41 [inline]
__skb_try_recv_datagram+0x14f/0x320 net/core/datagram.c:273
unix_dgram_recvmsg+0x14c/0x870 net/unix/af_unix.c:2101
unix_seqpacket_recvmsg+0x5a/0x70 net/unix/af_unix.c:2067
____sys_recvmsg+0x15d/0x310 include/linux/uio.h:244
___sys_recvmsg net/socket.c:2598 [inline]
do_recvmmsg+0x35c/0x9f0 net/socket.c:2692
__sys_recvmmsg net/socket.c:2771 [inline]
__do_sys_recvmmsg net/socket.c:2794 [inline]
__se_sys_recvmmsg net/socket.c:2787 [inline]
__x64_sys_recvmmsg+0xcf/0x150 net/socket.c:2787
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0x00000101
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 14001 Comm: syz-executor.3 Not tainted 5.13.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of a magic number (13 currently) and having
to change it every other year, use sizeof_field() macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same concerns expressed for host MDB entries are valid for host FDBs
just as well:
- in the case of multiple bridges spanning the same switch chip, deleting
a host FDB entry that belongs to one bridge will result in breakage to
the other bridge
- not deleting FDB entries across DSA links means that the switch's
hardware tables will eventually run out, given enough wear&tear
So do the same thing and introduce reference counting for CPU ports and
DSA links using the same data structures as we have for MDB entries.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for the new cross-chip notifiers for host addresses,
let's introduce some more topology helpers which we are going to use to
discern switches that are in our path towards the dedicated CPU port
from switches that aren't.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add support for QCA_ROME device (0cf3:e500) and RTL8822CE
- Update management interface revision to 21
- Use of incluse language
- Proper handling of HCI_LE_Advertising_Set_Terminated event
- Recovery handing of HCI ncmd=0
- Various memory fixes
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmDaAwwZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKQwaD/4oPoEH/jdEhkvVUpHJebm2
GYNa+zvW5PwqgK1wvdg9Ou0fpgLuHMH+XPyhVMciG3N6kpbAGnyrGTlpKyu8uqh3
EDBPYoU3Rx/0+gUCrEXZrkSSyds8SuzD+ALkEzHtg2ISkUGff0FiFyDbc7TXG4ue
h1n87LgouCEq0kYEKZAbXnSKDt0aDBIeXGCMjZ/K4DctfeLZTTvzA8HMbX/ektzm
DOcZdHD1i47ijsu7/PZhesnqw/RGmMXaXfMvO9uuXM4N5qrZdDdFB9eDXRhHTwwF
79P3bYYweLvQloMXKQytnGUElPqceMNznCB1LsvxxDqTJuEwqF5nVzJ3TJJJHon2
Cq3JP3QthlpomRqaKlf+raQy618u5yF3APef9ZrUeLABfTjy2CIh0IhKixW7SsP0
VH3b4z8ClRodRTOYJpEx0Ncs9krccbmNf73FCf1kzjPEP7oKLuuA37WcolQ+7fjd
UiCLaKJ9flHFH3qeBzAii2vqe07Qy/l/SnG33nB1bDv4g0XnXI/Apb70z95jZcJL
jUBmlWbWcxKV1euTuIGpWiZPfhZyTVN0D3mcLcdCJhxHj79ZlUZcA1WdA699Crbo
Pc4icu/INzDybQ5oivqu78Ajdun3XGq/orlZfDtbjZNHoJpjNQTHhSNdPZhM9Bgr
uvNnaggWfjCXeSZ6ljY/Nw==
=2NdY
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for QCA_ROME device (0cf3:e500) and RTL8822CE
- Update management interface revision to 21
- Use of incluse language
- Proper handling of HCI_LE_Advertising_Set_Terminated event
- Recovery handing of HCI ncmd=0
- Various memory fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilja reported that, simply putting it, nothing was validating that
from_addr_param functions were operating on initialized memory. That is,
the parameter itself was being validated by sctp_walk_params, but it
doesn't check for types and their specific sizes and it could be a 0-length
one, causing from_addr_param to potentially work over the next parameter or
even uninitialized memory.
The fix here is to, in all calls to from_addr_param, check if enough space
is there for the wanted IP address type.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-28
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 12 day(s) which contain
a total of 56 files changed, 394 insertions(+), 380 deletions(-).
The main changes are:
1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney.
2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski.
3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev.
4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria.
5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards.
6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa.
7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi.
8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim.
9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young.
10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer.
11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski.
12) Fix up bpfilter startup log-level to info level, from Gary Lin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch builds off of commit 2b246b2569
and adds functionality to respond to ICMPV6 PROBE requests.
Add icmp_build_probe function to construct PROBE requests for both
ICMPV4 and ICMPV6.
Modify icmpv6_rcv to detect ICMPV6 PROBE messages and call the
icmpv6_echo_reply handler.
Modify icmpv6_echo_reply to build a PROBE response message based on the
queried interface.
This patch has been tested using a branch of the iputils git repo which can
be found here: https://github.com/Juniper-Clinic-2020/iputils/tree/probe-request
Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"action" should not be NULL when it is referenced.
Signed-off-by: gushengxian <13145886936@163.com>
Signed-off-by: gushengxian <gushengxian@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case where the driver asks for a replay of a certain type of
event (port object or attribute) for a bridge port that is a LAG, it may
do so because this port has just joined the LAG.
But there might already be other switchdev ports in that LAG, and it is
preferable that those preexisting switchdev ports do not act upon the
replayed event.
The solution is to add a context to switchdev events, which is NULL most
of the time (when the bridge layer initiates the call) but which can be
set to a value controlled by the switchdev driver when a replay is
requested. The driver can then check the context to figure out if all
ports within the LAG should act upon the switchdev event, or just the
ones that match the context.
We have to modify all switchdev_handle_* helper functions as well as the
prototypes in the drivers that use these helpers too, because these
helpers hide the underlying struct switchdev_notifier_info from us and
there is no way to retrieve the context otherwise.
The context structure will be populated and used in later patches.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-06-28
1) Remove an unneeded error assignment in esp4_gro_receive().
From Yang Li.
2) Add a new byseq state hashtable to find acquire states faster.
From Sabrina Dubroca.
3) Remove some unnecessary variables in pfkey_create().
From zuoqilin.
4) Remove the unused description from xfrm_type struct.
From Florian Westphal.
5) Fix a spelling mistake in the comment of xfrm_state_ok().
From gushengxian.
6) Replace hdr_off indirections by a small helper function.
From Florian Westphal.
7) Remove xfrm4_output_finish and xfrm6_output_finish declarations,
they are not used anymore.From Antony Antony.
8) Remove xfrm replay indirections.
From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* aggregation handling improvements for some drivers
* hidden AP discovery on 6 GHz and other HE 6 GHz
improvements
* minstrel improvements for no-ack frames
* deferred rate control for TXQs to improve reaction
times
* virtual time-based airtime scheduler
* along with various little cleanups/fixups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmDWUHgACgkQB8qZga/f
l8RxXA/+KjkrFUGDHlMTIzS/i03EcB+idLYy+XBWYuO3PN4p/uCCXLzdGfiRKTxp
iCAjYp0ZP01FlN4sOIN+qqrUMoN0e8xqBEY1DFbJELBB1knV4VR0FeBPPcBMUEgE
xAjdZnIsfKO1yU2mrdQVDnkvXr4jHDALePvGwCQe+4KUbwCmVjo6nb534v17Ie3C
MDNnGq395fCvo9NW+Yvzw6s9ZLdhr28bGhSBgBrqmPBx/9vm3b7BA8qO+X6BzAny
87S2x+z40wabBujCGMvw9J5H6OJ0ZLfbT0TumNznqQbCbmwRdSUC+W/cjgRq/aQn
CA3E9T7tofIr1mk9EgcjE8ax91TB/TTX5Nh9Huki25B7rNRBM57VnuEuzmXP5qBS
xgsr60agQHp2ZAGRPBC9msPcPCbvRfdl9ckTAF8/4vI7uEXSyYViX92M0F1FEy60
Mw13B8WeRA4bH22xHpqdXsyl0x2OQHXHlLHfpzmhcUy0i2cQOQMsWbl3g6CJMTq+
/InNSaG+cqrCVse734KJLumRhOq4+MkedYJlqL7uHZJrXlLgQ9Z8ewANjJ/BjTsQ
d8I16XkKPyv8UiJEwHZ6NaHvU48LR8T/r4ym9c/rz0DTWHZyp87QMYa55iwuk4Zo
cW7G3f/N7dMk9fGRxgfIrnNwNUCMsGcBSJeJyr5qT7u8+RW0bRo=
=nFvj
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes berg says:
====================
Lots of changes:
* aggregation handling improvements for some drivers
* hidden AP discovery on 6 GHz and other HE 6 GHz
improvements
* minstrel improvements for no-ack frames
* deferred rate control for TXQs to improve reaction
times
* virtual time-based airtime scheduler
* along with various little cleanups/fixups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 14972cbd34 ("net: lwtunnel: Handle fragmentation") moved
fragmentation logic away from lwtunnel by carry encap headroom and
use it in output MTU calculation. But the forwarding part was not
covered and created difference in MTU for output and forwarding and
further to silent drops on ipv4 forwarding path. Fix it by taking
into account lwtunnel encap headroom.
The same commit also introduced difference in how to treat RTAX_MTU
in IPv4 and IPv6 where latter explicitly removes lwtunnel encap
headroom from route MTU. Make IPv4 version do the same.
Fixes: 14972cbd34 ("net: lwtunnel: Handle fragmentation")
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These command do have variable length and the length can go up to 251,
so this changes the struct to not use a fixed size and then when
creating the PDU only the actual length of the data send to the
controller.
Fixes: a0fb3726ba ("Bluetooth: Use Set ext adv/scan rsp data if controller supports")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf
Specifically, these terms are replaced:
blacklist -> reject list
whitelist -> accept list
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf
Specifically, these terms are replaced:
master -> central
slave -> peripheral
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf
Specifically, these terms are replaced:
master -> initiator
slave -> responder
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf
Specifically, these terms are replaced:
master -> central
slave -> peripheral
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf
Specifically, these terms are replaced when describing the
connectionless peripheral broadcast feature:
master -> central
slave -> peripheral
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf
Specifically, these terms are replaced:
master -> central
slave -> peripheral
These attributes are not used elsewhere in the code.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
During command status or command complete event, the controller may set
ncmd=0 indicating that it is not accepting any more commands. In such a
case, host holds off sending any more commands to the controller. If the
controller doesn't recover from such condition, host will wait forever,
until the user decides that the Bluetooth is broken and may power cycles
the Bluetooth.
This patch triggers the hardware error to reset the controller and
driver when it gets into such state as there is no other wat out.
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When an MGMT_EV_DEVICE_CONNECTED event is reported back to the user
space we will set the flags to tell if the established connection is
outbound or not. This is useful for the user space to log better metrics
and error messages.
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Yu Liu <yudiliu@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Second, and most likely the last, set of patches for v5.14. mt76 and
iwlwifi have most patches in this round, but rtw88 also has some new
features. Nothing special really standing out.
mt76
* mt7915 MSI support
* disable ASPM on mt7915
* mt7915 tx status reporting
* mt7921 decap offload
rtw88
* beacon filter support
* path diversity support
* firmware crash information via devcoredump
* quirks for disabling pci capabilities
mt7601u
* add USB ID for a XiaoDu WiFi Dongle
ath11k
* enable support for QCN9074 PCI devices
brcmfmac
* support parse country code map from DeviceTree
iwlwifi
* support for new hardware
* support for BIOS control of 11ax enablement in Russia
* support UNII4 band enablement from BIOS
-----BEGIN PGP SIGNATURE-----
iQFJBAABCgAzFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmDVyOMVHGt2YWxvQGNv
ZGVhdXJvcmEub3JnAAoJEG4XJFUm622bXtsH/RB2SKxPY4Rgb7+afjpJIeGDBq8T
7p5/Xe/PaKMIuusFuJ1gYbTShm5FjdA16F7ZMYSgzpf2b9XBv6hDw7xsPW3qPPtJ
MJynfnyvpNH9/NWcSrqOMLeFGv1ot5861BLESCXIaWNud+8OMUSw+qf8P6Z80RCD
tasOayIJVQ/6xLpzbqYoLXfXGeIwlWTiQnQROA8Joq7vQziHSnHTqC7RyTF6SrFn
AUnGJ10t62hsGt64tt1QeaCV65SCTPjxI2KKpUpvhD/yQtdcEq2f+1vCfnCk+NFz
VWxCAI6U7s6gf0Lu4/+cJhHt/ZCtLZ2d06g93sfplSWHBImYxG55cO3tHzU=
=pRFB
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
wireless-drivers-next patches for v5.14
Second, and most likely the last, set of patches for v5.14. mt76 and
iwlwifi have most patches in this round, but rtw88 also has some new
features. Nothing special really standing out.
mt76
* mt7915 MSI support
* disable ASPM on mt7915
* mt7915 tx status reporting
* mt7921 decap offload
rtw88
* beacon filter support
* path diversity support
* firmware crash information via devcoredump
* quirks for disabling pci capabilities
mt7601u
* add USB ID for a XiaoDu WiFi Dongle
ath11k
* enable support for QCN9074 PCI devices
brcmfmac
* support parse country code map from DeviceTree
iwlwifi
* support for new hardware
* support for BIOS control of 11ax enablement in Russia
* support UNII4 band enablement from BIOS
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the PLPMUTD probe will stop for a long period (interval * 30)
after it enters search complete state. If there's a pmtu change on the
route path, it takes a long time to be aware if the ICMP TooBig packet
is lost or filtered.
As it says in rfc8899#section-4.3:
"A DPLPMTUD method MUST NOT rely solely on this method."
(ICMP PTB message).
This patch is to enable the other method for search complete state:
"A PL can use the DPLPMTUD probing mechanism to periodically
generate probe packets of the size of the current PLPMTU."
With this patch, the probe will continue with the current pmtu every
'interval' until the PMTU_RAISE_TIMER 'timeout', which we implement
by adding raise_count to raise the probe size when it counts to 30
and removing the SCTP_PL_COMPLETE check for PMTU_RAISE_TIMER.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The key length used when offloading macsec to Ethernet or PHY drivers
was set to MACSEC_KEYID_LEN (16), which is an issue as:
- This was never meant to be the key length.
- The key length can be > 16.
Fix this by using MACSEC_MAX_KEY_LEN to store the key (the max length
accepted in uAPI) and secy->key_len to copy it.
Fixes: 3cf3227a21 ("net: macsec: hardware offloading infrastructure")
Reported-by: Lior Nahmanson <liorna@nvidia.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
XDP_REDIRECT works by a three-step process: the bpf_redirect() and
bpf_redirect_map() helpers will lookup the target of the redirect and store
it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
Next, when the program returns the XDP_REDIRECT return code, the driver
will call xdp_do_redirect() which will use the information thus stored to
actually enqueue the frame into a bulk queue structure (that differs
slightly by map type, but shares the same principle). Finally, before
exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
flush all the different bulk queues, thus completing the redirect.
Pointers to the map entries will be kept around for this whole sequence of
steps, protected by RCU. However, there is no top-level rcu_read_lock() in
the core code; instead drivers add their own rcu_read_lock() around the XDP
portions of the code, but somewhat inconsistently as Martin discovered[0].
However, things still work because everything happens inside a single NAPI
poll sequence, which means it's between a pair of calls to
local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
document this intention by using rcu_dereference_check() with
rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
lockdep to verify that everything is done correctly.
This patch does just that: we add an __rcu annotation to the map entry
pointers and remove the various comments explaining the NAPI poll assurance
strewn through devmap.c in favour of a longer explanation in filter.c. The
goal is to have one coherent documentation of the entire flow, and rely on
the RCU annotations as a "standard" way of communicating the flow in the
map code (which can additionally be understood by sparse and lockdep).
The RCU annotation replacements result in a fairly straight-forward
replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
proper constructs to cast the pointer back and forth between __rcu and
__kernel address space (for the benefit of sparse). The one complication is
that xskmap has a few constructions where double-pointers are passed back
and forth; these simply all gain __rcu annotations, and only the final
reference/dereference to the inner-most pointer gets changed.
With this, everything can be run through sparse without eliciting
complaints, and lockdep can verify correctness even without the use of
rcu_read_lock() in the drivers. Subsequent patches will clean these up from
the drivers.
[0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
[1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
Daniel Borkmann says:
====================
pull-request: bpf 2021-06-23
The following pull-request contains BPF updates for your *net* tree.
We've added 14 non-merge commits during the last 6 day(s) which contain
a total of 13 files changed, 137 insertions(+), 64 deletions(-).
Note that when you merge net into net-next, there is a small merge conflict
between 9f2470fbc4 ("skmsg: Improve udp_bpf_recvmsg() accuracy") from bpf
with c49661aa6f ("skmsg: Remove unused parameters of sk_msg_wait_data()")
from net-next. Resolution is to: i) net/ipv4/udp_bpf.c: take udp_msg_wait_data()
and remove err parameter from the function, ii) net/ipv4/tcp_bpf.c: take
tcp_msg_wait_data() and remove err parameter from the function, iii) for
net/core/skmsg.c and include/linux/skmsg.h: remove the sk_msg_wait_data()
implementation and its prototype in header.
The main changes are:
1) Fix BPF poke descriptor adjustments after insn rewrite, from John Fastabend.
2) Fix regression when using BPF_OBJ_GET with non-O_RDWR flags, from Maciej Żenczykowski.
3) Various bug and error handling fixes for UDP-related sock_map, from Cong Wang.
4) Fix patching of vmlinux BTF IDs with correct endianness, from Tony Ambardar.
5) Two fixes for TX descriptor validation in AF_XDP, from Magnus Karlsson.
6) Fix overflow in size calculation for bpf_map_area_alloc(), from Bui Quang Minh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit d26b698dd3 ("net/tls: add skeleton of MIB statistics")
introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
not defined also. Let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Various minor cleanups and fixes from net-next branch
2) Optimize mlx5 feature check on tx and
a fix to allow Vxlan with Ipsec offloads
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmDSYyYACgkQSD+KveBX
+j45Bgf+M7Vg3OK4fA7ZURrbxu2EiCQB4XprO/it7YSzpiZx634DNNRzWXQ2mLJD
jx5TcmAFVUKkGmx/qrPYe/Y9c9l5s6JMjwACL5aEawXtvPzI/q1KBx/n5L+CoFYw
lO1IpBPpkwqIbeIl9cwata7IJ6aTeOGjqfQ//Fodfwga063Aaggig6sEcPkr0Ewe
wcuJmblnw/qOkSI2BlSOyixYiVjPRDF7cAVRTBK4/DCDFiGJTiaj8w0JgfdS2zVs
3xMrYajdz7qArMcqbuQe59KojeYj4hxALUSs+s9ks8qeIrbM+hZ9sH5m0dJgM4P6
7Pg87LD6PFDV31DWF0XM3KgdqfiZxw==
=aszX
-----END PGP SIGNATURE-----
Merge tag 'mlx5-net-next-2021-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-net-next-2021-06-22
1) Various minor cleanups and fixes from net-next branch
2) Optimize mlx5 feature check on tx and
a fix to allow Vxlan with Ipsec offloads
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2021-06-23
1) Don't return a mtu smaller than 1280 on IPv6 pmtu discovery.
From Sabrina Dubroca
2) Fix seqcount rcu-read side in xfrm_policy_lookup_bytype
for the PREEMPT_RT case. From Varad Gautam.
3) Remove a repeated declaration of xfrm_parse_spi.
From Shaokun Zhang.
4) IPv4 beet mode can't handle fragments, but IPv6 does.
commit 68dc022d04 ("xfrm: BEET mode doesn't support
fragments for inner packets") handled IPv4 and IPv6
the same way. Relax the check for IPv6 because fragments
are possible here. From Xin Long.
5) Memory allocation failures are not reported for
XFRMA_ENCAP and XFRMA_COADDR in xfrm_state_construct.
Fix this by moving both cases in front of the function.
6) Fix a missing initialization in the xfrm offload fallback
fail case for bonding devices. From Ayush Sawal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
from Phil Sutter.
2) Simplify TCP option sanity check for TCP packets, also from Phil.
3) Add a new expression to store when the rule has been used last time.
4) Pass the hook state object to log function, from Florian Westphal.
5) Document the new sysctl knobs to tune the flowtable timeouts,
from Oz Shlomo.
6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
from Dan Carpenter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As MISSED and DRAINING state are used to indicate a non-empty
qdisc, qdisc->empty is not longer needed, so remove it.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().
This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.
As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2ce ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.
In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.
As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:
thread1 thread2 thread3
qdisc_run_begin() # true
qdisc_run_begin(q)
set(MISSED)
pfifo_fast_dequeue
clear(MISSED)
# recheck the queue
qdisc_run_end()
enqueue skb1
qdisc empty # true
qdisc_run_begin() # true
sch_direct_xmit() # skb2
qdisc_run_begin()
set(MISSED)
When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.
This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.
There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
dequeue all skb in one round and call __netif_schedule(), which
might result in a non-empty qdisc without MISSED set. In order
to avoid this, the MISSED state is set for lockless qdisc and
__netif_schedule() will be called at the end of qdisc_run_end.
2. The MISSED state also need to be set for lockless qdisc instead
of calling __netif_schedule() directly when requeuing a skb for
a similar reason.
3. For netdev queue stopped case, the MISSED case need clearing
while the netdev queue is stopped, otherwise there may be
unnecessary __netif_schedule() calling. So a new DRAINING state
is added to indicate this case, which also indicate a non-empty
qdisc.
4. As there is already netif_xmit_frozen_or_stopped() checking in
dequeue_skb() and sch_direct_xmit(), which are both within the
protection of qdisc->seqlock, but the same checking in
__dev_xmit_skb() is without the protection, which might cause
empty indication of a lockless qdisc to be not reliable. So
remove the checking in __dev_xmit_skb(), and the checking in
the protection of qdisc->seqlock seems enough to avoid the cpu
consumption problem for netdev queue stopped case.
1. https://lkml.org/lkml/2021/5/29/215
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc->running seqcount operation is mainly used to do heuristic
locking on q->busylock for locked qdisc, see qdisc_is_running()
and __dev_xmit_skb().
So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK
flag.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add wiphy_info_once() helper that prints info message only once.
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20210511211549.30571-1-digetx@gmail.com
This switches the airtime scheduler in mac80211 to use a virtual
time-based scheduler instead of the round-robin scheduler used before.
This has a couple of advantages:
- No need to sync up the round-robin scheduler in firmware/hardware with
the round-robin airtime scheduler.
- If several stations are eligible for transmission we can schedule both
of them; no need to hard-block the scheduling rotation until the head
of the queue has used up its quantum.
- The check of whether a station is eligible for transmission becomes
simpler (in ieee80211_txq_may_transmit()).
The drawback is that scheduling becomes slightly more expensive, as we
need to maintain an rbtree of TXQs sorted by virtual time. This means
that ieee80211_register_airtime() becomes O(logN) in the number of
currently scheduled TXQs because it can change the order of the
scheduled stations. We mitigate this overhead by only resorting when a
station changes position in the tree, and hopefully N rarely grows too
big (it's only TXQs currently backlogged, not all associated stations),
so it shouldn't be too big of an issue.
To prevent divisions in the fast path, we maintain both station sums and
pre-computed reciprocals of the sums. This turns the fast-path operation
into a multiplication, with divisions only happening as the number of
active stations change (to re-compute the current sum of all active
station weights). To prevent this re-computation of the reciprocal from
happening too frequently, we use a time-based notion of station
activity, instead of updating the weight every time a station gets
scheduled or de-scheduled. As queues can oscillate between empty and
occupied quite frequently, this can significantly cut down on the number
of re-computations. It also has the added benefit of making the station
airtime calculation independent on whether the queue happened to have
drained at the time an airtime value was accounted.
Co-developed-by: Yibo Zhao <yiboz@codeaurora.org>
Signed-off-by: Yibo Zhao <yiboz@codeaurora.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20210623134755.235545-1-toke@redhat.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We have mgd_prepare_tx(), but sometimes drivers may want/need
to take action when the exchange finishes, whether successfully
or not.
Add a notification to the driver on completion, i.e. call the
new method mgd_complete_tx().
To unify the two scenarios, and to add more information, make
both of them take a struct that has the duration (prepare only),
subtype (both) and success (complete only).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210618133832.5d94e78f6230.I6dc979606b6f28701b740d7aab725f7853a5a155@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There may be cases where vendor-specific elements need to be
used over the air. Rather than have driver or firmware add
them and possibly cause problems that way, add them to the
iftype-data band capabilities. This way we can advertise to
userspace first, and use them in mac80211 next.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210618133832.e8c4f0347276.Iee5964682b3e9ec51fc1cd57a7c62383eaf6ddd7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This avoids calling back into tx handlers from within the rate control module.
Preparation for deferring rate control until tx dequeue
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210617163113.75815-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a helper function that checks if a frame is a data frame. Frames
with hardware encapsulation enabled are data frames.
Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de>
Link: https://lore.kernel.org/r/20210519122019.92359-2-borgers@mi.fu-berlin.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The inner_ipproto saves the inner IP protocol of the plain
text packet. This allows vendor's IPsec feature making offload
decision at skb's features_check and configuring hardware at
ndo_start_xmit.
For example, ConnectX6-DX IPsec device needs the plaintext's
IP protocol to support partial checksum offload on
VXLAN/GENEVE packet over IPsec transport mode tunnel.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This patch defined a new flag MPTCP_CAP_DENY_JOIN_ID0 for the third bit,
labeled "C" of the MP_CAPABLE option.
Add a new flag allow_join_id0 in struct mptcp_out_options. If this flag is
set, send out the MP_CAPABLE option with the flag MPTCP_CAP_DENY_JOIN_ID0.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, sctp over udp was using udp tunnel's icmp err process, which
only does sk lookup on sctp side. However for sctp's icmp error process,
there are more things to do, like syncing assoc pmtu/retransmit packets
for toobig type err, and starting proto_unreach_timer for unreach type
err etc.
Now after adding PLPMTUD, which also requires to process toobig type err
on sctp side. This patch is to process icmp err on sctp side by parsing
the type/code/info in .encap_err_lookup and call sctp's icmp processing
functions. Note as the 'redirect' err process needs to know the outer
ip(v6) header's, we have to leave it to udp(v6)_err to handle it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As described in rfc8899#section-5.2, when a probe succeeds, there might
be the following state transitions:
- Base -> Search, occurs when probe succeeds with BASE_PLPMTU,
pl.pmtu is not changing,
pl.probe_size increases by SCTP_PL_BIG_STEP,
- Error -> Search, occurs when probe succeeds with BASE_PLPMTU,
pl.pmtu is changed from SCTP_MIN_PLPMTU to SCTP_BASE_PLPMTU,
pl.probe_size increases by SCTP_PL_BIG_STEP.
- Search -> Search Complete, occurs when probe succeeds with the probe
size SCTP_MAX_PLPMTU less than pl.probe_high,
pl.pmtu is not changing, but update *pathmtu* with it,
pl.probe_size is set back to pl.pmtu to double check it.
- Search Complete -> Search, occurs when probe succeeds with the probe
size equal to pl.pmtu,
pl.pmtu is not changing,
pl.probe_size increases by SCTP_PL_MIN_STEP.
So search process can be described as:
1. When it just enters 'Search' state, *pathmtu* is not updated with
pl.pmtu, and probe_size increases by a big step (SCTP_PL_BIG_STEP)
each round.
2. Until pl.probe_high is set when a probe fails, and probe_size
decreases back to pl.pmtu, as described in the last patch.
3. When the probe with the new size succeeds, probe_size changes to
increase by a small step (SCTP_PL_MIN_STEP) due to pl.probe_high
is set.
4. Until probe_size is next to pl.probe_high, the searching finishes and
it goes to 'Complete' state and updates *pathmtu* with pl.pmtu, and
then probe_size is set to pl.pmtu to confirm by once more probe.
5. This probe occurs after "30 * probe_inteval", a much longer time than
that in Search state. Once it is done it goes to 'Search' state again
with probe_size increased by SCTP_PL_MIN_STEP.
As we can see above, during the searching, pl.pmtu changes while *pathmtu*
doesn't. *pathmtu* is only updated when the search finishes by which it
gets an optimal value for it. A big step is used at the beginning until
it gets close to the optimal value, then it changes to a small step until
it has this optimal value.
The small step is also used in 'Complete' until it goes to 'Search' state
again and the probe with 'pmtu + the small step' succeeds, which means a
higher size could be used. Then probe_size changes to increase by a big
step again until it gets close to the next optimal value.
Note that anytime when black hole is detected, it goes directly to 'Base'
state with pl.pmtu set to SCTP_BASE_PLPMTU, as described in the last patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The state transition is described in rfc8899#section-5.2,
PROBE_COUNT == MAX_PROBES means the probe fails for MAX times, and the
state transition includes:
- Base -> Error, occurs when BASE_PLPMTU Confirmation Fails,
pl.pmtu is set to SCTP_MIN_PLPMTU,
probe_size is still SCTP_BASE_PLPMTU;
- Search -> Base, occurs when Black Hole Detected,
pl.pmtu is set to SCTP_BASE_PLPMTU,
probe_size is set back to SCTP_BASE_PLPMTU;
- Search Complete -> Base, occurs when Black Hole Detected
pl.pmtu is set to SCTP_BASE_PLPMTU,
probe_size is set back to SCTP_BASE_PLPMTU;
Note a black hole is encountered when a sender is unaware that packets
are not being delivered to the destination endpoint. So it includes the
probe failures with equal probe_size to pl.pmtu, and definitely not
include that with greater probe_size than pl.pmtu. The later one is the
normal probe failure where probe_size should decrease back to pl.pmtu
and pl.probe_high is set. pl.probe_high would be used on HB ACK recv
path in the next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does exactly what rfc8899#section-6.2.1.2 says:
The SCTP sender needs to be able to determine the total size of a
probe packet. The HEARTBEAT chunk could carry a Heartbeat
Information parameter that includes, besides the information
suggested in [RFC4960], the probe size to help an implementation
associate a HEARTBEAT ACK with the size of probe that was sent. The
sender could also use other methods, such as sending a nonce and
verifying the information returned also contains the corresponding
nonce. The length of the PAD chunk is computed by reducing the
probing size by the size of the SCTP common header and the HEARTBEAT
chunk.
Note that HB ACK chunk will carry back whatever HB chunk carried, including
the probe_size we put it in; We also check hbinfo->probe_size in the HB ACK
against link->pl.probe_size to validate this HB ACK chunk.
v1->v2:
- Remove the unused 'sp' and add static for sctp_packet_bundle_pad().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are 3 timers described in rfc8899#section-5.1.1:
PROBE_TIMER, PMTU_RAISE_TIMER, CONFIRMATION_TIMER
This patches adds a 'probe_timer' in transport, and it works as either
PROBE_TIMER or PMTU_RAISE_TIMER. At most time, it works as PROBE_TIMER
and expires every a 'probe_interval' time to send the HB probe packet.
When transport pl enters COMPLETE state, it works as PMTU_RAISE_TIMER
and expires in 'probe_interval * 30' time to go back to SEARCH state
and do searching again.
SCTP HB is an acknowledged packet, CONFIRMATION_TIMER is not needed.
The timer will start when transport pl enters BASE state and stop
when it enters DISABLED state.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are 4 constants described in rfc8899#section-5.1.2:
MAX_PROBES, MIN_PLPMTU, MAX_PLPMTU, BASE_PLPMTU;
And 2 variables described in rfc8899#section-5.1.3:
PROBED_SIZE, PROBE_COUNT;
And 5 states described in rfc8899#section-5.2:
DISABLED, BASE, SEARCH, SEARCH_COMPLETE, ERROR;
And these 4 APIs are used to reset/update PLPMTUD, check if PLPMTUD is
enabled, and calculate the additional headers length for a transport.
Note the member 'probe_high' in transport will be set to the probe
size when a probe fails with this probe size in the next patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'.
'n' is the interval for PLPMTUD probe timer in milliseconds, and it
can't be less than 5000 if it's not 0.
All asoc/transport's PLPMTUD in a new socket will be enabled by default.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This chunk is defined in rfc4820#section-3, and used to pad an
SCTP packet. The receiver must discard this chunk and continue
processing the rest of the chunks in the packet.
Add it now, as it will be bundled with a heartbeat chunk to probe
pmtu in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The difference between dsa_is_user_port and dsa_port_is_user is that the
former needs to look up the list of ports of the DSA switch tree in
order to find the struct dsa_port, while the latter directly receives it
as an argument.
dsa_is_user_port is already in widespread use and has its place, so
there isn't any chance of converting all callers to a single form.
But being able to do:
dsa_port_is_user(dp)
instead of
dsa_is_user_port(dp->ds, dp->index)
is much more efficient too, especially when the "dp" comes from an
iterator over the DSA switch tree - this reduces the complexity from
quadratic to linear.
Move these helpers from dsa2.c to include/net/dsa.h so that others can
use them too.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spin_trylock() was assumed to contain the implicit
barrier needed to ensure the correct ordering between
STATE_MISSED setting/clearing and STATE_MISSED checking
in commit a90c57f2ce ("net: sched: fix packet stuck
problem for lockless qdisc").
But it turns out that spin_trylock() only has load-acquire
semantic, for strongly-ordered system(like x86), the compiler
barrier implicitly contained in spin_trylock() seems enough
to ensure the correct ordering. But for weakly-orderly system
(like arm64), the store-release semantic is needed to ensure
the correct ordering as clear_bit() and test_bit() is store
operation, see queued_spin_lock().
So add the explicit barrier to ensure the correct ordering
for the above case.
Fixes: a90c57f2ce ("net: sched: fix packet stuck problem for lockless qdisc")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces the overflow indirection with the new xfrm_replay_overflow
helper. After this, the 'repl' pointer in xfrm_state is no longer
needed and can be removed as well.
xfrm_replay_overflow() is added in two incarnations, one is used
when the kernel is compiled with xfrm hardware offload support enabled,
the other when its disabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adds new xfrm_replay_recheck() helper and calls it from
xfrm input path instead of the indirection.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Similar to other patches: add a new helper to avoid
an indirection.
v2: fix 'net/xfrm/xfrm_replay.c:519:13: warning: 'seq' may be used
uninitialized in this function' warning.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
replay protection is implemented using a callback structure and then
called via
x->repl->notify(), x->repl->recheck(), and so on.
all the differect functions are always built-in, so this could be direct
calls instead.
This first patch prepares for removal of the x->repl structure.
Add an enum with the three available replay modes to the xfrm_state
structure and then replace all x->repl->notify() calls by the new
xfrm_replay_notify() helper.
The helper checks the enum internally to adapt behaviour as needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh
scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added a new member named csum in struct mptcp_options_received.
When parsing the MP_CAPABLE with data, if the checksum is enabled,
adjust the expected_opsize. If the receiving option length matches the
length with the data checksum, get the checksum value and save it in
mp_opt->csum. And in mptcp_incoming_options, pass it to mpext->csum.
We always parse any csum/nocsum combination and delay the presence check
to later code, to allow reset if missing.
Additionally, in the TX path, use the newly introduce ext field to avoid
MPTCP csum recomputation on TCP retransmission and unneeded csum update
on when setting the data fin_flag.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new member csum_reqd in struct mptcp_out_options and
struct mptcp_subflow_request_sock. Initialized it with the helper
function mptcp_is_checksum_enabled().
In mptcp_write_options, if this field is enabled, send out the MP_CAPABLE
suboption with the MPTCP_CAP_CHECKSUM_REQD flag.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new member named csum in struct mptcp_ext, implemented
a new function named mptcp_generate_data_checksum().
Generate the data checksum in mptcp_sendmsg_frag, save it in mpext->csum.
Note that we must generate the csum for zero window probe, too.
Do the csum update incrementally, to avoid multiple csum computation
when the data is appended to existing skb.
Note that in a later patch we will skip unneeded csum related operation.
Changes not included here to keep the delta small.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a missing validation of a Tx descriptor when executing in skb mode
and the umem is in unaligned mode. A descriptor could point to a
buffer straddling the end of the umem, thus effectively tricking the
kernel to read outside the allowed umem region. This could lead to a
kernel crash if that part of memory is not mapped.
In zero-copy mode, the descriptor validation code rejects such
descriptors by checking a bit in the DMA address that tells us if the
next page is physically contiguous or not. For the last page in the
umem, this bit is not set, therefore any descriptor pointing to a
packet straddling this last page boundary will be rejected. However,
the skb path does not use this bit since it copies out data and can do
so to two different pages. (It also does not have the array of DMA
address, so it cannot even store this bit.) The code just returned
that the packet is always physically contiguous. But this is
unfortunately also returned for the last page in the umem, which means
that packets that cross the end of the umem are being allowed, which
they should not be.
Fix this by introducing a check for this in the SKB path only, not
penalizing the zero-copy path.
Fixes: 2b43470add ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20210617092255.3487-1-magnus.karlsson@gmail.com
The packet logger backend is unable to provide the incoming (or
outgoing) interface name because that information isn't available.
Pass the hook state, it contains the network namespace, the protocol
family, the network interfaces and other things.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-17
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).
The main changes are:
1) BPF infrastructure to migrate TCP child sockets from a listener to another
in the same reuseport group/map, from Kuniyuki Iwashima.
2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.
3) Streamline error reporting changes in libbpf as planned out in the
'libbpf: the road to v1.0' effort, from Andrii Nakryiko.
4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.
5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.
6) Support new LLVM relocations in libbpf to make them more linker friendly,
also add a doc to describe the BPF backend relocations, from Yonghong Song.
7) Silence long standing KUBSAN complaints on register-based shifts in
interpreter, from Daniel Borkmann and Eric Biggers.
8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
target arch cannot be determined, from Lorenz Bauer.
9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.
10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.
11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
programs to bpf_helpers.h header, from Florent Revest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new optional expression that tells you when last matching on a
given rule / set element element has happened.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make the gathered SMC statistics network namespace aware, for each
namespace collect an own set of statistic information.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These function declarations are not needed any more.
The definitions were deleted.
Fixes: 2ab6096db2 ("xfrm: remove output_finish indirection from xfrm_state_afinfo")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The round-robin rr_tx_counter was shared across CPUs leading to
significant cache thrashing at high packet rates. This patch switches
the round-robin packet counter to use a per-cpu variable to decide
the destination slave.
On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh
(-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after
this patch.
"perf top -e cache_misses" before:
12.31% [bonding] [k] bond_xmit_roundrobin_slave_get
10.59% [sch_fq_codel] [k] fq_codel_dequeue
9.34% [kernel] [k] skb_release_data
after:
15.42% [sch_fq_codel] [k] fq_codel_dequeue
10.06% [kernel] [k] __memset
9.12% [kernel] [k] skb_release_data
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function get_net_ns_by_fd() could be inlined when NET_NS is not
enabled.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
reuseport_migrate_sock() does the same check done in
reuseport_listen_stop_sock(). If the reuseport group is capable of
migration, reuseport_migrate_sock() selects a new listener by the child
socket hash and increments the listener's sk_refcnt beforehand. Thus, if we
fail in the migration, we have to decrement it later.
We will support migration by eBPF in the later commits.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.
The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.
Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.
This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.
By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.
When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:
- we cannot carry over the eBPF prog of shutdowned sockets
- we cannot attach/detach an eBPF prog to/from listening sockets via
shutdowned sockets
Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
As noted in the following commit, a closed listener has to hold the
reference to the reuseport group for socket migration. This patch adds a
field (num_closed_socks) to struct sock_reuseport to manage closed sockets
within the same reuseport group. Moreover, this and the following commits
introduce some helper functions to split socks[] into two sections and keep
TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended
queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE
sockets from the end.
TCP_LISTEN----------> <-------TCP_CLOSE
+---+---+ --- +---+ --- +---+ --- +---+
| 0 | 1 | ... | i | ... | j | ... | k |
+---+---+ --- +---+ --- +---+ --- +---+
i = num_socks - 1
j = max_socks - num_closed_socks
k = max_socks - 1
This patch also extends reuseport_add_sock() and reuseport_grow() to
support num_closed_socks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
Whenever query statistics is issued for trap, devlink subsystem
would also fill-in statistics 'dropped' field. This field indicates
the number of packets HW dropped and failed to report to the device driver,
and thus - to the devlink subsystem itself.
In case if device driver didn't register callback for hard drop
statistics querying, 'dropped' field will be omitted and not filled.
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make rtnetlink ops that can create different
kinds of devices, like what we want to add to the WWAN
framework, the priv_size and setup parameters aren't quite
sufficient. Make this easier to manage by allowing ops to
allocate their own netdev via an @alloc method that gets
the tb netlink data.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
The reason is that nsfs tries to access ns->ops but the proc_ns_operations
is not implemented in this case.
[7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[7.670268] pgd = 32b54000
[7.670544] [00000010] *pgd=00000000
[7.671861] Internal error: Oops: 5 [#1] SMP ARM
[7.672315] Modules linked in:
[7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16
[7.673309] Hardware name: Generic DT based system
[7.673642] PC is at nsfs_evict+0x24/0x30
[7.674486] LR is at clear_inode+0x20/0x9c
The same to tun SIOCGSKNS command.
To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
disabled. Meanwhile move it to right place net/core/net_namespace.c.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Fixes: c62cce2cae ("net: add an ioctl to get a socket network namespace")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add socket ops for SEQPACKET type and .seqpacket_allow() callback
to query transports if they support SEQPACKET. Also split path
for data check for STREAM and SEQPACKET branches.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update current stream enqueue function for SEQPACKET
support:
1) Call transport's seqpacket enqueue callback.
2) Return value from enqueue function is whole record length or error
for SOCK_SEQPACKET.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add receive loop for SEQPACKET. It looks like receive loop for
STREAM, but there are differences:
1) It doesn't call notify callbacks.
2) It doesn't care about 'SO_SNDLOWAT' and 'SO_RCVLOWAT' values, because
there is no sense for these values in SEQPACKET case.
3) It waits until whole record is received.
4) It processes and sets 'MSG_TRUNC' flag.
So to avoid extra conditions for two types of socket inside one loop, two
independent functions were created.
Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SJA1110 has improved a few things compared to SJA1105:
- To send a control packet from the host port with SJA1105, one needed
to program a one-shot "management route" over SPI. This is no longer
true with SJA1110, you can actually send "in-band control extensions"
in the packets sent by DSA, these are in fact DSA tags which contain
the destination port and switch ID.
- When receiving a control packet from the switch with SJA1105, the
source port and switch ID were written in bytes 3 and 4 of the
destination MAC address of the frame (which was a very poor shot at a
DSA header). If the control packet also had an RX timestamp, that
timestamp was sent in an actual follow-up packet, so there were
reordering concerns on multi-core/multi-queue DSA masters, where the
metadata frame with the RX timestamp might get processed before the
actual packet to which that timestamp belonged (there is no way to
pair a packet to its timestamp other than the order in which they were
received). On SJA1110, this is no longer true, control packets have
the source port, switch ID and timestamp all in the DSA tags.
- Timestamps from the switch were partial: to get a 64-bit timestamp as
required by PTP stacks, one would need to take the partial 24-bit or
32-bit timestamp from the packet, then read the current PTP time very
quickly, and then patch in the high bits of the current PTP time into
the captured partial timestamp, to reconstruct what the full 64-bit
timestamp must have been. That is awful because packet processing is
done in NAPI context, but reading the current PTP time is done over
SPI and therefore needs sleepable context.
But it also aggravated a few things:
- Not only is there a DSA header in SJA1110, but there is a DSA trailer
in fact, too. So DSA needs to be extended to support taggers which
have both a header and a trailer. Very unconventional - my understanding
is that the trailer exists because the timestamps couldn't be prepared
in time for putting them in the header area.
- Like SJA1105, not all packets sent to the CPU have the DSA tag added
to them, only control packets do:
* the ones which match the destination MAC filters/traps in
MAC_FLTRES1 and MAC_FLTRES0
* the ones which match FDB entries which have TRAP or TAKETS bits set
So we could in theory hack something up to request the switch to take
timestamps for all packets that reach the CPU, and those would be
DSA-tagged and contain the source port / switch ID by virtue of the
fact that there needs to be a timestamp trailer provided. BUT:
- The SJA1110 does not parse its own DSA tags in a way that is useful
for routing in cross-chip topologies, a la Marvell. And the sja1105
driver already supports cross-chip bridging from the SJA1105 days.
It does that by automatically setting up the DSA links as VLAN trunks
which contain all the necessary tag_8021q RX VLANs that must be
communicated between the switches that span the same bridge. So when
using tag_8021q on sja1105, it is possible to have 2 switches with
ports sw0p0, sw0p1, sw1p0, sw1p1, and 2 VLAN-unaware bridges br0 and
br1, and br0 can take sw0p0 and sw1p0, and br1 can take sw0p1 and
sw1p1, and forwarding will happen according to the expected rules of
the Linux bridge.
We like that, and we don't want that to go away, so as a matter of
fact, the SJA1110 tagger still needs to support tag_8021q.
So the sja1110 tagger is a hybrid between tag_8021q for data packets,
and the native hardware support for control packets.
On RX, packets have a 13-byte trailer if they contain an RX timestamp.
That trailer is padded in such a way that its byte 8 (the start of the
"residence time" field - not parsed by Linux because we don't care) is
aligned on a 16 byte boundary. So the padding has a variable length
between 0 and 15 bytes. The DSA header contains the offset of the
beginning of the padding relative to the beginning of the frame (and the
end of the padding is obviously the end of the packet minus 13 bytes,
the length of the trailer). So we discard it.
Packets which don't have a trailer contain the source port and switch ID
information in the header (they are "trap-to-host" packets). Packets
which have a trailer contain the source port and switch ID in the trailer.
On TX, the destination port mask and switch ID is always in the trailer,
so we always need to say in the header that a trailer is present.
The header needs a custom EtherType and this was chosen as 0xdadc, after
0xdada which is for Marvell and 0xdadb which is for VLANs in
VLAN-unaware mode on SJA1105 (and SJA1110 in fact too).
Because we use tag_8021q in concert with the native tagging protocol,
control packets will have 2 DSA tags.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some really really weird switches just couldn't decide whether to use a
normal or a tail tagger, so they just did both.
This creates problems for DSA, because we only have the concept of an
'overhead' which can be applied to the headroom or to the tailroom of
the skb (like for example during the central TX reallocation procedure),
depending on the value of bool tail_tag, but not to both.
We need to generalize DSA to cater for these odd switches by
transforming the 'overhead / tail_tag' pair into 'needed_headroom /
needed_tailroom'.
The DSA master's MTU is increased to account for both.
The flow dissector code is modified such that it only calls the DSA
adjustment callback if the tagger has a non-zero header length.
Taggers are trivially modified to declare either needed_headroom or
needed_tailroom, based on the tail_tag value that they currently
declare.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After previous patches all remaining users set the function pointer to
the same function: xfrm6_find_1stfragopt.
So remove this function pointer and call ip6_find_1stfragopt directly.
Reduces size of xfrm_type to 64 bytes on 64bit platforms.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
UDP sendmsg() path can be lockless, it is possible for another
thread to re-connect an change sk->sk_txhash under us.
There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE()
pair to document the race.
BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w
write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1:
sk_set_txhash include/net/sock.h:1937 [inline]
__ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75
__ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580
__sys_connect_file net/socket.c:1837 [inline]
__sys_connect+0x245/0x280 net/socket.c:1854
__do_sys_connect net/socket.c:1864 [inline]
__se_sys_connect net/socket.c:1861 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1861
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0:
skb_set_hash_from_sk include/net/sock.h:2211 [inline]
skb_set_owner_w+0x118/0x220 net/core/sock.c:2101
sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359
sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373
__ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621
ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983
udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
__do_sys_sendmmsg net/socket.c:2519 [inline]
__se_sys_sendmmsg net/socket.c:2516 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xbca3c43d -> 0xfdb309e0
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Namely, have off counted starting from 0 rather than from csstate->off.
To compensate we need to shift the initial value (csstate->sum) (rotate
by 8 bits, as usual for csum) and do the same after we are finished adding
the pieces up.
What we get out of that is a bit more redundancy in our variables - from
is always equal to addr + off, which will be useful several commits down
the road.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add nfgenmsg field to nfnetlink's struct nfnl_info and use it.
2) Remove nft_ctx_init_from_elemattr() and nft_ctx_init_from_setattr()
helper functions.
3) Add the nf_ct_pernet() helper function to fetch the conntrack
pernetns data area.
4) Expose TCP and UDP flowtable offload timeouts through sysctl,
from Oz Shlomo.
5) Add nfnetlink_hook subsystem to fetch the netfilter hook
pipeline configuration, from Florian Westphal. This also includes
a new field to annotate the hook type as metadata.
6) Fix unsafe memory access to non-linear skbuff in the new SCTP
chunk support for nft_exthdr, from Phil Sutter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix more fallout from RTNL locking changes
* fixes for some of the bugs found by syzbot
* drop multicast fragments in mac80211 to align
with the spec and what drivers are doing now
* fix NULL-ptr deref in radiotap injection
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmDAz24ACgkQB8qZga/f
l8S1LQ/8CVe2fweF6mps0gktCgAAiLhCCpcqCiGqPFe6cmDfSJp7bvCj9YNL7LaG
YvCwXlBAN7xpBwnGAXSvBpYC2Ru7KNfzTSqFfThbzPh4DLKxUfKsOK6Yel3yx6B3
gWDjT4zKpZ93k7DO1wdgO/MOvaOVbTe0F+wcLCvcZ3dpqHZuFqAK5FWlHUtlM2c3
Uc08O2WN2DrQR/Qnw0ErXK6pd8N87bnrTNd7vYf69Cmcp53GC4rQGRATQxEtm8LC
DdlAQ4ensIfrexlFG+oCSISufwlKYBNW9PY0L10qNUzB6DJyRDVz1UybWEPcTZEy
sS8nK4O98bGALWMi98Dqf/s/mQMrjs6THJIyJUQi+p2pHimDH43qwfcIAqoMcw0g
37aG67dEZDXkSYx+CPloBFPgELfDP726BFcVkRyUzdHEIZyGvIIEnEfr6LsIKXNS
pDRrDyJOaNoHjGq0VzYvZ+7ETo8rqJHDWkNjEQX13jfa2r3kDTUAvauXkNTmez5N
xTNN5XttlfNXvUgb+QWp35ZgfvwimLzVKGfPGBNl8vKaFc5tOGVnzaHU3WahOa1d
ttzGRuiNuvb0OWZqIlxG8U8FPtXXpSy/+oKdP4ZbFOLeZXRqpJ85dMSpUAIOwYT5
E0bdOpgbx5C5LFhK4GXUT/Mx6nLBr3c3Jj5flhrGx2wg9+z+PVU=
=evzy
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2021-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes berg says:
====================
A fair number of fixes:
* fix more fallout from RTNL locking changes
* fixes for some of the bugs found by syzbot
* drop multicast fragments in mac80211 to align
with the spec and what drivers are doing now
* fix NULL-ptr deref in radiotap injection
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Its set but never read. Reduces size of xfrm_type to 64 bytes on 64bit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Up to now several high speed NICs have custom mechanisms of recycling
the allocated memory they use for their payloads.
Our page_pool API already has recycling capabilities that are always
used when we are running in 'XDP mode'. So let's tweak the API and the
kernel network stack slightly and allow the recycling to happen even
during the standard operation.
The API doesn't take into account 'split page' policies used by those
drivers currently, but can be extended once we have users for that.
The idea is to be able to intercept the packet on skb_release_data().
If it's a buffer coming from our page_pool API recycle it back to the
pool for further usage or just release the packet entirely.
To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
a field in struct page (page->pp) to store the page_pool pointer.
Storing the information in page->pp allows us to recycle both SKBs and
their fragments.
We could have skipped the skb bit entirely, since identical information
can bederived from struct page. However, in an effort to affect the free path
as less as possible, reading a single bit in the skb which is already
in cache, is better that trying to derive identical information for the
page stored data.
The driver or page_pool has to take care of the sync operations on it's own
during the buffer recycling since the buffer is, after opting-in to the
recycling, never unmapped.
Since the gain on the drivers depends on the architecture, we are not
enabling recycling by default if the page_pool API is used on a driver.
In order to enable recycling the driver must call skb_mark_for_recycle()
to store the information we need for recycling in page->pp and
enabling the recycling bit, or page_pool_store_mem_info() for a fragment.
Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the aging period for tcp/udp connections is hard coded to
30 seconds. Aged tcp/udp connections configure a hard coded 120/30
seconds pickup timeout for conntrack.
This configuration may be too aggressive or permissive for some users.
Dynamically configure the nf flow table GC timeout intervals according
to the user defined values.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
UDP connections may be offloaded from nf conntrack to nf flow table.
Offloaded connections are aged after 30 seconds of inactivity.
Once aged, ownership is returned to conntrack with a hard coded pickup
time of 30 seconds, after which the connection may be deleted.
eted. The current aging intervals may be too aggressive for some users.
Provide users with the ability to control the nf flow table offload
aging and pickup time intervals via sysctl parameter as a pre-step for
configuring the nf flow table GC timeout intervals.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP connections may be offloaded from nf conntrack to nf flow table.
Offloaded connections are aged after 30 seconds of inactivity.
Once aged, ownership is returned to conntrack with a hard coded pickup
time of 120 seconds, after which the connection may be deleted.
eted. The current aging intervals may be too aggressive for some users.
Provide users with the ability to control the nf flow table offload
aging and pickup time intervals via sysctl parameter as a pre-step for
configuring the nf flow table GC timeout intervals.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
MPTCP is builtin, so no need to add EXPORT_SYMBOL()s.
It will be used to support SO_TIMESTAMP(NS) ancillary
messages in the mptcp receive path.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to previous patch: expose SO_TIMESTAMPING helper so we do not
have to copy & paste this into the mptcp core.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This exports SO_TIMESTAMP_* function for re-use by MPTCP.
Without this there is too much copy & paste needed to support
this from mptcp setsockopt path.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
caif_enroll_dev() can fail in some cases. Ingnoring
these cases can lead to memory leak due to not assigning
link_support pointer to anywhere.
Fixes: 7c18d2205e ("caif: Restructure how link caif link layer enroll")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added cfserl_release() function.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor DEVLINK_CMD_RATE_{GET|SET} command handlers to support setting
a node as a parent for another rate object (leaf or node) by means of
new attribute DEVLINK_ATTR_RATE_PARENT_NODE_NAME. Extend devlink ops
with new callbacks rate_{leaf|node}_parent_set() to set node as a parent
for rate object to allow supporting drivers to implement rate grouping
through devlink. Driver implementations are allowed to support leafs
or node children only. Invoking callback with NULL as parent should be
threated by the driver as unset parent action.
Extend rate object struct with reference counter to disallow deleting a
node with any child pointing to it. User should unset parent for the
child explicitly.
Example:
$ devlink port function rate add netdevsim/netdevsim10/group1
$ devlink port function rate add netdevsim/netdevsim10/group2
$ devlink port function rate set netdevsim/netdevsim10/group1 parent group2
$ devlink port function rate show netdevsim/netdevsim10/group1
netdevsim/netdevsim10/group1: type node parent group2
$ devlink port function rate set netdevsim/netdevsim10/group1 noparent
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for DEVLINK_CMD_RATE_{NEW|DEL} commands that are used
to create and delete devlink rate nodes. Add new attribute
DEVLINK_ATTR_RATE_NODE_NAME that specify node name string. The node name
is an alphanumeric identifier. No valid node name can be a devlink port
index, eg. decimal number. Extend devlink ops with new callbacks
rate_node_{new|del}() and rate_node_tx_{share|max}_set() to allow
supporting drivers to implement ports rate grouping and setting tx rate
of rate nodes through devlink.
Expose devlink_rate_nodes_destroy() function to allow vendor driver do
proper cleanup of internally allocated resources for the nodes if the
driver goes down or due to any other reasons which requires nodes to be
destroyed.
Disallow moving device from switchdev to legacy mode if any node exists
on that device. User must explicitly delete nodes before switching mode.
Example:
$ devlink port function rate add netdevsim/netdevsim10/group1
$ devlink port function rate set netdevsim/netdevsim10/group1 \
tx_share 10mbit tx_max 100mbit
Add + set command can be combined:
$ devlink port function rate add netdevsim/netdevsim10/group1 \
tx_share 10mbit tx_max 100mbit
$ devlink port function rate show netdevsim/netdevsim10/group1
netdevsim/netdevsim10/group1: type node tx_share 10mbit tx_max 100mbit
$ devlink port function rate del netdevsim/netdevsim10/group1
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for DEVLINK_CMD_RATE_SET command with new attributes
DEVLINK_ATTR_RATE_TX_{SHARE|MAX} that are used to set devlink rate
shared/max tx rate values. Extend devlink ops with new callbacks
rate_leaf_tx_{share|max}_set() to allow supporting drivers to implement
rate control through devlink.
New attributes are optional. Driver implementations are allowed to
support either or both of them.
Shared rate example:
$ devlink port function rate set netdevsim/netdevsim10/0 tx_share 10mbit
$ devlink port function rate show netdevsim/netdevsim10/0
netdevsim/netdevsim10/0: type leaf tx_share 10mbit
Max rate example:
$ devlink port function rate set netdevsim/netdevsim10/0 tx_max 100mbit
$ devlink port function rate show netdevsim/netdevsim10/0
netdevsim/netdevsim10/0: type leaf tx_max 100mbit
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow registering rate object for devlink ports with dedicated
devlink_rate_leaf_{create|destroy}() API. Implement new netlink
DEVLINK_CMD_RATE_GET command that is used to retrieve rate object info.
Add new DEVLINK_CMD_RATE_{NEW|DEL} commands that are used for
notifications when creating/deleting leaf rate object.
Rate API is intended to be used for rate limiting of individual
devlink ports (leafs) and their aggregates (nodes).
Example:
$ devlink port show
pci/0000:03:00.0/0
pci/0000:03:00.0/1
$ devlink port function rate show
pci/0000:03:00.0/0: type leaf
pci/0000:03:00.0/1: type leaf
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for SCTP chunks matching on nf_tables, from Phil Sutter.
2) Skip LDMXCSR, we don't need a valid MXCSR state. From Stefano Brivio.
3) CONFIG_RETPOLINE for nf_tables set lookups, from Florian Westphal.
4) A few Kconfig leading spaces removal, from Juerg Haefliger.
5) Remove spinlock from xt_limit, from Jason Baron.
6) Remove useless initialization in xt_CT, oneliner from Yang Li.
7) Tree-wide replacement of netlink_unicast() by nfnetlink_unicast().
8) Reduce footprint of several structures: xt_action_param,
nft_pktinfo and nf_hook_state, from Florian.
10) Add nft_thoff() and nft_sk() helpers and use them, also from Florian.
11) Fix documentation in nf_tables pipapo avx2, from Florian Westphal.
12) Fix clang-12 fmt string warnings, also from Florian.
====================
Currently vlan modification action checks existence of vlan priority by
comparing it to 0. Therefore it is impossible to modify existing vlan
tag to have priority 0.
For example, the following tc command will change the vlan id but will
not affect vlan priority:
tc filter add dev eth1 ingress matchall action vlan modify id 300 \
priority 0 pipe mirred egress redirect dev eth2
The incoming packet on eth1:
ethertype 802.1Q (0x8100), vlan 200, p 4, ethertype IPv4
will be changed to:
ethertype 802.1Q (0x8100), vlan 300, p 4, ethertype IPv4
although the user has intended to have p == 0.
The fix is to add tcfv_push_prio_exists flag to struct tcf_vlan_params
and rely on it when deciding to set the priority.
Fixes: 45a497f2d1 (net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action)
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev with active TLS offload goes down, tls_device_down is
called to stop the offload and tear down the TLS context. However, the
socket stays alive, and it still points to the TLS context, which is now
deallocated. If a netdev goes up, while the connection is still active,
and the data flow resumes after a number of TCP retransmissions, it will
lead to a use-after-free of the TLS context.
This commit addresses this bug by keeping the context alive until its
normal destruction, and implements the necessary fallbacks, so that the
connection can resume in software (non-offloaded) kTLS mode.
On the TX side tls_sw_fallback is used to encrypt all packets. The RX
side already has all the necessary fallbacks, because receiving
non-decrypted packets is supported. The thing needed on the RX side is
to block resync requests, which are normally produced after receiving
non-decrypted packets.
The necessary synchronization is implemented for a graceful teardown:
first the fallbacks are deployed, then the driver resources are released
(it used to be possible to have a tls_dev_resync after tls_dev_del).
A new flag called TLS_RX_DEV_DEGRADED is added to indicate the fallback
mode. It's used to skip the RX resync logic completely, as it becomes
useless, and some objects may be released (for example, resync_async,
which is allocated and freed by the driver).
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU synchronization is guaranteed to finish in finite time, unlike a
busy loop that polls a flag. This patch is a preparation for the bugfix
in the next patch, where the same synchronize_net() call will also be
used to sync with the TX datapath.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit cb17ed29a7 ("mac80211: parse radiotap header when selecting Tx
queue") moved the code to validate the radiotap header from
ieee80211_monitor_start_xmit to ieee80211_parse_tx_radiotap. This made is
possible to share more code with the new Tx queue selection code for
injected frames. But at the same time, it now required the call of
ieee80211_parse_tx_radiotap at the beginning of functions which wanted to
handle the radiotap header. And this broke the rate parser for radiotap
header parser.
The radiotap parser for rates is operating most of the time only on the
data in the actual radiotap header. But for the 802.11a/b/g rates, it must
also know the selected band from the chandef information. But this
information is only written to the ieee80211_tx_info at the end of the
ieee80211_monitor_start_xmit - long after ieee80211_parse_tx_radiotap was
already called. The info->band information was therefore always 0
(NL80211_BAND_2GHZ) when the parser code tried to access it.
For a 5GHz only device, injecting a frame with 802.11a rates would cause a
NULL pointer dereference because local->hw.wiphy->bands[NL80211_BAND_2GHZ]
would most likely have been NULL when the radiotap parser searched for the
correct rate index of the driver.
Cc: stable@vger.kernel.org
Reported-by: Ben Greear <greearb@candelatech.com>
Fixes: cb17ed29a7 ("mac80211: parse radiotap header when selecting Tx queue")
Signed-off-by: Mathy Vanhoef <Mathy.Vanhoef@kuleuven.be>
[sven@narfation.org: added commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20210530133226.40587-1-sven@narfation.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit a05829a722 ("cfg80211: avoid holding the RTNL when calling the
driver") dropped usage of RTNL here and replaced it with
hw->wiphy->mutex. But we didn't update the comments.
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Link: https://lore.kernel.org/r/20210505202829.1039400-1-briannorris@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Init it on demand in the nft_compat expression. This reduces size
of nft_pktinfo from 48 to 24 bytes on x86_64.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The functions pass extra skb arg, but either its not used or the helpers
can already access it via pkt->skb.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to change storage placement later on without changing readers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>