The current phyrate conversion does not include extended MCS and provides
incorrect rates. Add a flag for extended MCS in DMG and add corresponding
phyrate table for the correct conversions using base MCS in DMG specs.
Signed-off-by: Max Chen <mxchen@codeaurora.org>
Link: https://lore.kernel.org/r/1609977050-7089-2-git-send-email-mxchen@codeaurora.org
[reduce data size, make a single WARN]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows drivers to pass 802.3 frames to mac80211, with some restrictions:
- the skb must be passed with a valid sta
- fast-rx needs to be active for the sta
- monitor mode needs to be disabled
mac80211 will tell the driver when it is safe to enable rx decap offload for
a particular station.
In order to implement support, a driver must:
- call ieee80211_hw_set(hw, SUPPORTS_RX_DECAP_OFFLOAD)
- implement ops->sta_set_decap_offload
- mark 802.3 frames with RX_FLAG_8023
If it doesn't want to enable offload for some vif types, it can mask out
IEEE80211_OFFLOAD_DECAP_ENABLED in vif->offload_flags from within the
.add_interface or .update_vif_offload driver ops
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-6-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A sorted flow list is only needed to drop packets in the biggest flow when
hitting the overmemory condition.
By scanning flows only when needed, we can avoid paying the cost of
maintaining the list under normal conditions
In order to avoid scanning lots of empty flows and touching too many cold
cache lines, a bitmap of flows with backlog is maintained
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Simplifies the code and prepares for a rework of scanning for flows on
overmemory drop.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is similar to what sch_fq_codel does. It also amortizes the worst
case cost of a follow-up patch that changes the selection of the biggest
flow for dropping packets
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.
Without this patch:
3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
|
--3.30%--__cgroup_bpf_run_filter_getsockopt
|
--0.81%--__kmalloc
With the patch applied:
0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern
Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
Introduce xdp_build_skb_from_frame utility routine to build the skb
from xdp_frame. Respect to __xdp_build_skb_from_frame,
xdp_build_skb_from_frame will allocate the skb object. Rely on
xdp_build_skb_from_frame in veth driver.
Introduce missing xdp metadata support in veth_xdp_rcv_one routine.
Add missing metadata support in veth_xdp_rcv_one().
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce __xdp_build_skb_from_frame utility routine to build
the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in
cpumap code.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Conflicts:
drivers/net/can/dev.c
commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Field rename.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous commit 32efcc06d2 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:
a. During handshake of an active open, only counts the first
SYN timeout
b. After handshake of passive and active open, stop updating
after (roughly) TCP_RETRIES1 recurring RTOs
c. After the socket aborts, over count timeout_rehash by 1
This patch fixes this by checking the rehash result from sk_rethink_txhash.
Fixes: 32efcc06d2 ("tcp: export count for rehash attempts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following the description in previous patch (for TX):
As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the offload status based on the logic under bond_sk_check().
Here we just declare RX device offload support, and expose it via the
NETIF_F_HW_TLS_RX flag.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement TLS TX device offload for bonding interfaces.
This allows kTLS sockets running on a bond to benefit from the
device offload on capable lower devices.
To allow a simple and fast maintenance of the TLS context in SW and
lower devices, we bind the TLS socket to a specific lower dev.
To achieve a behavior similar to SW kTLS, we support only balance-xor
and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
in bond_sk_check(), done in a previous patch.
For the above configuration, the SW implementation keeps picking the
same exact lower dev for all the socket's SKBs. The device offload
behaves similarly, making the decision once at the connection creation.
Per socket, the TLS module should work directly with the lowest netdev
in chain, to call the tls_dev_ops operations.
As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the current offload status based on the logic under bond_sk_check().
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add ndo_sk_get_lower_dev() implementation for bond interfaces.
Support only for the cases where the socket's and SKBs' hash
yields identical value for the whole connection lifetime.
Here we restrict it to L3+4 sockets only, with
xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcf_action_init_1() loads tc action modules automatically with
request_module() after parsing the tc action names, and it drops RTNL
lock and re-holds it before and after request_module(). This causes a
lot of troubles, as discovered by syzbot, because we can be in the
middle of batch initializations when we create an array of tc actions.
One of the problem is deadlock:
CPU 0 CPU 1
rtnl_lock();
for (...) {
tcf_action_init_1();
-> rtnl_unlock();
-> request_module();
rtnl_lock();
for (...) {
tcf_action_init_1();
-> tcf_idr_check_alloc();
// Insert one action into idr,
// but it is not committed until
// tcf_idr_insert_many(), then drop
// the RTNL lock in the _next_
// iteration
-> rtnl_unlock();
-> rtnl_lock();
-> a_o->init();
-> tcf_idr_check_alloc();
// Now waiting for the same index
// to be committed
-> request_module();
-> rtnl_lock()
// Now waiting for RTNL lock
}
rtnl_unlock();
}
rtnl_unlock();
This is not easy to solve, we can move the request_module() before
this loop and pre-load all the modules we need for this netlink
message and then do the rest initializations. So the loop breaks down
to two now:
for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
struct tc_action_ops *a_o;
a_o = tc_action_load_ops(name, tb[i]...);
ops[i - 1] = a_o;
}
for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
act = tcf_action_init_1(ops[i - 1]...);
}
Although this looks serious, it only has been reported by syzbot, so it
seems hard to trigger this by humans. And given the size of this patch,
I'd suggest to make it to net-next and not to backport to stable.
This patch has been tested by syzbot and tested with tdc.py by me.
Fixes: 0fedc63fad ("net_sched: commit action insertions together")
Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com
Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.
The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():
RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
as long as the receiver continues to respond probes. We support
this by default and reset icsk_probes_out with incoming ACKs.
This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.
In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.
Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall <william.mccall@gmail.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the set definition contains stateful expressions, allocate them for
the newly added entries from the packet path.
Fixes: 65038428b2 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Switches that care about QoS might have hardware support for reserving
buffer pools for individual ports or traffic classes, and configuring
their sizes and thresholds. Through devlink-sb (shared buffers), this is
all configurable, as well as their occupancy being viewable.
Add the plumbing in DSA for these operations.
Individual drivers still need to call devlink_sb_register() with the
shared buffers they want to expose. A helper was not created in DSA for
this purpose (unlike, say, dsa_devlink_params_register), since in my
opinion it does not bring any benefit over plainly calling
devlink_sb_register() directly.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-01-16
1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.
2) Add support for using kernel module global variables (__ksym externs in BPF
programs) retrieved via module's BTF, from Andrii Nakryiko.
3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
stored in mmap2 event for perf, from Jiri Olsa.
4) Various fixes for cross-building BPF sefltests out-of-tree which then will
unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.
5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.
6) Clean up driver's XDP buffer init and split into two helpers to init per-
descriptor and non-changing fields during processing, from Lorenzo Bianconi.
7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
perf: Add build id data in mmap2 event
bpf: Add size arg to build_id_parse function
bpf: Move stack_map_get_build_id into lib
bpf: Document new atomic instructions
bpf: Add tests for new BPF atomic operations
bpf: Add bitwise atomic instructions
bpf: Pull out a macro for interpreting atomic ALU operations
bpf: Add instructions for atomic_[cmp]xchg
bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
bpf: Move BPF_STX reserved field check into BPF_STX verifier code
bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
bpf: x86: Factor out a lookup table for some ALU opcodes
bpf: x86: Factor out emission of REX byte
bpf: x86: Factor out emission of ModR/M for *(reg + off)
tools/bpftool: Add -Wall when building BPF programs
bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
selftests/bpf: Install btf_dump test cases
selftests/bpf: Fix installation of urandom_read
selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
selftests/bpf: Fix out-of-tree build
...
====================
Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
is modeled on tag_trailer.c which works in a similar way.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Monitor the following events and notify the driver when:
- A DSA port joins/leaves a LAG.
- A LAG, made up of DSA ports, joins/leaves a bridge.
- A DSA port in a LAG is enabled/disabled (enabled meaning
"distributing" in 802.3ad LACP terms).
When a LAG joins a bridge, the DSA subsystem will treat that as each
individual port joining the bridge. The driver may look at the port's
LAG device pointer to see if it is associated with any LAG, if that is
required. This is analogue to how switchdev events are replicated out
to all lower devices when reaching e.g. a LAG.
Drivers can optionally request that DSA maintain a linear mapping from
a LAG ID to the corresponding netdev by setting ds->num_lag_ids to the
desired size.
In the event that the hardware is not capable of offloading a
particular LAG for any reason (the typical case being use of exotic
modes like broadcast), DSA will take a hands-off approach, allowing
the LAG to be formed as a pure software construct. This is reported
back through the extended ACK, but is otherwise transparent to the
user.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A function has a different name between their prototype
and its kernel-doc markup:
../include/net/cfg80211.h:1766: warning: expecting prototype for struct cfg80211_sar_chan_ranges. Prototype was for struct cfg80211_sar_freq_ranges instead
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/c7ed4bc4d9e992ead16d3d2df246f3b56dbfb1fb.1610610937.git.mchehab+huawei@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Now that all users of struct switchdev_trans have been modified to do
without it, we can remove this structure and the two helpers to determine
the phase.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It should be the driver's business to logically separate its VLAN
offloading into a preparation and a commit phase, and some drivers don't
need / can't do this.
So remove the transactional shim from DSA and let drivers propagate
errors directly from the .port_vlan_add callback.
It would appear that the code has worse error handling now than it had
before. DSA is the only in-kernel user of switchdev that offloads one
switchdev object to more than one port: for every VLAN object offloaded
to a user port, that VLAN is also offloaded to the CPU port. So the
"prepare for user port -> check for errors -> prepare for CPU port ->
check for errors -> commit for user port -> commit for CPU port"
sequence appears to make more sense than the one we are using now:
"offload to user port -> check for errors -> offload to CPU port ->
check for errors", but it is really a compromise. In the new way, we can
catch errors from the commit phase that we previously had to ignore.
But we have our hands tied and cannot do any rollback now: if we add a
VLAN on the CPU port and it fails, we can't do the rollback by simply
deleting it from the user port, because the switchdev API is not so nice
with us: it could have simply been there already, even with the same
flags. So we don't even attempt to rollback anything on addition error,
just leave whatever VLANs managed to get offloaded right where they are.
This should not be a problem at all in practice.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For many drivers, the .port_mdb_prepare callback was not a good opportunity
to avoid any error condition, and they would suppress errors found during
the actual commit phase.
Where a logical separation between the prepare and the commit phase
existed, the function that used to implement the .port_mdb_prepare
callback still exists, but now it is called directly from .port_mdb_add,
which was modified to return an int code.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Linus Wallei <linus.walleij@linaro.org> # RTL8366
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the introduction of the switchdev API, port attributes were
transmitted to drivers for offloading using a two-step transactional
model, with a prepare phase that was supposed to catch all errors, and a
commit phase that was supposed to never fail.
Some classes of failures can never be avoided, like hardware access, or
memory allocation. In the latter case, merely attempting to move the
memory allocation to the preparation phase makes it impossible to avoid
memory leaks, since commit 91cf8eceff ("switchdev: Remove unused
transaction item queue") which has removed the unused mechanism of
passing on the allocated memory between one phase and another.
It is time we admit that separating the preparation from the commit
phase is something that is best left for the driver to decide, and not
something that should be baked into the API, especially since there are
no switchdev callers that depend on this.
This patch removes the struct switchdev_trans member from switchdev port
attribute notifier structures, and converts drivers to not look at this
member.
In part, this patch contains a revert of my previous commit 2e554a7a5d
("net: dsa: propagate switchdev vlan_filtering prepare phase to
drivers").
For the most part, the conversion was trivial except for:
- Rocker's world implementation based on Broadcom OF-DPA had an odd
implementation of ofdpa_port_attr_bridge_flags_set. The conversion was
done mechanically, by pasting the implementation twice, then only
keeping the code that would get executed during prepare phase on top,
then only keeping the code that gets executed during the commit phase
on bottom, then simplifying the resulting code until this was obtained.
- DSA's offloading of STP state, bridge flags, VLAN filtering and
multicast router could be converted right away. But the ageing time
could not, so a shim was introduced and this was left for a further
commit.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the introduction of the switchdev API, port objects were
transmitted to drivers for offloading using a two-step transactional
model, with a prepare phase that was supposed to catch all errors, and a
commit phase that was supposed to never fail.
Some classes of failures can never be avoided, like hardware access, or
memory allocation. In the latter case, merely attempting to move the
memory allocation to the preparation phase makes it impossible to avoid
memory leaks, since commit 91cf8eceff ("switchdev: Remove unused
transaction item queue") which has removed the unused mechanism of
passing on the allocated memory between one phase and another.
It is time we admit that separating the preparation from the commit
phase is something that is best left for the driver to decide, and not
something that should be baked into the API, especially since there are
no switchdev callers that depend on this.
This patch removes the struct switchdev_trans member from switchdev port
object notifier structures, and converts drivers to not look at this
member.
Where driver conversion is trivial (like in the case of the Marvell
Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is
done in this patch.
Where driver conversion needs more attention (DSA, Mellanox Spectrum),
the conversion is left for subsequent patches and here we only fake the
prepare/commit phases at a lower level, just not in the switchdev
notifier itself.
Where the code has a natural structure that is best left alone as a
preparation and a commit phase (as in the case of the Ocelot switch),
that structure is left in place, just made to not depend upon the
switchdev transactional model.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The call path of a switchdev VLAN addition to the bridge looks something
like this today:
nbp_vlan_init
| __br_vlan_set_default_pvid
| | |
| | br_afspec |
| | | |
| | v |
| | br_process_vlan_info |
| | | |
| | v |
| | br_vlan_info |
| | / \ /
| | / \ /
| | / \ /
| | / \ /
v v v v v
nbp_vlan_add br_vlan_add ------+
| ^ ^ | |
| / | | |
| / / / |
\ br_vlan_get_master/ / v
\ ^ / / br_vlan_add_existing
\ | / / |
\ | / / /
\ | / / /
\ | / / /
\ | / / /
v | | v /
__vlan_add /
/ | /
/ | /
v | /
__vlan_vid_add | /
\ | /
v v v
br_switchdev_port_vlan_add
The ranges UAPI was introduced to the bridge in commit bdced7ef78
("bridge: support for multiple vlans and vlan ranges in setlink and
dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
have always been passed one by one, through struct bridge_vlan_info
tmp_vinfo, to br_vlan_info. So the range never went too far in depth.
Then Scott Feldman introduced the switchdev_port_bridge_setlink function
in commit 47f8328bb1 ("switchdev: add new switchdev bridge setlink").
That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
full use of the range. But switchdev_port_bridge_setlink was called like
this:
br_setlink
-> br_afspec
-> switchdev_port_bridge_setlink
Basically, the switchdev and the bridge code were not tightly integrated.
Then commit 41c498b935 ("bridge: restore br_setlink back to original")
came, and switchdev drivers were required to implement
.ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.
In the meantime, commits such as 0944d6b5a2 ("bridge: try switchdev op
first in __vlan_vid_add/del") finally made switchdev penetrate the
br_vlan_info() barrier and start to develop the call path we have today.
But remember, br_vlan_info() still receives VLANs one by one.
Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
29ab586c3d ("net: switchdev: Remove bridge bypass support from
switchdev") so that drivers would not implement .ndo_bridge_setlink any
longer. The switchdev_port_bridge_setlink also got deleted.
This refactoring removed the parallel bridge_setlink implementation from
switchdev, and left the only switchdev VLAN objects to be the ones
offloaded from __vlan_vid_add (basically RX filtering) and __vlan_add
(the latter coming from commit 9c86ce2c1a ("net: bridge: Notify about
bridge VLANs")).
That is to say, today the switchdev VLAN object ranges are not used in
the kernel. Refactoring the above call path is a bit complicated, when
the bridge VLAN call path is already a bit complicated.
Let's go off and finish the job of commit 29ab586c3d by deleting the
bogus iteration through the VLAN ranges from the drivers. Some aspects
of this feature never made too much sense in the first place. For
example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
flag supposed to mean, when a port can obviously have a single pvid?
This particular configuration _is_ denied as of commit 6623c60dc2
("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
perspective, the driver still has to play pretend, and only offload the
vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
modify the flags of another, completely unrelated, switchdev VLAN
object! (a VLAN that is PVID will invalidate the PVID flag from whatever
other VLAN had previously been offloaded with switchdev and had that
flag. Yet switchdev never notifies about that change, drivers are
supposed to guess).
Nonetheless, having a VLAN range in the API makes error handling look
scarier than it really is - unwinding on errors and all of that.
When in reality, no one really calls this API with more than one VLAN.
It is all unnecessary complexity.
And despite appearing pretentious (two-phase transactional model and
all), the switchdev API is really sloppy because the VLAN addition and
removal operations are not paired with one another (you can add a VLAN
100 times and delete it just once). The bridge notifies through
switchdev of a VLAN addition not only when the flags of an existing VLAN
change, but also when nothing changes. There are switchdev drivers out
there who don't like adding a VLAN that has already been added, and
those checks don't really belong at driver level. But the fact that the
API contains ranges is yet another factor that prevents this from being
addressed in the future.
Of the existing switchdev pieces of hardware, it appears that only
Mellanox Spectrum supports offloading more than one VLAN at a time,
through mlxsw_sp_port_vlan_set. I have kept that code internal to the
driver, because there is some more bookkeeping that makes use of it, but
I deleted it from the switchdev API. But since the switchdev support for
ranges has already been de facto deleted by a Mellanox employee and
nobody noticed for 4 years, I'm going to assume it's not a biggie.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A stray @ caused the kernel-doc parser to not understand
this, fix that. Also add some missing kernel-doc.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 6bdb68cef7 ("nl80211: add common API to configure SAR power limitations")
Fixes: c534e093d8 ("mac80211: add ieee80211_set_sar_specs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> # build only
Link: https://lore.kernel.org/r/20210106234740.96827c18f9bd.I8b9f0a9cbfe186931ef9640046f414371f216914@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This effectively reverts commit 60724d4bae ("net: dsa: Add support for
DSA specific notifiers"). The reason is that since commit 2f1e8ea726
("net: dsa: link interfaces with the DSA master to get rid of lockdep
warnings"), it appears that there is a generic way to achieve the same
purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was
converted to use the generic notifiers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when
they are enslaved to e.g. a bridge by calling netif_is_bridge_master().
Export this helper from DSA to get the equivalent functionality of
determining whether the upper interface of a CHANGEUPPER notifier is a
DSA switch interface or not.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is a bit strange to see something as specific as Broadcom SYSTEMPORT
bits in the main DSA include file. Move these away into a separate
header, and have the tagger and the SYSTEMPORT driver include them.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some DSA switches (and not only) cannot learn source MAC addresses from
packets injected from the CPU. They only perform hardware address
learning from inbound traffic.
This can be problematic when we have a bridge spanning some DSA switch
ports and some non-DSA ports (which we'll call "foreign interfaces" from
DSA's perspective).
There are 2 classes of problems created by the lack of learning on
CPU-injected traffic:
- excessive flooding, due to the fact that DSA treats those addresses as
unknown
- the risk of stale routes, which can lead to temporary packet loss
To illustrate the second class, consider the following situation, which
is common in production equipment (wireless access points, where there
is a WLAN interface and an Ethernet switch, and these form a single
bridging domain).
AP 1:
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
| ^ ^
| | |
| | |
| Client A Client B
|
|
|
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
AP 2
- br0 of AP 1 will know that Clients A and B are reachable via wlan0
- the hardware fdb of a DSA switch driver today is not kept in sync with
the software entries on other bridge ports, so it will not know that
clients A and B are reachable via the CPU port UNLESS the hardware
switch itself performs SA learning from traffic injected from the CPU.
Nonetheless, a substantial number of switches don't.
- the hardware fdb of the DSA switch on AP 2 may autonomously learn that
Client A and B are reachable through swp0. Therefore, the software br0
of AP 2 also may or may not learn this. In the example we're
illustrating, some Ethernet traffic has been going on, and br0 from AP
2 has indeed learnt that it can reach Client B through swp0.
One of the wireless clients, say Client B, disconnects from AP 1 and
roams to AP 2. The topology now looks like this:
AP 1:
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
| ^
| |
| Client A
|
|
| Client B
| |
| v
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
AP 2
- br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
- br0 of AP 1 will (possibly) know that Client B has left wlan0. There
are cases where it might never find out though. Either way, DSA today
does not process that notification in any way.
- the hardware FDB of the DSA switch on AP 1 may learn autonomously that
Client B can be reached via swp0, if it receives any packet with
Client 1's source MAC address over Ethernet.
- the hardware FDB of the DSA switch on AP 2 still thinks that Client B
can be reached via swp0. It does not know that it has roamed to wlan0,
because it doesn't perform SA learning from the CPU port.
Now Client A contacts Client B.
AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
segment.
AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
Hairpinning is disabled => drop.
This problem comes from the fact that these switches have a 'blind spot'
for addresses coming from software bridging. The generic solution is not
to assume that hardware learning can be enabled somehow, but to listen
to more bridge learning events. It turns out that the bridge driver does
learn in software from all inbound frames, in __br_handle_local_finish.
A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
addresses serviced by the bridge on 'foreign' interfaces. The software
bridge also does the right thing on migration, by notifying that the old
entry is deleted, so that does not need to be special-cased in DSA. When
it is deleted, we just need to delete our static FDB entry towards the
CPU too, and wait.
The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
events received on its own interfaces, such as static FDB entries.
Luckily we can change that, and DSA can listen to all switchdev FDB
add/del events in the system and figure out if those events were emitted
by a bridge that spans at least one of DSA's own ports. In case that is
true, DSA will also offload that address towards its own CPU port, in
the eventuality that there might be bridge clients attached to the DSA
switch who want to talk to the station connected to the foreign
interface.
In terms of implementation, we need to keep the fdb_info->added_by_user
check for the case where the switchdev event was targeted directly at a
DSA switch port. But we don't need to look at that flag for snooped
events. So the check is currently too late, we need to move it earlier.
This also simplifies the code a bit, since we avoid uselessly allocating
and freeing switchdev_work.
We could probably do some improvements in the future. For example,
multi-bridge support is rudimentary at the moment. If there are two
bridges spanning a DSA switch's ports, and both of them need to service
the same MAC address, then what will happen is that the migration of one
of those stations will trigger the deletion of the FDB entry from the
CPU port while it is still used by other bridge. That could be improved
with reference counting but is left for another time.
This behavior needs to be enabled at driver level by setting
ds->assisted_learning_on_cpu_port = true. This is because we don't want
to inflict a potential performance penalty (accesses through
MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
because address learning on the CPU port works there.
Reported-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into
udp_tunnel_nic_*_port() helpers, since they're always
done right before the call.
Add similar checks before calling the notifier.
udp_tunnel_nic invokes the notifier without checking
features which could result in some wasted cycles.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
and bpf trees.
Current release - regressions:
- mt76: - usb: fix NULL pointer dereference in mt76u_status_worker
- sdio: fix NULL pointer dereference in mt76s_process_tx_queue
- net: ipa: fix interconnect enable bug
Current release - always broken:
- netfilter: ipset: fixes possible oops in mtype_resize
- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages
Previous releases - regressions:
- e1000e: re-enable s0ix power saving flows for systems with
the Intel i219-LM Ethernet controllers to fix power
use regression
- virtio_net: fix recursive call to cpus_read_lock() to avoid
a deadlock
- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
- net-sysfs: take the rtnl lock around XPS configuration
- xsk: - fix memory leak for failed bind
- rollback reservation at NETDEV_TX_BUSY
- r8169: work around power-saving bug on some chip versions
Previous releases - always broken:
- dcb: validate netlink message in DCB handler
- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries
- vhost_net: fix ubuf refcount when sendmsg fails
- bpf: save correct stopping point in file seq iteration
- ncsi: use real net-device for response handler
- neighbor: fix div by zero caused by a data race (TOCTOU)
- bareudp: - fix use of incorrect min_headroom size
- fix false positive lockdep splat from the TX lock
- net: mvpp2: - clear force link UP during port init procedure
in case bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration
- xsk: fix race in SKB mode transmit with shared cq
- ionic: account for vlan tag len in rx buffer len
- net: stmmac: ignore the second clock input, current clock framework
does not handle exclusive clock use well, other drivers
may reconfigure the second clock
Misc:
- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/zsqQACgkQMUZtbf5S
IrvfqA/+MbjN9TRccZRgYVzPVzlP5jswi7VZIjikPrNxCdwgQd8bDMfeaD6I1PcX
WHf35vtD8zh729qz9DheWXFp7kDQ1fY0Z59KA25xf/ulFEkZPl3RBg70rSgv4rc+
T82dVo6x33DPe6NkspDC+Uhjz2IxcS/P7F9N7DtbavrfNuDyX8+0U/FFQIL0xOyG
DuhwecCh0vJFGcWXTWtK1vP1CPD98L28KS2Od+EZsUUZOKt1WMyGrAgNcT6uYXmO
NIYNy+FPyvvIwTLupoFE7oU4LA0sZozyvzcTDugXBF5EKoR8BwBFk0FfWzN9Oxge
LrmhNBSTeYyiw8XMOwSIfxwZnBm7mJFQqTHR1+Y83Qw1SR6PfSUZgkEkW2SYgprL
9CzE3O3P3Ci7TSx7fvZUn8B1q5J0DfZR6ZYyor9zl55e+ikraRYtXsk47bf9AGXl
owpHXEYWHFmgOP+LVdf1BUjuiE3vnCBJBsHlMbRkxiNPKravWtPSiM2yTu6fEbpT
pMXCgFQBL/IqwzX01zuw7teg40YLVaFnmFdQbYDwA5p9VODlQvHzn2K4GyuktswX
wxHYU5WRWtCkBfE+nbAROKzE7MuH9jtPtV1ZeuseTqYGBRuvEvudX8ypEvKS45pP
OWkzFsSXd9q7M6cxftipwjcyLiIO+UGdizNHvDUyEQOPAyYPKb4=
=N4/x
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, wireless and bpf
trees.
Current release - regressions:
- mt76: fix NULL pointer dereference in mt76u_status_worker and
mt76s_process_tx_queue
- net: ipa: fix interconnect enable bug
Current release - always broken:
- netfilter: fixes possible oops in mtype_resize in ipset
- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages
Previous releases - regressions:
- e1000e: re-enable s0ix power saving flows for systems with the
Intel i219-LM Ethernet controllers to fix power use regression
- virtio_net: fix recursive call to cpus_read_lock() to avoid a
deadlock
- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
- sysfs: take the rtnl lock around XPS configuration
- xsk: fix memory leak for failed bind and rollback reservation at
NETDEV_TX_BUSY
- r8169: work around power-saving bug on some chip versions
Previous releases - always broken:
- dcb: validate netlink message in DCB handler
- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries
- vhost_net: fix ubuf refcount when sendmsg fails
- bpf: save correct stopping point in file seq iteration
- ncsi: use real net-device for response handler
- neighbor: fix div by zero caused by a data race (TOCTOU)
- bareudp: fix use of incorrect min_headroom size and a false
positive lockdep splat from the TX lock
- mvpp2:
- clear force link UP during port init procedure in case
bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration
- xsk: fix race in SKB mode transmit with shared cq
- ionic: account for vlan tag len in rx buffer len
- stmmac: ignore the second clock input, current clock framework does
not handle exclusive clock use well, other drivers may reconfigure
the second clock
Misc:
- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme"
* tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
r8169: work around power-saving bug on some chip versions
net: usb: qmi_wwan: add Quectel EM160R-GL
selftests: mlxsw: Set headroom size of correct port
net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
ibmvnic: fix: NULL pointer dereference.
docs: networking: packet_mmap: fix old config reference
docs: networking: packet_mmap: fix formatting for C macros
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
bareudp: Fix use of incorrect min_headroom size
bareudp: set NETIF_F_LLTX flag
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
atlantic: remove architecture depends
erspan: fix version 1 check in gre_parse_header()
net: hns: fix return value check in __lb_other_process()
net: sched: prevent invalid Scell_log shift count
net: neighbor: fix a crash caused by mod zero
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
...
Daniel Borkmann says:
====================
pull-request: bpf 2020-12-28
The following pull-request contains BPF updates for your *net* tree.
There is a small merge conflict between bpf tree commit 69ca310f34
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:
[...]
curr_task = task_seq_get_next(ns, &curr_tid, true);
if (!curr_task) {
info->task = NULL;
info->tid = curr_tid;
return NULL;
}
/* set info->task and info->tid */
[...]
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).
The main changes are:
1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.
2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
point to BPF hashtab initialization, from Eric Dumazet.
3) Fix a splat in task iterator by saving the correct stopping point in the
seq file iteration, from Jonathan Lemon.
4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
errors on update/deletes, from Andrii Nakryiko.
5) Fix BPF selftest error reporting to something more user friendly if the
vmlinux BTF cannot be found, from Kamal Mostafa.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Check Scell_log shift size in red_check_params() and modify all callers
of red_check_params() to pass Scell_log.
This prevents a shift out-of-bounds as detected by UBSAN:
UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
shift exponent 72 is too large for 32-bit type 'int'
Fixes: 8afa10cbe2 ("net_sched: red: Avoid illegal values")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix long-standing limitation on open-unlink-fop pattern
- add refcount to p9_fid (fixes the above and will allow for more
cleanups and simplifications in the future)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl/gWOYACgkQq06b7GqY
5nBcHxAArtfxZ4wGa+OMoWt19UvF0bUeYUYdy75rp6awpXmQyMfMZY9oOrZ9L+St
mS3oLd8Mq7MeJQ+iGKPPOX+085aNhRtxjQOiHlv02fi09zVyNqj1vptq+rTRkKZL
8KTs+wYSIX2x1WensDBdohYFiWf87vCWyywpQ/1Vm0585mI8B88N3/H5EpkLbPZn
1bgXkakgd2hS+rKu6vJGV8lCJ8eKOSpQ3WUK0WQhY6ysJEL9dbl00qyNQ4kq2UgI
i3CpWC/86297KJOehPmegpVwqTIL5wrCkhQy7sNxDkR9E4Q3VoKTNSrnRDtBI0x/
mRCEnc/2Pf76Zv3LpGXnLFSnDLTQaO842Hla92RZJCcHaDtGDwV9Znq6sZgn+pdV
wlEpT383s+7AC/X1EFvLPMr5PHrEM1KW25YEYQI78X2hh8G+utW5p3+vu70b0PyT
Z7P9yBW4aU4cZb3XcZZihatZG2P1LxHyjNMmn4YBdUeQuitdiJGc2KwjVyKXi9PX
+t6xxI+YAIhu/5I9fRtOoIDQNTjsRRE+4FzRKd2kvxW5Z5sV/bZCf6CHnQ2uKYVP
fEMYd0GhjrUABbJPLLdFlBXaV+yigvwKtABDnJ5KSA9NlDSVBGewTRa2I+9DoclL
yQS4hELuvQsUNCo4AkICixo9vABt0j1OhXlo+B0WVk8Ilh9Wxpc=
=DKlc
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux
Pull 9p update from Dominique Martinet:
- fix long-standing limitation on open-unlink-fop pattern
- add refcount to p9_fid (fixes the above and will allow for more
cleanups and simplifications in the future)
* tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux:
9p: Remove unnecessary IS_ERR() check
9p: Uninitialized variable in v9fs_writeback_fid()
9p: Fix writeback fid incorrectly being attached to dentry
9p: apply review requests for fid refcounting
9p: add refcount to p9_fid struct
fs/9p: search open fids first
fs/9p: track open fids
fs/9p: fix create-unlink-getattr idiom
Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.
Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.
Fixes: 35fcde7f8d ("xsk: support for Tx")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl/YBtEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNnwA/9Ek8DG/1t8CEoJxpoRvwovQxNo+bi
0rCT9vqvx9PeCwoZi/0Vp6oKmpE1HADvbeB/+e00VrbLYnzE3oRY6VkpjoZRofKS
vc0/MzHSFxFUR1OTHwCefcXlPLK+bfitQbX5jEMeVyQCXNXXIrN7CnJf1LmCeLTR
kQBPlEN9lt7HyNVAi34FhOD/TQbWnFHgl2z5puffgri6cWnc+TALKMYytUZ+rYex
NYndDJW5b3g5kTat2eErn0FruxfzloGs0xMIiWb+z2i9kl41D+dkKPdAN7idqCSC
Jv0nJP/bDftzA0wOe9szmGaLQzu7YnCN5kiWcSspatZVnon42Cy/tp9tiuPGLRFU
XtelDfpyX6o3CLN0tX7LQEO+GYxPzvM6iaR2OrsChWPozUIIR3TLQg7jJN4bvNKl
TR6gCGZCoAeS5JLNGjzVKxT/oKQY+tCLLlYXQdQY6swNFi3EKmPr+K1D9lgm98fO
f3d1QmWiZZNmtxxoVogT0qoQYjkfgpnm3dVx813Vt+lwHlVpHGMEPpO27iD3/RYb
w2yWOJaGKwMD8iL0l+Cm6CPW0/nE5FFISQjWgC8b4Vgxlyan6+L9eViqGICkrUQ2
Edo0i1YFFZ4utHYkDf1VYBbJ+36KyCtdktgLAcbgnePiPB3E1XBsXTIIStSUIbVQ
iEbTkBlsCG4GIeU=
=6Cqb
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
"While we have a small number of SELinux patches for v5.11, there are a
few changes worth highlighting:
- Change the LSM network hooks to pass flowi_common structs instead
of the parent flowi struct as the LSMs do not currently need the
full flowi struct and they do not have enough information to use it
safely (missing information on the address family).
This patch was discussed both with Herbert Xu (representing team
netdev) and James Morris (representing team
LSMs-other-than-SELinux).
- Fix how we handle errors in inode_doinit_with_dentry() so that we
attempt to properly label the inode on following lookups instead of
continuing to treat it as unlabeled.
- Tweak the kernel logic around allowx, auditallowx, and dontauditx
SELinux policy statements such that the auditx/dontauditx are
effective even without the allowx statement.
Everything passes our test suite"
* tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
selinux: Fix fall-through warnings for Clang
selinux: drop super_block backpointer from superblock_security_struct
selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
selinux: allow dontauditx and auditallowx rules to take effect without allowx
selinux: fix error initialization in inode_doinit_with_dentry()
Core:
- support "prefer busy polling" NAPI operation mode, where we defer softirq
for some time expecting applications to periodically busy poll
- AF_XDP: improve efficiency by more batching and hindering
the adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or unaligned
reads making zero copy a performance win for much smaller messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined in
IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which
also allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to
a central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
+htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
=GXs1
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll
- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...
Jakub pointed out that the IP_ECN_set* helpers basically open-code
csum16_add(), so let's switch them over to using the helper instead.
v2:
- Use __be16 for check_add stack variable in IP_ECN_set_ce() (kbot)
v3:
- Turns out we need __force casts to do arithmetic on __be16 types
Reported-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Jonathan Morton <chromatix99@gmail.com>
Tested-by: Pete Heist <pete@heistp.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20201211142638.154780-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For each TCP zero window probe, the icsk_backoff is increased by one and
its max value is tcp_retries2. If tcp_retries2 is greater than 63, the
probe0 timeout shift may exceed its max bits. On x86_64/ARMv8/MIPS, the
shift count would be masked to range 0 to 63. And on ARMv7 the result is
zero. If the shift count is masked, only several probes will be sent
with timeout shorter than TCP_RTO_MAX. But if the timeout is zero, it
needs tcp_retries2 times probes to end this false timeout. Besides,
bitwise shift greater than or equal to the width is an undefined
behavior.
This patch adds a limit to the backoff. The max value of max_when is
TCP_RTO_MAX and the min value of timeout base is TCP_RTO_MIN. The limit
is the backoff from TCP_RTO_MIN to TCP_RTO_MAX.
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Link: https://lore.kernel.org/r/20201208091910.37618-1-cambda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.
This is the 'MPTCP fast close option'. It can be carried on ack
segments or TCP resets. In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.
Next patch will add receive side fastclose support in MPTCP.
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs.
- Lockdep-RCU updates reducing the need for __maybe_unused.
- Tasks-RCU updates.
- Miscellaneous fixes.
- Documentation updates.
- Torture-test updates.
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xon4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobXUD/92LJTI/TMgK6Z6EEQBiJZO/2mNKjK8
FEKc6AqTNMlZNsWCfQ5UgqtHpn+MkBZsX1x4u22gehE1qaCB8gnQ5wXgbXon8tQm
exxVk6vvQZjseeqCMqrsUYQlD7dNgHnf1qAmWXJvji4sA/1Opo6n2M74tqfE2ueV
S5hpQwSuK/6Zu2Hrr62HD8+Fx0in6ZuKRZxHGp1392l++DGbniJM3dzntRXB+JbZ
w3PDHFCQuGzTytyeKuQV48ot9IK+2YzmjIp/+4tHL6mvU38xeSu6gcYtqKPcfYWw
D6HXvDa965h5IrFdSA2JWSzjJ+VYgZVElk2HyXDNIae0fM/8GidgoIDQipT1WAur
sxW/Ke4U6Jm5MMqXqV8iMNduktkGD1/h6G/iB1Yis29xFdthorNpbHVAP+8cKXgf
1cR6RorOuBYv6XpyzygHtE7qfLY5ST352pJ4+UqNzboujOcuEnGaygttt0F/F8sA
ZH8NT5dyUfbGeqepdZWkbj116Hjeg3fyV3CZeyBhDeqpjf1Nn3nbJ1xRksPLfa3i
IKvN7HSzEg+vKnsJNnQeFlAmQ/W3n2bedzRqfaCg77pNhKI6jPuavY5f2YGFUj0y
yx0UzOYoI1Cln0keBMmynbyUKgJ7zstLkrt/JenjhtD3B+0df5BmYjkL+nqkP6ax
+XTCu7Xg+B061g==
=N/iO
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Thomas Gleixner:
"RCU, LKMM and KCSAN updates collected by Paul McKenney.
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs
- Lockdep-RCU updates reducing the need for __maybe_unused
- Tasks-RCU updates
- Miscellaneous fixes
- Documentation updates
- Torture-test updates
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests"
* tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
srcu: Take early exit on memory-allocation failure
rcu/tree: Defer kvfree_rcu() allocation to a clean context
rcu: Do not report strict GPs for outgoing CPUs
rcu: Fix a typo in rcu_blocking_is_gp() header comment
rcu: Prevent lockdep-RCU splats on lock acquisition/release
rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs
rcu,ftrace: Fix ftrace recursion
rcu/tree: Make struct kernel_param_ops definitions const
rcu/tree: Add a warning if CPU being onlined did not report QS already
rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
rcu: Fix single-CPU check in rcu_blocking_is_gp()
rcu: Implement rcu_segcblist_is_offloaded() config dependent
list.h: Update comment to explicitly note circular lists
rcu: Panic after fixed number of stalls
x86/smpboot: Move rcu_cpu_starting() earlier
rcu: Allow rcu_irq_enter_check_tick() from NMI
tools/memory-model: Label MP tests' producers and consumers
tools/memory-model: Use "buf" and "flag" for message-passing tests
tools/memory-model: Add types to litmus tests
tools/memory-model: Add a glossary of LKMM terms
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc
ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF
GEP9def51HvXLpBV2BzNUGggo1SoGgQ=
=w/cO
-----END PGP SIGNATURE-----
Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:
- Use put_nsproxy() instead of open-coding it switch_task_namespaces()
- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.
- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.
- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688 ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.
Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"
* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap.
2) Use atomic_inc_return() instead of atomic_add_return() in IPVS,
from Yejune Deng.
3) Simplify check for overquota in xt_nfacct, from Kaixu Xia.
4) Move nfnl_acct_list away from struct net, from Miao Wang.
5) Pass actual sk in reject actions, from Jan Engelhardt.
6) Add timeout and protoinfo to ctnetlink destroy events,
from Florian Westphal.
7) Four patches to generalize set infrastructure to support
for multiple expressions per set element.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: nftables: netlink support for several set element expressions
netfilter: nftables: generalize set extension to support for several expressions
netfilter: nftables: move nft_expr before nft_set
netfilter: nftables: generalize set expressions support
netfilter: ctnetlink: add timeout and protoinfo to destroy events
netfilter: use actual socket sk for REJECT action
netfilter: nfnl_acct: remove data from struct net
netfilter: Remove unnecessary conversion to bool
ipvs: replace atomic_add_return()
netfilter: nft_reject_bridge: fix build errors due to code movement
====================
Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.
In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores / 70GB DRAM machine, the available
memory continuously reduced in a fast speed (about 120MB per second,
15GB in total within the 2 minutes). Note that the issue don't
reproduce on every machine. On my 6 CPU cores machine, the problem
didn't reproduce.
'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the
workload and made the contention for 'rcu_barrier()' high. In more
detail, the global mutex, 'rcu_state.barrier_mutex' became the
bottleneck.
This commit avoids such contention by doing the 'rcu_barrier()' and
subsequent lightweight works in a batched manner, as similar to that of
'cleanup_net()'. The fqdir hashtable destruction, which is done before
the 'rcu_barrier()', is still allowed to run in parallel for fast
processing, but this commit makes it to use a dedicated work queue
instead of the 'system_wq', to make sure that the number of threads is
bounded.
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new
extension allows to attach several expressions to one set element (not
only one single expression as NFT_SET_EXPR provides). This patch
prepares for support for several expressions per set element in the
netlink userspace API.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* validate key indices for key deletion
* more preamble support in mac80211
* various 6 GHz scan fixes/improvements
* a common SAR power limitations API
* various small fixes & code improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl/TfYAACgkQB8qZga/f
l8ShjQ/9Hd6KjvA7keATtdjR7rDHo7H2nBKV/LukpuHsiTRrXTVOAfkcUTOb2hfR
7SJMzsUXdJGivbwm4lkx5TrIgiJm1hfW3zG0PFOs/bIuXs/KICrb+kLgQWiRIUfa
RIinf8BGPH3GgcCHDcWFUrnfnBVchrPUx2wIHoCQbCzLHIhB6q6x8jEJA67+smpv
57tDfUhm6pf6OYOqVN8HYlo0uRAIn1ImneplDelCmCI1dzlneEkMhqZuBXqWpD/I
C5vU+MjoOsJiW1XkmYOMe6VKQ/Bve06GUWs830S7aROOEfByv+ptlR9IjqvHvPIm
UI9NivfXQiZr6S7yD1m2xV7a14UMCIzYarwaM/I/NHAWF/Y4vzHFVzQjLfVKqMCV
dxxsWN+Yg7Gx3T5Fj3NNiQgnPF9ASVqgMrlC59ga+4If0y60V7dOSFuo9HF7AWgP
NIWKVI3He7Mb5TciM+BX5YQWkJiCSZXs427WLO6p0bp3kAgS6N6BThUraGCogXVF
1BT/y5G3QzZwg02vL3lxgWglXoH/e63UYCPt0r+i5c83Z+n4YnFSUZyRSViy9Elj
DkCgdxmP0OtM+FaHxLdYm+FL4GXaGWQVNORIDP0ViSrstPgSxhWIgVj/pKNcKC7g
bJI/IXm7eQkW5SXOafmhVV0TmvDOt/mM46E0CeWcPTqIhetk3lM=
=+qd6
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
A new set of wireless changes:
* validate key indices for key deletion
* more preamble support in mac80211
* various 6 GHz scan fixes/improvements
* a common SAR power limitations API
* various small fixes & code improvements
* tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (35 commits)
mac80211: add ieee80211_set_sar_specs
nl80211: add common API to configure SAR power limitations
mac80211: fix a mistake check for rx_stats update
mac80211: mlme: save ssid info to ieee80211_bss_conf while assoc
mac80211: Update rate control on channel change
mac80211: don't filter out beacons once we start CSA
mac80211: Fix calculation of minimal channel width
mac80211: ignore country element TX power on 6 GHz
mac80211: use bitfield helpers for BA session action frames
mac80211: support Rx timestamp calculation for all preamble types
mac80211: don't set set TDLS STA bandwidth wider than possible
mac80211: support driver-based disconnect with reconnect hint
cfg80211: support immediate reconnect request hint
mac80211: use struct assignment for he_obss_pd
cfg80211: remove struct ieee80211_he_bss_color
nl80211: validate key indexes for cfg80211_registered_device
cfg80211: include block-tx flag in channel switch started event
mac80211: disallow band-switch during CSA
ieee80211: update reduced neighbor report TBTT info length
cfg80211: Save the regulatory domain when setting custom regulatory
...
====================
Link: https://lore.kernel.org/r/20201211142552.209018-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the nft_expr structure definition before nft_set. Expressions are
used by rules and sets, remove unnecessary forward declarations. This
comes as preparation to support for multiple expressions per set element.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, the set infrastucture allows for one single expressions per
element. This patch extends the existing infrastructure to allow for up
to two expressions. This is not updating the netlink API yet, this is
coming as an initial preparation patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
DESTROY events do not include the remaining timeout.
Add the timeout if the entry was removed explicitly. This can happen
when a conntrack gets deleted prematurely, e.g. due to a tcp reset,
module removal, netdev notifier (nat/masquerade device went down),
ctnetlink and so on.
Add the protocol state too for the destroy message to check for abnormal
state on connection termination.
Joint work with Pablo.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This change registers ieee80211_set_sar_specs to
mac80211_config_ops, so cfg80211 can call it.
Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
Link: https://lore.kernel.org/r/20201203103728.3034-3-cjhuang@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
NL80211_CMD_SET_SAR_SPECS is added to configure SAR from
user space. NL80211_ATTR_SAR_SPEC is used to pass the SAR
power specification when used with NL80211_CMD_SET_SAR_SPECS.
Wireless driver needs to register SAR type, supported frequency
ranges to wiphy, so user space can query it. The index in
frequency range is used to specify which sub band the power
limitation applies to. The SAR type is for compatibility, so later
other SAR mechanism can be implemented without breaking the user
space SAR applications.
Normal process is user space queries the SAR capability, and
gets the index of supported frequency ranges and associates the
power limitation with this index and sends to kernel.
Here is an example of message send to kernel:
8c 00 00 00 08 00 01 00 00 00 00 00 38 00 2b 81
08 00 01 00 00 00 00 00 2c 00 02 80 14 00 00 80
08 00 02 00 00 00 00 00 08 00 01 00 38 00 00 00
14 00 01 80 08 00 02 00 01 00 00 00 08 00 01 00
48 00 00 00
NL80211_CMD_SET_SAR_SPECS: 0x8c
NL80211_ATTR_WIPHY: 0x01(phy idx is 0)
NL80211_ATTR_SAR_SPEC: 0x812b (NLA_NESTED)
NL80211_SAR_ATTR_TYPE: 0x00 (NL80211_SAR_TYPE_POWER)
NL80211_SAR_ATTR_SPECS: 0x8002 (NLA_NESTED)
freq range 0 power: 0x38 in 0.25dbm unit (14dbm)
freq range 1 power: 0x48 in 0.25dbm unit (18dbm)
Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
Link: https://lore.kernel.org/r/20201203103728.3034-2-cjhuang@codeaurora.org
[minor edits, NLA parse cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We don't really use this struct, we're now using
struct cfg80211_he_bss_color instead.
Change the one place in mac80211 that's using the old
name to use struct assignment instead of memcpy() and
thus remove the wrong sizeof while at it.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.f6698d97ae4e.Iba2dffcb79c4ab80bde7407609806010b55edfdf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Support the driver incrementing MIC error and replay detected
counters when having detected a bad frame, if it drops it directly
instead of relying on mac80211 to do the checks.
These are then exposed to userspace, though currently only in some
cases and in debugfs.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.fb59be9c6de8.Ife2260887366f585afadd78c983ebea93d2bb54b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Alexei Starovoitov says:
====================
pull-request: bpf 2020-12-10
The following pull-request contains BPF updates for your *net* tree.
We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).
The main changes are:
1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.
2) Fix ring_buffer__poll() return value, from Andrii.
3) Fix race in lwt_bpf, from Cong.
4) Fix test_offload, from Toke.
5) Various xsk fixes.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In rfc8684, the length of ADD_ADDR suboption with IPv4 address and port
is 18 octets, but mptcp_write_options is 32-bit aligned, so we need to
pad it to 20 octets. All the other port related option lengths need to
be added up 2 octets similarly.
This patch added a new field 'port' in mptcp_out_options. When this
field is set with a port number, we need to add up 4 octets for the
ADD_ADDR suboption, and put the port number into the suboption.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
from Subash Abhinov Kasiviswanathan.
2) Fix netlink dump of dynset timeouts later than 23 days.
3) Add comment for the indirect serialization of the nft commit mutex
with rtnl_mutex.
4) Remove bogus check for confirmed conntrack when matching on the
conntrack ID, from Brett Mastbergen.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:
- An XDP program loaded without specifying a mode (and which then ends up
in driver mode) cannot be unloaded if the program mode is specified on
unload.
- The dev_xdp_uninstall() hook always calls the driver callback with the
mode set to the type of the program but an empty flags argument, which
means the flags_ok() check prevents the program from being removed,
leading to bpf prog reference leaks.
The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.
Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
Johan Hedberg says:
====================
pull request: bluetooth-next 2020-12-07
Here's the main bluetooth-next pull request for the 5.11 kernel.
- Updated Bluetooth entries in MAINTAINERS to include Luiz von Dentz
- Added support for Realtek 8822CE and 8852A devices
- Added support for MediaTek MT7615E device
- Improved workarounds for fake CSR devices
- Fix Bluetooth qualification test case L2CAP/COS/CFD/BV-14-C
- Fixes for LL Privacy support
- Enforce 16 byte encryption key size for FIPS security level
- Added new mgmt commands for extended advertising support
- Multiple other smaller fixes & improvements
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by
8e1102d5a1 ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.
Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Don't try to adjust XFRM support flags if the bond device isn't yet
registered. Bad things can currently happen when netdev_change_features()
is called without having wanted_features fully filled in yet. This code
runs both on post-module-load mode changes, as well as at module init
time, and when run at module init time, it is before register_netdevice()
has been called and filled in wanted_features. The empty wanted_features
led to features also getting emptied out, which was definitely not the
intended behavior, so prevent that from happening.
Originally, I'd hoped to stop adjusting wanted_features at all in the
bonding driver, as it's documented as being something only the network
core should touch, but we actually do need to do this to properly update
both the features and wanted_features fields when changing the bond type,
or we get to a situation where ethtool sees:
esp-hw-offload: off [requested on]
I do think we should be using netdev_update_features instead of
netdev_change_features here though, so we only send notifiers when the
features actually changed.
Fixes: a3b658cfb6 ("bonding: allow xfrm offload setup post-module-load")
Reported-by: Ivan Vecera <ivecera@redhat.com>
Suggested-by: Ivan Vecera <ivecera@redhat.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Link: https://lore.kernel.org/r/20201205172229.576587-1-jarod@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For advertising, we wish to know the LE tx power capabilities of the
controller in userspace, so this patch edits the Security Info MGMT
command to be more generic, such that other various controller
capabilities can be included in the EIR data. This change also includes
the LE min and max tx power into this newly-named command.
The change was tested by manually verifying that the MGMT command
returns the tx power range as expected in userspace.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Queries tx power via HCI_LE_Read_Transmit_Power command when the hci
device is initialized, and stores resulting min/max LE power in hdev
struct. If command isn't available (< BT5 support), min/max values
both default to HCI_TX_POWER_INVALID.
This patch is manually verified by ensuring BT5 devices correctly query
and receive controller tx power range.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch takes the min/max intervals and tx power optionally provided
in mgmt interface, stores them in the advertisement struct, and uses
them when configuring the hci requests. While tx power is not used if
extended advertising is unavailable, software rotation will use the min
and max advertising intervals specified by the client.
This change is validated manually by ensuring the min/max intervals are
propagated to the controller on both hatch (extended advertising) and
kukui (no extended advertising) chromebooks, and that tx power is
propagated correctly on hatch. These tests are performed with multiple
advertisements simultaneously.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch adds support for the new advertising add interface, with the
first command setting advertising parameters and the second to set
advertising data. The set parameters command allows the caller to leave
some fields "unset", with a params bitfield defining which params were
purposefully set. Unset parameters will be given defaults when calling
hci_add_adv_instance. The data passed to the param mgmt command is
allowed to be flexible, so in the future if bluetoothd passes a larger
structure with new params, the mgmt command will ignore the unknown
members at the end.
This change has been validated on both hatch (extended advertising) and
kukui (no extended advertising) chromebooks running bluetoothd that
support this new interface. I ran the following manual tests:
- Set several (3) advertisements using modified test_advertisement.py
- For each, validate correct data and parameters in btmon trace
- Verified both for software rotation and extended adv
Automatic test suite also run, testing many (25) scenarios of single and
multi-advertising for data/parameter correctness.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
We wish to handle advertising data separately from advertising
parameters in our new MGMT requests. This change adds a helper that
allows the advertising data and scan response to be updated for an
existing advertising instance.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch add a configurable parameter to switch off the interleave
scan feature.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch implements the interleaving between allowlist scan and
no-filter scan. It'll be used to save power when at least one monitor is
registered and at least one pending connection or one device to be
scanned for.
The durations of the allowlist scan and the no-filter scan are
controlled by MGMT command: Set Default System Configuration. The
default values are set randomly for now.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
implement the NCI 2.x initial sequence to support NCI 2.x NFCC.
Since NCI 2.0, CORE_RESET and CORE_INIT sequence have been changed.
If NFCEE supports NCI 2.x, then NCI 2.x initial sequence will work.
In NCI 1.0, Initial sequence and payloads are as below:
(DH) (NFCC)
| -- CORE_RESET_CMD --> |
| <-- CORE_RESET_RSP -- |
| -- CORE_INIT_CMD --> |
| <-- CORE_INIT_RSP -- |
CORE_RESET_RSP payloads are Status, NCI version, Configuration Status.
CORE_INIT_CMD payloads are empty.
CORE_INIT_RSP payloads are Status, NFCC Features,
Number of Supported RF Interfaces, Supported RF Interface,
Max Logical Connections, Max Routing table Size,
Max Control Packet Payload Size, Max Size for Large Parameters,
Manufacturer ID, Manufacturer Specific Information.
In NCI 2.0, Initial Sequence and Parameters are as below:
(DH) (NFCC)
| -- CORE_RESET_CMD --> |
| <-- CORE_RESET_RSP -- |
| <-- CORE_RESET_NTF -- |
| -- CORE_INIT_CMD --> |
| <-- CORE_INIT_RSP -- |
CORE_RESET_RSP payloads are Status.
CORE_RESET_NTF payloads are Reset Trigger,
Configuration Status, NCI Version, Manufacturer ID,
Manufacturer Specific Information Length,
Manufacturer Specific Information.
CORE_INIT_CMD payloads are Feature1, Feature2.
CORE_INIT_RSP payloads are Status, NFCC Features,
Max Logical Connections, Max Routing Table Size,
Max Control Packet Payload Size,
Max Data Packet Payload Size of the Static HCI Connection,
Number of Credits of the Static HCI Connection,
Max NFC-V RF Frame Size, Number of Supported RF Interfaces,
Supported RF Interfaces.
Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20201202223147.3472-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-12-03
The main changes are:
1) Support BTF in kernel modules, from Andrii.
2) Introduce preferred busy-polling, from Björn.
3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.
4) Memcg-based memory accounting for bpf objects, from Roman.
5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================
Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.
Signed-off-by: Prankur gupta <prankgup@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.
At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards. There are two ways to allow MPTCP to send the
reset.
1. override 'send_synack' callback and emit the rst from there.
The drawback is that the request socket gets inserted into the
listeners queue just to get removed again right away.
2. Send the reset from the 'route_req' function instead.
This avoids the 'add&remove request socket', but route_req lacks the
skb that is required to send the TCP reset.
Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.
This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.
'send reset on unknown mptcp join token' is added in next patch.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When adding support for propagating ECT(1) marking in IP headers it seems I
suffered from endianness-confusion in the checksum update calculation: In
fact the ECN field is in the *lower* bits of the first 16-bit word of the
IP header when calculating in network byte order. This means that the
addition performed to update the checksum field was wrong; let's fix that.
Fixes: b723748750 ("tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040")
Reported-by: Jonathan Morton <chromatix99@gmail.com>
Tested-by: Pete Heist <pete@heistp.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20201130183705.17540-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that support bridge offload need to be notified about changes to
the bridge's VLAN protocol so that they could react accordingly and
potentially veto the change.
Add a new switchdev attribute to communicate the change to drivers.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Functions tfilter_notify_chain() and tcf_get_next_proto() are always called
with rtnl lock held in current implementation. Moreover, attempting to call
them without rtnl lock would cause a warning down the call chain in
function __tcf_get_next_proto() that requires the lock to be held by
callers. Remove the 'rtnl_held' argument in order to simplify the code and
make rtnl lock requirement explicit.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20201127151205.23492-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stephen reported the following build error for !CONFIG_NET_RX_BUSY_POLL
built kernels:
In file included from fs/select.c:32:
include/net/busy_poll.h: In function 'sk_mark_napi_id_once':
include/net/busy_poll.h:150:36: error: 'const struct sk_buff' has no member named 'napi_id'
150 | __sk_mark_napi_id_once_xdp(sk, skb->napi_id);
| ^~
Fix it by wrapping a CONFIG_NET_RX_BUSY_POLL around the helpers.
Fixes: b02e5a0ebb ("xsk: Propagate napi_id to XDP socket Rx path")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/linux-next/20201201190746.7d3357fb@canb.auug.org.au
True to the message of commit v5.10-rc1-105-g46d6c5ae953c, _do_
actually make use of state->sk when possible, such as in the REJECT
modules.
Reported-by: Minqiang Chen <ptpt52@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch removes nfnl_acct_list from struct net to reduce the default
memory footprint for the netns structure.
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows invoking an additional callback under the
socket spin lock.
Will be used by the next patches to avoid additional
spin lock contention.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix insufficient validation of IPSET_ATTR_IPADDR_IPV6 reported
by syzbot.
2) Remove spurious reports on nf_tables when lockdep gets disabled,
from Florian Westphal.
3) Fix memleak in the error path of error path of
ip_vs_control_net_init(), from Wang Hai.
4) Fix missing control data in flow dissector, otherwise IP address
matching in hardware offload infra does not work.
5) Fix hardware offload match on prefix IP address when userspace
does not send a bitwise expression to represent the prefix.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nftables_offload: build mask based from the matching bytes
netfilter: nftables_offload: set address type in control dissector
ipvs: fix possible memory leak in ip_vs_control_net_init
netfilter: nf_tables: avoid false-postive lockdep splat
netfilter: ipset: prevent uninit-value in hash_ip6_add
====================
Link: https://lore.kernel.org/r/20201127190313.24947-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2020-11-28
1) Do not reference the skb for xsk's generic TX side since when looped
back into RX it might crash in generic XDP, from Björn Töpel.
2) Fix umem cleanup on a partially set up xsk socket when being destroyed,
from Magnus Karlsson.
3) Fix an incorrect netdev reference count when failing xsk_bind() operation,
from Marek Majtyka.
4) Fix bpftool to set an error code on failed calloc() in build_btf_type_table(),
from Zhen Lei.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add MAINTAINERS entry for BPF LSM
bpftool: Fix error return value in build_btf_type_table
net, xsk: Avoid taking multiple skbuff references
xsk: Fix incorrect netdev reference count
xsk: Fix umem cleanup bug at socket destruct
MAINTAINERS: Update XDP and AF_XDP entries
====================
Link: https://lore.kernel.org/r/20201128005104.1205-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.
Conflicts:
drivers/net/can/usb/gs_usb.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently kernel tc subsystem can do conntrack in cat_ct. But when several
fragment packets go through the act_ct, function tcf_ct_handle_fragments
will defrag the packets to a big one. But the last action will redirect
mirred to a device which maybe lead the reassembly big packet over the mtu
of target device.
This patch add support for a xmit hook to mirred, that gets executed before
xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
The frag xmit hook maybe reused by other modules.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFC 7905 defines special behavior for ChaCha-Poly TLS sessions.
The differences are in the calculation of nonce and the absence
of explicit IV. This behavior is like TLSv1.3 partly.
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To provide support for ChaCha-Poly cipher we need to define
specific constants and structures.
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Inline functions defined in tls.h have a lot of AES-specific
constants. Remove these constants and change argument to struct
tls_prot_info to have an access to cipher type in later patches
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Userspace might match on prefix bytes of header fields if they are on
the byte boundary, this requires that the mask is adjusted accordingly.
Use NFT_OFFLOAD_MATCH_EXACT() for meta since prefix byte matching is not
allowed for this type of selector.
The bitwise expression might be optimized out by userspace, hence the
kernel needs to infer the prefix from the number of payload bytes to
match on. This patch adds nft_payload_offload_mask() to calculate the
bitmask to match on the prefix.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nft_flow_rule_set_addr_type() to set the address type
from the nft_payload expression accordingly.
If the address type is not set in the control dissector then a rule that
matches either on source or destination IP address does not work.
After this patch, nft hardware offload generates the flow dissector
configuration as tc-flower does to match on an IP address.
This patch has been also tested functionally to make sure packets are
filtered out by the NIC.
This is also getting the code aligned with the existing netfilter flow
offload infrastructure which is also setting the control dissector.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.
This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a packet trap to report packets that were dropped due to a
blackhole nexthop.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.
Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.
Fix all the places which made use of this implicit include.
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.
Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
We're about to do reshuffling in networking headers and
eliminate some implicit includes. This results in:
In file included from ../net/ipv4/netfilter/arp_tables.c:26:
include/net/compat.h:60:40: error: unknown type name ‘compat_uptr_t’; did you mean ‘compat_ptr_ioctl’?
struct sockaddr __user **save_addr, compat_uptr_t *ptr,
^~~~~~~~~~~~~
compat_ptr_ioctl
include/net/compat.h:61:4: error: unknown type name ‘compat_size_t’; did you mean ‘compat_sigset_t’?
compat_size_t *len);
^~~~~~~~~~~~~
compat_sigset_t
Currently net/compat.h depends on linux/compat.h being included
first. After the upcoming changes this would break the 32bit build.
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20201121214844.1488283-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
OoO handling attempts to detect when packet is out-of-window by testing
current ack sequence and remaining space vs. sequence number.
This doesn't work reliably. Store the highest allowed sequence number
that we've announced and use it to detect oow packets.
Do this when mptcp options get written to the packet (wire format).
For this to work we need to move the write_options call until after
stack selected a new tcp window.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix build of net/core/stream.o when CONFIG_INET is not enabled.
Fixes these build errors (sample):
ld: net/core/stream.o: in function `sk_stream_write_space':
(.text+0x27e): undefined reference to `tcp_stream_memory_free'
ld: (.text+0x29c): undefined reference to `tcp_stream_memory_free'
ld: (.text+0x2ab): undefined reference to `tcp_stream_memory_free'
ld: net/core/stream.o: in function `sk_stream_wait_memory':
(.text+0x5a1): undefined reference to `tcp_stream_memory_free'
ld: (.text+0x5bf): undefined reference to `tcp_stream_memory_free'
Fixes: 1c5f2ced13 ("tcp: avoid indirect call to tcp_stream_memory_free()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20201118194438.674-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The static checker is fooled by the non-static locking scheme
implemented by the mentioned helpers.
Let's make its life easier adding some unconditional annotation
so that the helpers are now interpreted as a plain spinlock from
sparse.
v1 -> v2:
- add __releases() annotation to unlock_sock_fast()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no easy way to distinguish if a conntracked tcp packet is
marked invalid because of tcp_in_window() check error or because
it doesn't belong to an existing connection. With this patch,
openvswitch sets liberal tcp flag for the established sessions so
that out of window packets are not marked invalid.
A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which
sets this flag for both the directions of the nf_conn.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201116130126.3065077-1-nusiddiq@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a bug that is triggered when a partially setup socket is
destroyed. For a fully setup socket, a socket that has been bound to a
device, the cleanup of the umem is performed at the end of the buffer
pool's cleanup work queue item. This has to be performed in a work
queue, and not in RCU cleanup, as it is doing a vunmap that cannot
execute in interrupt context. However, when a socket has only been
partially set up so that a umem has been created but the buffer pool
has not, the code erroneously directly calls the umem cleanup function
instead of using a work queue, and this leads to a BUG_ON() in
vunmap().
As there in this case is no buffer pool, we cannot use its work queue,
so we need to introduce a work queue for the umem and schedule this for
the cleanup. So in the case there is no pool, we are going to use the
umem's own work queue to schedule the cleanup. But if there is a
pool, the cleanup of the umem is still being performed by the pool's
work queue, as it is important that the umem is cleaned up after the
pool.
Fixes: e5e1a4bc91 ("xsk: Fix possible memory leak at socket close")
Reported-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Marek Majtyka <marekx.majtyka@intel.com>
Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com
When performing a flash update via devlink, device drivers may inform
user space of status updates via
devlink_flash_update_(begin|end|timeout|status)_notify functions.
It is expected that drivers do not send any status notifications unless
they send a begin and end message. If a driver sends a status
notification without sending the appropriate end notification upon
finishing (regardless of success or failure), the current implementation
of the devlink userspace program can get stuck endlessly waiting for the
end notification that will never come.
The current ice driver implementation may send such a status message
without the appropriate end notification in rare cases.
Fixing the ice driver is relatively simple: we just need to send the
begin_notify at the start of the function and always send an end_notify
no matter how the function exits.
Rather than assuming driver authors will always get this right in the
future, lets just fix the API so that it is not possible to get wrong.
Make devlink_flash_update_begin_notify and
devlink_flash_update_end_notify static, and call them in devlink.c core
code. Always send the begin_notify just before calling the driver's
flash_update routine. Always send the end_notify just after the routine
returns regardless of success or failure.
Doing this makes the status notification easier to use from the driver,
as it no longer needs to worry about catching failures and cleaning up
by calling devlink_flash_update_end_notify. It is now no longer possible
to do the wrong thing in this regard. We also save a couple of lines of
code in each driver.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All drivers which implement the devlink flash update support, with the
exception of netdevsim, use either request_firmware or
request_firmware_direct to locate the firmware file. Rather than having
each driver do this separately as part of its .flash_update
implementation, perform the request_firmware within net/core/devlink.c
Replace the file_name parameter in the struct devlink_flash_update_params
with a pointer to the fw object.
Use request_firmware rather than request_firmware_direct. Although most
Linux distributions today do not have the fallback mechanism
implemented, only about half the drivers used the _direct request, as
compared to the generic request_firmware. In the event that
a distribution does support the fallback mechanism, the devlink flash
update ought to be able to use it to provide the firmware contents. For
distributions which do not support the fallback userspace mechanism,
there should be essentially no difference between request_firmware and
request_firmware_direct.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Shannon Nelson <snelson@pensando.io>
Acked-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPV6=m
NF_DEFRAG_IPV6=y
ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
`nf_ct_frag6_gather':
net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
`ipv6_frag_thdr_truncated'
Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
dependency is forcing IPV6=y.
Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
is the same solution as used with a similar issues: Referring to
commit 70b095c843 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
module")
Fixes: 9d9e937b1c ("ipv6/netfilter: Discard first fragment not including all headers")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix race issue in fid contention.
Eric's and Greg's patch offer a mechanism to fix open-unlink-f*syscall
bug in 9p. But there is race issue in fid parallel accesses.
As Greg's patch stores all of fids from opened files into according inode,
so all the lookup fid ops can retrieve fid from inode preferentially. But
there is no mechanism to handle the fid contention issue. For example,
there are two threads get the same fid in the same time and one of them
clunk the fid before the other thread ready to discard the fid. In this
scenario, it will lead to some fatal problems, even kernel core dump.
I introduce a mechanism to fix this race issue. A counter field introduced
into p9_fid struct to store the reference counter to the fid. When a fid
is allocated from the inode or dentry, the counter will increase, and
will decrease at the end of its occupation. It is guaranteed that the
fid won't be clunked before the reference counter go down to 0, then
we can avoid the clunked fid to be used.
tests:
race issue test from the old test case:
for file in {01..50}; do touch f.${file}; done
seq 1 1000 | xargs -n 1 -P 50 -I{} cat f.* > /dev/null
open-unlink-f*syscall test:
I have tested for f*syscall include: ftruncate fstat fchown fchmod faccessat.
Link: http://lkml.kernel.org/r/20200923141146.90046-5-jianyong.wu@arm.com
Fixes: 478ba09edc ("fs/9p: search open fids first")
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
In async_resync mode, we log the TCP seq of records until the async request
is completed. Later, in case one of the logged seqs matches the resync
request, we return it, together with its record serial number. Before this
fix, we mistakenly returned the serial number of the current record
instead.
Fixes: ed9b7646b0 ("net/tls: Add asynchronous resync")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201115131448.2702-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce batched descriptor interfaces in the xsk core code for the
Tx path to be used in the driver to write a code path with higher
performance. This interface will be used by the i40e driver in the
next patch. Though other drivers would likely benefit from this new
interface too.
Note that batching is only implemented for the common case when
there is only one socket bound to the same device and queue id. When
this is not the case, we fall back to the old non-batched version of
the function.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/1605525167-14450-5-git-send-email-magnus.karlsson@gmail.com
unlocked version of protocol level close, will be used by
MPTCP to allow decouple orphaning and subflow level close.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Will be needed by the next patch, as MPTCP needs to handle
directly the error/memory-allocation-needed path.
No functional changes intended.
Additionally let MPTCP code access the tcp_remove_empty_skb()
helper.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Packets are processed even though the first fragment don't include all
headers through the upper layer header. This breaks TAHI IPv6 Core
Conformance Test v6LC.1.3.6.
Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
all headers through an Upper-Layer header, then that fragment should be
discarded and an ICMP Parameter Problem, Code 3, message should be sent to
the source of the fragment, with the Pointer field set to zero."
The fragment needs to be validated the same way it is done in
commit 2efdaaaf88 ("IPv6: reply ICMP error if the first fragment don't
include all headers") for ipv6. Wrap the validation into a common function,
ipv6_frag_thdr_truncated() to check for truncation in the upper layer
header. This validation does not fullfill all aspects of RFC 8200,
section 4.5, but is at the moment sufficient to pass mentioned TAHI test.
In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
let ipv6_frag_thdr_truncated() start it's traverse from the fragment
header.
Return 0 to drop the fragment in the netfilter. This is the same behaviour
as used on other protocol errors in this function, e.g. when
nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
appropriate ICMP Parameter Problem message back to the source.
References commit 2efdaaaf88 ("IPv6: reply ICMP error if the first
fragment don't include all headers")
Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new
name of this function.
Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nla_strlcpy now returns -E2BIG if src was truncated when written to dst.
It also returns this error value if dstsize is 0 or higher than INT_MAX.
For example, if src is "foo\0" and dst is 3 bytes long, the result will be:
1. "foG" after memcpy (G means garbage).
2. "fo\0" after memset.
3. -E2BIG is returned because src was not completely written into dst.
The callers of nla_strlcpy were modified to take into account this modification.
Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both IPv4 and IPv6 needs it via a function pointer.
Following patch will avoid the indirect call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-11-14
1) Add BTF generation for kernel modules and extend BTF infra in kernel
e.g. support for split BTF loading and validation, from Andrii Nakryiko.
2) Support for pointers beyond pkt_end to recognize LLVM generated patterns
on inlined branch conditions, from Alexei Starovoitov.
3) Implements bpf_local_storage for task_struct for BPF LSM, from KP Singh.
4) Enable FENTRY/FEXIT/RAW_TP tracing program to use the bpf_sk_storage
infra, from Martin KaFai Lau.
5) Add XDP bulk APIs that introduce a defer/flush mechanism to optimize the
XDP_REDIRECT path, from Lorenzo Bianconi.
6) Fix a potential (although rather theoretical) deadlock of hashtab in NMI
context, from Song Liu.
7) Fixes for cross and out-of-tree build of bpftool and runqslower allowing build
for different target archs on same source tree, from Jean-Philippe Brucker.
8) Fix error path in htab_map_alloc() triggered from syzbot, from Eric Dumazet.
9) Move functionality from test_tcpbpf_user into the test_progs framework so it
can run in BPF CI, from Alexander Duyck.
10) Lift hashtab key_size limit to be larger than MAX_BPF_STACK, from Florian Lehner.
Note that for the fix from Song we have seen a sparse report on context
imbalance which requires changes in sparse itself for proper annotation
detection where this is currently being discussed on linux-sparse among
developers [0]. Once we have more clarification/guidance after their fix,
Song will follow-up.
[0] https://lore.kernel.org/linux-sparse/CAHk-=wh4bx8A8dHnX612MsDO13st6uzAz1mJ1PaHHVevJx_ZCw@mail.gmail.com/T/https://lore.kernel.org/linux-sparse/20201109221345.uklbp3lzgq6g42zb@ltop.local/T/
* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (66 commits)
net: mlx5: Add xdp tx return bulking support
net: mvpp2: Add xdp tx return bulking support
net: mvneta: Add xdp tx return bulking support
net: page_pool: Add bulk support for ptr_ring
net: xdp: Introduce bulking for xdp tx return path
bpf: Expose bpf_d_path helper to sleepable LSM hooks
bpf: Augment the set of sleepable LSM hooks
bpf: selftest: Use bpf_sk_storage in FENTRY/FEXIT/RAW_TP
bpf: Allow using bpf_sk_storage in FENTRY/FEXIT/RAW_TP
bpf: Rename some functions in bpf_sk_storage
bpf: Folding omem_charge() into sk_storage_charge()
selftests/bpf: Add asm tests for pkt vs pkt_end comparison.
selftests/bpf: Add skb_pkt_end test
bpf: Support for pointers beyond pkt_end.
tools/bpf: Always run the *-clean recipes
tools/bpf: Add bootstrap/ to .gitignore
bpf: Fix NULL dereference in bpf_task_storage
tools/bpftool: Fix build slowdown
tools/runqslower: Build bpftool using HOSTCC
tools/runqslower: Enable out-of-tree build
...
====================
Link: https://lore.kernel.org/r/20201114020819.29584-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce the capability to batch page_pool ptr_ring refill since it is
usually run inside the driver NAPI tx completion loop.
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/bpf/08dd249c9522c001313f520796faa777c4089e1c.1605267335.git.lorenzo@kernel.org
XDP bulk APIs introduce a defer/flush mechanism to return
pages belonging to the same xdp_mem_allocator object
(identified via the mem.id field) in bulk to optimize
I-cache and D-cache since xdp_return_frame is usually run
inside the driver NAPI tx completion loop.
The bulk queue size is set to 16 to be aligned to how
XDP_REDIRECT bulking works. The bulk is flushed when
it is full or when mem.id changes.
xdp_frame_bulk is usually stored/allocated on the function
call-stack to avoid locking penalties.
Current implementation considers only page_pool memory model.
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/bpf/e190c03eac71b20c8407ae0fc2c399eda7835f49.1605267335.git.lorenzo@kernel.org
Commit 58956317c8 ("neighbor: Improve garbage collection")
guarantees neighbour table entries a five-second lifetime. Processes
which make heavy use of multicast can fill the neighour table with
multicast addresses in five seconds. At that point, neighbour entries
can't be GC-ed because they aren't five seconds old yet, the kernel
log starts to fill up with "neighbor table overflow!" messages, and
sends start to fail.
This patch allows multicast addresses to be thrown out before they've
lived out their five seconds. This makes room for non-multicast
addresses and makes messages to all addresses more reliable in these
circumstances.
Fixes: 58956317c8 ("neighbor: Improve garbage collection")
Signed-off-by: Jeff Dike <jdike@akamai.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* injection/radiotap updates for new test capabilities
* remove WDS support - even years ago when we turned
it off by default it was already basically unusable
* support for HE (802.11ax) rates for beacons
* support for some vendor-specific HE rates
* many other small features/cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl+uSqIACgkQB8qZga/f
l8QBOw/6AwlcQWMjqdb6H/QRORA81E4tX2+alHbeBai7KSI+9E1Jtakmn5qKQ4iH
IjpNWPsclj4zKhgbKaariIn/bZEk8OhzmDpssnHTMpuo3iuCmuzFaDdZd9Uun2Ad
tr3bqfHaom1MhWRF/FuBSHcnk599qRnsk+RY7/6dhjiPlWOWJvsfpuo1KblVoFWU
wYDX+W2oYDAx44O/6AGJ0Zctwf6m7Kyzb2aMIqv2fwacBoDvyVdTIT/4NroV9INI
QvIY4Gi8hoCDQX39zwaxSWOq7uFLYHwUozzZxktS5c4N3eSVFs80jmdiQiMKmKRQ
A+R+ZcuFBcC+6+Wt4x+20T2mF6pUvSaIDA4jegCbDL4jQlp+023XTMlV42cnpP0z
hFZgBWJszLnLtj4KW/v3sXefZ1Pxl0WD4BHNqz8SMzMUaWalrXP4Gt2bnjB7Bx1N
2M/DjW570eNZeZ9ZFcvkwHysCWMzHKmh5sPXnOitrs4s2hweIrO7wnMlYVLAGF1J
m8jUoqpI9Cc7dFEg0inaSIddcjobcx9i2eG14zaZnXj0t8WqAbQqI0Lw/mipWXFY
7DfdjFULI+Yru46TAFbiisFo/2dlijxrIr3d3QK21Cwklb3BPhpiDf83q6HYhNpB
xPs38OCZaNdSL7TwNRcuZ2jmBCf+48SYgse85HQOgdD2QzJv6dU=
=TGgF
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2020-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Some updates:
* injection/radiotap updates for new test capabilities
* remove WDS support - even years ago when we turned
it off by default it was already basically unusable
* support for HE (802.11ax) rates for beacons
* support for some vendor-specific HE rates
* many other small features/cleanups
* tag 'mac80211-next-for-net-next-2020-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (21 commits)
nl80211: fix kernel-doc warning in the new SAE attribute
cfg80211: remove WDS code
mac80211: remove WDS-related code
rt2x00: remove WDS code
b43legacy: remove WDS code
b43: remove WDS code
carl9170: remove WDS code
ath9k: remove WDS code
wireless: remove CONFIG_WIRELESS_WDS
mac80211: assure that certain drivers adhere to DONT_REORDER flag
mac80211: don't overwrite QoS TID of injected frames
mac80211: adhere to Tx control flag that prevents frame reordering
mac80211: add radiotap flag to assure frames are not reordered
mac80211: save HE oper info in BSS config for mesh
cfg80211: add support to configure HE MCS for beacon rate
nl80211: fix beacon tx rate mask validation
nl80211/cfg80211: fix potential infinite loop
cfg80211: Add support to calculate and report 4096-QAM HE rates
cfg80211: Add support to configure SAE PWE value to drivers
ieee80211: Add definition for WFA DPP
...
====================
Link: https://lore.kernel.org/r/20201113101148.25268-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch enables the FENTRY/FEXIT/RAW_TP tracing program to use
the bpf_sk_storage_(get|delete) helper, so those tracing programs
can access the sk's bpf_local_storage and the later selftest
will show some examples.
The bpf_sk_storage is currently used in bpf-tcp-cc, tc,
cg sockops...etc which is running either in softirq or
task context.
This patch adds bpf_sk_storage_get_tracing_proto and
bpf_sk_storage_delete_tracing_proto. They will check
in runtime that the helpers can only be called when serving
softirq or running in a task context. That should enable
most common tracing use cases on sk.
During the load time, the new tracing_allowed() function
will ensure the tracing prog using the bpf_sk_storage_(get|delete)
helper is not tracing any bpf_sk_storage*() function itself.
The sk is passed as "void *" when calling into bpf_local_storage.
This patch only allows tracing a kernel function.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201112211313.2587383-1-kafai@fb.com
The skb is needed only to fetch the keys for the lookup.
Both functions are used from GRO stack, we do not want
accidental modification of the skb.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_sdif() does not modify the skb.
This will permit propagating the const qualifier in
udp{4|6}_lib_lookup_skb() functions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After having migrated all users remove ip_tunnel_get_stats64().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for unconditionally passing the
struct tasklet_struct pointer to all tasklet
callbacks, switch to using the new tasklet_setup()
and from_tasklet() to pass the tasklet pointer explicitly.
Signed-off-by: Romain Perier <romain.perier@gmail.com>
Signed-off-by: Allen Pais <apais@linux.microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf 2020-11-06
1) Pre-allocated per-cpu hashmap needs to zero-fill reused element, from David.
2) Tighten bpf_lsm function check, from KP.
3) Fix bpftool attaching to flow dissector, from Lorenz.
4) Use -fno-gcse for the whole kernel/bpf/core.c instead of function attribute, from Ard.
* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Update verification logic for LSM programs
bpf: Zero-fill re-used per-cpu map element
bpf: BPF_PRELOAD depends on BPF_SYSCALL
tools/bpftool: Fix attaching flow dissector
libbpf: Fix possible use after free in xsk_socket__delete
libbpf: Fix null dereference in xsk_socket__delete
libbpf, hashmap: Fix undefined behavior in hash_bits
bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
tools, bpftool: Remove two unused variables.
tools, bpftool: Avoid array index warnings.
xsk: Fix possible memory leak at socket close
bpf: Add struct bpf_redir_neigh forward declaration to BPF helper defs
samples/bpf: Set rlimit for memlock to infinity in all samples
bpf: Fix -Wshadow warnings
selftest/bpf: Fix profiler test using CO-RE relocation for enums
====================
Link: https://lore.kernel.org/r/20201106221759.24143-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will be used by the next patch which extends the function to replay
all the existing nexthops to the notifier block being registered.
Device drivers will be able to pass extack to the function since it is
passed to them upon reload from devlink.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>