I left one dev_put() in br_add_if() error path and sure enough
syzbot found its way.
As the tracker is allocated in new_nbp(), we must make sure
to properly free it.
We have to call dev_put_track(dev, &p->dev_tracker) before
@p object is freed, of course. This is not an issue because
br_add_if() owns a reference on @dev.
Fixes: b2dcdc7f73 ("net: bridge: add net device refcount tracker")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. This
includes one patch to update ovs and act_ct to use nf_ct_put() instead
of nf_conntrack_put().
1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet.
2) Remove redundant rcu read-size lock in nf_tables packet path.
3) Replace BUG() by WARN_ON_ONCE() in nft_payload.
4) Consolidate rule verdict tracing.
5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core.
6) Make counter support built-in in nf_tables.
7) Add new field to conntrack object to identify locally generated
traffic, from Florian Westphal.
8) Prevent NAT from shadowing well-known ports, from Florian Westphal.
9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from
Florian.
10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King.
11) Replace opencoded max() in conntrack, from Jiapeng Chong.
12) Update conntrack to use refcount_t API, from Florian Westphal.
13) Move ip_ct_attach indirection into the nf_ct_hook structure.
14) Constify several pointer object in the netfilter codebase,
from Florian Westphal.
15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also
from Florian.
16) Fix egress splat due to incorrect rcu notation, from Florian.
17) Move stateful fields of connlimit, last, quota, numgen and limit
out of the expression data area.
18) Build a blob to represent the ruleset in nf_tables, this is a
requirement of the new register tracking infrastructure.
19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers.
20) Add register tracking infrastructure to skip redundant
store-to-register operations, this includes support for payload,
meta and bitwise expresssions.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits)
netfilter: nft_meta: cancel register tracking after meta update
netfilter: nft_payload: cancel register tracking after payload update
netfilter: nft_bitwise: track register operations
netfilter: nft_meta: track register operations
netfilter: nft_payload: track register operations
netfilter: nf_tables: add register tracking infrastructure
netfilter: nf_tables: add NFT_REG32_NUM
netfilter: nf_tables: add rule blob layout
netfilter: nft_limit: move stateful fields out of expression data
netfilter: nft_limit: rename stateful structure
netfilter: nft_numgen: move stateful fields out of expression data
netfilter: nft_quota: move stateful fields out of expression data
netfilter: nft_last: move stateful fields out of expression data
netfilter: nft_connlimit: move stateful fields out of expression data
netfilter: egress: avoid a lockdep splat
net: prefer nf_ct_put instead of nf_conntrack_put
netfilter: conntrack: avoid useless indirection during conntrack destruction
netfilter: make function op structures const
netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
netfilter: conntrack: convert to refcount_t api
...
====================
Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The meta expression might mangle the packet metadata, cancel register
tracking since any metadata in the registers is stale.
Finer grain register tracking cancellation by inspecting the meta type
on the register is also possible.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
commit 077cdda764 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
commit 31108d142f ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
commit 4390c6edc0 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/
net/smc/smc_wr.c
commit 49dc9013e3 ("net/smc: Use the bitmap API when applicable")
commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
bitmap_zero()/memset() is removed by the fix
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to first check if the context is a vlan one, then we need to
check the global bridge multicast vlan snooping flag, and finally the
vlan's multicast flag, otherwise we will unnecessarily enable vlan mcast
processing (e.g. querier timers).
Fixes: 7b54aaaf53 ("net: bridge: multicast: add vlan state initialization and control")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20211228153142.536969-1-nikolay@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As reported[1] if startup query interval is set too low in combination with
large number of startup queries and we have multiple bridges or even a
single bridge with multiple querier vlans configured we can crash the
machine. Add a 1 second minimum which must be enforced by overwriting the
value if set lower (i.e. without returning an error) to avoid breaking
user-space. If that happens a log message is emitted to let the admin know
that the startup interval has been set to the minimum. It doesn't make
sense to make the startup interval lower than the normal query interval
so use the same value of 1 second. The issue has been present since these
intervals could be user-controlled.
[1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.com/
Fixes: d902eee43f ("bridge: Add multicast count/interval sysfs entries")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As reported[1] if query interval is set too low and we have multiple
bridges or even a single bridge with multiple querier vlans configured
we can crash the machine. Add a 1 second minimum which must be enforced
by overwriting the value if set lower (i.e. without returning an error) to
avoid breaking user-space. If that happens a log message is emitted to let
the administrator know that the interval has been set to the minimum.
The issue has been present since these intervals could be user-controlled.
[1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.com/
Fixes: d902eee43f ("bridge: Add multicast count/interval sysfs entries")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
In compat mode SIOC{G,S}IFBR ioctls were only supporting
BRCTL_GET_VERSION returning an artificially version to spur userland
tool to use SIOCDEVPRIVATE instead. But some userland tools ignore that
and use SIOC{G,S}IFBR unconditionally as seen with busybox's brctl.
Example of non working 32-bit brctl with CONFIG_COMPAT=y:
$ brctl show
brctl: SIOCGIFBR: Invalid argument
Example of fixed 32-bit brctl with CONFIG_COMPAT=y:
$ brctl show
bridge name bridge id STP enabled interfaces
br0
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 561d835281 ("bridge: use ndo_siocdevprivate") changed the
source and destination arguments of copy_{to,from}_user in bridge's
old_deviceless() from args[1] to uarg breaking SIOC{G,S}IFBR ioctls.
Commit cbd7ad29a5 ("net: bridge: fix ioctl old_deviceless bridge
argument") fixed only BRCTL_{ADD,DEL}_BRIDGES commands leaving
BRCTL_GET_BRIDGES one untouched.
The fixes BRCTL_GET_BRIDGES as well and has been tested with busybox's
brctl.
Example of broken brctl:
$ brctl show
bridge name bridge id STP enabled interfaces
brctl: can't get bridge name for index 0: No such device or address
Example of fixed brctl:
$ brctl show
bridge name bridge id STP enabled interfaces
br0 8000.000000000000 no
Fixes: 561d835281 ("bridge: use ndo_siocdevprivate")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/all/20211223153139.7661-2-repk@triplefau.lt/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, mostly
rather small housekeeping patches:
1) Remove unused variable in IPVS, from GuoYong Zheng.
2) Use memset_after in conntrack, from Kees Cook.
3) Remove leftover function in nfnetlink_queue, from Florian Westphal.
4) Remove redundant test on bool in conntrack, from Bernard Zhao.
5) egress support for nft_fwd, from Lukas Wunner.
6) Make pppoe work for br_netfilter, from Florian Westphal.
7) Remove unused variable in conntrack resize routine, from luo penghao.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: conntrack: Remove useless assignment statements
netfilter: bridge: add support for pppoe filtering
netfilter: nft_fwd_netdev: Support egress hook
netfilter: ctnetlink: remove useless type conversion to bool
netfilter: nf_queue: remove leftover synchronize_rcu
netfilter: conntrack: Use memset_startat() to zero struct nf_conn
ipvs: remove unused variable for ip_vs_new_dest
====================
Link: https://lore.kernel.org/r/20211215234911.170741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This makes 'bridge-nf-filter-pppoe-tagged' sysctl work for
bridged traffic.
Looking at the original commit it doesn't appear this ever worked:
static unsigned int br_nf_post_routing(unsigned int hook, struct sk_buff **pskb,
[..]
if (skb->protocol == htons(ETH_P_8021Q)) {
skb_pull(skb, VLAN_HLEN);
skb->network_header += VLAN_HLEN;
+ } else if (skb->protocol == htons(ETH_P_PPP_SES)) {
+ skb_pull(skb, PPPOE_SES_HLEN);
+ skb->network_header += PPPOE_SES_HLEN;
}
[..]
NF_HOOK(... POST_ROUTING, ...)
... but the adjusted offsets are never restored.
The alternative would be to rip this code out for good,
but otoh we'd have to keep this anyway for the vlan handling
(which works because vlan tag info is in the skb, not the packet
payload).
Reported-and-tested-by: Amish Chana <amish@3g.co.za>
Fixes: 516299d2f5 ("[NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cited commit converted simple_strtoul() to kstrtoul() as suggested by
the former's documentation. However, it also forced all the inputs to be
decimal resulting in user space breakage.
Fix by setting the base to '0' so that the base is automatically
detected.
Before:
# ip link add name br0 type bridge vlan_filtering 1
# echo "0x88a8" > /sys/class/net/br0/bridge/vlan_protocol
bash: echo: write error: Invalid argument
After:
# ip link add name br0 type bridge vlan_filtering 1
# echo "0x88a8" > /sys/class/net/br0/bridge/vlan_protocol
# echo $?
0
Fixes: 520fbdf7fb ("net/bridge: replace simple_strtoul to kstrtol")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20211124101122.3321496-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add netif_set_gso_max_segs() helper.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'inuse' bitmap is local to this function. So we can use the
non-atomic '__set_bit()' to save a few cycles.
While at it, also remove some useless {}.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Use array_size() in ebtables, from Gustavo A. R. Silva.
2) Attach IPS_ASSURED to internal UDP stream state, reported by
Maciej Zenczykowski.
3) Add NFT_META_IFTYPE to match on the interface type either
from ingress or egress.
4) Generalize pktinfo->tprot_set to flags field.
5) Allow to match on inner headers / payload data.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
br_switchdev_mdb_notify() is conditionally compiled only when
CONFIG_NET_SWITCHDEV=y and CONFIG_BRIDGE_IGMP_SNOOPING=y. It is called
from br_mdb.c, which is conditionally compiled only when
CONFIG_BRIDGE_IGMP_SNOOPING=y.
The shim definition of br_switchdev_mdb_notify() is therefore needed for
the case where CONFIG_NET_SWITCHDEV=n, however we mistakenly put it
there for the case where CONFIG_BRIDGE_IGMP_SNOOPING=n. This results in
build failures when CONFIG_BRIDGE_IGMP_SNOOPING=y and
CONFIG_NET_SWITCHDEV=n.
To fix this, put the shim definition right next to
br_switchdev_fdb_notify(), which is properly guarded by NET_SWITCHDEV=n.
Since this is called only from br_mdb.c, we need not take any extra
safety precautions, when NET_SWITCHDEV=n and BRIDGE_IGMP_SNOOPING=n,
this shim definition will be absent but nobody will be needing it.
Fixes: 9776457c78 ("net: bridge: mdb: move all switchdev logic to br_switchdev.c")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20211029223606.3450523-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Function br_get_link_af_size_filtered() calls br_cfm_{,peer}_mep_count()
that return a count. When BRIDGE_CFM is not enabled these functions
simply return -EOPNOTSUPP but do not modify count parameter and
calling function then works with uninitialized variables.
Modify these inline functions to return zero in count parameter.
Fixes: b6d0425b81 ("bridge: cfm: Netlink Notifications.")
Cc: Henrik Bjoernlund <henrik.bjoernlund@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following functions:
br_mdb_complete
br_switchdev_mdb_populate
br_mdb_replay_one
br_mdb_queue_one
br_mdb_replay
br_mdb_switchdev_host_port
br_mdb_switchdev_host
br_switchdev_mdb_notify
are only accessible from code paths where CONFIG_NET_SWITCHDEV is
enabled. So move them to br_switchdev.c, in order for that code to be
compiled out if that config option is disabled.
Note that br_switchdev.c gets build regardless of whether
CONFIG_BRIDGE_IGMP_SNOOPING is enabled or not, whereas br_mdb.c only got
built when CONFIG_BRIDGE_IGMP_SNOOPING was enabled. So to preserve
correct compilation with CONFIG_BRIDGE_IGMP_SNOOPING being disabled, we
must now place an #ifdef around these functions in br_switchdev.c.
The offending bridge data structures that need this are
br->multicast_lock and br->mdb_list, these are also compiled out of
struct net_bridge when CONFIG_BRIDGE_IGMP_SNOOPING is turned off.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to fdb_notify() and br_switchdev_fdb_notify(), split the
switchdev specific logic from br_mdb_notify() into a different function.
This will be moved later in br_switchdev.c.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
br_vlan_replay() is relevant only if CONFIG_NET_SWITCHDEV is enabled, so
move it to br_switchdev.c.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
br_vlan_replay() needs this, and we're preparing to move it to
br_switchdev.c, which will be compiled regardless of whether or not
CONFIG_BRIDGE_VLAN_FILTERING is enabled.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two places where a switchdev FDB entry is constructed, one is
br_switchdev_fdb_notify() and the other is br_fdb_replay(). One uses a
struct initializer, and the other declares the structure as
uninitialized and populates the elements one by one.
One problem when introducing new members of struct
switchdev_notifier_fdb_info is that there is a risk for one of these
functions to run with an uninitialized value.
So centralize the logic of populating such structure into a dedicated
function. Being the primary location where these structures are created,
using an uninitialized variable and populating the members one by one
should be fine, since this one function is supposed to assign values to
all its members.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_replay is only called from switchdev code paths, so it makes
sense to be disabled if switchdev is not enabled in the first place.
As opposed to br_mdb_replay and br_vlan_replay which might be turned off
depending on bridge support for multicast and VLANs, FDB support is
always on. So moving br_mdb_replay and br_vlan_replay inside
br_switchdev.c would mean adding some #ifdef's in br_switchdev.c, so we
keep those where they are.
The reason for the movement is that in future changes there will be some
code reuse between br_switchdev_fdb_notify and br_fdb_replay.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can express the same logic without an "if" condition as big as the
function, just return early if the kmem_cache_alloc() call fails.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_insert() is a wrapper over fdb_insert() that also takes the
bridge hash_lock.
With fdb_insert() being renamed to fdb_add_local(), rename
br_fdb_insert() to br_fdb_add_local().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb_insert() is not a descriptive name for this function, and also easy
to confuse with __br_fdb_add(), fdb_add_entry(), br_fdb_update().
Even more confusingly, it is not even related in any way with those
functions, neither one calls the other.
Since fdb_insert() basically deals with the creation of a BR_FDB_LOCAL
entry and is called only from functions where that is the intention:
- br_fdb_changeaddr
- br_fdb_change_mac_address
- br_fdb_insert
then rename it to fdb_add_local(), because its removal counterpart is
called fdb_delete_local().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb_insert() has a forward declaration because its first caller,
br_fdb_changeaddr(), is declared before fdb_create(), a function which
fdb_insert() needs.
This patch moves the 2 functions above br_fdb_changeaddr() and deletes
the forward declaration for fdb_insert().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb_notify() has a forward declaration because its first caller,
fdb_delete(), is declared before 3 functions that fdb_notify() needs:
fdb_to_nud(), fdb_fill_info() and fdb_nlmsg_size().
This patch moves the aforementioned 4 functions above fdb_delete() and
deletes the forward declaration.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use array_size() helper instead of the open-coded version in
copy_{from,to}_user(). These sorts of multiplication factors
need to be wrapped in array_size().
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter fixes for net:
1) Crash due to missing initialization of timer data in
xt_IDLETIMER, from Juhee Kang.
2) NF_CONNTRACK_SECMARK should be bool in Kconfig, from Vegard Nossum.
3) Skip netdev events on netns removal, from Florian Westphal.
4) Add testcase to show port shadowing via UDP, also from Florian.
5) Remove pr_debug() code in ip6t_rt, this fixes a crash due to
unsafe access to non-linear skbuff, from Xin Long.
6) Make net/ipv4/vs/debug_level read-only from non-init netns,
from Antoine Tenart.
7) Remove bogus invocation to bash in selftests/netfilter/nft_flowtable.sh
also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS for net-next:
1) Add new run_estimation toggle to IPVS to stop the estimation_timer
logic, from Dust Li.
2) Relax superfluous dynset check on NFT_SET_TIMEOUT.
3) Add egress hook, from Lukas Wunner.
4) Nowadays, almost all hook functions in x_table land just call the hook
evaluation loop. Remove remaining hook wrappers from iptables and IPVS.
From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When I added IGMPv3 support I decided to follow the RFC for computing
the GMI dynamically:
" 8.4. Group Membership Interval
The Group Membership Interval is the amount of time that must pass
before a multicast router decides there are no more members of a
group or a particular source on a network.
This value MUST be ((the Robustness Variable) times (the Query
Interval)) plus (one Query Response Interval)."
But that actually is inconsistent with how the bridge used to compute it
for IGMPv2, where it was user-configurable that has a correct default value
but it is up to user-space to maintain it. This would make it consistent
with the other timer values which are also maintained correct by the user
instead of being dynamically computed. It also changes back to the previous
user-expected GMI behaviour for IGMPv3 queries which were supported before
IGMPv3 was added. Note that to properly compute it dynamically we would
need to add support for "Robustness Variable" which is currently missing.
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Fixes: 0436862e41 ("net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of netdev helper functions to improve code readability.
Replace 'dev->priv_flags & IFF_EBRIDGE' with netif_is_bridge_master(dev).
Signed-off-by: Kyungrok Chung <acadx0@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is possible now that the xt_table structure is passed via *priv.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit de1799667b ("net: bridge: add STP xstats")
added an additional nla_reserve_64bit() in br_fill_linkxstats(),
but forgot to update br_get_linkxstats_size() accordingly.
This can trigger the following in rtnl_stats_get()
WARN_ON(err == -EMSGSIZE);
Fixes: de1799667b ("net: bridge: add STP xstats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge_fill_linkxstats() is using nla_reserve_64bit().
We must use nla_total_size_64bit() instead of nla_total_size()
for corresponding data structure.
Fixes: 1080ab95e3 ("net: bridge: add support for IGMP/MLD stats and export them via netlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert sw drivers from memcpy(... ETH_ADDR) to eth_hw_addr_set():
@@
expression dev, np;
@@
- memcpy(dev->dev_addr, np, ETH_ALEN)
+ eth_hw_addr_set(dev, np)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use array_size() helper instead of the open-coded version in
copy_to_user(). These sorts of multiplication factors need
to be wrapped in array_size().
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sequence count bridge_mcast_querier::seq is protected by
net_bridge::multicast_lock but seqcount_init() does not associate the
seqcount with the lock. This leads to a warning on PREEMPT_RT because
preemption is still enabled.
Let seqcount_init() associate the seqcount with lock that protects the
write section. Remove lockdep_assert_held_once() because lockdep already checks
whether the associated lock is held.
Fixes: 67b746f94f ("net: bridge: mcast: make sure querier port/address updates are consistent")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210928141049.593833-1-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before vlan/port mcast router support was added
br_multicast_set_port_router was used only with bh already disabled due
to the bridge port lock, but that is no longer the case and when it is
called to configure a vlan/port mcast router we can deadlock with the
timer, so always disable bh to make sure it can be called from contexts
with both enabled and disabled bh.
Fixes: 2796d846d7 ("net: bridge: vlan: convert mcast router global option to per-vlan entry")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_ip6_multicast_mld2_report function uses icmp6h
to parse mld2_report packet.
mld2r_ngrec defines mld2r_hdr.icmp6_dataun.un_data16[1]
in include/net/mld.h.
So, it is more compact to use mld2r rather than icmp6h.
By doing printk test, it is confirmed that
icmp6h->icmp6_dataun.un_data16[1] and mld2r->mld2r_ngrec are
indeed equivalent.
Also, sizeof(*mld2r) and sizeof(*icmp6h) are equivalent, too.
Signed-off-by: MichelleJin <shjy180909@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_handle_ingress_vlan_tunnel() is only referenced in
br_handle_frame(). If br_handle_ingress_vlan_tunnel() is called and
return non-zero value, goto drop in br_handle_frame().
But, br_handle_ingress_vlan_tunnel() always return 0. So, the
routines that check the return value and goto drop has no meaning.
Therefore, change return type of br_handle_ingress_vlan_tunnel() to
void and remove if statement of br_handle_frame().
Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210823102118.17966-1-l4stpr0gr4m@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The per-vlan router option controls the port/vlan and host vlan entries'
mcast router config. The global option controlled only the host vlan
config, but that is unnecessary and incosistent as it's not really a
global vlan option, but rather bridge option to control host router
config, so convert BRIDGE_VLANDB_GOPTS_MCAST_ROUTER to
BRIDGE_VLANDB_ENTRY_MCAST_ROUTER which can be used to control both host
vlan and port vlan mcast router config.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change br_multicast_set_port_router to take port multicast context as
its first argument so we can later use it to control port/vlan mcast
router option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When changing vlan mcast state by br_multicast_toggle_vlan it iterates
over all ports and enables/disables the port mcast ctx based on the new
state, but I forgot to update the host vlan (bridge master vlan entry)
with the new state so it will be left out. Also that function is not
used outside of br_multicast.c, so make it static.
Fixes: f4b7002a70 ("net: bridge: add vlan mcast snooping knob")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending a global vlan notification we should account for the number
of router ports when allocating the skb, otherwise we might end up
losing notifications.
Fixes: dc002875c2 ("net: bridge: vlan: use br_rports_fill_info() to export mcast router ports")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We always create a vlan with enabled mcast snooping, so when the user
turns on per-vlan mcast contexts they'll get consistent behaviour with
the current situation, but one place wasn't updated when a bridge/master
vlan which already exists (created due to port vlans) is being added as
real bridge vlan (BRIDGE_VLAN_INFO_BRENTRY). We need to enable mcast
snooping for that vlan when that happens.
Fixes: 7b54aaaf53 ("net: bridge: multicast: add vlan state initialization and control")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to account for the IPv6 attributes when dumping querier state.
Fixes: 5e924fe6ccfd ("net: bridge: mcast: dump ipv6 querier state")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was a dumb error I made instead of writing nla_total_size(0)
for a nest attribute, I wrote nla_total_size(sizeof(0)).
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 606433fe3e11 ("net: bridge: mcast: dump ipv4 querier state")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A minor improvement to avoid dumping mcast ctx querier state if snooping
is disabled for that context (either bridge or vlan).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new mcast querier state dump infrastructure and export vlans'
mcast context querier state embedded in attribute
BRIDGE_VLANDB_GOPTS_MCAST_QUERIER_STATE.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for dumping global IPv6 querier state, we dump the state
only if our own querier is enabled or there has been another external
querier which has won the election. For the bridge global state we use
a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside.
The structure is:
[IFLA_BR_MCAST_QUERIER_STATE]
`[BRIDGE_QUERIER_IPV6_ADDRESS] - ip address of the querier
`[BRIDGE_QUERIER_IPV6_PORT] - bridge port ifindex where the querier
was seen (set only if external querier)
`[BRIDGE_QUERIER_IPV6_OTHER_TIMER] - other querier timeout
IPv4 and IPv6 attributes are embedded at the same level of
IFLA_BR_MCAST_QUERIER_STATE. If we didn't dump anything we cancel the nest
and return.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for dumping global IPv4 querier state, we dump the state
only if our own querier is enabled or there has been another external
querier which has won the election. For the bridge global state we use
a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside.
The structure is:
[IFLA_BR_MCAST_QUERIER_STATE]
`[BRIDGE_QUERIER_IP_ADDRESS] - ip address of the querier
`[BRIDGE_QUERIER_IP_PORT] - bridge port ifindex where the querier was
seen (set only if external querier)
`[BRIDGE_QUERIER_IP_OTHER_TIMER] - other querier timeout
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can consolidate both functions as they share almost the same logic.
This is easier to maintain and we have a single querier update function.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a sequence counter to make sure port/address updates can be read
consistently without requiring the bridge multicast_lock. We need to
zero out the port and address when the other querier has expired and
we're about to select ourselves as querier. br_multicast_read_querier
will be used later when dumping querier state. Updates are done only
with the multicast spinlock and softirqs disabled, while reads are done
from process context and from softirqs (due to notifications).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a querier port is detected its net_bridge_port pointer is
recorded, but it's used only for comparisons so it's fine to have stale
pointer, in order to dereference and use the port pointer a proper
accounting of its usage must be implemented adding unnecessary
complexity. To solve the problem we can just store the netdevice ifindex
instead of the port pointer and retrieve the bridge port. It is a best
effort and the device needs to be validated that is still part of that
bridge before use, but that is small price to pay for avoiding querier
reference counting for each port/vlan.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When global vlan options are equal sequentially we compress them in a
range to save space and reduce processing time. In order to have the
proper range end id we need to update range_end if the options are equal
otherwise we get ranges with the same end vlan id as the start.
Fixes: 743a53d963 ("net: bridge: vlan: add support for dumping global vlan options")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210810092139.11700-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Embed the standard multicast router port export by br_rports_fill_info()
into a new global vlan attribute BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS.
In order to have the same format for the global bridge mcast context and
the per-vlan mcast context we need a double-nesting:
- BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS
- MDBA_ROUTER
Currently we don't compare router lists, if any router port exists in
the bridge mcast contexts we consider their option sets as different and
export them separately.
In addition we export the router port vlan id when dumping similar to
the router port notification format.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are dumping the router ports of a vlan mcast context we need to
use the bridge/vlan and port/vlan's multicast contexts to check if
IPv4/IPv6 router port is present and later to dump the vlan id.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast router state
which is used for the bridge itself. We just need to pass multicast context
to br_multicast_set_router instead of bridge device and the rest of the
logic remains the same.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast querier state.
We just need to pass multicast context to br_multicast_set_querier
instead of bridge device and the rest of the logic remains the same.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is a minor optimization and better behaviour to make sure querier and
query sending routines affect only the matching multicast context
depending if vlan snooping is enabled (vlan ctx vs bridge ctx).
It also avoids sending unnecessary extra query packets.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to have the querier state per multicast context in order to have
per-vlan control, so remove the internal option bit and move it to the
multicast context. Also annotate the lockless reads of the new variable.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast startup query
interval option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast query response
interval option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast query interval
option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast querier interval
option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast membership
interval option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast last member
interval option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast startup query
count option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan multicast last member
count option.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to change and retrieve global vlan IGMP/MLD versions.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat.
2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout
fixup.
3) CLUSTERIP registers ARP hook on demand, from Florian.
4) Use clusterip_net to store pernet warning, also from Florian.
5) Remove struct netns_xt, from Florian Westphal.
6) Enable ebtables hooks in initns on demand, from Florian.
7) Allow to filter conntrack netlink dump per status bits,
from Florian Westphal.
8) Register x_tables hooks in initns on demand, from Florian.
9) Remove queue_handler from per-netns structure, again from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ignore fdb flags when adding port extern learn entries and always set
BR_FDB_LOCAL flag when adding bridge extern learn entries. This is
closest to the behaviour we had before and avoids breaking any use cases
which were allowed.
This patch fixes iproute2 calls which assume NUD_PERMANENT and were
allowed before, example:
$ bridge fdb add 00:11:22:33:44:55 dev swp1 extern_learn
Extern learn entries are allowed to roam, but do not expire, so static
or dynamic flags make no sense for them.
Also add a comment for future reference.
Fixes: eb100e0e24 ("net: bridge: allow to add externally learned entries from user-space")
Fixes: 0541a62932 ("net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20210810110010.43859-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Restrict range element expansion in ipset to avoid soft lockup,
from Jozsef Kadlecsik.
2) Memleak in error path for nf_conntrack_bridge for IPv4 packets,
from Yajun Deng.
3) Simplify conntrack garbage collection strategy to avoid frequent
wake-ups, from Florian Westphal.
4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name.
5) Missing chain family netlink attribute in chain description
in nfnetlink_hook.
6) Incorrect sequence number on nfnetlink_hook dumps.
7) Use netlink request family in reply message for consistency.
8) Remove offload_pickup sysctl, use conntrack for established state
instead, from Florian Westphal.
9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since
NFPROTO_INET is not exposed through nfnetlink_hook.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nfnetlink_hook: translate inet ingress to netdev
netfilter: conntrack: remove offload_pickup sysctl again
netfilter: nfnetlink_hook: Use same family as request message
netfilter: nfnetlink_hook: use the sequence number of the request message
netfilter: nfnetlink_hook: missing chain family
netfilter: nfnetlink_hook: strip off module name from hookfn
netfilter: conntrack: collect all entries in one cycle
netfilter: nf_conntrack_bridge: Fix memory leak when error
netfilter: ipset: Limit the maximal range of consecutive elements to add/delete
====================
Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).
Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
589918df93 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
0fac6aa098 ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")
Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df93 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit ad2f99aedf ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
changed the source of the argument copy in bridge's old_deviceless() from
args[1] (user ptr to device name) to uarg (ptr to ioctl arguments) causing
wrong device name to be used.
Example (broken, bridge exists but is up):
$ brctl delbr bridge
bridge bridge doesn't exist; can't delete it
Example (working):
$ brctl delbr bridge
bridge bridge is still up; can't delete it
Fixes: ad2f99aedf ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before commit ad2f99aedf ("net: bridge: move bridge ioctls out of
.ndo_do_ioctl") the bridge ioctl calls were divided in two parts:
one was deviceless called by sock_ioctl and didn't expect rtnl to be held,
the other was with a device called by dev_ifsioc() and expected rtnl to be
held. After the commit above they were united in a single ioctl stub, but
it didn't take care of the locking expectations.
For sock_ioctl now we acquire (1) br_ioctl_mutex, (2) rtnl
and for dev_ifsioc we acquire (1) rtnl, (2) br_ioctl_mutex
The fix is to get a refcnt on the netdev for dev_ifsioc calls and drop rtnl
then to reacquire it in the bridge ioctl stub after br_ioctl_mutex has
been acquired. That will avoid playing locking games and make the rules
straight-forward: we always take br_ioctl_mutex first, and then rtnl.
Reported-by: syzbot+34fe5894623c4ab1b379@syzkaller.appspotmail.com
Fixes: ad2f99aedf ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of explicit offloading API in switchdev in commit
2f5dc00f7a ("net: bridge: switchdev: let drivers inform which bridge
ports are offloaded"), we started having Ethernet switch drivers calling
directly into a function exported by net/bridge/br_switchdev.c, which is
a function exported by the bridge driver.
This means that drivers that did not have an explicit dependency on the
bridge before, like cpsw and am65-cpsw, now do - otherwise it is not
possible to call a symbol exported by a driver that can be built as
module unless you are a module too.
There was an attempt to solve the dependency issue in the form of commit
b0e8181762 ("net: build all switchdev drivers as modules when the
bridge is a module"). Grygorii Strashko, however, says about it:
| In my opinion, the problem is a bit bigger here than just fixing the
| build :(
|
| In case, of ^cpsw the switchdev mode is kinda optional and in many
| cases (especially for testing purposes, NFS) the multi-mac mode is
| still preferable mode.
|
| There were no such tight dependency between switchdev drivers and
| bridge core before and switchdev serviced as independent, notification
| based layer between them, so ^cpsw still can be "Y" and bridge can be
| "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE
| will need to be set as "Y", or we will have to update drivers to
| support build with BRIDGE=n and maintain separate builds for
| networking vs non-networking testing. But is this enough? Wouldn't
| it cause 'chain reaction' required to add more and more "Y" options
| (like CONFIG_VLAN_8021Q)?
|
| PS. Just to be sure we on the same page - ARM builds will be forced
| (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our
| automation testing will just fail with omap2plus_defconfig.
In the light of this, it would be desirable for some configurations to
avoid dependencies between switchdev drivers and the bridge, and have
the switchdev mode as completely optional within the driver.
Arnd Bergmann also tried to write a patch which better expressed the
build time dependency for Ethernet switch drivers where the switchdev
support is optional, like cpsw/am65-cpsw, and this made the drivers
follow the bridge (compile as module if the bridge is a module) only if
the optional switchdev support in the driver was enabled in the first
place:
https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/
but this still did not solve the fact that cpsw and am65-cpsw now must
be built as modules when the bridge is a module - it just expressed
correctly that optional dependency. But the new behavior is an apparent
regression from Grygorii's perspective.
So to support the use case where the Ethernet driver is built-in,
NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we
need a framework that can handle the possible absence of the bridge from
the running system, i.e. runtime bloatware as opposed to build-time
bloatware.
Luckily we already have this framework, since switchdev has been using
it extensively. Events from the bridge side are transmitted to the
driver side using notifier chains - this was originally done so that
unrelated drivers could snoop for events emitted by the bridge towards
ports that are implemented by other drivers (think of a switch driver
with LAG offload that listens for switchdev events on a bonding/team
interface that it offloads).
There are also events which are transmitted from the driver side to the
bridge side, which again are modeled using notifiers.
SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with
notifying the bridge that a MAC address has been dynamically learned.
So there is a precedent we can use for modeling the new framework.
The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work
that the bridge needs to do when a port becomes offloaded is blocking in
its nature: replay VLANs, MDBs etc. The calling context is indeed
blocking (we are under rtnl_mutex), but the existing switchdev
notification chain that the bridge is subscribed to is only the atomic
one. So we need to subscribe the bridge to the blocking switchdev
notification chain too.
This patch:
- keeps the driver-side perception of the switchdev_bridge_port_{,un}offload
unchanged
- moves the implementation of switchdev_bridge_port_{,un}offload from
the bridge module into the switchdev module.
- makes everybody that is subscribed to the switchdev blocking notifier
chain "hear" offload & unoffload events
- makes the bridge driver subscribe and handle those events
- moves the bridge driver's handling of those events into 2 new
functions called br_switchdev_port_{,un}offload. These functions
contain in fact the core of the logic that was previously in
switchdev_bridge_port_{,un}offload, just that now we go through an
extra indirection layer to reach them.
Unlike all the other switchdev notification structures, the structure
used to carry the bridge port information, struct
switchdev_notifier_brport_info, does not contain a "bool handled".
This is because in the current usage pattern, we always know that a
switchdev bridge port offloading event will be handled by the bridge,
because the switchdev_bridge_port_offload() call was initiated by a
NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a
bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event
couldn't have happened.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should be added kfree_skb_list() when err is not equal to zero
in nf_br_ip_fragment().
v2: keep this aligned with IPv6.
v3: modify iter.frag_list to iter.frag.
Fixes: 3c171f496e ("netfilter: bridge: add connection tracking system")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Nikolay points out that it is incorrect to assume that it is impossible
to have an fdb entry with fdb->dst == NULL and the BR_FDB_LOCAL bit in
fdb->flags not set. This is because there are reader-side places that
test_bit(BR_FDB_LOCAL, &fdb->flags) without the br->hash_lock, and if
the updating of the FDB entry happens on another CPU, there are no
memory barriers at writer or reader side which would ensure that the
reader sees the updates to both fdb->flags and fdb->dst in the same
order, i.e. the reader will not see an inconsistent FDB entry.
So we must be prepared to deal with FDB entries where fdb->dst and
fdb->flags are in a potentially inconsistent state, and that means that
fdb->dst == NULL should remain a condition to pick the net_device that
we report to switchdev as being the bridge device, which is what the
code did prior to the blamed patch.
Fixes: 52e4bec155 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge")
Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210802113633.189831-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently it is possible to add broken extern_learn FDB entries to the
bridge in two ways:
1. Entries pointing towards the bridge device that are not local/permanent:
ip link add br0 type bridge
bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static
2. Entries pointing towards the bridge device or towards a port that
are marked as local/permanent, however the bridge does not process the
'permanent' bit in any way, therefore they are recorded as though they
aren't permanent:
ip link add br0 type bridge
bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent
Since commit 52e4bec155 ("net: bridge: switchdev: treat local FDBs the
same as entries towards the bridge"), these incorrect FDB entries can
even trigger NULL pointer dereferences inside the kernel.
This is because that commit made the assumption that all FDB entries
that are not local/permanent have a valid destination port. For context,
local / permanent FDB entries either have fdb->dst == NULL, and these
point towards the bridge device and are therefore local and not to be
used for forwarding, or have fdb->dst == a net_bridge_port structure
(but are to be treated in the same way, i.e. not for forwarding).
That assumption _is_ correct as long as things are working correctly in
the bridge driver, i.e. we cannot logically have fdb->dst == NULL under
any circumstance for FDB entries that are not local. However, the
extern_learn code path where FDB entries are managed by a user space
controller show that it is possible for the bridge kernel driver to
misinterpret the NUD flags of an entry transmitted by user space, and
end up having fdb->dst == NULL while not being a local entry. This is
invalid and should be rejected.
Before, the two commands listed above both crashed the kernel in this
check from br_switchdev_fdb_notify:
struct net_device *dev = info.is_local ? br->dev : dst->dev;
info.is_local == false, dst == NULL.
After this patch, the invalid entry added by the first command is
rejected:
ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static; ip link del br0
Error: bridge: FDB entry towards bridge must be permanent.
and the valid entry added by the second command is properly treated as a
local address and does not crash br_switchdev_fdb_notify anymore:
ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent; ip link del br0
Fixes: eb100e0e24 ("net: bridge: allow to add externally learned entries from user-space")
Reported-by: syzbot+9ba1174359adba5a5b7c@syzkaller.appspotmail.com
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210801231730.7493-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If any of these modules is loaded, hooks get registered in all netns:
Before: 'unshare -n nft list hooks' shows:
family bridge hook prerouting {
-2147483648 ebt_broute
-0000000300 ebt_nat_hook
}
family bridge hook input {
-0000000200 ebt_filter_hook
}
family bridge hook forward {
-0000000200 ebt_filter_hook
}
family bridge hook output {
+0000000100 ebt_nat_hook
+0000000200 ebt_filter_hook
}
family bridge hook postrouting {
+0000000300 ebt_nat_hook
}
This adds 'template 'tables' for ebtables.
Each ebtable_foo registers the table as a template, with an init function
that gets called once the first get/setsockopt call is made.
ebtables core then searches the (per netns) list of tables.
If no table is found, it searches the list of templates instead.
If a template entry exists, the init function is called which will
enable the table and register the hooks (so packets are diverted
to the table).
If no entry is found in the template list, request_module is called.
After this, hook registration is delayed until the 'ebtables'
(set/getsockopt) request is made for a given table and will only
happen in the specific namespace.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently the following script:
1. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
2. ip link set swp2 up && ip link set swp2 master br0
3. ip link set swp3 up && ip link set swp3 master br0
4. ip link set swp4 up && ip link set swp4 master br0
5. bridge vlan del dev swp2 vid 1
6. bridge vlan del dev swp3 vid 1
7. ip link set swp4 nomaster
8. ip link set swp3 nomaster
produces the following output:
[ 641.010738] sja1105 spi0.1: port 2 failed to delete 00:1f:7b:63:02:48 vid 1 from fdb: -2
[ swp2, swp3 and br0 all have the same MAC address, the one listed above ]
In short, this happens because the number of FDB entry additions
notified to switchdev is unbalanced with the number of deletions.
At step 1, the bridge has a random MAC address. At step 2, the
br_fdb_replay of swp2 receives this initial MAC address. Then the bridge
inherits the MAC address of swp2 via br_fdb_change_mac_address(), and it
notifies switchdev (only swp2 at this point) of the deletion of the
random MAC address and the addition of 00:1f:7b:63:02:48 as a local FDB
entry with fdb->dst == swp2, in VLANs 0 and the default_pvid (1).
During step 7:
del_nbp
-> br_fdb_delete_by_port(br, p, vid=0, do_all=1);
-> fdb_delete_local(br, p, f);
br_fdb_delete_by_port() deletes all entries towards the ports,
regardless of vid, because do_all is 1.
fdb_delete_local() has logic to migrate local FDB entries deleted from
one port to another port which shares the same MAC address and is in the
same VLAN, or to the bridge device itself. This migration happens
without notifying switchdev of the deletion on the old port and the
addition on the new one, just fdb->dst is changed and the added_by_user
flag is cleared.
In the example above, the del_nbp(swp4) causes the
"addr 00:1f:7b:63:02:48 vid 1" local FDB entry with fdb->dst == swp4
that existed up until then to be migrated directly towards the bridge
(fdb->dst == NULL). This is because it cannot be migrated to any of the
other ports (swp2 and swp3 are not in VLAN 1).
After the migration to br0 takes place, swp4 requests a deletion replay
of all FDB entries. Since the "addr 00:1f:7b:63:02:48 vid 1" entry now
point towards the bridge, a deletion of it is replayed. There was just
a prior addition of this address, so the switchdev driver deletes this
entry.
Then, the del_nbp(swp3) at step 8 triggers another br_fdb_replay, and
switchdev is notified again to delete "addr 00:1f:7b:63:02:48 vid 1".
But it can't because it no longer has it, so it returns -ENOENT.
There are other possibilities to trigger this issue, but this is by far
the simplest to explain.
To fix this, we must avoid the situation where the addition of an FDB
entry is notified to switchdev as a local entry on a port, and the
deletion is notified on the bridge itself.
Considering that the 2 types of FDB entries are completely equivalent
and we cannot have the same MAC address as a local entry on 2 bridge
ports, or on a bridge port and pointing towards the bridge at the same
time, it makes sense to hide away from switchdev completely the fact
that a local FDB entry is associated with a given bridge port at all.
Just say that it points towards the bridge, it should make no difference
whatsoever to the switchdev driver and should even lead to a simpler
overall implementation, will less cases to handle.
This also avoids any modification at all to the core bridge driver, just
what is reported to switchdev changes. With the local/permanent entries
on bridge ports being already reported to user space, it is hard to
believe that the bridge behavior can change in any backwards-incompatible
way such as making all local FDB entries point towards the bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>