OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Vladimir Oltean	4e51bf44a0	net: bridge: move the switchdev object replay helpers to "push" mode Starting with commit `4f2673b3a2` ("net: bridge: add helper to replay port and host-joined mdb entries"), DSA has introduced some bridge helpers that replay switchdev events (FDB/MDB/VLAN additions and deletions) that can be lost by the switchdev drivers in a variety of circumstances: - an IP multicast group was host-joined on the bridge itself before any switchdev port joined the bridge, leading to the host MDB entries missing in the hardware database. - during the bridge creation process, the MAC address of the bridge was added to the FDB as an entry pointing towards the bridge device itself, but with no switchdev ports being part of the bridge yet, this local FDB entry would remain unknown to the switchdev hardware database. - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface, before any switchdev port joined that LAG, leading to the hardware database missing those entries. - a switchdev port left a LAG that is a bridge port, while the LAG remained part of the bridge, and all FDB/MDB/VLAN entries remained installed in the hardware database of the switchdev port. Also, since commit `0d2cfbd41c` ("net: bridge: ignore switchdev events for LAG ports which didn't request replay"), DSA introduced a method, based on a const void ctx, to ensure that two switchdev ports under the same LAG that is a bridge port do not see the same MDB/VLAN entry being replayed twice by the bridge, once for every bridge port that joins the LAG. With so many ordering corner cases being possible, it seems unreasonable to expect a switchdev driver writer to get it right from the first try. Therefore, now that DSA has experimented with the bridge replay helpers for a little bit, we can move the code to the bridge driver where it is more readily available to all switchdev drivers. To convert the switchdev object replay helpers from "pull mode" (where the driver asks for them) to a "push mode" (where the bridge offers them automatically), the biggest problem is that the bridge needs to be aware when a switchdev port joins and leaves, even when the switchdev is only indirectly a bridge port (for example when the bridge port is a LAG upper of the switchdev). Luckily, we already have a hook for that, in the form of the newly introduced switchdev_bridge_port_offload() and switchdev_bridge_port_unoffload() calls. These offer a natural place for hooking the object addition and deletion replays. Extend the above 2 functions with: - pointers to the switchdev atomic notifier (for FDB replays) and the blocking notifier (for MDB and VLAN replays). - the "const void ctx" argument required for drivers to be able to disambiguate between which port is targeted, when multiple ports are lowers of the same LAG that is a bridge port. Most of the drivers pass NULL to this argument, except the ones that support LAG offload and have the proper context check already in place in the switchdev blocking notifier handler. Also unexport the replay helpers, since nobody except the bridge calls them directly now. Note that: (a) we abuse the terminology slightly, because FDB entries are not "switchdev objects", but we count them as objects nonetheless. With no direct way to prove it, I think they are not modeled as switchdev objects because those can only be installed by the bridge to the hardware (as opposed to FDB entries which can be propagated in the other direction too). This is merely an abuse of terms, FDB entries are replayed too, despite not being objects. (b) the bridge does not attempt to sync port attributes to newly joined ports, just the countable stuff (the objects). The reason for this is simple: no universal and symmetric way to sync and unsync them is known. For example, VLAN filtering: what to do on unsync, disable or leave it enabled? Similarly, STP state, ageing timer, etc etc. What a switchdev port does when it becomes standalone again is not really up to the bridge's competence, and the driver should deal with it. On the other hand, replaying deletions of switchdev objects can be seen a matter of cleanup and therefore be treated by the bridge, hence this patch. We make the replay helpers opt-in for drivers, because they might not bring immediate benefits for them: - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(), so br_vlan_replay() should not do anything for the new drivers on which we call it. The existing drivers where there was even a slight possibility for there to exist a VLAN on a bridge port before they join it are already guarded against this: mlxsw and prestera deny joining LAG interfaces that are members of a bridge. - br_fdb_replay() should now notify of local FDB entries, but I patched all drivers except DSA to ignore these new entries in commit `2c4eca3ef7` ("net: bridge: switchdev: include local flag in FDB notifications"). Driver authors can lift this restriction as they wish, and when they do, they can also opt into the FDB replay functionality. - br_mdb_replay() should fix a real issue which is described in commit `4f2673b3a2` ("net: bridge: add helper to replay port and host-joined mdb entries"). However most drivers do not offload the SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw offload this switchdev object, and I don't completely understand the way in which they offload this switchdev object anyway. So I'll leave it up to these drivers' respective maintainers to opt into br_mdb_replay(). So most of the drivers pass NULL notifier blocks for the replay helpers, except: - dpaa2-switch which was already acked/regression-tested with the helpers enabled (and there isn't much of a downside in having them) - ocelot which already had replay logic in "pull" mode - DSA which already had replay logic in "pull" mode An important observation is that the drivers which don't currently request bridge event replays don't even have the switchdev_bridge_port_{offload,unoffload} calls placed in proper places right now. This was done to avoid unnecessary rework for drivers which might never even add support for this. For driver writers who wish to add replay support, this can be used as a tentative placement guide: https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/ Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Vladimir Oltean	7105b50b7e	net: bridge: guard the switchdev replay helpers against a NULL notifier block There is a desire to make the object and FDB replay helpers optional when moving them inside the bridge driver. For example a certain driver might not offload host MDBs and there is no case where the replay helpers would be of immediate use to it. So it would be nice if we could allow drivers to pass NULL pointers for the atomic and blocking notifier blocks, and the replay helpers to do nothing in that case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Vladimir Oltean	2f5dc00f7a	net: bridge: switchdev: let drivers inform which bridge ports are offloaded On reception of an skb, the bridge checks if it was marked as 'already forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it is, it assigns the source hardware domain of that skb based on the hardware domain of the ingress port. Then during forwarding, it enforces that the egress port must have a different hardware domain than the ingress one (this is done in nbp_switchdev_allowed_egress). Non-switchdev drivers don't report any physical switch id (neither through devlink nor .ndo_get_port_parent_id), therefore the bridge assigns them a hardware domain of 0, and packets coming from them will always have skb->offload_fwd_mark = 0. So there aren't any restrictions. Problems appear due to the fact that DSA would like to perform software fallback for bonding and team interfaces that the physical switch cannot offload. +-- br0 ---+ / / \| \ / / \| \ / \| \| bond0 / \| \| / \ swp0 swp1 swp2 swp3 swp4 There, it is desirable that the presence of swp3 and swp4 under a non-offloaded LAG does not preclude us from doing hardware bridging beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high enough that software bridging between {swp0,swp1,swp2} and bond0 is not impractical. But this creates an impossible paradox given the current way in which port hardware domains are assigned. When the driver receives a packet from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to something. - If we set it to 0, then the bridge will forward it towards swp1, swp2 and bond0. But the switch has already forwarded it towards swp1 and swp2 (not to bond0, remember, that isn't offloaded, so as far as the switch is concerned, ports swp3 and swp4 are not looking up the FDB, and the entire bond0 is a destination that is strictly behind the CPU). But we don't want duplicated traffic towards swp1 and swp2, so it's not ok to set skb->offload_fwd_mark = 0. - If we set it to 1, then the bridge will not forward the skb towards the ports with the same switchdev mark, i.e. not to swp1, swp2 and bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should have forwarded the skb there. So the real issue is that bond0 will be assigned the same hardware domain as {swp0,swp1,swp2}, because the function that assigns hardware domains to bridge ports, nbp_switchdev_add(), recurses through bond0's lower interfaces until it finds something that implements devlink (calls dev_get_port_parent_id with bool recurse = true). This is a problem because the fact that bond0 can be offloaded by swp3 and swp4 in our example is merely an assumption. A solution is to give the bridge explicit hints as to what hardware domain it should use for each port. Currently, the bridging offload is very 'silent': a driver registers a netdevice notifier, which is put on the netns's notifier chain, and which sniffs around for NETDEV_CHANGEUPPER events where the upper is a bridge, and the lower is an interface it knows about (one registered by this driver, normally). Then, from within that notifier, it does a bunch of stuff behind the bridge's back, without the bridge necessarily knowing that there's somebody offloading that port. It looks like this: ip link set swp0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v call_netdevice_notifiers \| v dsa_slave_netdevice_event \| v oh, hey! it's for me! \| v .port_bridge_join What we do to solve the conundrum is to be less silent, and change the switchdev drivers to present themselves to the bridge. Something like this: ip link set swp0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the \| \| hardware domain for v \| this port, and zero dsa_slave_netdevice_event \| if I got nothing. \| \| v \| oh, hey! it's for me! \| \| \| v \| .port_bridge_join \| \| \| +------------------------+ switchdev_bridge_port_offload(swp0, swp0) Then stacked interfaces (like bond0 on top of swp3/swp4) would be treated differently in DSA, depending on whether we can or cannot offload them. The offload case: ip link set bond0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the \| \| switchdev mark for v \| bond0. dsa_slave_netdevice_event \| Coincidentally (or not), \| \| bond0 and swp0, swp1, swp2 v \| all have the same switchdev hmm, it's not quite for me, \| mark now, since the ASIC but my driver has already \| is able to forward towards called .port_lag_join \| all these ports in hw. for it, because I have \| a port with dp->lag_dev == bond0. \| \| \| v \| .port_bridge_join \| for swp3 and swp4 \| \| \| +------------------------+ switchdev_bridge_port_offload(bond0, swp3) switchdev_bridge_port_offload(bond0, swp4) And the non-offload case: ip link set bond0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge waiting: call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload \| \| wasn't called, okay, I'll use a v \| hwdom of zero for this one. dsa_slave_netdevice_event : Then packets received on swp0 will \| : not be software-forwarded towards v : swp1, but they will towards bond0. it's not for me, but bond0 is an upper of swp3 and swp4, but their dp->lag_dev is NULL because they couldn't offload it. Basically we can draw the conclusion that the lowers of a bridge port can come and go, so depending on the configuration of lowers for a bridge port, it can dynamically toggle between offloaded and unoffloaded. Therefore, we need an equivalent switchdev_bridge_port_unoffload too. This patch changes the way any switchdev driver interacts with the bridge. From now on, everybody needs to call switchdev_bridge_port_offload and switchdev_bridge_port_unoffload, otherwise the bridge will treat the port as non-offloaded and allow software flooding to other ports from the same ASIC. Note that these functions lay the ground for a more complex handshake between switchdev drivers and the bridge in the future. For drivers that will request a replay of the switchdev objects when they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we place the call to switchdev_bridge_port_unoffload() strategically inside the NETDEV_PRECHANGEUPPER notifier's code path, and not inside NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers need the netdev adjacency lists to be valid, and that is only true in NETDEV_PRECHANGEUPPER. Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Tobias Waldekranz	8582661048	net: bridge: switchdev: recycle unused hwdoms Since hwdoms have only been used thus far for equality comparisons, the bridge has used the simplest possible assignment policy; using a counter to keep track of the last value handed out. With the upcoming transmit offloading, we need to perform set operations efficiently based on hwdoms, e.g. we want to answer questions like "has this skb been forwarded to any port within this hwdom?" Move to a bitmap-based allocation scheme that recycles hwdoms once all members leaves the bridge. This means that we can use a single unsigned long to keep track of the hwdoms that have received an skb. v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX) into a plain unsigned long. v2->v6: none Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Tobias Waldekranz	f7cf972f93	net: bridge: disambiguate offload_fwd_mark Before this change, four related - but distinct - concepts where named offload_fwd_mark: - skb->offload_fwd_mark: Set by the switchdev driver if the underlying hardware has already forwarded this frame to the other ports in the same hardware domain. - nbp->offload_fwd_mark: An idetifier used to group ports that share the same hardware forwarding domain. - br->offload_fwd_mark: Counter used to make sure that unique IDs are used in cases where a bridge contains ports from multiple hardware domains. - skb->cb->offload_fwd_mark: The hardware domain on which the frame ingressed and was forwarded. Introduce the term "hardware forwarding domain" ("hwdom") in the bridge to denote a set of ports with the following property: If an skb with skb->offload_fwd_mark set, is received on a port belonging to hwdom N, that frame has already been forwarded to all other ports in hwdom N. By decoupling the name from "offload_fwd_mark", we can extend the term's definition in the future - e.g. to add constraints that describe expected egress behavior - without overloading the meaning of "offload_fwd_mark". - nbp->offload_fwd_mark thus becomes nbp->hwdom. - br->offload_fwd_mark becomes br->last_hwdom. - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight change in naming here mandates a slight change in behavior of the nbp_switchdev_frame_mark() function. Previously, it only set this value in skb->cb for packets with skb->offload_fwd_mark true (ones which were forwarded in hardware). Whereas now we always track the incoming hwdom for all packets coming from a switchdev (even for the packets which weren't forwarded in hardware, such as STP BPDUs, IGMP reports etc). As all uses of skb->cb->offload_fwd_mark were already gated behind checks of skb->offload_fwd_mark, this will not introduce any functional change, but it paves the way for future changes where the ingressing hwdom must be known for frames coming from a switchdev regardless of whether they were forwarded in hardware or not (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now always tracks which one). A typical example where this is relevant: the switchdev has a fixed configuration to trap STP BPDUs, but STP is not running on the bridge and the group_fwd_mask allows them to be forwarded. Say we have this setup: br0 / \| \ / \| \ swp0 swp1 swp2 A BPDU comes in on swp0 and is trapped to the CPU; the driver does not set skb->offload_fwd_mark. The bridge determines that the frame should be forwarded to swp{1,2}. It is imperative that forward offloading is _not_ allowed in this case, as the source hwdom is already "poisoned". Recording the source hwdom allows this case to be handled properly. v2->v3: added code comments v3->v6: none Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Nikolay Aleksandrov	58d913a326	net: bridge: multicast: add context support for host-joined groups Adding bridge multicast context support for host-joined groups is easy because we only need the proper timer value. We pass the already chosen context and use its timer value. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 14:34:47 -07:00
Nikolay Aleksandrov	6567cb438a	net: bridge: multicast: add mdb context support Choose the proper bridge multicast context when user-spaces is adding mdb entries. Currently we require the vlan to be configured on at least one device (port or bridge) in order to add an mdb entry if vlan mcast snooping is enabled (vlan snooping implies vlan filtering). Note that we always allow deleting an entry, regardless of the vlan state. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 14:34:47 -07:00
Eric Dumazet	240bfd134c	tcp: tweak len/truesize ratio for coalesce candidates tcp_grow_window() is using skb->len/skb->truesize to increase tp->rcv_ssthresh which has a direct impact on advertized window sizes. We added TCP coalescing in linux-3.4 & linux-3.5: Instead of storing skbs with one or two MSS in receive queue (or OFO queue), we try to append segments together to reduce memory overhead. High performance network drivers tend to cook skb with 3 parts : 1) sk_buff structure (256 bytes) 2) skb->head contains room to copy headers as needed, and skb_shared_info 3) page fragment(s) containing the ~1514 bytes frame (or more depending on MTU) Once coalesced into a previous skb, 1) and 2) are freed. We can therefore tweak the way we compute len/truesize ratio knowing that skb->truesize is inflated by 1) and 2) soon to be freed. This is done only for in-order skb, or skb coalesced into OFO queue. The result is that low rate flows no longer pay the memory price of having low GRO aggregation factor. Same result for drivers not using GRO. This is critical to allow a big enough receiver window, typically tcp_rmem[2] / 2. We have been using this at Google for about 5 years, it is due time to make it upstream. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 09:05:57 -07:00
Nikolay Aleksandrov	54cb43199e	net: bridge: multicast: fix igmp/mld port context null pointer dereferences With the recent change to use bridge/port multicast context pointers instead of bridge/port I missed to convert two locations which pass the port pointer as-is, but with the new model we need to verify the port context is non-NULL first and retrieve the port from it. The first location is when doing querier selection when a query is received, the second location is when leaving a group. The port context will be null if the packets originated from the bridge device (i.e. from the host). The fix is simple just check if the port context exists and retrieve the port pointer from it. Fixes: `adc47037a7` ("net: bridge: multicast: use multicast contexts instead of bridge or port") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 09:04:03 -07:00
Eric Dumazet	739b2adf99	tcp: avoid indirect call in tcp_new_space() For tcp sockets, sk->sk_write_space is most probably sk_stream_write_space(). Other sk->sk_write_space() calls in TCP are slow path and do not deserve any change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 09:02:11 -07:00
Vadim Fedorenko	ac6627a28d	net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward Consolidate IPv4 MTU code the same way it is done in IPv6 to have code aligned in both address families Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 08:22:03 -07:00
Vadim Fedorenko	427faee167	net: ipv6: introduce ip6_dst_mtu_maybe_forward Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and reuse this code in ip6_mtu. Actually these two functions were almost duplicates, this change will simplify the maintaince of mtu calculation code. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 08:22:02 -07:00
Justin Iurman	3edede08ff	ipv6: ioam: Support for IOAM injection with lwtunnels Add support for the IOAM inline insertion (only for the host-to-host use case) which is per-route configured with lightweight tunnels. The target is iproute2 and the patch is ready. It will be posted as soon as this patchset is merged. Here is an overview: $ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0 This example configures an IOAM Pre-allocated Trace option attached to the fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated trace data block is 12 octets (size) and only the first IOAM data (bit 0: hop_limit + node id) is included in the trace (type) represented as a bitfield. The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not implemented is explained on the patchset cover. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 08:14:33 -07:00
Justin Iurman	8c6f6fa677	ipv6: ioam: IOAM Generic Netlink API Add Generic Netlink commands to allow userspace to configure IOAM namespaces and schemas. The target is iproute2 and the patch is ready. It will be posted as soon as this patchset is merged. Here is an overview: $ ip ioam Usage: ip ioam { COMMAND \| help } ip ioam namespace show ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ] ip ioam namespace del ID ip ioam schema show ip ioam schema add ID DATA ip ioam schema del ID ip ioam namespace set ID schema { ID \| none } Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 08:14:33 -07:00
Justin Iurman	9ee11f0fff	ipv6: ioam: Data plane support for Pre-allocated Trace Implement support for processing the IOAM Pre-allocated Trace with IPv6, see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3]. A new per-interface sysctl is introduced. The value is a boolean to accept (=1) or ignore (=0, by default) IPv6 IOAM options on ingress for an interface: - net.ipv6.conf.XXX.ioam6_enabled Two other sysctls are introduced to define IOAM IDs, represented by an integer. They are respectively per-namespace and per-interface: - net.ipv6.ioam6_id - net.ipv6.conf.XXX.ioam6_id The value of the first one represents the IOAM ID of the node itself (u32; max and default value = U32_MAX>>8, due to hop limit concatenation) while the other represents the IOAM ID of an interface (u16; max and default value = U16_MAX). Each "ioam6_id" sysctl has a "_wide" equivalent: - net.ipv6.ioam6_id_wide - net.ipv6.conf.XXX.ioam6_id_wide The value of the first one represents the wide IOAM ID of the node itself (u64; max and default value = U64_MAX>>8, due to hop limit concatenation) while the other represents the wide IOAM ID of an interface (u32; max and default value = U32_MAX). The use of short and wide equivalents is not exclusive, a deployment could choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format) could be an identifier for a physical interface, whereas net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a logical sub-interface. Documentation about new sysctls is provided at the end of this patchset. Two relativistic hash tables are used: one for IOAM namespaces, the other for IOAM schemas. A namespace can only have a single active schema and a schema can only be attached to a single namespace (1:1 relationship). [1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options [2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data [3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 08:14:33 -07:00
Vladimir Oltean	71f4f89a03	net: switchdev: recurse into __switchdev_handle_fdb_del_to_device The difference between __switchdev_handle_fdb_del_to_device and switchdev_handle_del_to_device is that the former takes an extra orig_dev argument, while the latter starts with dev == orig_dev. We should recurse into the variant that does not lose the orig_dev along the way. This is relevant when deleting FDB entries pointing towards a bridge (dev changes to the lower interfaces, but orig_dev shouldn't). The addition helper already recurses properly, just the deletion one doesn't. Fixes: `8ca07176ab` ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 08:00:16 -07:00
Yang Yang	8292d7f6e8	net: ipv4: add capability check for net administration Root in init user namespace can modify /proc/sys/net/ipv4/ip_forward without CAP_NET_ADMIN, this doesn't follow the principle of capabilities. For example, let's take a look at netdev_store(), root can't modify netdev attribute without CAP_NET_ADMIN. So let's keep the consistency of permission check logic. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 07:15:22 -07:00
Vladimir Oltean	b94dc99c0d	net: dsa: use switchdev_handle_fdb_{add,del}_to_device Using the new fan-out helper for FDB entries installed on the software bridge, we can install host addresses with the proper refcount on the CPU port, such that this case: ip link set swp0 master br0 ip link set swp1 master br0 ip link set swp2 master br0 ip link set swp3 master br0 ip link set br0 address 00:01:02:03:04:05 ip link set swp3 nomaster works properly and the br0 address remains installed as a host entry with refcount 3 instead of getting deleted. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 07:04:27 -07:00
Vladimir Oltean	8ca07176ab	net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE Currently DSA has an issue with FDB entries pointing towards the bridge in the presence of br_fdb_replay() being called at port join and leave time. In particular, each bridge port will ask for a replay for the FDB entries pointing towards the bridge when it joins, and for another replay when it leaves. This means that for example, a bridge with 4 switch ports will notify DSA 4 times of the bridge MAC address. But if the MAC address of the bridge changes during the normal runtime of the system, the bridge notifies switchdev [ once ] of the deletion of the old MAC address as a local FDB towards the bridge, and of the insertion [ again once ] of the new MAC address as a local FDB. This is a problem, because DSA keeps the old MAC address as a host FDB entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the old MAC address will not be deleted. Additionally, the new MAC address will only be installed with refcount 1, and when the first switch port leaves the bridge (leaving 3 others as still members), it will delete with it the new MAC address of the bridge from the local FDB entries kept by DSA (because the br_fdb_replay call on deletion will bring the entry's refcount from 1 to 0). So the problem, really, is that the number of br_fdb_replay() calls is not matched with the refcount that a host FDB is offloaded to DSA during normal runtime. An elegant way to solve the problem would be to make the switchdev notification emitted by br_fdb_change_mac_address() result in a host FDB kept by DSA which has a refcount exactly equal to the number of ports under that bridge. Then, no matter how many DSA ports join or leave that bridge, the host FDB entry will always be deleted when there are exactly zero remaining DSA switch ports members of the bridge. To implement the proposed solution, we remember that the switchdev objects and port attributes have some helpers provided by switchdev, which can be optionally called by drivers: switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set. These helpers: - fan out a switchdev object/attribute emitted for the bridge towards all the lower interfaces that pass the check_cb(). - fan out a switchdev object/attribute emitted for a bridge port that is a LAG towards all the lower interfaces that pass the check_cb(). In other words, this is the model we need for the FDB events too: something that will keep an FDB entry emitted towards a physical port as it is, but translate an FDB entry emitted towards the bridge into N FDB entries, one per physical port. Of course, there are many differences between fanning out a switchdev object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards a LAG should be treated specially, because FDB entries are unicast, we can't just install the same address towards 3 destinations. It is imaginable that drivers might want to treat this case specifically, so create some methods for this case and do not recurse into the LAG lower ports, just the bridge ports. DSA also listens for FDB entries on "foreign" interfaces, aka interfaces bridged with us which are not part of our hardware domain: think an Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA installs host FDB entries. However, there we have the same problem (those host FDB entries are installed with a refcount of only 1) and an even bigger one which we did not have with FDB entries towards the bridge: br_fdb_replay() is currently not called for FDB entries on foreign interfaces, just for the physical port and for the bridge itself. So when DSA sniffs an address learned by the software bridge towards a foreign interface like an e1000 port, and then that e1000 leaves the bridge, DSA remains with the dangling host FDB address. That will be fixed separately by replaying all FDB entries and not just the ones towards the port and the bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 07:04:27 -07:00
Vladimir Oltean	c6451cda10	net: switchdev: introduce helper for checking dynamically learned FDB entries It is a bit difficult to understand what DSA checks when it tries to avoid installing dynamically learned addresses on foreign interfaces as local host addresses, so create a generic switchdev helper that can be reused and is generally more readable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 07:04:27 -07:00
Vladimir Oltean	c64b9c0504	net: dsa: tag_8021q: add proper cross-chip notifier support The big problem which mandates cross-chip notifiers for tag_8021q is this: \| sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] \| +---------+ \| sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] \| +---------+ \| sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] When the user runs: ip link add br0 type bridge ip link set sw0p0 master br0 ip link set sw2p0 master br0 It doesn't work. This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and "other_ds" are at most 1 hop away from each other, so it is sufficient to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice versa and presto, the cross-chip link works. When there is another switch in the middle, such as in this case switch 1 with its DSA links sw1p3 and sw1p4, somebody needs to tell it about these VLANs too. Which is exactly why the problem is quadratic: when a port joins a bridge, for each port in the tree that's already in that same bridge we notify a tag_8021q VLAN addition of that port's RX VLAN to the entire tree. It is a very complicated web of VLANs. It must be mentioned that currently we install tag_8021q VLANs on too many ports (DSA links - to be precise, on all of them). For example, when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there isn't any port of switch 0 that is a member of br0 (at least yet). In theory we could notify only the switches which sit in between the port joining the bridge and the port reacting to that bridge_join event. But in practice that is impossible, because of the way 'link' properties are described in the device tree. The DSA bindings require DT writers to list out not only the real/physical DSA links, but in fact the entire routing table, like for example switch 0 above will have: sw0p3: port@3 { link = <&sw1p4 &sw2p4>; }; This was done because: /* TODO: ideally DSA ports would have a single dp->link_dp member, * and no dst->rtable nor this struct dsa_link would be needed, * but this would require some more complex tree walking, * so keep it stupid at the moment and list them all. */ but it is a perfect example of a situation where too much information is actively detrimential, because we are now in the position where we cannot distinguish a real DSA link from one that is put there to avoid the 'complex tree walking'. And because DT is ABI, there is not much we can change. And because we do not know which DSA links are real and which ones aren't, we can't really know if DSA switch A is in the data path between switches B and C, in the general case. So this is why tag_8021q RX VLANs are added on all DSA links, and probably why it will never change. On the other hand, at least the number of additions/deletions is well balanced, and this means that once we implement reference counting at the cross-chip notifier level a la fdb/mdb, there is absolutely zero need for a struct dsa_8021q_crosschip_link, it's all self-managing. In fact, with the tag_8021q notifiers emitted from the bridge join notifiers, it becomes so generic that sja1105 does not need to do anything anymore, we can just delete its implementation of the .crosschip_bridge_{join,leave} methods. Among other things we can simply delete is the home-grown implementation of sja1105_notify_crosschip_switches(). The reason why that is wrong is because it is not quadratic - it only covers remote switches to which we have a cross-chip bridging link and that does not cover in-between switches. This deletion is part of the same patch because sja1105 used to poke deep inside the guts of the tag_8021q context in order to do that. Because the cross-chip links went away, so needs the sja1105 code. Last but not least, dsa_8021q_setup_port() is simplified (and also renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react on the CPU port too, the four dsa_8021q_vid_apply() calls: - 1 for RX VLAN on user port - 1 for the user port's RX VLAN on the CPU port - 1 for TX VLAN on user port - 1 for the user port's TX VLAN on the CPU port now get squashed into only 2 notifier calls via dsa_port_tag_8021q_vlan_add. And because the notifiers to add and to delete a tag_8021q VLAN are distinct, now we finally break up the port setup and teardown into separate functions instead of relying on a "bool enabled" flag which tells us what to do. Arguably it should have been this way from the get go. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	e19cc13c9c	net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time There has been at least one wasted opportunity for tag_8021q to be used by a driver: https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272 because of a design decision: the declared purpose of tag_8021q is to offer source port/switch identification for a tagging driver for packets coming from a switch with no hardware DSA tagging support. It is not intended to provide VLAN-based port isolation, because its first user, sja1105, had another mechanism for bridging domain isolation, the L2 Forwarding Table. So even if 2 ports are in the same VLAN but they are separated via the L2 Forwarding Table, they will not communicate with one another. The L2 Forwarding Table is managed by the sja1105_bridge_join() and sja1105_bridge_leave() methods. As a consequence, today tag_8021q does not bother too much with hooking into .port_bridge_join() and .port_bridge_leave() because that would introduce yet another degree of freedom, it just iterates statically through all ports of a switch and adds the RX VLAN of one port to all the others. In this way, whenever .port_bridge_join() is called, bridging will magically work because the RX VLANs are already installed everywhere they need to be. This is not to say that the reason for the change in this patch is to satisfy the hellcreek and similar use cases, that is merely a nice side effect. Instead it is to make sja1105 cross-chip links work properly over a DSA link. For context, sja1105 today supports a degenerate form of cross-chip bridging, where the switches are interconnected through their CPU ports ("disjoint trees" topology). There is some code which has been generalized into dsa_8021q_crosschip_link_{add,del}, but it is not enough, and frankly it is impossible to build upon that. Real multi-switch DSA trees, like daisy chains or H trees, which have actual DSA links, do not work. The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT for cross-chip bridging, which is a table by which the local switch can select the forwarding domain for packets from a certain ingress switch ID and source port. The sja1105 switches cannot parse their own DSA tags, because, well, they don't really have support for DSA tags, it's all VLANs. So to make something like cross-chip bridging between sw0p0 and sw1p0 to work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology below: \| \| sw0p0 sw0p1 sw0p2 sw0p3 sw1p3 sw1p2 sw1p1 sw1p0 [ user ] [ user ] [ cpu ] [ dsa ] ---- [ dsa ] [ cpu ] [ user ] [ user ] we need to ask ourselves 2 questions: (1) how should the L2 Forwarding Table be managed? (2) how should the VLAN Lookup Table be managed? i.e. what should prevent packets from going to unwanted ports? Since as mentioned, there is no PVT, the L2 Forwarding Table only contains forwarding rules for local ports. So we can say "all user ports are allowed to forward to all CPU ports and all DSA links". If we allow forwarding to DSA links unconditionally, this means we must prevent forwarding using the VLAN Lookup Table. This is in fact asymmetric with what we do for tag_8021q on ports local to the same switch, and it matters because now that we are making tag_8021q a core DSA feature, we need to hook into .crosschip_bridge_join() to add/remove the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs for local forwarding in the same way as cross-chip forwarding. Note that there is a very precise reason why tag_8021q hooks into dsa_switch_bridge_join() which acts at the cross-chip notifier level, and not at a higher level such as dsa_port_bridge_join(). We need to install the RX VLAN of the newly joining port into the VLAN table of all the existing ports across the tree that are part of the same bridge, and the notifier already does the iteration through the switches for us. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	328621f613	net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register Right now, setting up tag_8021q is a 2-step operation for a driver, first the context structure needs to be created, then the VLANs need to be installed on the ports. A similar thing is true for teardown. Merge the 2 steps into the register/unregister methods, to be as transparent as possible for the driver as to what tag_8021q does behind the scenes. This also gets rid of the funny "bool setup == true means setup, == false means teardown" API that tag_8021q used to expose. Note that dsa_tag_8021q_register() must be called at least in the .setup() driver method and never earlier (like in the driver probe function). This is because the DSA switch tree is not initialized at probe time, and the cross-chip notifiers will not work. For symmetry with .setup(), the unregister method should be put in .teardown(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	5da11eb407	net: dsa: make tag_8021q operations part of the core Make tag_8021q a more central element of DSA and move the 2 driver specific operations outside of struct dsa_8021q_context (which is supposed to hold dynamic data and not really constant function pointers). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	d7b1fd520d	net: dsa: let the core manage the tag_8021q context The basic problem description is as follows: Be there 3 switches in a daisy chain topology: \| sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] \| +---------+ \| sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] \| +---------+ \| sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] The CPU will not be able to ping through the user ports of the bottom-most switch (like for example sw2p0), simply because tag_8021q was not coded up for this scenario - it has always assumed DSA switch trees with a single switch. To add support for the topology above, we must admit that the RX VLAN of sw2p0 must be added on some ports of switches 0 and 1 as well. This is in fact a textbook example of thing that can use the cross-chip notifier framework that DSA has set up in switch.c. There is only one problem: core DSA (switch.c) is not able right now to make the connection between a struct dsa_switch ds and a struct dsa_8021q_context ctx. Right now, it is drivers who call into tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer, and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del} methods. But with cross-chip notifiers, it is possible for tag_8021q to call drivers without drivers having ever asked for anything. A good example is right above: when sw2p0 wants to set itself up for tag_8021q, the .tag_8021q_vlan_add method needs to be called for switches 1 and 0, so that they transport sw2p0's VLANs towards the CPU without dropping them. So instead of letting drivers manage the tag_8021q context, add a tag_8021q_ctx pointer inside of struct dsa_switch, which will be populated when dsa_tag_8021q_register() returns success. The patch is fairly long-winded because we are partly reverting commit `5899ee367a` ("net: dsa: tag_8021q: add a context structure") which made the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we can access "ctx" directly from "ds", this is no longer needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	8b6e638b4b	net: dsa: build tag_8021q.c as part of DSA core Upcoming patches will add tag_8021q related logic to switch.c and port.c, in order to allow it to make use of cross-chip notifiers. In addition, a struct dsa_8021q_context ctx pointer will be added to struct dsa_switch. It seems fairly low-reward to #ifdef the ctx from struct dsa_switch and to provide shim implementations of the entire tag_8021q.c calling surface (not even clear what to do about the tag_8021q cross-chip notifiers to avoid compiling them). The runtime overhead for switches which don't use tag_8021q is fairly small because all helpers will check for ds->tag_8021q_ctx being a NULL pointer and stop there. So let's make it part of dsa_core.o. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	cedf467064	net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpers In preparation of moving tag_8021q to core DSA, move all initialization and teardown related to tag_8021q which is currently done by drivers in 2 functions called "register" and "unregister". These will gather more functionality in future patches, which will better justify the chosen naming scheme. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	69ebb37064	net: dsa: tag_8021q: use symbolic error names Use %pe to give the user a string holding the error code instead of just a number. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	a81a45744b	net: dsa: tag_8021q: use "err" consistently instead of "rc" Some of the tag_8021q code has been taken out of sja1105, which uses "rc" for its return code variables, whereas the DSA core uses "err". Change tag_8021q for consistency. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
Vladimir Oltean	0fac6aa098	net: dsa: sja1105: delete the best_effort_vlan_filtering mode Simply put, the best-effort VLAN filtering mode relied on VLAN retagging from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to decode the source port in the tagger, but the VLAN retagging implementation inside the sja1105 chips is not the best and we were relying on marginal operating conditions. The most notable limitation of the best-effort VLAN filtering mode is its incapacity to treat this case properly: ip link add br0 type bridge vlan_filtering 1 ip link set swp2 master br0 ip link set swp4 master br0 bridge vlan del dev swp4 vid 1 bridge vlan add dev swp4 vid 1 pvid When sending an untagged packet through swp2, the expectation is for it to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1 on egress). But the switch will send it as egress-untagged. There was an attempt to fix this here: https://patchwork.kernel.org/project/netdevbpf/patch/20210407201452.1703261-2-olteanv@gmail.com/ but it failed miserably because it broke PTP RX timestamping, in a way that cannot be corrected due to hardware issues related to VLAN retagging. So with either PTP broken or pushing VLAN headers on egress for untagged packets being broken, the sad reality is that the best-effort VLAN filtering code is broken. Delete it. Note that this means there will be a temporary loss of functionality in this driver until it is replaced with something better (network stack RX/TX capability for "mode 2" as described in Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware bridge" case). We simply cannot keep this code until that driver rework is done, it is super bloated and tangled with tag_8021q. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:36:42 -07:00
David S. Miller	542bb39651	Merge branch 'veth-flexible-channel-numbers' Paolo Abeni says: ==================== veth: more flexible channels number configuration XDP setups can benefit from multiple veth RX/TX queues. Currently veth allow setting such number only at creation time via the 'numrxqueues' and 'numtxqueues' parameters. This series introduces support for the ethtool set_channel operation and allows configuring the queue number via a new module parameter. The veth default configuration is not changed. Finally self-tests are updated to check the new features, with both valid and invalid arguments. This iteration is a rebase of the most recent RFC, it does not provide a module parameter to configure the default number of queues, but I think could be worthy RFC v1 -> RFC v2: - report more consistent 'combined' count - make set_channel as resilient as possible to errors - drop module parameter - but I would still consider it. - more self-tests ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:23:13 -07:00
David S. Miller	2c08040447	Merge branch 'bridge-vlan-multicast' Nikolay Aleksandrov says: ==================== net: bridge: multicast: add vlan support This patchset adds initial per-vlan multicast support, most of the code deals with moving to multicast context pointers from bridge/port pointers. That allows us to switch them with the per-vlan contexts when a multicast packet is being processed and vlan multicast snooping has been enabled. That is controlled by a global bridge option added in patch 06 which is off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note that this option can change only under RTNL and doesn't require multicast_lock, so we need to be careful when retrieving mcast contexts in parallel. For packet processing they are switched only once in br_multicast_rcv() and then used until the packet has been processed. For the most part we need these contexts only to read config values and check if they are disabled. The global mcast state which is maintained consists of querier and router timers, the rest are config options. The port mcast state which is maintained consists of query timer and link to router port list if it's ever marked as a router port. Port multicast contexts _must_ be used only with their respective global contexts, that is a bridge port's mcast context must be used only with bridge's global mcast context and a vlan/port's mcast context must be used only with that vlan's global mcast context due to the router port lists. This way a bridge port can be marked as a router in multiple vlans, but might not be a router in some other vlan. Also this allows us to have per-vlan querier elections, per-vlan queries and basically the whole multicast state becomes per-vlan when the option is enabled. One of the hardest parts is synchronization with vlan's memory management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED which is changed only under multicast_lock. When a vlan is being destroyed first that flag is removed under the lock, then the multicast context is torn down which includes waiting for any outstanding context timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED it must be checked first if the contexts are vlan and the multicast_lock has been acquired. That is done by all IGMP/MLD packet processing functions and timers. When processing a packet we have RCU so the vlan memory won't be freed, but if the flag is missing we must not process it. The timers are synchronized in the same way with the addition of waiting for them to finish in case they are running after removing the flag under multicast_lock (i.e. they were waiting for the lock). Multicast vlan snooping requires vlan filtering to be enabled, if it's disabled then snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all vlan disabled checks. We need both flags because one is controlled by user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED). Since the latter is also used for synchronization between the multicast and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we rely on it when checking if a vlan context is disabled. The multicast fast-path has 3 new bit tests on the cache-hot bridge flags field, I didn't observe any measurable difference. I haven't forced either context options to be always disabled when the other type is enabled because the state consists of timers which either expire (router) or don't affect the normal operation. Some options, like the mcast querier one, won't be allowed to change for the disabled context type, that will come with a future patch-set which adds per-vlan querier control. Another important addition is the global vlan options, so far we had only per bridge/port vlan options but in order to control vlan multicast snooping globally we need to add a new type of global vlan options. They can be changed only on the bridge device and are dumped only when a special flag is set in the dump request. The first global option is vlan mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED private flag. It can be set only on master vlan entries. There will be many more global vlan options in the future both for multicast config and other per-vlan options (e.g. STP). There's a lot of room for improvements, I'll do some of the initial ones but splitting the state to different contexts opens the door for a lot more. Also any new multicast options become vlan-supported with very little to no effort by using the same contexts. Short patch description: patches 01-04: initial mcast context add, no functional changes patch 05: adds vlan mcast init and control helpers and uses them on vlan create/destroy patch 06: adds a global bridge mcast vlan snooping knob (default off) patches 07-08: add a helper for users which must derive the contexts based on current bridge and vlan options (e.g. timers) patch 09: adds checks for disabled vlan contexts in packet processing and timers patch 10: adds support for per-vlan querier and tagged queries patch 11: adds router port vlan id in the notifications patches 12-14: add global vlan options support (change, dump, notify) patch 15: adds per-vlan global mcast snooping control Future patch-sets which build on this one (in order): - vlan state mcast handling - user-space mdb contexts (currently only the bridge contexts are used there) - all bridge multicast config options added per-vlan global and per vlan/port - iproute2 support for all the new uAPIs - selftests This set has been stress-tested (deleting/adding ports/vlans while changing vlan mcast snooping while processing IGMP/MLD packets), and also has passed all bridge self-tests. I'm sending this set as early as possible since there're a few more related sets that should go in the same release to get proper and full mcast vlan snooping support. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:22:14 -07:00
Vasily Averin	2c6ad20b58	memcg: enable accounting for scm_fp_list objects unix sockets allows to send file descriptors via SCM_RIGHTS type messages. Each such send call forces kernel to allocate up to 2Kb memory for struct scm_fp_list. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:00:38 -07:00
Vasily Averin	1b51d82719	memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Author: Andrey Ryabinin <aryabinin@virtuozzo.com> The size of the ip_tunnel_prl structs allocation is controllable from user-space, thus it's better to avoid spam in dmesg if allocation failed. Also add __GFP_ACCOUNT as this is a good candidate for per-memcg accounting. Allocation is temporary and limited by 4GB. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:00:38 -07:00
Vasily Averin	a89893dd7b	memcg: enable accounting for VLAN group array vlan array consume up to 8 pages of memory per net device. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:00:38 -07:00
Vasily Averin	990c74e3f4	memcg: enable accounting for inet_bin_bucket cache net namespace can create up to 64K tcp and dccp ports and force kernel to allocate up to several megabytes of memory per netns for inet_bind_bucket objects. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:00:38 -07:00
Vasily Averin	6126891c6d	memcg: enable accounting for IP address and routing-related objects An netadmin inside container can use 'ip a a' and 'ip r a' to assign a large number of ipv4/ipv6 addresses and routing entries and force kernel to allocate megabytes of unaccounted memory for long-lived per-netdevice related kernel objects: 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node', 'struct rt6_info', 'struct fib_rules' and ip_fib caches. These objects can be manually removed, though usually they lives in memory till destroy of its net namespace. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. One of such objects is the 'struct fib6_node' mostly allocated in net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section: write_lock_bh(&table->tb6_lock); err = fib6_add(&table->tb6_root, rt, info, mxc); write_unlock_bh(&table->tb6_lock); In this case it is not enough to simply add SLAB_ACCOUNT to corresponding kmem cache. The proper memory cgroup still cannot be found due to the incorrect 'in_interrupt()' check used in memcg_kmem_bypass(). Obsoleted in_interrupt() does not describe real execution context properly. >From include/linux/preempt.h: The following macros are deprecated and should not be used in new code: in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled To verify the current execution context new macro should be used instead: in_task() - We're in task context Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:00:38 -07:00
Vasily Averin	c948f51c16	memcg: enable accounting for net_device and Tx/Rx queues Container netadmin can create a lot of fake net devices, then create a new net namespace and repeat it again and again. Net device can request the creation of up to 4096 tx and rx queues, and force kernel to allocate up to several tens of megabytes memory per net device. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 06:00:38 -07:00
Nikolay Aleksandrov	9dee572c38	net: bridge: vlan: add mcast snooping control Add a new global vlan option which controls whether multicast snooping is enabled or disabled for a single vlan. It controls the vlan private flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	9aba624d7c	net: bridge: vlan: notify when global options change Add support for global options notifications. They use only RTM_NEWVLAN since global options can only be set and are contained in a separate vlan global options attribute. Notifications are compressed in ranges where possible, i.e. the sequential vlan options are equal. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	743a53d963	net: bridge: vlan: add support for dumping global vlan options Add a new vlan options dump flag which causes only global vlan options to be dumped. The dumps are done only with bridge devices, ports are ignored. They support vlan compression if the options in sequential vlans are equal (currently always true). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	47ecd2dbd8	net: bridge: vlan: add support for global options We can have two types of vlan options depending on context: - per-device vlan options (split in per-bridge and per-port) - global vlan options The second type wasn't supported in the bridge until now, but we need them for per-vlan multicast support, per-vlan STP support and other options which require global vlan context. They are contained in the global bridge vlan context even if the vlan is not configured on the bridge device itself. This patch adds initial netlink attributes and support for setting these global vlan options, they can only be set (RTM_NEWVLAN) and the operation must use the bridge device. Since there are no such options yet it shouldn't have any functional effect. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	1e9ca45662	net: bridge: multicast: include router port vlan id in notifications Use the port multicast context to check if the router port is a vlan and in case it is include its vlan id in the notification. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	615cc23e62	net: bridge: multicast: add vlan querier and query support Add basic vlan context querier support, if the contexts passed to multicast_alloc_query are vlan then the query will be tagged. Also handle querier start/stop of vlan contexts. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	4cdd0d10f3	net: bridge: multicast: check if should use vlan mcast ctx Add helpers which check if the current bridge/port multicast context should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD processing, timers and new group addition. It is important for vlans to disable processing of timer/packet after the multicast_lock is obtained if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two cases when that flag is missing: - if the vlan is getting destroyed it will be removed and timers will be stopped - if the vlan mcast snooping is being disabled Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	eb1593a0b4	net: bridge: multicast: use the port group to port context helper We need to use the new port group to port context helper in places where we cannot pass down the proper context (i.e. functions that can be called by timers or outside the packet snooping paths). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	74edfd483d	net: bridge: multicast: add helper to get port mcast context from port group Add br_multicast_pg_to_port_ctx() which returns the proper port multicast context from either port or vlan based on bridge option and vlan flags. As the comment inside explains the locking is a bit tricky, we rely on the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change and we also require it to be held to call that helper. If we find the vlan under rcu and it still has the flag then we can be sure it will be alive until we unlock multicast_lock which should be enough. Note that the context might change from vlan to bridge between different calls to this helper as the mcast vlan knob requires only rtnl so it should be used carefully and for read-only/check purposes. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	f4b7002a70	net: bridge: add vlan mcast snooping knob Add a global knob that controls if vlan multicast snooping is enabled. The proper contexts (vlan or bridge-wide) will be chosen based on the knob when processing packets and changing bridge device state. Note that vlans have their individual mcast snooping enabled by default, but this knob is needed to turn on bridge vlan snooping. It is disabled by default. To enable the knob vlan filtering must also be enabled, it doesn't make sense to have vlan mcast snooping without vlan filtering since that would lead to inconsistencies. Disabling vlan filtering will also automatically disable vlan mcast snooping. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	7b54aaaf53	net: bridge: multicast: add vlan state initialization and control Add helpers to enable/disable vlan multicast based on its flags, we need two flags because we need to know if the vlan has multicast enabled globally (user-controlled) and if it has it enabled on the specific device (bridge or port). The new private vlan flags are: - BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used when removing a vlan, toggling vlan mcast snooping and controlling single vlan (kernel-controlled, valid under RTNL and multicast_lock) - BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the vlan, used to control the bridge-wide vlan mcast snooping for a single vlan (user-controlled, can be checked under any context) Bridge vlan contexts are created with multicast snooping enabled by default to be in line with the current bridge snooping defaults. In order to actually activate per vlan snooping and context usage a bridge-wide knob will be added later which will default to disabled. If that knob is enabled then automatically all vlan snooping will be enabled. All vlan contexts are initialized with the current bridge multicast context defaults. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov	613d61dbef	net: bridge: vlan: add global and per-port multicast context Add global and per-port vlan multicast context, only initialized but still not used. No functional changes intended. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00

1 2 3 4 5 ...

65905 Commits