OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Ido Schimmel	07e1a5809b	psample: Add additional metadata attributes Extend psample to report the following attributes when available: * Output traffic class as a 16-bit value * Output traffic class occupancy in bytes as a 64-bit value * End-to-end latency of the packet in nanoseconds resolution * Software timestamp in nanoseconds resolution (always available) * Packet's protocol. Needed for packet dissection in user space (always available) Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 15:00:43 -07:00
Ido Schimmel	a03e99d39f	psample: Encapsulate packet metadata in a struct Currently, callers of psample_sample_packet() pass three metadata attributes: Ingress port, egress port and truncated size. Subsequent patches are going to add more attributes (e.g., egress queue occupancy), which also need an indication whether they are valid or not. Encapsulate packet metadata in a struct in order to keep the number of arguments reasonable. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 15:00:43 -07:00
Alexander Lobakin	59753ce8b1	ethernet: constify eth_get_headlen()'s data argument It's used only for flow dissection, which now takes constant data pointers. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:48:02 -07:00
Alexander Lobakin	f96533cded	flow_dissector: constify raw input data argument Flow Dissector code never modifies the input buffer, neither skb nor raw data. Make 'data' argument const for all of the Flow dissector's functions. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:46:32 -07:00
Alexander Lobakin	d0eed5c325	gro: give 'hash' variable in dev_gro_receive() a less confusing name 'hash' stores not the flow hash, but the index of the GRO bucket corresponding to it. Change its name to 'bucket' to avoid confusion while reading lines like '__set_bit(hash, &napi->gro_bitmask)'. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:41:09 -07:00
Alexander Lobakin	9dc2c31337	gro: consistentify napi->gro_hash[x] access in dev_gro_receive() GRO bucket index doesn't change through the entire function. Store a pointer to the corresponding bucket instead of its member and use it consistently through the function. It is performance-safe since &gro_list->list == gro_list. Misc: remove superfluous braces around single-line branches. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:41:08 -07:00
Alexander Lobakin	0ccf4d50d1	gro: simplify gro_list_prepare() gro_list_prepare() always returns &napi->gro_hash[bucket].list, without any variations. Moreover, it uses 'napi' argument only to have access to this list, and calculates the bucket index for the second time (firstly it happens at the beginning of dev_gro_receive()) to do that. Given that dev_gro_receive() already has an index to the needed list, just pass it as the first argument to eliminate redundant calculations, and make gro_list_prepare() return void. Also, both arguments of gro_list_prepare() can be constified since this function can only modify the skbs from the bucket list. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:41:08 -07:00
David S. Miller	ebc71a3804	There is only a single patch this time: - Use netif_rx_any_context(), by Sebastian Andrzej Siewior -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmBLjJUWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSb6EACfWUz38XgTB/qpmwzFLVj5ho7Z NhkULQqwL8iGo/nOEImQP51wTXF4latraAUy2S188ZhOyZyHdpATL0u66/jIWblE zsxxNulKWBycK641F1HMgd2NpgmdCDE7FWEh4IdOdAM8IoY36j2qCwDr1beK5/fN N6OuJeELaXkCjCNzdTMaEO10yFujqNZ/eRGaC30UMBlqjBNsmzUlGNuMcdr54beq lLJ4MRA4pX95v2iIzCECYOciw4kfZLtBgTKjq1dVXdDnUgSkAtbzEFDQDWok/8Ue dwiGqZeyIXZgGdpiqftsaxEiTmSC9XsZNKU5gOhK0fXCW1blXtvEvcd76HkKvsTl QyTZ9I2oPjkgki8x7t5EvBGyIbz+RW9p9LjfWtOjg12+VNzhQM0I+7H5hlVqeE9p 1owK2c5G9p25WwLC9lihAdtG2PStchx4Uv0JIAYyVzBi73nbKf2awArVU1X6iYFK j/mAKNSVjXgbPDrjc9AATe+rtDud4e7qDr9VHJgwuW73LbifWM+d5bC5ile4aTxf rpJzOg+Lo7bdEo3i3Zdzqii1Y1+4jh7diH4rhTr/pIbtrWd+y27c7tEOVsEkvIIn Ty3frqi3ourKvR7rS21CaOn3srRgn8JUzhJrOnEjqYNxtcDmIrV3hMM1hN5aN8c5 nWKODAZQORbi1zAqyQ== =HAr3 -----END PGP SIGNATURE----- Merge tag 'batadv-next-pullrequest-20210312' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== There is only a single patch this time: - Use netif_rx_any_context(), by Sebastian Andrzej Siewior ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-13 14:27:56 -08:00
Baowen Zheng	2ffe039528	net/sched: act_police: add support for packet-per-second policing Allow a policer action to enforce a rate-limit based on packets-per-second, configurable using a packet-per-second rate and burst parameters. e.g. tc filter add dev tap1 parent ffff: u32 match \ u32 0 0 police pkts_rate 3000 pkts_burst 1000 Testing was unable to uncover a performance impact of this change on existing features. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-13 14:18:09 -08:00
Xingfeng Hu	25660156f4	flow_offload: add support for packet-per-second policing Allow flow_offload API to configure packet-per-second policing using rate and burst parameters. Dummy implementations of tcf_police_rate_pkt_ps() and tcf_police_burst_pkt() are supplied which return 0, the unconfigured state. This is to facilitate splitting the offload, driver, and TC code portion of this feature into separate patches with the aim of providing a logical flow for review. And the implementation of these helpers will be filled out by a follow-up patch. Signed-off-by: Xingfeng Hu <xingfeng.hu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-13 14:18:09 -08:00
Geliang Tang	0e4a3e6886	mptcp: remove a list of addrs when flushing This patch invoked mptcp_nl_remove_addrs_list to remove a list of addresses when the netlink flushes addresses, instead of using mptcp_nl_remove_subflow_and_signal_addr to remove them one by one. And dropped the unused parameter net in __flush_addrs too. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	06faa22710	mptcp: remove multi addresses and subflows in PM This patch implemented the function to remove a list of addresses and subflows, named mptcp_nl_remove_addrs_list, which had a input parameter rm_list as the removing addresses list. In mptcp_nl_remove_addrs_list, traverse all the existing msk sockets to invoke mptcp_pm_remove_addrs_and_subflows to remove a list of addresses for each msk socket. In mptcp_pm_remove_addrs_and_subflows, traverse all the addresses in the removing addresses list, to find whether this address is in the conn_list or anno_list. If it is, put the address ID into the removing address list or the removing subflow list, and pass the two lists to mptcp_pm_remove_addr and mptcp_pm_remove_subflow. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	ddd14bb85d	mptcp: remove multi subflows in PM This patch dealt with removing multi subflows in PM: In mptcp_pm_remove_subflow, changed the input parameter local_id as an list of removing address ids, and passed the list to mptcp_pm_nl_rm_subflow_received. In mptcp_pm_nl_rm_subflow_received, iterated each address id from the received ids list. Then shut down and closed each address id's subsocket. In mptcp_nl_remove_subflow_and_signal_addr, put the single address id into an ids list, and passed it to mptcp_pm_remove_subflow. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	d0b698ca9a	mptcp: remove multi addresses in PM This patch dropped the member rm_id of struct mptcp_pm_data. Use rm_list_rx in mptcp_pm_nl_rm_addr_received instead of using rm_id. In mptcp_pm_nl_rm_addr_received, iterated each address id from pm.rm_list_rx, then shut down and closed each address id's subsocket. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	b5c55f334c	mptcp: add rm_list_rx in mptcp_pm_data This patch added a new member rm_list_rx for struct mptcp_pm_data as an list of the removing address ids on the incoming direction. Initialized its nr field to zero in mptcp_pm_data_init. In mptcp_pm_rm_addr_received, set it as the input rm_list. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	5c4a824dcb	mptcp: add rm_list in mptcp_options_received This patch changed the member rm_id in struct mptcp_options_received as a list of the removing address ids, and renamed it to rm_list. In mptcp_parse_option, parsed the RM_ADDR suboption and filled them into the rm_list in struct mptcp_options_received. In mptcp_incoming_options, passed this rm_list to the function mptcp_pm_rm_addr_received. It also changed the parameter type of mptcp_pm_rm_addr_received. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	cbde278718	mptcp: add rm_list_tx in mptcp_pm_data This patch added a new member rm_list_tx for struct mptcp_pm_data as the removing address list on the outgoing direction. Initialize its nr field to zero in mptcp_pm_data_init. In mptcp_pm_remove_anno_addr, put the single address id into an removing list, and passed it to mptcp_pm_remove_addr. In mptcp_pm_remove_addr, save the input rm_list to rm_list_tx in struct mptcp_pm_data. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Geliang Tang	6445e17af7	mptcp: add rm_list in mptcp_out_options This patch defined a new struct mptcp_rm_list, the ids field was an array of the removing address ids, the nr field was the valid number of removing address ids in the array. The array size was definced as a new macro MPTCP_RM_IDS_MAX. Changed the member rm_id of struct mptcp_out_options to rm_list. In mptcp_established_options_rm_addr, invoked mptcp_pm_rm_addr_signal to get the rm_list. According the number of addresses in it, calculated the padded RM_ADDR suboption length. And saved the ids array in struct mptcp_out_options's rm_list member. In mptcp_write_options, iterated each address id from struct mptcp_out_options's rm_list member, set the invalid ones as TCPOPT_NOP, then filled them into the RM_ADDR suboption. Changed TCPOLEN_MPTCP_RM_ADDR_BASE from 4 to 3. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:47:45 -08:00
Shubhankar Kuranagatti	6ad086009f	net: ipv4: route.c: Fix indentation of multi line comment. All comment lines inside the comment block have been aligned. Every line of comment starts with a * (uniformity in code). Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-12 17:02:30 -08:00
Eric Dumazet	ac3959fd0d	tcp: remove obsolete check in __tcp_retransmit_skb() TSQ provides a nice way to avoid bufferbloat on individual socket, including retransmit packets. We can get rid of the old heuristic: /* Do not sent more than we queued. 1/4 is reserved for possible * copying overhead: fragmentation, tunneling, mangling etc. */ if (refcount_read(&sk->sk_wmem_alloc) > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) return -EAGAIN; This heuristic was giving false positives according to Jakub, whenever TX completions are delayed above RTT. (Ack packets are processed by TCP stack before clones are orphaned/freed) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Eric Dumazet	a7abf3cd76	tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack() Jakub reported Data included in a Fastopen SYN that had to be retransmit would have to wait for an RTO if TX completions are slow, even with prior fix. This is because tcp_rcv_fastopen_synack() does not use standard rtx logic, meaning TSQ handler exits early in tcp_tsq_write() because tp->lost_out == tp->retrans_out Lets make tcp_rcv_fastopen_synack() use standard rtx logic, by using tcp_mark_skb_lost() on the skb thats needs to be sent again. Not this raised a warning in tcp_fastretrans_alert() during my tests since we consider the data not being aknowledged by the receiver does not mean packet was lost on the network. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Eric Dumazet	f4dae54e48	tcp: plug skb_still_in_host_queue() to TSQ Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) Main issue is that TCP stack has a logic preventing a packet being retransmit if the prior clone has not yet been orphaned or freed. This logic came with commit `1f3279ae0c` ("tcp: avoid retransmits of TCP packets hanging in host queues") Thankfully, in the case skb_still_in_host_queue() detects the initial clone is still in flight, it can use TSQ logic that will eventually retry later, at the moment the clone is freed or orphaned. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neil Spring <ntspring@fb.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:35:31 -08:00
Hoang Huu Le	97bc84bbd4	tipc: clean up warnings detected by sparse This patch fixes the following warning from sparse: net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types) net/tipc/monitor.c:263:35: expected unsigned int net/tipc/monitor.c:263:35: got restricted __be32 [usertype] [...] net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock [...] net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces) net/tipc/crypto.c:1201:9: expected struct tipc_aead [noderef] __rcu __tmp net/tipc/crypto.c:1201:9: got struct tipc_aead [...] Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:06:54 -08:00
Hoang Le	1980d37565	tipc: convert dest node's address to network order (struct tipc_link_info)->dest is in network order (__be32), so we must convert the value to network order before assigning. The problem detected by sparse: net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types) net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest net/tipc/netlink_compat.c:699:24: got int Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 18:06:54 -08:00
Petr Machata	15e1dd5703	nexthop: Enable resilient next-hop groups Now that all the code is in place, stop rejecting requests to create resilient next-hop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	0b4818aabc	nexthop: Notify userspace about bucket migrations Nexthop replacements et.al. are notified through netlink, but if a delayed work migrates buckets on the background, userspace will stay oblivious. Notify these as RTM_NEWNEXTHOPBUCKET events. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	187d4c6b97	nexthop: Add netlink handlers for bucket get Allow getting (but not setting) individual buckets to inspect the next hop mapped therein, idle time, and flags. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	8a1bbabb03	nexthop: Add netlink handlers for bucket dump Add a dump handler for resilient next hop buckets. When next-hop group ID is given, it walks buckets of that group, otherwise it walks buckets of all groups. It then dumps the buckets whose next hops match the given filtering criteria. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Petr Machata	a2601e2b1e	nexthop: Add netlink handlers for resilient nexthop groups Implement the netlink messages that allow creation and dumping of resilient nexthop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Ido Schimmel	cfc15c1dbb	nexthop: Allow reporting activity of nexthop buckets The kernel periodically checks the idle time of nexthop buckets to determine if they are idle and can be re-populated with a new nexthop. When the resilient nexthop group is offloaded to hardware, the kernel will not see activity on nexthop buckets unless it is reported from hardware. Add a function that can be periodically called by device drivers to report activity on nexthop buckets after querying it from the underlying device. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:13:00 -08:00
Ido Schimmel	56ad5ba344	nexthop: Allow setting "offload" and "trap" indication of nexthop buckets Add a function that can be called by device drivers to set "offload" or "trap" indication on nexthop buckets following nexthop notifications and other changes such as a neighbour becoming invalid. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	7c37c7e004	nexthop: Implement notifiers for resilient nexthop groups Implement the following notifications towards drivers: - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created. - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of next hops to hash table buckets. That includes replacements, deletions, and delayed upkeep cycles. Some bucket notifications can be vetoed by the driver, to make it possible to propagate bucket busy-ness flags from the HW back to the algorithm. Some are however forced, e.g. if a next hop is deleted, all buckets that use this next hop simply must be migrated, whether the HW wishes so or not. - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is replaced. Usually the driver will get the bucket notifications as well, and could veto those. But in some cases, a bucket may not be migrated immediately, but during delayed upkeep, and that is too late to roll the transaction back. This notification allows the driver to take a look and veto the new proposed group up front, before anything is committed. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	283a72a559	nexthop: Add implementation of resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group, which implements the hash-threshold algorithm. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. That causes problems e.g. as established TCP connections are reset, because the traffic is forwarded to a server that is not familiar with the connection. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle buckets. There are three users of the resilient data structures. - The forwarding code accesses them under RCU, and does not modify them except for updating the time a selected bucket was last used. - Netlink code, running under RTNL, which may modify the data. - The delayed upkeep code, which may modify the data. This runs unlocked, and mutual exclusion between the RTNL code and the delayed upkeep is maintained by canceling the delayed work synchronously before the RTNL code touches anything. Later it restarts the delayed work if necessary. The RTNL code has to implement next-hop group replacement, next hop removal, etc. For removal, the mpath code uses a neat trick of having a backup next hop group structure, doing the necessary changes offline, and then RCU-swapping them in. However, the hash tables for resilient hashing are about an order of magnitude larger than the groups themselves (the size might be e.g. 4K entries), and it was felt that keeping two of them is an overkill. Both the primary next-hop group and the spare therefore use the same resilient table, and writers are careful to keep all references valid for the forwarding code. The hash table references next-hop group entries from the next-hop group that is currently in the primary role (i.e. not spare). During the transition from primary to spare, the table references a mix of both the primary group and the spare. When a next hop is deleted, the corresponding buckets are not set to NULL, but instead marked as empty, so that the pointer is valid and can be used by the forwarding code. The buckets are then migrated to a new next-hop group entry during upkeep. The only times that the hash table is invalid is the very beginning and very end of its lifetime. Between those points, it is always kept valid. This patch introduces the core support code itself. It does not handle notifications towards drivers, which are kept as if the group were an mpath one. It does not handle netlink either. The only bit currently exposed to user space is the new next-hop group type, and that is currently bounced. There is therefore no way to actually access this code. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Ido Schimmel	710ec56223	nexthop: Add netlink defines and enumerators for resilient NH groups - RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_. - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will currently serve only for dumping of individual buckets of resilient next hop groups. For nexthop group buckets, these messages will carry a nested attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_. There are several reasons why a new suite of messages is created for nexthop buckets instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages. First, a nexthop group can contain a large number of nexthop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each nexthop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over nexthop buckets configuration. - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is adjusted to bounce groups with that type for now. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	90e1a9e213	nexthop: Add a dedicated flag for multipath next-hop groups With the introduction of resilient nexthop groups, there will be two types of multipath groups: the current hash-threshold "mpath" ones, and resilient groups. Both are multipath, but to determine the fact, the system needs to consider two flags. This might prove costly in the datapath. Therefore, introduce a new flag, that should be set for next-hop groups that have more than one nexthop, and should be considered multipath. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	96a856256a	nexthop: __nh_notifier_single_info_init(): Make nh_info an argument The cited function currently uses rtnl_dereference() to get nh_info from a handed-in nexthop. However, under the resilient hashing scheme, this function will not always be called under RTNL, sometimes the mutual exclusion will be achieved differently. Therefore move the nh_info extraction from the function to its callers to make it possible to use a different synchronization guarantee. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Petr Machata	597f48e46b	nexthop: Pass nh_config to replace_nexthop() Currently, replace assumes that the new group that is given is a fully-formed object. But mpath groups really only have one attribute, and that is the constituent next hop configuration. This may not be universally true. From the usability perspective, it is desirable to allow the replace operation to adjust just the constituent next hop configuration and leave the group attributes as such intact. But the object that keeps track of whether an attribute was or was not given is the nh_config object, not the next hop or next-hop group. To allow (selective) attribute updates during NH group replacement, propagate `cfg' to replace_nexthop() and further to replace_nexthop_grp(). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:12:59 -08:00
Julien Massonneau	fbbc5bc2ab	seg6: ignore routing header with segments left equal to 0 When there are 2 segments routing header, after an End.B6 action for example, the second SRH will never be handled by an action, packet will be dropped when the first SRH has segments left equal to 0. For actions that doesn't perform decapsulation (currently: End, End.X, End.T, End.B6, End.B6.Encaps), this patch adds the IP6_FH_F_SKIP_RH flag in arguments for ipv6_find_hdr(). Signed-off-by: Julien Massonneau <julien.massonneau@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:09:21 -08:00
Julien Massonneau	ee90c6ba34	seg6: add support for IPv4 decapsulation in ipv6_srh_rcv() As specified in IETF RFC 8754, section 4.3.1.2, if the upper layer header is IPv4 or IPv6, perform IPv6 decapsulation and resubmit the decapsulated packet to the IPv4 or IPv6 module. Only IPv6 decapsulation was implemented. This patch adds support for IPv4 decapsulation. Link: https://tools.ietf.org/html/rfc8754#section-4.3.1.2 Signed-off-by: Julien Massonneau <julien.massonneau@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-11 16:09:21 -08:00
Shubhankar Kuranagatti	6b9c8f46af	net: ipv4: route.c: fix space before tab The extra space before tab space has been removed. Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 15:37:19 -08:00
Shubhankar Kuranagatti	13fdb9403d	net: ipv6: route.c:fix indentation The series of space has been replaced by tab space wherever required. Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:16 -08:00
Ido Schimmel	58c04397f7	sched: act_sample: Implement stats_update callback Implement this callback in order to get the offloaded stats added to the kernel stats. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Yunsheng Lin	1ddc3229ad	skbuff: remove some unnecessary operation in skb_segment_list() gro list uses skb_shinfo(skb)->frag_list to link two skb together, and NAPI_GRO_CB(p)->last->next is used when there are more skb, see skb_gro_receive_list(). gso expects that each segmented skb is linked together using skb->next, so only the first skb->next need to set to skb_shinfo(skb)-> frag_list when doing gso list segment. It is the same reason that nskb->next does not need to be set to list_skb before goto the error handling, because nskb->next already pointers to list_skb. And nskb is also the last skb at the end of loop, so remove tail variable and use nskb instead. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Gustavo A. R. Silva	90d181ca48	net: rose: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Gustavo A. R. Silva	b1866bfff9	net: core: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Gustavo A. R. Silva	ecd1c6a51f	net: bridge: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Gustavo A. R. Silva	5646fba6ea	net: ax25: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Gustavo A. R. Silva	4cdbe58b4b	decnet: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Yejune Deng	3e6f20e09a	net/rds: Drop duplicate sin and sin6 assignments There is no need to assign the msg->msg_name to sin or sin6, because there is DECLARE_SOCKADDR statement. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
David S. Miller	c1acda9807	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-09 The following pull-request contains BPF updates for your net-next tree. We've added 90 non-merge commits during the last 17 day(s) which contain a total of 114 files changed, 5158 insertions(+), 1288 deletions(-). The main changes are: 1) Faster bpf_redirect_map(), from Björn. 2) skmsg cleanup, from Cong. 3) Support for floating point types in BTF, from Ilya. 4) Documentation for sys_bpf commands, from Joe. 5) Support for sk_lookup in bpf_prog_test_run, form Lorenz. 6) Enable task local storage for tracing programs, from Song. 7) bpf_for_each_map_elem() helper, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-09 18:07:05 -08:00

1 2 3 4 5 ...

64232 Commits