OpenCloudOS-Kernel/drivers/net
Eric Dumazet 19757cebf0 tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:28:34 +01:00
..
appletalk net: remove single-byte netdev->dev_addr writes 2021-10-13 10:03:59 -07:00
arcnet net: remove single-byte netdev->dev_addr writes 2021-10-13 10:03:59 -07:00
bonding net: use dev_addr_set() 2021-10-09 11:55:01 +01:00
caif
can can: c_can: fix null-ptr-deref on ioctl() 2021-09-07 08:46:58 +02:00
dsa net: dsa: qca8k: move port config to dedicated struct 2021-10-15 11:06:38 +01:00
ethernet tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
fddi fddi: use eth_hw_addr_set() 2021-10-02 14:18:26 +01:00
fjes
hamradio hamradio: use dev_addr_set() for setting device address 2021-10-13 09:41:37 -07:00
hippi
hyperv hv_netvsc: Add comment of netvsc_xdp_xmit() 2021-10-14 19:17:57 -07:00
ieee802154
ipa asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
ipvlan net: use eth_hw_addr_set() instead of ether_addr_copy() 2021-10-02 14:18:25 +01:00
mctp
mdio net: mdio-ipq4019: Fix the error for an optional regs resource 2021-09-28 17:28:54 -07:00
netdevsim devlink: Delete reload enable/disable interface 2021-10-12 16:29:17 -07:00
pcs net: pcs: xpcs: fix incorrect steps on disable EEE 2021-10-06 11:18:27 +01:00
phy net: phy: dp83867: introduce critical chip default init for non-of platform 2021-10-14 19:10:47 -07:00
plip
ppp ppp: use the correct function to check if a netdev name is in use 2021-10-08 17:02:35 +01:00
slip
team net: use dev_addr_set() 2021-10-09 11:55:01 +01:00
usb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
vmxnet3 net: use dev_addr_set() 2021-10-09 11:55:01 +01:00
wan net: use dev_addr_set() 2021-10-09 11:55:01 +01:00
wireguard
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
wwan net: wwan: iosm: correct devlink extra params 2021-10-02 16:05:20 +01:00
xen-netback xen-netback: Remove redundant initialization of variable err 2021-10-15 11:02:36 +01:00
Kconfig
LICENSE.SRC
Makefile
Space.c
bareudp.c
dummy.c
eql.c
geneve.c
gtp.c gtp: use skb_dst_update_pmtu_no_confirm() instead of direct call 2021-10-06 15:19:37 +01:00
ifb.c
loopback.c
macsec.c net: use eth_hw_addr_set() instead of ether_addr_copy() 2021-10-02 14:18:25 +01:00
macvlan.c net: use eth_hw_addr_set() instead of ether_addr_copy() 2021-10-02 14:18:25 +01:00
macvtap.c
mdio.c
mhi_net.c drivers: net: mhi: fix error path in mhi_net_newlink 2021-09-24 14:25:05 +01:00
mii.c
net_failover.c net: use dev_addr_set() 2021-10-09 11:55:01 +01:00
netconsole.c
nlmon.c
ntb_netdev.c net: use dev_addr_set() 2021-10-09 11:55:01 +01:00
rionet.c
sb1000.c
sungem_phy.c
tap.c
thunderbolt.c
tun.c ethtool: extend coalesce setting uAPI with CQE mode 2021-08-24 07:38:29 -07:00
veth.c
virtio_net.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-10-14 16:50:14 -07:00
vrf.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-08-19 18:09:18 -07:00
vsockmon.c
vxlan.c nexthop: Fix memory leaks in nexthop notification chain listeners 2021-09-23 12:33:22 +01:00
xen-netfront.c xen/netfront: don't trust the backend response data blindly 2021-08-25 10:43:21 +01:00