linux-sg2042/net
Greg Banks 59a252ff8c knfsd: avoid overloading the CPU scheduler with enormous load averages
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads.  When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up.  As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them.  Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run.  This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
   the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
   of CPU time, to the point where daemons like portmap and rpc.mountd
   do not schedule for tens of seconds at a time.  Clients attempting
   to mount an NFS filesystem timeout at the very first step (opening
   a TCP connection to portmap) because portmap cannot wake up from
   select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server.  That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server.  The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:41 -04:00
..
9p 9p: fix endian issues [attempt 3] 2009-02-06 22:07:41 -08:00
802 net: fix tokenring license 2009-03-03 23:48:50 -08:00
8021q vlan: Fix vlan-in-vlan crashes. 2009-03-04 23:46:25 -08:00
appletalk appletalk: convert aarp to net_device_ops 2009-01-07 17:21:44 -08:00
atm netdevice: Kill netdev->priv 2008-12-08 01:14:16 -08:00
ax25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2008-12-28 12:49:40 -08:00
bluetooth bluetooth: driver API update 2009-01-07 17:23:17 -08:00
bridge bridge: Fix LRO crash with tun 2009-02-09 15:07:18 -08:00
can can: fix slowpath issue in hrtimer callback function 2009-01-14 21:06:55 -08:00
core vlan: Fix vlan-in-vlan crashes. 2009-03-04 23:46:25 -08:00
dcb DCB: fix kfree(skb) 2009-01-04 17:29:21 -08:00
dccp dccp ccid-3: Fix RFC reference 2009-01-11 00:17:22 -08:00
decnet net: Make static 2008-12-10 15:18:31 -08:00
dsa dsa: convert to net_device_ops (v2) 2009-01-06 16:45:26 -08:00
econet netns: Use net_eq() to compare net-namespaces for optimization. 2008-07-19 22:34:43 -07:00
ethernet eth: Declare an optimized compare_ether_addr_64bits() function 2008-11-23 23:24:32 -08:00
ipv4 tcp: Like icmp use register_pernet_subsys 2009-03-03 01:14:21 -08:00
ipv6 ipv6: Fix BUG when disabled ipv6 module is unloaded 2009-03-11 09:22:51 -07:00
ipx net: '&' redux 2008-11-03 18:21:05 -08:00
irda tty: Fix an ircomm warning and note another bug 2009-01-02 10:19:43 -08:00
iucv s390: remove s390_root_dev_*() 2009-01-06 10:44:34 -08:00
key af_key: initialize xfrm encap_oa 2009-01-25 20:49:14 -08:00
lapb
llc net: remove redundant argument comments 2008-11-21 17:15:03 -08:00
mac80211 mac80211: restrict to AP in outgoing interface heuristic 2009-02-11 11:27:17 -05:00
netfilter netfilter: xt_recent: fix proc-file addition/removal of IPv4 addresses 2009-02-24 14:53:12 +01:00
netlabel netlabel: Update kernel configuration API 2008-12-31 12:54:11 -05:00
netlink netlink: invert error code in netlink_set_err() 2009-03-03 23:37:30 -08:00
netrom Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2008-12-28 12:49:40 -08:00
packet net: packet socket packet_lookup_frame fix 2009-02-01 01:53:29 -08:00
phonet Phonet: do not compute unused value 2009-02-10 17:14:50 -08:00
rfkill net/rfkill/rfkill.c: fix unused rfkill_led_trigger() warning 2009-01-04 17:11:24 -08:00
rose Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2008-12-28 12:49:40 -08:00
rxrpc RxRPC: Fix a potential NULL dereference 2009-02-06 21:50:52 -08:00
sched pkt_sched: act_police: Fix a rate estimator test. 2009-03-04 17:38:10 -08:00
sctp SCTP: change sctp_ctl_sock_init() to try IPv4 if IPv6 fails 2009-03-04 03:20:26 -08:00
sunrpc knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-03-18 17:38:41 -04:00
tipc net/tipc/bcast.h: use ARRAY_SIZE 2009-01-11 00:06:33 -08:00
unix introduce new LSM hooks where vfsmount is available. 2008-12-31 18:07:37 -05:00
wanrouter netdevice wanrouter: Convert directly reference of netdev->priv 2008-11-20 04:26:21 -08:00
wimax wimax: fix oops in wimax_dev_get_by_genl_info() when looking up non-wimax iface 2009-02-12 17:00:20 -08:00
wireless cfg80211: test before subtraction on unsigned 2009-03-06 15:54:32 -05:00
x25 net: '&' redux 2008-11-03 18:21:05 -08:00
xfrm xfrm: Fix xfrm_state_find() wrt. wildcard source address. 2009-03-13 14:22:40 -07:00
Kconfig net: Move config NET_NS to from net/Kconfig to init/Kconfig 2009-01-26 12:25:55 -08:00
Makefile wimax: Makefile, Kconfig and docbook linkage for the stack 2009-01-07 10:00:17 -08:00
TUNABLE
compat.c reintroduce accept4 2008-11-19 18:49:57 -08:00
nonet.c
socket.c [CVE-2009-0029] System call wrappers part 22 2009-01-14 14:15:27 +01:00
sysctl_net.c missing bits of net-namespace / sysctl 2008-07-27 09:45:34 -07:00