OpenCloudOS-Kernel/net
Eric Dumazet 39e6c8208d net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 09:50:58 -08:00
..
6lowpan 6lowpan: use rb_entry() 2017-01-22 16:46:13 -05:00
9p IB/core: add support to create a unsafe global rkey to ib_create_pd 2016-09-23 13:47:44 -04:00
802 Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
8021q net: remove ndo_neigh_{construct, destroy} from stacked devices 2017-02-06 11:25:57 -05:00
appletalk lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
atm lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
ax25 ax25: Fix segfault after sock connection timeout 2017-01-16 14:39:58 -05:00
batman-adv Here are two fixes for batman-adv for net-next: 2017-01-29 19:21:26 -05:00
bluetooth scripts/spelling.txt: add "an user" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
bridge lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
caif net: caif: Remove unused stats member from struct chnl_net 2017-01-19 11:45:21 -05:00
can can: bcm: fix hrtimer/tasklet termination in bcm op removal 2017-01-30 11:05:04 +01:00
ceph This time around we have: 2017-02-28 15:36:09 -08:00
core net: solve a NAPI race 2017-03-01 09:50:58 -08:00
dcb net: dcb: set error code on failures 2016-12-03 23:54:25 -05:00
dccp net/dccp: fix use after free in tw_timer_handler() 2017-02-22 16:14:07 -05:00
decnet Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
dns_resolver
dsa Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
ethernet Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2017-02-16 21:25:49 -05:00
hsr net/hsr: use eth_hw_addr_random() 2017-02-21 13:25:22 -05:00
ieee802154 lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
ife net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
ipv4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-28 10:00:39 -08:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-28 10:00:39 -08:00
ipx ktime: Get rid of the union 2016-12-25 17:21:22 +01:00
irda lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
iucv net/af_iucv: don't use paged skbs for TX on HiperSockets 2017-01-10 21:08:29 -05:00
kcm kcm: fix a null pointer dereference in kcm_sendmsg() 2017-02-14 13:06:37 -05:00
key netns: make struct pernet_operations::id unsigned int 2016-11-18 10:59:15 -05:00
l2tp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-28 10:00:39 -08:00
l3mdev
lapb Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
llc net/llc: avoid BUG_ON() in skb_orphan() 2017-02-12 22:14:49 -05:00
mac80211 Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax 2017-02-28 20:29:41 -08:00
mac802154 ktime: Cleanup ktime_set() usage 2016-12-25 17:21:22 +01:00
mpls net: mpls: Add support for netconf 2017-02-20 11:13:37 -05:00
ncsi net/ncsi: Improve HNCDSC AEN handler 2016-10-20 11:23:08 -04:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-28 10:00:39 -08:00
netlabel netlabel: add CALIPSO to the list of built-in protocols 2017-01-06 22:20:45 -05:00
netlink net: adjust skb->truesize in pskb_expand_head() 2017-01-27 12:03:29 -05:00
netrom
nfc genetlink: mark families as __ro_after_init 2016-10-27 16:16:09 -04:00
openvswitch openvswitch: Set event bit after initializing labels. 2017-02-20 10:18:43 -05:00
packet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-19 11:18:46 -05:00
phonet netns: make struct pernet_operations::id unsigned int 2016-11-18 10:59:15 -05:00
psample net: Introduce psample, a new genetlink channel for packet sampling 2017-01-24 13:44:28 -05:00
qrtr net: qrtr: Mark 'buf' as little endian 2017-01-10 20:45:04 -05:00
rds rds: ib: add the static type to the variables 2017-03-01 09:50:58 -08:00
rfkill rfkill: remove rfkill-regulator 2017-01-24 11:07:35 +01:00
rose Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
rxrpc rxrpc: Fix deadlock between call creation and sendmsg/recvmsg 2017-03-01 09:50:58 -08:00
sched net sched actions: do not overwrite status of action creation. 2017-02-26 21:31:32 -05:00
sctp sctp: call rcu_read_lock before checking for duplicate transport nodes 2017-03-01 09:50:58 -08:00
smc smc: some potential use after free bugs 2017-01-30 16:37:55 -05:00
strparser strparser: Propagate correct error code in strp_recv() 2016-10-12 01:51:49 -04:00
sunrpc The nfsd update this round is mainly a lot of miscellaneous cleanups and 2017-02-28 15:39:09 -08:00
switchdev Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-10-30 12:42:58 -04:00
tipc tipc: move premature initilalization of stack variables 2017-02-24 11:42:54 -05:00
unix unix: add ioctl to open a unix socket file with O_PATH 2017-02-02 21:58:02 -05:00
vmw_vsock Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-17 20:17:04 -08:00
wimax genetlink: mark families as __ro_after_init 2016-10-27 16:16:09 -04:00
wireless Some more updates: 2017-02-10 14:31:51 -05:00
x25 Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
xfrm xfrm: provide correct dst in xfrm_neigh_lookup 2017-02-26 21:35:24 -05:00
Kconfig bpf: make jited programs visible in traces 2017-02-17 13:40:05 -05:00
Makefile net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
compat.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-02-22 10:15:09 -08:00
socket.c net: socket: fix recvmmsg not returning error from sock_error 2017-02-21 13:35:25 -05:00
sysctl_net.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2016-10-06 09:52:23 -07:00