OpenCloudOS-Kernel/net/ipv4
Nikolay Aleksandrov 24b9bf43e9 net: fix for a race condition in the inet frag code
I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions
(basically the users of inet_frag_kill/inet_frag_put).

What happens is that after a fragment has been added to the hash chain
but before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, it may get deleted (either by an expired timer if
the system load is high or the timer sufficiently low, or by the
fraq_queue function for different reasons) before it's added to the
lru_list, then after it gets added it's a matter of time for the
evictor to get to a piece of memory which has been freed leading to a
number of different bugs depending on what's left there.

I've been able to trigger this on both IPv4 and IPv6 (which is normal
as the frag code is the same), but it's been much more difficult to
trigger on IPv4 due to the protocol differences about how fragments
are treated.

The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
in a RR bond, so the same flow can be seen on multiple cards at the
same time. Then I used multiple instances of ping/ping6 to generate
fragmented packets and flood the machines with them while running
other processes to load the attacked machine.

*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.

An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put.  All of this could
happen before the frag was ever added to the LRU list, then it gets
added and the evictor uses a freed fragment.

An example for IPv6 would be if a fragment is being added and is at
the stage of being inserted in the hash after the hash lock is
released, but before inet_frag_lru_add executes (or is able to obtain
the lru lock) another overlapping fragment for the same flow arrives
at a different CPU which finds it in the hash, but since it's
overlapping it drops it invoking inet_frag_kill and thus removing all
guarding refcounts, and afterwards freeing it by invoking
inet_frag_put which removes the last refcount added previously by
inet_frag_find, then inet_frag_lru_add gets executed by
inet_frag_intern and we have a freed fragment in the lru_list.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for
the fragment to be added to the lru_list, and then it'll remove it (it
works because the hash chain removal is done before the lru_list one
and there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4 ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Florian Westphal <fw@strlen.de>
CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:31:42 -05:00
..
netfilter netfilter: nf_nat_snmp_basic: fix duplicates in if/else branches 2014-02-14 11:37:36 +01:00
Kconfig net: neighbour: Remove CONFIG_ARPD 2013-09-03 21:41:43 -04:00
Makefile gre_offload: statically build GRE offloading support 2014-01-06 20:28:34 -05:00
af_inet.c ipv4: ipv6: better estimate tunnel header cut for correct ufo handling 2014-02-25 18:27:06 -05:00
ah4.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
arp.c ipv4: arp: update neighbour address when a gratuitous arp is received and arp_accept is set 2014-01-02 00:08:38 -05:00
cipso_ipv4.c ipv4: ERROR: code indent should use tabs where possible 2013-12-26 13:43:21 -05:00
datagram.c net: Remove FLOWI_FLAG_CAN_SLEEP 2013-12-06 07:24:39 +01:00
devinet.c ipv4: Fix runtime WARNING in rtmsg_ifa() 2014-02-06 20:02:15 -08:00
esp4.c net: esp{4,6}: get rid of struct esp_data 2013-10-29 06:39:42 +01:00
fib_frontend.c fib_frontend: fix possible NULL pointer dereference 2014-01-24 15:51:26 -08:00
fib_lookup.h ipv4: make fib_detect_death static 2013-12-28 17:01:46 -05:00
fib_rules.c inet: fix NULL pointer Oops in fib(6)_rule_suppress 2013-12-10 17:54:23 -05:00
fib_semantics.c ipv4: make fib_detect_death static 2013-12-28 17:01:46 -05:00
fib_trie.c seq_file: remove "%n" usage from seq_file users 2013-11-15 09:32:20 +09:00
gre_demux.c gre_offload: statically build GRE offloading support 2014-01-06 20:28:34 -05:00
gre_offload.c net/ipv4: don't use module_init in non-modular gre_offload 2014-01-16 16:08:27 -08:00
icmp.c ipv4: introduce hardened ip_no_pmtu_disc mode 2014-01-13 11:22:55 -08:00
igmp.c net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
inet_connection_sock.c net: replace macros net_random and net_srandom with direct calls to prandom 2014-01-14 15:15:25 -08:00
inet_diag.c inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets 2014-01-13 22:35:46 -08:00
inet_fragment.c net: fix for a race condition in the inet frag code 2014-03-05 20:31:42 -05:00
inet_hashtables.c inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once 2013-10-19 19:45:35 -04:00
inet_lro.c lro: remove dead code 2013-12-29 16:34:25 -05:00
inet_timewait_sock.c tcp/dccp: remove twchain 2013-10-08 23:19:24 -04:00
inetpeer.c ipv4: remove unused function 2013-12-28 17:03:20 -05:00
ip_forward.c net: ip, ipv6: handle gso skbs in forwarding path 2014-02-13 17:17:02 -05:00
ip_fragment.c net: Add utility functions to clear rxhash 2013-12-17 16:36:21 -05:00
ip_gre.c net: gre: use icmp_hdr() to get inner ip header 2014-01-27 20:38:26 -08:00
ip_input.c net: Fix memory leak if TPROXY used with TCP early demux 2014-01-27 16:22:11 -08:00
ip_options.c ipv4: switch and case should be at the same indent 2014-01-02 03:30:36 -05:00
ip_output.c netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n 2014-02-17 11:20:12 +01:00
ip_sockglue.c ipv6: make IPV6_RECVPKTINFO work for ipv4 datagrams 2014-01-19 19:53:18 -08:00
ip_tunnel.c sit: fix panic with route cache in ip tunnels 2014-02-20 13:13:50 -05:00
ip_tunnel_core.c ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer 2014-03-03 15:56:40 -05:00
ip_vti.c ipv4: be friend with drop monitor 2014-01-18 23:08:02 -08:00
ipcomp.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
ipconfig.c ipv4: ipconfig.c: add parentheses in an if statement 2014-02-14 00:14:23 -05:00
ipip.c ipv4: be friend with drop monitor 2014-01-18 23:08:02 -08:00
ipmr.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-18 00:55:41 -08:00
netfilter.c netfilter: add my copyright statements 2013-04-18 20:27:55 +02:00
ping.c ipv6: protect protocols not handling ipv4 from v4 connection/bind attempts 2014-01-21 16:59:19 -08:00
proc.c ipv4: spaces required around that '=' 2014-01-02 03:30:36 -05:00
protocol.c net: remove outdated comment for ipv4 and ipv6 protocol handler 2013-11-28 18:47:51 -05:00
raw.c net: add build-time checks for msg->msg_name size 2014-01-18 23:04:16 -08:00
route.c ipv4: fix counter in_slow_tot 2014-02-17 16:54:42 -05:00
syncookies.c ipv4: fix checkpatch error "space prohibited" 2013-12-26 13:43:21 -05:00
sysctl_net_ipv4.c ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing 2014-01-13 11:22:54 -08:00
tcp.c net-tcp: fastopen: fix high order allocations 2014-02-22 00:05:21 -05:00
tcp_bic.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_cong.c tcp: reduce the bloat caused by tcp_is_cwnd_limited() 2014-02-24 19:13:38 -05:00
tcp_cubic.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_diag.c inet_diag: Rename inet_diag_req into inet_diag_req_v2 2012-01-11 12:56:06 -08:00
tcp_fastopen.c tcp: enable sockets to use MSG_FASTOPEN by default 2013-11-04 19:57:47 -05:00
tcp_highspeed.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_htcp.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_hybla.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_illinois.c remove extra definitions of U32_MAX 2014-01-23 16:36:55 -08:00
tcp_input.c tcp: fix bogus RTT on special retransmission 2014-03-03 15:33:02 -05:00
tcp_ipv4.c tcp: delete redundant calls of tcp_mtup_init() 2014-01-21 16:52:31 -08:00
tcp_lp.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_memcontrol.c tcp_memcontrol: Cleanup/fix cg_proto->memory_pressure handling. 2013-12-05 21:01:01 -05:00
tcp_metrics.c tcp: metrics: Handle v6/v4-mapped sockets in tcp-metrics 2014-01-23 12:48:28 -08:00
tcp_minisocks.c ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT 2014-01-17 17:56:33 -08:00
tcp_offload.c tcp: do not export tcp_gso_segment() and tcp_gro_receive() 2014-01-14 18:53:48 -08:00
tcp_output.c tcp: fix bogus RTT on special retransmission 2014-03-03 15:33:02 -05:00
tcp_probe.c ipv4: ERROR: do not initialise globals to 0 or NULL 2013-12-26 13:43:21 -05:00
tcp_scalable.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_timer.c tcp: temporarily disable Fast Open on SYN timeout 2013-10-29 22:50:41 -04:00
tcp_vegas.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_vegas.h net: ipv4/ipv6: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
tcp_veno.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_westwood.c tcp: refactor F-RTO 2013-03-21 11:47:50 -04:00
tcp_yeah.c ipv4: ipv4: Cleanup the comments in tcp_yeah.c 2013-12-26 13:43:55 -05:00
tunnel4.c net: Convert printks to pr_<level> 2012-03-11 23:42:51 -07:00
udp.c net: add build-time checks for msg->msg_name size 2014-01-18 23:04:16 -08:00
udp_diag.c netlink: rename ssk to sk in struct netlink_skb_params 2013-04-19 14:57:56 -04:00
udp_impl.h net: ipv4/ipv6: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
udp_offload.c net/ipv4: Use proper RCU APIs for writer-side in udp_offload.c 2014-02-04 20:01:55 -08:00
udplite.c net: ipv4: Standardize prefixes for message logging 2012-03-12 17:05:21 -07:00
xfrm4_input.c net: Add skb_unclone() helper function. 2013-02-15 15:10:37 -05:00
xfrm4_mode_beet.c ipv4: ERROR: code indent should use tabs where possible 2013-12-26 13:43:21 -05:00
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2013-09-30 15:24:57 -04:00
xfrm4_output.c xfrm: revert ipv4 mtu determination to dst_mtu 2013-08-26 12:40:53 +02:00
xfrm4_policy.c xfrm: Fix null pointer dereference when decoding sessions 2013-11-01 07:08:46 +01:00
xfrm4_state.c inet: make no_pmtu_disc per namespace and kill ipv4_config 2013-12-18 16:58:20 -05:00
xfrm4_tunnel.c sit: add IPv4 over IPv4 support 2013-05-31 17:19:05 -07:00