linux-sg2042/include/net
Ilpo Järvinen 832d11c5cd tcp: Try to restore large SKBs while SACK processing
During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return back to pre-split
state already while more and more SACK info gets discovered by
combining newly discovered SACK areas with the previous skb if
that's SACKed as well.

This approach has a number of benefits:

1) The processing overhead is spread more equally over the RTT
2) Write queue has less skbs to process (affect everything
   which has to walk in the queue past the sacked areas)
3) Write queue is consistent whole the time, so no other parts
   of TCP has to be aware of this (this was not the case with
   some other approach that was, well, quite intrusive all
   around).
4) Clean_rtx_queue can release most of the pages using single
   put_page instead of previous PAGE_SIZE/mss+1 calls

In case a hole is fully filled by the new SACK block, we attempt
to combine the next skb too which allows construction of skbs
that are even larger than what tso split them to and it handles
hole per on every nth patterns that often occur during slow start
overshoot pretty nicely. Though this to be really useful also
a retransmission would have to get lost since cumulative ACKs
advance one hole at a time in the most typical case.

TODO: handle upwards only merging. That should be rather easy
when segment is fully sacked but I'm leaving that as future
work item (it won't make very large difference anyway since
this current approach already covers quite a lot of normal
cases).

I was earlier thinking of some sophisticated way of tracking
timestamps of the first and the last segment but later on
realized that it won't be that necessary at all to store the
timestamp of the last segment. The cases that can occur are
basically either:
  1) ambiguous => no sensible measurement can be taken anyway
  2) non-ambiguous is due to reordering => having the timestamp
     of the last segment there is just skewing things more off
     than does some good since the ack got triggered by one of
     the holes (besides some substle issues that would make
     determining right hole/skb even harder problem). Anyway,
     it has nothing to do with this change then.

I choose to route some abnormal looking cases with goto noop,
some could be handled differently (eg., by stopping the
walking at that skb but again). In general, they either
shouldn't happen at all or are rare enough to make no difference
in practice.

In theory this change (as whole) could cause some macroscale
regression (global) because of cache misses that are taken over
the round-trip time but it gets very likely better because of much
less (local) cache misses per other write queue walkers and the
big recovery clearing cumulative ack.

Worth to note that these benefits would be very easy to get also
without TSO/GSO being on as long as the data is in pages so that
we can merge them. Currently I won't let that happen because
DSACK splitting at fragment that would mess up pcounts due to
sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
avoided, we have some conditions that can be made less strict.

TODO: I will probably have to convert the excessive pointer
passing to struct sacktag_state... :-)

My testing revealed that considerable amount of skbs couldn't
be shifted because they were cloned (most likely still awaiting
tx reclaim)...

[The rest is considering future work instead since I got
repeatably EFAULT to tcpdump's recvfrom when I added
pskb_expand_head to deal with clones, so I separated that
into another, later patch]

...To counter that, I gave up on the fifth advantage:

5) When growing previous SACK block, less allocs for new skbs
   are done, basically a new alloc is needed only when new hole
   is detected and when the previous skb runs out of frags space

...which now only happens of if reclaim is fast enough to dispose
the clone before the SACK block comes in (the window is RTT long),
otherwise we'll have to alloc some.

With clones being handled I got these numbers (will be somewhat
worse without that), taken with fine-grained mibs:

                  TCPSackShifted 398
                   TCPSackMerged 877
            TCPSackShiftFallback 320
      TCPSACKCOLLAPSEFALLBACKGSO 0
  TCPSACKCOLLAPSEFALLBACKSKBBITS 0
  TCPSACKCOLLAPSEFALLBACKSKBDATA 0
    TCPSACKCOLLAPSEFALLBACKBELOW 0
    TCPSACKCOLLAPSEFALLBACKFIRST 1
 TCPSACKCOLLAPSEFALLBACKPREVBITS 318
      TCPSACKCOLLAPSEFALLBACKMSS 1
   TCPSACKCOLLAPSEFALLBACKNOHEAD 0
    TCPSACKCOLLAPSEFALLBACKSHIFT 0
          TCPSACKCOLLAPSENOOPSEQ 0
  TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
     TCPSACKCOLLAPSENOOPSMALLLEN 0
             TCPSACKCOLLAPSEHOLE 12

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:20:15 -08:00
..
9p 9p: fix sparse warnings 2008-10-22 18:54:47 -05:00
bluetooth include: replace __FUNCTION__ with __func__ 2008-10-16 11:21:30 -07:00
irda include: replace __FUNCTION__ with __func__ 2008-10-16 11:21:30 -07:00
iucv [AF_IUCV]: postpone receival of iucv-packets 2007-10-10 16:54:51 -07:00
netfilter misc: replace NIPQUAD() 2008-10-31 00:56:49 -07:00
netns net: implement emergency route cache rebulds when gc_elasticity is exceeded 2008-10-27 17:06:14 -07:00
phonet Phonet: include generic link-layer header size in MAX_PHONET_HEADER 2008-10-26 23:06:31 -07:00
sctp misc: replace NIPQUAD() 2008-10-31 00:56:49 -07:00
tc_act pkt_action: add new action skbedit 2008-09-12 16:30:20 -07:00
tipc tipc: Remove unneeded parameter to tipc_createport_raw() 2008-07-14 22:42:19 -07:00
act_api.h [NET_SCHED]: act_api: use PTR_ERR in tcf_action_init/tcf_action_get 2008-01-28 15:11:17 -08:00
addrconf.h netns: Add network namespace argument to rt6_fill_node() and ipv6_dev_get_saddr() 2008-08-14 15:33:21 -07:00
af_rxrpc.h
af_unix.h net: unix: fix inflight counting bug in garbage collector 2008-11-09 11:17:33 -08:00
ah.h [IPSEC]: Get rid of ipv6_{auth,esp,comp}_hdr 2007-10-10 16:55:55 -07:00
arp.h [NETFILTER]: ebtables: remove casts, use consts 2008-01-31 19:27:33 -08:00
atmclip.h
ax25.h [AX25] ax25_ds_timer: use mod_timer instead of add_timer 2008-02-12 17:53:34 -08:00
ax88796.h ax88796: add 93cx6 eeprom support 2007-10-10 16:53:56 -07:00
cfg80211.h cfg80211: make use of reg macros on REG_RULE 2008-11-10 15:17:41 -05:00
checksum.h include/net net/ - csum_partial - remove unnecessary casts 2008-11-19 15:44:53 -08:00
cipso_ipv4.h cipso: Add support for native local labeling and fixup mapping names 2008-10-10 10:16:34 -04:00
compat.h net: Use standard structures for generic socket address structures. 2008-07-19 22:35:47 -07:00
datalink.h
dcbnl.h DCB: Add support for DCB BCN 2008-11-20 21:10:23 -08:00
dn.h [DECNET]: Another unnecessary net/tcp.h inclusion in net/dn.h 2007-07-10 23:02:12 -07:00
dn_dev.h
dn_fib.h
dn_neigh.h
dn_nsp.h
dn_route.h [NET]: Wrap netdevice hardware header creation. 2007-10-10 16:52:50 -07:00
dsa.h dsa: add support for Trailer tagging format 2008-10-08 17:24:16 -07:00
dsfield.h [NET]: Constify include/net/dsfield.h 2008-01-28 14:55:58 -08:00
dst.h net: make sure struct dst_entry refcount is aligned on 64 bytes 2008-11-16 19:46:36 -08:00
esp.h [IPSEC]: Use crypto_aead and authenc in ESP 2008-01-31 19:27:02 -08:00
fib_rules.h net: add fib_rules_ops to flush_cache method 2008-07-05 19:01:28 -07:00
flow.h ipv4: Loosen source address check on IPv4 output 2008-10-01 07:28:28 -07:00
garp.h vlan: Add GVRP support 2008-07-05 21:26:57 -07:00
gen_stats.h [NET_SCHED]: Convert packet schedulers from rtnetlink to new netlink API 2008-01-28 15:11:10 -08:00
genetlink.h netlink: Improve returned error codes 2008-06-03 16:36:54 -07:00
icmp.h mib: put icmpmsg statistics on struct net 2008-07-18 04:04:22 -07:00
ieee80211.h lib80211: absorb crypto bits from net/ieee80211 2008-11-21 11:08:17 -05:00
ieee80211_radiotap.h include: use get/put_unaligned_* helpers 2008-07-25 10:53:26 -07:00
if_inet6.h ipv6: make struct ipv6_devconf static 2008-07-22 14:21:58 -07:00
inet6_connection_sock.h
inet6_hashtables.h inet: Don't lookup the socket if there's a socket attached to the skb 2008-10-07 12:41:01 -07:00
inet_common.h [NETNS]: Inet control socket should not hold a namespace. 2008-04-03 14:28:30 -07:00
inet_connection_sock.h net: more #ifdef CONFIG_COMPAT 2008-08-28 02:53:51 -07:00
inet_ecn.h [IPV6]: Use appropriate sock tclass setting for routing lookup. 2008-04-13 23:40:51 -07:00
inet_frag.h [NET]: Rename inet_frag.h identifiers COMPLETE, FIRST_IN, LAST_IN to INET_FRAG_* 2008-03-28 16:35:27 -07:00
inet_hashtables.h net: Convert TCP/DCCP listening hash tables to use RCU 2008-11-23 17:22:55 -08:00
inet_sock.h tcp: Port redirection support for TCP 2008-10-01 07:46:49 -07:00
inet_timewait_sock.h net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls 2008-11-16 19:40:17 -08:00
inetpeer.h net: remove CVS keywords 2008-06-11 21:00:38 -07:00
ip.h net: avoid a pair of dst_hold()/dst_release() in ip_append_data() 2008-11-24 15:52:46 -08:00
ip6_checksum.h
ip6_fib.h [NETNS][IPV6] rt6_info - move rt6_info structure inside the namespace 2008-03-04 13:48:30 -08:00
ip6_route.h netns: Add network namespace argument to rt6_fill_node() and ipv6_dev_get_saddr() 2008-08-14 15:33:21 -07:00
ip6_tunnel.h net: remove CVS keywords 2008-06-11 21:00:38 -07:00
ip_fib.h [IPV4]: Fix compile error building without CONFIG_FS_PROC 2008-02-05 02:54:16 -08:00
ip_vs.h include/net net/ - csum_partial - remove unnecessary casts 2008-11-19 15:44:53 -08:00
ipcomp.h ipsec: ipcomp - Merge IPComp implementations 2008-07-25 02:54:40 -07:00
ipconfig.h net: remove CVS keywords 2008-06-11 21:00:38 -07:00
ipip.h inet: Make tunnel RX/TX byte counters more consistent 2008-10-09 12:03:17 -07:00
ipv6.h ipv6: making ip and icmp statistics per/namespace 2008-10-08 11:16:45 -07:00
ipx.h
iw_handler.h wext: Emit event stream entries correctly when compat. 2008-06-16 18:50:49 -07:00
lapb.h
lib80211.h wireless: missing include in lib80211.h 2008-11-21 11:42:55 -05:00
llc.h [LLC]: station source mac address 2008-03-28 16:28:36 -07:00
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h [NET]: Make socket creation namespace safe. 2007-10-10 16:49:07 -07:00
llc_if.h [LLC]: Kill static inline llc_addrany 2008-02-29 11:46:17 -08:00
llc_pdu.h [LLC]: skb allocation size for responses 2008-03-31 21:02:47 -07:00
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h [LLC]: skb allocation size for responses 2008-03-31 21:02:47 -07:00
mac80211.h mac80211: add explicit padding in struct ieee80211_tx_info 2008-11-21 11:08:18 -05:00
mip6.h [IPV6] MIP6: Use our standard definitions for paddings. 2008-04-12 13:43:22 +09:00
ndisc.h bonding: send IPv6 neighbor advertisement on failover 2008-11-06 00:49:37 -05:00
neighbour.h net: Cleanup of neighbour code 2008-11-12 00:54:54 -08:00
net_namespace.h net: Introduce read_pnet() and write_pnet() helpers 2008-11-12 00:53:30 -08:00
netdma.h
netevent.h [NET]: Remove unnecessary inclusion of dst.h 2008-01-28 14:53:38 -08:00
netlabel.h netlabel: Add configuration support for local labeling 2008-10-10 10:16:34 -04:00
netlink.h netlink: constify struct nlattr * arg to parsing functions 2008-10-28 11:59:11 -07:00
netrom.h
nexthop.h
p8022.h
pkt_cls.h ematch: simpler tcf_em_unregister() 2008-11-16 23:01:49 -08:00
pkt_sched.h pkt_sched: Remove the tx queue state check in qdisc_run() 2008-09-23 01:05:56 -07:00
protocol.h [NETNS]: Drop packets in the non-initial namespace on the per/protocol basis. 2008-03-24 15:33:00 -07:00
psnap.h
raw.h [RAW]: Add raw_hashinfo member on struct proto. 2008-03-22 16:56:51 -07:00
rawv6.h [IPv6] RAW: Compact the API for the kernel 2008-01-28 14:54:29 -08:00
red.h
request_sock.h tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup 2008-08-06 23:50:04 -07:00
rose.h rose: improving AX25 routing frames via ROSE network 2008-06-17 17:08:32 -07:00
route.h ipv4: Conditionally enable transparent flow flag when connecting 2008-10-01 07:35:39 -07:00
rtnetlink.h [RTNL]: Introduce the rtnl_kill_links helper. 2008-04-16 00:46:52 -07:00
sch_generic.h pkt_sched: Remove qdisc->ops->requeue() etc. 2008-11-13 22:56:30 -08:00
scm.h net: Fix recursive descent in __scm_destroy(). 2008-11-06 13:51:50 -08:00
slhc_vj.h
snmp.h net: remove CVS keywords 2008-06-11 21:00:38 -07:00
sock.h Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2008-11-18 23:38:23 -08:00
stp.h net: Add STP demux layer 2008-07-05 21:25:39 -07:00
tcp.h tcp: Try to restore large SKBs while SACK processing 2008-11-24 21:20:15 -08:00
tcp_states.h
timewait_sock.h
transp_v6.h net: change proto destroy method to return void 2008-06-14 17:04:49 -07:00
udp.h udp: Use hlist_nulls in UDP RCU code 2008-11-16 19:39:21 -08:00
udplite.h udp: introduce struct udp_table and multiple spinlocks 2008-10-29 01:41:45 -07:00
wext.h wext: Dispatch and handle compat ioctls entirely in net/wireless/wext.c 2008-06-16 18:32:46 -07:00
wireless.h net: struct device - replace bus_id with dev_name(), dev_set_name() 2008-11-10 13:55:14 -08:00
x25.h
x25device.h
xfrm.h xfrm: remove unused struct xfrm_policy::next 2008-10-31 00:42:25 -07:00