OpenCloudOS-Kernel/net
Michal Soltys 678a6241c6 net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug
This patch simplifies how we update fsc and calculate vt from it - while
keeping the expected functionality identical with how hfsc behaves
curently. It also fixes a certain issue introduced with
a very old patch.

The idea is, that instead of correcting cl_vt before fsc curve update
(rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
cl_vt local to the current period - we can simply rely on virtual times
and curve values always being in sync - analogously to how rsc and usc
function, except that we use virtual time here.

Why hasn't it been done since the beginning this way ? The likely scenario
(basing on the code trying to correct curves whenever possible) was to
keep the virtual times as small as possible - as they have tendency to
"gallop" forward whenever their siblings and other fair sharing
subtrees are idling. On top of that, current code is subtly bugged, so
cumulative time (without any corrections) is always kept and used in
init_vf() when a new backlog period begins (using cl_cvtoff).

Is cumulative value safe ? Generally yes, though corner cases are easy
to create. For example consider:

1gbit interface
some 100kbit leaf, everything else idle

With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
logarithms rounded up). It's getting somewhat dangerous, but also
requires setup excusing this kind of values not mentioning permanently
backlogged class for a year. In near most extreme case (10gbit, 10kbit
leaf), we have "enough" to hold ~13.6 days in 64 bits.

Well, the issue remains mostly theoretical and cl_cvtoff has been
working fine for all those years. Sensible configuration are de-facto
immune to this issue, and not so sensible can solve it with a cronjob
and its period inversely proportional to the insanity of such setup =)

Now let's explain the subtle bug mentioned earlier.

The issue is related to how offsets are kept and how we calculate
virtual times and update fair service curve(s). The issue itself is
subtle, but easy to observe with long m1 segments. It was introduced in
rather old patch:

Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
in HFSC scheduler"

(available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)

Originally when a new backlog period was started, cl_vtoff of each
sibling was updated with cl_cvtmax from past period - naturally moving
all cl_vt to proper starting point. That patch adjusted it so cumulative
offset is kept in the parent, and there is no need for traversing the
list (as any subsequent child activation derives new vt from already
active sibling(s)).

But with this change, cl_vtoff (of each sibling) is no longer persistent
across the inactivity periods, as it's calculated from parent's
cl_cvtoff on a new backlog period, conflicting with the following curve
correction from the previous period:

if (cl->cl_virtual.x == vt) {
        cl->cl_virtual.x -= cl->cl_vtoff;
	cl->cl_vtoff = 0;
}

This essentially tries to keep curve as if it was local to the period
and resets cl_vtoff (cumulative vt offset of the class) to 0 when
possible (read: when we have an intersection or if a new curve is below
the old one). But then it's recalculated from cl_cvtoff on next active
period.  Then rtsc_min() call preceding the above if() doesn't really
do what we expect it to do in such scenario - as it calculates the
minimum of corrected curve (from the previous backlog period) and the
new uncorrected curve (with offset derived from cl_cvtoff).

Example:

tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit

start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
start A, let it run 10s
pause A briefly to force rtsc_min()

At this point we would expect A to continue at 20mbit after a brief
moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
the effect of first correcting A (during 'start A'), and then - after
unpausing - calculating rtsc_min() from old corrected and new uncorrected
curve.

The patch fixes this bug and keepis vt and fsc in sync (virtual times
are cumulative, not local to the backlog period).

Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-08 16:06:47 -07:00
..
6lowpan 6lowpan: ndisc: set invalid unicast short addr to unspec 2016-07-08 13:23:12 +02:00
9p remove lots of IS_ERR_VALUE abuses 2016-05-27 15:26:11 -07:00
802
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-07-24 00:53:32 -04:00
appletalk appletalk: fix erroneous return value 2016-02-18 14:59:34 -05:00
atm net: add dev arg to ndo_neigh_construct/destroy 2016-07-05 09:06:28 -07:00
ax25 AX.25: Close socket connection on session completion 2016-06-18 20:55:34 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-07-24 00:53:32 -04:00
bluetooth Bluetooth: Add debugfs fields for hardware and firmware info 2016-07-18 09:33:28 +03:00
bridge bridge: Fix incorrect re-injection of LLDP packets 2016-07-25 10:53:34 -07:00
caif caif: Remove unneeded header file 2016-06-28 05:26:14 -04:00
can can: only call can_stat_update with procfs 2016-06-23 11:23:49 +02:00
ceph libceph: fsmap.user subscription support 2016-07-28 03:00:40 +02:00
core neigh: allow admin to set NUD_STALE 2016-08-08 15:36:38 -07:00
dcb
dccp Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2016-07-29 17:38:46 -07:00
decnet net: fix decnet rtnexthop parsing 2016-07-05 14:08:47 -07:00
dns_resolver KEYS: Add a facility to restrict new links into a keyring 2016-04-11 22:37:37 +01:00
dsa net: dsa: support switchdev ageing time attr 2016-07-19 19:42:01 -07:00
ethernet eth: Pull header from first fragment via eth_get_headlen 2016-02-24 13:58:05 -05:00
hsr net/hsr: Use setup_timer and mod_timer. 2016-05-16 14:00:43 -04:00
ieee802154 ieee802154: 6lowpan: fix intra pan id check 2016-07-08 13:23:12 +02:00
ipv4 net/multicast: should not send source list records when have filter mode change 2016-08-08 16:04:39 -07:00
ipv6 net/multicast: should not send source list records when have filter mode change 2016-08-08 16:04:39 -07:00
ipx
irda Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-07-27 12:03:20 -07:00
iucv Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2016-07-29 17:38:46 -07:00
kcm kcm: remove redundant -ve error check and return path 2016-07-25 11:17:16 -07:00
key
l2tp l2tp: Correctly return -EBADF from pppol2tp_getname. 2016-07-26 15:19:46 -07:00
l3mdev net: vrf: Implement get_saddr for IPv6 2016-06-17 21:25:29 -07:00
lapb net/lapb: tuse %*ph to dump buffers 2016-05-29 22:33:25 -07:00
llc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
mac80211 mac80211: Add ieee80211_hw pointer to get_expected_throughput 2016-08-05 14:23:25 +02:00
mac802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-03-19 10:05:34 -07:00
mpls mpls: allow routes on ipip and sit devices 2016-07-09 17:45:56 -04:00
ncsi net/ncsi: avoid maybe-uninitialized warning 2016-07-25 10:32:59 -07:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-07-27 12:03:20 -07:00
netlabel netlabel: Implement CALIPSO config functions for SMACK. 2016-06-27 15:06:18 -04:00
netlink net/netlink/af_netlink.h: Remove unused structure. 2016-06-09 22:26:24 -07:00
netrom
nfc NFC: digital: Fix RTOX supervisor PDU handling 2016-07-11 02:02:03 +02:00
openvswitch OVS: Ignore negative headroom value 2016-08-06 00:06:11 -04:00
packet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-07-24 00:53:32 -04:00
phonet sock: struct proto hash function may error 2016-02-11 03:54:14 -05:00
qrtr Merge tag 'qcom-soc-for-4.7-2' into net-next 2016-05-17 14:11:19 -04:00
rds RDS: TCP: Enable multipath RDS for TCP 2016-07-15 11:36:58 -07:00
rfkill rfkill: Use switch to demux userspace operations 2016-04-05 10:48:53 +02:00
rose rose: limit sk_filter trim to payload 2016-07-13 11:53:40 -07:00
rxrpc rxrpc: Fix races between skb free, ACK generation and replying 2016-08-06 00:08:40 -04:00
sched net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug 2016-08-08 16:06:47 -07:00
sctp sctp: use event->chunk when it's valid 2016-08-08 14:31:23 -07:00
sunrpc NFS client updates for Linux 4.8 2016-07-30 16:33:25 -07:00
switchdev net/switchdev: Export the same parent ID service function 2016-07-14 13:34:29 -07:00
tipc tipc: fix imbalance read_unlock_bh in __tipc_nl_add_monitor() 2016-07-30 20:38:22 -07:00
unix af_unix: charge buffers to kmemcg 2016-07-26 16:19:19 -07:00
vmw_vsock vsock: make listener child lock ordering explicit 2016-06-27 10:44:46 -04:00
wimax
wireless nl80211: correct checks for NL80211_MESHCONF_HT_OPMODE value 2016-08-05 14:14:54 +02:00
x25 net: fix a kernel infoleak in x25 module 2016-05-09 22:45:33 -04:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
Kconfig net/ncsi: Resource management 2016-07-19 20:49:16 -07:00
Makefile net/ncsi: Resource management 2016-07-19 20:49:16 -07:00
compat.c packet: compat support for sock_fprog 2016-06-09 23:41:03 -07:00
socket.c fs: poll/select/recvmmsg: use timespec64 for timeout events 2016-05-19 19:12:14 -07:00
sysctl_net.c net: Use ns_capable_noaudit() when determining net sysctl permissions 2016-06-06 20:16:22 +10:00