Created attachment 192252 [details]
sender rwnd constrained, client opend rwnd twice leading to excessive traffic burst
While verifying the reason for huge SACK scoreboard overflow counters, I found that the tcp stack will burst massive amounts of data when a client (with enabled receive window autosize) increases its receive window.
Suffice it to say, that bursting ~300kB with 10G towards a 100Mbps client in 0.5ms twice is a recipie for carnage.
While investigating, I found that this appears to be a zero-day issue.
Since "forever", cwnd would be increased by max(smss*smss/cwnd,1) as long as in congestion avoidance phase - that is, as long as no loss happend.
However, this behavior is clearly not intended by the design of TCP, even though no RFC clearly spells it out (with the exception of RFC2861, section 2, paragraph 6 perhaps, but that is not the normative text therein).
When cwnd can not be verified, that is, the sender being limited by either rwnd (easy to detect) or the application (see RFC7661), cwnd *must not* grow without bounds.
This probably passed undetected for so long, as receive windows traditionally were static (or at least didn't jump up by magnitudes, when TCP flow control is actively used).
However, many clients now feature receive window autoscale, starting off at rather small rwnd, and probing if increasing rwnd has a positive effect on goodput later in the connection - often when the sender was rwnd bound for a significant number of RTTs already (and cwnd grew unchecked in the background).
This bug request is not about RFC7661, but only to clamp the growth of cwnd to no more than rwnd (or perhaps rwnd+2*smss, like what linux did in the past). This should prevent the observed traffic bursts when receive window autoscale on a client decides to suddenly increase the receive window after a phase when the sender was rwnd constrained.
See attached tcptrace graphs, with two line-rate bursts of traffic (the first one is arguably not preventable, except for its magnitude, the 2nd one leads to significant burst loss and impacts other traffic (not shown).
Created attachment 192253 [details]
details of the 2nd burst, limited by the cwnd which grew in the background unchecked
After further investigation, this issue is more complex. cwnd does in fact not grow when the transmission is rwnd limited. However, 20 sec prior to these two burst events, during slow start, the client already signaled a large rwnd (at least as large as when the bursts happen). And no losses occured, so cwnd did probably grow to the large rwnd (or very close to is) during the initial slow
start. Then, the client reduced rwnd to 1/2 or 1/3 of that initial value, and
cwnd never decays (RTT is 1ms with empty buffers, and 40ms with full buffers; 20 sec are eons in any case).
So at the time of these graphs, cwnd has no longer any valid information about the state of the network, but is nevertheless used as such.
Guess RFC7661 (with substituting "no transmissions" with "transmission rate < cwnd") with it's many-rtt long decay would adress this particular issue.
Can provide xpl file (but not original trace) of the entire session.
Immediately clamping cwnd down to rwnd is not a viable solution, as TCP flow control might be in actual use by the client (dynamically adjusting rwnd within <10 RTTs, depending on processing state of the received data). Thus a slow decay as mentioned in 7661 is probably the correct course....
IIRC, Netflix has new-cwv implemented in their not-yet-upstreamed codebase.
(In reply to Hiren Panchasara from comment #4)
just found that Lars submitted a Patch incuding NewCWV some time ago:
The first patch is about PRR (the core mechanism looks good; cache line alignment of the added variables might need some review. The simplistic way to
calculate the delta bytes of the scoreboard by scanning the entire scoreboard twice can probably be made more efficient by proper accounting during the scoreboard updates.
The second patch is about this bug report;
I tried to Cc/Assign to the transport mailing list so more people interested in TCP would see it; bugzilla did not let me; I sent an email to admins; might be worth forwarding the PR number (or a link) to that list manually for now.
We have a partial fix (to not grow cwnd excessively when app limited) now.
The partial fix is D21798 / rS355269. That takes care of "unbounded" growth of cwnd when the transmit side is application rather than cwnd limited. So cwnd only grows, if it is actually restricting the transmission rate.
In order for cwnd to decay over time, when an session becomes app-limited and the app keeps on sending small amounts of data to never enter after_idle, will mean implementing New Congestion Window Validation (RFC7661).