Bug 227303 - TCP: huge cwnd does not slowly decay while app/rwnd limited, interacts badly with rwnd autosize
Summary: TCP: huge cwnd does not slowly decay while app/rwnd limited, interacts badly ...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-transport maling list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-05 18:09 UTC by Richard Scheffenegger
Modified: 2020-01-25 15:47 UTC (History)
2 users (show)

See Also:


Attachments
sender rwnd constrained, client opend rwnd twice leading to excessive traffic burst (23.39 KB, image/png)
2018-04-05 18:09 UTC, Richard Scheffenegger
no flags Details
details of the 2nd burst, limited by the cwnd which grew in the background unchecked (28.46 KB, image/png)
2018-04-05 18:11 UTC, Richard Scheffenegger
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Scheffenegger 2018-04-05 18:09:53 UTC
Created attachment 192252 [details]
sender rwnd constrained, client opend rwnd twice leading to excessive traffic burst

While verifying the reason for huge SACK scoreboard overflow counters, I found that the tcp stack will burst massive amounts of data when a client (with enabled receive window autosize) increases its receive window.

Suffice it to say, that bursting ~300kB with 10G towards a 100Mbps client in 0.5ms twice is a recipie for carnage.

While investigating, I found that this appears to be a zero-day issue.

Since "forever", cwnd would be increased by max(smss*smss/cwnd,1) as long as in congestion avoidance phase - that is, as long as no loss happend.

However, this behavior is clearly not intended by the design of TCP, even though no RFC clearly spells it out (with the exception of RFC2861, section 2, paragraph 6 perhaps, but that is not the normative text therein).

When cwnd can not be verified, that is, the sender being limited by either rwnd (easy to detect) or the application (see RFC7661), cwnd *must not* grow without bounds.

This probably passed undetected for so long, as receive windows traditionally were static (or at least didn't jump up by magnitudes, when TCP flow control is actively used).

However, many clients now feature receive window autoscale, starting off at rather small rwnd, and probing if increasing rwnd has a positive effect on goodput later in the connection - often when the sender was rwnd bound for a significant number of RTTs already (and cwnd grew unchecked in the background).

This bug request is not about RFC7661, but only to clamp the growth of cwnd to no more than rwnd (or perhaps rwnd+2*smss, like what linux did in the past). This should prevent the observed traffic bursts when receive window autoscale on a client decides to suddenly increase the receive window after a phase when the sender was rwnd constrained.

See attached tcptrace graphs, with two line-rate bursts of traffic (the first one is arguably not preventable, except for its magnitude, the 2nd one leads to significant burst loss and impacts other traffic (not shown).
Comment 1 Richard Scheffenegger 2018-04-05 18:11:12 UTC
Created attachment 192253 [details]
details of the 2nd burst, limited by the cwnd which grew in the background unchecked
Comment 2 Richard Scheffenegger 2018-04-05 20:46:23 UTC
After further investigation, this issue is more complex. cwnd does in fact not grow when the transmission is rwnd limited. However, 20 sec prior to these two burst events, during slow start, the client already signaled a large rwnd (at least as large as when the bursts happen). And no losses occured, so cwnd did probably grow to the large rwnd (or very close to is) during the initial slow
start. Then, the client reduced rwnd to 1/2 or 1/3 of that initial value, and
cwnd never decays (RTT is 1ms with empty buffers, and 40ms with full buffers; 20 sec are eons in any case). 

So at the time of these graphs, cwnd has no longer any valid information about the state of the network, but is nevertheless used as such.

Guess RFC7661 (with substituting "no transmissions" with "transmission rate < cwnd") with it's many-rtt long decay would adress this particular issue.

Can provide xpl file (but not original trace) of the entire session.
Comment 3 Richard Scheffenegger 2018-04-05 20:48:55 UTC
Immediately clamping cwnd down to rwnd is not a viable solution, as TCP flow control might be in actual use by the client (dynamically adjusting rwnd within <10 RTTs, depending on processing state of the received data). Thus a slow decay as mentioned in  7661 is probably the correct course....
Comment 4 Hiren Panchasara freebsd_committer 2018-04-05 21:34:28 UTC
IIRC, Netflix has new-cwv implemented in their not-yet-upstreamed codebase.
Comment 5 Richard Scheffenegger 2018-10-18 16:44:56 UTC
(In reply to Hiren Panchasara from comment #4)
Hi Hiren,

just found that Lars submitted a Patch incuding NewCWV some time ago:

https://lists.freebsd.org/pipermail/freebsd-net/2014-February/037771.html

The first patch is about PRR (the core mechanism looks good; cache line alignment of the added variables might need some review. The simplistic way to
calculate the delta bytes of the scoreboard  by scanning the entire scoreboard twice can probably be made more efficient by proper accounting during the scoreboard updates.

The second patch is about this bug report; 


Best regards,
   Richard
Comment 6 Bjoern A. Zeeb freebsd_committer 2018-10-18 16:59:20 UTC
I tried to Cc/Assign to the transport mailing list so more people interested in TCP would see it; bugzilla did not let me;  I sent an email to admins;  might be worth forwarding the PR number (or a link) to that list manually for now.
Comment 7 Richard Scheffenegger 2020-01-25 15:47:31 UTC
We have a partial fix (to not grow cwnd excessively when app limited) now.