Bug 276890 - Getting fq_codel correct on inbound shaping
Summary: Getting fq_codel correct on inbound shaping
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-08 11:14 UTC by Dave Taht
Modified: 2024-04-05 01:28 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Taht 2024-02-08 11:14:12 UTC
Over the years, particularly in the opnsense universe, there has been much angst and hairpulling with a variety of claims saying that "fq_codel" does not work properly on inbound shaping, and others that went to the Linux version on the same hardware, saying that it worked much better.

The paper and code to my eye, looked pretty close to correct, but it was, indeed, very different than the Linux version, *and* that paper did not test inbound shaping: https://www.researchgate.net/publication/322784709_Dummynet_AQM_v02_--_CoDel_FQ-CoDel_PIE_and_FQ-PIE_for_FreeBSD%27s_ipfwdummynet_framework   


So anyway, this reddit thread came up where I concluded there is something weird about how the freebsd inbound path actually functions in some cases:

https://www.reddit.com/r/opnsense/comments/18z68ec/anyone_who_hasnt_configured_codel_limiters_do_it/

I am not a freebsd developer, and the effort required for me to "get in there" and test (or worse, patch) is a bit much (I live on a boat nowadays).  What I would like is to find someone that can run 3 simple tests on my behalf, and give me the packet captures, so I can verify correctness. Ideally driving more complex tests through flent to my cloud would be great, but I would settle for, at some RTT (ideally 10-40ms - not LOCAL[1] - ideally over a GPON fiber network to the cloud) is the following test:

Setup inbound shaping at 100Mbit, 300mbit, and 800mbit. (on a link capable of those speeds, and on x86 or arm hardware also capable)

Run iperf or netperf for 1 minute on a tcp download while capturing
tcpdump -i the_interface -s 128 -w 100mbit.cap (etc)

And send me the capture, please?  [2]

There are plenty of reasons why inbound shaping can have problems, like batchyness or timing outside of the fq_codel implementation, and people set their expectations too high for it (particularly against torrent), and I have tried (with libreqos, etc) to make egress shaping far more the default across the internet. But I should be able to verify basic correctness with just those three captures. [3]

Bonus points would be enabling RFC3168 style ECN across the test (which shows up really nicely in a packet cap) and capturing from the server, rather than client. 

Thanks for any help you can offer. 

[1] Longer RTTs engage the AQM component of fq_codel more visibly
[2] Or plot yourself with wireshark and/or xplot.org
[3] and gain a reference plot for future work, if needed 

PS The openbsd version of fq_codel is *definately* broken, but codebase entirely different. It caps the "count" (drop) variable to 400 when it should run freely. Working on fixing that too.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2024-02-09 19:49:09 UTC
> I am not a freebsd developer, and the effort required for me to "get in there" and test (or worse, patch) is a bit much (I live on a boat nowadays).  What I would like is to find someone that can run 3 simple tests on my behalf, and give me the packet captures, so I can verify correctness.

I suspect you'll have more luck by asking for help on the FreeBSD-net mailing list.  That'll get you much more access to users who would be willing to run your test.
Comment 2 Dave Taht 2024-02-13 12:50:50 UTC
I got some packet captures from an opnsense user, who also on my behalf turned on and off LRO. One of the test runs was decidedly odd in that it missed the target for far, far too long, which could be just about anything, including a path change or a tcp bug. I will plot it later, and poke deeper into the packet captures.

http://london.starlink.taht.net/freebsd/ecn/

The ecn on tests also showed a known issue with coping with slow start on short RTTs.
Comment 3 Dave Taht 2024-02-24 17:09:32 UTC
One of the issues with the freebsd implementation appears to be too much logging:

https://forum.opnsense.org/index.php?topic=39046.0
Comment 4 Dave Taht 2024-02-24 23:24:22 UTC
Assuming I am looking at a correct source tree:

http://fxr.watson.org/fxr/source/netpfil/ipfw/dn_sched_fq_codel.c#L330

Logging an overflow event twice, and to the console, while under stress, is not a good idea.

  if (mainq->ni.length > schk->cfg.limit) { D("over limit");
  331                 /* find first active flow */
  332                 for (maxidx = 0; maxidx < schk->cfg.flows_cnt; maxidx++)
  333                         if (si->flows[maxidx].active)
  334                                 break;
  335                 if (maxidx < schk->cfg.flows_cnt) {
  336                         /* find the largest sub- queue */
  337                         for (i = maxidx + 1; i < schk->cfg.flows_cnt; i++) 
  338                                 if (si->flows[i].active && si->flows[i].stats.length >
  339                                         si->flows[maxidx].stats.length)
  340                                         maxidx = i;
  341                         codel_drop_head(&si->flows[maxidx], si);
  342                         D("maxidx = %d",maxidx);
  343                         drop = 1;
  344                 }
  345         }

I would delete both Ds here. Even then there are two things we ended up doing in the linux version - 1) We added the drop_batch facility (and optimized it) to drop up to 64 packets at a time from the head (all but the last - it helps to always deliver at least the last packet in the queue).  

It was *very* expensive under a packet flood to hit this limit, search the flow list, then drop a single packet.

2) Also merely dropping the packet without also telling the AQM to drop harder on its own led to persistent hitting of this spot. So we also incremented the codel count variable on this drop in the expectation the AQM would eventually catch up.

3) It is possible to keep extra state around to always track the fattest queue (see fq_codel_fast) and eliminate this non-O(1) search, at the cost of tracking the fattest queue elsewhere. The expectation is that in normal use, this routine is rarely hit.
Comment 5 Dave Taht 2024-02-25 01:18:05 UTC
I am a little dubious about the "active" idea.

http://fxr.watson.org/fxr/source/netpfil/ipfw/dn_sched_fq_codel.c#L413

Also.

 if (!si->flows[idx].active ) {
  319                 STAILQ_INSERT_TAIL(&si->newflows, &si->flows[idx], flowchain);
  320                 si->flows[idx].deficit = param->quantum;
  321                 si->flows[idx].cst.dropping = false;
  322                 si->flows[idx].cst.first_above_time = 0;
  323                 si->flows[idx].active = 1;
  324                 //D("activate %d",idx);
  325         }

In the linux version we do not touch the codel state variables at this phase at all, but retain the previous settings. It may be that dropping and first_above_time get set in roughly the same way in my version, but perhaps if I describe the intent of what should happen, I too will understand this code better. 

The idea of a flow going out of an "active" state is not that any of it´s state needs to be reset. The overall target of fq_codel is to reduce the total delay in all the queues to the target (usually 5ms). It maintains a cache of the last
"good" drop rate. 

If a single flow, out of dozens, has an arrival rate like this:

A                      A                    A                 A

And that is still too much relative to the other flows, it needs to get more drops. 

Anyway, instead of checking for or maintaining an active or inactive "state" we just check to see if queue length > 0.

Just saving my state here on this subtley. Also we use the global queue length not the per queue length to turn off the global dropper, which I have to go looking through here to see if it is correct.