Bug 239112 - [LACP] Latency problem when reconfiguring LACP links
Summary: [LACP] Latency problem when reconfiguring LACP links
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2019-07-10 13:25 UTC by Masse Nicolas
Modified: 2019-07-11 11:29 UTC (History)
2 users (show)

See Also:


Attachments
proposed fix (1.70 KB, patch)
2019-07-10 13:25 UTC, Masse Nicolas
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Masse Nicolas 2019-07-10 13:25:59 UTC
Created attachment 205657 [details]
proposed fix

Recently, me and my company came across a problem with lacp:
We observed that since we migrate from FreeBSD10.3 to FreeBSD11.2, sometimes the lacp links takes several seconds (+/- 6 seconds) to get configured properly.
This was observed when re-creating the lacp links from scratch using the following commands:
ifconfig lagg0 inet 192.168.29.131/24 delete
ifconfig igb1 down
ifconfig lagg0 -laggport igb1
ifconfig lagg0 ether 00:00:00:00:00:00
ifconfig igb1 mtu 1500
ifconfig igb1 media autoselect
ifconfig lagg0 laggproto lacp
ifconfig lagg0 laggport igb1
ifconfig lagg0 mtu 1500
ifconfig lagg0 ether 0:d:b4:e:ba:e1
ifconfig lagg0 inet 192.168.29.131/24
ifconfig igb1 up
ifconfig lagg0 up

After some research, we found that the problem comes from the commit https://svnweb.freebsd.org/base?view=revision&revision=332834.
From what I understand, here is what happens:
- lacp_select is called. In the case we don't see our peer yet, it does nothing.
- since it does nothing, the flag LACP_SELECTED isn't set. As a consequence, the timer LACP_TIMER_WAIT_WHILE isn't armed.
- Since this timer isn't armed, we have to wait for the timer LACP_TIMER_CURRENT_WHILE to be triggered instead, adding an extra latency we didn't observe before (up to 6 seconds).

This extra latency is a problem for us since we have a lot of automated regression tests, and it makes them taking twice as much time to run than before because we have to wait to be sure that the link is created properly.
So I tried to see if I can solve this, and came across the following fix (see the attached patch):
- In lacp_select, in the case we haven't seen our peer yet, I still create the aggregator if he doesn't exists yet and set LACP_SELECTED, but I dont fill the aggregator id.
- In the next call to lacp_select, i test if the aggregator id id filled by checking the LACP_STATE_AGGREGATION flag. In the case this isn't set the aggregator id is filled and the flag is set.
Since LACP_SELECTED is set anew in the first call to lacp_select, the LACP_TIMER_WAIT_WHILE is armed and triggered as it was before the revision 332834.

Early testing of this patch in our environnement show us that with this the extra latency is gone and things seems to work properly.