Bug 185967 - [lagg] [patch] Link Aggregation LAGG: LACP not working in 10.0
Summary: [lagg] [patch] Link Aggregation LAGG: LACP not working in 10.0
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-net mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-21 15:20 UTC by Michael Rebele
Modified: 2018-05-28 19:49 UTC (History)
7 users (show)

See Also:


Attachments
smime.p7s (4.68 KB, application/pkcs7-signature)
2014-01-24 08:21 UTC, info
no flags Details
Output of ifconfig -a (1.58 KB, text/plain)
2014-09-29 09:16 UTC, Jeroen van Heugten
no flags Details
LACP settings in /etc/rc.conf (134 bytes, text/plain)
2014-09-29 09:18 UTC, Jeroen van Heugten
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Rebele 2014-01-21 15:20:00 UTC
After upgrading from 9.2-RELEASE-p3 the link aggregation with lacp is not working anymore.
It seems that it does not depend on the used network-hardware.
The machine has the following configuration:
a quad-port Broadcom card:
bge0: <Broadcom NetXtreme Gigabit Ethernet, ASIC rev. 0x5720000> mem 0xd90a0000-0xd90affff,0xd90b0000-0xd90bffff,0xd90c0000-0xd90cffff irq 34 at device 0.0 on pci2

and a dual-port Intel card:
igb0: <Intel(R) PRO/1000 Network Connection version - 2.4.0> mem 0xd4d00000-0xd4dfffff,0xd4ff8000-0xd4ffbfff irq 82 at device 0.0 on pci68

On 9.2 the lagg0 configuration was as follows (from /etc/rc.conf):
ifconfig_igb0="UP"
ifconfig_igb1="UP"
ifconfig_bge2="UP"
ifconfig_bge3="UP"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport igb0 laggport igb1 laggport bge2 laggport bge3 10.50.1.154 netmask 255.255.0.0"
defaultrouter="10.50.0.1"
 ifconfig_lagg0_alias0="inet 10.50.1.155 netmask 255.255.0.0"
 ifconfig_lagg0_alias1="inet 10.50.1.165 netmask 255.255.0.0"

This all works fine and as expected under 9.2.

After upgrading to 10.0-RC5, you can't connect to the machine as expected. The lagg0-Interface goes up, but no network connection is possible. Changing to 10.0-RELEASE didn't solve the problem. 
I tried several things under 10.0 (w. LACP as Protocol Type):
lagg-Interface with bge2 and bge3 network interfaces: not working
lagg-interface with igb0 and igb1 network interfaces: not working

ifconfig shows the interface as up and active, netstat -r shows the assigned IP-Adresses and the correct entry for the defaultroute. But e.g. you can't ping the gateway (100% packet loss).

If i change the LAGG-Protocol to failover, then the lagg-Interface is working. It seems to me, that there is a problem only with LACP in 10.0.

Fix: 

Don't use laggproto lacp, use laggproto failover instead (which is virtually not the same)
How-To-Repeat: With the help of Google i found this thread, the problem has been previously discussed here before:
http://forums.freebsd.org/viewtopic.php?f=7&t=43665
Comment 1 Brad Davis freebsd_committer 2014-01-23 16:30:45 UTC
Hi,

I have been using LACP interfaces on 10.0 just fine using the em driver
with a Cisco 3750X switch.

What brand/model switch are you using on the remote side?


Regards,
Brad Davis
Comment 2 info 2014-01-24 08:21:48 UTC
Hi,

Im using the igb driver and it doesnt work as well.

The switch is a Juniper EX3300-48T.

I tried to switch to failover but this seems not to work as well.

Best regards
Ben=
Comment 3 Michael Rebele 2014-01-24 08:54:35 UTC
Hi,

On 23.01.14 17:30, Brad Davis wrote:
> 
> I have been using LACP interfaces on 10.0 just fine using the em driver
> with a Cisco 3750X switch.
> 
> What brand/model switch are you using on the remote side?
> 

i'm using a HP ProCurve 6600ml-24G (J9263A).

Kind regards,

Michael
Comment 4 info 2014-02-03 10:24:10 UTC
Scott Long helped to fix this issue for me:

"The difference between FreeBSD 9.x and 10 is that in 9.x, it ran in=20
=93optimistic=94 mode, meaning that it didn=92t rely on getting receive=20
messages from the switch, and only took a channel down if the link state=20
went down. In strict mode, it looks for the receive messages and only=20
transitions to a full operational state if it gets them. So while I know=20
it=92s easy to point at the problem being FreeBSD 10, seeing as FreeBSD 9=
=20
worked for you, please check to make sure that your switch is set up=20
correctly."

Setting up my Juniper switch to "active" LACP mode it worked again:

"The LACP mode can be active or passive. If the actor and partner are=20
both in passive mode, they do not exchange LACP packets, which results=20
in the aggregated Ethernet links not coming up. If either the actor or=20
partner is active, they do exchange LACP packets. By default, LACP is=20
turned off on aggregated Ethernet interfaces. If LACP is configured, it=20
is in passive mode by default. To initiate transmission of LACP packets=20
and response to LACP packets, you must configure LACP in active mode."

If you have this issue please check if you have a similar setting.

Best regards
Ben
Comment 5 info 2014-02-03 15:13:21 UTC
Scott Long helped to fix this issue for me:

"The difference between FreeBSD 9.x and 10 is that in 9.x, it ran in=20
=93optimistic=94 mode, meaning that it didn=92t rely on getting receive=20
messages from the switch, and only took a channel down if the link state=20
went down. In strict mode, it looks for the receive messages and only=20
transitions to a full operational state if it gets them. So while I know=20
it=92s easy to point at the problem being FreeBSD 10, seeing as FreeBSD 9=
=20
worked for you, please check to make sure that your switch is set up=20
correctly."

Setting up my Juniper switch to "active" LACP mode it worked again:

"The LACP mode can be active or passive. If the actor and partner are=20
both in passive mode, they do not exchange LACP packets, which results=20
in the aggregated Ethernet links not coming up. If either the actor or=20
partner is active, they do exchange LACP packets. By default, LACP is=20
turned off on aggregated Ethernet interfaces. If LACP is configured, it=20
is in passive mode by default. To initiate transmission of LACP packets=20
and response to LACP packets, you must configure LACP in active mode."

If you have this issue please check if you have a similar setting.

Best regards
Ben
Comment 6 Michael Rebele 2014-02-10 11:01:29 UTC
Scott Long provided here (thanks to Ben Niessen for opening the thread there):
http://www.opendevs.org/mhthu/kern-185967-link-aggregation-lagg-lacp-not-working-in-10.html
the following Patch for the issue:

Index: ieee8023ad_lacp.c
===================================================================
--- ieee8023ad_lacp.c	(revision 261432)
+++ ieee8023ad_lacp.c	(working copy)
@@ -192,6 +192,11 @@
 SYSCTL_INT(_net_link_lagg_lacp, OID_AUTO, debug, CTLFLAG_RW | CTLFLAG_TUN,
     &lacp_debug, 0, "Enable LACP debug logging (1=debug, 2=trace)");
 TUNABLE_INT("net.link.lagg.lacp.debug", &lacp_debug);
+static int lacp_strict = 0;
+SYSCTL_INT(_net_link_lagg_lacp, OID_AUTO, lacp_strict_mode,
+    CTLFLAG_RW | CTLFLAG_TUN, &lacp_strict, 0,
+    "Enable LACP strict protocol compliance");
+TUNABLE_INT("net.link.lagg.lacp.lacp_strict_mode", &lacp_strict);

 #define LACP_DPRINTF(a) if (lacp_debug & 0x01) { lacp_dprintf a ; }
 #define LACP_TRACE(a) if (lacp_debug & 0x02) { lacp_dprintf(a,"%s\n",__func__); }
@@ -791,7 +796,7 @@

 	lsc->lsc_hashkey = arc4random();
 	lsc->lsc_active_aggregator = NULL;
-	lsc->lsc_strict_mode = 1;
+	lsc->lsc_strict_mode = lacp_strict;
 	LACP_LOCK_INIT(lsc);
 	TAILQ_INIT(&lsc->lsc_aggregators);
 	LIST_INIT(&lsc->lsc_ports);



I've applied this patch and can confirm, that this patch restores the LACP-behaviour of FreeBSD 10.0
to that of 9.2. Though, with this patch, you may upgrade your OS without changing your
configuration, neither on the switch nor on the OS.
I've tested this across different type of interfaces (igb/bge) with up to four interfaces and also
configurations with two interfaces of the same type (bge0/1 and igb 0/1).
You may run FreeBSD on the switch without the need of configuring LACP on the switch side. This may
not strictly to be conform with 802.3ad standard, but is working (just for the records: bonding mode
4 under Linux, which claims also conformity against 802.3ad, works like this)

From my point of view, the patch should be applied and merged ASAP into the FreeBSD-Kernel and be
the standard behaviour for the next release.

Kind regards

Michael
Comment 7 Mark Linimon freebsd_committer freebsd_triage 2014-04-20 04:22:00 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 8 olivier 2014-06-11 08:31:36 UTC
The patch didn't solve Lagg regression between 9.2 and 10-stable (r267244).

My configuration only use one lagg member (the second will be added later):

ifconfig_em0="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport em0 SYNCDHCP"
ifconfig_lagg0_ipv6="inet6 accept_rtadv"

And it's works great on 9.2:

[root@R1]~# uname -a
FreeBSD R1 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r255918M: Sat Oct 26 22:41:39 CEST 2013     root@orange.bsdrp.net:/usr/obj/BSDRP.amd64/usr/local/BSDRP/BSDRP/FreeBSD/src/sys/amd64  amd64
[root@R1]~# ifconfig lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>
        ether aa:aa:00:01:01:01
        inet6 fe80::a8aa:ff:fe01:101%lagg0 prefixlen 64 scopeid 0x8
        inet 10.0.12.1 netmask 0xffffff00 broadcast 10.0.12.255
        inet6 2001:db8:12:0:a8aa:ff:fe01:101 prefixlen 64 autoconf
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        laggproto lacp lagghash l2,l3,l4
        laggport: em0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
[root@R1]~# uname -a
FreeBSD R1 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r255918M: Sat Oct 26 22:41:39 CEST 2013     root@orange.bsdrp.net:/usr/obj/BSDRP.amd64/usr/local/BSDRP/BSDRP/FreeBSD/src/sys/amd64  amd64

But once upgraded to 10-stable AND DISABLING the strict mode, it didn't works anymore:

[root@R1]~# uname -a
FreeBSD R1 10.0-STABLE FreeBSD 10.0-STABLE #0 r267244M: Mon Jun  9 03:57:44 CEST 2014     root@orange.bsdrp.net:/usr/obj/BSDRP.amd64/usr/local/BSDRP/BSDRP/FreeBSD/src/sys/amd64  amd64

[root@R1]~# echo "net.link.lagg.0.lacp.lacp_strict_mode=0" >> /etc/sysctl.conf

[root@R1]~# ifconfig lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>
        ether aa:aa:00:01:01:01
        inet6 fe80::a8aa:ff:fe01:101%lagg0 prefixlen 64 scopeid 0x7
        inet 0.0.0.0 netmask 0xff000000 broadcast 255.255.255.255
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        laggproto lacp lagghash l2,l3,l4
        laggport: em0 flags=0<>
[root@R1]~# sysctl net.link.lagg.0.
net.link.lagg.0.use_flowid: 1
net.link.lagg.0.flowid_shift: 16
net.link.lagg.0.count: 1
net.link.lagg.0.active: 0
net.link.lagg.0.flapping: 0
net.link.lagg.0.lacp.lacp_strict_mode: 0
net.link.lagg.0.lacp.debug.rx_test: 0
net.link.lagg.0.lacp.debug.tx_test: 0
 
In debug mode, I've got this output:
lacp_select_tx_port: no active aggregator
Comment 9 Michael Moll freebsd_committer 2014-06-20 21:09:36 UTC
This patch did solve my occasional problems with FreeBSD 10-STABLE and a Netgear GS110TP, so somebody should at least have a look at it (or scottl could just commit it).
Comment 10 Michael Moll freebsd_committer 2014-06-20 21:14:19 UTC
Hi Oliver,

(In reply to olivier from comment #8)
> The patch didn't solve Lagg regression between 9.2 and 10-stable (r267244).
> 
> My configuration only use one lagg member (the second will be added later):

_Maybe_ #179926 is your Bug and you could try the patch over there... (Although that problem seems to have existed at least in 9.1)
Comment 11 olivier 2014-09-13 21:16:22 UTC
(In reply to kvedulv from comment #10)
> Hi Oliver,
> 
> (In reply to olivier from comment #8)
> > The patch didn't solve Lagg regression between 9.2 and 10-stable (r267244).
> > 
> > My configuration only use one lagg member (the second will be added later):
> 
> _Maybe_ #179926 is your Bug and you could try the patch over there...
> (Although that problem seems to have existed at least in 9.1)

I believe it's not a regression in my case:
On FreeBSD 9.2, the lagg-lacp mode by not implementing a 'strict' mode was equivalent to a 'static+optionnal LACP' mode.
The behavior of lagg-lacp on 9.2 seems:
- If no LACP detected, allow a minimum of 1 lagg member to transmit.
The behavior on 10 seems:
- If no LACP detected, don't allow any of the lagg member to transmit (even with strict_mode disabled)
And in my case: I didn't have a LACP device in front of my FreeBSD.
Then I simply need to use a 'static' mode (like loadbalanced or roundrobin).
Comment 12 Jeroen van Heugten 2014-09-29 09:16:11 UTC
Created attachment 147795 [details]
Output of ifconfig -a
Comment 13 Jeroen van Heugten 2014-09-29 09:16:46 UTC
We are facing the same problem, running FreeBSD 10.0-p9, connected with LACP to two Juniper EX4550 switches (virtual-chassis/stacked). In our case the Junipers are already in active mode, but we still have issues with the connection. We are experiencing lots of outages (every few mins).

dmesg shows (on -multiple- FreeBSD servers:

ix1: Interface stopped DISTRIBUTING, possible flapping
ix1: Interface stopped DISTRIBUTING, possible flapping
ix1: Interface stopped DISTRIBUTING, possible flapping

Juniper configuration:

    ae21 {
        description "serverX (xe-0/0/17 en xe-1/0/17)";
        aggregated-ether-options {
            link-speed 10g;
            lacp {
                active;
                periodic fast;
            }
        }
        unit 0 {
            family ethernet-switching {
                port-mode access;
                vlan {
                    members S1_SERVERS;
                }
            }
        }
    }


- attached output of "ifconfig -a"
- attached output of "/etc/rc.conf"
Comment 14 Jeroen van Heugten 2014-09-29 09:18:44 UTC
Created attachment 147796 [details]
LACP settings in /etc/rc.conf
Comment 15 Jan Jurkus 2014-09-30 22:57:54 UTC
Is the patch supposed to be included with 10.1-BETA3?

There does not seem to be a sysctl entry called net.link.lagg.lacp.lacp_strict_mode
Comment 16 Joseph Mingrone freebsd_committer 2014-11-25 00:42:30 UTC
I just upgraded our storage server from 9.3 to 10.1 and we're now seeing this problem as well.

This is what we had in /etc/rc.conf:

cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport em0 laggport em1 laggport em2 laggport em3 192.168.0.102/24"

I'll also attach a screenshot of ifconfig -a.
Comment 18 Jeroen van Heugten 2015-04-01 14:24:06 UTC
In our case the problem was due to a bug in TSO (tcp segment offloading). This has been resolved in 10.1. You can easily exclude issues TSO by turning it temporary off for your interfaces (ifconfig ix# -tso).
Comment 19 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:49:50 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.