Hello, I had Freebsd 11.0-Release with configured lagg interface like this: # LACP ifconfig_lagg0="laggproto lacp laggport ix0 laggport ix1 192.168.202.254 netmask 255.255.255.0" after upgrade to: FreeBSD Gagarin 11.1-STABLE FreeBSD 11.1-STABLE #0 r321860 I have problem with second laggport: lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: ix1 flags=0<> flags=0, rc.conf is the same like as for Freebsd 11.0-Release ifconfig shows no carrier but connectivity is done correctly ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 hwaddr 0c:c4:7a:bd:65:58 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>) status: active ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 hwaddr 0c:c4:7a:bd:65:59 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier When I removed laggport ix1 from rc.conf and reboot machine, ix1 works normaly ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:59 hwaddr 0c:c4:7a:bd:65:59 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>) status: active lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> and I can add ix1 to lagg0: # ifconfig lagg0 laggport ix1 and all laggports works: lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> root@Gagarin:~ #
We also noticed some instability with recent 3.2.12k driver and 82599EB cards (chip=0x10fb8086). Sometimes it can't detect media after reboot. Usually power reset solves the problem. We use "unsupported_sfp" tunable, but with old driver 2.5.15 we never seen that.
Does not seem to be lagg(4) problem but rather ix(4) link instability problem.
I have some Intel xl710 running failover lagg and I had an issue not totally unlike this one. One of ports in lagg (and always the same one, unless they were added in different order, then both would always work) would only send frames but never receive them, so the router would become master on carp on vlans on this lagg, never seeing carp advertisements of the primary router. That was on FreeBSD 11.0, though. I used ixgbe-2.5.15 from Intel instead of the one coming with Kernel and the issue was gone.
Sorry, I meant ixl-1.7.12 from Intel.
Hi guys, It's a bug after commit r320897. I have the same issue running an ix interface in netmap mode after this commit. I had to downgrade ixgbe drivers to make it works again. Looks like a bug after driver update to 3.2.12-k. I'm copying the author of commit.
We're seeing this on cxgbe(4) as well, so I think it's related to if_lagg locking changes and the LACP init code.
Do we know who to assign this to, then? If it is actually a lagg problem and not a driver-specific one.
(In reply to Eric Joyner from comment #7) So should i open a new PR related to driver update problem and netmap?
I set LAGG on igb0 and igb1 and seems to be ok: lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:1e:67:27:1d:5e inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> but when ix's ports are configured as lagg1, all laggports work properly. When I switch ix's ports as lagg0 and first in cloned_interfaces the problem appears: lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:bd:65:58 inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: ix1 flags=0<> lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:1e:67:27:1d:5e inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
(In reply to Konrad from comment #9) > I set LAGG on igb0 and igb1 and seems to be ok: > but when ix's ports are configured as lagg1, all laggports work properly. > When I switch ix's ports as lagg0 and first in cloned_interfaces the > problem appears: Is it reproducible? Are you doing this in runtime manually without reboot?
Yes, its reproducible. I change lagg configuration in rc.conf and reboot machine. if lagg0 will start with "ix1 flags=0" ix1 will have always "no carrier" even if I remove from laggport. Manually removed ix1 and netif restart does no fix the problem (ix1 still has "no carrier"), only reboot.
Does this only happen with LACP or does it happen with any lagg type (e.g. failover)?
(In reply to Cassiano Peixoto from comment #8) For NetMap, yes. That would be a different issue. Which driver?
(In reply to Sean Bruno from comment #13) Dear Sean, I've just opened a PR 221317 regarding ixgbe driver update and netmap issue. Thanks.
Our cxgbe issue was not related and has been fixed by the vendor.
We are unable to reproduce this over 10 reboots: FreeBSD u1015 11.1-STABLE FreeBSD 11.1-STABLE #0 d89b4f5935a(stable/11): Sat Nov 4 04:15:08 UTC 2017 [root@u1015 ~]# sysctl dev.ix|grep Version dev.ix.1.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k dev.ix.0.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k Relevant rc.conf entries: ifconfig_ix0="up" ifconfig_ix1="up" cloned_interfaces="lagg0" ifconfig_lagg0="laggproto lacp laggport ix0 laggport ix1 190.2.10.15 netmask 255.255.0.0" [root@u1015 ~]# ifconfig -vvvv lagg0 lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 90:e2:ba:d5:bf:40 inet 190.2.10.15 netmask 0xffff0000 broadcast 190.2.255.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 lagg options: flags=11<USE_FLOWID,LACP_STRICT> flowid_shift: 16 lagg statistics: active ports: 2 flapping: 0 lag id: [(8000,90-E2-BA-D5-BF-40,00D2,0000,0000), (0000,00-04-96-9B-9B-35,03E9,0000,0000)] laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING> [(8000,90-E2-BA-D5-BF-40,00D2,8000,0003), (0000,00-04-96-9B-9B-35,03E9,0000,03E9)] laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING> [(8000,90-E2-BA-D5-BF-40,00D2,8000,0004), (0000,00-04-96-9B-9B-35,03E9,0000,03EA)] [root@u1015 ~]# ifconfig -vvvvv ix1 ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 90:e2:ba:d5:bf:40 hwaddr 90:e2:ba:d5:bf:41 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>) status: active plugged: SFP/SFP+/SFP28 10G Base-LR (LC) vendor: Intel Corp PN: FTLX1471D3BCV-I3 SN: ATS0LVZ DATE: 2015-07-06 module temperature: 42.12 C Voltage: 3.30 Volts RX: 0.43 mW (-3.64 dBm) TX: 0.76 mW (-1.17 dBm) SFF8472 DUMP (0xA0 0..127 range): 03 04 07 20 00 00 02 00 00 00 00 06 67 02 0A 64 00 00 00 00 49 6E 74 65 6C 20 43 6F 72 70 20 20 20 20 20 20 00 00 1B 21 46 54 4C 58 31 34 37 31 44 33 42 43 56 2D 49 33 41 20 20 20 05 1E 00 83 00 3A 00 00 41 54 53 30 4C 56 5A 20 20 20 20 20 20 20 20 20 31 35 30 37 30 36 20 20 68 FA 02 45 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [root@u1015 ~]# ifconfig -vvvvv ix0 ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 90:e2:ba:d5:bf:40 hwaddr 90:e2:ba:d5:bf:40 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>) status: active plugged: SFP/SFP+/SFP28 10G Base-LR (LC) vendor: Intel Corp PN: FTLX1471D3BCV-I3 SN: AU20U3K DATE: 2015-07-15 module temperature: 41.68 C Voltage: 3.30 Volts RX: 0.34 mW (-4.64 dBm) TX: 0.72 mW (-1.37 dBm) SFF8472 DUMP (0xA0 0..127 range): 03 04 07 20 00 00 02 00 00 00 00 06 67 02 0A 64 00 00 00 00 49 6E 74 65 6C 20 43 6F 72 70 20 20 20 20 20 20 00 00 1B 21 46 54 4C 58 31 34 37 31 44 33 42 43 56 2D 49 33 41 20 20 20 05 1E 00 83 00 3A 00 00 41 55 32 30 55 33 4B 20 20 20 20 20 20 20 20 20 31 35 30 37 31 35 20 20 68 FA 02 FC 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Can you give us the ifconfig -vvvvv output for each ix interface?
Created attachment 187911 [details] Move hardware notification of driver readiness to end of attach. Current.
Created attachment 187912 [details] Move hardware notification of driver readiness to end of attach. Stable/11.
I'm having a heck of a time reproducing this bug, as it seems like a race in the attach startup. If you are seeing this problem on Current or Stable/11, I've attached an attempt to fix this, please report back if it helps or does nothing.
(In reply to Sean Bruno from comment #19) I'm not saying it won't help, but it also opens the door for the reasons we moved it to the beginning of attach() in the first place. The firmware tends to not honor the synchronization bits until the driver has taken over (via the DRV_LOAD bit).
(In reply to Jeb Cramer from comment #20) Until I get confirmation from a failure case, I'm not going to do anything with this at this time.
Hello, I would like to add little piece to this as well. We can reproduce same issue with driver 3.2.12-k Jan 30 12:59:18 localhost ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xecc0-0xecdf mem 0xd9d00000-0xd9dfffff,0xd9ff8000-0xd9ffbfff irq 40 at device 0.0 numa-domain 0 on pci2 having lagg1 lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether a0:36:9f:3e:57:18 inet 61.0.0.24 netmask 0xff000000 broadcast 61.255.255.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto failover lagghash l2,l3,l4 laggport: ix0 flags=5<MASTER,ACTIVE> laggport: ix1 flags=0<> we use simple script to reproduce this: # more a.sh #!/bin/sh while true; do ifconfig $1 down echo next $1 ifconfig $1 up #sleep 1 done # sh a.sh ix0 After about 30-50 loops, we can find ix0 interface with flag UP, but with 'no carrier' status ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=9400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,NETMAP> ether a0:36:9f:3e:57:18 hwaddr a0:36:9f:3e:57:18 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier plugged: SFP/SFP+/SFP28 10G Base-SR (LC) vendor: OEM PN: SFP-10G-SR SN: IN140317016 DATE: 2014-03-20 module temperature: 30.60 C Voltage: 3.29 Volts RX: 0.38 mW (-4.10 dBm) TX: 0.41 mW (-3.85 dBm) SFF8472 DUMP (0xA0 0..127 range): 03 04 07 10 00 00 50 FF 00 00 00 06 67 02 00 00 08 03 00 1E 4F 45 4D 20 20 20 20 20 20 20 20 20 20 20 20 20 00 00 1B 21 53 46 50 2D 31 30 47 2D 53 52 20 20 20 20 20 20 41 20 20 20 03 52 00 BA 00 3A 00 00 49 4E 31 34 30 33 31 37 30 31 36 20 20 20 20 20 31 34 30 33 32 30 20 20 68 FA 03 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ============== Same result is done also when netmap is not involved; script will take ix0 down too # ifconfig ix0 ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether a0:36:9f:3e:57:18 hwaddr a0:36:9f:3e:57:18 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier FYI, we do temporary downgrade of driver to 3.1.13-k where issue is not present. If I can help with any additional testing, let me know. Best Regards, Peter
Same problem here on Dell PowerEdge R740xd running 11.2-RELEASE-p4, sources modified to include latest mrsas drviers (DELL PERC H740P). root@host:~ # uname -a FreeBSD host 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0 r338999M: Fri Sep 28 19:27:30 UTC 2018 root@host:/usr/obj/usr/src/sys/GENERIC amd64 root@host:~ # dmesg -a| grep '^ix[01]' ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0x4020-0x403f mem 0x9d900000-0x9d9fffff,0x9da04000-0x9da07fff at device 0.0 numa-domain 0 on pci5 ix0: Using MSI-X interrupts with 9 vectors ix0: Ethernet address: xx:xx:xx:xx:xx:98 ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 8/2048, RX 8/2048 ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0x4000-0x401f mem 0x9d800000-0x9d8fffff,0x9da00000-0x9da03fff at device 0.1 numa-domain 0 on pci5 ix1: Using MSI-X interrupts with 9 vectors ix1: Ethernet address: xx:xx:xx:xx:xx:9a ix1: PCI Express Bus: Speed 5.0GT/s Width x8 ix1: netmap queues/slots: TX 8/2048, RX 8/2048 ix0: link state changed to UP ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 ix1: link state changed to UP ix1: link state changed to DOWN ix1: link state changed to UP I can attach ix1 to lagg0 later, after system boot... doing this within rc.conf (in pair with ix0) does not work, ix1 shows 'no carrier' and it doesn't change it's state after removing from lagg0, interface down/up. root@host:~ # ifconfig ix0 ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether xx:xx:xx:xx:xx:98 hwaddr xx:xx:xx:xx:xx:98 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>) status: active root@host:~ # ifconfig ix1 ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether xx:xx:xx:xx:xx:98 hwaddr xx:xx:xx:xx:xx:9a nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>) status: active root@host:~ # ifconfig lagg0 lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether xx:xx:xx:xx:xx:98 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
Hello guys, Wondering if this is may be same origin as coming from https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 with commit https://reviews.freebsd.org/D18468 Peter
Please do not put bugs on stable@, current@, hackers@, etc
Hi, not sure if this is fully related, but I've had issues with carp and lagg too. If changing the carp status, i.e. plugging in one of the configured interfaces (that was how I first noticed it), then the lagg0 interface went down and up, but carp failed to catch up on this. This happens on reboot too (but not sure it has happened *every time*, this is a new setup). The net.inet.carp.demotion counter went to 2160 ( /240 = 9, which is the number of VLANs with CARP on the lagg), but got stuck there and never came back down to 0: Apr 10 20:23:00 gw1 kernel: ix1: link state changed to UP Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 240 (send error 50 on vlan14) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 480 (send error 50 on vlan11) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 720 (send error 50 on vlan17) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 960 (send error 50 on vlan16) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1200 (send error 50 on vlan13) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1440 (send error 50 on vlan15) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1680 (send error 50 on vlan10) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1920 (send error 50 on vlan1) Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 2160 (send error 50 on vlan18) Apr 10 20:23:08 gw1 kernel: carp: 10@vlan10: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan10: 3 Apr 10 20:23:08 gw1 kernel: carp: 13@vlan13: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan13: 3 Apr 10 20:23:08 gw1 kernel: carp: 5@vlan15: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan15: 3 Apr 10 20:23:08 gw1 kernel: carp: 18@vlan18: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan18: 3 Apr 10 20:23:08 gw1 kernel: carp: 11@vlan11: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: arp: 172.28.2.1 moved from 00:22:4d:6b:b1:5b to 00:00:5e:00:01:0b on vlan11 Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan11: 3 Apr 10 20:23:08 gw1 kernel: carp: 17@vlan17: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan17: 3 Apr 10 20:23:08 gw1 kernel: carp: 1@vlan1: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan1: 3 Apr 10 20:23:08 gw1 kernel: carp: 14@vlan14: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan14: 3 Apr 10 20:23:08 gw1 kernel: carp: 16@vlan16: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan16: 3 Then I manually removed the demote via sysctl: Apr 10 20:23:21 gw1 kernel: carp: demoted by -2160 to 0 (sysctl) Apr 10 20:23:23 gw1 kernel: carp: 11@vlan11: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 17@vlan17: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 14@vlan14: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 1@vlan1: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 13@vlan13: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 10@vlan10: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: arp: 172.28.4.129 moved from 00:00:5e:00:01:11 to 00:22:4d:6b:b1:5b on vlan17 Apr 10 20:23:23 gw1 kernel: carp: 5@vlan15: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 16@vlan16: BACKUP -> MASTER (preempting a slower master) Apr 10 20:23:23 gw1 kernel: carp: 18@vlan18: BACKUP -> MASTER (preempting a slower master) (Also a bit interesting that it mentions those ARP changes.. Why would either of the nodes announce the CARPed IP on the nic mac rather than the CARP ip, at any time?) The "other" carp node (not using lagg) is 11.2-RELEASE-p7, this node with lagg is 11.2-RELEASE-p8. The lagg'ed nic's are ix0-ix4 "<Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> mem 0xddc00000-0xdddfffff,0xdde04000-0xdde07fff at device 0.0 on pci6" on a Supermicro A2SDi-4C-HLN4F. On both nodes, net.inet.carp.preempt=1, and advbase 1, advskew 100 on this node, 200 on the other. To add another dimension to this. If I set net.inet.carp.preempt=0 (which I had initially), I cannot get the interfaces out of BACKUP at all: ... Apr 10 20:45:23 gw1 kernel: carp: 1@vlan1: MASTER -> BACKUP (more frequent advertisement received) Apr 10 20:45:23 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan1: 3 Apr 10 20:45:36 gw1 kernel: carp: demoted by -2160 to 0 (sysctl) and then nothing more.. Setting it to 1 again, immediately makes it master. Anyway, not sure if this is related to the ixgbe 3.2.12-k driver, or lagg, or carp.. But I thought I'd write it down here anyway.
No, this is not related to the main issue. When seeing send err you can disable the demoting for send err. It will not destabilize the cluster but will work as expected. Nevertheless you should find the root cause for it, maybe something spanning-tree related, or firmware of switch which build LACP too slow (could be anything).
Michael, thanks for the quick pointer! Setting net.inet.carp.senderr_demotion_factor=0 makes CARP no longer react this way to individual up/down of underlying lagg interfaces. Instead I just get "carp: demoted by 0 to 0 (send error 50 on vlan18)", and no CARP change. As for this "send error 50", I assume 50 is ENETDOWN, so for some reason the lagg driver signals this for a brief moment while changing it's setup. Not sure how to proceed in hunting down the root cause for that, but this is not really a super critical setup, mostly doing this for experimenting. For the record, the other end is a HP 1820-24G (J9980A) and the lagg config is "laggproto lacp laggport ix0 laggport ix1 laggport ix2 laggport ix3" (unrelated carp rambling below:) Regarding why it did not automatically come back when net.inet.carp.preempt=0, it's of course by design.. In carp(4) DESCRIPTION section for this sysctl (and on when googling it and reading up on it a bit, and studying the carp source) it's clear that it is actually a way to make it more aggressive in enforcing that the node with the lower advskew is always MASTER. With =0 it won't go in until it stops seeing the current master, regardless of advskew. In the EXAMPLE section of carp(4) the same option is described as a way to make multiple interfaces change state at the same time, so I had missed the above fact: "When one of the physical interfaces of host A fails, advskew is demoted to a configured value on all its carp vhids. Due to the preempt option, host B would start announcing itself, and thus preempt host A on both interfaces instead of just the failed one." One thing about the text above which almost bit me (as in thinking it was not behaving as expected) was this: If I (on the current master with lower advskew) deconfigured a VLAN from the LACP VLAN trunk of current master, the other node no longer sees the masters announcements, and takes over as master. This only happened for that particular VLAN though, and that confused me.. Doesn't the EXAMPLE say that it should affect all all interfaces when preempt=1? Yes, for *ifdown* events... In this scenario it just stopped seeing any incoming traffic in that VLAN... so ofcourse it did not demote the other interfaces, which it would have done with the value of net.inet.carp.ifdown_demotion_factor if the interface actually went down.. Sorry for my rambling, but perhaps someone else (or me, when I have forgotten the details) might stumble upon it in the future and can benefit from it!
In my case the problem seems to be gone... I suspect Juniper switch OS was behaving badly, in the meantime my machine has been upgraded to 11.2-RELEASE-p9 but I bet on switch. It's Juniper ex3400-24t running Junos 18.2R1.9 right now - the machine boots fine, both lacp links are UP.