221146 – [ixgbe] Problem with second laggport

Bug 221146 - [ixgbe] Problem with second laggport

Summary: [ixgbe] Problem with second laggport

Status:	Closed Overcome By Events

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	11.1-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	Kevin Bowling

URL:
Keywords:	IntelNetworking, regression

Depends on:
Blocks:

Reported:	2017-08-01 12:08 UTC by Konrad
Modified:	2023-08-13 16:17 UTC (History)
CC List:	18 users (show)

See Also:

Flags:	koobs: mfc-stable11?

Attachments
Move hardware notification of driver readiness to end of attach. Current. (790 bytes, patch) 2017-11-10 21:16 UTC, Sean Bruno	no flags	Details \| Diff
Move hardware notification of driver readiness to end of attach. Stable/11. (807 bytes, patch) 2017-11-10 21:16 UTC, Sean Bruno	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Konrad 2017-08-01 12:08:45 UTC

Hello,

I had Freebsd 11.0-Release with configured lagg interface like this:

# LACP
ifconfig_lagg0="laggproto lacp laggport ix0 laggport ix1 192.168.202.254 netmask 255.255.255.0"

after upgrade to:
FreeBSD Gagarin 11.1-STABLE FreeBSD 11.1-STABLE #0 r321860

I have problem with second laggport:
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: ix1 flags=0<>

flags=0, 


rc.conf is the same like as for Freebsd 11.0-Release

ifconfig shows no carrier but connectivity is done correctly

ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	hwaddr 0c:c4:7a:bd:65:58
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>)
	status: active
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	hwaddr 0c:c4:7a:bd:65:59
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier


When I removed laggport ix1 from rc.conf and reboot machine, ix1 works normaly

ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:59
	hwaddr 0c:c4:7a:bd:65:59
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>)
	status: active

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>


and I can add ix1 to lagg0:
# ifconfig lagg0 laggport ix1

and all laggports works:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
root@Gagarin:~ #

Comment 1 Andrey V. Elsukov freebsd_committer

2017-08-01 14:23:14 UTC

We also noticed some instability with recent 3.2.12k driver and 82599EB cards (chip=0x10fb8086). Sometimes it can't detect media after reboot. Usually power reset solves the problem. We use "unsupported_sfp" tunable, but with old driver 2.5.15 we never seen that.

Comment 2 Eugene Grosbein freebsd_committer

2017-08-01 19:36:49 UTC

Does not seem to be lagg(4) problem but rather ix(4) link instability problem.

Comment 3 Kajetan Staszkiewicz 2017-08-01 20:39:49 UTC

I have some Intel xl710 running failover lagg and I had an issue not totally unlike this one. One of ports in lagg (and always the same one, unless they were added in different order, then both would always work) would only send frames but never receive them, so the router would become master on carp on vlans on this lagg, never seeing carp advertisements of the primary router. That was on FreeBSD 11.0, though. I used ixgbe-2.5.15 from Intel instead of the one coming with Kernel and the issue was gone.

Comment 4 Kajetan Staszkiewicz 2017-08-01 20:40:45 UTC

Sorry, I meant ixl-1.7.12 from Intel.

Comment 5 Cassiano Peixoto 2017-08-01 21:02:41 UTC

Hi guys,

It's a bug after commit r320897.

I have the same issue running an ix interface in netmap mode after this commit. I had to downgrade ixgbe drivers to make it works again.

Looks like a bug after driver update to 3.2.12-k.

I'm copying the author of commit.

Comment 6 Kevin Bowling freebsd_committer

2017-08-02 08:39:28 UTC

We're seeing this on cxgbe(4) as well, so I think it's related to if_lagg locking changes and the LACP init code.

Comment 7 Eric Joyner freebsd_committer

2017-08-03 20:50:19 UTC

Do we know who to assign this to, then? If it is actually a lagg problem and not a driver-specific one.

Comment 8 Cassiano Peixoto 2017-08-03 21:01:22 UTC

(In reply to Eric Joyner from comment #7)
So should i open a new PR related to driver update problem and netmap?

Comment 9 Konrad 2017-08-04 10:45:48 UTC

I set LAGG on igb0 and igb1 and seems to be ok:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 00:1e:67:27:1d:5e
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>




but when ix's ports are configured as lagg1, all laggports work properly. When I  switch ix's ports as lagg0 and first in cloned_interfaces the problem appears:



lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 0c:c4:7a:bd:65:58
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: ix1 flags=0<>
lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 00:1e:67:27:1d:5e
	inet 192.168.202.254 netmask 0xffffff00 broadcast 192.168.202.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	groups: lagg 
	laggproto lacp lagghash l2,l3,l4
	laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

Comment 10 Andrey V. Elsukov freebsd_committer

2017-08-04 11:26:19 UTC

(In reply to Konrad from comment #9)
> I set LAGG on igb0 and igb1 and seems to be ok:
> but when ix's ports are configured as lagg1, all laggports work properly.
> When I  switch ix's ports as lagg0 and first in cloned_interfaces the
> problem appears:

Is it reproducible? Are you doing this in runtime manually without reboot?

Comment 11 Konrad 2017-08-04 11:49:13 UTC

Yes, its reproducible. I change lagg configuration in rc.conf and reboot machine.

if lagg0 will start with "ix1 flags=0" ix1 will have always "no carrier" even if I remove from laggport. Manually removed ix1 and netif restart does no fix the problem (ix1 still has "no carrier"), only reboot.

Comment 12 Matt Joras freebsd_committer

2017-08-04 16:25:18 UTC

Does this only happen with LACP or does it happen with any lagg type (e.g. failover)?

Comment 13 Sean Bruno freebsd_committer

2017-08-07 16:10:22 UTC

(In reply to Cassiano Peixoto from comment #8)
For NetMap, yes.  That would be a different issue.  Which driver?

Comment 14 Cassiano Peixoto 2017-08-07 17:20:29 UTC

(In reply to Sean Bruno from comment #13)
Dear Sean,

I've just opened a PR 221317 regarding ixgbe driver update and netmap issue.

Thanks.

Comment 15 Kevin Bowling freebsd_committer

2017-08-11 03:40:09 UTC

Our cxgbe issue was not related and has been fixed by the vendor.

Comment 16 Jeff Pieper 2017-11-06 22:51:14 UTC

We are unable to reproduce this over 10 reboots:

FreeBSD u1015 11.1-STABLE FreeBSD 11.1-STABLE #0 d89b4f5935a(stable/11): Sat Nov  4 04:15:08 UTC 2017

[root@u1015 ~]# sysctl dev.ix|grep Version
dev.ix.1.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k
dev.ix.0.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k

Relevant rc.conf entries:

ifconfig_ix0="up"
ifconfig_ix1="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport ix0 laggport ix1 190.2.10.15 netmask 255.255.0.0"

[root@u1015 ~]# ifconfig -vvvv lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 90:e2:ba:d5:bf:40
        inet 190.2.10.15 netmask 0xffff0000 broadcast 190.2.255.255
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto lacp lagghash l2,l3,l4
        lagg options:
                flags=11<USE_FLOWID,LACP_STRICT>
                flowid_shift: 16
        lagg statistics:
                active ports: 2
                flapping: 0
        lag id: [(8000,90-E2-BA-D5-BF-40,00D2,0000,0000),
                 (0000,00-04-96-9B-9B-35,03E9,0000,0000)]
        laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                [(8000,90-E2-BA-D5-BF-40,00D2,8000,0003),
                 (0000,00-04-96-9B-9B-35,03E9,0000,03E9)]
        laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                [(8000,90-E2-BA-D5-BF-40,00D2,8000,0004),
                 (0000,00-04-96-9B-9B-35,03E9,0000,03EA)]
    [root@u1015 ~]# ifconfig -vvvvv ix1
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 90:e2:ba:d5:bf:40
        hwaddr 90:e2:ba:d5:bf:41
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>)
        status: active
        plugged: SFP/SFP+/SFP28 10G Base-LR (LC)
        vendor: Intel Corp PN: FTLX1471D3BCV-I3 SN: ATS0LVZ DATE: 2015-07-06
        module temperature: 42.12 C Voltage: 3.30 Volts
        RX: 0.43 mW (-3.64 dBm) TX: 0.76 mW (-1.17 dBm)
 
        SFF8472 DUMP (0xA0 0..127 range):
        03 04 07 20 00 00 02 00 00 00 00 06 67 02 0A 64
        00 00 00 00 49 6E 74 65 6C 20 43 6F 72 70 20 20
        20 20 20 20 00 00 1B 21 46 54 4C 58 31 34 37 31
        44 33 42 43 56 2D 49 33 41 20 20 20 05 1E 00 83
        00 3A 00 00 41 54 53 30 4C 56 5A 20 20 20 20 20
        20 20 20 20 31 35 30 37 30 36 20 20 68 FA 02 45
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[root@u1015 ~]# ifconfig -vvvvv ix0
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 90:e2:ba:d5:bf:40
        hwaddr 90:e2:ba:d5:bf:40
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>)
        status: active
        plugged: SFP/SFP+/SFP28 10G Base-LR (LC)
        vendor: Intel Corp PN: FTLX1471D3BCV-I3 SN: AU20U3K DATE: 2015-07-15
        module temperature: 41.68 C Voltage: 3.30 Volts
        RX: 0.34 mW (-4.64 dBm) TX: 0.72 mW (-1.37 dBm)
 
        SFF8472 DUMP (0xA0 0..127 range):
        03 04 07 20 00 00 02 00 00 00 00 06 67 02 0A 64
        00 00 00 00 49 6E 74 65 6C 20 43 6F 72 70 20 20
        20 20 20 20 00 00 1B 21 46 54 4C 58 31 34 37 31
        44 33 42 43 56 2D 49 33 41 20 20 20 05 1E 00 83
        00 3A 00 00 41 55 32 30 55 33 4B 20 20 20 20 20
        20 20 20 20 31 35 30 37 31 35 20 20 68 FA 02 FC
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Can you give us the ifconfig -vvvvv output for each ix interface?

Comment 17 Sean Bruno freebsd_committer

2017-11-10 21:16:07 UTC

Created attachment 187911 [details]
Move hardware notification of driver readiness to end of attach.  Current.

Comment 18 Sean Bruno freebsd_committer

2017-11-10 21:16:49 UTC

Created attachment 187912 [details]
Move hardware notification of driver readiness to end of attach. Stable/11.

Comment 19 Sean Bruno freebsd_committer

2017-11-10 21:17:59 UTC

I'm having a heck of a time reproducing this bug, as it seems like a race in the attach startup.

If you are seeing this problem on Current or Stable/11, I've attached an attempt to fix this, please report back if it helps or does nothing.

Comment 20 Jeb Cramer 2017-11-13 17:29:02 UTC

(In reply to Sean Bruno from comment #19)
I'm not saying it won't help, but it also opens the door for the reasons we moved it to the beginning of attach() in the first place.  The firmware tends to not honor the synchronization bits until the driver has taken over (via the DRV_LOAD bit).

Comment 21 Sean Bruno freebsd_committer

2017-11-13 18:58:23 UTC

(In reply to Jeb Cramer from comment #20)
Until I get confirmation from a failure case, I'm not going to do anything with this at this time.

Comment 22 Peter Vanek 2018-01-30 13:32:57 UTC

Hello,

I would like to add little piece to this as well.
We can reproduce same issue with driver 3.2.12-k 

Jan 30 12:59:18 localhost ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xecc0-0xecdf mem 0xd9d00000-0xd9dfffff,0xd9ff8000-0xd9ffbfff irq 40 at device 0.0 numa-domain 0 on pci2

having lagg1

lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether a0:36:9f:3e:57:18
        inet 61.0.0.24 netmask 0xff000000 broadcast 61.255.255.255
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto failover lagghash l2,l3,l4
        laggport: ix0 flags=5<MASTER,ACTIVE>
        laggport: ix1 flags=0<>


we use simple script to reproduce this:

# more a.sh
#!/bin/sh

while true; do
ifconfig $1 down
echo next $1
ifconfig $1 up
#sleep 1
done


# sh a.sh ix0


After about 30-50 loops, we can find ix0 interface with flag UP, but with 'no carrier' status

ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=9400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,NETMAP>
        ether a0:36:9f:3e:57:18
        hwaddr a0:36:9f:3e:57:18
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
        plugged: SFP/SFP+/SFP28 10G Base-SR (LC)
        vendor: OEM PN: SFP-10G-SR SN: IN140317016 DATE: 2014-03-20
        module temperature: 30.60 C Voltage: 3.29 Volts
        RX: 0.38 mW (-4.10 dBm) TX: 0.41 mW (-3.85 dBm)

        SFF8472 DUMP (0xA0 0..127 range):
        03 04 07 10 00 00 50 FF 00 00 00 06 67 02 00 00
        08 03 00 1E 4F 45 4D 20 20 20 20 20 20 20 20 20
        20 20 20 20 00 00 1B 21 53 46 50 2D 31 30 47 2D
        53 52 20 20 20 20 20 20 41 20 20 20 03 52 00 BA
        00 3A 00 00 49 4E 31 34 30 33 31 37 30 31 36 20
        20 20 20 20 31 34 30 33 32 30 20 20 68 FA 03 07
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

==============

Same result is done also when netmap is not involved; script will take ix0 down too

# ifconfig  ix0
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether a0:36:9f:3e:57:18
        hwaddr a0:36:9f:3e:57:18
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier


FYI, we do temporary downgrade of driver to 3.1.13-k where issue is not present.

If I can help with any additional testing, let me know.

Best Regards,
Peter

Comment 23 Maciej Suszko 2018-10-12 09:50:41 UTC

Same problem here on Dell PowerEdge R740xd running 11.2-RELEASE-p4, sources modified to include latest mrsas drviers (DELL PERC H740P).

root@host:~ # uname -a
FreeBSD host 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0 r338999M: Fri Sep 28 19:27:30 UTC 2018     root@host:/usr/obj/usr/src/sys/GENERIC  amd64

root@host:~ # dmesg -a| grep '^ix[01]'
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0x4020-0x403f mem 0x9d900000-0x9d9fffff,0x9da04000-0x9da07fff at device 0.0 numa-domain 0 on pci5
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: xx:xx:xx:xx:xx:98
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 8/2048, RX 8/2048
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0x4000-0x401f mem 0x9d800000-0x9d8fffff,0x9da00000-0x9da03fff at device 0.1 numa-domain 0 on pci5
ix1: Using MSI-X interrupts with 9 vectors
ix1: Ethernet address: xx:xx:xx:xx:xx:9a
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 8/2048, RX 8/2048
ix0: link state changed to UP
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
ix1: link state changed to UP
ix1: link state changed to DOWN
ix1: link state changed to UP

I can attach ix1 to lagg0 later, after system boot... doing this within rc.conf (in pair with ix0) does not work, ix1 shows 'no carrier' and it doesn't change it's state after removing from lagg0, interface down/up.

root@host:~ # ifconfig ix0
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether xx:xx:xx:xx:xx:98
        hwaddr xx:xx:xx:xx:xx:98
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
        status: active

root@host:~ # ifconfig ix1
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether xx:xx:xx:xx:xx:98
        hwaddr xx:xx:xx:xx:xx:9a
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
        status: active

root@host:~ # ifconfig lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
     options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether xx:xx:xx:xx:xx:98
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg 
        laggproto lacp lagghash l2,l3,l4
        laggport: ix0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: ix1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

Comment 24 Peter Vanek 2018-12-09 19:52:19 UTC

Hello guys,

Wondering if this is may be same origin as coming from

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

with commit

https://reviews.freebsd.org/D18468

Peter

Comment 25 Rodney W. Grimes freebsd_committer

2019-02-13 01:55:11 UTC

Please do not put bugs on stable@, current@, hackers@, etc

Comment 26 Johan Ström 2019-04-10 18:49:43 UTC

Hi,

not sure if this is fully related, but I've had issues with carp and lagg too.
If changing the carp status, i.e. plugging in one of the configured interfaces (that was how I first noticed it), then the lagg0 interface went down and up, but carp failed to catch up on this. This happens on reboot too (but not sure it has happened *every time*, this is a new setup).

The net.inet.carp.demotion counter went to 2160 ( /240 = 9, which is the number of VLANs with CARP on the lagg), but got stuck there and never came back down to 0:

Apr 10 20:23:00 gw1 kernel: ix1: link state changed to UP
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 240 (send error 50 on vlan14)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 480 (send error 50 on vlan11)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 720 (send error 50 on vlan17)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 960 (send error 50 on vlan16)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1200 (send error 50 on vlan13)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1440 (send error 50 on vlan15)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1680 (send error 50 on vlan10)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 1920 (send error 50 on vlan1)
Apr 10 20:23:07 gw1 kernel: carp: demoted by 240 to 2160 (send error 50 on vlan18)
Apr 10 20:23:08 gw1 kernel: carp: 10@vlan10: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan10: 3
Apr 10 20:23:08 gw1 kernel: carp: 13@vlan13: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan13: 3
Apr 10 20:23:08 gw1 kernel: carp: 5@vlan15: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan15: 3
Apr 10 20:23:08 gw1 kernel: carp: 18@vlan18: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan18: 3
Apr 10 20:23:08 gw1 kernel: carp: 11@vlan11: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: arp: 172.28.2.1 moved from 00:22:4d:6b:b1:5b to 00:00:5e:00:01:0b on vlan11
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan11: 3
Apr 10 20:23:08 gw1 kernel: carp: 17@vlan17: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan17: 3
Apr 10 20:23:08 gw1 kernel: carp: 1@vlan1: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan1: 3
Apr 10 20:23:08 gw1 kernel: carp: 14@vlan14: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan14: 3
Apr 10 20:23:08 gw1 kernel: carp: 16@vlan16: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:23:08 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan16: 3


Then I manually removed the demote via sysctl:

Apr 10 20:23:21 gw1 kernel: carp: demoted by -2160 to 0 (sysctl)
Apr 10 20:23:23 gw1 kernel: carp: 11@vlan11: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 17@vlan17: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 14@vlan14: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 1@vlan1: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 13@vlan13: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 10@vlan10: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: arp: 172.28.4.129 moved from 00:00:5e:00:01:11 to 00:22:4d:6b:b1:5b on vlan17
Apr 10 20:23:23 gw1 kernel: carp: 5@vlan15: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 16@vlan16: BACKUP -> MASTER (preempting a slower master)
Apr 10 20:23:23 gw1 kernel: carp: 18@vlan18: BACKUP -> MASTER (preempting a slower master)


(Also a bit interesting that it mentions those ARP changes..  Why would either of the nodes announce the CARPed IP on the nic mac rather than the CARP ip, at any time?)

The "other" carp node (not using lagg) is 11.2-RELEASE-p7, this node with lagg is 11.2-RELEASE-p8.
The lagg'ed nic's are ix0-ix4 "<Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> mem 0xddc00000-0xdddfffff,0xdde04000-0xdde07fff at device 0.0 on pci6" on a Supermicro A2SDi-4C-HLN4F.

On both nodes, net.inet.carp.preempt=1, and advbase 1, advskew 100 on this node, 200 on the other.


To add another dimension to this. If I set net.inet.carp.preempt=0 (which I had initially), I cannot get the interfaces out of BACKUP at all:
...
Apr 10 20:45:23 gw1 kernel: carp: 1@vlan1: MASTER -> BACKUP (more frequent advertisement received)
Apr 10 20:45:23 gw1 kernel: ifa_maintain_loopback_route: deletion failed for interface vlan1: 3
Apr 10 20:45:36 gw1 kernel: carp: demoted by -2160 to 0 (sysctl)

and then nothing more.. Setting it to 1 again, immediately makes it master.


Anyway, not sure if this is related to the ixgbe 3.2.12-k driver, or lagg, or carp..  But I thought I'd write it down here anyway.

Comment 27 Michael Muenz 2019-04-10 19:55:08 UTC

No, this is not related to the main issue. When seeing send err you can disable the demoting for send err. It will not destabilize the cluster but will work as expected. Nevertheless you should find the root cause for it, maybe something spanning-tree related, or firmware of switch which build LACP too slow (could be anything).

Comment 28 Johan Ström 2019-04-11 05:17:11 UTC

Michael, thanks for the quick pointer! Setting net.inet.carp.senderr_demotion_factor=0 makes CARP no longer react this way to individual up/down of underlying lagg interfaces. Instead I just get "carp: demoted by 0 to 0 (send error 50 on vlan18)", and no CARP change.

As for this "send error 50", I assume 50 is ENETDOWN, so for some reason the lagg driver signals this for a brief moment while changing it's setup. Not sure how to proceed in hunting down the root cause for that, but this is not really a super critical setup, mostly doing this for experimenting.
For the record, the other end is a HP 1820-24G (J9980A) and the lagg config is "laggproto lacp laggport ix0 laggport ix1 laggport ix2 laggport ix3"

(unrelated carp rambling below:)

Regarding why it did not automatically come back when net.inet.carp.preempt=0, it's of course by design.. In carp(4) DESCRIPTION section for this sysctl (and on when googling it and reading up on it a bit, and studying the carp source) it's clear that it is actually a way to make it more aggressive in enforcing that the node with the lower advskew is always MASTER. With =0 it won't go in until it stops seeing the current master, regardless of advskew.

In the EXAMPLE section of carp(4) the same option is described as a way to make multiple interfaces change state at the same time, so I had missed the above fact:

"When one of the physical interfaces of host A fails, advskew is demoted
to a configured value on all its carp vhids. Due to the preempt option,
host B would start announcing itself, and thus preempt host A on both
interfaces instead of just the failed one."

One thing about the text above which almost bit me (as in thinking it was not behaving as expected) was this:
If I (on the current master with lower advskew) deconfigured a VLAN from the LACP VLAN trunk of current master, the other node no longer sees the masters announcements, and takes over as master. This only happened for that particular VLAN though, and that confused me.. Doesn't the EXAMPLE say that it should affect all all interfaces when preempt=1?
Yes, for *ifdown* events... In this scenario it just stopped seeing any incoming traffic in that VLAN... so ofcourse it did not demote the other interfaces, which it would have done with the value of net.inet.carp.ifdown_demotion_factor if the interface actually went down..

Sorry for my rambling, but perhaps someone else (or me, when I have forgotten the details) might stumble upon it in the future and can benefit from it!

Comment 29 Maciej Suszko 2019-04-12 08:40:49 UTC

In my case the problem seems to be gone... I suspect Juniper switch OS was behaving badly, in the meantime my machine has been upgraded to 11.2-RELEASE-p9 but I bet on switch. It's Juniper ex3400-24t running Junos 18.2R1.9 right now - the machine boots fine, both lacp links are UP.