Bug 258850 - lagg(4): interface vanishes when both member interfaces are inactive/unassociated, and members cannot be reactivated
Summary: lagg(4): interface vanishes when both member interfaces are inactive/unassoci...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: needs-qa
Depends on:
Blocks:
 
Reported: 2021-10-02 00:10 UTC by John Westbrook
Modified: 2021-10-04 18:58 UTC (History)
1 user (show)

See Also:
koobs: mfc-stable13?


Attachments
uname (174 bytes, text/plain)
2021-10-02 15:27 UTC, John Westbrook
no flags Details
pciconf output (5.23 KB, text/plain)
2021-10-02 15:29 UTC, John Westbrook
no flags Details
script from first comment (1.10 KB, application/x-shellscript)
2021-10-02 15:29 UTC, John Westbrook
no flags Details
example1 dmesg transcript (13.81 KB, text/plain)
2021-10-02 15:30 UTC, John Westbrook
no flags Details
example1 ifconfig output before symptom (1.49 KB, text/plain)
2021-10-02 15:31 UTC, John Westbrook
no flags Details
example1 ifconfig output after symptom (1.17 KB, text/plain)
2021-10-02 15:31 UTC, John Westbrook
no flags Details
example1 ifconfig output after recovery (1.47 KB, text/plain)
2021-10-02 15:32 UTC, John Westbrook
no flags Details
example2 dmesg transcript (13.66 KB, text/plain)
2021-10-02 15:32 UTC, John Westbrook
no flags Details
example2 ifconfig output before symptom (1.51 KB, text/plain)
2021-10-02 15:32 UTC, John Westbrook
no flags Details
example2 ifconfig output after symptom (1.50 KB, text/plain)
2021-10-02 15:33 UTC, John Westbrook
no flags Details
example2 ifconfig output unrecovered (1.50 KB, text/plain)
2021-10-02 15:34 UTC, John Westbrook
no flags Details
example3 dmesg transcript (13.59 KB, text/plain)
2021-10-02 15:34 UTC, John Westbrook
no flags Details
example3 ifconfig output before symptom (1.17 KB, text/plain)
2021-10-02 15:35 UTC, John Westbrook
no flags Details
example1 ifconfig output after symptom 1 (1.28 KB, text/plain)
2021-10-02 15:35 UTC, John Westbrook
no flags Details
example3 ifconfig output after symptom 1 (1.28 KB, text/plain)
2021-10-02 15:36 UTC, John Westbrook
no flags Details
example3 ifconfig output after symptom 2 (1.17 KB, text/plain)
2021-10-02 15:37 UTC, John Westbrook
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Westbrook 2021-10-02 00:10:26 UTC
I am having significant problems on FreeBSD 13.0 using lagg-failover with em0 and wlan0/ath0 on both my ThinkPad X220 and X230. Both laptops are running Coreboot, with a Dell 7WCGT Bigfoot Killer Wireless (AR5BHB112; AR9380 chipset). Both em0 and wlan0/ath0 work fine when not used with lagg.

This problem has some similarities to bug #226549 but can't be recovered in the same way.

The basic symptom is that the lagg0 interface often vanishes when both laggport interfaces are inactive/unassociated--for example, (1) when not connected to wired ethernet and the WiFi interface loses its association with the WiFi access point, or (2) when unplugging from the wired network. This also often happens at boot, when the lagg0 interface comes up but WiFi hasn't established an association with the WiFi access point. Looking in dmesg after boot doesn't shed much light:

lagg0: link state changed to DOWN
lagg0: link state changed to UP
lagg0: link state changed to DOWN

However, the problem isn't limited to WiFi. The problem also occurs when failing over from wired. Once em0 goes down (i.e. cable unplugged, or ifconfig down), it can't be brought back up, even separate from lagg0:

# ifconfig em0
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=800000<>
	ether XX:XX:XX:XX:XX:XX
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
# ifconfig em0 up
# ifconfig em0
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=800000<>
	ether XX:XX:XX:XX:XX:XX
	media: Ethernet autoselect
	status: no carrier
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
# ifconfig em0
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=800000<>
	ether XX:XX:XX:XX:XX:XX
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

Here's my lagg configuration--almost identical to the man page:

wlans_ath0="wlan0"
ifconfig_wlan0="WPA"
ifconfig_em0="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="up laggproto failover laggport em0 laggport wlan0 DHCP"

except that I'm setting the MAC address via a hint in /boot/loader.conf:

hint.ath.0.macaddr="XX:XX:XX:XX:XX:XX"

I used the hint based on past threads discussing problems associated with setting the MAC address on Atheros devices. However, it doesn't seem to make a difference with the problem if I instead override the MAC address on em0 with the MAC address from the Atheros card. Also, the problem with lagg0 happens both when using DHCP and when configured to use a static IP address.

When not connected to wired ethernet, and when the WiFi interface stabilizes/associates, reconfiguring lagg0 from the command line is flaky. Sometimes it works, sometimes not. Sometimes ifconfig shows lagg0 along with a device-not-configured error, followed by lagg0 vanishing:

# ifconfig wlan0 down
# ifconfig
em0: flags=8c23<UP,BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500	options=481249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER,NOMAP>
	ether XX:XX:XX:XX:XX:XX
	media: Ethernet autoselect
	status: no carrier
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
	inet 127.0.0.1 netmask 0xff000000
	groups: lo
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wlan0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	ether XX:XX:XX:XX:XX:XX
	groups: wlan
	ssid "" channel 1 (2412 MHz 11g ht/20)
	regdomain 106 indoor ecm authmode WPA2/802.11i privacy ON
	deftxkey UNDEF AES-CCM 2:128-bit txpower 20 bmiss 7 scanvalid 60
	protmode CTS ampdulimit 64k ampdudensity 8 shortgi -uapsd wme burst
	roaming MANUAL
	parent interface: ath0
	media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
	status: no carrier
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
	groups: pflog
lagg0: flags=8802<BROADCAST,SIMPLEX,MULTICAST>
	ether XX:XX:XX:XX:XX:XX
ifconfig: SIOCGIFGROUP: Device not configured

# ifconfig lagg0 create
# ifconfig lagg0 up laggproto failover laggport wlan0 laggport em0
# ifconfig
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=481249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER,NOMAP>
	ether XX:XX:XX:XX:XX:XX
	media: Ethernet autoselect
	status: no carrier
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
	inet 127.0.0.1 netmask 0xff000000
	groups: lo
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wlan0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	ether XX:XX:XX:XX:XX:XX
	groups: wlan
	ssid "" channel 1 (2412 MHz 11g ht/20)
	regdomain 106 indoor ecm authmode WPA2/802.11i privacy ON
	deftxkey UNDEF AES-CCM 2:128-bit txpower 20 bmiss 7 scanvalid 60
	protmode CTS ampdulimit 64k ampdudensity 8 shortgi -uapsd wme burst
	roaming MANUAL
	parent interface: ath0
	media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
	status: no carrier
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
	groups: pflog
lagg0: flags=8802<BROADCAST,SIMPLEX,MULTICAST>
	ether XX:XX:XX:XX:XX:XX
ifconfig: SIOCGIFGROUP: Device not configured

Repeating the same operations sometimes yields success. I wrote a script that helps with making sense the sequence in /var/log/messages:

#!/bin/sh

tag=`basename "$0"`

logger -t "$tag" "Checking lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 exists."
    exit 0
fi

logger -t "$tag" "Creating lagg0 ..."
if ifconfig lagg0 create; then
    logger -t "$tag" "lagg0 create success."
else
    logger -t "$tag" "lagg0 create failed."
    exit 1
fi

logger -t "$tag" "Configuring lagg0 ..."
params=`sysrc -n ifconfig_lagg0 | sed s/DHCP/up/`
if ifconfig lagg0 $params; then
    logger -t "$tag" "lagg0 config success."
else
    logger -t "$tag" "lagg0 config failed: $params"
    exit 2
fi

logger -t "$tag" "Postcheck(0) lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 postcheck success."
else
    logger -t "$tag" "lagg0 postcheck failed."
    exit 3
fi

sleep 10
logger -t "$tag" "Postcheck(1) lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 postcheck success."
else
    logger -t "$tag" "lagg0 postcheck failed."
    exit 4
fi

sleep 20
logger -t "$tag" "Postcheck(2) lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 postcheck success."
else
    logger -t "$tag" "lagg0 postcheck failed."
    exit 5
fi

Here's an example of when the script succeeds:

Oct  1 10:27:08 x220a fix-lagg0[6783]: Checking lagg0 ...
Oct  1 10:27:08 x220a fix-lagg0[6788]: Creating lagg0 ...
Oct  1 10:27:08 x220a fix-lagg0[6793]: lagg0 create success.
Oct  1 10:27:08 x220a fix-lagg0[6797]: Configuring lagg0 ...
Oct  1 10:27:09 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-DISCONNECTED bssid=AA:AA:AA:AA:AA:AA reason=3 locally_generated=1
Oct  1 10:27:09 x220a kernel: lagg0: link state changed to DOWN
Oct  1 10:27:09 x220a kernel: wlan0: link state changed to DOWN
Oct  1 10:27:10 x220a fix-lagg0[6822]: lagg0 config success.
Oct  1 10:27:10 x220a fix-lagg0[6826]: Postcheck(0) lagg0 ...
Oct  1 10:27:10 x220a fix-lagg0[6831]: lagg0 postcheck success.
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: Trying to associate with AA:AA:AA:AA:AA:AA (SSID='FiOS-YLLQU-5G' freq=5765 MHz)
Oct  1 10:27:16 x220a kernel: ath0: ath_edma_recv_tasklet: sc_inreset_cnt > 0; skipping
Oct  1 10:27:16 x220a wpa_supplicant[347]: Failed to add supported operating classes IE
Oct  1 10:27:16 x220a wpa_supplicant[347]: ioctl[SIOCS80211, op=20, val=0, arg_len=7]: Can't assign requested address
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: Associated with AA:AA:AA:AA:AA:AA
Oct  1 10:27:16 x220a kernel: wlan0: ieee80211_new_state_locked: pending AUTH -> ASSOC transition lost
Oct  1 10:27:16 x220a kernel: wlan0: ieee80211_new_state_locked: pending ASSOC -> RUN transition lost
Oct  1 10:27:16 x220a kernel: wlan0: link state changed to UP
Oct  1 10:27:16 x220a kernel: lagg0: link state changed to UP
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: WPA: Key negotiation completed with AA:AA:AA:AA:AA:AA [PTK=CCMP GTK=CCMP]
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-CONNECTED - Connection to AA:AA:AA:AA:AA:AA completed [id=0 id_str=]
Oct  1 10:27:20 x220a fix-lagg0[6852]: Postcheck(1) lagg0 ...
Oct  1 10:27:20 x220a fix-lagg0[6857]: lagg0 postcheck success.
Oct  1 10:27:50 x220a fix-lagg0[6878]: Postcheck(2) lagg0 ...
Oct  1 10:27:50 x220a fix-lagg0[6883]: lagg0 postcheck success.
Oct  1 10:27:51 x220a dhclient[6935]: New IP Address (lagg0): 192.168.1.86
Oct  1 10:27:52 x220a dhclient[6939]: New Subnet Mask (lagg0): 255.255.255.0
Oct  1 10:27:52 x220a dhclient[6943]: New Broadcast Address (lagg0): 192.168.1.255
Oct  1 10:27:52 x220a dhclient[6947]: New Routers (lagg0): 192.168.1.1

Notice that adding wlan0 as a laggport brings wlan0 down and triggers a reassociation. Destroying lagg0 also takes down wlan0 and triggers a reassociation:

Oct  1 10:32:30 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-DISCONNECTED bssid=AA:AA:AA:AA:AA:AA reason=3 locally_generated=1
Oct  1 10:32:33 x220a kernel: wlan0: link state changed to DOWN
Oct  1 10:32:33 x220a kernel: lagg0: link state changed to DOWN
Oct  1 10:32:33 x220a dhclient[6925]: Interface lagg0 is down, dhclient exiting
Oct  1 10:32:33 x220a dhclient[6925]: connection closed
Oct  1 10:32:33 x220a dhclient[6925]: exiting.
Oct  1 10:32:33 x220a root[7331]: /etc/rc.d/netif: WARNING: lagg0 does not exist.  Skipped.
Oct  1 10:32:40 x220a wpa_supplicant[347]: wlan0: Trying to associate with AA:AA:AA:AA:AA:AA (SSID='FiOS-YLLQU-5G' freq=5765 MHz)
Oct  1 10:32:40 x220a wpa_supplicant[347]: Failed to add supported operating classes IE
Oct  1 10:32:40 x220a wpa_supplicant[347]: ioctl[SIOCS80211, op=20, val=0, arg_len=7]: Can't assign requested address
Oct  1 10:32:50 x220a wpa_supplicant[347]: wlan0: Authentication with AA:AA:AA:AA:AA:AA timed out.
Oct  1 10:32:50 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-DISCONNECTED bssid=AA:AA:AA:AA:AA:AA reason=3 locally_generated=1
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: Trying to associate with AA:AA:AA:AA:AA:AA (SSID='FiOS-YLLQU-5G' freq=5765 MHz)
Oct  1 10:32:57 x220a wpa_supplicant[347]: Failed to add supported operating classes IE
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: Associated with AA:AA:AA:AA:AA:AA
Oct  1 10:32:57 x220a kernel: wlan0: link state changed to UP
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: WPA: Key negotiation completed with AA:AA:AA:AA:AA:AA [PTK=CCMP GTK=CCMP]
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-CONNECTED - Connection to AA:AA:AA:AA:AA:AA completed [id=0 id_str=]

The transcripts above are from my X220, but I've had the same symptoms on my X230. Given that the problem happens on two machines and impacts both laggport interfaces (em0 and WiFi), it seems like a lagg-related issue.
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2021-10-02 00:36:32 UTC
Thank you for your report John. Could you please add:

- full uname -a output (as an atachment)
- pciconf -lv output (as an attachment)
- /var/run/dmesg.boot (as an attachment)
- include your test script (as an attachment)

Confirm/clarify:

 - the state of lagg0 (ifconfig -a) before and after the symptom
 - the state of em0 (ifconfig -a) before and after the symptom
 - the state of wlan0 (ifconfig -a) before and after the symptom

 - the minimum necessary steps and configuration to reproduce the issue

 - Whether this is a recent change in behaviour (upgrade or similar), and if so, 
   the previous version and behaviour
Comment 2 John Westbrook 2021-10-02 15:27:34 UTC
Created attachment 228363 [details]
uname
Comment 3 John Westbrook 2021-10-02 15:29:10 UTC
Created attachment 228364 [details]
pciconf output
Comment 4 John Westbrook 2021-10-02 15:29:50 UTC
Created attachment 228365 [details]
script from first comment
Comment 5 John Westbrook 2021-10-02 15:30:24 UTC
Created attachment 228366 [details]
example1 dmesg transcript
Comment 6 John Westbrook 2021-10-02 15:31:05 UTC
Created attachment 228367 [details]
example1 ifconfig output before symptom
Comment 7 John Westbrook 2021-10-02 15:31:33 UTC
Created attachment 228368 [details]
example1 ifconfig output after symptom
Comment 8 John Westbrook 2021-10-02 15:32:13 UTC
Created attachment 228369 [details]
example1 ifconfig output after recovery
Comment 9 John Westbrook 2021-10-02 15:32:33 UTC
Created attachment 228370 [details]
example2 dmesg transcript
Comment 10 John Westbrook 2021-10-02 15:32:58 UTC
Created attachment 228371 [details]
example2 ifconfig output before symptom
Comment 11 John Westbrook 2021-10-02 15:33:33 UTC
Created attachment 228372 [details]
example2 ifconfig output after symptom
Comment 12 John Westbrook 2021-10-02 15:34:24 UTC
Created attachment 228373 [details]
example2 ifconfig output unrecovered
Comment 13 John Westbrook 2021-10-02 15:34:43 UTC
Created attachment 228374 [details]
example3 dmesg transcript
Comment 14 John Westbrook 2021-10-02 15:35:09 UTC
Created attachment 228375 [details]
example3 ifconfig output before symptom
Comment 15 John Westbrook 2021-10-02 15:35:52 UTC
Created attachment 228376 [details]
example1 ifconfig output after symptom 1
Comment 16 John Westbrook 2021-10-02 15:36:48 UTC
Created attachment 228377 [details]
example3 ifconfig output after symptom 1
Comment 17 John Westbrook 2021-10-02 15:37:23 UTC
Created attachment 228378 [details]
example3 ifconfig output after symptom 2
Comment 18 John Westbrook 2021-10-02 15:38:30 UTC
Example 1: WiFi associated, wired disconnected; recovered
# ifconfig -a > example1.ifconfig.before.txt
# ifconfig wlan0 down
# ifconfig -a > example1.ifconfig.after.txt
# ifconfig wlan0 up
# ifconfig lagg0 create
# ifconfig lagg0 up laggproto failover laggport wlan0 laggport em0
# ifconfig -a > example1.ifconfig.recovered.txt

Example 2: WiFi associated, wired connected; unrecovered
# ifconfig -a > example2.ifconfig.before.txt
# ifconfig em0 down
# ifconfig -a > example2.ifconfig.after.txt
# ifconfig em0 up
# ifconfig -a > example2.ifconfig.unrecovered.txt

Example 3: WiFi down, wired disconnected; error message
# ifconfig wlan0 down
# ifconfig > example3.ifconfig.before.txt
# ifconfig lagg0 create
# ifconfig lagg0 up laggproto failover laggport wlan0 laggport em0
# ifconfig > example3.ifconfig.after1.txt
# ifconfig -a > example3.ifconfig.after2.txt