Bug 243319 - em(4): Panic results in Intel 82579LM taking down local switch
Summary: em(4): Panic results in Intel 82579LM taking down local switch
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2020-01-13 14:14 UTC by Kyle Evans
Modified: 2020-02-25 02:58 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kyle Evans freebsd_committer freebsd_triage 2020-01-13 14:14:11 UTC
Hi,

Running on an X220 with em0 (Intel 82579LM), laptop panicking results in an absolute flood of ARP requests:

07:55:53.025687 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.025708 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025729 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.025750 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025783 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.025825 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025850 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.025872 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025894 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.025915 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025937 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.025959 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025980 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.026001 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.026022 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
07:55:53.026044 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.026065 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46

root@gemini:~# tcpdump -r re0.pcap | grep -i 'arp, request' | grep '7:55:53' | wc -l
reading from file re0.pcap, link-type EN10MB (Ethernet)
   17424

17,000 in that particular second alone; all of the boxes on the local network segment end up with networking hosed until I reboot/remove the offending laptop.
Comment 1 Kyle Evans freebsd_committer freebsd_triage 2020-01-13 14:16:13 UTC
CC'ing -net@ and cem/markj, the latter since it's in a panic context which leads me to believe it's perhaps debugnet related, but I've not configured debugnet/netdump at all.
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2020-01-13 14:18:31 UTC
(In reply to Kyle Evans from comment #1)
What happens after the panic?  Does the system attempt to dump core?  Do you perhaps have a DDB script that attempts to trigger a netdump?
Comment 3 Kyle Evans freebsd_committer freebsd_triage 2020-01-13 14:26:03 UTC
(In reply to Mark Johnston from comment #2)

Yeah, so from the system's viewpoint it looks like an absolutely normal panic and I'm at ddb prompt and able to dump (and did, because I needed to examine later if the VM panic it hit that time has been resolved since).

This system doesn't use DDB scripts, but I double-checked here:

# sysrc ddb_enable
ddb_enable: NO

and it's otherwise configured like so:

# grep 'dump' rc.conf
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="AUTO"
dumpon_flags="-vZ"
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2020-01-13 14:39:58 UTC
You could try repro'ing the problem with net.debugnet.debug=1 or =2 to verify that the debugnet code isn't actually running somehow.
Comment 5 Kyle Evans freebsd_committer freebsd_triage 2020-01-13 19:10:45 UTC
(In reply to Mark Johnston from comment #4)

Doing this resulted in no activity from debugnet, at least.

It might be worth noting that it does take a while after the initial panic for this misbehavior to begin, and the machine is unattended and sitting idle at the ddb prompt the entire time.
Comment 6 Conrad Meyer freebsd_committer freebsd_triage 2020-01-13 20:02:15 UTC
These are different requesters?  It would help to spell out which IP(s) are the panic'd laptop.

07:55:53.025959 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46
07:55:53.025980 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46
                                                              ^^
If you don't have debugnet enabled I don't see any obvious reason debugnet would be ARPing.  Plus, debugnet isn't *that* spammy and the number of ARPs it sends is bounded; it gives up after a few tries (like, 3?).  You would see obvious prints as a side effect of debugnet-enabled dump being attempted after the panic, and an obvious print when the ARP request failed.

If anything, I suspect this is some NIC internal firmware/hardware behavior due to the panic'd machine not processing RX queues or something.
Comment 7 Kyle Evans freebsd_committer freebsd_triage 2020-01-13 20:31:52 UTC
(In reply to Conrad Meyer from comment #6)

Hmm... yeah, good point- I misread '18' as '16', and those are actually the two Windows boxen on the local segment; 10.6.112.1 being the gateway for this vlan.

I'll work on another repro and see if I can't get more context. The flood seems to just be a side-effect of whatever's cutting off the local network, rather than the cause. This still has to be the result of something this NIC is doing periodically -- disconnecting it immediately remedies the situation and local connectivity is restored, and the behavior is consistent but not immediately triggered upon panic. Nagios lets us know quickly when this laptop's taken down the Windows machines.

This is the context leading up to that particular flood:

07:55:35.211083 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:35.650045 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:36.650033 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:37.211468 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:37.650026 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:38.650003 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:39.186264 IP 10.6.112.1 > ospf-all.mcast.net: OSPFv2, Hello, length 56
07:55:39.209654 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:39.649990 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:40.649980 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:41.211537 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:41.649960 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:42.649947 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:43.210181 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:43.649929 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:44.649936 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:45.208168 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:45.649903 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:46.649907 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:47.229691 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:47.649898 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query
07:55:48.216500 IP 10.6.112.1 > ospf-all.mcast.net: OSPFv2, Hello, length 56
07:55:48.649860 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:49.255548 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:49.649850 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:50.649836 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:51.227859 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43
07:55:51.649821 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply
07:55:52.649815 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply

68:1c:a2:10:41:10 is the unmanaged switch immediately upstream from the laptop. That unmanaged switch currently has yet another unmanaged switch of the same model upstream from it that I had setup ~5 months ago to try and isolate the problem, as this has been ongoing and consistent over the last 6+ months at least (I don't panic it that frequently). Immediately upstream from that one is a managed switch. The Windows boxen lay on the most-upstream switch, while this laptop and another FreeBSD laptop are on the lowest switch.
Comment 8 Kyle Evans freebsd_committer freebsd_triage 2020-01-13 20:56:01 UTC
I've uploaded a fairly noisy pcap file: https://people.freebsd.org/~kevans/re0.pcap

The panicked laptop was plugged back in within a minute of starting this dump. 
 Around 7 minutes in at *:45/*:46 is when the local network goes away, then I yanked the network cable out of the laptop again at about *:46:44 and local traffic was restored.
Comment 9 Kubilay Kocak freebsd_committer freebsd_triage 2020-01-25 02:53:15 UTC
Is the panic potentially relevant, or has the issue been observed with unrelated/multiple/different panics?
Comment 10 Kyle Evans freebsd_committer freebsd_triage 2020-01-25 03:12:41 UTC
I've changed the title to reflect what has been revealed- the ARP flood is from the local Windows boxes that can't find their gateway because the Intel NIC has done something to kill the switch's uplink.

I'll grab a live Linux image and double-check there, but I suspect this is not FreeBSD specific.
Comment 11 Kyle Evans freebsd_committer freebsd_triage 2020-02-21 13:52:46 UTC
This is (apparently) a firmware bug; I have other laptops on the same revision that do not kill the local switch. In any event- not a FreeBSD issue.