Hi, Running on an X220 with em0 (Intel 82579LM), laptop panicking results in an absolute flood of ARP requests: 07:55:53.025687 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.025708 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025729 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.025750 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025783 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.025825 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025850 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.025872 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025894 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.025915 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025937 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.025959 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025980 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.026001 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.026022 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 07:55:53.026044 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.026065 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 root@gemini:~# tcpdump -r re0.pcap | grep -i 'arp, request' | grep '7:55:53' | wc -l reading from file re0.pcap, link-type EN10MB (Ethernet) 17424 17,000 in that particular second alone; all of the boxes on the local network segment end up with networking hosed until I reboot/remove the offending laptop.
CC'ing -net@ and cem/markj, the latter since it's in a panic context which leads me to believe it's perhaps debugnet related, but I've not configured debugnet/netdump at all.
(In reply to Kyle Evans from comment #1) What happens after the panic? Does the system attempt to dump core? Do you perhaps have a DDB script that attempts to trigger a netdump?
(In reply to Mark Johnston from comment #2) Yeah, so from the system's viewpoint it looks like an absolutely normal panic and I'm at ddb prompt and able to dump (and did, because I needed to examine later if the VM panic it hit that time has been resolved since). This system doesn't use DDB scripts, but I double-checked here: # sysrc ddb_enable ddb_enable: NO and it's otherwise configured like so: # grep 'dump' rc.conf # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable dumpdev="AUTO" dumpon_flags="-vZ"
You could try repro'ing the problem with net.debugnet.debug=1 or =2 to verify that the debugnet code isn't actually running somehow.
(In reply to Mark Johnston from comment #4) Doing this resulted in no activity from debugnet, at least. It might be worth noting that it does take a while after the initial panic for this misbehavior to begin, and the machine is unattended and sitting idle at the ddb prompt the entire time.
These are different requesters? It would help to spell out which IP(s) are the panic'd laptop. 07:55:53.025959 ARP, Request who-has 10.6.112.1 tell 10.6.112.16, length 46 07:55:53.025980 ARP, Request who-has 10.6.112.1 tell 10.6.112.18, length 46 ^^ If you don't have debugnet enabled I don't see any obvious reason debugnet would be ARPing. Plus, debugnet isn't *that* spammy and the number of ARPs it sends is bounded; it gives up after a few tries (like, 3?). You would see obvious prints as a side effect of debugnet-enabled dump being attempted after the panic, and an obvious print when the ARP request failed. If anything, I suspect this is some NIC internal firmware/hardware behavior due to the panic'd machine not processing RX queues or something.
(In reply to Conrad Meyer from comment #6) Hmm... yeah, good point- I misread '18' as '16', and those are actually the two Windows boxen on the local segment; 10.6.112.1 being the gateway for this vlan. I'll work on another repro and see if I can't get more context. The flood seems to just be a side-effect of whatever's cutting off the local network, rather than the cause. This still has to be the result of something this NIC is doing periodically -- disconnecting it immediately remedies the situation and local connectivity is restored, and the behavior is consistent but not immediately triggered upon panic. Nagios lets us know quickly when this laptop's taken down the Windows machines. This is the context leading up to that particular flood: 07:55:35.211083 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:35.650045 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:36.650033 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:37.211468 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:37.650026 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:38.650003 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:39.186264 IP 10.6.112.1 > ospf-all.mcast.net: OSPFv2, Hello, length 56 07:55:39.209654 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:39.649990 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:40.649980 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:41.211537 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:41.649960 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:42.649947 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:43.210181 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:43.649929 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:44.649936 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:45.208168 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:45.649903 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:46.649907 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:47.229691 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:47.649898 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 query 07:55:48.216500 IP 10.6.112.1 > ospf-all.mcast.net: OSPFv2, Hello, length 56 07:55:48.649860 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:49.255548 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:49.649850 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:50.649836 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:51.227859 STP 802.1d, Config, Flags [none], bridge-id 8070.04:c5:a4:5e:0d:80.8098, length 43 07:55:51.649821 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 07:55:52.649815 68:1c:a2:10:41:10 (oui Unknown) > Broadcast, RRCP-0x23 reply 68:1c:a2:10:41:10 is the unmanaged switch immediately upstream from the laptop. That unmanaged switch currently has yet another unmanaged switch of the same model upstream from it that I had setup ~5 months ago to try and isolate the problem, as this has been ongoing and consistent over the last 6+ months at least (I don't panic it that frequently). Immediately upstream from that one is a managed switch. The Windows boxen lay on the most-upstream switch, while this laptop and another FreeBSD laptop are on the lowest switch.
I've uploaded a fairly noisy pcap file: https://people.freebsd.org/~kevans/re0.pcap The panicked laptop was plugged back in within a minute of starting this dump. Around 7 minutes in at *:45/*:46 is when the local network goes away, then I yanked the network cable out of the laptop again at about *:46:44 and local traffic was restored.
Is the panic potentially relevant, or has the issue been observed with unrelated/multiple/different panics?
I've changed the title to reflect what has been revealed- the ARP flood is from the local Windows boxes that can't find their gateway because the Intel NIC has done something to kill the switch's uplink. I'll grab a live Linux image and double-check there, but I suspect this is not FreeBSD specific.
This is (apparently) a firmware bug; I have other laptops on the same revision that do not kill the local switch. In any event- not a FreeBSD issue.