Bug 287337 - LAN interface gets totally wedged, unkillable processes, no packets received
Summary: LAN interface gets totally wedged, unkillable processes, no packets received
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.4-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2025-06-06 07:27 UTC by Gert Doering
Modified: 2025-06-09 00:58 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gert Doering 2025-06-06 07:27:14 UTC
Hi,

so, this is a machine with multiple NICs, and occasionally one of them gets "wedged" or "stuck", with no packets being received at all (outgoing packets do seem to get out), and things like "ifconfig $nic down" resulting in an unkillable "ifconfig" process.  Only reboot helps.

This used to happen on an old HP DL360G8 "once or twice a year", so we assumed "it's dying hardware", and eventually moved the whole FreeBSD system into a Proxmox VM cluster.  It was running perfectly fine there for about 8 months, and yesterday the "interface wedged" problem occured *4 times*, which makes me think we might actually have hit a FreeBSD bug there.

At this point, the VM has 3 NICs, vtnet0..vtnet2.

vtnet0 is the one that does "all the production traffic", and that's the one that gets stuck.  vtnet1/vtnet2 connect to two isolated network segments, and keep working perfectly fine - so when the problem happens, I can still ssh in from another machine connected to vtnet1/vtnet2 and run diagnostics.

At this point, I would primarily ask for "what sort of information should I gather, and what should I test, if it happens again?"

What we did

 - tcpdump -n -s0 -i vtnet0 --> claims "we send packets, we do not receive packets, at all"
 - flap the virtual NIC link on the hypervisor --> shows up in "dmesg", but does not change anything
 - move the VM to a different cluster node (see if it's something on the KVM side) --> no change
 - try "ifconfig vtnet0 down" --> makes "ifconfig" unkillable, and the interface is still displayed as "up"
 - run "ping 141.1.1.1" --> makes "ping" unkillable
 - on "shutdown -r" it tries to kill dhcp6d for 90 seconds, which refuses to die, then complains about "90 seconds watchdog timeout" and proceeds to "flushing disks" and gets stuck there - so a press to the (virtual) reset button is needed to un-stick things

We have first seen this on 13.2-RELEASE (2 times over 9 months), and yesterday 4x on 13.4-RELEASE-p5.

Right now my suspect is the IPv6 DHCP server from "isc-dhcp44-server-4.4.3P1_2", because this is really the only thing that makes this machine unique - we have some 20+ more FreeBSD machines on 13.4-RELEASE, some hardware, some VMs on Proxmox, and no othe machine has ever exhibited this.  Some have way more network traffic and sessions.


What we did in the meantime is to upgrade the kernel to 14.2-RELEASE (first half of "freebsd-update -r 14.2-RELEASE upgrade") to see if that will help - that machine needs to get work done.