my machine uses an igb interface for pppoe to a dsl-modem. after some load on the interface (max 16mbit) the link state changes to DOWN.
sometimes using "ifconfig pppoe0 down" and then "ifconfig pppoe0 up" helps but mostly not and the only solution is to reboot the machine.
Every aspect of involved hardware changed over the years but the problems persists.
funny thing is that the link goes active if I run "ifconfig pppoe0 down". but as soon as I run "ifconfig pppoe0 up" the link state goes to inactive.
first messages in var/log/message when this happens:
Feb 4 16:20:14 monolith kernel: igb3: TX(0) desc avail = 42, pidx = 876
Feb 4 16:20:15 monolith kernel: pppoe0: link state changed to DOWN
I'm running into this as well, no pppoe involved. Everything is up and running just fine until (on Friday evening of Memorial day weekend, of course) ...
May 24 21:28:03 <host> kernel: igb0: TX(0) desc avail = 42, pidx = 574
May 24 21:28:03 <host> kernel: igb0: link state changed to DOWN
May 24 21:28:05 <host> kernel: igb0: TX(3) desc avail = 1024, pidx = 0
May 24 21:28:35 <host> syslogd: last message repeated 18 times
Can't try net/intel-em-kmod as this is an I350 card:
igb0@pci0:4:0:0: class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = 'I350 Gigabit Network Connection'
class = network
subclass = ethernet
After it decides there is no carrier (lights go out on physical card, too) rebooting is the only way I've found to wake back up.
Running amd64 12.0-p3 at the time. Custom kernel config, but comes up and is stable until this occurs. Doesn't *seem* to be during high traffic, but I suppose there could have been an undetected (looking at monitoring data) spike right before it happened. (This was the interface the monitoring data is sent out, so....)
it seems that r353778 - head/sys/dev/e1000 fixed my problem.
running now for 36h without a problem
will close this as fixed in a week
Assign committer of base r353778 for comment/confirmation that the change is relavent for this issue
The commit was not marked for MFC, but it may be indicated, setting mfc-stable-* flags accordingly, in case they're needed
it happened again. but I still think is has to do with the handling of the interface state.
(In reply to Kubilay Kocak from comment #3)
r353778 is orthogonal to this bug
some patch earlier this year (2020) fixed something. So at least it is now possible (again) to reanimate the interface with a ifconfig igbx down/up combo. But the problem is still that the interface is not usable when it happens and with no other access to it the machine is offline.
I believe this is still recurring, even on 12.1-R, and that pf is involved somehow. Every few weeks, a server will go offline. When I bring up the management console, the link light is off and the console is full of errors like this:
igb1: Watchdog timeout (TX: 2 desc avail: 42 pidx: 504) -- resetting
igb1: Watchdog timeout (TX: 2 desc avail: 1024 pidx: 0) -- resetting
With the first occurring once, then the second repeating every 2-3 seconds.
If I ifconfig down the interface, the watchdog timeout errors cease and the link light comes on. If I then ifconfig up the interface the link light goes off again and the watchdog timeout errors resume.
Thus far, only rebooting fixes this state.
Anecdote: this behaviour seems to correlate to whether or not pf is in use. I have multiple of this exact hardware (Supermicro X11SSH-LN4F), some run pf, some don't. The behaviour occurs on every machine running pf, and none of the ones not running pf. It wasn't occurring on two machines for months, then pf was enabled, and the behaviour started. It was occuring on one of the machines running pf, then pf was disabled, and after months it has not recurred.
I'm posting this here because the observation--the paradoxical if up/down behaviour fixed only with a reboot--is the same, and I have machines running 12.x that do it, suggesting this has only been partially masked.
on my machine pf is not enabled. also it seems that incoming traffic is triggering the problem.
this supermicro motherboard has a production error which caused the problem