Bug 235524

Summary: igb(4): Ethernet interface loses active link state
Product: Base System Reporter: Denis Ahrens <denis>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Closed Not A Bug    
Severity: Affects Only Me CC: eborisch+FreeBSD, kbowling, kirill, mack, ncrogers, net, ports.maintainer
Priority: --- Keywords: IntelNetworking
Version: 12.1-STABLEFlags: koobs: mfc-stable12?
koobs: mfc-stable11?
Hardware: amd64   
OS: Any   

Description Denis Ahrens 2019-02-05 13:54:02 UTC
my machine uses an igb interface for pppoe to a dsl-modem. after some load on the interface (max 16mbit) the link state changes to DOWN.

sometimes using "ifconfig pppoe0 down" and then "ifconfig pppoe0 up" helps but mostly not and the only solution is to reboot the machine.

Every aspect of involved hardware changed over the years but the problems persists.

funny thing is that the link goes active if I run "ifconfig pppoe0 down". but as soon as I run "ifconfig pppoe0 up" the link state goes to inactive.

first messages in var/log/message when this happens:

Feb  4 16:20:14 monolith kernel: igb3: TX(0) desc avail = 42, pidx = 876
Feb  4 16:20:15 monolith kernel: pppoe0: link state changed to DOWN
Comment 1 eborisch+FreeBSD 2019-06-03 16:30:01 UTC
I'm running into this as well, no pppoe involved. Everything is up and running just fine until (on Friday evening of Memorial day weekend, of course) ...

May 24 21:28:03 <host> kernel: igb0: TX(0) desc avail = 42, pidx = 574
May 24 21:28:03 <host> kernel: igb0: link state changed to DOWN
May 24 21:28:05 <host> kernel: igb0: TX(3) desc avail = 1024, pidx = 0
May 24 21:28:35 <host> syslogd: last message repeated 18 times

Can't try net/intel-em-kmod as this is an I350 card:

igb0@pci0:4:0:0:	class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet

After it decides there is no carrier (lights go out on physical card, too) rebooting is the only way I've found to wake back up.

Running amd64 12.0-p3 at the time. Custom kernel config, but comes up and is stable until this occurs. Doesn't *seem* to be during high traffic, but I suppose there could have been an undetected (looking at monitoring data) spike right before it happened. (This was the interface the monitoring data is sent out, so....)
Comment 2 Denis Ahrens 2019-10-24 04:08:21 UTC
it seems that r353778 - head/sys/dev/e1000 fixed my problem.

running now for 36h without a problem

will close this as fixed in a week

denis
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2019-10-24 07:39:41 UTC
Assign committer of base r353778 for comment/confirmation that the change is relavent for this issue

The commit was not marked for MFC, but it may be indicated, setting mfc-stable-* flags accordingly, in case they're needed
Comment 4 Denis Ahrens 2019-10-30 12:49:29 UTC
it happened again. but I still think is has to do with the handling of the interface state.
Comment 5 Marius Strobl freebsd_committer freebsd_triage 2019-10-30 20:52:06 UTC
(In reply to Kubilay Kocak from comment #3)

r353778 is orthogonal to this bug
Comment 6 Denis Ahrens 2020-04-27 15:32:36 UTC
some patch earlier this year (2020) fixed something. So at least it is now possible (again) to reanimate the interface with a ifconfig igbx down/up combo. But the problem is still that the interface is not usable when it happens and with no other access to it the machine is offline.
Comment 7 ports.maintainer 2020-04-28 01:46:49 UTC
I believe this is still recurring, even on 12.1-R, and that pf is involved somehow. Every few weeks, a server will go offline.  When I bring up the management console, the link light is off and the console is full of errors like this:

igb1: Watchdog timeout (TX: 2 desc avail: 42 pidx: 504) -- resetting
igb1: Watchdog timeout (TX: 2 desc avail: 1024 pidx: 0) -- resetting

With the first occurring once, then the second repeating every 2-3 seconds.

If I ifconfig down the interface, the watchdog timeout errors cease and the link light comes on.  If I then ifconfig up the interface the link light goes off again and the watchdog timeout errors resume.

Thus far, only rebooting fixes this state.

Anecdote: this behaviour seems to correlate to whether or not pf is in use.  I have multiple of this exact hardware (Supermicro X11SSH-LN4F), some run pf, some don't. The behaviour occurs on every machine running pf, and none of the ones not running pf.  It wasn't occurring on two machines for months, then pf was enabled, and the behaviour started.  It was occuring on one of the machines running pf, then pf was disabled, and after months it has not recurred.

I'm posting this here because the observation--the paradoxical if up/down behaviour fixed only with a reboot--is the same, and I have machines running 12.x that do it, suggesting this has only been partially masked.
Comment 8 Denis Ahrens 2020-07-24 18:47:52 UTC
on my machine pf is not enabled. also it seems that incoming traffic is triggering the problem.
Comment 9 Denis Ahrens 2021-02-17 06:20:05 UTC
this supermicro motherboard has a production error which caused the problem
Comment 10 mack 2021-04-21 18:04:23 UTC
Same problem here from time to time on a four port realtek card (re0, re1 currently used).

Machine is used as a router and it now happened three times that the interfaces somehow get shut off - not seen, what the real problem is, as we had to have it up and running very quickly, so an immediate reboot helped in all three cases.

link state changes to DOWN / UP happens once or twice every minute, when we cannot receive any packets on both interfaces, even though only one interface is logged with this problem.

This switching down and up again happens every now and then without further impact. Time between DOWN and UP message is four seconds quite consistently. Is seems to be triggered from a watchdog timeout (kernel: re0: wathcdog timeout).

Maybe the problem gets triggered by (too?) much network traffic, as the network traffic was probably higher during those three cases.

Btw.: someone mentioned of similar problems when pf was enabled and no problems, when pf was disabled. We are running ipfilter on that machine, so it might be possible that filtering and high traffic promotes these problems, regardless of the network card used.
Comment 11 Kevin Bowling freebsd_committer freebsd_triage 2021-04-21 19:35:18 UTC
(In reply to mack from comment #10)
This is a different issue.  The realtek driver in FreeBSD is missing a lot of critical workarounds for hardware issues, which is why you see the watchdog.  You can try the "net/realtek-re-kmod" driver in ports which contains some of the workarounds needed and is tested by the vendor.

I am interested in improving re(4) but to set expectations these are colossal tasks that cover 60 chipset variations and none of them are flawless, I don't have docs, and have to sneak this in between a dayjob.  If you can provide more info like the pciconf output that would make for a useful bug report (in a new PR not this one which is not related).

If you rely on the system in a material way I would switch to a vendor supported NIC, like Chelsio T4+, Mellanox CX4+, or intel ixgbe+.  Chelsio has twisted pair options.
Comment 12 mack 2021-04-22 08:23:03 UTC
Created a new bug report, see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255319