Bug 235524 - igb(4): Ethernet interface loses active link state
Summary: igb(4): Ethernet interface loses active link state
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2019-02-05 13:54 UTC by Denis Ahrens
Modified: 2020-07-24 18:47 UTC (History)
5 users (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Ahrens 2019-02-05 13:54:02 UTC
my machine uses an igb interface for pppoe to a dsl-modem. after some load on the interface (max 16mbit) the link state changes to DOWN.

sometimes using "ifconfig pppoe0 down" and then "ifconfig pppoe0 up" helps but mostly not and the only solution is to reboot the machine.

Every aspect of involved hardware changed over the years but the problems persists.

funny thing is that the link goes active if I run "ifconfig pppoe0 down". but as soon as I run "ifconfig pppoe0 up" the link state goes to inactive.

first messages in var/log/message when this happens:

Feb  4 16:20:14 monolith kernel: igb3: TX(0) desc avail = 42, pidx = 876
Feb  4 16:20:15 monolith kernel: pppoe0: link state changed to DOWN
Comment 1 eborisch+FreeBSD 2019-06-03 16:30:01 UTC
I'm running into this as well, no pppoe involved. Everything is up and running just fine until (on Friday evening of Memorial day weekend, of course) ...

May 24 21:28:03 <host> kernel: igb0: TX(0) desc avail = 42, pidx = 574
May 24 21:28:03 <host> kernel: igb0: link state changed to DOWN
May 24 21:28:05 <host> kernel: igb0: TX(3) desc avail = 1024, pidx = 0
May 24 21:28:35 <host> syslogd: last message repeated 18 times

Can't try net/intel-em-kmod as this is an I350 card:

igb0@pci0:4:0:0:	class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet

After it decides there is no carrier (lights go out on physical card, too) rebooting is the only way I've found to wake back up.

Running amd64 12.0-p3 at the time. Custom kernel config, but comes up and is stable until this occurs. Doesn't *seem* to be during high traffic, but I suppose there could have been an undetected (looking at monitoring data) spike right before it happened. (This was the interface the monitoring data is sent out, so....)
Comment 2 Denis Ahrens 2019-10-24 04:08:21 UTC
it seems that r353778 - head/sys/dev/e1000 fixed my problem.

running now for 36h without a problem

will close this as fixed in a week

denis
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2019-10-24 07:39:41 UTC
Assign committer of base r353778 for comment/confirmation that the change is relavent for this issue

The commit was not marked for MFC, but it may be indicated, setting mfc-stable-* flags accordingly, in case they're needed
Comment 4 Denis Ahrens 2019-10-30 12:49:29 UTC
it happened again. but I still think is has to do with the handling of the interface state.
Comment 5 Marius Strobl freebsd_committer 2019-10-30 20:52:06 UTC
(In reply to Kubilay Kocak from comment #3)

r353778 is orthogonal to this bug
Comment 6 Denis Ahrens 2020-04-27 15:32:36 UTC
some patch earlier this year (2020) fixed something. So at least it is now possible (again) to reanimate the interface with a ifconfig igbx down/up combo. But the problem is still that the interface is not usable when it happens and with no other access to it the machine is offline.
Comment 7 Rin Morningstar 2020-04-28 01:46:49 UTC
I believe this is still recurring, even on 12.1-R, and that pf is involved somehow. Every few weeks, a server will go offline.  When I bring up the management console, the link light is off and the console is full of errors like this:

igb1: Watchdog timeout (TX: 2 desc avail: 42 pidx: 504) -- resetting
igb1: Watchdog timeout (TX: 2 desc avail: 1024 pidx: 0) -- resetting

With the first occurring once, then the second repeating every 2-3 seconds.

If I ifconfig down the interface, the watchdog timeout errors cease and the link light comes on.  If I then ifconfig up the interface the link light goes off again and the watchdog timeout errors resume.

Thus far, only rebooting fixes this state.

Anecdote: this behaviour seems to correlate to whether or not pf is in use.  I have multiple of this exact hardware (Supermicro X11SSH-LN4F), some run pf, some don't. The behaviour occurs on every machine running pf, and none of the ones not running pf.  It wasn't occurring on two machines for months, then pf was enabled, and the behaviour started.  It was occuring on one of the machines running pf, then pf was disabled, and after months it has not recurred.

I'm posting this here because the observation--the paradoxical if up/down behaviour fixed only with a reboot--is the same, and I have machines running 12.x that do it, suggesting this has only been partially masked.
Comment 8 Denis Ahrens 2020-07-24 18:47:52 UTC
on my machine pf is not enabled. also it seems that incoming traffic is triggering the problem.