Summary: | igb(4): Ethernet interface loses active link state | ||
---|---|---|---|
Product: | Base System | Reporter: | Denis Ahrens <denis> |
Component: | kern | Assignee: | freebsd-net (Nobody) <net> |
Status: | Closed Not A Bug | ||
Severity: | Affects Only Me | CC: | eborisch+FreeBSD, kbowling, kirill, mack, ncrogers, net, ports.maintainer |
Priority: | --- | Keywords: | IntelNetworking |
Version: | 12.1-STABLE | Flags: | koobs:
mfc-stable12?
koobs: mfc-stable11? |
Hardware: | amd64 | ||
OS: | Any |
Description
Denis Ahrens
2019-02-05 13:54:02 UTC
I'm running into this as well, no pppoe involved. Everything is up and running just fine until (on Friday evening of Memorial day weekend, of course) ... May 24 21:28:03 <host> kernel: igb0: TX(0) desc avail = 42, pidx = 574 May 24 21:28:03 <host> kernel: igb0: link state changed to DOWN May 24 21:28:05 <host> kernel: igb0: TX(3) desc avail = 1024, pidx = 0 May 24 21:28:35 <host> syslogd: last message repeated 18 times Can't try net/intel-em-kmod as this is an I350 card: igb0@pci0:4:0:0: class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = 'I350 Gigabit Network Connection' class = network subclass = ethernet After it decides there is no carrier (lights go out on physical card, too) rebooting is the only way I've found to wake back up. Running amd64 12.0-p3 at the time. Custom kernel config, but comes up and is stable until this occurs. Doesn't *seem* to be during high traffic, but I suppose there could have been an undetected (looking at monitoring data) spike right before it happened. (This was the interface the monitoring data is sent out, so....) it seems that r353778 - head/sys/dev/e1000 fixed my problem. running now for 36h without a problem will close this as fixed in a week denis Assign committer of base r353778 for comment/confirmation that the change is relavent for this issue The commit was not marked for MFC, but it may be indicated, setting mfc-stable-* flags accordingly, in case they're needed it happened again. but I still think is has to do with the handling of the interface state. (In reply to Kubilay Kocak from comment #3) r353778 is orthogonal to this bug some patch earlier this year (2020) fixed something. So at least it is now possible (again) to reanimate the interface with a ifconfig igbx down/up combo. But the problem is still that the interface is not usable when it happens and with no other access to it the machine is offline. I believe this is still recurring, even on 12.1-R, and that pf is involved somehow. Every few weeks, a server will go offline. When I bring up the management console, the link light is off and the console is full of errors like this: igb1: Watchdog timeout (TX: 2 desc avail: 42 pidx: 504) -- resetting igb1: Watchdog timeout (TX: 2 desc avail: 1024 pidx: 0) -- resetting With the first occurring once, then the second repeating every 2-3 seconds. If I ifconfig down the interface, the watchdog timeout errors cease and the link light comes on. If I then ifconfig up the interface the link light goes off again and the watchdog timeout errors resume. Thus far, only rebooting fixes this state. Anecdote: this behaviour seems to correlate to whether or not pf is in use. I have multiple of this exact hardware (Supermicro X11SSH-LN4F), some run pf, some don't. The behaviour occurs on every machine running pf, and none of the ones not running pf. It wasn't occurring on two machines for months, then pf was enabled, and the behaviour started. It was occuring on one of the machines running pf, then pf was disabled, and after months it has not recurred. I'm posting this here because the observation--the paradoxical if up/down behaviour fixed only with a reboot--is the same, and I have machines running 12.x that do it, suggesting this has only been partially masked. on my machine pf is not enabled. also it seems that incoming traffic is triggering the problem. this supermicro motherboard has a production error which caused the problem Same problem here from time to time on a four port realtek card (re0, re1 currently used). Machine is used as a router and it now happened three times that the interfaces somehow get shut off - not seen, what the real problem is, as we had to have it up and running very quickly, so an immediate reboot helped in all three cases. link state changes to DOWN / UP happens once or twice every minute, when we cannot receive any packets on both interfaces, even though only one interface is logged with this problem. This switching down and up again happens every now and then without further impact. Time between DOWN and UP message is four seconds quite consistently. Is seems to be triggered from a watchdog timeout (kernel: re0: wathcdog timeout). Maybe the problem gets triggered by (too?) much network traffic, as the network traffic was probably higher during those three cases. Btw.: someone mentioned of similar problems when pf was enabled and no problems, when pf was disabled. We are running ipfilter on that machine, so it might be possible that filtering and high traffic promotes these problems, regardless of the network card used. (In reply to mack from comment #10) This is a different issue. The realtek driver in FreeBSD is missing a lot of critical workarounds for hardware issues, which is why you see the watchdog. You can try the "net/realtek-re-kmod" driver in ports which contains some of the workarounds needed and is tested by the vendor. I am interested in improving re(4) but to set expectations these are colossal tasks that cover 60 chipset variations and none of them are flawless, I don't have docs, and have to sneak this in between a dayjob. If you can provide more info like the pciconf output that would make for a useful bug report (in a new PR not this one which is not related). If you rely on the system in a material way I would switch to a vendor supported NIC, like Chelsio T4+, Mellanox CX4+, or intel ixgbe+. Chelsio has twisted pair options. Created a new bug report, see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255319 |