Summary: | igb(4): Interfaces fail to switch active to inactive state | ||
---|---|---|---|
Product: | Base System | Reporter: | ncrogers |
Component: | kern | Assignee: | Marius Strobl <marius> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | afedorov, bugzilla.freebsd, egypcio, emaste, erj, freebsd, jboman, julien, karl, krzysztof.galazka, ltning-freebsd, marius.halden, natalino.picone, ncrogers, net, pi, rafael.faria, shurd, sigsys, smh, snow, webmaster |
Priority: | --- | Keywords: | IntelNetworking, regression |
Version: | 12.1-RELEASE | Flags: | koobs:
mfc-stable12+
|
Hardware: | amd64 | ||
OS: | Any | ||
URL: | https://reviews.freebsd.org/D21769 | ||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=228556 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240 |
||
Bug Depends on: | |||
Bug Blocks: | 240700 |
Description
ncrogers
2019-03-22 21:50:11 UTC
Is there anything else I can do to make squashing this bug easier? I seem to recall a way to enable debug mode on an interface but I can't seem to figure this out post-iflib. FWIW this is also broken in latest 13-CURRENT snapshot. FreeBSD 13.0-CURRENT r345355 GENERIC Set version to earliest version issue was observed in. Curious if anyone else on the CC list is having the same problem or not? (In reply to ncrogers from comment #4) Maybe. I had a very odd thing happen the other day; my PCEngines gateway/firewall machine, which has two of these interfaces in it, "disappeared" off the net. I wasn't where the box was, so I couldn't physically check it from the console. I was forced to have an untrained person reset it, and it came right back up. BUT -- if the interface flapped and the upper levels got it wrong, well..... guess what -- no packets for you, which is exactly what it looked like. (In reply to ncrogers from comment #4) Yes, seing problem in FreeBSD 12.0-RELEASE-p5 (amd64) w/ I210 interface BTW I've just updated the box (source build) to a pretty-current rev so we'll see if the problem is gone..... Stumbled my way here because I've also just encountered this in 12.0-Rp7, also on PCEngines hardware. Happy to test fixes or provide additional information. This looks similar to what I reported in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240. Unplugging the LAN cable results in not getting notification in ifconfig of the link dropping. Furthermore, Leaving the cable unplugged for any extended time makes the interface unusable throwing "igb0: TX(0) desc avail = 1024, pidx = 0" or "ix0: TX(0) desc avail = 1024, pidx = 0". The interface will not resume working until a reboot. Should be fixed with this patch: https://reviews.freebsd.org/D21769 D21769 appears to fix this for me. Thanks! (In reply to Krzysztof Galazka from comment #10) D21769 fixes this for me as well. Thank you! @Krzysztof Could you please add re@f.o to the review as a blocking reviewer to get approval to merge this to releng/12.1 after stable/12 merge, so that this makes it to 12.1-RELEASE Thanks *** Bug 240658 has been marked as a duplicate of this bug. *** (In reply to Kubilay Kocak from comment #13) Do you mean "releng"? I don't see an re@f.o option in Phabricator when I look at editing the reviewers list. (In reply to Eric Joyner from comment #15) Apologies Eric, ignore the request to add re@f.o to reviews, as they shouldn't be involved until a change at least makes it to head. The re@ approval process is currently entirely *after* commit/merge. The reason I had suggested that originally, was there was no attachment (patch) here to add re@f.o to for approval (we can do that with attachment flags). Given this issue blocks bug 240700, it is now on re's radar, so I'm less worried about an important issue missing 12.1-R, so.. Once this issue is committed (to head) and merged (to stable/*), you can ask re@f.o for explicit merge approval to releng/12.1 for 12.1-R inclusion. I'd like to add that marius@'s approach in https://reviews.freebsd.org/D21924 has the same effect – from the operator's view – like the original tested D21769. Once the interface was "up", link state change is correctly detected (again tested with 82574L (em) and igb(4)s 82576, i210, i350). If the interface wasn't configured/brought up, link state changes to "active" but never back, which seems to be by design, according to that report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240818 Thanks, -harry A commit references this bug: Author: marius Date: Sun Oct 20 17:40:50 UTC 2019 New revision: 353778 URL: https://svnweb.freebsd.org/changeset/base/353778 Log: - In em_intr(), just call em_handle_link() instead of duplicating it. - In em_msix_link(), properly handle IGB-class devices after the iflib(4) conversion again by only setting EM_MSIX_LINK for the EM-class 82574 and by re-arming link interrupts unconditionally, i. e. not only in case of spurious interrupts. This fixes the interface link state change detection for the IGB-class. [1] - In em_if_update_admin_status(), only re-arm the link state change interrupt for 82574 and also only if such a device uses MSI-X, i. e. takes advantage of autoclearing. In case of INTx and MSI as well as for LEM- and IGB-class devices, re-arming isn't appropriate here and setting EM_MSIX_LINK isn't either. While at it, consistently take advantage of the hw variable. PR: 236724 [1] Differential Revision: https://reviews.freebsd.org/D21924 Changes: head/sys/dev/e1000/if_em.c A commit references this bug: Author: marius Date: Thu Oct 24 14:18:06 UTC 2019 New revision: 354021 URL: https://svnweb.freebsd.org/changeset/base/354021 Log: MFC: r353778 - In em_intr(), just call em_handle_link() instead of duplicating it. - In em_msix_link(), properly handle IGB-class devices after the iflib(4) conversion again by only setting EM_MSIX_LINK for the EM-class 82574 and by re-arming link interrupts unconditionally, i. e. not only in case of spurious interrupts. This fixes the interface link state change detection for the IGB-class. [1] - In em_if_update_admin_status(), only re-arm the link state change interrupt for 82574 and also only if such a device uses MSI-X, i. e. takes advantage of autoclearing. In case of INTx and MSI as well as for LEM- and IGB-class devices, re-arming isn't appropriate here and setting EM_MSIX_LINK isn't either. While at it, consistently take advantage of the hw variable. PR: 236724 [1] Differential Revision: https://reviews.freebsd.org/D21924 Changes: _U stable/12/ stable/12/sys/dev/e1000/if_em.c I did see it happening in HEAD (r355121/amd64) during a silly test I was conducting trying to reproduce bug #239240. should we reopen this issue? used components to reproduce it; - physical hardware (https://www.dell.com/en-us/work/shop/povw/poweredge-r440); - intel i350 gigabit network card [igb] (class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x1521 subvendor=0x8086 subdevice=0x0002); - live image (https://download.freebsd.org/ftp/snapshots/amd64/amd64/ISO-IMAGES/13.0/FreeBSD-13.0-CURRENT-amd64-20191127-r355121-mini-memstick.img). how to reproduce; - leave NO ethernet cables plugged to the net. card's ports; - boot the live image, and choose 'live cd' option (log in as root, of course); - perform an `ifconfig` to get actual status/options of all igb interfaces; - plug a cable to the igb interface of your choice (the other end must be connected to a switch or anything that can trigger layer1 activity); - perform an `ifconfig` to get actual status/options of all igb interfaces; - unplug the cable; - perform an `ifconfig` to get actual status/options of all igb interfaces. NOTE: after performing an `ifconfig igb0 up` and testing it all again, the feedback to its physical status are all fine. * igb0 is the only one showing 'WOL_MAGIC' but that's maybe some setting from the BIOS I should check in a few. ===== igb0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP> ether b4:96:91:62:ef:92 media: Ethernet autoselect status: no carrier nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> igb1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP> ether b4:96:91:62:ef:93 media: Ethernet autoselect status: no carrier nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> ===== MORE ABOUT IT: the odd situation is that we can see "status:" on all intel interfaces even before performing the very first `ifconfig igb0 up` --- for the onboard broadcom interfaces, there's no such a thing (all broadcom stay DOWN and display no "status:" line before performing `ifconfig bge0 up`) ===== bge0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 4c:d9:8f:8f:11:9a media: Ethernet autoselect nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> bge1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether 4c:d9:8f:8f:11:9b media: Ethernet autoselect nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> ===== btw, I also tested stable/12 and releng/12.1 live images! same behavior (In reply to Vinícius Zavam from comment #21) should this be tagged to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240700 as well? PS: my bad on the 'BIOS settings' commend - the Intel card is not an onboard one, so there was/is no options on the BIOS to play with the wake-on-lan option. I am reopenig this one and adding to the META PR, please check the "See Also" reports as well. (In reply to Vinícius Zavam from comment #21) same thing also when using stable/11 or releng/11.3. was this thing *ALWAYS* behaving like this? as mentioned before: setting up the interface with a regular 'ifconfig_igb0="up"' on the rc.conf, its states changes behaves just fine. still looks odd. (In reply to Vinícius Zavam from comment #24) FWIW, myy systems were on 11.1 for a while where it did not happen, and then I noticed it when switching many systems over to RELEASE-12.0. I don't think it always happened, but perhaps it started somewhere in 11/stable. I am still running 12.0 with the D21769 patch, which fixed the problem for me, but it looks like some different fixes went into 12.1? We are seeing what seems to be the same problem as Vinícius described on FreeBSD 12.1-RELEASE-p1. truly hope I am not testing it wrong, because after trying the same steps with an 10.4-RELEASE I got the same results. used very same hardware as described on 'Comment 20' image that I used? http://ftp-archive.freebsd.org/pub/FreeBSD-Archive/old-releases/amd64/amd64/ISO-IMAGES/10.4/FreeBSD-10.4-RELEASE-amd64-uefi-mini-memstick.img.xz [decompressed and 'dd' to an USB stick, of course] From what I can see the patch from marius@ was never merged into 12.1, is that correct? (In reply to Marius Halden from comment #28) I tried rebuilding the releng/12.1 kernel with the patch from D21924 applied. With the patch applied I've so far been unable to reproduce the issues we've been having. Maybe there should be an errata for this? Strongly support getting this fix out there as quickly as possible. There is very significant fallout from this, with interface failover and all sorts of other things depending on link detection being rendered useless. (In reply to Marius Halden from comment #29) It actually looks like the disabling msix for the interface with a loader tunable mitigates the (most obvious) issues we have been having without patching the 12.1 kernel. dev.igb.0.iflib.disable_msix=1 (In reply to Marius Halden from comment #31) it did not work for me, following the steps I reported on Comment 20. D21712 also did not fix it (for me, again, following same testing steps). used HEAD@r356310 and created an USB bootable image with the following release.conf: #!/bin/sh ###################################################################### # WITH_DVD= CHROOTDIR="/builder2/freebsd/scratch/head" DOC_UPDATE_SKIP=1 KERNEL="GENERIC" MAKE_FLAGS="-s -j4" NODOC= NOPORTS= PORTS_UPDATE_SKIP=1 SRCBRANCH="base/head@rHEAD" SVNROOT="https://svn.freebsd.org" TARGET="amd64" TARGET_ARCH="amd64" (In reply to Vinícius Zavam from comment #32) The comment from me earlier is only a reference to 12.1-RELEASE. I haven't tested on anything else. From what I can see the patch I referenced is in HEAD and 12-STABLE, but has not yet been merged to releng/12.1. (In reply to Vinícius Zavam from comment #32) The fix for this PR, i. e. link state change detection for interfaces in the up state, didn't make it into 12.1 as RC3 was cancelled, unfortunately. Disabling the use of MSI-X as described in comment 32 is a viable workaround, though. Comment 20 describes an orthogonal bug consisting in link status being reported for interfaces in the down state, while the expected behavior for an interface in this state is that no link status is reported and that - unless WOL is enabled - its PHY(s) is/are shut down. I'm closing this PR again as the regression it's about has been fixed and I won't file an EN request for the fix. Sorry, typo; the workaround actually has been described in comment 31. Won't disabling MSI-X destroy the performance, if so feels like this might be worth an EN? (In reply to Steven Hartland from comment #36) Strongly agree. +1 for an EN, as lagg is unable to detect when the interface is down (and thus unusable..) 12.1 EN is planned for the next batch queued as EN-20:09 It looks like this patch has not yet be included in 12.1 releng release. Any idea on when it will be included in 12.1 releng? (In reply to Natalino Picone from comment #41) It's this patch: https://www.freebsd.org/security/advisories/FreeBSD-EN-20:09.igb.asc I got the same problem with FreeBSD 13.0p3. device = 'I350 Gigabit Network Connection' It's a 4 ports network, igb0 to igb3 After unplug the network cable, the interface does not change the status. Was this patch applied to 13.0 too? |