Bug 211219 - NIC status does not pass into a state of "no carrier" after disconnecting the cable.
Summary: NIC status does not pass into a state of "no carrier" after disconnecting the...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 10.3-RELEASE
Hardware: i386 Any
: --- Affects Some People
Assignee: freebsd-net
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2016-07-19 10:19 UTC by Alexey
Modified: 2017-06-19 16:28 UTC (History)
12 users (show)

See Also:


Attachments
link state fix for 10.3 / 11.0 (1.01 KB, patch)
2017-02-16 14:56 UTC, Franco Fichtner
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey 2016-07-19 10:19:02 UTC
NIC status does not pass into a state of "no carrier" after disconnecting the cable. 

Built into the motherboard cards change their state is normal, but the cards connected through PCI-Express go to a state "active" when connecting the cable, but do not go into a state of "no carrier" when the cable is disconnected.

Motherboard:  Supermicro X10SRi-F
NICs: Intel EXPI9301CT Gigabit Desktop Adapter 10/100/1000Mbps, PCIEx1

uname -a:
FreeBSD gw3 10.3-RELEASE FreeBSD 10.3-RELEASE #0: Thu Jul 14 17:35:30 NOVT 2016     avs@gw3:/usr/obj/usr/src/sys/pfaltq  i386
Comment 1 Alexey 2016-07-19 10:23:11 UTC
This commands turn status to "no carrier" after disconnecting cable: 
ifconfig em0 down & ifconfig em0 up
Comment 2 Michał Jędrzejczak 2016-09-12 05:54:54 UTC
(In reply to Alexey from comment #1)
Have you configured this card in rc.conf ? 

If no please check that with configured card ( ip, netmask for this interface in rc.conf ). I saw this same problem but on unconfigured interface.
Comment 3 Alexey 2016-09-12 15:02:27 UTC
Yes, this card has been configured in the rc.conf
Comment 4 Jeff Pieper 2016-09-15 22:50:29 UTC
I can reproduce this on 10.3-RELEASE (em-7.6.1-k). There is probably a link interrupt check that is missing. Currently investigating.
Comment 5 Eric Joyner freebsd_committer 2016-09-21 18:25:54 UTC
(In reply to Jeff Pieper from comment #4)

I think this is an issue only in MSI-X mode; I tried an 82754L in MSI mode and unplugging/plugging in the cable behaved as expected.

We should check the IVAR, IMS, and other interrupt-cause-related registers, since it appears the link interrupt isn't even being generated.
Comment 6 CS 2016-09-28 20:13:28 UTC
I think I experience the same using pfsense 2.3.2-RELEASE FreeBSD 10.3-RELEASE-p5 on a Soekris 6501 board (i386). In addition to that, after connecting the cable back after a few hours or in general when the link state changes to UP the device connected to that port has no network connectivity until I reboot pfSense. More details can be found here: hxxps://redmine.pfsense.org/issues/6823
Comment 7 Franco Fichtner 2016-09-28 20:21:59 UTC
We had the same issue in OPNsense, so we created a FreeBSD port of the stock Intel driver version 7.6.2, which works fine.

It would help others to mitigate the issue by getting into the tree.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212828


Cheers,
Franco
Comment 8 Franco Fichtner 2016-12-05 07:23:01 UTC
Hi all,

Last time we checked this also affected 11.0.  Unfortunately, there multiple issues at play here:

The FreeBSD em(4) driver under 10.3+ has this bug, and the Intel 7.6.2 driver fixes this on 10.3.  But the Intel driver does not fully work under 11.0 -- it has issues with netmap(4) (e.g. using Suricata) there.

The link issue in particular also manifests in several VM deployments that we heard of through OPNsense, where widespread emulation of e1000 is an integral part of daily operation and required to be reliable.  The only fix there was to install the Intel driver, but obviously this doesn't fully work on 11.0 anymore.

So we were hoping to see some progress in either:

Getting the FreeBSD 10.3+ regression ironed out,

and/or

Seeing an Intel driver bump for reliable netmap(4) operation under 11.0.

Bottom line is can we help in any way to speed this up?


Thanks,
Franco
Comment 9 Franco Fichtner 2016-12-15 07:36:49 UTC
To reiterate... setting the following in loader.conf works as previously suggested:

hw.em.enable_msix=0

But since this has a performance impact, is there any news here?


Thanks,
Franco
Comment 10 Daniel G 2017-01-05 14:31:57 UTC
We're trying to triage a simmilar problem, where after a (panic) reboot the NIC remains in "no carrier" state, while the switch thinks the link is online. Our nic's are using lagg with active/passive failover

Running the 10.3-RELEASE kernel (on 10,1 and 10.2 kernels we never had these issues). (Our) issue also appears to be present in the 10.3-stable branch.

FreeBSD storage1.c1 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

These are the nic's we're using:

ix0@pci0:1:0:0:	class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet

We'll try the hw.em.enable_msix=0 option and report back!
Comment 11 Franco Fichtner 2017-01-05 14:34:41 UTC
Hi Daniel,

If this is ix(4) as per your indication, you need:

hw.ix.enable_msix=0 


Cheers,
Franco
Comment 12 Daniel G 2017-01-06 13:52:24 UTC
Our issue is unrelated to this one apparently :sadpanda:
Perhaps it's best if my comments are removedd.
Comment 13 Julien Cigar 2017-01-06 14:32:19 UTC
I *think* I experienced something similar on a Soekris 6501 (amd64) recently. We have two of those boxes which serve as a redundant firewall, dhcp server, local DNS server, etc..

The DHCP server is configured in "balancing mode", with something like:

load balance max seconds 3;
mclt 1800;
split 128;

Which means (more of less) that client DHCP requests are balanced between the two. Recently the BACKUP died (still powered on, but unreachable), but for some unknown reason the clients which got their IPs from the BACKUP DHCP server just "lost" it after the lease time passed.. and didn't get a new IP from the MASTER node.. I suspect also that's because the interfaces didn't go into the "no carrier" state.

This is on 10.2-RELEASE with 82574L cards.
Comment 14 Franco Fichtner 2017-02-16 14:56:17 UTC
Created attachment 180048 [details]
link state fix for 10.3 / 11.0

Yes, the issue is also present on soekris boxes, but not on 10.2.  The bug was introduced in 10.3, now present in 11.0 and 12-CURRENT.

We tried to fix it up for 12-CURRENT but the fix isn't complete.

I have attached a working patch as used on OPNsense 17.1.
Comment 15 Sean Bruno freebsd_committer 2017-04-10 16:06:23 UTC
As far as I can tell, this bug is fixed in 10-stable and 11-stable.  Since the code bases in 12-current and XX-stable are quite a bit different, I'm going to close this ticket out as fixed.

I don't want to lose tracking open issues here, and it sure looks like there's somewhere around 3 different issues being discussed here.  If there is still a problem with the hardware you are using, please indicate your issue in a *seperate* bug and I'll respond.
Comment 16 Franco Fichtner 2017-04-10 16:50:16 UTC
I'm not so sure.  All branches received a modified patch that came up inconclusive in testing, overshadowed by the iflib conversion and multiple build and subsystem failures due to it.  We circled around a minimal patch (note the attached patch reverts to the full Intel driver change), but the weirdness that is "adapter->ims" for just this one chipset code path still makes no sense in the grand scheme of things.

I asked for a full revert to the working Intel state via e-mail, but that didn't happen.

I don't think this is sensible for the simple reason that this was introduced in stable/10, went into 10.3 unnoticed, breaking *a lot* of embedded hardware with regard to e.g. CARP, then went into 11.0.  It took a long time to identify the commit that was responsible and unwind it (it was a batch MFC after all).

Settling for anything other than what is attached to this ticket is risky, especially if there is nobody who can test and verify this in over one and a half years.


Cheers,
Franco
Comment 17 Sean Bruno freebsd_committer 2017-04-10 22:04:54 UTC
(In reply to Franco Fichtner from comment #16)
Perhaps I'm missing something.

stable/10 and stable/11 current match the code specified in the patch requested in this bugzilla ticket.  Was there something else?
Comment 18 Franco Fichtner 2017-04-16 11:11:49 UTC
The second line in the if statement still differs in the way introduced by the original commit causing the regression:

Original Intel driver and requested in this this PR/attached commit:

ims_mask |= EM_MSIX_MASK;

Current state on all branches:

ims_mask |= adapter->ims;

In our conversations you asked me which of the two lines were needed, because the chip documentation wasn't clear.

The testing result for a good result (for two distinct devices I have) was:

if (hw->mac.type == e1000_82574) {
    E1000_WRITE_REG(hw, EM_EIAC, adapter->ims);
} 

The current FreeBSD state was changed to read this:

if (hw->mac.type == e1000_82574) {
    E1000_WRITE_REG(hw, EM_EIAC, adapter->ims);
    ims_mask |= adapter->ims;
}

Which still differs from the good tested result or the original Intel state.

Either the second line should be dropped or changed to how it reads in the Intel driver.


Cheers,
Franco
Comment 19 Sean Bruno freebsd_committer 2017-04-16 19:45:28 UTC
(In reply to Franco Fichtner from comment #18)
Is it possible that you are looking at a different source tree than I?

stable/11 currently has the following:

static void
em_enable_intr(struct adapter *adapter)
{
        struct e1000_hw *hw = &adapter->hw;
        u32 ims_mask = IMS_ENABLE_MASK;

        if (hw->mac.type == e1000_82574) {
                E1000_WRITE_REG(hw, EM_EIAC, EM_MSIX_MASK);
                ims_mask |= adapter->ims;
        }
        E1000_WRITE_REG(hw, E1000_IMS, ims_mask);
}
Comment 20 Franco Fichtner 2017-04-26 07:05:36 UTC
It's easy, look here: https://bugs.freebsd.org/bugzilla/attachment.cgi?id=180048&action=diff

The second line before:

ims_mask |= adapter->ims;

The second line after:

ims_mask |= EM_MSIX_MASK;

The regression in 10.3 and 11.0:

ims_mask |= adapter->ims;

The correct code in 10.2 and the Intel driver:

ims_mask |= EM_MSIX_MASK;


Cheers,
Franco
Comment 21 Franco Fichtner 2017-05-11 22:07:33 UTC
Did I not explain this right?  Forgive my persistence.  ;)
Comment 22 Franco Fichtner 2017-06-09 16:50:05 UTC
Dear all,

The code is still incorrect, could somebody please look at this?

The "delay" in addressing this issue is extreme given that a patch exists since February. If somebody would let me know which mistake I made I would like to be able to avoid it in the future.


Thank you,
Franco
Comment 23 Eugene Grosbein freebsd_committer 2017-06-10 15:58:39 UTC
Apparently, closed by mistake: no track of fixing commit and there is a report the problem's still here.
Comment 24 Eric Joyner freebsd_committer 2017-06-10 16:44:00 UTC
Since it appears Sean is dead/busy, I'll look at this.
Comment 25 Sean Bruno freebsd_committer 2017-06-11 18:41:59 UTC
(In reply to Eric Joyner from comment #24)
I think more dead than busy ... but perhaps both.

This definitely should be committed now that I know what Franco was trying to get me to do.
Comment 26 Eric Joyner freebsd_committer 2017-06-11 19:23:49 UTC
I don't think the posted patch applies as-is to HEAD, but I think the change is still straightforward. I can commit it.
Comment 27 Eric Joyner freebsd_committer 2017-06-11 19:43:23 UTC
Actually, looking at em_if_enable_intr() in CURRENT, do we need to have the IMS_ENABLE_MASK in 82574 in MSI-X mode? I think all of the causes that 82574 needs are covered in the Tx0/Rx0/Tx1/Rx1/Other causes that are only in 82574.
Comment 28 Eric Joyner freebsd_committer 2017-06-11 20:04:39 UTC
And sorry about the comment spam, but I think a better fix (at least on CURRENT), is to set bit 24 in adapter->ims in em_if_msix_intr_assign. It looks like that doesn't get set in IMS, so the "other" cause (that includes link state changes) never fires.
Comment 29 Eric Joyner freebsd_committer 2017-06-11 20:05:59 UTC
(I think this is the bit that EM_MSIX_MASK has that adapter->ims in the current version does not.)
Comment 30 Sean Bruno freebsd_committer 2017-06-11 21:04:42 UTC
(In reply to Eric Joyner from comment #29)
Franco is specifically requesting an update to stable/11 and the release.

Specifically, look at Comment #20
Comment 31 commit-hook freebsd_committer 2017-06-19 15:04:02 UTC
A commit references this bug:

Author: sbruno
Date: Mon Jun 19 15:03:48 UTC 2017
New revision: 320102
URL: https://svnweb.freebsd.org/changeset/base/320102

Log:
  Direct commit to stable/11 to correctly setting the EIAC and IMS
  registers to the same values when processing interrupts.  This reverts a
  change made in r286831 that was not fully reverted in r311979

  This resolves PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211219

  PR:		211219
  Submitted by:	Franco Fitchner <franco@opnsense.org>
  Approved by:	re (marius)

Changes:
  stable/11/sys/dev/e1000/if_em.c
Comment 32 pete 2017-06-19 15:55:02 UTC
This patch as commited to STABLE stops my em0 working entirely. hardware details:

em0@pci0:3:2:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82541PI Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
Comment 33 commit-hook freebsd_committer 2017-06-19 15:56:52 UTC
A commit references this bug:

Author: sbruno
Date: Mon Jun 19 15:56:03 UTC 2017
New revision: 320117
URL: https://svnweb.freebsd.org/changeset/base/320117

Log:
  Direct commit to stable/10 to correctly setting the EIAC and IMS
  registers to the same values when processing interrupts.

  This resolves PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211219

  PR:		211219
  Submitted by:	Franco Fitchner <franco@opnsense.org>

Changes:
  stable/10/sys/dev/e1000/if_em.c
Comment 34 pete 2017-06-19 16:23:58 UTC
Actually, not quyite that simple. Sometimes it works, sometimes it does not. If it does not then ifconfig em0 down and ifconfig em0 up will get it working again. The only chnage to em0 is the application of this patch, but I hadnt rebooted the machine this aftrnoon, so am going to check elsewhere for other issues.
Comment 35 Eric Joyner freebsd_committer 2017-06-19 16:28:51 UTC
(In reply to pete from comment #34)

I don't think this change is causing your problem. It only changes something in an 82574-specific conditional.