211219 – [e1000] NIC status does not pass into a state of "no carrier" after disconnecting the cable.

Bug 211219 - [e1000] NIC status does not pass into a state of "no carrier" after disconnecting the cable.

Summary: [e1000] NIC status does not pass into a state of "no carrier" after disconnec...

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	10.3-RELEASE
Hardware:	i386 Any

Importance:	--- Affects Some People
Assignee:	Kevin Bowling

URL:
Keywords:	IntelNetworking, patch

Depends on:
Blocks:

Reported:	2016-07-19 10:19 UTC by Alexey
Modified:	2021-05-28 06:42 UTC (History)
CC List:	12 users (show)

See Also:

Flags:	kbowling: mfc-stable13+ kbowling: mfc-stable12+ kbowling: mfc-stable11-

Attachments
link state fix for 10.3 / 11.0 (1.01 KB, patch) 2017-02-16 14:56 UTC, Franco Fichtner	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexey 2016-07-19 10:19:02 UTC

NIC status does not pass into a state of "no carrier" after disconnecting the cable. 

Built into the motherboard cards change their state is normal, but the cards connected through PCI-Express go to a state "active" when connecting the cable, but do not go into a state of "no carrier" when the cable is disconnected.

Motherboard:  Supermicro X10SRi-F
NICs: Intel EXPI9301CT Gigabit Desktop Adapter 10/100/1000Mbps, PCIEx1

uname -a:
FreeBSD gw3 10.3-RELEASE FreeBSD 10.3-RELEASE #0: Thu Jul 14 17:35:30 NOVT 2016     avs@gw3:/usr/obj/usr/src/sys/pfaltq  i386

Comment 1 Alexey 2016-07-19 10:23:11 UTC

This commands turn status to "no carrier" after disconnecting cable: 
ifconfig em0 down & ifconfig em0 up

Comment 2 Michał Jędrzejczak 2016-09-12 05:54:54 UTC

(In reply to Alexey from comment #1)
Have you configured this card in rc.conf ? 

If no please check that with configured card ( ip, netmask for this interface in rc.conf ). I saw this same problem but on unconfigured interface.

Comment 3 Alexey 2016-09-12 15:02:27 UTC

Yes, this card has been configured in the rc.conf

Comment 4 Jeff Pieper 2016-09-15 22:50:29 UTC

I can reproduce this on 10.3-RELEASE (em-7.6.1-k). There is probably a link interrupt check that is missing. Currently investigating.

Comment 5 Eric Joyner freebsd_committer

2016-09-21 18:25:54 UTC

(In reply to Jeff Pieper from comment #4)

I think this is an issue only in MSI-X mode; I tried an 82754L in MSI mode and unplugging/plugging in the cable behaved as expected.

We should check the IVAR, IMS, and other interrupt-cause-related registers, since it appears the link interrupt isn't even being generated.

Comment 6 CS 2016-09-28 20:13:28 UTC

I think I experience the same using pfsense 2.3.2-RELEASE FreeBSD 10.3-RELEASE-p5 on a Soekris 6501 board (i386). In addition to that, after connecting the cable back after a few hours or in general when the link state changes to UP the device connected to that port has no network connectivity until I reboot pfSense. More details can be found here: hxxps://redmine.pfsense.org/issues/6823

Comment 7 Franco Fichtner 2016-09-28 20:21:59 UTC

We had the same issue in OPNsense, so we created a FreeBSD port of the stock Intel driver version 7.6.2, which works fine.

It would help others to mitigate the issue by getting into the tree.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212828


Cheers,
Franco

Comment 8 Franco Fichtner 2016-12-05 07:23:01 UTC

Hi all,

Last time we checked this also affected 11.0.  Unfortunately, there multiple issues at play here:

The FreeBSD em(4) driver under 10.3+ has this bug, and the Intel 7.6.2 driver fixes this on 10.3.  But the Intel driver does not fully work under 11.0 -- it has issues with netmap(4) (e.g. using Suricata) there.

The link issue in particular also manifests in several VM deployments that we heard of through OPNsense, where widespread emulation of e1000 is an integral part of daily operation and required to be reliable.  The only fix there was to install the Intel driver, but obviously this doesn't fully work on 11.0 anymore.

So we were hoping to see some progress in either:

Getting the FreeBSD 10.3+ regression ironed out,

and/or

Seeing an Intel driver bump for reliable netmap(4) operation under 11.0.

Bottom line is can we help in any way to speed this up?


Thanks,
Franco

Comment 9 Franco Fichtner 2016-12-15 07:36:49 UTC

To reiterate... setting the following in loader.conf works as previously suggested:

hw.em.enable_msix=0

But since this has a performance impact, is there any news here?


Thanks,
Franco

Comment 10 Daniel G 2017-01-05 14:31:57 UTC

We're trying to triage a simmilar problem, where after a (panic) reboot the NIC remains in "no carrier" state, while the switch thinks the link is online. Our nic's are using lagg with active/passive failover

Running the 10.3-RELEASE kernel (on 10,1 and 10.2 kernels we never had these issues). (Our) issue also appears to be present in the 10.3-stable branch.

FreeBSD storage1.c1 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

These are the nic's we're using:

ix0@pci0:1:0:0:	class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet

We'll try the hw.em.enable_msix=0 option and report back!

Comment 11 Franco Fichtner 2017-01-05 14:34:41 UTC

Hi Daniel,

If this is ix(4) as per your indication, you need:

hw.ix.enable_msix=0 


Cheers,
Franco

Comment 12 Daniel G 2017-01-06 13:52:24 UTC

Our issue is unrelated to this one apparently :sadpanda:
Perhaps it's best if my comments are removedd.

Comment 13 Julien Cigar 2017-01-06 14:32:19 UTC

I *think* I experienced something similar on a Soekris 6501 (amd64) recently. We have two of those boxes which serve as a redundant firewall, dhcp server, local DNS server, etc..

The DHCP server is configured in "balancing mode", with something like:

load balance max seconds 3;
mclt 1800;
split 128;

Which means (more of less) that client DHCP requests are balanced between the two. Recently the BACKUP died (still powered on, but unreachable), but for some unknown reason the clients which got their IPs from the BACKUP DHCP server just "lost" it after the lease time passed.. and didn't get a new IP from the MASTER node.. I suspect also that's because the interfaces didn't go into the "no carrier" state.

This is on 10.2-RELEASE with 82574L cards.

Comment 14 Franco Fichtner 2017-02-16 14:56:17 UTC

Created attachment 180048 [details]
link state fix for 10.3 / 11.0

Yes, the issue is also present on soekris boxes, but not on 10.2.  The bug was introduced in 10.3, now present in 11.0 and 12-CURRENT.

We tried to fix it up for 12-CURRENT but the fix isn't complete.

I have attached a working patch as used on OPNsense 17.1.

Comment 15 Sean Bruno freebsd_committer

2017-04-10 16:06:23 UTC

As far as I can tell, this bug is fixed in 10-stable and 11-stable.  Since the code bases in 12-current and XX-stable are quite a bit different, I'm going to close this ticket out as fixed.

I don't want to lose tracking open issues here, and it sure looks like there's somewhere around 3 different issues being discussed here.  If there is still a problem with the hardware you are using, please indicate your issue in a *seperate* bug and I'll respond.

Comment 16 Franco Fichtner 2017-04-10 16:50:16 UTC

I'm not so sure.  All branches received a modified patch that came up inconclusive in testing, overshadowed by the iflib conversion and multiple build and subsystem failures due to it.  We circled around a minimal patch (note the attached patch reverts to the full Intel driver change), but the weirdness that is "adapter->ims" for just this one chipset code path still makes no sense in the grand scheme of things.

I asked for a full revert to the working Intel state via e-mail, but that didn't happen.

I don't think this is sensible for the simple reason that this was introduced in stable/10, went into 10.3 unnoticed, breaking *a lot* of embedded hardware with regard to e.g. CARP, then went into 11.0.  It took a long time to identify the commit that was responsible and unwind it (it was a batch MFC after all).

Settling for anything other than what is attached to this ticket is risky, especially if there is nobody who can test and verify this in over one and a half years.


Cheers,
Franco

Comment 17 Sean Bruno freebsd_committer

2017-04-10 22:04:54 UTC

(In reply to Franco Fichtner from comment #16)
Perhaps I'm missing something.

stable/10 and stable/11 current match the code specified in the patch requested in this bugzilla ticket.  Was there something else?

Comment 18 Franco Fichtner 2017-04-16 11:11:49 UTC

The second line in the if statement still differs in the way introduced by the original commit causing the regression:

Original Intel driver and requested in this this PR/attached commit:

ims_mask |= EM_MSIX_MASK;

Current state on all branches:

ims_mask |= adapter->ims;

In our conversations you asked me which of the two lines were needed, because the chip documentation wasn't clear.

The testing result for a good result (for two distinct devices I have) was:

if (hw->mac.type == e1000_82574) {
    E1000_WRITE_REG(hw, EM_EIAC, adapter->ims);
} 

The current FreeBSD state was changed to read this:

if (hw->mac.type == e1000_82574) {
    E1000_WRITE_REG(hw, EM_EIAC, adapter->ims);
    ims_mask |= adapter->ims;
}

Which still differs from the good tested result or the original Intel state.

Either the second line should be dropped or changed to how it reads in the Intel driver.


Cheers,
Franco

Comment 19 Sean Bruno freebsd_committer

2017-04-16 19:45:28 UTC

(In reply to Franco Fichtner from comment #18)
Is it possible that you are looking at a different source tree than I?

stable/11 currently has the following:

static void
em_enable_intr(struct adapter *adapter)
{
        struct e1000_hw *hw = &adapter->hw;
        u32 ims_mask = IMS_ENABLE_MASK;

        if (hw->mac.type == e1000_82574) {
                E1000_WRITE_REG(hw, EM_EIAC, EM_MSIX_MASK);
                ims_mask |= adapter->ims;
        }
        E1000_WRITE_REG(hw, E1000_IMS, ims_mask);
}

Comment 20 Franco Fichtner 2017-04-26 07:05:36 UTC

It's easy, look here: https://bugs.freebsd.org/bugzilla/attachment.cgi?id=180048&action=diff

The second line before:

ims_mask |= adapter->ims;

The second line after:

ims_mask |= EM_MSIX_MASK;

The regression in 10.3 and 11.0:

ims_mask |= adapter->ims;

The correct code in 10.2 and the Intel driver:

ims_mask |= EM_MSIX_MASK;


Cheers,
Franco

Comment 21 Franco Fichtner 2017-05-11 22:07:33 UTC

Did I not explain this right?  Forgive my persistence.  ;)

Comment 22 Franco Fichtner 2017-06-09 16:50:05 UTC

Dear all,

The code is still incorrect, could somebody please look at this?

The "delay" in addressing this issue is extreme given that a patch exists since February. If somebody would let me know which mistake I made I would like to be able to avoid it in the future.


Thank you,
Franco

Comment 23 Eugene Grosbein freebsd_committer

2017-06-10 15:58:39 UTC

Apparently, closed by mistake: no track of fixing commit and there is a report the problem's still here.

Comment 24 Eric Joyner freebsd_committer

2017-06-10 16:44:00 UTC

Since it appears Sean is dead/busy, I'll look at this.

Comment 25 Sean Bruno freebsd_committer

2017-06-11 18:41:59 UTC

(In reply to Eric Joyner from comment #24)
I think more dead than busy ... but perhaps both.

This definitely should be committed now that I know what Franco was trying to get me to do.

Comment 26 Eric Joyner freebsd_committer

2017-06-11 19:23:49 UTC

I don't think the posted patch applies as-is to HEAD, but I think the change is still straightforward. I can commit it.

Comment 27 Eric Joyner freebsd_committer

2017-06-11 19:43:23 UTC

Actually, looking at em_if_enable_intr() in CURRENT, do we need to have the IMS_ENABLE_MASK in 82574 in MSI-X mode? I think all of the causes that 82574 needs are covered in the Tx0/Rx0/Tx1/Rx1/Other causes that are only in 82574.

Comment 28 Eric Joyner freebsd_committer

2017-06-11 20:04:39 UTC

And sorry about the comment spam, but I think a better fix (at least on CURRENT), is to set bit 24 in adapter->ims in em_if_msix_intr_assign. It looks like that doesn't get set in IMS, so the "other" cause (that includes link state changes) never fires.

Comment 29 Eric Joyner freebsd_committer

2017-06-11 20:05:59 UTC

(I think this is the bit that EM_MSIX_MASK has that adapter->ims in the current version does not.)

Comment 30 Sean Bruno freebsd_committer

2017-06-11 21:04:42 UTC

(In reply to Eric Joyner from comment #29)
Franco is specifically requesting an update to stable/11 and the release.

Specifically, look at Comment #20

Comment 31 commit-hook freebsd_committer

2017-06-19 15:04:02 UTC

A commit references this bug:

Author: sbruno
Date: Mon Jun 19 15:03:48 UTC 2017
New revision: 320102
URL: https://svnweb.freebsd.org/changeset/base/320102

Log:
  Direct commit to stable/11 to correctly setting the EIAC and IMS
  registers to the same values when processing interrupts.  This reverts a
  change made in r286831 that was not fully reverted in r311979

  This resolves PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211219

  PR:		211219
  Submitted by:	Franco Fitchner <franco@opnsense.org>
  Approved by:	re (marius)

Changes:
  stable/11/sys/dev/e1000/if_em.c

Comment 32 Pete French 2017-06-19 15:55:02 UTC

This patch as commited to STABLE stops my em0 working entirely. hardware details:

em0@pci0:3:2:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82541PI Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

Comment 33 commit-hook freebsd_committer

2017-06-19 15:56:52 UTC

A commit references this bug:

Author: sbruno
Date: Mon Jun 19 15:56:03 UTC 2017
New revision: 320117
URL: https://svnweb.freebsd.org/changeset/base/320117

Log:
  Direct commit to stable/10 to correctly setting the EIAC and IMS
  registers to the same values when processing interrupts.

  This resolves PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211219

  PR:		211219
  Submitted by:	Franco Fitchner <franco@opnsense.org>

Changes:
  stable/10/sys/dev/e1000/if_em.c

Comment 34 Pete French 2017-06-19 16:23:58 UTC

Actually, not quyite that simple. Sometimes it works, sometimes it does not. If it does not then ifconfig em0 down and ifconfig em0 up will get it working again. The only chnage to em0 is the application of this patch, but I hadnt rebooted the machine this aftrnoon, so am going to check elsewhere for other issues.

Comment 35 Eric Joyner freebsd_committer

2017-06-19 16:28:51 UTC

(In reply to pete from comment #34)

I don't think this change is causing your problem. It only changes something in an 82574-specific conditional.

Comment 36 Eitan Adler freebsd_committer

2018-05-28 19:48:27 UTC

batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.

Comment 37 Kevin Bowling freebsd_committer

2021-04-23 02:03:34 UTC

(In reply to Franco Fichtner from comment #22)
Franco, can you test this https://reviews.freebsd.org/D29943?

Comment 38 commit-hook freebsd_committer

2021-04-27 22:30:48 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=eea55de7b10808b86277d7fdbed2d05d3c6db1b2

commit eea55de7b10808b86277d7fdbed2d05d3c6db1b2
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-04-25 08:22:23 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-04-27 22:29:39 +0000

    e1000: Rework em_msi_link interrupt filter

    * Fix 82574 Link Status Changes, carrying the OTHER mask bit around as
      needed.
    * Move igb-class LSC re-arming out of FAST back into the handler.
    * Clarify spurious/other interrupt re-arms in FAST.

    In MSI-X mode, 82574 and igb-class devices use an interrupt filter to
    handle Link Status Changes. We want to do LSC re-arms in the handler
    to take advantage of autoclear (EIAC) single shot behavior.

    82574 uses 'Other' in ICR and IMS for LSC interrupt types when in MSI-X
    mode, so we need to set and re-arm the 'Other' bit during attach and
    after ICR reads in the FAST handler if not an LSC or after handling on
    LSC due to autoclearing.

    This work was primarily done to address the referenced PR, but inspired
    some clarification and improvement for igb-class devices once the
    intentions of previous bug fix attempts became clearer.

    PR:             211219
    Reported by:    Alexey <aserp3@gmail.com>
    Tested by:      kbowling (I210 lagg), markj (I210)
    Approved by:    markj
    MFC after:      1 month
    Differential Revision:  https://reviews.freebsd.org/D29943

 sys/dev/e1000/if_em.c | 40 ++++++++++++++++++++++++----------------
 sys/dev/e1000/if_em.h |  2 --
 2 files changed, 24 insertions(+), 18 deletions(-)

Comment 39 commit-hook freebsd_committer

2021-05-28 05:51:27 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=1fb96c59b4ce265ea94eddef5a97c7c075ceaec5

commit 1fb96c59b4ce265ea94eddef5a97c7c075ceaec5
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-04-25 08:22:23 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-05-28 05:50:07 +0000

    e1000: Rework em_msi_link interrupt filter

    * Fix 82574 Link Status Changes, carrying the OTHER mask bit around as
      needed.
    * Move igb-class LSC re-arming out of FAST back into the handler.
    * Clarify spurious/other interrupt re-arms in FAST.

    In MSI-X mode, 82574 and igb-class devices use an interrupt filter to
    handle Link Status Changes. We want to do LSC re-arms in the handler
    to take advantage of autoclear (EIAC) single shot behavior.

    82574 uses 'Other' in ICR and IMS for LSC interrupt types when in MSI-X
    mode, so we need to set and re-arm the 'Other' bit during attach and
    after ICR reads in the FAST handler if not an LSC or after handling on
    LSC due to autoclearing.

    This work was primarily done to address the referenced PR, but inspired
    some clarification and improvement for igb-class devices once the
    intentions of previous bug fix attempts became clearer.

    PR:             211219
    Reported by:    Alexey <aserp3@gmail.com>
    Tested by:      kbowling (I210 lagg), markj (I210)
    Approved by:    markj
    MFC after:      1 month
    Differential Revision:  https://reviews.freebsd.org/D29943

    (cherry picked from commit eea55de7b10808b86277d7fdbed2d05d3c6db1b2)

 sys/dev/e1000/if_em.c | 41 ++++++++++++++++++++++++-----------------
 sys/dev/e1000/if_em.h |  2 --
 2 files changed, 24 insertions(+), 19 deletions(-)

Comment 40 commit-hook freebsd_committer

2021-05-28 05:52:31 UTC

A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=166ad86c33ba4ca1a1b235913a9ae553ebf6a425

commit 166ad86c33ba4ca1a1b235913a9ae553ebf6a425
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-04-25 08:22:23 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-05-28 05:51:46 +0000

    e1000: Rework em_msi_link interrupt filter

    * Fix 82574 Link Status Changes, carrying the OTHER mask bit around as
      needed.
    * Move igb-class LSC re-arming out of FAST back into the handler.
    * Clarify spurious/other interrupt re-arms in FAST.

    In MSI-X mode, 82574 and igb-class devices use an interrupt filter to
    handle Link Status Changes. We want to do LSC re-arms in the handler
    to take advantage of autoclear (EIAC) single shot behavior.

    82574 uses 'Other' in ICR and IMS for LSC interrupt types when in MSI-X
    mode, so we need to set and re-arm the 'Other' bit during attach and
    after ICR reads in the FAST handler if not an LSC or after handling on
    LSC due to autoclearing.

    This work was primarily done to address the referenced PR, but inspired
    some clarification and improvement for igb-class devices once the
    intentions of previous bug fix attempts became clearer.

    PR:             211219
    Reported by:    Alexey <aserp3@gmail.com>
    Tested by:      kbowling (I210 lagg), markj (I210)
    Approved by:    markj
    MFC after:      1 month
    Differential Revision:  https://reviews.freebsd.org/D29943

    (cherry picked from commit eea55de7b10808b86277d7fdbed2d05d3c6db1b2)

 sys/dev/e1000/if_em.c | 41 ++++++++++++++++++++++++-----------------
 sys/dev/e1000/if_em.h |  2 --
 2 files changed, 24 insertions(+), 19 deletions(-)

Comment 41 Kevin Bowling freebsd_committer

2021-05-28 06:42:51 UTC

No MFC to stable11 it uses a different driver.