Bug 219428 - em network driver broken in current
Summary: em network driver broken in current
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-net mailing list
URL:
Keywords: regression
Depends on:
Blocks: 220004
  Show dependency treegraph
 
Reported: 2017-05-20 21:14 UTC by andy
Modified: 2019-07-14 12:57 UTC (History)
20 users (show)

See Also:


Attachments
dmesg from affected system (14.28 KB, text/plain)
2017-06-03 20:02 UTC, gitdev
no flags Details
debug output (6.60 KB, text/plain)
2017-06-04 01:07 UTC, gitdev
no flags Details
boot msg, panic and backtrace (6.43 KB, text/plain)
2017-06-18 03:43 UTC, gitdev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description andy 2017-05-20 21:14:18 UTC
em network driver is broken on 12-current

The problem occurs trying to set the address via DHCP or manually.

pciconf -lv shows:
em0 device = 82583v Gigabit Network Connection


The problem has been discussed recently on CURRENT mail list.
Comment 1 Kevin Bowling freebsd_committer 2017-05-25 00:41:26 UTC
This report should be self contained, it is difficult to attempt to try and dig up and align ML threads
Comment 2 andy 2017-05-27 01:15:45 UTC
No need to look thru ML threads, I just mentioned that so there was more reports than just mine.  Not much to report, install CURRENT on a physical
machine with an Intel gigabit nic.  DHCP and static address assignment both fail to configure the interface and successfully ping another machine.  Not sure what I can add.  Did you try to install on a physical machine?  I have no way else to tell you how to reproduce the problem.
Comment 3 O. Hartmann 2017-05-27 07:34:25 UTC
This is what I've send to the mailing list recently and I simply copy-and-paste'd it here for convenience to document the problem, which is still present and serious. The problem has gone worse and reliefed since the introduction of IFLIB, to say: recent CURRENT recovers itself now after being "dead" for more than a minute (but loosing then connections due to timeouts, i.e. ssh) and in now more frequently occuring cases getting worse in terms of loosing the device: there was no known to me method to revive the NIC but rebooting - which is desastrous in some situations.

[...]
Since the introduction of IFLIB, I have big trouble with especially a certain
type of NIC, namely formerly known igb and em.

The worst device is an Intel NIC known as i217-LM

em0@pci0:0:25:0:        class=0x020000 card=0x11ed1734 chip=0x153a8086 rev=0x05
hdr=0x00 vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection I217-LM'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base rxfb300000, size 131072, enabled
    bar   [14] = type Memory, range 32, base rxfb339000, size 4096, enabled
    bar   [18] = type I/O Port, range 32, base rxf020, size 32, enabled

This NIC is widely used by Fujitsu's workstations CELSIUS M740 and the fate
would have it, that I have to use one of these.

When syncing data over the network from the workstation to an older C2D/bce
based server via NFSv4, since introduction of IFLIB the connection to the NFS
get stuck and I receive on the console messages like

em0: TX(0) desc avail = 1024, pidx = 0
em0: TX(0) desc avail = 42, pidx = 985

Hitting "Ctrl-T" on the terminal doing the sync via "rsync", I see then this
message:

load: 0.01  cmd: rsync 68868 [nfsaio] 395.68r 4.68u 4.64s 0% 3288k (just for
the record)

Server and client(s) are on 12-CURRENT: ~ FreeBSD 12.0-CURRENT #38 r318285: Mon
May 15 12:27:29 CEST 2017 amd64, customised kernels and "netmap" enabled (just
for the record if that matters).

In the past, I was able to revive the connection by simply putting the NIC down
and then up again and while I had running a ping as a trace indication of the
state of the NIC, I got very often

ping: sendto: No buffer space available

Well, today I checked via dmesg the output to gather again those messages and
realised that the dmesg is garbled:

[...]
nfs nfs servnnfs servefs r server19 2.19162n.fs snerver fs1 s9nfs s2er.nfs
server er192.168.0.31:/pool/packages: not responding v
er 192.168.0.31ver :/po1ol/packages9: 2.168.0.31:/pool/packagesn: noot
responding t
<6>n fs serverespondinngf
s
 server 192.168.1rn nfs server 192.168.0.31:/pool/packages: not1 responding
 9
 2.168.1f7s 0.31:/pool/packagenfs sesrver 19serv2er .168.0.31:/poo: not
respolnding /
 packages: not responding
 nfs server 19192.168.0.31:/pool/pa2c.k168.0.31:a/gpserver
ne1s92.168.0.31:/pool/pac: knot respaof1s68 gs.e17rve8r.2
3192.168.0.31:/pool/packa1:/pool/packages: not responding o goes: nl/packages:
not responding o
 t responding
 nfs server 192.168.0.31:/poes: ol/packages: nfns server
192.168.0.31:/pool/paot responding c
 kages: not respondinnfs server n192.1f68.0.31:/pool/packagess: ndi server
 192.168.0.31:/pool/packages: not responding
[...]

Earlier this year after introduction of IFLIB, I checked out servers equipted
with Intels very popular i350T2v2 NIC and I had similar problems when dd'ing
large files over NFSv4 (ZFS backed) from a client (em0, a client/consumer grade
older NIC from 2010, forgot its ID, towards server with i350, but the server
side got stuck with the messages seen similar to those reported with the
i217-LM). Since my department uses lots of those server grade NICs, I will swap
the i217 with a i350T2 and check again.
Comment 4 gitdev 2017-06-03 02:05:07 UTC
I confirm the problem with em on freebsd 12-CURRENT r319167.  The board is an Atom D525 with an ICH8M system chip.  pciconf shows

em5@pci0:7:0:0: class=0x020000 card=0x00008086 chip=0x150c8086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82583V Gigabit Network Connection'
    class      = network
    subclass   = ethernet
Comment 5 gitdev 2017-06-03 20:02:54 UTC
Created attachment 183184 [details]
dmesg from affected system

Boot output from affected system, amd64 with atom D525 booting 12-CURRENT r319167.  There should be em0-em5 ethernet devices recognized, and em3-em5 configured as a bridge.
Comment 6 gitdev 2017-06-04 01:07:06 UTC
Created attachment 183187 [details]
debug output

Debug output for r319481 on amd64

panic: Assertion adapter->tx_num_queues > 0 failed at /usr/src/sys/dev/e1000/if_em.c:2664
Comment 7 gitdev 2017-06-18 03:43:19 UTC
Created attachment 183588 [details]
boot msg, panic and backtrace

FreeBSD 12.0-CURRENT #0 r319859 generates panic:

em1: Using MSIX interrupts with 1 vectors
panic: Assertion adapter->tx_num_queues > 0 failed at /usr/src/sys/dev/e1000/if_em.c:2664
Comment 8 Kaho Toshikazu 2017-08-25 02:51:33 UTC
(In reply to gitdev from comment #7)

The panic you met is unrelated to the original report.
Please try this patch.

Index: sys/dev/e1000/if_em.c
===================================================================
--- sys/dev/e1000/if_em.c	(revision 322833)
+++ sys/dev/e1000/if_em.c	(working copy)
@@ -797,6 +797,8 @@
 		scctx->isc_txrx = &em_txrx;
 		scctx->isc_capenable = EM_CAPS;
 		scctx->isc_tx_csum_flags = CSUM_TCP | CSUM_UDP | CSUM_IP_TSO;
+		if (adapter->hw.mac.type != e1000_82574)
+			scctx->isc_msix_bar = 0;
 	} else {
 		scctx->isc_txqsizes[0] = roundup2((scctx->isc_ntxd[0] + 1) * sizeof(struct e1000_tx_desc), EM_DBA_ALIGN);
 		scctx->isc_rxqsizes[0] = roundup2((scctx->isc_nrxd[0] + 1) * sizeof(struct e1000_rx_desc), EM_DBA_ALIGN);
Comment 9 Sergey V. Dyatko 2017-11-16 13:16:41 UTC
Hi, I have SuperMicro server 
smbios.planar.product="X9DRW-3LN4F+/X9DRW-3TF+"
running
FreeBSD st3.domain.tld 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r325556: Sun Nov 12 22:39:29 MSK 2017     root@st3.domain.tld:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG  amd64
with igb(4):
igb0@pci0:4:0:0:        class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
igb1@pci0:4:0:1:        class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
igb2@pci0:129:0:0:      class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
igb3@pci0:129:0:3:      class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet

# grep lagg /etc/rc.conf
cloned_interfaces="lagg0 vlan2"
ifconfig_lagg0="laggproto lacp laggport igb0 laggport igb1 laggport igb2 laggport igb3 62.x.x.x netmask 255.255.255.224"
ifconfig_vlan2="vlan 2 vlandev lagg0 192.168.2.3/24"

after reboot all works fine:
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e505bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 0c:c4:7a:4c:11:d2
        inet 62.x.x.x netmask 0xffffffe0 broadcast 62.x.x.x 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg 
        laggproto lacp lagghash l2,l3,l4
        laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

but, after a while I see in messages something like this:
Nov 16 09:35:30 st3 kernel: igb1: Interface stopped DISTRIBUTING, possible flapping

(always igb1)
then, after a while the server become unavailable over the network, if I open console via IPMI I could see following:

igb1: TX(3) desc avail = 1024, pidx = 0
igb1: TX(3) desc avail = 1024, pidx = 0
igb1: TX(3) desc avail = 1024, pidx = 0
igb1: TX(3) desc avail = 1024, pidx = 0

after reboot all works fine again...
Comment 10 Guido Falsi freebsd_committer 2017-11-16 14:38:59 UTC
I'm also seeing similar errors, I have this hardware:

em0@pci0:0:31:6:	class=0x020000 card=0x86721043 chip=0x15b88086 rev=0x31 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection (2) I219-V'
    class      = network
    subclass   = ethernet


Running "ifconfig em0 -tso4" makes the machine stable for me.
Comment 11 tom 2018-06-29 04:41:13 UTC
This started happening on a couple of my boxes as well. (optiplex 7010, lenovo thinkcentre m90p) 

em0: TX(0) desc avail = 41, pidx = 788
link state changed to down
em0: link state changed to DOWN
em0: TX(0) desc avail = 1024, pidx = 0
em0: TX(0) desc avail = 1024, pidx = 0

dell -
em0@pci0:0:25:0:        class=0x020000 card=0x052c1028 chip=0x15028086 rev=0x04 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82579LM Gigabit Network Connection (Lewisville)'
    class      = network
    subclass   = ethernet

thinkcentre -
em0@pci0:0:25:0:        class=0x020000 card=0x306017aa chip=0x10ef8086 rev=0x06 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82578DM Gigabit Network Connection'
    class      = network
    subclass   = ethernet


No panic, no other messages other than "kernel out of bufferspace" 
I've added -tso -lro -rxcsum -txcsum to ifconfig and we'll see if it stops puking. It usually takes a couple of days to trigger it. 

Didn't start happening on the dell till last week.
Comment 12 Matthew Macy 2018-06-29 05:38:24 UTC
The only recent change was r335303. Can you try r335302 or prior?

-M
Comment 13 tom 2018-06-29 13:53:48 UTC
(In reply to Matthew Macy from comment #12)

I can confirm putting a pci rl card in the lenovo box prior to this as I suspected it was hardware related at the time.
Comment 14 commit-hook freebsd_committer 2018-07-15 19:04:39 UTC
A commit references this bug:

Author: marius
Date: Sun Jul 15 19:04:26 UTC 2018
New revision: 336313
URL: https://svnweb.freebsd.org/changeset/base/336313

Log:
  Assorted TSO fixes for em(4)/iflib(9) and dead code removal:
  - Ever since the workaround for the silicon bug of TSO4 causing MAC hangs
    was committed in r295133, CSUM_TSO always got disabled unconditionally
    by em(4) on the first invocation of em_init_locked(). However, even with
    that problem fixed, it turned out that for at least e. g. 82579 not all
    necessary TSO workarounds are in place, still causing MAC hangs even at
    Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled
    in r323292 (r323293 for stable/10) for the EM-class by default, allowing
    users to turn it on if it happens to work with their particular EM MAC
    in a Gigabit-only environment.
    In head, the TSO workaround for speeds other than Gigabit was lost with
    the conversion to iflib(9) in r311849 (possibly along with another one
    or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4
    got enabled by default again, causing device hangs. Therefore, change the
    default for this hardware class back to have TSO4 off, allowing users
    to turn it on manually if it happens to work in their environment as
    we do in stable/{10,11}. An alternative would be to add a whitelist of
    EM-class devices where TSO4 actually is reliable with the workarounds in
    place, but given that the advantage of TSO at Gigabit speed is rather
    limited - especially with the overhead of these workarounds -, that's
    really not worth it. [1]
    This change includes the addition of an isc_capabilities to struct
    if_softc_ctx so iflib(9) can also handle interface capabilities that
    shouldn't be enabled by default which is used to handle the default-off
    capabilities of e1000 as suggested by shurd@ and moving their handling
    from em_setup_interface() to em_if_attach_pre() accordingly.
  - Although 82543 support TSO4 in theory, the former lem(4) didn't have
    support for TSO4, presumably because TSO4 is even more broken in the
    LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class
    devices was enabled as part of the conversion to iflib(9) in r311849,
    causing device hangs. So revert back to the pre-r311849 behavior of
    not supporting TSO4 for LEM-class at all, which includes not creating
    a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2]
  - In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET
    (65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO
    DMA must have a maxsize of the maximum TSO size plus the size of a
    VLAN header for software VLAN tagging. The iflib(9) converted em(4),
    thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE
    in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET
    in em_setup_interface() (apparently, left-over from pre-iflib(9)
    times). So remove the later and correct iflib(9) to correctly cap
    the maximum TSO size reported to the stack at IP_MAXPACKET. While at
    it, let iflib(9) use if_sethwtsomax*().
    This change includes the addition of isc_tso_max{seg,}size DMA engine
    constraints for the TSO DMA tag to struct if_shared_ctx and letting
    iflib_txsd_alloc() automatically adjust the maxsize of that tag in case
    IFCAP_VLAN_MTU is supported as requested by shurd@.
  - Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet
    header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9)
    so adjustment is automatically done in case IFCAP_VLAN_MTU is supported.
    As a consequence, this adjustment now is also done in case of bnxt(4)
    which missed it previously.
  - Move the reduction of the maximum TSO segment count reported to the
    stack by the number of m_pullup(9) calls (which in the worst case,
    can add another mbuf and, thus, the requirement for another DMA
    segment each) in the transmit path for performance reasons from
    em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now
    done in iflib_parse_header() rather than in the no longer existing
    em_xmit(). Moreover, this optimization applies to all drivers using
    iflib(9) and not just em(4); all in-tree iflib(9) consumers still
    have enough room to handle full size TSO packets. Also, reduce the
    adjustment to the maximum number of m_pullup(9)'s now performed in
    iflib_parse_header().
  - Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9)
    in r311849 and r335338 respectively, these drivers didn't enable
    IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed
    through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on
    by default but also lagg(4) was fixed in that regard in r203548. So
    just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling
    in {em,ixl,ixlv}_setup_interface().
  - Nuke other redundant IFCAP_* setting in {em,ixl,ixlv}_setup_interface()
    which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre()
    now.
  - Remove some redundant/dead setting of scctx->isc_tx_csum_flags in
    em_if_attach_pre().
  - Remove some IFCAP_* duplicated either directly or indirectly (e. g.
    via IFCAP_HWCSUM) in {EM,IGB,IXL}_CAPS.
  - Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as
    iflib(9) adds that capability unconditionally.
  - Remove some unused macros from em(4).
  - Bump __FreeBSD_version as some of the above changes require the modules
    of drivers using iflib(9) to be recompiled.

  Okayed by:	sbruno@ at 201806 DevSummit Transport Working Group [1]
  Reviewed by:	sbruno (earlier version), erj
  PR:	219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2]
  Differential Revision:	https://reviews.freebsd.org/D15720

Changes:
  head/sys/dev/bnxt/if_bnxt.c
  head/sys/dev/e1000/if_em.c
  head/sys/dev/e1000/if_em.h
  head/sys/dev/ixgbe/if_ix.c
  head/sys/dev/ixgbe/if_ixv.c
  head/sys/dev/ixgbe/ixgbe.h
  head/sys/dev/ixl/if_ixl.c
  head/sys/dev/ixl/if_ixlv.c
  head/sys/dev/ixl/ixl_pf_main.c
  head/sys/net/iflib.c
  head/sys/net/iflib.h
  head/sys/sys/param.h
Comment 15 commit-hook freebsd_committer 2019-02-09 11:59:40 UTC
A commit references this bug:

Author: marius
Date: Sat Feb  9 11:58:41 UTC 2019
New revision: 343934
URL: https://svnweb.freebsd.org/changeset/base/343934

Log:
  - Remove the redundant device disabled hint handling; ever since
    r241119 that's performed globally by device_attach(9).
  - As for the EM-class of devices, em(4) supports multiple queues
    and MSI-X respectively only with 82574 devices. However, since
    the conversion to iflib(4), em(4) relies on the interrupt type
    fallback mechanism, i. e. MSI-X -> MSI -> INTx, of iflib(4) to
    figure out the interrupt type to use for the EM-class (as well
    as the IGB-class) of MACs. Moreover, despite the datasheet for
    82583V not mentioning any support of MSI-X, there actually are
    82583V devices out there that report a varying number of MSI-X
    messages as supported. The interrupt type fallback of iflib(4)
    is causing two failure modes depending on the actual number of
    MSI-X messages supported for such instances of 82583V:
    1) With only one MSI-X message supported, none is left for the
       RX/TX queues as that one message gets assigned to the admin
       interrupt. Worse, later on - which will be addressed with a
       separate fix - iflib(4) interprets that one messages as MSI
       or INTx to be set up, but fails to actually do so as it has
       previously called pci_alloc_msix(9). [1, 2]
    2) With more message supported, their distribution is okay but
       then em_if_msix_intr_assign() doesn't work for 82583V, with
       the interface being left in a non-working state, too. [3]
    Thus, let em_if_attach_pre() indicate to iflib(4) to try MSI-X
    with 82574 only, and at most MSI for the remainder of EM-class
    devices.
    While at it, remove "try_second_bar" as it's polarity inverted
    and not actually needed.
  - Remove code from em_if_timer() that effectively is a NOP since
    the conversion to iflib(4) ("trigger" is no longer read).
    While at it, let the comment for em_if_timer() reflect reality
    after said conversion.
  - Implement an ifdi_watchdog_reset method which only updates the
    em(4) "watchdog_events" counter but doesn't perform any reset,
    so that the em(4) "watchdog_timeouts" SYSCTL (iflib(4) doesn't
    provide a counterpart) reflects reality and these timeouts add
    to IFCOUNTER_OERRORS again after the iflib(4) conversion.
  - Remove the "mbuf_defrag_fail" and "tx_dma_fail" SYSCTLS; since
    the iflib(4) conversion, associated counters are disconnected,
    but iflib(4) provides "mbuf_defrag_failed" and "tx_map_failed"
    respectively as equivalents.
  - Move the description preceding lem_smartspeed() to the correct
    spot before em_reset() and bring back appropriate comments for
    {igb,em}_initialize_rss_mapping() and lem_smartspeed() lost in
    the iflib(4) conversion.
  - Adapt some other function descriptions and INIT_DEBUGOUT() use
    to match reality after the iflib(4) conversion.
  - Put the debugging message of em_enable_vectors_82574() (missed
    in r343578) under bootverbose, too.

  PR:		219428 [1], 235246 [2], 235147 [3]
  Reviewed by:	erj (previous version)
  Differential Revision:	https://reviews.freebsd.org/D19108

Changes:
  head/sys/dev/e1000/if_em.c
  head/sys/dev/e1000/if_em.h
Comment 16 commit-hook freebsd_committer 2019-02-13 14:39:54 UTC
A commit references this bug:

Author: marius
Date: Wed Feb 13 14:39:17 UTC 2019
New revision: 344098
URL: https://svnweb.freebsd.org/changeset/base/344098

Log:
  MFC: r343934

  - Remove the redundant device disabled hint handling; ever since
    r241119 that's performed globally by device_attach(9).
  - As for the EM-class of devices, em(4) supports multiple queues
    and MSI-X respectively only with 82574 devices. However, since
    the conversion to iflib(4), em(4) relies on the interrupt type
    fallback mechanism, i. e. MSI-X -> MSI -> INTx, of iflib(4) to
    figure out the interrupt type to use for the EM-class (as well
    as the IGB-class) of MACs. Moreover, despite the datasheet for
    82583V not mentioning any support of MSI-X, there actually are
    82583V devices out there that report a varying number of MSI-X
    messages as supported. The interrupt type fallback of iflib(4)
    is causing two failure modes depending on the actual number of
    MSI-X messages supported for such instances of 82583V:
    1) With only one MSI-X message supported, none is left for the
       RX/TX queues as that one message gets assigned to the admin
       interrupt. Worse, later on - which will be addressed with a
       separate fix - iflib(4) interprets that one messages as MSI
       or INTx to be set up, but fails to actually do so as it has
       previously called pci_alloc_msix(9). [1, 2]
    2) With more message supported, their distribution is okay but
       then em_if_msix_intr_assign() doesn't work for 82583V, with
       the interface being left in a non-working state, too. [3]
    Thus, let em_if_attach_pre() indicate to iflib(4) to try MSI-X
    with 82574 only, and at most MSI for the remainder of EM-class
    devices.
    While at it, remove "try_second_bar" as it's polarity inverted
    and not actually needed.
  - Remove code from em_if_timer() that effectively is a NOP since
    the conversion to iflib(4) ("trigger" is no longer read).
    While at it, let the comment for em_if_timer() reflect reality
    after said conversion.
  - Implement an ifdi_watchdog_reset method which only updates the
    em(4) "watchdog_events" counter but doesn't perform any reset,
    so that the em(4) "watchdog_timeouts" SYSCTL (iflib(4) doesn't
    provide a counterpart) reflects reality and these timeouts add
    to IFCOUNTER_OERRORS again after the iflib(4) conversion.
  - Remove the "mbuf_defrag_fail" and "tx_dma_fail" SYSCTLS; since
    the iflib(4) conversion, associated counters are disconnected,
    but iflib(4) provides "mbuf_defrag_failed" and "tx_map_failed"
    respectively as equivalents.
  - Move the description preceding lem_smartspeed() to the correct
    spot before em_reset() and bring back appropriate comments for
    {igb,em}_initialize_rss_mapping() and lem_smartspeed() lost in
    the iflib(4) conversion.
  - Adapt some other function descriptions and INIT_DEBUGOUT() use
    to match reality after the iflib(4) conversion.
  - Put the debugging message of em_enable_vectors_82574() (missed
    in r343578) under bootverbose, too.

  PR:		219428 [1], 235246 [2], 235147 [3]
  Reviewed by:	erj (previous version)
  Differential Revision:	https://reviews.freebsd.org/D19108

Changes:
_U  stable/12/
  stable/12/sys/dev/e1000/if_em.c
  stable/12/sys/dev/e1000/if_em.h
Comment 17 patpro 2019-02-19 20:40:42 UTC
Hi,
I'm experiencing this bug on:

FreeBSD 12.0-RELEASE-p3 GENERIC amd64
CPU: Intel(R) Core(TM) i3-3220T CPU @ 2.80GHz (2793.72-MHz K8-class CPU)
em0: <Intel(R) PRO/1000 Network Connection> port 0xf080-0xf09f mem 0xf7e00000-0xf7e1ffff,0xf7e39000-0xf7e39fff irq 20 at device 25.0 on pci0
em1: <Intel(R) PRO/1000 Network Connection> port 0xe000-0xe01f mem 0xf7c00000-0xf7c1ffff,0xf7c20000-0xf7c23fff irq 18 at device 0.0 on pci3

em1: attach_pre capping queues at 2
Current cap: 0x460b
em1: using 1024 tx descriptors and 1024 rx descriptors
em1: msix_init qsets capped at 2
em1: pxm cpus: 2 queue msgs: 4 admincnt: 1
em1: using 2 rx queues 2 tx queues 
em1: Using MSIX interrupts with 3 vectors
em1: allocated for 2 tx_queues
em1: allocated for 2 rx_queues
em1: Ethernet address: 70:54:d2:45:71:41
em1: netmap queues/slots: TX 2/1024, RX 2/1024

em1: TX(1) desc avail = 41, pidx = 179
em1: link state changed to DOWN
em1: TX(1) desc avail = 1024, pidx = 0
em1: TX(1) desc avail = 1024, pidx = 0
em1: TX(1) desc avail = 1024, pidx = 0
...

It's only fixed by a reboot.

How can I know if the fixes mentioned here are present in 12.0-RELEASE-p3 or if I'll have to wait for p4 or more ?

thanks
Comment 18 Robert 2019-02-24 17:19:12 UTC
Can confirm this is still a problem on 12.0-RELEASE-p3. Of note it is only happening on em0 on my system, I use em1 for a cross connect and NFS. They're both the built in on the motherboard.



em0@pci0:0:25:0: class=0x020000 card=0x20378086 chip=0x15038086 rev=0x04 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82579V Gigabit Network Connection'
    class      = network
    subclass   = ethernet


em1@pci0:3:0:0: class=0x020000 card=0x20378086 chip=0x10d38086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82574L Gigabit Network Connection'
    class      = network
    subclass   = ethernet
Comment 19 Nick 2019-03-02 20:14:31 UTC
Hi, I believe I am also affected by this bug in FreeBSD-12.0-RELEASE-p3.

When uploading/downloading at the same time, em0 loses speed and will eventually restart with:

Mar  2 14:45:40 kernel: em0: TX(0) desc avail = 41, pidx = 583
Mar  2 14:45:41 kernel: em0: link state changed to DOWN
Mar  2 14:45:42 kernel: em0: link state changed to UP
Mar  2 14:45:43 kernel: em0: TX(0) desc avail = 1024, pidx = 0
Mar  2 14:45:44 kernel: em0: link state changed to DOWN
Mar  2 14:45:45 kernel: em0: link state changed to UP

em0@pci0:0:25:0:	class=0x020000 card=0x20138086 chip=0x15038086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82579V Gigabit Network Connection'
    class      = network
    subclass   = ethernet

This does not happen when using Linux (ie Debian 9.8).

Thank you.
Comment 20 Stefan Thurner 2019-03-09 09:24:53 UTC
Hi,

I'm facing this problem on 1 of 2 installed em interfaces, too.

System:
FreeBSD 12.0-STABLE r344823 amd64

NICs:
em0@pci0:0:25:0:        class=0x020000 card=0x102e17aa chip=0x15028086 rev=0x04 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82579LM Gigabit Network Connection (Lewisville)'
    class      = network
    subclass   = ethernet

em1@pci0:2:10:0:        class=0x020000 card=0x002e8086 chip=0x100e8086 rev=0x02 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82540EM Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

Problem:
Mar  7 20:49:06 hydrogen kernel: em1: link state changed to DOWN
Mar  7 20:49:07 hydrogen kernel: em1: TX(0) desc avail = 1024, pidx = 0
Mar  7 20:49:12 hydrogen syslogd: last message repeated 3 times
Mar  7 20:49:14 hydrogen kernel: em1: link state changed to UP
Mar  7 20:49:15 hydrogen kernel: em1: TX(0) desc avail = 1020, pidx = 4
Mar  7 20:49:15 hydrogen kernel: em1: link state changed to DOWN
Mar  7 20:49:16 hydrogen kernel: em1: link state changed to UP
Mar  7 20:49:24 hydrogen kernel: em1: link state changed to DOWN
Mar  7 20:49:26 hydrogen kernel: em1: link state changed to UP
Mar  7 20:49:27 hydrogen kernel: em1: link state changed to DOWN
Mar  7 20:49:28 hydrogen kernel: em1: TX(0) desc avail = 1024, pidx = 0
Mar  7 20:49:30 hydrogen syslogd: last message repeated 1 times
Mar  7 20:49:32 hydrogen kernel: em1: link state changed to UP
Mar  7 20:49:34 hydrogen kernel: em1: link state changed to DOWN
Mar  7 20:49:36 hydrogen kernel: em1: TX(0) desc avail = 1024, pidx = 0
Mar  7 20:49:50 hydrogen syslogd: last message repeated 8 times
Mar  7 20:49:52 hydrogen kernel: em1: link state changed to UP
Mar  7 20:49:52 hydrogen kernel: em1: TX(0) desc avail = 1024, pidx = 0
Mar  7 20:49:52 hydrogen kernel: em1: link state changed to DOWN
Mar  7 20:49:54 hydrogen kernel: em1: link state changed to UP

em0 (82579LM) does work without problems.
Comment 21 IPTRACE 2019-04-01 18:19:46 UTC
Hello!

I'm affected as well.

kernel: igb0: TX(0) desc avail = 1024, pidx = 0
...
kernel: igb2: TX(5) desc avail = 1024, pidx = 0


igb0@pci0:129:0:0:      class=0x020000 card=0x00001458 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet


igb2@pci0:132:0:0:      class=0x020000 card=0xa02c8086 chip=0x10e88086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82576 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
Comment 22 IPTRACE 2019-07-06 05:29:23 UTC
Please open the bug, the problem still occurs.

FreeBSD 12.0-RELEASE-p7

Use case:
1. All was working.
2. I detached Ethernet cable from modem side.
3. Connected again and saw "no carrier" on igb0 interface.
4. System logged as follow:
      Jul  5 22:31:35 hpv kernel: igb0: TX(0) desc avail = 1024, pidx = 0
      Jul  5 22:31:38 hpv kernel: igb0: TX(0) desc avail = 1024, pidx = 0
      Jul  5 22:31:40 hpv kernel: igb0: TX(0) desc avail = 1024, pidx = 0
5. I was trying "server:# service netif restart igb0 && service routing restart" without success.
6. Resarted OS to back to normal.
Comment 23 IPTRACE 2019-07-13 19:53:35 UTC
Is there any workaround instead of restart a system?
I was trying below without progress as well.

# ifconfig igb0 down
# ifconfig igb0 up

I think the problem is when the network card loses ethernet link and then errors occurs with non-working interface.
Comment 24 IPTRACE 2019-07-14 12:57:36 UTC
My sysctl.conf file.

net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1
net.inet.icmp.log_redirect=1
net.inet.icmp.drop_redirect=1
net.inet.ip.random_id=1
net.link.tap.up_on_open=1
net.inet.tcp.mssdflt=1440
net.inet.tcp.nolocaltimewait=1
net.inet.ip.check_interface=1
net.inet.ip.redirect=0
net.inet.tcp.drop_synfin=1
net.inet.tcp.msl=15000
net.inet.tcp.icmp_may_rst=0
net.inet.tcp.path_mtu_discovery=0
net.inet6.icmp6.rediraccept=0
net.inet6.ip6.redirect=0
kern.ipc.maxsockbuf=16777216
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_inc=524288
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.cc.algorithm=cubic
net.inet.tcp.tso=0
net.inet.tcp.rexmit_slop=50
net.inet.tcp.msl=5000
net.inet.tcp.keepinit=5000
net.inet.tcp.finwait2_timeout=5000
net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.always_keepalive=0
net.route.netisr_maxqlen=2048
net.inet.ip.process_options=0
net.inet.sctp.blackhole=2
net.inet.tcp.abc_l_var=44