Bug 220997 - Broken watchdog after iflib update (em0)
Summary: Broken watchdog after iflib update (em0)
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Mark Linimon
Keywords: IntelNetworking, regression
Depends on:
Reported: 2017-07-25 09:18 UTC by Alexander Leidinger
Modified: 2021-04-17 09:47 UTC (History)
6 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Leidinger freebsd_committer 2017-07-25 09:18:24 UTC

em0@pci0:2:6:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82541PI Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

head as of r321365

Before the iflib update, this device showed several watchdog and NIC resets per day. After the iflib update the system loses network connection from time to time and only a reboot helps getting back the network connection.

The system has several VIMAGE jails.

I have no second way to access the system, so any debugging would need to happen in a scripted way. My current workaround is this (most of the time it reboots, sometimes the ifconfig down/up helps to get back a network connection):


(echo uptime:; /usr/bin/uptime; echo; /usr/bin/netstat -m; echo; echo dmesg:; /sbin/dmesg | /usr/bin/tail -50; echo ping:; /sbin/ping -nc 1 $GW_IP; ifconfig em0 down; sleep 1; ifconfig em0 up; echo; echo ping again after ifconfig down/up; /sbin/ping -nc 1 $GW_IP; ) | /usr/bin/mail -s "$(hostname): no gateway" root
/bin/sleep 60
/sbin/ping -nc 1 $GW_IP >/dev/null 2>&1 || /sbin/shutdown -r now >/dev/null 2>&1

Here an example of the output when it needs to reboot:
 5:45PM  up  9:30, 1 users, load averages: 0.14, 0.25, 0.58

1283/3277/4560 mbufs in use (current/cache/total)
1024/1818/2842/500678 mbuf clusters in use (current/cache/total/max)
0/1518 mbuf+clusters out of packet secondary zone in use (current/cache)
0/40/40/250338 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/74174 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41723 16k jumbo clusters in use (current/cache/total/max)
2368K/4615K/6984K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
14237 sendfile syscalls
819 sendfile syscalls completed without I/O request
14426 requests for I/O initiated by sendfile
80695 pages read by sendfile as part of a request
11663 pages were valid at time of a sendfile request
0 pages were valid and substituted to bogus page
0 pages were requested for read ahead by applications
0 pages were read ahead by sendfile
0 times sendfile encountered an already busy page
0 requests for sfbufs denied
0 requests for sfbufs delayed

epair3a: Ethernet address: 02:ff:80:00:09:0a
epair3b: Ethernet address: 02:ff:d0:00:0a:0b
epair3a: link state changed to UP
epair3b: link state changed to UP
epair3a: changing name to 'vnet1:2'
epair3b: changing name to 'vnet1'
vnet1:2: promiscuous mode enabled
epair4a: Ethernet address: 02:ff:80:00:0a:0a
epair4b: Ethernet address: 02:ff:d0:00:0b:0b
epair4a: link state changed to UP
epair4b: link state changed to UP
epair4a: changing name to 'vnet0:3'
epair4b: changing name to 'vnet0'
vnet0:3: promiscuous mode enabled
epair5a: Ethernet address: 02:ff:80:00:0b:0a
epair5b: Ethernet address: 02:ff:d0:00:0c:0b
epair5a: link state changed to UP
epair5b: link state changed to UP
epair5a: changing name to 'vnet1:3'
epair5b: changing name to 'vnet1'
vnet1:3: promiscuous mode enabled
epair6a: Ethernet address: 02:ff:80:00:0c:0a
epair6b: Ethernet address: 02:ff:d0:00:0d:0b
epair6a: link state changed to UP
epair6b: link state changed to UP
epair6a: changing name to 'vnet0:4'
epair6b: changing name to 'vnet0'
vnet0:4: promiscuous mode enabled
epair7a: Ethernet address: 02:ff:80:00:0d:0a
epair7b: Ethernet address: 02:ff:d0:00:0e:0b
epair7a: link state changed to UP
epair7b: link state changed to UP
epair7a: changing name to 'vnet1:4'
epair7b: changing name to 'vnet1'
vnet1:4: promiscuous mode enabled
epair8a: Ethernet address: 02:ff:80:00:0e:0a
epair8b: Ethernet address: 02:ff:d0:00:0f:0b
epair8a: link state changed to UP
epair8b: link state changed to UP
epair8a: changing name to 'vnet0:5'
epair8b: changing name to 'vnet0'
vnet0:5: promiscuous mode enabled
epair9a: Ethernet address: 02:ff:80:00:0f:0a
epair9b: Ethernet address: 02:ff:d0:00:10:0b
epair9a: link state changed to UP
epair9b: link state changed to UP
epair9a: changing name to 'vnet1:5'
epair9b: changing name to 'vnet1'
vnet1:5: promiscuous mode enabled
sonewconn: pcb 0xfffff8009e728848: Listen queue overflow: 5 already in queue awaiting acceptance (1 occurrences)
PING $GW_IP ($GW_IP): 56 data bytes

--- $GW_IP ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss

ping again after ifconfig down/up
PING $GW_IP ($GW_IP): 56 data bytes
64 bytes from $GW_IP: icmp_seq=0 ttl=64 time=6457.660 ms

--- $GW_IP ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 6457.660/6457.660/6457.660/0.000 ms

Comment 1 Alexander Leidinger freebsd_committer 2017-07-25 09:19:33 UTC
Assign to sbruno.
Comment 2 Kevin Bowling freebsd_committer 2017-09-03 06:08:38 UTC
We aren't sure why the watchdog would not function on this HW type and don't have access to it.  If you set sysctl net.inet.tcp.tso=0 does the NIC still trip?
Comment 3 Alexander Leidinger freebsd_committer 2017-09-06 19:25:04 UTC
Normally I have several reboots per day, with TSO disabled it survived 2.5 days before a reboot. It looked like it rebooted because it didn't reach the gateway in my "self-healing" script. I will keep an eye on it and report back next week.
Comment 4 Alexander Leidinger freebsd_committer 2017-09-09 13:19:47 UTC
Disabling TSO doesn't really help:

2017-09-07 -  4:15am  uptime ~9.5h  -> reboot
2017-09-07 - 12:55pm  uptime ~8.5h  -> reboot
2017-09-07 -  8:25pm  uptime ~7.5h  -> no reboot
2017-09-07 -  8:40pm  uptime ~7.5h  -> reboot
2017-09-07 - 10:40pm  uptime ~2h    -> reboot
2017-09-08 - 11:40am  uptime ~15h   -> no reboot
2017-09-08 -  9:05pm  uptime ~1 day -> reboot
2017-09-08 - 10:35pm  uptime ~1.5h  -> reboot
2017-09-09 -  9:05am  uptime ~10.5h -> reboot
2017-09-09 - 12:40pm  uptime ~3h    -> no reboot
2017-09-09 -  2:45pm  uptime ~5.5h  -> reboot

Any command I should have automatically run when my watchdog script detects the situation and reboots?
Comment 5 Alexander Leidinger freebsd_committer 2017-09-27 19:00:10 UTC
Data point:
I updated from r321365 to r323637, and additionally I have this patch:
Index: sys/dev/e1000/em_txrx.c
--- sys/dev/e1000/em_txrx.c     (revision 323637)
+++ sys/dev/e1000/em_txrx.c     (working copy)
@@ -703,8 +703,12 @@

                /* Make sure bad packets are discarded */
                if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
+#if 0
                        return EBADMSG;
+                       printf("em: frame error: ignored -- local patch from bde@\n");

                ri->iri_frags[i].irf_flid = 0;

The change of behavior I see now is, that I have an uptime of more than 10 days. I don't have any line with "frame error" in dmesg or the console.log. Instead of having several reboots per day, I only get 1-2 mails from my watchdog script telling me it has done a down+up of the interface only (gateway was reachable afterwards and no need to reboot).

So... something has changed... in a positive direction... but not yet as it shall be.
Comment 6 Alexander Leidinger freebsd_committer 2018-02-25 15:38:44 UTC

Today I noticed that this behavior is either limited to IPv6, or at least much more obvious for IPv6.

A ping from home to the IPv6 address was giving me ~ 10 responses, then ~10 timeouts/not_reachable then ~10 responses and so on. The ~10 means somewhere around 5 to 18, with a high amount really at 10 or 8-12.

A ping to the IPv4 address didn't show any timeout.

Based upon this I made an experiment on the system:
 - svn update in a loop (from svn.FreeBSD.org)
 - default settings in rc.conf
 --> frequent timeouts from svn update (either connection timeout (most of the time) or aborted data transfer after a few files) for hours
 - change rc.conf while svn update runs in the loop: ip6addrctl_policy="ipv4_prefer"
 - service ip6addrctl restart
 --> next svn update in the loop does data transfer as expected, no timeout, not even after transferring a few files, first svn update finishes successful after catching up with changes from about 2-3 months.

Currently I have the system in ipv4_prefer and it's not reachable via ipv6.

Any ideas?
Comment 7 Alexander Leidinger freebsd_committer 2018-03-16 21:03:59 UTC
(In reply to Alexander Leidinger from comment #6)
Ignore this comment about IPv6, a different issue not related to the connection loss.

Still, sometimes several reboots a day due to my issue-detection script, sometimes a day or two without.
Comment 8 commit-hook freebsd_committer 2018-07-15 19:04:46 UTC
A commit references this bug:

Author: marius
Date: Sun Jul 15 19:04:26 UTC 2018
New revision: 336313
URL: https://svnweb.freebsd.org/changeset/base/336313

  Assorted TSO fixes for em(4)/iflib(9) and dead code removal:
  - Ever since the workaround for the silicon bug of TSO4 causing MAC hangs
    was committed in r295133, CSUM_TSO always got disabled unconditionally
    by em(4) on the first invocation of em_init_locked(). However, even with
    that problem fixed, it turned out that for at least e. g. 82579 not all
    necessary TSO workarounds are in place, still causing MAC hangs even at
    Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled
    in r323292 (r323293 for stable/10) for the EM-class by default, allowing
    users to turn it on if it happens to work with their particular EM MAC
    in a Gigabit-only environment.
    In head, the TSO workaround for speeds other than Gigabit was lost with
    the conversion to iflib(9) in r311849 (possibly along with another one
    or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4
    got enabled by default again, causing device hangs. Therefore, change the
    default for this hardware class back to have TSO4 off, allowing users
    to turn it on manually if it happens to work in their environment as
    we do in stable/{10,11}. An alternative would be to add a whitelist of
    EM-class devices where TSO4 actually is reliable with the workarounds in
    place, but given that the advantage of TSO at Gigabit speed is rather
    limited - especially with the overhead of these workarounds -, that's
    really not worth it. [1]
    This change includes the addition of an isc_capabilities to struct
    if_softc_ctx so iflib(9) can also handle interface capabilities that
    shouldn't be enabled by default which is used to handle the default-off
    capabilities of e1000 as suggested by shurd@ and moving their handling
    from em_setup_interface() to em_if_attach_pre() accordingly.
  - Although 82543 support TSO4 in theory, the former lem(4) didn't have
    support for TSO4, presumably because TSO4 is even more broken in the
    LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class
    devices was enabled as part of the conversion to iflib(9) in r311849,
    causing device hangs. So revert back to the pre-r311849 behavior of
    not supporting TSO4 for LEM-class at all, which includes not creating
    a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2]
  - In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET
    (65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO
    DMA must have a maxsize of the maximum TSO size plus the size of a
    VLAN header for software VLAN tagging. The iflib(9) converted em(4),
    thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE
    in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET
    in em_setup_interface() (apparently, left-over from pre-iflib(9)
    times). So remove the later and correct iflib(9) to correctly cap
    the maximum TSO size reported to the stack at IP_MAXPACKET. While at
    it, let iflib(9) use if_sethwtsomax*().
    This change includes the addition of isc_tso_max{seg,}size DMA engine
    constraints for the TSO DMA tag to struct if_shared_ctx and letting
    iflib_txsd_alloc() automatically adjust the maxsize of that tag in case
    IFCAP_VLAN_MTU is supported as requested by shurd@.
  - Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet
    header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9)
    so adjustment is automatically done in case IFCAP_VLAN_MTU is supported.
    As a consequence, this adjustment now is also done in case of bnxt(4)
    which missed it previously.
  - Move the reduction of the maximum TSO segment count reported to the
    stack by the number of m_pullup(9) calls (which in the worst case,
    can add another mbuf and, thus, the requirement for another DMA
    segment each) in the transmit path for performance reasons from
    em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now
    done in iflib_parse_header() rather than in the no longer existing
    em_xmit(). Moreover, this optimization applies to all drivers using
    iflib(9) and not just em(4); all in-tree iflib(9) consumers still
    have enough room to handle full size TSO packets. Also, reduce the
    adjustment to the maximum number of m_pullup(9)'s now performed in
  - Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9)
    in r311849 and r335338 respectively, these drivers didn't enable
    IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed
    through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on
    by default but also lagg(4) was fixed in that regard in r203548. So
    just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling
    in {em,ixl,ixlv}_setup_interface().
  - Nuke other redundant IFCAP_* setting in {em,ixl,ixlv}_setup_interface()
    which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre()
  - Remove some redundant/dead setting of scctx->isc_tx_csum_flags in
  - Remove some IFCAP_* duplicated either directly or indirectly (e. g.
  - Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as
    iflib(9) adds that capability unconditionally.
  - Remove some unused macros from em(4).
  - Bump __FreeBSD_version as some of the above changes require the modules
    of drivers using iflib(9) to be recompiled.

  Okayed by:	sbruno@ at 201806 DevSummit Transport Working Group [1]
  Reviewed by:	sbruno (earlier version), erj
  PR:	219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2]
  Differential Revision:	https://reviews.freebsd.org/D15720

Comment 9 Alexander Leidinger freebsd_committer 2019-05-12 19:17:46 UTC
This is not fixed. Even not with r347222.

I now see this from time to time:
em0: link state changed to DOWN
em0: link state changed to UP
em0: Watchdog timeout (TX: 0 desc avail: 1024 pidx: 0) -- resetting
em0: link state changed to DOWN
em0: link state changed to UP

This is an improvement, as now I see at least a watchdog timeout, but it still happens, and sometimes the script to down/up the interface doesn't help and only a reboot (when the script detects that the down/up doesn't work) helps.
Comment 10 Alexander Leidinger freebsd_committer 2020-06-06 19:13:40 UTC
Hi Eric,

as this is an Intel NIC I assigned this to you (as someone from IIRC Intel). If you are the wrong person to assign this to feel free to suggest a better alternative or to assign it to someone you think is better suited (or to tell if is out of support from Intel).

Comment 11 Eric Joyner freebsd_committer 2020-06-11 16:04:31 UTC
(In reply to Alexander Leidinger from comment #10)

Hi Alexander,

It's awkward, but we don't really support 1G cards anymore outside of things like adding support for new devices.

Would you want to ask marius@ for help? This definitely seems like an iflib issue.
Comment 12 Mark Linimon freebsd_committer freebsd_triage 2020-06-11 22:08:49 UTC
Comment 13 Kevin Bowling freebsd_committer 2021-04-16 20:35:55 UTC
(In reply to Alexander Leidinger from comment #10)
Do you still have the ability to test this NIC?  I'd be curious if you can run main after my recent commits around spb and promiscuous/multicast filtering.  But a report from 12.2 or 13.0 would also be very helpful.

Another thing to investigate, is it using MSI?  Take a look at #7 in https://opendevices.ru/wp-content/uploads/2012/11/82541.pdf it seems it may be a good idea to force 82541 to use legacy interrupts.  I can provide a patch if you can test.
Comment 14 Alexander Leidinger freebsd_committer 2021-04-17 09:47:45 UTC
(In reply to Kevin Bowling from comment #13)

Yes I can test this. The system is running -current as of 2021-04-12-093914. I installed this version yesterday. It still has this issue. Does this contain your changes?

I do not think it is using MSI..., does it? See below.

pcib2: <ACPI PCI-PCI bridge> at device 20.4 on pci0
pcib0: allocated type 4 (0xd000-0xdfff) for rid 1c of pcib2
pcib0: allocated type 3 (0xfdd00000-0xfddfffff) for rid 20 of pcib2
pcib0: allocated type 3 (0xfdc00000-0xfdcfffff) for rid 24 of pcib2
pcib2:   domain            0
pcib2:   secondary bus     2
pcib2:   subordinate bus   2
pcib2:   I/O decode        0xd000-0xdfff
pcib2:   memory decode     0xfdd00000-0xfddfffff
pcib2:   prefetched decode 0xfdc00000-0xfdcfffff
pcib2:   special decode    subtractive
pci2: <ACPI PCI bus> on pcib2
pcib2: allocated bus range (2-2) for rid 0 of pci2
pci2: domain=0, physical bus=2
found-> vendor=0x8086, dev=0x107c, revid=0x05
        domain=0, bus=2, slot=6, func=0
        class=02-00-00, hdrtype=0x00, mfdev=0
        cmdreg=0x0007, statreg=0x0230, cachelnsz=1 (dwords)
        lattimer=0x20 (960 ns), mingnt=0xff (63750 ns), maxlat=0x00 (0 ns)
        intpin=a, irq=10
        powerspec 2  supports D0 D3  current D0
        map[10]: type Memory, range 32, base rxfddc0000, size 17, enabled
pcib2: allocated memory range (0xfddc0000-0xfdddffff) for rid 10 of pci0:2:6:0
        map[14]: type Memory, range 32, base rxfdda0000, size 17, enabled
pcib2: allocated memory range (0xfdda0000-0xfddbffff) for rid 14 of pci0:2:6:0
        map[18]: type I/O Port, range 32, base rxdf00, size  6, enabled
pcib2: allocated I/O port range (0xdf00-0xdf3f) for rid 18 of pci0:2:6:0
pcib2: matched entry for 2.6.INTA
pcib2: slot 6 INTA hardwired to IRQ 20
em0: <Intel(R) PRO/1000 Network Connection> port 0xdf00-0xdf3f mem 0xfddc0000-0xfdddffff,0xfdda0000-0xfddbffff irq 20 at device 6.0 on pci2
em0: attach_pre capping queues at 1
em0: Using 1024 TX descriptors and 1024 RX descriptors
em0: allocated for 1 tx_queues
em0: allocated for 1 rx_queues
ioapic0: routing intpin 20 (PCI IRQ 20) to lapic 0 vector 53
ioapic0: routing intpin 20 (PCI IRQ 20) to lapic 3 vector 49
ioapic0: routing intpin 20 (PCI IRQ 20) to lapic 2 vector 49