Bug 200420 - [igb] igb0: Watchdog timeout -- resetting
Summary: [igb] igb0: Watchdog timeout -- resetting
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net mailing list
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2015-05-23 22:37 UTC by gcognault
Modified: 2017-11-06 17:12 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description gcognault 2015-05-23 22:37:18 UTC
I can easily reproduce the bug:
Using a samba share, I calculate a MD5 checksum on 4000 pictures.
The network freezes around the 3000 picture.

The hardware:
ASRock C2550D4i
http://www.asrockrack.com/general/productdetail.asp?Model=C2550D4I#Specifications

FreeBSD 10.1-RELEASE-p10 FreeBSD 10.1-RELEASE-p10 #0 r282897M: Thu May 14 15:53:59 CEST 2015     root@dev.nas4free.org:/usr/obj/nas4free/usr/src/sys/NAS4FREE-amd64  amd64

dev.igb.0.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0
dev.igb.0.%driver: igb
dev.igb.0.%location: slot=0 function=0
dev.igb.0.%pnpinfo: vendor=0x8086 device=0x1533 subvendor=0x1849 subdevice=0x1533 class=0x020000
dev.igb.0.%parent: pci7
dev.igb.0.nvm: -1
dev.igb.0.enable_aim: 1
dev.igb.0.fc: 3
dev.igb.0.rx_processing_limit: 100
dev.igb.0.dmac: 0
dev.igb.0.eee_disabled: 0
dev.igb.0.link_irq: 4
dev.igb.0.dropped: 0
dev.igb.0.tx_dma_fail: 0
dev.igb.0.rx_overruns: 0
dev.igb.0.watchdog_timeouts: 1
dev.igb.0.device_control: 1075577409
dev.igb.0.rx_control: 71335938
dev.igb.0.interrupt_mask: 4
dev.igb.0.extended_int_mask: 2147483679
dev.igb.0.tx_buf_alloc: 0
dev.igb.0.rx_buf_alloc: 0
dev.igb.0.fc_high_water: 31328
dev.igb.0.fc_low_water: 31312
dev.igb.0.queue0.no_desc_avail: 0
dev.igb.0.queue0.tx_packets: 943452
dev.igb.0.queue0.rx_packets: 1019555
dev.igb.0.queue0.rx_bytes: 1047787
dev.igb.0.queue0.lro_queued: 0
dev.igb.0.queue0.lro_flushed: 0
dev.igb.0.queue1.no_desc_avail: 0
dev.igb.0.queue1.tx_packets: 2967552
dev.igb.0.queue1.rx_packets: 3052154
dev.igb.0.queue1.rx_bytes: 990821619
dev.igb.0.queue1.lro_queued: 0
dev.igb.0.queue1.lro_flushed: 0
dev.igb.0.queue2.no_desc_avail: 0
dev.igb.0.queue2.tx_packets: 1444357
dev.igb.0.queue2.rx_packets: 1521528
dev.igb.0.queue2.rx_bytes: 581413
dev.igb.0.queue2.lro_queued: 0
dev.igb.0.queue2.lro_flushed: 0
dev.igb.0.queue3.no_desc_avail: 0
dev.igb.0.queue3.tx_packets: 6118751
dev.igb.0.queue3.rx_packets: 6677679
dev.igb.0.queue3.rx_bytes: 579126
dev.igb.0.queue3.lro_queued: 0
dev.igb.0.queue3.lro_flushed: 0
dev.igb.0.mac_stats.excess_coll: 0
dev.igb.0.mac_stats.single_coll: 0
dev.igb.0.mac_stats.multiple_coll: 0
dev.igb.0.mac_stats.late_coll: 0
dev.igb.0.mac_stats.collision_count: 0
dev.igb.0.mac_stats.symbol_errors: 0
dev.igb.0.mac_stats.sequence_errors: 0
dev.igb.0.mac_stats.defer_count: 0
dev.igb.0.mac_stats.missed_packets: 0
dev.igb.0.mac_stats.recv_no_buff: 0
dev.igb.0.mac_stats.recv_undersize: 0
dev.igb.0.mac_stats.recv_fragmented: 0
dev.igb.0.mac_stats.recv_oversize: 0
dev.igb.0.mac_stats.recv_jabber: 0
dev.igb.0.mac_stats.recv_errs: 0
dev.igb.0.mac_stats.crc_errs: 0
dev.igb.0.mac_stats.alignment_errs: 0
dev.igb.0.mac_stats.coll_ext_errs: 0
dev.igb.0.mac_stats.xon_recvd: 0
dev.igb.0.mac_stats.xon_txd: 0
dev.igb.0.mac_stats.xoff_recvd: 0
dev.igb.0.mac_stats.xoff_txd: 0
dev.igb.0.mac_stats.total_pkts_recvd: 12320459
dev.igb.0.mac_stats.good_pkts_recvd: 12320241
dev.igb.0.mac_stats.bcast_pkts_recvd: 106770
dev.igb.0.mac_stats.mcast_pkts_recvd: 74526
dev.igb.0.mac_stats.rx_frames_64: 89382
dev.igb.0.mac_stats.rx_frames_65_127: 4036945
dev.igb.0.mac_stats.rx_frames_128_255: 478553
dev.igb.0.mac_stats.rx_frames_256_511: 56903
dev.igb.0.mac_stats.rx_frames_512_1023: 64832
dev.igb.0.mac_stats.rx_frames_1024_1522: 7593626
dev.igb.0.mac_stats.good_octets_recvd: 11970320160
dev.igb.0.mac_stats.good_octets_txd: 12279791794
dev.igb.0.mac_stats.total_pkts_txd: 18512524
dev.igb.0.mac_stats.good_pkts_txd: 18512524
dev.igb.0.mac_stats.bcast_pkts_txd: 1043
dev.igb.0.mac_stats.mcast_pkts_txd: 4064
dev.igb.0.mac_stats.tx_frames_64: 7877
dev.igb.0.mac_stats.tx_frames_65_127: 10769674
dev.igb.0.mac_stats.tx_frames_128_255: 63733
dev.igb.0.mac_stats.tx_frames_256_511: 29070
dev.igb.0.mac_stats.tx_frames_512_1023: 54471
dev.igb.0.mac_stats.tx_frames_1024_1522: 7587699
dev.igb.0.mac_stats.tso_txd: 647779
dev.igb.0.mac_stats.tso_ctx_fail: 0
dev.igb.0.interrupts.asserts: 8356963
dev.igb.0.interrupts.rx_pkt_timer: 12269756
dev.igb.0.interrupts.rx_abs_timer: 0
dev.igb.0.interrupts.tx_pkt_timer: 0
dev.igb.0.interrupts.tx_abs_timer: 0
dev.igb.0.interrupts.tx_queue_empty: 18492713
dev.igb.0.interrupts.tx_queue_min_thresh: 12270651
dev.igb.0.interrupts.rx_desc_min_thresh: 0
dev.igb.0.interrupts.rx_overrun: 0
dev.igb.0.host.breaker_tx_pkt: 0
dev.igb.0.host.host_tx_pkt_discard: 0
dev.igb.0.host.rx_pkt: 895
dev.igb.0.host.breaker_rx_pkts: 0
dev.igb.0.host.breaker_rx_pkt_drop: 0
dev.igb.0.host.tx_good_pkt: 571
dev.igb.0.host.breaker_tx_pkt_drop: 0
dev.igb.0.host.rx_good_bytes: 11959919100
dev.igb.0.host.tx_good_bytes: 12277353640
dev.igb.0.host.length_errors: 0
dev.igb.0.host.serdes_violation_pkt: 0
dev.igb.0.host.header_redir_missed: 0
Comment 1 Arcadiy Ivanov 2016-06-16 17:26:51 UTC
I can report the same for 10.3-RELEASE-p4 and stock kernel.

# cat /boot/loader.conf:
kern.geom.label.gptid.enable="0"
kern.ipc.nmbclusters="1000000"
hw.pci.enable_msi="1"
hw.pci.enable_msix="0"
zfs_load="YES"
aesni_load="YES"
ichsmb_load="YES"
ipmi_load="YES"
beastie_disable="YES"
autoboot_delay="3"

# pciconf -lcv
igb0@pci0:2:0:0:        class=0x020000 card=0x153315d9 chip=0x15338086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks enabled with 1 message
    cap 11[70] = MSI-X supports 5 messages
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 128(512) FLR RO NS link x1(x1)
                 speed 2.5(2.5) ASPM L1(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 002590ffffxxxxx2
    ecap 0017[1a0] = TPH Requester 1

igb1@pci0:5:0:0:        class=0x020000 card=0x153315d9 chip=0x15338086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks enabled with 1 message
    cap 11[70] = MSI-X supports 5 messages
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 128(512) FLR RO NS link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 002590ffffxxxxx3
    ecap 0017[1a0] = TPH Requester 1

This configuration is fairly stable (1h so far, no watchdog timeouts on igb1).

Disabling MSI altogether causes igb1 that doesn't pass any traffic and doesn't even come up.

Disabling MSI-X seems to help.

This is a SuperMicro X11SBA-F board (http://www.supermicro.com/products/motherboard/X11/X11SBA-F.cfm) with BIOS 1.0b (latest).
Comment 2 Arcadiy Ivanov 2016-06-16 17:37:49 UTC
Spoke too soon, disabling MSI-X seems to help only marginally:

Jun 16 13:29:49 <kern.crit> fw1 kernel: igb1: Watchdog timeout -- resetting
Jun 16 13:29:49 <kern.crit> fw1 kernel: igb1: Queue(846295657) tdh = -1249464976, hw tdt = 589458993
Jun 16 13:29:49 <kern.crit> fw1 kernel: igb1: TX(846295657) desc avail = 0,Next TX to Clean = 0
Jun 16 13:29:49 <kern.notice> fw1 kernel: igb1: link state changed to DOWN
Jun 16 13:29:53 <kern.notice> fw1 kernel: igb1: link state changed to UP
Jun 16 13:29:53 <user.notice> fw1 devd: Executing '/etc/rc.d/dhclient quietstart igb1'
Jun 16 13:34:26 <kern.crit> fw1 kernel: igb1: Watchdog timeout -- resetting
Jun 16 13:34:26 <kern.crit> fw1 kernel: igb1: Queue(846295657) tdh = -1249464976, hw tdt = 589458993
Jun 16 13:34:26 <kern.crit> fw1 kernel: igb1: TX(846295657) desc avail = 0,Next TX to Clean = 0
Jun 16 13:34:26 <kern.notice> fw1 kernel: igb1: link state changed to DOWN
Jun 16 13:34:31 <kern.notice> fw1 kernel: igb1: link state changed to UP
Jun 16 13:34:31 <user.notice> fw1 devd: Executing '/etc/rc.d/dhclient quietstart igb1'
Comment 3 Arcadiy Ivanov 2016-06-16 17:43:19 UTC
Possibly related to #200221
Comment 4 Arcadiy Ivanov 2016-06-16 17:44:32 UTC
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>

igb1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>

I have TSO, VLAN_HWTSO and LRO disabled.
Comment 5 Arcadiy Ivanov 2016-06-16 18:41:53 UTC
some PCI-E errors:

pcib4@pci0:3:0:0:       class=0x060400 card=0x00000000 chip=0x260812d8 rev=0x00 hdr=0x01
    vendor     = 'Pericom Semiconductor'
    class      = bridge
    subclass   = PCI-PCI
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Receiver Error
                 Bad TLP
                 Bad DLLP
                 REPLAY_NUM Rollover
                 Replay Timer Timeout
                 Advisory Non-Fatal Error

igb0@pci0:2:0:0:        class=0x020000 card=0x153315d9 chip=0x15338086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
     Corrected = Advisory Non-Fatal Error

igb1@pci0:5:0:0:        class=0x020000 card=0x153315d9 chip=0x15338086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I210 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
  PCI-e errors = Correctable Error Detected
     Corrected = Advisory Non-Fatal Error
Comment 6 Arcadiy Ivanov 2016-06-16 19:02:32 UTC
# netstat -m
2048/3772/5820 mbufs in use (current/cache/total)
2046/2514/4560/1000000 mbuf clusters in use (current/cache/total/max)
2046/2508 mbuf+clusters out of packet secondary zone in use (current/cache)
0/4/4/250101 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/74104 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41683 16k jumbo clusters in use (current/cache/total/max)
4604K/5987K/10591K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
Comment 7 Arcadiy Ivanov 2016-10-19 08:46:16 UTC
There appears to be some indication that the X11SBA series has a hardware bug in a PCI-E controller supplying igb1-4 (not affecting igb1). 

See https://sourceforge.net/p/e1000/bugs/502/
Comment 8 Arcadiy Ivanov 2016-10-19 08:47:04 UTC
*not affecting igb0
Comment 9 Arcadiy Ivanov 2016-12-15 20:49:44 UTC
With respect to X11SBA-F board, I can confirm that the issue arises from hardware version 1.01 of the board and is gone with 1.02. The issue with 1.01 is not rectifiable by any EEPROM, BIOS or other firmware updates.
Comment 10 Kevin Bowling freebsd_committer 2017-01-10 11:28:47 UTC
For the original submitter with the ASRock board, can you retry with -CURRENT?