Bug 218653

Summary: Intel e1000 network link drops under high network load
Product: Base System Reporter: Naveen Nathan <freebsd>
Component: kernAssignee: freebsd-net mailing list <net>
Status: Closed Feedback Timeout    
Severity: Affects Only Me CC: freebsd, kaho, marius, sbruno
Priority: --- Keywords: IntelNetworking, needs-qa
Version: 11.0-RELEASEFlags: koobs: mfc-stable11?
koobs: mfc-stable10?
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200221

Description Naveen Nathan 2017-04-14 10:46:17 UTC
This is a little difficult to articulate. So at a high-level this is the issue I'm seeing:

1. initiate a freebsd-update to upgrade from 10.3-RELEASE to 11.0-RELEASE
2. when metadata files are fetched (using phttpget), the network link completely drops out
3. the ethernet watchdog timer (i assume) detects activity has stalled and drops the queue, disables the link, and enables the link
4. when the link is restored, gunzip fails decompressing the metadata file and is deemed corrupt, and freebsd-update fails.

The hardware I'm running is dated, specifically a supermicro server with a PDSMi motherboard with 2x onboard Intel gigabit NICs.

# pciconf -lv | grep em0 -A4
em0@pci0:13:0:0:        class=0x020000 card=0x108c15d9 chip=0x108c8086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82573E Gigabit Ethernet Controller (Copper)'
    class      = network
    subclass   = ethernet

# dmesg | grep -i em0
em0: <Intel(R) PRO/1000 Network Connection 7.6.1-k> port 0x4000-0x401f mem 0xe0200000-0xe021ffff irq 16 at device 0.0 on pci13
em0: Using an MSI interrupt
em0: Ethernet address: 00:30:48:8b:55:de

When I run freebsd-update to upgrade from 10.3-RELEASE-p18 to 11.0-RELEASE the network link drops. This happens specifically when the metadata files are being fetched. I have also removed /var/db/freebsd-update/*.gz to see if that would fix it, but that didn't do much.

I recall also having the same network link drops when I was previously on 9.1 and upgrading to 10.3. During the "fetching files" phase, it would simply drop the network link; freebsd-update was however resilient enough to continue trying until it received the files and performed the upgrade.

# freebsd-update upgrade -r 11.0-RELEASE                                                                              
Looking up update.FreeBSD.org mirrors... 4 mirrors found.
Fetching metadata signature for 10.3-RELEASE from update5.freebsd.org... done.
Fetching metadata index... done.
Inspecting system... done.

The following components of FreeBSD seem to be installed:
kernel/generic world/base world/doc world/games world/lib32

The following components of FreeBSD do not seem to be installed:

Does this look reasonable (y/n)? y

Fetching metadata signature for 11.0-RELEASE from update5.freebsd.org... done.
Fetching metadata index... done.
Fetching 1 metadata patches. done.
Applying metadata patches... done.
Fetching 1 metadata files... gunzip: (stdin): unexpected end of file
metadata is corrupt.

# dmesg | tail -n 14
em0: Watchdog timeout Queue[0]-- resetting
Interface is RUNNING and ACTIVE
em0: TX Queue 0 ------
em0: hw tdh = 381, hw tdt = 387
em0: Tx Queue Status = -2147483648
em0: TX descriptors avail = 1018
em0: Tx Descriptors avail failure = 0
em0: RX Queue 0 ------
em0: hw rdh = 94, hw rdt = 93
em0: RX discarded packets = 0
em0: RX Next to Check = 94
em0: RX Next to Refresh = 93
em0: link state changed to DOWN
em0: link state changed to UP
Comment 1 Naveen Nathan 2017-04-14 13:41:31 UTC
Further investigation and it happens during any kind of network load activity, usually when traffic goes beyond 10Mbps.

So this happen when using portsnap, pkg install, etc.

I have also disabled tso4 and vlanhwtso. I think it made things a little more bearable but the issue still persists.

# ifconfig em0
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 00:30:48:8b:55:de
        inet netmask 0xfffffff0 broadcast 
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active

[root@armakuni ~]# netstat -I em0
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
em0    1500 <Link#1>      00:30:48:8b:55:de    88189   476     0    49789     5     0
em0       - 104.x.x.16/ xxx.xxx                86581     -     -    50046     -     -
Comment 2 Naveen Nathan 2017-04-14 13:45:30 UTC
Apologies, I forgot to mention. I was able to upgrade to 11.0-RELEASE after running the freebsd-update about 30 or so times -- I ended up getting lucky where the network connection didn't drop, and was able to continue with the upgrade.

The above comments about disabling tso4/vlanhwtso was in the 11.0 release.

Therefore the em0 watchdog timer under network load issue seems to persist even though bug 200221 resolved it for 10.3.
Comment 3 Kaho Toshikazu 2017-04-17 02:22:35 UTC
(In reply to nn from comment #1)
I think the link drop itself is caused by a Tx error, but you have
many Rx errors shown by the Ierrs of the netstat output and
you should investigate what errors occur at first.

Can you see a `sysctl dev.em.0` result? Can you get which knobs
related errors are increasing their counters? 
For example, does rx_overrun or crc_errs has a non-zero value?
Comment 4 Kubilay Kocak freebsd_committer freebsd_triage 2017-07-05 11:53:21 UTC
Feedback timeout (2 months)

@nn please re-open this issue if you can provide additional or updated information, isolation or reproduction steps.