|Summary:||Intel e1000 network link drops under high network load|
|Product:||Base System||Reporter:||Naveen Nathan <freebsd>|
|Component:||kern||Assignee:||freebsd-net (Nobody) <net>|
|Status:||Closed Feedback Timeout|
|Severity:||Affects Only Me||CC:||freebsd, kaho, marius, sbruno|
Description Naveen Nathan 2017-04-14 10:46:17 UTC
This is a little difficult to articulate. So at a high-level this is the issue I'm seeing: 1. initiate a freebsd-update to upgrade from 10.3-RELEASE to 11.0-RELEASE 2. when metadata files are fetched (using phttpget), the network link completely drops out 3. the ethernet watchdog timer (i assume) detects activity has stalled and drops the queue, disables the link, and enables the link 4. when the link is restored, gunzip fails decompressing the metadata file and is deemed corrupt, and freebsd-update fails. The hardware I'm running is dated, specifically a supermicro server with a PDSMi motherboard with 2x onboard Intel gigabit NICs. # pciconf -lv | grep em0 -A4 em0@pci0:13:0:0: class=0x020000 card=0x108c15d9 chip=0x108c8086 rev=0x03 hdr=0x00 vendor = 'Intel Corporation' device = '82573E Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet [...] # dmesg | grep -i em0 em0: <Intel(R) PRO/1000 Network Connection 7.6.1-k> port 0x4000-0x401f mem 0xe0200000-0xe021ffff irq 16 at device 0.0 on pci13 em0: Using an MSI interrupt em0: Ethernet address: 00:30:48:8b:55:de [...] When I run freebsd-update to upgrade from 10.3-RELEASE-p18 to 11.0-RELEASE the network link drops. This happens specifically when the metadata files are being fetched. I have also removed /var/db/freebsd-update/*.gz to see if that would fix it, but that didn't do much. I recall also having the same network link drops when I was previously on 9.1 and upgrading to 10.3. During the "fetching files" phase, it would simply drop the network link; freebsd-update was however resilient enough to continue trying until it received the files and performed the upgrade. # freebsd-update upgrade -r 11.0-RELEASE Looking up update.FreeBSD.org mirrors... 4 mirrors found. Fetching metadata signature for 10.3-RELEASE from update5.freebsd.org... done. Fetching metadata index... done. Inspecting system... done. The following components of FreeBSD seem to be installed: kernel/generic world/base world/doc world/games world/lib32 The following components of FreeBSD do not seem to be installed: Does this look reasonable (y/n)? y Fetching metadata signature for 11.0-RELEASE from update5.freebsd.org... done. Fetching metadata index... done. Fetching 1 metadata patches. done. Applying metadata patches... done. Fetching 1 metadata files... gunzip: (stdin): unexpected end of file metadata is corrupt. # dmesg | tail -n 14 em0: Watchdog timeout Queue-- resetting Interface is RUNNING and ACTIVE em0: TX Queue 0 ------ em0: hw tdh = 381, hw tdt = 387 em0: Tx Queue Status = -2147483648 em0: TX descriptors avail = 1018 em0: Tx Descriptors avail failure = 0 em0: RX Queue 0 ------ em0: hw rdh = 94, hw rdt = 93 em0: RX discarded packets = 0 em0: RX Next to Check = 94 em0: RX Next to Refresh = 93 em0: link state changed to DOWN em0: link state changed to UP
Comment 1 Naveen Nathan 2017-04-14 13:41:31 UTC
Further investigation and it happens during any kind of network load activity, usually when traffic goes beyond 10Mbps. So this happen when using portsnap, pkg install, etc. I have also disabled tso4 and vlanhwtso. I think it made things a little more bearable but the issue still persists. # ifconfig em0 em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC> ether 00:30:48:8b:55:de inet 18.104.22.168 netmask 0xfffffff0 broadcast 22.214.171.124 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (1000baseT <full-duplex>) status: active [root@armakuni ~]# netstat -I em0 Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll em0 1500 <Link#1> 00:30:48:8b:55:de 88189 476 0 49789 5 0 em0 - 104.x.x.16/ xxx.xxx 86581 - - 50046 - -
Comment 2 Naveen Nathan 2017-04-14 13:45:30 UTC
Apologies, I forgot to mention. I was able to upgrade to 11.0-RELEASE after running the freebsd-update about 30 or so times -- I ended up getting lucky where the network connection didn't drop, and was able to continue with the upgrade. The above comments about disabling tso4/vlanhwtso was in the 11.0 release. Therefore the em0 watchdog timer under network load issue seems to persist even though bug 200221 resolved it for 10.3.
Comment 3 Kaho Toshikazu 2017-04-17 02:22:35 UTC
(In reply to nn from comment #1) I think the link drop itself is caused by a Tx error, but you have many Rx errors shown by the Ierrs of the netstat output and you should investigate what errors occur at first. Can you see a `sysctl dev.em.0` result? Can you get which knobs related errors are increasing their counters? For example, does rx_overrun or crc_errs has a non-zero value?
Comment 4 Kubilay Kocak 2017-07-05 11:53:21 UTC
Feedback timeout (2 months) @nn please re-open this issue if you can provide additional or updated information, isolation or reproduction steps.