Bug 246003 - em(4): Intel I219-V6 on NUC8i5BEH randomly loses carrier or fails over to 100Mbit
Summary: em(4): Intel I219-V6 on NUC8i5BEH randomly loses carrier or fails over to 100...
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Kevin Bowling
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2020-04-28 16:34 UTC by Joshua Kinard
Modified: 2022-02-17 17:50 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joshua Kinard 2020-04-28 16:34:11 UTC
Running an Intel NUC here (NUC8i5BEH) which has an Intel I219 V6 chip in it that the em(4) driver handles.  Very sporadically, the machine will either outright lose network connectivity, or start to experience significantly-increased latency responding to network traffic.  Almost like how a traffic jam might start, is how it feels.

The machine is a Squid proxy server for my home network, and it seems accessing websites with large amounts of resources to fetch can more easily trigger whatever this bug is, though I have also triggered the bug with something as simple as editing a text file in nano over SSH.

When the bug happens, I sometimes see my switch, a Netgear GS324T (S350 series), change the port over from 1Gbps to 100Mbps (green to orange on the LED).  If I ping the device from another machine on the network, the ping is either lost, the host is marked as "down", or the ping returns upwards of a few thousand milliseconds later.

Recovering from the issue is usually only done by rebooting the machine.  Sometimes, if you just wait several minutes, the machine will eventually respond and behave normally.  This to me feels like a buffer being flooded too quickly.

I have been using jumbo frames w/ an MTU of 9000.  As a test, I have lowered that down to 1500 to see if the issues remain.  It feels like this MIGHT be tied to Bug #218894, per the last comment in 2018.  If it is, lowering the MTU to 1500, or staying under 6k/pkt might avoid the issue, as it smells like a buffer in em(4) is not sized correctly on I219 chips to handle 9k/pkt jumbo frames.

I am experiencing this issue on both the base em(4) driver (7.6.1-k) as well as the latest intel-em-kmod driver from ports (7.7.5).


Some technical info (IP/DNS info removed):

dmesg, shwoing the device going up/down a few times, including when I tried unplugging its cable from the switch:
---<<BOOT>>---
Copyright (c) 1992-2019 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.1-RELEASE-p4 CUSTOM-12_1 amd64
FreeBSD clang version 8.0.1 (tags/RELEASE_801/final 366581) (based on LLVM 8.0.1)
VT(efifb): resolution 1024x768
module zfsctrl already present!
module_register: cannot register pci/em from kernel; already loaded from if_em_updated.ko
Module pci/em failed to register: 17
CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz (2304.11-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x806ea  Family=0x6  Model=0x8e  Stepping=10
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c67af<FSGSBASE,TSCADJ,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features3=0x9c002400<MD_CLEAR,TSXFA,IBPB,STIBP,L1DFL,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 17179869184 (16384 MB)
avail memory = 16487288832 (15723 MB)
CPU microcode: updated from 0xc6 to 0xca
Event timer "LAPIC" quality 600
ACPI APIC Table: <INTEL  NUC8i5BE>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
random: unblocking device.
ioapic0 <Version 2.0> irqs 0-119 on motherboard
Launching APs: 3 2 1
Timecounter "TSC-low" frequency 1152052507 Hz quality 1000
random: entropy device external interface
module_register_init: MOD_LOAD (vesa, 0xffffffff80b2e120, 0) error 19
kbd0 at kbdmux0
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
nexus0
efirtc0: <EFI Realtime Clock> on motherboard
efirtc0: registered as a time-of-day clock, resolution 1.000000s
cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS> on motherboard
acpi0: <INTEL NUC8i5BE> on motherboard
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 24000000 Hz quality 950
Event timer "HPET" frequency 24000000 Hz quality 550
Event timer "HPET1" frequency 24000000 Hz quality 440
Event timer "HPET2" frequency 24000000 Hz quality 440
Event timer "HPET3" frequency 24000000 Hz quality 440
Event timer "HPET4" frequency 24000000 Hz quality 440
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1808-0x180b on acpi0
acpi_ec0: <Embedded Controller: GPE 0x14> port 0x62,0x66 on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0x3000-0x303f mem 0xbf000000-0xbfffffff,0x80000000-0x8fffffff at device 2.0 on pci0
vgapci0: Boot video device
xhci0: <XHCI (generic) USB 3.0 controller> mem 0x404a000000-0x404a00ffff at device 20.0 on pci0
xhci0: 32 bytes context size, 64-bit DMA
usbus0 on xhci0
usbus0: 5.0Gbps Super Speed USB v3.0
pci0: <memory, RAM> at device 20.2 (no driver attached)
pci0: <simple comms> at device 22.0 (no driver attached)
ahci0: <AHCI SATA controller> port 0x3090-0x3097,0x3080-0x3083,0x3060-0x307f mem 0xc0120000-0xc0121fff,0xc0123000-0xc01230ff,0xc0122000-0xc01227ff at device 23.0 on pci0
ahci0: AHCI v1.31 with 1 6Gbps ports, Port Multiplier not supported
ahcich2: <AHCI channel> at channel 2 on ahci0
pcib1: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 28.4 on pci0
pcib2: [GIANT-LOCKED]
pcib3: <ACPI PCI-PCI bridge> at device 29.0 on pci0
pci2: <ACPI PCI bus> on pcib3
nvme0: <Generic NVMe Device> mem 0xc0000000-0xc0003fff at device 0.0 on pci2
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
pci0: <serial bus> at device 31.5 (no driver attached)
em0: <Intel(R) PRO/1000 Network Connection 7.7.5> mem 0xc0100000-0xc011ffff at device 31.6 on pci0
em0: Using an MSI interrupt
em0: Ethernet address: 1c:69:7a:08:74:7e
acpi_button0: <Sleep Button> on acpi0
acpi_button1: <Power Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_syscontainer0: <System Container> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
acpi_tz1: _HOT value is absurd, ignored (-73.1C)
atrtc0: <AT realtime clock> at port 0x70 irq 8 on isa0
atrtc0: Warning: Couldn't map I/O.
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
uart0: <Non-standard ns8250 class UART with FIFOs> at port 0x3f8 irq 4 flags 0x10 on isa0
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 10.000 msec
acpi_tz1: _TMP value is absurd, ignored (-263.1C)
ugen0.1: <0x8086 XHCI root HUB> at usbus0
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
nvd0: <Samsung SSD 970 EVO 1TB> NVMe namespace
nvd0: 953869MB (1953525168 512 byte sectors)
Trying to mount root from zfs:core/env/fbsd_12.1-20200422 []...
Root mount waiting for: usbus0
Root mount waiting for: usbus0
uhub0: 18 ports with 18 removable, self powered
ugen0.2: <CHESEN PS2 to USB Converter> at usbus0
ukbd0 on uhub0
ukbd0: <CHESEN PS2 to USB Converter, class 0/0, rev 1.10/0.10, addr 1> on usbus0
kbd1 at ukbd0
GEOM_ELI: Device nvd0p2.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware
lo0: link state changed to UP
em0: link state changed to UP
ums0 on uhub0
ums0: <CHESEN PS2 to USB Converter, class 0/0, rev 1.10/0.10, addr 1> on usbus0
ums0: 5 buttons and [XYZ] coordinates ID=1
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to deny, logging disabled
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP

netstat -i shows a handful of Ierrs:
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
em0    1500 <Link#1>      1c:69:7a:xx:xx:xx  5959360     4     0  5229960     0     0
em0       - 192.168.x.0/2 xxxxxx             5896418     -     -  5201176     -     -
em0       - fe80::%em0/64 fe80::1e69:7aff:f        0     -     -        0     -     -
em0       - fdxx::xxxx:xx xxxxxx               28862     -     -    28369     -     -
lo0   16384 <Link#2>      lo0                   1384     0     0     1384     0     0
lo0       - localhost     localhost                0     -     -      215     -     -
lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -        0     -     -
lo0       - your-net      localhost               42     -     -     1384     -     -
ipfw0     - <Link#3>      ipfw0                    0     0     0        0     0     0


pciconf:
em0@pci0:0:31:6:        class=0x020000 card=0x20748086 chip=0x15be8086 rev=0x30 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection (6) I219-V'
    class      = network
    subclass   = ethernet


sysctl -a info for em0:
# sysctl -a | grep "\.em\."
hw.em.max_interrupt_rate: 8000
hw.em.eee_setting: 1
hw.em.rx_process_limit: -1
hw.em.sbp: 1
hw.em.smart_pwr_down: 0
hw.em.rx_abs_int_delay: 66
hw.em.tx_abs_int_delay: 66
hw.em.rx_int_delay: 0
hw.em.tx_int_delay: 66
hw.em.disable_crc_stripping: 0
dev.em.0.wake: 0
dev.em.0.interrupts.rx_overrun: 0
dev.em.0.interrupts.rx_desc_min_thresh: 0
dev.em.0.interrupts.tx_queue_min_thresh: 0
dev.em.0.interrupts.tx_queue_empty: 0
dev.em.0.interrupts.tx_abs_timer: 0
dev.em.0.interrupts.tx_pkt_timer: 0
dev.em.0.interrupts.rx_abs_timer: 0
dev.em.0.interrupts.rx_pkt_timer: 0
dev.em.0.interrupts.asserts: 4163967
dev.em.0.mac_stats.tx_frames_1024_1522: -1
dev.em.0.mac_stats.tx_frames_512_1023: -1
dev.em.0.mac_stats.tx_frames_256_511: -1
dev.em.0.mac_stats.tx_frames_128_255: -1
dev.em.0.mac_stats.tx_frames_65_127: -1
dev.em.0.mac_stats.tx_frames_64: -1
dev.em.0.mac_stats.rx_frames_1024_1522: -1
dev.em.0.mac_stats.rx_frames_512_1023: -1
dev.em.0.mac_stats.rx_frames_256_511: -1
dev.em.0.mac_stats.rx_frames_128_255: -1
dev.em.0.mac_stats.rx_frames_65_127: -1
dev.em.0.mac_stats.rx_frames_64: -1
dev.em.0.mac_stats.tso_ctx_fail: 0
dev.em.0.mac_stats.tso_txd: 0
dev.em.0.mac_stats.mcast_pkts_txd: 24
dev.em.0.mac_stats.bcast_pkts_txd: 124
dev.em.0.mac_stats.good_pkts_txd: 5231594
dev.em.0.mac_stats.total_pkts_txd: 5231594
dev.em.0.mac_stats.good_octets_txd: 6647896611
dev.em.0.mac_stats.good_octets_recvd: 6724467487
dev.em.0.mac_stats.mcast_pkts_recvd: 273
dev.em.0.mac_stats.bcast_pkts_recvd: 32687
dev.em.0.mac_stats.good_pkts_recvd: 5960900
dev.em.0.mac_stats.total_pkts_recvd: 5960904
dev.em.0.mac_stats.xoff_txd: 0
dev.em.0.mac_stats.xoff_recvd: 0
dev.em.0.mac_stats.xon_txd: 0
dev.em.0.mac_stats.xon_recvd: 0
dev.em.0.mac_stats.coll_ext_errs: 0
dev.em.0.mac_stats.alignment_errs: 0
dev.em.0.mac_stats.crc_errs: 0
dev.em.0.mac_stats.recv_errs: 0
dev.em.0.mac_stats.recv_jabber: 0
dev.em.0.mac_stats.recv_oversize: 0
dev.em.0.mac_stats.recv_fragmented: 0
dev.em.0.mac_stats.recv_undersize: 0
dev.em.0.mac_stats.recv_no_buff: 0
dev.em.0.mac_stats.missed_packets: 4
dev.em.0.mac_stats.defer_count: 0
dev.em.0.mac_stats.sequence_errors: 0
dev.em.0.mac_stats.symbol_errors: 0
dev.em.0.mac_stats.collision_count: 0
dev.em.0.mac_stats.late_coll: 0
dev.em.0.mac_stats.multiple_coll: 0
dev.em.0.mac_stats.single_coll: 0
dev.em.0.mac_stats.excess_coll: 0
dev.em.0.queue_rx_0.rx_irq: 0
dev.em.0.queue_rx_0.rxd_tail: 1001
dev.em.0.queue_rx_0.rxd_head: 1003
dev.em.0.queue_tx_0.no_desc_avail: 0
dev.em.0.queue_tx_0.tx_irq: 0
dev.em.0.queue_tx_0.txd_tail: 537
dev.em.0.queue_tx_0.txd_head: 538
dev.em.0.fc_low_water: 20552
dev.em.0.fc_high_water: 23584
dev.em.0.rx_control: 67141634
dev.em.0.device_control: 1573440
dev.em.0.watchdog_timeouts: 0
dev.em.0.rx_overruns: 4
dev.em.0.tx_dma_fail: 0
dev.em.0.dropped: 0
dev.em.0.cluster_alloc_fail: 0
dev.em.0.mbuf_alloc_fail: 0
dev.em.0.link_irq: 0
dev.em.0.eee_control: 1
dev.em.0.rx_processing_limit: -1
dev.em.0.itr: 488
dev.em.0.tx_abs_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_int_delay: 66
dev.em.0.rx_int_delay: 0
dev.em.0.fc: 0
dev.em.0.debug: -1
dev.em.0.nvm: -1
dev.em.0.%parent: pci0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x15be subvendor=0x8086 subdevice=0x2074 class=0x020000
dev.em.0.%location: slot=31 function=6 dbsf=pci0:0:31:6 handle=\_SB_.PCI0.GLAN
dev.em.0.%driver: em
dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.7.5
dev.em.%parent:


ping output from another device on the network:
# ping xxxxxx
PING xxxxxx (192.168.x.yyy): 56 data bytes
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
64 bytes from 192.168.x.yyy: icmp_seq=17 ttl=64 time=4227.244 ms
64 bytes from 192.168.x.yyy: icmp_seq=18 ttl=64 time=3154.970 ms
64 bytes from 192.168.x.yyy: icmp_seq=19 ttl=64 time=2143.813 ms
64 bytes from 192.168.x.yyy: icmp_seq=20 ttl=64 time=1071.436 ms
64 bytes from 192.168.x.yyy: icmp_seq=21 ttl=64 time=1.375 ms
64 bytes from 192.168.x.yyy: icmp_seq=79 ttl=64 time=0.578 ms
64 bytes from 192.168.x.yyy: icmp_seq=284 ttl=64 time=0.958 ms
64 bytes from 192.168.x.yyy: icmp_seq=293 ttl=64 time=0.580 ms
64 bytes from 192.168.x.yyy: icmp_seq=334 ttl=64 time=0.496 ms
64 bytes from 192.168.x.yyy: icmp_seq=335 ttl=64 time=0.454 ms
64 bytes from 192.168.x.yyy: icmp_seq=336 ttl=64 time=0.473 ms
64 bytes from 192.168.x.yyy: icmp_seq=337 ttl=64 time=0.457 ms
64 bytes from 192.168.x.yyy: icmp_seq=338 ttl=64 time=0.459 ms
64 bytes from 192.168.x.yyy: icmp_seq=339 ttl=64 time=0.442 ms
64 bytes from 192.168.x.yyy: icmp_seq=340 ttl=64 time=0.447 ms
64 bytes from 192.168.x.yyy: icmp_seq=341 ttl=64 time=0.452 ms
64 bytes from 192.168.x.yyy: icmp_seq=342 ttl=64 time=0.437 ms
64 bytes from 192.168.x.yyy: icmp_seq=343 ttl=64 time=0.462 ms
Comment 1 Kevin Bowling freebsd_committer freebsd_triage 2020-08-28 23:39:21 UTC
Can you test stable/12 snapshots?
Comment 2 Joshua Kinard 2020-08-29 00:15:58 UTC
(In reply to Kevin Bowling from comment #1)
> Can you test stable/12 snapshots?

Not easily.  The system is currently running 12.1-RELEASE-p8.  I can, however, test patches if you have a specific commit that may address the issue and works w/ 12.1-RELEASE (or mostly applies to the source).

Currently using the out-of-tree intel-em-kmod driver on this platform, version 7.7.8 from Intel's download center.  The latest version in the ports tree is 7.7.5 and that one also exhibited issues w/ jumbo frames.  I ended up giving up and went back to mtu 1500, and have not attempted jumbo frames on 7.7.8 yet.
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2021-04-15 02:46:01 UTC
(In reply to Joshua Kinard from comment #2)

Thanks Joshua

To clarify, 7.7.8 from upstream does *not* exhibit the behviour, but 7.7.5 does?
Comment 4 Joshua Kinard 2021-04-15 16:40:12 UTC
(In reply to Kubilay Kocak from comment #3)

I believe this was a problem in 12.1-RELEASE-pX.  The problem went away after the upgrade to 12.2-RELEASE, both on the in-tree driver (7.7.5) and the external em-7.7.8 from Intel upstream.  I've since upgraded to 13.0-RELEASE (also ran several of the RCs) on the device and haven't had any issues with jumbo frames since then.  I even ported em-7.7.8 to compile on 13.0-RELEASE and that works w/o issue thus far.
Comment 5 Kevin Bowling freebsd_committer freebsd_triage 2021-04-21 02:59:51 UTC
Per the submitter this works as intended in 12.2-RELEASE and 13.0-RELEASE.  There is nothing to MFC to stable/11 because it uses a different driver.
Comment 6 Kubilay Kocak freebsd_committer freebsd_triage 2021-04-21 03:56:27 UTC
^Triage: Correct resolution. Without identified/references/specific commits/committers, OBE is more appropriate. mfc-* not appropriate without identified commits (cancel accordingly).
Comment 7 Kubilay Kocak freebsd_committer freebsd_triage 2021-04-21 03:56:52 UTC
^Triage: Assign to committer resolving (OBE).
Comment 8 Joshua Kinard 2022-02-17 17:50:33 UTC
Just to add a final resolution and to mark this as a hardware issue and not a bug in FreeBSD, the problem ultimately turned out to be the 1ft CAT6A Monoprice SlimRun cable I was using to connect the NUC to my 24-port switch.

I learned awhile ago that swapping to a different cable made the problem go away, but I didn't know why.  After finally finding where my cable tester was hidden, I was able to resolve that the slimrun cable had a faulty ground connection between the two ends that was probably creating a ground loop between the switch and the NUC, causing the switch to overreact and either disable the port or drop it to 100mbps, which probably wasn't handled well by the em(4) driver on the NUC.