Hi, While sending moderated nfs traffic < 2Mo/s, the interface suddenly stopped transmitting/receiving. However the interface seemed fine: $ ifconfig em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO> ether 00:25:90:34:5d:44 inet YYYY netmask 0xffffff00 broadcast YYY.255 inet6 fe80::225:90ff:fe34:5d44%em0 prefixlen 64 scopeid 0x1 inet6 XXXX prefixlen 64 autoconf nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL> media: Ethernet autoselect (1000baseT <full-duplex>) status: active Pinging gateway didn't work: $ ping ZZZZ PING ZZZZ (ZZZZ): 56 data bytes ping: sendto: Host is down ping: sendto: Host is down But driver seemed happy with the card as no particular message was printed. # tcpdump -ni em0 -> No rx traffic, only tx. Printing em driver internal variables was more interesting: $ sysctl dev.em.0.debug=1 Interface is RUNNING and ACTIVE em0: hw tdh = 325, hw tdt = 166 em0: hw rdh = 688, hw rdt = 687 em0: Tx Queue Status = 1 em0: TX descriptors avail = 150 em0: Tx Descriptors avail failure = 0 em0: RX discarded packets = 0 em0: RX Next to Check = 688 em0: RX Next to Refresh = 687 Sending PING request incremented hw tdt as expected. Wondering what would happen when it would reach tdh value, I ping-flooded the gateway. Driver figured out something was going bad and reset the card: #ping -f ZZZZ em0: Watchdog timeout -- resetting em0: Queue(0) tdh = 325, hw tdt = 285 em0: TX(0) desc avail = 31,Next TX to Clean = 316 em0: link state changed to DOWN em0: link state changed to UP Interface is RUNNING and ACTIVE em0: hw tdh = 113, hw tdt = 113 em0: hw rdh = 36, hw rdt = 35 em0: Tx Queue Status = 0 em0: TX descriptors avail = 1024 em0: Tx Descriptors avail failure = 0 em0: RX discarded packets = 0 em0: RX Next to Check = 36 em0: RX Next to Refresh = 35 From here, the interface was working as usual. $ ping ZZZZ PING ZZZZ (ZZZZ): 56 data bytes 64 bytes from ZZZZ: icmp_seq=0 ttl=255 time=0.241 ms $dmesg FreeBSD 10.1-RELEASE-p6 #0: Tue Feb 24 19:00:21 UTC 2015 [...] em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xdc00-0xdc1f mem 0xfe9e0000-0xfe9fffff,0xfe9dc000-0xfe9dffff irq 16 at device 0.0 on pci2 em0: Using MSIX interrupts with 3 vectors em0: Ethernet address: 00:25:90:34:5d:44 pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.5 on pci0 pci3: <ACPI PCI bus> on pcib3 em1: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xec00-0xec1f mem 0xfeae0000-0xfeafffff,0xfeadc000-0xfeadffff irq 17 at device 0.0 on pci3 em1: Using MSIX interrupts with 3 vectors em1: Ethernet address: 00:25:90:34:5d:45 $pciconf -elv [...] em0@pci0:2:0:0: class=0x020000 card=0x060a15d9 chip=0x10d38086 rev=0x00 hdr=0x00 vendor = 'Intel Corporation' device = '82574L Gigabit Network Connection' class = network subclass = ethernet PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Receiver Error Bad TLP Bad DLLP REPLAY_NUM Rollover Replay Timer Timeout Advisory Non-Fatal Error em1@pci0:3:0:0: class=0x020000 card=0x060a15d9 chip=0x10d38086 rev=0x00 hdr=0x00 vendor = 'Intel Corporation' device = '82574L Gigabit Network Connection' class = network subclass = ethernet PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Receiver Error Bad TLP Bad DLLP Replay Timer Timeout Advisory Non-Fatal Error The port is connected to a GS108 switch. Link was up the whole time and no transmit error has been detected. Motherboard is a Supermicro X7SPA-HF with latest bios. On this board, there is a BMC sharing the em0 port. The BMC was not responding either. Hence my lucky guess would be that it may not be the driver fault as the BMC has suffered too, but the card fault. This is also happening on an OpenBSD em0 with the same motherboard (but not connected to the same switch).
The card is definitely not sending: 1) dev.em.0.queue0.tx_irq is *not* increasing, so no signal from the card that data has been sent 2) hence sysctl dev.em.0.debug=1 shows that TX descriptors avail is decreasing 3) when it drops bellow EM_MAX_SCATTER, em_txeof() is finally called: https://svnweb.freebsd.org/base/releng/10.1/sys/dev/e1000/if_em.c?view=markup#l962 4) em_txoef() examines descriptors and detects that the card has not processed any of them and marks the queue as EM_QUEUE_HUNG: https://svnweb.freebsd.org/base/releng/10.1/sys/dev/e1000/if_em.c?view=markup#l3925 5) The queue status is read and the card reset is performed: https://svnweb.freebsd.org/base/releng/10.1/sys/dev/e1000/if_em.c?view=markup#l2297
Created attachment 155199 [details] sysctl dev.em.0
Regarding RX: sysctl dev.em.0.mac_stats.total_pkts_recvd *is* increasing sysctl dev.em.0.mac_stats.good_pkts_recvd *isn't*
While the interace is not responding, dev.em.0.link_irq is incremented every second.
Let's keep up with the digging. When hung, the 82574 generates link interrupts at a rate of ~1/s. The first ICR read value by the link interrupt routine is 0x800000c5 <INT_ASSERTED,RXTO,RXO,LSC,TXDW>. Then *every* ICR read value is 0x40 <RXO>. TXDW: Transmit Descriptor Written Back Set when hardware processes a descriptor with RS set. If using delayed interrupts (IDE set), the interrupt is delayed until after one of the delayed-timers (TIDV or TADV) expires. LSC: Link Status Change This bit is set whenever the link status changes (either from up to down, or from down to up). This bit is affected by the link indication from the PHY. RXO: Receiver Overrun Set on receive data FIFO overrun. Could be caused either because there are no available buffers or because PCIe receive bandwidth is inadequate RXT0: Receiver Timer Interrupt Set when the timer expires. INT_ASSRTED: Interrupt Asserted This bit is set when the LAN port has a pending interrupt. If the interrupt is enabled in the PCI configuration space, an interrupt is asserted.
Just out of curiosity, try disabling TSO on the interface as a work around. The symptoms are similar to what I saw in https://lists.freebsd.org/pipermail/freebsd-stable/2014-September/080081.html
I disabled TSO4 and VLAN_HWTSO 16 days ago when I saw your reply. The card used to hung specially when I was watching crappy movies. Hence I watched Fast And Furious 1 to 7 since to stress the chip. No hung so far. So yes, it seems related to HWTSO.
Note that current driver does'nt increment adapter->rx_overruns in em_msix_link().
(In reply to david.keller from comment #8) Can you please open a separate issue and cc me, jfv@freebsd.org and erj@freebsd.org if you feel that can cause lesser rx_overruns errors to be reported?
Intel device Errata 17 (http://www.intel.fr/content/dam/www/public/us/en/documents/specification-updates/82574-gbe-controller-spec-update.pdf) says: > When using TSO, a situation can occur where a PCIe MRd request is repeated with > the same address, resulting in data corruption. At the end of the TCP packet, > the Tx DMA hangs because the length doesn't match. This can only occur when > the following are true: > * The first buffer of the packet is larger than [3 * (max_read_request - 4)]. > * There is a 4 KB boundary within 64 bytes following the end of the header bytes in > the buffer On my device, PCIe max_read_request is 512. Hence first buffer has to be <= 1524. Interface mtu is 9000. Hence it seems to me that the first buffer can be greater than the 1524 if the FreeBSD stack doesn't split header into a dedicated mbuf The dma alignement requirement is not met on bus_dma_tag_create() where alignement argument is 1. (https://svnweb.freebsd.org/base/head/sys/dev/e1000/if_em.c?view=markup#l3260). Maybe bus_dma_tag_create() should use an alignement of 128 bytes as explained in errata: > The alignment of the buffer containing the headers should be such that there is no > 4 KB boundary within 64 bytes following the end of the header bytes. Assuming > standard > Ethernet/IP/TCP headers of 54 bytes, this means that the buffer should > not start 54-118 bytes before a 4 KB boundary. For example, 128-byte alignment > for this buffer could be used to fulfill this condition
Did a couple of tests via iperf. test 1 #1 set hw.em.txd="4096" #2 set hw.em.rxd="4096" #3 ran with TSO enabled -- ran into hangs and debug output from watchdog as reporter and stated. -- seems to happen less frequently, once in 24 hour test test 2 #1 setup as test 1 #2 disable TSO (ifconfig em0 -tso) -- no hangs in 24 hours test test 3 #1 set hw.em.txd="256" #2 set hw.em.rxd="256" #3 reboot and leave TSO enabled -- ran into hangs and debug out from watchdog in minutes Hypothesis: -- TSO implementation isn't taking into consideration ring wrap events?
Could you give the parameters that you used for the NFS mount?
(In reply to Eric Joyner from comment #12) Sure Eric. On server (FreeBSD 10.1): data on /data (zfs, NFS exported, local, nfsv4acls) On client (Fedora 21): X.X.X.X:/data on /media/nas type nfs4 (rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=Y.Y.Y.Y,local_lock=none,addr=X.X.X.X)
(In reply to Sean Bruno from comment #11) Sean, can you provide iperf command lines you used ?
(In reply to david.keller from comment #14) Nothing fancy here. Server runs "iperf -p 8000 -s" (8core amd box) Client under test runs this forever: #!/bin/sh FILE=test.out if [ -f ${FILE} ]; then rm $FILE; fi while [ 1 ]; do date; iperf -p 8000 -c 192.168.100.1 -t 600 -P ${1} >> $FILE; done
Since freebsd-update 10.1-RELEASE-p6 -> p9, no freeze happened. Even after hours running iperf. That's quite strange as nothing new happened on e1000 since p6.
(In reply to david.keller from comment #16) Its still happening on -current from what I can see in my testing. At least with MSI enabled. I'll push a patch that the Intel devs and I are working on after some verification.
(In reply to Sean Bruno from comment #17) Just happened again.
(In reply to david.keller from comment #18) ok. What version of FreeBSD is failing?
(In reply to Sean Bruno from comment #19) 10.1-RELEASE-p10
(In reply to david.keller from comment #20) oh, I see. Changes to em(4) have not propagated to the releng/10.1 branch and won't appear in an installable version until 10.2r https://svnweb.freebsd.org/base/releng/10.1/?view=log vs https://svnweb.freebsd.org/base/stable/10/sys/dev/e1000/?view=log so until 10.2r is out, you could apply r283504 and r284444 to see if the problem goes away for you, and if not apply r284522 and test as well. Is this something you can do?
(In reply to Sean Bruno from comment #21) Sure, I'running now a releng 10.1 kernel with dev/e1000/ from r284444.
Still hanging with r284444.
Assign to Sean so the issue doesn't remain unassigned with no clear owner Sean, can you clarify in what branches the fix has been committed, as I can't see an automated commit reference message (with PR: 199174 line). From my reading of the comments, HEAD and stable/10 have the change so far. Can it also be MFC'd to stable/9? I've set MFC flags, please set them to + (after MFC'ing) or - as necessary, with comments
(In reply to david.keller from comment #23) Ah, is this an NFS test or a raw iperf test? I still need to take care of something that Rick M. asked me to do.
(In reply to Sean Bruno from comment #25) So far, it only happened when NFS mount was stressed (torrent traffic). I didn't manage to reproduce using iperf. It doesn't seems to me directly related to bandwidth as it occurs with less than 4MB/s read/write traffic.
(In reply to david.keller from comment #26) ok, good to know. I have an idea of what to do thanks to Rick. Let me test some things and try to reproduce fail/fix.
(In reply to Sean Bruno from comment #27) I can try to help, could you elaborate on what Rick M. told you ?
(In reply to david.keller from comment #28) Rick has suggested this. It seems to work, but I'm unsure of the impact. Can you try the following patch to if_em.c and see if it helps? I'm trying to diagnose this here. Index: dev/e1000/if_em.c =================================================================== --- dev/e1000/if_em.c (revision 285481) +++ dev/e1000/if_em.c (working copy) @@ -3034,6 +3034,9 @@ if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST); if_setioctlfn(ifp, em_ioctl); if_setgetcounterfn(ifp, em_get_counter); + ifp->if_hw_tsomax = 65518; /* 32*MCLBYTES - max_mac_hdr_len*/ + ifp->if_hw_tsomaxsegcount = EM_MAX_SCATTER; + ifp->if_hw_tsomaxsegsize = 65536; #ifdef EM_MULTIQUEUE /* Multiqueue stack interface */ if_settransmitfn(ifp, em_mq_start);
I've only set ifp->if_hw_tsomax=65518 as other settings depend on attributes not available in releng/10.1. => It crashed after 1h movie playing over nfs. I'll wait for 10.2 to perform the full configuration test, as it's a used server I can't switch to -current :-) What about comment 10 ?
(In reply to david.keller from comment #30) I just tested an alignment of 128 and I don't see any changes to behavior. I still get a lockup with TSO enabled. @@ -3350,13 +3356,13 @@ * Setup DMA descriptor areas. */ if ((error = bus_dma_tag_create(bus_get_dma_tag(dev), - 1, 0, /* alignment, bounds */ + 128, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */
Ok, I think I've got this bug cornered. Can you test the patch at https://reviews.freebsd.org/D3192 to see if it fixes your testcase?
A commit references this bug: Author: sbruno Date: Sun Aug 16 19:43:45 UTC 2015 New revision: 286831 URL: https://svnweb.freebsd.org/changeset/base/286831 Log: Increase EM_MAX_SCATTER to 64 such that the size of em_xmit()::segs[EM_MAX_SCATTER] doesn't get overrun by things like NFS that can and do shove more than 32 segs when being used with em(4) and TSO4. Update tso handling code in em_xmit() with update from jhb@ in email thread: https://lists.freebsd.org/pipermail/freebsd-net/2014-July/039306.html set ifp->if_hw_tsomax, ifp->if_hw_tsomaxsegcount & ifp->if_hw_tsomaxsegsize to appropriate values. Define a TSO workaround "magic" number of 4 that is used to avoid an alignment issue in hardware. Change a couple of integer values that were used as booleans to actual bool types. Ensure that em_enable_intr() enables the appropriate mask of interrupts and not just a hardcoded define of values. PR: 200221 199174 195078 Differential Revision: https://reviews.freebsd.org/D3192 Reviewed by: erj jhb hiren MFC after: 2 weeks Sponsored by: Limelight Networks Changes: head/sys/dev/e1000/if_em.c head/sys/dev/e1000/if_em.h
A commit references this bug: Author: marius Date: Wed Jan 27 22:31:10 UTC 2016 New revision: 294958 URL: https://svnweb.freebsd.org/changeset/base/294958 Log: Sync the e1000 drivers with what's in head as of r294327, modulo parts that don't apply to stable/10 (driver API, if_inc_counter(), RSS changes etc.) and modulo r287465 (which reportedly breaks igb(4)), i. e. assorted fixes and improvements only: o MFC r267385 (partial): - Don't compare bus_dma map pointers for static DMA allocations against NULL to determine if bus_dmamap_unload() or bus_dmamem_free() should be called. Instead, check the associated bus and virtual addresses. - Don't clear static DMA maps to NULL. o MFC r284933: Delete the refernce to VLAN handling being disabled by default. This is no longer the case. [1] o MFC r285639: Add an adapter CORE lock in the DDB hook em_dump_queue to avoid WITNESS panic in em_init_locked() while debugging. o MFC r285879: - Remove unused txd_saved. - Intialize txd_upper, txd_lower and txd_used at declaration. o MFC r286162: Free mbufs when busdma loading fails. o MFC r286829: Add capability to disable CRC stripping as it breaks IPMI/BMC capabilities on certain adatpers. [2] o MFC r286831: [3] - Increase EM_MAX_SCATTER to 64 such that the size of em_xmit():: segs[EM_MAX_SCATTER] doesn't get overrun by things like NFS that can and do shove more than 32 segs when being used with em(4) and TSO4. - Update tso handling code in em_xmit() with update from jhb@ - Set if_hw_tsomax, if_hw_tsomaxsegcount and if_hw_tsomaxsegsize to appropriate values. - Define a TSO workaround "magic" number of 4 that is used to avoid an alignment issue in hardware. - Change a couple of integer values that were used as booleans to actual bool types. - Ensure that em_enable_intr() enables the appropriate mask of interrupts and not just a hardcoded define of values. o MFC r286832: e1000/if_lem.c bump to 1.1.0 o MFC r286833: Bump all copywrite dates to 2015. o MFC r287112: Style/whitespace cleanup in shared/common code. o MFC r293331: - Switch em(4) to the extended RX descriptor format. - Split rxbuffer and txbuffer apart to support the new RX descriptor format structures. Move rxbuffer manipulation to em_setup_rxdesc() to unify the new behavior changes. - Add a RSSKEYLEN macro for help in generating the RSSKEY data structures in the card. - Change em_receive_checksum() to process the new rxdescriptor format status bit. o MFC r293332: Disable the reuse of checksum offload context descriptors in the case of multiple queues in em(4). Document errata in the code. o MFC r293854: Given that em(4), lem(4) and igb(4) hardware doesn't require the alignment guarantees provided by m_defrag(9), use m_collapse(9) instead for performance reasons. While at it, sanitize the statistics softc members, i. e. retire unused ones and add SYSCTL nodes missing for actually used ones. PR: 118693 [1], 161277 [2], 195078 [3], 199174 [3], 200221 [3] Changes: _U stable/10/ stable/10/share/man/man4/em.4 stable/10/sys/dev/e1000/e1000_80003es2lan.c stable/10/sys/dev/e1000/e1000_80003es2lan.h stable/10/sys/dev/e1000/e1000_82540.c stable/10/sys/dev/e1000/e1000_82541.c stable/10/sys/dev/e1000/e1000_82541.h stable/10/sys/dev/e1000/e1000_82542.c stable/10/sys/dev/e1000/e1000_82543.c stable/10/sys/dev/e1000/e1000_82543.h stable/10/sys/dev/e1000/e1000_82571.c stable/10/sys/dev/e1000/e1000_82571.h stable/10/sys/dev/e1000/e1000_82575.c stable/10/sys/dev/e1000/e1000_82575.h stable/10/sys/dev/e1000/e1000_api.c stable/10/sys/dev/e1000/e1000_api.h stable/10/sys/dev/e1000/e1000_defines.h stable/10/sys/dev/e1000/e1000_hw.h stable/10/sys/dev/e1000/e1000_i210.c stable/10/sys/dev/e1000/e1000_i210.h stable/10/sys/dev/e1000/e1000_ich8lan.c stable/10/sys/dev/e1000/e1000_ich8lan.h stable/10/sys/dev/e1000/e1000_mac.c stable/10/sys/dev/e1000/e1000_mac.h stable/10/sys/dev/e1000/e1000_manage.c stable/10/sys/dev/e1000/e1000_manage.h stable/10/sys/dev/e1000/e1000_mbx.c stable/10/sys/dev/e1000/e1000_mbx.h stable/10/sys/dev/e1000/e1000_nvm.c stable/10/sys/dev/e1000/e1000_nvm.h stable/10/sys/dev/e1000/e1000_osdep.c stable/10/sys/dev/e1000/e1000_osdep.h stable/10/sys/dev/e1000/e1000_phy.c stable/10/sys/dev/e1000/e1000_phy.h stable/10/sys/dev/e1000/e1000_regs.h stable/10/sys/dev/e1000/e1000_vf.c stable/10/sys/dev/e1000/e1000_vf.h stable/10/sys/dev/e1000/if_em.c stable/10/sys/dev/e1000/if_em.h stable/10/sys/dev/e1000/if_igb.c stable/10/sys/dev/e1000/if_igb.h stable/10/sys/dev/e1000/if_lem.c stable/10/sys/dev/e1000/if_lem.h stable/10/sys/dev/ixgb/if_ixgb.c stable/10/sys/dev/netmap/if_em_netmap.h