Bug 221919

Summary: ixl: TX queue hang when using TSO and having a high and mixed network load
Product: Base System Reporter: Nikita Kozlov <nikita>
Component: kernAssignee: Eric Joyner <erj>
Status: Closed FIXED    
Severity: Affects Only Me CC: bapt, ed, erj, haron86, incin, ixbug, jason, jeffrey.e.pieper, krzysztof.galazka, kurt, net, peter.x.eriksson, rstone, sbruno, smh, wollman
Priority: --- Keywords: IntelNetworking
Version: 11.1-STABLEFlags: koobs: mfc-stable11+
Hardware: amd64   
OS: Any   
URL: https://reviews.freebsd.org/D14985

Description Nikita Kozlov 2017-08-29 20:06:31 UTC
In my scenario, when having an hundred of iscsi sessions that are doing a load of around 10Gib (but it happen with a smaller load less often too) of mixed RX/TX, my XL710-DA2 cards start to flap with the following output in dmesg:

```
...
kernel: ixl0: WARNING: queue 4 appears to be hung!
kernel: ixl0: WARNING: Resetting!
kernel: ixl0: Malicious Driver Detection event 2 on TX queue 0, pf 
number 0
kernel: ixl0: MDD TX event is for this function! 
...
kernel: ixl0: Interface stopped DISTRIBUTING, possible flapping
...
```
(and then a lot iscsi session drop due to the flap)


It happens, during the network load, something like every 5min.

It seems to me to be related to TSO because when I have disabled it, the bug was gone and I was unable to reproduce it anymore. Reenabling TSO make the bug appearing again.

My FW : dev.ixl.0.fw_version: fw 5.0.40043 api 1.5 nvm 5.05 etid 800028a6 oem 1.262.0
But I also reproduced it with fw 4.33.31377 api 1.2 nvm 4.42
Comment 1 Eric Joyner freebsd_committer freebsd_triage 2017-08-31 00:47:18 UTC
Try removing the ixl_init_locked() in ixl_local_timer(), right after it prints the "WARNING: Resetting!" message -- the queues might actually not be hung and don't need to be reinitialized.
Comment 2 Peter Eriksson 2017-10-31 15:13:51 UTC
This is a really annoying bug that we've also seen. I do not think it's related to iSCSI though (since we aren't using it). Disabling TSO seems to help (but also severly reduces transmission speed - in our case it drops from around 10Gbps to 3Gbps without TSO).

I our servers are SMB (and NFS, but not much yet) servers. Dell PowerEdge 730xd.

> FreeBSD 11.1
> ixl2: fw 5.40.47690 api 1.5 nvm 5.40 etid 80002d35 oem 18.4608.16

ixl0: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full
ixl0: link state changed to UP
ixl2: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full
ixl2: link state changed to UP

ixl0: <Intel(R) Ethernet Connection XL710/X722 Driver, Version - 1.7.12-k> mem 0xc9000000-0xc9ffffff,0xca008000-0xca00ffff at device 0.0 numa-domain 1 on pci15
ixl0: Using MSIX interrupts with 9 vectors
ixl0: fw 5.40.47690 api 1.5 nvm 5.40 etid 80002d35 oem 18.4608.16
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl0: Ethernet address: 3c:fd:fe:24:e7:e0
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: Failed to initialize SR-IOV (error=2)
ixl0: netmap queues/slots: TX 8/1024, RX 8/1024

ixl2: <Intel(R) Ethernet Connection XL710/X722 Driver, Version - 1.7.12-k> mem 0xcc000000-0xccffffff,0xcd008000-0xcd00ffff at device 0.0 numa-domain 1 on pci16
ixl2: Using MSIX interrupts with 9 vectors
ixl2: fw 5.40.47690 api 1.5 nvm 5.40 etid 80002d35 oem 18.4608.16
ixl2: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl2: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl2: Ethernet address: 3c:fd:fe:24:d6:a0
ixl2: PCI Express Bus: Speed 8.0GT/s Width x8
ixl2: Failed to initialize SR-IOV (error=2)
ixl2: netmap queues/slots: TX 8/1024, RX 8/1024

ixl0: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full
ixl0: link state changed to UP
ixl2: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full
ixl2: link state changed to UP
ixl2: link state changed to DOWN
ixl0: link state changed to DOWN
ixl0: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full
ixl0: link state changed to UP
ixl2: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full
ixl2: link state changed to UP
ixl2: Malicious Driver Detection event 2 on TX queue 0, pf number 0
ixl2: MDD TX event is for this function!ixl2: Interface stopped DISTRIBUTING, possible flapping
ixl2: Interface stopped DISTRIBUTING, possible flapping
ixl2: Interface stopped DISTRIBUTING, possible flapping
...repeat...
ixl2: WARNING: queue 0 appears to be hung!
ixl2: WARNING: Resetting!

I managed to login to the server after a while and disable TSO and then things started working again.


Would using the Intel-provided (instead of the 11.1 one) driver and firmware (from their web site) help with this issue?
Comment 3 Ryan Stone freebsd_committer freebsd_triage 2017-10-31 17:41:44 UTC
How reproducible is the hang?  Could you please try this patch and confirm whether it fixes your issue?

https://people.freebsd.org/~rstone/patches/ixl_tsosegpermss.diff
Comment 4 Eric Joyner freebsd_committer freebsd_triage 2017-12-07 18:09:34 UTC
We did find another bug in the function to detect packets that violate the HW restriction on how many buffers each segment in a TSO can span (and the fix will be in the next update to the driver in 12), but Ryan's patch should ensure packets like those don't reach the driver.

Could you report if it works for you guys, Nikita and Peter?
Comment 5 Peter Eriksson 2017-12-07 20:59:21 UTC
I haven't had time to test the patch yet (started on it but got side-tracked with other bugs), but I'll make another attempt. Might not happen until this weekend or early next week though.

One problem is that the issue take some time to pop up. I tried creating a test setup that would force it on our test system but so far it has only shown itself on our production systems :-(

It would be nice to be able to trigger the bug on the non-production system :-)
Comment 6 Garrett Wollman freebsd_committer freebsd_triage 2017-12-12 21:46:59 UTC
Applied the patch from #c3 to my 11.1 source tree and found that it did not improve matters.  It would be better if this "feature" could simply be disabled, as the Linux drivers (apparently?) allow.
Comment 7 Ryan Stone freebsd_committer freebsd_triage 2017-12-12 22:40:42 UTC
Sorry, there was a mistake in the patch.  I think that something got lost in translation when I ported it forward.  I've regenerated the patch at the same location, or you can replace this line in ixl_pf_main.c:

	ifp->if_hw_tsomaxsegpermss = IXL_MAX_TX_SEGS;

with

	ifp->if_hw_tsomaxsegpermss = IXL_SPARSE_CHAIN;


Sorry for the confusion.
Comment 8 KurtC 2017-12-28 18:43:16 UTC
I am running into this exact Malicious Driver Detection event under high load on a X710-DA2 running driver 1.7.12.  Disabling TSO does not fix the problem for me.
Comment 9 Wallace 2017-12-28 20:28:34 UTC
We are having the same issue on a new Supermicro server purchsed this month.

ixl0@pci0:26:0:0:	class=0x020000 card=0x37d215d9 chip=0x37d28086 rev=0x09 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection X722 for 10GBASE-T'
    class      = network
    subclass   = ethernet

Dec 27 10:23:03 hostname kernel: ixl0: WARNING: queue 1 appears to be hung!
Dec 27 10:23:03 hostname kernel: ixl0: WARNING: Resetting!
Dec 27 10:23:10 hostname kernel: ixl0: Malicious Driver Detection event 14 on TX queue 1, pf number 0
Dec 27 10:23:10 hostname kernel: ixl0: MDD TX event is for this function!

After playing with lro and tso things seemed to be better. No more errors in the log files and NFS shares seemed more stable. Over the past week it seemed it didn't matter if there was light or heavy traffic. 


Errors: 

ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether ac:1f:6b:61:a3:80
        hwaddr ac:1f:6b:61:a3:80
        inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x
        inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-T <full-duplex>)
        status: active

No Errors:

[root@backup0 ~]# ifconfig ixl0 -lro -tso
[root@backup0 ~]# ifconfig ixl0
ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether ac:1f:6b:61:a3:80
        hwaddr ac:1f:6b:61:a3:80
        inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x 
        inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet

[root@hostname /var/log]# freebsd-version -k
11.1-RELEASE-p4
[root@hostname /var/log]# freebsd-version -u
11.1-RELEASE-p6

I can provide more info if anyone needs it or help debug the issue more.

Thanks!
Comment 10 Garrett Wollman freebsd_committer freebsd_triage 2017-12-29 01:56:04 UTC
(In reply to Ryan Stone from comment #7)
Doesn't seem to have made any difference.  (Had to wait for the post-Xmas outage window now that this server is in production.)
Comment 11 Eric Joyner freebsd_committer freebsd_triage 2018-01-04 19:44:15 UTC
This should be fixed in ixl-1.9.5.

We're working on getting that upstreamed, but in the meantime, you can download it from the Intel download center.

https://downloadcenter.intel.com/download/25160/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connection-under-FreeBSD-?product=36773
Comment 12 Ed Schouten freebsd_committer freebsd_triage 2018-02-12 09:04:06 UTC
Hi Eric,

We're also experiencing these issues on a SuperMicro system having these NICs:

# pciconf -l | grep ixl
ixl0@pci0:2:0:0:        class=0x020000 card=0x089e15d9 chip=0x15728086 rev=0x02 hdr=0x00
ixl1@pci0:2:0:1:        class=0x020000 card=0x000015d9 chip=0x15728086 rev=0x02 hdr=0x00

I'll integrate the ixl-1.9.5 driver into our own codebase to work around this issue, but it would be nice if this driver got upstreamed instead. Are there any concrete plans for doing this?

Thanks,
Ed
Comment 13 Jason Tubnor 2018-02-15 02:12:23 UTC
I am also seeing this on our Lenovo SR650 7x06 servers.  We too are using 10GbE XL710 cards:

Intel(R) Ethernet Controller X710 for 10GbE SFP+

# pciconf -l | grep ixl
ixl0@pci0:10:0:0:       class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00
ixl1@pci0:10:0:1:       class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00
ixl2@pci0:10:0:2:       class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00
ixl3@pci0:10:0:3:       class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00
ixl4@pci0:174:0:0:      class=0x020000 card=0x000a8086 chip=0x15728086 rev=0x01 hdr=0x00
ixl5@pci0:174:0:1:      class=0x020000 card=0x00008086 chip=0x15728086 rev=0x01 hdr=0x00

snip from /var/log/messages:

Feb 15 09:50:53 server01 kernel: ixl5: Malicious Driver Detection event 2 on TX queue 769, pf number 1
Feb 15 09:50:53 server01 kernel: ixl5: MDD TX event is for this function!
Feb 15 09:50:54 server01 kernel: ixl5: WARNING: queue 0 appears to be hung!
Feb 15 09:50:54 server01 kernel: ixl5: WARNING: Resetting!
Feb 15 09:50:57 server01 kernel: WARNING: 192.168.1.14 (iqn.1998-01.com.vmware:HOST-00000000): no ping reply (NOP-Out) after 5 seconds; dropping connection
Feb 15 09:51:25 server01 kernel: ixl5: Malicious Driver Detection event 2 on TX queue 775, pf number 1
Feb 15 09:51:25 server01 kernel: ixl5: MDD TX event is for this function!
Feb 15 09:51:29 server01 kernel: WARNING: 192.168.1.14 (iqn.1998-01.com.vmware:HOST-00000000): no ping reply (NOP-Out) after 5 seconds; dropping connection
Feb 15 09:51:53 server01 kernel: ixl5: WARNING: queue 7 appears to be hung!
Feb 15 09:51:53 server01 kernel: ixl5: WARNING: Resetting!
Feb 15 09:51:55 server01 kernel: ixl5: Malicious Driver Detection event 2 on TX queue 768, pf number 1
Feb 15 09:51:55 server01 kernel: ixl5: MDD TX event is for this function!

This is easily able to be reproduced when hooking 10GbE VMWare ESXi hosts up to these storage servers via iSCSI.  We could trigger it by performing a vMotion move from one datastore to another.

I do not have a test server that I can test any patches on as 3 of these exist in production running 11.1-RELEASE and cannot afford to have them off-line or deviate away from the standard supported freebsd-update mechanism.

I hope something can be worked out pretty soon and rolled into update as this issue for us can't wait for 11.2 or 12.

I will be trying out -tso, but was trying to avoid that for performance reasons.

Thanks!
Comment 14 Jason Tubnor 2018-03-29 01:00:05 UTC
Has the Intel driver been upstreamed yet to make the 11.2-RELEASE? re@ have just sent a reminder of the release schedule.  If the updated vendor driver works for others here that run their own build service, can it be merged in time for 11.2 for those that follow supported updates?

Thanks.
Comment 15 Jeff Pieper 2018-03-29 10:46:13 UTC
We will have a patch ready for Phabricator soon. It should be committed before code freeze.
Comment 16 Jeff Pieper 2018-03-29 10:46:44 UTC
We will have a patch ready for Phabricator soon. It should be committed before code freeze.
Comment 17 Krzysztof Galazka 2018-04-06 10:54:39 UTC
The Phabricator review which should fix this issue: https://reviews.freebsd.org/D14985
Comment 18 Steven Hartland freebsd_committer freebsd_triage 2018-04-17 10:00:20 UTC
That review is still pending.

We've just had what appears to be a RX hang with TSO disabled, related?

No messages in /var/log/messages tcpdump still showing outbound traffic but no inbound, had to reboot to recover.
Comment 19 commit-hook freebsd_committer freebsd_triage 2018-05-01 18:50:42 UTC
A commit references this bug:

Author: erj
Date: Tue May  1 18:50:13 UTC 2018
New revision: 333149
URL: https://svnweb.freebsd.org/changeset/base/333149

Log:
  ixl(4): Update to 1.9.9-k

  Refresh upstream driver before impending conversion to iflib.

  Major changes:

  - Support for descriptor writeback mode (required by ixlv(4) for AVF support)
  - Ability to disable firmware LLDP agent by user (PR 221530)
  - Fix for TX queue hang when using TSO (PR 221919)
  - Separate descriptor ring sizes for TX and RX rings

  PR:		221530, 221919
  Submitted by:	Krzysztof Galazka <krzysztof.galazka@intel.com>
  Reviewed by:	#IntelNetworking
  MFC after:	1 day
  Relnotes:	Yes
  Sponsored by:	Intel Corporation
  Differential Revision:	https://reviews.freebsd.org/D14985

Changes:
  head/sys/conf/files.amd64
  head/sys/dev/ixl/i40e_adminq.c
  head/sys/dev/ixl/i40e_adminq.h
  head/sys/dev/ixl/i40e_adminq_cmd.h
  head/sys/dev/ixl/i40e_alloc.h
  head/sys/dev/ixl/i40e_common.c
  head/sys/dev/ixl/i40e_dcb.c
  head/sys/dev/ixl/i40e_dcb.h
  head/sys/dev/ixl/i40e_devids.h
  head/sys/dev/ixl/i40e_hmc.c
  head/sys/dev/ixl/i40e_hmc.h
  head/sys/dev/ixl/i40e_lan_hmc.c
  head/sys/dev/ixl/i40e_lan_hmc.h
  head/sys/dev/ixl/i40e_nvm.c
  head/sys/dev/ixl/i40e_osdep.c
  head/sys/dev/ixl/i40e_osdep.h
  head/sys/dev/ixl/i40e_prototype.h
  head/sys/dev/ixl/i40e_register.h
  head/sys/dev/ixl/i40e_status.h
  head/sys/dev/ixl/i40e_type.h
  head/sys/dev/ixl/i40e_virtchnl.h
  head/sys/dev/ixl/if_ixl.c
  head/sys/dev/ixl/if_ixlv.c
  head/sys/dev/ixl/ixl.h
  head/sys/dev/ixl/ixl_iw.c
  head/sys/dev/ixl/ixl_iw.h
  head/sys/dev/ixl/ixl_iw_int.h
  head/sys/dev/ixl/ixl_pf.h
  head/sys/dev/ixl/ixl_pf_i2c.c
  head/sys/dev/ixl/ixl_pf_iov.c
  head/sys/dev/ixl/ixl_pf_iov.h
  head/sys/dev/ixl/ixl_pf_main.c
  head/sys/dev/ixl/ixl_pf_qmgr.c
  head/sys/dev/ixl/ixl_pf_qmgr.h
  head/sys/dev/ixl/ixl_txrx.c
  head/sys/dev/ixl/ixlv.h
  head/sys/dev/ixl/ixlv_vc_mgr.h
  head/sys/dev/ixl/ixlvc.c
  head/sys/dev/ixl/virtchnl.h
  head/sys/modules/ixl/Makefile
Comment 20 Peter Eriksson 2018-12-15 14:18:58 UTC
Just a quick note that we're still seeing the same problem on our production servers if we enable "tso" on the 10G interfaces. FreeBSD 11.2-RELEASE-p6. Haven't been able to reproduce it on the test servers (identical hardware) running 11.2-RELEASE-p5 (and 12-0-RELEASE) so far though (but they don't see any traffic)...

Driver version:
> dev.ixl.0.%desc: Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.9.9-k

Firmware:
> dev.ixl.0.fw_version: fw 6.80.48812 api 1.7 nvm 6.00 etid 80003751 oem 18.4608.17

Watch Events in the output from sysctl -a
> dev.ixl.0.watchdog_events: 4

Dmesg errors:
> ixl0: WARNING: queue 3 appears to be hung!
> ixl0: WARNING: queue 2 appears to be hung!
> ixl2: WARNING: queue 2 appears to be hung!
> ixl2: WARNING: queue 4 appears to be hung!
> ixl2: WARNING: queue 7 appears to be hung!
> ixl2: WARNING: queue 3 appears to be hung!
> ixl0: WARNING: queue 7 appears to be hung!
> ixl2: WARNING: queue 3 appears to be hung!
> ixl0: WARNING: queue 4 appears to be hung!

(Output from ifconfig with TSO disabled)
> # ifconfig lagg0
> lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> 	> options=6404bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
> 	ether 3c:fd:fe:25:47:a0
> 	inet6 fe80::3efd:feff:fe25:47a0%lagg0 prefixlen 64 scopeid 0xa
> 	inet6 2001:6b0:17:2400::8:43 prefixlen 64
> 	inet 130.236.8.43 netmask 0xffffffe0 broadcast 130.236.8.63
> 	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
> 	media: Ethernet autoselect
> 	status: active
> 	groups: lagg
> 	laggproto lacp lagghash l2,l3,l4
> 	laggport: ixl0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
> 	laggport: ixl2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

iperf3 output with TSO disabled:
> # iperf3 -c filur00 -t4
> Connecting to host filur00, port 5201
> [  5] local 2001:6b0:17:2400::8:43 port 51226 connected to 2001:6b0:17:2400::8:40 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   318 MBytes  2.66 Gbits/sec    0    561 KBytes
> [  5]   1.00-2.00   sec   350 MBytes  2.94 Gbits/sec    0   1.11 MBytes
> [  5]   2.00-3.00   sec   392 MBytes  3.28 Gbits/sec    0   1.67 MBytes
> [  5]   3.00-4.00   sec   351 MBytes  2.94 Gbits/sec    0   1.77 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-4.00   sec  1.38 GBytes  2.95 Gbits/sec    0             sender
> [  5]   0.00-4.00   sec  1.38 GBytes  2.95 Gbits/sec                  receiver
> 
> iperf Done.


With TSO enabled (when things work):

> # ifconfig lagg0 tso ; iperf3 -c filur00 -t4
> Connecting to host filur00, port 5201
> [  5] local 2001:6b0:17:2400::8:43 port 51237 connected to 2001:6b0:17:2400::8:40 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   976 MBytes  8.19 Gbits/sec    0    492 KBytes
> [  5]   1.00-2.00   sec  1.08 GBytes  9.29 Gbits/sec    0   1021 KBytes
> [  5]   2.00-3.00   sec  1.08 GBytes  9.29 Gbits/sec    0   1.50 MBytes
> [  5]   3.00-4.00   sec  1.08 GBytes  9.28 Gbits/sec    0   1.75 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-4.00   sec  4.20 GBytes  9.01 Gbits/sec    0             sender
> [  5]   0.00-4.00   sec  4.19 GBytes  9.01 Gbits/sec                  receiver
> 
> iperf Done.

But often queues get stuck and freezes. Hmm.. I just noticed that it was IPv6 that stopped working when I tried to enable it on a production server and ran iperf3 on it - IPv4 traffic was still passing thru. 

Can it be that there still are IPv6 (TSO6)-related bugs and that the IPv4 ones are solved? Too bad I can't find a way to force it to happen on the test servers...
Comment 21 Finn 2018-12-16 15:31:04 UTC
Hi,
have you tried the recently released version 1.10.4?
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233531)

and FreeBSD 12.0 comes with ixl version 2.1.0-k
Comment 22 Peter Eriksson 2018-12-16 20:23:13 UTC
I'm running 1.10.4 on one of our test (running 11.2-RELEASE-p5) servers. And also  2.1.0-k on FreeBSD 12.0 on another. Both works fine. But then again, so does the same version (1.9.9-k) that fails on the production servers... I hesitate a bit on doing experimentation on the production systems.

I've been trying to provoke the problem (without having to set up Samba servers and a couple of hundred Windows clients to connect with and simulate the users) to happen on the test servers a bit today but so far no real luck... Sigh.
Comment 23 Eric Joyner freebsd_committer freebsd_triage 2018-12-17 18:50:58 UTC
(In reply to Peter Eriksson from comment #22)

Your issue doesn't look like the original bug's because your logs don't mention MDD events.

Maybe there's some interaction with the driver and lagg?
Comment 24 Aleksandr 2018-12-17 20:49:58 UTC
(In reply to Peter Eriksson from comment #22)

Do you have any scripts that reconfigure the interface (ifconfig ixl0 down up/mtu/tso) while it receives/transmits traffic?
Comment 25 Peter Eriksson 2018-12-18 09:53:04 UTC
> Do you have any scripts that reconfigure the interface (ifconfig ixl0 down 
> up/mtu/tso) while it receives/transmits traffic?

Hmm... Aha! Bingo!

When testing on the production servers (that always receive SMB/NFS/SSH traffic) I just did an "ifconfig lagg0 tso" to enable it and then started my iperf3 testing (and some off and on to get test data).

I can now reliably reproduce this if I start an "iperf3" test session between two servers and while it is running disable / enable tso "on the fly".

I can now provoke the "hang" on:
FreeBSD 11.2-RELEASE-p6 with ixl driver 1.9.9-k and firmware 5.60
FreeBSD 11.2-RELEASE-p6 with ixl driver 1.10.4 and firmware 6.80

I've not (so far) been able to provoke it to occur on:
FreeBSD 12.0-RELEASE-p0 with ixl driver 2.1.0-k and firmware 6.80

Sometimes it self-heals after a while, but most often I have to do an "ifconfig lagg0 down ; ifconfig lagg0 up" to get it to recover.
Comment 26 Kubilay Kocak freebsd_committer freebsd_triage 2020-01-15 04:30:03 UTC
^Triage: 

 - Close (appears resolved)
 - Track MFC
    - head was 12.x in base r333149
    - MFC'd to stable/11 in base r333343