In my scenario, when having an hundred of iscsi sessions that are doing a load of around 10Gib (but it happen with a smaller load less often too) of mixed RX/TX, my XL710-DA2 cards start to flap with the following output in dmesg: ``` ... kernel: ixl0: WARNING: queue 4 appears to be hung! kernel: ixl0: WARNING: Resetting! kernel: ixl0: Malicious Driver Detection event 2 on TX queue 0, pf number 0 kernel: ixl0: MDD TX event is for this function! ... kernel: ixl0: Interface stopped DISTRIBUTING, possible flapping ... ``` (and then a lot iscsi session drop due to the flap) It happens, during the network load, something like every 5min. It seems to me to be related to TSO because when I have disabled it, the bug was gone and I was unable to reproduce it anymore. Reenabling TSO make the bug appearing again. My FW : dev.ixl.0.fw_version: fw 5.0.40043 api 1.5 nvm 5.05 etid 800028a6 oem 1.262.0 But I also reproduced it with fw 4.33.31377 api 1.2 nvm 4.42
Try removing the ixl_init_locked() in ixl_local_timer(), right after it prints the "WARNING: Resetting!" message -- the queues might actually not be hung and don't need to be reinitialized.
This is a really annoying bug that we've also seen. I do not think it's related to iSCSI though (since we aren't using it). Disabling TSO seems to help (but also severly reduces transmission speed - in our case it drops from around 10Gbps to 3Gbps without TSO). I our servers are SMB (and NFS, but not much yet) servers. Dell PowerEdge 730xd. > FreeBSD 11.1 > ixl2: fw 5.40.47690 api 1.5 nvm 5.40 etid 80002d35 oem 18.4608.16 ixl0: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full ixl0: link state changed to UP ixl2: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full ixl2: link state changed to UP ixl0: <Intel(R) Ethernet Connection XL710/X722 Driver, Version - 1.7.12-k> mem 0xc9000000-0xc9ffffff,0xca008000-0xca00ffff at device 0.0 numa-domain 1 on pci15 ixl0: Using MSIX interrupts with 9 vectors ixl0: fw 5.40.47690 api 1.5 nvm 5.40 etid 80002d35 oem 18.4608.16 ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active ixl0: Ethernet address: 3c:fd:fe:24:e7:e0 ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 ixl0: Failed to initialize SR-IOV (error=2) ixl0: netmap queues/slots: TX 8/1024, RX 8/1024 ixl2: <Intel(R) Ethernet Connection XL710/X722 Driver, Version - 1.7.12-k> mem 0xcc000000-0xccffffff,0xcd008000-0xcd00ffff at device 0.0 numa-domain 1 on pci16 ixl2: Using MSIX interrupts with 9 vectors ixl2: fw 5.40.47690 api 1.5 nvm 5.40 etid 80002d35 oem 18.4608.16 ixl2: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C ixl2: Allocating 8 queues for PF LAN VSI; 8 queues active ixl2: Ethernet address: 3c:fd:fe:24:d6:a0 ixl2: PCI Express Bus: Speed 8.0GT/s Width x8 ixl2: Failed to initialize SR-IOV (error=2) ixl2: netmap queues/slots: TX 8/1024, RX 8/1024 ixl0: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full ixl0: link state changed to UP ixl2: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full ixl2: link state changed to UP ixl2: link state changed to DOWN ixl0: link state changed to DOWN ixl0: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full ixl0: link state changed to UP ixl2: Link is up, 10 Gbps Full Duplex, FEC: None, Autoneg: False, Flow Control: Full ixl2: link state changed to UP ixl2: Malicious Driver Detection event 2 on TX queue 0, pf number 0 ixl2: MDD TX event is for this function!ixl2: Interface stopped DISTRIBUTING, possible flapping ixl2: Interface stopped DISTRIBUTING, possible flapping ixl2: Interface stopped DISTRIBUTING, possible flapping ...repeat... ixl2: WARNING: queue 0 appears to be hung! ixl2: WARNING: Resetting! I managed to login to the server after a while and disable TSO and then things started working again. Would using the Intel-provided (instead of the 11.1 one) driver and firmware (from their web site) help with this issue?
How reproducible is the hang? Could you please try this patch and confirm whether it fixes your issue? https://people.freebsd.org/~rstone/patches/ixl_tsosegpermss.diff
We did find another bug in the function to detect packets that violate the HW restriction on how many buffers each segment in a TSO can span (and the fix will be in the next update to the driver in 12), but Ryan's patch should ensure packets like those don't reach the driver. Could you report if it works for you guys, Nikita and Peter?
I haven't had time to test the patch yet (started on it but got side-tracked with other bugs), but I'll make another attempt. Might not happen until this weekend or early next week though. One problem is that the issue take some time to pop up. I tried creating a test setup that would force it on our test system but so far it has only shown itself on our production systems :-( It would be nice to be able to trigger the bug on the non-production system :-)
Applied the patch from #c3 to my 11.1 source tree and found that it did not improve matters. It would be better if this "feature" could simply be disabled, as the Linux drivers (apparently?) allow.
Sorry, there was a mistake in the patch. I think that something got lost in translation when I ported it forward. I've regenerated the patch at the same location, or you can replace this line in ixl_pf_main.c: ifp->if_hw_tsomaxsegpermss = IXL_MAX_TX_SEGS; with ifp->if_hw_tsomaxsegpermss = IXL_SPARSE_CHAIN; Sorry for the confusion.
I am running into this exact Malicious Driver Detection event under high load on a X710-DA2 running driver 1.7.12. Disabling TSO does not fix the problem for me.
We are having the same issue on a new Supermicro server purchsed this month. ixl0@pci0:26:0:0: class=0x020000 card=0x37d215d9 chip=0x37d28086 rev=0x09 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Connection X722 for 10GBASE-T' class = network subclass = ethernet Dec 27 10:23:03 hostname kernel: ixl0: WARNING: queue 1 appears to be hung! Dec 27 10:23:03 hostname kernel: ixl0: WARNING: Resetting! Dec 27 10:23:10 hostname kernel: ixl0: Malicious Driver Detection event 14 on TX queue 1, pf number 0 Dec 27 10:23:10 hostname kernel: ixl0: MDD TX event is for this function! After playing with lro and tso things seemed to be better. No more errors in the log files and NFS shares seemed more stable. Over the past week it seemed it didn't matter if there was light or heavy traffic. Errors: ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether ac:1f:6b:61:a3:80 hwaddr ac:1f:6b:61:a3:80 inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-T <full-duplex>) status: active No Errors: [root@backup0 ~]# ifconfig ixl0 -lro -tso [root@backup0 ~]# ifconfig ixl0 ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether ac:1f:6b:61:a3:80 hwaddr ac:1f:6b:61:a3:80 inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x inet x.x.x.x netmask 0xfffff800 broadcast x.x.x.x nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet [root@hostname /var/log]# freebsd-version -k 11.1-RELEASE-p4 [root@hostname /var/log]# freebsd-version -u 11.1-RELEASE-p6 I can provide more info if anyone needs it or help debug the issue more. Thanks!
(In reply to Ryan Stone from comment #7) Doesn't seem to have made any difference. (Had to wait for the post-Xmas outage window now that this server is in production.)
This should be fixed in ixl-1.9.5. We're working on getting that upstreamed, but in the meantime, you can download it from the Intel download center. https://downloadcenter.intel.com/download/25160/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connection-under-FreeBSD-?product=36773
Hi Eric, We're also experiencing these issues on a SuperMicro system having these NICs: # pciconf -l | grep ixl ixl0@pci0:2:0:0: class=0x020000 card=0x089e15d9 chip=0x15728086 rev=0x02 hdr=0x00 ixl1@pci0:2:0:1: class=0x020000 card=0x000015d9 chip=0x15728086 rev=0x02 hdr=0x00 I'll integrate the ixl-1.9.5 driver into our own codebase to work around this issue, but it would be nice if this driver got upstreamed instead. Are there any concrete plans for doing this? Thanks, Ed
I am also seeing this on our Lenovo SR650 7x06 servers. We too are using 10GbE XL710 cards: Intel(R) Ethernet Controller X710 for 10GbE SFP+ # pciconf -l | grep ixl ixl0@pci0:10:0:0: class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00 ixl1@pci0:10:0:1: class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00 ixl2@pci0:10:0:2: class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00 ixl3@pci0:10:0:3: class=0x020000 card=0x402117aa chip=0x37d18086 rev=0x09 hdr=0x00 ixl4@pci0:174:0:0: class=0x020000 card=0x000a8086 chip=0x15728086 rev=0x01 hdr=0x00 ixl5@pci0:174:0:1: class=0x020000 card=0x00008086 chip=0x15728086 rev=0x01 hdr=0x00 snip from /var/log/messages: Feb 15 09:50:53 server01 kernel: ixl5: Malicious Driver Detection event 2 on TX queue 769, pf number 1 Feb 15 09:50:53 server01 kernel: ixl5: MDD TX event is for this function! Feb 15 09:50:54 server01 kernel: ixl5: WARNING: queue 0 appears to be hung! Feb 15 09:50:54 server01 kernel: ixl5: WARNING: Resetting! Feb 15 09:50:57 server01 kernel: WARNING: 192.168.1.14 (iqn.1998-01.com.vmware:HOST-00000000): no ping reply (NOP-Out) after 5 seconds; dropping connection Feb 15 09:51:25 server01 kernel: ixl5: Malicious Driver Detection event 2 on TX queue 775, pf number 1 Feb 15 09:51:25 server01 kernel: ixl5: MDD TX event is for this function! Feb 15 09:51:29 server01 kernel: WARNING: 192.168.1.14 (iqn.1998-01.com.vmware:HOST-00000000): no ping reply (NOP-Out) after 5 seconds; dropping connection Feb 15 09:51:53 server01 kernel: ixl5: WARNING: queue 7 appears to be hung! Feb 15 09:51:53 server01 kernel: ixl5: WARNING: Resetting! Feb 15 09:51:55 server01 kernel: ixl5: Malicious Driver Detection event 2 on TX queue 768, pf number 1 Feb 15 09:51:55 server01 kernel: ixl5: MDD TX event is for this function! This is easily able to be reproduced when hooking 10GbE VMWare ESXi hosts up to these storage servers via iSCSI. We could trigger it by performing a vMotion move from one datastore to another. I do not have a test server that I can test any patches on as 3 of these exist in production running 11.1-RELEASE and cannot afford to have them off-line or deviate away from the standard supported freebsd-update mechanism. I hope something can be worked out pretty soon and rolled into update as this issue for us can't wait for 11.2 or 12. I will be trying out -tso, but was trying to avoid that for performance reasons. Thanks!
Has the Intel driver been upstreamed yet to make the 11.2-RELEASE? re@ have just sent a reminder of the release schedule. If the updated vendor driver works for others here that run their own build service, can it be merged in time for 11.2 for those that follow supported updates? Thanks.
We will have a patch ready for Phabricator soon. It should be committed before code freeze.
The Phabricator review which should fix this issue: https://reviews.freebsd.org/D14985
That review is still pending. We've just had what appears to be a RX hang with TSO disabled, related? No messages in /var/log/messages tcpdump still showing outbound traffic but no inbound, had to reboot to recover.
A commit references this bug: Author: erj Date: Tue May 1 18:50:13 UTC 2018 New revision: 333149 URL: https://svnweb.freebsd.org/changeset/base/333149 Log: ixl(4): Update to 1.9.9-k Refresh upstream driver before impending conversion to iflib. Major changes: - Support for descriptor writeback mode (required by ixlv(4) for AVF support) - Ability to disable firmware LLDP agent by user (PR 221530) - Fix for TX queue hang when using TSO (PR 221919) - Separate descriptor ring sizes for TX and RX rings PR: 221530, 221919 Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com> Reviewed by: #IntelNetworking MFC after: 1 day Relnotes: Yes Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D14985 Changes: head/sys/conf/files.amd64 head/sys/dev/ixl/i40e_adminq.c head/sys/dev/ixl/i40e_adminq.h head/sys/dev/ixl/i40e_adminq_cmd.h head/sys/dev/ixl/i40e_alloc.h head/sys/dev/ixl/i40e_common.c head/sys/dev/ixl/i40e_dcb.c head/sys/dev/ixl/i40e_dcb.h head/sys/dev/ixl/i40e_devids.h head/sys/dev/ixl/i40e_hmc.c head/sys/dev/ixl/i40e_hmc.h head/sys/dev/ixl/i40e_lan_hmc.c head/sys/dev/ixl/i40e_lan_hmc.h head/sys/dev/ixl/i40e_nvm.c head/sys/dev/ixl/i40e_osdep.c head/sys/dev/ixl/i40e_osdep.h head/sys/dev/ixl/i40e_prototype.h head/sys/dev/ixl/i40e_register.h head/sys/dev/ixl/i40e_status.h head/sys/dev/ixl/i40e_type.h head/sys/dev/ixl/i40e_virtchnl.h head/sys/dev/ixl/if_ixl.c head/sys/dev/ixl/if_ixlv.c head/sys/dev/ixl/ixl.h head/sys/dev/ixl/ixl_iw.c head/sys/dev/ixl/ixl_iw.h head/sys/dev/ixl/ixl_iw_int.h head/sys/dev/ixl/ixl_pf.h head/sys/dev/ixl/ixl_pf_i2c.c head/sys/dev/ixl/ixl_pf_iov.c head/sys/dev/ixl/ixl_pf_iov.h head/sys/dev/ixl/ixl_pf_main.c head/sys/dev/ixl/ixl_pf_qmgr.c head/sys/dev/ixl/ixl_pf_qmgr.h head/sys/dev/ixl/ixl_txrx.c head/sys/dev/ixl/ixlv.h head/sys/dev/ixl/ixlv_vc_mgr.h head/sys/dev/ixl/ixlvc.c head/sys/dev/ixl/virtchnl.h head/sys/modules/ixl/Makefile
Just a quick note that we're still seeing the same problem on our production servers if we enable "tso" on the 10G interfaces. FreeBSD 11.2-RELEASE-p6. Haven't been able to reproduce it on the test servers (identical hardware) running 11.2-RELEASE-p5 (and 12-0-RELEASE) so far though (but they don't see any traffic)... Driver version: > dev.ixl.0.%desc: Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.9.9-k Firmware: > dev.ixl.0.fw_version: fw 6.80.48812 api 1.7 nvm 6.00 etid 80003751 oem 18.4608.17 Watch Events in the output from sysctl -a > dev.ixl.0.watchdog_events: 4 Dmesg errors: > ixl0: WARNING: queue 3 appears to be hung! > ixl0: WARNING: queue 2 appears to be hung! > ixl2: WARNING: queue 2 appears to be hung! > ixl2: WARNING: queue 4 appears to be hung! > ixl2: WARNING: queue 7 appears to be hung! > ixl2: WARNING: queue 3 appears to be hung! > ixl0: WARNING: queue 7 appears to be hung! > ixl2: WARNING: queue 3 appears to be hung! > ixl0: WARNING: queue 4 appears to be hung! (Output from ifconfig with TSO disabled) > # ifconfig lagg0 > lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > > options=6404bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> > ether 3c:fd:fe:25:47:a0 > inet6 fe80::3efd:feff:fe25:47a0%lagg0 prefixlen 64 scopeid 0xa > inet6 2001:6b0:17:2400::8:43 prefixlen 64 > inet 130.236.8.43 netmask 0xffffffe0 broadcast 130.236.8.63 > nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> > media: Ethernet autoselect > status: active > groups: lagg > laggproto lacp lagghash l2,l3,l4 > laggport: ixl0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> > laggport: ixl2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> iperf3 output with TSO disabled: > # iperf3 -c filur00 -t4 > Connecting to host filur00, port 5201 > [ 5] local 2001:6b0:17:2400::8:43 port 51226 connected to 2001:6b0:17:2400::8:40 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 318 MBytes 2.66 Gbits/sec 0 561 KBytes > [ 5] 1.00-2.00 sec 350 MBytes 2.94 Gbits/sec 0 1.11 MBytes > [ 5] 2.00-3.00 sec 392 MBytes 3.28 Gbits/sec 0 1.67 MBytes > [ 5] 3.00-4.00 sec 351 MBytes 2.94 Gbits/sec 0 1.77 MBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-4.00 sec 1.38 GBytes 2.95 Gbits/sec 0 sender > [ 5] 0.00-4.00 sec 1.38 GBytes 2.95 Gbits/sec receiver > > iperf Done. With TSO enabled (when things work): > # ifconfig lagg0 tso ; iperf3 -c filur00 -t4 > Connecting to host filur00, port 5201 > [ 5] local 2001:6b0:17:2400::8:43 port 51237 connected to 2001:6b0:17:2400::8:40 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 976 MBytes 8.19 Gbits/sec 0 492 KBytes > [ 5] 1.00-2.00 sec 1.08 GBytes 9.29 Gbits/sec 0 1021 KBytes > [ 5] 2.00-3.00 sec 1.08 GBytes 9.29 Gbits/sec 0 1.50 MBytes > [ 5] 3.00-4.00 sec 1.08 GBytes 9.28 Gbits/sec 0 1.75 MBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-4.00 sec 4.20 GBytes 9.01 Gbits/sec 0 sender > [ 5] 0.00-4.00 sec 4.19 GBytes 9.01 Gbits/sec receiver > > iperf Done. But often queues get stuck and freezes. Hmm.. I just noticed that it was IPv6 that stopped working when I tried to enable it on a production server and ran iperf3 on it - IPv4 traffic was still passing thru. Can it be that there still are IPv6 (TSO6)-related bugs and that the IPv4 ones are solved? Too bad I can't find a way to force it to happen on the test servers...
Hi, have you tried the recently released version 1.10.4? (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233531) and FreeBSD 12.0 comes with ixl version 2.1.0-k
I'm running 1.10.4 on one of our test (running 11.2-RELEASE-p5) servers. And also 2.1.0-k on FreeBSD 12.0 on another. Both works fine. But then again, so does the same version (1.9.9-k) that fails on the production servers... I hesitate a bit on doing experimentation on the production systems. I've been trying to provoke the problem (without having to set up Samba servers and a couple of hundred Windows clients to connect with and simulate the users) to happen on the test servers a bit today but so far no real luck... Sigh.
(In reply to Peter Eriksson from comment #22) Your issue doesn't look like the original bug's because your logs don't mention MDD events. Maybe there's some interaction with the driver and lagg?
(In reply to Peter Eriksson from comment #22) Do you have any scripts that reconfigure the interface (ifconfig ixl0 down up/mtu/tso) while it receives/transmits traffic?
> Do you have any scripts that reconfigure the interface (ifconfig ixl0 down > up/mtu/tso) while it receives/transmits traffic? Hmm... Aha! Bingo! When testing on the production servers (that always receive SMB/NFS/SSH traffic) I just did an "ifconfig lagg0 tso" to enable it and then started my iperf3 testing (and some off and on to get test data). I can now reliably reproduce this if I start an "iperf3" test session between two servers and while it is running disable / enable tso "on the fly". I can now provoke the "hang" on: FreeBSD 11.2-RELEASE-p6 with ixl driver 1.9.9-k and firmware 5.60 FreeBSD 11.2-RELEASE-p6 with ixl driver 1.10.4 and firmware 6.80 I've not (so far) been able to provoke it to occur on: FreeBSD 12.0-RELEASE-p0 with ixl driver 2.1.0-k and firmware 6.80 Sometimes it self-heals after a while, but most often I have to do an "ifconfig lagg0 down ; ifconfig lagg0 up" to get it to recover.
^Triage: - Close (appears resolved) - Track MFC - head was 12.x in base r333149 - MFC'd to stable/11 in base r333343