Bug 243463 - ix0: Watchdog timeout
Summary: ix0: Watchdog timeout
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: IntelNetworking, needs-qa
Depends on:
Blocks:
 
Reported: 2020-01-20 08:47 UTC by Jiri
Modified: 2020-04-30 09:11 UTC (History)
3 users (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jiri 2020-01-20 08:47:59 UTC
Hi all,
I observe some strange behavior of my NIC. Dual port Intel X520, only one port connected.
Real traffic about 400MBit RX / 100MBit TX, Supermicro X11SCL-F/Xeon E-2176G/2x8GB RAM, latest BIOS R1.2, acting as pf firewall/router. New install, running first day. 

ix0: Watchdog timeout (TX: 0 desc avail: 34 pidx: 1455) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 0 desc avail: 34 pidx: 1885) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 0 desc avail: 34 pidx: 1062) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 1 desc avail: 34 pidx: 177) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 0 desc avail: 33 pidx: 1275) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 0 desc avail: 34 pidx: 2014) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 0 desc avail: 34 pidx: 707) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP
ix0: Watchdog timeout (TX: 0 desc avail: 34 pidx: 653) -- resetting
ix0: link state changed to DOWN
ix0: link state changed to UP

ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500        options=8138b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER>
        ether a0:36:9f:26:fb:b8
        inet x.x.x.x netmask 0xfffffff8 broadcast y.y.y.y
        media: Ethernet autoselect (10Gbase-LR <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        plugged: SFP/SFP+/SFP28 10G Base-LR (LC)
        vendor: Intel Corp PN: SFP-10G13-LR SN: IB81220374 DATE: 2018-12-20
        module temperature: 36.87 C Voltage: 3.28 Volts
        RX: 0.54 mW (-2.64 dBm) TX: 0.71 mW (-1.43 dBm)

ix0@pci0:1:0:0: class=0x020000 card=0x7b118086 chip=0x154d8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet 10G 2P X520 Adapter'
    class      = network
    subclass   = ethernet

sysctl.conf
kern.ipc.maxsockbuf=16777216
net.inet.tcp.mssdflt=1460
net.inet.tcp.minmss=536

loader.conf
cc_htcp_load="YES"
machdep.hyperthreading_allowed="0"
net.inet.tcp.soreceive_stream="1"
net.isr.maxthreads="-1"
net.isr.bindthreads="-1"
net.pf.source_nodes_hashsize="1048576"

Jiri
Comment 1 Krzysztof Galazka 2020-01-21 09:34:11 UTC
(In reply to Jiri from comment #0)

Could you, please, check if applying this patch https://reviews.freebsd.org/D21712 has any influence? I would like to rule out that the watchdog timeouts are false positives.
Comment 2 Jiri 2020-01-21 09:58:07 UTC
(In reply to Krzysztof Galazka from comment #1)

Thank You,

O.K. I''l do it at night. Now, two days router running at the same traffic condition (no reboots, no config changes) no ix0 timeouts has appeared.

Jiri
Comment 3 Jiri 2020-01-23 14:54:02 UTC
I have had applied the patch. No timeouts or messages like "queue can't be marked as hung if interface is down" has appeared.
Next I'll try switch port shutdown and traffic torture. I'll let you know if something happened.
Jiri
Comment 4 Jiri 2020-01-27 15:42:57 UTC
I tried to switch on/off optical link to my ix0 - manually remove fibers. Kernel doesn't detect any outage, no message ix0 down/up in log. (was about 7 sec - info from switch).
No errors appear in log, system running about 5 days from recommended patch.
Strange, but it looks like fully operable.
Comment 5 Jiri 2020-03-10 08:18:15 UTC
Two outages was observed. No kernel message, no log event. ix0 stop communicating, looking still up. Ifconfig up/down did resolve this issue.
May be bad network card, if nobody have this problem ?
Comment 6 Denis Ahrens 2020-03-27 02:55:43 UTC
looks like 235524 for me.

the igb interface will not survive iperf3 -t 300 for me.
Comment 7 Jiri 2020-03-27 09:19:00 UTC
Yes, agree. I have added the second X520 card, there is very low traffic (about some MBits) and no problem observed here. This timeout behavior probably depends on traffic. I.E. high traffic = problem.
Comment 8 Jiri 2020-04-30 09:11:17 UTC
Some new information.
Kernel patched to P3. May be not problem in Intel driver.
I change Intel NIC to Mellanox ConnectX4 NIC to wish solve traffic outages. There is running cron script to test connectivity to gateway an down/up interface. 

Log attached:
Fri Apr 10 19:57:13 CEST 2020 interface ix0 restart
Fri Apr 10 20:38:13 CEST 2020 interface ix0 restart
Sat Apr 11 17:45:13 CEST 2020 interface ix0 restart
Sat Apr 11 20:00:13 CEST 2020 interface ix0 restart
Sat Apr 11 20:30:13 CEST 2020 interface ix0 restart
Sun Apr 12 19:16:13 CEST 2020 interface ix0 restart
Sun Apr 12 20:30:13 CEST 2020 interface ix0 restart
Sun Apr 26 00:27:13 CEST 2020 interface mce0 restart
Sun Apr 26 04:48:13 CEST 2020 interface mce0 restart
Sun Apr 26 11:12:13 CEST 2020 interface mce0 restart
Wed Apr 29 21:27:13 CEST 2020 interface mce0 restart
Wed Apr 29 21:33:13 CEST 2020 interface mce0 restart

After changing network card from Apr 12 to Apr 26 no problems appears. But from Apr 26 on the Mellanox card traffic not stop, but 80% packet loss appear. Interface down/up solve this issue, like in Intel case.
Other servers in my network have connectivity to gateway O.K., there is no connectivity/switch issues.
Traffic max about 1,3GBit/s on the NIC.
On Mellanox NIC this log event appear "arpresolve: can't allocate llinfo for xxx.xxx.xxx.xxx on mce0"

Jiri