Bug 254341 - igb hung every 5-20 hours. ifconfig down+up solves the problem for next N hours
Summary: igb hung every 5-20 hours. ifconfig down+up solves the problem for next N hours
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: arm64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2021-03-16 18:29 UTC by Vladislav Shabanov
Modified: 2021-10-05 20:54 UTC (History)
5 users (show)

See Also:


Attachments
full dmesg.boot (11.28 KB, text/plain)
2021-03-16 18:29 UTC, Vladislav Shabanov
no flags Details
watchdog dump just before the network hung (16.72 KB, text/plain)
2021-03-16 18:30 UTC, Vladislav Shabanov
no flags Details
watchdog dump right after the hung (before ifconfig down/up) (16.72 KB, text/plain)
2021-03-16 18:31 UTC, Vladislav Shabanov
no flags Details
watchdog dump before the hung, 2021.09.29 (22.92 KB, text/plain)
2021-10-05 14:45 UTC, Vladislav Shabanov
no flags Details
watchdog dump after the hung, 2021.09.29 (22.91 KB, text/plain)
2021-10-05 14:46 UTC, Vladislav Shabanov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vladislav Shabanov 2021-03-16 18:29:24 UTC
Created attachment 223330 [details]
full dmesg.boot

I have problem with FreeBSD 12.2p1 The problem started on FreeBSD 12.0 and persists till now.

The network hung just after the boot. Ifconfig down+up solves the problem for next 5-20 hours. I don’t know how to force NIC to hung immediately. Just do ifconfig up/down and wait for next N hours.

MBoard: SuperMicro X10SRI-F
NIC: <Intel(R) PRO/1000 PCI-Express Network Driver> port 0xe020-0xe03f mem 0xfb120000-0xfb13ffff,0xfb144000-0xfb147fff irq 43 at device 0.0 on pci5
firewall: pf 
There is no vlan, only jails working on 10.0.0.1/16

Now I use simple watchdog script staring every minute. The script pings the gateway and do ifconfig down/up on error.
The script dumps these values every minute:
        /sbin/pfctl -si > /root/WATCHDOG/failure.txt
        /usr/bin/netstat -m >> /root/WATCHDOG/failure.txt
        /sbin/sysctl -a | /usr/bin/grep dev.igb >> /root/WATCHDOG/failure.txt

I have two files: one before the hung, the second one just after the hung. Hope that some values inside can help.


———————————————————————————————
- from dmesg.boot
igb0: <Intel(R) PRO/1000 PCI-Express Network Driver> port 0xe020-0xe03f mem 0xfb120000-0xfb13ffff,0xfb144000-0xfb147fff 
igb0: Using 1024 TX descriptors and 1024 RX descriptors
igb0: Using an MSI interrupt
igb0: Ethernet address: ac:1f:6b:02:8a:c4
igb0: netmap queues/slots: TX 1/1024, RX 1/1024

———————————————————————————————
$ ifconfig igb0
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether ac:1f:6b:02:8a:c4
        inet IP-ADDR netmask 0xffffff00 broadcast IP-ADDR
        media: Ethernet 100baseTX <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

———————————————————————————————
- /etc/rc.conf:
-
zfs_enable="YES"
sshd_enable="YES"
ntpd_enable="YES"
powerd_enable="YES"
local_unbound_enable="YES"
hostname="..."
ifconfig_igb0="inet IP-ADDR netmask 255.255.255.0"
defaultrouter="IP-ADDR"
ifconfig_lo0_alias0="inet 10.0.0.1 netmask 0xffff0000"
pf_enable="YES"
pflog_enable="YES"
Comment 1 Vladislav Shabanov 2021-03-16 18:30:33 UTC
Created attachment 223331 [details]
watchdog dump just before the network hung
Comment 2 Vladislav Shabanov 2021-03-16 18:31:19 UTC
Created attachment 223332 [details]
watchdog dump right after the hung (before ifconfig down/up)
Comment 3 Vladimir Druzenko freebsd_committer freebsd_triage 2021-03-16 19:44:25 UTC
Where did you get igb in 12.2?

HISTORY
     The em device driver first	appeared in FreeBSD 4.4.  em was merged	with
     the igb device driver and converted to the	iflib framework	in
     FreeBSD 12.0.
Comment 4 Vladislav Shabanov 2021-03-16 22:14:15 UTC
$ ifconfig -a 
igb0: flags=...
   .....
igb1: flags=
   .....
lo0: flags=
   .....
pflog0: flags=0
   .....

I know that em and igb now are merged together. But the system names this device igb0, igb1, ...
Technically it's I350.
I use this 'igb' in bug report title because other users with this problem probably will search by this word.

$ pciconf -lBbceVv | grep igb0
igb0@pci0:4:0:0:        class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'I350 Gigabit Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base rxfb120000, size 131072, enabled
    bar   [18] = type I/O Port, range 32, base rxe020, size 32, enabled
    bar   [1c] = type Memory, range 32, base rxfb144000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks enabled with 1 message
    cap 11[70] = MSI-X supports 10 messages
                 Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS
                 link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 ac1f6bffff028ac4
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 8 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0004
                     VF Device ID 0x1520
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
    ecap 0017[1a0] = TPH Requester 1
    ecap 0018[1c0] = LTR 1
    ecap 000d[1d0] = ACS 1
Comment 5 Vladimir Druzenko freebsd_committer freebsd_triage 2021-03-17 07:42:08 UTC
(In reply to Vladislav Shabanov from comment #4)
Sometimes something like this can help (workaround):
ifconfig igb0 -rxcsum -rxcsum6 -txcsum -txcsum6 -tso -lro -vlanmtu -vlanhwtag -vlanhwfilter -vlanhwtso
Comment 6 Vladislav Shabanov 2021-03-17 12:04:53 UTC
(In reply to VVD from comment #5)
Thanks. I applied all this -xxx options. 

Interestingly than the NIC hungs exactly after I executed this command line. Then I executed 
$ ifconfig igb0 -rxcsum -rxcsum6 -txcsum -txcsum6 -tso -lro -vlanmtu -vlanhwtag -vlanhwfilter vlanhwtso

and it hungs again. I tried one more time with '-vlanhwtso' and it hungs once more.

Thanks to VVD, now I can reproduce the bug manually. Спасибо :)
Comment 7 Vladimir Druzenko freebsd_committer freebsd_triage 2021-03-17 13:22:16 UTC
(In reply to Vladislav Shabanov from comment #6)
Try to run ifconfig with each parameter:
ifconfig igb0 -rxcsum
ifconfig igb0 rxcsum
ifconfig igb0 -rxcsum6
ifconfig igb0 rxcsum6
ifconfig igb0 -txcsum
ifconfig igb0 txcsum
ifconfig igb0 -txcsum6
ifconfig igb0 txcsum6
ifconfig igb0 -tso
ifconfig igb0 tso
ifconfig igb0 -lro
ifconfig igb0 lro
ifconfig igb0 -vlanmtu
ifconfig igb0 vlanmtu
ifconfig igb0 -vlanhwtag
ifconfig igb0 vlanhwtag
ifconfig igb0 -vlanhwfilter
ifconfig igb0 vlanhwfilter
ifconfig igb0 -vlanhwtso
ifconfig igb0 vlanhwtso

Пожалуйста!
Comment 8 Vladislav Shabanov 2021-03-17 21:36:24 UTC
Bad news:
At 15:00 I turned off all options (-rxcsum -rxcsum6 ...) and left the machine alone. At 20:05 the NIC hung again. 

Now I can reproduce the hung using sequence of 
   ifconfig bge0 txcsum6 ; ifconfig bge0 -txcsum6
   ifconfig bge0 something-else; ifconfig bge0 -something-else;

After 5-6 calls network dies. 

Alas, I can reproduce the problem, but I still have no workaround. The only solution is  watchdog script pinging other host and ifconfig down/up.
Comment 9 Vladimir Druzenko freebsd_committer freebsd_triage 2021-03-17 22:02:37 UTC
(In reply to Vladislav Shabanov from comment #8)
bge or igb???
Comment 10 Eric Joyner freebsd_committer freebsd_triage 2021-03-17 22:03:02 UTC
Also, why is igb0 using MSI, but igb1 is using MSI-X?
Comment 11 Vladislav Shabanov 2021-03-18 08:56:10 UTC
(In reply to Eric Joyner from comment #10)
I experimented with /boot/loader.conf:

    hw.igb.num_queues="4"
    dev.igb.0.iflib.disable_msix="1"

No luck.

Sorry for typo in #8, igb0
Comment 12 Kevin Bowling freebsd_committer freebsd_triage 2021-09-17 23:43:00 UTC
I am wondering if there is something electronically wrong or a negative interaction with NC-SI (BMC).  You have a fair number of crc errors dev.igb.0.mac_stats.crc_errs: 35330623 going on and supermicro gets things wrong 

In your watchdog, dumping dev.igb.0.reg_dump as well would be interesting.. it wont dump by default if doing a higher level sysctl OID.

Try updating the BIOS and BMC https://www.supermicro.com/en/products/motherboard/X10SRi-F and if you are able, try FreeBSD 12-STABLE kernel.
Comment 13 Vladislav Shabanov 2021-10-05 14:43:02 UTC
Kevin, thank you for your reply.
Changes on this server from March, 2021 till now (October, 2021):
1. Now it’s FreeBSD 13.0
2. On both sides of cable the wire configured to media 100baseTX mediaopt full-duplex. There are no crc errors now, but the problem persists.
3. BIOS has version from 2019.11.02. As far as I understand, all newer versions have patches in CPU microcode, but nothing changed in other parts of the BIOS. However, I will update BIOS in next weeks. 
4. The problem persists, hung occurs every couple of days.
5. I added dump for RX Registers, new pair of files attached.

With best regargs, 
    Vlad
Comment 14 Vladislav Shabanov 2021-10-05 14:45:47 UTC
Created attachment 228464 [details]
watchdog dump before the hung, 2021.09.29
Comment 15 Vladislav Shabanov 2021-10-05 14:46:43 UTC
Created attachment 228465 [details]
watchdog dump after the hung, 2021.09.29
Comment 16 Kevin Bowling freebsd_committer freebsd_triage 2021-10-05 20:54:42 UTC
(In reply to Vladislav Shabanov from comment #13)
There are a lot more changes so it is worth a shot https://www.supermicro.com/Bios/softfiles/14136/X10SRi-F_BIOS_3.4_release_notes.pdf, supermicro is not good about changelogs in general. 

There is nothing in the provided data that stands out to me so far.  What is on the other side, and why forcing 100 FDX?  Is NC-SI (shared BMC port) in use or are you using the dedicated port? 

I will work on a patch to log the eeprom/nvm version so we can correlate issues to Intel's public specification update document http://iommu.com/datasheets/e1000-datasheets/333066%20-%20I350_SpecUpdate_Rev3.1.pdf and don't have any other clues or suggestions at the moment.