Created attachment 223330 [details] full dmesg.boot I have problem with FreeBSD 12.2p1 The problem started on FreeBSD 12.0 and persists till now. The network hung just after the boot. Ifconfig down+up solves the problem for next 5-20 hours. I don’t know how to force NIC to hung immediately. Just do ifconfig up/down and wait for next N hours. MBoard: SuperMicro X10SRI-F NIC: <Intel(R) PRO/1000 PCI-Express Network Driver> port 0xe020-0xe03f mem 0xfb120000-0xfb13ffff,0xfb144000-0xfb147fff irq 43 at device 0.0 on pci5 firewall: pf There is no vlan, only jails working on 10.0.0.1/16 Now I use simple watchdog script staring every minute. The script pings the gateway and do ifconfig down/up on error. The script dumps these values every minute: /sbin/pfctl -si > /root/WATCHDOG/failure.txt /usr/bin/netstat -m >> /root/WATCHDOG/failure.txt /sbin/sysctl -a | /usr/bin/grep dev.igb >> /root/WATCHDOG/failure.txt I have two files: one before the hung, the second one just after the hung. Hope that some values inside can help. ——————————————————————————————— - from dmesg.boot igb0: <Intel(R) PRO/1000 PCI-Express Network Driver> port 0xe020-0xe03f mem 0xfb120000-0xfb13ffff,0xfb144000-0xfb147fff igb0: Using 1024 TX descriptors and 1024 RX descriptors igb0: Using an MSI interrupt igb0: Ethernet address: ac:1f:6b:02:8a:c4 igb0: netmap queues/slots: TX 1/1024, RX 1/1024 ——————————————————————————————— $ ifconfig igb0 igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether ac:1f:6b:02:8a:c4 inet IP-ADDR netmask 0xffffff00 broadcast IP-ADDR media: Ethernet 100baseTX <full-duplex> status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> ——————————————————————————————— - /etc/rc.conf: - zfs_enable="YES" sshd_enable="YES" ntpd_enable="YES" powerd_enable="YES" local_unbound_enable="YES" hostname="..." ifconfig_igb0="inet IP-ADDR netmask 255.255.255.0" defaultrouter="IP-ADDR" ifconfig_lo0_alias0="inet 10.0.0.1 netmask 0xffff0000" pf_enable="YES" pflog_enable="YES"
Created attachment 223331 [details] watchdog dump just before the network hung
Created attachment 223332 [details] watchdog dump right after the hung (before ifconfig down/up)
Where did you get igb in 12.2? HISTORY The em device driver first appeared in FreeBSD 4.4. em was merged with the igb device driver and converted to the iflib framework in FreeBSD 12.0.
$ ifconfig -a igb0: flags=... ..... igb1: flags= ..... lo0: flags= ..... pflog0: flags=0 ..... I know that em and igb now are merged together. But the system names this device igb0, igb1, ... Technically it's I350. I use this 'igb' in bug report title because other users with this problem probably will search by this word. $ pciconf -lBbceVv | grep igb0 igb0@pci0:4:0:0: class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = 'I350 Gigabit Network Connection' class = network subclass = ethernet bar [10] = type Memory, range 32, base rxfb120000, size 131072, enabled bar [18] = type I/O Port, range 32, base rxe020, size 32, enabled bar [1c] = type Memory, range 32, base rxfb144000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks enabled with 1 message cap 11[70] = MSI-X supports 10 messages Table in map 0x1c[0x0], PBA in map 0x1c[0x2000] cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1) ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 0003[140] = Serial 1 ac1f6bffff028ac4 ecap 000e[150] = ARI 1 ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled 0 VFs configured out of 8 supported First VF RID Offset 0x0180, VF RID Stride 0x0004 VF Device ID 0x1520 Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304 ecap 0017[1a0] = TPH Requester 1 ecap 0018[1c0] = LTR 1 ecap 000d[1d0] = ACS 1
(In reply to Vladislav Shabanov from comment #4) Sometimes something like this can help (workaround): ifconfig igb0 -rxcsum -rxcsum6 -txcsum -txcsum6 -tso -lro -vlanmtu -vlanhwtag -vlanhwfilter -vlanhwtso
(In reply to VVD from comment #5) Thanks. I applied all this -xxx options. Interestingly than the NIC hungs exactly after I executed this command line. Then I executed $ ifconfig igb0 -rxcsum -rxcsum6 -txcsum -txcsum6 -tso -lro -vlanmtu -vlanhwtag -vlanhwfilter vlanhwtso and it hungs again. I tried one more time with '-vlanhwtso' and it hungs once more. Thanks to VVD, now I can reproduce the bug manually. Спасибо :)
(In reply to Vladislav Shabanov from comment #6) Try to run ifconfig with each parameter: ifconfig igb0 -rxcsum ifconfig igb0 rxcsum ifconfig igb0 -rxcsum6 ifconfig igb0 rxcsum6 ifconfig igb0 -txcsum ifconfig igb0 txcsum ifconfig igb0 -txcsum6 ifconfig igb0 txcsum6 ifconfig igb0 -tso ifconfig igb0 tso ifconfig igb0 -lro ifconfig igb0 lro ifconfig igb0 -vlanmtu ifconfig igb0 vlanmtu ifconfig igb0 -vlanhwtag ifconfig igb0 vlanhwtag ifconfig igb0 -vlanhwfilter ifconfig igb0 vlanhwfilter ifconfig igb0 -vlanhwtso ifconfig igb0 vlanhwtso Пожалуйста!
Bad news: At 15:00 I turned off all options (-rxcsum -rxcsum6 ...) and left the machine alone. At 20:05 the NIC hung again. Now I can reproduce the hung using sequence of ifconfig bge0 txcsum6 ; ifconfig bge0 -txcsum6 ifconfig bge0 something-else; ifconfig bge0 -something-else; After 5-6 calls network dies. Alas, I can reproduce the problem, but I still have no workaround. The only solution is watchdog script pinging other host and ifconfig down/up.
(In reply to Vladislav Shabanov from comment #8) bge or igb???
Also, why is igb0 using MSI, but igb1 is using MSI-X?
(In reply to Eric Joyner from comment #10) I experimented with /boot/loader.conf: hw.igb.num_queues="4" dev.igb.0.iflib.disable_msix="1" No luck. Sorry for typo in #8, igb0
I am wondering if there is something electronically wrong or a negative interaction with NC-SI (BMC). You have a fair number of crc errors dev.igb.0.mac_stats.crc_errs: 35330623 going on and supermicro gets things wrong In your watchdog, dumping dev.igb.0.reg_dump as well would be interesting.. it wont dump by default if doing a higher level sysctl OID. Try updating the BIOS and BMC https://www.supermicro.com/en/products/motherboard/X10SRi-F and if you are able, try FreeBSD 12-STABLE kernel.
Kevin, thank you for your reply. Changes on this server from March, 2021 till now (October, 2021): 1. Now it’s FreeBSD 13.0 2. On both sides of cable the wire configured to media 100baseTX mediaopt full-duplex. There are no crc errors now, but the problem persists. 3. BIOS has version from 2019.11.02. As far as I understand, all newer versions have patches in CPU microcode, but nothing changed in other parts of the BIOS. However, I will update BIOS in next weeks. 4. The problem persists, hung occurs every couple of days. 5. I added dump for RX Registers, new pair of files attached. With best regargs, Vlad
Created attachment 228464 [details] watchdog dump before the hung, 2021.09.29
Created attachment 228465 [details] watchdog dump after the hung, 2021.09.29
(In reply to Vladislav Shabanov from comment #13) There are a lot more changes so it is worth a shot https://www.supermicro.com/Bios/softfiles/14136/X10SRi-F_BIOS_3.4_release_notes.pdf, supermicro is not good about changelogs in general. There is nothing in the provided data that stands out to me so far. What is on the other side, and why forcing 100 FDX? Is NC-SI (shared BMC port) in use or are you using the dedicated port? I will work on a patch to log the eeprom/nvm version so we can correlate issues to Intel's public specification update document http://iommu.com/datasheets/e1000-datasheets/333066%20-%20I350_SpecUpdate_Rev3.1.pdf and don't have any other clues or suggestions at the moment.