When using an I226 under OpnSense (FreeBSD 13.2-RELEASE kernel - I also tried FreeBSD 14.0-RELEASE), I experience connection hangups about once per day under no specific circumstances (maximum was 3 times within one hour, I also had none in three days). This problem manifests in a dead connection (no packets are received, note are sent), but the low-level counters (dev.igc.0.mac_stats) still increase. The conditon can be cleard up by bringing the interface down and up again or by shortly disconnecting the cable. There are reports on this and other related problems all over the internet for different OSes, see: Windows: https://forums.evga.com/PSA-Intel-I226V-25GbE-on-Raptor-Lake-Motherboards-Has-a-Connection-Drop-Issue-No-Fix-m3595279.aspx OpnSense (FreeBSD): https://forum.opnsense.org/index.php?topic=40404.msg199288#msg199288 pfSense (FreeBSD): https://forum.netgate.com/topic/181571/chinese-i226-v-on-23-05-1-problems My specific variant is an I226-V, rev.4, built into a Minisforum MS-01: igc0@pci0:87:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000 vendor = 'Intel Corporation' device = 'Ethernet Controller I226-V' class = network subclass = ethernet However, there are reports of the I226-LM connected to the same machine showing the same behaviour, see: https://forum.opnsense.org/index.php?topic=40556 igc1@pci0:88:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125b subvendor=0x8086 subdevice=0x0000 vendor = 'Intel Corporation' device = 'Ethernet Controller I226-LM' class = network subclass = ethernet This seems to indicate that at least the I226 family (which is a successor to the problem-ridden I225 using the same driver module) is affected by this problem. I tried all possible settings I could think of to make this go away, like reducing the speed from 2.5 to 1 Gbps, disabling EEE (which is off by default anyway) to no avail. Interestingly, the Minisforum-MS01 has gained much interest in the last few months and there was a specific review on Youtube were the creator states in a comment that he is not seeing this problem (https://www.youtube.com/watch?v=_wgX1sDab-M). However, he uses OpnSense under a Proxmox hypervisor, thus using the Linux driver modules (OpnSense itself uses the virtualized virtio NICs). This and the reports of gamers stating they had "micro-hangs" manifesting as short lags in online games got me thinking. So I compared the Linux and FreeBSD drivers and found, that the Linux driver has a specific routine to catch, protocol and clear "TX hang" conditions, see from line 3150 here: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/igc/igc_main.c, which reads: if (test_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) { struct igc_hw *hw = &adapter->hw; /* Detect a transmit hang in hardware, this serializes the * check with the clearing of time_stamp and movement of i */ clear_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags); if (tx_buffer->next_to_watch && time_after(jiffies, tx_buffer->time_stamp + (adapter->tx_timeout_factor * HZ)) && !(rd32(IGC_STATUS) & IGC_STATUS_TXOFF) && (rd32(IGC_TDH(tx_ring->reg_idx)) != readl(tx_ring->tail)) && !tx_ring->oper_gate_closed) { /* detected Tx unit hang */ netdev_err(tx_ring->netdev, "Detected Tx Unit Hang\n" " Tx Queue <%d>\n" " TDH <%x>\n" " TDT <%x>\n" " next_to_use <%x>\n" " next_to_clean <%x>\n" "buffer_info[next_to_clean]\n" " time_stamp <%lx>\n" " next_to_watch <%p>\n" " jiffies <%lx>\n" " desc.status <%x>\n", tx_ring->queue_index, rd32(IGC_TDH(tx_ring->reg_idx)), readl(tx_ring->tail), tx_ring->next_to_use, tx_ring->next_to_clean, tx_buffer->time_stamp, tx_buffer->next_to_watch, jiffies, tx_buffer->next_to_watch->wb.status); netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index); /* we are about to reset, no point in enabling stuff */ return true; } } There is also a routine to reset the adapter: /** * igc_tx_timeout - Respond to a Tx Hang * @netdev: network interface device structure * @txqueue: queue number that timed out **/ static void igc_tx_timeout(struct net_device *netdev, unsigned int __always_unused txqueue) { struct igc_adapter *adapter = netdev_priv(netdev); struct igc_hw *hw = &adapter->hw; /* Do the reset outside of interrupt context */ adapter->tx_timeout_count++; schedule_work(&adapter->reset_task); wr32(IGC_EICS, (adapter->eims_enable_mask & ~adapter->eims_other)); } I did not see anything to this extent in the FreeBSD driver igc module. Intel themselves do not offer an OEM driver for FreeBSD in their Intel Network Connections 29.1 package. So, my theory is that there is a hardware ideosyncrasy in this Intel adapter family which causes packet flow to stop sometimes. This is handled in the Linux driver module by testing if no packets are processed for a short period. That detection and handling would not be there if there was no problem, so we can take this for a fact. I suspect that the same handling is contained in the Windows drivers, too - which I cannot ascertain because I cannot look at the source code. However, this would be in line with the observed "micro-hangs" under Windows from other users. Alas, under FreeBSD, there is no handling of this condition which might explain the total packet loss after it occurs. If it were fixed in FreeBSD, it would be a great benefit for applications like pfSense and OpnSense since now, these adapters are essentially unusable. A potential fix would still produce "micro-hangs" once in a while, however this is far better than losing the connection completely.
A factor in some of the i225 / i226 issues seem to be the NVM version installed in the card (see FreeBSD PR 265714 for a patch to display the version). My understanding is the i225 NVM is up to 1.94 and the i226 NVM is up to 2.25. As noted on PR 265714 I've experienced problems similar to yours using a i225 with NVM version 1.79, I have a i226 on order with NVM version 2.17. It might be interesting to know the NVM version of your i226 NIC.
(In reply to john from comment #1) Interesting. I thought that there was no firmware for these adapters at all and that the revision was all that separated generations. Do you happen to know if the NIC firmware is contained in the motherboard BIOS or not changeable at all? Alas, I cannot try the patch and check the version, since I lack means to build my own kernel. The reason I am asking is that the first link I referenced says something about Asus having released both a BIOS and a driver update for their Z790 Kingpin boards which seems to fix the issue (which in this case, was the micro-hangs, indicating that not only the full hang, but also the underlying hardware problem may have been fixed). If this is true, there may have been a NIC firmware update contained in the new BIOS. I have found that for my hardware, Minisforum has just released a BIOS update from version 1.17 (late 2023) to 1.22 (03/12/2024). I have installed it and using it for two days now - so far without a hangup.
Intel in: https://community.intel.com/t5/Ethernet-Products/Intel-Ethernet-Controller-3-I225-V-Connection-Drop/td-p/1482427 directs people to flash their motherboard firmware in order to receive the new NIC NVM version. Standalone NICs can also be flashed depending on the design. The general problem is some OEMs either don't supply updates or are slow to provide updates. I.e. IOCrest never responded to my request for an updated i225 NVM.
1. I now know that the OEMs should provide the NVM update and I have requested, but not received it yet. Considering the neglect by OEMs, it is a shame that Intel provides updates for the I225, but not for the I226. 2. I had a hangup once again, so the Minisforum MS-01 BIOS update does nothing w/r to this problem. 3. As for the driver shortcoming addressed with this bug report: Fixing it would still be a band-aid for people like me who have no remedy for the underlying hardware bug.
As it turns out, I could fix my specific problem by disabling ASPM support in the BIOS. No more hangups since then. So there is an alternative fix, at least for me.
I just experienced the same issue, and it was a PAIN to find the real issue. https://forum.opnsense.org/index.php?topic=42368 TL:DR: Mini-PC by shuttle with Intel i228-LM 2.5GB Ports Under heady load, sometimes every 6h I needed to reboot my opnsense box. Network behaved really weird, devices became inaccessible, opnsense too. A reboot usually fixed the issue but only for several hours. Funnily enough a reboot of my switch also seemed to have fix it for the moment. Luckily my OEM uploaded a new BIOS this month and it actually fixed my issues, no more reboots needed after the BIOS update. https://global.shuttle.com/products/productsDownload?pn=DL30N%20SERIES
Side note: In the absence of a BIOS setting, you can also try setting hw.pci.enable_aspm=0, at the expense of less power savings for all PCIe devices.
(In reply to Dr. Uwe Meyer-Gruhl from comment #7) I finally figured out the problem on my ASUS PN65 with I226-V and it isn't solved by any of the suggestions so far, but is related. All this on 14.3 and the igc driver. Setting hw.pci.enable_aspm=0 in loader.conf did not work. Updating the I226-V firmware to from V2.17 to V2.32 using nvmupdate64e as described at https://forum.opnsense.org/index.php?topic=48695.0 did not work, though it may have been necessary. I'm not going to revert to 2.17 to test that. I discovered that the igc device's ASPM capability in the PCIe link control capability register remained enabled even with hw.pci.enable_aspm cleared. I see no setting in this ASUS BIOS to disable ASPM for the device. There are configurations for the CPUs, but I don't see other PCIe settings. The BIOS for this machine is dumbed down. Shouldn't the kernel or igc driver clear the ASPM enable bits in the PCIe link control register when enable_aspm is 0? Given that the universal solution to igc problems is disabling ASPM, should the igc driver have a default config that forces the link control to disable ASPM? I wrote a script that walks the PCIe config space, finds the control register and clears the bits that is now running during boot. So far the machine is both stable and finally seeing expected throughput.
For me, the update of the card firmware did not fix the problem, either. It looks like a hardware idiosyncrasy. The real fix would be to implement code to detect and clear the hang condition like it is obviously used in other OSes. Disabling ASPM seems to be a mitigation, but no real fix.
I've seen cases where the bug was in power management in the HW, where the real solution short of modifying the HW was working around the power management bug. In that case it was a PCIe core purchased from a major IP vendor that would go into micro sleep states based on the traffic pattern sent to it from the device. The time required to wake up caused major loss of bandwidth. There's nothing the kernel or driver could have done to fix it short of preventing the mode that allowed sleep. Given that everyone reporting problems fixes it with the same solution and Intel hasn't worked around it even in the latest firmware, doesn't it seem reasonable to conclude that the driver should deal with this quirk/bug and that Intel won't?
Yes, that is why I re-opened the bug. The proposal for a FreeBSD driver fix already is in the bug report itself.
Great. Thank you -- I didn't notice. I'm going to attach the scripts that fixed the problem for me in case they help anyone else for now.
Created attachment 265786 [details] rc.d script to find I226-V adapters and invoke a script to disable ASPM Put in /usr/local/etc/rc.d. Runs at boot, finds I226-V adapters and invokes a script to turn off ASPM in the PCIe link control register. Also requires the attached aspm_disable script. For machines that have an option in the BIOS, you might prefer a BIOS switch to this script.
Created attachment 265787 [details] Walk the PCIe config space of a device and disable ASPM Companion to eth_aspm_disable attachment. Put this script in /usr/local/sbin. Requires the pciutils package for lspci and setpci. The script finds the link control register for a PCIe device and clears the two ASPM bits.