Bug 279245 - igc(4) I226 (and I225) TX hangups
Summary: igc(4) I226 (and I225) TX hangups
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2024-05-23 09:12 UTC by Dr. Uwe Meyer-Gruhl
Modified: 2024-05-26 06:05 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dr. Uwe Meyer-Gruhl 2024-05-23 09:12:16 UTC
When using an I226 under OpnSense (FreeBSD 13.2-RELEASE kernel - I also tried FreeBSD 14.0-RELEASE), I experience connection hangups about once per day under no specific circumstances (maximum was 3 times within one hour, I also had none in three days).

This problem manifests in a dead connection (no packets are received, note are sent), but the low-level counters (dev.igc.0.mac_stats) still increase.
The conditon can be cleard up by bringing the interface down and up again or by shortly disconnecting the cable.

There are reports on this and other related problems all over the internet for different OSes, see:

Windows: https://forums.evga.com/PSA-Intel-I226V-25GbE-on-Raptor-Lake-Motherboards-Has-a-Connection-Drop-Issue-No-Fix-m3595279.aspx
OpnSense (FreeBSD): https://forum.opnsense.org/index.php?topic=40404.msg199288#msg199288
pfSense (FreeBSD): https://forum.netgate.com/topic/181571/chinese-i226-v-on-23-05-1-problems

My specific variant is an I226-V, rev.4, built into a Minisforum MS-01:

igc0@pci0:87:0:0:       class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet


However, there are reports of the I226-LM connected to the same machine showing the same behaviour, see: https://forum.opnsense.org/index.php?topic=40556

igc1@pci0:88:0:0:       class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125b subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-LM'
    class      = network
    subclass   = ethernet

This seems to indicate that at least the I226 family (which is a successor to the problem-ridden I225 using the same driver module) is affected by this problem.
I tried all possible settings I could think of to make this go away, like reducing the speed from 2.5 to 1 Gbps, disabling EEE (which is off by default anyway) to no avail.

Interestingly, the Minisforum-MS01 has gained much interest in the last few months and there was a specific review on Youtube were the creator states in a comment that he is not seeing this problem (https://www.youtube.com/watch?v=_wgX1sDab-M). However, he uses OpnSense under a Proxmox hypervisor, thus using the Linux driver modules (OpnSense itself uses the virtualized virtio NICs).

This and the reports of gamers stating they had "micro-hangs" manifesting as short lags in online games got me thinking.
So I compared the Linux and FreeBSD drivers and found, that the Linux driver has a specific routine to catch, protocol and clear "TX hang" conditions, see from line 3150 here: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/igc/igc_main.c, which reads:

	if (test_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
		struct igc_hw *hw = &adapter->hw;

		/* Detect a transmit hang in hardware, this serializes the
		* check with the clearing of time_stamp and movement of i
		*/
		clear_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags);
		if (tx_buffer->next_to_watch &&
		    time_after(jiffies, tx_buffer->time_stamp +
		    (adapter->tx_timeout_factor * HZ)) &&
		    !(rd32(IGC_STATUS) & IGC_STATUS_TXOFF) &&
		    (rd32(IGC_TDH(tx_ring->reg_idx)) != readl(tx_ring->tail)) &&
		    !tx_ring->oper_gate_closed) {
			/* detected Tx unit hang */
			netdev_err(tx_ring->netdev,
				   "Detected Tx Unit Hang\n"
				   "  Tx Queue             <%d>\n"
				   "  TDH                  <%x>\n"
				   "  TDT                  <%x>\n"
				   "  next_to_use          <%x>\n"
				   "  next_to_clean        <%x>\n"
				   "buffer_info[next_to_clean]\n"
				   "  time_stamp           <%lx>\n"
				   "  next_to_watch        <%p>\n"
				   "  jiffies              <%lx>\n"
				   "  desc.status          <%x>\n",
				   tx_ring->queue_index,
				   rd32(IGC_TDH(tx_ring->reg_idx)),
				   readl(tx_ring->tail),
				   tx_ring->next_to_use,
				   tx_ring->next_to_clean,
				   tx_buffer->time_stamp,
				   tx_buffer->next_to_watch,
				   jiffies,
				   tx_buffer->next_to_watch->wb.status);
			netif_stop_subqueue(tx_ring->netdev,
					    tx_ring->queue_index);

			/* we are about to reset, no point in enabling stuff */
			return true;
		}
	}
    
There is also a routine to reset the adapter:

/**
 * igc_tx_timeout - Respond to a Tx Hang
 * @netdev: network interface device structure
 * @txqueue: queue number that timed out
 **/
static void igc_tx_timeout(struct net_device *netdev,
			   unsigned int __always_unused txqueue)
{
	struct igc_adapter *adapter = netdev_priv(netdev);
	struct igc_hw *hw = &adapter->hw;

	/* Do the reset outside of interrupt context */
	adapter->tx_timeout_count++;
	schedule_work(&adapter->reset_task);
	wr32(IGC_EICS,
	     (adapter->eims_enable_mask & ~adapter->eims_other));
}

I did not see anything to this extent in the FreeBSD driver igc module.

Intel themselves do not offer an OEM driver for FreeBSD in their Intel Network Connections 29.1 package.

So, my theory is that there is a hardware ideosyncrasy in this Intel adapter family which causes packet flow to stop sometimes.
This is handled in the Linux driver module by testing if no packets are processed for a short period.
That detection and handling would not be there if there was no problem, so we can take this for a fact.

I suspect that the same handling is contained in the Windows drivers, too - which I cannot ascertain because I cannot look at the source code.
However, this would be in line with the observed "micro-hangs" under Windows from other users.

Alas, under FreeBSD, there is no handling of this condition which might explain the total packet loss after it occurs.
If it were fixed in FreeBSD, it would be a great benefit for applications like pfSense and OpnSense since now, these adapters are essentially unusable.
A potential fix would still produce "micro-hangs" once in a while, however this is far better than losing the connection completely.
Comment 1 john 2024-05-25 08:22:01 UTC
A factor in some of the i225 / i226 issues seem to be the NVM version installed in the card (see FreeBSD PR 265714 for a patch to display the version).  My understanding is the i225 NVM is up to 1.94 and the i226 NVM is up to 2.25.

As noted on PR 265714 I've experienced problems similar to yours using a i225 with NVM version 1.79, I have a i226 on order with NVM version 2.17.

It might be interesting to know the NVM version of your i226 NIC.
Comment 2 Dr. Uwe Meyer-Gruhl 2024-05-25 08:45:10 UTC
(In reply to john from comment #1)
Interesting. I thought that there was no firmware for these adapters at all and that the revision was all that separated generations. Do you happen to know if the NIC firmware is contained in the motherboard BIOS or not changeable at all?

Alas, I cannot try the patch and check the version, since I lack means to build my own kernel.

The reason I am asking is that the first link I referenced says something about Asus having released both a BIOS and a driver update for their Z790 Kingpin boards which seems to fix the issue (which in this case, was the micro-hangs, indicating that not only the full hang, but also the underlying hardware problem may have been fixed). If this is true, there may have been a NIC firmware update contained in the new BIOS.

I have found that for my hardware, Minisforum has just released a BIOS update from version 1.17 (late 2023) to 1.22 (03/12/2024). I have installed it and using it for two days now - so far without a hangup.
Comment 3 john 2024-05-25 16:54:20 UTC
Intel in:

  https://community.intel.com/t5/Ethernet-Products/Intel-Ethernet-Controller-3-I225-V-Connection-Drop/td-p/1482427

directs people to flash their motherboard firmware in order to receive the new NIC NVM version.

Standalone NICs can also be flashed depending on the design.

The general problem is some OEMs either don't supply updates or are slow to provide updates.  I.e. IOCrest never responded to my request for an updated i225 NVM.
Comment 4 Dr. Uwe Meyer-Gruhl 2024-05-26 06:05:53 UTC
1. I now know that the OEMs should provide the NVM update and I have requested, but not received it yet. Considering the neglect by OEMs, it is a shame that Intel provides updates for the I225, but not for the I226.

2. I had a hangup once again, so the Minisforum MS-01 BIOS update does nothing w/r to this problem.

3. As for the driver shortcoming addressed with this bug report: Fixing it would still be a band-aid for people like me who have no remedy for the underlying hardware bug.