Bug 280074 - em(4) temporarily hangs and re-enables TX CSUM if a bridge it's a member of is modified
Summary: em(4) temporarily hangs and re-enables TX CSUM if a bridge it's a member of i...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL: https://new.reddit.com/r/freebsd/comm...
Keywords: IntelNetworking, regression
Depends on:
Blocks:
 
Reported: 2024-07-01 16:23 UTC by Joshua Kinard
Modified: 2024-07-02 15:38 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joshua Kinard 2024-07-01 16:23:00 UTC
There seems to be a possible regression in the base system em(4) driver in FreeBSD 14.x, where, if it is in a bridge membership with other interfaces (specifically epair(4)), removal of another interface from that membership causes the em(4) interface to briefly become unresponsive and at least TX checksum offloading gets turned back on.

On one of my FreeBSD appliances, an Intel NUC8i5BEH, there is a single em(4) interface 'em0' (Intel I219-V CNP(6), devid 0x15be), and this system runs a single jail for squid.  The em0 interface is assigned at system start to a bridge(4) interface 'bridge0', and it has lro, tso, rxcsum, rxcsum6, txcsum, txcsum6, and all vlan flags disabled.  The 'ifconfig em0' output looks like this (some data is redacted for privacy):

> em0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 9000
>         options=88<VLAN_MTU,VLAN_HWCSUM>
>         ether aa:bb:cc:dd:ee:ff
>         inet x.x.x.x netmask 0xffffff00 broadcast x.x.x.255
>         inet6 fe80::xxxx:xxxx:xxxx:xxxx%em0 prefixlen 64 scopeid 0x1
>         media: Ethernet autoselect (1000baseT <full-duplex>)
>         status: active
>         nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
If, from the above initial state, I start the squid jail manually, everything works fine, and the SSH connection remains responsive while the jail init routines are run, including adding the jail's epair(4) interface, 'epair0a', to the bridge0 membership.

It is when shutting that jail down and removing epair0a from the bridge0 membership that I notice that the SSH connection will become unresponsive for ~10s or more.  Most of the time, it will recover, but sometimes, the connection will get dropped and I have to reconnect.  I also notice that ONLY txcsum and txcsum6 get re-enabled on the em0 interface, even though ONLY the epair0a interface gets removed from the bridge membership.  Here's what 'ifconfig em0' looks like after the fact:

> em0: flags=1008943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 9000
>         options=4c00022<TXCSUM,JUMBO_MTU,TXCSUM_IPV6,HWSTATS,MEXTPG>
>         ether aa:bb:cc:dd:ee:ff
>         inet x.x.x.x netmask 0xffffff00 broadcast x.x.x.255
>         media: Ethernet autoselect (1000baseT <full-duplex>)
>         status: active
>         nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
I have manually confirmed that the interface becomes unresponsive and txcsum/txcsum6 re-enabled when issuing this command:
> ifconfig bridge0 deletem epair0a
I'm not sure why modifying the bridge and removing an epair(4) member affects the em(4) member.  I think it may be doing a short HW reset of the interface or something, and that's why TX checksum offloading gets turned back on.  I am not sure why the other flags flip state as well, especially JUMBO_MTU (system MTU is set to 9000 at boot, so ::shrug::).

I also compiled net/intel-em-kmod from ports and rebooted the system so it'd load that driver instead of the base system em(4) driver.  Using that module, I can add and remove the jail's epair(4) interface without consequence from bridge0.  SSH remains responsive in both cases, and checksum offloading stays disabled.  So it seems like it's some kind of an issue with the base system driver only.  That driver uses iflib(4) now, while the out-of-tree driver doesn't, so it could be something related to iflib causing the problem?

I don't know how long this issue may have existed.  Prior to FreeBSD 14.0-RELEASE, I used net/intel-em-kmod almost exclusively on this system, and only switched to the base system driver when upgrading to 14.0, since that driver will be more up-to-date now that Intel itself no longer develops the out-of-tree driver.  This is when I started noticing this issue on this system.  I also noticed the issue on another appliance I have that used the em(4) driver and ran a jail, but had an Intel 82583V chipset instead, so I don't think it's a hardware erratum issue w/ the I219-V.  That other system has since been upgraded to something that uses igb(4) now, and no longer has this problem.

In any event, I am going to look at replacing this NUC appliance with something newer (it's almost 5 years old), and which uses better GbE hardware.  If someone can figure out possible fixes to em(4) before then, I can try them out, but no promises once I update the affected hardware to something else.
Comment 1 Joshua Kinard 2024-07-02 15:38:41 UTC
This was also discussed on Reddit in r/FreeBSD a few months ago:
https://new.reddit.com/r/freebsd/comments/1bfvof2/em0_disconnects_when_added_toremoved_from_bridge/

Consensus there was that the problem was "solved" by disabling RX/TX checksum offloading on the interface, but I don't know if anyone noticed that modifying the bridge membership was re-enabling txcsum/txcsum6 on em(4).