Bug 221122

Summary:	Attaching interface to a bridge stops all traffic on uplink NIC for few seconds
Product:	Base System	Reporter:	Heinz N. Gies <heinz>
Component:	kern	Assignee:	freebsd-net (Nobody) <net>
Status:	Closed FIXED
Severity:	Affects Many People	CC:	adrian, bz, dereks, eugen, julian, kbowling, kevans, mason, mav, olevole, spork, zarychtam
Priority:	---
Version:	11.1-RELEASE
Hardware:	amd64
OS:	Any

Description Heinz N. Gies 2017-07-31 13:57:20 UTC

Running the following test case leads to the bridge0 to become unresponsive (this includes other interfraces on the bridge):

while true
do
   PAIR=`ifconfig epair create | sed 's/a\$//'`
   ifconfig bridge0 addm ${PAIR}a
   jail -i -c name=crash persist vnet=new vnet.interface=${PAIR}b exec.start="/sbin/ifconfig ${PAIR}b name net0p"
   jail -r crash
   ifconfig ${PAIR}a destroy
done


The test setup:

                               ┌───────────────────────────────────┐
                               │             BSD Box               │
                  ┌──────┐     │     ┌──────────────┬─────────────┐│
┌───────────┐     │      │     │     │      em0     │             ││
│ ping host │────▶│switch│─────┼───▶│ 192.168.1.22 │             ││
└───────────┘     │      │     │     └──────────────┤             ││
                  └──────┘     │                    │             ││
                               │                    │   bridge0   ││
                               │                    │             ││
                               │                    │             ││
                               │                    │             ││
                               │                    │             ││
                               │                    └─────────────┘│
                               └───────────────────────────────────┘

The kernel is the 11.1-RELEASE kernel with the following patch applied: https://reviews.freebsd.org/D11782

The compiling config is:

include GENERIC
ident FIFOKERNEL

nooptions       SCTP   # Stream Control Transmission Protocol
options         VIMAGE # VNET/Vimage support
options         RACCT  # Resource containers
options         RCTL   # same as above


The system is a Supermicro X9SCL/X9SCM with an Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz CPU and an intel 1G network card.

Comment 1 Alexander Motin freebsd_committer

2017-10-16 10:16:18 UTC

I've tried to reproduce this, and all I see is an uplink interface flap for several seconds due to bridge need to disable/restore of interface offload flags.  After NIC reinitialize the link, operation is restored.  Do I reproduce your issue, or you mean something different?

Comment 2 Heinz N. Gies 2017-10-16 10:54:34 UTC

Hi, first of all, thanks for looking into this! It does sound like an explanation for what I'm seeing. I sadly know little about the internals of the network stack, but the symptoms seem to fit. Adding an interface leads to a reproducible drop of connectivity/delay for a few seconds.

Comment 3 Alexander Motin freebsd_committer

2017-10-16 11:00:18 UTC

Then I tend to say that it behaves correctly, even though not very nice.  If you wish to avoid the flaps on bridge reconfiguration, you may explicitly disable some capabilities of uplink interface before bridge configuration, to avoid them modified by bridge later on epair interface addition/removal.

Comment 4 Heinz N. Gies 2017-10-16 11:24:23 UTC

I understand that it acts as implemented i.e. is not a code bug. Before we close this I'd like to make a case that is not working as intended but rather working as accepted.

The VNET system is rather new in FreeBSD, bridges, on the other hand, exist for a lot longer.

Historically bridges were used in a rather static manner, to bridge physical interfaces (they don't change often), or bridge between physical interfaces and tunnels or other virtual but too rather static interfaces.

This kind of use is often a one-time configuration that happens on system startup or in the case of tunnels in an incredibly rare basis. At those times the loss of connectivity for a few seconds either has no impact (during startup), or the impact is neglectable (i.e. adding tunnel interfaces as no one is connected to a nonexisting interface anyway).

I suspect that when the decision was made to implement it this way all that was taken into consideration and (rightfully so) it wasn't worth the work for finding an alternative as it was working good enough for its use.

VNET and more so VNET jails change things a bit, they make network configuration more dynamic. It becomes required to add and remove interfaces to a bridge dynamically - something that I suspect wasn't foreseen.

Features do not exist in a void, they exist in relation to their environment. The environment for bridges changed and while it was fine before it becomes problematic in this changed environment.

I agree it's not a 'bug' in the bridge driver. But we can not look at a single component in isolation and on a system level, I'm sure that 'starting/stopping a vnet jail means all other vnet jails loose connectivity' is intended behavior.

Comment 5 Alexander Motin freebsd_committer

2017-10-16 11:32:02 UTC

OK, we call it any way you like, but it does not change the facts: to be able bridge interfaces with different hardware capabilities, some of those capabilities has to be disabled, and changing capabilities for Intel NICs ends up in NIC reinit, that takes time and invasive.  Before this was introduced, bridging was just not working correctly in number of scenarios, including VNET jails also, especially for modern NICs with more offload capabilities.  If somebody see alternative way to handle that -- be my guest.

Comment 6 Eugene Grosbein freebsd_committer

2017-10-16 11:40:00 UTC

(In reply to Heinz N. Gies from comment #4)

Addition of first member to the bridge is quite different from addition of others. Why do you think it interfers with traffic flow every time?

Also, you did not show your actions (commands) and has not been quite specific describing what ill effects those actions bring thereafter.

Comment 7 Heinz N. Gies 2017-10-16 11:55:55 UTC

(In reply to Eugene Grosbein from comment #6)

> Addition of first member to the bridge is quite different from addition of others. Why do you think it interferes with traffic flow every time?

Mostly because I could not find any documentation regarding this so all I had to go by was what I observed and it never occurred to me to try a second or third interface after seeing the problem with the first.

The actions/commands in the initial bug report, along with a diagram of the setup, and hardware specifications.

The ill effect is losing network connectivity for a few seconds, for a server that can be quite problematic.

Perhaps I'm approaching this all wrong and trying to squeeze s square peg through a round hole. Are bridge/epairs the wrong tools for vnet jails, is there a better alternative?

Comment 8 Alexander Motin freebsd_committer

2017-10-16 12:06:53 UTC

(In reply to Heinz N. Gies from comment #7)
Bridge+epair are the right tools, unless you wish to dedicate one NIC completely to specific VNET Jail.

I've already told you how to workaround the problem:  when configuring uplink interface, you can explicitly disable its capabilities that bridge try to disable otherwise (TSO, LRO, TOE, TXCSUM, TXCSUM6).  In that case bridge should be happy from the beginning and not modify capabilities any more.

Comment 9 commit-hook freebsd_committer

2017-10-16 12:33:54 UTC

A commit references this bug:

Author: mav
Date: Mon Oct 16 12:32:57 UTC 2017
New revision: 324659
URL: https://svnweb.freebsd.org/changeset/base/324659

Log:
  Update details of interface capabilities changed by bridge(4).

  PR:		221122
  MFC after:	1 week

Changes:
  head/share/man/man4/bridge.4

Comment 10 Eugene Grosbein freebsd_committer

2017-10-16 12:38:58 UTC

(In reply to Heinz N. Gies from comment #7)

Please repeat your tests being more thorough:

1. Verify if you still have the problem while adding second and next bridge members after uplink interface already added as first bridge member.

2. Compare output of ifconfig $uplink before and after it added to the bridge. Then destroy the bridge and use ifconfig for uplink to disable features that bridge disables automatically. Then repeat creation of the bridge and verify if addition of uplink as first bridge member still leads to uplink reset.

Comment 11 Heinz N. Gies 2017-10-16 12:41:07 UTC

Yes I read that, and I've been going through the man pages trying to figure out which those are is there a list of settings supported by epairs. Just saw the updated info bridge I think that's what I was looking for.

I was worried that the delta (RXCSUM, TXCSUM, TSO4) is not exhaustive - and it seems it wasn't.

Weeding through ifconfig(8), will LRO also be affected?

I'm not trying to be dense. I've spent quite some time building tooling around jails and am trying to understand this good enough to write up the steps for someone (like me) who don't know how bridges are implemented to get things working in a way that can be used in a production environment without unpleasant surprises.

Comment 12 Eugene Grosbein freebsd_committer

2017-10-16 12:42:44 UTC

(In reply to Heinz N. Gies from comment #7)

> 2. Compare output of ifconfig $uplink before and after it added to the bridge.

... after it AND other members added to the bridge.

Comment 13 Heinz N. Gies 2017-10-16 12:49:18 UTC

(In reply to Eugene Grosbein from comment #12)


ifconfig em0 (no bridge interfaces)
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
	ether 00:25:90:a6:3b:c7
	hwaddr 00:25:90:a6:3b:c7
	inet 192.168.1.22 netmask 0xffffff00 broadcast 192.168.1.255
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active

adding first bridge interface:

64 bytes from 192.168.1.22: icmp_seq=22 ttl=64 time=1.325 ms
Request timeout for icmp_seq 23
Request timeout for icmp_seq 24
Request timeout for icmp_seq 25
Request timeout for icmp_seq 26
Request timeout for icmp_seq 27
Request timeout for icmp_seq 28
Request timeout for icmp_seq 29
64 bytes from 192.168.1.22: icmp_seq=30 ttl=64 time=1.261 ms

ifconfig em0 (after adding bridge interface) 
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=42098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWTSO>
	ether 00:25:90:a6:3b:c7
	hwaddr 00:25:90:a6:3b:c7
	inet 192.168.1.22 netmask 0xffffff00 broadcast 192.168.1.255
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active


adding second interface:

64 bytes from 192.168.1.22: icmp_seq=132 ttl=64 time=1.432 ms
64 bytes from 192.168.1.22: icmp_seq=133 ttl=64 time=1.332 ms
64 bytes from 192.168.1.22: icmp_seq=134 ttl=64 time=1.146 ms

(no drops)

Comment 14 Eugene Grosbein freebsd_committer

2017-10-17 08:05:43 UTC

(In reply to Heinz N. Gies from comment #13)

Have you tried to use /etc/rc.conf to disable these features of em0 that anyway got disabled by the bridge? And then create the bridge and try to add members to it to make sure that it does not affect traffic this way.

Comment 15 Heinz N. Gies 2017-10-17 16:17:17 UTC

(In reply to Eugene Grosbein from comment #14)

Yes, I did remove the features Alexander recommended, that did solve the downtime issue. He also did submit a patch to document the behavior. While it isn't ideal we don't live in a perfect world and having it documented probably as good as it gets.

Comment 16 Julian Elischer freebsd_committer

2017-10-18 05:49:17 UTC

The earlier comment that epair and bridge were the way to go was correct but incomplete.  You can also use netgraph to plumb the jails (this was how vimage was originally done). See the examples in /usr/share/examples.

Comment 17 commit-hook freebsd_committer

2017-10-23 07:39:35 UTC

A commit references this bug:

Author: mav
Date: Mon Oct 23 07:39:05 UTC 2017
New revision: 324908
URL: https://svnweb.freebsd.org/changeset/base/324908

Log:
  MFC r324659: Update details of interface capabilities changed by bridge(4).

  PR:		221122

Changes:
_U  stable/11/
  stable/11/share/man/man4/bridge.4

Comment 18 Mason Loring Bliss freebsd_triage

2021-12-07 03:46:54 UTC

Re-opening as this is still an issue, and shouldn't be. Details to follow.

Comment 19 Mason Loring Bliss freebsd_triage

2021-12-07 03:52:06 UTC

I saw this issue on FreeBSD 13.0-RELEASE, and following kbowling's
recommendation, also tried the most recent 13-STABLE images. This latter
is where I've gathered data.

Same issue: Add an epair half to a bridge and things go away for several
seconds. The delay is quite possibly longer in -STABLE but I might be
imagining it. Either way, documented below. Note that on literally the
same hardware, the same operations cause no delay under Debian Bullseye:
Have a bridge, add a vnet device to it, and everything keeps flowing
without interruption, which is useful since these boxes are hypervisors
and running a variety of generally network-oriented tasks.

# freebsd-version -ku ; uname -a
13.0-STABLE
13.0-STABLE
FreeBSD amazon.int.blisses.org 13.0-STABLE FreeBSD 13.0-STABLE #0
stable/13-n248302-2cd26a286a9: Thu Dec  2 02:40:58 UTC 2021
root@releng3.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC
amd64
# dmesg | tail
uhid0 on uhub3
uhid0: <Logitech USB Keyboard, class 0/0, rev 1.10/79.00, addr 3> on usbus1
Security policy loaded: MAC/ntpd (mac_ntpd)
epair0a: Ethernet address: 02:52:8f:32:b1:0a
epair0b: Ethernet address: 02:52:8f:32:b1:0b
epair0a: link state changed to UP
epair0b: link state changed to UP
igb0: link state changed to DOWN
epair0a: promiscuous mode enabled
igb0: link state changed to UP
# dmesg | grep igb0
igb0: <Intel(R) I350 (Copper)> mem
0xfb720000-0xfb73ffff,0xfb7c4000-0xfb7c7fff irq 40 at device 0.0 on pci3
igb0: EEPROM V0.93-0 eTrack 0x800006b2
igb0: Using 1024 TX descriptors and 1024 RX descriptors
igb0: Using 6 RX queues 6 TX queues
igb0: Using MSI-X interrupts with 7 vectors
igb0: Ethernet address: 00:25:90:a6:a5:60
igb0: netmap queues/slots: TX 6/1024, RX 6/1024
igb0: promiscuous mode enabled
igb0: link state changed to UP
igb0: link state changed to DOWN
igb0: link state changed to UP

/home/mason$ date ; ssh root@amazon ifconfig bridge0 addm
epair0a ; date
Mon 06 Dec 2021 10:36:37 PM EST
Mon 06 Dec 2021 10:36:41 PM EST
/home/mason$ date ; ssh root@amazon ifconfig bridge0 deletem
epair0a ; date
Mon 06 Dec 2021 10:37:00 PM EST
Mon 06 Dec 2021 10:37:05 PM EST
/home/mason$ date ; ssh root@amazon date ; date
Mon 06 Dec 2021 10:38:14 PM EST
Mon Dec  6 22:38:14 EST 2021
Mon 06 Dec 2021 10:38:14 PM EST

Comment 20 Mason Loring Bliss freebsd_triage

2021-12-08 04:11:47 UTC

The bridge is set up per:

    https://wiki.freebsd.org/MasonLoringBliss/JailsEpair

...albeit with igb0 rather than em0 in this case.

So:

cloned_interfaces="bridge0"
ifconfig_bridge0="inet 10.0.0.2 netmask 0xffffff00 addm igb0"
ifconfig_igb0="up"

Comment 21 Alexander Motin freebsd_committer

2021-12-08 04:18:25 UTC

Mason,  I don't see how this can be fixed without either significantly complicating the bridge driver to handle TSO/LRO/etc offload in software, or without making Intel drivers somehow avoid chip reset on interface capability changes (if that is even possible).

In TrueNAS we've workarounded this problem by adding UI checkbox to preemptively disable interface offload on boot.  Done earlier it allows to avoid the interface flap later.

Comment 22 Mason Loring Bliss freebsd_triage

2021-12-08 04:35:25 UTC

Linux manages the trick on this same box, so the hardware can manage it 
unless there's some critical difference I'm missing. I'd be happy to 
explore from either side to shed more light on it. And sure, I can change 
my model a bit and add a pool of epairs at boot and assign them 
programmatically instead of using per-jail numbering and dynamic spin-up as
I do today, but my interest in this started when I realized the delay was
there in FreeBSD. Seems unfortunate for FreeBSD's handling to be less 
capable than what Linux can do. That said, if I'm missing some concept that
is different and matters, I'm eager to learn about it.

Comment 23 Eugene Grosbein freebsd_committer

2021-12-08 06:06:33 UTC

(In reply to Mason Loring Bliss from comment #22)

Please provide output of "ifconfig igb0" and "ifconfig bridge0" just after boot when bridge has only single igb0 member.

Then, attach new epair to the bridge as you generally do and show output of ifconfig again for igb0, bridge0 and epair at host.

I'm sure the problem should be solved replacing ifconfig_igb0="up" with another one disabling offloads not supported by epair.

Comment 24 Marek Zarychta 2021-12-08 08:20:08 UTC

At least one epair(4) can be created and added to the bridge early to make consensus between capabilities:  

ifconfig_oce3="up mtu 9000"
cloned_interfaces="bridge0 epair0 ..."
create_args_epair0="mtu 9000 up"
ifconfig_bridge0="addm oce3 addm epair0a ..."

The bridge will not suffer from epair(4) interfaces added later then.

Comment 25 Eugene Grosbein freebsd_committer

2021-12-08 09:12:19 UTC

(In reply to Marek Zarychta from comment #24)

Capabilities can be changed at bridge creation time, so first epair won't be affected, too. That's why I've asked for ifconfig output to be exact.

Comment 26 Mason Loring Bliss freebsd_triage

2022-03-12 21:23:47 UTC

I just ran into this again today and that reminded me of the bug.

This is from a different box, but the symptoms are the same:

# ifconfig
em0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4812099<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER,NOMAP>
        ether elided
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether elided
        inet elided.2 netmask 0xffffff00 broadcast elided.255
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: epair43a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 4 priority 128 path cost 2000
        member: em0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 1 priority 128 path cost 2000000
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
epair43a: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 
        options=8<VLAN_MTU>
        ether elided
        groups: epair
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

Comment 27 Mason Loring Bliss freebsd_triage

2022-03-13 03:19:02 UTC

Marek,

I just realized that your comment #24 applies to my situation. The first 
epair added incurs the penalty, but further epairs do not. I think this 
will be sufficient for my purposes. Thank you!

Comment 28 Eugene Grosbein freebsd_committer

2022-06-27 22:32:41 UTC

Should be fixed in stable/12 and stable/13 now.

Comment 29 Mason Loring Bliss freebsd_triage

2022-06-27 22:52:05 UTC

If you could include a link to commits that'd be appreciated!

Thank you.

Comment 30 Derek Schrock 2023-04-20 23:33:42 UTC

(In reply to Eugene Grosbein from comment #28)
Do you have a link to this commit?

Comment 31 spork 2023-08-31 03:50:49 UTC

I burned a few hours on this last night, first thinking something was amiss with iocage (fair assumption, as it seems to be another abandoned project). Then while troubleshooting, I started running the bridge creation and interface additions by hand and noticed my prompt was hanging for a few seconds. Then I found the link flaps in the logs:

Aug 29 20:42:56 clweb5 kernel: ext0: link state changed to DOWN
Aug 29 20:43:01 clweb5 kernel: ext0: Link is up, 1 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None
Aug 29 20:43:01 clweb5 kernel: ext0: link state changed to UP
Aug 29 20:45:53 clweb5 kernel: ext0: link state changed to DOWN
Aug 29 20:45:57 clweb5 kernel: ext0: Link is up, 1 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None
Aug 29 20:45:57 clweb5 kernel: ext0: link state changed to UP
Aug 29 20:48:10 clweb5 kernel: ext0: link state changed to DOWN
Aug 29 20:48:15 clweb5 kernel: ext0: Link is up, 1 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None
Aug 29 20:48:15 clweb5 kernel: ext0: link state changed to UP

Seems to take about 5 seconds for it to recover, which is kind of rough on a box that will be hosting multiple jails.

I understand there were workarounds posted, but I'm curious about the fix mentioned here and under what conditions this should not happen?

NICs are ixl(4)
OS is: 13.2-RELEASE-p2 FreeBSD 13.2-RELEASE-p2 GENERIC amd64

I did dig through the manpage for if_bridge(4), and I'm sure I saw the note about matching capabilities, but it didn't really jump out as a cause. Maybe a note that specifically calls out the most common use case (bridging with epair(4) for jails, bhyve or other virtualization methods) would be a good idea? Or even something in epair(4)'s manpage?

Comment 32 spork 2023-08-31 04:00:16 UTC

(In reply to spork from comment #31)

Sorry forgot to show my diffs for the interface options between bridged and not-bridged:

[root@clweb5 /home/spork]# diff /tmp/options-ixl-nobridge /tmp/options-ixl-bridge
1c1
< options=4e503bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
---
> options=4a500b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,NOMAP>
3c3
< options=4e503bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
---
> options=4a500b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,NOMAP>

Comment 33 spork 2023-08-31 20:18:15 UTC

Some additional testing here...

There are two workarounds presented in this thread:

- Add "-txcsum -tso4 -tso6 -txcsum6" (or whatever your NIC requires) to the ifconfig statement for your interface(s) in rc.conf. This requires knowing what you need to disable to make sure your NIC and epair have equal capabilities so that when the epair interface is added to the bridge, there's no need to reinit the NIC to make the capabilities match, and therefore, no connectivity loss.

- Pre-plumb the bridge and epair interfaces by adding them to rc.conf's cloned_interfaces and add the epair to the "addm" ifconfig line. On boot, the "addm" runs and we don't care about the reinit of the NIC because it's during boot. This method does not require knowing what capabilities need to be disabled on the NIC.

I'm finding neither of these actually work as workarounds, because in 13.2 with my ixl NICs I can see both with iocage (a jail shutdown or restart) and with manual ifconfig commands (removing a vtnet interface from a bridge) cause the NIC to reinit. In other words, removing an epair/vtnet interface from a bridge seems to put the offloading capabilities back in place, rendering either workaround useless.

Again, I'm not clear on what the fix was that was mentioned in comment #28, so if I'm way off base here, let me know!

Example follows...

We have a bridge containing my external ixl interface and an epair/vtnet interface from a jail:

[root@clweb5 /home/spork]# ifconfig bridge0
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	ether 58:9c:fc:10:ff:d9
	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
	member: vnet0.10 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 7 priority 128 path cost 2000
	member: ext0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 1 priority 128 path cost 55
	groups: bridge
	nd6 options=9<PERFORMNUD,IFDISABLED>

The ext0 (ixl) interface was already a member of the bridge when the jail started to there was NO NIC reinit/loss of connectivity when the jail started (good!).

ext0 options look like this while a member of bridge0 (ie: txcsum and two for v4 and v6 are disabled):

ext0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=4a500b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,NOMAP>

Now I manually pull vtnet0.10 from the above bridge:

[root@clweb5 /home/spork]# ifconfig bridge0 deletem vnet0.10

And we see connectivity drop for 5 seconds:

Aug 31 15:32:57 clweb5 kernel: vnet0.10: promiscuous mode disabled
Aug 31 15:32:57 clweb5 kernel: ext0: link state changed to DOWN
Aug 31 15:33:02 clweb5 kernel: ext0: Link is up, 1 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None
Aug 31 15:33:02 clweb5 kernel: ext0: link state changed to UP

And we see why - removing the vtnet bridge member causes something(?) to put all the flags I'd removed from ext0 back in place (txcsum, txcsum6, tso4, tso6):

[root@clweb5 /home/spork]# ifconfig ext0
ext0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=4e503bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>

Again, this is me manually removing the interface from the bridge, not iocage.

Standard jails and iocage jails both call a "destroy" on the vtnet/epair interface, so this isn't just an iocage issue.

Sorry this is so long... anyhow the questions again:

- Did the prior workarounds "work" and then stop working later?
- Did the behavior of bringing explicitly-removed flags back to an interface when members are removed from a bridge change at some point?
- What was the fix in comment #28?

Comment 34 spork 2023-09-01 02:40:09 UTC

The answer to "when did interface capabilities get restored when a member is removed" is "back in 2008".

This commit altered how interface flags were dealt with: 

https://cgit.freebsd.org/src/commit/sys/net/if_bridge.c?id=ec29c623005ca6a32d44fb59bc2a759a96dc75e4

You can see a variable "bif_savedcaps" was added so that the bridge now tracks what the original interface flags were.

Then when a member is removed, it looks like all of a bridge's interfaces are looped through and the original flags are restored (in bridge_delete_member()):

+		/* reneable any interface capabilities */
+		bridge_set_ifcap(sc, bif, bif->bif_savedcaps);

Not sure where, but this kind of feels like it could be a tunable, like "net.link.bridge.restore_caps" or similar, given a) jails will trigger this with lots of NICs b) these days 5 seconds of downtime is actually not a minor issue in many environments and c) it need not change any defaults, but rc.d/jail and 3rd party jail scripts could opt to set it d) jails are kind of a big reason people come to FreeBSD.

I'm not much of a coder, but I could get that sysctl like 80% there I think after looking at the other "net.link.bridge" tunables... any takers on helping? Any thoughts on whether this makes sense?

Comment 35 spork 2023-09-01 23:07:01 UTC

OK, really done for now... :)

I'm trying this out for a bit.

[root@clweb5 /usr/src/sys/net]# diff -u if_bridge.c.dist if_bridge.c.caps
--- if_bridge.c.dist	2023-08-31 22:47:16.758453000 -0400
+++ if_bridge.c.caps	2023-09-01 19:05:41.724323000 -0400
@@ -452,6 +452,13 @@
     CTLFLAG_RWTUN | CTLFLAG_VNET, &VNET_NAME(log_stp), 0,
     "Log STP state changes");

+/* restore member if capabilites */
+VNET_DEFINE_STATIC(int, restore_caps) = 1;
+#define	V_restore_caps	VNET(restore_caps)
+SYSCTL_INT(_net_link_bridge, OID_AUTO, restore_caps,
+    CTLFLAG_RWTUN | CTLFLAG_VNET, &VNET_NAME(restore_caps), 0,
+    "Restore member interface flags on reinit");
+
 /* share MAC with first bridge member */
 VNET_DEFINE_STATIC(int, bridge_inherit_mac);
 #define	V_bridge_inherit_mac	VNET(bridge_inherit_mac)
@@ -1151,7 +1158,8 @@
 #endif
 			break;
 		}
-		/* reneable any interface capabilities */
+		/* reneable any interface capabilities if restore_caps is set */
+		if (V_restore_caps)
 		bridge_set_ifcap(sc, bif, bif->bif_savedcaps);
 	}
 	bstp_destroy(&bif->bif_stp);	/* prepare to free */