Bug 240106 - VNET issue with ARP and routing sockets in jails
Summary: VNET issue with ARP and routing sockets in jails
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-jail (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-25 19:37 UTC by John Westbrook
Modified: 2024-02-11 01:48 UTC (History)
24 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Westbrook 2019-08-25 19:37:31 UTC
I'm experiencing an intermittent connectivity issue running FreeBSD 12.0 with jail using VNET, which appears to be related to lost ARP replies.

There are several discussion threads on forums that appear related:

https://forums.freebsd.org/threads/vnet-arp-replies-are-lost.71082
https://www.ixsystems.com/community/threads/arp-replies-loss-in-vnet.77027
https://www.ixsystems.com/community/threads/jails-eero.59477

One insightful comment from the first thread:

"""On step #2 the reply is mistakenly padded with 14 bytes which is exactly the number of bytes beyond the 18 bytes in the request (the request was padded with 32 bytes). I bet this is part of the bug. By looking at FreeBSD ARP reply code it actually creates the reply by editing the request bytes in place. For some reason it removes only 18 bytes from the request padding. However, this happens only on VNET interface as noted above."""

I was able to see ARP traffic using tcpdump, but (arp -a) doesn't contain updated ARP entries. Also, in an affected jail, I can't add static arp entries:

# arp -s 10.0.0.1 XX:XX:XX:XX:XX:XX
arp: writing to routing socket: Cannot allocate memory

whereas, in an unaffected jail the arp command succeeds. Jails are should have access to routing sockets by default, so perhaps the problem is related to accessing routing sockets in VNET jails?

The test setup where I'm observing this is using an SR-IOV VF (Chelsio cxlv0) passed into the jail (via vnet.interface in jail.conf). The test setup has two jails each on two direct attached hosts. I observe the problem on both hosts, but it comes and goes with reboots.
Comment 1 Andrey V. Elsukov freebsd_committer freebsd_triage 2019-08-27 10:51:18 UTC
Can you describe the steps required to reproduce the problem on the 12.0/13.0 system?
Comment 2 John Westbrook 2019-08-29 16:54:20 UTC
I have SR-IOV configured as described in this thread:

https://forums.freebsd.org/threads/sr-iov-chelsio-error-in-guest.70653

such that cxlv[0-3] are shown in ifconfig. The jail.conf is:

vnet;
vnet.interface = "vnet0";
exec.prestart  = "ifconfig ${vnet0} name vnet0";
exec.poststop  = "ifconfig vnet0 name ${vnet0}";

exec.start += "/bin/sh /etc/rc";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.consolelog = "/var/log/${name}.log";
host.hostname = "${name}";
path = "/jail/${name}";

j1 {
   $vnet0 = "cxlv1";
}

j2 {
   $vnet0 = "cxlv2";
}

There are two hosts direct connected via cxl0. The problem is visible when pinging (1) between jails on the same host and (2) from an affected jail on host 1 to host 2. On an unaffected host both of these operations succeed.

Using tcpdump on the physical (cxl) and virtual (cxlv) interfaces shows the ARP requests and responses, but in an affected jail the ARP tables aren't updated.
Comment 3 Alexander Lunev 2019-10-09 11:43:04 UTC
I think that bug that I wanted to report is somewhat similar, all main actors - VNET, jails and ARP - are the same.

So I have a problem with network connectivity between jails and host when using jails with VNET and VLANs. 

I've written about it to freebsd-net@ mailing list: 

threads: 
https://lists.freebsd.org/pipermail/freebsd-net/2019-September/054391.html
https://lists.freebsd.org/pipermail/freebsd-net/2019-October/054437.html

There's a topic on FreeBSD forums, which confirms this and once again explain the configuration with which this problem occuring, and in in great detail, but author has "solved" his problem by simply not using configuration when you bridge physical interface with jail's VNET interface and not using jail's VNET interface with VLANs. 

https://forums.freebsd.org/threads/bridge-epair-not-passing-through-tagged-vlan-traffic-between-host-and-vnet-jail.71646/

I'll add some more observation here. I recreated configuration in a virtual machine, as i wrote in my last message to freebsd-net@ here: https://lists.freebsd.org/pipermail/freebsd-net/2019-October/054475.html. Jail's vlan interface IP is 10.15.15.2 and host's vlan interface IP is 10.15.15.1. Both jail and host have no ARP entries about each other addresses. 

So I ping from 10.15.15.2 to 10.15.15.1. 

1. in initial configuration, I see this on em0: 

HOST# tcpdump -i em0 -e | grep 10.15.15
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:57:52.051429 02:95:ce:33:dc:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.1 tell 10.15.15.2, length 28
08:57:53.071451 02:95:ce:33:dc:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.1 tell 10.15.15.2, length 28
08:57:54.101515 02:95:ce:33:dc:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.1 tell 10.15.15.2, length 28

2. then I added ARP entry in jail: 

JAIL# arp -s 10.15.15.1 00:0c:29:2f:6c:08

HOST# tcpdump -i em0 -e | grep 10.15.15
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:07:10.321257 00:0c:29:2f:6c:08 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.2 tell 10.15.15.1, length 28
09:07:11.391300 00:0c:29:2f:6c:08 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.2 tell 10.15.15.1, length 28
09:07:12.415232 00:0c:29:2f:6c:08 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.2 tell 10.15.15.1, length 28

3. then I added jail ARP entry to host: 

HOST# arp -s 10.15.15.2 02:95:ce:33:dc:0b

and ICMP requests started to pass from jail to host, and vlan22 interface on host receiving packets and sending replies: 

HOST# tcpdump -i vlan22 -e | grep 10.15.15
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan22, link-type EN10MB (Ethernet), capture size 262144 bytes
09:37:11.517054 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype IPv4 (0x0800), length 98: 10.15.15.2 > 10.15.15.1: ICMP echo request, id 25864, seq 0, length 64
09:37:11.517063 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype IPv4 (0x0800), length 98: 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 25864, seq 0, length 64

but i don't see them on host's epair0a interface, bridged with em0 in bridge0, there are only requests on epair0a: 

HOST# tcpdump -i epair0a -e | grep 10.15.15
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on epair0a, link-type EN10MB (Ethernet), capture size 262144 bytes
09:40:44.178363 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.2 > 10.15.15.1: ICMP echo request, id 32264, seq 0, length 64
09:40:45.221713 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.2 > 10.15.15.1: ICMP echo request, id 32264, seq 1, length 64
09:40:46.253079 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.2 > 10.15.15.1: ICMP echo request, id 32264, seq 2, length 64

and on em0 i see only replies:

HOST# tcpdump -i em0 -e | grep 10.15.15
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:41:11.092092 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 34568, seq 0, length 64
09:41:12.096310 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 34568, seq 1, length 64
09:41:13.121890 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 34568, seq 2, length 64

and on bridge interface nor requests nor replies are shown. 

HOST# tcpdump -i bridge0 -e | grep 10.15.15
... silince ...

Is it normal and I'm doing something wrong? 
I wanted to make jails act as the normal freebsd host with one dedicated VNET interface with VLANs.
Comment 4 Ryan Moeller freebsd_committer freebsd_triage 2020-07-15 03:58:53 UTC
When I followed the reproduction steps described in the linked threads with a debug kernel I hit the following assert:

 panic: m_dup: bogus m_pkthdr.len
 cpuid = 1
 KDB: stack backtrace:
 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe2eca39e700
 vpanic() at vpanic+0x177/frame 0xfffffe2eca39e760
 doadump() at doadump/frame 0xfffffe2eca39e7e0
 m_dup() at m_dup+0x376/frame 0xfffffe2eca39e860
 bridge_broadcast() at bridge_broadcast+0x1bf/frame 0xfffffe2eca39e8c0
 bridge_forward() at bridge_forward+0x222/frame 0xfffffe2eca39e920
 bridge_input() at bridge_input+0x3d5/frame 0xfffffe2eca39e990
 ether_nh_input() at ether_nh_input+0x2a6/frame 0xfffffe2eca39e9e0
 netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe2eca39ea40
 ether_input() at ether_input+0x8f/frame 0xfffffe2eca39ea80
 epair_nh_sintr() at epair_nh_sintr+0x1a/frame 0xfffffe2eca39eaa0
 swi_net() at swi_net+0x1b9/frame 0xfffffe2eca39eb20
 intr_event_execute_handlers() at intr_event_execute_handlers+0x99/frame 0xfffffe2eca39eb60
 ithread_loop() at ithread_loop+0xb7/frame 0xfffffe2eca39ebb0
 fork_exit() at fork_exit+0x84/frame 0xfffffe2eca39ebf0
 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe2eca39ebf0
 — trap 0, rip = 0, rsp = 0, rbp = 0 —


The offending KASSERT is still there:

                /* Check correct total mbuf length */
                KASSERT((remain > 0 && m != NULL) || (remain == 0 && m == NULL),
                        ("%s: bogus m_pkthdr.len", __func__));
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2020-07-15 14:23:25 UTC
(In reply to Ryan Moeller from comment #4)
I haven't been able to reproduce this.  Did you do so on -CURRENT?
Comment 6 Ryan Moeller freebsd_committer freebsd_triage 2020-07-15 16:03:09 UTC
(In reply to Mark Johnston from comment #5)

This stack was from approximately stable/11 a few months ago.

I just tried on -CURRENT and the ARP reply does make it back and there is no panic (tested in a jail with epair on a bridge). I will check stable/12 and stable/11 again, and 12.1-Rel to be sure.
Comment 7 Aaron 2020-11-18 20:25:28 UTC
I'm trying to figure out an issue which seems similar or the same as this issue. DHCP Response traffic is going to the untagged interface & bridge, rather than the VLAN interface, and then to the associated bridge. 

I have a forum post up at https://forums.freebsd.org/threads/vlans-with-bhyve-guests-not-getting-dhcp.77647/  and otherwise can't see any other way forward to debugging. I've been poking at some of the net.link.bridge sysctl tuneables, but nothing seems changed.
Comment 8 Sebastian Stroniewski-Wojtczak 2020-11-19 07:29:04 UTC
Try this one:

ifconfig lo0 -rxcsum -txcsum

Afterward you shouldnt see any cksum incorect messages. For some reasons even if Vlan/Bridge is atached to real device loopback is interfere with it.
Comment 9 Aaron 2020-11-19 08:08:40 UTC
@Sebastian
Not sure what that setting has to do with anything. I'm not seeing anything with checksum errors or anything. It's traffic for VLAN is coming in on the main, untagged interface rather than going to the VLAN interface, and then to the VLAN bridge.
Comment 10 johan 2020-11-21 14:56:19 UTC
I had this exact problem with the same setup. My jails are on a VM on an ESXi 6.5 host, with the port group on promiscuous mode. I lost some hairs to this issue. Tried FreeBSD 12.1, 12.2, disabled TSO, cksum, dug through kernel sources...

All that in vain : The problem disapeared the minute i enabled this fling on ESXi : https://flings.vmware.com/learnswitch

I hope this help
Comment 11 Gabor ADORJANI 2022-03-30 12:38:48 UTC
I believe I ran into the same issue today on 13.1-BETA3.

Setup: I use a NUC for virtualisation host with a single NIC: em0. It has vPro (poor man's service processor), which shares the NIC with the OS and communicates on the native VLAN (VLAN1). Because of this I put the OS to a tagged one.

I set up several tagged VLANs: 2, 4, 6, 8. The host OS uses em0.2 on VLAN2.

I set up a bridge for each VLAN interface, as well as for the physical:

em0 -> vm-sw1
em0.2 -> vm-sw2
em0.4 -> vm-sw4
em0.6 -> vm-sw6
em0.8 -> vm-sw8

Then I created a jail with Bastille, assigning it to VLAN2/vm-sw2 using VNET, with an IP from the subnet also used on the host.

I could ping the host from the jail and vice versa, but could not reach the external world from the jail, nor could ping the jail from the router in the same subnet.

After 'ifconfig vm-sw1 destroy' it suddenly started working and the jail now has full IP4/6 connectivity.
Comment 12 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-03-30 15:17:50 UTC
I beleive I ran into similar problems with main last year and I am still seeing occasional "blackouts"; I believe back then I could trigger traffic by sending packets from one specific part of jail / host / remote to another and that would hold until the entry expired but I have no more notes on this.

I am also adding kp@ given bridge ...
Comment 13 Kristof Provost freebsd_committer freebsd_triage 2022-03-30 15:32:39 UTC
(In reply to Bjoern A. Zeeb from comment #12)
Note that the issue described in #10 is a configuration problem more than a bug.

In this configuration the bridge will grab all packets, including those with a vlan tag and nothing will be passed to the vlan interfaces. That's expected. After all, the system has been configured to bridge all packets arriving on em0 to the members of vm-sw1, and that includes those with ETHERTYPE_VLAN.

This patch should make it do what the user wants, but I'm not convinced that's actually appropriate:

diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c
index 12c807fe2009..98c79764bc69 100644
--- a/sys/net/if_bridge.c
+++ b/sys/net/if_bridge.c
@@ -2467,6 +2467,11 @@ bridge_input(struct ifnet *ifp, struct mbuf *m)

        eh = mtod(m, struct ether_header *);

+       if (ntohs(eh->ether_type) == ETHERTYPE_VLAN ||
+           ntohs(eh->ether_type) == ETHERTYPE_QINQ) {
+               return (m);
+       }
+
        bridge_span(sc, m);

        if (m->m_flags & (M_BCAST|M_MCAST)) {
Comment 14 Gabor ADORJANI 2022-03-30 16:07:37 UTC
(In reply to Kristof Provost from comment #13)
Thanks Kristof, you are right, I didn't see the forest for the trees. It's not a bug, but a feature.
Comment 15 crest 2022-03-30 17:22:42 UTC
You probably want to create the vlan interfaces on the physical interface and add them as member interfaces to the bridges and all IP interfaces belong on the bridge interface. Don't put IP addresses on bridge member interfaces.
Comment 16 Gabor ADORJANI 2022-03-30 18:03:05 UTC
Thanks, I'll try - I've been doing this on Linux for many years. Ironically, some years ago I read the opposite in some FreeBSD-related doc: assigning the addresses to the parent interface was the recommended way.
Comment 17 Peter Eriksson 2022-04-21 12:42:21 UTC
I'm seeing the same (or similar issue) on 12.3-RELEASE-p5 when trying to bridge a vlan interface into a jail:


# egrep 'ifconfig|cloned' rc.conf
ifconfig_ixl0="up"
ifconfig_ixl2="up"
cloned_interfaces="lagg0 vlan1601 bridge0"
ifconfig_lagg0="laggproto lacp laggport ixl0 laggport ixl2 130.236.8.40 netmask 255.255.255.224 lacp_fast_timeout"
ifconfig_lagg0_ipv6="inet6 2001:6b0:17:2400::8:40/64 lacp_fast_timeout"
ifconfig_vlan1601="vlandev lagg0 vlan 1601 up"
ifconfig_bridge0="addm vlan1601 up"


# ifconfig bridge0
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	ether 02:90:7b:7b:f5:00
	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
	member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 14 priority 128 path cost 2000
	member: lagg0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 10 priority 128 path cost 1000
	member: vlan1601 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 11 priority 128 path cost 2000000
	groups: bridge
	nd6 options=9<PERFORMNUD,IFDISABLED>



root@filur00:/etc # iocage console test
Last login: Thu Apr 21 14:27:18 on pts/0
FreeBSD 12.3-RELEASE-p5 GENERIC --
...
root@test:~ # ping 130.236.8.65
PING 130.236.8.65 (130.236.8.65): 56 data bytes
^C
--- 130.236.8.65 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss


If I now manually remove the "lagg0" member from the bridge0 interface then things start to work fine. It would be nice if it didn't add it automatically :-)


root@filur00:/etc # ifconfig bridge0 deletem lagg0

root@filur00:/etc # iocage console test
Last login: Thu Apr 21 14:38:34 on pts/0
FreeBSD 12.3-RELEASE-p5 GENERIC 
....
root@test:~ # ping 130.236.8.65
PING 130.236.8.65 (130.236.8.65): 56 data bytes
64 bytes from 130.236.8.65: icmp_seq=0 ttl=255 time=0.249 ms
^C
--- 130.236.8.65 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.249/0.249/0.249/0.000 ms
Comment 18 Peter Eriksson 2022-04-21 13:07:25 UTC
> root@filur00:/etc # ifconfig bridge0 deletem lagg0

Easy solution... Remove from rc.conf:

  ifconfig_bridge0="addm vlan1601 up"


and then tell iocage to not add lagg0 automatically to the jail's bridge:

  # iocage set vnet_default_interface=vlan1601 test
Comment 19 O. Hartmann 2022-05-07 08:13:44 UTC
Hello.
We also have an similar issue on FreeBSD 12.3-RELEASE-p2 (XigmaNAS, stuck at -p2 for the moment) as described. The boxes in question do have two NICs, one is supposed for the management (em0) access and the other one is supposed to be bound to offered services. Additionally, the second NIC (igb0) is accessible via an IP AND serves as the physical NIC as member of a bridge for vnet jails, which do have epair interfaces (in Xigmanas created via the FreeBSD in-tree tool "jib").
Binding provided services as SAMBA and NFS to the second NIC (igb0) works as expected, also ping and ssh is no problem.

Base host's IP (both NICs) and those of the jails are within the same network.

When it comes to the vnet jails on the bridge, of which the igb0 NIC is member of, trouble begins.
We use several jails on those boxes. Pinging those jails from outside the campus network does work sporadically with some IPs, it takes a long time until the jail starts repsonding. Same behaviour is within the LAN. 

We also already disabled pfil on the bridges as suggested:

device	if_bridge
net.link.bridge.ipfw: 0
net.link.bridge.allow_llz_overlap: 0
net.link.bridge.inherit_mac: 0
net.link.bridge.log_stp: 0
net.link.bridge.pfil_local_phys: 0
net.link.bridge.pfil_member: 0
net.link.bridge.ipfw_arp: 0
net.link.bridge.pfil_bridge: 0
net.link.bridge.pfil_onlyip: 0

A curiosity is that if one can ping one or two out of the five jails on the host, in another attempt to do so one, at most two different hosts would answer the ping then and the former working pinged hosts do not anymore. It is like gambling.

We also run another host with the very same XigmaNAS version, in that case, he second NIC is configured to be part of another network and attached to another switch - not problem there!

In the problematic cases described above, we do not have direct access to the switches of the backend of the department, so I can't see whether I'm the culprit (misconfiguration, misunderstanding et cetera of network technology).

Hope the problem could be solved anyway within FreeBSD 12.3.
Comment 20 Andriy Gapon freebsd_committer freebsd_triage 2023-01-27 11:06:12 UTC
(In reply to Kristof Provost from comment #13)
Perhaps we could create a special vlan "sub-interface" that sees only untagged traffic on input and does not add any tag on output (just like the parent interface).  We could use some reserved VLAN ID to mark the special interface. E.g., the currently prohibited VLAN ID 0.
Comment 21 Nikita Olenets 2023-01-27 17:28:46 UTC
In the mean time you can try "workaround" to create ng_bridge interface to your parent and then use than newly created interface as a member to your management bridge.

Assuming you have em0 as your parent interface

em0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
        ether 58:9c:fc:10:f1:16
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

/bin/sh /usr/share/examples/jails/jng bridge main em0
Will create ng0_main interface:

ng0_main: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=28<VLAN_MTU,JUMBO_MTU>
        ether 02:60:c8:08:84:9b
        hwaddr 58:9c:fc:10:ff:ff
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

Now you can add ng0_main interface instead your em0. Will work like a charm.
Comment 22 Kristof Provost freebsd_committer freebsd_triage 2023-01-28 01:18:43 UTC
(In reply to Andriy Gapon from comment #20)
I'm not sure how that'd fix this issue, but I believe that's something we already support. It's possible to set Priority code point (PCP) on a regular interface, which should insert a vlan header with VID 0.
Comment 23 Andriy Gapon freebsd_committer freebsd_triage 2023-01-29 10:48:35 UTC
(In reply to Kristof Provost from comment #22)
Just to be sure that we talk about the same thing (and I feel like we are not), I am not suggesting any modification to what's going on the wire. Just a new virtual interface that captures only untagged packets.
To be more clear:
- igb0: receives all arriving packets, sends packets without inserting any VLAN tag
- igb0.1: receives arriving packets with VLAN tag 1, adds VLAN tag 1 when sending
- igb0.0: [proposed] receives only packets without any VLAN tag, sends packets without inserting any VLAN tag
Comment 24 Kristof Provost freebsd_committer freebsd_triage 2023-01-29 19:52:17 UTC
(In reply to Andriy Gapon from comment #23)
Ah, I see. I did indeed misunderstand.

However, I don't think that'd fix the issue of VLAN on if_bridge interfaces. The problem is that the bridge checks if it needs to grab the packet before vlan_input() gets its turn.
Comment 25 Andriy Gapon freebsd_committer freebsd_triage 2023-01-29 20:32:07 UTC
(In reply to Kristof Provost from comment #24)
I think I will need to look at the code. I thought that a bridge would see packets only from a bridged virtual/vlan interface (such as the proposed igb0.0), but it looks that the actual ethernet input processing has a different flow.
Comment 26 kvs 2023-03-06 22:51:19 UTC
Hello Everyone!

I believe I have hit the same bug, though I believe my issue is specifically related to lagg/lacp.  I can confirm this problem affects tap as well as epair interfaces on a bridge when attempting to send over a vlan interface that has a lagg parent.


System Description: FreeBSD 13.1 w/ Chelsio T6225-SO-CR NIC, identified by cc0 / cc1 (confirmed up and operational), host25 is the system name.  Network is 10.20.20.0/24, gateway is 10.20.20.254 (mac: 02:11:22:33:44:55), host is assigned 10.20.20.5, epair0 is assigned to jail-10-20-20-6 (with matching IP of 10.20.20.6 on epair0b).  Switch is set to accept tagged frames only for vlan 2020.  All mtu's 1500.

When adding a vlan interface child of cc0 to the bridge, I do not have any trouble passing data over the lagg.

host25# ifconfig cc0.2020 create up
host25# ifconfig bridge2020 create up
host25# ifconfig bridge2020 addm cc0.2020
host25# ifconfig bridge2020 addm epair0a
host25# ifconfig bridge2020 inet 10.20.20.25/24
	
(pings from host -> gateway works fine)
host25# ping 10.20.20.254
success!

(pings from jail -> gateway also work)
host25# jexec jail-10-20-20-6 sh
jail-10-20-20-6# ping 10.20.20.254
success!

(I now reset bridge2020 to use a lagg interface.)
host25# ifconfig bridge2020 destroy
host25# ifconfig cc0.2020 destroy

host25# ifconfig lagg0 create laggproto lacp laggport cc0 laggport cc1 up
host25# ifconfig lagg0.2020 create up
host25# ifconfig bridge2020 create up
host25# ifconfig bridge2020 addm lagg0.2020 addm epair0a
host25# ifconfig bridge2020 inet 10.20.20.25/24

(pings from host -> gateway work fine)
host25# ping 10.20.20.254
success!

(pings from jail -> gateway timeout)
host25# jexec jail-10-20-20-6 sh
jail-10-20-20-6# ping 10.20.20.254
ping: sendto: Host is down


(arp cache from jail appears to not include gateway mac)
jail-10-20-20-6# arp -an
? (10.20.20.6) at 02:07:f0:80:de:0b on epair0b permanent [ethernet]
? (10.20.20.254) at (incomplete) on epair0b expired [ethernet]

(I assign mac statically.)
jail-10-20-20-6# arp -s 10.20.20.254 02:11:22:33:44:55
jail-10-20-20-6# arp -an
? (10.20.20.6) at 02:07:f0:80:de:0b on epair0b permanent [ethernet]
? (10.20.20.254) at 02:11:22:33:44:55 on epair0b permanent [ethernet]

(attempt ping again after static arp assignment)
jail-10-20-20-6# ping 10.20.20.254
success!

What comes next is a reasonably big presumption on my part, so hopefully someone more educated on the topic kindly corrects me where I'm wrong.  Seeing that the vlan interface of cc0.2020 works in the bridge when lagg0.2020 is removed/destroyed. I believe it's possible that the issue is related to arp responses being sent down one of the two lagg members and the host OS not being aware of that.  Although the reply does come inbound on one of the host OS interfaces, it doesn't propagate that down across the epair / tap.  The VM/Jail then never sees the arp reply, and keeps the arp as "(incomplete)" in it's cache.  When using a single interface, or a lagg with only a single interface active, arp appears to work as expected.

To help observe this, I did the following:

1) From host25, I watched epair0a, cc0, and cc1 using
host25# tcpdump -e -vvv -XX -i [interface]

2) inside jail-10-20-20-6, I attempted to ping the gateway to generate the arp traffic:
ping -c 1 -t 1 -q 10.20.20.254
PING 10.20.20.254 (10.20.20.254): 56 data bytes

--- 10.20.20.254 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss



3) Results follow:
# tcpdump -e -vvv -XX -i epair0a
tcpdump: listening on epair0a, link-type EN10MB (Ethernet), capture size 262144 bytes
01:43:54.768801 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 28
		0x0000:  ffff ffff ffff 0207 f080 de0b 0806 0001  ................
		0x0010:  0800 0604 0001 0207 f080 de0b 0a14 1406  ................
		0x0020:  0000 0000 0000 0a14 14fe                 ..........
01:43:54.768936 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 42
		0x0000:  ffff ffff ffff 0207 f080 de0b 0806 0001  ................
		0x0010:  0800 0604 0001 0207 f080 de0b 0a14 1406  ................
		0x0020:  0000 0000 0000 0a14 14fe 0000 0000 0000  ................
		0x0030:  0000 0000 0000 0000                      ........
01:43:54.768969 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 46
		0x0000:  ffff ffff ffff 0207 f080 de0b 0806 0001  ................
		0x0010:  0800 0604 0001 0207 f080 de0b 0a14 1406  ................
		0x0020:  0000 0000 0000 0a14 14fe 0000 0000 0000  ................
		0x0030:  0000 0000 0000 0000 0000 0000            ............
		
		
# tcpdump -e -vvv -XX -i cc0
tcpdump: listening on cc0, link-type EN10MB (Ethernet), capture size 262144 bytes
01:43:54.768822 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 28
	0x0000:  ffff ffff ffff 0207 f080 de0b 8100 07e4  ................
	0x0010:  0806 0001 0800 0604 0001 0207 f080 de0b  ................
	0x0020:  0a14 1406 0000 0000 0000 0a14 14fe       ..............
01:43:54.769126 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46
	0x0000:  0207 f080 de0b 0211 2233 4455 8100 07e4  ........"3DU....
	0x0010:  0806 0001 0800 0604 0002 0211 2233 4455  ............"3DU
	0x0020:  0a14 14fe 0207 f080 de0b 0a14 1406 0000  ................
	0x0030:  0000 0000 0000 0000 0000 0000 0000 0000  ................
01:43:54.769171 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46
	0x0000:  0207 f080 de0b 0211 2233 4455 8100 07e4  ........"3DU....
	0x0010:  0806 0001 0800 0604 0002 0211 2233 4455  ............"3DU
	0x0020:  0a14 14fe 0207 f080 de0b 0a14 1406 0000  ................
	0x0030:  0000 0000 0000 0000 0000 0000 0000 0000  ................
01:43:54.769221 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46
	0x0000:  0207 f080 de0b 0211 2233 4455 8100 07e4  ........"3DU....
	0x0010:  0806 0001 0800 0604 0002 0211 2233 4455  ............"3DU
	0x0020:  0a14 14fe 0207 f080 de0b 0a14 1406 0000  ................
	0x0030:  0000 0000 0000 0000 0000 0000 0000 0000  ................



# tcpdump -e -vvv -XX -i cc1
tcpdump: listening on cc1, link-type EN10MB (Ethernet), capture size 262144 bytes
01:43:54.768876 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 60: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 42
	0x0000:  ffff ffff ffff 0207 f080 de0b 8100 07e4  ................
	0x0010:  0806 0001 0800 0604 0001 0207 f080 de0b  ................
	0x0020:  0a14 1406 0000 0000 0000 0a14 14fe 0000  ................
	0x0030:  0000 0000 0000 0000 0000 0000            ............
01:43:54.768965 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 46
	0x0000:  ffff ffff ffff 0207 f080 de0b 8100 07e4  ................
	0x0010:  0806 0001 0800 0604 0001 0207 f080 de0b  ................
	0x0020:  0a14 1406 0000 0000 0000 0a14 14fe 0000  ................
	0x0030:  0000 0000 0000 0000 0000 0000 0000 0000  ................



Apparently 1 arp request is sent over cc0, and 2 over cc1, all 3 replies come back over cc0.  None of them appear to enter epair0a.  I've not had any luck changing lagg hashes at this stage to try to force requests down one of the two lagg members, so instead I downed one of the interfaces in the lagg.

(bridge2020 is still up with epair0a and lagg0.2020 (lagg0 contains cc0+cc1 both up))

jail-10-20-20-6# ping 10.20.20.254
ping: sendto: Host is down

host25# ifconfig cc1 down

(confirm arp cache is empty in jail)
jail-10-20-20-6# arp -da
jail-10-20-20-6# ping 10.20.20.254
success!


(using tcpdump, epair0a now sees the arp replies as well (I excluded the tcpdump for cc0 here because it's largely identical))
# tcpdump -e -vvv -XX -i epair0a
15:23:10.623560 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 28
	0x0000:  0001 0800 0604 0001 0207 f080 de0b 0a14  ................
	0x0010:  1406 0000 0000 0000 0a14 14fe            ............
15:23:10.623916 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46
	0x0000:  0001 0800 0604 0002 0211 2233 4455 0a14  .........."3DU..
	0x0010:  14fe 0207 f080 de0b 0a14 1406 0000 0000  ................
	0x0020:  0000 0000 0000 0000 0000 0000 0000       ..............
15:23:10.623924 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46
	0x0000:  0001 0800 0604 0002 0211 2233 4455 0a14  .........."3DU..
	0x0010:  14fe 0207 f080 de0b 0a14 1406 0000 0000  ................
	0x0020:  0000 0000 0000 0000 0000 0000 0000       ..............
15:23:10.623926 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46
	0x0000:  0001 0800 0604 0002 0211 2233 4455 0a14  .........."3DU..
	0x0010:  14fe 0207 f080 de0b 0a14 1406 0000 0000  ................
	0x0020:  0000 0000 0000 0000 0000 0000 0000       ..............
15:23:10.623943 02:07:f0:80:de:0b (oui Unknown) > 02:11:22:33:44:55 (oui Unknown), ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 56841, offset 0, flags [none], proto ICMP (1), length 84)
	10.20.20.6 > 10.20.20.254: ICMP echo request, id 22927, seq 0, length 64
	0x0000:  4500 0054 de09 0000 4001 5f74 0a14 1406  E..T....@._t....
	0x0010:  0a14 14fe 0800 8750 598f 0000 0006 2ec0  .......PY.......
	0x0020:  15c1 e795 0809 0a0b 0c0d 0e0f 1011 1213  ................
	0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
	0x0040:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
	0x0050:  3435 3637                                4567
15:23:10.624147 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 54016, offset 0, flags [none], proto ICMP (1), length 84)
	10.20.20.254 > 10.20.20.6: ICMP echo reply, id 22927, seq 0, length 64
	0x0000:  4500 0054 d300 0000 4001 6a7d 0a14 14fe  E..T....@.j}....
	0x0010:  0a14 1406 0000 8f50 598f 0000 0006 2ec0  .......PY.......
	0x0020:  15c1 e795 0809 0a0b 0c0d 0e0f 1011 1213  ................
	0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
	0x0040:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
	0x0050:  3435 3637                                4567
	
	
(arp cache seems valid as well)
jail-10-20-20-6# arp -na
? (10.20.20.6) at 02:07:f0:80:de:0b on epair0b permanent [ethernet]
? (10.20.20.254) at 02:11:22:33:44:55 on epair0b expires in 1085 seconds [ethernet]





Additional thoughts:
1) With lagg0, cc0, and cc1 up, I created a second jail on host25 using 10.20.20.7 (epair1).  I add epair1a to bridge2020 (now including epair0a, epair1a and lagg0.2020).

When I attempt to ping from jail-10-20-20-6 to .254 I get a timeout as previously experienced.

Pinging from .6 to .7 appears to work without any trouble, if lagg0 has any cc0/1 members up or down.  This was expected, as packets should never traverse lagg0.2020, but I did want to test/confirm.
 
2) I did run some ping tests with untagged lagg0 in the bridge, and it does appear it's working without trouble.  I removed lagg0.2020 from bridge2020, then added lagg0 to bridge2020, and set the switch ports as untagged in the switch.  The packets appear to move without trouble even with both cc0+cc1 up.  I need to further test this to be conclusive, but this felt less important to perform at this time as it doesn't solve the requirement I need of tagged ports.

3) I have a few bhyve vm's that I've added as tests, tap0, tap1, etc to the bridge2020.  The results seem to be largely consistent with jails.  You could replace jail-10-20-20-6, with vm-10-20-20-11 (tested freebsd / openbsd / windows) for instance, and these same results appear.  Packets fail when originating from tap/vnet and traversing lagg0.2020.

(again, lagg0/lacp is up, includes cc0+cc1, bridge2020 includes lagg0.2020, tap0, and epair0a devices)
host25# ping 10.20.20.254
success!

vm-10-20-20-11# arp -da
(attempt traverse lagg0.2020)
vm-10-20-20-11# ping 10.20.20.254
ping: sendto: Host is down

(try tap0 -> epair0)
vm-10-20-20-11# ping 10.20.20.6
success!

(try tests again with lagg0 member cc1 down)
host25# cc1 down

(tap0 -> lagg0.2020 -> 10.20.20.254)
vm-10-20-20-11# ping 10.20.20.254
success!

(again tap0 -> epair0, works as expected)
vm-10-20-20-11# ping 10.20.20.6
success!

(turn cc1 back up, wait about 10 seconds for both laggports to be distributing)
host25# cc1 up
vm-10-20-20-11# arp -da
vm-10-20-20-11# ping 10.20.20.254
ping: sendto: Host is down

(again, only lagg is preventing arp, tap <-> epair in bridge still works fine)
vm-10-20-20-11# ping 10.20.20.6
success!
jail-10-20-20-6# ping 10.20.20.11
success!

Conclusion: When bridging a vnet/tap interface with a lagg.vlan interface (vlan interface with lagg [laggproto lacp] parent) arp replies do not enter the vnet/tap interface on the bridge when *both* lagg members are up.  By downing one of the two interfaces in the lagg group, arp replies enter the vnet/tap interface as expected.


Final notes:
I've not included it in this post, but I've attempted to remove all the hardware offloading features from the interfaces lagg0/lagg0.2020/cc0/cc1 as well as toggled lagg0 lagghash, toggled sysctls net.link.lagg.* and net.link.bridge.*, as well as upgraded to 13-STABLE.  No luck moving data over the lagg until I down one of the two lagg0 interfaces.  For brevity, I used the command 'ping host-ip' in the examples above, and only displayed a simple response of success/fail.  In testing I mostly performed pings for reasonably long periods (ex: -c 10 -t 2), to confirm the above examples.

I'd be happy to help test further if anyone has any suggestions.

Thank you!

-kvs
Comment 27 Slawomir Wojciech Wojtczak 2023-03-21 23:41:08 UTC
Hi,

any progress on this one?

Will it be fixed in 13.2-RELEASE or the little later upcoming 14.0-RELEASE?

I ask because my buddy just hit it again with 13.1-RELEASE today ...

Regards,
vermaden
Comment 28 kvs 2023-03-22 03:23:58 UTC
(In reply to Slawomir Wojciech Wojtczak from comment #27)

I have some headway on my end, though I don't know how much it's related to the earlier bugs at this point.

After further testing, vlans apparently aren't related to my problem.  The problem occurs on lagg without vlan interfaces.  

When a jail+VNET (on bridge) sends an ARP request it traverses the bridge and exits both interfaces in the host lagg group.  When the ARP reply comes back, it appears it will only ever enter the host bridge if it comes in on the primary lagg member.  I'm not certain this is exclusive to vnets, also possibly this is normal operation for laggs using lacp?

Lab test:
lagg0 (ports cc0 + cc1), bridge2020 (members epair0a & lagg0)
ping from jail+VNET to switch (10.20.20.254), using source epair0b (10.20.20.77)


(epair0b -> epair0a -> bridge2020 -> lagg0 -> cc0/cc1 -> switch)

tcpdump -i epair0a
10:00:17.981011 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28

tcpdump -i bridge2020
10:00:17.981051 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28

tcpdump -i lagg0
10:00:17.981030 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
10:00:17.981282 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46

tcpdump -i cc0:
10:00:17.981050 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 42

tcpdump -i cc1:
10:00:17.981041 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
10:00:17.981282 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46

Arp table is not populated on VM, as bridge2020 and epair0a/b never sees ARP reply come in over cc1.  I believe in my case specifically the switch is seeing cc1 as the primary lagg member while the FreeBSD server sees cc0 as the primary lagg member.

When ARP replies manage to come in over cc0, the ARP replies make it to the vnet interface and the jail populates its ARP table.  I can force this event by downing cc1 or shutting down the cc1 switch port (in both cases it appears the switch then identifies cc0 as the primary lagg member over which it sends ARP replies).  Alternatively, if both cc0 and cc1 are up, and the switch sends an ARP reply over cc0 (has happened randomly), the ARP reply does makes it through the bridge/epair and populates the ARP cache on the VM.

Example after ifconfig cc1 down:

tcpdump -i epair0a
10:48:18.949695 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46

tcpdump -i bridge2020
10:48:18.949731 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46

tcpdump -i lagg0
10:48:18.949711 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46

tcpdump -i cc0
10:48:18.949722 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46


ARP table on VM is now populated with switch address, and everything appears to work as normal over lagg0 (with cc0 up / cc1 down).  

In the mean time I've managed to get the switch configured to send L2/ARP over both lagg members which has fixed the immediate problem.  Though I do think it's strange that FreeBSD populates the ARP table just fine on the host over cc1, but just wont send that ARP reply over the bridge interface unless it comes in on cc0.  That *feels* like a bug, as it only seems to affect the second interface on a lagg that's in a bridge, and quite possibly only for layer 2 (L2/3 needs further testing - I've not lost packets once the arp table is populated, but it's possible the switch was handling layer 3 differently and always using the cc0 port, in which case FreeBSD would probably send over the bridge without trouble).

Testing has been performed on 14-CURRENT and 13-STABLE with identical results.
Comment 29 Zhenlei Huang freebsd_committer freebsd_triage 2023-03-22 09:44:08 UTC
(In reply to kvs from comment #28)
I think your should open a separate PR, as you have different setup with that of the original PR by John Westbrook. He has SR-IOV configured.

I managed to repeat with cxl / lagg / bridge / epair (vnet) on 13.2-RC3. Also tried re / ue .

> tcpdump -i cc0:
> 10:00:17.981050 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 42

> tcpdump -i cc1:
> 10:00:17.981041 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28
> 10:00:17.981282 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46

You might want to tcpdump on cc0 with `--direction=in` to filter ARP request send out from cc1 and then come back to cc0 (the switch forwarded it).

The IF_BRIDGE(4) seems to hide some thing to protect itself get confused.

If you can confirm, then please config you switch properly. The two ports cc0 and cc1 connected should be in same link aggregation group.

I'll see if I can teach IF_BRIDGE(4) to emit warnings in case it get ARP request packet sent from it self.
Comment 30 Zhenlei Huang freebsd_committer freebsd_triage 2023-03-31 09:49:39 UTC
(In reply to Kristof Provost from comment #13)
Let bridge(4) ignore all packets with vlan tag might be too aggressive. All tagged packets are ignored.
I'd propose to make bridge(4) decide by configuration. That is something similar with hardware switches.

Some syntax like this:
```
# ifconfig bridge0 vlan 10,20,100-200
# ifconfig bridge0 addm em0 link-type trunk
# ifconfig bridge0 addm em1 link-type hybrid
# ifconfig bridge0 addm em2 link-type access
# ifconfig bridge0 addm em0 trunk vlan 10,100-110
# ifconfig bridge0 addm em1 hybrid vlan all
# ifconfig bridge0 addm em2 access vlan 20
```

Then bridge(4) determines to accept tagged / untagged packets by checking the configuration of port member.

For example, as the syntax above, bridge0 is interested in vlan 10,20,100-200, any packets received on em1 without vlan tag 10,20,100-200 will be ignored and returned for local processing.
As for em2, tagged packets are ignored, and untagged packets will be add vlan tag 20 and processed normally (by bridge0).
Comment 31 Zhenlei Huang freebsd_committer freebsd_triage 2023-03-31 09:58:21 UTC
As a workaround, if such setup as (comment #11) is mandatary:

em0 -- vm-sw1 -- epair0b -- epair0a(connected to host)

epair0a.2 -- vm-sw2 -- jails vlan 2
epair0a.4 -- vm-sw4 -- jails vlan 4
epair0a.6 -- vm-sw6 -- jails vlan 6
epair0a.8 -- vm-sw8 -- jails vlan 8

Let em0, vm-sw1 and epair0b be pure layer 2 interfaces. Set IP/IPv6 addresses on epair0a as required.