I'm experiencing an intermittent connectivity issue running FreeBSD 12.0 with jail using VNET, which appears to be related to lost ARP replies. There are several discussion threads on forums that appear related: https://forums.freebsd.org/threads/vnet-arp-replies-are-lost.71082 https://www.ixsystems.com/community/threads/arp-replies-loss-in-vnet.77027 https://www.ixsystems.com/community/threads/jails-eero.59477 One insightful comment from the first thread: """On step #2 the reply is mistakenly padded with 14 bytes which is exactly the number of bytes beyond the 18 bytes in the request (the request was padded with 32 bytes). I bet this is part of the bug. By looking at FreeBSD ARP reply code it actually creates the reply by editing the request bytes in place. For some reason it removes only 18 bytes from the request padding. However, this happens only on VNET interface as noted above.""" I was able to see ARP traffic using tcpdump, but (arp -a) doesn't contain updated ARP entries. Also, in an affected jail, I can't add static arp entries: # arp -s 10.0.0.1 XX:XX:XX:XX:XX:XX arp: writing to routing socket: Cannot allocate memory whereas, in an unaffected jail the arp command succeeds. Jails are should have access to routing sockets by default, so perhaps the problem is related to accessing routing sockets in VNET jails? The test setup where I'm observing this is using an SR-IOV VF (Chelsio cxlv0) passed into the jail (via vnet.interface in jail.conf). The test setup has two jails each on two direct attached hosts. I observe the problem on both hosts, but it comes and goes with reboots.
Can you describe the steps required to reproduce the problem on the 12.0/13.0 system?
I have SR-IOV configured as described in this thread: https://forums.freebsd.org/threads/sr-iov-chelsio-error-in-guest.70653 such that cxlv[0-3] are shown in ifconfig. The jail.conf is: vnet; vnet.interface = "vnet0"; exec.prestart = "ifconfig ${vnet0} name vnet0"; exec.poststop = "ifconfig vnet0 name ${vnet0}"; exec.start += "/bin/sh /etc/rc"; exec.stop = "/bin/sh /etc/rc.shutdown"; exec.consolelog = "/var/log/${name}.log"; host.hostname = "${name}"; path = "/jail/${name}"; j1 { $vnet0 = "cxlv1"; } j2 { $vnet0 = "cxlv2"; } There are two hosts direct connected via cxl0. The problem is visible when pinging (1) between jails on the same host and (2) from an affected jail on host 1 to host 2. On an unaffected host both of these operations succeed. Using tcpdump on the physical (cxl) and virtual (cxlv) interfaces shows the ARP requests and responses, but in an affected jail the ARP tables aren't updated.
I think that bug that I wanted to report is somewhat similar, all main actors - VNET, jails and ARP - are the same. So I have a problem with network connectivity between jails and host when using jails with VNET and VLANs. I've written about it to freebsd-net@ mailing list: threads: https://lists.freebsd.org/pipermail/freebsd-net/2019-September/054391.html https://lists.freebsd.org/pipermail/freebsd-net/2019-October/054437.html There's a topic on FreeBSD forums, which confirms this and once again explain the configuration with which this problem occuring, and in in great detail, but author has "solved" his problem by simply not using configuration when you bridge physical interface with jail's VNET interface and not using jail's VNET interface with VLANs. https://forums.freebsd.org/threads/bridge-epair-not-passing-through-tagged-vlan-traffic-between-host-and-vnet-jail.71646/ I'll add some more observation here. I recreated configuration in a virtual machine, as i wrote in my last message to freebsd-net@ here: https://lists.freebsd.org/pipermail/freebsd-net/2019-October/054475.html. Jail's vlan interface IP is 10.15.15.2 and host's vlan interface IP is 10.15.15.1. Both jail and host have no ARP entries about each other addresses. So I ping from 10.15.15.2 to 10.15.15.1. 1. in initial configuration, I see this on em0: HOST# tcpdump -i em0 -e | grep 10.15.15 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes 08:57:52.051429 02:95:ce:33:dc:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.1 tell 10.15.15.2, length 28 08:57:53.071451 02:95:ce:33:dc:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.1 tell 10.15.15.2, length 28 08:57:54.101515 02:95:ce:33:dc:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.1 tell 10.15.15.2, length 28 2. then I added ARP entry in jail: JAIL# arp -s 10.15.15.1 00:0c:29:2f:6c:08 HOST# tcpdump -i em0 -e | grep 10.15.15 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes 09:07:10.321257 00:0c:29:2f:6c:08 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.2 tell 10.15.15.1, length 28 09:07:11.391300 00:0c:29:2f:6c:08 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.2 tell 10.15.15.1, length 28 09:07:12.415232 00:0c:29:2f:6c:08 (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 22, p 0, ethertype ARP, Request who-has 10.15.15.2 tell 10.15.15.1, length 28 3. then I added jail ARP entry to host: HOST# arp -s 10.15.15.2 02:95:ce:33:dc:0b and ICMP requests started to pass from jail to host, and vlan22 interface on host receiving packets and sending replies: HOST# tcpdump -i vlan22 -e | grep 10.15.15 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vlan22, link-type EN10MB (Ethernet), capture size 262144 bytes 09:37:11.517054 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype IPv4 (0x0800), length 98: 10.15.15.2 > 10.15.15.1: ICMP echo request, id 25864, seq 0, length 64 09:37:11.517063 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype IPv4 (0x0800), length 98: 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 25864, seq 0, length 64 but i don't see them on host's epair0a interface, bridged with em0 in bridge0, there are only requests on epair0a: HOST# tcpdump -i epair0a -e | grep 10.15.15 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on epair0a, link-type EN10MB (Ethernet), capture size 262144 bytes 09:40:44.178363 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.2 > 10.15.15.1: ICMP echo request, id 32264, seq 0, length 64 09:40:45.221713 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.2 > 10.15.15.1: ICMP echo request, id 32264, seq 1, length 64 09:40:46.253079 02:95:ce:33:dc:0b (oui Unknown) > 00:0c:29:2f:6c:08 (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.2 > 10.15.15.1: ICMP echo request, id 32264, seq 2, length 64 and on em0 i see only replies: HOST# tcpdump -i em0 -e | grep 10.15.15 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes 09:41:11.092092 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 34568, seq 0, length 64 09:41:12.096310 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 34568, seq 1, length 64 09:41:13.121890 00:0c:29:2f:6c:08 (oui Unknown) > 02:95:ce:33:dc:0b (oui Unknown), ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4, 10.15.15.1 > 10.15.15.2: ICMP echo reply, id 34568, seq 2, length 64 and on bridge interface nor requests nor replies are shown. HOST# tcpdump -i bridge0 -e | grep 10.15.15 ... silince ... Is it normal and I'm doing something wrong? I wanted to make jails act as the normal freebsd host with one dedicated VNET interface with VLANs.
When I followed the reproduction steps described in the linked threads with a debug kernel I hit the following assert: panic: m_dup: bogus m_pkthdr.len cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe2eca39e700 vpanic() at vpanic+0x177/frame 0xfffffe2eca39e760 doadump() at doadump/frame 0xfffffe2eca39e7e0 m_dup() at m_dup+0x376/frame 0xfffffe2eca39e860 bridge_broadcast() at bridge_broadcast+0x1bf/frame 0xfffffe2eca39e8c0 bridge_forward() at bridge_forward+0x222/frame 0xfffffe2eca39e920 bridge_input() at bridge_input+0x3d5/frame 0xfffffe2eca39e990 ether_nh_input() at ether_nh_input+0x2a6/frame 0xfffffe2eca39e9e0 netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe2eca39ea40 ether_input() at ether_input+0x8f/frame 0xfffffe2eca39ea80 epair_nh_sintr() at epair_nh_sintr+0x1a/frame 0xfffffe2eca39eaa0 swi_net() at swi_net+0x1b9/frame 0xfffffe2eca39eb20 intr_event_execute_handlers() at intr_event_execute_handlers+0x99/frame 0xfffffe2eca39eb60 ithread_loop() at ithread_loop+0xb7/frame 0xfffffe2eca39ebb0 fork_exit() at fork_exit+0x84/frame 0xfffffe2eca39ebf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe2eca39ebf0 — trap 0, rip = 0, rsp = 0, rbp = 0 — The offending KASSERT is still there: /* Check correct total mbuf length */ KASSERT((remain > 0 && m != NULL) || (remain == 0 && m == NULL), ("%s: bogus m_pkthdr.len", __func__));
(In reply to Ryan Moeller from comment #4) I haven't been able to reproduce this. Did you do so on -CURRENT?
(In reply to Mark Johnston from comment #5) This stack was from approximately stable/11 a few months ago. I just tried on -CURRENT and the ARP reply does make it back and there is no panic (tested in a jail with epair on a bridge). I will check stable/12 and stable/11 again, and 12.1-Rel to be sure.
I'm trying to figure out an issue which seems similar or the same as this issue. DHCP Response traffic is going to the untagged interface & bridge, rather than the VLAN interface, and then to the associated bridge. I have a forum post up at https://forums.freebsd.org/threads/vlans-with-bhyve-guests-not-getting-dhcp.77647/ and otherwise can't see any other way forward to debugging. I've been poking at some of the net.link.bridge sysctl tuneables, but nothing seems changed.
Try this one: ifconfig lo0 -rxcsum -txcsum Afterward you shouldnt see any cksum incorect messages. For some reasons even if Vlan/Bridge is atached to real device loopback is interfere with it.
@Sebastian Not sure what that setting has to do with anything. I'm not seeing anything with checksum errors or anything. It's traffic for VLAN is coming in on the main, untagged interface rather than going to the VLAN interface, and then to the VLAN bridge.
I had this exact problem with the same setup. My jails are on a VM on an ESXi 6.5 host, with the port group on promiscuous mode. I lost some hairs to this issue. Tried FreeBSD 12.1, 12.2, disabled TSO, cksum, dug through kernel sources... All that in vain : The problem disapeared the minute i enabled this fling on ESXi : https://flings.vmware.com/learnswitch I hope this help
I believe I ran into the same issue today on 13.1-BETA3. Setup: I use a NUC for virtualisation host with a single NIC: em0. It has vPro (poor man's service processor), which shares the NIC with the OS and communicates on the native VLAN (VLAN1). Because of this I put the OS to a tagged one. I set up several tagged VLANs: 2, 4, 6, 8. The host OS uses em0.2 on VLAN2. I set up a bridge for each VLAN interface, as well as for the physical: em0 -> vm-sw1 em0.2 -> vm-sw2 em0.4 -> vm-sw4 em0.6 -> vm-sw6 em0.8 -> vm-sw8 Then I created a jail with Bastille, assigning it to VLAN2/vm-sw2 using VNET, with an IP from the subnet also used on the host. I could ping the host from the jail and vice versa, but could not reach the external world from the jail, nor could ping the jail from the router in the same subnet. After 'ifconfig vm-sw1 destroy' it suddenly started working and the jail now has full IP4/6 connectivity.
I beleive I ran into similar problems with main last year and I am still seeing occasional "blackouts"; I believe back then I could trigger traffic by sending packets from one specific part of jail / host / remote to another and that would hold until the entry expired but I have no more notes on this. I am also adding kp@ given bridge ...
(In reply to Bjoern A. Zeeb from comment #12) Note that the issue described in #10 is a configuration problem more than a bug. In this configuration the bridge will grab all packets, including those with a vlan tag and nothing will be passed to the vlan interfaces. That's expected. After all, the system has been configured to bridge all packets arriving on em0 to the members of vm-sw1, and that includes those with ETHERTYPE_VLAN. This patch should make it do what the user wants, but I'm not convinced that's actually appropriate: diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c index 12c807fe2009..98c79764bc69 100644 --- a/sys/net/if_bridge.c +++ b/sys/net/if_bridge.c @@ -2467,6 +2467,11 @@ bridge_input(struct ifnet *ifp, struct mbuf *m) eh = mtod(m, struct ether_header *); + if (ntohs(eh->ether_type) == ETHERTYPE_VLAN || + ntohs(eh->ether_type) == ETHERTYPE_QINQ) { + return (m); + } + bridge_span(sc, m); if (m->m_flags & (M_BCAST|M_MCAST)) {
(In reply to Kristof Provost from comment #13) Thanks Kristof, you are right, I didn't see the forest for the trees. It's not a bug, but a feature.
You probably want to create the vlan interfaces on the physical interface and add them as member interfaces to the bridges and all IP interfaces belong on the bridge interface. Don't put IP addresses on bridge member interfaces.
Thanks, I'll try - I've been doing this on Linux for many years. Ironically, some years ago I read the opposite in some FreeBSD-related doc: assigning the addresses to the parent interface was the recommended way.
I'm seeing the same (or similar issue) on 12.3-RELEASE-p5 when trying to bridge a vlan interface into a jail: # egrep 'ifconfig|cloned' rc.conf ifconfig_ixl0="up" ifconfig_ixl2="up" cloned_interfaces="lagg0 vlan1601 bridge0" ifconfig_lagg0="laggproto lacp laggport ixl0 laggport ixl2 130.236.8.40 netmask 255.255.255.224 lacp_fast_timeout" ifconfig_lagg0_ipv6="inet6 2001:6b0:17:2400::8:40/64 lacp_fast_timeout" ifconfig_vlan1601="vlandev lagg0 vlan 1601 up" ifconfig_bridge0="addm vlan1601 up" # ifconfig bridge0 bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 ether 02:90:7b:7b:f5:00 id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 14 priority 128 path cost 2000 member: lagg0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 10 priority 128 path cost 1000 member: vlan1601 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 11 priority 128 path cost 2000000 groups: bridge nd6 options=9<PERFORMNUD,IFDISABLED> root@filur00:/etc # iocage console test Last login: Thu Apr 21 14:27:18 on pts/0 FreeBSD 12.3-RELEASE-p5 GENERIC -- ... root@test:~ # ping 130.236.8.65 PING 130.236.8.65 (130.236.8.65): 56 data bytes ^C --- 130.236.8.65 ping statistics --- 2 packets transmitted, 0 packets received, 100.0% packet loss If I now manually remove the "lagg0" member from the bridge0 interface then things start to work fine. It would be nice if it didn't add it automatically :-) root@filur00:/etc # ifconfig bridge0 deletem lagg0 root@filur00:/etc # iocage console test Last login: Thu Apr 21 14:38:34 on pts/0 FreeBSD 12.3-RELEASE-p5 GENERIC .... root@test:~ # ping 130.236.8.65 PING 130.236.8.65 (130.236.8.65): 56 data bytes 64 bytes from 130.236.8.65: icmp_seq=0 ttl=255 time=0.249 ms ^C --- 130.236.8.65 ping statistics --- 1 packets transmitted, 1 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.249/0.249/0.249/0.000 ms
> root@filur00:/etc # ifconfig bridge0 deletem lagg0 Easy solution... Remove from rc.conf: ifconfig_bridge0="addm vlan1601 up" and then tell iocage to not add lagg0 automatically to the jail's bridge: # iocage set vnet_default_interface=vlan1601 test
Hello. We also have an similar issue on FreeBSD 12.3-RELEASE-p2 (XigmaNAS, stuck at -p2 for the moment) as described. The boxes in question do have two NICs, one is supposed for the management (em0) access and the other one is supposed to be bound to offered services. Additionally, the second NIC (igb0) is accessible via an IP AND serves as the physical NIC as member of a bridge for vnet jails, which do have epair interfaces (in Xigmanas created via the FreeBSD in-tree tool "jib"). Binding provided services as SAMBA and NFS to the second NIC (igb0) works as expected, also ping and ssh is no problem. Base host's IP (both NICs) and those of the jails are within the same network. When it comes to the vnet jails on the bridge, of which the igb0 NIC is member of, trouble begins. We use several jails on those boxes. Pinging those jails from outside the campus network does work sporadically with some IPs, it takes a long time until the jail starts repsonding. Same behaviour is within the LAN. We also already disabled pfil on the bridges as suggested: device if_bridge net.link.bridge.ipfw: 0 net.link.bridge.allow_llz_overlap: 0 net.link.bridge.inherit_mac: 0 net.link.bridge.log_stp: 0 net.link.bridge.pfil_local_phys: 0 net.link.bridge.pfil_member: 0 net.link.bridge.ipfw_arp: 0 net.link.bridge.pfil_bridge: 0 net.link.bridge.pfil_onlyip: 0 A curiosity is that if one can ping one or two out of the five jails on the host, in another attempt to do so one, at most two different hosts would answer the ping then and the former working pinged hosts do not anymore. It is like gambling. We also run another host with the very same XigmaNAS version, in that case, he second NIC is configured to be part of another network and attached to another switch - not problem there! In the problematic cases described above, we do not have direct access to the switches of the backend of the department, so I can't see whether I'm the culprit (misconfiguration, misunderstanding et cetera of network technology). Hope the problem could be solved anyway within FreeBSD 12.3.
(In reply to Kristof Provost from comment #13) Perhaps we could create a special vlan "sub-interface" that sees only untagged traffic on input and does not add any tag on output (just like the parent interface). We could use some reserved VLAN ID to mark the special interface. E.g., the currently prohibited VLAN ID 0.
In the mean time you can try "workaround" to create ng_bridge interface to your parent and then use than newly created interface as a member to your management bridge. Assuming you have em0 as your parent interface em0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP> ether 58:9c:fc:10:f1:16 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> /bin/sh /usr/share/examples/jails/jng bridge main em0 Will create ng0_main interface: ng0_main: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=28<VLAN_MTU,JUMBO_MTU> ether 02:60:c8:08:84:9b hwaddr 58:9c:fc:10:ff:ff media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> Now you can add ng0_main interface instead your em0. Will work like a charm.
(In reply to Andriy Gapon from comment #20) I'm not sure how that'd fix this issue, but I believe that's something we already support. It's possible to set Priority code point (PCP) on a regular interface, which should insert a vlan header with VID 0.
(In reply to Kristof Provost from comment #22) Just to be sure that we talk about the same thing (and I feel like we are not), I am not suggesting any modification to what's going on the wire. Just a new virtual interface that captures only untagged packets. To be more clear: - igb0: receives all arriving packets, sends packets without inserting any VLAN tag - igb0.1: receives arriving packets with VLAN tag 1, adds VLAN tag 1 when sending - igb0.0: [proposed] receives only packets without any VLAN tag, sends packets without inserting any VLAN tag
(In reply to Andriy Gapon from comment #23) Ah, I see. I did indeed misunderstand. However, I don't think that'd fix the issue of VLAN on if_bridge interfaces. The problem is that the bridge checks if it needs to grab the packet before vlan_input() gets its turn.
(In reply to Kristof Provost from comment #24) I think I will need to look at the code. I thought that a bridge would see packets only from a bridged virtual/vlan interface (such as the proposed igb0.0), but it looks that the actual ethernet input processing has a different flow.
Hello Everyone! I believe I have hit the same bug, though I believe my issue is specifically related to lagg/lacp. I can confirm this problem affects tap as well as epair interfaces on a bridge when attempting to send over a vlan interface that has a lagg parent. System Description: FreeBSD 13.1 w/ Chelsio T6225-SO-CR NIC, identified by cc0 / cc1 (confirmed up and operational), host25 is the system name. Network is 10.20.20.0/24, gateway is 10.20.20.254 (mac: 02:11:22:33:44:55), host is assigned 10.20.20.5, epair0 is assigned to jail-10-20-20-6 (with matching IP of 10.20.20.6 on epair0b). Switch is set to accept tagged frames only for vlan 2020. All mtu's 1500. When adding a vlan interface child of cc0 to the bridge, I do not have any trouble passing data over the lagg. host25# ifconfig cc0.2020 create up host25# ifconfig bridge2020 create up host25# ifconfig bridge2020 addm cc0.2020 host25# ifconfig bridge2020 addm epair0a host25# ifconfig bridge2020 inet 10.20.20.25/24 (pings from host -> gateway works fine) host25# ping 10.20.20.254 success! (pings from jail -> gateway also work) host25# jexec jail-10-20-20-6 sh jail-10-20-20-6# ping 10.20.20.254 success! (I now reset bridge2020 to use a lagg interface.) host25# ifconfig bridge2020 destroy host25# ifconfig cc0.2020 destroy host25# ifconfig lagg0 create laggproto lacp laggport cc0 laggport cc1 up host25# ifconfig lagg0.2020 create up host25# ifconfig bridge2020 create up host25# ifconfig bridge2020 addm lagg0.2020 addm epair0a host25# ifconfig bridge2020 inet 10.20.20.25/24 (pings from host -> gateway work fine) host25# ping 10.20.20.254 success! (pings from jail -> gateway timeout) host25# jexec jail-10-20-20-6 sh jail-10-20-20-6# ping 10.20.20.254 ping: sendto: Host is down (arp cache from jail appears to not include gateway mac) jail-10-20-20-6# arp -an ? (10.20.20.6) at 02:07:f0:80:de:0b on epair0b permanent [ethernet] ? (10.20.20.254) at (incomplete) on epair0b expired [ethernet] (I assign mac statically.) jail-10-20-20-6# arp -s 10.20.20.254 02:11:22:33:44:55 jail-10-20-20-6# arp -an ? (10.20.20.6) at 02:07:f0:80:de:0b on epair0b permanent [ethernet] ? (10.20.20.254) at 02:11:22:33:44:55 on epair0b permanent [ethernet] (attempt ping again after static arp assignment) jail-10-20-20-6# ping 10.20.20.254 success! What comes next is a reasonably big presumption on my part, so hopefully someone more educated on the topic kindly corrects me where I'm wrong. Seeing that the vlan interface of cc0.2020 works in the bridge when lagg0.2020 is removed/destroyed. I believe it's possible that the issue is related to arp responses being sent down one of the two lagg members and the host OS not being aware of that. Although the reply does come inbound on one of the host OS interfaces, it doesn't propagate that down across the epair / tap. The VM/Jail then never sees the arp reply, and keeps the arp as "(incomplete)" in it's cache. When using a single interface, or a lagg with only a single interface active, arp appears to work as expected. To help observe this, I did the following: 1) From host25, I watched epair0a, cc0, and cc1 using host25# tcpdump -e -vvv -XX -i [interface] 2) inside jail-10-20-20-6, I attempted to ping the gateway to generate the arp traffic: ping -c 1 -t 1 -q 10.20.20.254 PING 10.20.20.254 (10.20.20.254): 56 data bytes --- 10.20.20.254 ping statistics --- 1 packets transmitted, 0 packets received, 100.0% packet loss 3) Results follow: # tcpdump -e -vvv -XX -i epair0a tcpdump: listening on epair0a, link-type EN10MB (Ethernet), capture size 262144 bytes 01:43:54.768801 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 28 0x0000: ffff ffff ffff 0207 f080 de0b 0806 0001 ................ 0x0010: 0800 0604 0001 0207 f080 de0b 0a14 1406 ................ 0x0020: 0000 0000 0000 0a14 14fe .......... 01:43:54.768936 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 42 0x0000: ffff ffff ffff 0207 f080 de0b 0806 0001 ................ 0x0010: 0800 0604 0001 0207 f080 de0b 0a14 1406 ................ 0x0020: 0000 0000 0000 0a14 14fe 0000 0000 0000 ................ 0x0030: 0000 0000 0000 0000 ........ 01:43:54.768969 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 46 0x0000: ffff ffff ffff 0207 f080 de0b 0806 0001 ................ 0x0010: 0800 0604 0001 0207 f080 de0b 0a14 1406 ................ 0x0020: 0000 0000 0000 0a14 14fe 0000 0000 0000 ................ 0x0030: 0000 0000 0000 0000 0000 0000 ............ # tcpdump -e -vvv -XX -i cc0 tcpdump: listening on cc0, link-type EN10MB (Ethernet), capture size 262144 bytes 01:43:54.768822 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 46: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 28 0x0000: ffff ffff ffff 0207 f080 de0b 8100 07e4 ................ 0x0010: 0806 0001 0800 0604 0001 0207 f080 de0b ................ 0x0020: 0a14 1406 0000 0000 0000 0a14 14fe .............. 01:43:54.769126 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 0x0000: 0207 f080 de0b 0211 2233 4455 8100 07e4 ........"3DU.... 0x0010: 0806 0001 0800 0604 0002 0211 2233 4455 ............"3DU 0x0020: 0a14 14fe 0207 f080 de0b 0a14 1406 0000 ................ 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01:43:54.769171 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 0x0000: 0207 f080 de0b 0211 2233 4455 8100 07e4 ........"3DU.... 0x0010: 0806 0001 0800 0604 0002 0211 2233 4455 ............"3DU 0x0020: 0a14 14fe 0207 f080 de0b 0a14 1406 0000 ................ 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01:43:54.769221 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 0x0000: 0207 f080 de0b 0211 2233 4455 8100 07e4 ........"3DU.... 0x0010: 0806 0001 0800 0604 0002 0211 2233 4455 ............"3DU 0x0020: 0a14 14fe 0207 f080 de0b 0a14 1406 0000 ................ 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ # tcpdump -e -vvv -XX -i cc1 tcpdump: listening on cc1, link-type EN10MB (Ethernet), capture size 262144 bytes 01:43:54.768876 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 60: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 42 0x0000: ffff ffff ffff 0207 f080 de0b 8100 07e4 ................ 0x0010: 0806 0001 0800 0604 0001 0207 f080 de0b ................ 0x0020: 0a14 1406 0000 0000 0000 0a14 14fe 0000 ................ 0x0030: 0000 0000 0000 0000 0000 0000 ............ 01:43:54.768965 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype 802.1Q (0x8100), length 64: vlan 2020, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 46 0x0000: ffff ffff ffff 0207 f080 de0b 8100 07e4 ................ 0x0010: 0806 0001 0800 0604 0001 0207 f080 de0b ................ 0x0020: 0a14 1406 0000 0000 0000 0a14 14fe 0000 ................ 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ Apparently 1 arp request is sent over cc0, and 2 over cc1, all 3 replies come back over cc0. None of them appear to enter epair0a. I've not had any luck changing lagg hashes at this stage to try to force requests down one of the two lagg members, so instead I downed one of the interfaces in the lagg. (bridge2020 is still up with epair0a and lagg0.2020 (lagg0 contains cc0+cc1 both up)) jail-10-20-20-6# ping 10.20.20.254 ping: sendto: Host is down host25# ifconfig cc1 down (confirm arp cache is empty in jail) jail-10-20-20-6# arp -da jail-10-20-20-6# ping 10.20.20.254 success! (using tcpdump, epair0a now sees the arp replies as well (I excluded the tcpdump for cc0 here because it's largely identical)) # tcpdump -e -vvv -XX -i epair0a 15:23:10.623560 02:07:f0:80:de:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 10.20.20.254 tell 10.20.20.6, length 28 0x0000: 0001 0800 0604 0001 0207 f080 de0b 0a14 ................ 0x0010: 1406 0000 0000 0000 0a14 14fe ............ 15:23:10.623916 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 0x0000: 0001 0800 0604 0002 0211 2233 4455 0a14 .........."3DU.. 0x0010: 14fe 0207 f080 de0b 0a14 1406 0000 0000 ................ 0x0020: 0000 0000 0000 0000 0000 0000 0000 .............. 15:23:10.623924 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 0x0000: 0001 0800 0604 0002 0211 2233 4455 0a14 .........."3DU.. 0x0010: 14fe 0207 f080 de0b 0a14 1406 0000 0000 ................ 0x0020: 0000 0000 0000 0000 0000 0000 0000 .............. 15:23:10.623926 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 0x0000: 0001 0800 0604 0002 0211 2233 4455 0a14 .........."3DU.. 0x0010: 14fe 0207 f080 de0b 0a14 1406 0000 0000 ................ 0x0020: 0000 0000 0000 0000 0000 0000 0000 .............. 15:23:10.623943 02:07:f0:80:de:0b (oui Unknown) > 02:11:22:33:44:55 (oui Unknown), ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 56841, offset 0, flags [none], proto ICMP (1), length 84) 10.20.20.6 > 10.20.20.254: ICMP echo request, id 22927, seq 0, length 64 0x0000: 4500 0054 de09 0000 4001 5f74 0a14 1406 E..T....@._t.... 0x0010: 0a14 14fe 0800 8750 598f 0000 0006 2ec0 .......PY....... 0x0020: 15c1 e795 0809 0a0b 0c0d 0e0f 1011 1213 ................ 0x0030: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 .............!"# 0x0040: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 $%&'()*+,-./0123 0x0050: 3435 3637 4567 15:23:10.624147 02:11:22:33:44:55 (oui Unknown) > 02:07:f0:80:de:0b (oui Unknown), ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 54016, offset 0, flags [none], proto ICMP (1), length 84) 10.20.20.254 > 10.20.20.6: ICMP echo reply, id 22927, seq 0, length 64 0x0000: 4500 0054 d300 0000 4001 6a7d 0a14 14fe E..T....@.j}.... 0x0010: 0a14 1406 0000 8f50 598f 0000 0006 2ec0 .......PY....... 0x0020: 15c1 e795 0809 0a0b 0c0d 0e0f 1011 1213 ................ 0x0030: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 .............!"# 0x0040: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 $%&'()*+,-./0123 0x0050: 3435 3637 4567 (arp cache seems valid as well) jail-10-20-20-6# arp -na ? (10.20.20.6) at 02:07:f0:80:de:0b on epair0b permanent [ethernet] ? (10.20.20.254) at 02:11:22:33:44:55 on epair0b expires in 1085 seconds [ethernet] Additional thoughts: 1) With lagg0, cc0, and cc1 up, I created a second jail on host25 using 10.20.20.7 (epair1). I add epair1a to bridge2020 (now including epair0a, epair1a and lagg0.2020). When I attempt to ping from jail-10-20-20-6 to .254 I get a timeout as previously experienced. Pinging from .6 to .7 appears to work without any trouble, if lagg0 has any cc0/1 members up or down. This was expected, as packets should never traverse lagg0.2020, but I did want to test/confirm. 2) I did run some ping tests with untagged lagg0 in the bridge, and it does appear it's working without trouble. I removed lagg0.2020 from bridge2020, then added lagg0 to bridge2020, and set the switch ports as untagged in the switch. The packets appear to move without trouble even with both cc0+cc1 up. I need to further test this to be conclusive, but this felt less important to perform at this time as it doesn't solve the requirement I need of tagged ports. 3) I have a few bhyve vm's that I've added as tests, tap0, tap1, etc to the bridge2020. The results seem to be largely consistent with jails. You could replace jail-10-20-20-6, with vm-10-20-20-11 (tested freebsd / openbsd / windows) for instance, and these same results appear. Packets fail when originating from tap/vnet and traversing lagg0.2020. (again, lagg0/lacp is up, includes cc0+cc1, bridge2020 includes lagg0.2020, tap0, and epair0a devices) host25# ping 10.20.20.254 success! vm-10-20-20-11# arp -da (attempt traverse lagg0.2020) vm-10-20-20-11# ping 10.20.20.254 ping: sendto: Host is down (try tap0 -> epair0) vm-10-20-20-11# ping 10.20.20.6 success! (try tests again with lagg0 member cc1 down) host25# cc1 down (tap0 -> lagg0.2020 -> 10.20.20.254) vm-10-20-20-11# ping 10.20.20.254 success! (again tap0 -> epair0, works as expected) vm-10-20-20-11# ping 10.20.20.6 success! (turn cc1 back up, wait about 10 seconds for both laggports to be distributing) host25# cc1 up vm-10-20-20-11# arp -da vm-10-20-20-11# ping 10.20.20.254 ping: sendto: Host is down (again, only lagg is preventing arp, tap <-> epair in bridge still works fine) vm-10-20-20-11# ping 10.20.20.6 success! jail-10-20-20-6# ping 10.20.20.11 success! Conclusion: When bridging a vnet/tap interface with a lagg.vlan interface (vlan interface with lagg [laggproto lacp] parent) arp replies do not enter the vnet/tap interface on the bridge when *both* lagg members are up. By downing one of the two interfaces in the lagg group, arp replies enter the vnet/tap interface as expected. Final notes: I've not included it in this post, but I've attempted to remove all the hardware offloading features from the interfaces lagg0/lagg0.2020/cc0/cc1 as well as toggled lagg0 lagghash, toggled sysctls net.link.lagg.* and net.link.bridge.*, as well as upgraded to 13-STABLE. No luck moving data over the lagg until I down one of the two lagg0 interfaces. For brevity, I used the command 'ping host-ip' in the examples above, and only displayed a simple response of success/fail. In testing I mostly performed pings for reasonably long periods (ex: -c 10 -t 2), to confirm the above examples. I'd be happy to help test further if anyone has any suggestions. Thank you! -kvs
Hi, any progress on this one? Will it be fixed in 13.2-RELEASE or the little later upcoming 14.0-RELEASE? I ask because my buddy just hit it again with 13.1-RELEASE today ... Regards, vermaden
(In reply to Slawomir Wojciech Wojtczak from comment #27) I have some headway on my end, though I don't know how much it's related to the earlier bugs at this point. After further testing, vlans apparently aren't related to my problem. The problem occurs on lagg without vlan interfaces. When a jail+VNET (on bridge) sends an ARP request it traverses the bridge and exits both interfaces in the host lagg group. When the ARP reply comes back, it appears it will only ever enter the host bridge if it comes in on the primary lagg member. I'm not certain this is exclusive to vnets, also possibly this is normal operation for laggs using lacp? Lab test: lagg0 (ports cc0 + cc1), bridge2020 (members epair0a & lagg0) ping from jail+VNET to switch (10.20.20.254), using source epair0b (10.20.20.77) (epair0b -> epair0a -> bridge2020 -> lagg0 -> cc0/cc1 -> switch) tcpdump -i epair0a 10:00:17.981011 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 tcpdump -i bridge2020 10:00:17.981051 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 tcpdump -i lagg0 10:00:17.981030 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 10:00:17.981282 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 tcpdump -i cc0: 10:00:17.981050 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 42 tcpdump -i cc1: 10:00:17.981041 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 10:00:17.981282 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 Arp table is not populated on VM, as bridge2020 and epair0a/b never sees ARP reply come in over cc1. I believe in my case specifically the switch is seeing cc1 as the primary lagg member while the FreeBSD server sees cc0 as the primary lagg member. When ARP replies manage to come in over cc0, the ARP replies make it to the vnet interface and the jail populates its ARP table. I can force this event by downing cc1 or shutting down the cc1 switch port (in both cases it appears the switch then identifies cc0 as the primary lagg member over which it sends ARP replies). Alternatively, if both cc0 and cc1 are up, and the switch sends an ARP reply over cc0 (has happened randomly), the ARP reply does makes it through the bridge/epair and populates the ARP cache on the VM. Example after ifconfig cc1 down: tcpdump -i epair0a 10:48:18.949695 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 tcpdump -i bridge2020 10:48:18.949731 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 tcpdump -i lagg0 10:48:18.949711 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 tcpdump -i cc0 10:48:18.949722 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 10:48:18.950041 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 ARP table on VM is now populated with switch address, and everything appears to work as normal over lagg0 (with cc0 up / cc1 down). In the mean time I've managed to get the switch configured to send L2/ARP over both lagg members which has fixed the immediate problem. Though I do think it's strange that FreeBSD populates the ARP table just fine on the host over cc1, but just wont send that ARP reply over the bridge interface unless it comes in on cc0. That *feels* like a bug, as it only seems to affect the second interface on a lagg that's in a bridge, and quite possibly only for layer 2 (L2/3 needs further testing - I've not lost packets once the arp table is populated, but it's possible the switch was handling layer 3 differently and always using the cc0 port, in which case FreeBSD would probably send over the bridge without trouble). Testing has been performed on 14-CURRENT and 13-STABLE with identical results.
(In reply to kvs from comment #28) I think your should open a separate PR, as you have different setup with that of the original PR by John Westbrook. He has SR-IOV configured. I managed to repeat with cxl / lagg / bridge / epair (vnet) on 13.2-RC3. Also tried re / ue . > tcpdump -i cc0: > 10:00:17.981050 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 42 > tcpdump -i cc1: > 10:00:17.981041 ARP, Request who-has 10.20.20.254 tell 10.20.20.77, length 28 > 10:00:17.981282 ARP, Reply 10.20.20.254 is-at 02:11:22:33:44:55 (oui Unknown), length 46 You might want to tcpdump on cc0 with `--direction=in` to filter ARP request send out from cc1 and then come back to cc0 (the switch forwarded it). The IF_BRIDGE(4) seems to hide some thing to protect itself get confused. If you can confirm, then please config you switch properly. The two ports cc0 and cc1 connected should be in same link aggregation group. I'll see if I can teach IF_BRIDGE(4) to emit warnings in case it get ARP request packet sent from it self.
(In reply to Kristof Provost from comment #13) Let bridge(4) ignore all packets with vlan tag might be too aggressive. All tagged packets are ignored. I'd propose to make bridge(4) decide by configuration. That is something similar with hardware switches. Some syntax like this: ``` # ifconfig bridge0 vlan 10,20,100-200 # ifconfig bridge0 addm em0 link-type trunk # ifconfig bridge0 addm em1 link-type hybrid # ifconfig bridge0 addm em2 link-type access # ifconfig bridge0 addm em0 trunk vlan 10,100-110 # ifconfig bridge0 addm em1 hybrid vlan all # ifconfig bridge0 addm em2 access vlan 20 ``` Then bridge(4) determines to accept tagged / untagged packets by checking the configuration of port member. For example, as the syntax above, bridge0 is interested in vlan 10,20,100-200, any packets received on em1 without vlan tag 10,20,100-200 will be ignored and returned for local processing. As for em2, tagged packets are ignored, and untagged packets will be add vlan tag 20 and processed normally (by bridge0).
As a workaround, if such setup as (comment #11) is mandatary: em0 -- vm-sw1 -- epair0b -- epair0a(connected to host) epair0a.2 -- vm-sw2 -- jails vlan 2 epair0a.4 -- vm-sw4 -- jails vlan 4 epair0a.6 -- vm-sw6 -- jails vlan 6 epair0a.8 -- vm-sw8 -- jails vlan 8 Let em0, vm-sw1 and epair0b be pure layer 2 interfaces. Set IP/IPv6 addresses on epair0a as required.