Bug 247912

Summary: [if_bridge] IPv6 ndp does not work across local bridge members
Product: Base System Reporter: Martin Birgmeier <d8zNeCFG>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Open ---    
Severity: Affects Only Me CC: dgeo, donner, kp, lwhsu, melifaro, philip, pmh, qingli
Priority: --- Keywords: ipv6
Version: 12.2-RELEASE   
Hardware: Any   
OS: Any   

Description Martin Birgmeier 2020-07-11 14:30:50 UTC
Scenario:
- FreeBSD 12.1 release patch level 6 acting as bhyve host
- The host has a local Ethernet interface em0 with IPv4 and IPv6 addresses assigned; all these addresses are announced via DNS and /etc/hosts
- Via em0, the host sees several other machines on the network; all have IPv4 and IPv6 addresses assigned, as well as DNS and /etc/hosts entries
- Using bhyve to run guests (FreeBSD 12.1 amd64 and i386, and head amd64)
- In order to use bhyve, create bridge and tap interfaces as follows:

# sysctl net.link.tap.up_on_open=1
# ifconfig bridge0 create && ifconfig bridge0 addm em0 && ifconfig bridge0 up
# ifconfig tap905 create && ifconfig bridge0 addm tap905
# sh /usr/share/examples/bhyve/vmrun.sh -u -c 4 -m 3G -t tap905 -d <disk device> <vm name>

Result:
- When using "ndp -a" in the bhyve client, entries for all remote machines exist correctly.
- However, there is no entry for the IPv6 address associated with the bridged-to interface em0
- As a result, it is not possible to reach services on the host system from the bhyve client via IPv6 (IPv4 is working)

Scenario (continued):
- Manually add ndp entries in the client:

# ndp -s <IPv6 address of host's em0> <Ethernet address of host's em0>

Result:
- It is now possible to reach services on the host system from the client system via IPv6

Expected result:
- NDP should be working also for the host's interface em0 which is bridged to bridge0, and not only for interfaces of remote machines

Note:
- The exactly same issue is seen on another bhyve host with re0 as physical interface

-- Martin
Comment 1 Qing Li freebsd_committer freebsd_triage 2020-08-17 20:47:01 UTC
(In reply to Martin Birgmeier from comment #0)

Just to be sure, could you please provide the "ndp -a" output for both before and after the bridge0 creation?
Comment 2 Martin Birgmeier 2020-08-18 15:34:48 UTC
Hi Li,

Since you want it "before and after the creation of bridge0", the following is from the host; but the issue actually occurs on the client - I'll provide the output for that, too.

Host before "bridge0 create" and "tap904 create":

[0]# ndp -a
Neighbor                             Linklayer Address  Netif Expire    S Flags
2002:b2bf:ee7e:4d42:22cf:30ff:fe55:5cb6 20:cf:30:55:5c:b6 re0 permanent R 
fec0::4d42:22cf:30ff:fe55:5cb6       20:cf:30:55:5c:b6    re0 permanent R 
fec0:0:0:4d42::e1                    20:cf:30:55:5c:b6    re0 permanent R 
fe80::22cf:30ff:fe55:5cb6%re0        20:cf:30:55:5c:b6    re0 permanent R 
gandalf.xyzzy                        00:03:0d:4f:f3:a7    re0 23h57m34s S R
fe80::203:dff:fe4f:f3a7%re0          00:03:0d:4f:f3:a7    re0 23h55m33s S R
fe80::218:e7ff:fee0:807b%re0         00:18:e7:e0:80:7b    re0 23h55m33s S R
hal.xyzzy                            20:cf:30:55:5c:b6    re0 permanent R 
mizar.xyzzy                          f0:de:f1:98:86:a9    re0 23h58m35s S 
[0]# 

After "ifconfig bridge0 create && ifconfig bridge0 addm re0 && ifconfig bridge0 up":

[0]# ndp -a                             
Neighbor                             Linklayer Address  Netif Expire    S Flags
2002:b2bf:ee7e:4d42:22cf:30ff:fe55:5cb6 20:cf:30:55:5c:b6 re0 permanent R 
fec0::4d42:22cf:30ff:fe55:5cb6       20:cf:30:55:5c:b6    re0 permanent R 
fec0:0:0:4d42::e1                    20:cf:30:55:5c:b6    re0 permanent R 
fe80::22cf:30ff:fe55:5cb6%re0        20:cf:30:55:5c:b6    re0 permanent R 
gandalf.xyzzy                        00:03:0d:4f:f3:a7    re0 23h58m48s S R
fe80::203:dff:fe4f:f3a7%re0          00:03:0d:4f:f3:a7    re0 23h51m46s S R
fe80::218:e7ff:fee0:807b%re0         00:18:e7:e0:80:7b    re0 23h51m46s S R
hal.xyzzy                            20:cf:30:55:5c:b6    re0 permanent R 
mizar.xyzzy                          f0:de:f1:98:86:a9    re0 23h59m48s S 
[0]# 

After "ifconfig tap904 create && ifconfig bridge0 addm tap904":

[0]# ndp -a                                                
Neighbor                             Linklayer Address  Netif Expire    S Flags
2002:b2bf:ee7e:4d42:22cf:30ff:fe55:5cb6 20:cf:30:55:5c:b6 re0 permanent R 
fec0::4d42:22cf:30ff:fe55:5cb6       20:cf:30:55:5c:b6    re0 permanent R 
fec0:0:0:4d42::e1                    20:cf:30:55:5c:b6    re0 permanent R 
fe80::22cf:30ff:fe55:5cb6%re0        20:cf:30:55:5c:b6    re0 permanent R 
gandalf.xyzzy                        00:03:0d:4f:f3:a7    re0 23h58m2s  S R
fe80::203:dff:fe4f:f3a7%re0          00:03:0d:4f:f3:a7    re0 23h51m0s  S R
fe80::218:e7ff:fee0:807b%re0         00:18:e7:e0:80:7b    re0 23h51m0s  S R
hal.xyzzy                            20:cf:30:55:5c:b6    re0 permanent R 
mizar.xyzzy                          f0:de:f1:98:86:a9    re0 23h59m2s  S 
[0]# 

Now starting the bhyve VM; the rest is from inside the VM.

Before manually added ndp entries:

[0]# ndp -a
Neighbor                             Linklayer Address  Netif Expire    S Flags
v904.xyzzy                           00:a0:98:50:35:17 vtnet0 permanent R 
gandalf.xyzzy                        00:03:0d:4f:f3:a7 vtnet0 23h59m57s S R
fe80::203:dff:fe4f:f3a7%vtnet0       00:03:0d:4f:f3:a7 vtnet0 23h59m2s  S R
fe80::218:e7ff:fee0:807b%vtnet0      00:18:e7:e0:80:7b vtnet0 23h59m2s  S R
2002:b2bf:ee7e:4d42:2a0:98ff:fe50:3517 00:a0:98:50:35:17 vtnet0 permanent R 
fec0::4d42:2a0:98ff:fe50:3517        00:a0:98:50:35:17 vtnet0 permanent R 
fe80::2a0:98ff:fe50:3517%vtnet0      00:a0:98:50:35:17 vtnet0 permanent R 
mizar.xyzzy                          f0:de:f1:98:86:a9 vtnet0 23h59m57s S 
[0]# 

After "ndp -s fec0:0:0:4d42::e 20:cf:30:55:5c:b6 && ndp -s fec0:0:0:4d42::e1 20:cf:30:55:5c:b6" (the host has two IPv6 addresses assigned to its interface; fec0:0:0:4d42::e resolves to hal.xyzzy):

[0]# ndp -a
Neighbor                             Linklayer Address  Netif Expire    S Flags
fec0:0:0:4d42::e1                    20:cf:30:55:5c:b6 vtnet0 permanent R 
v904.xyzzy                           00:a0:98:50:35:17 vtnet0 permanent R 
gandalf.xyzzy                        00:03:0d:4f:f3:a7 vtnet0 23h58m54s S R
fe80::203:dff:fe4f:f3a7%vtnet0       00:03:0d:4f:f3:a7 vtnet0 23h57m59s S R
fe80::218:e7ff:fee0:807b%vtnet0      00:18:e7:e0:80:7b vtnet0 23h57m59s S R
2002:b2bf:ee7e:4d42:2a0:98ff:fe50:3517 00:a0:98:50:35:17 vtnet0 permanent R 
fec0::4d42:2a0:98ff:fe50:3517        00:a0:98:50:35:17 vtnet0 permanent R 
fe80::2a0:98ff:fe50:3517%vtnet0      00:a0:98:50:35:17 vtnet0 permanent R 
hal.xyzzy                            20:cf:30:55:5c:b6 vtnet0 permanent R 
mizar.xyzzy                          f0:de:f1:98:86:a9 vtnet0 23h58m54s S 
[0]# 

-- Martin
Comment 3 Patrick M. Hausen 2020-10-29 08:15:22 UTC
Isn't the IP configuration (both v4 and v6) supposed to go on the bridge interface instead of em0?

There should be a message upon inserting em0 as a member:

"IPv6 addresses on em0 have been removed before adding it as a member to prevent IPv6 address scope violation."
Comment 4 Philip Paeps freebsd_committer freebsd_triage 2020-12-30 11:33:28 UTC
To clarify the bhyve use case a little further:

Setup:

bhyve (tap0) - bridge - vlan0

```
root@host:~ # ifconfig vm-service
vm-service: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 76:92:90:55:ad:c5
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto stp-rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 15 priority 128 path cost 2000000
        member: vlan_service flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 9 priority 128 path cost 20000
        groups: bridge vm-switch viid-aaabf@
        nd6 options=41<PERFORMNUD,NO_RADR>

root@host:~ # ifconfig vlan-service
vlan_service: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=200401<RXCSUM,LRO,RXCSUM_IPV6>
        ether 40:62:31:11:af:6f
        inet 172.24.0.254 netmask 0xffffff00 broadcast 172.24.0.255
        inet 172.24.0.1 netmask 0xffffffff broadcast 172.24.0.1
        inet 172.24.0.153 netmask 0xffffffff broadcast 172.24.0.153
        inet6 fe80::4262:31ff:fe11:af6f%vlan_service prefixlen 64 scopeid 0x9
        inet6 fd55:3904:d01f:0:4262:31ff:fe11:af6f prefixlen 64
        inet6 fd55:3904:d01f::153 prefixlen 64
        inet6 fd55:3904:d01f::1 prefixlen 64
        inet6 2404:c804:1637:4c00:4262:31ff:fe11:af6f prefixlen 64
        groups: vlan
        vlan: 2750 vlanpcp: 0 parent interface: igb0
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=61<PERFORMNUD,AUTO_LINKLOCAL,NO_RADR>

root@linux:~# ip neighbour ls
172.24.0.153 dev enp0s5 lladdr 58:9c:fc:00:41:7a REACHABLE
172.24.0.254 dev enp0s5 lladdr 40:62:31:11:af:6f REACHABLE
fd55:3904:d01f::1 dev enp0s5 lladdr 40:62:31:11:af:6f router REACHABLE
fe80::4262:31ff:fe11:af6f dev enp0s5 lladdr 40:62:31:11:af:6f router STALE
fd55:3904:d01f::153 dev enp0s5  FAILED
2404:c804:1637:4c00:4262:31ff:fe11:af6f dev enp0s5 lladdr 40:62:31:11:af:6f router STALE
```

Linux running in bhyve is unable to communicate with fd55:3904:d01f::153 on the host because we never respond to the NDP packets.  If fd55:3904:d01f::153 is configured on the bridge rather than on the vlan_service interface, everything works normally.
Comment 5 Patrick M. Hausen 2020-12-30 12:25:27 UTC
Again: this is how it is supposed to work. You *must* configure IPv4 and IPv6 addresses on the bridge interface and not on the VLAN member.

See FreeBSD Handbook on bridging.
Comment 6 Kristof Provost freebsd_committer freebsd_triage 2020-12-30 12:38:25 UTC
Yes, the problem is indeed that the addresses should be set on the bridge interface, not the member interfaces. It mostly works if you don't, but only mostly. Multicast is broken in that setup.

That's because in bridge_input() we special-case multicast and broadcast traffic. It gets forward *out* all of the member interfaces and injected into the bridge interface. Member interfaces do not get to see it. The bridge interface is not subscribed to the expected multicast group (because the address is not set on it, but on a member interface) and the packet gets ignored.
Comment 7 Patrick M. Hausen 2020-12-30 12:45:52 UTC
Kristof, possibly we should make that paragraph more prominent in the bridging chapter? Can I clone the docs repo now and submit a pull request instead of the awkward "could some committer please look at XY" procedure of the past - now that we are using git?
Comment 8 Kristof Provost freebsd_committer freebsd_triage 2020-12-30 12:49:05 UTC
(In reply to Patrick M. Hausen from comment #7)
I'm not a doc committer myself, but I'd hope that patches are welcome in just about any form. Maybe send it to bcr@? He's always helpful.

(Off topic: I don't think that we've fully embraced the GitHub PR workflow yet. It's likely easier to get those PRs in than it was in the past though.)
Comment 9 Martin Birgmeier 2020-12-30 12:54:22 UTC
Spinning this a little further, shouldn't the VM's (tapXXX) IP address also be assigned to the bridge?

Obviously this would not be a good idea because then the host get's the client's traffic...

So why is bridging used in the first place in this scenario? - It seems because it is the only way for the client to be connected to an outside ("real") network.

It would probably be more correct for bhyve to use an internal virtual network which is then routed (layer 3) to the external network by the host.

Is this analysis correct?

-- Martin
Comment 10 Patrick M. Hausen 2020-12-30 12:57:14 UTC
Of course not.

A tap interface (and an epair for jails) has got two ends. One on the host side, member of the bridge, and *without* any IP address.
The other end *in* the guest VM or jail with the VM's/jail's IP address.

So that's a different matter.
Comment 11 Patrick M. Hausen 2020-12-30 13:02:44 UTC
And you can build an internal network and route if you prefer.

- create a bridge without a physical member interface
- assign suitable IP address to that bridge, this will be the default gateway in your network
- enable forwarding on the host
- make all your VM taps member of that bridge
- take care of external routing so other systems know how that network can be reached - or use ipfw/pf and NAT

We do that all the time. Again IP addresses *of the host* go on the bridge.
Comment 12 Philip Paeps freebsd_committer freebsd_triage 2020-12-30 13:12:43 UTC
(In reply to Patrick M. Hausen from comment #7)
> Kristof, possibly we should make that paragraph more prominent in the bridging
> chapter? Can I clone the docs repo now and submit a pull request instead of the
> awkward "could some committer please look at XY" procedure of the past - now that we 
> are using git?

You can also attach your pull request to this bug.  We can Cc: it to a doc committer for review.  (Assuming that "pull request" is newspeak for "patch" and not some Git magic I'm not familiar with.)

I think this limitation on multicast should be documented in the if_bridge(4) manual page as well as in the Handbook.

Given that this setup "mostly" works (except for multicast), there's probably no point in having ifconfig complain if you assign an address to an interface that's a member in a bridge.  It would also open up cans of worms about attaching interfaces that have addresses on them to bridges.

Conceptually, this limitation makes sense: you really should put addresses on the bridge interface rather than the VLAN interface.  In this setup, the VLAN interface becomes morally equivalent to a "link" rather than an "interface", much as the host side of the tap to bhyve is.
Comment 13 Martin Birgmeier 2020-12-30 13:25:05 UTC
(In reply to Patrick M. Hausen from comment #11)

Thanks for setting me straight on this.

It is of course the client's interface that gets the client IP address, in this case vtnet0. And the tapXXX in the host just sees the layer 2 traffic as part of the bridge.

One more question: Is the handling of the re0 and tap0 interfaces (both members of bridge0) different? - Because tap0 seems to see all the NDP traffic in both directions but re0 not?

-- Martin
Comment 14 Patrick M. Hausen 2020-12-30 13:29:03 UTC
The difference is that there is an IP address on re0 and none on tap0 in the setup that is almost but not quite working.

If you really want to know what that does to multicast in the kernel, I have to refer to Kristof ;-)
Comment 15 Patrick M. Hausen 2020-12-30 13:32:32 UTC
(In reply to Philip Paeps from comment #12)
Philip, a pull request is how almost all of the open source with the notable exception of FreeBSD handles contributions.

The workflow is:

- clone repository into my own working copy
- make, debug, commit, push changes to my heart's content
- click on "send pull request" in browser
- the upstream project receives that, whoever is in charge reviews and hopefully accepts it, and
- clicks on "merge pull request"

Two clicks instead of manually creating a diff, creating a ticket, etc. ...

I was hoping one of the reasons for moving to git *was* easier contribution.
Comment 16 Martin Birgmeier 2020-12-30 14:04:28 UTC
(In reply to Patrick M. Hausen from comment #3)

This is a reply to an older comment #3...

There never was such a message.

I also tried to find "prevent IPv6" in head and releng/12.2, to no avail (using find ... -exec grep ...).

Where should that message come from?

-- Martin
Comment 17 Patrick M. Hausen 2020-12-30 14:25:06 UTC
(In reply to Martin Birgmeier from comment #16)
https://svnweb.freebsd.org/base/releng/12.2/sys/net/if_bridge.c

line 1207 ff.
Comment 18 Martin Birgmeier 2020-12-30 14:31:25 UTC
(In reply to Patrick M. Hausen from comment #17)

Thank you.

Which leaves the question as to why such a message never shows up in my setup.

-- Martin
Comment 19 Martin Birgmeier 2021-04-23 05:57:11 UTC
I'd like to come back to this issue.

Basically, I am (still :-)) not assigning the IP addresses to the bridge interface. The major reason for this is that I am assembling/disassembling the bridge and its member interfaces as needed, and I do not want to always have to fiddle with reassigning IP addresses from the member interfaces to the bridge and vice versa.

Which brings me to my point: In normal networking parlance a bridge knows nothing about ISO layer 3 and therefore not about IP. Much less it gets an IP address assigned (let us not digress to smart managed devices). So I believe that we have a design issue here: In FreeBSD we are talking about a "bridge" but in reality it is kludge used to tie some interfaces together. Or at least it is not a bridge in the traditional networking sense.

How difficult would it be to redesign the bridge abstraction in FreeBSD to more closely resemble a real layer 2 bridge?

-- Martin
Comment 20 Lutz Donnerhacke freebsd_committer freebsd_triage 2021-04-23 10:35:12 UTC
There are two bridge implementations in FreeBSD. The classical one you are using. And the netgraph one ng_bridge, which is much simpler. If you have a problem with the classical one, would you mind to give the ng_bridge a try? You may assign ng_eiface virtual interfaces to it, if necessary.

I'm just curious to know which part of the classical bridge is the problematic part.
Comment 21 Martin Birgmeier 2021-04-23 13:45:24 UTC
Hi Lutz,

It seems I would need FreeBSD 13 for this, right? - I am still at 12.2

-- Martin
Comment 22 Lutz Donnerhacke freebsd_committer freebsd_triage 2021-04-23 13:57:35 UTC
The ng_bridge(4) node type was implemented in FreeBSD 4.2.

I bet it will work in 12.x, too.
Comment 23 Martin Birgmeier 2021-04-23 14:24:59 UTC
Isn't https://reviews.freebsd.org/D24620 needed for this?
Comment 24 Lutz Donnerhacke freebsd_committer freebsd_triage 2021-04-23 14:33:57 UTC
Oh, it might be necessary for bhyve VMs. I'm not familiar with this part. I thought about connecting "real" interfaces like eiface inside the VM.