Bug 273557 - Regression preventing bhyve from running inside a jail without IP after f74147e26999, breaks support for jailing bhyve with IPv4 and IPv6 disabled
Summary: Regression preventing bhyve from running inside a jail without IP after f7414...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bhyve (show other bugs)
Version: 14.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: bhyve, easy, regression
Depends on:
Blocks: 14.0r
  Show dependency treegraph
 
Reported: 2023-09-04 08:53 UTC by crest
Modified: 2024-04-16 00:35 UTC (History)
12 users (show)

See Also:
freebsd: mfc-stable14?


Attachments
Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS (2.18 KB, patch)
2023-09-04 08:53 UTC, crest
no flags Details | Diff
Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS on FreeBSD 14-stable as well (2.15 KB, patch)
2023-09-06 19:14 UTC, crest
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description crest 2023-09-04 08:53:49 UTC
Created attachment 244627 [details]
Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS

Bhyve used to require either the sysctl net.link.tap.up_on_open=1 or an external wrapper to set the tap/vmnet interfaces link state after the device has been opened. Bjoern A. Zeeb's solution to this uses an IP socket (trying both IPv4 and IPv6). The code as shipped in FreeBSD 13.2 refuses to start bhyve if it can't create an IP socket to set the link state of the tap/vmnet interface.

It turns out there is a better way to set the link state on tap interfaces since there is an equivalent ioctl() available directly on the tap/vmnet device.

The included patch against FreeBSD 13.2 removes the unused variables (ifrq and s) replaces ioctl(s, SIOCGIFFLAGS) on the socket with ioctl(be->fd, VMIO_SIOCSIFFLAGS) on the tap/vmnet device.

The patch restores the ability to run bhyve inside a jail with ip4=disable and ip6=disable. The guest running inside bhyve accesses the network through the tap device without using IP sockets inside the bhyve process. This previously supported configuration provides defense in depth against guest escapes.
Comment 1 crest 2023-09-06 19:14:00 UTC
Created attachment 244684 [details]
Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS on FreeBSD 14-stable as well

Here is the same patch forward ported for FreeBSD 14-stable
Comment 2 Mina Galić freebsd_triage 2023-09-07 12:14:39 UTC
hey re@ folks
crest fixed a regression in bhyve and would like to MFC it to stable/14
Comment 3 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-09-07 14:33:17 UTC
The patch seems to work more by accident because the kernel side adds an IFF_UP and IFF_UP is in the TUN_VMIO_FLAG_MASK set:
{{{
               case VMIO_SIOCSIFFLAGS: /* VMware/VMnet SIOCSIFFLAGS */
                        iflags = *(int *)data;
                        iflags &= TUN_VMIO_FLAG_MASK;
                        iflags &= ~IFF_CANTCHANGE;
                        iflags |= IFF_UP;
}}}

What's the actual error you get?  "Could open socket"?

I think the real solution is to make the code "advisory" and not to error out in case it cannot UP the interface in that case?
Comment 4 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-09-07 14:35:48 UTC
Untested but can you try:

{{{
diff --git usr.sbin/bhyve/net_backends.c usr.sbin/bhyve/net_backends.c
index fa7cd9c81f46..9b4ca9e42c99 100644
--- usr.sbin/bhyve/net_backends.c
+++ usr.sbin/bhyve/net_backends.c
@@ -270,21 +270,22 @@ tap_init(struct net_backend *be, const char *devname,
                s = socket(pf_list[i], SOCK_DGRAM, 0);
        if (s == -1) {
                WPRINTF(("Could open socket"));
-               goto error;
+               goto skip;
        }
 
        if (ioctl(s, SIOCGIFFLAGS, &ifrq) < 0) {
                (void)close(s);
                WPRINTF(("Could not get interface flags"));
-               goto error;
+               goto skip;
        }
        ifrq.ifr_flags |= IFF_UP;
        if (ioctl(s, SIOCSIFFLAGS, &ifrq) < 0) {
                (void)close(s);
                WPRINTF(("Could not set interface flags"));
-               goto error;
+               goto skip;
        }
        (void)close(s);
+skip:
 #endif
 
 #ifndef WITHOUT_CAPSICUM
}}}
Comment 5 crest 2023-09-07 15:50:20 UTC
(In reply to Bjoern A. Zeeb from comment #3)

I've seen that the ioctl() does implicitly bitwise-or in the IFF_UP flag, but it felt cleaner to explicitly request this instead of relying on this undocumented implementation detail. I confirmed it even brings the interface up if a NULL pointer is passed as argument to the ioctl(), but again that would be an ugly workaround. I agree that it would've been better to make this optional or even check the interface link state before overwriting it and only refusing to start the guest if the link can't be brought up, but I don't see any situation where the current implementation of the VMIO_SIOCSIFFLAGS ioctl() would fail on an opened tap device so the error handling code is unreachable on FreeBSD releng/13.2 and stable/14.
Comment 6 crest 2023-09-07 17:04:54 UTC
(In reply to Bjoern A. Zeeb from comment #4)
The untested patch would still attempt to create a socket (which can fail), closing the socket could be factored out (keep only the last close(s), but put a label before and after it). It will allow bhyve to start, but bhyve will fail to bring up the tap/vmnet interface on its own if it can't create an IP socket (either IPv4 or IPv6). For example if bhyve is started inside very restrictive jail with ip=disabled and ip6=disabled. In this case bhyve will start but the tap/vmnet device will stay in its default link state forcing users to either set sysctl net.link.tap.up_on_open=1 to change the default link state for all tap/vmnet interface from DOWN to UP or have a wrapper poll the tap/vmnet device aggressive enough until bhyve has opened it to bring the link state up before the guest notices it. This is complicated by the fact that bhyve(8) exists if the guest requests a reboot. While I've always run my bhyve hosts with net.link.tap.up_on_open=1 bhyve users shouldn't be forced to do change their global system configuration. The corner case I ran into is clearly a bug that should be fixed. I see no advantage to addressing only one half of the bug by still using an IP socket, but continuing if the socket creation or ioctl()s fail.

Processes running inside a jail with both IPv4 and IPv6 disabled can't be create new IP sockets of either IP version. The only way for bhyve to bring up the tap/vmnet device from inside such a jail is to use the ioctl(VMIO_SIOCSIFFLAGS) on the tap/vmnet device. It's also the only place in bhyve that I found with a quick `grep -r` which creates IP sockets. From my understanding ioctl(VMIO_SIOCSIFFLAGS) covers all cases (inside a jail and outside), avoids creating and destroying a socket, saves a few syscalls, removes a (theoretical) race condition, removes failure cases that have to be handled (even if only to report and ignore them).

Am I overlooking a downside to using ioctl(VMIO_SIOCSIFFLAGS) on the tap/vmnet device and getting rid of the IP socket and the variables holding the socket file descriptor number and struct passed to ioctl()s on the socket?
Comment 7 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-09-07 17:58:18 UTC
You can probably do this using a netlink socket these days but the issue stays the same.  There's certain ways to configure interfaces;  taking a short cut now will break eventually...

May I ask how you are connecting the tap to anything?  Are you simply jailing bhyve and creating a tap interface in the base system a along this way?  Makes me wonder why a non-vnet jail allows this in first place...
Comment 8 crest 2023-09-08 08:43:13 UTC
I create the tap interface on the jail host and apply a jail specific devfs ruleset to it allowing only access to those devices bhyve needs e.g. vmm/$name vmm.io/$name.bootrom, tap$n and symlink tap-$name pointing to the renamed tap interface, nmdm devices matching certain patterns, one CTL port for virtio-scsi etc.

The bhyve tap device is a member of a bridge on the jail host. 

The jail isn't vnet enabled because it doesn't require IP sockets at all except for the current code to set the tap interface state to UP. Bhyve doesn't need sockets to read/write Ethernet frames on tap devices. Having an extra vnet would require the jail to also contain an extra bridge with exactly two members (one half of an epair and the tap). The other half of the epair would take the place of the tap device on the host bridge. Such a configuration would be **noticeable** slower, harder to configure, and provide a larger attack surface.
Comment 9 crest 2023-09-12 11:06:00 UTC
(In reply to Bjoern A. Zeeb from comment #7)
I get the impression you don't use the VMIO_SIOCIFFLAGS ioctl if possible and the existing code works if an IP socket can be created to invoke the SIOCGIFFLAGS ioctl on.

Treating the inability to create a IP new socket as non-fatal would allow deploying bhyve in a jail, but it would still depend on the net.link.tap.up_on_open sysctl. I've looked for alternatives, but the VMIO_SIOCIFFLAGS ioctl is only API I've found that's useable from inside jail a without access to an IP stack. It would be possible to fall back to VMIO_SIOCIFFLAGS only if the IP socket creation fails and keep using an IP socket if possible.

I would like bhyve to always use the ioctl directly on the tap device instead of bringing the interface up indirectly by name through a socket ioctl since it covers all cases with a single code path, but any fix that allows an unpatched bhyve to be started inside a jail with both IPv4 and IPv6 disabled would help.

How would you like to proceed? I'll be at EuroBSDcon later this week. Is there anything I can do to help get a fix for this regression into FreeBSD 14.0 (and errata into the older releases if release engineering agrees)?
Comment 10 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-09-12 17:19:55 UTC
(In reply to crest from comment #9)

I wont' be at Euro.  You can find some bhyve or networking or jails people and see. I see talks on all thee subjects.

The real question however (and I hinted at that when I asked) is why a privileged operation on a networking device is allowed in first place inside an IP-based (or a non-IP) jail.   I assume for tun/tap the idea was that you need the device node and that needs manual intervention already?

Another thing I hinted before was netlink;  one should check how ifconfig currently does and IFF_UP maybe in a new netlink world.  That may possibly avoid the IP sockets proper.  I believe you'll find the netlink expert at Euro as well.
Comment 11 crest 2023-09-15 09:22:23 UTC
In my opinion it makes are lot of sense to be able to set the interface link state through the tap device file descriptor, because the process containing it has performed **the** single opening allowed per tap device at a time. The process is trusted to process all Ethernet frames to and from the tap device. Why would you want to have it jump through hoops to set the link state instead of making it available through the capability represented by the file descriptor?
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2023-10-11 18:45:16 UTC
I think I agree with the proposed patch to use VMIO_SIOCSIFFLAGS.  Even if IP is disabled on the jail, tap is an L2 device.  The fact that we have to use an IP socket to configure the interface is bizarre; this overloading of sockets is also problematic for capsicum.

OTOH, I think we can configure IFF_UP with any kind of socket, no?  We could use a PF_ROUTE or PF_NETLINK socket instead.  I don't think netlink itself buys us anything here, ifconfig still uses socket ioctls to set interface flags.  One other reason to avoid VMIO_SIOCSIFFLAGS is that it's really just there for vmnet(4) compatibility, I believe.
Comment 13 crest 2023-10-11 20:09:39 UTC
The patch changes the bhyve tap network backend code invoked to open devices with names starting with "tap" or "vmnet". Since the code is already exclusive to the if_tuntap driver in tap or vmnet mode I can't see any problem with relying on one *more* driver specific ioctl(). As far as I can tell there is no possible configuration where the existing code works as intended, but the patch wouldn't also work.

Netlink sockets shouldn't offer any way for a bhyve process attached to a vnet enabled jail to even address network interface in its parent jail (normally the unjailed host) to bring it up (or down). Anything else would be a security problem. The device file descriptor is the natural API endpoint to bring an interface up. It is the capability to manipulate a tap device through the single opening allowed per tap device (at a time). Having to query the tap device for its interface name or index to reference the interface in a request to change interface state by name or index is also a race condition since the interface could've been destroyed and its name or index reused. Using the file descriptor precludes any race condition similar to funlinkat() vs unlinkat().
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2023-10-12 12:44:17 UTC
(In reply to crest from comment #13)
> Netlink sockets shouldn't offer any way for a bhyve process attached to a vnet enabled jail to even address network interface in its parent jail (normally the unjailed host) to bring it up (or down).

I don't believe I claimed otherwise?  I thought this bug report was about classic jails.

I'm ok with committing a patch to use the /dev/tap ioctl.  Let's wait for a bit to see if Bjoern has any further comments.
Comment 15 crest 2023-10-12 17:13:59 UTC
The bug report is for jails with IP disabled, but the patch cause any regression for bhyve in vnet jails.

Mark Johnston: I didn't mean to imply you said Netlink could do this, but the ioctl() should work even across vnet boundaries.
Comment 16 crest 2023-10-12 18:10:22 UTC
(In reply to crest from comment #15)

*The bug report is for jails with IP disabled, but the patch shouldn't cause any regression for bhyve in vnet jails.
Comment 17 Mark Johnston freebsd_committer freebsd_triage 2023-10-13 15:57:12 UTC
Regarding the patch itself, "up" should be declared at the beginning of the function with the rest of the local variables per style(9).
Comment 18 commit-hook freebsd_committer freebsd_triage 2023-10-17 15:56:22 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=fd8b9c73a5a63a7aa438a73951d7a535b4f25d9a

commit fd8b9c73a5a63a7aa438a73951d7a535b4f25d9a
Author:     Jan Bramkamp <crest+freebsd@rlwinm.de>
AuthorDate: 2023-09-04 08:38:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-17 15:24:11 +0000

    bhyve: Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS

    Creating an IP socket to invoke the SIOCGIFFLAGS ioctl on is the only
    thing preventing bhyve from working inside a bhyve jail with IPv4 and
    IPv6 disabled restricting the jailed bhyve process to only access the
    host network via a tap/vmnet device node.

    PR:             273557
    Fixes:          56be282bc999 ("bhyve: net_backends, automatically IFF_UP tap devices")
    Reviewed by:    markj
    MFC after:      1 week

 usr.sbin/bhyve/net_backends.c | 52 ++++---------------------------------------
 1 file changed, 4 insertions(+), 48 deletions(-)
Comment 19 commit-hook freebsd_committer freebsd_triage 2023-10-24 13:38:17 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=bc6372602a0042e5496d9ab6637558f98af0deef

commit bc6372602a0042e5496d9ab6637558f98af0deef
Author:     Jan Bramkamp <crest+freebsd@rlwinm.de>
AuthorDate: 2023-09-04 08:38:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-24 13:21:26 +0000

    bhyve: Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS

    Creating an IP socket to invoke the SIOCGIFFLAGS ioctl on is the only
    thing preventing bhyve from working inside a bhyve jail with IPv4 and
    IPv6 disabled restricting the jailed bhyve process to only access the
    host network via a tap/vmnet device node.

    PR:             273557
    Fixes:          56be282bc999 ("bhyve: net_backends, automatically IFF_UP tap devices")
    Reviewed by:    markj
    MFC after:      1 week

    (cherry picked from commit fd8b9c73a5a63a7aa438a73951d7a535b4f25d9a)

 usr.sbin/bhyve/net_backends.c | 52 ++++---------------------------------------
 1 file changed, 4 insertions(+), 48 deletions(-)
Comment 20 commit-hook freebsd_committer freebsd_triage 2023-10-24 13:39:20 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=700689bc2abaf860801b3896ceae86b0072f406c

commit 700689bc2abaf860801b3896ceae86b0072f406c
Author:     Jan Bramkamp <crest+freebsd@rlwinm.de>
AuthorDate: 2023-09-04 08:38:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-24 13:21:17 +0000

    bhyve: Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS

    Creating an IP socket to invoke the SIOCGIFFLAGS ioctl on is the only
    thing preventing bhyve from working inside a bhyve jail with IPv4 and
    IPv6 disabled restricting the jailed bhyve process to only access the
    host network via a tap/vmnet device node.

    PR:             273557
    Fixes:          56be282bc999 ("bhyve: net_backends, automatically IFF_UP tap devices")
    Reviewed by:    markj
    MFC after:      1 week

    (cherry picked from commit fd8b9c73a5a63a7aa438a73951d7a535b4f25d9a)

 usr.sbin/bhyve/net_backends.c | 52 ++++---------------------------------------
 1 file changed, 4 insertions(+), 48 deletions(-)
Comment 21 commit-hook freebsd_committer freebsd_triage 2023-10-25 16:57:01 UTC
A commit in branch releng/14.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=952196961fe7849e6b4d517bc2aaa562eaa4b9c9

commit 952196961fe7849e6b4d517bc2aaa562eaa4b9c9
Author:     Jan Bramkamp <crest+freebsd@rlwinm.de>
AuthorDate: 2023-09-04 08:38:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-25 16:52:37 +0000

    bhyve: Use VMIO_SIOCSIFFLAGS instead of SIOCGIFFLAGS

    Creating an IP socket to invoke the SIOCGIFFLAGS ioctl on is the only
    thing preventing bhyve from working inside a bhyve jail with IPv4 and
    IPv6 disabled restricting the jailed bhyve process to only access the
    host network via a tap/vmnet device node.

    Approved by:    re (gjb)
    PR:             273557
    Fixes:          56be282bc999 ("bhyve: net_backends, automatically IFF_UP tap devices")
    Reviewed by:    markj
    MFC after:      1 week

    (cherry picked from commit fd8b9c73a5a63a7aa438a73951d7a535b4f25d9a)
    (cherry picked from commit 700689bc2abaf860801b3896ceae86b0072f406c)

 usr.sbin/bhyve/net_backends.c | 52 ++++---------------------------------------
 1 file changed, 4 insertions(+), 48 deletions(-)
Comment 22 crest 2023-10-27 17:15:55 UTC
It's not fixed because the VMIO_SIOCSIFFLAGS ioctl() takes the argument by value instead of by reference (which is what I assumed).
Comment 23 crest 2023-10-27 17:18:23 UTC
It's not fixed because the VMIO_SIOCSIFFLAGS ioctl() takes the argument by value instead of by reference (which is what I assumed): https://reviews.freebsd.org/D42366. It was just (bad) luck that it still worked by accident.
Comment 24 Mark Johnston freebsd_committer freebsd_triage 2023-10-27 17:34:08 UTC
(In reply to crest from comment #23)
That patch was committed and merged to all affected branches, so there is nothing left to do.
Comment 25 Peter Much 2024-04-13 03:19:09 UTC
(In reply to Mark Johnston from comment #24)

Yes, there seems something to do. After upgrade 13.2 -> 13.3 my guests can no longer reboot, because the ip adresses on the tap are removed when bhyve terminates.
This is because bhyve does now delete the LINK0 flag during startup. (Setting it again after startup helps only for the next reboot.)

Since this change is the only one that seemed somehow related, I reverted it,
and voila, stuff works again. 
Now filed as bug #278338 (This is already the third whopper in 13.3, after 
bug #276862 and the notorious arc_prune malady).
Comment 26 crest 2024-04-16 00:35:06 UTC
(In reply to Peter Much from comment #25)
Can you try again with vmnet interface (same driver accessed through a different name to get slightly different defaults) if the different default link flag changes the behaviour?