Bug 262179 - Prevent jail escaping via shared nullfs; option to disable UNIX domain socket binding
Summary: Prevent jail escaping via shared nullfs; option to disable UNIX domain socket...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Security Team
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-24 22:35 UTC by firk
Modified: 2024-03-19 18:49 UTC (History)
12 users (show)

See Also:


Attachments
PoC source (4.57 KB, text/plain)
2022-02-24 22:35 UTC, firk
no flags Details
patch for HEAD (5.36 KB, patch)
2022-02-24 22:37 UTC, firk
no flags Details | Diff
patch for 13-stable (5.35 KB, patch)
2022-02-24 22:37 UTC, firk
no flags Details | Diff
patch for 12.3 (5.41 KB, patch)
2022-02-24 22:38 UTC, firk
no flags Details | Diff
patch for 11.4 (5.37 KB, patch)
2022-02-24 22:38 UTC, firk
no flags Details | Diff
patch for 10.4 (5.29 KB, patch)
2022-02-24 22:39 UTC, firk
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description firk 2022-02-24 22:35:53 UTC
Created attachment 232085 [details]
PoC source

==== Background: possible security hole with jail & shared nullfs ====

When two independent jails have the same directory shared via nullfs, they can escape chroot-ed filesystem.

How this is done:
1) jail A creates listening UNIX domain socket in the shared directory.
2) jail B connects to that listening socket
3) jail A does open("/",O_DIRECTORY) and sends this fd over socket using SCM_RIGHTS
4) jail B receives the fd and now able to go through ".." into host filesystem

Quick guide how to reproduce (sendfd.c in attachment):

prepare:
> gcc -o sendfd sendfd.c
> mkdir /j /j/1 /j/2
> tar -c -f - /bin /lib /libexec | tar -x -f - -C /j/1
> tar -c -f - /bin /lib /libexec | tar -x -f - -C /j/2
> cp sendfd /j/1/bin
> cp sendfd /j/2/bin
> mkdir /j/1/shared /j/2/shared
> mount -t nullfs /j/2/shared /j/1/shared

first console:
> jail /j/2 x 127.0.0.1 /bin/sendfd listen /shared/2.sock /bin

second console:
> jail /j/1 x 127.0.0.1 /bin/sendfd sh /shared/2.sock
> pwd

and you'll see pwd "/j/2" and may explore ../../ system root.

==== Proposed fix ====
New mount flag "nosockbind" means "do not allow UNIX domain bind()/bindat() to paths on this filesystem". This flag is not transparent over nullfs, so it is possible to mount bindable nullfs over non-bindable base partition.

Note that connecting to UNIX domain socket on such filesystem is still possible.

There is patches for 10.4, 11.4, 12.3 and 13-stable branches. At least 12.3 version looks working fine, but: VFS subsystem is quite complicated, and I'm new to it, so there are many (almst everywhere) things that I'm unsure:

=== vfs_mount.c ===
1) I'm not sure if I should add the option just to global_opts[] in vfs_mount.c or also to fs-specific lists (at least I've seen noexec and nosuid duplicated from global_opts[] to ffs_opts[])

2) I'm not sure about old sys_mount() API, there is some manual handling for ro/nosuid/noexec there.

3) I think user mounts (they automatically gets MNT_NOSUID) is not an issue here, am I right?

=== uipc_usrreq.c ===
1) It seems that mp from vn_start_write is not always the direct mp for specified vnode (ex. for nullfs) so I read nd.ni_dvp->v_mount - is it correct? It seems that ni_dvp can't be NULL here, and also it seems that it is locked and so can't disappear.

2) Could ni_dvp->v_mount be zero or spontaneously disappear in middle? I've added a check against NULL but may be it is not needed.

=== mount.h ===
Using unused flag 0x0000020000000000ULL for MNT_NOSOCKBIND.
Somewhere in CURRENT MNT_RECURSE=0x0000100000000000ULL was added which is larger, but bits 0x00000E0000000000ULL still seems unused.


==== What did not done ====
Since nullfs is marked as jail-friendly, it seems that it is still possible to do all this when jail created with allow_mount flag, by mounting unrestricted nullfs over any place. Possible fixes for this:
1) enforce all jailed nullfs mounts to inherit "nosockbind" from underlying fs
2) disallow jailed updating a mount from "nosockbind" to "nonosockbind" state
3) make "nosockbind" transparent over nullfs (may be optional via sysctl)
4) workaround: do not make crossjail-shared nullfs accessible for allow_mount jails
Comment 1 firk 2022-02-24 22:37:07 UTC
Created attachment 232086 [details]
patch for HEAD
Comment 2 firk 2022-02-24 22:37:43 UTC
Created attachment 232087 [details]
patch for 13-stable
Comment 3 firk 2022-02-24 22:38:17 UTC
Created attachment 232088 [details]
patch for 12.3
Comment 4 firk 2022-02-24 22:38:43 UTC
Created attachment 232089 [details]
patch for 11.4
Comment 5 firk 2022-02-24 22:39:09 UTC
Created attachment 232090 [details]
patch for 10.4
Comment 6 firk 2022-02-24 23:15:24 UTC
It should be noted, that there are another ways to escape jail via shared nullfs when it is done badly. I did not tried to fix them in any way, the patch is to fix jail-escaping only for properly configured nullfs:

1) both shared mounts have exactly same underlying base directory
2) underlying base directory for nullfs is placed on a separate filesystem, otherwise invisible to the jails.
Comment 7 firk 2022-03-16 11:20:12 UTC
Added revisions for this and for another way to prevent escaping via shared nullfs UNIX socket

https://reviews.freebsd.org/D34560 - nosockbind option

https://reviews.freebsd.org/D34579 - verify dirfds when receiving them via SCM_RIGHTS
Comment 8 crest 2023-07-25 10:30:19 UTC
Having jails exchange file descriptors over unix sockets using SCM_RIGHTS is a very powerful feature that should be safe to use. If I understand the reported issue correctly the protection against leaving a jail root using the equivalent to `cd ..` is circumvented by sending a file descriptor to (the root) directory of one jail to an other jail. If so thanks for coming of with this clever idea.
 
This sounds like a nasty problem to solve because it requires the file descriptor to reference the prison it was created. I suspect the current kernel code limits relative path lookups by comparing the file descriptor relative to which the lookup is requested against the root directory of the processes prison instead of the root directory of the prison the file descriptor was created in. 

While a shared nullfs containing unix sockets is the most likely way to encounter what has been reported in the wild it's not the only way processes can have connected unix sockets across jail boundaries. For example a common parent process could use socketpair() to create a pair of connected unix sockets without going through the file system namespace at all and inherit one end of the connection a pair of child processes before attaching them to different jails. I can see this happening if some application attempts to use jails for very tight sandboxing and the sockets are intended for IPC messages.

What is the expected behaviour for a file descriptor created in a jail when the jail is removed?
* Should the file descriptor get revoked?
* Should the jail be be stuck dying until all file descriptors created inside it are closed and if so how would an operator find out who holds the remaining file descriptors?
* Something else?
Comment 9 firk 2023-11-23 21:26:14 UTC
> If I understand the reported issue correctly the protection against leaving a jail root using the equivalent to `cd ..` is circumvented by sending a file descriptor to (the root) directory of one jail to an other jail. 

Yes.

> This sounds like a nasty problem to solve because it requires the file descriptor to reference the prison it was created.

While binding file descriptors to jail where they created is interesting, this is not required to prevent 'cd ..'. I already created the patch that verifies all sent directory decriptors against chroot of the receiver: https://reviews.freebsd.org/D34579

> limits relative path lookups by comparing the file descriptor relative to which the lookup is requested against the root directory of the processes prison instead of the root directory of the prison the file descriptor was created in. 

No, it doesn't. It only checks against 'cd ..' across jail-root directory. If we already have access to some directory fd outside OUR chroot (including directory from some other jail root), we can do cd .. from it without restrictions. The patch I mentioned above fixes the problem.
Comment 10 crest 2024-03-05 20:03:01 UTC
If I understand your description correctly the security fix as proposed in the patches wouldn't just fix the exploitable case, of using the directory descriptor to dig up the file system tree (e.g. "../../../../../../../etc/master.passwd"), but would also prevent intentionally sharing a directory and anything underneath it. I started looking into this to write a wrapper that finds creates a unix socketpair, forks child process, attaches it to a jail and has the child process send it the runtime linker, an executable and the library directories to run a link and run an dynamically linked executable installed on the host inside a jail without having to pollute the jail file system subtree with nullfs mounts.

Such a command would be a generic alternative to adding a -j option to each command you might want to run in jail, but isn't available inside the jail because the jail is too stripped down (single executable jails, maybe even an empty read-only mounted directory) or incompatible (e.g. Linux branded, new kernel features with older jails).

Would it be possible to instead of preventing the passed directory descriptor be used at all only force O_RESOLVE_BENEATH flag to be applied to all operations on it? The jail_attach() and jail_set(JAIL_ATTACH) system calls already prevent processes from accidentally bringing directory descriptors into a jail so the only way it can happen is passing the descriptor over a socket. In this case I don't consider walking the directory content and deriving file descriptors to the content an exploitable security bug, but a useful feature.
Comment 11 firk 2024-03-19 18:49:28 UTC
(In reply to crest from comment #10)

> would also prevent intentionally sharing a directory and anything underneath it

Yes. Such a sharing also leads to some vulnerabilities. Not so serious as jail escaping, but anyway privilege escalation. For example, non-root daemon in jail A may gain root privileges inside jail A if root daemon from jail B assist him via such a shared directory (example, via creating setuid-root binary in it).

So, forcing O_RESOLVE_BENEATH is not really enough, but I can imagine that in some cases it may be useful. This is anyway possible, but currently I don't know how much kernel code changes should be done to do that.