Created attachment 232085 [details] PoC source ==== Background: possible security hole with jail & shared nullfs ==== When two independent jails have the same directory shared via nullfs, they can escape chroot-ed filesystem. How this is done: 1) jail A creates listening UNIX domain socket in the shared directory. 2) jail B connects to that listening socket 3) jail A does open("/",O_DIRECTORY) and sends this fd over socket using SCM_RIGHTS 4) jail B receives the fd and now able to go through ".." into host filesystem Quick guide how to reproduce (sendfd.c in attachment): prepare: > gcc -o sendfd sendfd.c > mkdir /j /j/1 /j/2 > tar -c -f - /bin /lib /libexec | tar -x -f - -C /j/1 > tar -c -f - /bin /lib /libexec | tar -x -f - -C /j/2 > cp sendfd /j/1/bin > cp sendfd /j/2/bin > mkdir /j/1/shared /j/2/shared > mount -t nullfs /j/2/shared /j/1/shared first console: > jail /j/2 x 127.0.0.1 /bin/sendfd listen /shared/2.sock /bin second console: > jail /j/1 x 127.0.0.1 /bin/sendfd sh /shared/2.sock > pwd and you'll see pwd "/j/2" and may explore ../../ system root. ==== Proposed fix ==== New mount flag "nosockbind" means "do not allow UNIX domain bind()/bindat() to paths on this filesystem". This flag is not transparent over nullfs, so it is possible to mount bindable nullfs over non-bindable base partition. Note that connecting to UNIX domain socket on such filesystem is still possible. There is patches for 10.4, 11.4, 12.3 and 13-stable branches. At least 12.3 version looks working fine, but: VFS subsystem is quite complicated, and I'm new to it, so there are many (almst everywhere) things that I'm unsure: === vfs_mount.c === 1) I'm not sure if I should add the option just to global_opts[] in vfs_mount.c or also to fs-specific lists (at least I've seen noexec and nosuid duplicated from global_opts[] to ffs_opts[]) 2) I'm not sure about old sys_mount() API, there is some manual handling for ro/nosuid/noexec there. 3) I think user mounts (they automatically gets MNT_NOSUID) is not an issue here, am I right? === uipc_usrreq.c === 1) It seems that mp from vn_start_write is not always the direct mp for specified vnode (ex. for nullfs) so I read nd.ni_dvp->v_mount - is it correct? It seems that ni_dvp can't be NULL here, and also it seems that it is locked and so can't disappear. 2) Could ni_dvp->v_mount be zero or spontaneously disappear in middle? I've added a check against NULL but may be it is not needed. === mount.h === Using unused flag 0x0000020000000000ULL for MNT_NOSOCKBIND. Somewhere in CURRENT MNT_RECURSE=0x0000100000000000ULL was added which is larger, but bits 0x00000E0000000000ULL still seems unused. ==== What did not done ==== Since nullfs is marked as jail-friendly, it seems that it is still possible to do all this when jail created with allow_mount flag, by mounting unrestricted nullfs over any place. Possible fixes for this: 1) enforce all jailed nullfs mounts to inherit "nosockbind" from underlying fs 2) disallow jailed updating a mount from "nosockbind" to "nonosockbind" state 3) make "nosockbind" transparent over nullfs (may be optional via sysctl) 4) workaround: do not make crossjail-shared nullfs accessible for allow_mount jails
Created attachment 232086 [details] patch for HEAD
Created attachment 232087 [details] patch for 13-stable
Created attachment 232088 [details] patch for 12.3
Created attachment 232089 [details] patch for 11.4
Created attachment 232090 [details] patch for 10.4
It should be noted, that there are another ways to escape jail via shared nullfs when it is done badly. I did not tried to fix them in any way, the patch is to fix jail-escaping only for properly configured nullfs: 1) both shared mounts have exactly same underlying base directory 2) underlying base directory for nullfs is placed on a separate filesystem, otherwise invisible to the jails.
Added revisions for this and for another way to prevent escaping via shared nullfs UNIX socket https://reviews.freebsd.org/D34560 - nosockbind option https://reviews.freebsd.org/D34579 - verify dirfds when receiving them via SCM_RIGHTS
Having jails exchange file descriptors over unix sockets using SCM_RIGHTS is a very powerful feature that should be safe to use. If I understand the reported issue correctly the protection against leaving a jail root using the equivalent to `cd ..` is circumvented by sending a file descriptor to (the root) directory of one jail to an other jail. If so thanks for coming of with this clever idea. This sounds like a nasty problem to solve because it requires the file descriptor to reference the prison it was created. I suspect the current kernel code limits relative path lookups by comparing the file descriptor relative to which the lookup is requested against the root directory of the processes prison instead of the root directory of the prison the file descriptor was created in. While a shared nullfs containing unix sockets is the most likely way to encounter what has been reported in the wild it's not the only way processes can have connected unix sockets across jail boundaries. For example a common parent process could use socketpair() to create a pair of connected unix sockets without going through the file system namespace at all and inherit one end of the connection a pair of child processes before attaching them to different jails. I can see this happening if some application attempts to use jails for very tight sandboxing and the sockets are intended for IPC messages. What is the expected behaviour for a file descriptor created in a jail when the jail is removed? * Should the file descriptor get revoked? * Should the jail be be stuck dying until all file descriptors created inside it are closed and if so how would an operator find out who holds the remaining file descriptors? * Something else?
> If I understand the reported issue correctly the protection against leaving a jail root using the equivalent to `cd ..` is circumvented by sending a file descriptor to (the root) directory of one jail to an other jail. Yes. > This sounds like a nasty problem to solve because it requires the file descriptor to reference the prison it was created. While binding file descriptors to jail where they created is interesting, this is not required to prevent 'cd ..'. I already created the patch that verifies all sent directory decriptors against chroot of the receiver: https://reviews.freebsd.org/D34579 > limits relative path lookups by comparing the file descriptor relative to which the lookup is requested against the root directory of the processes prison instead of the root directory of the prison the file descriptor was created in. No, it doesn't. It only checks against 'cd ..' across jail-root directory. If we already have access to some directory fd outside OUR chroot (including directory from some other jail root), we can do cd .. from it without restrictions. The patch I mentioned above fixes the problem.
If I understand your description correctly the security fix as proposed in the patches wouldn't just fix the exploitable case, of using the directory descriptor to dig up the file system tree (e.g. "../../../../../../../etc/master.passwd"), but would also prevent intentionally sharing a directory and anything underneath it. I started looking into this to write a wrapper that finds creates a unix socketpair, forks child process, attaches it to a jail and has the child process send it the runtime linker, an executable and the library directories to run a link and run an dynamically linked executable installed on the host inside a jail without having to pollute the jail file system subtree with nullfs mounts. Such a command would be a generic alternative to adding a -j option to each command you might want to run in jail, but isn't available inside the jail because the jail is too stripped down (single executable jails, maybe even an empty read-only mounted directory) or incompatible (e.g. Linux branded, new kernel features with older jails). Would it be possible to instead of preventing the passed directory descriptor be used at all only force O_RESOLVE_BENEATH flag to be applied to all operations on it? The jail_attach() and jail_set(JAIL_ATTACH) system calls already prevent processes from accidentally bringing directory descriptors into a jail so the only way it can happen is passing the descriptor over a socket. In this case I don't consider walking the directory content and deriving file descriptors to the content an exploitable security bug, but a useful feature.
(In reply to crest from comment #10) > would also prevent intentionally sharing a directory and anything underneath it Yes. Such a sharing also leads to some vulnerabilities. Not so serious as jail escaping, but anyway privilege escalation. For example, non-root daemon in jail A may gain root privileges inside jail A if root daemon from jail B assist him via such a shared directory (example, via creating setuid-root binary in it). So, forcing O_RESOLVE_BENEATH is not really enough, but I can imagine that in some cases it may be useful. This is anyway possible, but currently I don't know how much kernel code changes should be done to do that.
Created attachment 257783 [details] patch for 13-STABLE ^Triage: rebase patch.
Created attachment 257784 [details] patch for 14-STABLE ^Triage: rebase patch.
Comment on attachment 232086 [details] patch for HEAD ^Triage: this patch no longer applies to 15-CURRENT.
Created attachment 259060 [details] Security patch for FreeBSD 14.2 Long overdue patch for the security vulnerability. Turns out nullfs is innocent. The problem is that file descriptor passing via UNIX domain sockets violates the implicit invariant that jailed processes contain only directory file descriptors from inside the jail path. The kernel already tracks the credentials used to open a file which includes the jail. Match those against the receiving process's jail.
(In reply to crest from comment #15) I think that the main idea in the proposed patch is right, but it is simpler and more bullet-proof to claim that only (grand-)child jail can send directories to the parent. There, the child/parent relation is as provided by pr_children from struct prison, instead of the path comparison.
My PoC exploit does *not* depend on the parent<->child relationship between jails only that the receiving jailed process gains access to a directory file descriptor to a directory outside of it's path. The vfs_lookup() function only compares the currently traversed directory against the pr_root directory vnodes of all jails the process belongs to, but *not* against the jail roots of the credentials the directory was opened with. Take the following example jail configuration: ``` foo { path = "/jails/foo"; bar { path = "/jails/bar"; } } ``` The jail hierachy doesn't have to reflect the file system hierachy. Sending a descriptor to a directory opened inside the "foo.bar" jail to the "foo" jail would allow the receiving receiving process to fchdir() to the received directory then chdir("../../") into the real host "/" directory followed by a chroot(".") to let it access the host file systems unrestricted. Without the chroot(2) absolute paths would still be looked up in the jail which makes running dynamically linked executables a pain if the jail userland is too trimmed down or out of sync with the host. With it you can just use at(1) to have cron(8) run your next exploit stage.
(In reply to Konstantin Belousov from comment #16) Thank you for reviewing my patch. I sent you my PoC exploit (which doesn't involve nullfs) to your @FreeBSD.org address for further testing. I don't feel comfortable attaching it to a public Bugzilla issue before a proper security fix has been released through official channels. I still would like to see a better fix which preserves the safe subset of this (too) powerful IPC building block. I encountered this problem while writing a jexec like command (ab-)using UNIX sockets to fexecve(2) a command installed on the *host* inside a jail (e.g. to configure an empty vnet jail) by having a child process pass its parent the host's runtime linker, library directory and the executable. Such a command would remove the need to add a -j flag to every tool you could possibly need to configure a jail from the host.
(In reply to crest from comment #17) Isn't bar as a child of foo, has the path specified relatively to the root of foo?
Also, the main point why pr_path should not be used is that it is invalid right after the moment the reference on the jail root vnode is taken. For instance, the root point can be renamed.
You're right. I didn't consider renaming (or unlinking) the jail root because my jails normally always have a mountpoint as root which makes blocks renaming. I guess having a simplified jails-root-only even simpler nullfs would be possible, but mounting filesystems can get expensive if you have lots of processes with lots of file descriptors (to patch the overmounted directory). It would be ugly and a massive POLA violation to have jails implicitly mount a pseudo-fs in their jail root. It would also have to be reference counted if multiple jails share a jail root. So we can't rely in pr_path for the fix. How about having a fast path if the opening credentials of the file descriptor and the receiving thread reference the same struct prison (or are both NULL), but otherwise scan upwards the parent directory chain toward the root until hitting a the first jail root (no looping through nested jails). If it's not inside the jail's root reject the message with EPERM?
This also brings up the question if it's better to have a "brittle" protection in unp_externalize() or if it would be better to have VFS lookups handle this case gracefully (e.g. by detecting it and forcing a resolve beneath semantic)?
Created attachment 259551 [details] Security patch for FreeBSD 15-CURRENT As Konstantin Belousov pointed out the jail path is only the path the jail was created with, but the jail root directory could've been renamed so the jail path is unsuitable. The latest patch instead adds a the prison_cancontain() function to sys/kern/kern_jail.c which checks if a jail can safely contain a file. It considers a file safe if it was opened by the jail, is no directory, or is a directory inside the jail. The test if a vnode is inside a jail is implemented by the new prision_contains() (also in sys/kern/kern_jail.c). It uses namei(9) to search for a root directory among the vnode's ancestor directories. All traversed directories are locked shared and pushed to a stack to be released and in reverse order.
If a passed file descriptor would be unsafe to deliver into the receiving process the recvmsg(3) call now fails with EPERM and the references to the files are released.