Created attachment 232085 [details] PoC source ==== Background: possible security hole with jail & shared nullfs ==== When two independent jails have the same directory shared via nullfs, they can escape chroot-ed filesystem. How this is done: 1) jail A creates listening UNIX domain socket in the shared directory. 2) jail B connects to that listening socket 3) jail A does open("/",O_DIRECTORY) and sends this fd over socket using SCM_RIGHTS 4) jail B receives the fd and now able to go through ".." into host filesystem Quick guide how to reproduce (sendfd.c in attachment): prepare: > gcc -o sendfd sendfd.c > mkdir /j /j/1 /j/2 > tar -c -f - /bin /lib /libexec | tar -x -f - -C /j/1 > tar -c -f - /bin /lib /libexec | tar -x -f - -C /j/2 > cp sendfd /j/1/bin > cp sendfd /j/2/bin > mkdir /j/1/shared /j/2/shared > mount -t nullfs /j/2/shared /j/1/shared first console: > jail /j/2 x 127.0.0.1 /bin/sendfd listen /shared/2.sock /bin second console: > jail /j/1 x 127.0.0.1 /bin/sendfd sh /shared/2.sock > pwd and you'll see pwd "/j/2" and may explore ../../ system root. ==== Proposed fix ==== New mount flag "nosockbind" means "do not allow UNIX domain bind()/bindat() to paths on this filesystem". This flag is not transparent over nullfs, so it is possible to mount bindable nullfs over non-bindable base partition. Note that connecting to UNIX domain socket on such filesystem is still possible. There is patches for 10.4, 11.4, 12.3 and 13-stable branches. At least 12.3 version looks working fine, but: VFS subsystem is quite complicated, and I'm new to it, so there are many (almst everywhere) things that I'm unsure: === vfs_mount.c === 1) I'm not sure if I should add the option just to global_opts[] in vfs_mount.c or also to fs-specific lists (at least I've seen noexec and nosuid duplicated from global_opts[] to ffs_opts[]) 2) I'm not sure about old sys_mount() API, there is some manual handling for ro/nosuid/noexec there. 3) I think user mounts (they automatically gets MNT_NOSUID) is not an issue here, am I right? === uipc_usrreq.c === 1) It seems that mp from vn_start_write is not always the direct mp for specified vnode (ex. for nullfs) so I read nd.ni_dvp->v_mount - is it correct? It seems that ni_dvp can't be NULL here, and also it seems that it is locked and so can't disappear. 2) Could ni_dvp->v_mount be zero or spontaneously disappear in middle? I've added a check against NULL but may be it is not needed. === mount.h === Using unused flag 0x0000020000000000ULL for MNT_NOSOCKBIND. Somewhere in CURRENT MNT_RECURSE=0x0000100000000000ULL was added which is larger, but bits 0x00000E0000000000ULL still seems unused. ==== What did not done ==== Since nullfs is marked as jail-friendly, it seems that it is still possible to do all this when jail created with allow_mount flag, by mounting unrestricted nullfs over any place. Possible fixes for this: 1) enforce all jailed nullfs mounts to inherit "nosockbind" from underlying fs 2) disallow jailed updating a mount from "nosockbind" to "nonosockbind" state 3) make "nosockbind" transparent over nullfs (may be optional via sysctl) 4) workaround: do not make crossjail-shared nullfs accessible for allow_mount jails
Created attachment 232086 [details] patch for HEAD
Created attachment 232087 [details] patch for 13-stable
Created attachment 232088 [details] patch for 12.3
Created attachment 232089 [details] patch for 11.4
Created attachment 232090 [details] patch for 10.4
It should be noted, that there are another ways to escape jail via shared nullfs when it is done badly. I did not tried to fix them in any way, the patch is to fix jail-escaping only for properly configured nullfs: 1) both shared mounts have exactly same underlying base directory 2) underlying base directory for nullfs is placed on a separate filesystem, otherwise invisible to the jails.
Added revisions for this and for another way to prevent escaping via shared nullfs UNIX socket https://reviews.freebsd.org/D34560 - nosockbind option https://reviews.freebsd.org/D34579 - verify dirfds when receiving them via SCM_RIGHTS
Having jails exchange file descriptors over unix sockets using SCM_RIGHTS is a very powerful feature that should be safe to use. If I understand the reported issue correctly the protection against leaving a jail root using the equivalent to `cd ..` is circumvented by sending a file descriptor to (the root) directory of one jail to an other jail. If so thanks for coming of with this clever idea. This sounds like a nasty problem to solve because it requires the file descriptor to reference the prison it was created. I suspect the current kernel code limits relative path lookups by comparing the file descriptor relative to which the lookup is requested against the root directory of the processes prison instead of the root directory of the prison the file descriptor was created in. While a shared nullfs containing unix sockets is the most likely way to encounter what has been reported in the wild it's not the only way processes can have connected unix sockets across jail boundaries. For example a common parent process could use socketpair() to create a pair of connected unix sockets without going through the file system namespace at all and inherit one end of the connection a pair of child processes before attaching them to different jails. I can see this happening if some application attempts to use jails for very tight sandboxing and the sockets are intended for IPC messages. What is the expected behaviour for a file descriptor created in a jail when the jail is removed? * Should the file descriptor get revoked? * Should the jail be be stuck dying until all file descriptors created inside it are closed and if so how would an operator find out who holds the remaining file descriptors? * Something else?
> If I understand the reported issue correctly the protection against leaving a jail root using the equivalent to `cd ..` is circumvented by sending a file descriptor to (the root) directory of one jail to an other jail. Yes. > This sounds like a nasty problem to solve because it requires the file descriptor to reference the prison it was created. While binding file descriptors to jail where they created is interesting, this is not required to prevent 'cd ..'. I already created the patch that verifies all sent directory decriptors against chroot of the receiver: https://reviews.freebsd.org/D34579 > limits relative path lookups by comparing the file descriptor relative to which the lookup is requested against the root directory of the processes prison instead of the root directory of the prison the file descriptor was created in. No, it doesn't. It only checks against 'cd ..' across jail-root directory. If we already have access to some directory fd outside OUR chroot (including directory from some other jail root), we can do cd .. from it without restrictions. The patch I mentioned above fixes the problem.
If I understand your description correctly the security fix as proposed in the patches wouldn't just fix the exploitable case, of using the directory descriptor to dig up the file system tree (e.g. "../../../../../../../etc/master.passwd"), but would also prevent intentionally sharing a directory and anything underneath it. I started looking into this to write a wrapper that finds creates a unix socketpair, forks child process, attaches it to a jail and has the child process send it the runtime linker, an executable and the library directories to run a link and run an dynamically linked executable installed on the host inside a jail without having to pollute the jail file system subtree with nullfs mounts. Such a command would be a generic alternative to adding a -j option to each command you might want to run in jail, but isn't available inside the jail because the jail is too stripped down (single executable jails, maybe even an empty read-only mounted directory) or incompatible (e.g. Linux branded, new kernel features with older jails). Would it be possible to instead of preventing the passed directory descriptor be used at all only force O_RESOLVE_BENEATH flag to be applied to all operations on it? The jail_attach() and jail_set(JAIL_ATTACH) system calls already prevent processes from accidentally bringing directory descriptors into a jail so the only way it can happen is passing the descriptor over a socket. In this case I don't consider walking the directory content and deriving file descriptors to the content an exploitable security bug, but a useful feature.
(In reply to crest from comment #10) > would also prevent intentionally sharing a directory and anything underneath it Yes. Such a sharing also leads to some vulnerabilities. Not so serious as jail escaping, but anyway privilege escalation. For example, non-root daemon in jail A may gain root privileges inside jail A if root daemon from jail B assist him via such a shared directory (example, via creating setuid-root binary in it). So, forcing O_RESOLVE_BENEATH is not really enough, but I can imagine that in some cases it may be useful. This is anyway possible, but currently I don't know how much kernel code changes should be done to do that.
Created attachment 257783 [details] patch for 13-STABLE ^Triage: rebase patch.
Created attachment 257784 [details] patch for 14-STABLE ^Triage: rebase patch.
Comment on attachment 232086 [details] patch for HEAD ^Triage: this patch no longer applies to 15-CURRENT.
Created attachment 259060 [details] Security patch for FreeBSD 14.2 Long overdue patch for the security vulnerability. Turns out nullfs is innocent. The problem is that file descriptor passing via UNIX domain sockets violates the implicit invariant that jailed processes contain only directory file descriptors from inside the jail path. The kernel already tracks the credentials used to open a file which includes the jail. Match those against the receiving process's jail.
(In reply to crest from comment #15) I think that the main idea in the proposed patch is right, but it is simpler and more bullet-proof to claim that only (grand-)child jail can send directories to the parent. There, the child/parent relation is as provided by pr_children from struct prison, instead of the path comparison.
My PoC exploit does *not* depend on the parent<->child relationship between jails only that the receiving jailed process gains access to a directory file descriptor to a directory outside of it's path. The vfs_lookup() function only compares the currently traversed directory against the pr_root directory vnodes of all jails the process belongs to, but *not* against the jail roots of the credentials the directory was opened with. Take the following example jail configuration: ``` foo { path = "/jails/foo"; bar { path = "/jails/bar"; } } ``` The jail hierachy doesn't have to reflect the file system hierachy. Sending a descriptor to a directory opened inside the "foo.bar" jail to the "foo" jail would allow the receiving receiving process to fchdir() to the received directory then chdir("../../") into the real host "/" directory followed by a chroot(".") to let it access the host file systems unrestricted. Without the chroot(2) absolute paths would still be looked up in the jail which makes running dynamically linked executables a pain if the jail userland is too trimmed down or out of sync with the host. With it you can just use at(1) to have cron(8) run your next exploit stage.
(In reply to Konstantin Belousov from comment #16) Thank you for reviewing my patch. I sent you my PoC exploit (which doesn't involve nullfs) to your @FreeBSD.org address for further testing. I don't feel comfortable attaching it to a public Bugzilla issue before a proper security fix has been released through official channels. I still would like to see a better fix which preserves the safe subset of this (too) powerful IPC building block. I encountered this problem while writing a jexec like command (ab-)using UNIX sockets to fexecve(2) a command installed on the *host* inside a jail (e.g. to configure an empty vnet jail) by having a child process pass its parent the host's runtime linker, library directory and the executable. Such a command would remove the need to add a -j flag to every tool you could possibly need to configure a jail from the host.
(In reply to crest from comment #17) Isn't bar as a child of foo, has the path specified relatively to the root of foo?
Also, the main point why pr_path should not be used is that it is invalid right after the moment the reference on the jail root vnode is taken. For instance, the root point can be renamed.
You're right. I didn't consider renaming (or unlinking) the jail root because my jails normally always have a mountpoint as root which makes blocks renaming. I guess having a simplified jails-root-only even simpler nullfs would be possible, but mounting filesystems can get expensive if you have lots of processes with lots of file descriptors (to patch the overmounted directory). It would be ugly and a massive POLA violation to have jails implicitly mount a pseudo-fs in their jail root. It would also have to be reference counted if multiple jails share a jail root. So we can't rely in pr_path for the fix. How about having a fast path if the opening credentials of the file descriptor and the receiving thread reference the same struct prison (or are both NULL), but otherwise scan upwards the parent directory chain toward the root until hitting a the first jail root (no looping through nested jails). If it's not inside the jail's root reject the message with EPERM?
This also brings up the question if it's better to have a "brittle" protection in unp_externalize() or if it would be better to have VFS lookups handle this case gracefully (e.g. by detecting it and forcing a resolve beneath semantic)?
Created attachment 259551 [details] Security patch for FreeBSD 15-CURRENT As Konstantin Belousov pointed out the jail path is only the path the jail was created with, but the jail root directory could've been renamed so the jail path is unsuitable. The latest patch instead adds a the prison_cancontain() function to sys/kern/kern_jail.c which checks if a jail can safely contain a file. It considers a file safe if it was opened by the jail, is no directory, or is a directory inside the jail. The test if a vnode is inside a jail is implemented by the new prision_contains() (also in sys/kern/kern_jail.c). It uses namei(9) to search for a root directory among the vnode's ancestor directories. All traversed directories are locked shared and pushed to a stack to be released and in reverse order.
If a passed file descriptor would be unsafe to deliver into the receiving process the recvmsg(3) call now fails with EPERM and the references to the files are released.
I proposed a patch [1] for this exploitable bug almost a month ago. Is there anything I can do to get it reviewed or committed as is? [1]: https://bz-attachments.freebsd.org/attachment.cgi?id=259551
^Triage: to firk@cantconnect.ru: do you agree that the patch from crest@ is a better solution than your two patches? If so I will go ahead and mark them obsolete.
(In reply to Mark Linimon from comment #26) Niether patch is a solution. In fact, there is no agreement that there is a problem to solve, to start with.
@Konstantin Belousov: Are you arguing that sharing a directory between two jails via nullfs (to intentionally allow unix socket communication) should result in a jail escape vulnerability? Did you look at my 2nd patch?
Having multiple jails share a nullfs mount isn't uncommon. I think it's fairly surprising that two cooperating jails with a writeable shared unix socket can potentially break out of the chroot. It makes some sense that the parent jail is allowed to pass arbitrary fds to a child jail, but sibling jails should be firewalled. What about the following semantics? - When transmitting SCM_RIGHTS, the kernel additionally passes a ucred pointer obtained from the transmitting thread. - When externalizing SCM_RIGHTS, the kernel compares the passed ucred with that of the receiving process. If the transmitter and receiver belong to the same prison, or receiver's prison is a descendant of the transmitter's prison, or the receiver and transmitter prisons have the same root vnode, then file handles for directory vnodes may be passed. Otherwise any such file handles are replaced with some tombstone. I suspect this would be good enough to prevent the escape, without breaking compatibility too badly.
(In reply to Mark Johnston from comment #29) Then we can just disable passing VDIRs if the destination prison is not equal to the sender one. Might be, with a knob.
(In reply to Konstantin Belousov from comment #30) Maybe this is too blunt a hammer. I suspect there are cases where it's useful for the host to pass fds to a jailed process, even directory fds. Maybe some tri-state sysctl, implementing both suggestions, is sufficient.
(In reply to Mark Johnston from comment #29) Mark's proposed semantics are still insecure, because the jail parent<->child relationship doesn't have to exist in the file system namespace. Take the following valid (and not even unreasonable) jail.conf snippet: path = "/jails/$name"; foo { ... bar { ... } } Since nested jail names are joined with "." (not "/") this would result in two jails named "foo" and "foo.bar" with corresponding jail roots of "/jails/foo" and "/jails/foo.bar". The child jail "foo.bar" would be a sibbling in the file system namespace, yet under the proposed semantics processes the parent jail "foo" would be allowed to receive a directory file descriptors sent by processes in the child "foo.bar" which would be exploitable as the jail implementation assumes that jail will only contain directory file descriptors to directories under the jail root. A possible exploit would follow these steps: * A process in the child jail "foo.bar" opens a descriptor to its root directory ("/jails/foo.bar" on the host) and sends it to a receiving process in the "foo" jail. * The receiver would be able to fchdir(2) to the received descriptor to a directory outside of its jail root (jail = "/jails/foo", cwd = "/jails/foo.bar"), then chdir("../..") from "/jails/foo.bar" to the true root directory, and finally chroot(".") its working working directory (jail = "/jails/foo", cwd = "/", root = "/"). * The receiver has escaped the jail root directory as if the jail was configured with path = "/". If the receiver was root in its jail it can now use at(1), modify authorized_keys files, etc. Mark Johnston: If you haven't gotten a copy I can send you my thoroughly commented exploit code shared with the security team to save you the time of writing your own PoC exploit from the discussion. My exploit doesn't even rely on nullfs. Instead I use socketpair() before attaching newly created jails only to leave them again (effectively undoing jail_attach(2)). I'll join todays jail production user and developers call in 20 minutes if you want to talk after the recorded part.
(In reply to Mark Johnston from comment #31) Wait, isn't exactly this action, passing dirfd from parent, to the subordinate jail, is the problem? Unless we can guarantee that dirfd is under the hierarchy of the jail rootdir, which we cannot.
(In reply to Mark Johnston from comment #31) My patch is intented to preserve the ability to pass exactly those directory file descriptors to jailed processes that they can safely contain, by searching for the jail's root directory vnode among each directory file descriptor's ancestors when the receiving process is in a different jail than the opening process. If a jail tries to receive a descriptor it can't safely contain the descriptors are closed and the recvmsg(2) call fails with EPERM. I just noticed that I haven't considered that happens if the sender uses multiple SCM_RIGHTS ancillary messages attached to a single message. The security check has to be performed on all messages before the first descriptor is externalized. Annoying, but not too hard to fix.
(In reply to Konstantin Belousov from comment #33) My second patch does (try to) verify that each received descriptor was either opened inside the receiving jail or is a descandant of the jail root vnode.
(In reply to crest from comment #34) Your patch is creating a large collection of LoRs, it locks vnodes in the wrong order. But it does not make sense to fix it, because what you are trying to do cannot be done. Even if there is a (complicated) code to check that dirvp2 is under the dirvp1 hierarchy, this check is invalidated right after the moment the check is done and vnodes are unlocked. Imagine malicious sender doing mv(1) out of hierarchy after the malicious receiver acked that fd is obtained and whatever checks at the file externalization were passed.
(In reply to crest from comment #32) That is surprising and contradicts the documentation in jail(8): Like the names, a child jail's path appears relative to its creator's own path. This is by virtue of the child jail being created in the chrooted environment of the first jail. But yes, that seems to be the case: markj@xinde> jls -n jid parent path jid=6 parent=0 path=/jails/foo jid=7 parent=6 path=/jails/foo.bar How does jail(8) even create such an arrangement? (In reply to Konstantin Belousov from comment #33) Passing a dirfd from a parent jail to the child lets the child break out into the parent, yes, same as with a plain chroot. But the same problem lets e.g., sibling jails with different roots but with a shared channel to escape, which is more surprising. Maybe another solution is to automatically apply O_RESOLVE_BENEATH to dirfds passed between jails. That is, O_RESOLVE_BENEATH can become a sticky flag, applied to namei(), and then the receiver cannot use it to perform ".." lookups.
(In reply to Mark Johnston from comment #37) Sorry, I missed the earlier discussion of O_RESOLVE_BENEATH. I still think that's potentially a reasonable mitigation for part of this problem.
(In reply to Konstantin Belousov from comment #36) It has already been brought up that as long as the jail root is not a mountpoint it's possible to have an user on the host with permission to navigate into a jail and write permission to a directory on the same file system move a directory from inside a jail to outside the jail while it's someone's current working directory. If you're conviced that it's not feasible to validate that each directory about to be received is under the receiving jail's root directory until all file descriptors are externalized than more overblocking is only easy to implement option I see. In that case I propose to fail the recvmsg() call if any attached file descriptor is directory that wasn't opened by the receiving jail. An exception could be made for unjailed processes. It would be easy to implement, but passing directories between jails (or host and jail) is such a neat IPC primitive that I intended to (ab-)use to write a tool to send the runtime linker, a set of library directories and the executable from the host to a jail to configure minimal jails without having pollute the mount namespace with several nullfs mounts, but that would rely on injecting exactly these dangerous directory file descriptors to host directories into the jails via fd passing. *sigh* Konstantin Belousov: Do you want me to write up a patch to implement the proposed semantics and would you be willing to review it?
(In reply to Mark Johnston from comment #38) It would require adding a new flag to file descriptors (afaik we only have close-on-exec). I would love to see this implemented in time for 15.0, but in my opinion quickly fixing the security of a common way to deploy jails at the cost of ripping out a so far (almost?) unused kernel feature is the right trade off. Transitively restricing the receiver to only derive directory file descriptors under the received directory wouldn't fully secure file descriptor passing. In the worst case the "nobody" user in one jail (or the host) passes it's root directory to the "root" user in a malicous jail. To preserve sane capability semantics the receiver would have to use some combination (probably the minimum for some definition of minimum) of the opening ucred and its own. Do we have any presedence for something like that in the kernel?
(In reply to crest from comment #39) Are you proposing to implement the following: > I propose to fail the recvmsg() call if any attached file descriptor is > directory that wasn't opened by the receiving jail. An exception could > be made for unjailed processes. If yes, I agree. Also I propose to slightly weaken the test, allowing dirfds opened by the jail itself or any of its descendants. This would also handle the 'unjailed' case automatically.
(In reply to crest from comment #40) > In the worst case the "nobody" user in one jail (or the host) passes it's root directory to the "root" user in a malicous jail. If this channel exists, then what stops the jailed root user from creating a setuid shell, opening it, passing the fd to the nobody user, who fexecve()s it? This general scenario is described in the NOTES section of jail(8).
(In reply to Konstantin Belousov from comment #41) The jail parent<->child relationship isn't necessarily preserved in the filesystem. This means a parent jail can't be trusted to receive directory file descriptors from its descendants.
(In reply to crest from comment #43) So I also tripped over it. Ok.
A POC patch for discussion: https://reviews.freebsd.org/D50371
As long as we work to fix nullfs problems for 262179 that don't also fix 262180, we haven't fixed anything. And that means that for the purposes of this bug, restrictions on descriptor passing aren't the answer. I have a solution to descriptor passing mind, that doesn't invlove restricting the passing but instead makes dotdot lookup more robust. But that is also failing in the face of 262180. Mark Johnston's comment in 262180 makes it clear: the problem is in nullfs, and the fix has to involve nullfs.
(In reply to Jamie Gritton from comment #46) I don't quite follow: these two bugs seem distinct to me. Here, the role of nullfs is just in giving two jails a shared channel to exchange fds. The problem arises in any context where two jailed processes in different jails can communicate over a unix socket. The problem in 262180 is truly a bug in nullfs/vfs_lookup().
(In reply to Mark Johnston from comment #47) With nullfs as part of this bug, the conclusion is "passing directory descriptors can't be made safe anywhere except to/from prison0." Without nullfs it reduces to "nslookup has to be a little more clever in noticing jail escapes." That makes both these bugs seem like two different views of the same nullfs problem: you can't reliably blacklist certain directories from dotdot traversal if nullfs if involved. Note that the trick in 262180 works equally well for escaping chroot as it does full jails.
(In reply to Jamie Gritton from comment #48) This bug has nothing to do with nullfs. Nullfs issues is different and not related to jails in fact.
Created attachment 260491 [details] Proof of concept to disallow any jail dotdot escape (In reply to Konstantin Belousov from comment #49) Yeah, thanks for pulling me away from that red herring. So then, back to "nslookup has to be a little more clever in noticing jail escapes." This can be solved in the dotdot handling, without having to make any restrictions on descriptor passing. Currently, we check for traversing up from prison root directories, checking the current prison and its parents, up to prison0. But with passed descriptors, or with the possibility of sub-jails outside of a parent's file hierarchy, that's not sufficient. What we should be preventing traversal upward from any prison root, with the sole exception of those that are within the current prison's file tree (so using the prison_contains strategy). A quick test run with a simpler strategy solves the test case: first skip the check for prison0 (common-case performance hack), then block dotdot traversal at any prison root regardless of path containment. With that, the process in j1 won't get outside of j2's root. The quick version only changed a few lines in vfs_lookup and matching lines in cache_fplookup_dotdot. A full version of this would add proper locking (maybe just allprison_lock but I don't know about LOR offhand), and the file containment test. In the meantime, this patch shows what I intend.
(In reply to Jamie Gritton from comment #50) I fo not understand how this patch would this help. Problem is that passing unrelated dirfd might allow the dotdot lookups to climb up arbitrary, e.g. up to the prison0 rootvp, without walking over any jail root.
(In reply to Konstantin Belousov from comment #51) > Problem is that passing unrelated dirfd might allow the dotdot lookups to > climb up arbitrary, e.g. up to the prison0 rootvp, without walking over any > jail root. That could only be the case if a dirfd was passed from a source that was able to open a directory outside of any jail root, i.e. from prison0. Presumably if a system is set up so a non-jailed process can send descriptors to a jailed process, that's what the administrator wants.
(In reply to Jamie Gritton from comment #52) Yes, it is assisted escape. No, it is not guaranteed that we find any prison root during lookups initiated using that dirfd.
If you add a new flag on file descriptors to prevent directory travels, consider copying OpenBSD's O_BELOW proposal: https://undeadly.org/cgi?action=article;sid=20250529080623