Created attachment 252746 [details] Documents this behavior in lib/libc/sys/jail.2 In jail_attach(2) the only reason listed for EPERM is if the user not root while attempting to attach to a jail. Running into that errno in my program I was puzzled because uid was 0 - so that could not have been the problem. After help from other people I found out that there might be other, undocumented, reasons for getting that errno. Specifically, the issue was that the process was holding a directory FD open while trying to attach, which is prohibited. A sentence describing this behavior should be added to "[EPERM]" in jail_attach(2) "ERRORS" section.
Funnily, I stumbled into the exact same problem today in a different context... That behavior is very indirectly documented through the paragraph: > Further jail(), jail_set(), and jail_attach() call chroot(2) internally, > so they can fail for all the same reasons. Please consult the chroot(2) > manual page for details. I agree that the jail_attach(2) manpage should mention the other "common" reason for getting EPERM (another reason is for the attach to be rejected by the MAC framework, and there are still potentially others).
(Just to avoid giving you false hopes with my prompt response, I'll be AFK for the next two weeks, so don't expect a prompt response from me. But if nobody beats me to it, I'll handle it in September.)
There is the kern.chroot_allow_open_directories sysctl to allow chroot() with directory file descriptors, but no such "backdoor" exists for jail_attach(), because jails are supposed to provide secure namespace isolation. Removing this restriction would allow trivial jail escapes through fchdir() and openat(). Allowing jailed processes to safely share directory file descriptors either through inheritance or file descriptor passing could be very useful, but is totally unsafe right now. Sharing directory file descriptors through inheritance across jails is (and always has been) impossible because jail_attach() doesn't allow processes with having one or more such file descriptors to attach to a jail. Despite this it's possible to share file descriptors across jails via file descriptor passing over unix domain sockets. Passing directory file descriptors to a directory outside the receivers jail root into a jail is an instant file system escape because the jail root is enforced by comparing vnodes and if you `parent_fd = openat(dir_fd, "..", O_DIRECTORY)` you won't encounter the jail root vnode. By looping until "." == ".." the jailed process can find the hosts root filesystem. While the unix socket could be brought into a jail via jail_attach() the more likely real world situation would be intentionally sharing a nullfs mount across jails (e.g. a PostgreSQL jail <-> a FastCGI jail). One way I can think of to make it safe would be to add a O_RESOLVE_BENEATH like write-once flag to file descriptors and allow only directory file descriptors with this flag to be shared across jails. It would have to be inherited too (e.g. openat(dir_fd, "<subdir>", O_DIRECTORY) on a sub-directory). If the flag could be automatically set by jail_attach() and file descriptor passing across jails would work without extra syscalls to set the new flag with the caveat that ".." inaccessible on such file descriptors and those derived from them.
(In reply to Olivier Certner from comment #1) Ah! I totally missed that paragraph. I guess I should read more carefully. (In reply to crest from comment #3) Could you add "root vnode pointer" to every directory FD to limit their scope? That way, when you reference ".." you would check whether the directory FD equals that pointer and if so, not go above. Also, O_RESOLVE_BENEATH would then just mean that when you open that directory FD you would make the directory itself be the "root vnode pointer" instead of its jail/chroot root directory. Opening directories with openat copies the "root vnode pointer" unless overriden by something like O_RESOLVE_BENEATH. ((I have no experience with VFS code, so take this idea with a grain of salt!)) This reminds me slightly of how sockets have vnet pointers. Should I close this issue or does someone else do that? The reason is documented already, as stated by Olivier Cartner.