Bug 251433 - unionfs copymode transparent creates files with current euid instead of lower layer file owner uid.
Summary: unionfs copymode transparent creates files with current euid instead of lower...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-28 05:26 UTC by Gunther Schadow
Modified: 2020-12-23 17:16 UTC (History)
0 users

See Also:


Attachments
The final patch of this work, it includes the copypolicy=always patch. The two could be separated easily if you want to adopt only the fix and not the new feature. (7.34 KB, patch)
2020-11-29 21:21 UTC, Gunther Schadow
no flags Details | Diff
The final (corrected VOP_ACCESS) patch of this work, it includes the copypolicy=always patch. The two could be separated easily if you want to adopt only the fix and not the new feature. (7.59 KB, patch)
2020-11-29 22:10 UTC, Gunther Schadow
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gunther Schadow 2020-11-28 05:26:48 UTC
Test:
-1. as root -------------------------------------------------------

cd /tmp
mkdir test-upper
mkdir test-union
echo "I'm root!" > test-union/data
chmod 755 test-*
chown -R root:wheel test-*
chmod 664 test-*

mount -t unionfs -o copymode=transpaent test-upper test-union

-2. as another user in group wheel ---------------------------------

cd /tmp
cat test-union/data
# I'm root!
ls -l test-union/data
# -rw-rw-r-- root wheel 10 test-union/data
echo "But I'm not root!" > test-union/data
cat test-union/data
# But I'm not root!
ls -l test-union/data
# -rw-rw-r-- user wheel 18 test-union/data

rm test-upper/data
cat test-union/data
# I'm root!
ls -l test-union/data
# -rw-rw-r-- root wheel 10 test-union/data
----------------------------------------------------------------------

You see, if the copymode=transparent was working right, it would create that copy of test-union/data on the upper layer using the owner uid "root" not "user". But somehow it is not doing that. 

And just to be clear, a simple overwrite of an existing file by an euid different from the owener uid does not change the owner uid. 

So this is a bug, but it's complicated, because the code says very clearly 

Out of unionfs_subr.c

unionfs_copyfile(...) calls
unionfs_vn_create_on_upper(...) calls
unionfs_create_uppervattr_core(...) does

	case UNIONFS_TRANSPARENT:
		uva->va_mode = lva->va_mode;
		uva->va_uid = lva->va_uid;
		uva->va_gid = lva->va_gid;
		break;

which uva is then used in the VOP_CREATE call.

What could the problem be here? Something is apparently "correcting" the uva->va_unid setting between here and the final effect of VOP_CREATE.
Comment 1 Gunther Schadow 2020-11-28 05:47:45 UTC
Note that it seems to work right with the shadow directories. So that might be a clue.
Comment 2 Conrad Meyer freebsd_committer 2020-11-28 16:42:50 UTC
First, it's important to determine what syscalls the shell is issuing.  In sh(1), 'echo "hello" > foobar.dat' does this:

openat(AT_FDCWD,"foobar.dat",O_WRONLY|O_CREAT|O_TRUNC,0666) = 5 (0x5)
dup2(5,1)			                                 = 1 (0x1)
close(5)			                                 = 0 (0x0)
write(1,"hello\n",6)		                         = 6 (0x6)

I suspect the rest of the observed behavior shakes out of O_CREAT, although I'm not sure of the exact mechanism.  In vn_open, if namei() did not locate a vnode, we call VOP_CREATE which in unionfs_create invokes VOP_CREATE on the upper; this would explain the observed owner.  However, I'm not sure if that code path is actually hit; intuitively, I'd guess that namei() on the unionfs mount might locate the existing file in the lower.
Comment 3 Gunther Schadow 2020-11-28 17:23:02 UTC
As for what system calls, there is no magic here, this is what I get for

echo "hello" > one

no matter if the file "one" was already there or not:

write(2,"\r\^[[17Cecho  "hello" > one",25)   = 25 (0x19)
read(0,"\n",1)                                   = 1 (0x1)

write(2,"\n",1)                                  = 1 (0x1)
ioctl(0,TIOCSETAW,0x800723238)                   = 0 (0x0)
fcntl(1,F_DUPFD_CLOEXEC,0xa)                     = 11 (0xb)
openat(AT_FDCWD,"one",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
dup2(3,1)                                        = 1 (0x1)
close(3)                                         = 0 (0x0)
write(1,"hello\n",6)                             = 6 (0x6)
dup2(11,1)                                       = 1 (0x1)
close(11)                                        = 0 (0x0)

What I will look at now is why the creation of the shadow directories works correctly but not the unionfs_copyfile(...).
Comment 4 Gunther Schadow 2020-11-28 17:53:48 UTC
I'm badly handicapped with using debuggers, and I focus on static code analysis. 

So here is the shadow directory creation which works well

unionfs_mkshadowdir:

	struct vattr	va;
	struct vattr	lva;

	...

	if ((error = VOP_GETATTR(lvp, &lva, cnp->cn_cred)))
		goto unionfs_mkshadowdir_abort;

	...

	unionfs_create_uppervattr_core(ump, &lva, &va, td);

	error = VOP_MKDIR(udvp, &uvp, &cn, &va);
 
cutting out stuff that isn't about any "va" and that I don't understand. Now here is the corresponding sequence for unionfs_copyfile, calls 

unionfs_vn_create_on_upper(..., struct vattr *uvap, ...):

	struct vattr	lva;

	...

	if ((error = VOP_GETATTR(lvp, &lva, cred)) != 0)
		return (error);
	unionfs_create_uppervattr_core(ump, &lva, uvap, td);

	...

	if ((error = VOP_CREATE(udvp, &vp, &cn, uvap)) != 0)
		goto unionfs_vn_create_on_upper_free_out1;


so this looks completely fine and parallel and because unionfs_create_uppervattr_core is working in the mkshadowdir case it is probably working fine in the create_on_upper case.

This leaves our problem inside VOP_CREATE. 

So, how do I find the specific source code of VOP_CREATE? I'm sure this is some macro which uses the struct vnode *udvp. Where should I look to figure out what is happening in VOP_CREATE?
Comment 5 Gunther Schadow 2020-11-28 18:07:44 UTC
Brute force search for this VOP_CREATE:

fgrep -R 'VOP_CREATE' /usr/include /usr/src

/usr/src/sys/kern/uipc_usrreq.c:                error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr);
/usr/src/sys/kern/vfs_vnops.c: * Lookup the vnode and invoke VOP_CREATE if needed.
/usr/src/sys/kern/vfs_vnops.c: * Check permissions, and call the VOP_OPEN or VOP_CREATE routine.
/usr/src/sys/kern/vfs_vnops.c:                          error = VOP_CREATE(ndp->ni_dvp, &ndp->ni_vp,
/usr/src/sys/fs/nfsserver/nfs_nfsdport.c:                       error = VOP_CREATE(ndp->ni_dvp,
/usr/src/sys/fs/nfsserver/nfs_nfsdport.c:               error = VOP_CREATE(ndp->ni_dvp, &ndp->ni_vp,
/usr/src/sys/fs/nfsserver/nfs_nfsdport.c:                       nd->nd_repstat = VOP_CREATE(ndp->ni_dvp,
/usr/src/sys/fs/nfsserver/nfs_nfsdport.c:               error = VOP_CREATE(dvp, &nvp, &named.ni_cnd, vap);
/usr/src/sys/fs/fuse/fuse_vnops.c:       * VOP_CREATE doesn't tell us the open(2) flags, so we guess.  Only a
/usr/src/sys/fs/fuse/fuse_vnops.c:                               * pathname for use later in VOP_CREATE or
/usr/src/sys/fs/unionfs/union_vnops.c:          error = VOP_CREATE(udvp, &vp, cnp, ap->a_vap);
/usr/src/sys/fs/unionfs/union_subr.c:   if ((error = VOP_CREATE(udvp, &vp, &cn, uvap)) != 0)
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:            error = VOP_CREATE(ZTOV(sharedir), zc->zc_string,
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_replay.c:           error = VOP_CREATE(ZTOV(dzp), name, &xva.xva_vattr,
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_replay.c:           error = VOP_CREATE(ZTOV(dzp), &vp, &cn, &xva.xva_vattr /*,vflg*/);
/usr/src/sys/ufs/ffs/ffs_snapshot.c:    error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vat);

so this is all just calling, not actually defining the VOP_CREATE symbol. What mysterious trickery is going on here?
Comment 6 Gunther Schadow 2020-11-29 03:56:22 UTC
OK, so I am on my own here still as nobody wants to give me even a slightest hint.

As I repeat the same fgrep -Ri 'VOP_CREATE' looking for lower case, assuming that the upper case is some macro (though I can not find any #define VOP_CREATE) but I am assuming it is some kind of macro that resolves to a function pointer of the underlying struct pointer in the first argument. Like poor-man's object oriented polymorphism. I am sure.

And as I do that I come across

/usr/src/sys/sys/vnode.h:void   vop_create_post(void *a, int rc);
/usr/src/sys/ufs/ffs/ffs_snapshot.c:    error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vat);
/usr/src/sys/ufs/ufs/ufs_vnops.c:static vop_create_t    ufs_create;
/usr/src/sys/ufs/ufs/ufs_vnops.c:       struct vop_create_args /* {
/usr/src/sys/ufs/ufs/ufs_vnops.c:       .vop_create =           ufs_create,

and it dawns on me that ultimately the implementation of VOP_CREATE is dependent on the file system, and in my case it's good old UFS/FFS. So I better get loking in ufs_vnops.c ufs_create:

/*
 * Create a regular file
 */
static int
ufs_create(ap)
	struct vop_create_args /* {
		struct vnode *a_dvp;
		struct vnode **a_vpp;
		struct componentname *a_cnp;
		struct vattr *a_vap;
	} */ *ap;
{
	int error;

	error =
	    ufs_makeinode(MAKEIMODE(ap->a_vap->va_type, ap->a_vap->va_mode),
	    ap->a_dvp, ap->a_vpp, ap->a_cnp, "ufs_create");
	if (error != 0)
		return (error);
	if ((ap->a_cnp->cn_flags & MAKEENTRY) != 0)
		cache_enter(ap->a_dvp, *ap->a_vpp, ap->a_cnp);
	return (0);
}

now that's simple enough isn't it? I will trace through the struct vattr *a_vap. So this is my line of focus:

ufs_makeinode(MAKEIMODE(ap->a_vap->va_type, ap->a_vap->va_mode),
	    ap->a_dvp, ap->a_vpp, ap->a_cnp, "ufs_create");

and where the heck is uid and gid? Remember in unionfs_subr.c we had:

/*
 * Create upper node attr.
 */
void
unionfs_create_uppervattr_core(struct unionfs_mount *ump,
			       struct vattr *lva,
			       struct vattr *uva,
			       struct thread *td)
{
	VATTR_NULL(uva);
	uva->va_type = lva->va_type;
	uva->va_atime = lva->va_atime;
	uva->va_mtime = lva->va_mtime;
	uva->va_ctime = lva->va_ctime;

	switch (ump->um_copymode) {
	case UNIONFS_TRANSPARENT:
		uva->va_mode = lva->va_mode;
		uva->va_uid = lva->va_uid;
		uva->va_gid = lva->va_gid;
		break;

va_uid and va_gid. They had been set. But now in ufs_create we have:

ufs_makeinode(
  MAKEIMODE(
    ap->a_vap->va_type, 
    ap->a_vap->va_mode),
    ap->a_dvp, 
    ap->a_vpp, 
    ap->a_cnp, "ufs_create");

where from our fine ap->a_vap we take only va_type and va_mode but just ignore the rest, throwing away va_uid and va_gid. I assume that this is then taken somewhere from the process euid and egid (although I don't remember how the group was decided from the many groups a user can belong to). 

This is probably the heart of the issue. And now, for contrast, I look at mkdir, since the mkshadowdir stuff worked fine. How much will I bet that it's not throwing away the va_uid and va_gid? Actually, I can't. The ufs_mkdir makes even less use of the ap->a_vap stuff.

/*
 * Mkdir system call
 */
static int
ufs_mkdir(ap)
	struct vop_mkdir_args /* {
		struct vnode *a_dvp;
		struct vnode **a_vpp;
		struct componentname *a_cnp;
		struct vattr *a_vap;
	} */ *ap;
{
	struct vnode *dvp = ap->a_dvp;
	struct vattr *vap = ap->a_vap;
	struct componentname *cnp = ap->a_cnp;
	struct inode *ip, *dp;
	struct vnode *tvp;
	...
	dmode = vap->va_mode & 0777;
	...
	ip->i_gid = dp->i_gid;
	DIP_SET(ip, i_gid, dp->i_gid);
#ifdef SUIDDIR
	{
#ifdef QUOTA
		struct ucred ucred, *ucp;
		gid_t ucred_group;
		ucp = cnp->cn_cred;
#endif
		/*
		 * If we are hacking owners here, (only do this where told to)
		 * and we are not giving it TO root, (would subvert quotas)
		 * then go ahead and give it to the other user.
		 * The new directory also inherits the SUID bit.
		 * If user's UID and dir UID are the same,
		 * 'give it away' so that the SUID is still forced on.
		 */
		if ((dvp->v_mount->mnt_flag & MNT_SUIDDIR) &&
		    (dp->i_mode & ISUID) && dp->i_uid) {
			dmode |= ISUID;
			ip->i_uid = dp->i_uid;
			DIP_SET(ip, i_uid, dp->i_uid);
#ifdef QUOTA
			if (dp->i_uid != cnp->cn_cred->cr_uid) {
				/*
				 * Make sure the correct user gets charged
				 * for the space.
				 * Make a dummy credential for the victim.
				 * XXX This seems to never be accessed out of
				 * our context so a stack variable is ok.
				 */
				refcount_init(&ucred.cr_ref, 1);
				ucred.cr_uid = ip->i_uid;
				ucred.cr_ngroups = 1;
				ucred.cr_groups = &ucred_group;
				ucred.cr_groups[0] = dp->i_gid;
				ucp = &ucred;
			}
#endif
		} else {
			ip->i_uid = cnp->cn_cred->cr_uid;
			DIP_SET(ip, i_uid, ip->i_uid);
		}

So now I have to check what this dp->i_uid is and where it comes from, and whether cnp->cn_cred->cr_uid might be the real source of the uid that I want. So many questions. Very complex code here, but I do know that ufs_mkdir works better than ufs_create.

Well, cnp is from ap->a_cnp, 3rd argument. And this dp thing comes from the first argument, probably the parent directory (?).

Here is some uid magic in unionfs_mkshadowdir:

int
unionfs_mkshadowdir(struct unionfs_mount *ump, struct vnode *udvp,
		    struct unionfs_node *unp, struct componentname *cnp,
		    struct thread *td)
{
	int		error;
	struct vnode   *lvp;
	struct vnode   *uvp;
	struct vattr	va;
	struct vattr	lva;
	struct componentname cn;
	struct mount   *mp;
	struct ucred   *cred;
	struct ucred   *credbk;
	struct uidinfo *rootinfo;

	if (unp->un_uppervp != NULLVP)
		return (EEXIST);

	lvp = unp->un_lowervp;
	uvp = NULLVP;
	credbk = cnp->cn_cred;

	/* Authority change to root */
	rootinfo = uifind((uid_t)0);
	cred = crdup(cnp->cn_cred);
	/*
	 * The calls to chgproccnt() are needed to compensate for change_ruid()
	 * calling chgproccnt().
	 */
	chgproccnt(cred->cr_ruidinfo, 1, 0);
	change_euid(cred, rootinfo);
	change_ruid(cred, rootinfo);
	change_svuid(cred, (uid_t)0);
	uifree(rootinfo);
	cnp->cn_cred = cred;
        ...
        

My theory is now that NEITHER unionfs_mkshadowdir nor unionfs_copyfile actually copies the owner uid of the vnode in the lower layer. Instead, 

1. unionfs_copyfile just leaves it up to ufs_create (provided that the upper layer is a UFS) and that uses the current process euid.
2. unionfs_mkshadowdir instead of using the lower layer's directory node's owner uid, will do an authority change to root!

Let's test this suspicion.

# mkdir test-upper
# mkdir test-union
# mkdir -p test-union/foo/bar
# chown -R user:www test-union/foo
# mount -t unionfs test-upper test-union
# su user
$ touch test-union/foo/bar/data
$ exit
# find test-upper -ls
403605        8 drwxr-xr-x    3 root wheel 512 Nov 29 03:30 test-upper
403613        8 drwxr-xr-x    3 user www   512 Nov 29 03:30 test-upper/foo
403614        8 drwxr-xr-x    2 user www   512 Nov 29 03:30 test-upper/foo/bar
403615        0 -rw-r--r--    1 user www     0 Nov 29 03:30 test-upper/foo/bar/data

So, no, my suspicion is proven wrong! The unionfs_mkshadowdir creates indeed a perfect copy of the owner uid and gid of the directory, not root.

But then there must be something about our authority switch to root in the code above!

Ah! NOW I HAVE IT! It's here. Continuing from ... above:

	unionfs_create_uppervattr_core(ump, &lva, &va, td);

	error = VOP_MKDIR(udvp, &uvp, &cn, &va);

	if (!error) {
		unionfs_node_update(unp, uvp, td);

		/*
		 * XXX The bug which cannot set uid/gid was corrected.
		 * Ignore errors.
		 */
		va.va_type = VNON;
		VOP_SETATTR(uvp, &va, cn.cn_cred);
	}

There it is VOP_SETATTR(uvp, $va, cn.cn_cred) which we can do because we have changed our cred to root.

So, and that is the part that's missing in the unionfs_copyfile code. No VOP_SETATTR (?) Actually again NO! It's actually there:

	if (error == 0) {
		/* Reset the attributes. Ignore errors. */
		uva.va_type = VNON;
		VOP_SETATTR(uvp, &uva, cred);
	}

But it's in a weird place! 

int
unionfs_copyfile(struct unionfs_node *unp, int docopy, struct ucred *cred,
		 struct thread *td)
{
        ...

	error = unionfs_vn_create_on_upper(&uvp, udvp, unp, &uva, td);
	if (error != 0) {
		vn_finished_write(mp);
		return (error);
	}

	if (docopy != 0) {
		error = VOP_OPEN(lvp, FREAD, cred, td, NULL);
		if (error == 0) {
			error = unionfs_copyfile_core(lvp, uvp, cred, td);
			VOP_CLOSE(lvp, FREAD, cred, td);
		}
	}
	VOP_CLOSE(uvp, FWRITE, cred, td);
	VOP_ADD_WRITECOUNT_CHECKED(uvp, -1);
	CTR3(KTR_VFS, "%s: vp %p v_writecount decreased to %d", __func__, uvp,
	    uvp->v_writecount);

	vn_finished_write(mp);

	if (error == 0) {
		/* Reset the attributes. Ignore errors. */
		uva.va_type = VNON;
		VOP_SETATTR(uvp, &uva, cred);
	}

	unionfs_node_update(unp, uvp, td);


I can trace this &uva through, it was populated from unionfs_vn_create_on_upper which traces all the way through to the assignment of the lva->va_uid, so it comes out just fine.

What I suspect, then, as the last resort, is that the VOP_SETATTR is throwing an error actually (which is why we ignore that error!!) because our effective cred isn't that of root.

What do we bet? How could I test the error returned from VOP_SETATTR?

Ha! It gets even better! As I am checking in the actual source code on my system, I see this:

                /*
                 * XXX The bug which cannot set uid/gid was corrected.
                 * Ignore errors.
                 */
                va.va_type = VNON;
                VOP_SETATTR(uvp, &va, cn.cn_cred);

do you note this XXX comment? It speaks EXACTLY to my problem. Only that we are calling the VOP_SETATTR with the proper root creds! 

So why don't I catch this error here:

        if (error == 0) {
                /* Reset the attributes. Ignore errors. */
                uva.va_type = VNON;
                error = VOP_SETATTR(uvp, &uva, cred);
        }

and it would then be returned. This would be proof that I am over the target.

And I am so excited, because I can already see why my original task the unionfs -o copypolicy=always doesn't work properly, because we must change cred to root in order to create a copy of the lower layer vnode on the upper layer! Haha, so I can kill two birds with one stone! Love this. Let's try...
Comment 7 Gunther Schadow 2020-11-29 04:19:00 UTC
# umount test-union
# kldunload unionfs.ko
# rm -rf test-upper/foo
# chown -R root:wheel test-union/*
# echo "I'm root" > test-union/foo/bar/data
# chmod g+w test-union/foo/bar/data
# mount -t unionfs test-upper test-union
# su user
$ echo "I'm not root!" > test-union/foo/bar/data
su: test-union/foo/bar/data: Permission denied

Ha! That's probably exactly the permission denied error from the last VOP_SETATTR. Let's do a quick baseline test. Removing the error = VOP_SETATTR again, make modules co unionfs.ko to /boot/kernel ... and going through the same motions above and BINGO!

The only additional problem I found -- YET ANOTHER BUG REALLY! -- is that if I had the immediate directory of the data file not also with g+w I would get permission denied with either version of unionfs.ko.

But that's part of the same bug really. Since unionfs_copyfile creates the file with the creds of the current user, it can't do it on the upper layer shadow directory which has been created with the perfect transparent owner uid and gid. 

This means, to fix this bug, all we need to do is make sure the copyfile runs in the root creds already at VOP_CREATE, and also at VOP_SETATTR.

IMO this could use some re-factoring, because there is no reason why we have to do VOP_SETATTR after we already copied the file. We should do it while we are creating the node on upper! Anyone would object?
Comment 8 Gunther Schadow 2020-11-29 14:05:23 UTC
I need to do some re-factoring with the cred stuff in copyfile and that's not trivial.

In union_vnops.c unionfs_open the argument pointer ap has ap->a_cred and we set the local struct ucred *cred = ap->a_cred. So it's a pointer. That is, if anything below in the call graph was to change the ucred structure it would reflect outside, we don't want a permanent change of cred.

Secondly, the unionfs_open relies on the current cred and copyfile to check the current call's permission to proceed with the write. And that I want to avoid, not only because I want to implement copypolicy=always (i.e., a cache over the lower layer) but also because it is incorrect to require directory write permission only to create the upper layer copy of a file that the current user has permission to write.

--> But that's OK, for now despite the default copypolicy=onwrite we would allow the copyfile to occur and only after that VOP_OPEN, using the original creds, would find that it cannot write anyway. This could possibly be fixed by doing a VOP_OPEN check before the copyfile.

Now, currently we are passing that cred into unionfs_copyfile. And actually that's what VOP_ACCESS is about, and indeed copyfile does that inside:

	error = VOP_ACCESS(lvp, VREAD, cred, td);
	if (error != 0)
		return (error);

so we might just as well do that outside. As soon as we do copyfile, we will change cred to root. So no need to do VOP_ACCESS inside. I'm going to move that out right here (please don't worry about my mixing this up with my copypolicy=always feature, it could still be removed if its not wanted, but it's cool)

static int
unionfs_open(struct vop_open_args *ap)
{
        ...
        if (targetvp == NULLVP) {
                if (uvp == NULLVP) {
                        if (((ap->a_mode & FWRITE) || (ump->um_copypolicy == UNIONFS_COPY_ALWAYS)) && lvp->v_type == VREG) {
                                error = VOP_ACCESS(lvp, ap->a_mode, cred, td);
                                if (error != 0)
                                        goto unionfs_open_abort;
                                error = unionfs_copyfile(unp,
                                    !(ap->a_mode & O_TRUNC), cred, td);
                                if (error != 0)
                                        goto unionfs_open_abort;

now some more re-factoring about this cred stuff. I will switch to root right inside the copyfile function, so that the create on upper needs that root cred passed to it:

static int
unionfs_vn_create_on_upper(struct vnode **vpp, struct vnode *udvp,
			   struct unionfs_node *unp, struct vattr *uvap,
			   struct thread *td)
{
	...
	struct ucred   *cred;
	...
	struct componentname cn;
	...
	cred = td->td_ucred;
	...
	if ((error = VOP_GETATTR(lvp, &lva, cred)) != 0)
		return (error);
	...
	cn.cn_thread = td;
	cn.cn_cred = cred;

I still don't know what this struct componentname is really about, it seems to duplicate things like cred and td that are also passed alongside it, but anyway, I am certain that the cred = td->td_ucred is really a cop-out due to the lack of struct ucred *cred argument passed into this function from above. I regard this as a bug waiting to be exposed.

So, we need to change the signature of this method to allow the cred to be passed in. And the standard convention seems to be cred before td. So I will insert it there:

static int
unionfs_vn_create_on_upper(struct vnode **vpp, struct vnode *udvp,
                           struct unionfs_node *unp, struct vattr *uvap,
                           struct ucred *cred, struct thread *td)
{
        ...

        if ((error = VOP_GETATTR(lvp, &lva, cred)) != 0)
                return (error);
        unionfs_create_uppervattr_core(ump, &lva, uvap, td);

        cn.cn_namelen = strlen(unp->un_path);
        cn.cn_pnbuf = uma_zalloc(namei_zone, M_WAITOK);
        bcopy(unp->un_path, cn.cn_pnbuf, cn.cn_namelen + 1);
        cn.cn_nameiop = CREATE;
        cn.cn_flags = (LOCKPARENT | LOCKLEAF | HASBUF | SAVENAME | ISLASTCN);
        cn.cn_lkflags = LK_EXCLUSIVE;
        cn.cn_thread = td;
        cn.cn_cred = cred;
        cn.cn_nameptr = cn.cn_pnbuf;

        ...

        if ((error = VOP_CREATE(udvp, &vp, &cn, uvap)) != 0)
                goto unionfs_vn_create_on_upper_free_out1;

        /* this is from where VOP_SETATTR was previously done, not sure this is needed
           lva.va_type = VNON;
        */
        if ((error = VOP_SETATTR(vp, &lva, cred)) != 0)
                goto unionfs_vn_create_on_upper_free_out1;

        if ((error = VOP_OPEN(vp, fmode, cred, td, NULL)) != 0) {
                vput(vp);
                goto unionfs_vn_create_on_upper_free_out1;
        }

Note I already moved the VOP_SETATTR right after VOP_CREATE because there is no reason to have a lot of activity going until we set those attributes correctly.

Good, so now all I need to do is the authority switch to root in the main unionfs_copyfile function. And how do do that I learn from the unionfs_mkshadowdir function.

int
unionfs_copyfile(struct unionfs_node *unp, int docopy, struct ucred *cred,
		 struct thread *td)
{
	int		error;
	struct mount   *mp;
	struct vnode   *udvp;
	struct vnode   *lvp;
	struct vnode   *uvp;
	struct vattr	uva;
	struct uidinfo *rootinfo;

	lvp = unp->un_lowervp;
	uvp = NULLVP;

	if ((UNIONFSTOV(unp)->v_mount->mnt_flag & MNT_RDONLY))
		return (EROFS);
	if (unp->un_dvp == NULLVP)
		return (EINVAL);
	if (unp->un_uppervp != NULLVP)
		return (EEXIST);
	udvp = VTOUNIONFS(unp->un_dvp)->un_uppervp;
	if (udvp == NULLVP)
		return (EROFS);
	if ((udvp->v_mount->mnt_flag & MNT_RDONLY))
		return (EROFS);

        /* Authority change to root */
        rootinfo = uifind((uid_t)0);
        cred = crdup(cred);
        /*
         * The calls to chgproccnt() are needed to compensate for change_ruid()
         * calling chgproccnt().
         */
        chgproccnt(cred->cr_ruidinfo, 1, 0);
        change_euid(cred, rootinfo);
        change_ruid(cred, rootinfo);
        change_svuid(cred, (uid_t)0);
        uifree(rootinfo);

        /* we changed to root here, so no need to check again
        error = VOP_ACCESS(lvp, VREAD, cred, td);
        if (error != 0)
                goto unionfs_copyfile_abort;
        */

        if ((error = vn_start_write(udvp, &mp, V_WAIT | PCATCH)) != 0)
                goto unionfs_copyfile_abort;
        error = unionfs_vn_create_on_upper(&uvp, udvp, unp, &uva, cred, td);
        if (error != 0) {
                vn_finished_write(mp);
                goto unionfs_copyfile_abort;
        }

        ...

	vn_finished_write(mp);

	if (error == 0) {
		/* Reset the attributes. Ignore errors. */
		uva.va_type = VNON;
		VOP_SETATTR(uvp, &uva, cred);
	}

	unionfs_node_update(unp, uvp, td);

 unionfs_copyfile_abort:
        chgproccnt(cred->cr_ruidinfo, -1, 0);
        crfree(cred);

        return (error);
}

Now make modules ... how I wish I understood where the Makefiles for each module are so I could just re-make that one unionfs.ko module. And ... drumroll ... it doesn't work. "Permission denied". 

Let's investigate ...
Comment 9 Gunther Schadow 2020-11-29 15:53:11 UTC
To find out where it fails, I am going to put error += 1; error += 2; ... in the various places to see the errno returned.

And this way I find out that what the VOP_ACCESS check is to blame:

static int
unionfs_open(struct vop_open_args *ap)
{
        ...
        if (targetvp == NULLVP) {
                if (uvp == NULLVP) {
                        if (((ap->a_mode & FWRITE) || (ump->um_copypolicy == UNIONFS_COPY_ALWAYS)) && lvp->v_type == VREG) {
                                error = VOP_ACCESS(lvp, ap->a_mode, cred, td);
                                if (error != 0) {
                                        error += 10;
                                        goto unionfs_open_abort;
                                }
                                error = unionfs_copyfile(unp,
                                    !(ap->a_mode & O_TRUNC), cred, td);
                                if (error != 0)
                                        goto unionfs_open_abort;

so why would that be? No clue. But I will for now just comment this out. 

Nope, it's worse than that even. I now have find on test-upper hang, probably because the lock of the newly created file has not been released when this error happened.

Need to also reconcile the errors. 

First error was: 
   test-union/foo/bar/data: Too many open files in system -- ENFILE -- 23
then after commenting it out it was
   test-union/foo/bar/data: Read-only file system -- EROFS -- 30

This is strange! Why are those numbers so high? I just added 10 and I am sure it was out of VOP_ACCESS because before I added this error += 10 I got a simple EPERM = 1. So now it should be 11, EDEADLK / EAGAIN, not 23. And 23 - 10 is 13, which is EACCES, ah! OK, no, it's cool. EPERM is "operation not permited", but "permission denied" is EACCES. So we are good, spot on.

Then after my commenting out, I have 30 - 13 = 17, that's way too big of a number. Apparently there were two increments of the error, or a different error. So I change my strategy, instead of adding increments to the errno I will bitwise-or a high byte. Then I can see. It's probably a different error. And I probably found yet another bug in the unionfs implementation, where when that error is hit, the file system or file remains locked up and is not properly freed. We'll get to the bottom of this. Unfortunately now I need to reboot first. And oh my, what a hard reboot this needs to be, because the lock prevents a normal reboot. Force stopping it...
Comment 10 Gunther Schadow 2020-11-29 16:30:42 UTC
And after the reboot I see that the file has been created in the upper layer and has all the right owner, group and mode bits.

But now I do it over.

 Unknown error: 0x1716, 

perfect! This points me right to here:

        if ((error = VOP_SETATTR(vp, &lva, cred)) != 0) {
                error |= 0x0700;
                goto unionfs_vn_create_on_upper_free_out1;
        }

(as later I am still or-ing the 0x1000 on top of that, and so the errno here was 22 - EINVAL, invalid argument. Interesting. Could it be that I must set va_type = VNON?

This might explain why the VOP_SETATTR was done until after the copy was completed. But I can just save the va_type ...

        enum   vtype    lva_type_bck;

        ...

        lva_type_bck =  lva.va_type;
        lva.va_type = VNON;
        error = VOP_SETATTR(vp, &lva, cred);
        lva.va_type = lva_type_bck;
        if(error != 0) {
                error |= 0x0700;
                goto unionfs_vn_create_on_upper_free_out1;
        }

now to reboot again. But also to wonder, meanwhile, why the upper file would remain locked so that find would hang just as it would try to access that file's directory entry on the upper layer?

I guess the goto here is the best we can do, it's going to free_out1, which will VOP_UNLOCK the udvp first, and then continue to unwind everything else.  

As I have it now, after this error it does

                vn_finished_write(mp);

but then 

                goto unionfs_copyfile_abort;

skipping over

	if (error == 0) {
		/* Reset the attributes. Ignore errors. */
		uva.va_type = VNON;
		VOP_SETATTR(uvp, &uva, cred);
	}

which  I should delete as it's nonsense here now, but skipping over:

	unionfs_node_update(unp, uvp, td);

which might be necessary now. Who knows.

Anyway, do I have the VNON issue guessed right? Is the va_type != VNON the illegal argument? ... Drumroll ...

No, it did not help. It's still the same error. :( 

What is the illegal argument here?
Comment 11 Gunther Schadow 2020-11-29 17:05:09 UTC
Oh! I get it! I shouldn't have used the &lva but the newly set-up uvap for the VOP_SETATTR. Sure, that makes sense.

        enum   vtype    uva_type_bck;
        ...
        if ((error = VOP_GETATTR(lvp, &lva, cred)) != 0)
                return (error);
        unionfs_create_uppervattr_core(ump, &lva, uvap, td);
        ...
        if ((error = VOP_CREATE(udvp, &vp, &cn, uvap)) != 0) {
                error |= 0x0600;
                goto unionfs_vn_create_on_upper_free_out1;
        }

        /* this is from where VOP_SETATTR was previously done, not sure this is needed
           uva.va_type = VNON;
        */
        uva_type_bck =  uvap->va_type;
        uvap->va_type = VNON;
        error = VOP_SETATTR(vp, uvap, cred);
        uvap->va_type = uva_type_bck;
        if(error != 0) {
                error |= 0x0700;
                goto unionfs_vn_create_on_upper_free_out1;
        }

Note: there is something I don't understand in this unionfs_vn_create_on_upper, it has VOP_OPEN without a VOP_CLOSE and in the caller there is another VOP_OPEN. Strange.

OK, now to reboot again ...

YES! VICTORY! That was it.

Now to clean up and produce one neat diff.
Comment 12 Gunther Schadow 2020-11-29 19:56:39 UTC
Yes, not quite full victory. 

Firstly, now that I am switching to root I do not even need the VOP_SETATTR call any more as the VOP_CREATE now properly applies owner uid and gid from the lower vnode. 

But the problem is that the root cred continues on after the unionfs_copyfile so I can update the data file as non root user even if I have no group write permissions! Yes, if I was able to make that gating VOP_ACCESS work, then it wouldn't matter that much, but IMO it is a security risk that somehow we can write to a vnode that we have no write permission to. 

So somehow we kept this file open and continue using it with the root creds.

By contrast, hen I re-do the operation now that the file is created in the upper layer, I can no longer write to it. So clearly this is a side-effect of the unionfs_copyfile.

It's very puzzling, but I think the problem is here:

        unionfs_node_update(unp, uvp, td);

we are passing the uvp which was obtained while we had root creds. Even though it was VOP_CLOSE I still think there is something hung over in this vnode pointer. I think this is true because we can use VOP_OPEN with the fp / componentname being NULL after the VOP_CREATE where the fp / componentname was set, and where we had saved the root cred as cn.cn_cred = cred.

I think that keeps lingering. But it is strange because the struct componentname cn was created on the stack deep inside the call tree, not allocated on dynamic memory, and not created by the unionfs_open function. 

I think somehow the original componentname is still in effect. After the VOP_CREATE we call VOP_OPEN several times with NULL cnp. And that uvp also  gets copied onto the unp in the unionfs_node_update() function:

unp->un_uppervp = uvp;

so we continue holding on to this vnode pointer. 

Ideally I would do some kind of relookup after the copyfile was done, using a cn that is using the original creds, not the root creds, no? But there is some magic going on with locking and reference counting that I feel uncertain about.
Comment 13 Gunther Schadow 2020-11-29 21:15:48 UTC
Well, I can't figure this one out. Really need some help with that. I am afraid I will just mess things up and in the end we'll have resource leaks. Now that said, there are already some problems when certain (rare) error conditions occur. The locking up issue was not caused by me, it was only triggered by my messing around. 

Anyway, I have found out why my VOP_ACCESS check didn't work. Apparently I need to filter out stuff from ap->ap_mode & (VREAD|VWRITE|VEXEC) is what I use now, and that works:

        if (targetvp == NULLVP) {
                if (uvp == NULLVP) {
                        if (((ap->a_mode & FWRITE) || (ump->um_copypolicy == UNIONFS_COPY_ALWAYS)) && lvp->v_type == VREG) {
                                error = VOP_ACCESS(lvp, ap->a_mode & (VREAD|VWRITE|VEXEC), cred, td);
                                if (error != 0)
                                        goto unionfs_open_abort;


fine. Now I will submit my full diff here.
Comment 14 Gunther Schadow 2020-11-29 21:21:54 UTC
Created attachment 220081 [details]
The final patch of this work, it includes the copypolicy=always patch. The two could be separated easily if you want to adopt only the fix and not the new feature.
Comment 15 Gunther Schadow 2020-11-29 22:07:18 UTC
No, the VOP_ACCESS check does not help! I can still write as normal user without group write permission.

But I'm totally confused about vnode.h's VREAD, VWRITE, VEXEC and the fcntl.h FREAD, FWRITE.

It doesn't make any sense!

static int
unionfs_open(struct vop_open_args *ap)
{
        ...
	if (targetvp == NULLVP) {
		if (uvp == NULLVP) {
			if ((ap->a_mode & FWRITE) && lvp->v_type == VREG) {
				error = unionfs_copyfile(unp,
				    !(ap->a_mode & O_TRUNC), cred, td);


how can this make sense to check a_mode & FWRITE instead of a_mode & VWRITE?

This is a holy mess! And the ap->a_mode & FWRITE didn't come from me! And that doesn't help either. It's a total mess and now that's making me mad.

I will just try to return the ap->a_mode as the error number.

And with this I find that for read only I get 1
While for piping into it I get 0x402.

fcntl.h:

#define FREAD           0x0001
#define FWRITE          0x0002
#define O_CREAT         0x0200          /* create if nonexistent */
#define O_TRUNC         0x0400          /* truncate to zero length */
#define O_EXCL          0x0800          /* error if already exists */

vnode.h:

/*
 * Flags for accmode_t.
 */
#define VEXEC                   000000000100 /* execute/search permission */
#define VWRITE                  000000000200 /* write permission */
#define VREAD                   000000000400 /* read permission */
#define VADMIN                  000000010000 /* being the file owner */
#define VAPPEND                 000000040000 /* permission to write/append */

So this is completely screwed up!

struct vop_open_args should have accmode_t a_mode, not fcntl.h's mode_t.

What I need to do to test this is to say

    accmode_t accmode = 0;
    if(ap->a_mode & FREAD)
      accmode |= VREAD;
    if(ap->a_mode & FWRITE)
      accmode |= VWRITE;

all together now:

                        if (((ap->a_mode & FWRITE) || (ump->um_copypolicy == UNIONFS_COPY_ALWAYS)) && lvp->v_type == VREG) {
                                accmode_t accmode = 0;
                                if(ap->a_mode & FREAD)
                                     accmode |= VREAD;
                                if(ap->a_mode & FWRITE)
                                     accmode |= VWRITE;
                                error = VOP_ACCESS(lvp, accmode, cred, td);
                                if (error != 0)
                                        goto unionfs_open_abort;


but that's really dumb! But that does the trick.

I consider that a bug. The VOP_ACCESS and VOP_OPEN are supposed to both use accmode_t with VREAD, VWRITE, etc. Or what am I missing?
Comment 16 Gunther Schadow 2020-11-29 22:10:04 UTC
Created attachment 220085 [details]
The final (corrected VOP_ACCESS) patch of this work, it includes the copypolicy=always patch. The two could be separated easily if you want to adopt only the fix and not the new feature.
Comment 17 Gunther Schadow 2020-12-23 17:16:57 UTC
Bump. Any comments? Any reaction? Can this go into a future release?