Bug 245688

Summary: nullfs: Doesn't send release for a file read with 'cat' command (while mounted over a FUSE fs).
Product: Base System Reporter: MooseFS FreeBSD Team <freebsd>
Component: kernAssignee: Alan Somers <asomers>
Status: Closed FIXED    
Severity: Affects Some People CC: asomers, chris, jSML4ThWwBID69YC, kib, mjg
Priority: --- Keywords: needs-patch, needs-qa
Version: 12.1-RELEASEFlags: koobs: maintainer-feedback? (mjg)
koobs: maintainer-feedback? (asomers)
koobs: maintainer-feedback? (kib)
koobs: mfc-stable12?
Hardware: amd64   
OS: Any   
URL: https://github.com/moosefs/moosefs/issues/347#issuecomment-604460874
Attachments:
Description Flags
How to install a minimal test instance of MooseFS on FreeBSD
none
Disable nullfs cacheing on top of fusefs none

Description MooseFS FreeBSD Team 2020-04-17 10:12:50 UTC
Scenario:

There is a MooseFS (via FUSE ver 3) filesystem mounted on a FreeBSD 12.1-p2. There is nullfs mounted over the MooseFS (nullfs is used for jails). We perform the following operations:

1) Read a file (using "cat [path/filename]") directly on MooseFS and no nullfs mounted at all:

03.26 15:10:08.014666: uid:0 gid:0 pid:4252 cmd:access (1,0x1): OK
03.26 15:10:08.015150: uid:0 gid:0 pid:4252 cmd:lookup (1,opentest): OK (1.0,1670,1.0,[drwxr-xr-x:0040755,2,0,0,1585231804,1585231099,1585231099,77100])
03.26 15:10:08.015191: uid:0 gid:0 pid:4252 cmd:access (1670,0x1): OK
03.26 15:10:08.015791: uid:0 gid:0 pid:4252 cmd:lookup (1670,test.sh): OK (0.0,49302,1.0,[-rwxr-xr-x:0100755,1,0,0,1585231653,1585230529,1585230533,771])
03.26 15:10:08.015825: uid:0 gid:0 pid:4252 cmd:access (49302,0x4): OK
03.26 15:10:08.016033: uid:0 gid:0 pid:4252 cmd:open (49302) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:0400000C]
03.26 15:10:08.017385: uid:0 gid:0 pid:4252 cmd:read (49302,771,0) [handle:0400000C]: OK (771)
03.26 15:10:08.017645: uid:0 gid:0 pid:4252 cmd:flush (49302) [handle:0400000C,uselocks:0,lock_owner:000000000000109C]: OK
03.26 15:10:08.017866: uid:0 gid:0 pid:4252 cmd:release (49302) [handle:0400000C,uselocks:0,lock_owner:000000000000109C]: OK

2) Read the same file directly on MooseFS with nullfs already mounted:

03.26 15:11:11.219917: uid:0 gid:0 pid:4266 cmd:access (1,0x1): OK
03.26 15:11:11.220632: uid:0 gid:0 pid:4266 cmd:lookup (1,opentest): OK (1.0,1670,1.0,[drwxr-xr-x:0040755,2,0,0,1585231804,1585231099,1585231099,77100])
03.26 15:11:11.220689: uid:0 gid:0 pid:4266 cmd:access (1670,0x1): OK
03.26 15:11:11.221475: uid:0 gid:0 pid:4266 cmd:lookup (1670,test.sh): OK (0.0,49302,1.0,[-rwxr-xr-x:0100755,1,0,0,1585231808,1585230529,1585230533,771])
03.26 15:11:11.221511: uid:0 gid:0 pid:4266 cmd:access (49302,0x4): OK
03.26 15:11:11.221749: uid:0 gid:0 pid:4266 cmd:open (49302) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:0500000C]
03.26 15:11:11.223064: uid:0 gid:0 pid:4266 cmd:read (49302,771,0) [handle:0500000C]: OK (771)
03.26 15:11:11.223348: uid:0 gid:0 pid:4266 cmd:flush (49302) [handle:0500000C,uselocks:0,lock_owner:00000000000010AA]: OK
03.26 15:11:11.223582: uid:0 gid:0 pid:4266 cmd:release (49302) [handle:0500000C,uselocks:0,lock_owner:00000000000010AA]: OK

3) Read the same file via nullfs mount:

03.26 15:11:26.925276: uid:0 gid:0 pid:4105 cmd:access (1670,0x1): OK
03.26 15:11:26.925315: uid:0 gid:0 pid:4105 cmd:lookup (1670,test.sh) (using open dir cache): OK (0.0,49302,1.0,[-rwxr-xr-x:0100755,1,0,0,1585231871,1585230529,1585230533,771])
03.26 15:11:27.507740: uid:0 gid:0 pid:4267 cmd:access (1670,0x1): OK
03.26 15:11:27.508235: uid:0 gid:0 pid:4267 cmd:lookup (1670,test.sh): OK (0.0,49302,1.0,[-rwxr-xr-x:0100755,1,0,0,1585231871,1585230529,1585230533,771])
03.26 15:11:27.508377: uid:0 gid:0 pid:4267 cmd:access (49302,0x4): OK
03.26 15:11:27.508656: uid:0 gid:0 pid:4267 cmd:open (49302) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:0600000C]
03.26 15:11:27.510012: uid:0 gid:0 pid:4267 cmd:read (49302,771,0) [handle:0600000C]: OK (771)
03.26 15:11:27.510273: uid:0 gid:0 pid:4267 cmd:flush (49302) [handle:0600000C,uselocks:0,lock_owner:00000000000010AB]: OK

*There is no release after flush!*

4) Read the same file again directly on MooseFS:

03.26 15:11:35.955262: uid:0 gid:0 pid:4268 cmd:access (1,0x1): OK
03.26 15:11:35.955730: uid:0 gid:0 pid:4268 cmd:lookup (1,opentest): OK (1.0,1670,1.0,[drwxr-xr-x:0040755,2,0,0,1585231886,1585231099,1585231099,77100])
03.26 15:11:35.955771: uid:0 gid:0 pid:4268 cmd:access (1670,0x1): OK
03.26 15:11:35.956399: uid:0 gid:0 pid:4268 cmd:lookup (1670,test.sh): OK (0.0,49302,1.0,[-rwxr-xr-x:0100755,1,0,0,1585231887,1585230529,1585230533,771])
03.26 15:11:35.956432: uid:0 gid:0 pid:4268 cmd:access (49302,0x4): OK
03.26 15:11:35.956692: uid:0 gid:0 pid:4268 cmd:open (49302) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:03000008]
03.26 15:11:35.957949: uid:0 gid:0 pid:4268 cmd:read (49302,771,0) [handle:03000008]: OK (771)
03.26 15:11:35.958217: uid:0 gid:0 pid:4268 cmd:flush (49302) [handle:03000008,uselocks:0,lock_owner:00000000000010AC]: OK

*Still no release!*

So it looks like the first time the file is read via nullfs, the kernel no longer sends release to the filesystem, so the filesystem keeps the file open. When you perform a lot of reads on files this way, eventually you'll run out of file descriptors...

Now, since the underlying filesystem is a FUSE filesystem, we don't know "who is at fault here" and is neglecting to send the release, but it's either nullfs itself or for some reason FUSE is doing it when the operations come from nullfs. The operations listed are what actually reaches the filesystem on the other side of FUSE.
Comment 1 Alan Somers freebsd_committer freebsd_triage 2020-04-17 12:58:22 UTC
It's hard to say where the bug really lies. But I bet dtrace would reveal it. Could you please post full steps to reproduce, including how to install moosefs and the exact mount commands used?  I might be able to look at this sometime next week.
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2020-04-17 16:49:44 UTC
Caveat up front: nullfs is kinda crap.

Userspace close (cat exiting) of ordinary files translates to vn_close1() -> VOP_CLOSE(9), then vput(9).  This is all generic and not specific to nullfs.  vput() -> vinactive(), then vdropl().

Nullfs does not register a CLOSE vnop, but does have vop_inactive.

/*
 * Do not allow the VOP_INACTIVE to be passed to the lower layer,
 * since the reference count on the lower vnode is not related to
 * ours.
 */
static int
null_want_recycle(struct vnode *vp)
{
        struct vnode *lvp;
        struct null_node *xp;
        struct mount *mp;
        struct null_mount *xmp;

        xp = VTONULL(vp);
        lvp = NULLVPTOLOWERVP(vp);
        mp = vp->v_mount;
        xmp = MOUNTTONULLMOUNT(mp);
        if ((xmp->nullm_flags & NULLM_CACHE) == 0 ||
            (xp->null_flags & NULLV_DROP) != 0 ||
            (lvp->v_vflag & VV_NOSYNC) != 0) {
                /*
                 * If this is the last reference and caching of the
                 * nullfs vnodes is not enabled, or the lower vnode is
                 * deleted, then free up the vnode so as not to tie up
                 * the lower vnodes.
                 */
                return (1);
        }
        return (0);
}

static int
null_inactive(struct vop_inactive_args *ap)
{
        struct vnode *vp;

        vp = ap->a_vp;
        if (null_want_recycle(vp)) {
                vp->v_object = NULL;
                vrecycle(vp);
        }
        return (0);
}

So I guess quick question, is nullfs caching disabled?  It is enabled by
default and disabled with mount option 'nocache'.

(This option is not documented in any of nullfs(5), mount_nullfs(8), nor
mount(8), which is... bad.  But it exists in the kernel.)
Comment 3 jSML4ThWwBID69YC 2020-04-17 18:02:21 UTC
Hello, 

I've tested both ways. Setting -o nocache seems to allow cmd:release to happen. Here's the tests ran twice. 

The '/storage/chunk/' directory is a MooseFS mount. 

Test 1: nocache
mount -t nullfs -o nocache /storage/chunk /root/test
cat /root/test/test/test.txt

Log output:
04.17 17:40:57.903154: uid:0 gid:0 pid:18295 cmd:access (811230,0x1): OK
04.17 17:40:57.903371: uid:0 gid:0 pid:18295 cmd:lookup (811230,test.txt): OK (0.0,23078,1.0,[-r--r--r--:0100444,1,1001,1001,1586303169,1586303169,1586303175,0])
04.17 17:40:57.903434: uid:0 gid:0 pid:18295 cmd:access (23078,0x4): OK
04.17 17:40:57.903561: uid:0 gid:0 pid:18295 cmd:open (23078) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:01000001]
04.17 17:40:57.903843: uid:0 gid:0 pid:18295 cmd:flush (23078) [handle:01000001,uselocks:0,lock_owner:0000000000004777]: OK
04.17 17:40:57.903885: uid:0 gid:0 pid:18295 cmd:release (23078) [handle:01000001,uselocks:0,lock_owner:0000000000004777]: OK

Test 2: default
umount /root/test
mount -t nullfs /storage/chunk /root/test
cat /root/test/test/test.txt

Log ouput: 
04.17 17:42:21.779677: uid:0 gid:0 pid:18302 cmd:access (811230,0x1): OK
04.17 17:42:21.779848: uid:0 gid:0 pid:18302 cmd:lookup (811230,test.txt): OK (0.0,23078,1.0,[-r--r--r--:0100444,1,1001,1001,1586303169,1586303169,1586303175,0])
04.17 17:42:21.779893: uid:0 gid:0 pid:18302 cmd:access (23078,0x4): OK
04.17 17:42:21.779920: uid:0 gid:0 pid:18302 cmd:open (23078) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:02000001]
04.17 17:42:21.780024: uid:0 gid:0 pid:18302 cmd:flush (23078) [handle:02000001,uselocks:0,lock_owner:000000000000477E]: OK

Test 3: nocache
umount /root/test
mount -t nullfs -o nocache /storage/chunk /root/test
cat /root/test/test/test.txt

Log output:
04.17 17:45:25.955375: uid:0 gid:0 pid:18337 cmd:access (811230,0x1): OK
04.17 17:45:25.955582: uid:0 gid:0 pid:18337 cmd:lookup (811230,test.txt): OK (0.0,23078,1.0,[-r--r--r--:0100444,1,1001,1001,1586303169,1586303169,1586303175,0])
04.17 17:45:25.955626: uid:0 gid:0 pid:18337 cmd:access (23078,0x4): OK
04.17 17:45:25.955655: uid:0 gid:0 pid:18337 cmd:open (23078) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:03000001]
04.17 17:45:25.955762: uid:0 gid:0 pid:18337 cmd:flush (23078) [handle:03000001,uselocks:0,lock_owner:00000000000047A1]: OK
04.17 17:45:25.955800: uid:0 gid:0 pid:18337 cmd:release (23078) [handle:03000001,uselocks:0,lock_owner:00000000000047A1]: OK

Test 4: default
umount /root/test
mount -t nullfs /storage/chunk /root/test
cat /root/test/test/test.txt

Log ouput: 
04.17 17:48:40.735042: uid:0 gid:0 pid:18343 cmd:access (811230,0x1): OK
04.17 17:48:40.735266: uid:0 gid:0 pid:18343 cmd:lookup (811230,test.txt): OK (0.0,23078,1.0,[-r--r--r--:0100444,1,1001,1001,1586303169,1586303169,1586303175,0])
04.17 17:48:40.735312: uid:0 gid:0 pid:18343 cmd:access (23078,0x4): OK
04.17 17:48:40.735343: uid:0 gid:0 pid:18343 cmd:open (23078) (using cached data from lookup): OK (direct_io:0,keep_cache:0) [handle:04000001]
04.17 17:48:40.735473: uid:0 gid:0 pid:18343 cmd:flush (23078) [handle:04000001,uselocks:0,lock_owner:00000000000047A7]: OK

From what I can tell, the cmd:release does not happen unless the nullfs mount includes -o nocache.
Comment 4 Konstantin Belousov freebsd_committer freebsd_triage 2020-04-18 00:32:41 UTC
The 'release' call seems to happens from fuse VOP_INACTIVE().  If this is true, there are two things broken:
1. for nullfs caching nullfs cached vnode keeps a use reference on the lower vnode,
which prevents inactivation of lowervp.  This can be easily worked around by eg.
adding a flag to fuse mount that instructs nullfs to not enable caching over it.

But there s a second issue, which is potentially more serious and which,
if resolved, make the first issue moot.

2. VFS does not guarantee that VOP_INACTIVE() is called timely.  It could be
missed due to some races which makes locking incompatible with the VOP_INACTIVE()
call requirements, and then the call is not make until the next time usecount
goes to zero or the vnode is reclaimed.  So if 'release' must be called on time,
VOP_INACTIVE() does not guarantee that it happens, even without nullfs.

WRT VOP_CLOSE(), it is automatically bypassed to the lower vp, no special code
is needed for this to happen.
Comment 5 MooseFS FreeBSD Team 2020-04-20 11:03:30 UTC
Created attachment 213600 [details]
How to install a minimal test instance of MooseFS on FreeBSD
Comment 6 MooseFS FreeBSD Team 2020-04-20 11:20:00 UTC
How to reproduce:

Install a test environment as per attached instructions. Then do the following:

1) Create test directories and files:
# cd /mnt/mfs
# mkdir nulltest
# mkdir /mnt/nullfs
# echo "foo bar" > nulltest/test.txt

2) Open operation log on another console:
# cat /mnt/mfs/.oplog

3) Perform these 4 sets of operations, check the other console with operation log in between:

First test:
# cat /mnt/mfs/nulltest/test.txt
(There should be a release operation in the log)

Second test:
# mount_nullfs /mnt/mfs/nulltest /mnt/nullfs
# cat /mnt/mfs/nulltest/test.txt
(So the nullfs is already mounted, but we read via MooseFS and the release is still there)

Third test:
# cat /mnt/nullfs/test.txt
(File was read via nullfs and there is no release in the operation log)

Fourth test:
# cat /mnt/mfs/nulltest/test.txt
(File was once again read directly via MooseFS, but the release operation is still missing)
Comment 7 Alan Somers freebsd_committer freebsd_triage 2020-05-09 02:24:50 UTC
Created attachment 214296 [details]
Disable nullfs cacheing on top of fusefs

Could you please retest after applying this patch?
Comment 8 MooseFS FreeBSD Team 2020-05-22 09:46:39 UTC
It works with the patch, thank you!
Comment 9 commit-hook freebsd_committer freebsd_triage 2020-05-22 18:04:06 UTC
A commit references this bug:

Author: asomers
Date: Fri May 22 18:03:15 UTC 2020
New revision: 361399
URL: https://svnweb.freebsd.org/changeset/base/361399

Log:
  Disable nullfs cacheing on top of fusefs

  Nullfs cacheing can keep a large number of vnodes active.  That results in
  more active FUSE file handles, causing some FUSE servers to use extra
  resources.  Disable nullfs cacheing for fusefs, just like we already do for
  NFSv4.

  PR:		245688
  Reported by:	MooseFS FreeBSD Team <freebsd@moosefs.pro>
  MFC after:	2 weeks

Changes:
  head/sys/fs/fuse/fuse_vfsops.c
Comment 10 commit-hook freebsd_committer freebsd_triage 2020-06-12 20:27:52 UTC
A commit references this bug:

Author: asomers
Date: Fri Jun 12 20:27:38 UTC 2020
New revision: 362115
URL: https://svnweb.freebsd.org/changeset/base/362115

Log:
  MFC r361399:

  Disable nullfs cacheing on top of fusefs

  Nullfs cacheing can keep a large number of vnodes active.  That results in
  more active FUSE file handles, causing some FUSE servers to use extra
  resources.  Disable nullfs cacheing for fusefs, just like we already do for
  NFSv4.

  PR:		245688
  Reported by:	MooseFS FreeBSD Team <freebsd@moosefs.pro>

Changes:
_U  stable/12/
  stable/12/sys/fs/fuse/fuse_vfsops.c