Bug 258022

Summary: [FUSEFS] Inode attributes are cached unnecessarily/for too long
Product: Base System Reporter: Agata <chogata>
Component: kernAssignee: Alan Somers <asomers>
Status: Closed FIXED    
Severity: Affects Some People CC: asomers, freebsd, jSML4ThWwBID69YC, piotr.konopelko
Priority: --- Flags: asomers: mfc-stable13+
asomers: mfc-stable12+
Version: 13.0-RELEASE   
Hardware: Any   
OS: Any   
URL: https://reviews.freebsd.org/D33283

Description Agata 2021-08-24 11:30:38 UTC
This is a problem a user of MooseFS reported. Under some circumstances creating a new fs entry (any type: directory, regular file, special file) on MooseFS mount shows a message "Resource temporarily unavailable" and any subsequent operations on this inode (ls -al, rm or rmdir) also show this message. And MooseFS cannot be unmounted, the system shows a message "Device busy". Only a restart of the whole machine helps.

Since this was a bit similar to a problem some versions of Linux kernel had, when a process on one machine deleted a CWD of a process on a different machine, we at first thought it had to do with CWDs only and introduced some safeguards in MooseFS client for FreeBSD. But recent findings show this is much more serious and on FreeBSD side.

A simple test: we take two machines and mount MooseFS on both. 
On FreeBSD machine (13.0-RELEASE-p3) we use these mount options:

mfsmount -o mfsattrcacheto=0 -o mfsxattrcacheto=0 -o mfsentrycacheto=0 -o mfsdirentrycacheto=0 -o mfssymlinkcacheto=0 -o mfsgroupscacheto=0 /mnt/mfs

All the -o options are to disable any attribute caches that may exist (any lookup, access, mkdir etc. operations will return 0 seconds as cache time).

We also have a second machine with the same MooseFS instance. Operating system on the second machine is irrelevant.

Then we perform these steps, exactly in the order shown below:

***FreeBSD machine***
~# cd /mnt/mfs/testdir
/mnt/mfs/testdir# ls -al
total 2932
drwxr-xr-x   2 root  wheel        1 Aug 23 12:41 .
drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
/mnt/mfs/testdir#
******
(FreeBSD "sees" that "testdir" is empty)

***OTHER machine***
~# cd /mnt/mfs/testdir
/mnt/mfs/testdir# mkdir dir
/mnt/mfs/testdir#
******
(Other machine creates a directory named "dir" inside "testdir")

***FreeBSD machine***
/mnt/mfs/testdir# ls -al
total 2933
drwxr-xr-x   3 root  wheel        1 Aug 23 12:59 .
drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
drwxr-xr-x   2 root  wheel        1 Aug 23 12:59 dir
/mnt/mfs/testdir#
******
(FreeBSD "sees" that there is now "dir" inside "testdir")

***OTHER machine***
/mnt/mfs/testdir# ls -ali
total 2933
8 drwxr-xr-x   3 root  wheel        1 Aug 23 12:59 .
1 drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
9 drwxr-xr-x   2 root  wheel        1 Aug 23 12:59 dir
/mnt/mfs/testdir# rmdir dir
/mnt/mfs/testdir# ls -al
total 2932
drwxr-xr-x   2 root  wheel        1 Aug 23 13:00 .
drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
/mnt/mfs/testdir#
******
(We check the inode number of "dir" on the other machine and delete "dir")

***FreeBSD machine***
/mnt/mfs/testdir# ls -al
total 2932
drwxr-xr-x   2 root  wheel        1 Aug 23 13:00 .
drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
/mnt/mfs/testdir#
******
(FreeBSD "sees" again, that "testdir" is empty)

Now we wait for at least 5 minutes, the timing will be explained below.

***FreeBSD machine***
/mnt/mfs/testdir# echo "foo" > file.txt
-bash: file.txt: Resource temporarily unavailable
/mnt/mfs/testdir# ls -al
ls: file.txt: Resource temporarily unavailable
total 2932
drwxr-xr-x   2 root  wheel        1 Aug 23 13:17 .
drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
/mnt/mfs/testdir#
******
Ooops?!

***OTHER machine***
/mnt/mfs/testdir# ls -ali
total 2932
8 drwxr-xr-x   2 root  wheel        1 Aug 23 13:17 .
1 drwxrwxrwx  43 root  wheel  3001433 Aug 23 12:28 ..
9 -rw-r--r--   1 root  wheel        0 Aug 23 13:17 file.txt
/mnt/mfs/testdir#
******
The newly created file got the same inode number as the recently deleted directory "dir"...

Notes:
1) The effect is not exclusive to former directory inode numbers becoming file inode numbers. It happens whenever the new object is of a different type than the old one (so ex-directory inode number becomes re-used as file, ex-file as fifo, ex-fifo as a device or directory etc.). The "ls -al" scenario is not the only one, the same will happen if objects are created on FreeBSD machine and then deleted from another machine, which is of course a normal occurrence in a network file system.
2) Default inode reuse time in MooseFS is 24 hours. It was set to 5 minutes for testing purposes only. The person, that reported the problem first (there were others after), used the default 24 hours. And only inodes that are truly "free" are reused, that means: no CWDs (active on any MooseFS client connected to the instance), no sustained (deleted but still open) files are reused. The 24 hour delay is counted from the moment they are considered free, so if a file is in a sustained state for, let's say, 24 hours after deletion (and then whatever process had a hold on it finally finishes), its inode number is still not reused for another 24 hours.
3) Default cache times in MooseFS: file attributes cache timeout - 1 second, extended attributes (xattr) cache timeout - 30 seconds, directory entry cache timeout - 1 second, negative entry cache timeout - 0 seconds (default no negative cache), symbolic link cache timeout - 300 seconds, supplementary groups cache timeout - 300 seconds
4) Caches in the above experiment were ALL set to 0.
5) The problem was first reported on FreeBSD 12.1.

So, to sum it up: we say "don't cache anything at all/longer than 300 seconds", FreeBSD caches indode attributes (we don't know, which ones, but at least type) for longer than 24 hours and it causes a serious problem, because a new inode with reused inode number is basically unusable in the file system.
Comment 1 Alan Somers freebsd_committer freebsd_triage 2021-08-24 13:44:51 UTC
I'll look into this when I have some time.  But could you please tell me what FUSE protocol version MooseFS is mounting with?  Earlier protocol versions had very limited ability to specify cache retention times.
Comment 2 Agata 2021-08-25 08:19:07 UTC
This is directly from the test machine I did this particular test on:

compiled_with_fuse: 3.2
kernel_fuse_protocol: 7.28

I don't know about our users, but I assume they mount with the latest they get with the system.
Comment 3 jSML4ThWwBID69YC 2021-08-25 13:34:36 UTC
As a user, here's my settings. 

compiled_with_fuse: 31.0
kernel_fuse_protocol: 7.28
PKG: fusefs-libs3-3.10.4
PKG: moosefs3-client-3.0.116
Comment 4 Alan Somers freebsd_committer freebsd_triage 2021-12-06 05:57:24 UTC
Patch in review.  Note that this bug actually didn't have anything to do with inode attributes.  The file type isn't considered an attribute, because it must remain the same throughout a file's lifetime.  So the entry cache is more relevant than the attribute cache.  But as it turns out, the best way to handle this situation is the same regardless of whether the entry cache has expired or not.
Comment 5 commit-hook freebsd_committer freebsd_triage 2021-12-07 04:37:53 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=25927e068fcbcac0a5111a881de723bd984b04b3

commit 25927e068fcbcac0a5111a881de723bd984b04b3
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2021-12-06 05:43:17 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2021-12-07 04:36:46 +0000

    fusefs: correctly handle an inode that changes file types

    Correctly handle the situation where a FUSE server unlinks a file, then
    creates a new file of a different type but with the same inode number.
    Previously fuse_vnop_lookup in this situation would return EAGAIN.  But
    since it didn't call vgone(), the vnode couldn't be reused right away.
    Fix this by immediately calling vgone() and reallocating a new vnode.

    This problem can occur in three code paths, during VOP_LOOKUP,
    VOP_SETATTR, or following FUSE_GETATTR, which usually happens during
    VOP_GETATTR but can occur during other vops, too.  Note that the correct
    response actually doesn't depend on whether the entry cache has expired.
    In fact, during VOP_LOOKUP, we can't even tell.  Either it has expired
    already, or else the vnode got reclaimed by vnlru.

    Also, correct the error code during the VOP_SETATTR path.

    PR:             258022
    Reported by:    chogata@moosefs.pro
    MFC after:      2 weeks
    Reviewed by:    pfg
    Differential Revision: https://reviews.freebsd.org/D33283

 sys/fs/fuse/fuse_internal.c    |  9 +++++---
 sys/fs/fuse/fuse_node.c        | 25 +++++++++++----------
 tests/sys/fs/fusefs/getattr.cc | 50 ++++++++++++++++++++++++++++++++++++++++++
 tests/sys/fs/fusefs/lookup.cc  | 32 +++++++++++++++++++++------
 tests/sys/fs/fusefs/setattr.cc | 47 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 142 insertions(+), 21 deletions(-)
Comment 6 commit-hook freebsd_committer freebsd_triage 2022-01-03 02:52:11 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=139764c4613cde14c97ff8dd8007eb0f27f5fb9f

commit 139764c4613cde14c97ff8dd8007eb0f27f5fb9f
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2021-12-06 05:43:17 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2022-01-03 02:36:38 +0000

    fusefs: correctly handle an inode that changes file types

    Correctly handle the situation where a FUSE server unlinks a file, then
    creates a new file of a different type but with the same inode number.
    Previously fuse_vnop_lookup in this situation would return EAGAIN.  But
    since it didn't call vgone(), the vnode couldn't be reused right away.
    Fix this by immediately calling vgone() and reallocating a new vnode.

    This problem can occur in three code paths, during VOP_LOOKUP,
    VOP_SETATTR, or following FUSE_GETATTR, which usually happens during
    VOP_GETATTR but can occur during other vops, too.  Note that the correct
    response actually doesn't depend on whether the entry cache has expired.
    In fact, during VOP_LOOKUP, we can't even tell.  Either it has expired
    already, or else the vnode got reclaimed by vnlru.

    Also, correct the error code during the VOP_SETATTR path.

    PR:             258022
    Reported by:    chogata@moosefs.pro
    Reviewed by:    pfg
    Differential Revision: https://reviews.freebsd.org/D33283

    (cherry picked from commit 25927e068fcbcac0a5111a881de723bd984b04b3)

 sys/fs/fuse/fuse_internal.c    |  9 +++++---
 sys/fs/fuse/fuse_node.c        | 25 +++++++++++----------
 tests/sys/fs/fusefs/getattr.cc | 50 ++++++++++++++++++++++++++++++++++++++++++
 tests/sys/fs/fusefs/lookup.cc  | 32 +++++++++++++++++++++------
 tests/sys/fs/fusefs/setattr.cc | 47 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 142 insertions(+), 21 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2022-01-03 05:32:40 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f09df4b05cf4b9f065e3db642666355a95c036e4

commit f09df4b05cf4b9f065e3db642666355a95c036e4
Author:     Alan Somers <asomers@FreeBSD.org>
AuthorDate: 2021-12-06 05:43:17 +0000
Commit:     Alan Somers <asomers@FreeBSD.org>
CommitDate: 2022-01-03 05:15:38 +0000

    fusefs: correctly handle an inode that changes file types

    Correctly handle the situation where a FUSE server unlinks a file, then
    creates a new file of a different type but with the same inode number.
    Previously fuse_vnop_lookup in this situation would return EAGAIN.  But
    since it didn't call vgone(), the vnode couldn't be reused right away.
    Fix this by immediately calling vgone() and reallocating a new vnode.

    This problem can occur in three code paths, during VOP_LOOKUP,
    VOP_SETATTR, or following FUSE_GETATTR, which usually happens during
    VOP_GETATTR but can occur during other vops, too.  Note that the correct
    response actually doesn't depend on whether the entry cache has expired.
    In fact, during VOP_LOOKUP, we can't even tell.  Either it has expired
    already, or else the vnode got reclaimed by vnlru.

    Also, correct the error code during the VOP_SETATTR path.

    PR:             258022
    Reported by:    chogata@moosefs.pro
    Reviewed by:    pfg
    Differential Revision: https://reviews.freebsd.org/D33283

    (cherry picked from commit 25927e068fcbcac0a5111a881de723bd984b04b3)

 sys/fs/fuse/fuse_internal.c    |  9 +++++---
 sys/fs/fuse/fuse_node.c        | 25 +++++++++++----------
 tests/sys/fs/fusefs/getattr.cc | 50 ++++++++++++++++++++++++++++++++++++++++++
 tests/sys/fs/fusefs/lookup.cc  | 32 +++++++++++++++++++++------
 tests/sys/fs/fusefs/setattr.cc | 47 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 142 insertions(+), 21 deletions(-)