Bug 230258 - [FUSE] [BUG]: Attributes caching issue
Summary: [FUSE] [BUG]: Attributes caching issue
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Conrad Meyer
URL: https://robo.moosefs.com/support/fuse...
: 218101 (view as bug list)
Depends on:
Blocks: 235773 235774 235775
  Show dependency treegraph
Reported: 2018-08-01 12:35 UTC by MooseFS FreeBSD Team
Modified: 2019-02-16 03:45 UTC (History)
6 users (show)

See Also:

adding filesize handling to lookup (fusefs) (1.01 KB, patch)
2018-08-10 08:13 UTC, Jakub Kruszona-Zawadzki
no flags Details | Diff
changed filesize handling in fuse file system (4.16 KB, patch)
2018-08-14 08:30 UTC, Jakub Kruszona-Zawadzki
no flags Details | Diff
changed filesize handling in fuse file system (4.46 KB, patch)
2018-08-14 12:16 UTC, Jakub Kruszona-Zawadzki
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description MooseFS FreeBSD Team 2018-08-01 12:35:36 UTC
This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script:

FreeBSD FUSE is keeping attributes it should not keep. Example: when kernel invokes lookup, one can say how long the result is kept in cache. We return 0 (0.0) timeout, but we still get results that suggest file length is kept in cache.

What happens: we have two clients using our distributed file system. Each client has access to the same file. For convienience sake, let's say those files are named /mnt/llfuse/hello1 on client one and /mnt/llfuse/hello2 on client two. But both files point to the SAME OBJECT in the filesystem.
When client one reads /mnt/llfuse/hello1, a minute later client two writes something to /mnt/llfuse/hello2, and yet a minute later client one reads /mnt/llfuse/hello1, client one does NOT see the data appended by client two.

After we perform this set of operations:

truncate -s 0 /mnt/llfuse/hello1
truncate -s 0 /mnt/llfuse/hello2
sleep 0.1
echo "t1" >> /mnt/llfuse/hello1
sleep 0.1
echo "t2" >> /mnt/llfuse/hello2
sleep 0.1
echo "t3" >> /mnt/llfuse/hello1
sleep 0.1
echo "t4" >> /mnt/llfuse/hello2
sleep 0.1
echo "t5" >> /mnt/llfuse/hello1
sleep 0.1
echo "t6" >> /mnt/llfuse/hello2
sleep 0.1

We get in result:

cat /mnt/llfuse/hello1

cat /mnt/llfuse/hello2

Which is, of course, not the correct content of the file we just wrote to.

We tried also to use fuse_lowlevel_notify_inval_entry function to get rid of the problem, but it returns EINVAL and writes "fuse: writing device: Invalid argument" to screen.

Best regards,
Peter / MooseFS Team
Comment 1 Conrad Meyer freebsd_committer 2018-08-02 00:15:19 UTC
FreeBSD's bufcache caches file contents by default.  This can be worked around by having userspace open files with the O_DIRECT option, or by having your FUSE filesystem's FUSE_OPEN handler return FOPEN_DIRECT_IO in open_flags (i.e. set the 'direct_io' flag in fuse_file_info).

Your test 'hello' filesystem does not appear set the direct_io flag on hello1 and hello2 (and of course 'echo >>' is unlikely to), so that may explain corrupt results.

Given FreeBSD's local cache cannot possibly be consistent with your distributed filesystem (at least without sophisticated cache invalidation that FUSE probably lacks), you likely want to always operate in DIRECT mode.
Comment 2 Conrad Meyer freebsd_committer 2018-08-02 00:22:12 UTC
Re: lowlevel_notify_inval_entry, note that this was added in protocol 7.12 and FreeBSD only supports 7.8:

Comment 3 Conrad Meyer freebsd_committer 2018-08-02 00:23:50 UTC
Here's a changelog of protocol additions we don't have: https://github.com/libfuse/libfuse/blob/master/include/fuse_kernel.h
Comment 4 Jakub Kruszona-Zawadzki 2018-08-02 06:41:14 UTC
I've checked it using "DIRECT" flag - result is almost the same:



Still no 't2' and no 't4'. Our problem is not in data caching - this can be fixed using 'direct_io' (not vary good solution, but it works). The problem is connected with dentry cache - i.e. file length. Because of this cache operation "echo 't3' >> /mnt/llfuse/hello1" uses old file length and overwrites 't2'.

In Linux and OS X this works fine. On FreeBSD it looks like 'timeout' parameter returned with attributes (getattr) and entry data (lookup) is totally ignored.

For example - output from OS X (original version - without direct):



Still no 't6' in 'hello1' - probably they had similar issue and 'fixed' this by invoking 'getattr' whenever file is opened with O_APPEND, so data are appended properly, but then when 'hello1' is read without 'O_APPEND' it also uses "old" file length.

On Linux there is 't6' in 'hello1' (proper timeouting attributes).

As for now there is no work-around for this issue. We've mentioned 'fuse_lowlevel_notify_inval_entry' only because it could be used as a work-around. We do not use this function in our current implementation.
Comment 5 Conrad Meyer freebsd_committer 2018-08-02 19:46:50 UTC
(In reply to Jakub Kruszona-Zawadzki from comment #4)
> We've mentioned 'fuse_lowlevel_notify_inval_entry' only because it could be
> used as a work-around. We do not use this function in our current
> implementation.

Understood, just wanted to clarify why you saw the error message.

It's interesting you mention O_APPEND.  I forgot about it, but that was totally broken (even on individual files) in 11.0 and I fixed it in CURRENT (r320451) June 2017.  It looks like MarkJ MFC'd the commit to 11 July 2017 (r320684) and also to 11.1 (r320689).  So if you are actually running 11.0 and not 11.1 per the PR Version field, I'd encourage you to try 11.1 :-).

I can't do it now, but I'll look into the timeout value and our file size caching a little more later (maybe tonight).

Thanks for the detailed report with repro, your help with investigating, and of course, patience.
Comment 6 Conrad Meyer freebsd_committer 2018-08-04 03:34:35 UTC
Ok, dirent timeouts (entry_timeout*), as well as attr timeouts (attr_timeout) are encoded in fuse_entry_out via e.g., fuse_reply_entry().

Attribute timeout data is also encoded in fuse_attr_out (attr_valid) via e.g. fuse_reply_attr().

There also seems to be some fuse_config entry, attr, and negative timeouts?

I'm not sure how all of these are supposed to interact; are implementations supposed to cache the attr_timeout value from lookup?  Any documentation of the protocol expectations here would be very welcome.

Timeout values don't look optional.  I.e., technically we aren't allowed to cache indefinitely.  But it's probably safe enough to treat any ULONG_MAX timeout value as an infinity sentinel (if I did the math right, it represents about 600 billion years).

It's easy to say what reasonable behavior should be at the limits:

1. Use our normal caching if attr_valid is the maximal value.
2. Don't cache if attr_valid is zero.  (Meets your needs.)

But that leaves undefined the entire (0, ULONG_MAX) range.  I'm confident that actually implementing some sort of timer based cache invalidation would be fairly complicated code-wise, and seems like it would have limited utility vs on-demand cache invalidation.

The safe thing is to not cache, but I'm not sure if real-world filesystems actually pass the ULONG_MAX "infinity" timeout value.  I guess I can try to dig into the ones I'm familiar with (probably ntfs-3g being most popular) and see what they do.
Comment 7 Conrad Meyer freebsd_committer 2018-08-04 03:39:49 UTC
Some libfuse example filesystems define a magical "NO_TIMEOUT" of 500000, which is neither ULONG_MAX nor 0.  In fact, it's just shy of six days.  I don't think we should special case that magical value unless real FUSE FS use it.
Comment 8 Conrad Meyer freebsd_committer 2018-08-04 03:53:25 UTC
(In reply to Conrad Meyer from comment #6)
> There also seems to be some fuse_config entry, attr, and negative timeouts?

I see.  This is configured as a global default by high level API consumers, and the global value is passed into the LL routines.  We (kernel) can ignore that since we only interact with the LL interface.  It seems the highlevel defaults are 1 second each for entry/attr, and 0 for negative entry.

I'm leaning towards keeping our current indefinite caching for any non-zero timeout value, and not caching when given a zero timeout/valid duration.  That would at least allow for correct operation of your filesystem.  (Performance may suffer due to excessive IPC until the more sophisticated notify_inval_entry is supported.)
Comment 9 Conrad Meyer freebsd_committer 2018-08-04 04:13:48 UTC
ntfs-3g appears to use the highlevel API only.  It seems like attribute caching is enabled or disabled with a compile-time option ("CACHEING" macro) based on some platform-specific definitions, and nothing else explicitly configures it.  The values used are either 0.0, 1.0, or 10.0.  So, I'm still leaning towards 0 = do not cache, non-zero = cache indefinitely.
Comment 10 Conrad Meyer freebsd_committer 2018-08-04 05:36:04 UTC
From the FreeBSD side of things, here's the status quo (to the best of my understanding):

1. On create (open + O_CREAT), we:
  a. cache_enter() (add an entry to the generic namecache which sits between generic namei()/lookup and VFS specific VOP_LOOKUP operations) — ignoring any entry_valid value (bogus).
  b. Arbitrarily nuke all negative cache entries below the parent directory, for unclear reasons (maybe pessimistic, but orthogonal to this issue).
  c. Totally ignore any returned attribute and attribute timeout (safe, I think, but maybe pessimistic).

2. In lookup, we:
  a. Skip right past fuse_entry_out or fuse_attr_out and go directly to the attributes (ignoring cache timeout value).
  b. And of course cache them indefinitely (cache_attrs()) (bogus).

3. For link, we check link count validity with our cached va_nlink attribute, assuming our attribute cache is valid (bogus).
  a. Then we ignore the entry_out result entirely (but don't cache the newly created link vnode, which is safe, if pessimistic).
Comment 11 Conrad Meyer freebsd_committer 2018-08-04 06:08:54 UTC
Ah, also anything using fuse_internal_newentry{,_core}: mkdir and symlink.
Comment 12 Conrad Meyer freebsd_committer 2018-08-04 08:45:20 UTC
Please try this patch for the attribute caching issue:

It compiles (on CURRENT) but I have not had a chance to actually test it out yet.  Notably, it does *not* fix the lookup/entry caching.
Comment 13 Conrad Meyer freebsd_committer 2018-08-08 04:22:45 UTC
(In reply to Jakub Kruszona-Zawadzki from comment #4)
Please try https://people.freebsd.org/~cem/pr230258.2of2.1.patch for entry data (lookup) caching.
Comment 14 Jakub Kruszona-Zawadzki 2018-08-08 09:55:45 UTC
I've just checked it and it doesn't work.



I applied both patches.

BTW. Timeouts other than 0.0 and "INFINITY" are very useful - especially in case of directories. In files we have one crucial attribute - size - which shouldn't be cached in any circumstances, but in case of directories attributes are less important and accessing files using long paths leads to lot of lookups and caching them (even for a second) can significantly improve efficiency.

In MooseFS this is the default strategy - for files timeout is 0.0, but for directories default is 1.0.
Comment 15 Conrad Meyer freebsd_committer 2018-08-08 16:33:09 UTC
(In reply to Jakub Kruszona-Zawadzki from comment #14)
Yeah, we do some special size attribute handling now.  Anyway, I am not very familiar with our FUSE implementation and must be missing something!  Thanks for testing it and reporting back.  I will have to test and debug further at some point, but I don't know when that will be.

(Any other FreeBSD contributor: consider this PR poachable.)
Comment 16 Jakub Kruszona-Zawadzki 2018-08-10 06:33:59 UTC
More info.

1. I've checked 12.0 and result is the same

2. I've found sysctl option "vfs.fuse.refresh_size" which is set by default to 0. Setting it to 1 resolves the problem partially (but probably at the cost of efficiency). Output:



3. I've made simple patch myself and it seems that it resolves the issue. In file "fuse_vnops.c" I've copied filesize handling from GETATTR to LOOKUP. In function "fuse_vnop_lookup" just after:

                if (op == FUSE_GETATTR) {
                        cache_attrs(*vpp, (struct fuse_attr_out *)fdi.answ);
                } else {
                        cache_attrs(*vpp, (struct fuse_entry_out *)fdi.answ);

I've added this code:

                        struct fuse_vnode_data *fvdat = VTOFUD(*vpp);

                        if (vnode_isreg(*vpp) && (fvdat->flag & FN_SIZECHANGE) == 0) {
                                 * This is for those cases when the file size changed without us
                                 * knowing, and we want to catch up.
                                off_t new_filesize;
                                if (op == FUSE_GETATTR) {
                                        new_filesize = ((struct fuse_attr_out *)fdi.answ)->attr.size;
                                } else {
                                        new_filesize = ((struct fuse_entry_out *)fdi.answ)->attr.size;

                                if (fvdat->filesize != new_filesize) {
                                        fuse_vnode_setsize(*vpp, cred, new_filesize);

And now result is correct:



I do not know how to make nice patch - this is why I'am sending it like that. Later I'll try to create proper patch from this.
Comment 17 Jakub Kruszona-Zawadzki 2018-08-10 08:13:13 UTC
Created attachment 196048 [details]
adding filesize handling to lookup (fusefs)

This is patch I've promised
Comment 18 Conrad Meyer freebsd_committer 2018-08-10 08:31:13 UTC
*** Bug 218101 has been marked as a duplicate of this bug. ***
Comment 19 Conrad Meyer freebsd_committer 2018-08-10 09:14:42 UTC
(In reply to Jakub Kruszona-Zawadzki from comment #17)

I think you can simplify somewhat using the existing 'fattr' pointer.  new_filesize = fattr->size;

Other than that, it looks ok to me.  I can't say I totally understand the FN_SIZECHANGE / fuse_vnode_setsize mechanism, but this copy seems to match GETATTR.  It might also be nice to refactor it out into a subroutine, but it's fairly small anyway.
Comment 20 Ben RUBSON 2018-08-10 10:03:43 UTC
What about adding support for attr_timeout and entry_timeout options as proposed in #218101 ?
Comment 21 Conrad Meyer freebsd_committer 2018-08-10 10:44:37 UTC
(In reply to Ben RUBSON from comment #20)
Please scroll up to comments #12-13.
Comment 22 Ben RUBSON 2018-08-10 11:01:41 UTC
Sorry Conrad, yes your patch handles these options (at least the entry_timeout if I red correctly).
But has Jakub did not succeed testing it, and proposed a lighter version which does not seem to take care of the options value, I wondered :)
Anyway, I'll be glad to test something about to be merged (as I also reported this bug in #218101).
Comment 23 Jakub Kruszona-Zawadzki 2018-08-13 07:38:39 UTC
Ok. As I see there are two different problems. One is entry/attr cache timeout and the other is file size handling. In Linux file size is strictly connected with attribute cache and therefore proper handling of attribute timeouts is important there. In FreeBSD it looks that it works differently. My small patch only fixed file size handling (which is very important - much more important than handling other attributes).

I've checked the code and what I see is that entry cache is now totally disabled (which is good, because it is better not to cache at all than cache without timeout). Attribute cache is there only in theory ("cached_attrs" field in "struct fuse_vnode_data"). In theory, because in the code attributes are set but ... never used. Only field va_nlink is used in "link" operation and only to check if it reaches FUSE_LINK_MAX.

This is why Conrad's patch didn't change anything.

To add full attr_timeout support we should add "struct timespec attrtimeout" field to "struct fuse_vnode_data", fill it correctly in "cache_attrs" and use it - mainly in "fuse_vnop_getattr" - check "attrtimeout" before dispatching FUSE_GETATTR and fill "vap" field using cached attributes instead.

To add full entry_timeout support maybe we can use "cache_enter_time" instead of "cache_enter" and store timeout in entry cache and then use it in lookup.

I might try to create a patch at least for attr_timeout. The only problem is that I'm unfamiliar with writing code on the kernel level. In this case I don't know how to obtain "monotonic" clock in the kernel.
Comment 24 Jakub Kruszona-Zawadzki 2018-08-13 08:07:47 UTC
Maybe using "ticks" for timeouts should be enough? What do you think?
Comment 25 Ben RUBSON 2018-08-13 10:17:55 UTC
Thank you for your explanation Jakub.
I did not understand that there were 2 issues.
I thought that the wrong returned size was due to the fact that attributes / entries were cached, as on Linux when we do not use attr_timeout=0 / entry_timeout=0.
Good news then to know that cache is for now disabled on FreeBSD, and that your patch solves the size issue (I'll give it a try).

Not sure attr_timeout / entry_timeout support is then really needed, it has now a more low priority, as you stated.
Perhaps it could help with performance in some cases (I can personally live without).
Comment 26 Ben RUBSON 2018-08-13 10:55:07 UTC
Jakub I applied your patch and unfortunately it does not correct the bug I opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218101 for :
if the underlying file size is changed, it's not reflected at fuse side.
This is why I thought caching may be enabled somewhere in the path and pointed to attr_timeout/entry_timeout options.
There may then be more than one bug.
Comment 27 Jakub Kruszona-Zawadzki 2018-08-14 07:09:42 UTC
(In reply to Ben RUBSON from comment #26)

Yes. I've just checked it. My patch works fine with "APPEND" to file, but simple "stat" shows "old" file size. I'll try to find better solution.
Comment 28 Jakub Kruszona-Zawadzki 2018-08-14 08:30:43 UTC
Created attachment 196191 [details]
changed filesize handling in fuse file system

Ok. I've created new patch.

1. I've removed "FN_SIZECHANGED" flag which in my opinion is not necessary.

2. I've removed "fuse_vnode_savesize" function - this caused unnecessary SETATTR to userspace - fuse filesystems have to update filesize correctly themselves - this SETATTR is not needed and even it can lead to small problems (unexpected change of mtime).

3. I've removed "fuse_vnode_refreshsize" function. Because we set correctly filesize on GETATTR and LOOKUP it seems unnecessary (I'm not sure here, but everything seems to work very well without it).

4. I've removed using cached filesize On GETATTR (fuse filesystems have to track filesize correctly themselves, so size returned by GETATTR should be always correct).

5. I've fixed previous patch (update filesize on LOOKUP)
Comment 29 Ben RUBSON 2018-08-14 11:06:40 UTC
This new patch helps yes ; now, when underlying file size has changed, stat() returns correct size at fuse side. Perfect, thank you Jakub !

However I found another use case where we still face an issue.
Let's assume raw file is the underlying file.
And fuse file the corresponding fuse file.

1. create raw file with some data
2. open fuse file
3. verify fuse data : correct
4. close fuse file
5. add some data to raw file
6. open fuse file
7. verify fuse data : garbage

I've found 2 tricks which may help to understand the bug :
- add a 5b step which stat() fuse file.
- use fuse direct_io mount option.

We worked on this last year with Conrad :
But does not seem to help here.
Comment 30 Jakub Kruszona-Zawadzki 2018-08-14 11:24:34 UTC
(In reply to Ben RUBSON from comment #29)

I see that still file size is sometimes not refreshed correctly:

root@fbsdcurr:/mnt/llfuse # truncate -s0 hello1
root@fbsdcurr:/mnt/llfuse # cat hello1
root@fbsdcurr:/mnt/llfuse # cat hello2
root@fbsdcurr:/mnt/llfuse # echo "ala" > hello1
root@fbsdcurr:/mnt/llfuse # cat hello2
root@fbsdcurr:/mnt/llfuse # echo "ola" >> hello1
root@fbsdcurr:/mnt/llfuse # cat hello2
root@fbsdcurr:/mnt/llfuse # echo "ela" >> hello1
root@fbsdcurr:/mnt/llfuse # cat hello2
root@fbsdcurr:/mnt/llfuse # stat hello2
18446744073390849794 3 -rw-rw-rw- 1 root wheel 0 12 "Aug 14 13:17:38 2018" "Aug 14 13:17:37 2018" "Aug 14 13:17:37 2018" "Jan  1 00:59:59 1970" 4096 0 0 hello2
root@fbsdcurr:/mnt/llfuse # cat hello2
root@fbsdcurr:/mnt/llfuse #

after 'echo "ela" >> hello1' first read (cat hello2) reads file using old length, then after 'stat' it seems that file size is correctly refreshed and then it reads correct data with correct size. I'll check it.

I don't see any "garbage" - only size is wrong.
Comment 31 Ben RUBSON 2018-08-14 11:28:36 UTC
Yes, sorry about the "garbage" shortcut.
I also think file may be red with the wrong size.
Glad you managed to reproduce !
Comment 32 Jakub Kruszona-Zawadzki 2018-08-14 11:47:29 UTC
(In reply to Ben RUBSON from comment #31)

Ok. It seems that lookup cache is on by default and it causes troubles (because timeout is not implemented).

Try to turn off lookup cache using command:

sysctl -w vfs.fuse.lookup_cache_enable=0

In my opinion, for now it should be turned off by default.

It should be turned on only after proper timeouts implementation. I'll add this change to my patch. In the meantime please check if the above command helps.
Comment 33 Jakub Kruszona-Zawadzki 2018-08-14 12:16:48 UTC
Created attachment 196192 [details]
changed filesize handling in fuse file system

Ok. I've created another patch.

1. I've removed "FN_SIZECHANGED" flag which in my opinion is not necessary.

2. I've removed "fuse_vnode_savesize" function - this caused unnecessary SETATTR to userspace - fuse filesystems have to update filesize correctly themselves - this SETATTR is not needed and even it can lead to small problems (unexpected change of mtime).

3. I've removed "fuse_vnode_refreshsize" function. Because we set correctly filesize on GETATTR and LOOKUP it seems unnecessary (I'm not sure here, but everything seems to work very well without it).

4. I've removed using cached filesize on GETATTR (fuse filesystems have to track filesize correctly themselves, so size returned by GETATTR should be always correct).

5. I've fixed previous patch (update filesize on LOOKUP)

6. I've turned off lookup cache by default (it should be turned off as long as timeouts are not taken into account)
Comment 34 Jakub Kruszona-Zawadzki 2018-08-14 12:19:03 UTC
I'm going on long vacation now, so I won't be able to contribute for the next three weeks.
Comment 35 Ben RUBSON 2018-08-14 13:46:29 UTC
Yes, turning "vfs.fuse.lookup_cache_enable" off helps, perfect.

There's still one side effect of caching I think though.

If you keep the fuse file opened, while the underlying one is modified :
- you read the fuse file up to the expected size, perfect ;
- but you read garbage (this time :).

Mount option -o direct_io helps as a workaround.
Comment 36 Conrad Meyer freebsd_committer 2018-08-14 16:47:44 UTC
Proposed patch looks like it fixes at least one issue, but there is a lot going on in our FUSE impl. that looks kind of dubious :-(.

(In reply to Jakub Kruszona-Zawadzki from comment #28)
If FUSE filesystems must track filesize themselves, I'm really not sure what the goal of our FN_SIZECHANGE flag is.  :-(

Pretty much all of this code dates to the initial import in 2012.  I guess more history may be discoverable from the old project branch (I have not looked yet, but will sometime): https://svnweb.freebsd.org/base/projects/fuse/sys/fs/fuse

Are there any sort of unit tests associated with libfuse or something I could try to run against our implementation to verify large-scale changes?

Thanks again Piotr, Jakub, and Ben for your investigation and discussion.
Comment 38 Ben RUBSON 2018-08-15 07:33:14 UTC
Conrad, I'm not sure about a test suite, but perhaps Nikolaus Rath (libfuse maintainer) who worked on libfuse3 FreeBSD support a few months ago could have some info ?
EncFS also has some integration tests which widely validate cache etc... and which fail for now (but with current patch here + direct_io workaround). If you want to give them a try and need assistance, feel free to contact me !
Comment 39 Conrad Meyer freebsd_committer 2018-08-15 16:35:28 UTC
(In reply to Ben RUBSON from comment #38)
Thanks Ben, these are good resources.  I think there are some nice cleanups we could do to our fuse but they are not especially targeted, which makes them risky.  I am hesitant to do that without some test coverage.

As discussed on this bug it seems like our internal generic attribute cache is entirely unused, except for VOP_LINK and VOP_REMOVE/VOP_RMDIR.  In VOP_LINK it is used to protect (?)naive userspace filesystems from a new link exceeded the protocol maximum).  However I don't know why kernel fuse cares at all if the link limit is exceeded.  And in REMOVE/RMDIR it is dead code!  Ugh.  While I would like to remove it as misleading and unused, it may make sense for the kernel to cache results on some local-only filesystems for improved performance.  So this may be worth fixing.

I think I understand our size caching protocol now (FN_SIZECHANGE, fvdat->filesize, etc).  When data caching is enabled (vfs.fuse.data_cache_enable), we use it to maintain our idea of cached data size when e.g. non-DIRECT writes extend the file.  I think the idea here is that we maintain dirty cached data locally that the upper filesystem does not know about.  So we override GETATTR because we may have local file extensions that have not yet been pushed to the user filesystem.  I agree that SETATTR (fuse_vnode_savesize()) appears to be the wrong way to push that information into the userspace filesystem.  We should instead just flush dirty buffers and the write path will extend the userspace file size by itself.

Maybe we need an additional flag on top of data_cache_enable for dirty_cache_enable -- it may be safe (or acceptable) to cache clean data, but not dirty data?  That would fix the append-to-two-files example without totally disabling data cache.

At the very least we should add documentation strings/comments to the various sysctls, and perhaps rename internal variables for clarity.  Maybe they should be mountpoint options eventually but for now that is not our biggest concern.

I still don't really like the cache timeout mechanism -- it is a poor substitute for real invalidation, which seems to have been added in a later FUSE protocol anyway.  But we can definitely agree on fixing the ignorance-of-zero-TTL value, and I am willing to be persuaded that we should match Linux on observing the actual timeout.
Comment 40 Conrad Meyer freebsd_committer 2018-08-15 17:39:37 UTC
Hm, seems like at least a subset of the sysctl options are already available per-mountpoint, but their use is not documented.
Comment 41 Conrad Meyer freebsd_committer 2018-08-15 17:42:39 UTC
Hm, and actually I don't believe we cache dirty data, so now I really do not understand the function of savesize.
Comment 42 jSML4ThWwBID69YC 2018-11-01 13:52:43 UTC
Does the patch from comment 33, along with direct_io resolve the issue? If so, what needs to be done to have the patch pushed in FreeBSD 12? Is there anything a non-programmer can do to help with this? Funding perhaps?
Comment 43 Conrad Meyer freebsd_committer 2018-11-01 15:19:03 UTC
(In reply to jSML4ThWwBID69YC from comment #42)
It's not directly commitable as-is, sorry.

Obviously I haven't had time to investigate this recently -- sorry about that.  12.0 timeframe is very very close, I would not expect anything to land by then.
Comment 44 jSML4ThWwBID69YC 2018-11-01 19:02:57 UTC
Hello Conrad, 

I understand, there is never enough time. Is this something that can be resolved in a patch release (11.2-p NUM), or would it have to wait for FreeBSD 13? As it stands, this bug prevents using MooseFS on FreeBSD in many work loads. 

Either way, what can be done to help?

Comment 45 jSML4ThWwBID69YC 2018-11-01 19:04:59 UTC
EDIT: I mean 12-Patch number, or 12.1, etc.
Comment 46 MooseFS FreeBSD Team 2019-02-05 10:59:00 UTC
Hello all involved in this thread,

Thank you for all your contributions done so far, we all appreciate them.

Do you think it would be possible to incorporate the patch from Jakub's comment #33 or modify it in order to make it compliant with everything (Conrad mentioned it cannot be committed "as is")?

The key thing is the patch actually solves the issue (maybe not in a very beautiful way, but it does indeed) so it would make FreeBSD better.

Any help here is appreciated. Maybe you know somebody from kernel/fs Team who could contribute some of their precious time in order to make it happen? The issue does not affect MooseFS only, but also any other file system which uses FUSE library.

We love FreeBSD and support it and its Users. We see that they have major data-consistency issues related to this issue and we would like to help them.

Many thanks for all your input!

Comment 47 Conrad Meyer freebsd_committer 2019-02-05 20:14:59 UTC
I had some work in progress patches I didn't post to phabricator for some reason.   I don't recall if they just didn't fix the symptom mentioned here or if I didn't feel satisfied with my testing of them.  I'll go ahead and put those up now, anyway.
Comment 49 Conrad Meyer freebsd_committer 2019-02-05 21:07:41 UTC
(In reply to Conrad Meyer from comment #48)
(If anyone is willing to review or test this, please feel free to add yourself to the phabricator review and leave a note.  Or comment on this bug.)
Comment 50 Conrad Meyer freebsd_committer 2019-02-06 06:14:09 UTC
(In reply to Conrad Meyer from comment #47)
Ah, they were already posted in comments #12 and #13, and don't fix the issue (noted in #14) because we still cache file *size* incorrectly.
Comment 51 Ben RUBSON 2019-02-06 15:30:17 UTC
(In reply to Piotr Robert Konopelko (MooseFS) from comment #46)

Yes patch from Jakub in comment #33 works, however there's still one issue according to comment #35.
Comment 52 jSML4ThWwBID69YC 2019-02-06 17:08:17 UTC
Is the patch in comment 33 the right one to test, or the patches in comment 48?
Comment 53 Conrad Meyer freebsd_committer 2019-02-06 17:13:11 UTC
(In reply to Ben RUBSON from comment #51)
Thanks Ben.

Ok, I think I have refreshed my memory somewhat.

This bug has focused on file size, but I think there's really two lower-level fundamental issues for distributed filesystems or filesystems like hello_ll that present the same file as two vnodes.  First:

a. FUSE expects write-through caching, i.e., no cached dirty data.  That makes a lot of sense, IMO — you still get the benefits of a clean read cache but don't have the consistency problems of eventual flush.

Wrong file size is just a side effect of the dirty data cache.  Removing it (or disabling it by default) solves one problem, but not other data consistency issues.


b. FreeBSD tries to speed up local FUSE filesystems by putting a buf data cache in front of FUSE, to avoid some upcalls into userspace for every IO operation.  The cache's lifetime isn't tied to open fds.

c. FUSE filesystems don't have a way to invalidate FreeBSD's cache.

What is protecting the integrity of the data?  FUSE read operation doesn't set some cache timeout, right?

It seems the answer is that FUSE kernel implementations are allowed to cache for performance reasons (and Linux does), but the FUSE *open* operation has a keep_cache flag that tells consumers to throw away the cached data.  Actually, it seems we (attempt to) honor this flag.

So why is the data still cached?  Maybe we aren't sending new opens all the way into FUSE?  Maybe the flag has the wrong bit representation?

I found this thread, which discusses similar issues for a time-traveling filesystem, very helpful: https://sourceforge.net/p/fuse/mailman/message/8902254/

Plan of action:

1. Disable dangerous dirty cache by default (but leave it around, in case it is useful for someone)

2. Investigate why data (incl. size) is cached in spite of keep_cache=0 in hello_ll.
Comment 54 Conrad Meyer freebsd_committer 2019-02-06 17:13:48 UTC
(In reply to jSML4ThWwBID69YC from comment #52)
I think both have known deficiencies; no need to test either at this time.
Comment 55 Conrad Meyer freebsd_committer 2019-02-07 03:31:19 UTC
(In reply to Piotr Robert Konopelko (MooseFS) from comment #0)
With vfs.fuse.refresh_size=1 (and with my two existing patches), it seems like this produces the correct result.  I don't think it indicates we have achieved perfect cache coherency, but it's a datapoint.
Comment 56 jSML4ThWwBID69YC 2019-02-10 20:06:42 UTC
(In reply to Conrad Meyer from comment #55)

It seems to work here on a single server. At least I've not been able to produce any errors while running those patches and having vfs.fuse.refresh_size=1 set. Testing was done on FreeBSD 12p3.

I should be able to do multi client testing in a couple of weeks.
Comment 57 Piotr Robert Konopelko (MooseFS) 2019-02-11 13:23:12 UTC
Hi all,
many thanks in such a positive and dynamic feedback!

@Jakub – would you have a moment to take a look at recent progress on this issue and give your valuable input / comment on this from MooseFS perspective? Do you think we would be able to test it sometime in coming weeks (I mean Conrad's patches from comment #48 with 'vfs.fuse.refresh_size=1')?

Comment 58 Conrad Meyer freebsd_committer 2019-02-11 17:24:03 UTC
(In reply to Piotr Robert Konopelko (MooseFS) from comment #57)
No need for that; #48 is incomplete.  I have some other patches in progress.
Comment 59 Piotr Robert Konopelko (MooseFS) 2019-02-11 17:25:54 UTC
(In reply to Conrad Meyer from comment #58)

Sure, thanks for the update!

All the best,
Comment 60 Conrad Meyer freebsd_committer 2019-02-12 09:34:46 UTC
Ok, there were a few confusing things to tease apart.

1. The buf cache in front of FUSE can easily be made a write-around/write-through (clean-only cache, which versions of FUSE prior to 7.23 expect).  I've got a patch to add this option and make it the default.  Note that this kind of already happens in the dueling append case, as bash opens '>>' redirects O_WRONLY|O_APPEND.

2. I agree fuse_vnode_savesize() is basically harmful, and fuse_vnode_refreshsize() basically unnecessary.  Both should be removed, but are not the cause of this bug.  I don't have these patches prepared yet.

3. Setsize and cached filesize is needed to maintain even a clean-only cache.  FN_SIZECHANGE is only used for dirty data caching and specifically handling VOP_GETATTR with dirty append data.  The scope it is set can be reduced somewhat and I have a patch to do this.

4. Dueling appends were observing the correct new size in LOOKUP (attrs returned with successful LOOKUP) but ignored that stat information (other than to cache it in the unused fvdat->cached_attrs, if the attr_valid time was non-zero).  In particular, for the dueling append test code, attr_valid was zero and the stats were entirely discarded.

By the time the appends got to OPEN, they observed !KEEP_CACHE and (appropriately) destroyed the cache contents, but (!!!) this does not reset or refresh the cache size from FUSE.  So the fvdat->filesize remained stale and the next IO_APPEND write simply overwrote the prior append.

(FreeBSD performs the FUSE_LOOKUP, FUSE_OPEN, and FUSE_WRITE in separate Vnode operations ("VOPs") from the core kernel, and it wasn't entirely clear to me what should happen where.)

I'll have a small patch for (4) to try shortly.  It works in my testing.  I can't promise it applies cleanly without the rest of the series, but I'll provide all of the patches.
Comment 61 Conrad Meyer freebsd_committer 2019-02-12 09:46:54 UTC
Here is the series:

1. https://reviews.freebsd.org/D19088 (add binary attr_valid cache, no timeouts yet)
2. https://reviews.freebsd.org/D19089 (add binary dentry valid cache, no timeouts yet)
3. https://reviews.freebsd.org/D19161 (reduce scope of FN_SIZECHANGED)
4. https://reviews.freebsd.org/D19162 (add clean-only cache mode and default to it)
5. https://reviews.freebsd.org/D19163 (refresh fvdat->filesize in LOOKUP, very nearly the same thing as Jakub's patch from comment #16 or 17)

With #5 in particular, the dueling appends on hello1/hello2 "do the right thing."  I think some of the other issues brought up in this bug with various patch incarnations are resolved by the series.  There may be remaining issues.

When testing, please remember to remove/disable any vfs.fuse sysctl.conf knobs you might have enabled to work issues in the past, e.g., vfs.fuse.data_cache_enable, vfs.fuse.data_cache_invalidate, vfs.fuse.refresh_size, and vfs.fuse.sync_resize should be left default.

Comment 62 Conrad Meyer freebsd_committer 2019-02-12 09:50:35 UTC
(Actual time-based cache timeout is a known TODO.  Removing some of these hacky workarounds for WB caching and some stylistic cleanups are other known TODOs.  What else?)
Comment 63 commit-hook freebsd_committer 2019-02-15 22:49:33 UTC
A commit references this bug:

Author: cem
Date: Fri Feb 15 22:49:16 UTC 2019
New revision: 344183
URL: https://svnweb.freebsd.org/changeset/base/344183

  FUSE: Respect userspace FS "do-not-cache" of file attributes

  The FUSE protocol demands that kernel implementations cache user filesystem
  file attributes (vattr data) for a maximum period of time in the range of
  [0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or 10
  seconds; or "a long time" to represent indefinite caching.

  Historically, FreeBSD FUSE has ignored this client directive entirely.  This
  works fine for local-only filesystems, but causes consistency issues with
  multi-writer network filesystems.

  For now, respect 0 second cache TTLs and do not cache such metadata.
  Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
  are still cached indefinitely, because it is unclear how a userspace
  filesystem could do anything sensible with those semantics even if

  In the future, as an optimization, we should implement notify_inval_entry,
  etc, which provide userspace filesystems a way of evicting the kernel cache.

  One potentially bogus access to invalid cached attribute data was left in
  fuse_io_strategy.  It is restricted behind the undocumented and non-default
  "vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
  deadcode and can be eliminated?

  Some minor APIs changed to facilitate this:

  1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").

  2. cache_attrs() respects the provided TTL and only caches in the FUSE
  inode if TTL > 0.  It also grows an "out" argument, which, if non-NULL,
  stores the translated fuse_attr (even if not suitable for caching).

  3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
  avoid programming mistakes.

  4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
  link op was weakened (only performed when we have a valid attr cache).  The
  check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
  We have to trust any userspace filesystem that rejects local caching to
  account for it correctly.

  PR:		230258 (inspired by; does not fix)

Comment 64 commit-hook freebsd_committer 2019-02-15 22:50:38 UTC
A commit references this bug:

Author: cem
Date: Fri Feb 15 22:50:32 UTC 2019
New revision: 344184
URL: https://svnweb.freebsd.org/changeset/base/344184

  FUSE: Respect userspace FS "do-not-cache" of path components

  The FUSE protocol demands that kernel implementations cache user filesystem
  path components (lookup/cnp data) for a maximum period of time in the range
  of [0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or
  10 seconds; or "a long time" to represent indefinite caching.

  Historically, FreeBSD FUSE has ignored this client directive entirely.  This
  works fine for local-only filesystems, but causes consistency issues with
  multi-writer network filesystems.

  For now, respect 0 second cache TTLs and do not cache such metadata.
  Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
  are still cached indefinitely, because it is unclear how a userspace
  filesystem could do anything sensible with those semantics even if

  Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
  if the user filesystem did not set a zero second TTL.

  PR:		230258 (inspired by; does not fix)

Comment 65 commit-hook freebsd_committer 2019-02-15 22:51:41 UTC
A commit references this bug:

Author: cem
Date: Fri Feb 15 22:51:10 UTC 2019
New revision: 344185
URL: https://svnweb.freebsd.org/changeset/base/344185

  FUSE: Only "dirty" cached file size when data is dirty

  Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
  the buf cache bounds as a result of either a read from the underlying FUSE
  filesystem, or as part of a write-through type operation (like truncate =>
  VOP_SETATTR).  In these cases, do not set the FN_SIZECHANGE flag, which
  indicates that an inode's data is dirty (in particular, that the local buf
  cache and fvdat->filesize have dirty extended data).

  PR:		230258 (related)

Comment 66 commit-hook freebsd_committer 2019-02-15 22:53:45 UTC
A commit references this bug:

Author: cem
Date: Fri Feb 15 22:52:50 UTC 2019
New revision: 344186
URL: https://svnweb.freebsd.org/changeset/base/344186

  FUSE: The FUSE design expects writethrough caching

  At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
  specifies only clean data to be cached.

  Prior to this change, we implement and default to writeback caching.  This
  is ok enough for local only filesystems without hardlinks, but violates the
  general design contract with FUSE and breaks distributed filesystems or
  concurrent access to hardlinks of the same inode.

  In this change, add cache mode as an extension of cache enable/disable.  The
  new modes are UC (was: cache disabled), WT (default), and WB (was: cache

  For now, WT caching is implemented as write-around, which meets the goal of
  only caching clean data.  WT can be better than WA for workloads that
  frequently read data that was recently written, but WA is trivial to
  implement.  Note that this has no effect on O_WRONLY-opened files, which
  were already coerced to write-around.

    * https://sourceforge.net/p/fuse/mailman/message/8902254/
    * https://github.com/vgough/encfs/issues/315

  PR:		230258 (inspired by)

Comment 67 commit-hook freebsd_committer 2019-02-15 22:55:50 UTC
A commit references this bug:

Author: cem
Date: Fri Feb 15 22:55:13 UTC 2019
New revision: 344187
URL: https://svnweb.freebsd.org/changeset/base/344187

  FUSE: Refresh cached file size when it changes (lookup)

  The cached fvdat->filesize is indepedent of the (mostly unused)
  cached_attrs, and we failed to update it when a cached (but perhaps
  inactive) vnode was found during VOP_LOOKUP to have a different size than

  As noted in the code comment, this can occur in distributed filesystems or
  with other kinds of irregular file behavior (anything is possible in FUSE).

  We do something similar in fuse_vnop_getattr already.

  PR:		230258 (as reported in description; other issues explored in
  			comments are not all resolved)
  Reported by:	MooseFS FreeBSD Team <freebsd AT moosefs.com>
  Submitted by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)

Comment 68 Conrad Meyer freebsd_committer 2019-02-15 23:58:05 UTC
I'm going to mark this bug as closed, and then clone off related bugs for the additional known issues identified in comments on this bug.  We want to continue to track issues, but I think this PR is "long enough" and we can track other outstanding issues more clearly in new bugs.  I hope that's ok.
Comment 69 Piotr Robert Konopelko (MooseFS) 2019-02-16 01:49:21 UTC
(In reply to Conrad Meyer from comment #68)

Thanks, Conrad! We unfortunately couldn't test it in past few days, but I hope it will be possible soon.

Also adding corresponding MooseFS GitHub issue for reference:

We appreciate your work and time spent on this, thanks again!

Comment 70 Conrad Meyer freebsd_committer 2019-02-16 03:45:40 UTC
(In reply to Piotr Robert Konopelko (MooseFS) from comment #69)
No worries.  When you get a chance, please try it and let us know your results.  Thanks!