Bug 236340 - [FUSEFS] fuse(4) should pass through most open flags to the filesystem daemon
Summary: [FUSEFS] fuse(4) should pass through most open flags to the filesystem daemon
Status: Closed Not Accepted
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Alan Somers
URL:
Keywords:
Depends on: 236844
Blocks: 236329
  Show dependency treegraph
 
Reported: 2019-03-06 23:07 UTC by Alan Somers
Modified: 2019-06-29 03:06 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2019-03-06 23:07:50 UTC
Currently when you open a fuse-backed file with O_APPEND, the kernel hides the O_APPEND flag from the fuse filesystem and computes write offsets in fuse_write_directbackend.  However, that behavior doesn't work for networked fuse filesystems.  The problem is that two clients both writing to the same file with O_APPEND wouldn't cooperate to update the file size.  Or to put it another way, implementing O_APPEND in-kernel requires cacheing the file size attribute, and cacheing attributes is not generally allowed for networked file systems.

The libfuse documentation makes it explicit that filesystems must handle the O_APPEND flag as of protocol 7.23.  Implicitly, filesystems should handle it on older protocols as well.

There are two possible approaches we could take:
1) Pass O_APPEND to FUSE_OPEN and set the offset of all FUSE_WRITE operations to 0
2) Pass O_APPEND to FUSE_OPEN but still compute the offset of all FUSE_WRITE operations as we do currently, expecting that compliant fuse filesystems will ignore it.

I don't know how many fuse filesystems correctly handle O_APPEND, but I bet it's less than all of them.
Comment 1 Alan Somers freebsd_committer freebsd_triage 2019-03-07 00:18:56 UTC
Actually, the problem is wider than just O_APPEND.  O_TRUNC must also be passed through to the daemon, according to the libfuse docs.  The docs don't state exactly which flags should be passed through and which shouldn't, but for maximum compatibility we should probably do what Linux does.  If we're going to pass O_APPEND and O_TRUNC, then it would be easy to pass them all.  From experiment Linux 4.9.0's behavior is:

flag		create	open
====		======	====
O_CREAT		yes	no
O_EXCL		yes	no
O_NOCTTY	no	no
O_TRUNC		yes	yes
O_APPEND	yes	yes
O_NONBLOCK	yes	yes
O_SYNC	yes	yes
O_ASYNC	yes	yes
O_LARGEFILE	yes, even if I don't ask for it!
O_DIRECTORY	N/A	N/A.  open is translated to OPENDIR
O_NOFOLLOW	yes	yes
O_CLOEXEC	no	no
O_DIRECT	yes	yes
O_NOATIME	yes	yes
O_PATH	N/A	N/A doesn't actually open anything
O_DSYNC	yes	yes
O_TMPFILE	N/A	N/A includes O_DIRECTORY
O_EXEC		Not implemented on Linux
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2019-03-07 00:35:47 UTC
O_LARGEFILE is just Linux for 'off_t is 64-bits'.  Handling double O_APPENDing clients (without reopen) is difficult, and there's plenty of low-hanging fruit.  I would prioritize other bugs over this one.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2019-03-07 00:41:05 UTC
I guess if filesystems actually implement O_APPEND correctly and we can always pass 0 (and use direct writes), that's not too difficult.  But any time we have to emulate it, even if we refresh filesize every operation, there's potentially a race with a remote client.
Comment 4 Alan Somers freebsd_committer freebsd_triage 2019-03-07 00:48:42 UTC
The way I see it is like this:

* If the filesystem isn't networked and we handle O_APPEND in the kernel, then we're fine whether or not the filesystem understands O_APPEND.
* If the filesystem is networked, understands O_APPEND, and we pass O_APPEND to FUSE_OPEN, then we're fine whether or not we also handle O_APPEND in the kernel.
* If the filesystem is networked and doesn't understand O_APPEND, then we're screwed no matter what we do

So we should pass through O_APPEND, but keep handling it in the kernel, too.
Comment 5 Conrad Meyer freebsd_committer freebsd_triage 2019-03-07 01:14:46 UTC
(In reply to Alan Somers from comment #4)
> * If the filesystem is networked, understands O_APPEND, and we pass O_APPEND
> to FUSE_OPEN, then we're fine whether or not we also handle O_APPEND in the
> kernel.

As in, O_APPEND-grokking filesystems are expected to ignore any given write offset when in append mode?  Or just 0?

We definitely can't send O_APPEND writes through the buf cache given we don't know what offset they'll end up at.

> * If the filesystem is networked and doesn't understand O_APPEND, then we're
> screwed no matter what we do

Right.  We could very pessimistically getattr before every APPEND write to determine the current filesize, but we wouldn't want to do this for non-broken filesystems.

> So we should pass through O_APPEND, but keep handling it in the kernel, too.

Ok.
Comment 6 Alan Somers freebsd_committer freebsd_triage 2019-03-07 01:46:00 UTC
(In reply to Conrad Meyer from comment #5)
> As in, O_APPEND-grokking filesystems are expected to ignore any given write
> offset when in append mode?  Or just 0?

I assume that an O_APPEND-grokking filesystem would ignore the write offset.

> We definitely can't send O_APPEND writes through the buf cache given we don't
> know what offset they'll end up at.

Right.  libfuse addresses this point.  A later protocol revision gives the filesystem the ability to request for the kernel to cache writes.  If the filesystem is networked, it probably won't set that.  The filesystem can also disable the cache on a per-file basis by setting the FOPEN_DIRECT_IO flag in the FUSE_OPEN response.

> Right.  We could very pessimistically getattr before every APPEND write to
> determine the current filesize, but we wouldn't want to do this for non-broken
> filesystems.

> Ok.
Comment 7 Conrad Meyer freebsd_committer freebsd_triage 2019-03-07 01:54:28 UTC
(In reply to Alan Somers from comment #6)
> (In reply to Conrad Meyer from comment #5)
> > We definitely can't send O_APPEND writes through the buf cache given we don't
> > know what offset they'll end up at.
> 
> Right.  libfuse addresses this point.  A later protocol revision gives the
> filesystem the ability to request for the kernel to cache writes.  If the
> filesystem is networked, it probably won't set that.

That's for caching dirty data; 7.8 FUSE implementations are allowed to cache clean data, we just can't do it for O_APPEND mode because we don't know the right offset to cache it at.

> The filesystem can
> also disable the cache on a per-file basis by setting the FOPEN_DIRECT_IO
> flag in the FUSE_OPEN response.

Right.  It can also notify the kernel of remote modification (on a new open()) via FOPEN_KEEP_CACHE (or its opposite).

Most importantly, later versions of the fuse protocol gain cache eviction callbacks, which we sorely need for accurate caching.
Comment 8 commit-hook freebsd_committer freebsd_triage 2019-03-31 03:20:02 UTC
A commit references this bug:

Author: asomers
Date: Sun Mar 31 03:19:11 UTC 2019
New revision: 345742
URL: https://svnweb.freebsd.org/changeset/base/345742

Log:
  fusefs: replace the fufh table with a linked list

  The FUSE protocol allows each open file descriptor to have a unique file
  handle.  On FreeBSD, these file handles must all be stored in the vnode.
  The old method (also used by OSX and OpenBSD) is to store them all in a
  small array.  But that limits the total number that can be stored.  This
  commit replaces the array with a linked list (a technique also used by
  Illumos).  There is not yet any change in functionality, but this is the
  first step to fixing several bugs.

  PR:		236329, 236340, 236381, 236560, 236844
  Discussed with:	cem
  Sponsored by:	The FreeBSD Foundation

Changes:
  projects/fuse2/sys/fs/fuse/fuse_file.c
  projects/fuse2/sys/fs/fuse/fuse_file.h
  projects/fuse2/sys/fs/fuse/fuse_internal.c
  projects/fuse2/sys/fs/fuse/fuse_node.c
  projects/fuse2/sys/fs/fuse/fuse_node.h
  projects/fuse2/sys/fs/fuse/fuse_vnops.c
Comment 9 Alan Somers freebsd_committer freebsd_triage 2019-03-31 23:31:47 UTC
Sigh, the problem is more complicated than it first sounded.  Firstly, VOP_CREATE doesn't get the open(2) flags.  So even though we can pass O_APPEND if it comes from open(..., O_RDWR | O_APPEND), we can't if it comes from open(..., O_RDWR | O_APPEND | O_CREAT).  Secondly, since we can't store a file handle in a struct file and expect it to be available when we need it (since the VOPs don't get a struct file* argument, and many operations bypass the fileops layer entirely) we must frequently guess what file handle to use.  Opening a separate file handle for O_RDWR vs O_RDWR | O_APPEND just makes the guessing process harder.

Furthermore, it's kind of dumb for the FUSE protocol to support O_APPEND in FUSE_OPEN and FUSE_CREATE, even though there's no support for getting, setting, or clearing that flag with fcntl.  It's kind of a half-baked feature.

There are _some_ file systems that care about O_APPEND.  fusefs-sshfs and fusefs-ifuse for two.  But I'm wondering if it's worth doing if we can't do it right.
Comment 10 commit-hook freebsd_committer freebsd_triage 2019-04-02 18:10:31 UTC
A commit references this bug:

Author: asomers
Date: Tue Apr  2 18:09:41 UTC 2019
New revision: 345808
URL: https://svnweb.freebsd.org/changeset/base/345808

Log:
  fusefs: cleanup and refactor some recent commits

  This commit cleans up after recent commits, especially 345766, 345768, and
  345781.  There is no functional change.  The most important change is to add
  comments documenting why we can't send flags like O_APPEND in
  FUSE_WRITE_OPEN.

  PR:		236340
  Sponsored by:	The FreeBSD Foundation

Changes:
  projects/fuse2/sys/fs/fuse/fuse_file.c
  projects/fuse2/sys/fs/fuse/fuse_file.h
  projects/fuse2/sys/fs/fuse/fuse_vnops.c
Comment 11 Alan Somers freebsd_committer freebsd_triage 2019-06-29 03:06:24 UTC
I'm going to close this.  Given the design of our VFS, we just can't support this (neither does OSX, NetBSD, OpenBSD, or Illumos).  Passing all of the open flags to fuse operations is a product of Linux's design that doesn't really have a vnode layer.