Currently when you open a fuse-backed file with O_APPEND, the kernel hides the O_APPEND flag from the fuse filesystem and computes write offsets in fuse_write_directbackend. However, that behavior doesn't work for networked fuse filesystems. The problem is that two clients both writing to the same file with O_APPEND wouldn't cooperate to update the file size. Or to put it another way, implementing O_APPEND in-kernel requires cacheing the file size attribute, and cacheing attributes is not generally allowed for networked file systems. The libfuse documentation makes it explicit that filesystems must handle the O_APPEND flag as of protocol 7.23. Implicitly, filesystems should handle it on older protocols as well. There are two possible approaches we could take: 1) Pass O_APPEND to FUSE_OPEN and set the offset of all FUSE_WRITE operations to 0 2) Pass O_APPEND to FUSE_OPEN but still compute the offset of all FUSE_WRITE operations as we do currently, expecting that compliant fuse filesystems will ignore it. I don't know how many fuse filesystems correctly handle O_APPEND, but I bet it's less than all of them.
Actually, the problem is wider than just O_APPEND. O_TRUNC must also be passed through to the daemon, according to the libfuse docs. The docs don't state exactly which flags should be passed through and which shouldn't, but for maximum compatibility we should probably do what Linux does. If we're going to pass O_APPEND and O_TRUNC, then it would be easy to pass them all. From experiment Linux 4.9.0's behavior is: flag create open ==== ====== ==== O_CREAT yes no O_EXCL yes no O_NOCTTY no no O_TRUNC yes yes O_APPEND yes yes O_NONBLOCK yes yes O_SYNC yes yes O_ASYNC yes yes O_LARGEFILE yes, even if I don't ask for it! O_DIRECTORY N/A N/A. open is translated to OPENDIR O_NOFOLLOW yes yes O_CLOEXEC no no O_DIRECT yes yes O_NOATIME yes yes O_PATH N/A N/A doesn't actually open anything O_DSYNC yes yes O_TMPFILE N/A N/A includes O_DIRECTORY O_EXEC Not implemented on Linux
O_LARGEFILE is just Linux for 'off_t is 64-bits'. Handling double O_APPENDing clients (without reopen) is difficult, and there's plenty of low-hanging fruit. I would prioritize other bugs over this one.
I guess if filesystems actually implement O_APPEND correctly and we can always pass 0 (and use direct writes), that's not too difficult. But any time we have to emulate it, even if we refresh filesize every operation, there's potentially a race with a remote client.
The way I see it is like this: * If the filesystem isn't networked and we handle O_APPEND in the kernel, then we're fine whether or not the filesystem understands O_APPEND. * If the filesystem is networked, understands O_APPEND, and we pass O_APPEND to FUSE_OPEN, then we're fine whether or not we also handle O_APPEND in the kernel. * If the filesystem is networked and doesn't understand O_APPEND, then we're screwed no matter what we do So we should pass through O_APPEND, but keep handling it in the kernel, too.
(In reply to Alan Somers from comment #4) > * If the filesystem is networked, understands O_APPEND, and we pass O_APPEND > to FUSE_OPEN, then we're fine whether or not we also handle O_APPEND in the > kernel. As in, O_APPEND-grokking filesystems are expected to ignore any given write offset when in append mode? Or just 0? We definitely can't send O_APPEND writes through the buf cache given we don't know what offset they'll end up at. > * If the filesystem is networked and doesn't understand O_APPEND, then we're > screwed no matter what we do Right. We could very pessimistically getattr before every APPEND write to determine the current filesize, but we wouldn't want to do this for non-broken filesystems. > So we should pass through O_APPEND, but keep handling it in the kernel, too. Ok.
(In reply to Conrad Meyer from comment #5) > As in, O_APPEND-grokking filesystems are expected to ignore any given write > offset when in append mode? Or just 0? I assume that an O_APPEND-grokking filesystem would ignore the write offset. > We definitely can't send O_APPEND writes through the buf cache given we don't > know what offset they'll end up at. Right. libfuse addresses this point. A later protocol revision gives the filesystem the ability to request for the kernel to cache writes. If the filesystem is networked, it probably won't set that. The filesystem can also disable the cache on a per-file basis by setting the FOPEN_DIRECT_IO flag in the FUSE_OPEN response. > Right. We could very pessimistically getattr before every APPEND write to > determine the current filesize, but we wouldn't want to do this for non-broken > filesystems. > Ok.
(In reply to Alan Somers from comment #6) > (In reply to Conrad Meyer from comment #5) > > We definitely can't send O_APPEND writes through the buf cache given we don't > > know what offset they'll end up at. > > Right. libfuse addresses this point. A later protocol revision gives the > filesystem the ability to request for the kernel to cache writes. If the > filesystem is networked, it probably won't set that. That's for caching dirty data; 7.8 FUSE implementations are allowed to cache clean data, we just can't do it for O_APPEND mode because we don't know the right offset to cache it at. > The filesystem can > also disable the cache on a per-file basis by setting the FOPEN_DIRECT_IO > flag in the FUSE_OPEN response. Right. It can also notify the kernel of remote modification (on a new open()) via FOPEN_KEEP_CACHE (or its opposite). Most importantly, later versions of the fuse protocol gain cache eviction callbacks, which we sorely need for accurate caching.
A commit references this bug: Author: asomers Date: Sun Mar 31 03:19:11 UTC 2019 New revision: 345742 URL: https://svnweb.freebsd.org/changeset/base/345742 Log: fusefs: replace the fufh table with a linked list The FUSE protocol allows each open file descriptor to have a unique file handle. On FreeBSD, these file handles must all be stored in the vnode. The old method (also used by OSX and OpenBSD) is to store them all in a small array. But that limits the total number that can be stored. This commit replaces the array with a linked list (a technique also used by Illumos). There is not yet any change in functionality, but this is the first step to fixing several bugs. PR: 236329, 236340, 236381, 236560, 236844 Discussed with: cem Sponsored by: The FreeBSD Foundation Changes: projects/fuse2/sys/fs/fuse/fuse_file.c projects/fuse2/sys/fs/fuse/fuse_file.h projects/fuse2/sys/fs/fuse/fuse_internal.c projects/fuse2/sys/fs/fuse/fuse_node.c projects/fuse2/sys/fs/fuse/fuse_node.h projects/fuse2/sys/fs/fuse/fuse_vnops.c
Sigh, the problem is more complicated than it first sounded. Firstly, VOP_CREATE doesn't get the open(2) flags. So even though we can pass O_APPEND if it comes from open(..., O_RDWR | O_APPEND), we can't if it comes from open(..., O_RDWR | O_APPEND | O_CREAT). Secondly, since we can't store a file handle in a struct file and expect it to be available when we need it (since the VOPs don't get a struct file* argument, and many operations bypass the fileops layer entirely) we must frequently guess what file handle to use. Opening a separate file handle for O_RDWR vs O_RDWR | O_APPEND just makes the guessing process harder. Furthermore, it's kind of dumb for the FUSE protocol to support O_APPEND in FUSE_OPEN and FUSE_CREATE, even though there's no support for getting, setting, or clearing that flag with fcntl. It's kind of a half-baked feature. There are _some_ file systems that care about O_APPEND. fusefs-sshfs and fusefs-ifuse for two. But I'm wondering if it's worth doing if we can't do it right.
A commit references this bug: Author: asomers Date: Tue Apr 2 18:09:41 UTC 2019 New revision: 345808 URL: https://svnweb.freebsd.org/changeset/base/345808 Log: fusefs: cleanup and refactor some recent commits This commit cleans up after recent commits, especially 345766, 345768, and 345781. There is no functional change. The most important change is to add comments documenting why we can't send flags like O_APPEND in FUSE_WRITE_OPEN. PR: 236340 Sponsored by: The FreeBSD Foundation Changes: projects/fuse2/sys/fs/fuse/fuse_file.c projects/fuse2/sys/fs/fuse/fuse_file.h projects/fuse2/sys/fs/fuse/fuse_vnops.c
I'm going to close this. Given the design of our VFS, we just can't support this (neither does OSX, NetBSD, OpenBSD, or Illumos). Passing all of the open flags to fuse operations is a product of Linux's design that doesn't really have a vnode layer.