This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script:
When an I/O operation (for example read) takes a bit long and the process that performed the read operation is killed (i.e. with 9 signal), kernel does not wait for the operation to complete but immediately sends release. Since FUSE is multithreaded, one's program can get the release before the read (race condition), effectively ending up reading data from an already closed descriptor into no-longer-existing structures.
We handled that issue in our system by making reference counters and delaying release (lots of FreeBSD specific code), but since FUSE behaviour on Linux is completely different and FUSE was first developed on Linux, we feel some consitency is warranted.
Peter / MooseFS Team
I haven't gotten to this one yet but I have my eye on it.
I'm not sure I understand this scenario. FreeBSD VFS layer requires some vnode lock over read VOP and exclusive vnode lock over inactive (or reclaim).
I guess read IPC upcall to userspace daemon is subject to both timeout (maybe what you mean by "take a bit long?") and interrupt by signal (PCATCH):
fuse_read_directbackend -> fdisp_wait_answ -> fticket_wait_answer -> msleep(, PCATCH, ..., daemon_timeout * hz)
Still, FreeBSD FUSE IPC seems to be serialized on the kernel side through a single IPC queue. fuse_lck_mtx_lock, fuse_ms_push, etc. I don't understand how the READ would be reordered with RELEASE. If you can provide more detail, that would be appreciated.
Even if RELEASE is sent just after READ then on the user space there are separate threads that handle those requests. Thread which handles RELEASE can be "faster" than thread which handles READ.
On Linux RELEASE is just sent only after all I/O ops returned with their statuses. This also means that process that is in the middle of I/O and got terminal signal waits for end of all I/O.
In attached program you can check it. We have very slow READ (using simple sleep). Then process that is in then middle of this 10s-long READ is killed. On FreeBSD process is terminated instantly (without waiting) and RELEASE is send in the middle of reading, on Linux process waits 10s (end of long read) and only then is finished, and only then RELEASE is sent to the filesystem.
(In reply to Jakub Kruszona-Zawadzki from comment #3)
Hm. I'm not sure if there is a great way to do this.
And what's the point? If we have sent RELEASE, the read is not going to be used anyway.
It seems like userspace can relatively easily force strict ordering of operations on the same filehandle (fixed number of worker threads + hash based dispatch), or at least serialize releases with other IO (ref count solution you have now).
In contrast a kernel approach has to sort of bend over backwards to fix libfuse's naive threading.
We could refcount FUSE filehandles in the kernel and instead of immediate RELEASE / RELEASEDIR / FORGET, tag the internal inode with a deferred release/forget flag. When outstanding IOs are completed, the handler decrements the reference count and checks for these flags when it drops to zero. Sounds like a pain :-).
(In reply to Conrad Meyer from comment #4)
> And what's the point? If we have sent RELEASE, the read is not going to be used anyway.
Yes. The only problem is that on the client side you usually use some data structures that you create on OPEN/CREATE and free on RELEASE. READ/WRITE/FSYNC are ops that use those data structures, so they can't be freed in the middle of operation. Formally there is INTERRUPT operation that should be send in such case to inform "long" I/O that the result is no longer needed. Problem number one - INTERRUPT is optional (and rarely implemented), problem number two - even with INTERRUPT kernel still should send RELEASE only after succesful interruption.
> It seems like userspace can relatively easily force strict ordering of operations on the same filehandle (fixed number of worker threads + hash based dispatch), or at least serialize releases with other IO (ref count solution you have now).
Because we use libfuse (external code for you and me) we can't force libfuse to do the job for us. Libfuse has dynamic pool of worker threads, so all ops from the kernel can be reordered. Also usually we want to dispatch simultaneously I/O on the same descriptor (efficiency).
Solution with ref count and delayed RELEASE is of course good enough. This is absolutelly not a problem for me - I've mentioned it only because other fuse-based filesystems can have the same problem and we all want to have better FreeBSD :-)
> We could refcount FUSE filehandles in the kernel and instead of immediate RELEASE / RELEASEDIR / FORGET, tag the internal inode with a deferred release/forget flag. When outstanding IOs are completed, the handler decrements the reference count and checks for these flags when it drops to zero. Sounds like a pain :-).
Exactly :-) Such approach should work similar to Linux version.
I don't think the "ref count and delayed RELEASE" solution is ok. The problem is that a misbehaving daemon might never respond to a READ or WRITE request. If the kernel waits for all I/O to complete before RELEASEing a file handle, then a misbehaving daemon could DOS the kernel. That's not ok. I think the best we can do would be to send an INTERRUPT to all outstanding I/O operations before sending RELEASE. File systems that don't implement INTERRUPT would have to reference-count their structures, just as you do.
I am very happy that recent changes in FUSEFS solved most of the issues we have discovered some time ago. As of now we removed most of workaround in MooseFS code since they are not necessary anymore (BTW sent a PR 243376 in order to update ports to 3.0.109). I appreciate your engagement and thank you for a precious work you have done on it so far. Just wanted to ask if there is any chance to proceed with this, last issue?
Piotr / MooseFS Team
As I understand it:
The original scenario is:
* Userspace process P is waiting for a response to some IO, such as read(2), via fuse(4) -> libfuse -> some multithreaded filesystem process F.
* Then, for any reason, P is signaled, interrupting the IO†, and that signal causes P to exit (e.g., SIGTERM).
* Because P exits, internal to FreeBSD, the references on the vnodes it has opened are dropped. As a side effect, if they drop to zero, fuse(4) in FreeBSD sends FUSE_RELEASE on the no-longer referenced vnodes.
* Because F is multithreaded, it may process RELEASE concurrently with the interrupted IO.
[†]: If F supports it, fuse(4) sends FUSE_INTERRUPT here! And as long as the signal was not fatal, not-blockable, SIGKILL, fuse(4) waits for it to complete before returning to userspace and thus exiting the process.
The problem statement:
* F can't queue-steer same-file requests to the same queue, which would prevent reordering problems because libfuse actually doesn't allow this decision (per comment #5).
* F can't or doesn't want to manage file lifetimes itself? It's less clear to me why it is valid for a multithreaded FUSE filesystem not to implement its own file coherency scheme between threads.
How do you manage concurrent attempts to modify the same data range in a file? Do you not have byte-range locking anyway?
You mentioned that: "On Linux ... that process [P] that is in the middle of I/O and got terminal signal _waits for end of all I/O_." As Alan mentioned in comment #6, that is unacceptable for FreeBSD. I'm not sure why it's even acceptable for Linux; it seems like a terrible bug.
* F chooses not to implement FUSE_INTERRUPT, or is concerned about the possible race in handling FUSE_INTERRUPT vs subsequent FUSE_RELEASE on SIGKILL.
* (FUSE Filesystem side) Implement FUSE_INTERRUPT, if you do not already. (This is only a partial solution, as it does not cover SIGKILL.)
* (FreeBSD side) For SIGKILL:
- In fticket_wait_answer(), in this case, set a special flag on the fuse_ticket, such as FT_KILLED;
- In fdisp_wait_answ() (the only caller), IF fticket_wait_answer errors AND has the fuse_ticket::tk_flag FT_KILLED set AND a fuse_filehandle was associated with the upcall,†† THEN set a similar flag on the fuse_filehandle (e.g., FUFH_KILLED) and stash the ticket's 'unique' value somewhere associated with the fuse_filehandle::'fh_id' (such as in a global 'deferred_release' queue on the mountpoint).†††
- In fuse_filehandle_close(), if FUFH_KILLED is set, don't send FUSE_RELEASE.
- In fuse_device_write(), if a response is received from the FUSE filesystem where the 'ohead.unique' is not in the pending list, nor the special zero value, consult the mountpoint list to see if there is a (fh_id, unique) pair that matches this response††††. If there is, remove the association from the global queue and submit a FUSE_RELEASE for the fh_id.
- fuse_vnop_inactive / fuse_vnop_reclaim may need to be modified slightly to skip other processing on FUFH_KILLED filehandles.
[††]: This would require modifying the fdisp_wait_answ() API to take a fufh parameter (NULL for operations not associated with filehandles). This would require refactoring in many callers, but it is not intractable.
[†††]: Yes, this is potentially a small leak of kernel resources after the associated process exits. We can mitigate that in a variety of easy ways. One example would be to simply cap the number of entries at some fixed number and replace the oldest entry when inserting a new entry. But it is much smaller than holding on to an entire process, or even an entire vnode, and SIGKILL'd FUSE consumers are probably relatively uncommon in non-adversarial conditions.
Unfortunately, we really need that list for this scheme to work, as fuse_out_header doesn't contain _any_ useful information other than the *ticket* response number.
Alternatively, we could just never send FUSE_RELEASE for filehandles like this. I'm not sure if that would cause resource leaks in FUSE filesystem implementations; probably.
[††††]: A problem with this approach would be that ticket numbers are recycled, except (1) we don't appear to do anything to verify the uniqueness of allocated ticket numbers today anyway, and (2) except for on 32-bit platforms, the ticket number is a 64-bit counter (i.e., recycling will essentially never happen). Even on 32-bit platforms, rolling a 32-bit number will likely take some time, and we could also just replace the unique counter with a uint64_t instead of a long and get the full 64-bit period on 32-bit platforms, obviating the problem entirely.
Sorry for the wall of text. Alan, Piotr & Jakub, do you have any thoughts?
I still don't see this as a kernel bug at all. It still seems to me like it needs to be solved in userland. Whether that should happen in libfuse or in moosefs, I'm not sure.
Yeah, this is easier and less fraught to fix in userspace. Do you think the libfuse author would go for a patch to do so there? (Addresses both concerns: fixing it in userspace, and fixing it for all fuse filesystem implementations.)
I think the current libfuse maintainer would probably accept such a patch, if it is backwards-compatible with existing file systems. But he doesn't accept new features for fusefs-libs2, so moosefs would have to upgrade to fusefs-libs3.
I guess I'm unfamiliar with what the migration pain between fusefs 2 and 3 looks like for filesystems. Is that challenging?
(In reply to Conrad Meyer from comment #12)
I've never tried it. But the fact that so few file system authors have done it is not encouraging.
(In reply to Alan Somers from comment #11)
> so moosefs would have to upgrade to fusefs-libs3.
We did so :)