This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script:
When an I/O operation (for example read) takes a bit long and the process that performed the read operation is killed (i.e. with 9 signal), kernel does not wait for the operation to complete but immediately sends release. Since FUSE is multithreaded, one's program can get the release before the read (race condition), effectively ending up reading data from an already closed descriptor into no-longer-existing structures.
We handled that issue in our system by making reference counters and delaying release (lots of FreeBSD specific code), but since FUSE behaviour on Linux is completely different and FUSE was first developed on Linux, we feel some consitency is warranted.
Peter / MooseFS Team
I haven't gotten to this one yet but I have my eye on it.
I'm not sure I understand this scenario. FreeBSD VFS layer requires some vnode lock over read VOP and exclusive vnode lock over inactive (or reclaim).
I guess read IPC upcall to userspace daemon is subject to both timeout (maybe what you mean by "take a bit long?") and interrupt by signal (PCATCH):
fuse_read_directbackend -> fdisp_wait_answ -> fticket_wait_answer -> msleep(, PCATCH, ..., daemon_timeout * hz)
Still, FreeBSD FUSE IPC seems to be serialized on the kernel side through a single IPC queue. fuse_lck_mtx_lock, fuse_ms_push, etc. I don't understand how the READ would be reordered with RELEASE. If you can provide more detail, that would be appreciated.
Even if RELEASE is sent just after READ then on the user space there are separate threads that handle those requests. Thread which handles RELEASE can be "faster" than thread which handles READ.
On Linux RELEASE is just sent only after all I/O ops returned with their statuses. This also means that process that is in the middle of I/O and got terminal signal waits for end of all I/O.
In attached program you can check it. We have very slow READ (using simple sleep). Then process that is in then middle of this 10s-long READ is killed. On FreeBSD process is terminated instantly (without waiting) and RELEASE is send in the middle of reading, on Linux process waits 10s (end of long read) and only then is finished, and only then RELEASE is sent to the filesystem.
(In reply to Jakub Kruszona-Zawadzki from comment #3)
Hm. I'm not sure if there is a great way to do this.
And what's the point? If we have sent RELEASE, the read is not going to be used anyway.
It seems like userspace can relatively easily force strict ordering of operations on the same filehandle (fixed number of worker threads + hash based dispatch), or at least serialize releases with other IO (ref count solution you have now).
In contrast a kernel approach has to sort of bend over backwards to fix libfuse's naive threading.
We could refcount FUSE filehandles in the kernel and instead of immediate RELEASE / RELEASEDIR / FORGET, tag the internal inode with a deferred release/forget flag. When outstanding IOs are completed, the handler decrements the reference count and checks for these flags when it drops to zero. Sounds like a pain :-).
(In reply to Conrad Meyer from comment #4)
> And what's the point? If we have sent RELEASE, the read is not going to be used anyway.
Yes. The only problem is that on the client side you usually use some data structures that you create on OPEN/CREATE and free on RELEASE. READ/WRITE/FSYNC are ops that use those data structures, so they can't be freed in the middle of operation. Formally there is INTERRUPT operation that should be send in such case to inform "long" I/O that the result is no longer needed. Problem number one - INTERRUPT is optional (and rarely implemented), problem number two - even with INTERRUPT kernel still should send RELEASE only after succesful interruption.
> It seems like userspace can relatively easily force strict ordering of operations on the same filehandle (fixed number of worker threads + hash based dispatch), or at least serialize releases with other IO (ref count solution you have now).
Because we use libfuse (external code for you and me) we can't force libfuse to do the job for us. Libfuse has dynamic pool of worker threads, so all ops from the kernel can be reordered. Also usually we want to dispatch simultaneously I/O on the same descriptor (efficiency).
Solution with ref count and delayed RELEASE is of course good enough. This is absolutelly not a problem for me - I've mentioned it only because other fuse-based filesystems can have the same problem and we all want to have better FreeBSD :-)
> We could refcount FUSE filehandles in the kernel and instead of immediate RELEASE / RELEASEDIR / FORGET, tag the internal inode with a deferred release/forget flag. When outstanding IOs are completed, the handler decrements the reference count and checks for these flags when it drops to zero. Sounds like a pain :-).
Exactly :-) Such approach should work similar to Linux version.
I don't think the "ref count and delayed RELEASE" solution is ok. The problem is that a misbehaving daemon might never respond to a READ or WRITE request. If the kernel waits for all I/O to complete before RELEASEing a file handle, then a misbehaving daemon could DOS the kernel. That's not ok. I think the best we can do would be to send an INTERRUPT to all outstanding I/O operations before sending RELEASE. File systems that don't implement INTERRUPT would have to reference-count their structures, just as you do.