|Summary:||[FUSEFS] [BUG]: File close (release) issue|
|Product:||Base System||Reporter:||MooseFS FreeBSD Team <freebsd>|
|Component:||kern||Assignee:||Conrad Meyer <cem>|
|Severity:||Affects Some People||CC:||acid, asomers, cem, piotr.konopelko|
Description MooseFS FreeBSD Team 2018-08-01 12:37:46 UTC
This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script: https://robo.moosefs.com/support/fuse_helloworld.tgz When an I/O operation (for example read) takes a bit long and the process that performed the read operation is killed (i.e. with 9 signal), kernel does not wait for the operation to complete but immediately sends release. Since FUSE is multithreaded, one's program can get the release before the read (race condition), effectively ending up reading data from an already closed descriptor into no-longer-existing structures. We handled that issue in our system by making reference counters and delaying release (lots of FreeBSD specific code), but since FUSE behaviour on Linux is completely different and FUSE was first developed on Linux, we feel some consitency is warranted. Best regards, Peter / MooseFS Team
Comment 1 Conrad Meyer 2018-08-02 19:47:26 UTC
I haven't gotten to this one yet but I have my eye on it.
Comment 2 Conrad Meyer 2018-08-08 05:07:41 UTC
I'm not sure I understand this scenario. FreeBSD VFS layer requires some vnode lock over read VOP and exclusive vnode lock over inactive (or reclaim). I guess read IPC upcall to userspace daemon is subject to both timeout (maybe what you mean by "take a bit long?") and interrupt by signal (PCATCH): fuse_read_directbackend -> fdisp_wait_answ -> fticket_wait_answer -> msleep(, PCATCH, ..., daemon_timeout * hz) Still, FreeBSD FUSE IPC seems to be serialized on the kernel side through a single IPC queue. fuse_lck_mtx_lock, fuse_ms_push, etc. I don't understand how the READ would be reordered with RELEASE. If you can provide more detail, that would be appreciated.
Comment 3 Jakub Kruszona-Zawadzki 2018-08-08 05:41:57 UTC
Even if RELEASE is sent just after READ then on the user space there are separate threads that handle those requests. Thread which handles RELEASE can be "faster" than thread which handles READ. On Linux RELEASE is just sent only after all I/O ops returned with their statuses. This also means that process that is in the middle of I/O and got terminal signal waits for end of all I/O. In attached program you can check it. We have very slow READ (using simple sleep). Then process that is in then middle of this 10s-long READ is killed. On FreeBSD process is terminated instantly (without waiting) and RELEASE is send in the middle of reading, on Linux process waits 10s (end of long read) and only then is finished, and only then RELEASE is sent to the filesystem.
Comment 4 Conrad Meyer 2018-08-08 16:28:52 UTC
(In reply to Jakub Kruszona-Zawadzki from comment #3) Hm. I'm not sure if there is a great way to do this. And what's the point? If we have sent RELEASE, the read is not going to be used anyway. It seems like userspace can relatively easily force strict ordering of operations on the same filehandle (fixed number of worker threads + hash based dispatch), or at least serialize releases with other IO (ref count solution you have now). In contrast a kernel approach has to sort of bend over backwards to fix libfuse's naive threading. We could refcount FUSE filehandles in the kernel and instead of immediate RELEASE / RELEASEDIR / FORGET, tag the internal inode with a deferred release/forget flag. When outstanding IOs are completed, the handler decrements the reference count and checks for these flags when it drops to zero. Sounds like a pain :-).
Comment 5 Jakub Kruszona-Zawadzki 2018-08-09 05:49:20 UTC
(In reply to Conrad Meyer from comment #4) > And what's the point? If we have sent RELEASE, the read is not going to be used anyway. Yes. The only problem is that on the client side you usually use some data structures that you create on OPEN/CREATE and free on RELEASE. READ/WRITE/FSYNC are ops that use those data structures, so they can't be freed in the middle of operation. Formally there is INTERRUPT operation that should be send in such case to inform "long" I/O that the result is no longer needed. Problem number one - INTERRUPT is optional (and rarely implemented), problem number two - even with INTERRUPT kernel still should send RELEASE only after succesful interruption. > It seems like userspace can relatively easily force strict ordering of operations on the same filehandle (fixed number of worker threads + hash based dispatch), or at least serialize releases with other IO (ref count solution you have now). Because we use libfuse (external code for you and me) we can't force libfuse to do the job for us. Libfuse has dynamic pool of worker threads, so all ops from the kernel can be reordered. Also usually we want to dispatch simultaneously I/O on the same descriptor (efficiency). Solution with ref count and delayed RELEASE is of course good enough. This is absolutelly not a problem for me - I've mentioned it only because other fuse-based filesystems can have the same problem and we all want to have better FreeBSD :-) > We could refcount FUSE filehandles in the kernel and instead of immediate RELEASE / RELEASEDIR / FORGET, tag the internal inode with a deferred release/forget flag. When outstanding IOs are completed, the handler decrements the reference count and checks for these flags when it drops to zero. Sounds like a pain :-). Exactly :-) Such approach should work similar to Linux version.