Bug 237588 - [fusefs] data corruption when mixing normal writes and mmap writes with Write-through cache mode
Summary: [fusefs] data corruption when mixing normal writes and mmap writes with Write...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Alan Somers
Depends on:
Reported: 2019-04-26 20:47 UTC by Alan Somers
Modified: 2019-06-14 19:48 UTC (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer 2019-04-26 20:47:59 UTC
fusefs supports three cache modes: uncached, write-through, and writeback.  Write-through is the default.  However, as currently implemented it's more like "write-around" than true "write-through".  That is, writes go directly to the fuse daemon and invalidate the cache, rather than fill it.

This is merely a performance bug when using normal writes (write(2), pwrite(2), aio_write(2), etc).  But when mixing normal writes with mmap()'ed writes, it causes data corruption.  The sequence goes like this:
1) A process mmaps a fusefs file
2) That process writes some data, but does not msync() it.
3) A process writes directly, such as with write(2) to a region of the file that overlaps what was written in step 2.  fusefs invalidates all cached pages that were part of the write.
4) If any bytes were written by step 2 and invalidated but not written by step 3, then they will be lost.

Steps to reproduce:
1) Build the passthrough example from sysutils/libfuse and fsx from tools/regression/fsx
2) mkdir -p /tmp/mnt
3) /path/to/libfuse/build/example/passthrough -d /tmp/mnt
4) /path/to/fsx/fsx -P /tmp -S1333 -b5 -N 15 -U fsx.bin
Comment 1 Alan Somers freebsd_committer 2019-04-26 20:49:16 UTC
The best solution would probably be to implement true write-through caching.  Here's cem's notes on the subject:

For WT, you'd want to go back to the biobackend for writes, but be
sure they are flushed synchronously with the VOP_WRITE, and probably
in-order.  A simple way to do that would be to bwrite() each buf as we
go.  That makes for a lot of serialized IPC latency, though.  With our
fairly large buf block size (and we pick the maximum — 64kB by default
on amd64 GENERIC), it may not be awful.  But it probably isn't great,

It might be worthwhile to use bawrite() / bufwrite + O_ASYNC
(immediate, but async completion writes) to at least pipeline multiple
bufs at a time.  We would want to hook b_iodone to override the
default bufdone() handling of async buf writes.  We would also want to
wait for all async writes we issued to complete before returning.  I'm
not sure how we would handle signals.  If /any/ buf write fails, we
would return failure (or short write) depending on if it was the first
buf or not.  We probably want to evict cache contents of any buf write
that fails, and probably also any after it (userspace can't know those
were written, and will likely try to rewrite them anyway).

The next step performance wise beyond bawrite pipelining is attempting
to do delayed (bdwrite) / clustered (vfs_bio_awrite) writes, but I
don't think the locking works there such that we can reliably do that
without allowing dirty data caching, so maybe that's pending 7.23.
Comment 2 commit-hook freebsd_committer 2019-06-14 19:48:20 UTC
A commit references this bug:

Author: asomers
Date: Fri Jun 14 19:47:49 UTC 2019
New revision: 349038
URL: https://svnweb.freebsd.org/changeset/base/349038

  fusefs: fix the "write-through" of write-through cacheing

  Our fusefs(5) module supports three cache modes: uncached, write-through,
  and write-back.  However, the write-through mode (which is the default) has
  never actually worked as its name suggests.  Rather, it's always been more
  like "write-around".  It wrote directly, bypassing the cache.  The cache
  would only be populated by a subsequent read of the same data.

  This commit fixes that problem.  Now the write-through mode works as one
  would expect: write(2) immediately adds data to the cache and then blocks
  while the daemon processes the write operation.

  A side effect of this change is that non-cache-block-aligned writes will now
  incur a read-modify-write cycle of the cache block.  The old behavior
  (bypassing write cache entirely) can still be achieved by opening a file
  with O_DIRECT.

  PR:		237588
  Sponsored by:	The FreeBSD Foundation