Bug 230260 - [FUSEFS] [PERFORMANCE]: Performance issue (I/O block size)
Summary: [FUSEFS] [PERFORMANCE]: Performance issue (I/O block size)
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs mailing list
URL: https://robo.moosefs.com/support/fuse...
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-01 12:39 UTC by MooseFS FreeBSD Team
Modified: 2019-09-06 17:59 UTC (History)
6 users (show)

See Also:
asomers: mfc-stable11-
asomers: mfc-stable12+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description MooseFS FreeBSD Team 2018-08-01 12:39:01 UTC
This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script:
https://robo.moosefs.com/support/fuse_helloworld.tgz

Performance issue in FUSE: if a program uses FUSE without the "direct" option, any I/O is always performed in 4k blocks. Maximum I/O speed we managed to get was 600MB/s (no physical I/O, just sending zeros from a RAM buffer).

With "direct" it's fast, 5GB/s, but "direct" is not the best solution: no cache, read operation has no limit on block size and if one uses extremely big block size, the read speed drastically drops again (we performed dd with bs=1G and the speed was only 40MB/s). Generally, "direct" is geared toward stream-like data (character devices) and should not be used for disk-like I/O.

Other FUSE implementations (Linux, MacOS) use 64k block.

Best regards,
Peter / MooseFS Team
Comment 1 Conrad Meyer freebsd_committer 2018-08-02 00:31:30 UTC
Hm, my reading of fuse_read_biobackend suggests we should be bound by MAXBSIZE, which is 64k on head.  Is it smaller on 11?  Nope, it's been 64k since ~2000.  I wonder where the 4k is coming from.
Comment 2 Conrad Meyer freebsd_committer 2018-08-02 00:36:47 UTC
libfuse itself has a crappy splice(2) reimplementation that hardcodes a 4k buffer size.  I'm not sure that's being used here, but it's a possibility.
Comment 3 Conrad Meyer freebsd_committer 2018-08-02 00:38:57 UTC
You might try bumping the size of 'buf' in fuse_buf_fd_to_fd in libfuse and seeing if that changes the size of IO submitted to your filesystem.
Comment 4 Conrad Meyer freebsd_committer 2018-08-02 05:01:04 UTC
Doh, I totally missed some other hardcoded 4096 in the FUSE kernel support code.  I see two instances:

1. fuse_internal_init_callback: if the fuse_init_out version is below 7.5, we explicitly limit max_write to 4k.  Otherwise we take max_write from the fiio.

2. FUSE_DEFAULT_BLOCKSIZE , FUSE_DEFAULT_IOSIZE are both 4k.  Supposedly both can be overridden per mount, but given they are macro defines, I don't see how that is possible.


DEFAULT_IOSIZE is used to limit the size of FUSE_READDIR requests.  It is also a factor in limiting max_readahead (16x).

DEFAULT_BLOCKSIZE is seemingly unused, except to fake FUSE_STATFS responses (f_bsize) if the user filesystem seems dead.

I don't see either plumbed into ordinary read/write IO, though.


PAGE_SIZE (also 4k) shows up a few times as well:

1. Internal caching of fuse file (vnode) attributes hardcodes va_blocksize as PAGE_SIZE.  I don't know the ramifications of this, if any.

2. Something around FUSE_READLINK operations restricts length to PAGE_SIZE.  This doesn't seem like a problem.

3. fuse_vfsop_mount (VFS = operations on a filesystem, rather than on a specific file (vnode)) hardcodes mnt_stat.f_iosize as PAGE_SIZE.  This may matter.  It is the output of the fuse_iosize() function.

4. fuse_vnop_getpages walks through pages as part of the mmap interface.  I don't think this is a problem.  Ditto fuse_vnop_putpages.


So yeah, (3) looks like it.  fuse_iosize() is used as a direct limit in fuse_{read,write}_biobackend as well as fuse_io_strategy, and f_iosize is widely used throughout the generic kernel code.

(It's also worth exploring bumping up the max READDIR size.  That's probably not the problem you reported, but 4k is still anemic there.)
Comment 5 Conrad Meyer freebsd_committer 2018-08-02 05:02:44 UTC
If you're able, please try this patch and report if the performance is improved:

--- a/sys/fs/fuse/fuse_vfsops.c
+++ b/sys/fs/fuse/fuse_vfsops.c
@@ -341,7 +341,7 @@ fuse_vfsop_mount(struct mount *mp)
        mp->mnt_kern_flag |= MNTK_USES_BCACHE;
        MNT_IUNLOCK(mp);
        /* We need this here as this slot is used by getnewvnode() */
-       mp->mnt_stat.f_iosize = PAGE_SIZE;
+       mp->mnt_stat.f_iosize = DFLTPHYS;
        if (subtype) {
                strlcat(mp->mnt_stat.f_fstypename, ".", MFSNAMELEN);
                strlcat(mp->mnt_stat.f_fstypename, subtype, MFSNAMELEN);
Comment 6 Jakub Kruszona-Zawadzki 2018-08-02 08:07:34 UTC
It works !!!

read in classic (cached) mode:
256+0 records in
256+0 records out
268435456 bytes transferred in 0.099339 secs (2702208293 bytes/sec)
bsize: 65536 ; count: 4096

read in direct (not cached and usually not recommended) mode:
256+0 records in
256+0 records out
268435456 bytes transferred in 0.024716 secs (10860808757 bytes/sec)
bsize: 1048576 ; count: 257

Performance is significantly improved. Now block size is 64k - perfect.

We hope that this patch will find its way to the release version soon :)
Comment 7 commit-hook freebsd_committer 2018-08-02 19:26:10 UTC
A commit references this bug:

Author: cem
Date: Thu Aug  2 19:25:43 UTC 2018
New revision: 337165
URL: https://svnweb.freebsd.org/changeset/base/337165

Log:
  FUSE: Bump maximum IO size to enable more performant operation

  Various components restrict size of IO passed up to the userspace filesystem
  based on the mount's f_iosize value.  The previous default of PAGE_SIZE
  is anemic, even for normal filesystems, but especially considering every
  FUSE operation involves a kernel <-> userspace IPC upcall.

  Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations.

  Anecdotally, Jakub reports IO read performance increased from 600 MB/s ->
  2700 MB/s with a basic RAM-backed FUSE filesystem.

  PR:		230260
  Reported by:	Peter (MooseFS) <freebsd AT moosefs.com>
  Tested by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com>
  MFC after:	3 days

Changes:
  head/sys/fs/fuse/fuse_vfsops.c
Comment 8 Conrad Meyer freebsd_committer 2018-08-02 19:27:36 UTC
This fix should land in 12.0.  I don't work on stable branches like 11.x myself, but maybe you can find someone who is interested in stable/11 to backport the patch, if you need it in 11.x.
Comment 9 Kenneth D. Merry freebsd_committer 2019-02-19 22:25:30 UTC
This breaks LTFS (https://github.com/LinearTapeFileSystem/ltfs) at least.  For example:

sm4u-12:/mnt:!:0} dd if=/dev/zero of=foo bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 5.713798 secs (187920873 bytes/sec)
{sm4u-12:/mnt:!:0} ls -la
total 1048578
drwxrwxrwx   2 root  wheel           0 Feb 19 22:20 .
drwxr-xr-x  30 root  wheel          35 Jan 30 18:45 ..
-rwxrwxrwx   1 root  wheel  1073741824 Feb 19 22:20 foo
{sm4u-12:/mnt:!:0} dd if=foo of=/dev/null bs=1m
0+1 records in
0+1 records out
65536 bytes transferred in 0.000146 secs (449912127 bytes/sec)
{sm4u-12:/mnt:!:0} ls -la
total 1048578
drwxrwxrwx   2 root  wheel           0 Feb 19 22:20 .
drwxr-xr-x  30 root  wheel          35 Jan 30 18:45 ..
-rwxrwxrwx   1 root  wheel  1073741824 Feb 19 22:20 foo

Reverting the change back to 4K makes reads work normally again.

I wonder how many other filesystems were broken by this change?
Comment 10 Conrad Meyer freebsd_committer 2019-02-19 23:08:26 UTC
(In reply to Kenneth D. Merry from comment #9)
Hi Ken,

I'm having some trouble understanding your reproduction steps.  This is inside an LTFS mount?

1. You read 1GB of /dev/zero in 1MB chunks and write it to foo;
2. ls -l foo reports the expected 1GB size;
3. reading foo with a 1MB request size returns 64k, and unexpected end of file.

I'm really curious how this change breaks LTFS, given a 4k iosize "works."

Thanks!
Comment 11 Kenneth D. Merry freebsd_committer 2019-02-20 14:25:35 UTC
Yes, this is inside an LTFS mount.  Reading a just-written file does result in only 64K getting read and an unexpected EOF.  Here is a dd read of the same file I created in the previous step.

The only difference between this and the previous version is that I built the fuse.ko module with change 337165 reverted:

{sm4u-12:/root:!:0} cd /mnt
{sm4u-12:/mnt:!:0} ls -la
total 1048578
drwxrwxrwx   2 root  wheel           0 Feb 19 22:20 .
drwxr-xr-x  30 root  wheel          35 Jan 30 18:45 ..
-rwxrwxrwx   1 root  wheel  1073741824 Feb 19 22:20 foo
{sm4u-12:/mnt:!:0} dd if=foo of=/dev/null bs=1m
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 11.281752 secs (95175096 bytes/sec)

So, normal result, no problems.  What this tells me is that somehow, changing f_iosize from 4K to DFLTPHYS (which is set to 512K on this particular system) messes up reads (but not writes) for LTFS.

In looking at the LTFS FUSE read code, I don't see any hardcoding of the I/O size:

https://github.com/LinearTapeFileSystem/ltfs/blob/master/src/ltfs_fuse.c

So, perhaps there is a place in the FUSE libraries that is hard-coding the size to 4K?
Comment 12 Conrad Meyer freebsd_committer 2019-02-20 23:10:24 UTC
(In reply to Kenneth D. Merry from comment #11)
I think you're on to something.  Are you on a recent head, or a stable branch?

It looks like fuse_write_biobackend directly uses f_iosize:

625         const int biosize = fuse_iosize(vp);

but fuse_read_biobackend clamps the buf block len to MAXBSIZE:

191         const int biosize = fuse_iosize(vp);
...
201         bcount = MIN(MAXBSIZE, biosize);

Which is defined as 64kB on CURRENT (i.e., the block size is not truncated on bio read when DFLTPHYS <= MAXBSIZE).

fuse's directio read path doesn't care about the freebsd block size or phys size and just issues maximal reads per the mount point.  Ditto io_strategy, directio & bio write.

So it's just the bio read path that is artificially truncating 512kB phys to 64kB MAXBSIZE.  I think maxbcachebuf must be >= 512k on your system, too, or else we'd trip this panic in getblk on write:

  3883         if (size > maxbcachebuf)
  3884                 panic("getblk: size(%d) > maxbcachebuf(%d)\n", size,
  3885                     maxbcachebuf);

(But perhaps the writes are hitting the directio write path that avoids the large getblk.)

Anyway, once the blocks are truncated, the LBNs used by the cached read path are nonsensical relative to what was written, and we end up discarding the last (DFLTPHYS - MAXBSIZE) bytes of every DFLTPHYS-sized block.

I'm not exactly sure why the bioread loop aborts after only a single truncated block.  (I would guess either getblk() returning NULL on 2nd block, or getblk marking the 2nd block !CACHEd and fuse_io_strategy() producing an error for some reason.)

I think we have a clear culprit here.  Please try replacing f_iosize = DFLTPHYS with f_iosize = MAXBSIZE or maxbcachebuf; or increasing MAXBSIZE to match DFLTPHYS.
Comment 13 commit-hook freebsd_committer 2019-02-21 02:42:57 UTC
A commit references this bug:

Author: cem
Date: Thu Feb 21 02:41:58 UTC 2019
New revision: 344407
URL: https://svnweb.freebsd.org/changeset/base/344407

Log:
  fuse: Fix a regression introduced in r337165

  On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
  use a buf cache block size in excess of permitted size.  This did not affect
  most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
  The issue was discovered and reported using a custom kernel with a DFLTPHYS
  of 512kB.

  PR:		230260 (comment #9)
  Reported by:	ken@
  MFC after:	?/? weeks

Changes:
  head/sys/fs/fuse/fuse_vfsops.c
Comment 14 Kenneth D. Merry freebsd_committer 2019-02-21 14:57:44 UTC
(In reply to Conrad Meyer from comment #12)

Your changed fixed the problem, thanks!

You are correct that MAXPHYS is larger than the default:

options         DFLTPHYS=(512*1024)
options         MAXPHYS=(1024*1056)

Could you merge this to stable/12?  This will likely break LTFS for most people using it.

Since tape drives don't do tagged queueing, the common way to get better performance is to use a larger block size.  LTFS supports up to 1MB block sizes, and in order to read tapes from other systems and get better performance, we set MAXPHYS to over 1MB.  (So we can get 1MB I/O regardless of alignment.)  DFLTPHYS goes along with that.
Comment 15 Conrad Meyer freebsd_committer 2019-02-21 16:44:22 UTC
(In reply to Kenneth D. Merry from comment #14)
I don't do stable/, but anyone is free to MFC it themselves.  It shouldn't conflict.

> Since tape drives don't do tagged queueing, the common way to get better
> performance is to use a larger block size.  LTFS supports up to 1MB block
> sizes, and in order to read tapes from other systems and get better
> performance, we set MAXPHYS to over 1MB.  (So we can get 1MB I/O regardless
> of alignment.)  DFLTPHYS goes along with that.

Yeah, that makes a lot of sense.

(I think it is probable that FUSE should move to the tunable maxbcachebuf instead of MAXBSIZE; MAXBSIZE is nearly orphaned in base, and can probably be removed.  But that is somewhat orthogonal.)

Thank you for reporting this and especially mentioning the non-default DFLTPHYS.  I did not realize it was a value people changed in their own kernels. :-)
Comment 16 Kenneth D. Merry freebsd_committer 2019-02-21 20:52:11 UTC
(In reply to Conrad Meyer from comment #15)

It would be nice if FUSE moved to the maxbcachebuf tunable at some point.

Thank you for fixing this quickly!

As for not putting things in stable, ever, that is unfortunate.

I try to merge my changes back to the active stable branches when it makes sense to do so.  Otherwise, end users won't get the benefit of the change until years down the road.  Or, someone else has to see and understand my changes well enough to merge them.
Comment 17 commit-hook freebsd_committer 2019-09-06 17:56:47 UTC
A commit references this bug:

Author: asomers
Date: Fri Sep  6 17:56:27 UTC 2019
New revision: 351943
URL: https://svnweb.freebsd.org/changeset/base/351943

Log:
  MFC r344183-r344187, r344333-r344334, r344407, r344857, r344865 (by cem)

  r344183:
  FUSE: Respect userspace FS "do-not-cache" of file attributes

  The FUSE protocol demands that kernel implementations cache user filesystem
  file attributes (vattr data) for a maximum period of time in the range of
  [0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or 10
  seconds; or "a long time" to represent indefinite caching.

  Historically, FreeBSD FUSE has ignored this client directive entirely.  This
  works fine for local-only filesystems, but causes consistency issues with
  multi-writer network filesystems.

  For now, respect 0 second cache TTLs and do not cache such metadata.
  Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
  are still cached indefinitely, because it is unclear how a userspace
  filesystem could do anything sensible with those semantics even if
  implemented.

  In the future, as an optimization, we should implement notify_inval_entry,
  etc, which provide userspace filesystems a way of evicting the kernel cache.

  One potentially bogus access to invalid cached attribute data was left in
  fuse_io_strategy.  It is restricted behind the undocumented and non-default
  "vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
  deadcode and can be eliminated?

  Some minor APIs changed to facilitate this:

  1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").

  2. cache_attrs() respects the provided TTL and only caches in the FUSE
  inode if TTL > 0.  It also grows an "out" argument, which, if non-NULL,
  stores the translated fuse_attr (even if not suitable for caching).

  3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
  avoid programming mistakes.

  4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
  link op was weakened (only performed when we have a valid attr cache).  The
  check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
  We have to trust any userspace filesystem that rejects local caching to
  account for it correctly.

  PR:		230258 (inspired by; does not fix)

  r344184:
  FUSE: Respect userspace FS "do-not-cache" of path components

  The FUSE protocol demands that kernel implementations cache user filesystem
  path components (lookup/cnp data) for a maximum period of time in the range
  of [0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or
  10 seconds; or "a long time" to represent indefinite caching.

  Historically, FreeBSD FUSE has ignored this client directive entirely.  This
  works fine for local-only filesystems, but causes consistency issues with
  multi-writer network filesystems.

  For now, respect 0 second cache TTLs and do not cache such metadata.
  Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
  are still cached indefinitely, because it is unclear how a userspace
  filesystem could do anything sensible with those semantics even if
  implemented.

  Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
  if the user filesystem did not set a zero second TTL.

  PR:		230258 (inspired by; does not fix)

  r344185:
  FUSE: Only "dirty" cached file size when data is dirty

  Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
  the buf cache bounds as a result of either a read from the underlying FUSE
  filesystem, or as part of a write-through type operation (like truncate =>
  VOP_SETATTR).  In these cases, do not set the FN_SIZECHANGE flag, which
  indicates that an inode's data is dirty (in particular, that the local buf
  cache and fvdat->filesize have dirty extended data).

  PR:		230258 (related)

  r344186:
  FUSE: The FUSE design expects writethrough caching

  At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
  specifies only clean data to be cached.

  Prior to this change, we implement and default to writeback caching.  This
  is ok enough for local only filesystems without hardlinks, but violates the
  general design contract with FUSE and breaks distributed filesystems or
  concurrent access to hardlinks of the same inode.

  In this change, add cache mode as an extension of cache enable/disable.  The
  new modes are UC (was: cache disabled), WT (default), and WB (was: cache
  enabled).

  For now, WT caching is implemented as write-around, which meets the goal of
  only caching clean data.  WT can be better than WA for workloads that
  frequently read data that was recently written, but WA is trivial to
  implement.  Note that this has no effect on O_WRONLY-opened files, which
  were already coerced to write-around.

  Refs:
    * https://sourceforge.net/p/fuse/mailman/message/8902254/
    * https://github.com/vgough/encfs/issues/315

  PR:		230258 (inspired by)

  r344187:
  FUSE: Refresh cached file size when it changes (lookup)

  The cached fvdat->filesize is indepedent of the (mostly unused)
  cached_attrs, and we failed to update it when a cached (but perhaps
  inactive) vnode was found during VOP_LOOKUP to have a different size than
  cached.

  As noted in the code comment, this can occur in distributed filesystems or
  with other kinds of irregular file behavior (anything is possible in FUSE).

  We do something similar in fuse_vnop_getattr already.

  PR:		230258 (as reported in description; other issues explored in
  			comments are not all resolved)
  Reported by:	MooseFS FreeBSD Team <freebsd AT moosefs.com>
  Submitted by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)

  r344333:
  fuse: add descriptions for remaining sysctls

  (Except reclaim revoked; I don't know what that goal of that one is.)

  r344334:
  Fuse: whitespace and style(9) cleanup

  Take a pass through fixing some of the most egregious whitespace issues in
  fs/fuse.  Also fix some style(9) warts while here.  Not 100% cleaned up, but
  somewhat less painful to look at and edit.

  No functional change.

  r344407:
  fuse: Fix a regression introduced in r337165

  On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
  use a buf cache block size in excess of permitted size.  This did not affect
  most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
  The issue was discovered and reported using a custom kernel with a DFLTPHYS
  of 512kB.

  PR:		230260 (comment #9)
  Reported by:	ken@

  r344857:
  FUSE: Prevent trivial panic

  When open(2) was invoked against a FUSE filesystem with an unexpected flags
  value (no O_RDONLY / O_RDWR / O_WRONLY), an assertion fired, causing panic.

  For now, prevent the panic by rejecting such VOP_OPENs with EINVAL.

  This is not considered the correct long term fix, but does prevent an
  unprivileged denial-of-service.

  PR:		236329
  Reported by:	asomers
  Reviewed by:	asomers
  Sponsored by:	Dell EMC Isilon

  r344865:
  fuse: switch from DFLTPHYS/MAXBSIZE to maxcachebuf

  On GENERIC kernels with empty loader.conf, there is no functional change.
  DFLTPHYS and MAXBSIZE are both 64kB at the moment.  This change allows
  larger bufcache block sizes to be used when either MAXBSIZE (custom kernel)
  or the loader.conf tunable vfs.maxbcachebuf (GENERIC) is adjusted higher
  than the default.

  Suggested by:	ken@

Changes:
_U  stable/12/
  stable/12/sys/fs/fuse/fuse.h
  stable/12/sys/fs/fuse/fuse_device.c
  stable/12/sys/fs/fuse/fuse_file.c
  stable/12/sys/fs/fuse/fuse_file.h
  stable/12/sys/fs/fuse/fuse_internal.c
  stable/12/sys/fs/fuse/fuse_internal.h
  stable/12/sys/fs/fuse/fuse_io.c
  stable/12/sys/fs/fuse/fuse_ipc.c
  stable/12/sys/fs/fuse/fuse_ipc.h
  stable/12/sys/fs/fuse/fuse_node.c
  stable/12/sys/fs/fuse/fuse_node.h
  stable/12/sys/fs/fuse/fuse_vfsops.c
  stable/12/sys/fs/fuse/fuse_vnops.c