Bug 230260 - [FUSEFS] [PERFORMANCE]: Performance issue (I/O block size)
Summary: [FUSEFS] [PERFORMANCE]: Performance issue (I/O block size)
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs mailing list
URL: https://robo.moosefs.com/support/fuse...
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-01 12:39 UTC by MooseFS FreeBSD Team
Modified: 2019-04-09 14:29 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description MooseFS FreeBSD Team 2018-08-01 12:39:01 UTC
This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script:
https://robo.moosefs.com/support/fuse_helloworld.tgz

Performance issue in FUSE: if a program uses FUSE without the "direct" option, any I/O is always performed in 4k blocks. Maximum I/O speed we managed to get was 600MB/s (no physical I/O, just sending zeros from a RAM buffer).

With "direct" it's fast, 5GB/s, but "direct" is not the best solution: no cache, read operation has no limit on block size and if one uses extremely big block size, the read speed drastically drops again (we performed dd with bs=1G and the speed was only 40MB/s). Generally, "direct" is geared toward stream-like data (character devices) and should not be used for disk-like I/O.

Other FUSE implementations (Linux, MacOS) use 64k block.

Best regards,
Peter / MooseFS Team
Comment 1 Conrad Meyer freebsd_committer 2018-08-02 00:31:30 UTC
Hm, my reading of fuse_read_biobackend suggests we should be bound by MAXBSIZE, which is 64k on head.  Is it smaller on 11?  Nope, it's been 64k since ~2000.  I wonder where the 4k is coming from.
Comment 2 Conrad Meyer freebsd_committer 2018-08-02 00:36:47 UTC
libfuse itself has a crappy splice(2) reimplementation that hardcodes a 4k buffer size.  I'm not sure that's being used here, but it's a possibility.
Comment 3 Conrad Meyer freebsd_committer 2018-08-02 00:38:57 UTC
You might try bumping the size of 'buf' in fuse_buf_fd_to_fd in libfuse and seeing if that changes the size of IO submitted to your filesystem.
Comment 4 Conrad Meyer freebsd_committer 2018-08-02 05:01:04 UTC
Doh, I totally missed some other hardcoded 4096 in the FUSE kernel support code.  I see two instances:

1. fuse_internal_init_callback: if the fuse_init_out version is below 7.5, we explicitly limit max_write to 4k.  Otherwise we take max_write from the fiio.

2. FUSE_DEFAULT_BLOCKSIZE , FUSE_DEFAULT_IOSIZE are both 4k.  Supposedly both can be overridden per mount, but given they are macro defines, I don't see how that is possible.


DEFAULT_IOSIZE is used to limit the size of FUSE_READDIR requests.  It is also a factor in limiting max_readahead (16x).

DEFAULT_BLOCKSIZE is seemingly unused, except to fake FUSE_STATFS responses (f_bsize) if the user filesystem seems dead.

I don't see either plumbed into ordinary read/write IO, though.


PAGE_SIZE (also 4k) shows up a few times as well:

1. Internal caching of fuse file (vnode) attributes hardcodes va_blocksize as PAGE_SIZE.  I don't know the ramifications of this, if any.

2. Something around FUSE_READLINK operations restricts length to PAGE_SIZE.  This doesn't seem like a problem.

3. fuse_vfsop_mount (VFS = operations on a filesystem, rather than on a specific file (vnode)) hardcodes mnt_stat.f_iosize as PAGE_SIZE.  This may matter.  It is the output of the fuse_iosize() function.

4. fuse_vnop_getpages walks through pages as part of the mmap interface.  I don't think this is a problem.  Ditto fuse_vnop_putpages.


So yeah, (3) looks like it.  fuse_iosize() is used as a direct limit in fuse_{read,write}_biobackend as well as fuse_io_strategy, and f_iosize is widely used throughout the generic kernel code.

(It's also worth exploring bumping up the max READDIR size.  That's probably not the problem you reported, but 4k is still anemic there.)
Comment 5 Conrad Meyer freebsd_committer 2018-08-02 05:02:44 UTC
If you're able, please try this patch and report if the performance is improved:

--- a/sys/fs/fuse/fuse_vfsops.c
+++ b/sys/fs/fuse/fuse_vfsops.c
@@ -341,7 +341,7 @@ fuse_vfsop_mount(struct mount *mp)
        mp->mnt_kern_flag |= MNTK_USES_BCACHE;
        MNT_IUNLOCK(mp);
        /* We need this here as this slot is used by getnewvnode() */
-       mp->mnt_stat.f_iosize = PAGE_SIZE;
+       mp->mnt_stat.f_iosize = DFLTPHYS;
        if (subtype) {
                strlcat(mp->mnt_stat.f_fstypename, ".", MFSNAMELEN);
                strlcat(mp->mnt_stat.f_fstypename, subtype, MFSNAMELEN);
Comment 6 Jakub Kruszona-Zawadzki 2018-08-02 08:07:34 UTC
It works !!!

read in classic (cached) mode:
256+0 records in
256+0 records out
268435456 bytes transferred in 0.099339 secs (2702208293 bytes/sec)
bsize: 65536 ; count: 4096

read in direct (not cached and usually not recommended) mode:
256+0 records in
256+0 records out
268435456 bytes transferred in 0.024716 secs (10860808757 bytes/sec)
bsize: 1048576 ; count: 257

Performance is significantly improved. Now block size is 64k - perfect.

We hope that this patch will find its way to the release version soon :)
Comment 7 commit-hook freebsd_committer 2018-08-02 19:26:10 UTC
A commit references this bug:

Author: cem
Date: Thu Aug  2 19:25:43 UTC 2018
New revision: 337165
URL: https://svnweb.freebsd.org/changeset/base/337165

Log:
  FUSE: Bump maximum IO size to enable more performant operation

  Various components restrict size of IO passed up to the userspace filesystem
  based on the mount's f_iosize value.  The previous default of PAGE_SIZE
  is anemic, even for normal filesystems, but especially considering every
  FUSE operation involves a kernel <-> userspace IPC upcall.

  Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations.

  Anecdotally, Jakub reports IO read performance increased from 600 MB/s ->
  2700 MB/s with a basic RAM-backed FUSE filesystem.

  PR:		230260
  Reported by:	Peter (MooseFS) <freebsd AT moosefs.com>
  Tested by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com>
  MFC after:	3 days

Changes:
  head/sys/fs/fuse/fuse_vfsops.c
Comment 8 Conrad Meyer freebsd_committer 2018-08-02 19:27:36 UTC
This fix should land in 12.0.  I don't work on stable branches like 11.x myself, but maybe you can find someone who is interested in stable/11 to backport the patch, if you need it in 11.x.
Comment 9 Kenneth D. Merry freebsd_committer 2019-02-19 22:25:30 UTC
This breaks LTFS (https://github.com/LinearTapeFileSystem/ltfs) at least.  For example:

sm4u-12:/mnt:!:0} dd if=/dev/zero of=foo bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 5.713798 secs (187920873 bytes/sec)
{sm4u-12:/mnt:!:0} ls -la
total 1048578
drwxrwxrwx   2 root  wheel           0 Feb 19 22:20 .
drwxr-xr-x  30 root  wheel          35 Jan 30 18:45 ..
-rwxrwxrwx   1 root  wheel  1073741824 Feb 19 22:20 foo
{sm4u-12:/mnt:!:0} dd if=foo of=/dev/null bs=1m
0+1 records in
0+1 records out
65536 bytes transferred in 0.000146 secs (449912127 bytes/sec)
{sm4u-12:/mnt:!:0} ls -la
total 1048578
drwxrwxrwx   2 root  wheel           0 Feb 19 22:20 .
drwxr-xr-x  30 root  wheel          35 Jan 30 18:45 ..
-rwxrwxrwx   1 root  wheel  1073741824 Feb 19 22:20 foo

Reverting the change back to 4K makes reads work normally again.

I wonder how many other filesystems were broken by this change?
Comment 10 Conrad Meyer freebsd_committer 2019-02-19 23:08:26 UTC
(In reply to Kenneth D. Merry from comment #9)
Hi Ken,

I'm having some trouble understanding your reproduction steps.  This is inside an LTFS mount?

1. You read 1GB of /dev/zero in 1MB chunks and write it to foo;
2. ls -l foo reports the expected 1GB size;
3. reading foo with a 1MB request size returns 64k, and unexpected end of file.

I'm really curious how this change breaks LTFS, given a 4k iosize "works."

Thanks!
Comment 11 Kenneth D. Merry freebsd_committer 2019-02-20 14:25:35 UTC
Yes, this is inside an LTFS mount.  Reading a just-written file does result in only 64K getting read and an unexpected EOF.  Here is a dd read of the same file I created in the previous step.

The only difference between this and the previous version is that I built the fuse.ko module with change 337165 reverted:

{sm4u-12:/root:!:0} cd /mnt
{sm4u-12:/mnt:!:0} ls -la
total 1048578
drwxrwxrwx   2 root  wheel           0 Feb 19 22:20 .
drwxr-xr-x  30 root  wheel          35 Jan 30 18:45 ..
-rwxrwxrwx   1 root  wheel  1073741824 Feb 19 22:20 foo
{sm4u-12:/mnt:!:0} dd if=foo of=/dev/null bs=1m
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 11.281752 secs (95175096 bytes/sec)

So, normal result, no problems.  What this tells me is that somehow, changing f_iosize from 4K to DFLTPHYS (which is set to 512K on this particular system) messes up reads (but not writes) for LTFS.

In looking at the LTFS FUSE read code, I don't see any hardcoding of the I/O size:

https://github.com/LinearTapeFileSystem/ltfs/blob/master/src/ltfs_fuse.c

So, perhaps there is a place in the FUSE libraries that is hard-coding the size to 4K?
Comment 12 Conrad Meyer freebsd_committer 2019-02-20 23:10:24 UTC
(In reply to Kenneth D. Merry from comment #11)
I think you're on to something.  Are you on a recent head, or a stable branch?

It looks like fuse_write_biobackend directly uses f_iosize:

625         const int biosize = fuse_iosize(vp);

but fuse_read_biobackend clamps the buf block len to MAXBSIZE:

191         const int biosize = fuse_iosize(vp);
...
201         bcount = MIN(MAXBSIZE, biosize);

Which is defined as 64kB on CURRENT (i.e., the block size is not truncated on bio read when DFLTPHYS <= MAXBSIZE).

fuse's directio read path doesn't care about the freebsd block size or phys size and just issues maximal reads per the mount point.  Ditto io_strategy, directio & bio write.

So it's just the bio read path that is artificially truncating 512kB phys to 64kB MAXBSIZE.  I think maxbcachebuf must be >= 512k on your system, too, or else we'd trip this panic in getblk on write:

  3883         if (size > maxbcachebuf)
  3884                 panic("getblk: size(%d) > maxbcachebuf(%d)\n", size,
  3885                     maxbcachebuf);

(But perhaps the writes are hitting the directio write path that avoids the large getblk.)

Anyway, once the blocks are truncated, the LBNs used by the cached read path are nonsensical relative to what was written, and we end up discarding the last (DFLTPHYS - MAXBSIZE) bytes of every DFLTPHYS-sized block.

I'm not exactly sure why the bioread loop aborts after only a single truncated block.  (I would guess either getblk() returning NULL on 2nd block, or getblk marking the 2nd block !CACHEd and fuse_io_strategy() producing an error for some reason.)

I think we have a clear culprit here.  Please try replacing f_iosize = DFLTPHYS with f_iosize = MAXBSIZE or maxbcachebuf; or increasing MAXBSIZE to match DFLTPHYS.
Comment 13 commit-hook freebsd_committer 2019-02-21 02:42:57 UTC
A commit references this bug:

Author: cem
Date: Thu Feb 21 02:41:58 UTC 2019
New revision: 344407
URL: https://svnweb.freebsd.org/changeset/base/344407

Log:
  fuse: Fix a regression introduced in r337165

  On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to
  use a buf cache block size in excess of permitted size.  This did not affect
  most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB.
  The issue was discovered and reported using a custom kernel with a DFLTPHYS
  of 512kB.

  PR:		230260 (comment #9)
  Reported by:	ken@
  MFC after:	?/? weeks

Changes:
  head/sys/fs/fuse/fuse_vfsops.c
Comment 14 Kenneth D. Merry freebsd_committer 2019-02-21 14:57:44 UTC
(In reply to Conrad Meyer from comment #12)

Your changed fixed the problem, thanks!

You are correct that MAXPHYS is larger than the default:

options         DFLTPHYS=(512*1024)
options         MAXPHYS=(1024*1056)

Could you merge this to stable/12?  This will likely break LTFS for most people using it.

Since tape drives don't do tagged queueing, the common way to get better performance is to use a larger block size.  LTFS supports up to 1MB block sizes, and in order to read tapes from other systems and get better performance, we set MAXPHYS to over 1MB.  (So we can get 1MB I/O regardless of alignment.)  DFLTPHYS goes along with that.
Comment 15 Conrad Meyer freebsd_committer 2019-02-21 16:44:22 UTC
(In reply to Kenneth D. Merry from comment #14)
I don't do stable/, but anyone is free to MFC it themselves.  It shouldn't conflict.

> Since tape drives don't do tagged queueing, the common way to get better
> performance is to use a larger block size.  LTFS supports up to 1MB block
> sizes, and in order to read tapes from other systems and get better
> performance, we set MAXPHYS to over 1MB.  (So we can get 1MB I/O regardless
> of alignment.)  DFLTPHYS goes along with that.

Yeah, that makes a lot of sense.

(I think it is probable that FUSE should move to the tunable maxbcachebuf instead of MAXBSIZE; MAXBSIZE is nearly orphaned in base, and can probably be removed.  But that is somewhat orthogonal.)

Thank you for reporting this and especially mentioning the non-default DFLTPHYS.  I did not realize it was a value people changed in their own kernels. :-)
Comment 16 Kenneth D. Merry freebsd_committer 2019-02-21 20:52:11 UTC
(In reply to Conrad Meyer from comment #15)

It would be nice if FUSE moved to the maxbcachebuf tunable at some point.

Thank you for fixing this quickly!

As for not putting things in stable, ever, that is unfortunate.

I try to merge my changes back to the active stable branches when it makes sense to do so.  Otherwise, end users won't get the benefit of the change until years down the road.  Or, someone else has to see and understand my changes well enough to merge them.