This is one of three issues we detected in FreeBSD FUSE while developing our distributed file system. All four issues can be replicated using this simple test script: https://robo.moosefs.com/support/fuse_helloworld.tgz Performance issue in FUSE: if a program uses FUSE without the "direct" option, any I/O is always performed in 4k blocks. Maximum I/O speed we managed to get was 600MB/s (no physical I/O, just sending zeros from a RAM buffer). With "direct" it's fast, 5GB/s, but "direct" is not the best solution: no cache, read operation has no limit on block size and if one uses extremely big block size, the read speed drastically drops again (we performed dd with bs=1G and the speed was only 40MB/s). Generally, "direct" is geared toward stream-like data (character devices) and should not be used for disk-like I/O. Other FUSE implementations (Linux, MacOS) use 64k block. Best regards, Peter / MooseFS Team
Hm, my reading of fuse_read_biobackend suggests we should be bound by MAXBSIZE, which is 64k on head. Is it smaller on 11? Nope, it's been 64k since ~2000. I wonder where the 4k is coming from.
libfuse itself has a crappy splice(2) reimplementation that hardcodes a 4k buffer size. I'm not sure that's being used here, but it's a possibility.
You might try bumping the size of 'buf' in fuse_buf_fd_to_fd in libfuse and seeing if that changes the size of IO submitted to your filesystem.
Doh, I totally missed some other hardcoded 4096 in the FUSE kernel support code. I see two instances: 1. fuse_internal_init_callback: if the fuse_init_out version is below 7.5, we explicitly limit max_write to 4k. Otherwise we take max_write from the fiio. 2. FUSE_DEFAULT_BLOCKSIZE , FUSE_DEFAULT_IOSIZE are both 4k. Supposedly both can be overridden per mount, but given they are macro defines, I don't see how that is possible. DEFAULT_IOSIZE is used to limit the size of FUSE_READDIR requests. It is also a factor in limiting max_readahead (16x). DEFAULT_BLOCKSIZE is seemingly unused, except to fake FUSE_STATFS responses (f_bsize) if the user filesystem seems dead. I don't see either plumbed into ordinary read/write IO, though. PAGE_SIZE (also 4k) shows up a few times as well: 1. Internal caching of fuse file (vnode) attributes hardcodes va_blocksize as PAGE_SIZE. I don't know the ramifications of this, if any. 2. Something around FUSE_READLINK operations restricts length to PAGE_SIZE. This doesn't seem like a problem. 3. fuse_vfsop_mount (VFS = operations on a filesystem, rather than on a specific file (vnode)) hardcodes mnt_stat.f_iosize as PAGE_SIZE. This may matter. It is the output of the fuse_iosize() function. 4. fuse_vnop_getpages walks through pages as part of the mmap interface. I don't think this is a problem. Ditto fuse_vnop_putpages. So yeah, (3) looks like it. fuse_iosize() is used as a direct limit in fuse_{read,write}_biobackend as well as fuse_io_strategy, and f_iosize is widely used throughout the generic kernel code. (It's also worth exploring bumping up the max READDIR size. That's probably not the problem you reported, but 4k is still anemic there.)
If you're able, please try this patch and report if the performance is improved: --- a/sys/fs/fuse/fuse_vfsops.c +++ b/sys/fs/fuse/fuse_vfsops.c @@ -341,7 +341,7 @@ fuse_vfsop_mount(struct mount *mp) mp->mnt_kern_flag |= MNTK_USES_BCACHE; MNT_IUNLOCK(mp); /* We need this here as this slot is used by getnewvnode() */ - mp->mnt_stat.f_iosize = PAGE_SIZE; + mp->mnt_stat.f_iosize = DFLTPHYS; if (subtype) { strlcat(mp->mnt_stat.f_fstypename, ".", MFSNAMELEN); strlcat(mp->mnt_stat.f_fstypename, subtype, MFSNAMELEN);
It works !!! read in classic (cached) mode: 256+0 records in 256+0 records out 268435456 bytes transferred in 0.099339 secs (2702208293 bytes/sec) bsize: 65536 ; count: 4096 read in direct (not cached and usually not recommended) mode: 256+0 records in 256+0 records out 268435456 bytes transferred in 0.024716 secs (10860808757 bytes/sec) bsize: 1048576 ; count: 257 Performance is significantly improved. Now block size is 64k - perfect. We hope that this patch will find its way to the release version soon :)
A commit references this bug: Author: cem Date: Thu Aug 2 19:25:43 UTC 2018 New revision: 337165 URL: https://svnweb.freebsd.org/changeset/base/337165 Log: FUSE: Bump maximum IO size to enable more performant operation Various components restrict size of IO passed up to the userspace filesystem based on the mount's f_iosize value. The previous default of PAGE_SIZE is anemic, even for normal filesystems, but especially considering every FUSE operation involves a kernel <-> userspace IPC upcall. Bump to DFLTPHYS (currently 64kB) to match other FUSE implementations. Anecdotally, Jakub reports IO read performance increased from 600 MB/s -> 2700 MB/s with a basic RAM-backed FUSE filesystem. PR: 230260 Reported by: Peter (MooseFS) <freebsd AT moosefs.com> Tested by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> MFC after: 3 days Changes: head/sys/fs/fuse/fuse_vfsops.c
This fix should land in 12.0. I don't work on stable branches like 11.x myself, but maybe you can find someone who is interested in stable/11 to backport the patch, if you need it in 11.x.
This breaks LTFS (https://github.com/LinearTapeFileSystem/ltfs) at least. For example: sm4u-12:/mnt:!:0} dd if=/dev/zero of=foo bs=1m count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 5.713798 secs (187920873 bytes/sec) {sm4u-12:/mnt:!:0} ls -la total 1048578 drwxrwxrwx 2 root wheel 0 Feb 19 22:20 . drwxr-xr-x 30 root wheel 35 Jan 30 18:45 .. -rwxrwxrwx 1 root wheel 1073741824 Feb 19 22:20 foo {sm4u-12:/mnt:!:0} dd if=foo of=/dev/null bs=1m 0+1 records in 0+1 records out 65536 bytes transferred in 0.000146 secs (449912127 bytes/sec) {sm4u-12:/mnt:!:0} ls -la total 1048578 drwxrwxrwx 2 root wheel 0 Feb 19 22:20 . drwxr-xr-x 30 root wheel 35 Jan 30 18:45 .. -rwxrwxrwx 1 root wheel 1073741824 Feb 19 22:20 foo Reverting the change back to 4K makes reads work normally again. I wonder how many other filesystems were broken by this change?
(In reply to Kenneth D. Merry from comment #9) Hi Ken, I'm having some trouble understanding your reproduction steps. This is inside an LTFS mount? 1. You read 1GB of /dev/zero in 1MB chunks and write it to foo; 2. ls -l foo reports the expected 1GB size; 3. reading foo with a 1MB request size returns 64k, and unexpected end of file. I'm really curious how this change breaks LTFS, given a 4k iosize "works." Thanks!
Yes, this is inside an LTFS mount. Reading a just-written file does result in only 64K getting read and an unexpected EOF. Here is a dd read of the same file I created in the previous step. The only difference between this and the previous version is that I built the fuse.ko module with change 337165 reverted: {sm4u-12:/root:!:0} cd /mnt {sm4u-12:/mnt:!:0} ls -la total 1048578 drwxrwxrwx 2 root wheel 0 Feb 19 22:20 . drwxr-xr-x 30 root wheel 35 Jan 30 18:45 .. -rwxrwxrwx 1 root wheel 1073741824 Feb 19 22:20 foo {sm4u-12:/mnt:!:0} dd if=foo of=/dev/null bs=1m 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 11.281752 secs (95175096 bytes/sec) So, normal result, no problems. What this tells me is that somehow, changing f_iosize from 4K to DFLTPHYS (which is set to 512K on this particular system) messes up reads (but not writes) for LTFS. In looking at the LTFS FUSE read code, I don't see any hardcoding of the I/O size: https://github.com/LinearTapeFileSystem/ltfs/blob/master/src/ltfs_fuse.c So, perhaps there is a place in the FUSE libraries that is hard-coding the size to 4K?
(In reply to Kenneth D. Merry from comment #11) I think you're on to something. Are you on a recent head, or a stable branch? It looks like fuse_write_biobackend directly uses f_iosize: 625 const int biosize = fuse_iosize(vp); but fuse_read_biobackend clamps the buf block len to MAXBSIZE: 191 const int biosize = fuse_iosize(vp); ... 201 bcount = MIN(MAXBSIZE, biosize); Which is defined as 64kB on CURRENT (i.e., the block size is not truncated on bio read when DFLTPHYS <= MAXBSIZE). fuse's directio read path doesn't care about the freebsd block size or phys size and just issues maximal reads per the mount point. Ditto io_strategy, directio & bio write. So it's just the bio read path that is artificially truncating 512kB phys to 64kB MAXBSIZE. I think maxbcachebuf must be >= 512k on your system, too, or else we'd trip this panic in getblk on write: 3883 if (size > maxbcachebuf) 3884 panic("getblk: size(%d) > maxbcachebuf(%d)\n", size, 3885 maxbcachebuf); (But perhaps the writes are hitting the directio write path that avoids the large getblk.) Anyway, once the blocks are truncated, the LBNs used by the cached read path are nonsensical relative to what was written, and we end up discarding the last (DFLTPHYS - MAXBSIZE) bytes of every DFLTPHYS-sized block. I'm not exactly sure why the bioread loop aborts after only a single truncated block. (I would guess either getblk() returning NULL on 2nd block, or getblk marking the 2nd block !CACHEd and fuse_io_strategy() producing an error for some reason.) I think we have a clear culprit here. Please try replacing f_iosize = DFLTPHYS with f_iosize = MAXBSIZE or maxbcachebuf; or increasing MAXBSIZE to match DFLTPHYS.
A commit references this bug: Author: cem Date: Thu Feb 21 02:41:58 UTC 2019 New revision: 344407 URL: https://svnweb.freebsd.org/changeset/base/344407 Log: fuse: Fix a regression introduced in r337165 On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to use a buf cache block size in excess of permitted size. This did not affect most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB. The issue was discovered and reported using a custom kernel with a DFLTPHYS of 512kB. PR: 230260 (comment #9) Reported by: ken@ MFC after: ?/? weeks Changes: head/sys/fs/fuse/fuse_vfsops.c
(In reply to Conrad Meyer from comment #12) Your changed fixed the problem, thanks! You are correct that MAXPHYS is larger than the default: options DFLTPHYS=(512*1024) options MAXPHYS=(1024*1056) Could you merge this to stable/12? This will likely break LTFS for most people using it. Since tape drives don't do tagged queueing, the common way to get better performance is to use a larger block size. LTFS supports up to 1MB block sizes, and in order to read tapes from other systems and get better performance, we set MAXPHYS to over 1MB. (So we can get 1MB I/O regardless of alignment.) DFLTPHYS goes along with that.
(In reply to Kenneth D. Merry from comment #14) I don't do stable/, but anyone is free to MFC it themselves. It shouldn't conflict. > Since tape drives don't do tagged queueing, the common way to get better > performance is to use a larger block size. LTFS supports up to 1MB block > sizes, and in order to read tapes from other systems and get better > performance, we set MAXPHYS to over 1MB. (So we can get 1MB I/O regardless > of alignment.) DFLTPHYS goes along with that. Yeah, that makes a lot of sense. (I think it is probable that FUSE should move to the tunable maxbcachebuf instead of MAXBSIZE; MAXBSIZE is nearly orphaned in base, and can probably be removed. But that is somewhat orthogonal.) Thank you for reporting this and especially mentioning the non-default DFLTPHYS. I did not realize it was a value people changed in their own kernels. :-)
(In reply to Conrad Meyer from comment #15) It would be nice if FUSE moved to the maxbcachebuf tunable at some point. Thank you for fixing this quickly! As for not putting things in stable, ever, that is unfortunate. I try to merge my changes back to the active stable branches when it makes sense to do so. Otherwise, end users won't get the benefit of the change until years down the road. Or, someone else has to see and understand my changes well enough to merge them.
A commit references this bug: Author: asomers Date: Fri Sep 6 17:56:27 UTC 2019 New revision: 351943 URL: https://svnweb.freebsd.org/changeset/base/351943 Log: MFC r344183-r344187, r344333-r344334, r344407, r344857, r344865 (by cem) r344183: FUSE: Respect userspace FS "do-not-cache" of file attributes The FUSE protocol demands that kernel implementations cache user filesystem file attributes (vattr data) for a maximum period of time in the range of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10 seconds; or "a long time" to represent indefinite caching. Historically, FreeBSD FUSE has ignored this client directive entirely. This works fine for local-only filesystems, but causes consistency issues with multi-writer network filesystems. For now, respect 0 second cache TTLs and do not cache such metadata. Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds are still cached indefinitely, because it is unclear how a userspace filesystem could do anything sensible with those semantics even if implemented. In the future, as an optimization, we should implement notify_inval_entry, etc, which provide userspace filesystems a way of evicting the kernel cache. One potentially bogus access to invalid cached attribute data was left in fuse_io_strategy. It is restricted behind the undocumented and non-default "vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are deadcode and can be eliminated? Some minor APIs changed to facilitate this: 1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data"). 2. cache_attrs() respects the provided TTL and only caches in the FUSE inode if TTL > 0. It also grows an "out" argument, which, if non-NULL, stores the translated fuse_attr (even if not suitable for caching). 3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help avoid programming mistakes. 4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE link op was weakened (only performed when we have a valid attr cache). The check is racy in a multi-writer network filesystem anyway -- classic TOCTOU. We have to trust any userspace filesystem that rejects local caching to account for it correctly. PR: 230258 (inspired by; does not fix) r344184: FUSE: Respect userspace FS "do-not-cache" of path components The FUSE protocol demands that kernel implementations cache user filesystem path components (lookup/cnp data) for a maximum period of time in the range of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10 seconds; or "a long time" to represent indefinite caching. Historically, FreeBSD FUSE has ignored this client directive entirely. This works fine for local-only filesystems, but causes consistency issues with multi-writer network filesystems. For now, respect 0 second cache TTLs and do not cache such metadata. Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds are still cached indefinitely, because it is unclear how a userspace filesystem could do anything sensible with those semantics even if implemented. Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup if the user filesystem did not set a zero second TTL. PR: 230258 (inspired by; does not fix) r344185: FUSE: Only "dirty" cached file size when data is dirty Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update the buf cache bounds as a result of either a read from the underlying FUSE filesystem, or as part of a write-through type operation (like truncate => VOP_SETATTR). In these cases, do not set the FN_SIZECHANGE flag, which indicates that an inode's data is dirty (in particular, that the local buf cache and fvdat->filesize have dirty extended data). PR: 230258 (related) r344186: FUSE: The FUSE design expects writethrough caching At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol specifies only clean data to be cached. Prior to this change, we implement and default to writeback caching. This is ok enough for local only filesystems without hardlinks, but violates the general design contract with FUSE and breaks distributed filesystems or concurrent access to hardlinks of the same inode. In this change, add cache mode as an extension of cache enable/disable. The new modes are UC (was: cache disabled), WT (default), and WB (was: cache enabled). For now, WT caching is implemented as write-around, which meets the goal of only caching clean data. WT can be better than WA for workloads that frequently read data that was recently written, but WA is trivial to implement. Note that this has no effect on O_WRONLY-opened files, which were already coerced to write-around. Refs: * https://sourceforge.net/p/fuse/mailman/message/8902254/ * https://github.com/vgough/encfs/issues/315 PR: 230258 (inspired by) r344187: FUSE: Refresh cached file size when it changes (lookup) The cached fvdat->filesize is indepedent of the (mostly unused) cached_attrs, and we failed to update it when a cached (but perhaps inactive) vnode was found during VOP_LOOKUP to have a different size than cached. As noted in the code comment, this can occur in distributed filesystems or with other kinds of irregular file behavior (anything is possible in FUSE). We do something similar in fuse_vnop_getattr already. PR: 230258 (as reported in description; other issues explored in comments are not all resolved) Reported by: MooseFS FreeBSD Team <freebsd AT moosefs.com> Submitted by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version) r344333: fuse: add descriptions for remaining sysctls (Except reclaim revoked; I don't know what that goal of that one is.) r344334: Fuse: whitespace and style(9) cleanup Take a pass through fixing some of the most egregious whitespace issues in fs/fuse. Also fix some style(9) warts while here. Not 100% cleaned up, but somewhat less painful to look at and edit. No functional change. r344407: fuse: Fix a regression introduced in r337165 On systems with non-default DFLTPHYS and/or MAXBSIZE, FUSE would attempt to use a buf cache block size in excess of permitted size. This did not affect most configurations, since DFLTPHYS and MAXBSIZE both default to 64kB. The issue was discovered and reported using a custom kernel with a DFLTPHYS of 512kB. PR: 230260 (comment #9) Reported by: ken@ r344857: FUSE: Prevent trivial panic When open(2) was invoked against a FUSE filesystem with an unexpected flags value (no O_RDONLY / O_RDWR / O_WRONLY), an assertion fired, causing panic. For now, prevent the panic by rejecting such VOP_OPENs with EINVAL. This is not considered the correct long term fix, but does prevent an unprivileged denial-of-service. PR: 236329 Reported by: asomers Reviewed by: asomers Sponsored by: Dell EMC Isilon r344865: fuse: switch from DFLTPHYS/MAXBSIZE to maxcachebuf On GENERIC kernels with empty loader.conf, there is no functional change. DFLTPHYS and MAXBSIZE are both 64kB at the moment. This change allows larger bufcache block sizes to be used when either MAXBSIZE (custom kernel) or the loader.conf tunable vfs.maxbcachebuf (GENERIC) is adjusted higher than the default. Suggested by: ken@ Changes: _U stable/12/ stable/12/sys/fs/fuse/fuse.h stable/12/sys/fs/fuse/fuse_device.c stable/12/sys/fs/fuse/fuse_file.c stable/12/sys/fs/fuse/fuse_file.h stable/12/sys/fs/fuse/fuse_internal.c stable/12/sys/fs/fuse/fuse_internal.h stable/12/sys/fs/fuse/fuse_io.c stable/12/sys/fs/fuse/fuse_ipc.c stable/12/sys/fs/fuse/fuse_ipc.h stable/12/sys/fs/fuse/fuse_node.c stable/12/sys/fs/fuse/fuse_node.h stable/12/sys/fs/fuse/fuse_vfsops.c stable/12/sys/fs/fuse/fuse_vnops.c