Bug 276299 - Write performance to NFS share is ~4x slower than on 13.2
Summary: Write performance to NFS share is ~4x slower than on 13.2
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-RELEASE
Hardware: arm64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2024-01-13 15:41 UTC by dmilith
Modified: 2024-01-15 20:31 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dmilith 2024-01-13 15:41:02 UTC
I've upgraded my aarch64 box (RockPro64 with 4G RAM) which serves a ZFS pool over NFS.

Before the upgrade (on FreeBSD 13.2) my upload speed to the share was like 50-70MiB/s. After the upgrade to FreeBSD 14.0, it struggles to get even 10MiB/s write.

I've tried switching from NFSd v3 to v4. It's even worse (6MiB/s write).

The disk is WD Gold, 10TiB. So I'm quite sure it's not about the disk speed or the network speed (my router has 1G ports).

My /etc/rc.conf:

# NFS
hostid_enable="YES"
nfscbd_enable="YES"
rpcbind_enable="YES"
nfs_server_enable="YES"
nfsv4_server_only="NO"
nfsv4_server_enable="YES"
nfsuserd_enable="YES"
mountd_enable="YES"
mountd_flags="-r"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"


I've bumped /etc/sysctl.conf settings to huge values:

net.inet.raw.maxdgram=262144
net.inet.raw.recvspace=1048576
net.inet.tcp.sendspace=1048576
vfs.nfsd.srvmaxio=1048576
vfs.nfsd.maxthreads=128
net.inet.tcp.rfc1323=1
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216

But this improved nothing. Maybe it's even worse…


My NFS client (macOS 13+) uses these options:

mount_nfs \
  -o \
  rw,vers=4,deadtimeout=0,readahead=6,noatime,sync,async,hard,bg,intr,inet,tcp,nfc,rsize=1048576,wsize=1048576,dsize=1048576 \
  vks4.home:/Copies/VMs \
  /Users/Shared/NFS/VMs


My sequential read from NFS is as before (~50-70MiB/s). Which is "okay" for that HW. But what can I do to bring the write speed back to 50MiB/s? Did I do something wrong?

Thanks
Comment 1 Mina Galić freebsd_triage 2024-01-13 17:46:01 UTC
cc'ing rmacklem@
Comment 2 Vladimir Druzenko freebsd_committer freebsd_triage 2024-01-13 18:04:50 UTC
Check on server:
sysctl vfs.nfsd.srvmaxio
sysctl vfs.nfsd.maxthreads

On clients readahead=16.

> sync,async
?
Comment 3 dmilith 2024-01-13 18:43:47 UTC
(In reply to Vladimir Druzenko from comment #2)

From the server:

vfs.nfsd.srvmaxio: 1048576
vfs.nfsd.maxthreads: 64

I've tried with 128 threads before, but that changed nothing. The server is… close to being idle during the transfer.

From the btop I only see "intr" and "nfsd: server" subprocess… But both use up to 15% CPU per process. Load is 0.7

Dropped the "sync" option on clients, and added "readahead=16". That caused the write to speed up from 6MiB/s to 8MiB/s.

Should I consider modifying /boot/loader.conf to add: "vfs.maxbcachebuf=1048576" there? It's mentioned in the 14.0 release docs.
Comment 4 Vladimir Druzenko freebsd_committer freebsd_triage 2024-01-13 18:53:11 UTC
Try on client:
dd if=/dev/zero of=/Users/Shared/NFS/VMs/ZERO bs=1M count=16384 status=progress
(don't know correct command line for dd on macos)
Comment 5 Rick Macklem freebsd_committer freebsd_triage 2024-01-13 19:08:39 UTC
There is very little difference in the NFS
server for 13.2 vs 14.0.
As such, the hit is most likely a network
fabric issue or a ZFS issue.

The only thing I can suggest to try is:
rsize=131072,wsize=131072

It should perform about as well as 1Mbyte, but???
If it does help a lot, there is something in the
network fabric (most likely the NIC/driver) that
cannot handle the burst of TCP segments well.

Did you happen to have "sync=disabled" set on your
13.2 ZFS.  Setting this runs the risk of data loss
when the NFS server crashes/reboots, but will help
w.r.t. write performance.
Comment 6 Rick Macklem freebsd_committer freebsd_triage 2024-01-13 19:24:40 UTC
Oh, and it might be worth capturing packets
while writes are slow and taking a look at
them in wireshark. (Unlike tcpdump, wireshark knows
NFS.)

Something like:
# tcpdump -s 0 -w out.pcap host <nfs-client-host>
on the NFS server and then look at out.pcap
in wireshark. (I just install wireshark on my
Windows laptop. No need to bother with X windows.)

You might see error replies for NFS RPCs or TCP
timeouts/retransmits that would explain the slowdown.
(Or TCP reconnects. I once saw a case where the
network switch would decide to inject an RST in the
TCP stream forcing the NFS client to create a new
connection. Why did it do this? No idea.)
Comment 7 dmilith 2024-01-13 19:46:24 UTC
(In reply to Vladimir Druzenko from comment #4)

dd if=/dev/zero of=/Users/Shared/NFS/VMs/ZERO bs=1M count=16384 status=progress
  179306496 bytes (179 MB, 171 MiB) transferred 15.001s, 12 MB/s

starts at around 20MiBs, then slows down to ~11 after some more tweaks on my side.
Comment 8 dmilith 2024-01-13 19:49:06 UTC
(In reply to Rick Macklem from comment #6)

It can't be a network thing. When I download VM images from that NFS, the transfer is stable around ~55MiB/s.

The issue is only when I upload/write to the share. Then it's ~5x slower.
Comment 9 dmilith 2024-01-13 20:18:00 UTC
(In reply to Rick Macklem from comment #6)

I double-checked with Wireshark, there are some "TCP Dup ACK" failures during the upload process. Example lines:

802281	62.076919	192.168.0.34	192.168.0.60	TCP	78	[TCP Dup ACK 802084#97] 2049 → 54276 [ACK] Seq=442949 Ack=757712677 Win=28968 Len=0 TSval=1156306898 TSecr=465886986 SLE=757724261 SRE=757864717
Comment 10 Rick Macklem freebsd_committer freebsd_triage 2024-01-14 03:30:30 UTC
By network fabric I mean everything
from the TCP stack down, at both ends.

A problem can easily manifest itself as
only a problem during writing. Writing to
an NFS server is very different traffic as
reading from a NFS server.
I am not saying that it is a network fabric
problem, just that good read performance does
not imply it is not a network fabric problem.

I once saw a case where everything worked fine
over NFS (where I worked as a sysadmin) until
one specific NFS RPC was done. That NFS RPC
(and only that NFS RPC would fail).
It turned out to be a hardware bug in a
network switch. Move the machine to a port
on another switch and the problem went away.
Move it onto the problem switch and the issue
showed up again. There were no detectable other
problems with this switch and the manufacturer
returned it after a maintenance cycle claiming
it was fixed. It still had the problem, so it
went in the trash. (It probably had a memory
problem that flipped a bit for this specific case
or some such.)

Two examples of how a network problem might affect
NFS write performance, but not read performance.
Write requests are the only large RPC messages
sent from client->server. With a !Mbyte write size,
each write results in about 700 1500byte TCP segments
(for an ordinary ethernet packet size).
-> If the burst of 700 packets causes one to be dropped
   on the server (receive) end sometimes...
   (Found by seeing an improvement with a smaller wsize.)
-> If the client/sender has a TSO bug (the most common problem
   is mishandling a TSO segment that is slightly less than 64Kbyytes.
   (Found by disabling TSO in the client. Disabling TSO also
    changes the timing of the TCP segments and this can sometimes
    avoid bugs.)
Have you yet tried a smaller rsize/wsize as I suggested.

NFS traffic is also very different than typical
TCP traffic. For example, both 13.0 and 13.1 shipped
with bugs in the TCP stack that affected the NFS
server (intermittent hangs in these cases).

If it isn't a network fabric problem it is probably
something related to ZFS. I know nothing about ZFS,
so I can't even suggest anything beyond "sync=disabled".

Since an NFS server uses both storage (hardware + ZFS)
and networking, any breakage anywhere in these can
cause a big performance hit.
NFS itself just translates between the NFS RPC message
and VFS/VOP calls. It is conceivable that some change
in the NFS server is causing this, but these changes
are few and others have not reported similar write
performance problems for 14.0, so it seems unlikely.
Comment 11 dmilith 2024-01-14 10:28:07 UTC
(In reply to Rick Macklem from comment #10)

Yes, I've tried both sync=disabled (it changed nothing) and smaller r/wsize (smaller size ~256K offers the best throughput from my tests).
After a Saturday of hacking, I've managed to reach ~20MiBs write and 43-50MiBs read. It's not terrible. But I will try some more tricks and will let you know if I'll achieve anything.

The router issue makes more sense the more I think about it. I got mine from my ISP and indeed sometimes I have weird network problems, so maybe that's related. Will also take a closer look at what can I do to improve this.

Thanks for your ideas :) Much appreciated.
Comment 12 Rick Macklem freebsd_committer freebsd_triage 2024-01-14 14:30:44 UTC
If you are playing with network related stuff,
here's a bit more (no pun intended;-).

Look at any stats generated by both server
and client NIC drivers for errors, etc.

If you have a different NIC lying about, particularily
if it has a different chipset in it, try it.

Look for a tunable in the NIC driver that adjusts
interrupt moderation. Interrupt moderation is good
for streaming traffic, not so good for NFS.  Once
a NFS client sends an RPC message, it waits for a
response. Any delay in the reply, slows it down and
interrupt moderation can delay the interrupt and,
therefore, the RPC reply.

And don't forget simple stuff like cables. They can
get damaged at any time.

Good luck with it, rick
Comment 13 Rick Macklem freebsd_committer freebsd_triage 2024-01-14 15:31:23 UTC
I just played around on my old dell laptop
(which is running something close to 14.0).

I mounted it locally (so it is using lo0) and
I see a reasonable write rate when I do:
# dd if=/dev/zero of=/mnt/xxxx bs=1M count=1000
(about 200Mbytes/sec)

but if I do:
# dd if=/tmp/somefile of=/mnt/xxxx bs=1M
I see much slower writing (about 30Mbytes/sec).
I'll try UFS and see if I see the slow writing there as well.

I am wondering if ZFS has changed the way it
does compression? (I know so little about ZFS,
I don't even know how to turn compression on/off
on ZFS.)

Btw, you could try using /dev/zero for input.
You could also try doing a local mount on the
NFS server (which gets the network out of the
picture and only uses lo0).