277197 – NFS is much too slow at 10GbaseT

Bug 277197 - NFS is much too slow at 10GbaseT

Summary: NFS is much too slow at 10GbaseT

Status:	Closed Overcome By Events

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	14.0-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:	performance

Depends on:
Blocks:

Reported:	2024-02-20 16:34 UTC by Hannes Hauswedell
Modified:	2024-04-20 10:54 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hannes Hauswedell 2024-02-20 16:34:45 UTC

I have FreeBSD 14 server and client. Both have Intel X540 10GBase-T adapters and are connected via CAT7 and a Netgear switch that has the respective 10GBase-T ports.

Via iperf3, I measure 1233 MiB/s (9.87GBit/s) throughput. 
Via nc, I measure 1160 MiB/s throughput.

Via NFS, I get around 190-250MiB/s. I did not expect to get the full 1100MiB/s with NFS, but I did hope to be between 600-800MB/s at least.

Various guides suggest tinkering with different TCP related sysctls, but I haven't had any luck improving the performance. And since nc also manages to push >1GByte over TCP, this doesn't seem like the core of the problem.

I have replaced the base system's ix with the one from ports, but no change. Again, I don't think the driver or the network stack have an issue per se; it seems to be NFS related.

I have used default options to do the mounts. This is what nfsstat shows for the NFS3 mount:
```
nfsv3,tcp,resvport,nconnect=1,hard,cto,lockd,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,n
egnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2 
```

and for the NFS4 mount:
```
nfsv4,minorversion=2,tcp,resvport,nconnect=1,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,namet
imeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,re
trans=2147483647
```

Am I missing something? Is this a bug or a configuration problem?

I will try to set up a linux NFS client to see if the issues are client or server related.

Thanks for your help!

P.S.: The server has an NVME raidz and can maintain throughput speeds over 900MiB/s reading and writing hundreds of gigabytes from/to different datasets of the pool. Even with encryption and compression. So I don't think disks are a limiting factor.

Comment 1 Rick Macklem freebsd_committer

2024-02-21 14:02:10 UTC

You could try these mount options:
nconnect=4 (or 8) on the NFSv4 mount only
  (doesn't work for NFSv3)
readahead=4 (or 8)

You can also try bumping up the rsize/wsize.
For the server, set
bfs_server_maxio=1048576
in it's /etc/rc.conf.

For the client, set
vfs.maxbcachebuf=1048576
in it's /boot/loader.conf.

A mount done after these changes should
default to rsize=1048576,wsize=1048576
(you can then try 256K using the rsize and wsize
 mount options).

Comment 2 Hannes Hauswedell 2024-02-21 20:16:21 UTC

Thanks for the reply!

With rsize=262144,wsize=262144 I get 293 MiB/s.
With rsize=1048576,wsize=1048576 I get 395 MiB/s. This is already an improvement, but still not where I would like to have it.

nconnect and readahead don't seem to make much of a difference. NFS3 vs NFS4 also not.
 
Interestingly, if I boot Linux on the client, and perform a regular NFS mount with no options supplied, I get almost 600 MiB/s. This was even performed before changing the server setting. It seems to indicate that our problems are client-side.

These are the options that Linux reports as used:
```
rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1.3.1.2,mountvers=3,mountport=820,mountproto=udp,local_lock=none,addr=1.3.1.2
```

So it seems to get a much higher throughput with lower rsize and wsize. I will try Linux next with higher rsize and wsize.

Comment 3 Hannes Hauswedell 2024-02-23 09:39:29 UTC

I can confirm that rsize and wsize do not influence the performance of the Linux NFS3 client strongly; choosing values between 128KiB and 1MiB, I always get  550-600MiB/s.

The Linux client is however strongly influenced by device mtu. I am currently operating both server and client at an mtu of 9000. If I lower the mtu to 1500, the Linux client performs *very similar* to the FreeBSD client (at any mtu), delivering 200-250 MiB/s.

I think, it is strange that the FreeBSD NFS client does not benefit from a mtu higher than 1500. Could this be a hint at what's going wrong?

Comment 4 Rick Macklem freebsd_committer

2024-02-23 14:11:24 UTC

I'd recommend that you start a discussion
on a mailing list (freebsd-current@ or
freebsd-stable@), since others may have
insight related to tuning and the Intel NIC/driver.

On the mailing list, I suggest you ask others what
performance they get when using other non-Intel NICs.

One well known problem (that may never be fixed) is
that use of 9K jumbo mbufs can cause fragmentation
of the mbuf pool. (A NIC driver does not need to use
9K jumbo mbufs for 9K mtu packets, but some do.)

Comment 5 Hannes Hauswedell 2024-02-24 13:51:54 UTC

Thanks for your reply! I will ask on the mailing lists, although I still consider the current behaviour a bug. 
I mean, by default a crucial networking component in the base system achieves less than 20% of the theoretical performance and around a third of the performance the respective component on Linux delivers by default.

Comment 6 Peter Eriksson 2024-03-21 22:37:56 UTC

(In reply to Rick Macklem from comment #1)

A quick test between two FreeBSD 13.2 servers with 10G Intel X710 ethernet:
(writing to a ZFS zpool with spinning disks/log on SSD/cache on SSD):

# mount -t nfs -o sec=sys,vers=4 dedur01:/test /mnt
# cd /mnt

# dd if=/dev/zero of=16G bs=1M count=16384 status=progress
  16982736896 bytes (17 GB, 16 GiB) transferred 61.065s, 278 MB/s   

# dd of=/dev/null if=16G bs=1M count=16384 status=progress
  17151557632 bytes (17 GB, 16 GiB) transferred 49.015s, 350 MB/s   
 

# cd /
# umount /mnt

# mount -t nfs -o sec=sys,vers=4,readahead=8 dedur01:/test /mnt
# cd /mnt

# dd of=/dev/null if=16G bs=1M count=16384 status=progress
  16490954752 bytes (16 GB, 15 GiB) transferred 16.019s, 1029 MB/s  

# nfsstat -m
dedur01:/test on /mnt
nfsv4,minorversion=2,tcp,resvport,nconnect=1,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=16777216,timeout=120,retrans=2147483647

# ifconfig ixl0
ixl0: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=4e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
	ether 3c:fd:fe:24:e7:e0
	media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
	status: active
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>


If I redo the read test without unmounting and remounting I get some silly number like 4GB/s or so (cached locally in RAM I suppose :-)

Comment 7 Rick Macklem freebsd_committer

2024-03-21 23:31:57 UTC

(In reply to Peter Eriksson from comment #6)
Yes. If you do not umount/mount, the file (or at least
part of it) will be in the client's buffer cache (kernel ram).

Does anyone know what differs between the X540 and X710?

Comment 8 Peter Eriksson 2024-03-22 09:35:18 UTC

(In reply to Rick Macklem from comment #7)

The X540 is a much older network card - around 5 years older than the X710. From the manual pages:


X540
Bus:      PCIe 2.1 x8
Chipset:  Intel 82598EB
Driver:   ixgbe
Max MTU:  16144
Features: Jumbo Frames, MSIX, TSO & RSS

X710
Bus:      PCIe 3.0 x8
Chipset:  Intel 700-series
Driver:   ixl
Max MTU:  9706
Features: Jumbo Frames, TX/RX checksum offload,
          TSO (TCP Segmentation offload),
          LRO (Large Receive Offload),
          VLAN tag insertion/extraction, VLAN checksum offload,
          VLAN TSO, RSS (Receive Side Steering)

I don't have any X540/ixgbe boards here so can't test that combo unfortunately, but looking at the feature sets it seems the X540 lacks the LRO feature...

Comment 9 Hannes Hauswedell 2024-04-19 09:13:54 UTC

Thank you for the replies and sorry for my late reply (been travelling a lot for work). I am still very interested in debugging this further!

@Peter Eriksson
It's hopeful to see higher speeds with FreeBSD on other devices at least.

> The X540 is a much older network card - around 5 years older than the X710. 
> I don't have any X540/ixgbe boards here so can't test that combo unfortunately, but looking at the feature sets it seems the X540 lacks the LRO feature...

Hm, but I do get much higher speeds on Linux, so how can this be a hardware issue? I am not sure whether LRO is the crucial thing here, but at least `ifconfig ix0` on the client does list LRO among its features:

# ifconfig -v ix0
ix0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 9000
        options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 98:b7:85:1f:2e:72
        inet 192.168.3.80 netmask 0xffffff00 broadcast 192.168.3.255
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

Comment 10 Hannes Hauswedell 2024-04-19 09:51:28 UTC

Ok, wow. I have just moved the client onto a different room/Rj45-wall-socket, and now I am getting NFS speeds around 950MiB/s with default settings and 1150MiB/s with nconnect=8,readahead=8.

This would suggest some SNAFU with both CAT7 cables going to the other outlet. I am really puzzled why that would affect NFS on FreeBSD more strongly that NFS on Linux or NC on FreeBSD, but that's what it looks like right now :o

Comment 11 Rick Macklem freebsd_committer

2024-04-20 02:57:22 UTC

(In reply to Hannes Hauswedell from comment #10)
I assume we can close this PR then?

Comment 12 Hannes Hauswedell 2024-04-20 10:54:38 UTC

(In reply to Rick Macklem from comment #11)

Yes, I am closing it. I am still puzzled by some aspects, but I don't think anyone here can clear this up.

Thank you for your comments!