I am observing frequent nfs mount hang-ups after upgrading my NAS to 13-stable. FreeBSD here serves nfsd of a zfs pool. A Gentoo Linux box is connected to this NAS via a 10GbE fiber link. Once a while, perhaps when the zfs load gets high (afpd is running), the Linux side access to nfs hangs, then recovers after a few minutes. The following messages are printed in Linux's dmesg:
May 30 22:04:34 mayhome kernel: nfs: server 192.168.3.51 not responding, still trying
But after a while, a few minutes or so, the access recovers:
May 30 22:06:35 mayhome kernel: nfs: server 192.168.3.51 OK
This behavior is only observed after updating NAS to 13-stable via buildworld, buildkernel procedure. The Linux side remains the same and no hardware changed on either side. 12-stable did not exhibit any of these.
The NAS's NIC serving nfsd is
t5nex0: <Chelsio T520-SO> mem 0xdd300000-0xdd37ffff,0xdc000000-0xdcffffff,0xdd884000-0xdd885fff irq 16 at device 0.4 on pci1
cxl0: <port 0> on t5nex0
cxl0: Ethernet address: 00:07:43:31:9c:80
cxl0: 8 txq, 8 rxq (NIC); 8 txq (TOE), 2 rxq (TOE)
cxl1: <port 1> on t5nex0
cxl1: Ethernet address: 00:07:43:31:9c:88
cxl1: 8 txq, 8 rxq (NIC); 8 txq (TOE), 2 rxq (TOE)
t5nex0: PCIe gen2 x8, 2 ports, 22 MSI-X interrupts, 54 eq, 21 iq
Linux nfsmount flags:
/home from 192.168.3.51:/mnt/nashome
Created attachment 225416 [details]
reverts commit r367492
r367492 causes NFS clients (mostly Linux ones) to hang
This patch reverts r367492.
This is most likely caused by r367492, which shipped in 13.0.
If this is the cause you will observe the following when a Linux
client is hung.
# netstat -a
will show the client TCP connection as established and Recv-Q
will be non-zero and probably getting larger.
If this is the case you need to do one of:
- revert r367492 by applying the patch in the attachment.
- apply the patch in D29690 to your kernel, which is believed
to fix the problem
- wait until commit 032bf749fd44 is MFC'd to stable/13 and then
upgrade to stable/13 (MFC should happen in early June)
However, if the
# netstat -a
shows the TCP connection as CLOSE_WAIT, then you need the patch
which is the first attachment on PR#254590 and is already in stable/13.
(MFC'd Apr. 27)
I confirm that #netstat -a shows the client TCP connection as established and Recv-Q went very high during a stall:
tcp4 230920 0 192.168.3.51.nfsd 192.168.3.1.760 ESTABLISHED
I've applied the revert r367492 patch. I'll report if I see stall again.
I've set the MFC-stable13 flag, to refer to commit 032bf749fd44
and not the attachment patch, which is just a workaround until
commit 032bf749fd44 is in your FreeBSD-13 kernel.
*** Bug 255083 has been marked as a duplicate of this bug. ***
I am seeing similar symptoms in 13.0-RELEASE, but can't confirm yet with netstat until some off-work hours. Can someone confirm from svn / git that this bug is in fact affecting 13.0-RELEASE.
And, if so, can we get a back port in RELEASE so that we don't have to compile a custom kernel workaround or update to -STABLE? This is a pretty brutal bug, at the very least a workaround would be nice and not require me to follow STABLE or build a custom kernel temporarily. I'm just a home user, but I imagine that enterprise installations following -RELEASE will be extremely annoyed by this bug.
Ah, looks as though it's in there. The SVN revision number should have probably been a big enough hint that it's a commit that came in a long time ago, I guess:
Someone should change Version: in the PR to 13.0-RELEASE.
Change to 13.0-RELEASE since a fix is now in 13-STABLE.
Any news on this getting into releng? I saw on stable/13 there's a bunch of traffic on rack.c, some of which refer to leaked mbufs due to long dormant bugs.
What do I lose by applying this patch that reverts the changes to the loss recovery?
Ahh, sorry for the double comment post, looks like I can't delete it. I pressed submit twice in succession and didn't read the message carefully enough that followed.
The patch may not apply cleanly against rack.c / bbr.c when coming directly from 13.0-RELEASE.
Unless you are actually using the RACK or BBR TCP stack - and if you do, you want to use the fixes that are in 13.0-STABLE, and not run 13.0-RELEASE, you can (locally) ignore to patch to these files (keep the rack.c source prior to the patch; however, if thereafter you start using RACK, you will run into locking issues which will trigger a core).
Getting a "clean" patch against 13.0-RELEASE would involve bundling all the commits to rack.c into this too - or risk running out of sync on the source files - thus more involved to provide it directly.
I hope this comment is helpful to you.
(In reply to Richard Scheffenegger from comment #12)
Nope, just the default loss control algorithm.