Bug 256280 - FreeBSD nfsd serving zfs pool, linux nfsclient, often hangs (not observed in 12-stable)
Summary: FreeBSD nfsd serving zfs pool, linux nfsclient, often hangs (not observed in ...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 13.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Rick Macklem
URL:
Keywords: regression
: 255083 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-05-31 05:20 UTC by yuan.mei
Modified: 2024-04-21 22:30 UTC (History)
21 users (show)

See Also:
rmacklem: mfc-stable13?


Attachments
reverts commit r367492 (8.74 KB, patch)
2021-05-31 16:51 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description yuan.mei 2021-05-31 05:20:04 UTC
I am observing frequent nfs mount hang-ups after upgrading my NAS to 13-stable.  FreeBSD here serves nfsd of a zfs pool.  A Gentoo Linux box is connected to this NAS via a 10GbE fiber link.  Once a while, perhaps when the zfs load gets high (afpd is running), the Linux side access to nfs hangs, then recovers after a few minutes.  The following messages are printed in Linux's dmesg:

May 30 22:04:34 mayhome kernel: nfs: server 192.168.3.51 not responding, still trying

But after a while, a few minutes or so, the access recovers:

May 30 22:06:35 mayhome kernel: nfs: server 192.168.3.51 OK

This behavior is only observed after updating NAS to 13-stable via buildworld, buildkernel procedure.  The Linux side remains the same and no hardware changed on either side.  12-stable did not exhibit any of these.

The NAS's NIC serving nfsd is

t5nex0: <Chelsio T520-SO> mem 0xdd300000-0xdd37ffff,0xdc000000-0xdcffffff,0xdd884000-0xdd885fff irq 16 at device 0.4 on pci1
cxl0: <port 0> on t5nex0
cxl0: Ethernet address: 00:07:43:31:9c:80
cxl0: 8 txq, 8 rxq (NIC); 8 txq (TOE), 2 rxq (TOE)
cxl1: <port 1> on t5nex0
cxl1: Ethernet address: 00:07:43:31:9c:88
cxl1: 8 txq, 8 rxq (NIC); 8 txq (TOE), 2 rxq (TOE)
t5nex0: PCIe gen2 x8, 2 ports, 22 MSI-X interrupts, 54 eq, 21 iq

Linux nfsmount flags:

/home from 192.168.3.51:/mnt/nashome
 Flags:	rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.51,mountvers=3,mountport=855,mountproto=udp,local_lock=all,addr=192.168.3.51
Comment 1 Rick Macklem freebsd_committer freebsd_triage 2021-05-31 16:51:29 UTC
Created attachment 225416 [details]
reverts commit r367492

r367492 causes NFS clients (mostly Linux ones) to hang
intermittently.

This patch reverts r367492.
Comment 2 Rick Macklem freebsd_committer freebsd_triage 2021-05-31 17:05:12 UTC
This is most likely caused by r367492, which shipped in 13.0.
If this is the cause you will observe the following when a Linux
client is hung.
# netstat -a
will show the client TCP connection as established and Recv-Q
will be non-zero and probably getting larger.

If this is the case you need to do one of:
- revert r367492 by applying the patch in the attachment.
- apply the patch in D29690 to your kernel, which is believed
  to fix the problem
- wait until commit 032bf749fd44 is MFC'd to stable/13 and then
  upgrade to stable/13 (MFC should happen in early June)

However, if the
# netstat -a
shows the TCP connection as CLOSE_WAIT, then you need the patch
which is the first attachment on PR#254590 and is already in stable/13.
(MFC'd Apr. 27)
Comment 3 yuan.mei 2021-06-01 01:43:52 UTC
I confirm that #netstat -a shows the client TCP connection as established and Recv-Q went very high during a stall:

tcp4  230920      0 192.168.3.51.nfsd      192.168.3.1.760        ESTABLISHED

I've applied the revert r367492 patch.  I'll report if I see stall again.
Comment 4 Rick Macklem freebsd_committer freebsd_triage 2021-06-01 01:54:50 UTC
I've set the MFC-stable13 flag, to refer to commit 032bf749fd44
and not the attachment patch, which is just a workaround until
commit 032bf749fd44 is in your FreeBSD-13 kernel.
Comment 5 Rick Macklem freebsd_committer freebsd_triage 2021-06-30 03:03:59 UTC
*** Bug 255083 has been marked as a duplicate of this bug. ***
Comment 6 Adam Stylinski 2021-07-06 15:05:12 UTC
I am seeing similar symptoms in 13.0-RELEASE, but can't confirm yet with netstat until some off-work hours.  Can someone confirm from svn / git that this bug is in fact affecting 13.0-RELEASE.  

And, if so, can we get a back port in RELEASE so that we don't have to compile a custom kernel workaround or update to -STABLE?  This is a pretty brutal bug, at the very least a workaround would be nice and not require me to follow STABLE or build a custom kernel temporarily.  I'm just a home user, but I imagine that enterprise installations following -RELEASE will be extremely annoyed by this bug.
Comment 7 Adam Stylinski 2021-07-06 15:17:25 UTC
Ah, looks as though it's in there.  The SVN revision number should have probably been a big enough hint that it's a commit that came in a long time ago, I guess:

https://cgit.freebsd.org/src/commit/?h=releng/13.0&id=4d0770f1725f84e8bcd059e6094b6bd29bed6cc3

Someone should change Version: in the PR to 13.0-RELEASE.
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2021-07-06 22:53:59 UTC
Change to 13.0-RELEASE since a fix is now in 13-STABLE.
Comment 9 Adam Stylinski 2021-07-10 01:17:49 UTC
Any news on this getting into releng?  I saw on stable/13 there's a bunch of traffic on rack.c, some of which refer to leaked mbufs due to long dormant bugs.

What do I lose by applying this patch that reverts the changes to the loss recovery?
Comment 10 Adam Stylinski 2021-07-10 01:17:59 UTC
Any news on this getting into releng?  I saw on stable/13 there's a bunch of traffic on rack.c, some of which refer to leaked mbufs due to long dormant bugs.

What do I lose by applying this patch that reverts the changes to the loss recovery?
Comment 11 Adam Stylinski 2021-07-10 01:21:48 UTC
Ahh, sorry for the double comment post, looks like I can't delete it.  I pressed submit twice in succession and didn't read the message carefully enough that followed.
Comment 12 Richard Scheffenegger freebsd_committer freebsd_triage 2021-07-12 10:29:47 UTC
The patch may not apply cleanly against rack.c / bbr.c when coming directly from 13.0-RELEASE.

Unless you are actually using the RACK or BBR TCP stack - and if you do, you want to use the fixes that are in 13.0-STABLE, and not run 13.0-RELEASE, you can (locally) ignore to patch to these files (keep the rack.c source prior to the patch; however, if thereafter you start using RACK, you will run into locking issues which will trigger a core).

Getting a "clean" patch against 13.0-RELEASE would involve bundling all the commits to rack.c into this too - or risk running out of sync on the source files - thus more involved to provide it directly.

I hope this comment is helpful to you.
Comment 13 Adam Stylinski 2021-07-12 12:58:26 UTC
(In reply to Richard Scheffenegger from comment #12)

Nope, just the default loss control algorithm.
Comment 14 adamz 2021-12-27 05:16:30 UTC
I have a server I just upgraded from 12.2-RELEASE to 13.0-RELEASE and am now experiencing this issue.  Reading through the comments here, what is the option to resolve this?  I have a server that I can't easily downgrade now and would really prefer to not have to completely rebuild it back on 12.2-RELEASE via a new install.

I'm shocked this bug was reported in May and yet its now the end of December and 13.0-RELEASE still has this issue?
Comment 15 Rick Macklem freebsd_committer freebsd_triage 2021-12-27 14:48:06 UTC
To fix the problem, you have two options other than
reverting back to FreeBSD 12.
- Apply the patch in the attachment here to kernel sources
  and build/install that kernel.
- Upgrade to stable/13.
- The fix will be in 13.1.

The patch that fixed this in stable/13 was not errata'd by
the committer, so it is not available via freebsd upgrade.

It was a bug introduced to the TCP stack, that did not
appear to have anything to do with NFS. That is why it
took so long to resolve. 
Here's the commit log message for r367492 (just fyi):

  Prevent premature SACK block transmission during loss recovery

  Under specific conditions, a window update can be sent with
  outdated SACK information. Some clients react to this by
  subsequently delaying loss recovery, making TCP perform very
  poorly.

The patch actually changed the timing and locking rules for
socket receive upcalls, which broke the NFS server under certain
conditions.
Comment 16 adamz 2021-12-27 18:58:24 UTC
If it matters at all, I disabled LRO and TSO on all nic's per the thread below and that seems to stabilize things without additional changes for the moment.  Not ideal but it at least got the system operational quickly.

https://muc.lists.freebsd.current.narkive.com/bqlZ5JRb/nfs-issues-since-upgrading-to-13-release

I don't see any details on the website on when 13.1-RELEASE is slated so I assume that's still a ways out at this point.  I would rather do a kernel patch than move to stable for a production box as well so I'll need to figure out how to apply the patch.  I have built/installed custom kernels in the past, just not applying patches like this.
Comment 17 Rick Macklem freebsd_committer freebsd_triage 2021-12-27 21:22:35 UTC
Yep. Disabling TSO has worked for others, too.

If you do decide to patch your kernel, the patch in
the attachment should apply to a 13.0 kernel and it
puts the code back the way it has been for a long time.
So, I think it is safe to apply and fixes the problem,
if it persists with TSO disabled.

Many NICs/drivers don't get TSO right, but other than
that, disabling it would affect timing. I was never able
to reproduce the problem, which consists of a receive
socket upcall being missed.
Comment 18 adamz 2021-12-27 21:25:46 UTC
Is there anything I can do to capture data for you since its 100% reproducible here?  Otherwise so far TSO/LRO disabling seems to have it stable at the moment.  The system uses Intel nic's (ix and igb for drivers) if that matters regarding TSO/LRO.

Personally I would love to wait until 13.1-RELEASE but assuming that won't be for a while and need to keep this box stable so the kernel route is likely the route I have to go as it's the least complicated.
Comment 19 Rick Macklem freebsd_committer freebsd_triage 2021-12-27 21:30:01 UTC
If disabling TSO works, great.

If not, patch your kernel.
Use the patch command to apply the patch and then build, etc.
is explained in comment #9 of PR#254590.
Comment 20 Riccardo Torrini 2022-01-06 12:31:51 UTC
I was probably bitten by the same bug, but my setup is slightly different:
- server 13.0-RELEASE-p4/p5, UFS on system, ZFS on /home (also NFS exported)
- client 13.0-RELEASE-p4/p5, /home mounted by the server

under heavy load (find on dir tree with millions of files[1], md5 and/or mv them elsewhere) the server hangs (cpu is not responding, caps lock does not work, need a power cycle on the server)
I can't run any debug command on client or server because of cpu hangs.

Also note that I'm fairly new to ZFS, the server machine was full-UFS until last week (neither crashes on NFS nor any other UFS issues)
So imho is definitely something related to the NFS + ZFS mix.

Just tried to disable TSO on broadcom on server (no TSO available on client), I'll let you know if it's enough (hopefully) waiting for 13.1

(both client and server)# freebsd-version -kru
13.0-RELEASE-p4
13.0-RELEASE-p4
13.0-RELEASE-p5

(server)# ifconfig | grep -iB1 tso
bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>

# grep ^bge /var/run/dmesg.boot
bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xfe9f0000-0xfe9fffff irq 18 at device 0.0 on pci3
bge0: CHIP ID 0x05784100; ASIC REV 0x5784; CHIP REV 0x57841; PCI-E
bge0: Using defaults for TSO: 65518/35/2048

Should I also disable HWTSO?

[1] something like /var/db/freebsd-update/files/{long_hex_name_like_md5_or_sha} to be reorganized under 00-ff subdirs
Comment 21 Rick Macklem freebsd_committer freebsd_triage 2022-01-06 16:02:38 UTC
If the server was hung, it is not this bug.

If disabling TSO does not help, you'll need
to get a dump when it hangs again and then either:
1 - Create another bug report, noting a hang and ZFS
or
2 - Post the information to freebsd-current@ (technically
    it would be freebsd-stable@, but all the developers
    tend to read freebsd-current@.

I know nothing about FZS, but it can be a resource hog
and this sounds like a ZFS tuning problem.
You might need to limit your ARC size, or something like that.
Comment 22 Adam Stylinski 2022-01-06 16:26:43 UTC
ARC tuning is _rarely_ needed these days, but if you have an extremely memory limited machine, maybe?

It doesn't sound like this bug if you never saw it with UFS and it hard locks your machine (I certainly never observed that).  I will say that despondent disks with buggy firmware or AHCI controllers that aren't quite up to snuff can make ZFS hang.  SAS controllers that end up having to setup watchdogs and reset the firmware can make ZFS have a fit, since ZFS can't really handle a case where suddenly all of the drives in a pool disappear.
Comment 23 Riccardo Torrini 2022-01-06 17:44:20 UTC
(In reply to Rick Macklem from comment #21)
(In reply to Adam Stylinski from comment #22)

Thanks for follow-up. Probably my hung is not related to this bug but was the only recent thread with ZFS+NFS :(

server machine is probably not enough for ZFS, it is an HP microserver gen7 N40L, 2GB di ram (I have another microserver, gen8, 8GB di ram, with a UFS system boot and 4x disk with raidz1, but not NFS exported). I'm waiting for 2nd disk to set up mirroring of /home partition.

should I revert to UFS for /home partition?
expand ram? change gen7 to another gen8? switch to gmirror?
Anyway, with -tso4 is still running (crossing fingers)

Thanks for your time and any suggestion.
Comment 24 Adam Stylinski 2022-01-06 20:46:51 UTC
Expanding memory if viable is something I'd do anyway with 2GB.  You can probably shrink the ARC max to something pretty low, though.  I managed, once upon a time, to, on a Pentium III laptop, lower the ARC max to something small enough that it worked within 768MB of memory.  It's doable, at a penalty for those ARC misses.
Comment 25 Kurt Jaeger freebsd_committer freebsd_triage 2022-03-18 13:44:33 UTC
(In reply to Rick Macklem from comment #2)
We see this or a similar problem between freebsd 13.0p8 clients and freebsd 13.0p8 servers.
Comment 26 iio7 2022-04-12 04:16:18 UTC
(In reply to Rick Macklem from comment #15)

> The patch that fixed this in stable/13 was not errata'd by
> the committer, so it is not available via freebsd upgrade.

I don't understand this part. Why isn't it simply erratat'd
by someone from the FreeBSD team then so that it goes into
freebsd-update?
Comment 27 Rick Macklem freebsd_committer freebsd_triage 2022-04-12 14:26:37 UTC
In this case the patch was a somewhat involved change
in the TCP stack. The author (rscheff@, not me) chose
not to do it as an errata.

You would need to ask the author why it was not done as
an errata, but it is possible that it was tied to
other changes not done as errata either.

There is also the question of resources.
Remember that most FreeBSD committers do so as volunteers.
(For example, I have never been paid anything by anyone
 for doing work on FreeBSD.)
Comment 28 Adam Stylinski 2022-04-12 14:58:40 UTC
Will this change at least be present in 13.1?  It ought to be, right?  Those spin off of stable.
Comment 29 Rick Macklem freebsd_committer freebsd_triage 2022-04-12 22:50:54 UTC
rscheff@'s patch (not the same as reverting r367492),
which is believed to fix the problem dealt with by this PR,
is in 13.1.

Note that the key characteristic w.r.t. these hangs is
# netstat -a
showing the TCP connection for the hung client in
ESTABLISHED state with an increasing size of RecvQ
meanwhile other clients continued to function.

If you do not observe the above, then your hangs are
not this bug.
Comment 30 Mateusz Piotrowski freebsd_committer freebsd_triage 2022-04-21 10:53:33 UTC
For reference:

- The commit hash, of the fix in the releng/13.1 branch: 55cc0a478506ee1c2db7b2f9aadb9855e5490af3
- Phabricator link: https://reviews.freebsd.org/D29690
Comment 31 Bren 2022-08-19 14:06:58 UTC
We started experiencing this bug after upgrading to 13.0-RELEASE. We have a large file server that serves files over NFS to several Linux based Apache / PHP servers which serve a fairly high traffic website. NFS mounts would frequently hang and then recover after a few minutes throughout the day just like others were seeing.

This stopped after upgrading to 13.1-RELEASE, however, now we're seeing a similar but more serious issue. About once a week the NFS mounts on one of the web servers will completely hang and never recover. I have to reset the web server in order to recover from this. It's never the same one and luckily it's only been one at a time up to this point.

The last time this happened I forgot to look at "netstat -a" but will do next time to see if RecvQ increments like it did before. I wanted to post here ahead of time to see if there is anything else I can look at when this happens. It seems like the exact same issue as before but now the mounts never recover.
Comment 32 Bren 2022-08-22 15:22:52 UTC
OK had two web server's mounts hang this morning. "netstat -a" on the server showed a Recv-Q and Send-Q of 0 for the hung client.

# freebsd-version -kru
13.1-RELEASE
13.1-RELEASE
13.1-RELEASE

Linux mount options:

proto=tcp,rw,nosuid,nodev,noexec,hard,timeo=600,retrans=2,vers=3,rsize=1048576,wsize=1048576,actimeo=5,noatime,nodiratime,x-systemd.requires=network-online.target

These are the NICs we're using:

ixl1: <Intel(R) Ethernet Connection X722 for 10GBASE-T - 2.3.1-k> mem 0x38bffd000000-0x38bffdffffff,0x38bfff800000-0x38bfff807fff irq 40 at device 0.1 on pci7
ixl1: fw 3.1.54559 api 1.5 nvm 3.2d etid 80000b4b oem 1.262.0
ixl1: PF-ID[1]: VFs 32, MSI-X 129, VF MSI-X 5, QPs 768, MDIO shared
ixl1: Using 1024 TX descriptors and 1024 RX descriptors
ixl1: Using 8 RX queues 8 TX queues
ixl1: Using MSI-X interrupts with 9 vectors
ixl1: Ethernet address: d8:c4:97:d1:1a:1f
ixl1: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl1: SR-IOV ready
ixl1: netmap queues/slots: TX 8/1024, RX 8/1024
ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None
ixl1: link state changed to UP

Following this thread, I collected tcpdumps on both the server and the client if anyone wants to see that.

https://www.mail-archive.com/freebsd-current@freebsd.org/msg184141.html

I might try disabling TSO and LRO to see if that helps. I'm also considering rolling back to 13.0-RELEASE because at least then the mounts would recover after a few minutes.
Comment 33 Rick Macklem freebsd_committer freebsd_triage 2022-08-22 19:51:13 UTC
I can look at the tcpdumps, if you let me know where
they are.
Comment 34 Rick Macklem freebsd_committer freebsd_triage 2022-08-22 19:53:27 UTC
Btw, since the Recv Q was 0, it probably is not the bug
fixed by reverting r367492.
Comment 35 Rick Macklem freebsd_committer freebsd_triage 2022-08-22 21:18:15 UTC
I took a look at the packet trace and it does not
appear that the server is hung.
I suspect that, if you did:
# nfsstat -s -E
repeatedly, you would see the RPC counts increasing.

What the packet trace shows is the client doing a
Write RPC repeatedly, getting a NFS3ERR_STALE reply
each time. (First reply at packet#1252, followed by
many more Write RPC attempts for the same file.)

NFS3ERR_STALE - Means that the file no longer exists
  on the server. I do not know why it no longer exists,
  since there are no Remove RPCs in the trace.
  I would consider it a client bug to keep retrying a
  Write RPC after a NFS3ERR_STALE reply, since the error
  is fatal and cannot recover upon retries.

You might look for some "cleanup" process/thread that
might delete a file before the NFS client would expect that.
Comment 36 Bren 2022-08-23 17:39:20 UTC
Thanks for looking at the tcpdumps. The server indeed doesn't hang for all clients. Only one of three Apache/PHP server's mounts hang at a time usually. I had two hang yesterday while I was collecting data and troubleshooting the one. It seems random.

We have three nginx servers serving out static files which never do this. We also have various other servers using NFS that never do this. It's only happening with the Apache/PHP servers.

If it's related to some sort of write issue that would make sense because the Apache / PHP servers are very busy doing reads and writes. Files are being uploaded, renamed, deleted, etc. 1000s of times a day. We also have a script server running backend processes which modifies files all the time.

We've been using NFS like this for over twenty years and with this particular setup for at least three now without issue. We only started seeing NFS mounts hang when we upgraded to 13.0-RELEASE. I'm very skeptical that this is simply a client bug but I could see a case where maybe something with the server changed which has uncovered this or something like that.

Usually when we see a stale file handle the logs report that. We have one process where this happens many times a day and the mounts never hang. All we're seeing from the client in this situation is:

kernel: [1261497.154370] nfs: server 10.10.1.20 not responding, still trying

Would it be helpful to have tcpdumps covering the time the hang occurs? I think I may have a way to grab 2-3 minute continuous dumps on both the file server and the web servers and then save the dumps when a hang occurs. This will take some work but I think it's doable. I'm also going to see if I can recreate this issue somehow.
Comment 37 Rick Macklem freebsd_committer freebsd_triage 2022-08-25 00:19:15 UTC
Ok, I had assumed that the packet trace was taken when
a client was "hung".  If not, the packet trace is irrelevant.

I did take another look and each of the Write RPCs that
get the NFS3ERR_STALE reply are for different offsets,
so the client is only attempting each write once, but is
doing it for a very large file.
- The largest offset in the packet trace is 1738260480.
- You can look at the packet trace by pulling it into
  wireshark. The first of these Writes is at packet#1250
  and the last at packet# 55040 in the server's capture.

If the client was "hung" at the time this capture was taken,
the server is not "hung". It is simply replying to RPCs.

You mention you have been using this for 20years, but I
suspect that you have upgraded the Linux client at various
times. If this packet capture is for a "hung" client, I
suspect a change in the Linux client has resulted in each
write being tried instead of a failure after one write fails
with NFS3ERR_STALE.

In general, you should try and avoid removing a file on one
client (or locally on the NFS server) while it is still being
accessed, to avoid NFS3ERR_STALE errors.
Comment 38 Bren 2022-08-25 22:46:39 UTC
I guess I misunderstood your last message. The packet traces were indeed taken while all six mounts of one of the web servers were hung/unresponsive.

This could definitely be a client issue but like I said we've never had an issue like this until we upgraded to 13.0-RELEASE.

If this is a separate issue then wouldn't we have seen both the ephemeral hangs as well as the permanent hangs in 13.0? We were only seeing ephemeral hangs like others were until we upgraded to 13.1.

There was a change in the Linux client recently (upgrade from Debian 10 to 11) but as I recall we were seeing hangs immediately after upgrading the file server to 13.0 when the web servers were still on Debian 10. The permanent hangs only started after upgrading to 13.1.

As far as I know, we only modify large files on one web server at a time. They mainly handle file uploads which can sometimes be quite large. We've never had an issue with this before and I don't think anything has changed there in a long time.

Would gathering packet traces while the hang occurs help? Is there any other data I can gather?
Comment 39 Rick Macklem freebsd_committer freebsd_triage 2022-08-25 23:39:43 UTC
I doubt capturing packets while the hang occurs will
make much difference. There might be a Remove RPC for
the file in the capture, which would clarify why the
file went away.

As I said, if you look at the capture in wireshark,
you will see that the server is just replying to RPCs.
It does not appear to be hung. The client may appear to
be hung (if you waited long enough it might eventually
recover after it has attempted all the writes).
I have no idea why moving to FreeBSD13 would trigger
this. Go back to FreeBSD12 on the server, if you'd like,
since the NFS server code is essentially the same for NFSv3.
Comment 40 Bren 2022-10-16 20:37:17 UTC
I've been troubleshooting this on and off for months without any luck. Thankfully this is only happening a couple of times a week and usually on one server at a time (the Apache/PHP servers).

Today, however, we had all NFS mounts hang on several servers for some reason (which took the site down and had me out of bed at 0630) including a script server, a couple of nginx servers (where the NFS mounts are readonly), and a couple of Apache/PHP servers.

There is nothing in the logs, the traffic graphs, or anywhere really to indicate why this happened. It seems to have just happened out of the blue as usual. One server did show this:

2022-10-16T07:06:02.629829-04:00 scripts01 kernel: [7664840.363767] rpc_check_timeout: 14 callbacks suppressed
2022-10-16T07:06:02.631007-04:00 scripts01 kernel: [7664840.363768] nfs: server 10.10.1.20 not responding, still trying

Otherwise the rest of them only logged the second line.

We upgraded our web servers to Debian 11 around May 24th of this year. At that time we were running 13.0-p11 on the storage server and experiencing the bug first reported here. The NFS mounts would hang but always come back after a few minutes.

It was only when we upgraded to 13.1-p0 around July 18 that we started seeing hung NFS mounts that would never recover. The NFS client didn't change during this period of time. Nothing else changed to my knowledge. It was only directly after we upgraded to FreeBSD 13.1-p0 that we started experiencing this issue so I don't think this is a bug with Debian/nfs-common.

Unfortunately we seem to be the only ones experiencing this (that I can find anyway). If anyone could provide any pointers as to what we could possibly do to get this fixed I'd really appreciate it. I've searched high and low for a solution to this but haven't been able to get anywhere with it.

At this point I'm considering rolling back to 13.0-p11 because at least then the mounts would recover. Or, worst case scenario, I might have to move us to a different platform which I really don't want to do.
Comment 41 Rick Macklem freebsd_committer freebsd_triage 2022-10-16 21:52:28 UTC
If you are running 13.1, your bug is not the one
that this PR originally dealt with, as far as I know.

Here's the kind of information you need to collect
when a client is hung, to try and isolate the problem.
(For this, client refers to an NFS client, although you
 may use the term "server" for these and server refers
 to the NFS server.)
# netstat -a
- on both client and server (you are looking
  for the TCP connection NFS is using between them and
  what state it is in, plus whether or not the Send or
  Recv Qs are non-empty (0).
# ping the server from the client to make sure basic
  network connectivity exists

# ps axHl - on the NFS server, to see what the nfsd
  threads are up to.
# tcpdump -s 0 -w out.pcap host <nfs-client>
- Run it for a few minutes on the NFS server, then look
  at out.pcap in wireshark to see what, if any, NFS
  traffic is happening between them.
- On the NFS server, access the file system that is exported
  to the NFS client locally on the server, to ensure that
  the file system seems to be working.
# vmstat -m
and
# vmstat -z
- on the NFS server, to see if some resource (like mbufs
  or mbuf clusters) seem to be exhausted

The above information might tell you what is going on.
If you cannot interpret it, post it.
Comment 42 Bren 2022-10-16 22:15:24 UTC
Rick,

I believe I have collected almost all of this information for you already in previous responses including a tcpdump. I can collect it all over again if you'd like.

The conclusion was that this was a bug in the Debian NFS client. After reviewing all of the information I'm still not convinced this is client bug as, like I've said a few times, this only started when we upgraded to 13.1. Nothing else changed as far as I know.

Please let me know if there is anything else I can do to help determine why this is happening. At this point it's looking like we are stuck with a broken NFS server implementation without many options.
Comment 43 Rick Macklem freebsd_committer freebsd_triage 2022-10-16 23:42:27 UTC
Please post my email response that indicated a
Linux NFS client problem here, if you still have it.

If you don't have it, I might be able to find it
buried deeply in my deleted email, but that is a
last resort.

This will give others a base from which to work on
possible solutions.
Comment 44 Bren 2022-10-17 02:06:55 UTC
"What the packet trace shows is the client doing a
Write RPC repeatedly, getting a NFS3ERR_STALE reply
each time. (First reply at packet#1252, followed by
many more Write RPC attempts for the same file.)

NFS3ERR_STALE - Means that the file no longer exists
  on the server. I do not know why it no longer exists,
  since there are no Remove RPCs in the trace.
  I would consider it a client bug to keep retrying a
  Write RPC after a NFS3ERR_STALE reply, since the error
  is fatal and cannot recover upon retries.

You might look for some "cleanup" process/thread that
might delete a file before the NFS client would expect that."

No cleanup process runs to account for this. I've tried to recreate this manually numerous times but haven't been able to. The server is up, pingable, and still serving other clients. Once the client that's hung is reset it works fine until the next hang. This seems to happen randomly.

Then this morning we had several other servers hang, two of which only had read only mounts so they couldn't have possibly been trying to write to a file.
Comment 45 Rick Macklem freebsd_committer freebsd_triage 2022-10-21 15:31:32 UTC
PR# 265588 might be relevant. It describes another
TCP stack issue that was detected on a 13.1 NFS server.

If you read through it, the reporter notes that disabling
SACK and DSACK fixed the problem.
This might be worth a try, Bren.
I don't know, but I'd guess...
# sysctl net.inet.tcp.sack.enable=0
on the NFS server does the trick.

If you try this and it helps, please comment both
here and on PR#265588.
Comment 46 Bren 2022-10-27 02:27:43 UTC
This definitely looks like it could be what we've been dealing with. Oddly enough, we haven't had any NFS hangs since the last one that took the site down.

If this happens again I'll try disabling SACK and we'll see what happens. Thanks for bringing this to my attention.
Comment 47 Richard Scheffenegger freebsd_committer freebsd_triage 2022-10-27 08:16:49 UTC
13.1 was found to have a regression and a day-one defect. D36626 addresses the day-one problem, and D36046 the specific regression exposing the former. 

13.0 has the day-one, but very unlikely to hit that without the regression introduced in 13.1.

The upcoming 13.2 (or STABLE/13) have both fixes, and there may be a patch available soon for 13.1 specifically.
Comment 48 Bren 2022-11-29 21:45:34 UTC
Just wanted to post an update. I disabled SACK but still had a couple NFS hangs after that. I didn't remount the NFS mounts after disabling SACK so I think that might have been why.

Shortly after that I noticed an errata notice for TCP / SACK so I upgraded to 13.1-RELEASE-p3 to get that patch. We haven't had any NFS hangs since. I don't know if the NFS hangs were being caused by this SACK bug or something else but it appears to be fixed in p3!
Comment 49 Rick Macklem freebsd_committer freebsd_triage 2022-11-29 23:22:34 UTC
Setting the sysctl to disable SACK only affects
new TCP connections done after that, as far as I
understand it.

Since NFS normally only creates a new TCP connection
at mount time (the other case is where the old
TCP connection is broken by something like a
network partitioning), the setting probably
did not take affect on the extant mounts.
Comment 50 Rick Macklem freebsd_committer freebsd_triage 2024-04-21 22:30:22 UTC
The fixe for this is now in 13.1 and
newer, so I think this can now be closed.