I am observing frequent nfs mount hang-ups after upgrading my NAS to 13-stable. FreeBSD here serves nfsd of a zfs pool. A Gentoo Linux box is connected to this NAS via a 10GbE fiber link. Once a while, perhaps when the zfs load gets high (afpd is running), the Linux side access to nfs hangs, then recovers after a few minutes. The following messages are printed in Linux's dmesg: May 30 22:04:34 mayhome kernel: nfs: server 192.168.3.51 not responding, still trying But after a while, a few minutes or so, the access recovers: May 30 22:06:35 mayhome kernel: nfs: server 192.168.3.51 OK This behavior is only observed after updating NAS to 13-stable via buildworld, buildkernel procedure. The Linux side remains the same and no hardware changed on either side. 12-stable did not exhibit any of these. The NAS's NIC serving nfsd is t5nex0: <Chelsio T520-SO> mem 0xdd300000-0xdd37ffff,0xdc000000-0xdcffffff,0xdd884000-0xdd885fff irq 16 at device 0.4 on pci1 cxl0: <port 0> on t5nex0 cxl0: Ethernet address: 00:07:43:31:9c:80 cxl0: 8 txq, 8 rxq (NIC); 8 txq (TOE), 2 rxq (TOE) cxl1: <port 1> on t5nex0 cxl1: Ethernet address: 00:07:43:31:9c:88 cxl1: 8 txq, 8 rxq (NIC); 8 txq (TOE), 2 rxq (TOE) t5nex0: PCIe gen2 x8, 2 ports, 22 MSI-X interrupts, 54 eq, 21 iq Linux nfsmount flags: /home from 192.168.3.51:/mnt/nashome Flags: rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.51,mountvers=3,mountport=855,mountproto=udp,local_lock=all,addr=192.168.3.51
Created attachment 225416 [details] reverts commit r367492 r367492 causes NFS clients (mostly Linux ones) to hang intermittently. This patch reverts r367492.
This is most likely caused by r367492, which shipped in 13.0. If this is the cause you will observe the following when a Linux client is hung. # netstat -a will show the client TCP connection as established and Recv-Q will be non-zero and probably getting larger. If this is the case you need to do one of: - revert r367492 by applying the patch in the attachment. - apply the patch in D29690 to your kernel, which is believed to fix the problem - wait until commit 032bf749fd44 is MFC'd to stable/13 and then upgrade to stable/13 (MFC should happen in early June) However, if the # netstat -a shows the TCP connection as CLOSE_WAIT, then you need the patch which is the first attachment on PR#254590 and is already in stable/13. (MFC'd Apr. 27)
I confirm that #netstat -a shows the client TCP connection as established and Recv-Q went very high during a stall: tcp4 230920 0 192.168.3.51.nfsd 192.168.3.1.760 ESTABLISHED I've applied the revert r367492 patch. I'll report if I see stall again.
I've set the MFC-stable13 flag, to refer to commit 032bf749fd44 and not the attachment patch, which is just a workaround until commit 032bf749fd44 is in your FreeBSD-13 kernel.
*** Bug 255083 has been marked as a duplicate of this bug. ***
I am seeing similar symptoms in 13.0-RELEASE, but can't confirm yet with netstat until some off-work hours. Can someone confirm from svn / git that this bug is in fact affecting 13.0-RELEASE. And, if so, can we get a back port in RELEASE so that we don't have to compile a custom kernel workaround or update to -STABLE? This is a pretty brutal bug, at the very least a workaround would be nice and not require me to follow STABLE or build a custom kernel temporarily. I'm just a home user, but I imagine that enterprise installations following -RELEASE will be extremely annoyed by this bug.
Ah, looks as though it's in there. The SVN revision number should have probably been a big enough hint that it's a commit that came in a long time ago, I guess: https://cgit.freebsd.org/src/commit/?h=releng/13.0&id=4d0770f1725f84e8bcd059e6094b6bd29bed6cc3 Someone should change Version: in the PR to 13.0-RELEASE.
Change to 13.0-RELEASE since a fix is now in 13-STABLE.
Any news on this getting into releng? I saw on stable/13 there's a bunch of traffic on rack.c, some of which refer to leaked mbufs due to long dormant bugs. What do I lose by applying this patch that reverts the changes to the loss recovery?
Ahh, sorry for the double comment post, looks like I can't delete it. I pressed submit twice in succession and didn't read the message carefully enough that followed.
The patch may not apply cleanly against rack.c / bbr.c when coming directly from 13.0-RELEASE. Unless you are actually using the RACK or BBR TCP stack - and if you do, you want to use the fixes that are in 13.0-STABLE, and not run 13.0-RELEASE, you can (locally) ignore to patch to these files (keep the rack.c source prior to the patch; however, if thereafter you start using RACK, you will run into locking issues which will trigger a core). Getting a "clean" patch against 13.0-RELEASE would involve bundling all the commits to rack.c into this too - or risk running out of sync on the source files - thus more involved to provide it directly. I hope this comment is helpful to you.
(In reply to Richard Scheffenegger from comment #12) Nope, just the default loss control algorithm.
I have a server I just upgraded from 12.2-RELEASE to 13.0-RELEASE and am now experiencing this issue. Reading through the comments here, what is the option to resolve this? I have a server that I can't easily downgrade now and would really prefer to not have to completely rebuild it back on 12.2-RELEASE via a new install. I'm shocked this bug was reported in May and yet its now the end of December and 13.0-RELEASE still has this issue?
To fix the problem, you have two options other than reverting back to FreeBSD 12. - Apply the patch in the attachment here to kernel sources and build/install that kernel. - Upgrade to stable/13. - The fix will be in 13.1. The patch that fixed this in stable/13 was not errata'd by the committer, so it is not available via freebsd upgrade. It was a bug introduced to the TCP stack, that did not appear to have anything to do with NFS. That is why it took so long to resolve. Here's the commit log message for r367492 (just fyi): Prevent premature SACK block transmission during loss recovery Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. The patch actually changed the timing and locking rules for socket receive upcalls, which broke the NFS server under certain conditions.
If it matters at all, I disabled LRO and TSO on all nic's per the thread below and that seems to stabilize things without additional changes for the moment. Not ideal but it at least got the system operational quickly. https://muc.lists.freebsd.current.narkive.com/bqlZ5JRb/nfs-issues-since-upgrading-to-13-release I don't see any details on the website on when 13.1-RELEASE is slated so I assume that's still a ways out at this point. I would rather do a kernel patch than move to stable for a production box as well so I'll need to figure out how to apply the patch. I have built/installed custom kernels in the past, just not applying patches like this.
Yep. Disabling TSO has worked for others, too. If you do decide to patch your kernel, the patch in the attachment should apply to a 13.0 kernel and it puts the code back the way it has been for a long time. So, I think it is safe to apply and fixes the problem, if it persists with TSO disabled. Many NICs/drivers don't get TSO right, but other than that, disabling it would affect timing. I was never able to reproduce the problem, which consists of a receive socket upcall being missed.
Is there anything I can do to capture data for you since its 100% reproducible here? Otherwise so far TSO/LRO disabling seems to have it stable at the moment. The system uses Intel nic's (ix and igb for drivers) if that matters regarding TSO/LRO. Personally I would love to wait until 13.1-RELEASE but assuming that won't be for a while and need to keep this box stable so the kernel route is likely the route I have to go as it's the least complicated.
If disabling TSO works, great. If not, patch your kernel. Use the patch command to apply the patch and then build, etc. is explained in comment #9 of PR#254590.
I was probably bitten by the same bug, but my setup is slightly different: - server 13.0-RELEASE-p4/p5, UFS on system, ZFS on /home (also NFS exported) - client 13.0-RELEASE-p4/p5, /home mounted by the server under heavy load (find on dir tree with millions of files[1], md5 and/or mv them elsewhere) the server hangs (cpu is not responding, caps lock does not work, need a power cycle on the server) I can't run any debug command on client or server because of cpu hangs. Also note that I'm fairly new to ZFS, the server machine was full-UFS until last week (neither crashes on NFS nor any other UFS issues) So imho is definitely something related to the NFS + ZFS mix. Just tried to disable TSO on broadcom on server (no TSO available on client), I'll let you know if it's enough (hopefully) waiting for 13.1 (both client and server)# freebsd-version -kru 13.0-RELEASE-p4 13.0-RELEASE-p4 13.0-RELEASE-p5 (server)# ifconfig | grep -iB1 tso bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> # grep ^bge /var/run/dmesg.boot bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xfe9f0000-0xfe9fffff irq 18 at device 0.0 on pci3 bge0: CHIP ID 0x05784100; ASIC REV 0x5784; CHIP REV 0x57841; PCI-E bge0: Using defaults for TSO: 65518/35/2048 Should I also disable HWTSO? [1] something like /var/db/freebsd-update/files/{long_hex_name_like_md5_or_sha} to be reorganized under 00-ff subdirs
If the server was hung, it is not this bug. If disabling TSO does not help, you'll need to get a dump when it hangs again and then either: 1 - Create another bug report, noting a hang and ZFS or 2 - Post the information to freebsd-current@ (technically it would be freebsd-stable@, but all the developers tend to read freebsd-current@. I know nothing about FZS, but it can be a resource hog and this sounds like a ZFS tuning problem. You might need to limit your ARC size, or something like that.
ARC tuning is _rarely_ needed these days, but if you have an extremely memory limited machine, maybe? It doesn't sound like this bug if you never saw it with UFS and it hard locks your machine (I certainly never observed that). I will say that despondent disks with buggy firmware or AHCI controllers that aren't quite up to snuff can make ZFS hang. SAS controllers that end up having to setup watchdogs and reset the firmware can make ZFS have a fit, since ZFS can't really handle a case where suddenly all of the drives in a pool disappear.
(In reply to Rick Macklem from comment #21) (In reply to Adam Stylinski from comment #22) Thanks for follow-up. Probably my hung is not related to this bug but was the only recent thread with ZFS+NFS :( server machine is probably not enough for ZFS, it is an HP microserver gen7 N40L, 2GB di ram (I have another microserver, gen8, 8GB di ram, with a UFS system boot and 4x disk with raidz1, but not NFS exported). I'm waiting for 2nd disk to set up mirroring of /home partition. should I revert to UFS for /home partition? expand ram? change gen7 to another gen8? switch to gmirror? Anyway, with -tso4 is still running (crossing fingers) Thanks for your time and any suggestion.
Expanding memory if viable is something I'd do anyway with 2GB. You can probably shrink the ARC max to something pretty low, though. I managed, once upon a time, to, on a Pentium III laptop, lower the ARC max to something small enough that it worked within 768MB of memory. It's doable, at a penalty for those ARC misses.
(In reply to Rick Macklem from comment #2) We see this or a similar problem between freebsd 13.0p8 clients and freebsd 13.0p8 servers.
(In reply to Rick Macklem from comment #15) > The patch that fixed this in stable/13 was not errata'd by > the committer, so it is not available via freebsd upgrade. I don't understand this part. Why isn't it simply erratat'd by someone from the FreeBSD team then so that it goes into freebsd-update?
In this case the patch was a somewhat involved change in the TCP stack. The author (rscheff@, not me) chose not to do it as an errata. You would need to ask the author why it was not done as an errata, but it is possible that it was tied to other changes not done as errata either. There is also the question of resources. Remember that most FreeBSD committers do so as volunteers. (For example, I have never been paid anything by anyone for doing work on FreeBSD.)
Will this change at least be present in 13.1? It ought to be, right? Those spin off of stable.
rscheff@'s patch (not the same as reverting r367492), which is believed to fix the problem dealt with by this PR, is in 13.1. Note that the key characteristic w.r.t. these hangs is # netstat -a showing the TCP connection for the hung client in ESTABLISHED state with an increasing size of RecvQ meanwhile other clients continued to function. If you do not observe the above, then your hangs are not this bug.
For reference: - The commit hash, of the fix in the releng/13.1 branch: 55cc0a478506ee1c2db7b2f9aadb9855e5490af3 - Phabricator link: https://reviews.freebsd.org/D29690
We started experiencing this bug after upgrading to 13.0-RELEASE. We have a large file server that serves files over NFS to several Linux based Apache / PHP servers which serve a fairly high traffic website. NFS mounts would frequently hang and then recover after a few minutes throughout the day just like others were seeing. This stopped after upgrading to 13.1-RELEASE, however, now we're seeing a similar but more serious issue. About once a week the NFS mounts on one of the web servers will completely hang and never recover. I have to reset the web server in order to recover from this. It's never the same one and luckily it's only been one at a time up to this point. The last time this happened I forgot to look at "netstat -a" but will do next time to see if RecvQ increments like it did before. I wanted to post here ahead of time to see if there is anything else I can look at when this happens. It seems like the exact same issue as before but now the mounts never recover.
OK had two web server's mounts hang this morning. "netstat -a" on the server showed a Recv-Q and Send-Q of 0 for the hung client. # freebsd-version -kru 13.1-RELEASE 13.1-RELEASE 13.1-RELEASE Linux mount options: proto=tcp,rw,nosuid,nodev,noexec,hard,timeo=600,retrans=2,vers=3,rsize=1048576,wsize=1048576,actimeo=5,noatime,nodiratime,x-systemd.requires=network-online.target These are the NICs we're using: ixl1: <Intel(R) Ethernet Connection X722 for 10GBASE-T - 2.3.1-k> mem 0x38bffd000000-0x38bffdffffff,0x38bfff800000-0x38bfff807fff irq 40 at device 0.1 on pci7 ixl1: fw 3.1.54559 api 1.5 nvm 3.2d etid 80000b4b oem 1.262.0 ixl1: PF-ID[1]: VFs 32, MSI-X 129, VF MSI-X 5, QPs 768, MDIO shared ixl1: Using 1024 TX descriptors and 1024 RX descriptors ixl1: Using 8 RX queues 8 TX queues ixl1: Using MSI-X interrupts with 9 vectors ixl1: Ethernet address: d8:c4:97:d1:1a:1f ixl1: Allocating 8 queues for PF LAN VSI; 8 queues active ixl1: SR-IOV ready ixl1: netmap queues/slots: TX 8/1024, RX 8/1024 ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None ixl1: link state changed to UP Following this thread, I collected tcpdumps on both the server and the client if anyone wants to see that. https://www.mail-archive.com/freebsd-current@freebsd.org/msg184141.html I might try disabling TSO and LRO to see if that helps. I'm also considering rolling back to 13.0-RELEASE because at least then the mounts would recover after a few minutes.
I can look at the tcpdumps, if you let me know where they are.
Btw, since the Recv Q was 0, it probably is not the bug fixed by reverting r367492.
I took a look at the packet trace and it does not appear that the server is hung. I suspect that, if you did: # nfsstat -s -E repeatedly, you would see the RPC counts increasing. What the packet trace shows is the client doing a Write RPC repeatedly, getting a NFS3ERR_STALE reply each time. (First reply at packet#1252, followed by many more Write RPC attempts for the same file.) NFS3ERR_STALE - Means that the file no longer exists on the server. I do not know why it no longer exists, since there are no Remove RPCs in the trace. I would consider it a client bug to keep retrying a Write RPC after a NFS3ERR_STALE reply, since the error is fatal and cannot recover upon retries. You might look for some "cleanup" process/thread that might delete a file before the NFS client would expect that.
Thanks for looking at the tcpdumps. The server indeed doesn't hang for all clients. Only one of three Apache/PHP server's mounts hang at a time usually. I had two hang yesterday while I was collecting data and troubleshooting the one. It seems random. We have three nginx servers serving out static files which never do this. We also have various other servers using NFS that never do this. It's only happening with the Apache/PHP servers. If it's related to some sort of write issue that would make sense because the Apache / PHP servers are very busy doing reads and writes. Files are being uploaded, renamed, deleted, etc. 1000s of times a day. We also have a script server running backend processes which modifies files all the time. We've been using NFS like this for over twenty years and with this particular setup for at least three now without issue. We only started seeing NFS mounts hang when we upgraded to 13.0-RELEASE. I'm very skeptical that this is simply a client bug but I could see a case where maybe something with the server changed which has uncovered this or something like that. Usually when we see a stale file handle the logs report that. We have one process where this happens many times a day and the mounts never hang. All we're seeing from the client in this situation is: kernel: [1261497.154370] nfs: server 10.10.1.20 not responding, still trying Would it be helpful to have tcpdumps covering the time the hang occurs? I think I may have a way to grab 2-3 minute continuous dumps on both the file server and the web servers and then save the dumps when a hang occurs. This will take some work but I think it's doable. I'm also going to see if I can recreate this issue somehow.
Ok, I had assumed that the packet trace was taken when a client was "hung". If not, the packet trace is irrelevant. I did take another look and each of the Write RPCs that get the NFS3ERR_STALE reply are for different offsets, so the client is only attempting each write once, but is doing it for a very large file. - The largest offset in the packet trace is 1738260480. - You can look at the packet trace by pulling it into wireshark. The first of these Writes is at packet#1250 and the last at packet# 55040 in the server's capture. If the client was "hung" at the time this capture was taken, the server is not "hung". It is simply replying to RPCs. You mention you have been using this for 20years, but I suspect that you have upgraded the Linux client at various times. If this packet capture is for a "hung" client, I suspect a change in the Linux client has resulted in each write being tried instead of a failure after one write fails with NFS3ERR_STALE. In general, you should try and avoid removing a file on one client (or locally on the NFS server) while it is still being accessed, to avoid NFS3ERR_STALE errors.
I guess I misunderstood your last message. The packet traces were indeed taken while all six mounts of one of the web servers were hung/unresponsive. This could definitely be a client issue but like I said we've never had an issue like this until we upgraded to 13.0-RELEASE. If this is a separate issue then wouldn't we have seen both the ephemeral hangs as well as the permanent hangs in 13.0? We were only seeing ephemeral hangs like others were until we upgraded to 13.1. There was a change in the Linux client recently (upgrade from Debian 10 to 11) but as I recall we were seeing hangs immediately after upgrading the file server to 13.0 when the web servers were still on Debian 10. The permanent hangs only started after upgrading to 13.1. As far as I know, we only modify large files on one web server at a time. They mainly handle file uploads which can sometimes be quite large. We've never had an issue with this before and I don't think anything has changed there in a long time. Would gathering packet traces while the hang occurs help? Is there any other data I can gather?
I doubt capturing packets while the hang occurs will make much difference. There might be a Remove RPC for the file in the capture, which would clarify why the file went away. As I said, if you look at the capture in wireshark, you will see that the server is just replying to RPCs. It does not appear to be hung. The client may appear to be hung (if you waited long enough it might eventually recover after it has attempted all the writes). I have no idea why moving to FreeBSD13 would trigger this. Go back to FreeBSD12 on the server, if you'd like, since the NFS server code is essentially the same for NFSv3.
I've been troubleshooting this on and off for months without any luck. Thankfully this is only happening a couple of times a week and usually on one server at a time (the Apache/PHP servers). Today, however, we had all NFS mounts hang on several servers for some reason (which took the site down and had me out of bed at 0630) including a script server, a couple of nginx servers (where the NFS mounts are readonly), and a couple of Apache/PHP servers. There is nothing in the logs, the traffic graphs, or anywhere really to indicate why this happened. It seems to have just happened out of the blue as usual. One server did show this: 2022-10-16T07:06:02.629829-04:00 scripts01 kernel: [7664840.363767] rpc_check_timeout: 14 callbacks suppressed 2022-10-16T07:06:02.631007-04:00 scripts01 kernel: [7664840.363768] nfs: server 10.10.1.20 not responding, still trying Otherwise the rest of them only logged the second line. We upgraded our web servers to Debian 11 around May 24th of this year. At that time we were running 13.0-p11 on the storage server and experiencing the bug first reported here. The NFS mounts would hang but always come back after a few minutes. It was only when we upgraded to 13.1-p0 around July 18 that we started seeing hung NFS mounts that would never recover. The NFS client didn't change during this period of time. Nothing else changed to my knowledge. It was only directly after we upgraded to FreeBSD 13.1-p0 that we started experiencing this issue so I don't think this is a bug with Debian/nfs-common. Unfortunately we seem to be the only ones experiencing this (that I can find anyway). If anyone could provide any pointers as to what we could possibly do to get this fixed I'd really appreciate it. I've searched high and low for a solution to this but haven't been able to get anywhere with it. At this point I'm considering rolling back to 13.0-p11 because at least then the mounts would recover. Or, worst case scenario, I might have to move us to a different platform which I really don't want to do.
If you are running 13.1, your bug is not the one that this PR originally dealt with, as far as I know. Here's the kind of information you need to collect when a client is hung, to try and isolate the problem. (For this, client refers to an NFS client, although you may use the term "server" for these and server refers to the NFS server.) # netstat -a - on both client and server (you are looking for the TCP connection NFS is using between them and what state it is in, plus whether or not the Send or Recv Qs are non-empty (0). # ping the server from the client to make sure basic network connectivity exists # ps axHl - on the NFS server, to see what the nfsd threads are up to. # tcpdump -s 0 -w out.pcap host <nfs-client> - Run it for a few minutes on the NFS server, then look at out.pcap in wireshark to see what, if any, NFS traffic is happening between them. - On the NFS server, access the file system that is exported to the NFS client locally on the server, to ensure that the file system seems to be working. # vmstat -m and # vmstat -z - on the NFS server, to see if some resource (like mbufs or mbuf clusters) seem to be exhausted The above information might tell you what is going on. If you cannot interpret it, post it.
Rick, I believe I have collected almost all of this information for you already in previous responses including a tcpdump. I can collect it all over again if you'd like. The conclusion was that this was a bug in the Debian NFS client. After reviewing all of the information I'm still not convinced this is client bug as, like I've said a few times, this only started when we upgraded to 13.1. Nothing else changed as far as I know. Please let me know if there is anything else I can do to help determine why this is happening. At this point it's looking like we are stuck with a broken NFS server implementation without many options.
Please post my email response that indicated a Linux NFS client problem here, if you still have it. If you don't have it, I might be able to find it buried deeply in my deleted email, but that is a last resort. This will give others a base from which to work on possible solutions.
"What the packet trace shows is the client doing a Write RPC repeatedly, getting a NFS3ERR_STALE reply each time. (First reply at packet#1252, followed by many more Write RPC attempts for the same file.) NFS3ERR_STALE - Means that the file no longer exists on the server. I do not know why it no longer exists, since there are no Remove RPCs in the trace. I would consider it a client bug to keep retrying a Write RPC after a NFS3ERR_STALE reply, since the error is fatal and cannot recover upon retries. You might look for some "cleanup" process/thread that might delete a file before the NFS client would expect that." No cleanup process runs to account for this. I've tried to recreate this manually numerous times but haven't been able to. The server is up, pingable, and still serving other clients. Once the client that's hung is reset it works fine until the next hang. This seems to happen randomly. Then this morning we had several other servers hang, two of which only had read only mounts so they couldn't have possibly been trying to write to a file.
PR# 265588 might be relevant. It describes another TCP stack issue that was detected on a 13.1 NFS server. If you read through it, the reporter notes that disabling SACK and DSACK fixed the problem. This might be worth a try, Bren. I don't know, but I'd guess... # sysctl net.inet.tcp.sack.enable=0 on the NFS server does the trick. If you try this and it helps, please comment both here and on PR#265588.
This definitely looks like it could be what we've been dealing with. Oddly enough, we haven't had any NFS hangs since the last one that took the site down. If this happens again I'll try disabling SACK and we'll see what happens. Thanks for bringing this to my attention.
13.1 was found to have a regression and a day-one defect. D36626 addresses the day-one problem, and D36046 the specific regression exposing the former. 13.0 has the day-one, but very unlikely to hit that without the regression introduced in 13.1. The upcoming 13.2 (or STABLE/13) have both fixes, and there may be a patch available soon for 13.1 specifically.
Just wanted to post an update. I disabled SACK but still had a couple NFS hangs after that. I didn't remount the NFS mounts after disabling SACK so I think that might have been why. Shortly after that I noticed an errata notice for TCP / SACK so I upgraded to 13.1-RELEASE-p3 to get that patch. We haven't had any NFS hangs since. I don't know if the NFS hangs were being caused by this SACK bug or something else but it appears to be fixed in p3!
Setting the sysctl to disable SACK only affects new TCP connections done after that, as far as I understand it. Since NFS normally only creates a new TCP connection at mount time (the other case is where the old TCP connection is broken by something like a network partitioning), the setting probably did not take affect on the extant mounts.