About once per week a VMWare ESX client hangs on its NFS 4.1 mount from my FreeBSD server, which is running a build based on stable/13 at 5dc6fedeb40 (16-Sep-2021). I have a packet trace from the last time I observed the problem. It shows normal NFS traffic culminating with a successful FSSTAT operation. Then, the client sends alternating EXCHANGE_ID and CREATE_SESSION operations, but the server never replies. It does send TCP ACKs, but there is no NFS-level response. It does continue to service other clients normally. I am unsure how to debug this further, but I can probably conduct any experiment you suggest. I can also provide the pcap files if necessary.
Either email me the packet trace as an attachment or put it here and I'll take a look. --> Normally ExchangeID/CreateSession should only happen once, either when the mount is done or after a server reboot. However, the server should reply with an error if, for some reason, the client has done it again. Do you use "trunking"? The ESX client had a serious bug, where it would try to do a ExchangeID/CreateSession on each TCP connection being trunked, instead of just on one of them and then sharing the ClientID and Session between the trunks. (This bug normally only showed up when the mount was first attempted and eventually got resolved once the ClientID and Session was acquired successfully for one of the connections. That was for 6.7 I think?)
Oh. You say 13.0-stable. I assume that means you have upgraded it to stable/13 recently? I ask, because 13.0 had the TCP problem that was fixed by reverting r367492. stable/13 has an alternate fix that works ok, as fa as I know. The characteristic of this problem is "netstat -a" showing a TCP connection in ESTABLISHED state, with a non-zero Recv-Q.
Our stable/13 branch was last September, so we already have 032bf749fd4 (the permanent fix for the problem that reverting 367492 temporarily fixed). Is "trunking" an ESX term? That could be it, though we don't see this problem at mount time. It only shows up later, after the load gets high. I've uploaded the pcap file to freefall. It's in my home directory, named "slc-rb-nesx4.pcap8560".
Hmm. Took a look and it looks like a variant of the TCP bug. You'll notice that the last NFS reply for the NFSv4 connection (port #805 on the client) shows up at packet #653. After that, all there is from the FreeBSD end are ACks, as you said. The NFSv3 RPCs near the end of the trace are done on other TCP connections (port#800 and #804) and they work. Can you roll back to before rscheff@'s fix and then revert r367492? (See PR#256280.) In other words, get that code back to its pre-r367492 state. --> The pre-r367492 code has worked ok, literally for decades. We believe that rscheff@'s fix is ok because otis@ did not observe a hang during two weeks of testing, but that doesn't guarantee it fixed the problem. Other possibilities are that the nfsd threads are getting hung trying to do some RPC around packet #653, but I would have expected that to result in all nfsd threads hung eventually (and the server obviously is not in that state, since RPCs on other connections are still working). If it happens again, do these commands on the FreeBSD server: # ps axHl <-- to look for "hung" nfsd threads # netstat -a <-- to look at the TCP connection for the broken client. (If it is in ESTABLISHED state with a non-0 Recv-Q, the rscheff@ patch has not fixed the problem.)
Oh, and you can try disabling TSO. And add "procstat -a -kk" to the list of commands to do, to see what locks the nfsd threads are using.
Ok, I'll try rolling back to 13.0-RELEASE with r367492 reverted on Monday. But since the bug only reproduces about once per week, I'll have to run this way for a long time to know if it made a difference.
So rolling back didn't work. I wasn't able to boot for some reason, perhaps because I had upgraded my zpools. And I didn't have enough time to debug it. I was able to reproduce the issue. tcpdump again showed a stream of ECHANGE ID/CREATE SESSION messages, netstat did not show any hosts with a RECV-Q stuck above zero, and procstat did not show any nfsd threads blocked. Do you have any other ideas?
When I looked at the packet trace, I noted where the problem occurred, which was long before the ExchangeID repeats. (I think the client is somehow trying to recover, but in a weird way.) In the packet trace, there was no reply to the request in packet #644. It is a packet with two compounds in it, but Linux does that to the server without any problems, from time to time. If there is no "stuck" TCP connection and no hung nfsd threads, then hmm... I'll take another look at the packet trace (I think I still have it), focusing on the session (sequence ops) around where the request does not get a reply. --> Is it possible for your server's net interface to drop a reply such that it doesn't show up in the packet trace? --> Did you try disabling TSO, which is the most common breakage for net interfaces and their device drivers. Actually, the fact that there is no non-zero RecvQ is good news, since that indicates that the PR256280 breakage isn't still broken.
I took another look at the packet trace and I do not see any problems w.r.t. the session (sequence ops are using different slots). However, I noticed that the request it packet #644 seems to be created from 4 large TCP segment fragments. (I'm not sure what Wireshark means by TCP segment fragments, since each TCP segment is a "segment". Maybe just its way of saying the 4 TCP segments make up the two RPC requests?) Are you using jumbo frames for a net interface that will use jumbo mbuf clusters? - I think you probably already know that jumbo mbuf clusters are an accident looking for a place to happen, due to fragmentation of the mbuf cluster pool. If this is the case, try either a network interface that can do jumbo frames without using jumbo mbuf clusters or turn off the jumbo frames. The lack of a reply to the requests in packet #644 is what causes the mount to mess up and it looks like some sort of network fabric problem, possibly tickled by having a large Write and a Read lumped together in the same TCP segments?
So the bug just happened again, and this time it coincided with a loss of LACP sync. One of the two lagged interfaces dropped from ACTIVE,COLLECTING,DISTRIBUTING to just ACTIVE, and it logged "Interface stopped DISTRIBUTING, possible flapping". At that exact time, the client's mount hung. That's highly suspicious. I can't find any evidence for that happening during previous events, however. But it does suggest that something in the network stack is not handling packet loss well.
I took another look at the packet trace, focusing on the TCP stuff starting around packet #644. (You should look too, because my TCP is minimal and rusty.) Starting at packet# 657, 659,... there are ACKs from the VMware client to the FreeBSD NFS server and they all list a "Win=2048". Does that mean that FreeBSD can only send 2048 bytes? - This window does not seem to open up. Please take a look and maybe get someone conversant with TCP to look as well.
I reproduced the issue again. Deliberately this time, by downing one leg of the LAGG during high traffic. This time I tailored the packet capture more narrowly, so it didn't drop any packets (the original pcap file contained omissions just because tcpdump couldn't write to disk fast enough). Crucially, it shows that the client sent a DESTROY_SESSION rpc which didn't show up in the original pcap file. The sequence looks like this: 1) The client's (172.30.156.243) last regular NFS call is packet 84 2) After that are a ton of TCP segment reassemblies. Probably related to the lagg interruption 3) In packet 472, the client sends DESTROY_SESSION 4) The server (172.30.99.32) replies NFS4_OK in packet 474 5) The client sends EXCHANGE_ID in packet 475 6) The server responds with NFS4_OK and clientid 0xd9e0ee6135000000 in packet 477 7) The client sends CREATE_SESSION in packet 478 with clientid 0xd9e0ee6135000000 8) The server replies NFS4ERR_STALE_CLIENTID in packet 480 9) Go back to step 5 and loop Could there be a problem in how we handle the DESTROY_SESSION rpc? If you want to look, I uploaded the new packet trace to my home directory on freefall, named "slc-rb-nesx4-7-feb.create-session.pcap".
Ok, I looked and this time the packet trace makes it clear. The ESXi client is broken. If you look at the seqid after the clientid in packet #477 (the Exchange_ID reply), you will see it is 0. If you look at the seqid in the CREATE_SESSION request in packet #478, you will see it is 39. They must be the same, as noted by the RFC... csa_sequence: Each client ID serializes CREATE_SESSION via a per-client ID sequence number (see Section 18.36.4). The corresponding result is csr_sequence, which MUST be equal to csa_sequence. The RFC does not seem to make it crystal clear what the error reply should be. The FreeBSD server uses NFS4ERR_STALE_CLIENTID. If the VMware engineers argued that it should be a different error, that could be changed in the FreeBSD server code. Since they send a bogus seqid argument, I doubt it matters which error the FreeBSD server sends in the reply. This is basically the same bug that shows up when multiple trunks (multiple TCP connections) are used by the client. In that case, it usually succeeded after several hundred tries, when the seqid values happen to be the same. If you have a maintenance contract with VMware, you can try reporting the bug to them. I have no way of doing so. Also, doing the DESTROY_SESSSION, EXCHANGE_ID, CREATE_SESSION makes no sense to me. If there is a problem with a TCP connection, other clients (Linux and FreeBSD, plus others) create a new TCP connection. They only do the above commands if the server replies NFS4ERR_EXPIRED.
Great find! We are following up with VMWare now. About that error code though, if I'm reading the RFC correctly, it seems like we should send NFS4ERR_SEQ_MISORDERED instead of NFS4ERR_STALE_CLIENTID, according to the first paragraph on page 521. Do you agree? I'm getting set up to reproduce this problem on demand now. Hopefully I'll be able to test any fixes to either side. Also, FTR I discovered that nfsrevoke can snap the client out of its infinite loop.
Well, deciding what the correct error is can be fun. Here's the definition of NFS4ERR_SEQ_MISORDERED in RFC5661: 15.1.11.8. NFS4ERR_SEQ_MISORDERED (Error Code 10063) The requester sent a Sequence operation with an invalid sequence ID. It seems to indicate the error is specifically for the Sequence operation. But, it does appear that you are correct, since in the implementation of CREATE_SESSION, it talks about a single slot session just for EXCHANGE_ID/CREATE_SESSION and returning NFS4ERR_SEQ_MISORDERED when it is not correct. I'll come up with a patch for this and attach it to this PR, but I doubt it will fix this problem. nfsrevoke causes the sever to reply NFS4ERR_ADMIN_REVOKED when the client ID is referenced by the client. This probably causes the client to throw away its notion of the client ID and then I suspect it gets the seqid in CREATE_SESSION correct. (Looking at the packet trace for this case could be interesting.;-)
Created attachment 231662 [details] return NFSERR_SEQMISORDERED for CREATE_SESSION with wrong seqid This patch changes the NFSv4.1/4.2 server to return NFSERR_SEQMISORDERED for Create_session with a wrong seqid, instead of NFS4ERR_STALE_CLIENTID. Alan, if you can test this patch and confirm the server now replies NFS4ERR_SEQ_MISORDERED for the Create_session, that would be appreciated, since I have no easy way to test it.
1) After nfsrevoke, the server does reply NFS4ERR_ADMIN_REVOKED to the next CREATE_SESSION. Then the client tries a new EXCHANGE_ID, and this time the server replies with a different client_id. The next CREATE_SESSION is successful. 2) With your patch, the server does reply NFS4ERR_SEQ_MISORDERED. 3) However, the ESXi client doesn't treat NFS4ERR_SEQ_MISORDERED any differently. It still gets stuck in an infinite loop until nfsrevoke fixes it.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=17a56f3fabdfacb62f6d8858643cdb1861c8c1b5 commit 17a56f3fabdfacb62f6d8858643cdb1861c8c1b5 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2022-02-09 23:17:50 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2022-02-09 23:17:50 +0000 nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument The ESXi NFSv4.1 client bogusly sends the wrong value for the csa_sequence argument for a Create_session operation. RFC8881 requires this value to be the same as the sequence reply from the ExchangeID operation most recently done for the client ID. Without this patch, the server replies NFSERR_STALECLIENTID, which is the correct response for an NFSv4.0 SetClientIDConfirm but is not the correct error for NFSv4.1/4.2, which is specified as NFSERR_SEQMISORDERED in RFC8881. This patch fixes this. This change does not fix the issue reported in the PR, where the ESXi client loops, attempting ExchangeID/Create_session repeatedly. Reported by: asomers Tested by: asomers PR: 261291 MFC after: 1 week sys/fs/nfsserver/nfs_nfsdstate.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
Created attachment 231733 [details] admin revoke client when create_session seqid not equal exchange_id's You could try this patch, which essentially does an nfsrevoke(8) when the seqid error is received. It probably should not go in FreeBSD, since it pretty clearly violates RFC8881.
Created attachment 231735 [details] Workaround ESXi's CREATE_SESSION sequence_id bug Thanks for that idea. Attached is what I'm testing right now. It works for my simple reproduction case, and I'll find out by Monday if it works for real. What it does is to check the nii_domain of the client. If it's "vmware.com", then it simply ignores sequence_id during CREATE_SESSION. It also adds a sysctl to switch that behavior on or off.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=7deb47265d12caf83fb3d0ed047fc291da187937 commit 7deb47265d12caf83fb3d0ed047fc291da187937 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2022-02-09 23:17:50 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2022-02-16 01:02:28 +0000 nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument The ESXi NFSv4.1 client bogusly sends the wrong value for the csa_sequence argument for a Create_session operation. RFC8881 requires this value to be the same as the sequence reply from the ExchangeID operation most recently done for the client ID. Without this patch, the server replies NFSERR_STALECLIENTID, which is the correct response for an NFSv4.0 SetClientIDConfirm but is not the correct error for NFSv4.1/4.2, which is specified as NFSERR_SEQMISORDERED in RFC8881. This patch fixes this. This change does not fix the issue reported in the PR, where the ESXi client loops, attempting ExchangeID/Create_session repeatedly. PR: 261291 (cherry picked from commit 17a56f3fabdfacb62f6d8858643cdb1861c8c1b5) sys/fs/nfsserver/nfs_nfsdstate.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=95e95764918f97112bb05282d62d9777b706c2fb commit 95e95764918f97112bb05282d62d9777b706c2fb Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2022-02-09 23:17:50 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2022-02-16 22:59:41 +0000 nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument The ESXi NFSv4.1 client bogusly sends the wrong value for the csa_sequence argument for a Create_session operation. RFC8881 requires this value to be the same as the sequence reply from the ExchangeID operation most recently done for the client ID. Without this patch, the server replies NFSERR_STALECLIENTID, which is the correct response for an NFSv4.0 SetClientIDConfirm but is not the correct error for NFSv4.1/4.2, which is specified as NFSERR_SEQMISORDERED in RFC8881. This patch fixes this. This change does not fix the issue reported in the PR, where the ESXi client loops, attempting ExchangeID/Create_session repeatedly. PR: 261291 (cherry picked from commit 17a56f3fabdfacb62f6d8858643cdb1861c8c1b5) sys/fs/nfsserver/nfs_nfsdstate.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
Let's put this bug to rest. The commit that rmacklem made fixes a minor protocol nit, but isn't really the cause of the problem. The root cause is an ESXi bug, not a FreeBSD bug. Attachment 231735 [details] works around the problem. But it's such a hack that I don't want to commit it to FreeBSD. Hopefully VMWare will fix their bug, and if not then hopefully Google will direct their users here.