Bug 261291 - ESX NFS4.1 client hangs, server never responds to EXCHANGE_ID/CREATE_SESSION
Summary: ESX NFS4.1 client hangs, server never responds to EXCHANGE_ID/CREATE_SESSION
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-18 00:40 UTC by Alan Somers
Modified: 2022-05-04 20:53 UTC (History)
1 user (show)

See Also:


Attachments
return NFSERR_SEQMISORDERED for CREATE_SESSION with wrong seqid (837 bytes, patch)
2022-02-09 00:28 UTC, Rick Macklem
no flags Details | Diff
admin revoke client when create_session seqid not equal exchange_id's (1.85 KB, patch)
2022-02-10 23:10 UTC, Rick Macklem
no flags Details | Diff
Workaround ESXi's CREATE_SESSION sequence_id bug (3.85 KB, patch)
2022-02-10 23:28 UTC, Alan Somers
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer 2022-01-18 00:40:10 UTC
About once per week a VMWare ESX client hangs on its NFS 4.1 mount from my FreeBSD server, which is running a build based on stable/13 at 5dc6fedeb40 (16-Sep-2021).  I have a packet trace from the last time I observed the problem.  It shows normal NFS traffic culminating with a successful FSSTAT operation.  Then, the client sends alternating EXCHANGE_ID and CREATE_SESSION operations, but the server never replies.  It does send TCP ACKs, but there is no NFS-level response.  It does continue to service other clients normally.

I am unsure how to debug this further, but I can probably conduct any experiment you suggest.  I can also provide the pcap files if necessary.
Comment 1 Rick Macklem freebsd_committer 2022-01-18 02:19:43 UTC
Either email me the packet trace as an attachment or put it here
and I'll take a look.
--> Normally ExchangeID/CreateSession should only happen once,
    either when the mount is done or after a server reboot.
    However, the server should reply with an error if, for
    some reason, the client has done it again.

Do you use "trunking"? The ESX client had a serious bug, where
it would try to do a ExchangeID/CreateSession on each TCP connection
being trunked, instead of just on one of them and then sharing
the ClientID and Session between the trunks.
(This bug normally only showed up when the mount was first attempted
 and eventually got resolved once the ClientID and Session was
 acquired successfully for one of the connections. That was for 6.7
 I think?)
Comment 2 Rick Macklem freebsd_committer 2022-01-18 02:33:46 UTC
Oh. You say 13.0-stable. I assume that means you have upgraded
it to stable/13 recently?
I ask, because 13.0 had the TCP problem that was fixed by
reverting r367492. stable/13 has an alternate fix that works ok,
as fa as I know.

The characteristic of this problem is "netstat -a" showing
a TCP connection in ESTABLISHED state, with a non-zero Recv-Q.
Comment 3 Alan Somers freebsd_committer 2022-01-18 02:40:43 UTC
Our stable/13 branch was last September, so we already have 032bf749fd4 (the permanent fix for the problem that reverting 367492 temporarily fixed).

Is "trunking" an ESX term?  That could be it, though we don't see this problem at mount time.  It only shows up later, after the load gets high.

I've uploaded the pcap file to freefall.  It's in my home directory, named "slc-rb-nesx4.pcap8560".
Comment 4 Rick Macklem freebsd_committer 2022-01-18 15:26:49 UTC
Hmm. Took a look and it looks like a variant of the TCP bug.
You'll notice that the last NFS reply for the NFSv4 connection
(port #805 on the client) shows up at packet #653.
After that, all there is from the FreeBSD end are ACks, as you
said.

The NFSv3 RPCs near the end of the trace are done on other TCP
connections (port#800 and #804) and they work.

Can you roll back to before rscheff@'s fix and then revert
r367492? (See PR#256280.) In other words, get that code back
to its pre-r367492 state.
--> The pre-r367492 code has worked ok, literally for decades.
We believe that rscheff@'s fix is ok because otis@ did not observe
a hang during two weeks of testing, but that doesn't guarantee it
fixed the problem.

Other possibilities are that the nfsd threads are getting hung
trying to do some RPC around packet #653, but I would have expected
that to result in all nfsd threads hung eventually (and the server
obviously is not in that state, since RPCs on other connections
are still working).

If it happens again, do these commands on the FreeBSD server:
# ps axHl  <-- to look for "hung" nfsd threads
# netstat -a <-- to look at the TCP connection for the broken client.
  (If it is in ESTABLISHED state with a non-0 Recv-Q, the rscheff@
   patch has not fixed the problem.)
Comment 5 Rick Macklem freebsd_committer 2022-01-19 03:43:49 UTC
Oh, and you can try disabling TSO.

And add "procstat -a -kk" to the list of commands to do,
to see what locks the nfsd threads are using.
Comment 6 Alan Somers freebsd_committer 2022-01-21 21:36:39 UTC
Ok, I'll try rolling back to 13.0-RELEASE with r367492 reverted on Monday.  But since the bug only reproduces about once per week, I'll have to run this way for a long time to know if it made a difference.
Comment 7 Alan Somers freebsd_committer 2022-01-27 15:51:20 UTC
So rolling back didn't work.  I wasn't able to boot for some reason, perhaps because I had upgraded my zpools.  And I didn't have enough time to debug it.

I was able to reproduce the issue.  tcpdump again showed a stream of ECHANGE ID/CREATE SESSION messages, netstat did not show any hosts with a RECV-Q stuck above zero, and procstat did not show any nfsd threads blocked.  Do you have any other ideas?
Comment 8 Rick Macklem freebsd_committer 2022-01-27 22:43:30 UTC
When I looked at the packet trace, I noted where the problem
occurred, which was long before the ExchangeID repeats.
(I think the client is somehow trying to recover, but in a
 weird way.)

In the packet trace, there was no reply to the request in
packet #644. It is a packet with two compounds in it, but
Linux does that to the server without any problems, from time
to time.

If there is no "stuck" TCP connection and no hung nfsd threads,
then hmm... I'll take another look at the packet trace (I think
I still have it), focusing on the session (sequence ops) around
where the request does not get a reply.
--> Is it possible for your server's net interface to drop a reply
    such that it doesn't show up in the packet trace?
--> Did you try disabling TSO, which is the most common breakage for
    net interfaces and their device drivers.

Actually, the fact that there is no non-zero RecvQ is good news, since
that indicates that the PR256280 breakage isn't still broken.
Comment 9 Rick Macklem freebsd_committer 2022-01-27 23:12:51 UTC
I took another look at the packet trace and I do not see
any problems w.r.t. the session (sequence ops are using
different slots).

However, I noticed that the request it packet #644 seems
to be created from 4 large TCP segment fragments.
(I'm not sure what Wireshark means by TCP segment fragments,
 since each TCP segment is a "segment". Maybe just its way of
 saying the 4 TCP segments make up the two RPC requests?)

Are you using jumbo frames for a net interface that will
use jumbo mbuf clusters?
- I think you probably already know that jumbo mbuf clusters are
  an accident looking for a place to happen, due to fragmentation
  of the mbuf cluster pool.

If this is the case, try either a network interface that can do
jumbo frames without using jumbo mbuf clusters or turn off the
jumbo frames.

The lack of a reply to the requests in packet #644 is what causes
the mount to mess up and it looks like some sort of network fabric
problem, possibly tickled by having a large Write and a Read lumped
together in the same TCP segments?
Comment 10 Alan Somers freebsd_committer 2022-01-27 23:57:45 UTC
So the bug just happened again, and this time it coincided with a loss of LACP sync.  One of the two lagged interfaces dropped from ACTIVE,COLLECTING,DISTRIBUTING to just ACTIVE, and it logged "Interface stopped DISTRIBUTING, possible flapping".  At that exact time, the client's mount hung. That's highly suspicious.  I can't find any evidence for that happening during previous events, however.  But it does suggest that something in the network stack is not handling packet loss well.
Comment 11 Rick Macklem freebsd_committer 2022-01-28 22:32:46 UTC
I took another look at the packet trace, focusing on
the TCP stuff starting around packet #644.
(You should look too, because my TCP is minimal and rusty.)

Starting at packet# 657, 659,... there are ACKs from the
VMware client to the FreeBSD NFS server and they all list
a "Win=2048".
Does that mean that FreeBSD can only send 2048 bytes?
- This window does not seem to open up.

Please take a look and maybe get someone conversant with
TCP to look as well.
Comment 12 Alan Somers freebsd_committer 2022-02-08 16:24:56 UTC
I reproduced the issue again.  Deliberately this time, by downing one leg of the LAGG during high traffic.  This time I tailored the packet capture more narrowly, so it didn't drop any packets (the original pcap file contained omissions just because tcpdump couldn't write to disk fast enough).  Crucially, it shows that the client sent a DESTROY_SESSION rpc which didn't show up in the original pcap file.  The sequence looks like this:

1) The client's (172.30.156.243) last regular NFS call is packet 84
2) After that are a ton of TCP segment reassemblies.  Probably related to the lagg interruption
3) In packet 472, the client sends DESTROY_SESSION
4) The server (172.30.99.32) replies NFS4_OK in packet 474
5) The client sends EXCHANGE_ID in packet 475
6) The server responds with NFS4_OK and clientid 0xd9e0ee6135000000 in packet 477
7) The client sends CREATE_SESSION in packet 478 with clientid 0xd9e0ee6135000000
8) The server replies NFS4ERR_STALE_CLIENTID in packet 480
9) Go back to step 5 and loop

Could there be a problem in how we handle the DESTROY_SESSION rpc?  If you want to look, I uploaded the new packet trace to my home directory on freefall, named "slc-rb-nesx4-7-feb.create-session.pcap".
Comment 13 Rick Macklem freebsd_committer 2022-02-08 22:43:15 UTC
Ok, I looked and this time the packet trace makes it clear.
The ESXi client is broken.

If you look at the seqid after the clientid in packet #477
(the Exchange_ID reply), you will see it is 0.

If you look at the seqid in the CREATE_SESSION request in packet #478,
you will see it is 39.

They must be the same, as noted by the RFC...
csa_sequence:

      Each client ID serializes CREATE_SESSION via a per-client ID
      sequence number (see Section 18.36.4).  The corresponding result
      is csr_sequence, which MUST be equal to csa_sequence.

The RFC does not seem to make it crystal clear what the error reply
should be. The FreeBSD server uses NFS4ERR_STALE_CLIENTID.
If the VMware engineers argued that it should be a different error,
that could be changed in the FreeBSD server code. Since they send a
bogus seqid argument, I doubt it matters which error the FreeBSD
server sends in the reply.

This is basically the same bug that shows up when multiple trunks
(multiple TCP connections) are used by the client. In that case,
it usually succeeded after several hundred tries, when the seqid values
happen to be the same.

If you have a maintenance contract with VMware, you can try reporting
the bug to them. I have no way of doing so.

Also, doing the DESTROY_SESSSION, EXCHANGE_ID, CREATE_SESSION makes no
sense to me. If there is a problem with a TCP connection, other clients
(Linux and FreeBSD, plus others) create a new TCP connection.
They only do the above commands if the server replies NFS4ERR_EXPIRED.
Comment 14 Alan Somers freebsd_committer 2022-02-08 23:19:09 UTC
Great find!  We are following up with VMWare now.  About that error code though, if I'm reading the RFC correctly, it seems like we should send NFS4ERR_SEQ_MISORDERED instead of NFS4ERR_STALE_CLIENTID, according to the first paragraph on page 521.  Do you agree?  I'm getting set up to reproduce this problem on demand now.  Hopefully I'll be able to test any fixes to either side.

Also, FTR I discovered that nfsrevoke can snap the client out of its infinite loop.
Comment 15 Rick Macklem freebsd_committer 2022-02-08 23:47:46 UTC
Well, deciding what the correct error is can be fun.
Here's the definition of NFS4ERR_SEQ_MISORDERED
in RFC5661:
15.1.11.8.  NFS4ERR_SEQ_MISORDERED (Error Code 10063)

   The requester sent a Sequence operation with an invalid sequence ID.

It seems to indicate the error is specifically for the Sequence operation.

But, it does appear that you are correct, since in the implementation
of CREATE_SESSION, it talks about a single slot session just for
EXCHANGE_ID/CREATE_SESSION and returning NFS4ERR_SEQ_MISORDERED when
it is not correct.

I'll come up with a patch for this and attach it to this PR, but I
doubt it will fix this problem.

nfsrevoke causes the sever to reply NFS4ERR_ADMIN_REVOKED when
the client ID is referenced by the client. This probably causes the
client to throw away its notion of the client ID and then I suspect
it gets the seqid in CREATE_SESSION correct.
(Looking at the packet trace for this case could be interesting.;-)
Comment 16 Rick Macklem freebsd_committer 2022-02-09 00:28:32 UTC
Created attachment 231662 [details]
return NFSERR_SEQMISORDERED for CREATE_SESSION with wrong seqid

This patch changes the NFSv4.1/4.2 server to return NFSERR_SEQMISORDERED
for Create_session with a wrong seqid, instead of NFS4ERR_STALE_CLIENTID.

Alan, if you can test this patch and confirm the server now replies
NFS4ERR_SEQ_MISORDERED for the Create_session, that would be appreciated,
since I have no easy way to test it.
Comment 17 Alan Somers freebsd_committer 2022-02-09 20:24:31 UTC
1) After nfsrevoke, the server does reply NFS4ERR_ADMIN_REVOKED to the next CREATE_SESSION.  Then the client tries a new EXCHANGE_ID, and this time the server replies with a different client_id.  The next CREATE_SESSION is successful.
2) With your patch, the server does reply NFS4ERR_SEQ_MISORDERED.
3) However, the ESXi client doesn't treat NFS4ERR_SEQ_MISORDERED any differently.  It still gets stuck in an infinite loop until nfsrevoke fixes it.
Comment 18 commit-hook freebsd_committer 2022-02-09 23:19:13 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=17a56f3fabdfacb62f6d8858643cdb1861c8c1b5

commit 17a56f3fabdfacb62f6d8858643cdb1861c8c1b5
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2022-02-09 23:17:50 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2022-02-09 23:17:50 +0000

    nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument

    The ESXi NFSv4.1 client bogusly sends the wrong value
    for the csa_sequence argument for a Create_session operation.
    RFC8881 requires this value to be the same as the sequence
    reply from the ExchangeID operation most recently done for
    the client ID.

    Without this patch, the server replies NFSERR_STALECLIENTID,
    which is the correct response for an NFSv4.0 SetClientIDConfirm
    but is not the correct error for NFSv4.1/4.2, which is
    specified as NFSERR_SEQMISORDERED in RFC8881.
    This patch fixes this.

    This change does not fix the issue reported in the PR, where
    the ESXi client loops, attempting ExchangeID/Create_session
    repeatedly.

    Reported by:    asomers
    Tested by:      asomers
    PR:     261291
    MFC after:      1 week

 sys/fs/nfsserver/nfs_nfsdstate.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)
Comment 19 Rick Macklem freebsd_committer 2022-02-10 23:10:47 UTC
Created attachment 231733 [details]
admin revoke client when create_session seqid not equal exchange_id's

You could try this patch, which essentially does an nfsrevoke(8) when
the seqid error is received.

It probably should not go in FreeBSD, since it pretty clearly violates
RFC8881.
Comment 20 Alan Somers freebsd_committer 2022-02-10 23:28:46 UTC
Created attachment 231735 [details]
Workaround ESXi's CREATE_SESSION sequence_id bug

Thanks for that idea.  Attached is what I'm testing right now.  It works for my simple reproduction case, and I'll find out by Monday if it works for real.

What it does is to check the nii_domain of the client.  If it's "vmware.com", then it simply ignores sequence_id during CREATE_SESSION.  It also adds a sysctl to switch that behavior on or off.
Comment 21 commit-hook freebsd_committer 2022-02-16 01:03:33 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=7deb47265d12caf83fb3d0ed047fc291da187937

commit 7deb47265d12caf83fb3d0ed047fc291da187937
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2022-02-09 23:17:50 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2022-02-16 01:02:28 +0000

    nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument

    The ESXi NFSv4.1 client bogusly sends the wrong value
    for the csa_sequence argument for a Create_session operation.
    RFC8881 requires this value to be the same as the sequence
    reply from the ExchangeID operation most recently done for
    the client ID.

    Without this patch, the server replies NFSERR_STALECLIENTID,
    which is the correct response for an NFSv4.0 SetClientIDConfirm
    but is not the correct error for NFSv4.1/4.2, which is
    specified as NFSERR_SEQMISORDERED in RFC8881.
    This patch fixes this.

    This change does not fix the issue reported in the PR, where
    the ESXi client loops, attempting ExchangeID/Create_session
    repeatedly.

    PR:     261291

    (cherry picked from commit 17a56f3fabdfacb62f6d8858643cdb1861c8c1b5)

 sys/fs/nfsserver/nfs_nfsdstate.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)
Comment 22 commit-hook freebsd_committer 2022-02-16 23:01:07 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=95e95764918f97112bb05282d62d9777b706c2fb

commit 95e95764918f97112bb05282d62d9777b706c2fb
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2022-02-09 23:17:50 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2022-02-16 22:59:41 +0000

    nfsd: Reply NFSERR_SEQMISORDERED for bogus seqid argument

    The ESXi NFSv4.1 client bogusly sends the wrong value
    for the csa_sequence argument for a Create_session operation.
    RFC8881 requires this value to be the same as the sequence
    reply from the ExchangeID operation most recently done for
    the client ID.

    Without this patch, the server replies NFSERR_STALECLIENTID,
    which is the correct response for an NFSv4.0 SetClientIDConfirm
    but is not the correct error for NFSv4.1/4.2, which is
    specified as NFSERR_SEQMISORDERED in RFC8881.
    This patch fixes this.

    This change does not fix the issue reported in the PR, where
    the ESXi client loops, attempting ExchangeID/Create_session
    repeatedly.

    PR:     261291

    (cherry picked from commit 17a56f3fabdfacb62f6d8858643cdb1861c8c1b5)

 sys/fs/nfsserver/nfs_nfsdstate.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)
Comment 23 Alan Somers freebsd_committer 2022-05-04 20:53:48 UTC
Let's put this bug to rest.  The commit that rmacklem made fixes a minor protocol nit, but isn't really the cause of the problem.  The root cause is an ESXi bug, not a FreeBSD bug.  Attachment 231735 [details] works around the problem.  But it's such a hack that I don't want to commit it to FreeBSD.  Hopefully VMWare will fix their bug, and if not then hopefully Google will direct their users here.