Bug 249567

Summary: NFSv4 server sometimes responds with NFSERR_INVAL to LOCK from Linux clients
Product: Base System Reporter: Björn Fischer <bf>
Component: kernAssignee: Rick Macklem <rmacklem>
Status: Closed FIXED    
Severity: Affects Some People CC: rmacklem
Priority: --- Flags: rmacklem: mfc-stable12+
rmacklem: mfc-stable11+
Version: 11.4-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
fcntl F_SETLK test case
none
allow NFSv4.1 clients to use an open_to_lock_owner when lock_owner exists none

Description Björn Fischer 2020-09-24 09:39:08 UTC
Created attachment 218237 [details]
fcntl F_SETLK test case

We run an NFSv4 server with ZFS as backing store based on 11.4-RELEASE-p3. Clients are Solaris and Linux mostly. Sometimes "svn checkout" or "svn status" fails with a disk I/O error on NFS volumes on Linux clients. We tracked down the problem to sqlite which quite intensively uses fcntl locking. From this, we created a small test case which simply does

  res = fcntl(fd, F_SETLK, &lock);
  [...]
  res = fcntl(fd, F_SETLK, &unlock);

in a loop (see attachment). This reliably triggers the problem:

  [me@linuxhost:~]$ fcntl_setlk
  F_SETLK F_RDLCK res = -1 (Invalid argument)
  Successful cycles: 431

So here fcntl fails after 431 succesful lock-unlock cycles. On the wire, we can see the LOCK request as:

  Opcode: LOCK (12)
      locktype: READ_LT (1)
      reclaim?: No
      offset: 0
      length: 1
      new lock owner?: Yes
      seqid: 0x00000000
      StateID
          [StateID Hash: 0xdf2d]
          StateID seqid: 1
          StateID Other: d92b565f140000008c0c0000
          [StateID Other hash: 0x15f7]
      lock_seqid: 0x00000000
      Owner
          clientid: 0xd92b565f14000000
          owner: <DATA>
              length: 20
              contents: <DATA>

Lock requests that succeed exactly look the same. On a fail case the FreeBSD NFS server replies:

  Opcode: LOCK (12)
      Status: NFS4ERR_INVAL (22)

Using DTrace, we found the source of the NFS4ERR_INVL in nfsrv_lockctrl() at nfs_nfsdstate.c:1810:

    if (!error)
        nfsrv_getowner(&stp->ls_open, new_stp, &lckstp);
    if (lckstp)
        /*  
         * I believe this should be an error, but it
         * isn't obvious what NFSERR_xxx would be
         * appropriate, so I'll use NFSERR_INVAL for now.
         */  
        error = NFSERR_INVAL;
    else
        lckstp = new_stp;

As a workaround we tried to simply comment out the setting of "error". With this change, the test case no longer triggers the problem:

--- nfs_nfsdstate.c	2020/09/23 12:58:37	1.1
+++ nfs_nfsdstate.c	2020/09/23 14:16:19
@@ -1802,12 +1802,17 @@
 			if (!error)
 			   nfsrv_getowner(&stp->ls_open, new_stp, &lckstp);
 			if (lckstp)
+#ifdef DIAGNOSTIC
+			   printf("nfs_nfsdstate.c:1805: I believe this should be an error\n");
+#else
+			   ;
+#endif
 			   /*
 			    * I believe this should be an error, but it
 			    * isn't obvious what NFSERR_xxx would be
 			    * appropriate, so I'll use NFSERR_INVAL for now.
-			    */
 			   error = NFSERR_INVAL;
+			    */
 			else
 			   lckstp = new_stp;
 		   } else if (new_stp->ls_flags&(NFSLCK_LOCK|NFSLCK_UNLOCK)) {

While this seems to work, I have a gut feeling that lckstp should be new_stp (unconditionally) instead of what nfsrv_getowner returns. Someone with a deeper understandig of the NFS specification should look into this.
Comment 1 Rick Macklem freebsd_committer freebsd_triage 2020-09-25 01:43:29 UTC
Interesting...
First off, let me say "good job diagnosing this".

open_owner4 and lock_owner4 refer to strings that an NFSv4
client uses to refer to a set of opens (a form of Windows lock)
or locks (byte range locks).
Although it is really up to the client implementor, you can think
of them as referring to a "process on the client".

NFSv4 has two variants of Lock operation:
- open_to_lock_owner4
   - the first time a given lock_owner4 is being
     used for a given open_owner4
or
- exist_lock_owner4
   - used for subsequent lock operations on the lock_owner4

What is happening in this case is that the Linux client is
using an open_to_lock_owner4 when that lock_owner4 has already
been created by a previous open_to_lock_owner4.
Is this a bug?
- For NFSv4.0 I would definitely say "yes". Why?
  Because there is a "seqid" field in the lock_owner4 that is
  used to serial order the lock operations and having two
  instances of the same lock_owner4 would break that.
- For NFSv4.1, NFSv4.2, I'm not so sure.
  (You can read the paragraph on Pg. 428 of RFC5661 for what
   it says. It starts with "The locker argument..".)
  I think it says that the client should only use
  open_to_lock_owner4 for the first lock op, but to be
  honest, Linux is the de-facto standard these days and
  if it is not doing this, the question is more "can the
  server live with this?.
  For NFSv4.1,4.2 sessions replaces the "seqid" in the lock_owner4
  structure for serializing lock ops, so the lock_owner4 is really
  just used to decide if two lock operations conflict (same
  lock_owner4-->no conflict).

What your change does is ignore the error and result in two
lock_owners being created for the same lock_owner "string",
which doesn't seem correct, but might be ok for NFSv4.1, 4.2.

If you could test this little change instead, it seems like
a better workaround.

      if (lckstp) {
              if (nd->nd_flag & ND_NFSV41)
                       new_stp->ls_flags &= ~NFSLCK_OPENTOLOCK;
              else
                       error = NFSERR_INVAL;
      } else
              lckstp = new_stp;

By clearing NFSLCK_OPENTOLOCK, it makes the code not allocate a
new lock_owner, since one already exists. I can't tell for sure
if this will break anything, but I think it might be ok?

One thing I couldn't see in your info was whether your mounts
were NFSv4.0 or NFSv4.1,4.2.
You can use "nfsstat -m" on the Linux client to find out.
(If it's NFSv4.0, forget the above patch and all I can say
 is that I think the Linux client is broken.)
If it is NFSv4.1,4.2 and you can test the above little patch,
please let me know how it goes?

Thanks for reporting this, rick
Comment 2 Björn Fischer 2020-09-25 09:12:46 UTC
Hello Rick,

thank you for your detailed explanation.

The mount is NFSv4.1, so that is why all the seqids are zero in wireshark, right?

I ran my test case with my inappropriate fix overnight, and yes: The lock_owner allocations resulted in a leak on the server (about 52G wired this morning) _and_ something bad happened on the Linux client, too.

I tried your patch and it seems to work flawlessly. It solved the problem, or at least worked around the bug/misbehaviour of the Linux client if it is one.

Cheers

Björn
Comment 3 Rick Macklem freebsd_committer freebsd_triage 2020-09-25 15:56:46 UTC
Sounds good.

If it is convenient for you to do, one
additional test you could try is forcing
the mount to use NFSv4.0 via minorversion=0
on the mount command and then run your test
against the server with my patch on it.

If the client generates the multiple
open_to_lock_owners, then it will still
fail with NFSERR_INVAL.

Btw, I read the appropriate section of
RFC7530 (the new RFC for NFSv4.0) and
it does state that the server should
return NFSERR_BADSEQID for this case.
I'll change the error return when I
patch it. (You might change NFSERR_INVAL
to NFSERR_BADSEQID, but I doubt the
client will care which error is returned.)

Thanks for doing the testing. I'll look
more closely at the code (I wrote it 20years
ago) and commit the patch if it looks safe.
I'll also fire up the one Linux client
I have and see if I can reproduce the
problem on it.
Comment 4 Rick Macklem freebsd_committer freebsd_triage 2020-09-26 04:19:10 UTC
Created attachment 218307 [details]
allow NFSv4.1 clients to use an open_to_lock_owner when lock_owner exists

This is a cleaned up version of the patch I suggested in
the previous comment, that seems to have resolved the bug.
Comment 5 Björn Fischer 2020-09-26 13:45:56 UTC
Ok, I tested the Linux client with NFSv4.0:

The problem does not occur at all. Not with your patch (from the attachment) nor without. I cannot detect any multiple open_to_lock requests.

If you have problems to reproduce the problem by setting up your own Linux client, here is what hardware we use. The double open_to_lock only occurs all few hundred lock requests, so maybe it is some race condition on the Linux client and timing may be crucial for successful reproduction:

FreeBSD NFS server:
 - AMD EPYC 7551P 32-Core
 - 512GB RAM
 - 40GE network

Linux client:
 - Ubuntu 20.04 5.3.0-51-generic
 - 2x Intel(R) Xeon(R) Gold 613
 - 384GB RAM
 - 10GE network

Meanwhile I applied your patch to our production server and everything seems fine.
Excellent work, thank you.
Comment 6 commit-hook freebsd_committer freebsd_triage 2020-09-26 23:05:52 UTC
A commit references this bug:

Author: rmacklem
Date: Sat Sep 26 23:05:39 UTC 2020
New revision: 366189
URL: https://svnweb.freebsd.org/changeset/base/366189

Log:
  Bjorn reported a problem where the Linux NFSv4.1 client is
  using an open_to_lock_owner4 when that lock_owner4 has already
  been created by a previous open_to_lock_owner4. This caused the NFS server
  to reply NFSERR_INVAL.

  For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes
  that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify
  what error to return).

  For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661,
  but the NFSv4.1 server can handle this case without error.
  This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple
  uses of the same lock_owner in open_to_lock_owner so that it now correctly
  interoperates with the Linux NFS client.
  It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID.

  Thanks go to Bjorn for diagnosing this and testing the patch.
  He also provided a program that I could use to reproduce the problem.

  Tested by:	bj@cebitec.uni-bielefeld.de (Bjorn Fischer)
  PR:		249567
  Reported by:	bj@cebitec.uni-bielefeld.de (Bjorn Fischer)
  MFC after:	3 days

Changes:
  head/sys/fs/nfsserver/nfs_nfsdstate.c
Comment 7 Björn Fischer 2020-09-27 10:18:40 UTC
Thank you for committing this.

Just saw that there was a typo in my email address: it is bf@, not bj@.

I saw the MFC, will this be merged into 11-STABLE as well?
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2020-09-27 15:49:34 UTC
Yes, it will be MFC'd to stable/11. I set the MFC
flags, but they got cleared when I clicked something
else. They're set now.
Comment 9 commit-hook freebsd_committer freebsd_triage 2020-09-29 01:53:48 UTC
A commit references this bug:

Author: rmacklem
Date: Tue Sep 29 01:52:54 UTC 2020
New revision: 366238
URL: https://svnweb.freebsd.org/changeset/base/366238

Log:
  MFC: r366189
  Bjorn reported a problem where the Linux NFSv4.1 client is
  using an open_to_lock_owner4 when that lock_owner4 has already
  been created by a previous open_to_lock_owner4. This caused the NFS server
  to reply NFSERR_INVAL.

  For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes
  that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify
  what error to return).

  For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661,
  but the NFSv4.1 server can handle this case without error.
  This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple
  uses of the same lock_owner in open_to_lock_owner so that it now correctly
  interoperates with the Linux NFS client.
  It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID.

  Thanks go to Bjorn for diagnosing this and testing the patch.
  He also provided a program that I could use to reproduce the problem.

  PR:		249567
  Reported by:	bf@cebitec.uni-bielefeld.de

Changes:
_U  stable/12/
  stable/12/sys/fs/nfsserver/nfs_nfsdstate.c
Comment 10 commit-hook freebsd_committer freebsd_triage 2020-09-29 02:03:52 UTC
A commit references this bug:

Author: rmacklem
Date: Tue Sep 29 02:03:25 UTC 2020
New revision: 366241
URL: https://svnweb.freebsd.org/changeset/base/366241

Log:
  MFC: r366189
  Bjorn reported a problem where the Linux NFSv4.1 client is
  using an open_to_lock_owner4 when that lock_owner4 has already
  been created by a previous open_to_lock_owner4. This caused the NFS server
  to reply NFSERR_INVAL.

  For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes
  that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify
  what error to return).

  For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661,
  but the NFSv4.1 server can handle this case without error.
  This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple
  uses of the same lock_owner in open_to_lock_owner so that it now correctly
  interoperates with the Linux NFS client.
  It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID.

  Thanks go to Bjorn for diagnosing this and testing the patch.
  He also provided a program that I could use to reproduce the problem.

  PR:		249567
  Reported by:	bf@cebitec.uni-bielefeld.de

Changes:
_U  stable/11/
  stable/11/sys/fs/nfsserver/nfs_nfsdstate.c
Comment 11 Rick Macklem freebsd_committer freebsd_triage 2020-09-29 02:31:25 UTC
Patch has been MFC'd to stable/11 and stable/11.
A request for MFS to releng/12.2 has been made.
Comment 12 commit-hook freebsd_committer freebsd_triage 2020-09-29 15:09:44 UTC
A commit references this bug:

Author: rmacklem
Date: Tue Sep 29 15:09:38 UTC 2020
New revision: 366256
URL: https://svnweb.freebsd.org/changeset/base/366256

Log:
  MFS: r366238
  Bjorn reported a problem where the Linux NFSv4.1 client is
  using an open_to_lock_owner4 when that lock_owner4 has already
  been created by a previous open_to_lock_owner4. This caused the NFS server
  to reply NFSERR_INVAL.

  For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes
  that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify
  what error to return).

  For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661,
  but the NFSv4.1 server can handle this case without error.
  This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple
  uses of the same lock_owner in open_to_lock_owner so that it now correctly
  interoperates with the Linux NFS client.
  It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID.

  Thanks go to Bjorn for diagnosing this and testing the patch.
  He also provided a program that I could use to reproduce the problem.

  PR:		249567
  Approved by:	re (gjb)

Changes:
_U  releng/12.2/
  releng/12.2/sys/fs/nfsserver/nfs_nfsdstate.c
Comment 13 Rick Macklem freebsd_committer freebsd_triage 2020-09-29 15:12:43 UTC
Patch has been MFC'd and MFS'd to releng/12.2.