Created attachment 218237 [details] fcntl F_SETLK test case We run an NFSv4 server with ZFS as backing store based on 11.4-RELEASE-p3. Clients are Solaris and Linux mostly. Sometimes "svn checkout" or "svn status" fails with a disk I/O error on NFS volumes on Linux clients. We tracked down the problem to sqlite which quite intensively uses fcntl locking. From this, we created a small test case which simply does res = fcntl(fd, F_SETLK, &lock); [...] res = fcntl(fd, F_SETLK, &unlock); in a loop (see attachment). This reliably triggers the problem: [me@linuxhost:~]$ fcntl_setlk F_SETLK F_RDLCK res = -1 (Invalid argument) Successful cycles: 431 So here fcntl fails after 431 succesful lock-unlock cycles. On the wire, we can see the LOCK request as: Opcode: LOCK (12) locktype: READ_LT (1) reclaim?: No offset: 0 length: 1 new lock owner?: Yes seqid: 0x00000000 StateID [StateID Hash: 0xdf2d] StateID seqid: 1 StateID Other: d92b565f140000008c0c0000 [StateID Other hash: 0x15f7] lock_seqid: 0x00000000 Owner clientid: 0xd92b565f14000000 owner: <DATA> length: 20 contents: <DATA> Lock requests that succeed exactly look the same. On a fail case the FreeBSD NFS server replies: Opcode: LOCK (12) Status: NFS4ERR_INVAL (22) Using DTrace, we found the source of the NFS4ERR_INVL in nfsrv_lockctrl() at nfs_nfsdstate.c:1810: if (!error) nfsrv_getowner(&stp->ls_open, new_stp, &lckstp); if (lckstp) /* * I believe this should be an error, but it * isn't obvious what NFSERR_xxx would be * appropriate, so I'll use NFSERR_INVAL for now. */ error = NFSERR_INVAL; else lckstp = new_stp; As a workaround we tried to simply comment out the setting of "error". With this change, the test case no longer triggers the problem: --- nfs_nfsdstate.c 2020/09/23 12:58:37 1.1 +++ nfs_nfsdstate.c 2020/09/23 14:16:19 @@ -1802,12 +1802,17 @@ if (!error) nfsrv_getowner(&stp->ls_open, new_stp, &lckstp); if (lckstp) +#ifdef DIAGNOSTIC + printf("nfs_nfsdstate.c:1805: I believe this should be an error\n"); +#else + ; +#endif /* * I believe this should be an error, but it * isn't obvious what NFSERR_xxx would be * appropriate, so I'll use NFSERR_INVAL for now. - */ error = NFSERR_INVAL; + */ else lckstp = new_stp; } else if (new_stp->ls_flags&(NFSLCK_LOCK|NFSLCK_UNLOCK)) { While this seems to work, I have a gut feeling that lckstp should be new_stp (unconditionally) instead of what nfsrv_getowner returns. Someone with a deeper understandig of the NFS specification should look into this.
Interesting... First off, let me say "good job diagnosing this". open_owner4 and lock_owner4 refer to strings that an NFSv4 client uses to refer to a set of opens (a form of Windows lock) or locks (byte range locks). Although it is really up to the client implementor, you can think of them as referring to a "process on the client". NFSv4 has two variants of Lock operation: - open_to_lock_owner4 - the first time a given lock_owner4 is being used for a given open_owner4 or - exist_lock_owner4 - used for subsequent lock operations on the lock_owner4 What is happening in this case is that the Linux client is using an open_to_lock_owner4 when that lock_owner4 has already been created by a previous open_to_lock_owner4. Is this a bug? - For NFSv4.0 I would definitely say "yes". Why? Because there is a "seqid" field in the lock_owner4 that is used to serial order the lock operations and having two instances of the same lock_owner4 would break that. - For NFSv4.1, NFSv4.2, I'm not so sure. (You can read the paragraph on Pg. 428 of RFC5661 for what it says. It starts with "The locker argument..".) I think it says that the client should only use open_to_lock_owner4 for the first lock op, but to be honest, Linux is the de-facto standard these days and if it is not doing this, the question is more "can the server live with this?. For NFSv4.1,4.2 sessions replaces the "seqid" in the lock_owner4 structure for serializing lock ops, so the lock_owner4 is really just used to decide if two lock operations conflict (same lock_owner4-->no conflict). What your change does is ignore the error and result in two lock_owners being created for the same lock_owner "string", which doesn't seem correct, but might be ok for NFSv4.1, 4.2. If you could test this little change instead, it seems like a better workaround. if (lckstp) { if (nd->nd_flag & ND_NFSV41) new_stp->ls_flags &= ~NFSLCK_OPENTOLOCK; else error = NFSERR_INVAL; } else lckstp = new_stp; By clearing NFSLCK_OPENTOLOCK, it makes the code not allocate a new lock_owner, since one already exists. I can't tell for sure if this will break anything, but I think it might be ok? One thing I couldn't see in your info was whether your mounts were NFSv4.0 or NFSv4.1,4.2. You can use "nfsstat -m" on the Linux client to find out. (If it's NFSv4.0, forget the above patch and all I can say is that I think the Linux client is broken.) If it is NFSv4.1,4.2 and you can test the above little patch, please let me know how it goes? Thanks for reporting this, rick
Hello Rick, thank you for your detailed explanation. The mount is NFSv4.1, so that is why all the seqids are zero in wireshark, right? I ran my test case with my inappropriate fix overnight, and yes: The lock_owner allocations resulted in a leak on the server (about 52G wired this morning) _and_ something bad happened on the Linux client, too. I tried your patch and it seems to work flawlessly. It solved the problem, or at least worked around the bug/misbehaviour of the Linux client if it is one. Cheers Björn
Sounds good. If it is convenient for you to do, one additional test you could try is forcing the mount to use NFSv4.0 via minorversion=0 on the mount command and then run your test against the server with my patch on it. If the client generates the multiple open_to_lock_owners, then it will still fail with NFSERR_INVAL. Btw, I read the appropriate section of RFC7530 (the new RFC for NFSv4.0) and it does state that the server should return NFSERR_BADSEQID for this case. I'll change the error return when I patch it. (You might change NFSERR_INVAL to NFSERR_BADSEQID, but I doubt the client will care which error is returned.) Thanks for doing the testing. I'll look more closely at the code (I wrote it 20years ago) and commit the patch if it looks safe. I'll also fire up the one Linux client I have and see if I can reproduce the problem on it.
Created attachment 218307 [details] allow NFSv4.1 clients to use an open_to_lock_owner when lock_owner exists This is a cleaned up version of the patch I suggested in the previous comment, that seems to have resolved the bug.
Ok, I tested the Linux client with NFSv4.0: The problem does not occur at all. Not with your patch (from the attachment) nor without. I cannot detect any multiple open_to_lock requests. If you have problems to reproduce the problem by setting up your own Linux client, here is what hardware we use. The double open_to_lock only occurs all few hundred lock requests, so maybe it is some race condition on the Linux client and timing may be crucial for successful reproduction: FreeBSD NFS server: - AMD EPYC 7551P 32-Core - 512GB RAM - 40GE network Linux client: - Ubuntu 20.04 5.3.0-51-generic - 2x Intel(R) Xeon(R) Gold 613 - 384GB RAM - 10GE network Meanwhile I applied your patch to our production server and everything seems fine. Excellent work, thank you.
A commit references this bug: Author: rmacklem Date: Sat Sep 26 23:05:39 UTC 2020 New revision: 366189 URL: https://svnweb.freebsd.org/changeset/base/366189 Log: Bjorn reported a problem where the Linux NFSv4.1 client is using an open_to_lock_owner4 when that lock_owner4 has already been created by a previous open_to_lock_owner4. This caused the NFS server to reply NFSERR_INVAL. For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify what error to return). For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661, but the NFSv4.1 server can handle this case without error. This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple uses of the same lock_owner in open_to_lock_owner so that it now correctly interoperates with the Linux NFS client. It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID. Thanks go to Bjorn for diagnosing this and testing the patch. He also provided a program that I could use to reproduce the problem. Tested by: bj@cebitec.uni-bielefeld.de (Bjorn Fischer) PR: 249567 Reported by: bj@cebitec.uni-bielefeld.de (Bjorn Fischer) MFC after: 3 days Changes: head/sys/fs/nfsserver/nfs_nfsdstate.c
Thank you for committing this. Just saw that there was a typo in my email address: it is bf@, not bj@. I saw the MFC, will this be merged into 11-STABLE as well?
Yes, it will be MFC'd to stable/11. I set the MFC flags, but they got cleared when I clicked something else. They're set now.
A commit references this bug: Author: rmacklem Date: Tue Sep 29 01:52:54 UTC 2020 New revision: 366238 URL: https://svnweb.freebsd.org/changeset/base/366238 Log: MFC: r366189 Bjorn reported a problem where the Linux NFSv4.1 client is using an open_to_lock_owner4 when that lock_owner4 has already been created by a previous open_to_lock_owner4. This caused the NFS server to reply NFSERR_INVAL. For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify what error to return). For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661, but the NFSv4.1 server can handle this case without error. This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple uses of the same lock_owner in open_to_lock_owner so that it now correctly interoperates with the Linux NFS client. It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID. Thanks go to Bjorn for diagnosing this and testing the patch. He also provided a program that I could use to reproduce the problem. PR: 249567 Reported by: bf@cebitec.uni-bielefeld.de Changes: _U stable/12/ stable/12/sys/fs/nfsserver/nfs_nfsdstate.c
A commit references this bug: Author: rmacklem Date: Tue Sep 29 02:03:25 UTC 2020 New revision: 366241 URL: https://svnweb.freebsd.org/changeset/base/366241 Log: MFC: r366189 Bjorn reported a problem where the Linux NFSv4.1 client is using an open_to_lock_owner4 when that lock_owner4 has already been created by a previous open_to_lock_owner4. This caused the NFS server to reply NFSERR_INVAL. For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify what error to return). For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661, but the NFSv4.1 server can handle this case without error. This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple uses of the same lock_owner in open_to_lock_owner so that it now correctly interoperates with the Linux NFS client. It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID. Thanks go to Bjorn for diagnosing this and testing the patch. He also provided a program that I could use to reproduce the problem. PR: 249567 Reported by: bf@cebitec.uni-bielefeld.de Changes: _U stable/11/ stable/11/sys/fs/nfsserver/nfs_nfsdstate.c
Patch has been MFC'd to stable/11 and stable/11. A request for MFS to releng/12.2 has been made.
A commit references this bug: Author: rmacklem Date: Tue Sep 29 15:09:38 UTC 2020 New revision: 366256 URL: https://svnweb.freebsd.org/changeset/base/366256 Log: MFS: r366238 Bjorn reported a problem where the Linux NFSv4.1 client is using an open_to_lock_owner4 when that lock_owner4 has already been created by a previous open_to_lock_owner4. This caused the NFS server to reply NFSERR_INVAL. For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify what error to return). For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661, but the NFSv4.1 server can handle this case without error. This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple uses of the same lock_owner in open_to_lock_owner so that it now correctly interoperates with the Linux NFS client. It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID. Thanks go to Bjorn for diagnosing this and testing the patch. He also provided a program that I could use to reproduce the problem. PR: 249567 Approved by: re (gjb) Changes: _U releng/12.2/ releng/12.2/sys/fs/nfsserver/nfs_nfsdstate.c
Patch has been MFC'd and MFS'd to releng/12.2.