Created attachment 212663 [details] create a tunable to set the NLM client to use TCP After upgrading the software on a Netapp Filer, serious interoperability problems were observed between the FreeBSD NLM client and Netapp server. (The NLM protocol is a separate protocol from NFSv3 that provides file locking. Also known as rpc.lockd.) Reported via email by Daniel Braniss (danny@cs.huji.ac.il). Although the problem(s) were not completely diagnosed, it appeared to be related to reuse of the xid in the RPC over UDP header. Adam McDougall (mcdouga9@egr.msu.edu) mentioned resolving a similar issue with the FreeBSD NLM->Netapp Filer by switching the NLM to use TCP instead of UDP. I have created this bug report to track this and have two simple patches that might resolve (or make it easier to deal with) attached here.
Created attachment 212664 [details] modify kernel UDP client to use global xid This patch modifies the kernel RPC UDP client so that it uses a single global xid instead of one "per connection". I couldn't see exactly how the "per connection" xid could end up reusing the same value, but since a "connection" is a sketchy concept anyhow and a global xid will not repeat for 4billion RPCs, this should avoid any reuse of the same xid value. (I suspect the "per connection xid" code was inherited from userland RPC library code, where a global value is not practical.)
A variant of the second patch has been committed to head and a variant of the first one will be committed to head later to-day.
A commit references this bug: Author: rmacklem Date: Sun Apr 5 21:08:17 UTC 2020 New revision: 359643 URL: https://svnweb.freebsd.org/changeset/base/359643 Log: Change the xid for client side krpc over UDP to a global value. Without this patch, the xid used for the client side krpc requests over UDP was initialized for each "connection". A "connection" for UDP is rather sketchy and for the kernel NLM a new one is created every 2minutes. A problem with client side interoperability with a Netapp server for the NLM was reported and it is believed to be caused by reuse of the same xid. Although this was never completely diagnosed by the reporter, I could see how the same xid might get reused, since it is initialized to a value based on the TOD clock every two minutes. I suspect initializing the value for every "connection" was inherited from userland library code, where having a global xid was not practical. However, implementing a global "xid" for the kernel rpc is straightforward and will ensure that an xid value is not reused for a long time. This patch does that and is hoped it will fix the Netapp interoperability problem. PR: 245022 Reported by: danny@cs.huji.ac.il MFC after: 2 weeks Changes: head/sys/rpc/clnt_dg.c
I won't be committing the first patch for now. Although it sets the client side of the NLM to use TCP and someone reported that helped for them when dealing with a Netapp server, the FreeBSD server does not handle TCP when the client tries to use it. I am not sure if I will investigate the NLM over TCP problem, since I consider it should be deprecated in favour of using NFSv4.1 (or NFSv4.2 when available) for cases where distributed locking for NFS is required.
I played around with the NLM tunable to set use of TCP. It appears that rpcbind always replies with the UDP port#, so it doesn't work. I think setting a fixed port# via "-p" for both rpc.statd and rpc.lockd might make it work. I am hoping that patch#2 will resolve the problem, so I don't need to bother trying to fix the rpcbind problem.
A commit references this bug: Author: rmacklem Date: Mon Apr 20 01:17:00 UTC 2020 New revision: 360109 URL: https://svnweb.freebsd.org/changeset/base/360109 Log: MFC: r359643 Change the xid for client side krpc over UDP to a global value. Without this patch, the xid used for the client side krpc requests over UDP was initialized for each "connection". A "connection" for UDP is rather sketchy and for the kernel NLM a new one is created every 2minutes. A problem with client side interoperability with a Netapp server for the NLM was reported and it is believed to be caused by reuse of the same xid. Although this was never completely diagnosed by the reporter, I could see how the same xid might get reused, since it is initialized to a value based on the TOD clock every two minutes. I suspect initializing the value for every "connection" was inherited from userland library code, where having a global xid was not practical. However, implementing a global "xid" for the kernel rpc is straightforward and will ensure that an xid value is not reused for a long time. This patch does that and is hoped it will fix the Netapp interoperability problem. PR: 245022 Changes: _U stable/12/ stable/12/sys/rpc/clnt_dg.c
A commit references this bug: Author: rmacklem Date: Mon Apr 20 01:26:18 UTC 2020 New revision: 360110 URL: https://svnweb.freebsd.org/changeset/base/360110 Log: MFC: r359643 Change the xid for client side krpc over UDP to a global value. Without this patch, the xid used for the client side krpc requests over UDP was initialized for each "connection". A "connection" for UDP is rather sketchy and for the kernel NLM a new one is created every 2minutes. A problem with client side interoperability with a Netapp server for the NLM was reported and it is believed to be caused by reuse of the same xid. Although this was never completely diagnosed by the reporter, I could see how the same xid might get reused, since it is initialized to a value based on the TOD clock every two minutes. I suspect initializing the value for every "connection" was inherited from userland library code, where having a global xid was not practical. However, implementing a global "xid" for the kernel rpc is straightforward and will ensure that an xid value is not reused for a long time. This patch does that and is hoped it will fix the Netapp interoperability problem. PR: 245022 Changes: _U stable/11/ stable/11/sys/rpc/clnt_dg.c
Second patch has been committed and MFC'd. Eanbling TCP using the first patch doesn't work correctly, due to the call to rpcbind returning the UDP port# instead of TCP port#. If the second patch does not resolve the interoperability problem with the Netapp filer, this PR can be reopened and I will work on fixing the first patch.