When the NFSv3 kernel lock manager times out trying to call back to a client, it emits an unhelpful error message on the console: NLM: failed to contact remote rpcbind, stat = 5, port = 28416 There are three things wrong with this message: the error is in decimal, requiring users to read the source code to figure out which particular set of error codes it's taken from never mind what it means; the port number is byte-swapped; and most importantly, it does not identify the client the NLM is trying to communicate with.
Hey Garrett, I've got a patch that improves the error message a bit but I'm having a hard time to trigger the error message. Do you know how to reproduce it? Thanks!
(In reply to Mateusz Piotrowski from comment #1) I *think* the way to trigger this error is to run rpcbind/statd/lockd on both client and server, but configure a firewall rule on the client that blocks incoming portmap RPCs (local port 111 on both UDP and TCP), then try to take a network lock on the client side (using, e.g., lockf(1)).
(In reply to Garrett Wollman from comment #2) Worked perfectly, thanks! I'll push the patch to Phabricator soon.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=14105aae555cc22554d87ab041ee736c086f5ef1 commit 14105aae555cc22554d87ab041ee736c086f5ef1 Author: Tom Jones <tom.jones@klarasystems.com> AuthorDate: 2023-09-25 18:33:45 +0000 Commit: Mateusz Piotrowski <0mp@FreeBSD.org> CommitDate: 2023-11-09 20:54:28 +0000 nlm: Fix error messages for failed remote rpcbind contact In case of a remote rpcbind connection timeout, the NFS kernel lock manager emits an error message along the lines of: NLM: failed to contact remote rpcbind, stat = 5, port = 28416 In the Bugzilla PR, Garrett Wollman identified the following problems with that error message: - The error is in decimal, which can only be deciphered by reading the source code. - The port number is byte-swapped. - The error message does not identify the client the NLM is trying to communicate with. Fix the shortcomings of the current error message by: - Printing out the port number correctly. - Mentioning the remote client. The low-level decimal error remains an outstanding issue though. It seems like the error strings describing the error codes live outside of the kernel code currently. PR: 244698 Reported by: wollman Approved by: allanjude Sponsored by: National Bureau of Economic Research Sponsored by: Klara, Inc. Co-authored-by: Mateusz Piotrowski <0mp@FreeBSD.org> sys/nlm/nlm_prot_impl.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-)
(In reply to commit-hook from comment #4) The remaining bit here is to actually print a human-friendly error description instead of just a number. For that it seems like we need to port the error strings from the userspace to the kernel.
(In reply to Mateusz Piotrowski from comment #5) I think as a stopgap, just identifying what kind of value it is (errno or protocol error or some internal status enumeration) is helpful, then at least you know what document to look at. "stat" really doesn't tell the user anything actionable. The rest of the change looks good.