Bug 244698 - NFSv3 lock manager unhelpful error message
Summary: NFSv3 lock manager unhelpful error message
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Mateusz Piotrowski
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-09 20:18 UTC by Garrett Wollman
Modified: 2023-11-09 21:31 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Garrett Wollman freebsd_committer freebsd_triage 2020-03-09 20:18:02 UTC
When the NFSv3 kernel lock manager times out trying to call back to a client, it emits an unhelpful error message on the console:

NLM: failed to contact remote rpcbind, stat = 5, port = 28416

There are three things wrong with this message: the error is in decimal, requiring users to read the source code to figure out which particular set of error codes it's taken from never mind what it means; the port number is byte-swapped; and most importantly, it does not identify the client the NLM is trying to communicate with.
Comment 1 Mateusz Piotrowski freebsd_committer freebsd_triage 2023-10-26 15:08:43 UTC
Hey Garrett,

I've got a patch that improves the error message a bit but I'm having a hard time to trigger the error message. Do you know how to reproduce it? Thanks!
Comment 2 Garrett Wollman freebsd_committer freebsd_triage 2023-10-29 20:04:25 UTC
(In reply to Mateusz Piotrowski from comment #1)
I *think* the way to trigger this error is to run rpcbind/statd/lockd on both client and server, but configure a firewall rule on the client that blocks incoming portmap RPCs (local port 111 on both UDP and TCP), then try to take a network lock on the client side (using, e.g., lockf(1)).
Comment 3 Mateusz Piotrowski freebsd_committer freebsd_triage 2023-11-09 17:02:18 UTC
(In reply to Garrett Wollman from comment #2)
Worked perfectly, thanks!

I'll push the patch to Phabricator soon.
Comment 4 commit-hook freebsd_committer freebsd_triage 2023-11-09 21:02:19 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=14105aae555cc22554d87ab041ee736c086f5ef1

commit 14105aae555cc22554d87ab041ee736c086f5ef1
Author:     Tom Jones <tom.jones@klarasystems.com>
AuthorDate: 2023-09-25 18:33:45 +0000
Commit:     Mateusz Piotrowski <0mp@FreeBSD.org>
CommitDate: 2023-11-09 20:54:28 +0000

    nlm: Fix error messages for failed remote rpcbind contact

    In case of a remote rpcbind connection timeout,
    the NFS kernel lock manager emits an error message
    along the lines of:

        NLM: failed to contact remote rpcbind, stat = 5, port = 28416

    In the Bugzilla PR, Garrett Wollman identified the following problems
    with that error message:

    - The error is in decimal, which can only be deciphered by reading the
      source code.
    - The port number is byte-swapped.
    - The error message does not identify the client the NLM is trying to
      communicate with.

    Fix the shortcomings of the current error message by:

    - Printing out the port number correctly.
    - Mentioning the remote client.

    The low-level decimal error remains an outstanding issue though.
    It seems like the error strings describing the error codes live outside
    of the kernel code currently.

    PR:             244698
    Reported by:    wollman
    Approved by:    allanjude
    Sponsored by:   National Bureau of Economic Research
    Sponsored by:   Klara, Inc.
    Co-authored-by: Mateusz Piotrowski <0mp@FreeBSD.org>

 sys/nlm/nlm_prot_impl.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)
Comment 5 Mateusz Piotrowski freebsd_committer freebsd_triage 2023-11-09 21:04:58 UTC
(In reply to commit-hook from comment #4)

The remaining bit here is to actually print a human-friendly error description instead of just a number. For that it seems like we need to port the error strings from the userspace to the kernel.
Comment 6 Garrett Wollman freebsd_committer freebsd_triage 2023-11-09 21:31:04 UTC
(In reply to Mateusz Piotrowski from comment #5)
I think as a stopgap, just identifying what kind of value it is (errno or protocol error or some internal status enumeration) is helpful, then at least you know what document to look at. "stat" really doesn't tell the user anything actionable. The rest of the change looks good.