Summary: | nfs root mount may loop endlessly without hint on console | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Bjoern A. Zeeb <bz> | ||||
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||
Status: | Open --- | ||||||
Severity: | Affects Some People | CC: | bz, michael.osipov, rmacklem | ||||
Priority: | --- | ||||||
Version: | CURRENT | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Bjoern A. Zeeb
2019-11-21 19:10:23 UTC
Well, the problem with setting a limit is "how long"? I know of a FreeBSD NFS server that exports over 72,000 file systems. I suspect that startup of a server like this can take a while and some systems would simply want to retry until the server is up, I think? I also don't see much use in a panic(), since a dump or stack trace isn't useful and another reboot cycle will presumably end up in the same state. I can see that spitting out a single message to the console along the lines of "Can't connect to NFS server" would be useful, so that sysadmins would know why the boot has wedged. I'll have to take a look at the code to see if the mount root case can be identified where it loops attempting reconnects, so a message can be generated for that case. I think newnfs_connect() does the socreate() and returns an error when it fails. However, newnfs_request() ignores any error return, so a message could possibly be generated there. All of the above is just mho. I think you should ask on a mailing list (FreeBSD-fs@ maybe?) to see what others think is the correct behaviour for this case. Actually, when I look at the code, it seems that there is a call to nfs_mountdiskless()->mountnfs() and mountnfs() should fail when newnfs_connect() fails. This looks like it should result in a message like: nfs_mountroot: mount <path> on /: <errno> being generated. I have no idea why your case does not do this? Created attachment 209371 [details]
generate a console message when RPC reconnect fails
This trivial patch might generate a console message when the reconnect fails.
Since it only happens on the first retry, hopefully it works, but is not
too noisy.
Untested at this time.
Sorry for no earlier feedback. Seems I got into this condition while patching things for IPv6 support and after fixing this I cannot easily reproduce it. The functional part of the patch seems fine to me. I keep wondering about the printf including the (NFS?) .. could the "?" be confusing to people? I'd be happy if you'd go ahead and commit it. Feel free to commit it (or any variant of the patch that you prefer). I put in the "(NFS?)" since the krpc is theoretically not NFS specific, although it happens to only be used by NFS and the NLM (rpc.lockd, which is thought of as part of NFS by many). |