If we are trying to mount the root file system over NFS and cannot establish a connection we do never give up. TCP/2049 packets arrive at the server, RST comes back. The reason is in newnfs_request() which seems to jump back to tryagain; at least that is my guess for the loop as I couldn't spot it earlier and the socreate() is part of the loop: 1599 XXX-BZ socreate:553^M 1600 XXX-BZ tcp_usr_attach:155 fff 4^M 1601 XXX-BZ tcp_usr_attach:161^M 1602 XXX-BZ tcp_usr_attach:171 error 0^M 1603 XXX-BZ socreate:553^M 1604 XXX-BZ tcp_usr_attach:155 fff 5^M 1605 XXX-BZ tcp_usr_attach:161^M 1606 XXX-BZ tcp_usr_attach:171 error 0^M 1607 XXX-BZ socreate:553^M 1608 XXX-BZ tcp_usr_attach:155 fff 6^M 1609 XXX-BZ tcp_usr_attach:161^M 1610 XXX-BZ tcp_usr_attach:171 error 0^M 1611 XXX-BZ socreate:553^M 1612 XXX-BZ tcp_usr_attach:155 fff 7^M 1613 XXX-BZ tcp_usr_attach:161^M 1614 XXX-BZ tcp_usr_attach:171 error 0^M I added a panic if I come by 15 times to get a backtrace. panic() at panic+0x43/frame 0xfffffe00acf58b70 tcp_usr_attach() at tcp_usr_attach+0x2b7/frame 0xfffffe00acf58be0 socreate() at socreate+0x1ce/frame 0xfffffe00acf58c30 __rpc_nconf2socket() at __rpc_nconf2socket+0x3f/frame 0xfffffe00acf58c60 clnt_reconnect_call() at clnt_reconnect_call+0x3b6/frame 0xfffffe00acf58d10 newnfs_request() at newnfs_request+0x90b/frame 0xfffffe00acf58e80 nfsrpc_getattrnovp() at nfsrpc_getattrnovp+0xeb/frame 0xfffffe00acf59020 mountnfs() at mountnfs+0x6b6/frame 0xfffffe00acf591c0 nfs_mount() at nfs_mount+0x11d3/frame 0xfffffe00acf59500 vfs_mount_sigdefer() at vfs_mount_sigdefer+0x24/frame 0xfffffe00acf59520 vfs_domount() at vfs_domount+0x7f9/frame 0xfffffe00acf59750 vfs_donmount() at vfs_donmount+0x911/frame 0xfffffe00acf597f0 kernel_mount() at kernel_mount+0x57/frame 0xfffffe00acf59840 parse_mount() at parse_mount+0x4a1/frame 0xfffffe00acf59990 vfs_mountroot() at vfs_mountroot+0x53b/frame 0xfffffe00acf59b10 start_init() at start_init+0x28/frame 0xfffffe00acf59bb0 I would suggest we'd rather really timeout / error after <n> retries and possibly reboot or fail mountroot or whatever it'll be, rather than being stuck in a loop without letting the user know that "NFS server not reachable: connection refused".
Well, the problem with setting a limit is "how long"? I know of a FreeBSD NFS server that exports over 72,000 file systems. I suspect that startup of a server like this can take a while and some systems would simply want to retry until the server is up, I think? I also don't see much use in a panic(), since a dump or stack trace isn't useful and another reboot cycle will presumably end up in the same state. I can see that spitting out a single message to the console along the lines of "Can't connect to NFS server" would be useful, so that sysadmins would know why the boot has wedged. I'll have to take a look at the code to see if the mount root case can be identified where it loops attempting reconnects, so a message can be generated for that case. I think newnfs_connect() does the socreate() and returns an error when it fails. However, newnfs_request() ignores any error return, so a message could possibly be generated there. All of the above is just mho. I think you should ask on a mailing list (FreeBSD-fs@ maybe?) to see what others think is the correct behaviour for this case.
Actually, when I look at the code, it seems that there is a call to nfs_mountdiskless()->mountnfs() and mountnfs() should fail when newnfs_connect() fails. This looks like it should result in a message like: nfs_mountroot: mount <path> on /: <errno> being generated. I have no idea why your case does not do this?
Created attachment 209371 [details] generate a console message when RPC reconnect fails This trivial patch might generate a console message when the reconnect fails. Since it only happens on the first retry, hopefully it works, but is not too noisy. Untested at this time.
Sorry for no earlier feedback. Seems I got into this condition while patching things for IPv6 support and after fixing this I cannot easily reproduce it. The functional part of the patch seems fine to me. I keep wondering about the printf including the (NFS?) .. could the "?" be confusing to people? I'd be happy if you'd go ahead and commit it.
Feel free to commit it (or any variant of the patch that you prefer). I put in the "(NFS?)" since the krpc is theoretically not NFS specific, although it happens to only be used by NFS and the NLM (rpc.lockd, which is thought of as part of NFS by many).