Created attachment 239339 [details] replace desired_name with GSS_C_NO_NAME so gss_acquire_cred() works If you attempt a Kerberized NFS mount with the gssname option such as: # mount -t nfs -o nfsv4,sec=krb5,gssname=host nfs-server:/ /mnt the gssd daemon gets stuck in the gss_acquire_cred() library call for several seconds. It then returns success, but the credentials are bogus. A workaround is: # kinit -k host/nfs-client.domain # mount -t nfs -o nfsv4,sec=krb5 nfs-server:/ /mnt The one line patch in the atttachment seems to fix the problem. I have no idea how long this bug has existed, but I suspect it has been broken for quite a while, due to some change in the Heimdal GSSAPI library.
The patch in the attachment has been committed to main.
Created attachment 239406 [details] Increase timeout for upcalls to the gssd(8) daemon It turns out that the underlying problem that broke Kerberized NFS mounts using gssname was a 25sec timeout on the kernerl GSSAPI upcall. For some reason, gss_acuqire_cred() with a prinicpal name argument now takes about 28sec to complete. The upcall would time out. The kernel code would assume the gssd had died and, as such, would close the socket. Ironically, this does cause the gssd daemon to terminate via a SIGPIPE signal. This patch increases the timeout. With this patch, but not the patch in attachment #239339 [details], the mount works, but takes almost 30sec to complete, so I think applying both patches is appropriate. NB: The timeout increase is needed when a user's TGT has expired, since gss_init_sec_context() takes over 25sec in that case, as well.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e3c26ce5cb410e4e58e131dfea7054e0bf11e3ca commit e3c26ce5cb410e4e58e131dfea7054e0bf11e3ca Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-01-11 21:20:31 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-01-11 21:20:31 +0000 kgssapi: Increase timeout for kernel to gssd(8) upcalls It turns out that the underlying problem that caused a Kerberized NFS mount with the "gssname" option to fail was that the kernel upcall to the gssd(8) daemon would time out prematurely after 25 seconds. The gss_acquire_cred() GSSAPI library call takes about 27 seconds for the case where a desired_name argument is specified. A similarly long delay occurs when the gss_init_sec_context() call is made and the user principal's TGT has expired. Once the upcall timed out, the kernel code assumed that the gssd(8) daemon had died and closed the socket. Ironically, closing the socket did cause the gssd(8) daemon to terminate via a SIGPIPE signal. This patch increases the timeout to 5 minutes. Since a timeout should only occur when the gssd(8) daemon has died, a long timeout should be ok and seems to fix this problem. I still think that commit c33509d49a should remain in the system, since it allows the mount to complete quickly and not take nearly 30 seconds. PR: 268823 MFC after: 2 weeks sys/kgssapi/gss_impl.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
Interesting! I recently ran into a similar bug in a Linux client (one of 10 identical clients we are using to measure NFS mount latency and general functionality tests from various places around the university) where a sec=krb5 nfsv4 mount (from a FreeBSD server) would take a long time (test testing system terminated the mount attempt after 5s) instead of the normal sub-second mounts. And in typical Kerberos-style not much usable error messages was logged anywhere so I ended up having to snoop network traffic and trace sys calls to find stuff out... sigh. There the problem turned out to be an expired /tmp/krb5cc_0 ticket. On that Linux client this would normally have been managed by the "gssproxy" daemon which maintains and refreshes that ticket, but when diagnosing a problem I had manually done a: kinit -k 'HOSTNAME$' to test and see if the host ticket/principal was valid (I had renamed the client). This caused the up call from the kernel to gssproxy to take a very long time and it would finally return an invalid/expired ticket down to the kernel which caused timeouts/problems further down the chain. (The problem was solved by deleting my manually "installed" ticket and let gssproxy handle it :-)
It turned out that the long delay was caused by a misconfigured DNS. Although I did not have "dns" in the /etc/nsswitch.conf, I did have a /etc/resolv.conf on the machine and, because of that, it tried to contact the DNS server for something like 27 seconds. Deleting /etc/resolv.conf fixed the delay. So, the first patch has now been reverted from "main" and the second patch has been committed to "main" so that the gssd(8) daemon will not terminate if the GSSAPI library is slow to return. With the second patch applied to your kernel, the mount will succeed if your DNS is misconfigured, although it may take close to 30 seconds.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=08b2c77707036768099e7df66222f75da877ebb7 commit 08b2c77707036768099e7df66222f75da877ebb7 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-01-11 21:20:31 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-01-26 03:02:18 +0000 kgssapi: Increase timeout for kernel to gssd(8) upcalls It turns out that the underlying problem that caused a Kerberized NFS mount with the "gssname" option to fail was that the kernel upcall to the gssd(8) daemon would time out prematurely after 25 seconds. The gss_acquire_cred() GSSAPI library call takes about 27 seconds for the case where a desired_name argument is specified. A similarly long delay occurs when the gss_init_sec_context() call is made and the user principal's TGT has expired. Once the upcall timed out, the kernel code assumed that the gssd(8) daemon had died and closed the socket. Ironically, closing the socket did cause the gssd(8) daemon to terminate via a SIGPIPE signal. This patch increases the timeout to 5 minutes. Since a timeout should only occur when the gssd(8) daemon has died, a long timeout should be ok and seems to fix this problem. I still think that commit c33509d49a should remain in the system, since it allows the mount to complete quickly and not take nearly 30 seconds. PR: 268823 (cherry picked from commit e3c26ce5cb410e4e58e131dfea7054e0bf11e3ca) sys/kgssapi/gss_impl.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9d35b4b38e1a466371c551568538488b6f873d07 commit 9d35b4b38e1a466371c551568538488b6f873d07 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-01-11 21:20:31 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-01-26 03:07:41 +0000 kgssapi: Increase timeout for kernel to gssd(8) upcalls It turns out that the underlying problem that caused a Kerberized NFS mount with the "gssname" option to fail was that the kernel upcall to the gssd(8) daemon would time out prematurely after 25 seconds. The gss_acquire_cred() GSSAPI library call takes about 27 seconds for the case where a desired_name argument is specified. A similarly long delay occurs when the gss_init_sec_context() call is made and the user principal's TGT has expired. Once the upcall timed out, the kernel code assumed that the gssd(8) daemon had died and closed the socket. Ironically, closing the socket did cause the gssd(8) daemon to terminate via a SIGPIPE signal. This patch increases the timeout to 5 minutes. Since a timeout should only occur when the gssd(8) daemon has died, a long timeout should be ok and seems to fix this problem. I still think that commit c33509d49a should remain in the system, since it allows the mount to complete quickly and not take nearly 30 seconds. PR: 268823 (cherry picked from commit e3c26ce5cb410e4e58e131dfea7054e0bf11e3ca) sys/kgssapi/gss_impl.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
The second attachment has now been committed and MFC'd. Note that the long (30 sec) delay in gss_acquire_cred() and gss_init_sec_context() was caused by a misconfigured DNS setup. (Apparently the library tries to use DNS when a /etc/resolv.conf file exists, even if use of DNS for host lookup is not specified in /etc/nsswitch.conf. (This is when this patch is needed to avoid the gssd(8) daemon from being terminated.)