275905 – nfs client: mount becomes unresponsive

Bug 275905 - nfs client: mount becomes unresponsive

Summary: nfs client: mount becomes unresponsive

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	14.0-RELEASE
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Rick Macklem

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-12-24 03:30 UTC by Lexi Winter
Modified:	2024-01-02 02:02 UTC (History)
CC List:	2 users (show)

See Also:

Flags:	rmacklem: mfc-stable14+ rmacklem: mfc-stable13+ rmacklem: mfc-stable12+

Attachments
Free a session slot when a server replies RPC_AUTHERR (932 bytes, patch) 2023-12-25 00:16 UTC, Rick Macklem	no flags	Details \| Diff
patch (2.13 KB, patch) 2023-12-26 07:24 UTC, Lexi Winter	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lexi Winter freebsd_committer

2023-12-24 03:30:54 UTC

FreeBSD ilythia.eden.le-fay.org 14.0-RELEASE-p3 FreeBSD 14.0-RELEASE-p3 #0: Mon Dec 11 04:56:01 UTC 2023     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

this system has an NFSv4.2 mount from a FreeBSD 14.0 server using sec=krb5p.  after a few hours of activity (mostly low-volume random reads at ~5-10Mbps throughput), the NFS mount has hung and cannot be accessed; 'df' and 'nfsstat -m' also hang.  there's no NFS network traffic and no indication in the logs on either system of a problem.  the Kerberos ticket isn't expired, and renewing the ticket anyway made no difference.

according to kgdb, nfscl seems to be stuck in nfsv4_sequencelookup:

(kgdb) where
#0  sched_switch (td=td@entry=0xfffffe017c43a740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2297
#1  0xffffffff80b5028b in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80ba001b in sleepq_switch (wchan=wchan@entry=0xfffff800754bd9c0, pri=pri@entry=68) at /usr/src/sys/kern/subr_sleepqueue.c:607
#3  0xffffffff80ba05ef in sleepq_timedwait (wchan=wchan@entry=0xfffff800754bd9c0, pri=68) at /usr/src/sys/kern/subr_sleepqueue.c:689
#4  0xffffffff80b4f9e0 in _sleep (ident=ident@entry=0xfffff800754bd9c0, lock=lock@entry=0xfffff800754bd810, priority=priority@entry=68, wmesg=0xffffffff81195590 "nfsclseq", sbt=4294967000, pr=pr@entry=0, flags=256) at /usr/src/sys/kern/kern_synch.c:219
#5  0xffffffff809ed270 in nfsv4_sequencelookup (nmp=nmp@entry=0xfffff802f5506000, sep=sep@entry=0xfffff800754bd810, slotposp=slotposp@entry=0xfffffe017b087b38, maxslotp=maxslotp@entry=0xfffffe017b087b34, slotseqp=slotseqp@entry=0xfffffe017b087b3c,
    sessionid=sessionid@entry=0xfffffe017b087b40 "k\214\020\201\377\377\377\377\377\242\352\200\377\377\377\377\n\340h D\257\353\v ", fnd_init=false) at /usr/src/sys/fs/nfs/nfs_commonsubs.c:4989
#6  0xffffffff809e1c39 in nfsv4_setsequence (nmp=0xfffff802f5506000, nd=0xfffffe017b087c38, sep=0xfffff800754bd810, dont_replycache=0, cred=0xfffff800087f3c00) at /usr/src/sys/fs/nfs/nfs_commonsubs.c:4892
#7  0xffffffff809e12f1 in nfscl_reqstart (nd=0xfffffe017b087c38, procnum=32, nmp=0xfffff802f5506000, nfhp=0x0, fhlen=0, opcntpp=<optimized out>, sep=<optimized out>, vers=0, minorvers=0, cred=0xfffff800087f3c00) at /usr/src/sys/fs/nfs/nfs_commonsubs.c:433
#8  0xffffffff80a11e77 in nfsrpc_renew (clp=<optimized out>, dsp=dsp@entry=0x0, cred=cred@entry=0xfffff800087f3c00, p=p@entry=0xfffffe017c43a740) at /usr/src/sys/fs/nfsclient/nfs_clrpcops.c:4874
#9  0xffffffff809f7ae5 in nfscl_renewthread (clp=clp@entry=0xfffffe0180c75000, p=0xfffffe017c43a740) at /usr/src/sys/fs/nfsclient/nfs_clstate.c:2769
#10 0xffffffff80a2bde4 in start_nfscl (arg=<unavailable>, arg@entry=0xfffffe0180c75000) at /usr/src/sys/fs/nfsclient/nfs_clport.c:782
#11 0xffffffff80afdb7f in fork_exit (callout=0xffffffff80a2bdd0 <start_nfscl>, arg=0xfffffe0180c75000, frame=0xfffffe017b087f40) at /usr/src/sys/kern/kern_fork.c:1160
#12 <signal handler called>
#13 0x00000328911e38ca in ?? ()
Backtrace stopped: Cannot access memory at address 0x3288e984798
(kgdb) frame 5
#5  0xffffffff809ed270 in nfsv4_sequencelookup (nmp=nmp@entry=0xfffff802f5506000, sep=sep@entry=0xfffff800754bd810, slotposp=slotposp@entry=0xfffffe017b087b38, maxslotp=maxslotp@entry=0xfffffe017b087b34, slotseqp=slotseqp@entry=0xfffffe017b087b3c,
    sessionid=sessionid@entry=0xfffffe017b087b40 "k\214\020\201\377\377\377\377\377\242\352\200\377\377\377\377\n\340h D\257\353\v ", fnd_init=false) at /usr/src/sys/fs/nfs/nfs_commonsubs.c:4989
warning: Source file is more recent than executable.
4989                                    mtx_sleep(&sep->nfsess_slots, &sep->nfsess_mtx,

Comment 1 Lexi Winter freebsd_committer

2023-12-24 14:25:36 UTC

this seems to be connected to Kerberos ticket expiry: the problem recurred just now when my ticket expired.  when this happens, klist shows:

% klist
Credentials cache: FILE:/tmp/krb5cc_10006
        Principal: lexi@EDEN.LE-FAY.ORG

  Issued                Expires        Principal
Dec 24 04:13:34 2023  >>>Expired<<<  krbtgt/EDEN.LE-FAY.ORG@EDEN.LE-FAY.ORG
Dec 24 06:58:04 2023  >>>Expired<<<  host/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 13:37:42 2023  >>>Expired<<<  nfs/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 13:37:42 2023  >>>Expired<<<  nfs/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 13:37:42 2023  >>>Expired<<<  nfs/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 13:37:42 2023  >>>Expired<<<  nfs/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
[... many identical entries elided...]
Dec 24 13:37:42 2023  >>>Expired<<<  nfs/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 13:37:42 2023  >>>Expired<<<  nfs/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 06:58:04 2023  >>>Expired<<<  host/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG
Dec 24 06:58:04 2023  >>>Expired<<<  host/hemlock.eden.le-fay.org@EDEN.LE-FAY.ORG

in additional, the 'Invalid' counter in nfsstat -cE increases to 64 (it was previously 0):

Rpc Info:
     TimedOut      Invalid    X Replies      Retries     Requests
            0           64            0            0      4832814

however, the mount never recovers even after renewing the tickets.

Comment 2 Rick Macklem freebsd_committer

2023-12-24 15:24:52 UTC

You haven't used the "intr" and/or "soft"
mount options by any chance?

These are guaranteed to break NFSv4 mounts.

The renewthread is stuck in nfsv4_sequencelookup()
because there are no session slots available.
- This can happen for "soft" or "intr" mounts
  because the RPC reply does not get processed
  properly to release the session slot.
- If the mount does not use "soft" nor "intr",
  then there is some other reason that the replies
  are not being processed.
  --> I'll think about this. It might be a side effect
      of a RPCSEC_GSS session being expired so the client
      cannot de-encrypt the RPC reply, or something like
      that. Typically, the RPCSEC_GSS session (not the same
      thing as a NFSv4 session, just to make the terminology
      confusing) expires when the TGT used to acquire it
      expires. It will be re-created upon the next RPC
      if there is a valid TGT, but that might be too
      late for processing of an RPC reply. (With sec=krb5p
      the reply is encrypted, so it cannot be processed
      without a valid RPCSEC_GSS session.


Once all session slots are in use and no RPC reply is
processed to release one, the mount is hung and the
only fix is to "umount -N <mnt-path>" or reboot the
client machine.

Comment 3 Lexi Winter freebsd_committer

2023-12-24 15:36:46 UTC

no, i'm not using soft or intr (assuming the default is hard,nointr):

hemlock.eden.le-fay.org:/public /data/public nfs rw,nfsv4,minorversion=2,sec=krb5p,gssname=host,proto=tcp6 0 0

in case it makes a difference, the max lifetime on my user TGT (lexi@EDEN.LE-FAY.ORG) is very long, about 30 days.  in the second comment, i'd only requested a normal 10 hour ticket.  in the original report, i think the lifetime may have been much longer - which is why i thought the ticket hadn't expired.  i'm not sure if this is relevant (i'm sure it *shouldn't* be, but perhaps this configuration hasn't been tested as much).

Comment 4 Rick Macklem freebsd_committer

2023-12-24 15:50:38 UTC

The renew thread RPC (the one that your backtrace shows
as hung in nfsv4_sequencelookup()) would be using the
host/FQDN@REALM entry in the default /etc/krb5.keytab.
(ie. Your TGT would not be used for this, since the
 renew thread runs as "root".)

Could the client's keytab entry be broken somehow?
(As you probably are aware, when you create a new
 keytab for a principal, old keytabs no longer work.)

You could go onto the client and do
# kinit -k host/FQDN
to see that the keytab entry for "host" works ok.
(I am still wondering what will happen if the TGT
 for the keytab expires just after the RPC request
 is sent to the server. I'll look into that.)

There could also be a problem where other RPCs were
using all slots, so the renew thread slept in
nfsv4_sequencelookup() and then another thread needed
the renew thread to do some work.
--> The renew thread should really have a dedicated
    slot in the NFSv4 session, so that no other thread
    can cause it to block.
    --> I might work on a patch for that.

Does this hang occur frequently or was this a "one time"
hang on a mount that usually behaves ok?

Comment 5 Lexi Winter freebsd_committer

2023-12-24 16:01:02 UTC

the client keytab should be fine, since i only created it about 2 days ago when i installed this host and there was no previous principal for this fqdn.

'kinit -k host/fqdn' on the client seems to work:

# kinit -k host/ilythia.eden.le-fay.org
# klist
Credentials cache: FILE:/tmp/krb5cc_0
        Principal: host/ilythia.eden.le-fay.org@EDEN.LE-FAY.ORG

  Issued                Expires               Principal
Dec 24 15:54:16 2023  Dec 25 01:54:16 2023  krbtgt/EDEN.LE-FAY.ORG@EDEN.LE-FAY.ORG

> Does this hang occur frequently or was this a "one time" hang on a mount that usually behaves ok?

as i mentioned, i only installed this host 2 days ago, so i can't say for sure, but so far it has reliably occurred twice about 10 hours after booting so it seems to be 100% reproducible.  (iow, it never *hasn't* occurred at that time.)

in the mean time, i've temporarily switched the mount from sec=krb5p to sec=krb5 to see if this fixes the issue; aiui, this means no GSS should be involved after the initial mount, so i expect it will, but i'm happy to do any other testing you need.

Comment 6 Rick Macklem freebsd_committer

2023-12-24 23:22:32 UTC

Not exactly. sec=krb5 says that a RPCSEC_GSS (think Kerberos) session
is used to identify the user for all RPCs. The difference w.r.t. krb5p is that
it does not encrypt the NFS payload.

I doubt it will make any difference, but it sounds like you'll know soon enough.

I now have a hunch w.r.t. what might be broken, but I need to look at the code
(and if my hunch seems correct) and maybe come up with a patch.

You could try the "syskrb5" mount option, which avoids use of the keytab and
allows "system operations that maintain the state" to use AUTH_SYS, while the
rest (all involving file data) use Kerberos.
(If my hunch is correct, this will not fix the problem, but might
cause it to behave
 better.)

rick
ps: I accidentally did this as an email reply...

Comment 7 Rick Macklem freebsd_committer

2023-12-25 00:16:54 UTC

Created attachment 247232 [details]
Free a session slot when a server replies RPC_AUTHERR

This patch might fix the problem. At least it
will give us a little more information w.r.t.
what is causing the hang.

If the reporter can test it on their client,
that could be useful.

Comment 8 Lexi Winter freebsd_committer

2023-12-25 09:11:20 UTC

sec=krb5 didn't fix the problem.  i didn't have a chance to test syskrb5; instead i've applied the patch, and i'll let you know if 10 hours if that's made a difference.

Comment 9 Lexi Winter freebsd_committer

2023-12-25 17:52:32 UTC

unfortunately the patch doesn't seem to make any difference.

when the mount is idle, the behaviour is as expected:

- ticket expires
- access to the mount returns EACCES
- obtaining a new ticket (kinit) restores access to the mount
- this is probably the same behaviour as without the patch, but i didn't think to test it

when the mount is busy (~20Mbps read traffic in this case), the behaviour is the same as before: instead of returning EACCESS, the mount simply hangs.  the printf in the patch is *not* triggered, and obtaining a new ticket doesn't fix the mount.

this is on releng/14.0; i'm about to upgrade this system to stable/14 (for unrelated reasons) so i'll test it there as well just in case there's a difference.

Comment 10 Lexi Winter freebsd_committer

2023-12-25 17:55:44 UTC

in case this makes a difference, the vast majority of the I/O is not on the mounted filesystem itself, but a child filesystem mounted as a mirror mount (which doesn't appear in df, so i forgot to mention it).

Comment 11 Rick Macklem freebsd_committer

2023-12-25 20:18:53 UTC

Hmm. I am not sure what you mean by a mirror
mount under the NFS mount?
Maybe you can explain that more.
(It is probably a part of this, since no one
 else has been reporting similar hangs.)

Try "syskrb5" on the mount, it should at least
avoid the renewthread from getting hung.

Next time you have the hang, please collect
the following information on the NFS client:
# ps axHl
# procstat -a -kk
which should give us some idea what is hanging.

There is really no diffeence in the NFS code between
releng/14.0 and stable/14, but maybe some other
part of the system that is involved in the hang
has changed?

Comment 12 Rick Macklem freebsd_committer

2023-12-25 20:29:59 UTC

Btw, I am not sure what your goals are for using
sec=krb5[p], but if they are "on the wire data encryption"
and client machine identification, the using NFS-over-TLS
is an alternative.

You can see:
https://people.freebsd.org/~rmacklem/nfs-over-tls-setup.txt

If you are using sec=krb5[p] so that users are identified via
their kerberos principal (instead of uid), then you are stuck
with Kerberos and the hassles of TGT expiration.
You are probably aware that there are utilities out there
that run as daemons and renew TGTs if they are renewable.

Comment 13 Lexi Winter freebsd_committer

2023-12-25 20:44:26 UTC

by mirror mount, i mean that on the NFS server, the filesystem has several sub-filesystems:

Filesystem                  Size    Used   Avail Capacity  Mounted on
data/public                 28T     45M     28T     0%    /data/public
data/public/Books           28T     48M     28T     0%    /data/public/Books
data/public/Miscellaneous   28T    178M     28T     0%    /data/public/Miscellaneous
data/public/Software        28T    166G     28T     1%    /data/public/Software
[...]

which are individually exported in /etc/exports:

/data/public       -sec=krb5:krb5i:krb5p -network 2001:8b0:aab5::/48
/data/public/Books -sec=krb5:krb5i:krb5p -network 2001:8b0:aab5::/48
[...]
V4: /data          -sec=sys:krb5:krb5i:krb5p -network 2001:8b0:aab5::/48

but on the client, only the root filesystem, /data/public, is mounted in /etc/fstab.  when something accesses a sub-filesystem, like /data/public/Software, the client automatically mounts it in the appropriate place.

possibly FreeBSD calls this something else; i think "mirror mount" might be the Solaris or Linux term for it (e.g., https://docs.oracle.com/cd/E26502_01/html/E28997/ggcvr.html).

Comment 14 Rick Macklem freebsd_committer

2023-12-25 21:17:49 UTC

Ok. I didn't realize you were referring to the
NFS server when you said "mirror mount".
Doing this on the server is fine and should not
affect the "hang".
(And, yes, the term is used in FreeBSD, although
I'll admit it has always seemed a weird term for
exported subtrees.)

Btw, the client does not construct separate mounts
for these sub-filesystems on the server, it simply
crosses the NFS server mount point. (Something that
did not work for NFSv3.)

Comment 15 Lexi Winter freebsd_committer

2023-12-26 07:24:21 UTC

Created attachment 247257 [details]
patch

i see - on some other systems these do appear as separate mounts, but the FreeBSD method seems cleaner.  in any case, this doesn't seem to be related: i tested with a direct mount of the subfilesystem and the behaviour was the same.

syskrb5 made no difference.

however, i did some more debugging and i think the reason the suggested patch doesn't work is because it should use RPC_AUTHERROR, not RPC_AUTHERR.  after changing that, it *does* seem to work: ticket expiry causes the application to get EACCES, but the mount *doesn't* hang and i can access it with ls (also returning EACCES); renewing the ticket allows the mount to work again.

i've attached the full patch i'm currently running.

this is on: FreeBSD 14.0-STABLE #5 stable/14-n266050-ac83f3fa44f3-dirty: Tue Dec 26 07:15:04 GMT 2023     lexi@ilythia.eden.le-fay.org:/src/stable/14/sys/amd64/compile/ILYTHIA

Comment 16 Rick Macklem freebsd_committer

2023-12-26 16:24:41 UTC

Good news (and good catch). I never even noticed
there were 2 different macros with similar names.
And, yes, I had meant RPC_AUTHERROR when I did
the patch.

As an aside, great minds think alike. I had added
printf()s almost identical to yours, but I had not
been able to reproduce the problem yet, so I didn't
see the error code.

Anyhow, looks like it is fixed. This bug has been
in the code ever since NFSv4.1 was added to the client.
I guess it shows how few users use sec=krb5[ip].

Thanks for reporting this and with your help debugging it.

Comment 17 commit-hook freebsd_committer

2023-12-26 22:36:51 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a558130881e9d574dc5f37827fe2284667d5aba8

commit a558130881e9d574dc5f37827fe2284667d5aba8
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-26 22:33:39 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-12-26 22:33:39 +0000

    nfscl: Fix handling of expired Kerberos credentials (NFSv4.1/4.2)

    If the NFS server detects that the Kerberos credentials provided
    by a NFSv4.1/4.2 mount using sec=krb5[ip] have expired, the NFS
    server replies with a krpc layer error of RPC_AUTHERROR.
    When this happened, the client erroneously left the NFSv4.1/4.2
    session slot busy, so that it could not be used by other RPCs.
    If this happened for all session slots, the mount point would
    hang.

    This patch fixes the problem by releasing the session slot
    and resetting its sequence# upon receiving a RPC_AUTHERROR
    reply.

    This bug only affects NFSv4.1/4.2 mounts using sec=krb5[ip],
    but has existed since NFSv4.1 client support was added to
    FreeBSD.

    So, why has the bug remained undetected for so long?
    I cannot be sure, but I suspect that, often, the client detected
    the Kerberos credential expiration before attempting the RPC.
    For this case, the client would not do the RPC and, as such,
    there would be no busy session slot.  Also, no hang would
    occur until all session slots are busied (64 for a FreeBSD
    client/server), so many cases of the bug probably went undetected?
    Also, use of sec=krb5[ip] mounts are not that common.

    PR:     275905
    Tested by:      Lexi <lexi.freebsd@le-fay.org>
    MFC after:      1 week

 sys/fs/nfs/nfs_commonkrpc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comment 18 Rick Macklem freebsd_committer

2023-12-26 23:06:42 UTC

I was able to reproduce the problem and see that
the patch (the second variant, but without the printfs)
worked, so I have committed it to main.

It will be MFC'd in a week.

Thanks for reporting it and for fixing/testing the patch.

Comment 19 Lexi Winter freebsd_committer

2023-12-27 05:57:28 UTC

thanks for the fix!  this was much more pleasant than getting NFS bugs fixed in some other systems i use...

Comment 20 commit-hook freebsd_committer

2024-01-02 01:23:05 UTC

A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=dabf1797a76a7c87b8f38fb300fb03f20cf6f973

commit dabf1797a76a7c87b8f38fb300fb03f20cf6f973
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-26 22:33:39 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2024-01-02 01:20:17 +0000

    nfscl: Fix handling of expired Kerberos credentials (NFSv4.1/4.2)

    If the NFS server detects that the Kerberos credentials provided
    by a NFSv4.1/4.2 mount using sec=krb5[ip] have expired, the NFS
    server replies with a krpc layer error of RPC_AUTHERROR.
    When this happened, the client erroneously left the NFSv4.1/4.2
    session slot busy, so that it could not be used by other RPCs.
    If this happened for all session slots, the mount point would
    hang.

    This patch fixes the problem by releasing the session slot
    and resetting its sequence# upon receiving a RPC_AUTHERROR
    reply.

    This bug only affects NFSv4.1/4.2 mounts using sec=krb5[ip],
    but has existed since NFSv4.1 client support was added to
    FreeBSD.

    So, why has the bug remained undetected for so long?
    I cannot be sure, but I suspect that, often, the client detected
    the Kerberos credential expiration before attempting the RPC.
    For this case, the client would not do the RPC and, as such,
    there would be no busy session slot.  Also, no hang would
    occur until all session slots are busied (64 for a FreeBSD
    client/server), so many cases of the bug probably went undetected?
    Also, use of sec=krb5[ip] mounts are not that common.

    PR:     275905

    (cherry picked from commit a558130881e9d574dc5f37827fe2284667d5aba8)

 sys/fs/nfs/nfs_commonkrpc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comment 21 commit-hook freebsd_committer

2024-01-02 01:45:08 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=c85ff48a092e4c5989f91a4a1ccf99441ffb3170

commit c85ff48a092e4c5989f91a4a1ccf99441ffb3170
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-26 22:33:39 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2024-01-02 01:41:19 +0000

    nfscl: Fix handling of expired Kerberos credentials (NFSv4.1/4.2)

    If the NFS server detects that the Kerberos credentials provided
    by a NFSv4.1/4.2 mount using sec=krb5[ip] have expired, the NFS
    server replies with a krpc layer error of RPC_AUTHERROR.
    When this happened, the client erroneously left the NFSv4.1/4.2
    session slot busy, so that it could not be used by other RPCs.
    If this happened for all session slots, the mount point would
    hang.

    This patch fixes the problem by releasing the session slot
    and resetting its sequence# upon receiving a RPC_AUTHERROR
    reply.

    This bug only affects NFSv4.1/4.2 mounts using sec=krb5[ip],
    but has existed since NFSv4.1 client support was added to
    FreeBSD.

    So, why has the bug remained undetected for so long?
    I cannot be sure, but I suspect that, often, the client detected
    the Kerberos credential expiration before attempting the RPC.
    For this case, the client would not do the RPC and, as such,
    there would be no busy session slot.  Also, no hang would
    occur until all session slots are busied (64 for a FreeBSD
    client/server), so many cases of the bug probably went undetected?
    Also, use of sec=krb5[ip] mounts are not that common.

    PR:     275905

    (cherry picked from commit a558130881e9d574dc5f37827fe2284667d5aba8)

 sys/fs/nfs/nfs_commonkrpc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comment 22 commit-hook freebsd_committer

2024-01-02 01:57:12 UTC

A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=263ed87413aece1786df7275cc541dad324c76a5

commit 263ed87413aece1786df7275cc541dad324c76a5
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-26 22:33:39 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2024-01-02 01:55:14 +0000

    nfscl: Fix handling of expired Kerberos credentials (NFSv4.1/4.2)

    If the NFS server detects that the Kerberos credentials provided
    by a NFSv4.1/4.2 mount using sec=krb5[ip] have expired, the NFS
    server replies with a krpc layer error of RPC_AUTHERROR.
    When this happened, the client erroneously left the NFSv4.1/4.2
    session slot busy, so that it could not be used by other RPCs.
    If this happened for all session slots, the mount point would
    hang.

    This patch fixes the problem by releasing the session slot
    and resetting its sequence# upon receiving a RPC_AUTHERROR
    reply.

    This bug only affects NFSv4.1/4.2 mounts using sec=krb5[ip],
    but has existed since NFSv4.1 client support was added to
    FreeBSD.

    So, why has the bug remained undetected for so long?
    I cannot be sure, but I suspect that, often, the client detected
    the Kerberos credential expiration before attempting the RPC.
    For this case, the client would not do the RPC and, as such,
    there would be no busy session slot.  Also, no hang would
    occur until all session slots are busied (64 for a FreeBSD
    client/server), so many cases of the bug probably went undetected?
    Also, use of sec=krb5[ip] mounts are not that common.

    PR:     275905

    (cherry picked from commit a558130881e9d574dc5f37827fe2284667d5aba8)

 sys/fs/nfs/nfs_commonkrpc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comment 23 Rick Macklem freebsd_committer

2024-01-02 02:02:48 UTC

Patch has been MFC'd.