Bug 274346 - kernel panic/page fault in nfs_commonkrpc.c::newnfs_request(), due to duplicate hostid's
Summary: kernel panic/page fault in nfs_commonkrpc.c::newnfs_request(), due to duplica...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Rick Macklem
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-10-08 02:27 UTC by Joshua Kinard
Modified: 2023-11-02 23:51 UTC (History)
2 users (show)

See Also:
rmacklem: mfc-stable14+
rmacklem: mfc-stable13+


Attachments
add checks for no valid session (1.36 KB, patch)
2023-10-10 00:06 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joshua Kinard 2023-10-08 02:27:19 UTC
So I have managed to trigger a kernel panic in 14.0-BETA5 in the NFS subsystem, but this is partly due to a mistake I myself made by not changing the hostid of a system that's a clone of another active system.  The first system is running 13.2-RELEASE-p4 and the cloned system is running 14.0-BETA5, newly upgraded.

There are several elements that lead to this panic:
  - Both systems have the same hostid
  - Both systems mount the same remote NFS share from a third system
  - Have the 13.2-RELEASE-p4 system start doing a job on the remote share, like compiling code (e.g., /usr/ports is on this share)
  - Have the cloned system running 14.0-BETA5 attempt to unmount the remote share
  - The 14.0-BETA5 system will crash

I know it's due to duplicate hostid's, because the below message is printed on the console immediately before the kernel crashes:
> 
> Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's
> 

And the printf() for that exact string is in the crashing function right where GDB says the crash happens, in nfs_commonkrpc.c, function newnfs_request(), line 1212.  I'm just not sure if it's the if statement immediately preceeding the printf() call or the if statement that happens after.  The next call is memcmp() in machine code, so I am assuming a NULL deref of some kind.

My kernel is a custom build, but this can be triggered on a GENERIC kernel as well, as my first crash happened on GENERIC right before I was set to reboot into my rebuilt custom kernel after doing the second `freebsd-update install` phase to upgrade to 14.0-BETA5.  At that time, I had crashdumps disabled.  So the below crash info is from that custom kernel, after I enabled crashdumps and re-triggered the crash (it's at least reproducible...):

> Unread portion of the kernel message buffer:
> [179]
> [179]
> [179] Fatal trap 12: page fault while in kernel mode
> [179] cpuid = 0; apic id = 00
> [179] fault virtual address     = 0x4
> [179] fault code                = supervisor read data, page not present
> [179] instruction pointer       = 0x20:0xffffffff809e9893
> [179] stack pointer             = 0x28:0xfffffe00a233e800
> [179] frame pointer             = 0x28:0xfffffe00a233e800
> [179] code segment              = base rx0, limit 0xfffff, type 0x1b
> [179]                   = DPL 0, pres 1, long 1, def32 0, gran 1
> [179] processor eflags  = interrupt enabled, resume, IOPL = 0
> [179] current process           = 87256 (umount)
> [179] rdi: fffff800077761e4 rsi: 0000000000000004 rdx: 0000000000000010
> [179] rcx: 0000000000000000  r8: 0000000000000024  r9: fffffe00a233f000
> [179] rax: 0000000000000000 rbx: fffffe00a251b020 rbp: fffffe00a233e800
> [179] r10: 0000000000000585 r11: 000000007ff9687f r12: fffff80007776010
> [180] r13: fffff80003abb800 r14: fffffe00a233ea18 r15: fffff80007776000
> [180] trap number               = 12
> [180] panic: page fault
> [180] cpuid = 0
> [180] time = 1696723338
> [180] KDB: stack backtrace:
> [180] #0 0xffffffff806b5edd at kdb_backtrace+0x5d
> [180] #1 0xffffffff8066aa20 at vpanic+0x130
> [180] #2 0xffffffff8066a8e3 at panic+0x43
> [180] #3 0xffffffff809ee34c at trap_fatal+0x40c
> [180] #4 0xffffffff809ee39e at trap_pfault+0x4e
> [180] #5 0xffffffff809c6288 at calltrap+0x8
> [180] #6 0xffffffff8053f804 at newnfs_request+0x10a4
> [180] #7 0xffffffff8054dbad at nfsrpc_destroysession+0x11d
> [180] #8 0xffffffff80557252 at nfscl_umount+0x312
> [180] #9 0xffffffff80589470 at nfs_unmount+0x70
> [180] #10 0xffffffff8073c4ad at vfs_unmount_sigdefer+0x2d
> [180] #11 0xffffffff80741e37 at dounmount+0x787
> [180] #12 0xffffffff80741645 at kern_unmount+0x2f5
> [180] #13 0xffffffff809eeaf9 at amd64_syscall+0x109
> [180] #14 0xffffffff809c6b9b at fast_syscall_common+0xf8
> [180] Timeout initializing vt_vga
> [180] Uptime: 3m0s
> [180] Dumping 447 out of 8077 MB:..4%..11%..22%..33%..43%..51%..61%..72%..83%..93%
> 
> __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
> 57      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
> (kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
> #1  doadump (textdump=<optimized out>) at ../../../kern/kern_shutdown.c:405
> #2  0xffffffff8066a5b7 in kern_reboot (howto=260)
>     at ../../../kern/kern_shutdown.c:526
> #3  0xffffffff8066aa8d in vpanic (fmt=0xffffffff80a3bcd1 "%s",
>     ap=ap@entry=0xfffffe00a233e680) at ../../../kern/kern_shutdown.c:970
> #4  0xffffffff8066a8e3 in panic (fmt=<unavailable>)
>     at ../../../kern/kern_shutdown.c:894
> #5  0xffffffff809ee34c in trap_fatal (frame=0xfffffe00a233e740, eva=4)
>     at ../../../amd64/amd64/trap.c:952
> #6  0xffffffff809ee39e in trap_pfault (frame=0xfffffe00a233e740,
>     usermode=false, signo=<optimized out>, ucode=<optimized out>)
>     at ../../../amd64/amd64/trap.c:760
> #7  <signal handler called>
> #8  memcmp () at ../../../amd64/amd64/support.S:115
> #9  0xffffffff8053f804 in newnfs_request (nd=nd@entry=0xfffffe00a233ea18,
>     nmp=nmp@entry=0xfffff80003abb800, clp=clp@entry=0x0,
>     nrp=nrp@entry=0xfffff80003abbcd8, vp=vp@entry=0x0,
>     td=td@entry=0xfffffe00a251b020, cred=0xfffff8000765aa00, prog=100003,
>     vers=4, retsum=0x0, toplevel=1, xidp=0x0, dssep=0x0)
>     at ../../../fs/nfs/nfs_commonkrpc.c:1212
> #10 0xffffffff8054dbad in nfsrpc_destroysession (
>     nmp=nmp@entry=0xfffff80003abb800, tsep=0xfffff80007776010,
>     tsep@entry=0x0, cred=cred@entry=0xfffff8000765aa00,
>     p=p@entry=0xfffffe00a251b020) at ../../../fs/nfs/nfs_commonsubs.c:5151
> #11 0xffffffff80557252 in nfscl_umount (nmp=nmp@entry=0xfffff80003abb800,
>     p=p@entry=0xfffffe00a251b020, dhp=dhp@entry=0x0)
>     at ../../../fs/nfsclient/nfs_clstate.c:2094
> #12 0xffffffff80589470 in nfs_unmount (mp=0xfffffe00a4058000,
>     mntflags=<optimized out>) at ../../../fs/nfsclient/nfs_clvfsops.c:1903
> #13 0xffffffff8073c4ad in vfs_unmount_sigdefer (mp=0xfffffe00a4058000,
>     mntflags=134217728) at ../../../kern/vfs_init.c:185
> #14 0xffffffff80741e37 in dounmount (mp=0xfffff800077761e4,
>     mp@entry=0xfffffe00a4058000, flags=flags@entry=134217728,
>     td=td@entry=0xfffffe00a251b020) at ../../../kern/vfs_mount.c:2327
> #15 0xffffffff80741645 in kern_unmount (td=0xfffffe00a251b020,
>     path=<optimized out>, flags=134217728) at ../../../kern/vfs_mount.c:1785
> #16 0xffffffff809eeaf9 in syscallenter (td=0xfffffe00a251b020)
>     at ../../../amd64/amd64/../../kern/subr_syscall.c:187
> #17 amd64_syscall (td=0xfffffe00a251b020, traced=0)
>     at ../../../amd64/amd64/trap.c:1197
> #18 <signal handler called>
> #19 0x0000244bc41489ba in ?? ()
> Backtrace stopped: Cannot access memory at address 0x244bc20f4c18
> (kgdb)
Comment 1 Graham Perrin 2023-10-09 17:26:20 UTC
^Triage: make former assignee fs@ a CC recipient.
Comment 2 Rick Macklem freebsd_committer freebsd_triage 2023-10-10 00:06:38 UTC
Created attachment 245539 [details]
add checks for no valid session

This patch adds checks in two places for the case
where there is no session on the mount.

Since I think that the crash was caused by the
mount not having a session, I think this patch
might avoid the crash.

Hopefully the reporter can test this patch?
Comment 3 Joshua Kinard 2023-10-10 05:46:27 UTC
(In reply to Rick Macklem from comment #2)
> 
> Hopefully the reporter can test this patch?
I think this works!  Upon trying to unmount the remote share while the other system w/ the same hostid was doing work in it, the 14.0-BETA5 system had a ~0.5s "pause", then completed the unmount and returned to the prompt.  No crash.  Also no visible message output anywhere.  I'd suggest at least still printing the "Initiate recovery" message somewhere so that a user can be made aware that they still have a problem to fix.  But the crash appears to be avoided now.
Comment 4 commit-hook freebsd_committer freebsd_triage 2023-10-18 02:44:24 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=db7257ef972ed75e33929d39fd791d3699b53c63

commit db7257ef972ed75e33929d39fd791d3699b53c63
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-10-18 02:40:23 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-10-18 02:43:25 +0000

    nfsd: Fix a server crash

    PR#274346 reports a crash which appears to be caused by a NULL default session
    being destroyed.  This patch should avoid the crash.

    Tested by:      Joshua Kinard <freebsd@kumba.dev>
    PR:     274346
    MFC after:      2 weeks

 sys/fs/nfs/nfs_commonkrpc.c | 9 +++++++++
 sys/fs/nfs/nfs_commonsubs.c | 6 ++++--
 2 files changed, 13 insertions(+), 2 deletions(-)
Comment 5 Rick Macklem freebsd_committer freebsd_triage 2023-10-18 02:47:26 UTC
The patch has been committed and will be MFC'd.

I did add a printf() for the case where the NFSERR_BADSESSION
error is returned and the default session is NULL, as suggested
by the reporter.
Comment 6 commit-hook freebsd_committer freebsd_triage 2023-11-02 22:04:40 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=098273e649c647d5472d518c5023477ad15b7c3f

commit 098273e649c647d5472d518c5023477ad15b7c3f
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-10-18 02:40:23 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-11-02 22:02:22 +0000

    nfsd: Fix a server crash

    PR#274346 reports a crash which appears to be caused by a NULL default session
    being destroyed.  This patch should avoid the crash.

    PR:     274346

    (cherry picked from commit db7257ef972ed75e33929d39fd791d3699b53c63)

 sys/fs/nfs/nfs_commonkrpc.c | 9 +++++++++
 sys/fs/nfs/nfs_commonsubs.c | 6 ++++--
 2 files changed, 13 insertions(+), 2 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2023-11-02 23:36:57 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=18d51c3c305f233a75fc64f8e5711306dd05a8fc

commit 18d51c3c305f233a75fc64f8e5711306dd05a8fc
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-10-18 02:40:23 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-11-02 23:35:25 +0000

    nfsd: Fix a server crash

    PR#274346 reports a crash which appears to be caused by a NULL default session
    being destroyed.  This patch should avoid the crash.

    PR:     274346

    (cherry picked from commit db7257ef972ed75e33929d39fd791d3699b53c63)

 sys/fs/nfs/nfs_commonkrpc.c | 9 +++++++++
 sys/fs/nfs/nfs_commonsubs.c | 6 ++++--
 2 files changed, 13 insertions(+), 2 deletions(-)
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2023-11-02 23:51:38 UTC
The patch has been MFC'd.