So I have managed to trigger a kernel panic in 14.0-BETA5 in the NFS subsystem, but this is partly due to a mistake I myself made by not changing the hostid of a system that's a clone of another active system. The first system is running 13.2-RELEASE-p4 and the cloned system is running 14.0-BETA5, newly upgraded. There are several elements that lead to this panic: - Both systems have the same hostid - Both systems mount the same remote NFS share from a third system - Have the 13.2-RELEASE-p4 system start doing a job on the remote share, like compiling code (e.g., /usr/ports is on this share) - Have the cloned system running 14.0-BETA5 attempt to unmount the remote share - The 14.0-BETA5 system will crash I know it's due to duplicate hostid's, because the below message is printed on the console immediately before the kernel crashes: > > Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's > And the printf() for that exact string is in the crashing function right where GDB says the crash happens, in nfs_commonkrpc.c, function newnfs_request(), line 1212. I'm just not sure if it's the if statement immediately preceeding the printf() call or the if statement that happens after. The next call is memcmp() in machine code, so I am assuming a NULL deref of some kind. My kernel is a custom build, but this can be triggered on a GENERIC kernel as well, as my first crash happened on GENERIC right before I was set to reboot into my rebuilt custom kernel after doing the second `freebsd-update install` phase to upgrade to 14.0-BETA5. At that time, I had crashdumps disabled. So the below crash info is from that custom kernel, after I enabled crashdumps and re-triggered the crash (it's at least reproducible...): > Unread portion of the kernel message buffer: > [179] > [179] > [179] Fatal trap 12: page fault while in kernel mode > [179] cpuid = 0; apic id = 00 > [179] fault virtual address = 0x4 > [179] fault code = supervisor read data, page not present > [179] instruction pointer = 0x20:0xffffffff809e9893 > [179] stack pointer = 0x28:0xfffffe00a233e800 > [179] frame pointer = 0x28:0xfffffe00a233e800 > [179] code segment = base rx0, limit 0xfffff, type 0x1b > [179] = DPL 0, pres 1, long 1, def32 0, gran 1 > [179] processor eflags = interrupt enabled, resume, IOPL = 0 > [179] current process = 87256 (umount) > [179] rdi: fffff800077761e4 rsi: 0000000000000004 rdx: 0000000000000010 > [179] rcx: 0000000000000000 r8: 0000000000000024 r9: fffffe00a233f000 > [179] rax: 0000000000000000 rbx: fffffe00a251b020 rbp: fffffe00a233e800 > [179] r10: 0000000000000585 r11: 000000007ff9687f r12: fffff80007776010 > [180] r13: fffff80003abb800 r14: fffffe00a233ea18 r15: fffff80007776000 > [180] trap number = 12 > [180] panic: page fault > [180] cpuid = 0 > [180] time = 1696723338 > [180] KDB: stack backtrace: > [180] #0 0xffffffff806b5edd at kdb_backtrace+0x5d > [180] #1 0xffffffff8066aa20 at vpanic+0x130 > [180] #2 0xffffffff8066a8e3 at panic+0x43 > [180] #3 0xffffffff809ee34c at trap_fatal+0x40c > [180] #4 0xffffffff809ee39e at trap_pfault+0x4e > [180] #5 0xffffffff809c6288 at calltrap+0x8 > [180] #6 0xffffffff8053f804 at newnfs_request+0x10a4 > [180] #7 0xffffffff8054dbad at nfsrpc_destroysession+0x11d > [180] #8 0xffffffff80557252 at nfscl_umount+0x312 > [180] #9 0xffffffff80589470 at nfs_unmount+0x70 > [180] #10 0xffffffff8073c4ad at vfs_unmount_sigdefer+0x2d > [180] #11 0xffffffff80741e37 at dounmount+0x787 > [180] #12 0xffffffff80741645 at kern_unmount+0x2f5 > [180] #13 0xffffffff809eeaf9 at amd64_syscall+0x109 > [180] #14 0xffffffff809c6b9b at fast_syscall_common+0xf8 > [180] Timeout initializing vt_vga > [180] Uptime: 3m0s > [180] Dumping 447 out of 8077 MB:..4%..11%..22%..33%..43%..51%..61%..72%..83%..93% > > __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 > 57 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. > (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 > #1 doadump (textdump=<optimized out>) at ../../../kern/kern_shutdown.c:405 > #2 0xffffffff8066a5b7 in kern_reboot (howto=260) > at ../../../kern/kern_shutdown.c:526 > #3 0xffffffff8066aa8d in vpanic (fmt=0xffffffff80a3bcd1 "%s", > ap=ap@entry=0xfffffe00a233e680) at ../../../kern/kern_shutdown.c:970 > #4 0xffffffff8066a8e3 in panic (fmt=<unavailable>) > at ../../../kern/kern_shutdown.c:894 > #5 0xffffffff809ee34c in trap_fatal (frame=0xfffffe00a233e740, eva=4) > at ../../../amd64/amd64/trap.c:952 > #6 0xffffffff809ee39e in trap_pfault (frame=0xfffffe00a233e740, > usermode=false, signo=<optimized out>, ucode=<optimized out>) > at ../../../amd64/amd64/trap.c:760 > #7 <signal handler called> > #8 memcmp () at ../../../amd64/amd64/support.S:115 > #9 0xffffffff8053f804 in newnfs_request (nd=nd@entry=0xfffffe00a233ea18, > nmp=nmp@entry=0xfffff80003abb800, clp=clp@entry=0x0, > nrp=nrp@entry=0xfffff80003abbcd8, vp=vp@entry=0x0, > td=td@entry=0xfffffe00a251b020, cred=0xfffff8000765aa00, prog=100003, > vers=4, retsum=0x0, toplevel=1, xidp=0x0, dssep=0x0) > at ../../../fs/nfs/nfs_commonkrpc.c:1212 > #10 0xffffffff8054dbad in nfsrpc_destroysession ( > nmp=nmp@entry=0xfffff80003abb800, tsep=0xfffff80007776010, > tsep@entry=0x0, cred=cred@entry=0xfffff8000765aa00, > p=p@entry=0xfffffe00a251b020) at ../../../fs/nfs/nfs_commonsubs.c:5151 > #11 0xffffffff80557252 in nfscl_umount (nmp=nmp@entry=0xfffff80003abb800, > p=p@entry=0xfffffe00a251b020, dhp=dhp@entry=0x0) > at ../../../fs/nfsclient/nfs_clstate.c:2094 > #12 0xffffffff80589470 in nfs_unmount (mp=0xfffffe00a4058000, > mntflags=<optimized out>) at ../../../fs/nfsclient/nfs_clvfsops.c:1903 > #13 0xffffffff8073c4ad in vfs_unmount_sigdefer (mp=0xfffffe00a4058000, > mntflags=134217728) at ../../../kern/vfs_init.c:185 > #14 0xffffffff80741e37 in dounmount (mp=0xfffff800077761e4, > mp@entry=0xfffffe00a4058000, flags=flags@entry=134217728, > td=td@entry=0xfffffe00a251b020) at ../../../kern/vfs_mount.c:2327 > #15 0xffffffff80741645 in kern_unmount (td=0xfffffe00a251b020, > path=<optimized out>, flags=134217728) at ../../../kern/vfs_mount.c:1785 > #16 0xffffffff809eeaf9 in syscallenter (td=0xfffffe00a251b020) > at ../../../amd64/amd64/../../kern/subr_syscall.c:187 > #17 amd64_syscall (td=0xfffffe00a251b020, traced=0) > at ../../../amd64/amd64/trap.c:1197 > #18 <signal handler called> > #19 0x0000244bc41489ba in ?? () > Backtrace stopped: Cannot access memory at address 0x244bc20f4c18 > (kgdb)
^Triage: make former assignee fs@ a CC recipient.
Created attachment 245539 [details] add checks for no valid session This patch adds checks in two places for the case where there is no session on the mount. Since I think that the crash was caused by the mount not having a session, I think this patch might avoid the crash. Hopefully the reporter can test this patch?
(In reply to Rick Macklem from comment #2) > > Hopefully the reporter can test this patch? I think this works! Upon trying to unmount the remote share while the other system w/ the same hostid was doing work in it, the 14.0-BETA5 system had a ~0.5s "pause", then completed the unmount and returned to the prompt. No crash. Also no visible message output anywhere. I'd suggest at least still printing the "Initiate recovery" message somewhere so that a user can be made aware that they still have a problem to fix. But the crash appears to be avoided now.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=db7257ef972ed75e33929d39fd791d3699b53c63 commit db7257ef972ed75e33929d39fd791d3699b53c63 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-10-18 02:40:23 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-10-18 02:43:25 +0000 nfsd: Fix a server crash PR#274346 reports a crash which appears to be caused by a NULL default session being destroyed. This patch should avoid the crash. Tested by: Joshua Kinard <freebsd@kumba.dev> PR: 274346 MFC after: 2 weeks sys/fs/nfs/nfs_commonkrpc.c | 9 +++++++++ sys/fs/nfs/nfs_commonsubs.c | 6 ++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
The patch has been committed and will be MFC'd. I did add a printf() for the case where the NFSERR_BADSESSION error is returned and the default session is NULL, as suggested by the reporter.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=098273e649c647d5472d518c5023477ad15b7c3f commit 098273e649c647d5472d518c5023477ad15b7c3f Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-10-18 02:40:23 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-11-02 22:02:22 +0000 nfsd: Fix a server crash PR#274346 reports a crash which appears to be caused by a NULL default session being destroyed. This patch should avoid the crash. PR: 274346 (cherry picked from commit db7257ef972ed75e33929d39fd791d3699b53c63) sys/fs/nfs/nfs_commonkrpc.c | 9 +++++++++ sys/fs/nfs/nfs_commonsubs.c | 6 ++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=18d51c3c305f233a75fc64f8e5711306dd05a8fc commit 18d51c3c305f233a75fc64f8e5711306dd05a8fc Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-10-18 02:40:23 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-11-02 23:35:25 +0000 nfsd: Fix a server crash PR#274346 reports a crash which appears to be caused by a NULL default session being destroyed. This patch should avoid the crash. PR: 274346 (cherry picked from commit db7257ef972ed75e33929d39fd791d3699b53c63) sys/fs/nfs/nfs_commonkrpc.c | 9 +++++++++ sys/fs/nfs/nfs_commonsubs.c | 6 ++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
The patch has been MFC'd.