OS: 12.1-STABLE r355967
Dec 21 16:06:22 v1 kernel: Fatal trap 12: page fault while in kernel mode
Dec 21 16:06:22 v1 kernel: cpuid = 27; apic id = 23
Dec 21 16:06:22 v1 kernel: fault virtual address = 0x8
Dec 21 16:06:22 v1 kernel: fault code = supervisor read data, page not present
Dec 21 16:06:22 v1 kernel: instruction pointer = 0x20:0xffffffff8045534c
Dec 21 16:06:22 v1 kernel: stack pointer = 0x0:0xfffffe0183ff6d80
Dec 21 16:06:22 v1 kernel: frame pointer = 0x0:0xfffffe0183ff6e10
Dec 21 16:06:22 v1 kernel: code segment = base rx0, limit 0xfffff, type 0x1b
Dec 21 16:06:22 v1 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Dec 21 16:06:22 v1 kernel: processor eflags = interrupt enabled, resume, IOPL = 0
Dec 21 16:06:22 v1 kernel: current process = 2694 (nfsd: service)
Dec 21 16:06:22 v1 kernel: trap number = 12
Dec 21 16:06:22 v1 kernel: panic: page fault
Dec 21 16:06:22 v1 kernel: cpuid = 27
Dec 21 16:06:22 v1 kernel: time = 1576936619
On this server we share static files via NFS4 share.
Are you able to get a backtrace from the crash?
(In reply to Mark Johnston from comment #1)
I guess panic occurred when I tried to remove directory on nfs4 server in time when other process on nfs client was trying to write files or append data to existing files in mounted share.
(In reply to iron.udjin from comment #2)
I think it will be difficult to resolve this without at least a stack trace.
If you are still using the same kernel, "addr2line -e /boot/kernel/kernel 0xffffffff8045534c" might give us a clue.
(In reply to Mark Johnston from comment #3)
Yes, I'm using the same kernel.
# addr2line -e /boot/kernel/kernel 0xffffffff8045534c
addr2line: dwarf_init: Debug info NULL [_dwarf_consumer_init(66)]
(In reply to iron.udjin from comment #4)
Are you building with WITHOUT_KERNEL_SYMBOLS set?
You can try finding the function where we crashed by running objdump -d /boot/kernel/kernel and finding the symbol enclosing the address 0xffffffff8045534c.
(In reply to Mark Johnston from comment #5)
I don't have WITHOUT_KERNEL_SYMBOLS=YES in my /etc/src.conf
Here is objdump -d:
ffffffff80455310: 0f 84 d6 01 00 00 je ffffffff804554ec <nfsrv_freelockowner+0x2ec>
ffffffff80455316: 41 be 01 00 00 00 mov $0x1,%r14d
ffffffff8045531c: 80 7d d7 00 cmpb $0x0,-0x29(%rbp)
ffffffff80455320: 0f 84 ce 01 00 00 je ffffffff804554f4 <nfsrv_freelockowner+0x2f4>
ffffffff80455326: 49 8d 7d 38 lea 0x38(%r13),%rdi
ffffffff8045532a: e8 51 bf 01 00 callq ffffffff80471280 <nfsvno_getvp>
ffffffff8045532f: 48 c7 85 70 ff ff ff movq $0xffffffff80ab52f8,-0x90(%rbp)
ffffffff80455336: f8 52 ab 80
ffffffff8045533a: 48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
ffffffff80455341: c7 45 80 00 00 00 00 movl $0x0,-0x80(%rbp)
ffffffff80455348: 48 89 45 c8 mov %rax,-0x38(%rbp)
ffffffff8045534c: 48 8b 78 08 mov 0x8(%rax),%rdi
ffffffff80455350: 48 8d b5 70 ff ff ff lea -0x90(%rbp),%rsi
ffffffff80455357: e8 94 0c 34 00 callq ffffffff80795ff0 <VOP_UNLOCK_APV>
ffffffff8045535c: 48 83 7d c8 00 cmpq $0x0,-0x38(%rbp)
ffffffff80455361: 0f 85 4e ff ff ff jne ffffffff804552b5 <nfsrv_freelockowner+0xb5>
ffffffff80455367: e9 a4 01 00 00 jmpq ffffffff80455510 <nfsrv_freelockowner+0x310>
ffffffff8045536c: 31 c9 xor %ecx,%ecx
ffffffff8045536e: 48 89 08 mov %rcx,(%rax)
ffffffff80455371: 48 83 04 25 40 e6 cd addq $0xffffffffffffffff,0xffffffff80cde640
ffffffff80455378: 80 ff
ffffffff8045537a: 83 04 25 d0 20 c5 80 addl $0xffffffffffffffff,0xffffffff80c520d0
ffffffff80455382: 48 8b 03 mov (%rbx),%rax
It looks like we crashed while dereferencing the vnode returned by nfsvno_getvp() in nfsrv_freeallnfslocks(). Since cansleep == 1 we must have been called via nfsrv_freeopen(). I don't know the NFS stack well enough to say anything else of use, but Rick (cc'ed) might have some idea. I suspect that you will need to collect a kernel dump in order to make much progress.
Currently I've set to stop all services before cleanup cache folder which is shared by NFS and didn't expiriance panic yet. In case of panic I'll rebuild kernel with debug things enabled and will collect a kernel dump.
Well, if you look at nfsvno_getvp(), it simply makes the file system
busy and then does a VFS_FHTOVP(..LK_EXCLUSIVE..);
--> It seems that the file system doesn't do the above VFS call correctly,
since it then crashes doing a VOP_UNLOCK() on it, after it returning
success for the call.
What kind of file system is exported?
Created attachment 210429 [details]
check for a non-NULL tvp returned by nfsvno_getvp()
Duh. It's the obvious ones that are hard to spot.
I took another look and the bug is obvious.
nfsvno_getvp() returns NULL if it can't get the
vnode and this code didn't check for non-NULL
before doing the unlock.
The attached patch should avoid the crash.
I'll take this, since the fix is obvious.
I won't be able to commit/MFC this for a while.
If iron.udjin can test the patch, that would
(In reply to Rick Macklem from comment #12)
Yes, I will test it. But only if I have a luck to reproduce panic.
Sure. There is a fairly narrow window between when the
vnode is unlocked and the nfsvno_getvp() call is done,
where another thread (typically for another NFS client)
would do a remove of the file.
So long as the patch doesn't cause a regression (I
do not think it will), it should be fine.
I've got a panic again with the same instruction pointer. Could you please add some code for this patch for print some dmesg message which would indicate that the same situation happened and correctly handled? It will make us sure that the patch solves discussed issue.
Created attachment 210503 [details]
check for a non-NULL tvp returned by nfsvno_getvp()
This version of the patch will print out a message
(console etc) when the condition that would have
caused the crash without this patch occurs.
(In reply to Rick Macklem from comment #16)
Jan 8 14:53:26 v1 kernel: Would panic without patch!
Please commit this patch.
Thanks for testing it. I'll be able to commit this in a couple of weeks.
(Unless you ask me not to, I'll list you as "Tested by:" in the
Ok, let's test it for a few weeks. I'll let you know in case something will be wrong.