OS: 12.1-STABLE r355967 Dec 21 16:06:22 v1 kernel: Fatal trap 12: page fault while in kernel mode Dec 21 16:06:22 v1 kernel: cpuid = 27; apic id = 23 Dec 21 16:06:22 v1 kernel: fault virtual address = 0x8 Dec 21 16:06:22 v1 kernel: fault code = supervisor read data, page not present Dec 21 16:06:22 v1 kernel: instruction pointer = 0x20:0xffffffff8045534c Dec 21 16:06:22 v1 kernel: stack pointer = 0x0:0xfffffe0183ff6d80 Dec 21 16:06:22 v1 kernel: frame pointer = 0x0:0xfffffe0183ff6e10 Dec 21 16:06:22 v1 kernel: code segment = base rx0, limit 0xfffff, type 0x1b Dec 21 16:06:22 v1 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Dec 21 16:06:22 v1 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Dec 21 16:06:22 v1 kernel: current process = 2694 (nfsd: service) Dec 21 16:06:22 v1 kernel: trap number = 12 Dec 21 16:06:22 v1 kernel: panic: page fault Dec 21 16:06:22 v1 kernel: cpuid = 27 Dec 21 16:06:22 v1 kernel: time = 1576936619 On this server we share static files via NFS4 share.
Are you able to get a backtrace from the crash?
(In reply to Mark Johnston from comment #1) Unfortunately no. I guess panic occurred when I tried to remove directory on nfs4 server in time when other process on nfs client was trying to write files or append data to existing files in mounted share.
(In reply to iron.udjin from comment #2) I think it will be difficult to resolve this without at least a stack trace. If you are still using the same kernel, "addr2line -e /boot/kernel/kernel 0xffffffff8045534c" might give us a clue.
(In reply to Mark Johnston from comment #3) Yes, I'm using the same kernel. # addr2line -e /boot/kernel/kernel 0xffffffff8045534c addr2line: dwarf_init: Debug info NULL [_dwarf_consumer_init(66)]
(In reply to iron.udjin from comment #4) Are you building with WITHOUT_KERNEL_SYMBOLS set? You can try finding the function where we crashed by running objdump -d /boot/kernel/kernel and finding the symbol enclosing the address 0xffffffff8045534c.
(In reply to Mark Johnston from comment #5) I don't have WITHOUT_KERNEL_SYMBOLS=YES in my /etc/src.conf Here is objdump -d: ffffffff80455310: 0f 84 d6 01 00 00 je ffffffff804554ec <nfsrv_freelockowner+0x2ec> ffffffff80455316: 41 be 01 00 00 00 mov $0x1,%r14d ffffffff8045531c: 80 7d d7 00 cmpb $0x0,-0x29(%rbp) ffffffff80455320: 0f 84 ce 01 00 00 je ffffffff804554f4 <nfsrv_freelockowner+0x2f4> ffffffff80455326: 49 8d 7d 38 lea 0x38(%r13),%rdi ffffffff8045532a: e8 51 bf 01 00 callq ffffffff80471280 <nfsvno_getvp> ffffffff8045532f: 48 c7 85 70 ff ff ff movq $0xffffffff80ab52f8,-0x90(%rbp) ffffffff80455336: f8 52 ab 80 ffffffff8045533a: 48 89 85 78 ff ff ff mov %rax,-0x88(%rbp) ffffffff80455341: c7 45 80 00 00 00 00 movl $0x0,-0x80(%rbp) ffffffff80455348: 48 89 45 c8 mov %rax,-0x38(%rbp) ffffffff8045534c: 48 8b 78 08 mov 0x8(%rax),%rdi ffffffff80455350: 48 8d b5 70 ff ff ff lea -0x90(%rbp),%rsi ffffffff80455357: e8 94 0c 34 00 callq ffffffff80795ff0 <VOP_UNLOCK_APV> ffffffff8045535c: 48 83 7d c8 00 cmpq $0x0,-0x38(%rbp) ffffffff80455361: 0f 85 4e ff ff ff jne ffffffff804552b5 <nfsrv_freelockowner+0xb5> ffffffff80455367: e9 a4 01 00 00 jmpq ffffffff80455510 <nfsrv_freelockowner+0x310> ffffffff8045536c: 31 c9 xor %ecx,%ecx ffffffff8045536e: 48 89 08 mov %rcx,(%rax) ffffffff80455371: 48 83 04 25 40 e6 cd addq $0xffffffffffffffff,0xffffffff80cde640 ffffffff80455378: 80 ff ffffffff8045537a: 83 04 25 d0 20 c5 80 addl $0xffffffffffffffff,0xffffffff80c520d0 ffffffff80455381: ff ffffffff80455382: 48 8b 03 mov (%rbx),%rax
It looks like we crashed while dereferencing the vnode returned by nfsvno_getvp() in nfsrv_freeallnfslocks(). Since cansleep == 1 we must have been called via nfsrv_freeopen(). I don't know the NFS stack well enough to say anything else of use, but Rick (cc'ed) might have some idea. I suspect that you will need to collect a kernel dump in order to make much progress.
Currently I've set to stop all services before cleanup cache folder which is shared by NFS and didn't expiriance panic yet. In case of panic I'll rebuild kernel with debug things enabled and will collect a kernel dump.
Well, if you look at nfsvno_getvp(), it simply makes the file system busy and then does a VFS_FHTOVP(..LK_EXCLUSIVE..); --> It seems that the file system doesn't do the above VFS call correctly, since it then crashes doing a VOP_UNLOCK() on it, after it returning success for the call. What kind of file system is exported?
ZFS dataset
Created attachment 210429 [details] check for a non-NULL tvp returned by nfsvno_getvp() Duh. It's the obvious ones that are hard to spot. I took another look and the bug is obvious. nfsvno_getvp() returns NULL if it can't get the vnode and this code didn't check for non-NULL before doing the unlock. The attached patch should avoid the crash.
I'll take this, since the fix is obvious. I won't be able to commit/MFC this for a while. If iron.udjin can test the patch, that would be appreciated.
(In reply to Rick Macklem from comment #12) Yes, I will test it. But only if I have a luck to reproduce panic.
Sure. There is a fairly narrow window between when the vnode is unlocked and the nfsvno_getvp() call is done, where another thread (typically for another NFS client) would do a remove of the file. So long as the patch doesn't cause a regression (I do not think it will), it should be fine.
Hello, I've got a panic again with the same instruction pointer. Could you please add some code for this patch for print some dmesg message which would indicate that the same situation happened and correctly handled? It will make us sure that the patch solves discussed issue. Thank you.
Created attachment 210503 [details] check for a non-NULL tvp returned by nfsvno_getvp() This version of the patch will print out a message (console etc) when the condition that would have caused the crash without this patch occurs.
(In reply to Rick Macklem from comment #16) Works! Jan 8 14:53:26 v1 kernel: Would panic without patch! Please commit this patch. Thank you!
Thanks for testing it. I'll be able to commit this in a couple of weeks. (Unless you ask me not to, I'll list you as "Tested by:" in the commit.)
Ok, let's test it for a few weeks. I'll let you know in case something will be wrong.
A commit references this bug: Author: rmacklem Date: Sun Jan 26 17:59:06 UTC 2020 New revision: 357149 URL: https://svnweb.freebsd.org/changeset/base/357149 Log: Fix a crash in the NFSv4 server. The PR reported a crash that occurred when a file was removed while client(s) were actively doing lock operations on it. Since nfsvno_getvp() will return NULL when the file does not exist, the bug was obvious and easy to fix via this patch. It is a little surprising that this wasn't found sooner, but I guess the above case rarely occurs. Tested by: iron.udjin@gmail.com PR: 242768 Reported by: iron.udjin@gmail.com MFC after: 2 weeks Changes: head/sys/fs/nfsserver/nfs_nfsdstate.c
Patch has been committed to head. I will close this once patch is MFC'd.
A commit references this bug: Author: rmacklem Date: Mon Feb 17 19:31:35 UTC 2020 New revision: 358035 URL: https://svnweb.freebsd.org/changeset/base/358035 Log: MFC: r357149 Fix a crash in the NFSv4 server. The PR reported a crash that occurred when a file was removed while client(s) were actively doing lock operations on it. Since nfsvno_getvp() will return NULL when the file does not exist, the bug was obvious and easy to fix via this patch. It is a little surprising that this wasn't found sooner, but I guess the above case rarely occurs. PR: 242768 Changes: _U stable/10/ stable/10/sys/fs/nfsserver/nfs_nfsdstate.c
A commit references this bug: Author: rmacklem Date: Mon Feb 17 19:32:54 UTC 2020 New revision: 358036 URL: https://svnweb.freebsd.org/changeset/base/358036 Log: MFC: r357149 Fix a crash in the NFSv4 server. The PR reported a crash that occurred when a file was removed while client(s) were actively doing lock operations on it. Since nfsvno_getvp() will return NULL when the file does not exist, the bug was obvious and easy to fix via this patch. It is a little surprising that this wasn't found sooner, but I guess the above case rarely occurs. PR: 242768 Changes: _U stable/12/ stable/12/sys/fs/nfsserver/nfs_nfsdstate.c
A commit references this bug: Author: rmacklem Date: Mon Feb 17 19:40:26 UTC 2020 New revision: 358037 URL: https://svnweb.freebsd.org/changeset/base/358037 Log: MFC: r357149 Fix a crash in the NFSv4 server. The PR reported a crash that occurred when a file was removed while client(s) were actively doing lock operations on it. Since nfsvno_getvp() will return NULL when the file does not exist, the bug was obvious and easy to fix via this patch. It is a little surprising that this wasn't found sooner, but I guess the above case rarely occurs. PR: 242768 Changes: _U stable/11/ stable/11/sys/fs/nfsserver/nfs_nfsdstate.c
Patch has been MFC'd.