Bug 242768 - [panic] Fatal trap 12: current process (nfsd: service)
Summary: [panic] Fatal trap 12: current process (nfsd: service)
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Rick Macklem
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-21 14:28 UTC by iron.udjin
Modified: 2020-01-08 22:55 UTC (History)
2 users (show)

See Also:
rmacklem: mfc-stable12?
rmacklem: mfc-stable11?


Attachments
check for a non-NULL tvp returned by nfsvno_getvp() (437 bytes, patch)
2020-01-04 01:34 UTC, Rick Macklem
no flags Details | Diff
check for a non-NULL tvp returned by nfsvno_getvp() (494 bytes, patch)
2020-01-07 14:09 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description iron.udjin 2019-12-21 14:28:08 UTC
OS: 12.1-STABLE r355967

Dec 21 16:06:22 v1 kernel: Fatal trap 12: page fault while in kernel mode
Dec 21 16:06:22 v1 kernel: cpuid = 27; apic id = 23
Dec 21 16:06:22 v1 kernel: fault virtual address	= 0x8
Dec 21 16:06:22 v1 kernel: fault code		= supervisor read data, page not present
Dec 21 16:06:22 v1 kernel: instruction pointer	= 0x20:0xffffffff8045534c
Dec 21 16:06:22 v1 kernel: stack pointer	        = 0x0:0xfffffe0183ff6d80
Dec 21 16:06:22 v1 kernel: frame pointer	        = 0x0:0xfffffe0183ff6e10
Dec 21 16:06:22 v1 kernel: code segment		= base rx0, limit 0xfffff, type 0x1b
Dec 21 16:06:22 v1 kernel: 			= DPL 0, pres 1, long 1, def32 0, gran 1
Dec 21 16:06:22 v1 kernel: processor eflags	= interrupt enabled, resume, IOPL = 0
Dec 21 16:06:22 v1 kernel: current process		= 2694 (nfsd: service)
Dec 21 16:06:22 v1 kernel: trap number		= 12
Dec 21 16:06:22 v1 kernel: panic: page fault
Dec 21 16:06:22 v1 kernel: cpuid = 27
Dec 21 16:06:22 v1 kernel: time = 1576936619

On this server we share static files via NFS4 share.
Comment 1 Mark Johnston freebsd_committer 2020-01-02 14:19:46 UTC
Are you able to get a backtrace from the crash?
Comment 2 iron.udjin 2020-01-02 15:26:36 UTC
(In reply to Mark Johnston from comment #1)

Unfortunately no.

I guess panic occurred when I tried to remove directory on nfs4 server in time when other process on nfs client was trying to write files or append data to existing files in mounted share.
Comment 3 Mark Johnston freebsd_committer 2020-01-02 15:28:45 UTC
(In reply to iron.udjin from comment #2)
I think it will be difficult to resolve this without at least a stack trace.

If you are still using the same kernel, "addr2line -e /boot/kernel/kernel 0xffffffff8045534c" might give us a clue.
Comment 4 iron.udjin 2020-01-02 16:04:08 UTC
(In reply to Mark Johnston from comment #3)
Yes, I'm using the same kernel.

# addr2line -e /boot/kernel/kernel 0xffffffff8045534c
addr2line: dwarf_init: Debug info NULL [_dwarf_consumer_init(66)]
Comment 5 Mark Johnston freebsd_committer 2020-01-02 17:28:58 UTC
(In reply to iron.udjin from comment #4)
Are you building with WITHOUT_KERNEL_SYMBOLS set?

You can try finding the function where we crashed by running objdump -d /boot/kernel/kernel and finding the symbol enclosing the address 0xffffffff8045534c.
Comment 6 iron.udjin 2020-01-02 18:18:51 UTC
(In reply to Mark Johnston from comment #5)
I don't have WITHOUT_KERNEL_SYMBOLS=YES in my /etc/src.conf

Here is objdump -d:

ffffffff80455310:       0f 84 d6 01 00 00       je     ffffffff804554ec <nfsrv_freelockowner+0x2ec>
ffffffff80455316:       41 be 01 00 00 00       mov    $0x1,%r14d
ffffffff8045531c:       80 7d d7 00             cmpb   $0x0,-0x29(%rbp)
ffffffff80455320:       0f 84 ce 01 00 00       je     ffffffff804554f4 <nfsrv_freelockowner+0x2f4>
ffffffff80455326:       49 8d 7d 38             lea    0x38(%r13),%rdi
ffffffff8045532a:       e8 51 bf 01 00          callq  ffffffff80471280 <nfsvno_getvp>
ffffffff8045532f:       48 c7 85 70 ff ff ff    movq   $0xffffffff80ab52f8,-0x90(%rbp)
ffffffff80455336:       f8 52 ab 80
ffffffff8045533a:       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
ffffffff80455341:       c7 45 80 00 00 00 00    movl   $0x0,-0x80(%rbp)
ffffffff80455348:       48 89 45 c8             mov    %rax,-0x38(%rbp)
ffffffff8045534c:       48 8b 78 08             mov    0x8(%rax),%rdi
ffffffff80455350:       48 8d b5 70 ff ff ff    lea    -0x90(%rbp),%rsi
ffffffff80455357:       e8 94 0c 34 00          callq  ffffffff80795ff0 <VOP_UNLOCK_APV>
ffffffff8045535c:       48 83 7d c8 00          cmpq   $0x0,-0x38(%rbp)
ffffffff80455361:       0f 85 4e ff ff ff       jne    ffffffff804552b5 <nfsrv_freelockowner+0xb5>
ffffffff80455367:       e9 a4 01 00 00          jmpq   ffffffff80455510 <nfsrv_freelockowner+0x310>
ffffffff8045536c:       31 c9                   xor    %ecx,%ecx
ffffffff8045536e:       48 89 08                mov    %rcx,(%rax)
ffffffff80455371:       48 83 04 25 40 e6 cd    addq   $0xffffffffffffffff,0xffffffff80cde640
ffffffff80455378:       80 ff
ffffffff8045537a:       83 04 25 d0 20 c5 80    addl   $0xffffffffffffffff,0xffffffff80c520d0
ffffffff80455381:       ff
ffffffff80455382:       48 8b 03                mov    (%rbx),%rax
Comment 7 Mark Johnston freebsd_committer 2020-01-03 22:20:22 UTC
It looks like we crashed while dereferencing the vnode returned by nfsvno_getvp() in nfsrv_freeallnfslocks().  Since cansleep == 1 we must have been called via nfsrv_freeopen().  I don't know the NFS stack well enough to say anything else of use, but Rick (cc'ed) might have some idea.  I suspect that you will need to collect a kernel dump in order to make much progress.
Comment 8 iron.udjin 2020-01-03 23:25:12 UTC
Currently I've set to stop all services before cleanup cache folder which is shared by NFS and didn't expiriance panic yet. In case of panic I'll rebuild kernel with debug things enabled and will collect a kernel dump.
Comment 9 Rick Macklem freebsd_committer 2020-01-04 01:04:12 UTC
Well, if you look at nfsvno_getvp(), it simply makes the file system
busy and then does a VFS_FHTOVP(..LK_EXCLUSIVE..);
--> It seems that the file system doesn't do the above VFS call correctly,
    since it then crashes doing a VOP_UNLOCK() on it, after it returning
    success for the call.

What kind of file system is exported?
Comment 10 iron.udjin 2020-01-04 01:16:13 UTC
ZFS dataset
Comment 11 Rick Macklem freebsd_committer 2020-01-04 01:34:00 UTC
Created attachment 210429 [details]
check for a non-NULL tvp returned by nfsvno_getvp()

Duh. It's the obvious ones that are hard to spot.
I took another look and the bug is obvious.
nfsvno_getvp() returns NULL if it can't get the
vnode and this code didn't check for non-NULL
before doing the unlock.

The attached patch should avoid the crash.
Comment 12 Rick Macklem freebsd_committer 2020-01-04 01:36:03 UTC
I'll take this, since the fix is obvious.
I won't be able to commit/MFC this for a while.

If iron.udjin can test the patch, that would
be appreciated.
Comment 13 iron.udjin 2020-01-04 03:33:50 UTC
(In reply to Rick Macklem from comment #12)
Yes, I will test it. But only if I have a luck to reproduce panic.
Comment 14 Rick Macklem freebsd_committer 2020-01-04 14:14:28 UTC
Sure. There is a fairly narrow window between when the
vnode is unlocked and the nfsvno_getvp() call is done,
where another thread (typically for another NFS client)
would do a remove of the file.

So long as the patch doesn't cause a regression (I
do not think it will), it should be fine.
Comment 15 iron.udjin 2020-01-07 11:59:08 UTC
Hello,

I've got a panic again with the same instruction pointer. Could you please add some code for this patch for print some dmesg message which would indicate that the same situation happened and correctly handled? It will make us sure that the patch solves discussed issue.

Thank you.
Comment 16 Rick Macklem freebsd_committer 2020-01-07 14:09:46 UTC
Created attachment 210503 [details]
check for a non-NULL tvp returned by nfsvno_getvp()

This version of the patch will print out a message
(console etc) when the condition that would have
caused the crash without this patch occurs.
Comment 17 iron.udjin 2020-01-08 14:34:41 UTC
(In reply to Rick Macklem from comment #16)

Works!

Jan  8 14:53:26 v1 kernel: Would panic without patch!

Please commit this patch.

Thank you!
Comment 18 Rick Macklem freebsd_committer 2020-01-08 22:51:40 UTC
Thanks for testing it. I'll be able to commit this in a couple of weeks.
(Unless you ask me not to, I'll list you as "Tested by:" in the
commit.)
Comment 19 iron.udjin 2020-01-08 22:55:33 UTC
Ok, let's test it for a few weeks. I'll let you know in case something will be wrong.