Bug 206634

Summary: "panic: ncllock1" from FreeBSD client after NFSv4 server was taken offline and brought back to life; lots of spam about "protocol prob err=10006"
Product: Base System Reporter: Enji Cooper <ngie>
Component: kernAssignee: Rick Macklem <rmacklem>
Status: Closed FIXED    
Severity: Affects Only Me CC: kib, lev, rmacklem
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Panic screenshot from IPMI
none
Patch to delete panic that no longer applies to current/head none

Description Enji Cooper freebsd_committer freebsd_triage 2016-01-26 05:27:03 UTC
Created attachment 166130 [details]
Panic screenshot from IPMI

A FreeBSD 11.0-CURRENT NFSv4 client with a GENERIC-NODEBUG kernel panicked with `panic: ncllock1` after the NFSv4 server it was connected to was brought offline, then brought back to life.

I've attached some screenshots. Unfortunately I can't get a dump because my swap device is too small :(...

The last built kernel/world was from this github revision:

commit ec77f0bef381d18a7cb6847d3e0f02c0f4087f05
Author: imp <imp@FreeBSD.org>
Date:   Tue Jan 5 21:20:47 2016 +0000

    Use the more proper -f. Leave /bin/rm in place since that's what
    other rc scripts have, though it isn't strictly necessary.

Notes:
    svn path=/head/; revision=293227

The machine's still up -- please let me know if there's anything I can grab to help with debugging this issue.
Comment 1 rmacklem 2016-01-28 00:20:49 UTC
I will take a look at the panic someday, to see if it can
be avoided.
I will note that 10006 is NFSERR_SERVERFAULT, which is a "catch-all"
for "horrible bad things are happening that you want to avoid like
the plague".
(As such, a panic is probably a good thing to have happen.)

NFSv4 servers are not like NFSv3 ones. They are stateful. As such,
you need to always unmount all mounts before shutting down/rebooting
or similar to the server.
If the server crashes, then there is a recovery protocol after the
server reboots, but the server must crash/reboot to invoke this.
(If a crash/reboot happened, it is too late to figure out what went
 wrong, because you'd need a packet trace from the point where the
 server reboots to figure out what happened w.r.t. this recovery protocol.
 Btw, this recovery protocol is more complex than the entire NFSv3
 protocol, so I wouldn't be surprised if the code does have problems.)

You said "brought offline". If that didn't mean "reboot" when it came
back online, then you definitely would break any NFSv4 mounts badly.

The only exception would be a case where the server is offline (due to
network shutdown or killing/restarting the nfsd threads or ???) for
less than 1-2 minutes (the FreeBSD server uses a 2minute lease duration
and that is the upper bound on how long the server can be unavailable
without bad things happening).
The only use for this I can think of is:
- If you were running a kernel without "options NFSD", so the nfsd.ko
  is loaded.
- You wanted to replace the nfsd.ko with one that has been patched.
--> You could kill the nfsd threads, kldunload nfsd.ko, kldload the
    new nfsd.ko and restart the nfsd threads within 1minute.

There probably isn't anywhere in the man pages that emphasizes that
you can't take an NFSv4 server "offline" without unmounting all the
client mounts and this should be fixed.
Comment 2 rmacklem 2016-01-28 00:22:32 UTC
Just to clarify, to make the recovery protocol happen, the
server must either be rebooted or the nfsd.ko must be unloaded
and reloaded. (A network partitioning or stopping/restarting
the nfsd threads will not start the recovery protocol.)
Comment 3 Lev A. Serebryakov freebsd_committer freebsd_triage 2016-02-19 17:32:59 UTC
I've got same panic on FreeBSD build from r295696, in VirtualBox. My NFS server was on-line, and panic occured when sources were updated from my local SVN mirror via NFS (file:// protocol).

I have core. bt is:

#0  doadump (textdump=0) at pcpu.h:221
#1  0xffffffff802f230b in db_dump (dummy=<value optimized out>, dummy2=false, dummy3=0, dummy4=0x0) at /data/src/sys/ddb/db_command.c:533
#2  0xffffffff802f20fe in db_command (cmd_table=0x0) at /data/src/sys/ddb/db_command.c:440
#3  0xffffffff802f1e94 in db_command_loop () at /data/src/sys/ddb/db_command.c:493
#4  0xffffffff802f492b in db_trap (type=<value optimized out>, code=0) at /data/src/sys/ddb/db_main.c:251
#5  0xffffffff804a7c43 in kdb_trap (type=3, code=0, tf=<value optimized out>) at /data/src/sys/kern/subr_kdb.c:654
#6  0xffffffff806e0510 in trap (frame=0xfffffe023b618210) at /data/src/sys/amd64/amd64/trap.c:556
#7  0xffffffff806c21a7 in calltrap () at /data/src/sys/amd64/amd64/exception.S:234
#8  0xffffffff804a732b in kdb_enter (why=0xffffffff80795899 "panic", msg=0x400 <Address 0x400 out of bounds>) at cpufunc.h:63
#9  0xffffffff8046c06f in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:750
#10 0xffffffff8046c0d3 in panic (fmt=0xffffffff80b29c40 "\004") at /data/src/sys/kern/kern_shutdown.c:688
#11 0xffffffff803aeb37 in nfs_lock1 (ap=<value optimized out>) at /data/src/sys/fs/nfsclient/nfs_clvnops.c:3372
#12 0xffffffff8073e020 in VOP_LOCK1_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:2083
#13 0xffffffff8052887a in _vn_lock (vp=0xfffff800ba28bb10, flags=<value optimized out>, file=0xffffffff807a7901 "/data/src/sys/kern/vfs_subr.c", line=2476)
    at vnode_if.h:859
#14 0xffffffff80519953 in vget (vp=0xfffff800ba28bb10, flags=279040, td=0x0) at /data/src/sys/kern/vfs_subr.c:2476
#15 0xffffffff8050c51c in vfs_hash_get (mp=0xfffff800077e6990, hash=790540428, flags=<value optimized out>, td=0x0, vpp=0xfffffe023b618568,
    fn=0xffffffff803b66e0 <newnfs_vncmpf>) at /data/src/sys/kern/vfs_hash.c:89
#16 0xffffffff803b6f81 in nfscl_ngetreopen (mntp=0xfffff800077e6990, fhp=<value optimized out>, fhsize=<value optimized out>, td=0x0, npp=0xfffffe023b618670)
    at /data/src/sys/fs/nfsclient/nfs_clport.c:347
#17 0xffffffff80392bcd in nfscl_hasexpired (clp=0xfffffe00039d2000, clidrev=<value optimized out>, p=0x0) at /data/src/sys/fs/nfsclient/nfs_clstate.c:4052
#18 0xffffffff803a0d0a in nfsrpc_read (vp=0xfffff800937fd760, uiop=0xfffffe023b6189d0, cred=0xfffff800b3286600, p=0x0, nap=0xfffffe023b618890,
    attrflagp=0xfffffe023b61895c) at /data/src/sys/fs/nfsclient/nfs_clrpcops.c:1381
#19 0xffffffff803af9fa in ncl_readrpc (vp=0xfffff800937fd760, uiop=0xfffffe023b6189d0, cred=0xfffff800b3195d00)
    at /data/src/sys/fs/nfsclient/nfs_clvnops.c:1381
#20 0xffffffff803ba638 in ncl_doio (vp=0xfffff800937fd760, bp=0xfffffe01f168aeb0, cr=0xfffffe023b617e40, td=<value optimized out>, called_from_strategy=-510)
    at /data/src/sys/fs/nfsclient/nfs_clbio.c:1610
#21 0xffffffff803bc694 in nfssvc_iod (instance=<value optimized out>) at /data/src/sys/fs/nfsclient/nfs_clnfsiod.c:302
#22 0xffffffff80436544 in fork_exit (callout=0xffffffff803bc420 <nfssvc_iod>, arg=0xffffffff80b28ec0, frame=0xfffffe023b618ac0)
    at /data/src/sys/kern/kern_fork.c:1034
#23 0xffffffff806c266e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:609
Comment 4 Rick Macklem freebsd_committer freebsd_triage 2016-02-20 02:36:13 UTC
Created attachment 167209 [details]
Patch to delete panic that no longer applies to current/head

r285632 changed vfs_hash_get() so that it no longer calls vget()
with VI_LOCK() held. As such, this panic() and the VI_UNLOCK() should
not be done. The attached patch (not yet tested by me) makes this change.
(Note that this panic and patch only apply to head/current and not
 stable/10 or earlier.)
Please let me know if you have an NFSv4 server crash/reboot after applying
the patch and whether or not it seems to work.

I will try and test the patch. I cannot commit it to head/current until
mid-April.
Comment 5 Rick Macklem freebsd_committer freebsd_triage 2016-02-20 02:40:31 UTC
Re: comment 3. The panic shows that your server was handling
a lease expired situation. This should only happen if a client
is network partitioned from the server for > 2minutes and is
about the most serious problem that can happen to NFSv4.
(If you don't know how the client got network partitioned from
 the server for over 2 minutes, you need to look hard for this
 and try hard to fix it.)
Comment 6 Rick Macklem freebsd_committer freebsd_triage 2016-05-11 21:14:55 UTC
Thanks to kib@, the recent commits r299412, r299413 fixes head
so that this panic doesn't occur. Since stable/10 doesn't panic,
I am closing this PR.