Bug 227631 - panic: vputx: usecount not zero for vnode
Summary: panic: vputx: usecount not zero for vnode
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-19 08:24 UTC by emz
Modified: 2018-12-06 13:41 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description emz 2018-04-19 08:24:36 UTC
(Started at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227465)

Got another panic on r332466, so it's reproducible.

vputx: usecount not zero for vnode 0xfffff803852a1760: tag devfs, type VCHR
    usecount -1, writecount -1, refcount 0 rdev 0xfffff80145cf1a00
    flags (VI_FREE)
 VI_LOCKed    lock type devfs: EXCL by thread 0xfffff804ec5b1620 (pid 59162, ctladm, tid 109582)
        dev zvol/data/kvm/desktop/games-vm401

Now with backtrace:

===Cut===
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:229
#1  0xffffffff80af3beb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:383
#2  0xffffffff80af4011 in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:776
#3  0xffffffff80af3e53 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:707
#4  0xffffffff80bb799e in vputx (vp=<value optimized out>, func=<value optimized out>)
    at /usr/src/sys/kern/vfs_subr.c:2686
#5  0xffffffff80bc6885 in vn_close1 (vp=0xfffff803852a1760, flags=3, file_cred=0x0, td=<value optimized out>, 
    keep_ref=false) at /usr/src/sys/kern/vfs_vnops.c:459
#6  0xffffffff82844554 in ctl_be_block_close (be_lun=0xfffff80a6c8f9000)
    at /usr/src/sys/cam/ctl/ctl_backend_block.c:2100
#7  0xffffffff8284888c in ctl_be_block_rm (softc=0xffffffff82870298, req=0xfffff807b9afbc00)
    at /usr/src/sys/cam/ctl/ctl_backend_block.c:2504
#8  0xffffffff8284399d in ctl_be_block_ioctl (dev=0xfffff8045f074600, cmd=3243303201, addr=0xfffff807b9afbc00 "block", 
    flag=3, td=0xfffff804ec5b1620) at /usr/src/sys/cam/ctl/ctl_backend_block.c:1792
#9  0xffffffff82835e5f in ctl_ioctl (dev=0xfffff8045f074600, cmd=3243303201, addr=0xfffff807b9afbc00 "block", flag=3, 
    td=0xfffff804ec5b1620) at /usr/src/sys/cam/ctl/ctl.c:3065
#10 0xffffffff809c2458 in devfs_ioctl_f (fp=0xfffff803bc3f5640, com=3243303201, data=0xfffff807b9afbc00, 
    cred=0xfffff80175ad3800, td=0xfffff804ec5b1620) at /usr/src/sys/fs/devfs/devfs_vnops.c:791
#11 0xffffffff80b5636d in kern_ioctl (td=0xfffff804ec5b1620, fd=3, com=3243303201, data=<value optimized out>)
    at file.h:323
#12 0xffffffff80b5608c in sys_ioctl (td=0xfffff804ec5b1620, uap=0xfffff804ec5b1b58)
    at /usr/src/sys/kern/sys_generic.c:745
#13 0xffffffff80f72dc8 in amd64_syscall (td=0xfffff804ec5b1620, traced=0) at subr_syscall.c:132
#14 0xffffffff80f5255d in fast_syscall_common () at /usr/src/sys/amd64/amd64/exception.S:480
#15 0x0000000801202bfa in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
===Cut===

Kernel binaries and core files are avaiable at http://san1.linx.playkey.net/r332466M/

I'm downgrading to r332465 to see if this would help.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2018-04-20 03:43:04 UTC
A quick check shows that r332466 is not likely to be responsible, nor were the changes immediately preceding it.  I would start by rolling back to around r332399 and seeing what happens.
Comment 2 emz 2018-04-20 09:07:45 UTC
My bad, I mistyped the revision version (and really downgraded to a wrong version), I suspect this changeset is responsible: https://svnweb.freebsd.org/base?view=revision&revision=331666 

So I really should roll back to 331665, not 332465.
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2018-04-20 09:36:02 UTC
I'm still confused.  r331666 was a change to -head, not -stable.  That change was MFCed to stable/11 as r332755.

The very latest commit to stable/11 was r332785.

If you did specify r331666, you are now running -head, not -stable.
Comment 4 emz 2018-04-20 09:55:44 UTC
Thank you a lot.
I always forget to check the branch.

So you're right then, 332465 was a way behind this change I'm pointing at.
I will try to downgrade to 332096 as avg@ advised.
Comment 5 emz 2018-04-23 13:24:22 UTC
Reproducible on r332096 with similar symptoms:

Unread portion of the kernel message buffer:
        dev zvol/data/kvm/desktop/desktop-vm393
panic: vputx: usecount not zero
cpuid = 14
KDB: stack backtrace:
#0 0xffffffff80b3a417 at kdb_backtrace+0x67
#1 0xffffffff80af4027 at vpanic+0x177
#2 0xffffffff80af3ea3 at panic+0x43
#3 0xffffffff80bb79ee at vputx+0x33e
#4 0xffffffff80bc68d5 at vn_close1+0x115
#5 0xffffffff82845554 at ctl_be_block_close+0x74
#6 0xffffffff8284988c at ctl_be_block_rm+0x14c
#7 0xffffffff8284499d at ctl_be_block_ioctl+0x7d
#8 0xffffffff82836e5f at ctl_ioctl+0x124f
#9 0xffffffff809c24a8 at devfs_ioctl_f+0x128
#10 0xffffffff80b563bd at kern_ioctl+0x26d
#11 0xffffffff80b560dc at sys_ioctl+0x15c
#12 0xffffffff80f72dc8 at amd64_syscall+0xa38
#13 0xffffffff80f5260d at fast_syscall_common+0x101
Uptime: 5m7s
Dumping 5642 out of 65385 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Comment 6 Edward Tomasz Napierala freebsd_committer 2018-04-23 15:38:47 UTC
The crash seems to be triggered by ctladm(8) trying to remove a LUN, and looking at core.0, it seems there are several ctladm instances running.  Is it possible that two instances are trying to manipulate the same LUN?  If so, this might be what triggers the problem.  There might be unsufficient locking in the CTL block backend.
Comment 7 emz 2018-04-23 18:53:01 UTC
Yes, this is pretty much possible.
I'll try to outrun this in our API.

Could this be solved from the CTL layer point of view ?
Comment 8 Edward Tomasz Napierala freebsd_committer 2018-04-23 18:59:23 UTC
Sure it could; what's needed is fixing the locking for the block backend IOCTL API.  Or perhaps removing the locking that's currently there - it seems a bit... weird - and replacing the "softc->lock" with an SX, and locking it higher up in the call stack.

But I'm guessing it's much easier to just avoid doing the concurrent manipulation.