(Started at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227465)
Got another panic on r332466, so it's reproducible.
vputx: usecount not zero for vnode 0xfffff803852a1760: tag devfs, type VCHR
usecount -1, writecount -1, refcount 0 rdev 0xfffff80145cf1a00
VI_LOCKed lock type devfs: EXCL by thread 0xfffff804ec5b1620 (pid 59162, ctladm, tid 109582)
Now with backtrace:
#0 doadump (textdump=<value optimized out>) at pcpu.h:229
#1 0xffffffff80af3beb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:383
#2 0xffffffff80af4011 in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
#3 0xffffffff80af3e53 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:707
#4 0xffffffff80bb799e in vputx (vp=<value optimized out>, func=<value optimized out>)
#5 0xffffffff80bc6885 in vn_close1 (vp=0xfffff803852a1760, flags=3, file_cred=0x0, td=<value optimized out>,
keep_ref=false) at /usr/src/sys/kern/vfs_vnops.c:459
#6 0xffffffff82844554 in ctl_be_block_close (be_lun=0xfffff80a6c8f9000)
#7 0xffffffff8284888c in ctl_be_block_rm (softc=0xffffffff82870298, req=0xfffff807b9afbc00)
#8 0xffffffff8284399d in ctl_be_block_ioctl (dev=0xfffff8045f074600, cmd=3243303201, addr=0xfffff807b9afbc00 "block",
flag=3, td=0xfffff804ec5b1620) at /usr/src/sys/cam/ctl/ctl_backend_block.c:1792
#9 0xffffffff82835e5f in ctl_ioctl (dev=0xfffff8045f074600, cmd=3243303201, addr=0xfffff807b9afbc00 "block", flag=3,
td=0xfffff804ec5b1620) at /usr/src/sys/cam/ctl/ctl.c:3065
#10 0xffffffff809c2458 in devfs_ioctl_f (fp=0xfffff803bc3f5640, com=3243303201, data=0xfffff807b9afbc00,
cred=0xfffff80175ad3800, td=0xfffff804ec5b1620) at /usr/src/sys/fs/devfs/devfs_vnops.c:791
#11 0xffffffff80b5636d in kern_ioctl (td=0xfffff804ec5b1620, fd=3, com=3243303201, data=<value optimized out>)
#12 0xffffffff80b5608c in sys_ioctl (td=0xfffff804ec5b1620, uap=0xfffff804ec5b1b58)
#13 0xffffffff80f72dc8 in amd64_syscall (td=0xfffff804ec5b1620, traced=0) at subr_syscall.c:132
#14 0xffffffff80f5255d in fast_syscall_common () at /usr/src/sys/amd64/amd64/exception.S:480
#15 0x0000000801202bfa in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language: auto; currently minimal
Kernel binaries and core files are avaiable at http://san1.linx.playkey.net/r332466M/
I'm downgrading to r332465 to see if this would help.
A quick check shows that r332466 is not likely to be responsible, nor were the changes immediately preceding it. I would start by rolling back to around r332399 and seeing what happens.
My bad, I mistyped the revision version (and really downgraded to a wrong version), I suspect this changeset is responsible: https://svnweb.freebsd.org/base?view=revision&revision=331666
So I really should roll back to 331665, not 332465.
I'm still confused. r331666 was a change to -head, not -stable. That change was MFCed to stable/11 as r332755.
The very latest commit to stable/11 was r332785.
If you did specify r331666, you are now running -head, not -stable.
Thank you a lot.
I always forget to check the branch.
So you're right then, 332465 was a way behind this change I'm pointing at.
I will try to downgrade to 332096 as avg@ advised.
Reproducible on r332096 with similar symptoms:
Unread portion of the kernel message buffer:
panic: vputx: usecount not zero
cpuid = 14
KDB: stack backtrace:
#0 0xffffffff80b3a417 at kdb_backtrace+0x67
#1 0xffffffff80af4027 at vpanic+0x177
#2 0xffffffff80af3ea3 at panic+0x43
#3 0xffffffff80bb79ee at vputx+0x33e
#4 0xffffffff80bc68d5 at vn_close1+0x115
#5 0xffffffff82845554 at ctl_be_block_close+0x74
#6 0xffffffff8284988c at ctl_be_block_rm+0x14c
#7 0xffffffff8284499d at ctl_be_block_ioctl+0x7d
#8 0xffffffff82836e5f at ctl_ioctl+0x124f
#9 0xffffffff809c24a8 at devfs_ioctl_f+0x128
#10 0xffffffff80b563bd at kern_ioctl+0x26d
#11 0xffffffff80b560dc at sys_ioctl+0x15c
#12 0xffffffff80f72dc8 at amd64_syscall+0xa38
#13 0xffffffff80f5260d at fast_syscall_common+0x101
Dumping 5642 out of 65385 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
The crash seems to be triggered by ctladm(8) trying to remove a LUN, and looking at core.0, it seems there are several ctladm instances running. Is it possible that two instances are trying to manipulate the same LUN? If so, this might be what triggers the problem. There might be unsufficient locking in the CTL block backend.
Yes, this is pretty much possible.
I'll try to outrun this in our API.
Could this be solved from the CTL layer point of view ?
Sure it could; what's needed is fixing the locking for the block backend IOCTL API. Or perhaps removing the locking that's currently there - it seems a bit... weird - and replacing the "softc->lock" with an SX, and locking it higher up in the call stack.
But I'm guessing it's much easier to just avoid doing the concurrent manipulation.