Hello, I'm getting a very similar panic on 12.1-BETA1 like this one: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226578 panic: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/scsi/scs i_da.c:2128 cpuid = 0 time = 1569751253 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0041383610 vpanic() at vpanic+0x19d/frame 0xfffffe0041383660 panic() at panic+0x43/frame 0xfffffe00413836c0 __mtx_lock_sleep() at __mtx_lock_sleep+0x4e1/frame 0xfffffe0041383750 __mtx_lock_flags() at __mtx_lock_flags+0xee/frame 0xfffffe00413837a0 daasync() at daasync+0x187/frame 0xfffffe00413837f0 xpt_async_process_dev() at xpt_async_process_dev+0x152/frame 0xfffffe0041383840 xptdevicetraverse() at xptdevicetraverse+0x13f/frame 0xfffffe0041383890 xpttargettraverse() at xpttargettraverse+0x6b/frame 0xfffffe00413838d0 xpt_async_process() at xpt_async_process+0x2d4/frame 0xfffffe00413839e0 xpt_done_process() at xpt_done_process+0x388/frame 0xfffffe0041383a20 xpt_done_td() at xpt_done_td+0xf6/frame 0xfffffe0041383a70 fork_exit() at fork_exit+0x84/frame 0xfffffe0041383ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0041383ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic #0 doadump (textdump=0) at RELENG_12_1/src/sys/amd64/include/pcpu.h:234 : : : #9 0xffffffff805cf53a in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_shutdown.c:869 #10 0xffffffff805cf2e3 in panic (fmt=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_shutdown.c:807 #11 0xffffffff805b52d1 in __mtx_lock_sleep (c=<value optimized out>, v=<value optimized out>, opts=<value optimized out>, file=<value optimized out>, line=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_mutex.c:523 #12 0xffffffff805b4d7e in __mtx_lock_flags (c=0xfffff8000296ece8, opts=0, file=0xffffffff80a68048 "/usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/scsi/scsi_da.c", line=2128) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_mutex.c:255 #13 0xffffffff8033b947 in daasync (callback_arg=0xfffff80003af2400, code=16384, path=0xfffff8000276ab80, arg=0xfffff800037ea000) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/scsi/scsi_da.c:2128 #14 0xffffffff802e48a2 in xpt_async_process_dev (device=0xfffff8000296e800, arg=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:4426 #15 0xffffffff802e37bf in xptdevicetraverse (target=<value optimized out>, start_device=<value optimized out>, tr_func=0xfffff80048287000, arg=0xfffff80048287000) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:2355 #16 0xffffffff802e34ab in xpttargettraverse (bus=0xfffff8000234ac00, start_target=<value optimized out>, tr_func=0xffffffff802e46f0 <xpt_async_process_tgt>, arg=0xfffff80048287000) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:2316 #17 0xffffffff802e02c4 in xpt_async_process (periph=<value optimized out>, ccb=0xfffff80048287000) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:4382 #18 0xffffffff802e0a88 in xpt_done_process (ccb_h=0xfffff80048287000) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:5516 #19 0xffffffff802e2ba6 in xpt_done_td (arg=0xffffffff80d45300) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:5543 #20 0xffffffff805962e4 in fork_exit (callout=0xffffffff802e2ab0 <xpt_done_td>, arg=0xffffffff80d45300, frame=0xfffffe0041383ac0) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_fork.c:1065 ---Type <return> to continue, or q <return> to quit--- #21 0xffffffff80912e9e in fork_trampoline () at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/amd64/amd64/exception.S:1077 #22 0x0000000000000000 in ?? () This bug report doesn't reference any commit. https://svnweb.freebsd.org/base?view=revision&revision=331097 seems to be the corresponding commit, and looking at the code, the proposed fix isn't applicable anymore. panic was cause by "camcontrol rescan 40" (scbus40 is on isp0 / FC). Thanks for taking care in advacnde. -Harry
I'm hitting this panic reliably in both 14.0-CURRENT and stable/13 (when the latter is built with WITNESS enabled). I can easily trigger it by attaching an iscsi disk, changing the iscsi disk's size on the server, then doing "camcontrol reprobe da0" on the client.
What about accessing softc->flags with atomic instructions instead of locks? Would that be a valid solution?
(In reply to Alan Somers from comment #2) No. Not a valid way to fix this. This is the periph lock, it seems, and we're lokcing it too much. I can't for the life of me recreate this on any of the SIMs I have locally, so it has languished. If you have some way to reproduce this, I'll look at it ASAP. Otherwise, it will continue to languish. It may be a bug in the iSCSCI code since that's the only SIM reported to create this issue. If you can recreate it with an instance or two of bhyve, then I'd love to know that so I can look at it.
I can't reproduce it on BHyve using virtual block devices, simply because I can't make any virtual block device whose size I can change and can reprobe: * virtio-blk: not a CAM device, so I can't "camcontrol reprobe" it * nvme: not a CAM device, so I can't "camcontrol reprobe" it * virtio-scsi: not supported by vm-bhyve. * ahci-hd: bhyve doesn't seem to notice when the zvol gets expanded Instead, I reproduced it with iSCSI. The iSCSI server is physical (but probably could be a VM): $ sudo zfs create -V 1g -o volmode=dev zroot/test/disk0 $ # write the following to /etc/ctl.conf auth-group { disk { auth-type = none initiator-portal = [ 192.168.0.0/24 ] } } portal-group { pg0 { discovery-auth-group no-authentication listen 0.0.0.0 listen [::] } } lun { "disk0" { blocksize = 4096 device-id = "disk0" path = "/dev/zvol/zroot/test/disk0" } } target { "iqn.2018-10.mydomain.myhost:disk0" { auth-group = disk portal-group { name = pg0 } lun = [ { number = 0, name = disk0 }, ] } } $ sudo sysrc ctld_flags="-u" $ sudo service ctld onestart Then do the following on the client $ sudo service iscsid onestart $ sudo iscsictl -A -d myhost.mydomain $ sudo iscsictl -L # to see the device name Back on the server, do the following: $ sudo zfs set volsize=2g zroot/test/disk0 $ sudo service ctld onereload Then do the following on the client $ sudo camcontrol reprobe da0
Great I'll try that... nvme is a CAM device, but you need to set hw.nvme.use_nvd=0 in loader.conf :). However, it doesn't notice resizes that well either. There's some code in dadone that might also be the issue too... I'll look at it in detail when I'm back at work next week... (unless things slow down on my vacation enough for me to look at it before then).
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25 commit bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25 Author: Robert Wing <rew@FreeBSD.org> AuthorDate: 2022-01-04 01:21:58 +0000 Commit: Robert Wing <rew@FreeBSD.org> CommitDate: 2022-01-04 01:56:48 +0000 cam: don't lock while handling an AC_UNIT_ATTENTION Don't take the device_mtx lock in daasync() when handling an AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the periph's softc flags. The device_mtx lock is taken in xptdevicetraverse() before daasync() is eventually called in xpt_async_bcast(). PR: 240917, 226510, 226578 Reviewed by: imp MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D27735 sys/cam/scsi/scsi_da.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=583480174ee7f4af92f0c5302884a7eece5b12f3 commit 583480174ee7f4af92f0c5302884a7eece5b12f3 Author: Robert Wing <rew@FreeBSD.org> AuthorDate: 2022-01-04 01:21:58 +0000 Commit: Robert Wing <rew@FreeBSD.org> CommitDate: 2022-02-10 19:43:18 +0000 cam: don't lock while handling an AC_UNIT_ATTENTION Don't take the device_mtx lock in daasync() when handling an AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the periph's softc flags. The device_mtx lock is taken in xptdevicetraverse() before daasync() is eventually called in xpt_async_bcast(). PR: 240917, 226510, 226578 Reviewed by: imp MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D27735 (cherry picked from commit bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25) sys/cam/scsi/scsi_da.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-)
^Triage: Assign to committer that resolved (last reference) and track stable/* merge (so far). Does this need to go to stable/12? This issue was a report against 12.0 (CURRENT). Will leave this issue closed, but if/when merged, please set mfc-stable12 flag to + and reference this issue in merge commit log so the merge is tracked in all issues
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=1987ff8abca2c9bdff7f385ea2fd1c60cf5b3aeb commit 1987ff8abca2c9bdff7f385ea2fd1c60cf5b3aeb Author: Robert Wing <rew@FreeBSD.org> AuthorDate: 2022-01-04 01:21:58 +0000 Commit: Robert Wing <rew@FreeBSD.org> CommitDate: 2022-03-08 07:07:46 +0000 cam: don't lock while handling an AC_UNIT_ATTENTION Don't take the device_mtx lock in daasync() when handling an AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the periph's softc flags. The device_mtx lock is taken in xptdevicetraverse() before daasync() is eventually called in xpt_async_bcast(). PR: 240917, 226510, 226578 Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D27735 (cherry picked from commit bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25) sys/cam/scsi/scsi_da.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-)