240917 – panic: (scsi_da.c:2128, 12.1-BETA1) _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock

Bug 240917 - panic: (scsi_da.c:2128, 12.1-BETA1) _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock

Summary: panic: (scsi_da.c:2128, 12.1-BETA1) _mtx_lock_sleep: recursed on non-recursiv...

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	12.1-RELEASE
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	Robert Wing

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2019-09-29 10:33 UTC by Harald Schmalzbauer
Modified:	2022-03-08 07:21 UTC (History)
CC List:	5 users (show)

See Also:	226578 226510

Flags:	koobs: mfc-stable13+ rew: mfc-stable12+

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Harald Schmalzbauer 2019-09-29 10:33:40 UTC

Hello,

I'm getting a very similar panic on 12.1-BETA1 like this one:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226578

panic: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/scsi/scs
i_da.c:2128                                   
                                                  
cpuid = 0                               
time = 1569751253                                  
KDB: stack backtrace:                    
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0041383610                                          
vpanic() at vpanic+0x19d/frame 0xfffffe0041383660
panic() at panic+0x43/frame 0xfffffe00413836c0
__mtx_lock_sleep() at __mtx_lock_sleep+0x4e1/frame 0xfffffe0041383750
__mtx_lock_flags() at __mtx_lock_flags+0xee/frame 0xfffffe00413837a0
daasync() at daasync+0x187/frame 0xfffffe00413837f0
xpt_async_process_dev() at xpt_async_process_dev+0x152/frame 0xfffffe0041383840    
xptdevicetraverse() at xptdevicetraverse+0x13f/frame 0xfffffe0041383890
xpttargettraverse() at xpttargettraverse+0x6b/frame 0xfffffe00413838d0           
xpt_async_process() at xpt_async_process+0x2d4/frame 0xfffffe00413839e0
xpt_done_process() at xpt_done_process+0x388/frame 0xfffffe0041383a20             
xpt_done_td() at xpt_done_td+0xf6/frame 0xfffffe0041383a70
fork_exit() at fork_exit+0x84/frame 0xfffffe0041383ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0041383ab0       
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                   
KDB: enter: panic


#0  doadump (textdump=0) at RELENG_12_1/src/sys/amd64/include/pcpu.h:234
:
:
:
#9  0xffffffff805cf53a in vpanic (fmt=<value optimized out>, ap=<value optimized out>)                                                 
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_shutdown.c:869                                                      
#10 0xffffffff805cf2e3 in panic (fmt=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_shutdown.c:807                                                      
#11 0xffffffff805b52d1 in __mtx_lock_sleep (c=<value optimized out>, v=<value optimized out>, opts=<value optimized out>,              
    file=<value optimized out>, line=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_mutex.c:523 
#12 0xffffffff805b4d7e in __mtx_lock_flags (c=0xfffff8000296ece8, opts=0,                                                              
    file=0xffffffff80a68048 "/usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/scsi/scsi_da.c", line=2128)                         
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_mutex.c:255                                                         
#13 0xffffffff8033b947 in daasync (callback_arg=0xfffff80003af2400, code=16384, path=0xfffff8000276ab80, arg=0xfffff800037ea000)       
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/scsi/scsi_da.c:2128                                                       
#14 0xffffffff802e48a2 in xpt_async_process_dev (device=0xfffff8000296e800, arg=<value optimized out>)                                 
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:4426                                                            
#15 0xffffffff802e37bf in xptdevicetraverse (target=<value optimized out>, start_device=<value optimized out>,                         
    tr_func=0xfffff80048287000, arg=0xfffff80048287000) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:2355        
#16 0xffffffff802e34ab in xpttargettraverse (bus=0xfffff8000234ac00, start_target=<value optimized out>,                               
    tr_func=0xffffffff802e46f0 <xpt_async_process_tgt>, arg=0xfffff80048287000)                                                        
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:2316                                                            
#17 0xffffffff802e02c4 in xpt_async_process (periph=<value optimized out>, ccb=0xfffff80048287000)                                     
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:4382                                                            
#18 0xffffffff802e0a88 in xpt_done_process (ccb_h=0xfffff80048287000)                                                                  
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:5516                                                            
#19 0xffffffff802e2ba6 in xpt_done_td (arg=0xffffffff80d45300) at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/cam/cam_xpt.c:5543 
#20 0xffffffff805962e4 in fork_exit (callout=0xffffffff802e2ab0 <xpt_done_td>, arg=0xffffffff80d45300, frame=0xfffffe0041383ac0)       
    at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/kern/kern_fork.c:1065                                                         
---Type <return> to continue, or q <return> to quit---
#21 0xffffffff80912e9e in fork_trampoline () at /usr/local/share/deploy-tools/RELENG_12_1/src/sys/amd64/amd64/exception.S:1077
#22 0x0000000000000000 in ?? ()

This bug report doesn't reference any commit.
https://svnweb.freebsd.org/base?view=revision&revision=331097 seems to be the corresponding commit, and looking at the code, the proposed fix isn't applicable anymore.

panic was cause by "camcontrol rescan 40" (scbus40 is on isp0 / FC).

Thanks for taking care in advacnde.

-Harry

Comment 1 Alan Somers freebsd_committer

2021-12-28 23:30:46 UTC

I'm hitting this panic reliably in both 14.0-CURRENT and stable/13 (when the latter is built with WITNESS enabled).  I can easily trigger it by attaching an iscsi disk, changing the iscsi disk's size on the server, then doing "camcontrol reprobe da0" on the client.

Comment 2 Alan Somers freebsd_committer

2021-12-28 23:31:56 UTC

What about accessing softc->flags with atomic instructions instead of locks?  Would that be a valid solution?

Comment 3 Warner Losh freebsd_committer

2021-12-29 00:05:54 UTC

(In reply to Alan Somers from comment #2)

No. Not a valid way to fix this. This is the periph lock, it seems,
and we're lokcing it too much.

I can't for the life of me recreate this on any of the SIMs I have locally, so it has languished. If you have some way to reproduce this, I'll look at it ASAP. Otherwise, it will continue to languish. It may be a bug in the iSCSCI code since that's the only SIM reported to create this issue. If you can recreate it with an instance or two of bhyve, then I'd love to know that so I can look at it.

Comment 4 Alan Somers freebsd_committer

2021-12-29 00:49:15 UTC

I can't reproduce it on BHyve using virtual block devices, simply because I can't make any virtual block device whose size I can change and can reprobe:
* virtio-blk: not a CAM device, so I can't "camcontrol reprobe" it
* nvme: not a CAM device, so I can't "camcontrol reprobe" it
* virtio-scsi: not supported by vm-bhyve.
* ahci-hd: bhyve doesn't seem to notice when the zvol gets expanded

Instead, I reproduced it with iSCSI.  The iSCSI server is physical (but probably could be a VM):

$ sudo zfs create -V 1g -o volmode=dev zroot/test/disk0
$ # write the following to /etc/ctl.conf
auth-group {
    disk {
        auth-type = none
	initiator-portal = [ 192.168.0.0/24 ]
    }
}

portal-group {
    pg0 {
        discovery-auth-group no-authentication
        listen 0.0.0.0
        listen [::]
    }
}

lun {
    "disk0" {
        blocksize = 4096
	device-id = "disk0"
	path = "/dev/zvol/zroot/test/disk0"
    }
}

target {
    "iqn.2018-10.mydomain.myhost:disk0" {
        auth-group = disk
	portal-group { name = pg0 }
	lun = [
	    { number = 0, name = disk0 },
	]
    }
}

$ sudo sysrc ctld_flags="-u"
$ sudo service ctld onestart

Then do the following on the client
$ sudo service iscsid onestart
$ sudo iscsictl -A -d myhost.mydomain
$ sudo iscsictl -L   # to see the device name

Back on the server, do the following:
$ sudo zfs set volsize=2g zroot/test/disk0
$ sudo service ctld onereload

Then do the following on the client
$ sudo camcontrol reprobe da0

Comment 5 Warner Losh freebsd_committer

2021-12-29 01:38:06 UTC

Great I'll try that...

nvme is a CAM device, but you need to set

hw.nvme.use_nvd=0

in loader.conf :). However, it doesn't notice resizes that well either.

There's some code in dadone that might also be the issue too...  I'll look at it in detail when I'm back at work next week... (unless things slow down on my vacation enough for me to look at it before then).

Comment 6 commit-hook freebsd_committer

2022-01-04 02:02:34 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25

commit bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25
Author:     Robert Wing <rew@FreeBSD.org>
AuthorDate: 2022-01-04 01:21:58 +0000
Commit:     Robert Wing <rew@FreeBSD.org>
CommitDate: 2022-01-04 01:56:48 +0000

    cam: don't lock while handling an AC_UNIT_ATTENTION

    Don't take the device_mtx lock in daasync() when handling an
    AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the
    periph's softc flags.

    The device_mtx lock is taken in xptdevicetraverse() before daasync()
    is eventually called in xpt_async_bcast().

    PR:             240917, 226510, 226578
    Reviewed by:    imp
    MFC after:      3 weeks
    Differential Revision: https://reviews.freebsd.org/D27735

 sys/cam/scsi/scsi_da.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

Comment 7 commit-hook freebsd_committer

2022-02-10 19:43:50 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=583480174ee7f4af92f0c5302884a7eece5b12f3

commit 583480174ee7f4af92f0c5302884a7eece5b12f3
Author:     Robert Wing <rew@FreeBSD.org>
AuthorDate: 2022-01-04 01:21:58 +0000
Commit:     Robert Wing <rew@FreeBSD.org>
CommitDate: 2022-02-10 19:43:18 +0000

    cam: don't lock while handling an AC_UNIT_ATTENTION

    Don't take the device_mtx lock in daasync() when handling an
    AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the
    periph's softc flags.

    The device_mtx lock is taken in xptdevicetraverse() before daasync()
    is eventually called in xpt_async_bcast().

    PR:             240917, 226510, 226578
    Reviewed by:    imp
    MFC after:      3 weeks
    Differential Revision: https://reviews.freebsd.org/D27735

    (cherry picked from commit bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25)

 sys/cam/scsi/scsi_da.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

Comment 8 Kubilay Kocak freebsd_committer

2022-02-10 23:23:41 UTC

^Triage: Assign to committer that resolved (last reference) and track stable/* merge (so far).

Does this need to go to stable/12? This issue was a report against 12.0 (CURRENT). 

Will leave this issue closed, but if/when merged, please set mfc-stable12 flag to + and reference this issue in merge commit log so the merge is tracked in all issues

Comment 9 commit-hook freebsd_committer

2022-03-08 07:12:35 UTC

A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=1987ff8abca2c9bdff7f385ea2fd1c60cf5b3aeb

commit 1987ff8abca2c9bdff7f385ea2fd1c60cf5b3aeb
Author:     Robert Wing <rew@FreeBSD.org>
AuthorDate: 2022-01-04 01:21:58 +0000
Commit:     Robert Wing <rew@FreeBSD.org>
CommitDate: 2022-03-08 07:07:46 +0000

    cam: don't lock while handling an AC_UNIT_ATTENTION

    Don't take the device_mtx lock in daasync() when handling an
    AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the
    periph's softc flags.

    The device_mtx lock is taken in xptdevicetraverse() before daasync()
    is eventually called in xpt_async_bcast().

    PR:             240917, 226510, 226578
    Reviewed by:    imp
    Differential Revision: https://reviews.freebsd.org/D27735

    (cherry picked from commit bb8441184bab60cd8a07c2b94bd6c4ae8b56ec25)

 sys/cam/scsi/scsi_da.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)