Bug 252575

Summary: panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue
Product: Base System Reporter: Peter Eriksson <pen>
Component: kernAssignee: freebsd-scsi (Nobody) <scsi>
Status: New ---    
Severity: Affects Only Me CC: ae, asomers
Priority: --- Keywords: crash
Version: 12.2-RELEASE   
Hardware: amd64   
OS: Any   

Description Peter Eriksson 2021-01-11 14:28:35 UTC
Just had a server crash on me:

Hardware: HP ProLiant DL380G9 with HP H241 HBA:s (ciss) with a couple of external D6020 JBOD SAS boxes filled with 10TB drives. The server crashes at night when backups (writing to it) are running. Stays up fairly well during daytime.


ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue 0xfffff8010745e838 of size 5
cpuid = 3
time = 1610229478
KDB: stack backtrace:
#0 0xffffffff80c1e9a7 at kdb_backtrace+0x67
#1 0xffffffff80bd171d at vpanic+0x19d
#2 0xffffffff80bd1573 at panic+0x43
#3 0xffffffff80370f76 at camq_remove+0xf6
#4 0xffffffff803742e8 at xpt_run_devq+0x168
#5 0xffffffff803738f6 at xpt_action_default+0x836
#6 0xffffffff803a50b9 at dastart+0x2f9
#7 0xffffffff80375cf2 at xpt_run_allocq+0x172
#8 0xffffffff803a6d68 at dastrategy+0x88
#9 0xffffffff80b176e6 at g_disk_start+0x336
#10 0xffffffff80b1ab0c at g_io_request+0x27c
#11 0xffffffff80b1ab0c at g_io_request+0x27c
#12 0xffffffff82567b37 at zio_vdev_io_start+0x2a7
#13 0xffffffff82563f1c at zio_execute+0xac
#14 0xffffffff8256384a at zio_nowait+0xca
#15 0xffffffff8253e678 at vdev_queue_io_done+0x148
#16 0xffffffff82567e01 at zio_vdev_io_done+0x151
#17 0xffffffff82563f1c at zio_execute+0xac

I'm guessing this is due to some drive having problems. Currently unclear which though. Trying to upgrade to 12.2 and we'll see if it behaves better...
Comment 1 Peter Eriksson 2021-01-11 15:13:53 UTC
Hmm

"panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue"

-2 might be "CAM_ACTIVE_INDEX" which is set into send_ccb->ccb_h.pinfo.index in cam_ccbq_send_ccb() (sys/cam/cam_queue.h) which is called from xpt_run_devq() around line 3492 in cam/cam_xpt.c in the xpt_run_devq() loop that calls camq_remove and panics...

                cam_ccbq_remove_ccb(&device->ccbq, work_ccb);                                                
                cam_ccbq_send_ccb(&device->ccbq, work_ccb);
Comment 2 Peter Eriksson 2021-01-12 17:33:08 UTC
With FreeBSD 12.2-STABLE I get a gazillion "ciss0: WARNING: completing non-busy request" but no panic (yet).

I managed to get some SCSI errors too which probably is related:

(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 10 a6 1a 70 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure)
(da11:ciss0:32:27:0): Info: 0x310a61b48
(da11:ciss0:32:27:0): Field Replaceable Unit: 67
(da11:ciss0:32:27:0): Descriptor 0x80: f7 43
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cd 21 01 00 d0
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 d2 4d 60 00 00 00 e8 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure)
(da11:ciss0:32:27:0): Info: 0x311d24dc0
(da11:ciss0:32:27:0): Field Replaceable Unit: 67
(da11:ciss0:32:27:0): Descriptor 0x80: f7 43
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cf 0f 01 00 21
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 df 48 78 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x311df4898
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 ce 0c 01 00 60
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 e4 6e f8 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x311e46f20
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cd a6 01 00 15
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 07 b2 28 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x31307b2d8
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 d0 09 01 01 26
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 09 ec 68 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x31309ec98
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cf dd 01 01 5d
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 0e dc 80 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x3130edcd0
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cf 7b 01 00 72
(da11:ciss0:32:27:0): Error 5, Unretryable error

I've now zpool offline'd that disk from the zpool in question so hopefully this night things will behave better. Still think that panic shouldn't have happened.

And perhaps one should have some kind of throttle on the kernel printed WARNINGS - then machine because more or less unusable while it was printing them (was watching them via ipmi/serial).
Comment 3 Alan Somers freebsd_committer freebsd_triage 2021-03-30 23:13:13 UTC
I can trigger this panic pretty reliably on 12.2 by doing a firmware update of an mps card.  Updating a Broadcom 9207-8e card from firmware 19.00.00.00 to 20.00.07.00 succeeds, but when resetting the card it hits this panic.  I haven't yet tried to do the update on 13.0, nor have I tried updating from 20.00.07.00 to the same version, 20.00.07.00.  The stack trace is a little bit different:

panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue 0xfffff80104dc0038 of size 1
cpuid = 14
time = 1617145521
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01b8b08860
vpanic() at vpanic+0x17b/frame 0xfffffe01b8b088b0
panic() at panic+0x43/frame 0xfffffe01b8b08910
camq_remove() at camq_remove+0xf6/frame 0xfffffe01b8b08930
xpt_run_devq() at xpt_run_devq+0x158/frame 0xfffffe01b8b08990
xpt_action_default() at xpt_action_default+0x7fd/frame 0xfffffe01b8b089e0
cam_periph_runccb() at cam_periph_runccb+0xe2/frame 0xfffffe01b8b08b30
enc_daemon() at enc_daemon+0x2d6/frame 0xfffffe01b8b08bb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe01b8b08bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01b8b08bf0

and the last entries in dmesg are the anodyne
mps2: Reinitializing controller
mps2: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps2: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=60
mps2: Reinitializing controller
mps2: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps2: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
Comment 4 Andrey V. Elsukov freebsd_committer freebsd_triage 2022-04-13 20:07:24 UTC
We have panic with similar trace on 13.0.
It seems it can happen when some CAM device disappears.

Unread portion of the kernel message buffer:
(cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 
(cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(cd0:umass-sim0:0:0:0): Retrying command, 3 more tries remain
(cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 
(cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(cd0:umass-sim0:0:0:0): Retrying command, 2 more tries remain
(cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 
(cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(cd0:umass-sim0:0:0:0): Retrying command, 1 more tries remain
(cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 
(cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(cd0:umass-sim0:0:0:0): Retrying command, 0 more tries remain
(cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 
(cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(cd0:umass-sim0:0:0:0): Error 5, Retries exhausted
(cd0:umass-sim0:0:0:0): cddone: got error 0x5 back
cd0 at umass-sim0 bus 0 scbus3 target 0 lun 0
cd0: <MP EMS Virtual Media 0326>  detached
da0 at umass-sim0 bus 0 scbus3 target 0 lun 1
da0: <MP EMS Virtual Media 0326>  detached
da1 at umass-sim0 bus 0 scbus3 target 0 lun 2
da1: <MP EMS Virtual Media 0326>  detached
(cd0:umass-sim0:0:0:0): Periph destroyed


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address	= 0x100000070
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8034e3f0
stack pointer	        = 0x28:0xfffffe0037796990
(da1:umass-sim0:0:0:2): Periph destroyed
frame pointer	        = 0x28:0xfffffe00377969e0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 7 (doneq0)
trap number		= 12
panic: page fault
cpuid = 2
time = 1645903760
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0037796650
vpanic() at vpanic+0x182/frame 0xfffffe00377966a0
panic() at panic+0x43/frame 0xfffffe0037796700
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0037796760
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00377967b0
trap() at trap+0x271/frame 0xfffffe00377968c0
calltrap() at calltrap+0x8/frame 0xfffffe00377968c0
--- trap 0xc, rip = 0xffffffff8034e3f0, rsp = 0xfffffe0037796990, rbp = 0xfffffe00377969e0 ---
xpt_run_devq() at xpt_run_devq+0x90/frame 0xfffffe00377969e0
xpt_release_simq() at xpt_release_simq+0x6d/frame 0xfffffe0037796a10
xpt_async_process() at xpt_async_process+0x37a/frame 0xfffffe0037796b20
xpt_done_process() at xpt_done_process+0x3da/frame 0xfffffe0037796b60
xpt_done_td() at xpt_done_td+0xf5/frame 0xfffffe0037796bb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0037796bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0037796bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 81d8h53m59s
Dumping 9884 out of 32621 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55	/usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) macro define offsetof(t, f) &((t *) 0)->f
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:394
#2  0xffffffff80b377eb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:481
#3  0xffffffff80b37c6a in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:913
#4  0xffffffff80b37a83 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:839
#5  0xffffffff80efab67 in trap_fatal (frame=0xfffffe00377968d0, eva=4294967408) at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff80efabbf in trap_pfault (frame=0xfffffe00377968d0, usermode=<optimized out>, signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff80efa221 in trap (frame=0xfffffe00377968d0) at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  xpt_run_devq (devq=0xfffff806f74a1180) at /usr/src/sys/cam/cam_xpt.c:3448
#10 0xffffffff80351a2d in xpt_release_simq (sim=0xfffff80724a6d700, run_queue=1) at /usr/src/sys/cam/cam_xpt.c:4617
#11 0xffffffff8035182a in xpt_async_process (periph=<optimized out>, ccb=0xfffff8045f5c9800) at /usr/src/sys/cam/cam_xpt.c:4363
#12 0xffffffff80351eaa in xpt_done_process (ccb_h=0xfffff8045f5c9800) at /usr/src/sys/cam/cam_xpt.c:5490
#13 0xffffffff803540c5 in xpt_done_td (arg=0xffffffff81586a80 <cam_doneqs>) at /usr/src/sys/cam/cam_xpt.c:5517
#14 0xffffffff80af5a3e in fork_exit (callout=0xffffffff80353fd0 <xpt_done_td>, arg=0xffffffff81586a80 <cam_doneqs>, frame=0xfffffe0037796c00)
    at /usr/src/sys/kern/kern_fork.c:1052
#15 <signal handler called>
(kgdb) f 9
#9  xpt_run_devq (devq=0xfffff806f74a1180) at /usr/src/sys/cam/cam_xpt.c:3448
3448	/usr/src/sys/cam/cam_xpt.c: No such file or directory.
(kgdb) p work_ccb
$1 = (union ccb *) 0xffffffff
(kgdb) p offsetof(union ccb, ccb_h.flags)
$2 = (u_int32_t *) 0x70
(kgdb) l
0xffffffff8034f5e0 is in xpt_run_devq (/usr/src/sys/cam/cam_xpt.c:3448).
3443			if (work_ccb == NULL) {
3444				printf("device on run queue with no ccbs???\n");
3445				continue;
3446			}
3447	
3448			if ((work_ccb->ccb_h.flags & CAM_HIGH_POWER) != 0) {
3449	
3450				mtx_lock(&xsoftc.xpt_highpower_lock);
3451			 	if (xsoftc.num_highpower <= 0) {
3452					/*