Just had a server crash on me: Hardware: HP ProLiant DL380G9 with HP H241 HBA:s (ciss) with a couple of external D6020 JBOD SAS boxes filled with 10TB drives. The server crashes at night when backups (writing to it) are running. Stays up fairly well during daytime. ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request ciss0: WARNING: completing non-busy request panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue 0xfffff8010745e838 of size 5 cpuid = 3 time = 1610229478 KDB: stack backtrace: #0 0xffffffff80c1e9a7 at kdb_backtrace+0x67 #1 0xffffffff80bd171d at vpanic+0x19d #2 0xffffffff80bd1573 at panic+0x43 #3 0xffffffff80370f76 at camq_remove+0xf6 #4 0xffffffff803742e8 at xpt_run_devq+0x168 #5 0xffffffff803738f6 at xpt_action_default+0x836 #6 0xffffffff803a50b9 at dastart+0x2f9 #7 0xffffffff80375cf2 at xpt_run_allocq+0x172 #8 0xffffffff803a6d68 at dastrategy+0x88 #9 0xffffffff80b176e6 at g_disk_start+0x336 #10 0xffffffff80b1ab0c at g_io_request+0x27c #11 0xffffffff80b1ab0c at g_io_request+0x27c #12 0xffffffff82567b37 at zio_vdev_io_start+0x2a7 #13 0xffffffff82563f1c at zio_execute+0xac #14 0xffffffff8256384a at zio_nowait+0xca #15 0xffffffff8253e678 at vdev_queue_io_done+0x148 #16 0xffffffff82567e01 at zio_vdev_io_done+0x151 #17 0xffffffff82563f1c at zio_execute+0xac I'm guessing this is due to some drive having problems. Currently unclear which though. Trying to upgrade to 12.2 and we'll see if it behaves better...
Hmm "panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue" -2 might be "CAM_ACTIVE_INDEX" which is set into send_ccb->ccb_h.pinfo.index in cam_ccbq_send_ccb() (sys/cam/cam_queue.h) which is called from xpt_run_devq() around line 3492 in cam/cam_xpt.c in the xpt_run_devq() loop that calls camq_remove and panics... cam_ccbq_remove_ccb(&device->ccbq, work_ccb); cam_ccbq_send_ccb(&device->ccbq, work_ccb);
With FreeBSD 12.2-STABLE I get a gazillion "ciss0: WARNING: completing non-busy request" but no panic (yet). I managed to get some SCSI errors too which probably is related: (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 10 a6 1a 70 00 00 01 00 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure) (da11:ciss0:32:27:0): Info: 0x310a61b48 (da11:ciss0:32:27:0): Field Replaceable Unit: 67 (da11:ciss0:32:27:0): Descriptor 0x80: f7 43 (da11:ciss0:32:27:0): Descriptor 0x81: 03 cd 21 01 00 d0 (da11:ciss0:32:27:0): Error 5, Unretryable error (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 d2 4d 60 00 00 00 e8 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure) (da11:ciss0:32:27:0): Info: 0x311d24dc0 (da11:ciss0:32:27:0): Field Replaceable Unit: 67 (da11:ciss0:32:27:0): Descriptor 0x80: f7 43 (da11:ciss0:32:27:0): Descriptor 0x81: 03 cf 0f 01 00 21 (da11:ciss0:32:27:0): Error 5, Unretryable error (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 df 48 78 00 00 01 00 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors) (da11:ciss0:32:27:0): Info: 0x311df4898 (da11:ciss0:32:27:0): Field Replaceable Unit: 66 (da11:ciss0:32:27:0): Descriptor 0x80: f7 42 (da11:ciss0:32:27:0): Descriptor 0x81: 03 ce 0c 01 00 60 (da11:ciss0:32:27:0): Error 5, Unretryable error (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 e4 6e f8 00 00 01 00 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors) (da11:ciss0:32:27:0): Info: 0x311e46f20 (da11:ciss0:32:27:0): Field Replaceable Unit: 66 (da11:ciss0:32:27:0): Descriptor 0x80: f7 42 (da11:ciss0:32:27:0): Descriptor 0x81: 03 cd a6 01 00 15 (da11:ciss0:32:27:0): Error 5, Unretryable error (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 07 b2 28 00 00 01 00 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors) (da11:ciss0:32:27:0): Info: 0x31307b2d8 (da11:ciss0:32:27:0): Field Replaceable Unit: 66 (da11:ciss0:32:27:0): Descriptor 0x80: f7 42 (da11:ciss0:32:27:0): Descriptor 0x81: 03 d0 09 01 01 26 (da11:ciss0:32:27:0): Error 5, Unretryable error (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 09 ec 68 00 00 01 00 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors) (da11:ciss0:32:27:0): Info: 0x31309ec98 (da11:ciss0:32:27:0): Field Replaceable Unit: 66 (da11:ciss0:32:27:0): Descriptor 0x80: f7 42 (da11:ciss0:32:27:0): Descriptor 0x81: 03 cf dd 01 01 5d (da11:ciss0:32:27:0): Error 5, Unretryable error (da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 0e dc 80 00 00 01 00 00 00 (da11:ciss0:32:27:0): CAM status: SCSI Status Error (da11:ciss0:32:27:0): SCSI status: Check Condition (da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors) (da11:ciss0:32:27:0): Info: 0x3130edcd0 (da11:ciss0:32:27:0): Field Replaceable Unit: 66 (da11:ciss0:32:27:0): Descriptor 0x80: f7 42 (da11:ciss0:32:27:0): Descriptor 0x81: 03 cf 7b 01 00 72 (da11:ciss0:32:27:0): Error 5, Unretryable error I've now zpool offline'd that disk from the zpool in question so hopefully this night things will behave better. Still think that panic shouldn't have happened. And perhaps one should have some kind of throttle on the kernel printed WARNINGS - then machine because more or less unusable while it was printing them (was watching them via ipmi/serial).
I can trigger this panic pretty reliably on 12.2 by doing a firmware update of an mps card. Updating a Broadcom 9207-8e card from firmware 19.00.00.00 to 20.00.07.00 succeeds, but when resetting the card it hits this panic. I haven't yet tried to do the update on 13.0, nor have I tried updating from 20.00.07.00 to the same version, 20.00.07.00. The stack trace is a little bit different: panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue 0xfffff80104dc0038 of size 1 cpuid = 14 time = 1617145521 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01b8b08860 vpanic() at vpanic+0x17b/frame 0xfffffe01b8b088b0 panic() at panic+0x43/frame 0xfffffe01b8b08910 camq_remove() at camq_remove+0xf6/frame 0xfffffe01b8b08930 xpt_run_devq() at xpt_run_devq+0x158/frame 0xfffffe01b8b08990 xpt_action_default() at xpt_action_default+0x7fd/frame 0xfffffe01b8b089e0 cam_periph_runccb() at cam_periph_runccb+0xe2/frame 0xfffffe01b8b08b30 enc_daemon() at enc_daemon+0x2d6/frame 0xfffffe01b8b08bb0 fork_exit() at fork_exit+0x7e/frame 0xfffffe01b8b08bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01b8b08bf0 and the last entries in dmesg are the anodyne mps2: Reinitializing controller mps2: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps2: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=60 mps2: Reinitializing controller mps2: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps2: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
We have panic with similar trace on 13.0. It seems it can happen when some CAM device disappears. Unread portion of the kernel message buffer: (cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 (cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (cd0:umass-sim0:0:0:0): Retrying command, 3 more tries remain (cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 (cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (cd0:umass-sim0:0:0:0): Retrying command, 2 more tries remain (cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 (cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (cd0:umass-sim0:0:0:0): Retrying command, 1 more tries remain (cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 (cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (cd0:umass-sim0:0:0:0): Retrying command, 0 more tries remain (cd0:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 (cd0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (cd0:umass-sim0:0:0:0): Error 5, Retries exhausted (cd0:umass-sim0:0:0:0): cddone: got error 0x5 back cd0 at umass-sim0 bus 0 scbus3 target 0 lun 0 cd0: <MP EMS Virtual Media 0326> detached da0 at umass-sim0 bus 0 scbus3 target 0 lun 1 da0: <MP EMS Virtual Media 0326> detached da1 at umass-sim0 bus 0 scbus3 target 0 lun 2 da1: <MP EMS Virtual Media 0326> detached (cd0:umass-sim0:0:0:0): Periph destroyed Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 04 fault virtual address = 0x100000070 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8034e3f0 stack pointer = 0x28:0xfffffe0037796990 (da1:umass-sim0:0:0:2): Periph destroyed frame pointer = 0x28:0xfffffe00377969e0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 7 (doneq0) trap number = 12 panic: page fault cpuid = 2 time = 1645903760 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0037796650 vpanic() at vpanic+0x182/frame 0xfffffe00377966a0 panic() at panic+0x43/frame 0xfffffe0037796700 trap_fatal() at trap_fatal+0x387/frame 0xfffffe0037796760 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00377967b0 trap() at trap+0x271/frame 0xfffffe00377968c0 calltrap() at calltrap+0x8/frame 0xfffffe00377968c0 --- trap 0xc, rip = 0xffffffff8034e3f0, rsp = 0xfffffe0037796990, rbp = 0xfffffe00377969e0 --- xpt_run_devq() at xpt_run_devq+0x90/frame 0xfffffe00377969e0 xpt_release_simq() at xpt_release_simq+0x6d/frame 0xfffffe0037796a10 xpt_async_process() at xpt_async_process+0x37a/frame 0xfffffe0037796b20 xpt_done_process() at xpt_done_process+0x3da/frame 0xfffffe0037796b60 xpt_done_td() at xpt_done_td+0xf5/frame 0xfffffe0037796bb0 fork_exit() at fork_exit+0x7e/frame 0xfffffe0037796bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0037796bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- Uptime: 81d8h53m59s Dumping 9884 out of 32621 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. (kgdb) macro define offsetof(t, f) &((t *) 0)->f (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:394 #2 0xffffffff80b377eb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:481 #3 0xffffffff80b37c6a in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:913 #4 0xffffffff80b37a83 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:839 #5 0xffffffff80efab67 in trap_fatal (frame=0xfffffe00377968d0, eva=4294967408) at /usr/src/sys/amd64/amd64/trap.c:915 #6 0xffffffff80efabbf in trap_pfault (frame=0xfffffe00377968d0, usermode=<optimized out>, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:732 #7 0xffffffff80efa221 in trap (frame=0xfffffe00377968d0) at /usr/src/sys/amd64/amd64/trap.c:398 #8 <signal handler called> #9 xpt_run_devq (devq=0xfffff806f74a1180) at /usr/src/sys/cam/cam_xpt.c:3448 #10 0xffffffff80351a2d in xpt_release_simq (sim=0xfffff80724a6d700, run_queue=1) at /usr/src/sys/cam/cam_xpt.c:4617 #11 0xffffffff8035182a in xpt_async_process (periph=<optimized out>, ccb=0xfffff8045f5c9800) at /usr/src/sys/cam/cam_xpt.c:4363 #12 0xffffffff80351eaa in xpt_done_process (ccb_h=0xfffff8045f5c9800) at /usr/src/sys/cam/cam_xpt.c:5490 #13 0xffffffff803540c5 in xpt_done_td (arg=0xffffffff81586a80 <cam_doneqs>) at /usr/src/sys/cam/cam_xpt.c:5517 #14 0xffffffff80af5a3e in fork_exit (callout=0xffffffff80353fd0 <xpt_done_td>, arg=0xffffffff81586a80 <cam_doneqs>, frame=0xfffffe0037796c00) at /usr/src/sys/kern/kern_fork.c:1052 #15 <signal handler called> (kgdb) f 9 #9 xpt_run_devq (devq=0xfffff806f74a1180) at /usr/src/sys/cam/cam_xpt.c:3448 3448 /usr/src/sys/cam/cam_xpt.c: No such file or directory. (kgdb) p work_ccb $1 = (union ccb *) 0xffffffff (kgdb) p offsetof(union ccb, ccb_h.flags) $2 = (u_int32_t *) 0x70 (kgdb) l 0xffffffff8034f5e0 is in xpt_run_devq (/usr/src/sys/cam/cam_xpt.c:3448). 3443 if (work_ccb == NULL) { 3444 printf("device on run queue with no ccbs???\n"); 3445 continue; 3446 } 3447 3448 if ((work_ccb->ccb_h.flags & CAM_HIGH_POWER) != 0) { 3449 3450 mtx_lock(&xsoftc.xpt_highpower_lock); 3451 if (xsoftc.num_highpower <= 0) { 3452 /*