Bug 252575 - panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue
Summary: panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2021-01-11 14:28 UTC by Peter Eriksson
Modified: 2021-01-12 17:33 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2021-01-11 14:28:35 UTC
Just had a server crash on me:

Hardware: HP ProLiant DL380G9 with HP H241 HBA:s (ciss) with a couple of external D6020 JBOD SAS boxes filled with 10TB drives. The server crashes at night when backups (writing to it) are running. Stays up fairly well during daytime.


ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
ciss0: WARNING: completing non-busy request
panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue 0xfffff8010745e838 of size 5
cpuid = 3
time = 1610229478
KDB: stack backtrace:
#0 0xffffffff80c1e9a7 at kdb_backtrace+0x67
#1 0xffffffff80bd171d at vpanic+0x19d
#2 0xffffffff80bd1573 at panic+0x43
#3 0xffffffff80370f76 at camq_remove+0xf6
#4 0xffffffff803742e8 at xpt_run_devq+0x168
#5 0xffffffff803738f6 at xpt_action_default+0x836
#6 0xffffffff803a50b9 at dastart+0x2f9
#7 0xffffffff80375cf2 at xpt_run_allocq+0x172
#8 0xffffffff803a6d68 at dastrategy+0x88
#9 0xffffffff80b176e6 at g_disk_start+0x336
#10 0xffffffff80b1ab0c at g_io_request+0x27c
#11 0xffffffff80b1ab0c at g_io_request+0x27c
#12 0xffffffff82567b37 at zio_vdev_io_start+0x2a7
#13 0xffffffff82563f1c at zio_execute+0xac
#14 0xffffffff8256384a at zio_nowait+0xca
#15 0xffffffff8253e678 at vdev_queue_io_done+0x148
#16 0xffffffff82567e01 at zio_vdev_io_done+0x151
#17 0xffffffff82563f1c at zio_execute+0xac

I'm guessing this is due to some drive having problems. Currently unclear which though. Trying to upgrade to 12.2 and we'll see if it behaves better...
Comment 1 Peter Eriksson 2021-01-11 15:13:53 UTC
Hmm

"panic: camq_remove: Attempt to remove out-of-bounds index -2 from queue"

-2 might be "CAM_ACTIVE_INDEX" which is set into send_ccb->ccb_h.pinfo.index in cam_ccbq_send_ccb() (sys/cam/cam_queue.h) which is called from xpt_run_devq() around line 3492 in cam/cam_xpt.c in the xpt_run_devq() loop that calls camq_remove and panics...

                cam_ccbq_remove_ccb(&device->ccbq, work_ccb);                                                
                cam_ccbq_send_ccb(&device->ccbq, work_ccb);
Comment 2 Peter Eriksson 2021-01-12 17:33:08 UTC
With FreeBSD 12.2-STABLE I get a gazillion "ciss0: WARNING: completing non-busy request" but no panic (yet).

I managed to get some SCSI errors too which probably is related:

(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 10 a6 1a 70 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure)
(da11:ciss0:32:27:0): Info: 0x310a61b48
(da11:ciss0:32:27:0): Field Replaceable Unit: 67
(da11:ciss0:32:27:0): Descriptor 0x80: f7 43
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cd 21 01 00 d0
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 d2 4d 60 00 00 00 e8 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:44,0 (Internal target failure)
(da11:ciss0:32:27:0): Info: 0x311d24dc0
(da11:ciss0:32:27:0): Field Replaceable Unit: 67
(da11:ciss0:32:27:0): Descriptor 0x80: f7 43
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cf 0f 01 00 21
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 df 48 78 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x311df4898
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 ce 0c 01 00 60
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 11 e4 6e f8 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x311e46f20
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cd a6 01 00 15
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 07 b2 28 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x31307b2d8
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 d0 09 01 01 26
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 09 ec 68 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x31309ec98
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cf dd 01 01 5d
(da11:ciss0:32:27:0): Error 5, Unretryable error
(da11:ciss0:32:27:0): READ(16). CDB: 88 00 00 00 00 03 13 0e dc 80 00 00 01 00 00 00 
(da11:ciss0:32:27:0): CAM status: SCSI Status Error
(da11:ciss0:32:27:0): SCSI status: Check Condition
(da11:ciss0:32:27:0): SCSI sense: ABORTED COMMAND asc:11,3 (Multiple read errors)
(da11:ciss0:32:27:0): Info: 0x3130edcd0
(da11:ciss0:32:27:0): Field Replaceable Unit: 66
(da11:ciss0:32:27:0): Descriptor 0x80: f7 42
(da11:ciss0:32:27:0): Descriptor 0x81: 03 cf 7b 01 00 72
(da11:ciss0:32:27:0): Error 5, Unretryable error

I've now zpool offline'd that disk from the zpool in question so hopefully this night things will behave better. Still think that panic shouldn't have happened.

And perhaps one should have some kind of throttle on the kernel printed WARNINGS - then machine because more or less unusable while it was printing them (was watching them via ipmi/serial).