Created attachment 248946 [details] detailed kgdb trace log The only active geom there is the single vdev zfs partition (without a replication). gmirror "efi" one is idle and such usually do not cause any problems, only active zfs matters FreeBSD 15.0-CURRENT #2 main-n268534-32fdcff8703d: Mon Feb 26 01:08:14 EET 2024 (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 88 e0 81 40 57 00 00 07 00 00 (ada3:ahcich3:0:0:0): CAM status: Command timeout (ada3:ahcich3:0:0:0): Retrying command, 2 more tries remain ahcich3: SATA connect time=100us status=00000133 ahcich3: AHCI reset: device found ahcich3: AHCI reset: device ready after 100ms ahcich3: Timeout on slot 26 port 0 ahcich3: is 00000000 cs 00000000 ss 3c000000 rs 3c000000 tfd 50 serr 00000000 cmd 0000dd17 ahcich3: AHCI reset... (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 88 e0 81 40 57 00 00 07 00 00 (ada3:ahcich3:0:0:0): CAM status: Command timeout (ada3:ahcich3:0:0:0): Retrying command, 1 more tries remain ahcich3: SATA connect time=100us status=00000133 ahcich3: AHCI reset: device found ahcich3: AHCI reset: device ready after 100ms ipfw: 24111 Nat TCP 10.0.11.17:39346 10.0.6.178:10050 in via em0 ahcich3: Timeout on slot 2 port 0 ahcich3: is 00000000 cs 00000000 ss 0000003c rs 0000003c tfd 50 serr 00000000 cmd 0000c517 ahcich3: AHCI reset... (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 88 e0 81 40 57 00 00 07 00 00 (ada3:ahcich3:0:0:0): CAM status: Command timeout (ada3:ahcich3:0:0:0): Retrying command, 0 more tries remain ahcich3: SATA connect time=100us status=00000133 ahcich3: AHCI reset: device found ahcich3: AHCI reset: device ready after 100ms (ada3:ahcich3:0:0:0): Periph destroyed GEOM_MIRROR: Device efi: provider ada3p99 disconnected. Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0x18 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80325322 stack pointer = 0x28:0xfffffe0121bf5e80 frame pointer = 0x28:0xfffffe0121bf5ea0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 4 (doneq0) rdi: fffff80001e2b660 rsi: 0000000000000001 rdx: 00000000000013f6 rcx: 0000000000000000 r8: ffffffff80c87ff0 r9: 0000000000526415 rax: 0000000000000000 rbx: fffff80001e2b660 rbp: fffffe0121bf5ea0 r10: 0000000000001388 r11: 000000008059e074 r12: 0000000000000000 r13: fffffe0121bf5eb8 r14: 0000000000000000 r15: fffff80001f42740 trap number = 12 panic: page fault cpuid = 1 time = 1709613907 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff8045142b = db_trace_self_wrapper+0x2b/frame 0xfffffe0121bf5b60 vpanic() at 0xffffffff8074c855 = vpanic+0x135/frame 0xfffffe0121bf5c90 panic() at 0xffffffff8074c713 = panic+0x43/frame 0xfffffe0121bf5cf0 trap_fatal() at 0xffffffff80bb43dc = trap_fatal+0x40c/frame 0xfffffe0121bf5d50 trap_pfault() at 0xffffffff80bb442f = trap_pfault+0x4f/frame 0xfffffe0121bf5db0 calltrap() at 0xffffffff80b8c238 = calltrap+0x8/frame 0xfffffe0121bf5db0 --- trap 0xc, rip = 0xffffffff80325322, rsp = 0xfffffe0121bf5e80, rbp = 0xfffffe0121bf5ea0 --- xpt_done_process() at 0xffffffff80325322 = xpt_done_process+0x222/frame 0xfffffe0121bf5ea0 xpt_done_td() at 0xffffffff80327485 = xpt_done_td+0xf5/frame 0xfffffe0121bf5ef0 fork_exit() at 0xffffffff80705acf = fork_exit+0x7f/frame 0xfffffe0121bf5f30 fork_trampoline() at 0xffffffff80b8d29e = fork_trampoline+0xe/frame 0xfffffe0121bf5f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic
This looks like an old bug that I killed 3 or 4 years ago. The root cause is that we're leaking a reference, so the periph is torn down with outstanding devices. if ((ccb_h->func_code & XPT_FC_USER_CCB) == 0) { struct cam_ed *dev = ccb_h->path->device; if (sim) devq = sim->devq; KASSERT(devq, ("Periph disappeared with CCB %p %s request pending.", ccb_h, xpt_action_name(ccb_h->func_code))); mtx_lock(&devq->send_mtx); devq is NULL here, according to the traceback... The hard part is figuring out how this came to be.
... outstanding requests.... We should destroy the periph after the last release. Last time it turned out to be a priority problem in probe that caused us schedule I/Os unconditionally which were the ones that were left in the queue after the periph was destroyed and they eventually were completed somehow...
commit 6c8ab086fed37a6b44fa84377e48c499f223ae80 Author: Warner Losh <imp@FreeBSD.org> Date: Sun May 1 10:39:04 2022 -0600 ada: Retry commands with retries left on CAM_SEL_TIMEOUT The AHCI and ATA SIMs will return CAM_SEL_TIMEOUT when an underlying device has stopped responding. This is usually seen after a timeouted out command and can be a transient event. Rather than fail the peripheral immediately after seeing this, queue a retry. For transient events, this allows drives to continue to provide data, though with some added latency, just like we do when we have some other kind of retriable error. If the error isn't transient (the drive is truly gone), then we'll discover that eventually and fail the transaction and invalidate the drive like we do today. This helps us avoid a panic at the end of camperiphfree when CAM_PERIPH_NEW_DEV_FOUND is set. However, the deferred callback should be queued to xpt_async_td instead of being made inline there. This issue will be solved in a different patch that does that. PR 263703. This also helps us avoid another bug where we can drop all references to the device (causing us to go through camperiphfree and destroy the path) while we have an I/O pending in the ata_da state machine (usually in state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE command). It's not clear why the reference that we take out to do the reprobe isn't effective at blocking this. By retrying this condition, though we avoid this bug (at least more often, I don't have a good reproduction test case, I just see this panic a few times a month at work on systems that have transient disk errors on ahci connected SATA SSDs). PR 263704. It's too soon to know how much this helps us avoid this bug. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34977
How easy is it for you to recreate this? I have some experimental dtrace scripts that we might be able use to get visibility into what's going on... thought the panic makes it a little harder.
Sorry was busy fighting with the HDD problems, finally replaced it. And the hdd falling off the bus problem seems to be linked with the partially faulted power supply, which I have changed today either. Probably I can assemble the testing machine for this using the problematic HDDs to cause more problems. Not sure if it is possible to "remove" the drive inside of bhyve VM. Or need testing with "devctl disable ahcichX" instead of physically disconnecting the drives. There is also another related problem btw - "ada0" gets disconnected (power failure) and found again after the bus reset, but cant get back into the system - pass0 and /dev/ada0 completely disappear and visible only from the "camcontrol devlist" as ada0.
found that part of log when ada0 also failed (no panic though, but found no way to make it work again): ahcich0: DISCONNECT requested ahcich0: AHCI reset... ahcich0: SATA connect timeout time=10000us status=00000000 ahcich0: AHCI reset: device not found (aprobe0:ahcich0:0:0:0): Periph created (aprobe0:ahcich0:0:0:0): Probe started (aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_IDENTIFY (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ahcich0:0:0:0): CAM status: Selection Timeout (aprobe0:ahcich0:0:0:0): Error 6, Unretryable error (aprobe0:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_INVALID (aprobe0:ahcich0:0:0:0): Probe completed (aprobe0:ahcich0:0:0:0): Periph invalidated (aprobe0:ahcich0:0:0:0): Periph destroyed (pass0:ahcich0:0:0:0): Periph invalidated pass0 at ahcich0 bus 0 scbus0 target 0 lun 0 pass0: <WDC WD10EZRX-00L4HB0 01.01A01> s/n WD-WCC4J83F9JVX detached (ada0:ahcich0:0:0:0): Periph invalidated ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <WDC WD10EZRX-00L4HB0 01.01A01> s/n WD-WCC4J83F9JVX detached (noperiph:ahcich0:0:0:0): xpt_async(AC_LOST_DEVICE) (ada0:ahcich0:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Selection Timeout (ada0:ahcich0:0:0:0): Error 6, Unretryable error (aprobe0:ahcich0:0:0:0): Periph created GEOM_MIRROR: Device efi: provider ada0p99 disconnected. (aprobe0:ahcich0:0:0:0): Probe started (aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_RESET (ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Selection Timeout (ada0:ahcich0:0:0:0): Error 6, Unretryable error GEOM_MIRROR: Device bak: provider ada0p11 disconnected. (aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich0:0:0:0): CAM status: Selection Timeout (aprobe0:ahcich0:0:0:0): Error 6, Unretryable error (aprobe0:ahcich0:0:0:0): Probe PROBE_RESET to PROBE_INVALID (aprobe0:ahcich0:0:0:0): Probe completed (aprobe0:ahcich0:0:0:0): Periph invalidated (aprobe0:ahcich0:0:0:0): Periph destroyed GEOM_MIRROR: Device lenovo2: provider ada0p3 disconnected. (pass0:ahcich0:0:0:0): Periph destroyed No more /dev/ada0 exists ahcich0: CONNECT requested ahcich0: AHCI reset... ahcich0: SATA connect time=100us status=00000133 ahcich0: AHCI reset: device found (aprobe1:ahcich0:0:0:0): Periph created (aprobe1:ahcich0:0:0:0): Probe started (aprobe1:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_RESET ahcich0: AHCI reset: device ready after 4400ms (aprobe1:ahcich0:0:0:0): Probe PROBE_RESET to PROBE_RESET (aprobe1:ahcich0:0:0:0): SIGNATURE: 0000 (aprobe1:ahcich0:0:0:0): Probe PROBE_RESET to PROBE_IDENTIFY (aprobe1:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_SETMODE (aprobe1:ahcich0:0:0:0): Probe PROBE_SETMODE to PROBE_SETDMAAA (aprobe1:ahcich0:0:0:0): Probe PROBE_SETDMAAA to PROBE_SET_MULTI (aprobe1:ahcich0:0:0:0): Probe PROBE_SET_MULTI to PROBE_DONE (noperiph:ahcich0:0:0:0): xpt_async(AC_FOUND_DEVICE) (aprobe1:ahcich0:0:0:0): Probe completed (aprobe1:ahcich0:0:0:0): Periph invalidated (aprobe1:ahcich0:0:0:0): Periph destroyed (pass3:ahcich0:0:0:0): Periph created Solaris: WARNING: Pool 'ZBIG' has encountered an uncorrectable I/O failure and has been suspended. (pass3:ahcich0:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00 (pass3:ahcich0:0:0:0): CAM status: CCB request was invalid (pass3:ahcich0:0:0:0): Error 22, Unretryable error (pass3:ahcich0:0:0:0): READ DEFECT DATA(10). CDB: 37 00 15 00 00 00 00 00 04 00 (pass3:ahcich0:0:0:0): CAM status: CCB request was invalid (pass3:ahcich0:0:0:0): Error 22, Unretryable error (pass3:ahcich0:0:0:0): READ DEFECT DATA(10). CDB: 37 00 0d 00 00 00 00 00 04 00 (pass3:ahcich0:0:0:0): CAM status: CCB request was invalid (pass3:ahcich0:0:0:0): Error 22, Unretryable error I was trying to disconnect the sata cable and connect it back. But /dev/ada0 no more appears: ahcich0: AHCI reset... ahcich0: SATA connect time=100us status=00000133 ahcich0: AHCI reset: device found (aprobe0:ahcich0:0:0:0): Periph created (aprobe0:ahcich0:0:0:0): Probe started (aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_IDENTIFY ahcich0: AHCI reset: device ready after 100ms (noperiph:ahcich0:0:0:0): xpt_async(AC_GETDEV_CHANGED) (aprobe0:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_SETMODE (aprobe0:ahcich0:0:0:0): Probe PROBE_SETMODE to PROBE_SETDMAAA (aprobe0:ahcich0:0:0:0): Probe PROBE_SETDMAAA to PROBE_SET_MULTI (aprobe0:ahcich0:0:0:0): Probe PROBE_SET_MULTI to PROBE_DONE (aprobe0:ahcich0:0:0:0): Probe completed (aprobe0:ahcich0:0:0:0): Periph invalidated (aprobe0:ahcich0:0:0:0): Periph destroyed ahcich0: AHCI reset... ahcich0: SATA connect time=100us status=00000133 ahcich0: AHCI reset: device found (aprobe0:ahcich0:0:0:0): Periph created (aprobe0:ahcich0:0:0:0): Probe started (aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_IDENTIFY ahcich0: AHCI reset: device ready after 100ms (aprobe0:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_SETMODE (aprobe0:ahcich0:0:0:0): Probe PROBE_SETMODE to PROBE_SETDMAAA (aprobe0:ahcich0:0:0:0): Probe PROBE_SETDMAAA to PROBE_SET_MULTI (aprobe0:ahcich0:0:0:0): Probe PROBE_SET_MULTI to PROBE_DONE (aprobe0:ahcich0:0:0:0): Probe completed (aprobe0:ahcich0:0:0:0): Periph invalidated (aprobe0:ahcich0:0:0:0): Periph destroyed