Bug 277499 - panic in doneq0 xpt_done_td xpt_done_process after HDD falling off the bus (Periph destroyed)
Summary: panic in doneq0 xpt_done_td xpt_done_process after HDD falling off the bus (P...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-05 09:59 UTC by bsd
Modified: 2024-03-08 07:29 UTC (History)
1 user (show)

See Also:


Attachments
detailed kgdb trace log (10.52 KB, text/plain)
2024-03-05 09:59 UTC, bsd
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description bsd 2024-03-05 09:59:02 UTC
Created attachment 248946 [details]
detailed kgdb trace log

The only active geom there is the single vdev zfs partition (without a replication).
gmirror "efi" one is idle and such usually do not cause any problems, only active zfs matters

FreeBSD 15.0-CURRENT #2 main-n268534-32fdcff8703d: Mon Feb 26 01:08:14 EET 2024

(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 88 e0 81 40 57 00 00 07 00 00
(ada3:ahcich3:0:0:0): CAM status: Command timeout
(ada3:ahcich3:0:0:0): Retrying command, 2 more tries remain
ahcich3: SATA connect time=100us status=00000133
ahcich3: AHCI reset: device found
ahcich3: AHCI reset: device ready after 100ms
ahcich3: Timeout on slot 26 port 0
ahcich3: is 00000000 cs 00000000 ss 3c000000 rs 3c000000 tfd 50 serr 00000000 cmd 0000dd17
ahcich3: AHCI reset...
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 88 e0 81 40 57 00 00 07 00 00
(ada3:ahcich3:0:0:0): CAM status: Command timeout
(ada3:ahcich3:0:0:0): Retrying command, 1 more tries remain
ahcich3: SATA connect time=100us status=00000133
ahcich3: AHCI reset: device found
ahcich3: AHCI reset: device ready after 100ms
ipfw: 24111 Nat TCP 10.0.11.17:39346 10.0.6.178:10050 in via em0
ahcich3: Timeout on slot 2 port 0
ahcich3: is 00000000 cs 00000000 ss 0000003c rs 0000003c tfd 50 serr 00000000 cmd 0000c517
ahcich3: AHCI reset...
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 88 e0 81 40 57 00 00 07 00 00
(ada3:ahcich3:0:0:0): CAM status: Command timeout
(ada3:ahcich3:0:0:0): Retrying command, 0 more tries remain
ahcich3: SATA connect time=100us status=00000133
ahcich3: AHCI reset: device found
ahcich3: AHCI reset: device ready after 100ms
(ada3:ahcich3:0:0:0): Periph destroyed
GEOM_MIRROR: Device efi: provider ada3p99 disconnected.


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x18
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80325322
stack pointer           = 0x28:0xfffffe0121bf5e80
frame pointer           = 0x28:0xfffffe0121bf5ea0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 4 (doneq0)
rdi: fffff80001e2b660 rsi: 0000000000000001 rdx: 00000000000013f6
rcx: 0000000000000000  r8: ffffffff80c87ff0  r9: 0000000000526415
rax: 0000000000000000 rbx: fffff80001e2b660 rbp: fffffe0121bf5ea0
r10: 0000000000001388 r11: 000000008059e074 r12: 0000000000000000
r13: fffffe0121bf5eb8 r14: 0000000000000000 r15: fffff80001f42740
trap number             = 12
panic: page fault
cpuid = 1
time = 1709613907
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff8045142b = db_trace_self_wrapper+0x2b/frame 0xfffffe0121bf5b60
vpanic() at 0xffffffff8074c855 = vpanic+0x135/frame 0xfffffe0121bf5c90
panic() at 0xffffffff8074c713 = panic+0x43/frame 0xfffffe0121bf5cf0
trap_fatal() at 0xffffffff80bb43dc = trap_fatal+0x40c/frame 0xfffffe0121bf5d50
trap_pfault() at 0xffffffff80bb442f = trap_pfault+0x4f/frame 0xfffffe0121bf5db0
calltrap() at 0xffffffff80b8c238 = calltrap+0x8/frame 0xfffffe0121bf5db0
--- trap 0xc, rip = 0xffffffff80325322, rsp = 0xfffffe0121bf5e80, rbp = 0xfffffe0121bf5ea0 ---
xpt_done_process() at 0xffffffff80325322 = xpt_done_process+0x222/frame 0xfffffe0121bf5ea0
xpt_done_td() at 0xffffffff80327485 = xpt_done_td+0xf5/frame 0xfffffe0121bf5ef0
fork_exit() at 0xffffffff80705acf = fork_exit+0x7f/frame 0xfffffe0121bf5f30
fork_trampoline() at 0xffffffff80b8d29e = fork_trampoline+0xe/frame 0xfffffe0121bf5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Comment 1 Warner Losh freebsd_committer freebsd_triage 2024-03-05 18:47:27 UTC
This looks like an old bug that I killed 3 or 4 years ago.

The root cause is that we're leaking a reference, so the periph is torn down with outstanding devices.

       if ((ccb_h->func_code & XPT_FC_USER_CCB) == 0) {
                struct cam_ed *dev = ccb_h->path->device;

                if (sim)
                        devq = sim->devq;
                KASSERT(devq, ("Periph disappeared with CCB %p %s request pending.",
                        ccb_h, xpt_action_name(ccb_h->func_code)));

                mtx_lock(&devq->send_mtx);

devq is NULL here, according to the traceback...

The hard part is figuring out how this came to be.
Comment 2 Warner Losh freebsd_committer freebsd_triage 2024-03-05 18:56:10 UTC
... outstanding requests....  We should destroy the periph after the last release.  Last time it turned out to be a priority problem in probe that caused us schedule I/Os unconditionally which were the ones that were left in the queue after the periph was destroyed and they eventually were completed somehow...
Comment 3 Warner Losh freebsd_committer freebsd_triage 2024-03-05 19:30:50 UTC
commit 6c8ab086fed37a6b44fa84377e48c499f223ae80
Author: Warner Losh <imp@FreeBSD.org>
Date:   Sun May 1 10:39:04 2022 -0600

    ada: Retry commands with retries left on CAM_SEL_TIMEOUT

    The AHCI and ATA SIMs will return CAM_SEL_TIMEOUT when an underlying
    device has stopped responding. This is usually seen after a timeouted
    out command and can be a transient event. Rather than fail the
    peripheral immediately after seeing this, queue a retry. For transient
    events, this allows drives to continue to provide data, though with some
    added latency, just like we do when we have some other kind of retriable
    error. If the error isn't transient (the drive is truly gone), then
    we'll discover that eventually and fail the transaction and invalidate
    the drive like we do today.

    This helps us avoid a panic at the end of camperiphfree when
    CAM_PERIPH_NEW_DEV_FOUND is set. However, the deferred callback should
    be queued to xpt_async_td instead of being made inline there. This issue
    will be solved in a different patch that does that. PR 263703.

    This also helps us avoid another bug where we can drop all references to
    the device (causing us to go through camperiphfree and destroy the path)
    while we have an I/O pending in the ata_da state machine (usually in
    state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE
    command). It's not clear why the reference that we take out to do the
    reprobe isn't effective at blocking this. By retrying this condition,
    though we avoid this bug (at least more often, I don't have a good
    reproduction test case, I just see this panic a few times a month at
    work on systems that have transient disk errors on ahci connected SATA
    SSDs). PR 263704. It's too soon to know how much this helps us avoid
    this bug.

    Sponsored by:           Netflix
    Differential Revision:  https://reviews.freebsd.org/D34977
Comment 4 Warner Losh freebsd_committer freebsd_triage 2024-03-05 19:36:12 UTC
How easy is it for you to recreate this? I have some experimental dtrace scripts that we might be able use to get visibility into what's going on... thought the panic makes it a little harder.
Comment 5 bsd 2024-03-07 20:52:45 UTC
Sorry was busy fighting with the HDD problems, finally replaced it.
And the hdd falling off the bus problem seems to be linked with the partially faulted power supply, which I have changed today either.
Probably I can assemble the testing machine for this using the problematic HDDs to cause more problems.
Not sure if it is possible to "remove" the drive inside of bhyve VM. Or need testing with "devctl disable ahcichX" instead of physically disconnecting the drives.
There is also another related problem btw - "ada0" gets disconnected (power failure) and found again after the bus reset, but cant get back into the system - pass0 and /dev/ada0 completely disappear and visible only from the "camcontrol devlist" as ada0.
Comment 6 bsd 2024-03-08 07:29:06 UTC
found that part of log when ada0 also failed (no panic though, but found no way to make it work again):

ahcich0: DISCONNECT requested
ahcich0: AHCI reset...
ahcich0: SATA connect timeout time=10000us status=00000000
ahcich0: AHCI reset: device not found
(aprobe0:ahcich0:0:0:0): Periph created
(aprobe0:ahcich0:0:0:0): Probe started
(aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_IDENTIFY
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Selection Timeout
(aprobe0:ahcich0:0:0:0): Error 6, Unretryable error
(aprobe0:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_INVALID
(aprobe0:ahcich0:0:0:0): Probe completed
(aprobe0:ahcich0:0:0:0): Periph invalidated
(aprobe0:ahcich0:0:0:0): Periph destroyed
(pass0:ahcich0:0:0:0): Periph invalidated
pass0 at ahcich0 bus 0 scbus0 target 0 lun 0
pass0: <WDC WD10EZRX-00L4HB0 01.01A01> s/n WD-WCC4J83F9JVX detached
(ada0:ahcich0:0:0:0): Periph invalidated
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD10EZRX-00L4HB0 01.01A01> s/n WD-WCC4J83F9JVX detached
(noperiph:ahcich0:0:0:0): xpt_async(AC_LOST_DEVICE)
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Selection Timeout
(ada0:ahcich0:0:0:0): Error 6, Unretryable error
(aprobe0:ahcich0:0:0:0): Periph created
GEOM_MIRROR: Device efi: provider ada0p99 disconnected.
(aprobe0:ahcich0:0:0:0): Probe started
(aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_RESET
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Selection Timeout
(ada0:ahcich0:0:0:0): Error 6, Unretryable error
GEOM_MIRROR: Device bak: provider ada0p11 disconnected.
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Selection Timeout
(aprobe0:ahcich0:0:0:0): Error 6, Unretryable error
(aprobe0:ahcich0:0:0:0): Probe PROBE_RESET to PROBE_INVALID
(aprobe0:ahcich0:0:0:0): Probe completed
(aprobe0:ahcich0:0:0:0): Periph invalidated
(aprobe0:ahcich0:0:0:0): Periph destroyed
GEOM_MIRROR: Device lenovo2: provider ada0p3 disconnected.
(pass0:ahcich0:0:0:0): Periph destroyed

No more /dev/ada0 exists

ahcich0: CONNECT requested
ahcich0: AHCI reset...
ahcich0: SATA connect time=100us status=00000133
ahcich0: AHCI reset: device found
(aprobe1:ahcich0:0:0:0): Periph created
(aprobe1:ahcich0:0:0:0): Probe started
(aprobe1:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_RESET

ahcich0: AHCI reset: device ready after 4400ms
(aprobe1:ahcich0:0:0:0): Probe PROBE_RESET to PROBE_RESET
(aprobe1:ahcich0:0:0:0): SIGNATURE: 0000
(aprobe1:ahcich0:0:0:0): Probe PROBE_RESET to PROBE_IDENTIFY
(aprobe1:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_SETMODE
(aprobe1:ahcich0:0:0:0): Probe PROBE_SETMODE to PROBE_SETDMAAA
(aprobe1:ahcich0:0:0:0): Probe PROBE_SETDMAAA to PROBE_SET_MULTI
(aprobe1:ahcich0:0:0:0): Probe PROBE_SET_MULTI to PROBE_DONE
(noperiph:ahcich0:0:0:0): xpt_async(AC_FOUND_DEVICE)
(aprobe1:ahcich0:0:0:0): Probe completed
(aprobe1:ahcich0:0:0:0): Periph invalidated
(aprobe1:ahcich0:0:0:0): Periph destroyed
(pass3:ahcich0:0:0:0): Periph created

Solaris: WARNING: Pool 'ZBIG' has encountered an uncorrectable I/O failure and has been suspended.
(pass3:ahcich0:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(pass3:ahcich0:0:0:0): CAM status: CCB request was invalid
(pass3:ahcich0:0:0:0): Error 22, Unretryable error
(pass3:ahcich0:0:0:0): READ DEFECT DATA(10). CDB: 37 00 15 00 00 00 00 00 04 00
(pass3:ahcich0:0:0:0): CAM status: CCB request was invalid
(pass3:ahcich0:0:0:0): Error 22, Unretryable error
(pass3:ahcich0:0:0:0): READ DEFECT DATA(10). CDB: 37 00 0d 00 00 00 00 00 04 00
(pass3:ahcich0:0:0:0): CAM status: CCB request was invalid
(pass3:ahcich0:0:0:0): Error 22, Unretryable error

I was trying to disconnect the sata cable and connect it back.
But /dev/ada0 no more appears:

ahcich0: AHCI reset...
ahcich0: SATA connect time=100us status=00000133
ahcich0: AHCI reset: device found
(aprobe0:ahcich0:0:0:0): Periph created
(aprobe0:ahcich0:0:0:0): Probe started
(aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_IDENTIFY
ahcich0: AHCI reset: device ready after 100ms
(noperiph:ahcich0:0:0:0): xpt_async(AC_GETDEV_CHANGED)
(aprobe0:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_SETMODE
(aprobe0:ahcich0:0:0:0): Probe PROBE_SETMODE to PROBE_SETDMAAA
(aprobe0:ahcich0:0:0:0): Probe PROBE_SETDMAAA to PROBE_SET_MULTI
(aprobe0:ahcich0:0:0:0): Probe PROBE_SET_MULTI to PROBE_DONE
(aprobe0:ahcich0:0:0:0): Probe completed
(aprobe0:ahcich0:0:0:0): Periph invalidated
(aprobe0:ahcich0:0:0:0): Periph destroyed
ahcich0: AHCI reset...
ahcich0: SATA connect time=100us status=00000133
ahcich0: AHCI reset: device found
(aprobe0:ahcich0:0:0:0): Periph created
(aprobe0:ahcich0:0:0:0): Probe started
(aprobe0:ahcich0:0:0:0): Probe PROBE_INVALID to PROBE_IDENTIFY
ahcich0: AHCI reset: device ready after 100ms
(aprobe0:ahcich0:0:0:0): Probe PROBE_IDENTIFY to PROBE_SETMODE
(aprobe0:ahcich0:0:0:0): Probe PROBE_SETMODE to PROBE_SETDMAAA
(aprobe0:ahcich0:0:0:0): Probe PROBE_SETDMAAA to PROBE_SET_MULTI
(aprobe0:ahcich0:0:0:0): Probe PROBE_SET_MULTI to PROBE_DONE
(aprobe0:ahcich0:0:0:0): Probe completed
(aprobe0:ahcich0:0:0:0): Periph invalidated
(aprobe0:ahcich0:0:0:0): Periph destroyed