263704 – Panic because device went away with XPT_ATA_IO pending

Bug 263704 - Panic because device went away with XPT_ATA_IO pending

Summary: Panic because device went away with XPT_ATA_IO pending

Status:	In Progress

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Warner Losh

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-05-01 17:07 UTC by Warner Losh
Modified:	2022-05-04 03:23 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Warner Losh freebsd_committer

2022-05-01 17:07:15 UTC

we can drop all references to the device (causing us to go through camperiphfree and destroy the path) while we have an I/O pending in the ata_da state machine (usually in state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE command). It's not clear why the reference that we take out to do the reprobe isn't effective at blocking this. By retrying this condition, though we avoid this bug (at least more often, I don't have a good reproduction test case, I just see this panic a few times a month at work on systems that have transient disk errors on ahci connected SATA SSDs).

Comment 1 Warner Losh freebsd_committer

2022-05-04 03:23:44 UTC

On further examination, it appears to be a bug in the dynamic I/O
scheduler. We can call xpt_schedule(CAM_PRI_NORMAL) during recovery
which causes all existing periphs that use it to schedule their recovery
operation a second time, but at a bad priority. It has nothing to do
with dropping references to the device, but rather causing extra
I/O to be scheduled that can persist after the periph is invalidated
(because the periph driver knows nothing of the extra CCBs) leading
to accessing the path after it's been freed which leads to a number of
different pathologies depending on where in the CCB lifecycle we wakeup. It
only affects the dynamic scheduler (all the time, but that points out another
bug, the timeout ticker shouldn't run all the time if there's nothing to
be controlled like bandwidth or iops).