Bug 235559 - 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regression from 11.1)
Summary: 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regr...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Scott Long
URL:
Keywords: panic, regression
Depends on:
Blocks:
 
Reported: 2019-02-06 18:27 UTC by karl
Modified: 2019-06-26 13:52 UTC (History)
3 users (show)

See Also:


Attachments
Core from latest kernel panic (245.02 KB, text/plain)
2019-02-06 18:27 UTC, karl
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description karl 2019-02-06 18:27:43 UTC
Created attachment 201796 [details]
Core from latest kernel panic

On 11.1, this system was completely stable.

I upgraded to 11.2 and started getting CAM timeouts / retries, which I started a thread on at https://lists.freebsd.org/pipermail/freebsd-stable/2019-February/090520.html

Note that the card firmware is 19.00.00.00; running 20.00.07.00 (latest available) instead of CAM problems with individual drives I get controller resets, which are *far* worse as the impact is not local.  In no case, however, has data been corrupted -- ZFS is happy with the data and shows no pack errors of any sort, nor do the disks themselves using smartctl.  The retries are successful.

The configuration is a LSI 8-port HBA with a Lenovo 24-port expander attached to one of the LSI connectors; the other has the boot drives on it, as the system and card firmware cannot boot from the expander.  This configuration has been stable for the last several years and up to 11.1-STABLE was flawless.  The drives themselves, backplanes to which they attach, power supply, HBA, SAS expander and cables have all been swapped out with spares here without any change in behavior.  The motherboard itself is a XEON with ECC and no RAM errors are being logged.  (It's thus reasonable to assume this isn't a hardware problem....)

The stall and retry itself looks an awful lot like a queued command is being missed or an interrupt lost, both under very heavy load.  This typically occurs only when the drives in question are slammed at 100% utilization or nearly so for an extended period of time (e.g. during a scrub or resilver.)  I have seen it on both HGST and Seagate drives of differing capacities, model and firmware revision numbers; it does not appear to be related to the disk model or firmware itself.

In an attempt to see if this was related to something in 11.2 I rolled the machine forward to 12.0-STABLE.  On 12.0-STABLE, r343809, this same condition rather than producing console logs and a successful retry instead results in a kernel panic in the driver.  The disk I/O in process at the time is a ZFS scrub and the drive in question is pure data -- it has no executables on it, and in fact the pool has no mounted filesystems at the time of the panic (it's a backup pool that is imported to serve as a destination for zfs sends used as a means of backup.)

I have ordered a pair of HBA 16i cards in order to get the expander out of the case in the hope that will stop the detach events, although I am completely lost in terms of why 11.2 and 12.0 will not run with that configuration where it was entirely stable over the last several releases up through 11.1 with uptimes measured in months; until 11.2 I had never seen even a single panic out of the disk subsystem on this configuration.

Note that if you have all disks attached to the mps driver you can't take a kernel core dump when it happens; any attempt to do so results in a double-panic out of the driver.  I have temporarily attached a drive to the onboard SATA ports and set it as dumpdev so as to be able to get the core file.

The panic itself bodes poorly for the impact of potential disk problems (real ones) where a drive goes offline when attached to the mps driver in 12.0, thus this bug report in an attempt to figure out this regression.
Comment 1 Scott Long freebsd_committer 2019-02-07 06:31:30 UTC
Thanks for the detailed bug report.  I don't have any immediate ideas, but I'm actively looked into it.
Comment 2 harrison 2019-06-26 11:11:12 UTC
I think this is the same issue:
237937