Created attachment 201796 [details]
Core from latest kernel panic
On 11.1, this system was completely stable.
I upgraded to 11.2 and started getting CAM timeouts / retries, which I started a thread on at https://lists.freebsd.org/pipermail/freebsd-stable/2019-February/090520.html
Note that the card firmware is 19.00.00.00; running 20.00.07.00 (latest available) instead of CAM problems with individual drives I get controller resets, which are *far* worse as the impact is not local. In no case, however, has data been corrupted -- ZFS is happy with the data and shows no pack errors of any sort, nor do the disks themselves using smartctl. The retries are successful.
The configuration is a LSI 8-port HBA with a Lenovo 24-port expander attached to one of the LSI connectors; the other has the boot drives on it, as the system and card firmware cannot boot from the expander. This configuration has been stable for the last several years and up to 11.1-STABLE was flawless. The drives themselves, backplanes to which they attach, power supply, HBA, SAS expander and cables have all been swapped out with spares here without any change in behavior. The motherboard itself is a XEON with ECC and no RAM errors are being logged. (It's thus reasonable to assume this isn't a hardware problem....)
The stall and retry itself looks an awful lot like a queued command is being missed or an interrupt lost, both under very heavy load. This typically occurs only when the drives in question are slammed at 100% utilization or nearly so for an extended period of time (e.g. during a scrub or resilver.) I have seen it on both HGST and Seagate drives of differing capacities, model and firmware revision numbers; it does not appear to be related to the disk model or firmware itself.
In an attempt to see if this was related to something in 11.2 I rolled the machine forward to 12.0-STABLE. On 12.0-STABLE, r343809, this same condition rather than producing console logs and a successful retry instead results in a kernel panic in the driver. The disk I/O in process at the time is a ZFS scrub and the drive in question is pure data -- it has no executables on it, and in fact the pool has no mounted filesystems at the time of the panic (it's a backup pool that is imported to serve as a destination for zfs sends used as a means of backup.)
I have ordered a pair of HBA 16i cards in order to get the expander out of the case in the hope that will stop the detach events, although I am completely lost in terms of why 11.2 and 12.0 will not run with that configuration where it was entirely stable over the last several releases up through 11.1 with uptimes measured in months; until 11.2 I had never seen even a single panic out of the disk subsystem on this configuration.
Note that if you have all disks attached to the mps driver you can't take a kernel core dump when it happens; any attempt to do so results in a double-panic out of the driver. I have temporarily attached a drive to the onboard SATA ports and set it as dumpdev so as to be able to get the core file.
The panic itself bodes poorly for the impact of potential disk problems (real ones) where a drive goes offline when attached to the mps driver in 12.0, thus this bug report in an attempt to figure out this regression.
Thanks for the detailed bug report. I don't have any immediate ideas, but I'm actively looked into it.
I think this is the same issue:
when this problem happens, what's the latency on the drives looking like? You could set the default timeout from 30s to something stupidly large (like 120s) and see if the timeout problems go away. Alternatively, would it be possible to compile in the IOSCHED so we can get latency histograms to see if we're running up against insane latency at high load or some other sets of issues. IOSCHED has the added advantage that you can configure it on a per-drive basis to rate limit IOPS or bandwidth to see if there's some threshold that triggers this behavior or not. I tend to agree it does seem like some kind of missed message scenario, but want to preclude it. gstat with a very fast polling might catch this as well if you can capture a trace through the event.
The reason I'm asking is that the number of read recovery attempts has increased. Sometimes these can starve I/O behind the I/O that's being recovered when the data is bad enough to need to take the 'slow path' in the drives. The counter isn't nuanced enough for me to know if the bump in the counter indicates some super-long recovery, or if it is of the more average, mundane kind that happens from time to time, but doesn't spike the latency.
Just something to eliminate as a possibility before we get too far off into the weeds. If it is excessive drive latency, then we need to make the recovery path more robust. If we're missing interrupts / messages that we need to figure that out. If it's something else, that needs to be diagnosed.
I am working on getting compile of IOSCHED but I would like to point out that recently we installed the seagate ironwolf drives and we have backed off these drives (replacing with sas drives) and with the drive replacements our kernel panics have gone down as a function of the number of seagate ironwolf drives we are using.
Additionally, our setup is fairly vanilla in that we have a hba connected to sas expander jbods (Supermicro). Once we installed two zpool vdevs of 8 10TB sata ironwolf drives we experienced very frequent kernel panics (every two hours, sigh). Currently, we are down to 1 drive per vdev and our panics are down to less then once every 5 days (still way to high).
I hope to have more info via the iosched data.