| Summary: | 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regression from 11.1) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | karl | ||||
| Component: | kern | Assignee: | Bugmeister <bugmeister> | ||||
| Status: | Closed Feedback Timeout | ||||||
| Severity: | Affects Some People | CC: | cem, freebsd, harrison, imp, scottl | ||||
| Priority: | --- | Keywords: | crash, regression | ||||
| Version: | 12.0-STABLE | ||||||
| Hardware: | amd64 | ||||||
| OS: | Any | ||||||
| Attachments: |
|
||||||
|
Description
karl
2019-02-06 18:27:43 UTC
Thanks for the detailed bug report. I don't have any immediate ideas, but I'm actively looked into it. I think this is the same issue: 237937 when this problem happens, what's the latency on the drives looking like? You could set the default timeout from 30s to something stupidly large (like 120s) and see if the timeout problems go away. Alternatively, would it be possible to compile in the IOSCHED so we can get latency histograms to see if we're running up against insane latency at high load or some other sets of issues. IOSCHED has the added advantage that you can configure it on a per-drive basis to rate limit IOPS or bandwidth to see if there's some threshold that triggers this behavior or not. I tend to agree it does seem like some kind of missed message scenario, but want to preclude it. gstat with a very fast polling might catch this as well if you can capture a trace through the event. The reason I'm asking is that the number of read recovery attempts has increased. Sometimes these can starve I/O behind the I/O that's being recovered when the data is bad enough to need to take the 'slow path' in the drives. The counter isn't nuanced enough for me to know if the bump in the counter indicates some super-long recovery, or if it is of the more average, mundane kind that happens from time to time, but doesn't spike the latency. Just something to eliminate as a possibility before we get too far off into the weeds. If it is excessive drive latency, then we need to make the recovery path more robust. If we're missing interrupts / messages that we need to figure that out. If it's something else, that needs to be diagnosed. I am working on getting compile of IOSCHED but I would like to point out that recently we installed the seagate ironwolf drives and we have backed off these drives (replacing with sas drives) and with the drive replacements our kernel panics have gone down as a function of the number of seagate ironwolf drives we are using. Additionally, our setup is fairly vanilla in that we have a hba connected to sas expander jbods (Supermicro). Once we installed two zpool vdevs of 8 10TB sata ironwolf drives we experienced very frequent kernel panics (every two hours, sigh). Currently, we are down to 1 drive per vdev and our panics are down to less then once every 5 days (still way to high). I hope to have more info via the iosched data. To submitters: is this aging PR still relevant? ^Triage: submitter timeout (> 2 months). |