Bug 237701

Summary: sysutils/smartmontools: Causing controller resets
Product: Ports & Packages Reporter: Danny McGrath <danmcgrath.ca>
Component: Individual Port(s)Assignee: freebsd-ports-bugs (Nobody) <ports-bugs>
Status: Closed Works As Intended    
Severity: Affects Only Me CC: fernape, samm
Priority: --- Flags: bugzilla: maintainer-feedback? (samm)
Version: Latest   
Hardware: amd64   
OS: Any   

Description Danny McGrath 2019-05-02 03:38:55 UTC
After updating to 2019Q2 (v7.0), I noticed that a pair of Dell PowerEdge 410's that have smartd running, not start adding some smartctl status to the daily logs. Unfortunately, it also seems like that are generating a bunch of dmesg errors recently too:

--
+[6527952] mpt0: request 0xfffffe00012a2610:49251 timed out for ccb 0xfffff8074e2ea000 (req->ccb 0xfffff8074e2ea000)
+[6527952] mpt0: attempting to abort req 0xfffffe00012a2610:49251 function 0
+[6527952] mpt0: request 0xfffffe00012a5888:49256 timed out for ccb 0xfffff800133f2800 (req->ccb 0xfffff800133f2800)
+[6527952] mpt0: request 0xfffffe00012b0d08:49257 timed out for ccb 0xfffff80013452000 (req->ccb 0xfffff80013452000)
+[6527952] mpt0: request 0xfffffe00012aeb30:49262 timed out for ccb 0xfffff802a97d2000 (req->ccb 0xfffff802a97d2000)
+[6527953] mpt0: mpt_wait_req(1) timed out
+[6527953] mpt0: mpt_recover_commands: abort timed-out. Resetting controller
+[6527953] mpt0: EvtLogData: IOCLogInfo: 0x00000000
+[6527953] mpt0:        EvtLogData: Event Data:
+[6527953] mpt0: mpt_cam_event: 0x1
+[6527953] mpt0: completing timedout/aborted req 0xfffffe00012a2610:49251
+[6527953] mpt0: completing timedout/aborted req 0xfffffe00012a5888:49256
+[6527953] mpt0: completing timedout/aborted req 0xfffffe00012b0d08:49257
+[6527953] mpt0: completing timedout/aborted req 0xfffffe00012aeb30:49262
---

Smart itself shows:
  SMART overall-health self-assessment test result: PASSED

but there is an error in the output:
  Read SMART Thresholds failed: Input/output error

I can only assume that whatever it is that is running to add the info to the daily report is querying the system in a way that appears to possibly be resetting the disk controller.

Any ideas or suggestions? Thought that I would point it out in case you weren't aware of the new change affecting older systems.
Comment 1 Danny McGrath 2019-05-16 11:42:39 UTC
It seems that the system has been much better since removing the line:
  daily_status_smart_devices="...."

from the /etc/periodic.conf. I suspect that past issues with these systems (Zabbix alerts during maintenance times) may have actually been coming from IO stall outs during the periodic runs that were invoking the smart status.

Interestingly enough, running `smartctl -i /dev/da#` alone doesn't cause the problems, but it was only a recent update to the smartmontools that started to populate the daily logs, which in turn caused the errors, that started to reveal a potentially old bug.

For some extra background, these R410's have HW Raid capable HBA cards in them, but were configured as non raid, which appeared to pass it through fine (at least enough to give me output in smartctl. 

My guess is smartctl doesn't like something about these particular devices. The are technically in IR mode, not IT, yet allow SMART to be queried. Maybe there is some issue with this? I don't know, you guys are the experts, so I can only give some historical insights and technical info.

Hope it helps!
Comment 2 Fernando Apesteguía freebsd_committer freebsd_triage 2019-06-05 18:13:11 UTC
(In reply to Dan McGrath from comment #1)
Hi Dan,

Can I close this PR then?
Comment 3 Danny McGrath 2019-06-05 19:57:47 UTC
(In reply to Fernando Apesteguía from comment #2)
Hi,

Honestly, it's up to you. I was able to just disable the periodic for the SMART stuff on those hosts to avoid the issue, but there is clearly some non ideal stuff that needs to maybe be evaluated. That said, I think that seems like more of an upstream problem, than a FreeBSD one, so perhaps that would be best. Your call.
Comment 4 Oleksii Samorukov freebsd_committer freebsd_triage 2019-07-14 11:29:49 UTC
I would recommend to close this PR. Problems with dell controller are 100% caused by buggy firmware in mpt device. Some of the versions are affected and most of them are not implementing SCSH->SATA tunneling as they should. We do have few workarounds in the code, however, it is possible that it is not sufficient. My recommendation is to use latest firmware version to see if problem persist. 

Anyway, there is nothing we can do in smartmontools itself.
Comment 5 Fernando Apesteguía freebsd_committer freebsd_triage 2019-07-14 16:03:47 UTC
OK, closing PR as requested by maintainer.