The CAM_IO_STATS kernel config option adds some sysctls to every disk device that track the number of errors, timeouts, and pack invalidations it has had. However, some slip through the cracks. I consistently see dmesg report MEDIUM ERRORs such as the below, while the stats sysctls report nothing. There are clearly errors, because gmultipath reports one path as DEGRADED and ZFS reports faults on the drive. It seems that some error path isn't updating the stats like it should. (da424:mpr3:0:115:0): READ(10). CDB: 28 00 06 3c 3a 68 00 00 d7 00 (da424:mpr3:0:115:0): CAM status: SCSI Status Error (da424:mpr3:0:115:0): SCSI status: Check Condition (da424:mpr3:0:115:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (da424:mpr3:0:115:0): Info: 0x63c3a7b (da424:mpr3:0:115:0): Field Replaceable Unit: 134 (da424:mpr3:0:115:0): Command Specific Info: 0x81037000 (da424:mpr3:0:115:0): Actual Retry Count: 206 (da424:mpr3:0:115:0): Error 5, Unretryable error GEOM_MULTIPATH: Error 5, da424 in mp_JB2_S57 marked FAIL GEOM_MULTIPATH: all paths in mp_JB2_S57 were marked FAIL, restore da244 $ sysctl kern.cam.da.424.stats kern.cam.da.424.stats.pack_invalidations: 0 kern.cam.da.424.stats.timeouts: 0 kern.cam.da.424.stats.errors: 0
These counters, for better or worse, track failures. Since this I/O succeeded on retry, the counters weren't bumped. So works as designed, with a hefty helping of 'the design could widen to report this' ACK is my response :). I'll see what I can do about the latter.
(In reply to Warner Losh from comment #1) Oh, wait, it did fail. I'll check to see why that didn't bump the counters...
Yes, it did fail. 59 errors got through to the ZFS level. $ zpool status | grep -C1 JB2_S57 spare-1 DEGRADED 0 0 0 multipath/mp_JB2_S57.eli FAULTED 59 0 0 too many errors multipath/mp_JB1_S90.eli ONLINE 0 0 0 (resilvering)
Yea, I saw that too late. I looked at this bug on my phone first, thought it was retriable, and it wasn't.... too much pattern recognition... I'll check into why the counters aren't bumped. At least some counter should be 59, or the sum of counters 59 (or the drive is gone forever with no counters :). Sorry about the initial misshoot.
No problem. BTW, these happen pretty frequently, so I can probably do some experiments if that would help you. I could run a dtrace script, for example.
I saw this again. A drive returned a lot of "Head select fault"s, but the errors counter is still 0, even though errors got reported to ZFS. (da292:mpr2:0:214:0): READ(6). CDB: 08 00 00 00 08 00 (da292:mpr2:0:214:0): CAM status: SCSI Status Error (da292:mpr2:0:214:0): SCSI status: Check Condition (da292:mpr2:0:214:0): SCSI sense: ABORTED COMMAND asc:9,4 (Head select fault) (da292:mpr2:0:214:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (da292:mpr2:0:214:0): Retrying command (per sense data) $ sysctl kern.cam.da.292.stats kern.cam.da.292.stats.pack_invalidations: 0 kern.cam.da.292.stats.timeouts: 10461 kern.cam.da.292.stats.errors: 0 $ zpool status | grep mp_JB1_S15 multipath/mp_JB1_S15.eli FAULTED 7.90K 11.0K 35 too many errors
Wonder if the counter should be bumped when CAM_SCSI_STATUS_ERROR? something for context: diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c index df8791e4b23e..2d78c00dbb60 100644 --- a/sys/cam/scsi/scsi_da.c +++ b/sys/cam/scsi/scsi_da.c @@ -5994,20 +5994,21 @@ daerror(union ccb *ccb, u_int32_t cam_flags, u_int32_t sense_flags) #ifdef CAM_IO_STATS switch (ccb->ccb_h.status & CAM_STATUS_MASK) { case CAM_CMD_TIMEOUT: softc->timeouts++; break; case CAM_REQ_ABORTED: case CAM_REQ_CMP_ERR: case CAM_REQ_TERMIO: case CAM_UNREC_HBA_ERROR: case CAM_DATA_RUN_ERR: + case CAM_SCSI_STATUS_ERROR: softc->errors++; break; default: break; } #endif
(In reply to Robert Wing from comment #7) So far this patch seems to be working. I've installed it on one system that's experienced two MEDIUM ERRORs since. I'll check on it daily.
https://reviews.freebsd.org/D33783 lists CAM_SCSI_STATUS_ERROR and other errors that also indicate problems with I/O.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=85056a56f3bbd26fdbaf0f045b527acac7199589 commit 85056a56f3bbd26fdbaf0f045b527acac7199589 Author: Warner Losh <imp@FreeBSD.org> AuthorDate: 2022-01-09 17:13:05 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2022-01-09 17:13:05 +0000 cam: Include more statuses as errors for CAM_IO_STATS Tag more status return values as an error for the I/O. CAM_SCSI_STATUS_ERROR is returned for medium errors, for example, but the counts weren't increased. The added errors all indicate a problem with the device request. Sponsored by: Netflix PR: 260257 Feedback from: ken Reviewed by: asomers Differential Revision: https://reviews.freebsd.org/D33783 sys/cam/scsi/scsi_da.c | 2 ++ 1 file changed, 2 insertions(+)
MARKED AS SPAM