| Summary: | CAM_IO_STATS does not track MEDIUM ERRORs | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Alan Somers <asomers> |
| Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
| Status: | Closed FIXED | ||
| Severity: | Affects Some People | CC: | imp, rew |
| Priority: | --- | ||
| Version: | 13.0-RELEASE | ||
| Hardware: | Any | ||
| OS: | Any | ||
|
Description
Alan Somers
2021-12-06 18:55:17 UTC
These counters, for better or worse, track failures. Since this I/O succeeded on retry, the counters weren't bumped. So works as designed, with a hefty helping of 'the design could widen to report this' ACK is my response :). I'll see what I can do about the latter. (In reply to Warner Losh from comment #1) Oh, wait, it did fail. I'll check to see why that didn't bump the counters... Yes, it did fail. 59 errors got through to the ZFS level. $ zpool status | grep -C1 JB2_S57 spare-1 DEGRADED 0 0 0 multipath/mp_JB2_S57.eli FAULTED 59 0 0 too many errors multipath/mp_JB1_S90.eli ONLINE 0 0 0 (resilvering) Yea, I saw that too late. I looked at this bug on my phone first, thought it was retriable, and it wasn't.... too much pattern recognition... I'll check into why the counters aren't bumped. At least some counter should be 59, or the sum of counters 59 (or the drive is gone forever with no counters :). Sorry about the initial misshoot. No problem. BTW, these happen pretty frequently, so I can probably do some experiments if that would help you. I could run a dtrace script, for example. I saw this again. A drive returned a lot of "Head select fault"s, but the errors counter is still 0, even though errors got reported to ZFS. (da292:mpr2:0:214:0): READ(6). CDB: 08 00 00 00 08 00 (da292:mpr2:0:214:0): CAM status: SCSI Status Error (da292:mpr2:0:214:0): SCSI status: Check Condition (da292:mpr2:0:214:0): SCSI sense: ABORTED COMMAND asc:9,4 (Head select fault) (da292:mpr2:0:214:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (da292:mpr2:0:214:0): Retrying command (per sense data) $ sysctl kern.cam.da.292.stats kern.cam.da.292.stats.pack_invalidations: 0 kern.cam.da.292.stats.timeouts: 10461 kern.cam.da.292.stats.errors: 0 $ zpool status | grep mp_JB1_S15 multipath/mp_JB1_S15.eli FAULTED 7.90K 11.0K 35 too many errors Wonder if the counter should be bumped when CAM_SCSI_STATUS_ERROR?
something for context:
diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c
index df8791e4b23e..2d78c00dbb60 100644
--- a/sys/cam/scsi/scsi_da.c
+++ b/sys/cam/scsi/scsi_da.c
@@ -5994,20 +5994,21 @@ daerror(union ccb *ccb, u_int32_t cam_flags, u_int32_t sense_flags)
#ifdef CAM_IO_STATS
switch (ccb->ccb_h.status & CAM_STATUS_MASK) {
case CAM_CMD_TIMEOUT:
softc->timeouts++;
break;
case CAM_REQ_ABORTED:
case CAM_REQ_CMP_ERR:
case CAM_REQ_TERMIO:
case CAM_UNREC_HBA_ERROR:
case CAM_DATA_RUN_ERR:
+ case CAM_SCSI_STATUS_ERROR:
softc->errors++;
break;
default:
break;
}
#endif
(In reply to Robert Wing from comment #7) So far this patch seems to be working. I've installed it on one system that's experienced two MEDIUM ERRORs since. I'll check on it daily. https://reviews.freebsd.org/D33783 lists CAM_SCSI_STATUS_ERROR and other errors that also indicate problems with I/O. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=85056a56f3bbd26fdbaf0f045b527acac7199589 commit 85056a56f3bbd26fdbaf0f045b527acac7199589 Author: Warner Losh <imp@FreeBSD.org> AuthorDate: 2022-01-09 17:13:05 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2022-01-09 17:13:05 +0000 cam: Include more statuses as errors for CAM_IO_STATS Tag more status return values as an error for the I/O. CAM_SCSI_STATUS_ERROR is returned for medium errors, for example, but the counts weren't increased. The added errors all indicate a problem with the device request. Sponsored by: Netflix PR: 260257 Feedback from: ken Reviewed by: asomers Differential Revision: https://reviews.freebsd.org/D33783 sys/cam/scsi/scsi_da.c | 2 ++ 1 file changed, 2 insertions(+) MARKED AS SPAM |