Bug 260257 - CAM_IO_STATS does not track MEDIUM ERRORs
Summary: CAM_IO_STATS does not track MEDIUM ERRORs
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-06 18:55 UTC by Alan Somers
Modified: 2024-04-15 11:39 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2021-12-06 18:55:17 UTC
The CAM_IO_STATS kernel config option adds some sysctls to every disk device that track the number of errors, timeouts, and pack invalidations it has had.  However, some slip through the cracks.  I consistently see dmesg report MEDIUM ERRORs such as the below, while the stats sysctls report nothing.  There are clearly errors, because gmultipath reports one path as DEGRADED and ZFS reports faults on the drive.  It seems that some error path isn't updating the stats like it should.

(da424:mpr3:0:115:0): READ(10). CDB: 28 00 06 3c 3a 68 00 00 d7 00 
(da424:mpr3:0:115:0): CAM status: SCSI Status Error
(da424:mpr3:0:115:0): SCSI status: Check Condition
(da424:mpr3:0:115:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da424:mpr3:0:115:0): Info: 0x63c3a7b
(da424:mpr3:0:115:0): Field Replaceable Unit: 134
(da424:mpr3:0:115:0): Command Specific Info: 0x81037000
(da424:mpr3:0:115:0): Actual Retry Count: 206
(da424:mpr3:0:115:0): Error 5, Unretryable error
GEOM_MULTIPATH: Error 5, da424 in mp_JB2_S57 marked FAIL
GEOM_MULTIPATH: all paths in mp_JB2_S57 were marked FAIL, restore da244

$ sysctl kern.cam.da.424.stats
kern.cam.da.424.stats.pack_invalidations: 0
kern.cam.da.424.stats.timeouts: 0
kern.cam.da.424.stats.errors: 0
Comment 1 Warner Losh freebsd_committer freebsd_triage 2021-12-06 19:10:37 UTC
These counters, for better or worse, track failures. Since this I/O succeeded on retry, the counters weren't bumped. So works as designed, with a hefty helping of 'the design could widen to report this' ACK is my response :). I'll see what I can do about the latter.
Comment 2 Warner Losh freebsd_committer freebsd_triage 2021-12-06 19:11:27 UTC
(In reply to Warner Losh from comment #1)
Oh, wait, it did fail. I'll check to see why that didn't bump the counters...
Comment 3 Alan Somers freebsd_committer freebsd_triage 2021-12-06 19:13:56 UTC
Yes, it did fail.  59 errors got through to the ZFS level.

$ zpool status | grep -C1 JB2_S57
	    spare-1                     DEGRADED     0     0     0
	      multipath/mp_JB2_S57.eli  FAULTED     59     0     0  too many errors
	      multipath/mp_JB1_S90.eli  ONLINE       0     0     0  (resilvering)
Comment 4 Warner Losh freebsd_committer freebsd_triage 2021-12-06 19:16:09 UTC
Yea, I saw that too late. I looked at this bug on my phone first, thought it was retriable, and it wasn't.... too much pattern recognition...
I'll check into why the counters aren't bumped. At least some counter should be 59, or the sum of counters 59 (or the drive is gone forever with no counters :).

Sorry about the initial misshoot.
Comment 5 Alan Somers freebsd_committer freebsd_triage 2021-12-06 19:50:08 UTC
No problem.  BTW, these happen pretty frequently, so I can probably do some experiments if that would help you.  I could run a dtrace script, for example.
Comment 6 Alan Somers freebsd_committer freebsd_triage 2022-01-01 20:25:49 UTC
I saw this again.  A drive returned a lot of "Head select fault"s, but the errors counter is still 0, even though errors got reported to ZFS.

(da292:mpr2:0:214:0): READ(6). CDB: 08 00 00 00 08 00 
(da292:mpr2:0:214:0): CAM status: SCSI Status Error
(da292:mpr2:0:214:0): SCSI status: Check Condition
(da292:mpr2:0:214:0): SCSI sense: ABORTED COMMAND asc:9,4 (Head select fault)
(da292:mpr2:0:214:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
(da292:mpr2:0:214:0): Retrying command (per sense data)

$ sysctl kern.cam.da.292.stats
kern.cam.da.292.stats.pack_invalidations: 0
kern.cam.da.292.stats.timeouts: 10461
kern.cam.da.292.stats.errors: 0

$ zpool status | grep mp_JB1_S15
	      multipath/mp_JB1_S15.eli  FAULTED  7.90K 11.0K    35  too many errors
Comment 7 Robert Wing freebsd_committer freebsd_triage 2022-01-05 20:00:03 UTC
Wonder if the counter should be bumped when CAM_SCSI_STATUS_ERROR?

something for context:

diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c
index df8791e4b23e..2d78c00dbb60 100644
--- a/sys/cam/scsi/scsi_da.c
+++ b/sys/cam/scsi/scsi_da.c
@@ -5994,20 +5994,21 @@ daerror(union ccb *ccb, u_int32_t cam_flags, u_int32_t sense_flags)
 #ifdef CAM_IO_STATS
        switch (ccb->ccb_h.status & CAM_STATUS_MASK) {
        case CAM_CMD_TIMEOUT:
                softc->timeouts++;
                break;
        case CAM_REQ_ABORTED:
        case CAM_REQ_CMP_ERR:
        case CAM_REQ_TERMIO:
        case CAM_UNREC_HBA_ERROR:
        case CAM_DATA_RUN_ERR:
+       case CAM_SCSI_STATUS_ERROR:
                softc->errors++;
                break;
        default:
                break;
        }
 #endif
Comment 8 Alan Somers freebsd_committer freebsd_triage 2022-01-06 21:55:55 UTC
(In reply to Robert Wing from comment #7)
So far this patch seems to be working.  I've installed it on one system that's experienced two MEDIUM ERRORs since.  I'll check on it daily.
Comment 9 Warner Losh freebsd_committer freebsd_triage 2022-01-07 18:16:48 UTC
https://reviews.freebsd.org/D33783 lists CAM_SCSI_STATUS_ERROR and other errors that also indicate problems with I/O.
Comment 10 commit-hook freebsd_committer freebsd_triage 2022-01-09 17:18:16 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=85056a56f3bbd26fdbaf0f045b527acac7199589

commit 85056a56f3bbd26fdbaf0f045b527acac7199589
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2022-01-09 17:13:05 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2022-01-09 17:13:05 +0000

    cam: Include more statuses as errors for CAM_IO_STATS

    Tag more status return values as an error for the
    I/O. CAM_SCSI_STATUS_ERROR is returned for medium errors, for example,
    but the counts weren't increased. The added errors all indicate a
    problem with the device request.

    Sponsored by:           Netflix
    PR:                     260257
    Feedback from:          ken
    Reviewed by:            asomers
    Differential Revision:  https://reviews.freebsd.org/D33783

 sys/cam/scsi/scsi_da.c | 2 ++
 1 file changed, 2 insertions(+)
Comment 11 Marry Joy 2024-04-15 07:59:29 UTC
MARKED AS SPAM