Bug 279978 - After commit 25375b1415, any errors in device connected to ahci etc. results in Unretryable error
Summary: After commit 25375b1415, any errors in device connected to ahci etc. results ...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Alexander Motin
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2024-06-25 04:04 UTC by Tomoki AONO
Modified: 2024-11-14 08:18 UTC (History)
3 users (show)

See Also:


Attachments
Patch to not use wiped func_code (1.48 KB, patch)
2024-06-26 15:27 UTC, Alexander Motin
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tomoki AONO 2024-06-25 04:04:06 UTC
I have a (half-broken) HDD (ada2, connected to ahci1) with a FreeBSD-14.1 (p0)
server in my office.

> kernel: CPU: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (2672.84-MHz K8-class CPU)
> kernel:   Origin="GenuineIntel"  Id=0x106a4  Family=0x6  Model=0x1a  Stepping=4
> kernel: ahci1: <Intel ICH10 AHCI SATA controller> port 0x7c00-0x7c07,0x7880-0x7883,0x7800-0x7807,0x7480-0x7483,0x7400-0x741f mem 0xf7ffc000-0xf7ffc7ff irq 20 at device 31.2 on pci0
> kernel: ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported
> kernel: ahcich4: <AHCI channel> at channel 2 on ahci1
> kernel: ahciem0: <AHCI enclosure management bridge> on ahci1
> kernel: ses0 at ahciem0 bus 0 scbus9 target 0 lun 0
> kernel: ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device
> kernel: ses0: SEMB SES Device
> kernel: ses0: ada2,pass2 in 'Slot 02', SATA Slot: scbus5 target 0
> kernel: ada2 at ahcich4 bus 0 scbus5 target 0 lun 0
> kernel: ada2: <WDC WD60EFRX-68L0BN1 82.00A82> ACS-2 ATA SATA 3.x device
> kernel: ada2: Serial Number WD-WX41DA5LVRR4
> kernel: ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> kernel: ada2: Command Queueing enabled
> kernel: ada2: 5723166MB (11721045168 512 byte sectors)
> kernel: ada2: quirks=0x1<4K>

When running read/write bad sector using dd (with 'sysctl kern.geom.debugflags=16'),
Unretryable error occurs and cannot access to ada2 until I use
'camcontrol reset ada2'.

> kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 68 da 57 40 b3 00 00 00 00 00
> kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
> kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error

When on FreeBSD-13.x, this error is retryable. (Following entries are
past logs, sector/ACB differs.)

> kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 e8 1b df 40 1f 01 00 08 00 00
> kernel: (ada2:ahcich4:0:0:0): CAM status: ATA Status Error
> kernel: (ada2:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> kernel: (ada2:ahcich4:0:0:0): RES: 41 40 b0 1c df 00 1f 01 00 00 00
> kernel: (ada2:ahcich4:0:0:0): Retrying command, 3 more tries remain

In commit 25375b1415, we changed as follows (/sys/dev/ahci/ahci.c only,
probably this also affects to siis/mvs):

diff --git a/sys/dev/ahci/ahci.c b/sys/dev/ahci/ahci.c
index 12e6ee8102da..d62a043eb2ab 100644
--- a/sys/dev/ahci/ahci.c
+++ b/sys/dev/ahci/ahci.c
@@ -2178,7 +2178,8 @@ completeall:
 		ahci_reset(ch);
 		return;
 	}
-	ccb->ccb_h = ch->hold[i]->ccb_h;	/* Reuse old header. */
+	xpt_setup_ccb(&ccb->ccb_h, ch->hold[i]->ccb_h.path,
+	    ch->hold[i]->ccb_h.pinfo.priority);
 	if (ccb->ccb_h.func_code == XPT_ATA_IO) {
 		/* READ LOG */
 		ccb->ccb_h.recovery_type = RECOVERY_READ_LOG;

Commit message say 'only field I see used from all the header is target_id.'
But we need func_code in 'if' statement in NEXT line.
func_code is always same value (probably 0), so 'if' statement
never match condition (XPT_ATA_IO in above code), we always do
'REQUEST SENSE' in 'else' block. This is problematic.

Copying more CCB header (at least func_code) or 'if' condition change
(ex. 'if(ch->hold[i]->ccb.h.func_code == XPT_ATA_IO) { ...')
would solve this issue. I modified adding xpt_merge_ccb()
after xpt_setup_ccb() (booting with modified kernel seems to work fine),
but I'm not sure if this is a right code.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2024-06-26 14:45:02 UTC
^Triage: over to committer of 25375b1415 .
Comment 2 Alexander Motin freebsd_committer freebsd_triage 2024-06-26 15:27:23 UTC
Created attachment 251709 [details]
Patch to not use wiped func_code

Thank you for reporting the issue.  I haven't tried to reproduce it, but it clearly the code looks wrong.  Could you test this patch to see whether it solves your problem, if it is reproducible?  I've just checked that it builds.
Comment 3 Tomoki AONO 2024-06-27 11:33:08 UTC
I'm not familiar with reporting, I'll try.

ada2 (I first reported) has recovered (bad sector replaced) with dd, so I tried another disk in same server. (Scanning is done with 'smartctl -t long /dev/ada1')

> kernel: ahci1: <Intel ICH10 AHCI SATA controller> port 0x7c00-0x7c07,0x7880-0x7883,0x7800-0x7807,0x7480-0x7483,0x7400-0x741f mem 0xf7ffc000-0xf7ffc7ff irq 20 at device 31.2 on pci0
> kernel: ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported
> kernel: ahcich3: <AHCI channel> at channel 1 on ahci1
> kernel: ada1 at ahcich3 bus 0 scbus4 target 0 lun 0
> kernel: ada1: <WDC WD60EFRX-68MYMN1 82.00A82> ACS-2 ATA SATA 3.x device
> kernel: ada1: Serial Number WD-WXB1HB4LVJXX
> kernel: ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> kernel: ada1: Command Queueing enabled
> kernel: ada1: 5723166MB (11721045168 512 byte sectors)
> kernel: ada1: quirks=0x1<4K>
> kernel: ses0: ada1,pass1 in 'Slot 01', SATA Slot: scbus4 target 0

kernel is 14.1-RELEASE-p1 . On both unpatched kernel (rebuild by myself) and kernel with Alexander's patch (in comment #2), I tested with dd. (WITHOUT 'sysctl kern.geom.debugflags=16')

> dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync

'29487216' is LBA that SMART long test failed in read. 'count=320' would be too large. Just to let you know, the patch worked for me.

With unpatched kernel:
> # dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync
> dd: /dev/ada1: Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes transferred in 3.915904 secs (0 bytes/sec)
> dd: /dev/ada1: Input/output error
> (snipped)
> dd: /dev/ada1: Input/output error
> 320+0 records in
> 320+0 records out
> 163840 bytes transferred in 4.007715 secs (40881 bytes/sec)
I feel it very fast ... ;-)

Unpatched kernel message in syslog:
> Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00
> Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): CAM status: Auto-Sense Retrieval Failed
> Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Unretryable error
> Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 71 f0 c1 40 01 00 00 00 00 00
> Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): CAM status: Auto-Sense Retrieval Failed
> Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Unretryable error


With patched kernel:
> # dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync
> dd: /dev/ada1: Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes transferred in 19.538302 secs (0 bytes/sec)
> dd: /dev/ada1: Input/output error
> dd: /dev/ada1: Input/output error
> 1+0 records in
> 1+0 records out
> 512 bytes transferred in 38.362942 secs (13 bytes/sec)
> dd: /dev/ada1: Input/output error
> ^C (I stopped.)
Too slow, but expected behavior. As we'll see later, retries are indeed happening.

Patched kernel message in syslog:
> Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00
> Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): CAM status: ATA Status Error
> Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): RES: 41 40 70 f0 c1 00 01 00 00 00 00
> Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): Retrying command, 3 more tries remain
> (snipped)
> Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00
> Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): CAM status: ATA Status Error
> Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
> Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): RES: 41 40 70 f0 c1 00 01 00 00 00 00
> Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Retries exhausted

Retry has happen. I think this is the intended behavior.
Comment 4 commit-hook freebsd_committer freebsd_triage 2024-06-27 13:38:13 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=87085c12ba8fa51f777bc636df67008b45e20d1c

commit 87085c12ba8fa51f777bc636df67008b45e20d1c
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2024-06-27 13:29:23 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2024-06-27 13:29:23 +0000

    Fix SATA NCQ error recovery after 25375b1415

    Since that commit ahci(4), siis(4) and mvs(4) drivers ended up
    using wrong command to fetch error information for NCQ commands.
    Since ATA errors are not very informative to begin with, the only
    noticeable effect is a lack of retries on those errors by CAM.

    MFC after:      1 week
    PR:             279978

 sys/dev/ahci/ahci.c | 2 +-
 sys/dev/mvs/mvs.c   | 2 +-
 sys/dev/siis/siis.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)
Comment 5 Alexander Motin freebsd_committer freebsd_triage 2024-06-27 13:39:38 UTC
Thank you.  The primary result of the fix is proper error status reporting now.  Retries is a consequence, but the most important one indeed.
Comment 6 Tomoki AONO 2024-06-29 06:24:41 UTC
Thank you for your cooperation, Alexander! I hope this change gets merged to 14.1 (/14.0).
Comment 7 eborisch+FreeBSD 2024-08-27 19:59:36 UTC
+1 to get 14.1 patched for this; this is causing crashes on VMs for me. Looking back in older logs, I see occasional retries which successfully resolve. (Perhaps during a live migration? I don't have visibility into what is going on within the VM hosting side.)
Comment 8 Tomoki AONO 2024-11-14 08:18:50 UTC
Since releng/14.2 branch created (and beta releases of 14.2 appears),
I would like someone to MFC this change to this branch (and stable/14).
What should I do for this?