I have a (half-broken) HDD (ada2, connected to ahci1) with a FreeBSD-14.1 (p0) server in my office. > kernel: CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (2672.84-MHz K8-class CPU) > kernel: Origin="GenuineIntel" Id=0x106a4 Family=0x6 Model=0x1a Stepping=4 > kernel: ahci1: <Intel ICH10 AHCI SATA controller> port 0x7c00-0x7c07,0x7880-0x7883,0x7800-0x7807,0x7480-0x7483,0x7400-0x741f mem 0xf7ffc000-0xf7ffc7ff irq 20 at device 31.2 on pci0 > kernel: ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported > kernel: ahcich4: <AHCI channel> at channel 2 on ahci1 > kernel: ahciem0: <AHCI enclosure management bridge> on ahci1 > kernel: ses0 at ahciem0 bus 0 scbus9 target 0 lun 0 > kernel: ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device > kernel: ses0: SEMB SES Device > kernel: ses0: ada2,pass2 in 'Slot 02', SATA Slot: scbus5 target 0 > kernel: ada2 at ahcich4 bus 0 scbus5 target 0 lun 0 > kernel: ada2: <WDC WD60EFRX-68L0BN1 82.00A82> ACS-2 ATA SATA 3.x device > kernel: ada2: Serial Number WD-WX41DA5LVRR4 > kernel: ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > kernel: ada2: Command Queueing enabled > kernel: ada2: 5723166MB (11721045168 512 byte sectors) > kernel: ada2: quirks=0x1<4K> When running read/write bad sector using dd (with 'sysctl kern.geom.debugflags=16'), Unretryable error occurs and cannot access to ada2 until I use 'camcontrol reset ada2'. > kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 68 da 57 40 b3 00 00 00 00 00 > kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed > kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error When on FreeBSD-13.x, this error is retryable. (Following entries are past logs, sector/ACB differs.) > kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 e8 1b df 40 1f 01 00 08 00 00 > kernel: (ada2:ahcich4:0:0:0): CAM status: ATA Status Error > kernel: (ada2:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) > kernel: (ada2:ahcich4:0:0:0): RES: 41 40 b0 1c df 00 1f 01 00 00 00 > kernel: (ada2:ahcich4:0:0:0): Retrying command, 3 more tries remain In commit 25375b1415, we changed as follows (/sys/dev/ahci/ahci.c only, probably this also affects to siis/mvs): diff --git a/sys/dev/ahci/ahci.c b/sys/dev/ahci/ahci.c index 12e6ee8102da..d62a043eb2ab 100644 --- a/sys/dev/ahci/ahci.c +++ b/sys/dev/ahci/ahci.c @@ -2178,7 +2178,8 @@ completeall: ahci_reset(ch); return; } - ccb->ccb_h = ch->hold[i]->ccb_h; /* Reuse old header. */ + xpt_setup_ccb(&ccb->ccb_h, ch->hold[i]->ccb_h.path, + ch->hold[i]->ccb_h.pinfo.priority); if (ccb->ccb_h.func_code == XPT_ATA_IO) { /* READ LOG */ ccb->ccb_h.recovery_type = RECOVERY_READ_LOG; Commit message say 'only field I see used from all the header is target_id.' But we need func_code in 'if' statement in NEXT line. func_code is always same value (probably 0), so 'if' statement never match condition (XPT_ATA_IO in above code), we always do 'REQUEST SENSE' in 'else' block. This is problematic. Copying more CCB header (at least func_code) or 'if' condition change (ex. 'if(ch->hold[i]->ccb.h.func_code == XPT_ATA_IO) { ...') would solve this issue. I modified adding xpt_merge_ccb() after xpt_setup_ccb() (booting with modified kernel seems to work fine), but I'm not sure if this is a right code.
^Triage: over to committer of 25375b1415 .
Created attachment 251709 [details] Patch to not use wiped func_code Thank you for reporting the issue. I haven't tried to reproduce it, but it clearly the code looks wrong. Could you test this patch to see whether it solves your problem, if it is reproducible? I've just checked that it builds.
I'm not familiar with reporting, I'll try. ada2 (I first reported) has recovered (bad sector replaced) with dd, so I tried another disk in same server. (Scanning is done with 'smartctl -t long /dev/ada1') > kernel: ahci1: <Intel ICH10 AHCI SATA controller> port 0x7c00-0x7c07,0x7880-0x7883,0x7800-0x7807,0x7480-0x7483,0x7400-0x741f mem 0xf7ffc000-0xf7ffc7ff irq 20 at device 31.2 on pci0 > kernel: ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported > kernel: ahcich3: <AHCI channel> at channel 1 on ahci1 > kernel: ada1 at ahcich3 bus 0 scbus4 target 0 lun 0 > kernel: ada1: <WDC WD60EFRX-68MYMN1 82.00A82> ACS-2 ATA SATA 3.x device > kernel: ada1: Serial Number WD-WXB1HB4LVJXX > kernel: ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > kernel: ada1: Command Queueing enabled > kernel: ada1: 5723166MB (11721045168 512 byte sectors) > kernel: ada1: quirks=0x1<4K> > kernel: ses0: ada1,pass1 in 'Slot 01', SATA Slot: scbus4 target 0 kernel is 14.1-RELEASE-p1 . On both unpatched kernel (rebuild by myself) and kernel with Alexander's patch (in comment #2), I tested with dd. (WITHOUT 'sysctl kern.geom.debugflags=16') > dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync '29487216' is LBA that SMART long test failed in read. 'count=320' would be too large. Just to let you know, the patch worked for me. With unpatched kernel: > # dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync > dd: /dev/ada1: Input/output error > 0+0 records in > 0+0 records out > 0 bytes transferred in 3.915904 secs (0 bytes/sec) > dd: /dev/ada1: Input/output error > (snipped) > dd: /dev/ada1: Input/output error > 320+0 records in > 320+0 records out > 163840 bytes transferred in 4.007715 secs (40881 bytes/sec) I feel it very fast ... ;-) Unpatched kernel message in syslog: > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): CAM status: Auto-Sense Retrieval Failed > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Unretryable error > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 71 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): CAM status: Auto-Sense Retrieval Failed > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Unretryable error With patched kernel: > # dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync > dd: /dev/ada1: Input/output error > 0+0 records in > 0+0 records out > 0 bytes transferred in 19.538302 secs (0 bytes/sec) > dd: /dev/ada1: Input/output error > dd: /dev/ada1: Input/output error > 1+0 records in > 1+0 records out > 512 bytes transferred in 38.362942 secs (13 bytes/sec) > dd: /dev/ada1: Input/output error > ^C (I stopped.) Too slow, but expected behavior. As we'll see later, retries are indeed happening. Patched kernel message in syslog: > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): CAM status: ATA Status Error > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): RES: 41 40 70 f0 c1 00 01 00 00 00 00 > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): Retrying command, 3 more tries remain > (snipped) > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): CAM status: ATA Status Error > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): RES: 41 40 70 f0 c1 00 01 00 00 00 00 > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Retries exhausted Retry has happen. I think this is the intended behavior.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=87085c12ba8fa51f777bc636df67008b45e20d1c commit 87085c12ba8fa51f777bc636df67008b45e20d1c Author: Alexander Motin <mav@FreeBSD.org> AuthorDate: 2024-06-27 13:29:23 +0000 Commit: Alexander Motin <mav@FreeBSD.org> CommitDate: 2024-06-27 13:29:23 +0000 Fix SATA NCQ error recovery after 25375b1415 Since that commit ahci(4), siis(4) and mvs(4) drivers ended up using wrong command to fetch error information for NCQ commands. Since ATA errors are not very informative to begin with, the only noticeable effect is a lack of retries on those errors by CAM. MFC after: 1 week PR: 279978 sys/dev/ahci/ahci.c | 2 +- sys/dev/mvs/mvs.c | 2 +- sys/dev/siis/siis.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
Thank you. The primary result of the fix is proper error status reporting now. Retries is a consequence, but the most important one indeed.
Thank you for your cooperation, Alexander! I hope this change gets merged to 14.1 (/14.0).
+1 to get 14.1 patched for this; this is causing crashes on VMs for me. Looking back in older logs, I see occasional retries which successfully resolve. (Perhaps during a live migration? I don't have visibility into what is going on within the VM hosting side.)
Since releng/14.2 branch created (and beta releases of 14.2 appears), I would like someone to MFC this change to this branch (and stable/14). What should I do for this?