Summary: | After commit 25375b1415, any errors in device connected to ahci etc. results in Unretryable error | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Tomoki AONO <aono> | ||||
Component: | kern | Assignee: | Alexander Motin <mav> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Some People | CC: | aono, eborisch+FreeBSD, mav | ||||
Priority: | --- | Keywords: | regression | ||||
Version: | 14.1-RELEASE | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Tomoki AONO
2024-06-25 04:04:06 UTC
^Triage: over to committer of 25375b1415 . Created attachment 251709 [details]
Patch to not use wiped func_code
Thank you for reporting the issue. I haven't tried to reproduce it, but it clearly the code looks wrong. Could you test this patch to see whether it solves your problem, if it is reproducible? I've just checked that it builds.
I'm not familiar with reporting, I'll try. ada2 (I first reported) has recovered (bad sector replaced) with dd, so I tried another disk in same server. (Scanning is done with 'smartctl -t long /dev/ada1') > kernel: ahci1: <Intel ICH10 AHCI SATA controller> port 0x7c00-0x7c07,0x7880-0x7883,0x7800-0x7807,0x7480-0x7483,0x7400-0x741f mem 0xf7ffc000-0xf7ffc7ff irq 20 at device 31.2 on pci0 > kernel: ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported > kernel: ahcich3: <AHCI channel> at channel 1 on ahci1 > kernel: ada1 at ahcich3 bus 0 scbus4 target 0 lun 0 > kernel: ada1: <WDC WD60EFRX-68MYMN1 82.00A82> ACS-2 ATA SATA 3.x device > kernel: ada1: Serial Number WD-WXB1HB4LVJXX > kernel: ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > kernel: ada1: Command Queueing enabled > kernel: ada1: 5723166MB (11721045168 512 byte sectors) > kernel: ada1: quirks=0x1<4K> > kernel: ses0: ada1,pass1 in 'Slot 01', SATA Slot: scbus4 target 0 kernel is 14.1-RELEASE-p1 . On both unpatched kernel (rebuild by myself) and kernel with Alexander's patch (in comment #2), I tested with dd. (WITHOUT 'sysctl kern.geom.debugflags=16') > dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync '29487216' is LBA that SMART long test failed in read. 'count=320' would be too large. Just to let you know, the patch worked for me. With unpatched kernel: > # dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync > dd: /dev/ada1: Input/output error > 0+0 records in > 0+0 records out > 0 bytes transferred in 3.915904 secs (0 bytes/sec) > dd: /dev/ada1: Input/output error > (snipped) > dd: /dev/ada1: Input/output error > 320+0 records in > 320+0 records out > 163840 bytes transferred in 4.007715 secs (40881 bytes/sec) I feel it very fast ... ;-) Unpatched kernel message in syslog: > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): CAM status: Auto-Sense Retrieval Failed > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Unretryable error > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 71 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): CAM status: Auto-Sense Retrieval Failed > Jun 27 18:45:04 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Unretryable error With patched kernel: > # dd if=/dev/ada1 of=/dev/null iseek=29487216 bs=512 count=320 conv=noerror,sync > dd: /dev/ada1: Input/output error > 0+0 records in > 0+0 records out > 0 bytes transferred in 19.538302 secs (0 bytes/sec) > dd: /dev/ada1: Input/output error > dd: /dev/ada1: Input/output error > 1+0 records in > 1+0 records out > 512 bytes transferred in 38.362942 secs (13 bytes/sec) > dd: /dev/ada1: Input/output error > ^C (I stopped.) Too slow, but expected behavior. As we'll see later, retries are indeed happening. Patched kernel message in syslog: > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): CAM status: ATA Status Error > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): RES: 41 40 70 f0 c1 00 01 00 00 00 00 > Jun 27 18:51:55 sv kernel: (ada1:ahcich3:0:0:0): Retrying command, 3 more tries remain > (snipped) > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 70 f0 c1 40 01 00 00 00 00 00 > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): CAM status: ATA Status Error > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): RES: 41 40 70 f0 c1 00 01 00 00 00 00 > Jun 27 18:52:10 sv kernel: (ada1:ahcich3:0:0:0): Error 5, Retries exhausted Retry has happen. I think this is the intended behavior. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=87085c12ba8fa51f777bc636df67008b45e20d1c commit 87085c12ba8fa51f777bc636df67008b45e20d1c Author: Alexander Motin <mav@FreeBSD.org> AuthorDate: 2024-06-27 13:29:23 +0000 Commit: Alexander Motin <mav@FreeBSD.org> CommitDate: 2024-06-27 13:29:23 +0000 Fix SATA NCQ error recovery after 25375b1415 Since that commit ahci(4), siis(4) and mvs(4) drivers ended up using wrong command to fetch error information for NCQ commands. Since ATA errors are not very informative to begin with, the only noticeable effect is a lack of retries on those errors by CAM. MFC after: 1 week PR: 279978 sys/dev/ahci/ahci.c | 2 +- sys/dev/mvs/mvs.c | 2 +- sys/dev/siis/siis.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) Thank you. The primary result of the fix is proper error status reporting now. Retries is a consequence, but the most important one indeed. Thank you for your cooperation, Alexander! I hope this change gets merged to 14.1 (/14.0). +1 to get 14.1 patched for this; this is causing crashes on VMs for me. Looking back in older logs, I see occasional retries which successfully resolve. (Perhaps during a live migration? I don't have visibility into what is going on within the VM hosting side.) Since releng/14.2 branch created (and beta releases of 14.2 appears), I would like someone to MFC this change to this branch (and stable/14). What should I do for this? |