I have a StarTech SATA->dual M.2 SATA adapter https://www.startech.com/en-us/hdd/25s22m2ngffr that has worked well for many releases, but is now failing in 14-CURRENT. I originally encountered the problem upgrading to pfSense 23.01 on a Lanner FW-7575 white-box. I have since obtained another adapter and encountered similar problems on a totally different box (ASUS ROG Z690 Formula) using a plain 14-CURRENT build (20230323 -b5d43972e394). Failing dmesg from pfSense 23.01: pmp0 at ahcich4 bus 0 scbus0 target 15 lun 0 pmp0: <Port Multiplier 10921b21 0008> ATA device pmp0: 300.000MB/s transfers (SATA 2.x, NONE, PIO 8192bytes) pmp0: 2 fan-out ports (aprobe0:ahcich4:0:0:0): SET_MULTI. ACB: c6 00 00 00 00 40 00 00 00 00 01 00 (aprobe0:ahcich4:0:0:0): CAM status: ATA Status Error (aprobe0:ahcich4:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT ) (aprobe0:ahcich4:0:0:0): RES: 51 04 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich4:0:0:0): Retrying command, 0 more tries remain (aprobe0:ahcich4:0:0:0): SET_MULTI. ACB: c6 00 00 00 00 40 00 00 00 00 01 00 (aprobe0:ahcich4:0:0:0): CAM status: ATA Status Error (aprobe0:ahcich4:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT ) (aprobe0:ahcich4:0:0:0): RES: 51 04 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich4:0:0:0): Error 5, Retries exhausted The Z690 board gets a different SET_MULTI error and then spins at "Root mount waiting for: CAM" indefinitely, so the above log is from the firewall hardware since it completes booting without mounting ada1. I couldn't obtain the exact same M.2 SATA drives again, so the different SET_MULTI error may be due to the different drives. In pfSense 22.05 (the last working build not based on 14-CURRENT), the relevant dmesg: pmp0 at ahcich4 bus 0 scbus0 target 15 lun 0 pmp0: <Port Multiplier 10921b21 0008> ATA device pmp0: 300.000MB/s transfers (SATA 2.x, NONE, PIO 8192bytes) pmp0: 2 fan-out ports ... ses0 at ahciem0 bus 0 scbus1 target 0 lun 0 ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device ses0: SEMB SES Device ada0 at ahcich4 bus 0 scbus0 target 0 lun 0 ada0: <WDC WDS500G1R0B-68A4Z0 411000WR> ACS-4 ATA SATA 3.x device ada0: Serial Number 21456H800061 ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes) ada0: Command Queueing enabled ada0: 476940MB (976773168 512 byte sectors) ada1 at ahcich4 bus 0 scbus0 target 1 lun 0 ada1: <WDC WDS500G1R0B-68A4Z0 411000WR> ACS-4 ATA SATA 3.x device ada1: Serial Number 21456H800025 ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes) ada1: Command Queueing enabled ada1: 476940MB (976773168 512 byte sectors) Note that ada1 is found and that ada0 and ada1 are reported as SATA 3.x 600.00MB/s and not SATA 2.x 300.00MB/s as 14-CURRENT is reporting. Happy to help debug on the Z690 since taking the firewall out of operation is family-incompatible.
I set kern.cam.dflags=127 and kern.cam.debug_delay=5000 from the loader prompt. The delay doesn't seem to meaningfully slow things down, but when the Z690 is looping, the following sequence repeats indefinitely: (pmp0:ahcich8:0:15:0): WRITE_BUFFER/PM. ACB: e8 02 00 00 00 01 00 00 00 00 01 00 (pmp0:ahcich8:0:15:0): CAM status: Unconditionally Re-queue Request (pmp0:ahcich8:0:15:0): Retrying command, 1 more tries remain (pmp0:ahcich8:0:15:0): WRITE_BUFFER/PM. ACB: e8 02 00 00 00 01 00 00 00 00 01 00 (aprobe0:ahcich8:0:0:0): SET_MULTI. ACB: c6 00 00 00 00 40 00 00 00 00 01 00 (aprobe0:ahcich8:0:0:0): CAM status: Unconditionally Re-queue Request (aprobe0:ahcich8:0:0:0): Retrying command, 1 more tries remain (aprobe0:ahcich8:0:0:0): SET_MULTI. ACB: c6 00 00 00 00 40 00 00 00 00 01 00 I played a bit with debug_delay and got different results with higher values, but that's probably breaking timings too much.
13.1-RELEASE also works properly. (I had only tried up to 12.3 via pfSense prior.)
Bisected this to 4afa62be71674b76c31ea513d6c470089a6c1848 (cam: Set the CAM_QOS_VALID when valid). For those following along, I had to cherry-pick in two patches during bisection: 1. 02732f945ed2 -- This allows the console to work properly on the ASUS ROG Z690 Formula motherboard 2. 1e4c802913af -- This builds lldb-tblgen, which was necessary because I was building inside a fresh jail
At a guess, someone code is not checking the return status quite right (checking the value vs masking off the status code part of the status). I have a vague memory of fixing a bug like this somewhere... I'll have to check my notes as to where to make sure that got into the tree and didn't drop on the floor.
At a guess, it's one of these needs to change to add parens and & CAM_STATUS_MASK to compare against CAM_REQ_CMP. sys/cam/ata/ata_xpt.c: if (cpi.ccb_h.status == CAM_REQ_CMP && sys/cam/ata/ata_xpt.c: if (work_ccb->ccb_h.status == CAM_REQ_CMP) { sys/cam/cam_periph.c: if (cgd->ccb_h.status == CAM_REQ_CMP) sys/cam/cam_xpt.c: if (cpi.ccb_h.status == CAM_REQ_CMP) { sys/cam/nvme/nvme_da.c: error = (ccb->ccb_h.status == CAM_REQ_CMP) ? 0 : EIO; sys/cam/scsi/scsi_cd.c: if (cpi.ccb_h.status == CAM_REQ_CMP && (cpi.hba_misc & PIM_NO_6_BYTE)) sys/cam/scsi/scsi_da.c: if (cpi.ccb_h.status == CAM_REQ_CMP && (cpi.hba_misc & PIM_NO_6_BYTE)) sys/cam/scsi/scsi_enc_ses.c: if (cgd.ccb_h.status == CAM_REQ_CMP) sys/cam/scsi/scsi_enc_ses.c: if (cdai.ccb_h.status == CAM_REQ_CMP) sys/cam/scsi/scsi_sa.c: if ((cdai.ccb_h.status == CAM_REQ_CMP) sys/cam/scsi/scsi_target.c: if (cab.ccb_h.status == CAM_REQ_CMP) { sys/cam/scsi/scsi_xpt.c: if (cts->ccb_h.status == CAM_REQ_CMP && cts->transport == XPORT_SPI) { sys/cam/scsi/scsi_xpt.c: if (cts->ccb_h.status == CAM_REQ_CMP && cts->transport == XPORT_FC) { sys/cam/scsi/scsi_xpt.c: if (cts->ccb_h.status == CAM_REQ_CMP && cts->transport == XPORT_SAS) { and I think https://reviews.freebsd.org/D37696 was the bug that I was thinking about. sys/cam/scsi/scsi_xpt.c: if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_SPI) { sys/cam/scsi/scsi_xpt.c: if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_FC) { sys/cam/scsi/scsi_xpt.c: if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_SPI) { sys/cam/scsi/scsi_xpt.c: if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_FC) {
Created attachment 241466 [details] proposed patch Does this fix the issue?
Yes, that patch does fix it. Many thanks! There may be other issues given the prevalence of the older pattern: freebsd-src git:(main) git grep "ccb_h.status [^\&]" | wc -l 1224
(In reply to Alex Rosenberg from comment #7) Things aren't that bad. A large number of what you found was setting the value And your re was a bit off, so there's a number of places that do mask correctly. And there's a large number of places where other bits are checked. I think my patch caught all the places, though I agree it would be safer to uninline them. I think this also hits 13.2... we may want a EN
I don't see any commits referencing this bug nor anything relevant in sys/cam, most notably the proposed patch is not committed. why is this closed?
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=fd02926a68c9570f00f7e50d64c1d0673581fb35 commit fd02926a68c9570f00f7e50d64c1d0673581fb35 Author: Warner Losh <imp@FreeBSD.org> AuthorDate: 2023-04-14 14:32:20 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2023-04-15 22:32:41 +0000 cam: Properly mask out the status bits to get completion code ccb_h.status has two parts: the actual status and some addition bits to indicate additional information. It must be masked before comparing against completion codes. Add new inline function cam_ccb_success to simplify this to test whether or not the request succeeded. Most of the code already does this, but a few places don't (the rest likely should be converted to use cam_ccb_status and/or cam_ccb_success, but that's for another day). This caused at least one bug in recognizing devices behind a SATA port multiplexer, though some of these checks were fine with the special knowledge of the code paths involved. PR: 270459 Sponsored by: Netflix MFC After: 1 week (and maybe a EN requst) Reviewed by: ken, mav Differential Revision: https://reviews.freebsd.org/D39572 sys/cam/ata/ata_xpt.c | 4 ++-- sys/cam/cam_ccb.h | 6 ++++++ sys/cam/cam_xpt.c | 2 +- sys/cam/nvme/nvme_da.c | 3 +-- sys/cam/scsi/scsi_enc_ses.c | 4 ++-- 5 files changed, 12 insertions(+), 7 deletions(-)
Ah a push failed that I thought had succeeded. Fixed.