Bug 270459 - SATA Port Multiplier failure
Summary: SATA Port Multiplier failure
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Warner Losh
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2023-03-26 03:55 UTC by Alex Rosenberg
Modified: 2023-04-15 22:34 UTC (History)
3 users (show)

See Also:


Attachments
proposed patch (3.22 KB, patch)
2023-04-14 02:27 UTC, Warner Losh
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Rosenberg 2023-03-26 03:55:41 UTC
I have a StarTech SATA->dual M.2 SATA adapter https://www.startech.com/en-us/hdd/25s22m2ngffr that has worked well for many releases, but is now failing in 14-CURRENT.

I originally encountered the problem upgrading to pfSense 23.01 on a Lanner FW-7575 white-box. I have since obtained another adapter and encountered similar problems on a totally different box (ASUS ROG Z690 Formula) using a plain 14-CURRENT build (20230323 -b5d43972e394).

Failing dmesg from pfSense 23.01:

pmp0 at ahcich4 bus 0 scbus0 target 15 lun 0
pmp0: <Port Multiplier 10921b21 0008> ATA device
pmp0: 300.000MB/s transfers (SATA 2.x, NONE, PIO 8192bytes)
pmp0: 2 fan-out ports
(aprobe0:ahcich4:0:0:0): SET_MULTI. ACB: c6 00 00 00 00 40 00 00 00 00 01 00
(aprobe0:ahcich4:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich4:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
(aprobe0:ahcich4:0:0:0): RES: 51 04 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich4:0:0:0): Retrying command, 0 more tries remain
(aprobe0:ahcich4:0:0:0): SET_MULTI. ACB: c6 00 00 00 00 40 00 00 00 00 01 00
(aprobe0:ahcich4:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich4:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
(aprobe0:ahcich4:0:0:0): RES: 51 04 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich4:0:0:0): Error 5, Retries exhausted

The Z690 board gets a different SET_MULTI error and then spins at "Root mount waiting for: CAM" indefinitely, so the above log is from the firewall hardware since it completes booting without mounting ada1. I couldn't obtain the exact same M.2 SATA drives again, so the different SET_MULTI error may be due to the different drives.

In pfSense 22.05 (the last working build not based on 14-CURRENT), the relevant dmesg:

pmp0 at ahcich4 bus 0 scbus0 target 15 lun 0
pmp0: <Port Multiplier 10921b21 0008> ATA device
pmp0: 300.000MB/s transfers (SATA 2.x, NONE, PIO 8192bytes)
pmp0: 2 fan-out ports
...
ses0 at ahciem0 bus 0 scbus1 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ada0 at ahcich4 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS500G1R0B-68A4Z0 411000WR> ACS-4 ATA SATA 3.x device
ada0: Serial Number 21456H800061
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 476940MB (976773168 512 byte sectors)
ada1 at ahcich4 bus 0 scbus0 target 1 lun 0
ada1: <WDC WDS500G1R0B-68A4Z0 411000WR> ACS-4 ATA SATA 3.x device
ada1: Serial Number 21456H800025
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada1: Command Queueing enabled
ada1: 476940MB (976773168 512 byte sectors)

Note that ada1 is found and that ada0 and ada1 are reported as SATA 3.x 600.00MB/s and not SATA 2.x 300.00MB/s as 14-CURRENT is reporting.

Happy to help debug on the Z690 since taking the firewall out of operation is family-incompatible.
Comment 1 Alex Rosenberg 2023-03-26 19:45:42 UTC
I set kern.cam.dflags=127 and kern.cam.debug_delay=5000 from the loader prompt. The delay doesn't seem to meaningfully slow things down, but when the Z690 is looping, the following sequence repeats indefinitely:

(pmp0:ahcich8:0:15:0): WRITE_BUFFER/PM. ACB: e8 02 00 00 00 01 00 00 00 00 01 00
(pmp0:ahcich8:0:15:0): CAM status: Unconditionally Re-queue Request
(pmp0:ahcich8:0:15:0): Retrying command, 1 more tries remain
(pmp0:ahcich8:0:15:0): WRITE_BUFFER/PM. ACB: e8 02 00 00 00 01 00 00 00 00 01 00
(aprobe0:ahcich8:0:0:0): SET_MULTI. ACB: c6 00 00  00 00 40 00 00 00 00 01 00
(aprobe0:ahcich8:0:0:0): CAM status: Unconditionally Re-queue Request
(aprobe0:ahcich8:0:0:0): Retrying command, 1 more tries remain
(aprobe0:ahcich8:0:0:0): SET_MULTI. ACB: c6 00 00  00 00 40 00 00 00 00 01 00

I played a bit with debug_delay and got different results with higher values, but that's probably breaking timings too much.
Comment 2 Alex Rosenberg 2023-03-26 20:05:19 UTC
13.1-RELEASE also works properly. (I had only tried up to 12.3 via pfSense prior.)
Comment 3 Alex Rosenberg 2023-04-13 23:55:15 UTC
Bisected this to 4afa62be71674b76c31ea513d6c470089a6c1848 (cam: Set the CAM_QOS_VALID when valid).

For those following along, I had to cherry-pick in two patches during bisection:

1. 02732f945ed2 -- This allows the console to work properly on the ASUS ROG Z690 Formula motherboard
2. 1e4c802913af -- This builds lldb-tblgen, which was necessary because I was building inside a fresh jail
Comment 4 Warner Losh freebsd_committer freebsd_triage 2023-04-14 01:09:20 UTC
At a guess, someone code is not checking the return status quite right (checking the value vs masking off the status code part of the status).

I have a vague memory of fixing a bug like this somewhere... I'll have to check my notes as to where to make sure that got into the tree and didn't drop on the floor.
Comment 5 Warner Losh freebsd_committer freebsd_triage 2023-04-14 02:06:49 UTC
At a guess, it's one of these needs to change to add parens and & CAM_STATUS_MASK to compare against CAM_REQ_CMP.

sys/cam/ata/ata_xpt.c:			if (cpi.ccb_h.status == CAM_REQ_CMP &&
sys/cam/ata/ata_xpt.c:			if (work_ccb->ccb_h.status == CAM_REQ_CMP) {
sys/cam/cam_periph.c:		if (cgd->ccb_h.status == CAM_REQ_CMP)
sys/cam/cam_xpt.c:	if (cpi.ccb_h.status == CAM_REQ_CMP) {
sys/cam/nvme/nvme_da.c:		error = (ccb->ccb_h.status == CAM_REQ_CMP) ? 0 : EIO;
sys/cam/scsi/scsi_cd.c:	if (cpi.ccb_h.status == CAM_REQ_CMP && (cpi.hba_misc & PIM_NO_6_BYTE))
sys/cam/scsi/scsi_da.c:	if (cpi.ccb_h.status == CAM_REQ_CMP && (cpi.hba_misc & PIM_NO_6_BYTE))
sys/cam/scsi/scsi_enc_ses.c:			if (cgd.ccb_h.status == CAM_REQ_CMP)
sys/cam/scsi/scsi_enc_ses.c:		if (cdai.ccb_h.status == CAM_REQ_CMP)
sys/cam/scsi/scsi_sa.c:		if ((cdai.ccb_h.status == CAM_REQ_CMP)
sys/cam/scsi/scsi_target.c:	if (cab.ccb_h.status == CAM_REQ_CMP) {
sys/cam/scsi/scsi_xpt.c:	if (cts->ccb_h.status == CAM_REQ_CMP && cts->transport == XPORT_SPI) {
sys/cam/scsi/scsi_xpt.c:	if (cts->ccb_h.status == CAM_REQ_CMP && cts->transport == XPORT_FC) {
sys/cam/scsi/scsi_xpt.c:	if (cts->ccb_h.status == CAM_REQ_CMP && cts->transport == XPORT_SAS) {

and I think https://reviews.freebsd.org/D37696 was the bug that I was thinking about.
sys/cam/scsi/scsi_xpt.c:	if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_SPI) {
sys/cam/scsi/scsi_xpt.c:	if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_FC) {
sys/cam/scsi/scsi_xpt.c:	if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_SPI) {
sys/cam/scsi/scsi_xpt.c:	if (cts.ccb_h.status == CAM_REQ_CMP && cts.transport == XPORT_FC) {
Comment 6 Warner Losh freebsd_committer freebsd_triage 2023-04-14 02:27:39 UTC
Created attachment 241466 [details]
proposed patch

Does this fix the issue?
Comment 7 Alex Rosenberg 2023-04-14 05:03:08 UTC
Yes, that patch does fix it. Many thanks!

There may be other issues given the prevalence of the older pattern:

freebsd-src git:(main) git grep "ccb_h.status [^\&]" | wc -l
    1224
Comment 8 Warner Losh freebsd_committer freebsd_triage 2023-04-14 17:01:41 UTC
(In reply to Alex Rosenberg from comment #7)
Things aren't that bad.

A large number of what you found was setting the value
And your re was a bit off, so there's a number of places that do mask correctly.
And there's a large number of places where other bits are checked.

I think my patch caught all the places, though I agree it would be safer to uninline them.

I think this also hits 13.2... we may want a EN
Comment 9 Mateusz Guzik freebsd_committer freebsd_triage 2023-04-15 22:16:47 UTC
I don't see any commits referencing this bug nor anything relevant in sys/cam, most notably the proposed patch is not committed. why is this closed?
Comment 10 commit-hook freebsd_committer freebsd_triage 2023-04-15 22:33:40 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=fd02926a68c9570f00f7e50d64c1d0673581fb35

commit fd02926a68c9570f00f7e50d64c1d0673581fb35
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2023-04-14 14:32:20 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2023-04-15 22:32:41 +0000

    cam: Properly mask out the status bits to get completion code

    ccb_h.status has two parts: the actual status and some addition bits to
    indicate additional information. It must be masked before comparing
    against completion codes. Add new inline function cam_ccb_success to
    simplify this to test whether or not the request succeeded. Most of the
    code already does this, but a few places don't (the rest likely should
    be converted to use cam_ccb_status and/or cam_ccb_success, but that's
    for another day). This caused at least one bug in recognizing devices
    behind a SATA port multiplexer, though some of these checks were
    fine with the special knowledge of the code paths involved.

    PR:                     270459
    Sponsored by:           Netflix
    MFC After:              1 week (and maybe a EN requst)
    Reviewed by:            ken, mav
    Differential Revision:  https://reviews.freebsd.org/D39572

 sys/cam/ata/ata_xpt.c       | 4 ++--
 sys/cam/cam_ccb.h           | 6 ++++++
 sys/cam/cam_xpt.c           | 2 +-
 sys/cam/nvme/nvme_da.c      | 3 +--
 sys/cam/scsi/scsi_enc_ses.c | 4 ++--
 5 files changed, 12 insertions(+), 7 deletions(-)
Comment 11 Warner Losh freebsd_committer freebsd_triage 2023-04-15 22:34:36 UTC
Ah a push failed that I thought had succeeded. Fixed.