Bug 243401 - [patch] ahci driver problems with Marvell 88SE9230 (Dell BOSS-S1)
Summary: [patch] ahci driver problems with Marvell 88SE9230 (Dell BOSS-S1)
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Dave Cottlehuber
URL: https://reviews.freebsd.org/D37585
Keywords: patch
Depends on:
Blocks:
 
Reported: 2020-01-16 21:32 UTC by Peter Eriksson
Modified: 2023-02-26 10:33 UTC (History)
11 users (show)

See Also:


Attachments
Patch for AHCI driver to make Dell BOSS-S1 detect unconfigure disks (5.01 KB, patch)
2020-12-21 23:41 UTC, Peter Eriksson
no flags Details | Diff
dmesg.boot (54.99 KB, text/plain)
2021-01-12 18:02 UTC, Peter Eriksson
no flags Details
Version 2 of patch (with debugging printfs) (17.60 KB, patch)
2021-01-12 18:06 UTC, Peter Eriksson
no flags Details | Diff
updated patch compatible with stable/13 (18.93 KB, patch)
2022-11-10 09:32 UTC, Lorenzo Perone
no flags Details | Diff
committed to address above issue, tested on a variety of Dell h/w and firmwares (2.84 KB, patch)
2023-02-10 21:11 UTC, Dave Cottlehuber
dch: maintainer-approval+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2020-01-16 21:32:33 UTC
This feels more like a firmware problem than a driver problem but since it apparently works in Windows and Linux, but not in FreeBSD I figured I'd report it here anyway... 

(Probably not meaningful to try to report it to Dell since FreeBSD isn't officially supported by them)


Dell BOSS-S1 (Marvell 88SE9230 based) M.2 "RAID" cards running Dells latest firmware (2.5.13.3022 A06 or 2.5.13.3022 A05) does something strange - when the kernel has loaded (from the drives on this card) it fails to detect the disks ("unconfigured" disks, non-RAID setup) and then root fs mounting fails...

(We have two M.2 SSDs connected to that controller)


With firmware 2.5.13.3016 A04 it gives a couple of errors at kernel boot time, but does detect the disks and the system boots.

With firmware 2.5.13.3011 A03, 2.5.13.2009 A02 or 2.5.13.2008 A01 no errors are printed and the disks are found just fine.

(But there are bugs fixed in the later releases that probably would be nice to have.. I have had M.2 drive go "offline" for me at 2008/A01-firmware so that's why I tried the later versions...


A summary of the (Dell) firmware fixes and my test results:

2.5.13.3022 A06
  Fixes: None
  Enhancement: Added support for 15G platforms

2.5.13.3020 A05
  Status:
  - Does not work, gives errors:
    - 'ahcich16: stopping AHCI engine failed'
  - Detects a 'pass23', but no disk devices:
      pass23 at ahcich16 bus 0 scbus19 target 0 lun 0
      pass23: <Marvell Console 1.01> Removable Processor SCSI device
      pass23: Serial Number HKDP221516WL
      pass23: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)
  Fixes:
  - Fixed an issue where system will hang during
    Boot when PERC is in HBA mode with BOSS-S1
  - When CLI is running, default temporary file
    directory & permission in Linux and ESXi Operating
    systems are changed as appropriate
  Enhancement: N/A

2.5.13.3016 A04
  Status:
  - Works and detects all disks, but gives errors about:
    - 'ahcich14: stopping AHCI engine failed'
    - 'ahcich15: stopping AHCI engine failed'
    - 'ahcich16: stopping AHCI engine failed'
  Fixes:
  - Fixed a behavior of BOSS-S1 firmware incorrectly marking M.2 drive offline/failed
  - Fixed a behavior where ESXi Host goes unresponsive
  - Fixed a behavior where BOSS-S1 Management path will not respond to Management commands
  - Fixed a behavior where BOSS-S1 boot partition becomes inaccessible
  - Fixed a behavior where ESXi host results in PSOD due to unexpected I/O timeout
  - Fixed a behavior where rebuild will not be proceed during error handling condition
  Enhancement:
  - Enhanced/ Added MVCLI events for command timeout
  - Added SLES15 Support

2.5.13.3011 A03
  Status:
  - Works
  Fixes:
  - Fixed M.2 disk failure when medium error is present
  Enhancement:
  - Enhanced medium error handling

2.5.13.2009 A02
  Status:
  - Works
  Fixes:
  - Fixed Sideband functionality issue
  Enhancement:
  - Added support for Rollback of Controller Firmware through iDRAC/LC

2.5.13.2008 A01
  Status:
  - Works
  Initial release


Kernel boot output (the relevant parts) from a firmware 3016 A04 boot:

ahci2: <Marvell 88SE9230 AHCI SATA controller> port 0x8028-0x802f,0x8034-0x8037,0x8020-0x8027,0x8030-0x8033,0x8\
000-0x801f mem 0xb8800000-0xb88007ff at device 0.0 numa-domain 0 on pci9
ahci2: AHCI v1.20 with 3 6Gbps ports, Port Multiplier not supported
ahci2: quirks=0x900<NOBSYRES,ALTSIG>
ahcich14: <AHCI channel> at channel 0 on ahci2
ahcich15: <AHCI channel> at channel 1 on ahci2
ahcich16: <AHCI channel> at channel 2 on ahci2
...
ahcich16: stopping AHCI engine failed
ahcich16: stopping AHCI engine failed
...
ahcich16: stopping AHCI engine failed
ahcich15: stopping AHCI engine failed
ada0 at ahcich14 bus 0 scbus17 target 0 lun 0
ada0: <SSDSCKJB120G7R N201DL43> ACS-3 ATA SATA 3.x device
ada0: Serial Number PHDW817002Z4150A
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 114473MB (234441648 512 byte sectors)
ada1 at ahcich15 bus 0 scbus18 target 0 lun 0
ada1: <SSDSCKJB120G7R N201DL43> ACS-3 ATA SATA 3.x device
ada1: Serial Number PHDW817002WC150A
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada1: Command Queueing enabled
ada1: 114473MB (234441648 512 byte sectors)
pass25 at ahcich16 bus 0 scbus19 target 0 lun 0
pass25: <Marvell Console 1.01> Removable Processor SCSI device
pass25: Serial Number HKDP221516WL
pass25: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)


On 3022 the ada0 and ada1 devices never get detected, and it only complains about not being able to stop ahcich16, nothing about 14 & 15.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2020-01-17 13:00:28 UTC
Cc: most involved committers with AHCI to ask for quick review.
Comment 2 F Sidoli 2020-05-28 07:29:04 UTC
Hi Peter,

FWIW, I'm also having this same issue on a PE R740xd2 server with the BOSS-S1. 

In my particular case, I have the BOSS (in JBOD) and a H730P PERC (in HBA mode)installed. I wanted to replace the latter with an HBA330 as I can't encrypt the disks at the disk level (they're SEDs) without the PERC locking them out and not passing them through to the OS on power cycle. 

Anyway, with the HBA installed the system partially boots and then fails, but it does just fine with the RAID card put back in. 

If I try to do a fresh install with the HBA in then I can't see the BOSS cards at all, so can't install to them. With the RAID card in there's no problem. 

If I firmware patch the BOSS to latest the system just doesn't boot at all regardless of which card I have in. 

Quite what this all means I don't know. I'm not sure if it's a driver issues in FreeNAS or if Dell are breaking things. I suspect a little bit of A and a little bit of B. 

That being said, I have been told by Dell that one of their engineers has managed to get a test system of theirs set up running the latest BOSS firmware and FreeNAS 11.3. Seems to me they have the BOSS in a RAID 1 and the legacy BIOS option set. 

Once verified I will repost here.
Comment 3 Peter Eriksson 2020-12-17 18:44:06 UTC
Just a quick note that this issue is still present in FreeBSD 12.2 and with the latest Dell BOSS-S1 firmware (A07 / 2.5.13.3024)

With a configured RAID volume it works - but not with "raw" disks - they just don't show up.

This makes it impossible to use the BOSS disks for ZFS-mirrored boot/root pools 

(Since it only supports one RAID volume that can be RAID 0 or RAID 1).

- Peter
Comment 4 Peter Eriksson 2020-12-21 23:41:04 UTC
Created attachment 220793 [details]
Patch for AHCI driver to make Dell BOSS-S1 detect unconfigure disks

Please find enclosed a patch that makes (atleast on my Systems) FreeBSD 12.2 detect unconfigured disks on a Dell BOSS-S1 card running the latest Dell firmware (v7).

The patch basically increases the time limit for the loop when initializing/probing the card for devices. It seems with firmware v5 and later the card takes a lot longer to detect disks after a reset.

The patch also adds a "debug.ahci_verbose" flag and adds some more verbose prints so one can "follow" what happens at probe time. 

With firmware v4 (and an older version of the patch without modified timeouts) the probing looks like this:

ahcich14: AHCI reset...
ahcich14: SATA status changed 00000133
ahcich14: SATA connect time=0us status=00000133
ahcich14: AHCI reset: device found
ahcich14: AHCI reset: device ready after 0ms
ahcich15: AHCI reset...
ahcich15: SATA status changed 00000133
ahcich15: SATA connect time=0us status=00000133
ahcich15: AHCI reset: device found
ahcich15: AHCI reset: device ready after 0ms
ahcich16: AHCI reset...
ahcich16: SATA status changed 00000113
ahcich16: SATA connect time=0us status=00000113
ahcich16: AHCI reset: device found
ahcich16: AHCI reset: device ready after 0ms

With the latest firmware and this patch in use:

ahci2: <Marvell 88SE9230 AHCI SATA controller> port 0x7028-0x702f,0x7034-0x7037,0x7020-0x7027,0x7030-0x7033,0x7000-0x701f mem 0xab200000-0xab2007ff at device 0.0 numa-domain 0 on pci6
ahci2: AHCI v1.20 with 3 6Gbps ports, Port Multiplier not supported
ahci2: quirks=0x200900<NOBSYRES,ALTSIG,MRVL_SR_DEL>
ahci2: Caps: 64bit NCQ 6Gbps PMD 32cmd 3ports
ahci2: Caps2:

ahcich14: <AHCI channel> at channel 0 on ahci2
ahcich14: Caps: CPD
ahcich15: <AHCI channel> at channel 1 on ahci2
ahcich15: Caps: CPD
ahcich16: <AHCI channel> at channel 2 on ahci2
ahcich16: Caps: CPD

ahcich14: AHCI reset...
ahcich14: SATA status changed 00000000
ahcich14: SATA status changed 00000001
ahcich14: SATA status changed 00000133
ahcich14: SATA connect timeout time=212300us status=00000133
ahcich14: AHCI reset: device not found

ahcich15: AHCI reset...
ahcich15: SATA status changed 00000000
ahcich15: SATA status changed 00000001
ahcich15: SATA status changed 00000133
ahcich15: SATA connect timeout time=212000us status=00000133
ahcich15: AHCI reset: device not found

ahcich16: AHCI reset...
ahcich16: SATA status changed 00000000
ahcich16: SATA status changed 00000113
ahcich16: SATA connect time=100us status=00000113
ahcich16: AHCI reset: device found
ahcich16: AHCI reset: device ready after 0ms
ahcich16: stopping AHCI engine failed

pass2 at ahcich16 bus 0 scbus18 target 0 lun 0
pass2: <Marvell Console 1.01> Removable Processor SCSI device
pass2: Serial Number HKDP221516WL
pass2: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)

ada0 at ahcich14 bus 0 scbus16 target 0 lun 0
ada0: <MTFDDAV480TDS D3DJ004> ACS-4 ATA SATA 3.x device
ada0: Serial Number 202729652D1E
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 457862MB (937703088 512 byte sectors)

ada1 at ahcich15 bus 0 scbus17 target 0 lun 0
ada1: <MTFDDAV480TDS D3DJ004> ACS-4 ATA SATA 3.x device
ada1: Serial Number 202729652D52
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 457862MB (937703088 512 byte sectors)

pass4 at ahcich16 bus 0 scbus18 target 0 lun 0
pass4: <Marvell Console 1.01> Removable Processor SCSI device
pass4: Serial Number HKDP221516WL
pass4: 150.000MB/s transfers (SATA 1.x, UDMA4, ATAPI 12bytes, PIO 8192bytes)

(It still claims no device found but they do show up anyway so the patch probably needs some more fine-tuning, but atleast one can access the disks now...)

Note the: "time=212300us"
Comment 5 Alexander Motin freebsd_committer freebsd_triage 2020-12-30 17:40:34 UTC
Peter, your patch is inconsistent in the timeout values.  You are using 10000 on line 2614, but 5000 on lines 2634 and 2638.  That discrepancy may potentially cause random weird effects.  Though it does not explain to me what I see in the provided messages.  I can not match it to the patch provided.  I guess you was running something different.

Also synchronous wait for half a second for every empty port in a system is not great. JFYI IIRC VMware emulates AHCI with 31 port.  Have you tried to measure time between "SATA status changed 00000000" and "SATA status changed 00000001" messages?  It it happen faster it would allow to not increase the timeout on line 2634, not waiting for devices on completely empty ports.
Comment 6 Peter Eriksson 2021-01-01 22:06:00 UTC
Yes, I've since changed my patch a bit so that it:

Only sets the timeout to 5000 (from 1000) if:
  1) quirk AHCI_Q_SLOWDEV (a new one) is set - only set on the Marvell 88SE9230
  2) only does this _after_ the first status change occurs (0x0000000 -> 0x00000001)

Now the trace looks something like this (some more debugging prints added):

ahcich14: AHCI reset...
ahcich14: AHCI engine: stopping
ahcich14: stopping AHCI engine: ci: -1 -> 0 at time 10us
ahcich14: stopping AHCI engine: sact: -1 -> 0 at time 10us
ahcich14: stopping AHCI engine: ccs: -1 -> 0 at time 10us
ahcich14: stopping AHCI engine: cr: -1 -> 0 at time 10us
ahcich14: AHCI engine stopped at time 10us
ahcich14: SATA changed status 0x00000000 -> 0x00000001 at time=100us
ahcich14: SATA changed status 0x00000001 -> 0x00000133 at time=212500us
ahcich14: SATA connect status 0x00000133 at time=212500us
ahcich14: AHCI reset: device found
ahcich14: AHCI reset: device ready after 0ms
ahcich14: AHCI engine(fbs=1): starting

ahcich15: AHCI reset...
ahcich15: AHCI engine: stopping
ahcich15: stopping AHCI engine: ci: -1 -> 0 at time 10us
ahcich15: stopping AHCI engine: sact: -1 -> 0 at time 10us
ahcich15: stopping AHCI engine: ccs: -1 -> 0 at time 10us
ahcich15: stopping AHCI engine: cr: -1 -> 0 at time 10us
ahcich15: AHCI engine stopped at time 10us
ahcich15: SATA changed status 0x00000000 -> 0x00000001 at time=100us
ahcich15: SATA changed status 0x00000001 -> 0x00000133 at time=221400us
ahcich15: SATA connect status 0x00000133 at time=221400us
ahcich15: AHCI reset: device found
ahcich15: AHCI reset: device ready after 0ms
ahcich15: AHCI engine(fbs=1): starting

ahcich16: AHCI reset...
ahcich16: AHCI engine: stopping
ahcich16: stopping AHCI engine: ci: -1 -> 0 at time 10us
ahcich16: stopping AHCI engine: sact: -1 -> 0 at time 10us
ahcich16: stopping AHCI engine: ccs: -1 -> 0 at time 10us
ahcich16: stopping AHCI engine: cr: -1 -> 0 at time 10us
ahcich16: AHCI engine stopped at time 10us
ahcich16: SATA changed status 0x00000000 -> 0x00000113 at time=100us
ahcich16: SATA connect status 0x00000113 at time=100us
ahcich16: AHCI reset: device found
ahcich16: AHCI reset: device ready after 0ms
ahcich16: AHCI engine(fbs=1): starting

Btw,
I've been testing some different variants of settings for this controller - for example I removed the quirk (ALTSIG) to see if that would make any difference but it doesn't seem to matter if it's set or not. Anyone know where that quirk comes from?

I'll upload a cleaned up version of an improved patch soon.
Comment 7 Peter Eriksson 2021-01-01 22:14:07 UTC
Btw, I noticed one other little thing while reading the source code for ahci.c:

ahci_start(ch, fbs) gets called with fbs=1 in all spots in the code, except one spot in ahci_execute_transaction() around line 1650 or so when it receives an ATA_A_RESET - then it sets fbs to 0 (zero). If I'm reading the code correctly then this would cause "FIS-based switched" to be disabled from that time on - if that ever happens? 

Just for testing I changed that to 1 - but I don't see much of a difference in behaviour. Granted I haven't tested _that_ much. Probably unrelated to this issue anyway.
Comment 8 Alexander Motin freebsd_committer freebsd_triage 2021-01-04 17:22:33 UTC
I still see the patch from December 21.  Where is the updated version?

It is good that DEV_PRESENT is reported fast enough.  It allows to not increase timeout in case of device absent and so I am thinking about just increasing the timeout slightly instead of adding the quirk.

ALTSIG quirk was required by some early Marvell controllers or firmware versions. I haven't retested it on recent ones, it may no longer be required.

I don't remember why I have disabled the FBS there, its being a while ago. My guess is to avoid other commands from trying to execute during soft reset.  If you look lower, you'll see "Kick controller into sane state and enable FBS." comment, which should call ahci_start(ch, 1) on line 2121 after soft reset complete.

PS: Why do you need AHCI-specific verbosity tunable?  Why not just enable it globally?
Comment 9 Peter Eriksson 2021-01-12 18:02:20 UTC
Sorry, have been busy with another problem (a HP server with a lot of disks panic:ing due to bugs in other parts of the kernel - sigh). I'll get back with a better patch soon.

Anyway, the way I changed it was basically to chnage the loop:

1. DELAY(10) instead of DELAY(100) - it takes about 30us for status to change from 0 -> 1, so no need to wait the full 100us :-)

2. Only change the "full timeout" from 100ms to 500ms after ATA_SS_DEV_MASK has changed from ATA_SS_DET_NO_DEVICE.


My current version of the patch contains a lot of debuging printouts so probably not really good for production use (it makes it easier to watch what's happening though :-)

A sample of the dmesg output:


First an ununes ahci channel/controller without devices:

ahcich13: AHCI engine: stopping
ahcich13: stopping AHCI engine: cr: 1 -> 0 at time 10 us
ahcich13: AHCI engine stopped at time 10 us
ahcich13: ahci_sata_phy_reset: Start
ahcich13: ahci_sata_connect: Start
ahcich13: SATA connect timeout status 0x00000000 at time=10000us
ahcich13: ahci_sata_connect: Done (0)
ahcich13: ahci_sata_phy_reset: Done (0)
ahcich13: AHCI reset: device not found
ahcich13: ahci_reset: Done (ahci_sata_phy_reset failed)


First port on the BOSS:

ahcich14: ahciaction: Calling ahci_reset (XPT_RESET_BUS)
ahcich14: ahci_reset: Start
ahcich14: AHCI reset...
ahcich14: AHCI engine: stopping
ahcich14: stopping AHCI engine: cr: 1 -> 0 at time 10 us
ahcich14: AHCI engine stopped at time 10 us
ahcich14: ahci_sata_phy_reset: Start
ahcich14: ahci_sata_connect: Start
ahcich14: SATA changed status 0x00000000 -> 0x00000001 at time=30us
ahcich14: SATA changed status 0x00000001 -> 0x00000133 at time=211790us
ahcich14: SATA connect status 0x00000133 at time=211790us
ahcich14: ahci_sata_connect: Done (1)
ahcich14: ahci_sata_phy_reset: Done (1)
ahcich14: AHCI reset: device found
ahcich14: AHCI reset: device ready after 0ms
ahcich14: AHCI engine(fbs=1): starting
ahcich14: ahci_start: Done
ahcich14: ahci_reset: Done


Then it resets the second port/channel (and starts doing stuff on the first port at the same time):

ahcich15: ahciaction: Calling ahci_reset (XPT_RESET_BUS)
ahcich15: ahci_reset: Start
ahcich15: AHCI reset...
ahcich15: AHCI engine: stopping
ahcich15: stopping AHCI engine: cr: 1 -> 0 at time 10 us
ahcich15: AHCI engine stopped at time 10 us
ahcich15: ahci_sata_phy_reset: Start
ahcich14: ahci_execute_transaction: Kicking controller into sane state
ahcich14: AHCI engine: stopping
ahcich14: stopping AHCI engine: cr: 1 -> 0 at time 10 us
ahcich14: AHCI engine stopped at time 10 us
ahcich14: ahci_clo: Start
ahcich14: ahci_clo: Done
ahcich14: AHCI engine(fbs=0): starting
ahcich14: ahci_start: Done
ahcich15: ahci_sata_connect: Start
ahcich15: SATA changed status 0x00000000 -> 0x00000001 at time=30us
ahcich14: ahci_end_transaction: Reinit port (eslots=00000004)
ahcich14: AHCI engine: stopping
ahcich15: SATA changed status 0x00000001 -> 0x00000133 at time=220650us
ahcich15: SATA connect status 0x00000133 at time=220650us
ahcich15: ahci_sata_connect: Done (1)
ahcich15: ahci_sata_phy_reset: Done (1)
ahcich15: AHCI reset: device found
ahcich15: AHCI reset: device ready after 0ms
ahcich15: AHCI engine(fbs=1): starting
ahcich15: ahci_start: Done
ahcich15: ahci_reset: Done


And then the third (ses?) port - there is only two ports on this controller:

ahcich16: ahciaction: Calling ahci_reset (XPT_RESET_BUS)
ahcich16: ahci_reset: Start
ahcich16: AHCI reset...
ahcich16: AHCI engine: stopping
ahcich16: stopping AHCI engine: cr: 1 -> 0 at time 10 us
ahcich16: AHCI engine stopped at time 10 us
ahcich16: ahci_sata_phy_reset: Start
ahcich16: ahci_sata_connect: Start
ahcich16: SATA changed status 0x00000000 -> 0x00000113 at time=70us
ahcich16: SATA connect status 0x00000113 at time=70us
ahcich16: ahci_sata_connect: Done (1)
ahcich16: ahci_sata_phy_reset: Done (1)
ahcich16: AHCI reset: device found
ahcich16: AHCI reset: device ready after 0ms
ahcich16: AHCI engine(fbs=1): starting
ahcich16: ahci_start: Done
ahcich16: ahci_reset: Done


But then things are a bit strange - notice the 1s timeouts (I increased the max timeout to 1s in this test boot):

Root mount waiting for: CAM usbus0
uhub0: 26 ports with 26 removable, self powered
ahcich14: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich14: ahci_clo: Start
ahcich14: ahci_clo: Done
ahcich14: AHCI engine(fbs=1): starting
ahcich14: ahci_start: Done
ahcich15: ahci_execute_transaction: Kicking controller into sane state
ahcich15: AHCI engine: stopping
ugen0.2: <Kingston DataTraveler 2.0> at usbus0
umass0 numa-domain 0 on uhub0
umass0: <Kingston DataTraveler 2.0, class 0/0, rev 2.00/1.00, addr 1> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0xc000
umass0:20:0: Attached to scbus20
Root mount waiting for: CAM usbus0
ahcich15: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich15: ahci_clo: Start
ahcich15: ahci_clo: Done
ahcich15: AHCI engine(fbs=0): starting
ahcich15: ahci_start: Done
ahcich15: ahci_end_transaction: Reinit port (eslots=00000004)
ahcich15: AHCI engine: stopping
ugen0.3: <vendor 0x1604 product 0x10c0> at usbus0
uhub1 numa-domain 0 on uhub0
uhub1: <vendor 0x1604 product 0x10c0, class 9/0, rev 2.00/0.00, addr 2> on usbus0
Root mount waiting for: CAM usbus0
ahcich15: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich15: ahci_clo: Start
ahcich15: ahci_clo: Done
ahcich15: AHCI engine(fbs=1): starting
ahcich15: ahci_start: Done
ahcich15: ahci_execute_transaction: Kicking controller into sane state
ahcich15: AHCI engine: stopping
Root mount waiting for: CAM usbus0
ahcich15: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich15: ahci_clo: Start
ahcich15: ahci_clo: Done
ahcich15: AHCI engine(fbs=0): starting
ahcich15: ahci_start: Done
uhub1: 4 ports with 4 removable, self powered
Root mount waiting for: CAM usbus0
ugen0.4: <vendor 0x1604 product 0x10c0> at usbus0
uhub2 numa-domain 0 on uhub1
uhub2: <vendor 0x1604 product 0x10c0, class 9/0, rev 2.00/0.00, addr 3> on usbus0
Root mount waiting for: CAM usbus0
ahcich15: ahci_end_transaction: Reinit port (eslots=00000010)
ahcich15: AHCI engine: stopping
Root mount waiting for: CAM usbus0
hcich15: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich15: ahci_clo: Start
ahcich15: ahci_clo: Done
ahcich15: AHCI engine(fbs=1): starting
ahcich15: ahci_start: Done
ahcich16: ahci_execute_transaction: Kicking controller into sane state
ahcich16: AHCI engine: stopping
ahcich16: stopping AHCI engine: cr: 1 -> 0 at time 10 us
ahcich16: AHCI engine stopped at time 10 us
ahcich16: ahci_clo: Start
ahcich16: ahci_clo: Done
ahcich16: AHCI engine(fbs=0): starting
ahcich16: ahci_start: Done
ahcich16: ahci_end_transaction: Reinit port (eslots=00000004)
ahcich16: AHCI engine: stopping
uhub2: 4 ports with 4 removable, self powered
Root mount waiting for: CAM usbus0
ugen0.5: <vendor 0x1604 product 0x10c0> at usbus0
uhub3 numa-domain 0 on uhub1
uhub3: <vendor 0x1604 product 0x10c0, class 9/0, rev 2.00/0.00, addr 4> on usbus0
ahcich16: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich16: ahci_clo: Start
ahcich16: ahci_clo: Done
ahcich16: AHCI engine(fbs=1): starting
ahcich16: ahci_start: Done
ahcich16: ahci_execute_transaction: Kicking controller into sane state
ahcich16: AHCI engine: stopping
Root mount waiting for: CAM usbus0
ahcich16: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich16: ahci_clo: Start
ahcich16: ahci_clo: Done
ahcich16: AHCI engine(fbs=0): starting
ahcich16: ahci_start: Done
Root mount waiting for: CAM usbus0
uhub3: 4 ports with 4 removable, self powered
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ahcich16: ahci_end_transaction: Reinit port (eslots=00000010)
ahcich16: AHCI engine: stopping
Root mount waiting for: CAM
ahcich16: stopping AHCI engine: timeout at 1000000 us (cr=1, ccs=0, ci=0, sact=0)
ahcich16: ahci_clo: Start
ahcich16: ahci_clo: Done
ahcich16: AHCI engine(fbs=1): starting
ahcich16: ahci_start: Done


However, eventually things seem to work anyway. I've attached the full dmesg.boot file
Comment 10 Peter Eriksson 2021-01-12 18:02:56 UTC
Created attachment 221500 [details]
dmesg.boot
Comment 11 Peter Eriksson 2021-01-12 18:06:55 UTC
Created attachment 221502 [details]
Version 2 of patch (with debugging printfs)

A new version of the patch (still with a lot of debugging printf's - I'll see if I can find time to create a more "clean" version with just the DELAY(10) and "long timeout, but only after state 0 -> 1" fixes.
Comment 12 Jason Mader 2021-09-22 19:30:08 UTC
Still seeing this problem with the latest BOSS-S1 firmware version 2.5.13.3024,A07 on FreeBSD-13.0 (and a Dell HBA330 in non-RAID mode)
Comment 13 Lorenzo Perone 2022-11-10 09:32:32 UTC
Created attachment 237998 [details]
updated patch compatible with stable/13

Problem still exists on FreeBSD 13.1, firmware version 2.15.0. The patch posted previously applies cleanly to stable/12, but not to stable/13. 
I made a best effort to cleanup conflicts when cherry-picking back the change on stable/13 and am attaching the diff here.
The built kernel works and the disks are seen. 
Any chances that someone might pick up on this, review the merged patch, and merge it back into 13 (and 12 using the previous version)?
Best Regards, Lorenzo
Comment 14 Lorenzo Perone 2022-11-16 10:25:50 UTC
Hi all again, particularly committers. Please apologize upfront for adding another comment here to draw attention. I'm not sure how much the merged path needs amendmend and review (and I'm sure there are other bugs which are much more important than this one). However, the controller in question is not some exotic hardware used by "some", but one of the standard options on one of the arguably most used vendors (DELL) serving with our beloved FreeBSD. On short term (before deploying machines), I'd be available for further tests. However I am resorting to USB NVME for boot  in the mean time (as I cannot deploy a custom kernel on those machines).
Thanks in advance for attention on this. Best Regards.
Comment 15 Dave Cottlehuber freebsd_committer freebsd_triage 2022-12-01 15:52:01 UTC
I have split out functional change into https://reviews.freebsd.org/D37585
to make review easier.

The Dell hardware mentioned: https://www.dell.com/support/manuals/de-at/boss-s-1/boss_s1_ug_publication/overview
Comment 16 Ed Maste freebsd_committer freebsd_triage 2022-12-16 14:56:21 UTC
> Probably not meaningful to try to report it to Dell since FreeBSD isn't officially supported by them

As an aside, if you have a support contract/channel with Dell, please do report to them - it's important that they are aware that people are using FreeBSD on their hardware, and experiencing trouble.

Thanks Dave for putting the review into Phabricator. I suspect things will be slow over the holidays, but I hope this will get picked up early in the new year.
Comment 17 Mike Dentifrice 2022-12-19 02:58:55 UTC
For other people stumbling upon that issue, I had to revert all the way back to firmware A03 (version 2.5.13.3011) for both M.2 SSDs to be detected. A04 did show one disk, but not both.

Looking forward to seeing the workaround committed, so we can flash back to A07? Thanks a bunch in advance to those involved!
Comment 18 tcs 2023-01-26 09:20:29 UTC
BOSS-S2 (firmware 2.5.13.4008 , PowerEdge T350 ) has same problem under the non-RAID mode.
Use the last patch can detect the disk but still cause the zfs cksum error in some case.
welcome to let me know how can I help to do something.
Comment 19 commit-hook freebsd_committer freebsd_triage 2023-02-10 16:10:06 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=346483b1f10454c5617a25d5e136829f60fb1184

commit 346483b1f10454c5617a25d5e136829f60fb1184
Author:     Mariusz Zaborski <oshogbo@FreeBSD.org>
AuthorDate: 2023-02-10 15:56:04 +0000
Commit:     Mariusz Zaborski <oshogbo@FreeBSD.org>
CommitDate: 2023-02-10 16:10:04 +0000

    ahci: increase timout

    For some devices, like Marvell 88SE9230, it takes more time
    to connect to the device. This patch introduces a special flag
    that extends the timeout from around 100ms to around 500ms.

    This change is based on the work of: Peter Eriksson <pen@lysator.liu.se>

    PR:             243401
    Reviewed by:    imp
    Tested by:      dch
    MFC after:      3 days
    Sponsored by:   Equinix
    Sponsored by:   SkunkWerks, GmbH
    Sponsored by:   Klara, Inc.
    Differential Revision:  https://reviews.freebsd.org/D38413

 sys/dev/ahci/ahci.c     | 12 ++++++++----
 sys/dev/ahci/ahci.h     |  4 +++-
 sys/dev/ahci/ahci_pci.c |  2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)
Comment 20 Dave Cottlehuber freebsd_committer freebsd_triage 2023-02-10 21:11:02 UTC
Created attachment 240058 [details]
committed to address above issue, tested on a variety of Dell h/w and firmwares

This patch addressed specifically the Marvell 88SE9230:

        {0x92301b4b, 0x00, "Marvell 88SE9230",  AHCI_Q_ALTSIG |
-           AHCI_Q_IOMMU_BUSWIDE},
+           AHCI_Q_IOMMU_BUSWIDE | AHCI_Q_SLOWDEV},

If you're experiencing this on other h/w please let us know
so we can add further device quirks if needed.
Comment 21 commit-hook freebsd_committer freebsd_triage 2023-02-13 23:45:25 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=65bab39e140f97cace92a2923e50c6b654b02e22

commit 65bab39e140f97cace92a2923e50c6b654b02e22
Author:     Mariusz Zaborski <oshogbo@FreeBSD.org>
AuthorDate: 2023-02-10 15:56:04 +0000
Commit:     Mariusz Zaborski <oshogbo@FreeBSD.org>
CommitDate: 2023-02-13 23:45:01 +0000

    ahci: increase timout

    For some devices, like Marvell 88SE9230, it takes more time
    to connect to the device. This patch introduces a special flag
    that extends the timeout from around 100ms to around 500ms.

    This change is based on the work of: Peter Eriksson <pen@lysator.liu.se>

    PR:             243401
    Reviewed by:    imp
    Tested by:      dch
    MFC after:      3 days
    Sponsored by:   Equinix
    Sponsored by:   SkunkWerks, GmbH
    Sponsored by:   Klara, Inc.
    Differential Revision:  https://reviews.freebsd.org/D38413

    (cherry picked from commit f08ac4cb14c1c0740346a4363f82e1e1367c2bad)

 sys/dev/ahci/ahci.c     | 12 ++++++++----
 sys/dev/ahci/ahci.h     |  4 +++-
 sys/dev/ahci/ahci_pci.c |  2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)
Comment 22 commit-hook freebsd_committer freebsd_triage 2023-02-14 23:19:00 UTC
A commit in branch releng/13.2 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f09f828a41d8eaf6456d2894a97a2a10f8979ea1

commit f09f828a41d8eaf6456d2894a97a2a10f8979ea1
Author:     Mariusz Zaborski <oshogbo@FreeBSD.org>
AuthorDate: 2023-02-10 15:56:04 +0000
Commit:     Mariusz Zaborski <oshogbo@FreeBSD.org>
CommitDate: 2023-02-14 23:17:57 +0000

    ahci: increase timout

    For some devices, like Marvell 88SE9230, it takes more time
    to connect to the device. This patch introduces a special flag
    that extends the timeout from around 100ms to around 500ms.

    This change is based on the work of: Peter Eriksson <pen@lysator.liu.se>

    Approved by:    re (cperciva)
    PR:             243401
    Reviewed by:    imp
    Tested by:      dch
    MFC after:      3 days
    Sponsored by:   Equinix
    Sponsored by:   SkunkWerks, GmbH
    Sponsored by:   Klara, Inc.
    Differential Revision:  https://reviews.freebsd.org/D38413

    (cherry picked from commit f08ac4cb14c1c0740346a4363f82e1e1367c2bad)
    (cherry picked from commit 65bab39e140f97cace92a2923e50c6b654b02e22)

 sys/dev/ahci/ahci.c     | 12 ++++++++----
 sys/dev/ahci/ahci.h     |  4 +++-
 sys/dev/ahci/ahci_pci.c |  2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)