Having previously used an M.2 disk in a simple pass-through adapter card, I bought a Sonnet McFiver card so plug in a 2nd disk. However, when the card is used, simply mounting any partition on any of the disks plugged to the card, results in: nvme1: Resetting controller due to a timeout and possible hot unplug. nvme1: resetting controller nvme1: failing outstanding i/o nvme1: READ sqid:5 cid:124 nsid:1 lba:73 len:8 nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:5 cid:124 cdw0:0 (nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1000000 prp1=0 prp2=0 cdw=49000000 0 7000000 0 0 0 (nda1:nvme1:0:0:1): CAM status: Unknown (0x420) (nda1:nvme1:0:0:1): Error 5, Retries exhausted g_vfs_done():nda1p1[READ(offset=4608, length=4096)]error = 5 nda1 at nvme1 bus 0 scbus5 target 0 lun 1 nda1: <Samsung SSD 980 1TB 3B4QFXO7 S649NS0W619970N> s/n S649NS0W619970N detached (nda1:nvme1:0:0:1): Periph destroyed On Linux accessing both drives works fine. The card and its devices are detected on Linux as: 0000:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) 0000:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) 0000:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) 0000:02:08.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) 0000:02:09.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) 0000:02:0a.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) 0000:03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 0000:04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980 0000:0a:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller 0000:0b:00.0 Ethernet controller: Aquantia Corp. AQC113CS NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 03)
So, the unknown statis is my fault: Forgot to add something to a table. Ignore that. But second, this error message happens when we post a transaction to the card, and then we don't get an interrupt and we timeout. When we timeout, we go look at the status registers for the card and find that the card is gone (reads as 0xffffffff) (that's the possible hotplug part). With the card reading that, there's no hope: it's game over. So the question is presumably it wasn't like that when we booted the system. It had to have read these registers, and a lot of others to boot, to find the card, then later we've had to do I/Os to the card when CAM starts up to get the card to attach (though these are trivial, they'd freak out if they read back all ff's). So what happened between probe time and now to get it into this state? Did a bridge go away, get renumbered, move its memory windows? Was something else mapped in conflict so the fight over the decoding results in ff's? Why didn't the interrupt happen so we got into the timeout path?
Please attach a dmesg. Can you interact with the card with nvmecontrol before you try to boot it (I suspect not)? Are there one or two nvme cards behind that bridge?
Created attachment 246227 [details] dmesg.boot Here's a full dmesg from booting.
Trying to mount on CURRENT: nvme1: Resetting controller due to a timeout and possible hot unplug. nvme1: resetting controller nvme1: failing outstanding i/o nvme1: READ sqid:5 cid:127 nsid:1 lba:73 len:8 nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:5 cid:127 cdw0:0 (nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=49 0 7 0 0 0 (nda1:nvme1:0:0:1): CAM status: NVME Status Error (nda1:nvme1:0:0:1): Error 5, Retries exhausted g_vfs_done():nda1p1[READ(offset=4608, length=4096)]error = 5 nda1 at nvme1 bus 0 scbus5 target 0 lun 1 nda1: <Samsung SSD 980 1TB 3B4QFXO7 S649NS0W619970N> s/n S649NS0W619970N detached mount_msdosfs: /dev/nda1p1: (Input/output errnor da1:nvme1root@:~ # :0:0:1): Periph destroyed nvme1: Failed controller, stopping watchdog timeout. root@:~ # uname -a FreeBSD 15.0-CURRENT FreeBSD 15.0-CURRENT #0 main-n266315-b2b381d365fc: Thu Nov 9 04:12:28 UTC 2023 root@releng3.nyi.freebsd.org:/usr/obj/usr/src/powerpc.powerpc64le/sys/GENERIC64LE powerpc
As I can see, this Sonnet McFiver card has own PCIe bridge. I wonder whether it (probably falsely) supports PCIe hot-plug on the M.2 (and may be falsely trigger it). Since the system seems to be PowerPC, I wonder what is the status of the PCIe hot-plug support there.
(In reply to Alexander Motin from comment #5) > Since the system seems to be PowerPC, I wonder what is the status of the PCIe hot-plug support there. GENERIC has it compiled in... Can't comment on whether or not it is working.
Wonder if verbose dmesg could say more about hot-plug. Same as `pciconf -lvcb` on the PCIe bridge ports parent to nvmes.
Disks are on pci3 and pci4: nvme0: <Generic NVMe Device> mem 0x80000000-0x80003fff irq 1044473 at device 0.0 numa-domain 0 on pci3 nvme1: <Generic NVMe Device> mem 0x80400000-0x80403fff irq 1044474 at device 0.0 numa-domain 0 on pci4 pci3 and pci4 are on respectively pcib3 and pcib4: pci3: <OFW PCI bus> numa-domain 0 on pcib3 pci4: <OFW PCI bus> numa-domain 0 on pcib4 Indeed one of those has HotPlug, but the other does not. pcib3@pci0:2:1:0: class=0x060400 rev=0xca hdr=0x01 vendor=0x10b5 device=0x8724 subvendor=0x16b8 subdevice=0x7404 vendor = 'PLX Technology, Inc.' device = 'PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA' class = bridge subclass = PCI-PCI cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[48] = MSI supports 8 messages, 64 bit, vector masks cap 10[68] = PCI-Express 2 downstream port max data 256(1024) RO NS ARI disabled max read 128 link x4(x4) speed 8.0(8.0) ASPM disabled(L1) slot 1 power limit 25000 mW HotPlug(present) Attn Button PC(off) MRL(open) cap 0d[a4] = PCI Bridge subvendor=0x16b8 subdevice=0x7404 ecap 0003[100] = Serial 1 ca870010b5df0e00 ecap 0001[fb4] = AER 1 0 fatal 0 non-fatal 0 corrected ecap 0004[138] = Power Budgeting 1 ecap 0019[10c] = PCIe Sec 1 lane errors 0 ecap 0002[148] = VC 1 max VC0 ecap 0012[e00] = Multicast 1 ecap 000d[f24] = ACS 1 Source Validation disabled, Translation Blocking disabled P2P Req Redirect disabled, P2P Cmpl Redirect disabled P2P Upstream Forwarding disabled, P2P Egress Control disabled P2P Direct Translated disabled, Enhanced Capability unavailable ecap 000b[b70] = Vendor [1] ID 0001 Rev 0 Length 16 pcib4@pci0:2:2:0: class=0x060400 rev=0xca hdr=0x01 vendor=0x10b5 device=0x8724 subvendor=0x16b8 subdevice=0x7404 vendor = 'PLX Technology, Inc.' device = 'PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA' class = bridge subclass = PCI-PCI cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[48] = MSI supports 8 messages, 64 bit, vector masks cap 10[68] = PCI-Express 2 downstream port max data 256(1024) RO NS ARI disabled max read 128 link x4(x4) speed 8.0(8.0) ASPM disabled(L1) slot 2 power limit 25000 mW cap 0d[a4] = PCI Bridge subvendor=0x16b8 subdevice=0x7404 ecap 0003[100] = Serial 1 ca870010b5df0e00 ecap 0001[fb4] = AER 1 0 fatal 0 non-fatal 0 corrected ecap 0004[138] = Power Budgeting 1 ecap 0019[10c] = PCIe Sec 1 lane errors 0 ecap 0002[148] = VC 1 max VC0 ecap 0012[e00] = Multicast 1 ecap 000d[f24] = ACS 1 Source Validation disabled, Translation Blocking disabled P2P Req Redirect disabled, P2P Cmpl Redirect disabled P2P Upstream Forwarding disabled, P2P Egress Control disabled P2P Direct Translated disabled, Enhanced Capability unavailable ecap 000b[b70] = Vendor [1] ID 0001 Rev 0 Length 16 What's more strange, I can now mount the nvme0 drive on pci3 (which has HP), even though it didn't work before. Attempting to mount nvme1 still doesn't work.