264172 – nvme(4): PM9A1 NVMe Samsung 512GB no longer available with 13.1-RELEASE (was working with 13.0)

Bug 264172 - nvme(4): PM9A1 NVMe Samsung 512GB no longer available with 13.1-RELEASE (was working with 13.0)

Summary: nvme(4): PM9A1 NVMe Samsung 512GB no longer available with 13.1-RELEASE (was ...

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	13.1-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Alexander Motin

URL:
Keywords:	needs-qa, regression

Depends on:
Blocks:	264030
	Show dependency tree / graph

Reported:	2022-05-23 09:41 UTC by Matthieu Volat
Modified:	2023-12-12 09:32 UTC (History)
CC List:	7 users (show)

See Also:

Attachments
dmesg.boot file with 13.1 (49.24 KB, text/plain) 2022-05-23 09:41 UTC, Matthieu Volat	no flags	Details
dmesg.boot file with 13.0 (31.40 KB, text/plain) 2022-05-23 09:42 UTC, Matthieu Volat	no flags	Details
pciconf -lv under 13.0 (29.54 KB, text/plain) 2022-05-24 07:39 UTC, Matthieu Volat	no flags	Details
pciconf -lv under 13.1 (29.23 KB, text/plain) 2022-05-24 07:40 UTC, Matthieu Volat	no flags	Details
pciconf -lvcb with 13.0 (45.15 KB, text/plain) 2022-05-31 09:12 UTC, Matthieu Volat	no flags	Details
pciconf -lvcb with 13.1 (44.06 KB, text/plain) 2022-05-31 09:13 UTC, Matthieu Volat	no flags	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthieu Volat 2022-05-23 09:41:17 UTC

Created attachment 234139 [details]
dmesg.boot file with 13.1

Hi,

I'm using a Dell Precision 5820 workstation with a nvme ssd drive (/dev/nvd0). This setup worked out of the box with FreeBSD 13.0 but trying to upgrade to 13.1 results in the drive not being available -- and the kernel being unable to mount its zfs pool.

I can confirm this is also the case using the memstick live image: successful boot, but no /dev/nvd0 -- so it's not something resulting from my current installation.

I disabled all raid options in the UEFI settings but to no avail (13.0 kernel is still working), I'm not sure what I could investigate by myself futhermore. I'll attach both dmesg.boot for 13.0 and 13.1...

Thanks a lot,

-- Matthieu

Comment 1 Matthieu Volat 2022-05-23 09:42:02 UTC

Created attachment 234140 [details]
dmesg.boot file with 13.0

Comment 2 Warner Losh freebsd_committer

2022-05-24 01:11:57 UTC

Pciconf -lv and dmesg please...

Comment 3 Warner Losh freebsd_committer

2022-05-24 02:22:01 UTC

(oh, there is a dmesg.boot)

nvme0: <Generic NVMe Device> at device 0.0 numa-domain 0 on pci9
pcib9: Power Fault Detected
pcib9: Power Fault Detected
nvd0: <PM9A1 NVMe Samsung 512GB> NVMe namespace
nvd0: 488386MB (1000215216 512 byte sectors)
nvd0: detached
g_dev_taste: g_dev_taste(nvd0) failed to g_attach, error=6
Root mount waiting for: usbus0
nvme0: detached
pci9: detached

Is the bit in question. What's this Power Fault Detected?

pcib8: <PCI-PCI bridge> port 0xc000 mem 0xf8000000-0xf9ffffff irq 0 at device 2.0 numa-domain 0 on vmd_bus0
pcib8: failed to allocate initial I/O port window: 0-0xafff
pci8: <PCI bus> numa-domain 0 on pcib8
nvme0: <Generic NVMe Device> at device 0.0 numa-domain 0 on pci8
pcib9: <PCI-PCI bridge> port 0xc000 mem 0xf8000000-0xf9ffffff irq 0 at device 3.0 numa-domain 0 on vmd_bus0
pcib9: failed to allocate initial I/O port window: 0-0xbfff
pci9: <PCI bus> numa-domain 0 on pcib9

is from 13.0, but it's talking about vmd_bus0, not pcibX.

I don't really have the ability to test vmd stuff, but I think this might be related to that and there's some mismatch between the drive and the BIOS. But since I don't mess with this stuff often, I'm not sure how much more help I can be.

Comment 4 Matthieu Volat 2022-05-24 07:39:31 UTC

Created attachment 234169 [details]
pciconf -lv under 13.0

Comment 5 Matthieu Volat 2022-05-24 07:40:01 UTC

Created attachment 234170 [details]
pciconf -lv under 13.1

Comment 6 Matthieu Volat 2022-05-24 07:42:26 UTC

Thanks for taking time to try to understand what is going on.

I've added pciconf -lv outputs in case it helps. This PCI power failure looks weird, but I had this desktop for a few weeks now and have not seen this error with 13.0, while it is quite systematic with 13.1.

I will try today to see if there are any UEFI updates, I have know more than one board that would only behave correctly once patched...

Comment 7 Matthieu Volat 2022-05-24 10:03:22 UTC

I updated the UEFI version to latest (2.19.1) with no avail. I also tried to reset all settings, play with some PCIe options with the same results.

Comment 8 Matthieu Volat 2022-05-30 12:19:30 UTC

I've git-bisect'ed the 13/stable branch, it seems the issue was introduced in :

- https://cgit.freebsd.org/src/commit/sys/dev/vmd/vmd.c?h=stable/13&id=1d8cda26cad2b1c2613edad2f3be0222c5691ad1 
- https://cgit.freebsd.org/src/commit/sys/dev/vmd/vmd.h?h=stable/13&id=53ec9de7b6bf9b5a837d4af876ba3124d0482e25

The second commit only fixing a missing member to complete the first. So given this, it seems the new vmd version is the reason for this issue, but I'm not really familiar with either vmd, nvme or the code to analyze further what is going wrong...

Comment 9 Warner Losh freebsd_committer

2022-05-30 14:14:03 UTC

Does a -current snapshot have the same issue?

vmm is this "thing" Intel invented. It's designed to be raid without the hassles, but it falls well short of that design goal. Basically, the nvme drives need to be accessed via the ACHI controller, and vmm.c provides that glue in what's supposed to be a seamless way.

If it works with a -current snapshot, then vmm isn't broken there and we're missing some key bit of code in 13.1. If it is broken there, then that's a whole different problem to solve (a currently unknown issue with vmm). I disabled VMM on my laptop years ago, and am reluctant to re-enable it since I'll have to back/restore windows in the process, which was the better part of a day for me the last time I fought with it...  Though if I do that, I can upgrade my BIOS and maybe run Windows 11 (if only I had a day to burn on this...)

Comment 10 Warner Losh freebsd_committer

2022-05-30 14:15:12 UTC

(adding CC of Alexander Motin (mav) since it's his code/MFC)

Comment 11 Alexander Motin freebsd_committer

2022-05-30 20:18:34 UTC

With my vmd(4) driver rework in 13.1 I've tried to turn it into more or less reasonable PCIe bridge.  I suspect that it could just fix things enough to expose some existing issue(s), hidden otherwise.  The "Power Fault Detected" messages you see are going from PCIe hot-plug code and are not directly relayed to vmd(4).  You may try to set hw.pci.enable_pcie_hp=0 loader tunable to block the PCIe hot-plug support.  You may also check your UEFI settings for any PCI/PCIe hot-plug options.

Unfortunately I can't say more with only the information available.  For `dmesg` I'd prefer to see verbose messages enables, while for `pciconf` I'd use `-lvcb` arguments, or may be use `lspci -vvvv`, if can be installed, for maximum information.

Comment 12 Matthieu Volat 2022-05-31 09:12:00 UTC

Thanks for the explanation -- I'm really not familiar with the more complex stuff vendors are putting on motherboards. It also means I have the option of disabling VMD? -- and surely reinstall the system, but I'm willing to try to push this a little further to help improve things.

Booting the system with -current display the same issue.

I'm also updating pciconf outputs with Alexander Motin's suggested flags -- i'm not sure how to have a "verbose dmesg" however?

Comment 13 Matthieu Volat 2022-05-31 09:12:36 UTC

Created attachment 234345 [details]
pciconf -lvcb with 13.0

Comment 14 Matthieu Volat 2022-05-31 09:13:01 UTC

Created attachment 234346 [details]
pciconf -lvcb with 13.1

Comment 15 Alexander Motin freebsd_committer

2022-05-31 14:18:24 UTC

For "verbose dmesg" you should enable it during boot, either via option in loader menu or via setting boot_verbose="YES" in loader.conf.  After that regular `dmesg` will be much more verbose.

Looking on new `pciconf` output I see in 13.0:
        cap 05[60] = MSI supports 2 messages, vector masks
, but in 13.1:
        cap 05[60] = MSI supports 2 messages, vector masks enabled with 1 message
, that means introduced PCIe hot-plug support, just as I have guessed.  You may try to block it with the loader tunable I've posted before.  Otherwise I am not sure what OS should do about it, since it just handles what is reported by hardware.

Comment 16 Matthieu Volat 2022-06-03 06:16:57 UTC

Sorry for the delay, I was working out of the office for a few days and had no access to the machine.

Yes! Disabling hotplug in /boot/loader.conf worked!

Comment 17 Motomichi Matsuzaki 2022-06-17 13:43:57 UTC

I believe the same applies to upgrading 12.2 to 12.3. In my case, nvd0 had lost when 12.2->12.3 upgrade, on my Precision 5820. I, in panic mode, have further upgraded to 13.1, then I found this PR. The loader tunable (hw.pci.enable_pcie_hp=0) also works fine for my 13.1 environment.

Comment 18 Xin LI freebsd_committer

2023-09-12 16:30:27 UTC

(In reply to Matthieu Volat from comment #16)
Do the same issue still exist for a 14.0 image?

Comment 19 Matthieu Volat 2023-12-12 09:32:06 UTC

(In reply to Xin LI from comment #18)

Sorry, got a bit busy and installing a beta wasn't possible (this is my work workstation).

I started the upgrade and tested: hw.pci.enable_pcie_hp is still needed.

But I'm not sure this entry should be left open: this is apparently a possible issue and the tunable exists to work around it?