Installing Centos or Rocky 8.4 results in a failed boot. The initial install works, but on reboot I get this while loading: Starting webhost04a * found guest in /storage/vm/webhost04a * booting... BdsDxe: failed to load Boot0001 "UEFI bhyve-NVMe NVME-4-0" from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,01-00-68-C1-20-FC-9C-58): Not Found Logging from vm-bhyve: Jun 04 17:18:00: booting Jun 04 17:18:00: [bhyve options: -c 8 -m 16G -Hwl bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd -U 62ff48d0-c58d-11eb-9187-f8bc1251963e] Jun 04 17:18:00: [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,nvme,/dev/zvol/storage/vm/webhost04a/disk0 -s 5:0,virtio-net,tap0,mac=58:9c:fc:07:6d:b7 -s 6:0,fbuf,tcp=192.168.1.150:5900 -s 7:0,xhci,tablet] Note, Rocky 8.3 and Centos 8.3 both install and boot fine. with exactly the same configs in vm-bhyve
Does the guest boot if you change the device from nvme to ahci?
Installing guest again, using virtio-blk instead of nvme results in a working client. Same guest, just change nvme to ahci-hd, works. Appears nvme support in UEFI bios has a change in the new RHEL 8.4 and successors.
I was able to repro this with Alma 8.3/8.4 (identical to Centos 8.3/8.4). With a file-backed image on ZFS, the sectorsize parameter was forced to 4K and 512, with no difference in getting the system to boot. The error appears to be in the EFI loader on Centos: " BdsDxe: loading Boot0001 "UEFI bhyve-NVMe NVME-4-0" from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,01-00-DD-44-20-FC-9C-58) BdsDxe: starting Boot0001 "UEFI bhyve-NVMe NVME-4-0" from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,01-00-DD-44-20-FC-9C-58) Unexpected return from initial read: Device Error, buffersize 0 Failed to load image \EFI\almalinux\grubx64.efi: Device Error start_image() returned Device Error StartImage failed: Device Error "
Has this been tested against bare metal that has UEFI and NVMe? I got the same as grehan@ when testing with both CentOS 8.4 and Stream. Observations suggest there is something up with the CentOS EFI shim for GRUB. I have done testing against the following, fully updated as of 20210609 12:10 +10UTC: openSUSE Tumbleweed Ubuntu impish 21.10 nightly Artix (Arch) GRUB 2.04-10 Linux 5.12.8 None of these experienced any issues with the NVMe device presented by bhyve.
I successfully updated CentOS from 8.3 to 8.4 and it is running fine on a NVMe bhyve device. It is looking more like how the installer deals with determining the boot device and then writing this and the GRUB components to the storage.
Is it possible to recompile pci_nvme.c and enable debug in the failing case? I.e. change the code to: static int nvme_debug = 1;
This looks to be an edge condition in the EFI NVMe driver, caused by the large maximum data transfer size advertised by bhyve NVMe (2MB), and the increase in size of grubx64.efi from 1.9MB in centos 8.3, to 2.3MB in centos 8.4. In 8.4, EFI attempts to read 2MB of grubx64.efi. However, the buffer starts at a non page-aligned address, using PRP1 in the command descriptor with an offset. PRP2 points to a PRP list, but with a 2MB transfer size, all 512 PRP entries in a page will be used. Since the first buffer was unaligned, there is a small amount left at the end, and EFI is putting garbage into that entry. (Copying the smaller 8.3 grubx64.efi to an 8.4 system resulted in a successful boot). A suggested fix is to drop the advertised mdts to something that isn't right on the verge of requiring a chained PRP list. Qemu defaults to 512KB, and h/w I've looked at advertises 256K. e.g. --- a/usr.sbin/bhyve/pci_nvme.c +++ b/usr.sbin/bhyve/pci_nvme.c @@ -106,7 +106,7 @@ static int nvme_debug = 0; #define NVME_MPSMIN_BYTES (1 << (12 + NVME_MPSMIN)) #define NVME_PRP2_ITEMS (PAGE_SIZE/sizeof(uint64_t)) -#define NVME_MDTS 9 +#define NVME_MDTS 7 (or 8) 8.4 boots fine with this change.
I can confirm the patch from grehan@ works as described. Tested against: CentOS 8.4 Windows Server 2022 OpenBSD 6.9 No regression was introduced into the latter two existing operating systems on our system.
To document this a bit more, the error in pci_nvme.c is: ... nvme doorbell 1, SQ, val 0x1 nvme_handle_io qid 1 head 0 tail 1 cmdlist 0x8c1885000 pci_nvme_io_done error 14 Bad address pci_nvme_set_completion sqid 1 cqid 1 cid 116 status: 0x0 0x4 pci_nvme_set_completion: CQ1 interrupt disabled nvme doorbell 1, CQ, val 0x1 blockif_write() asynchronously returns EFAULT (a.k.a. "Bad address"). Adding some debugging to pci_nvme_io_done() shows the following: Fault on SQ[1] opc=0x2 cid=0x74 bytes=2097152 [ 0] 0x8c097a018 : 2097128 [ 1] 0x0 : 24 The debug statement displays the contents of the pci_nvme_ioreq structure for the failing IO, including valid iovec entries. This is from an I/O Submission Queue (i.e. queue ID != 0) for a Read (opcode 0x2) of 2097152 bytes. My suspicion is the second iovec with the NULL iov_base value causes the EFAULT.
I think nvme_write_read_blockif() has a bug: } static void @@ -1978,7 +1984,7 @@ nvme_write_read_blockif(struct pci_nvme_softc *sc, /* PRP2 is pointer to a physical region page list */ while (bytes) { /* Last entry in list points to the next list */ - if (prp_list == last) { + if ((prp_list == last) && (bytes > PAGE_SIZE)) { uint64_t prp = *prp_list; prp_list = paddr_guest2host(vmctx, prp, Note that I cleaned up some additional things and your line numbers won't quite match up. But I believe this is the crux of the fix necessary.
I reverted the grehan@ patch and applied chuck@ patch and tested against: CentOS 8.4 Windows Server 2022 OpenBSD 6.9-snapshot and all of the above installed, rebooted and booted correctly.
Chuck's fix is the correct one; mine was just a bandaid.
Actually, Peter's analysis helped me immediately zero in on the problem :)
Change up for review https://reviews.freebsd.org/D30897
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=91064841d72b285a146a3f1c32cb447251e062ea commit 91064841d72b285a146a3f1c32cb447251e062ea Author: Chuck Tuffli <chuck@FreeBSD.org> AuthorDate: 2021-06-27 22:14:52 +0000 Commit: Chuck Tuffli <chuck@FreeBSD.org> CommitDate: 2021-06-27 22:14:52 +0000 bhyve: Fix NVMe iovec construction for large IOs The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug in the NVMe emulation's construction of iovec's. By default, NVMe data transfer operations use a scatter-gather list in which all entries point to a fixed size memory region. For example, if the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists themselves are also fixed size (default is 512 entries). Because the list size is fixed, the last entry is special. If the IO requires more than 512 entries, the last entry in the list contains the address of the next list of entries. But if the IO requires exactly 512 entries, the last entry points to data. The NVMe emulation missed this logic and unconditionally treated the last entry as a pointer to the next list. Fix is to check if the remaining data is greater than the page size before using the last entry as a pointer to the next list. PR: 256422 Reported by: dave@syix.com Tested by: jason@tubnor.net MFC after: 5 days Relnotes: yes Reviewed by: imp, grehan Differential Revision: https://reviews.freebsd.org/D30897 usr.sbin/bhyve/pci_nvme.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a7761d19dacd414c8b8269a6cf909ab4528783dc commit a7761d19dacd414c8b8269a6cf909ab4528783dc Author: Chuck Tuffli <chuck@FreeBSD.org> AuthorDate: 2021-06-27 22:14:52 +0000 Commit: Chuck Tuffli <chuck@FreeBSD.org> CommitDate: 2021-07-09 14:24:14 +0000 bhyve: Fix NVMe iovec construction for large IOs The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug in the NVMe emulation's construction of iovec's. By default, NVMe data transfer operations use a scatter-gather list in which all entries point to a fixed size memory region. For example, if the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists themselves are also fixed size (default is 512 entries). Because the list size is fixed, the last entry is special. If the IO requires more than 512 entries, the last entry in the list contains the address of the next list of entries. But if the IO requires exactly 512 entries, the last entry points to data. The NVMe emulation missed this logic and unconditionally treated the last entry as a pointer to the next list. Fix is to check if the remaining data is greater than the page size before using the last entry as a pointer to the next list. PR: 256422 Reported by: dave@syix.com Tested by: jason@tubnor.net Relnotes: yes (cherry picked from commit 91064841d72b285a146a3f1c32cb447251e062ea) usr.sbin/bhyve/pci_nvme.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cd8a5d2316a12a8abca458c31467dc9dcf8361ce commit cd8a5d2316a12a8abca458c31467dc9dcf8361ce Author: Chuck Tuffli <chuck@FreeBSD.org> AuthorDate: 2021-06-27 22:14:52 +0000 Commit: Chuck Tuffli <chuck@FreeBSD.org> CommitDate: 2021-07-09 14:25:45 +0000 bhyve: Fix NVMe iovec construction for large IOs The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug in the NVMe emulation's construction of iovec's. By default, NVMe data transfer operations use a scatter-gather list in which all entries point to a fixed size memory region. For example, if the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists themselves are also fixed size (default is 512 entries). Because the list size is fixed, the last entry is special. If the IO requires more than 512 entries, the last entry in the list contains the address of the next list of entries. But if the IO requires exactly 512 entries, the last entry points to data. The NVMe emulation missed this logic and unconditionally treated the last entry as a pointer to the next list. Fix is to check if the remaining data is greater than the page size before using the last entry as a pointer to the next list. PR: 256422 Reported by: dave@syix.com Tested by: jason@tubnor.net Relnotes: yes (cherry picked from commit 91064841d72b285a146a3f1c32cb447251e062ea) usr.sbin/bhyve/pci_nvme.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Anything more to do here? Thanks (In reply to commit-hook from comment #16) > Relnotes: yes <https://www.freebsd.org/releases/13.1R/> release notes included: > NVMe iovec construction for large IOs in bhyve(8) has been fixed. The > problem was exposed by the UEFI driver included with Rocky Linux 8.4. > a7761d19dacd (In reply to commit-hook from comment #17) > Relnotes: yes <https://www.freebsd.org/releases/12.3R/> release notes referred to FreeBSD-EN-21:25.bhyve, which referred to this bug 256422.
I think this can get closed now. Thank you for the report.