Bug 256422 - bhyve and Centos/Rocky 8.4 no boot after install
Summary: bhyve and Centos/Rocky 8.4 no boot after install
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bhyve (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-05 00:31 UTC by dave
Modified: 2021-07-09 14:27 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dave 2021-06-05 00:31:47 UTC
Installing Centos or Rocky 8.4 results in a failed boot.  The initial install works, but on reboot I get this while loading:  

Starting webhost04a
  * found guest in /storage/vm/webhost04a
  * booting...
BdsDxe: failed to load Boot0001 "UEFI bhyve-NVMe NVME-4-0" from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,01-00-68-C1-20-FC-9C-58): Not Found  

Logging from vm-bhyve:  

Jun 04 17:18:00: booting 
Jun 04 17:18:00:  [bhyve options: -c 8 -m 16G -Hwl bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd -U 62ff48d0-c58d-11eb-9187-f8bc1251963e]
Jun 04 17:18:00:  [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,nvme,/dev/zvol/storage/vm/webhost04a/disk0 -s 5:0,virtio-net,tap0,mac=58:9c:fc:07:6d:b7 -s 6:0,fbuf,tcp=192.168.1.150:5900 -s 7:0,xhci,tablet]  


Note, Rocky 8.3 and Centos 8.3 both install and boot fine. with exactly the same configs in vm-bhyve
Comment 1 Chuck Tuffli freebsd_committer 2021-06-07 14:44:53 UTC
Does the guest boot if you change the device from nvme to ahci?
Comment 2 dave 2021-06-07 15:56:09 UTC
Installing guest again, using virtio-blk instead of nvme results in a working client.  

Same guest, just change nvme to ahci-hd, works.

Appears nvme support in UEFI bios has a change in the new RHEL 8.4 and successors.
Comment 3 Peter Grehan freebsd_committer 2021-06-07 20:28:25 UTC
I was able to repro this with Alma 8.3/8.4 (identical to Centos 8.3/8.4).

With a file-backed image on ZFS, the sectorsize parameter was forced to 4K and 512, with no difference in getting the system to boot.

The error appears to be in the EFI loader on Centos:

"
BdsDxe: loading Boot0001 "UEFI bhyve-NVMe NVME-4-0" from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,01-00-DD-44-20-FC-9C-58)
BdsDxe: starting Boot0001 "UEFI bhyve-NVMe NVME-4-0" from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,01-00-DD-44-20-FC-9C-58)
Unexpected return from initial read: Device Error, buffersize 0
Failed to load image \EFI\almalinux\grubx64.efi: Device Error
start_image() returned Device Error
StartImage failed: Device Error
"
Comment 4 Jason Tubnor 2021-06-09 02:18:58 UTC
Has this been tested against bare metal that has UEFI and NVMe?

I got the same as grehan@ when testing with both CentOS 8.4 and Stream. Observations suggest there is something up with the CentOS EFI shim for GRUB.

I have done testing against the following, fully updated as of 20210609 12:10 +10UTC:

openSUSE Tumbleweed
Ubuntu impish 21.10 nightly
Artix (Arch) GRUB 2.04-10 Linux 5.12.8

None of these experienced any issues with the NVMe device presented by bhyve.
Comment 5 Jason Tubnor 2021-06-09 03:41:28 UTC
I successfully updated CentOS from 8.3 to 8.4 and it is running fine on a NVMe bhyve device.

It is looking more like how the installer deals with determining the boot device and then writing this and the GRUB components to the storage.
Comment 6 Chuck Tuffli freebsd_committer 2021-06-09 23:28:11 UTC
Is it possible to recompile pci_nvme.c and enable debug in the failing case? I.e. change the code to:

    static int nvme_debug = 1;
Comment 7 Peter Grehan freebsd_committer 2021-06-10 15:24:59 UTC
This looks to be an edge condition in the EFI NVMe driver, caused by the large maximum data transfer size advertised by bhyve NVMe (2MB), and the increase in size of grubx64.efi from 1.9MB in centos 8.3, to 2.3MB in centos 8.4.

In 8.4, EFI attempts to read 2MB of grubx64.efi. However, the buffer starts at a non page-aligned address, using PRP1 in the command descriptor with an offset. PRP2 points to a PRP list, but with a 2MB transfer size, all 512 PRP entries in a page will be used. Since the first buffer was unaligned, there is a small amount left at the end, and EFI is putting garbage into that entry.

(Copying the smaller 8.3 grubx64.efi to an 8.4 system resulted in a successful boot).

A suggested fix is to drop the advertised mdts to something that isn't right on the verge of requiring a chained PRP list. Qemu defaults to 512KB, and h/w I've looked at advertises 256K. e.g.

--- a/usr.sbin/bhyve/pci_nvme.c
+++ b/usr.sbin/bhyve/pci_nvme.c
@@ -106,7 +106,7 @@ static int nvme_debug = 0;
 #define        NVME_MPSMIN_BYTES       (1 << (12 + NVME_MPSMIN))
 
 #define        NVME_PRP2_ITEMS         (PAGE_SIZE/sizeof(uint64_t))
-#define        NVME_MDTS               9
+#define        NVME_MDTS               7

(or 8)

8.4 boots fine with this change.
Comment 8 Jason Tubnor 2021-06-10 16:38:03 UTC
I can confirm the patch from grehan@ works as described. Tested against:

CentOS 8.4
Windows Server 2022
OpenBSD 6.9

No regression was introduced into the latter two existing operating systems on our system.
Comment 9 Chuck Tuffli freebsd_committer 2021-06-24 00:49:23 UTC
To document this a bit more, the error in pci_nvme.c is:
...
nvme doorbell 1, SQ, val 0x1
nvme_handle_io qid 1 head 0 tail 1 cmdlist 0x8c1885000
pci_nvme_io_done error 14 Bad address
pci_nvme_set_completion sqid 1 cqid 1 cid 116 status: 0x0 0x4
pci_nvme_set_completion: CQ1 interrupt disabled
nvme doorbell 1, CQ, val 0x1

blockif_write() asynchronously returns EFAULT (a.k.a. "Bad address"). Adding some debugging to pci_nvme_io_done() shows the following:

Fault on SQ[1] opc=0x2 cid=0x74 bytes=2097152
[  0] 0x8c097a018 : 2097128
[  1] 0x0 : 24

The debug statement displays the contents of the pci_nvme_ioreq structure for the failing IO, including valid iovec entries. This is from an I/O Submission Queue (i.e. queue ID != 0) for a Read (opcode 0x2) of 2097152 bytes.

My suspicion is the second iovec with the NULL iov_base value causes the EFAULT.
Comment 10 Chuck Tuffli freebsd_committer 2021-06-24 00:55:05 UTC
I think nvme_write_read_blockif() has a bug:

 }

 static void
@@ -1978,7 +1984,7 @@ nvme_write_read_blockif(struct pci_nvme_softc *sc,
                /* PRP2 is pointer to a physical region page list */
                while (bytes) {
                        /* Last entry in list points to the next list */
-                       if (prp_list == last) {
+                       if ((prp_list == last) && (bytes > PAGE_SIZE)) {
                                uint64_t prp = *prp_list;

                                prp_list = paddr_guest2host(vmctx, prp,

Note that I cleaned up some additional things and your line numbers won't quite match up. But I believe this is the crux of the fix necessary.
Comment 11 Jason Tubnor 2021-06-24 03:07:45 UTC
I reverted the grehan@ patch and applied chuck@ patch and tested against:

CentOS 8.4
Windows Server 2022
OpenBSD 6.9-snapshot

and all of the above installed, rebooted and booted correctly.
Comment 12 Peter Grehan freebsd_committer 2021-06-24 04:29:36 UTC
Chuck's fix is the correct one; mine was just a bandaid.
Comment 13 Chuck Tuffli freebsd_committer 2021-06-24 16:25:57 UTC
Actually, Peter's analysis helped me immediately zero in on the problem :)
Comment 14 Chuck Tuffli freebsd_committer 2021-06-25 15:33:50 UTC
Change up for review https://reviews.freebsd.org/D30897
Comment 15 commit-hook freebsd_committer 2021-06-27 22:23:10 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=91064841d72b285a146a3f1c32cb447251e062ea

commit 91064841d72b285a146a3f1c32cb447251e062ea
Author:     Chuck Tuffli <chuck@FreeBSD.org>
AuthorDate: 2021-06-27 22:14:52 +0000
Commit:     Chuck Tuffli <chuck@FreeBSD.org>
CommitDate: 2021-06-27 22:14:52 +0000

    bhyve: Fix NVMe iovec construction for large IOs

    The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
    in the NVMe emulation's construction of iovec's.

    By default, NVMe data transfer operations use a scatter-gather list in
    which all entries point to a fixed size memory region. For example, if
    the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
    themselves are also fixed size (default is 512 entries).

    Because the list size is fixed, the last entry is special. If the IO
    requires more than 512 entries, the last entry in the list contains the
    address of the next list of entries. But if the IO requires exactly 512
    entries, the last entry points to data.

    The NVMe emulation missed this logic and unconditionally treated the
    last entry as a pointer to the next list. Fix is to check if the
    remaining data is greater than the page size before using the last entry
    as a pointer to the next list.

    PR:             256422
    Reported by:    dave@syix.com
    Tested by:      jason@tubnor.net
    MFC after:      5 days
    Relnotes:       yes
    Reviewed by:    imp, grehan
    Differential Revision:  https://reviews.freebsd.org/D30897

 usr.sbin/bhyve/pci_nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 16 commit-hook freebsd_committer 2021-07-09 14:26:44 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a7761d19dacd414c8b8269a6cf909ab4528783dc

commit a7761d19dacd414c8b8269a6cf909ab4528783dc
Author:     Chuck Tuffli <chuck@FreeBSD.org>
AuthorDate: 2021-06-27 22:14:52 +0000
Commit:     Chuck Tuffli <chuck@FreeBSD.org>
CommitDate: 2021-07-09 14:24:14 +0000

    bhyve: Fix NVMe iovec construction for large IOs

    The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
    in the NVMe emulation's construction of iovec's.

    By default, NVMe data transfer operations use a scatter-gather list in
    which all entries point to a fixed size memory region. For example, if
    the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
    themselves are also fixed size (default is 512 entries).

    Because the list size is fixed, the last entry is special. If the IO
    requires more than 512 entries, the last entry in the list contains the
    address of the next list of entries. But if the IO requires exactly 512
    entries, the last entry points to data.

    The NVMe emulation missed this logic and unconditionally treated the
    last entry as a pointer to the next list. Fix is to check if the
    remaining data is greater than the page size before using the last entry
    as a pointer to the next list.

    PR:             256422
    Reported by:    dave@syix.com
    Tested by:      jason@tubnor.net
    Relnotes:       yes

    (cherry picked from commit 91064841d72b285a146a3f1c32cb447251e062ea)

 usr.sbin/bhyve/pci_nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 17 commit-hook freebsd_committer 2021-07-09 14:27:46 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cd8a5d2316a12a8abca458c31467dc9dcf8361ce

commit cd8a5d2316a12a8abca458c31467dc9dcf8361ce
Author:     Chuck Tuffli <chuck@FreeBSD.org>
AuthorDate: 2021-06-27 22:14:52 +0000
Commit:     Chuck Tuffli <chuck@FreeBSD.org>
CommitDate: 2021-07-09 14:25:45 +0000

    bhyve: Fix NVMe iovec construction for large IOs

    The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
    in the NVMe emulation's construction of iovec's.

    By default, NVMe data transfer operations use a scatter-gather list in
    which all entries point to a fixed size memory region. For example, if
    the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
    themselves are also fixed size (default is 512 entries).

    Because the list size is fixed, the last entry is special. If the IO
    requires more than 512 entries, the last entry in the list contains the
    address of the next list of entries. But if the IO requires exactly 512
    entries, the last entry points to data.

    The NVMe emulation missed this logic and unconditionally treated the
    last entry as a pointer to the next list. Fix is to check if the
    remaining data is greater than the page size before using the last entry
    as a pointer to the next list.

    PR:             256422
    Reported by:    dave@syix.com
    Tested by:      jason@tubnor.net
    Relnotes:       yes

    (cherry picked from commit 91064841d72b285a146a3f1c32cb447251e062ea)

 usr.sbin/bhyve/pci_nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)