Bug 250580

Summary: VMware UEFI guests crash in virtual hardware after r366691
Product: Base System Reporter: Phillip R. Jaenke <prj>
Component: kernAssignee: freebsd-virtualization (Nobody) <virtualization>
Status: New ---    
Severity: Affects Many People CC: chris, dan.kotowski, daniel, emaste, freebsd, freebsd, garga, hehongbo, imp, martinp, ncrogers, richard, rleigh, ruben, sak
Priority: ---    
Version: Unspecified   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=244906
Attachments:
Description Flags
Screenshot
none
VMware workstation log, slightly edited
none
Sample VMX file none

Description Phillip R. Jaenke 2020-10-24 16:51:27 UTC
Confirmed using multiple guests, all amd64. The breaking commit was the MFS12 of r366422 and r366588. UEFI guests prior to that MFS are able to successfully boot reliably. No extra crap. It Just Worked(TM).

Any attempt after this results in an immediate crash and full power off of the VM (AHCI signaled) at the point of switching to UEFI console, AFTER reporting which console is being switched to.
Bifurcating the MFS into the original commits did not help either. The only fix that worked was backing out all of r366691.
Comment 1 Warner Losh freebsd_committer freebsd_triage 2020-10-25 21:28:53 UTC
Any chance I could get remote access to the environment that causes problems?
Comment 2 Phillip R. Jaenke 2020-10-28 01:53:31 UTC
(In reply to Warner Losh from comment #1)
Unfortunately, this is in a restricted environment, so no luck this time. *Fortunately*, the reproduction is relatively easy.

1) Needs AMD Ryzen, Threadripper, or EPYC in non-EVC with or without SEV
2) Needs vSphere 7.0 build 16324942 or later (7.0b or 7.0U1) - 16321839 and lower will not boot UEFI cleanly. 
3) Create a guest with the following properties:
- Hardware Version 17 (must be 17; 16 and below will not boot UEFI)
- Any number of CPUs
- At least 1GB of RAM; 512MB didn't behave consistently
- Any disk configuration
- 1 or more VMXnet3 adapter
- Boot Options must be EFI, Secure Boot disabled
- Do NOT disable acceleration or enable debug under advanced
4) Install FreeBSD 12.1-RELEASE amd64 from ISO. Don't make any adjustments, just leave everything defaults. open-vm-tools-nox11 is optional but recommended for snapshots.
5) Confirm reboot. Update to 12.1-p10, confirm reboot. SNAPSHOT HERE!
6) Perform `freebsd-update -r 12.2-RELEASE` and reboot
7) Guest will now crash as described including power off behavior.

The snapshot in step 5 is critical. If you revert to this snapshot, the guest will go right back to working normally. Either a live or powered down snapshot, doesn't matter. The base and snapshot are both clone and template safe as well, for spinning up more VMs if needed.
I forgot to note, this also reproduced using -CURRENT after the r366422 and r366588 commits. -CURRENT VMs do NOT appear recoverable with live snapshots though.

You'll see the behavior through the initial branch of 12.2 until you hit r366691. Every step between 12.2 initial branch up to that commit will work just fine. After that commit, you will receive the EFI frame buffer loading message and then an immediate console disconnect (within probably 500ms. It's fast.)
The only event you will get is a guest error, "The firmware encountered an unexpected exception. The virtual machine cannot boot. An error message listing a collection of observations has been reported by the virtual machine". Which indicates that the kernel reported an EFI error upward.
Comment 3 Roger Leigh 2020-10-31 15:43:49 UTC
Note this isn't AMD-specific.  I've also reproduced on a Mac Mini (Intel) with VMware Fusion.  Screenshot and detailed logs are attached to https://communities.vmware.com/thread/643481
Comment 4 Phillip R. Jaenke 2020-10-31 16:40:49 UTC
(In reply to Roger Leigh from comment #3)
Yep, confirmed now this is not AMD-specific. Reproduced on a BabyDragon Gen.5 and a BabyDragon Gen.3.
Looking over things more closely, I am far more confident that imp@'s fix for bhyve is what broke VMware. I think the PCI probe is what's causing it. However, that makes it an open-ended question of is it FreeBSD at fault or is it VMware at fault? If VMware's response is malformed, well, boom. But if FreeBSD's probe is malformed, also boom.

We need to get a VMware engineer involved here to sort it out. I think we're running afoul of assumptions about behavior made on both sides. FreeBSD assumed where a video device would be previously, and assumed a reasonable response to a probe, while VMware may have assumed FreeBSD wouldn't probe the video device, and may not be answering in a sane fashion.
Comment 5 Hongbo He 2020-11-02 06:25:40 UTC
Yes, this bug is not AMD-specific. I'm running FreeBSD 12.1 on an ESXi Server with an Intel Xeon E3-1230v3 CPU and after upgrading to 12.2 it failed to boot.

Also, another virtual machine which is a FreeNAS 11.3 also breaks after upgrading to TrueNAS 12, which under the hood is also FreeBSD 12.2.

Recently I found I can boot these 2 virtual machines by entering the EFI shell of VMware UEFI firmware before booting FreeBSD, just:

i. Power ON the VM and immediately press the ESC button.
ii. On the boot menu screen of VMware VM's firmware, press the arrow key to select "EFI Internal Shell (Unsupported option)" and press enter to confirm.
iii. After seeing "Press ESC in 5 seconds to skip startup.nsh, any other key to continue." press a key other than ESC or just wait for the countdown.
iv. Continue to boot FreeBSD as normal.

May this detail helps to investigate the issue.
Comment 6 Dan Kotowski 2020-11-05 12:29:49 UTC
Able to replicate the same behavior on

- 12.2-STABLE r367189 GENERIC
- Workstation 15.5.6 build-16341506
- i7-3770K

Can confirm that Hongbo's workaround works for me as well
Comment 7 Richard Wai 2020-11-21 02:11:09 UTC
This issue appears to have been fixed with the most recent release of VMWare Workstation that was just release today (15.5.7 build-17171714)

Running 12.2-RELEASE, booted fine in EFI.
Comment 8 Christian Ullrich 2020-11-21 08:26:32 UTC
It still fails with the same message on Workstation 16.1.0 build-17198959, released on the same day (2020-11-19) as 15.5.7 mentioned above.
Comment 9 Warner Losh freebsd_committer freebsd_triage 2020-11-21 08:28:59 UTC
can someone share the error messages and/or a screen shot of the failure?
Comment 10 Christian Ullrich 2020-11-21 08:38:11 UTC
Created attachment 219850 [details]
Screenshot
Comment 11 Christian Ullrich 2020-11-21 08:38:54 UTC
Created attachment 219851 [details]
VMware workstation log, slightly edited
Comment 12 Richard Wai 2020-11-23 04:41:07 UTC
I just happened on something very interesting.

My previous post about it working since the latest Workstation 15 update is misleading.

It turns out that if I attempt to boot from an (emulated) NVMe drive, it crashes as others have seen. However, if I boot from an (emulated) SCSI drive, it boots normally in EFI.
Comment 13 Christian Ullrich 2020-11-23 06:53:10 UTC
I cannot reproduce the SCSI/NVMe behavior in 16.1; it fails with either configuration. Which type of SCSI controller did you use?

However, I noticed something else potentially interesting.

It has been mentioned elsewhere (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=244906#c8) that booting from the ISO works, but then the reboot into the installed system fails.

What I have not seen described yet is that rebooting into the ISO _also_ fails, even on a VM that does not include any hard disk at all.

Steps to reproduce:

1. Create VM.
2. Remove hard disk from VM.
3. Boot into 12.2-RELEASE disc1 ISO.
4. Select "Shell" at the first prompt.
5. # reboot

The VM fails with the same firmware error message upon reboot.

6. Delete the VM's .nvram file.
7. Power on VM.

The VM successfully boots into the installer image.

I previously assumed there was a difference between the loader in the ISO and what ends up on the virtual disk, but instead, the problem seems to persist in EFI variables or something else in the NVRAM.
Comment 14 Richard Wai 2020-11-23 06:59:10 UTC
I am using the "lsilogic" SCSI controller.

I wonder if the very act of making any configuration change at all is what makes the difference..
Comment 15 Christian Ullrich 2020-11-23 07:14:50 UTC
A hard disk installation also boots successfully after deleting the .nvram file, but the system clock setting is lost, i.e. unless the host is in UTC, the VM will be in the wrong time zone after booting. This can be worked around by adding

rtc.diffFromUTC = 0

to the .vmx file. Source: https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf, page 10.

There does not appear to be a lot of other important data in the EFI NVRAM on VMware. In cases where the automatically selected hard disk boot option is the right one, deleting the .nvram file may be a usable workaround.
Comment 16 Yuri Pankov 2020-11-23 21:27:12 UTC
(In reply to Christian Ullrich from comment #15)
Thanks a lot for "rtc.diffFromUTC = 0", it's something I was desperately looking for and my google-fu was failing me for *years*.
Comment 17 Hongbo He 2020-11-26 02:05:42 UTC
(In reply to Christian Ullrich from comment #13)

> 6. Delete the VM's .nvram file.

That's interesting!

Besides the trick that entering the EFI shell before boot, I've tried removing the VM's .nvram file in ESXi's datastore file browser, which will also result in a successful boot, but only once. 

After a reboot (inc. cold boot and hot reboot) it will fail again and needs another purge of its NVRAM file, or entering the EFI shell as I mentioned in comment #5.
Comment 18 Hongbo He 2020-11-26 02:14:54 UTC
In addition to comment #17, I've just tried purging NVRAM with 12.2 RC3 (Which TrueNAS 12.0 shipped as it's kernel) with no luck. 

12.2-RELEASE will boot as normal while 12.2 RC3 will result in a reboot(not a shutdown of the VM), and then a firmware crash that leads to the VM being shutdown.

"Entering the EFI Shell" trick works for both, however.
Comment 19 Daniel Morante 2021-01-28 02:42:10 UTC
Here's what I'm observing:

VMWare Workstation: 15.5.7 build-17171714
Host: Windows 7 Enterprise, 64-bit 6.1.7601, Service Pack 1
Guest OS ISO: FreeBSD-12.2-RELEASE-amd64-bootonly.iso

1) Create a new VM, set firmware to EFI (sample .vmx attached)
2) Boot of the ISO image.

FreeBSD boots off the CD with no problem and I am able to install the OS properly.  Not sure if it matters, but I am using UFS on GTP.

When you reboot it fails with the described issue.  
Deleting the VM's .nvram file lets the machine boot properly. 
Rebooting results in the same problem.
Deleting the VM's .nvram file again lets the machine boot properly.
Comment 20 Daniel Morante 2021-01-28 02:43:04 UTC
Created attachment 221977 [details]
Sample VMX file
Comment 21 Dan Kotowski 2021-01-28 11:49:25 UTC
UEFI boot is now functioning normally with 12.2-STABLE r369049 on VMware Workstation 16.1.0 build-17198959.
Comment 22 ruben 2021-03-22 19:27:59 UTC
Chiming in on a similar observation (that removing the VM's nvram file alleviates the issue)

Installed FreeBSD 11.2 with Auto ZFS in vmware fusion 12.1.0 (as RAID1)
Upgraded it to 12.2. gpart bootcode was used to update both boot1.efifat and gptzfsboot

UEFI boot doesn't work anymore, unless when the system is reset from the boot loader and choose "boot from file" chosing BOOTx64.efi from either EFI partition. 

Creating it as a boot entry using the VMWare EFI setup screen made no difference with auto booting, but it would works when dropping again in the vmware EFI firmware and select one of the entries that where created.

It not, any automated (re)boot lets VMWare stop with "The firmware encountered an unexpected exception."

Removing the nvram file lets the system boot normally once.

After that it will run into the "The firmware encountered an unexpected exception" again.

However, replacing BOOTx64.efi with /boot/loader.efi instead of /boot/boot1.efi lets the system reliably reboot into multi user mode. 

keeping BOOTx64.efi as boot1.efi and creating boot entries for \EFISYS\FreeBSD\loader.efi doesn't work.
Comment 23 Phillip R. Jaenke 2021-04-25 17:50:07 UTC
This continues to reproduce on 12.2-RELEASE but does not reproduce on 13.0-RELEASE with ESXi build 17551050. Note that it WILL reproduce on 13 with some ESXi builds below 17551050, not sure which.
Comment 24 Mark Linimon freebsd_committer freebsd_triage 2021-05-14 12:00:28 UTC
^Triage: correct assignment.  Discussed with: koobs@.