Bug 250580 - VMware UEFI guests crash in virtual hardware after r366691
Summary: VMware UEFI guests crash in virtual hardware after r366691
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-24 16:51 UTC by Phillip R. Jaenke
Modified: 2020-11-26 02:14 UTC (History)
10 users (show)

See Also:


Attachments
Screenshot (20.12 KB, image/png)
2020-11-21 08:38 UTC, Christian Ullrich
no flags Details
VMware workstation log, slightly edited (134.51 KB, text/plain)
2020-11-21 08:38 UTC, Christian Ullrich
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Phillip R. Jaenke 2020-10-24 16:51:27 UTC
Confirmed using multiple guests, all amd64. The breaking commit was the MFS12 of r366422 and r366588. UEFI guests prior to that MFS are able to successfully boot reliably. No extra crap. It Just Worked(TM).

Any attempt after this results in an immediate crash and full power off of the VM (AHCI signaled) at the point of switching to UEFI console, AFTER reporting which console is being switched to.
Bifurcating the MFS into the original commits did not help either. The only fix that worked was backing out all of r366691.
Comment 1 Warner Losh freebsd_committer 2020-10-25 21:28:53 UTC
Any chance I could get remote access to the environment that causes problems?
Comment 2 Phillip R. Jaenke 2020-10-28 01:53:31 UTC
(In reply to Warner Losh from comment #1)
Unfortunately, this is in a restricted environment, so no luck this time. *Fortunately*, the reproduction is relatively easy.

1) Needs AMD Ryzen, Threadripper, or EPYC in non-EVC with or without SEV
2) Needs vSphere 7.0 build 16324942 or later (7.0b or 7.0U1) - 16321839 and lower will not boot UEFI cleanly. 
3) Create a guest with the following properties:
- Hardware Version 17 (must be 17; 16 and below will not boot UEFI)
- Any number of CPUs
- At least 1GB of RAM; 512MB didn't behave consistently
- Any disk configuration
- 1 or more VMXnet3 adapter
- Boot Options must be EFI, Secure Boot disabled
- Do NOT disable acceleration or enable debug under advanced
4) Install FreeBSD 12.1-RELEASE amd64 from ISO. Don't make any adjustments, just leave everything defaults. open-vm-tools-nox11 is optional but recommended for snapshots.
5) Confirm reboot. Update to 12.1-p10, confirm reboot. SNAPSHOT HERE!
6) Perform `freebsd-update -r 12.2-RELEASE` and reboot
7) Guest will now crash as described including power off behavior.

The snapshot in step 5 is critical. If you revert to this snapshot, the guest will go right back to working normally. Either a live or powered down snapshot, doesn't matter. The base and snapshot are both clone and template safe as well, for spinning up more VMs if needed.
I forgot to note, this also reproduced using -CURRENT after the r366422 and r366588 commits. -CURRENT VMs do NOT appear recoverable with live snapshots though.

You'll see the behavior through the initial branch of 12.2 until you hit r366691. Every step between 12.2 initial branch up to that commit will work just fine. After that commit, you will receive the EFI frame buffer loading message and then an immediate console disconnect (within probably 500ms. It's fast.)
The only event you will get is a guest error, "The firmware encountered an unexpected exception. The virtual machine cannot boot. An error message listing a collection of observations has been reported by the virtual machine". Which indicates that the kernel reported an EFI error upward.
Comment 3 Roger Leigh 2020-10-31 15:43:49 UTC
Note this isn't AMD-specific.  I've also reproduced on a Mac Mini (Intel) with VMware Fusion.  Screenshot and detailed logs are attached to https://communities.vmware.com/thread/643481
Comment 4 Phillip R. Jaenke 2020-10-31 16:40:49 UTC
(In reply to Roger Leigh from comment #3)
Yep, confirmed now this is not AMD-specific. Reproduced on a BabyDragon Gen.5 and a BabyDragon Gen.3.
Looking over things more closely, I am far more confident that imp@'s fix for bhyve is what broke VMware. I think the PCI probe is what's causing it. However, that makes it an open-ended question of is it FreeBSD at fault or is it VMware at fault? If VMware's response is malformed, well, boom. But if FreeBSD's probe is malformed, also boom.

We need to get a VMware engineer involved here to sort it out. I think we're running afoul of assumptions about behavior made on both sides. FreeBSD assumed where a video device would be previously, and assumed a reasonable response to a probe, while VMware may have assumed FreeBSD wouldn't probe the video device, and may not be answering in a sane fashion.
Comment 5 Hongbo He 2020-11-02 06:25:40 UTC
Yes, this bug is not AMD-specific. I'm running FreeBSD 12.1 on an ESXi Server with an Intel Xeon E3-1230v3 CPU and after upgrading to 12.2 it failed to boot.

Also, another virtual machine which is a FreeNAS 11.3 also breaks after upgrading to TrueNAS 12, which under the hood is also FreeBSD 12.2.

Recently I found I can boot these 2 virtual machines by entering the EFI shell of VMware UEFI firmware before booting FreeBSD, just:

i. Power ON the VM and immediately press the ESC button.
ii. On the boot menu screen of VMware VM's firmware, press the arrow key to select "EFI Internal Shell (Unsupported option)" and press enter to confirm.
iii. After seeing "Press ESC in 5 seconds to skip startup.nsh, any other key to continue." press a key other than ESC or just wait for the countdown.
iv. Continue to boot FreeBSD as normal.

May this detail helps to investigate the issue.
Comment 6 Dan Kotowski 2020-11-05 12:29:49 UTC
Able to replicate the same behavior on

- 12.2-STABLE r367189 GENERIC
- Workstation 15.5.6 build-16341506
- i7-3770K

Can confirm that Hongbo's workaround works for me as well
Comment 7 Richard Wai 2020-11-21 02:11:09 UTC
This issue appears to have been fixed with the most recent release of VMWare Workstation that was just release today (15.5.7 build-17171714)

Running 12.2-RELEASE, booted fine in EFI.
Comment 8 Christian Ullrich 2020-11-21 08:26:32 UTC
It still fails with the same message on Workstation 16.1.0 build-17198959, released on the same day (2020-11-19) as 15.5.7 mentioned above.
Comment 9 Warner Losh freebsd_committer 2020-11-21 08:28:59 UTC
can someone share the error messages and/or a screen shot of the failure?
Comment 10 Christian Ullrich 2020-11-21 08:38:11 UTC
Created attachment 219850 [details]
Screenshot
Comment 11 Christian Ullrich 2020-11-21 08:38:54 UTC
Created attachment 219851 [details]
VMware workstation log, slightly edited
Comment 12 Richard Wai 2020-11-23 04:41:07 UTC
I just happened on something very interesting.

My previous post about it working since the latest Workstation 15 update is misleading.

It turns out that if I attempt to boot from an (emulated) NVMe drive, it crashes as others have seen. However, if I boot from an (emulated) SCSI drive, it boots normally in EFI.
Comment 13 Christian Ullrich 2020-11-23 06:53:10 UTC
I cannot reproduce the SCSI/NVMe behavior in 16.1; it fails with either configuration. Which type of SCSI controller did you use?

However, I noticed something else potentially interesting.

It has been mentioned elsewhere (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=244906#c8) that booting from the ISO works, but then the reboot into the installed system fails.

What I have not seen described yet is that rebooting into the ISO _also_ fails, even on a VM that does not include any hard disk at all.

Steps to reproduce:

1. Create VM.
2. Remove hard disk from VM.
3. Boot into 12.2-RELEASE disc1 ISO.
4. Select "Shell" at the first prompt.
5. # reboot

The VM fails with the same firmware error message upon reboot.

6. Delete the VM's .nvram file.
7. Power on VM.

The VM successfully boots into the installer image.

I previously assumed there was a difference between the loader in the ISO and what ends up on the virtual disk, but instead, the problem seems to persist in EFI variables or something else in the NVRAM.
Comment 14 Richard Wai 2020-11-23 06:59:10 UTC
I am using the "lsilogic" SCSI controller.

I wonder if the very act of making any configuration change at all is what makes the difference..
Comment 15 Christian Ullrich 2020-11-23 07:14:50 UTC
A hard disk installation also boots successfully after deleting the .nvram file, but the system clock setting is lost, i.e. unless the host is in UTC, the VM will be in the wrong time zone after booting. This can be worked around by adding

rtc.diffFromUTC = 0

to the .vmx file. Source: https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf, page 10.

There does not appear to be a lot of other important data in the EFI NVRAM on VMware. In cases where the automatically selected hard disk boot option is the right one, deleting the .nvram file may be a usable workaround.
Comment 16 Yuri Pankov 2020-11-23 21:27:12 UTC
(In reply to Christian Ullrich from comment #15)
Thanks a lot for "rtc.diffFromUTC = 0", it's something I was desperately looking for and my google-fu was failing me for *years*.
Comment 17 Hongbo He 2020-11-26 02:05:42 UTC
(In reply to Christian Ullrich from comment #13)

> 6. Delete the VM's .nvram file.

That's interesting!

Besides the trick that entering the EFI shell before boot, I've tried removing the VM's .nvram file in ESXi's datastore file browser, which will also result in a successful boot, but only once. 

After a reboot (inc. cold boot and hot reboot) it will fail again and needs another purge of its NVRAM file, or entering the EFI shell as I mentioned in comment #5.
Comment 18 Hongbo He 2020-11-26 02:14:54 UTC
In addition to comment #17, I've just tried purging NVRAM with 12.2 RC3 (Which TrueNAS 12.0 shipped as it's kernel) with no luck. 

12.2-RELEASE will boot as normal while 12.2 RC3 will result in a reboot(not a shutdown of the VM), and then a firmware crash that leads to the VM being shutdown.

"Entering the EFI Shell" trick works for both, however.