Bug 274810 - [regression] FreeBSD 14.0-RC3 crash during early boot on Vultr with custom ISO
Summary: [regression] FreeBSD 14.0-RC3 crash during early boot on Vultr with custom ISO
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Zhenlei Huang
URL: https://reviews.freebsd.org/D42414
Keywords: regression
Depends on:
Blocks: 14.0r
  Show dependency treegraph
 
Reported: 2023-10-30 10:24 UTC by Zhenlei Huang
Modified: 2024-02-26 09:58 UTC (History)
6 users (show)

See Also:
zlei: mfc-stable14+


Attachments
Vultr FreeBSD 14.0-RC3 crash (158.19 KB, image/jpeg)
2023-10-30 10:24 UTC, Zhenlei Huang
no flags Details
dmesg of affected vm (33.91 KB, text/plain)
2023-10-31 01:49 UTC, Zhenlei Huang
no flags Details
Patch against releng/14.0 (695 bytes, patch)
2023-10-31 03:07 UTC, Zhenlei Huang
no flags Details | Diff
acpidump from vultr VM (55.82 KB, text/plain)
2023-10-31 09:37 UTC, Zhenlei Huang
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zhenlei Huang freebsd_committer freebsd_triage 2023-10-30 10:24:19 UTC
Created attachment 245990 [details]
Vultr FreeBSD 14.0-RC3 crash

I noticed this while upgrading one of my Vultr VM from 13.2 to 14.0-RC3.
Managed to repeat this with a new VM, and FreeBSD-14.0-RC3-amd64-bootonly.iso from FreeBSD's public ftp site.

A text version of the page fault (copied from the screen shot, at my best):

current process = 0 (swapper )
rdi: 0000000000000000 rsi: 0000000000000000 rdx: 0000000000000000
rcx: 0000000000000000  r8: 0000000000000000  r9: fffffe00463a1000
rax: 0000000000000000 rbx: fffff80003610900 rbp: ffffffff828b6e00
r10: 0000000000000000 r11: fffff8000302ed10 r12: fffffe0044f51000
r13: 0000000000000000 r14: 0000000000000000 r15: fffffe0044f51000
trap number = 12
panic: page fault
cpuid = 0
time = 2
DB: stack backtrace:
#0 0xffffffff80b9002d at kdb_backtrace+0x5d
#1 0xffffffff80b43132 at vpanic+0x132
#2 0xffffffff80b42ff3 at panic+0x43
#3 0xffffffff8100c85c at trap_fatal +0x48c
#4 0xffffffff8100cBaf at trap_pfault+0x4f
#5 0xffffffff80fe3818 at calltrap+0x8
#6 0xffffffff80f89a5b at vmbus_intrhook+0x27b
#7 0xffffffff80blafe1 at run_interrupt_driven_config_hooks+0xd1
#8 0xffffffff88670431 at boot_run_interrupt_driven_config_hooks+0x21
#9 0xffffffff80acc3c5 at mi_startup+0xb5
#10 0xffffffff80376023 at btext+0x23
Uptime: 25
Automatic reboot in 15 seconds - press a key on the console to abort
Comment 1 Mitchell Horne freebsd_committer freebsd_triage 2023-10-30 19:34:26 UTC
Hi,

I looked at the reported faulting address in objdump/addr2line. It seems to be the result of a bad call to acpi_get_handle(), whose definition is expanded from line 280 of acpivar.h.

Consider the following two lines in vmbus_doattach(), added in e7a9817b8d32 (Sept 2023):

	dev_res =  devclass_get_device(devclass_find("vmbus_res"), 0);
	handle = acpi_get_handle(dev_res);

There is no NULL check for dev_res, which means if the vmbus_res0 device is not found (attached), we will get a page fault in the following call to acpi_get_handle().

Now, _why_ vmbus_res0 can't be found, I cannot guess. It has similar attachment criteria to vmbus0.

Strangely, my Vultr VM doesn't run on Hyper-V, instead the kern.vm_guest sysctl reports "kvm". So this is all I can do when it comes to testing/debugging. Let me tag the maintainers.
Comment 2 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 01:38:40 UTC
(In reply to Mitchell Horne from comment #1)
Based on your analysis I initially thought Vultr incorrectly include Hyper-V devices in the VM's config. Later I found this on the affected VM:

```
# sysctl kern.vm_guest
kern.vm_guest: hv
```

I can confirm that by dmesg log from verbose boot (FreeBSD 13.2).

This is interesting. Not all hypervisors of Vultr are KVM.
Comment 3 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 01:49:37 UTC
Created attachment 246010 [details]
dmesg of affected vm
Comment 4 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 03:07:07 UTC
Created attachment 246012 [details]
Patch against releng/14.0

Good news,

Based on Mitchell's analysis, I made this patch (against releng/14.0).
Now the affected VM boots finely!
Comment 5 schakrabarti@microsoft.com 2023-10-31 05:35:13 UTC
(In reply to Zhenlei Huang from comment #0)
Can you please collect acpidump -dt output and share that here.
As this code path should only get hit if the environment is Hyper-V based.
Comment 6 schakrabarti@microsoft.com 2023-10-31 05:38:52 UTC
Most likely the system is on gen1 Hyper-V, but we can confirm after checking the acpidump -dt output.
Comment 7 schakrabarti@microsoft.com 2023-10-31 09:05:48 UTC
Also please share the dmesg output, as in gen1 also we have vmbus_res0.
From dmesg in gen1 VM in Azure:

vmbus_res0: <Hyper-V Vmbus Resource> irq 5,7 on acpi0
Comment 8 schakrabarti@microsoft.com 2023-10-31 09:21:34 UTC
in gen1 this is the dev tree

  acpi0
    pcib0
      vmbus0
        hvet0
        storvsc0
        storvsc1
        hvheartbeat0
        hvkvp0
        hvshutdown0
        hvtimesync0
        storvsc2
        storvsc3
        hvkbd0
        hn0
        pcib1
          pci1
            mlx5_core0
      pci0
        hostb0
        isab0
          isa0
            orm0
            vga0
        atapci0
          ata0
          ata1
        vgapci0
    atdma0
    attimer0
    atrtc0
    atkbdc0
      atkbd0
      psm0
    psmcpnp0
    fpupnp0
    uart0
    uart1
    fdc0
      fd0
    acpi_sysresource0
    acpi_sysresource1
    vmbus_res0

and in gen2

nexus0
  acpi0
    acpi_syscontainer0
      vmbus0
        hvhid0
          hidbus0
            hms0
        hvkbd0
        hvheartbeat0
        hvkvp0
        hvshutdown0
        hvtimesync0
        hn0
        storvsc0
        storvsc1
        pcib0
          pci0
            mlx5_core0
    uart0
    uart1
    vmbus_res0

so both cases vmbus_res0 is present. As this is a pseudo device which has been made a child of acpi and owns the resources of vmbus.
Comment 9 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 09:37:18 UTC
Created attachment 246019 [details]
acpidump from vultr VM

(In reply to schakrabarti@microsoft.com from comment #6)
> Most likely the system is on gen1 Hyper-V, but we can confirm after
> checking the acpidump -dt output.

See the attachment "acpidump from vultr VM"
Comment 10 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 09:38:53 UTC
(In reply to schakrabarti@microsoft.com from comment #7)

> Also please share the dmesg output, as in gen1 also we have vmbus_res0.
> From dmesg in gen1 VM in Azure:
> vmbus_res0: <Hyper-V Vmbus Resource> irq 5,7 on acpi0

I've uploaded the dmesg.
No vmbus_res devices from the dmesg output.
Comment 11 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 09:49:08 UTC
*** UPDATE ***

To be clear, the regression happens only on Vultr VMs with custom ISO. A normal installation, i.e. select FreeBSD server image from the VM creating step, is not affected.
Comment 12 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 09:53:38 UTC
I have contacted Vultr and the system admin Albert has confirmed that they are using QEMU.

I believe, for the custom ISO installations the Hyper-V is emulated [1]. Probably Hyper-V is not fully emulated hence the guest VM lacks vmbus_res devices.

Then the patch can still apply to fix such a corner case.

1. https://fuchsia.googlesource.com/third_party/qemu/+/refs/tags/v7.0.0-rc0/docs/hyperv.txt
Comment 13 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-31 10:03:09 UTC
See also "Hyper-V Enlightenments" from QEMU document [2].

2. https://www.qemu.org/docs/master/system/i386/hyperv.html
Comment 14 Mitchell Horne freebsd_committer freebsd_triage 2023-10-31 13:58:19 UTC
(In reply to Zhenlei Huang from comment #12)

Oh interesting... "what could go wrong?" :D

Anyway, it is good if the scope of the problem is limited to custom ISO installations only, but it is still undesirable.

You can consider your patch 'Reviewed by: mhorne'. Let's see what Souradeep says, but if you can sneak the fix into 14.0-RELEASE that would be excellent. Otherwise it could be distributed as an Errata Notice after the fact.
Comment 15 Zhenlei Huang freebsd_committer freebsd_triage 2023-11-01 09:18:41 UTC
I managed to repeat this with QEMU 7.2.5 on Debian 12.2.0 host.

```
# uname -a
Linux debian 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
# qemu-system-x86_64 --version
QEMU emulator version 7.2.5 (Debian 1:7.2+dfsg-7+deb12u2)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers 
```

A minimal script to repeat (be sure firstly load kvm / kvm_intel, AMD is not tested yet):

```
#!/bin/sh

qemu-system-x86_64 \
        -vnc 0.0.0.0:1,password=on \
        -monitor stdio \
        --enable-kvm \
        --cpu host,hv-vpindex,hv-synic \
        --smp 1 \
        --m 512M \
        --cdrom FreeBSD-14.0-RC3-amd64-bootonly.iso
```

The Vultr 's enabled feature flags should be equivalent to 
```
--enable-kvm \
--cpu host,hv-relaxed,hv-vapic,hv-vpindex,hv-synic,hv-time,hv-stimer,hv-xmm-input
```

I've tested the patch with QEMU, it still works :)
Comment 16 commit-hook freebsd_committer freebsd_triage 2023-11-02 09:10:07 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=63bf943d4af17799cef21e2bb78dd28003ce1ce5

commit 63bf943d4af17799cef21e2bb78dd28003ce1ce5
Author:     Zhenlei Huang <zlei@FreeBSD.org>
AuthorDate: 2023-11-02 09:07:11 +0000
Commit:     Zhenlei Huang <zlei@FreeBSD.org>
CommitDate: 2023-11-02 09:07:11 +0000

    Hyper-V: vmbus: Add NULL check for vmbus_res

    QEMU emulates Hyper-V [1] but lacks the emulation for vmbus_res, thus no
    coherence information is available. Add NULL check for it and fallback
    to no coherence. This will prevent FreeBSD guests from panic on QEMU
    with the Hyper-V enlightenment hv-synic enabled.

    For real Hyper-V, both gen1 and gen2 have vmbus_res then they are not
    affected by this change.

    1. https://www.qemu.org/docs/master/system/i386/hyperv.html

    PR:             274810
    Reviewed by:    mhorne, emaste, delphij, whu
    Diagnosed by:   mhorne
    Fixes:          e7a9817b8d32 Hyper-V: vmbus: implementat bus_get_dma_tag in vmbus
    Insta-MFC approved by:  re (delphij) for 14.0-RC4
    Differential Revision:  https://reviews.freebsd.org/D42414

 sys/dev/hyperv/vmbus/vmbus.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)
Comment 17 commit-hook freebsd_committer freebsd_triage 2023-11-02 09:12:09 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=1969d82fcf62f80c2047a53b42f501680b140b0d

commit 1969d82fcf62f80c2047a53b42f501680b140b0d
Author:     Zhenlei Huang <zlei@FreeBSD.org>
AuthorDate: 2023-11-02 09:07:11 +0000
Commit:     Zhenlei Huang <zlei@FreeBSD.org>
CommitDate: 2023-11-02 09:10:03 +0000

    Hyper-V: vmbus: Add NULL check for vmbus_res

    QEMU emulates Hyper-V [1] but lacks the emulation for vmbus_res, thus no
    coherence information is available. Add NULL check for it and fallback
    to no coherence. This will prevent FreeBSD guests from panic on QEMU
    with the Hyper-V enlightenment hv-synic enabled.

    For real Hyper-V, both gen1 and gen2 have vmbus_res then they are not
    affected by this change.

    1. https://www.qemu.org/docs/master/system/i386/hyperv.html

    PR:             274810
    Reviewed by:    mhorne, emaste, delphij, whu
    Diagnosed by:   mhorne
    Fixes:          e7a9817b8d32 Hyper-V: vmbus: implementat bus_get_dma_tag in vmbus
    Insta-MFC approved by:  re (delphij) for 14.0-RC4
    Differential Revision:  https://reviews.freebsd.org/D42414

    (cherry picked from commit 63bf943d4af17799cef21e2bb78dd28003ce1ce5)

 sys/dev/hyperv/vmbus/vmbus.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)
Comment 18 commit-hook freebsd_committer freebsd_triage 2023-11-02 09:17:11 UTC
A commit in branch releng/14.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=52dbe7401fba923bc18124190029e65b491a756e

commit 52dbe7401fba923bc18124190029e65b491a756e
Author:     Zhenlei Huang <zlei@FreeBSD.org>
AuthorDate: 2023-11-02 09:07:11 +0000
Commit:     Zhenlei Huang <zlei@FreeBSD.org>
CommitDate: 2023-11-02 09:13:18 +0000

    Hyper-V: vmbus: Add NULL check for vmbus_res

    QEMU emulates Hyper-V [1] but lacks the emulation for vmbus_res, thus no
    coherence information is available. Add NULL check for it and fallback
    to no coherence. This will prevent FreeBSD guests from panic on QEMU
    with the Hyper-V enlightenment hv-synic enabled.

    For real Hyper-V, both gen1 and gen2 have vmbus_res then they are not
    affected by this change.

    1. https://www.qemu.org/docs/master/system/i386/hyperv.html

    PR:             274810
    Reviewed by:    mhorne, emaste, delphij, whu
    Approved by:    re (gjb)
    Diagnosed by:   mhorne
    Fixes:          e7a9817b8d32 Hyper-V: vmbus: implementat bus_get_dma_tag in vmbus
    Insta-MFC approved by:  re (delphij) for 14.0-RC4
    Differential Revision:  https://reviews.freebsd.org/D42414

    (cherry picked from commit 63bf943d4af17799cef21e2bb78dd28003ce1ce5)
    (cherry picked from commit 1969d82fcf62f80c2047a53b42f501680b140b0d)

 sys/dev/hyperv/vmbus/vmbus.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)
Comment 19 Zhenlei Huang freebsd_committer freebsd_triage 2023-11-02 09:33:49 UTC
The fix will be in 14.0-RC4.