Bug 279410 - gve(4) initialization panics kernel on arm64
Summary: gve(4) initialization panics kernel on arm64
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: Unspecified
Hardware: arm64 Any
: --- Affects Only Me
Assignee: Xin LI
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-30 20:11 UTC by Li-Wen Hsu
Modified: 2024-09-30 05:29 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Li-Wen Hsu freebsd_committer freebsd_triage 2024-05-30 20:11:46 UTC
Boot log from an arm64 instance on GCE:

Autoloading module: if_gve
Autoloading module: virtio_random
gve0: <gVNIC> mem 0x10203000-0x10203fff,0x10202000-0x1020203f,0x10100000-0x101fffff at device 0.0 on pci0
gve0: Unrecognized device option 0x5 not enabled.
gve0: Unrecognized device option 0x6 not enabled.
gve0: Failed to acquire any msix vectors
gve0: No irq table, nothing to free
gve0: No irq table, nothing to free
gve0: No irq table, nothing to free
Fatal data abort:
  x0: 0xffffa000ea784ba0
  x1: 0xffffa000ea784ba0
  x2: 0x000000000000000a
  x3: 0x000000000000000a
  x4: 0xffff00000088eff4
  x5: 0x0000000000000041
  x6: 0xffff00000052cfdc
  x7: 0xffff0000aca301d0
  x8: 0x0000000000000000
  x9: 0x0000000000000000
 x10: 0x0000000000000001
 x11: 0xfefefefefefefeff
 x12: 0xffff00000a656572
 x13: 0x0000feff01000001
 x14: 0x0000000000000000
 x15: 0x0000000000000002
 x16: 0xffff0000ade2cdc0
 x17: 0xffff00000051bb70
 x18: 0xffff0000aca30330
 x19: 0xffffa000ea510018
 x20: 0x0000000000000000
 x21: 0x0000000000000006
 x22: 0x0000000080040003
 x23: 0xffff000000a22944
 x24: 0xffff000000a899c0
 x25: 0xffff000000a08e29
 x26: 0xffffa000e7a49470
 x27: 0x000000006097de09
 x28: 0x0000000000000000
 x29: 0xffff0000aca30330
  sp: 0xffff0000aca30330
  lr: 0xffff0000ade176fc
 elr: 0x0000000000000000
spsr: 0x0000000060400045
 far: 0x0000000000000000
 esr: 0x0000000086000004
panic: vm_fault failed: 0x0 error 1
cpuid = 0
time = 1716533241
KDB: stack backtrace:
#0 0xffff000000525e30 at kdb_backtrace+0x58
#1 0xffff0000004d0d4c at vpanic+0x198
#2 0xffff0000004d0bb0 at panic+0x44
#3 0xffff0000008b795c at data_abort+0x2cc
#4 0xffff000000893814 at handle_el1h_sync+0x14
Uptime: 23s
Comment 1 Kyle Evans freebsd_committer freebsd_triage 2024-05-30 20:13:53 UTC
The panic is a red herring, that's just a bad error path that we should fix.  The real problem is back here:

gve0: Failed to acquire any msix vectors

MSI-X interrupts in virtio_pci seem to be borked completely for whatever reason, and gve(4) can't cope with that like, e.g., nvme(4) can.
Comment 2 Andrew Turner freebsd_committer freebsd_triage 2024-05-31 12:32:06 UTC
The panic looks like it is because the driver is calling a NULL function pointer. To track down that it would be useful to know what function the lr register is in.

Can you get the full FreeBSD boot log?
Comment 3 Ilya Bakulin freebsd_committer freebsd_triage 2024-05-31 21:43:56 UTC
Hi Andrew, sure! The full boot log:

UEFI firmware (version  built at 15:54:05 on Apr  2 2024)
EMU Variable FVB Started
EMU Variable invalid PCD sizes
Found PL031 RTC @ 0x9010000
InitializeRealTimeClock: using default timezone/daylight settings
[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01HBdsDxe: loading Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)
BdsDxe: starting Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)

UEFI: Attempting to start image.
Description: UEFI Misc Device
FilePath: PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)
OptionNumber: 1.

[2J[01;01H[2J[01;01H[01;01HConsoles: EFI console  

|/    Reading loader env vars from /efi/freebsd/loader.env

Setting currdev to disk0p1:

FreeBSD/arm64 EFI loader, Revision 1.1



   Command line arguments: loader.efi

   Image base: 0x13c4c0000

   EFI version: 2.70

   EFI Firmware: EDK II (rev 1.00)

   Console: efi (0x1000)

   Load Path: \EFI\BOOT\BOOTAA64.EFI

   Load Device: PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)/HD(1,GPT,FE6C0EE6-1911-11EF-8459-0CC47ADA5F32,0x22,0x10418)

   BootCurrent: 0001

   BootOrder: 0001[*] 0000

   BootInfo Path: PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)

Ignoring Boot0001: Only one DP found

Trying ESP: PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)/HD(1,GPT,FE6C0EE6-1911-11EF-8459-0CC47ADA5F32,0x22,0x10418)

Setting currdev to disk0p1:

-Trying: PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)/HD(2,GPT,FE6C0EED-1911-11EF-8459-0CC47ADA5F32,0x1043A,0x200000)

Setting currdev to disk0p2:

\Trying: PciRoot(0x0)/Pci(0x2,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)/HD(3,GPT,FE6C0EF0-1911-11EF-8459-0CC47ADA5F32,0x21043A,0x800000)

|Setting currdev to zfs:zroot/ROOT/default:

/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/[27;01H-\|/-\Loading /boot/defaults/loader.conf

|Loading /boot/defaults/loader.conf

Loading /boot/device.hints

/-Loading /boot/loader.conf

console vidconsole is unavailable

console vidconsole is unavailable

\Loading /boot/loader.conf.local

|/-\[2J[01;01H[01;01H?c|/Loading kernel...

-\|/boot/kernel/kernel text=0x2a8 text=0x9db150 /-\text=0x261054 data=0x150cb8 |data=0x0+0x2bc000 0x8+0x1516b0+0x8+0x17a5c2

Loading configured modules...

/-\can't find '/boot/entropy'

|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\can't find 'aesni'

|//boot/kernel/zfs.ko text=0xacd30 text=0x207b90 -data=0x2ce30+0xaabe4 0x8+0x34db8+0x8+0x2e521

\|can't find '/etc/hostid'

/


Booting [/boot/kernel/kernel]...               

No valid device tree blob found!

WARNING! Trying to fire up the kernel, but no device tree blob found!

---<<BOOT>>---
Copyright (c) 1992-2023 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 14.1-BETA1 releng/14.1-n267604-25c2d762af7a GENERIC arm64
FreeBSD clang version 18.1.4 (https://github.com/llvm/llvm-project.git llvmorg-18.1.4-0-ge6c3289804a6)
SRAT: Ignoring memory at addr 0x0
VT: init without driver.
module scmi already present!
real memory  = 4294967296 (4096 MB)
avail memory = 4154679296 (3962 MB)
FreeBSD/SMP: Multiprocessor System Detected: 1 CPUs
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and the knob 'bypass_before_seeding' was enabled.
random: entropy device external interface
kbd0 at kbdmux0
acpi0: <Google GOOGFACP>
acpi0: Power Button (fixed)
acpi0: Sleep Button (fixed)
acpi0: Could not update all GPEs: AE_NOT_CONFIGURED
psci0: <ARM Power State Co-ordination Interface Driver> on acpi0
gic0: <ARM Generic Interrupt Controller v3.0> iomem 0x8000000-0x800ffff,0x80a0000-0x8ffffff on acpi0
its0: <ARM GIC Interrupt Translation Service> mem 0x8080000-0x809ffff on gic0
generic_timer0: <ARM Generic Timer> irq 5,6,7 on acpi0
Timecounter "ARM MPCore Timecounter" frequency 25000000 Hz quality 1000
Event timer "ARM MPCore Eventtimer" frequency 25000000 Hz quality 1000
efirtc0: <EFI Realtime Clock>
efirtc0: registered as a time-of-day clock, resolution 1.000000s
pmu0: <Performance Monitoring Unit> on acpi0
cpu0: <ACPI CPU> on acpi0
uart0: <PrimeCell UART (PL011)> iomem 0x9000000-0x9000fff irq 0 on acpi0
uart0: console (9600,n,8,1)
uart1: <PrimeCell UART (PL011)> iomem 0x9001000-0x9001fff irq 1 on acpi0
uart2: <PrimeCell UART (PL011)> iomem 0x9002000-0x9002fff irq 2 on acpi0
uart3: <PrimeCell UART (PL011)> iomem 0x9003000-0x9003fff irq 3 on acpi0
acpi_ged0: <Generic Event Device> irq 4 on acpi0
acpi_ged0: Raw IRQ 50
acpi_button0: <Power Button> on acpi0
acpi_button1: <Sleep Button> on acpi0
pcib0: <Generic PCI host controller> on acpi0
pci0: <PCI bus> on pcib0
pci0: <network, ethernet> at device 0.0 (no driver attached)
virtio_pci0: <VirtIO PCI (legacy) Entropy adapter> mem 0x10201000-0x1020103f at device 1.0 on pci0
nvme0: <Generic NVMe Device> mem 0x10000000-0x10003fff,0x10200000-0x1020003f at device 2.0 on pci0
nvme0: unable to allocate MSI-X
armv8crypto0: <AES-CBC,AES-XTS,AES-GCM>
Timecounters tick every 1.000 msec
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
usb_needs_explore_all: no devclass
nvme0: temperature threshold not supported
CPU  0: ARM Neoverse-N1 r3p1 affinity:  0
                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG,IDC>
 Instruction Set Attributes 0 = <DP,RDM,Atomic,CRC32,SHA2,SHA1,AES+PMULL>
 Instruction Set Attributes 1 = <RCPC-8.3,DCPoP>
 Instruction Set Attributes 2 = <>
         Processor Features 0 = <CSV3,CSV2,GIC,AdvSIMD+HP,FP+HP,EL3,EL2,EL1,EL0>
         Processor Features 1 = <PSTATE.SSBS MSR>
      Memory Model Features 0 = <TGran4,SNSMem,16bit ASID,256TB PA>
      Memory Model Features 1 = <PAN+ATS1E1,LO,HPD,16bit VMID,HAF+DS>
      Memory Model Features 2 = <32bit CCIDX,48bit VA,UAO,CnP>
             Debug Features 0 = <MTPMU res0,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,Debugv8>
             Debug Features 1 = <>
         Auxiliary Features 0 = <>
         Auxiliary Features 1 = <>
AArch32 Instruction Set Attributes 5 = <>
AArch32 Media and VFP Features 0 = <>
AArch32 Media and VFP Features 1 = <>
TCP_ratelimit: Is now initialized
Trying to mount root from zfs:zroot/ROOT/default []...
nda0 at nvme0 bus 0 scbus0 target 0 lun 1
nda0: <nvme_card-pd 2 nvme_card-pd>
nda0: Serial Number nvme_card-pd
nda0: nvme version 1.0
nda0: 10240MB (20971520 512 byte sectors)
GEOM: nda0: the secondary GPT header is not in the last LBA.
Setting hostuuid: 8c7b2107-d74c-a0ba-4e31-5747c0e41d46.
Setting hostid: 0x51ecc71c.
This system supports ZFS pool feature flags.

Enabled the following features on 'zroot':
  async_destroy
  empty_bpobj
  lz4_compress
  multi_vdev_crash_dump
  spacemap_histogram
  enabled_txg
  hole_birth
  extensible_dataset
  embedded_data
  bookmarks
  filesystem_limits
  large_blocks
  large_dnode
  sha512
  skein
  edonr
  userobj_accounting
  encryption
  project_quota
  device_removal
  obsolete_counts
  zpool_checkpoint
  spacemap_v2
  allocation_classes
  resilver_defer
  bookmark_v2
  redaction_bookmarks
  redacted_datasets
  bookmark_written
  log_spacemap
  livelist
  device_rebuild
  zstd_compress
  draid
  zilsaxattr
  head_errlog
  blake3
  block_cloning
  vdev_zaps_v2

Pool 'zroot' has the bootfs property set, you might need to update
the boot code. See gptzfsboot(8) and loader.efi(8) for details.
Starting file system checks:
/dev/gpt/efiesp: FILESYSTEM CLEAN; SKIPPING CHECKS
Growing root partition to fill device
nda0 recovered
random: randomdev_wait_until_seeded unblock wait
random: randomdev_wait_until_seeded unblock wait
random: unblocking device.
nda0p3 resized
Mounting local filesystems:.
Autoloading module: if_gve
Autoloading module: virtio_random
gve0: <gVNIC> mem 0x10203000-0x10203fff,0x10202000-0x1020203f,0x10100000-0x101fffff at device 0.0 on pci0
gve0: Unrecognized device option 0x5 not enabled.
gve0: Unrecognized device option 0x6 not enabled.
gve0: Failed to acquire any msix vectors
gve0: No irq table, nothing to free
gve0: No irq table, nothing to free
gve0: No irq table, nothing to free
Fatal data abort:
  x0: 0xffffa000ea784ba0
  x1: 0xffffa000ea784ba0
  x2: 0x000000000000000a
  x3: 0x000000000000000a
  x4: 0xffff00000088eff4
  x5: 0x0000000000000041
  x6: 0xffff00000052cfdc
  x7: 0xffff0000aca301d0
  x8: 0x0000000000000000
  x9: 0x0000000000000000
 x10: 0x0000000000000001
 x11: 0xfefefefefefefeff
 x12: 0xffff00000a656572
 x13: 0x0000feff01000001
 x14: 0x0000000000000000
 x15: 0x0000000000000002
 x16: 0xffff0000ade2cdc0
 x17: 0xffff00000051bb70
 x18: 0xffff0000aca30330
 x19: 0xffffa000ea510018
 x20: 0x0000000000000000
 x21: 0x0000000000000006
 x22: 0x0000000080040003
 x23: 0xffff000000a22944
 x24: 0xffff000000a899c0
 x25: 0xffff000000a08e29
 x26: 0xffffa000e7a49470
 x27: 0x000000006097de09
 x28: 0x0000000000000000
 x29: 0xffff0000aca30330
  sp: 0xffff0000aca30330
  lr: 0xffff0000ade176fc
 elr: 0x0000000000000000
spsr: 0x0000000060400045
 far: 0x0000000000000000
 esr: 0x0000000086000004
panic: vm_fault failed: 0x0 error 1
cpuid = 0
time = 1716533241
KDB: stack backtrace:
#0 0xffff000000525e30 at kdb_backtrace+0x58
#1 0xffff0000004d0d4c at vpanic+0x198
#2 0xffff0000004d0bb0 at panic+0x44
#3 0xffff0000008b795c at data_abort+0x2cc
#4 0xffff000000893814 at handle_el1h_sync+0x14
Uptime: 23s
Dumping 247 out of 4064 MB:..2%..12%..22%..31%..41%..51%..62%..72%..81%..91%
Dump complete
Automatic reboot in 15 seconds - press a key on the console to abort



================================================
I keep the original boot log and reproduction steps also here: https://docs.google.com/document/d/1iAVx83Hhb7jS9Q1goxZg8TflwLm2ZE2yagMdEalUkqw/edit
Comment 4 shailend 2024-06-04 20:54:55 UTC
https://reviews.freebsd.org/D45489 fixes the error path panic. 

As for the inability to allocate msix vectors, I wonder if `hw.pci.honor_msi_blacklist` is relevant.
Comment 5 Li-Wen Hsu freebsd_committer freebsd_triage 2024-06-04 21:31:57 UTC
(In reply to shailend from comment #4)
Thanks for the patch!

I've built a new image based on 14.1-R
https://people.freebsd.org/~lwhsu/tmp/gce/gce.raw.zst

Ilya can you help testing it?
Comment 6 Xin LI freebsd_committer freebsd_triage 2024-06-05 15:59:05 UTC
(In reply to Li-Wen Hsu from comment #5)
Thanks!  Could you please also create an image with the patch plus:

hw.pci.honor_msi_blacklist="0" 

in /boot/loader.conf?
Comment 7 commit-hook freebsd_committer freebsd_triage 2024-06-18 06:09:51 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b81cbb12410b000074483899e61e9e767ba3ec1d

commit b81cbb12410b000074483899e61e9e767ba3ec1d
Author:     Shailend Chand <shailend@google.com>
AuthorDate: 2024-06-05 05:31:46 +0000
Commit:     Xin LI <delphij@FreeBSD.org>
CommitDate: 2024-06-18 06:08:31 +0000

    gve: Make gve_free_qpls idempotent

    This fixes a panic caused by double free.

    PR:     kern/279410
    MFC after:      3 days
    Differential Revision: https://reviews.freebsd.org/D45489

 sys/dev/gve/gve_qpl.c | 1 +
 1 file changed, 1 insertion(+)
Comment 8 Andrew Turner freebsd_committer freebsd_triage 2024-06-18 13:11:00 UTC
Can someone get a verbose boot log? There are a few places pci_alloc_msix might fail & that would narrow down where.
Comment 9 commit-hook freebsd_committer freebsd_triage 2024-06-21 05:45:57 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=224e20ceb1212579887397b67c43b42d41108c62

commit 224e20ceb1212579887397b67c43b42d41108c62
Author:     Shailend Chand <shailend@google.com>
AuthorDate: 2024-06-05 05:31:46 +0000
Commit:     Xin LI <delphij@FreeBSD.org>
CommitDate: 2024-06-21 05:44:34 +0000

    gve: Make gve_free_qpls idempotent

    This fixes a panic caused by double free.

    PR:     kern/279410
    Differential Revision: https://reviews.freebsd.org/D45489

    (cherry picked from commit b81cbb12410b000074483899e61e9e767ba3ec1d)

 sys/dev/gve/gve_qpl.c | 1 +
 1 file changed, 1 insertion(+)
Comment 10 commit-hook freebsd_committer freebsd_triage 2024-06-21 05:46:59 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=14454f417201a6c1075768c1a571b22c6d4c57d2

commit 14454f417201a6c1075768c1a571b22c6d4c57d2
Author:     Shailend Chand <shailend@google.com>
AuthorDate: 2024-06-05 05:31:46 +0000
Commit:     Xin LI <delphij@FreeBSD.org>
CommitDate: 2024-06-21 05:45:58 +0000

    gve: Make gve_free_qpls idempotent

    This fixes a panic caused by double free.

    PR:     kern/279410
    Differential Revision: https://reviews.freebsd.org/D45489

    (cherry picked from commit b81cbb12410b000074483899e61e9e767ba3ec1d)

 sys/dev/gve/gve_qpl.c | 1 +
 1 file changed, 1 insertion(+)
Comment 11 Mark Linimon freebsd_committer freebsd_triage 2024-09-30 05:29:32 UTC
^Triage: committed and MFCed.