Bug 252445 - panics on ESXi triggered by update making devd(8) to load vmci(4) module
Summary: panics on ESXi triggered by update making devd(8) to load vmci(4) module
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords: panic, regression
Depends on:
Blocks:
 
Reported: 2021-01-05 19:46 UTC by Marek Zarychta
Modified: 2021-04-10 18:22 UTC (History)
7 users (show)

See Also:


Attachments
core.txt (163.20 KB, text/plain)
2021-01-05 19:46 UTC, Marek Zarychta
no flags Details
Untested patch for vmci_qp_guest_endpoints_exit (456 bytes, patch)
2021-01-06 18:07 UTC, Mark Peek
no flags Details | Diff
Proposed patch for handling vmci pci errors (4.04 KB, patch)
2021-01-07 21:30 UTC, Mark Peek
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marek Zarychta 2021-01-05 19:46:56 UTC
Created attachment 221304 [details]
core.txt

After upgrade to the most recent 12.2-STABLE system installed on ESXi panicked with:

[9] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
<3>[9] vmci: Could not map: BAR1
<3>[9] vmci: Failed to map PCI BARs.
<4>[9] vmci: Failed to unsubscribe to event (type=0) with subscriber (ID=0xffffffff).
[9] device_attach: vmci0 attach returned 6
[9] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
<3>[9] vmci: Could not map: BAR1
<3>[9] vmci: Failed to map PCI BARs.
[9] 
[9] 
[9] Fatal trap 12: page fault while in kernel mode
[9] cpuid = 3; apic id = 03
[9] fault virtual address       = 0x410
[9] fault code          = supervisor read data, page not present
[9] instruction pointer = 0x20:0xffffffff80b822a6
[9] stack pointer               = 0x28:0xfffffe0035170600
[9] frame pointer               = 0x28:0xfffffe0035170680
[9] code segment                = base rx0, limit 0xfffff, type 0x1b
[9]                     = DPL 0, pres 1, long 1, def32 0, gran 1
[9] processor eflags    = interrupt enabled, resume, IOPL = 0
[9] current process             = 198 (devctl)
[9] trap number         = 12
[9] panic: page fault

The kernel from 12.2-STABLE r367922 build in November 2020 works fine.
Comment 1 Marek Zarychta 2021-01-05 20:18:41 UTC
After revering suspicious commit 6338833c50a7566d006b722c791a6a92071309b8 everything works fine again.

https://cgit.freebsd.org/src/commit/sys/dev/vmware/vmci/vmci.c?h=stable/12&id=6338833c50a7566d006b722c791a6a92071309b8

I am not able to report the revision but it's still the latest 12.2-STABLE
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2021-01-05 22:29:42 UTC
^Triage: assign, but notify reviewer of DR as well.
Comment 3 Mark Johnston freebsd_committer 2021-01-06 15:00:21 UTC
Could you please show the backtrace?  Presumably that commit is triggering the problem because it's causing a driver to be loaded when it was not being loaded before, so it's just exposing a driver bug.
Comment 4 Marek Zarychta 2021-01-06 17:07:07 UTC
(In reply to Mark Johnston from comment #3)
Isn't attached core.txt including a backtrace? It's standard postmortem analysis done by the /etc/rc.d/savecore script.

If it's not enough, then please let me know and I will try to examine the core file with lldb.
Comment 5 Mark Peek freebsd_committer 2021-01-06 18:07:55 UTC
Created attachment 221331 [details]
Untested patch for vmci_qp_guest_endpoints_exit

Based on the core.txt stack trace, try the (untested) attached patch. Note: there may be secondary cleanup issues past this one. Also, ESXi 4.1.0 has been EOL for some time so debugging the actual PCI mapping issue is likely not as useful.
Comment 6 Marek Zarychta 2021-01-06 19:35:17 UTC
(In reply to Mark Peek from comment #5)
Thanks for the patch, but I am not able to build kernel with it applied:

Building /usr/obj/usr/src/amd64.amd64/sys/VBSD/modules/usr/src/sys/modules/vmware/vmci/vmci_queue_pair.o
/usr/src/sys/dev/vmware/vmci/vmci_queue_pair.c:341:6: error: invalid argument type 'vmci_mutex' (aka 'struct mtx') to unary expression
        if (!qp_guest_endpoints.mutex)
            ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
*** Error code 1

Stop.
make[5]: stopped in /usr/src/sys/modules/vmware/vmci
Comment 7 VVD 2021-01-07 14:19:39 UTC
(In reply to Marek Zarychta from comment #6)
> if (!qp_guest_endpoints.mutex)
Maybe:
if (qp_guest_endpoints.mutex == NULL)
Comment 8 Mark Peek freebsd_committer 2021-01-07 21:30:09 UTC
Created attachment 221369 [details]
Proposed patch for handling vmci pci errors

Loaded up stable/12, simulated the error, and worked through the error cases. Give this latest patch a try.
Comment 9 Marek Zarychta 2021-01-07 22:43:38 UTC
(In reply to Mark Peek from comment #8)
Thank you for the patch. It solves the issue, the panics have gone.

Here is excerpt of the dmesg connected with this:

[10] VMware memory control driver initialized
[10] intsmb0: <Intel PIIX4 SMBUS Interface> port 0x1040-0x104f at device 7.3 on pci0
[10] intsmb0: intr SMI disabled revision 0
[10] smbus0: <System Management Bus> on intsmb0
[10] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
[10] vmci: Could not map: BAR1
[10] vmci: Failed to map PCI BARs.
[10] vmci: Failed to unsubscribe to event (type=0) with subscriber (ID=0xffffffff).
[10] device_attach: vmci0 attach returned 6
[10] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
[10] vmci: Could not map: BAR1
[10] vmci: Failed to map PCI BARs.
[10] vmci: Failed to unsubscribe to event (type=0) with subscriber (ID=0xffffffff).
[10] device_attach: vmci0 attach returned 6
Comment 10 Marek Zarychta 2021-01-11 19:29:29 UTC
So far the patch seems to be a reliable solution. Is anything that prevents it from being committed to HEAD? Should we expect it to be MFCed to stable/12 or directly committed to this branch or should we cope with it on our own?

Anyway thanks again for providing us with this patch which perfectly solves the issue.
Comment 11 Mark Johnston freebsd_committer 2021-01-21 15:22:23 UTC
(In reply to Mark Peek from comment #8)
Mark, is there any reason not to commit this?  13.0 is going to be branched in the next day or so, so it'd be nice to get this in.
Comment 12 Guy Helmer freebsd_committer 2021-02-13 16:49:51 UTC
I encountered this in 12.2-STABLE on ESXi 6.5.0. Patch applied and resolved the issue.
 
FreeBSD xxx@yyy.com 12.2-STABLE FreeBSD 12.2-STABLE #6 r369256M: Fri Feb 12 17:17:35 CST 2021     root@yyy.com:/usr/obj/usr/src/i386.i386/sys/GENERIC  i386
Comment 13 Marek Zarychta 2021-02-13 17:07:20 UTC
If the patch was not applied, then a quick workaround for this will be to do some post make installkernel cleanup:

rm /boot/kernel/vmci.ko

Adding "WITHOUT_MODULES = vmci" to /etc/make.conf doesn't help much.
Comment 14 Joe Marcus Clarke freebsd_committer 2021-04-10 17:59:17 UTC
Confirmed this patch worked for me on stable/12 on ESXi...so old I don't really want to tell you (hint: it's still called ESX).  Thanks for running this down.
Comment 15 Marek Zarychta 2021-04-10 18:22:14 UTC
I won't be able to help in this case anymore since moved the affected machine from EOLed ESXi to bhyve. After the upgrade to stable/13 even more issues have risen affecting both: stability and reliability of the VM (mostly due to problems with timecounter keeping and PF state table overflowing due to this).

The workaround for this bug is very simple, but the bug persistent, so PR while still opened might be useful. The decision about closing it I leave over to the Committers/Triggers team.