Bug 252445 - panics on ESXi triggered by update making devd(8) to load vmci(4) module
Summary: panics on ESXi triggered by update making devd(8) to load vmci(4) module
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords: crash, regression
Depends on:
Blocks:
 
Reported: 2021-01-05 19:46 UTC by Marek Zarychta
Modified: 2022-10-12 00:50 UTC (History)
9 users (show)

See Also:
zarychtam: maintainer-feedback+
zarychtam: mfc-stable13?
zarychtam: mfc-stable12?


Attachments
core.txt (163.20 KB, text/plain)
2021-01-05 19:46 UTC, Marek Zarychta
no flags Details
Untested patch for vmci_qp_guest_endpoints_exit (456 bytes, patch)
2021-01-06 18:07 UTC, Mark Peek
no flags Details | Diff
Proposed patch for handling vmci pci errors (4.04 KB, patch)
2021-01-07 21:30 UTC, Mark Peek
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marek Zarychta 2021-01-05 19:46:56 UTC
Created attachment 221304 [details]
core.txt

After upgrade to the most recent 12.2-STABLE system installed on ESXi panicked with:

[9] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
<3>[9] vmci: Could not map: BAR1
<3>[9] vmci: Failed to map PCI BARs.
<4>[9] vmci: Failed to unsubscribe to event (type=0) with subscriber (ID=0xffffffff).
[9] device_attach: vmci0 attach returned 6
[9] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
<3>[9] vmci: Could not map: BAR1
<3>[9] vmci: Failed to map PCI BARs.
[9] 
[9] 
[9] Fatal trap 12: page fault while in kernel mode
[9] cpuid = 3; apic id = 03
[9] fault virtual address       = 0x410
[9] fault code          = supervisor read data, page not present
[9] instruction pointer = 0x20:0xffffffff80b822a6
[9] stack pointer               = 0x28:0xfffffe0035170600
[9] frame pointer               = 0x28:0xfffffe0035170680
[9] code segment                = base rx0, limit 0xfffff, type 0x1b
[9]                     = DPL 0, pres 1, long 1, def32 0, gran 1
[9] processor eflags    = interrupt enabled, resume, IOPL = 0
[9] current process             = 198 (devctl)
[9] trap number         = 12
[9] panic: page fault

The kernel from 12.2-STABLE r367922 build in November 2020 works fine.
Comment 1 Marek Zarychta 2021-01-05 20:18:41 UTC
After revering suspicious commit 6338833c50a7566d006b722c791a6a92071309b8 everything works fine again.

https://cgit.freebsd.org/src/commit/sys/dev/vmware/vmci/vmci.c?h=stable/12&id=6338833c50a7566d006b722c791a6a92071309b8

I am not able to report the revision but it's still the latest 12.2-STABLE
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2021-01-05 22:29:42 UTC
^Triage: assign, but notify reviewer of DR as well.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2021-01-06 15:00:21 UTC
Could you please show the backtrace?  Presumably that commit is triggering the problem because it's causing a driver to be loaded when it was not being loaded before, so it's just exposing a driver bug.
Comment 4 Marek Zarychta 2021-01-06 17:07:07 UTC
(In reply to Mark Johnston from comment #3)
Isn't attached core.txt including a backtrace? It's standard postmortem analysis done by the /etc/rc.d/savecore script.

If it's not enough, then please let me know and I will try to examine the core file with lldb.
Comment 5 Mark Peek freebsd_committer freebsd_triage 2021-01-06 18:07:55 UTC
Created attachment 221331 [details]
Untested patch for vmci_qp_guest_endpoints_exit

Based on the core.txt stack trace, try the (untested) attached patch. Note: there may be secondary cleanup issues past this one. Also, ESXi 4.1.0 has been EOL for some time so debugging the actual PCI mapping issue is likely not as useful.
Comment 6 Marek Zarychta 2021-01-06 19:35:17 UTC
(In reply to Mark Peek from comment #5)
Thanks for the patch, but I am not able to build kernel with it applied:

Building /usr/obj/usr/src/amd64.amd64/sys/VBSD/modules/usr/src/sys/modules/vmware/vmci/vmci_queue_pair.o
/usr/src/sys/dev/vmware/vmci/vmci_queue_pair.c:341:6: error: invalid argument type 'vmci_mutex' (aka 'struct mtx') to unary expression
        if (!qp_guest_endpoints.mutex)
            ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
*** Error code 1

Stop.
make[5]: stopped in /usr/src/sys/modules/vmware/vmci
Comment 7 Vladimir Druzenko freebsd_committer freebsd_triage 2021-01-07 14:19:39 UTC
(In reply to Marek Zarychta from comment #6)
> if (!qp_guest_endpoints.mutex)
Maybe:
if (qp_guest_endpoints.mutex == NULL)
Comment 8 Mark Peek freebsd_committer freebsd_triage 2021-01-07 21:30:09 UTC
Created attachment 221369 [details]
Proposed patch for handling vmci pci errors

Loaded up stable/12, simulated the error, and worked through the error cases. Give this latest patch a try.
Comment 9 Marek Zarychta 2021-01-07 22:43:38 UTC
(In reply to Mark Peek from comment #8)
Thank you for the patch. It solves the issue, the panics have gone.

Here is excerpt of the dmesg connected with this:

[10] VMware memory control driver initialized
[10] intsmb0: <Intel PIIX4 SMBUS Interface> port 0x1040-0x104f at device 7.3 on pci0
[10] intsmb0: intr SMI disabled revision 0
[10] smbus0: <System Management Bus> on intsmb0
[10] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
[10] vmci: Could not map: BAR1
[10] vmci: Failed to map PCI BARs.
[10] vmci: Failed to unsubscribe to event (type=0) with subscriber (ID=0xffffffff).
[10] device_attach: vmci0 attach returned 6
[10] vmci0: <VMware Virtual Machine Communication Interface> port 0x1080-0x10bf irq 16 at device 7.7 on pci0
[10] vmci: Could not map: BAR1
[10] vmci: Failed to map PCI BARs.
[10] vmci: Failed to unsubscribe to event (type=0) with subscriber (ID=0xffffffff).
[10] device_attach: vmci0 attach returned 6
Comment 10 Marek Zarychta 2021-01-11 19:29:29 UTC
So far the patch seems to be a reliable solution. Is anything that prevents it from being committed to HEAD? Should we expect it to be MFCed to stable/12 or directly committed to this branch or should we cope with it on our own?

Anyway thanks again for providing us with this patch which perfectly solves the issue.
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2021-01-21 15:22:23 UTC
(In reply to Mark Peek from comment #8)
Mark, is there any reason not to commit this?  13.0 is going to be branched in the next day or so, so it'd be nice to get this in.
Comment 12 Guy Helmer freebsd_committer freebsd_triage 2021-02-13 16:49:51 UTC
I encountered this in 12.2-STABLE on ESXi 6.5.0. Patch applied and resolved the issue.
 
FreeBSD xxx@yyy.com 12.2-STABLE FreeBSD 12.2-STABLE #6 r369256M: Fri Feb 12 17:17:35 CST 2021     root@yyy.com:/usr/obj/usr/src/i386.i386/sys/GENERIC  i386
Comment 13 Marek Zarychta 2021-02-13 17:07:20 UTC
If the patch was not applied, then a quick workaround for this will be to do some post make installkernel cleanup:

rm /boot/kernel/vmci.ko

Adding "WITHOUT_MODULES = vmci" to /etc/make.conf doesn't help much.
Comment 14 Joe Marcus Clarke freebsd_committer freebsd_triage 2021-04-10 17:59:17 UTC
Confirmed this patch worked for me on stable/12 on ESXi...so old I don't really want to tell you (hint: it's still called ESX).  Thanks for running this down.
Comment 15 Marek Zarychta 2021-04-10 18:22:14 UTC
I won't be able to help in this case anymore since moved the affected machine from EOLed ESXi to bhyve. After the upgrade to stable/13 even more issues have risen affecting both: stability and reliability of the VM (mostly due to problems with timecounter keeping and PF state table overflowing due to this).

The workaround for this bug is very simple, but the bug persistent, so PR while still opened might be useful. The decision about closing it I leave over to the Committers/Triggers team.
Comment 16 Gleb Popov freebsd_committer freebsd_triage 2021-06-08 14:50:08 UTC
Bumped into this problem after upgrading to 13.0, which is quite unpleasant.

Any progress on this?
Comment 17 Gleb Popov freebsd_committer freebsd_triage 2021-09-16 14:24:41 UTC
Bumping the PR again. What prevents getting it in?
Comment 18 Mark Peek freebsd_committer freebsd_triage 2021-09-16 15:01:21 UTC
(In reply to Gleb Popov from comment #17)

This is my fault. It dropped off my radar. I have an updated patch for -current (really for INVARIANTS turned on) that I need to validate on a new build and then put it out for a quick review. I'll get that prioritized in the next day or two.
Comment 19 Martin Pola 2021-10-02 20:18:53 UTC
I'm having the same issue when upgrading from FreeBSD 12.2 to 13.0 on VMware ESXi 6.0.0.My workaround was:
# mv /boot/kernel/vmci.ko /boot/kernel/vmci.koNOTUSED

Upgrading from FreeBSD 12.2 to 13.0 on VMware ESXi 6.5.0 has not been a problem.
Comment 20 commit-hook freebsd_committer freebsd_triage 2021-10-09 21:30:23 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0f14bcbe384091c729464cb770372aeb79061070

commit 0f14bcbe384091c729464cb770372aeb79061070
Author:     Mark Peek <mp@FreeBSD.org>
AuthorDate: 2021-10-09 21:21:16 +0000
Commit:     Mark Peek <mp@FreeBSD.org>
CommitDate: 2021-10-09 21:21:16 +0000

    vmci: fix panic due to freeing unallocated resources

    Summary:
    An error mapping PCI resources results in a panic due to unallocated
    resources being freed up. This change puts the appropriate checks in
    place to prevent the panic.

    PR:             252445
    Reported by:    Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
    Tested by:      marcus
    MFC after:      1 week
    Sponsored by:   VMware

    Test Plan:
    Along with user testing, also simulated error by inserting a ENXIO
    return in vmci_map_bars().

    Reviewed by:    marcus
    Subscribers:    imp
    Differential Revision: https://reviews.freebsd.org/D32016

 sys/dev/vmware/vmci/vmci.c            |  9 ++++---
 sys/dev/vmware/vmci/vmci_event.c      |  3 +++
 sys/dev/vmware/vmci/vmci_kernel_if.c  | 48 ++++++++++++++++++++++++++++++++++-
 sys/dev/vmware/vmci/vmci_kernel_if.h  |  2 ++
 sys/dev/vmware/vmci/vmci_queue_pair.c |  3 +++
 5 files changed, 61 insertions(+), 4 deletions(-)
Comment 21 commit-hook freebsd_committer freebsd_triage 2021-10-16 22:30:43 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4e5c1be4202a141b7a15c505848abcbea535912f

commit 4e5c1be4202a141b7a15c505848abcbea535912f
Author:     Mark Peek <mp@FreeBSD.org>
AuthorDate: 2021-10-09 21:21:16 +0000
Commit:     Mark Peek <mp@FreeBSD.org>
CommitDate: 2021-10-16 18:22:43 +0000

    vmci: fix panic due to freeing unallocated resources

    Summary:
    An error mapping PCI resources results in a panic due to unallocated
    resources being freed up. This change puts the appropriate checks in
    place to prevent the panic.

    PR:             252445
    Reported by:    Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
    Tested by:      marcus
    MFC after:      1 week
    Sponsored by:   VMware

    Test Plan:
    Along with user testing, also simulated error by inserting a ENXIO
    return in vmci_map_bars().

    Reviewed by:    marcus
    Subscribers:    imp
    Differential Revision: https://reviews.freebsd.org/D32016

    (cherry picked from commit 0f14bcbe384091c729464cb770372aeb79061070)

 sys/dev/vmware/vmci/vmci.c            |  9 ++++---
 sys/dev/vmware/vmci/vmci_event.c      |  3 +++
 sys/dev/vmware/vmci/vmci_kernel_if.c  | 48 ++++++++++++++++++++++++++++++++++-
 sys/dev/vmware/vmci/vmci_kernel_if.h  |  2 ++
 sys/dev/vmware/vmci/vmci_queue_pair.c |  3 +++
 5 files changed, 61 insertions(+), 4 deletions(-)
Comment 22 Gleb Popov freebsd_committer freebsd_triage 2021-10-17 09:23:36 UTC
Will this change end up into a security update, so that `freebsd-update upgrade -r 13.0-RELEASE` would work?
Comment 23 commit-hook freebsd_committer freebsd_triage 2021-10-17 15:32:46 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b5d236785dc352a65bc29d97c8a89b40387eb7a0

commit b5d236785dc352a65bc29d97c8a89b40387eb7a0
Author:     Mark Peek <mp@FreeBSD.org>
AuthorDate: 2021-10-09 21:21:16 +0000
Commit:     Mark Peek <mp@FreeBSD.org>
CommitDate: 2021-10-17 15:31:53 +0000

    vmci: fix panic due to freeing unallocated resources

    Summary:
    An error mapping PCI resources results in a panic due to unallocated
    resources being freed up. This change puts the appropriate checks in
    place to prevent the panic.

    PR:             252445
    Reported by:    Marek Zarychta <zarychtam@plan-b.pwste.edu.pl>
    Tested by:      marcus
    MFC after:      1 week
    Sponsored by:   VMware

    Test Plan:
    Along with user testing, also simulated error by inserting a ENXIO
    return in vmci_map_bars().

    Reviewed by:    marcus
    Subscribers:    imp
    Differential Revision: https://reviews.freebsd.org/D32016

    (cherry picked from commit 0f14bcbe384091c729464cb770372aeb79061070)

 sys/dev/vmware/vmci/vmci.c            |  9 ++++---
 sys/dev/vmware/vmci/vmci_event.c      |  3 +++
 sys/dev/vmware/vmci/vmci_kernel_if.c  | 48 ++++++++++++++++++++++++++++++++++-
 sys/dev/vmware/vmci/vmci_kernel_if.h  |  2 ++
 sys/dev/vmware/vmci/vmci_queue_pair.c |  3 +++
 5 files changed, 61 insertions(+), 4 deletions(-)
Comment 24 Mark Johnston freebsd_committer freebsd_triage 2021-10-19 16:26:41 UTC
(In reply to Gleb Popov from comment #22)
Could you drop a note to secteam@ requesting this?  Ideally a copy of https://www.freebsd.org/security/errata-template.txt would be filled out with some details of the problem but a pointer to this PR is probably sufficient.
Comment 25 Marek Zarychta 2021-11-04 20:03:09 UTC
Finally, it even found its way to errata notices! Thank you again for fixing this nasty bug. I am closing as completely resolved.