Bug 253272 - Page fault in _mca_init during boot
Summary: Page fault in _mca_init during boot
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-05 16:23 UTC by Alan Somers
Modified: 2021-02-16 17:08 UTC (History)
3 users (show)

See Also:


Attachments
Unconditionally allocate the cmci memory (502 bytes, patch)
2021-02-05 16:27 UTC, Alan Somers
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2021-02-05 16:23:11 UTC
I saw the following panic during boot on a system running something close to 12.2-RELEASE. It doesn't happen every time.  However, I suspect I've hit the same bug a few other times and not known, because the kernel normally reboots immediately since swap is not configured by this point.

Fatal trap 12: page fault while in kernel mode
cpuid = 26; apic id = 34
fault virtual address = 0xd0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8125a009
stack pointer = 0x28:0xfffffe0000b65f20
frame pointer = 0x28:0xfffffe0000b65f50
code segment = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (idle: cpu26)
trap number = 12
panic: page fault
cpuid = 26
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000b65be0
vpanic() at vpanic+0x17b/frame 0xfffffe0000b65c30
panic() at panic+0x43/frame 0xfffffe0000b65c90
trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000b65cf0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000b65d40
trap() at trap+0x286/frame 0xfffffe0000b65e50
calltrap() at calltrap+0x8/frame 0xfffffe0000b65e50
--- trap 0xc, rip = 0xffffffff8125a009, rsp = 0xfffffe0000b65f20, rbp = 0xfffffe0000b65f50 ---
_mca_init() at _mca_init+0x5d9/frame 0xfffffe0000b65f50
init_secondary_tail() at init_secondary_tail+0xfd/frame 0xfffffe0000b65f80
init_secondary() at init_secondary+0x2d1/frame 0xfffffe0000b65ff0
KDB: enter: panic
[ thread pid 11 tid 100029 ]
Stopped at kdb_enter+0x37: movq $0,0x12bc1f6(%rip)

The bug is caused because only one of my two CPUs reports support for the MCG_CMCI_P bit.  On boot, it's random which CPU the kernel queries for support.  If it queries the wrong one, then it doesn't allocate memory for the cmd state, but later calls cmci_setup() for the CPU that does support that bit.  The following command shows the asymmetry between the CPUs:

$ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179 /dev/cpuctl$x; done | uniq -c
16 MSR 0x179: 0x00000000 0x0f000c14
16 MSR 0x179: 0x00000000 0x0f000814
Comment 1 Alan Somers freebsd_committer freebsd_triage 2021-02-05 16:27:29 UTC
Created attachment 222184 [details]
Unconditionally allocate the cmci memory

This patch from kib@FreeBSD.org attempts to fix the problem by unconditionally allocating memory for cmc_state, regardless of the MCG_CAP_CMCI_P bit.
Comment 2 Alan Somers freebsd_committer freebsd_triage 2021-02-07 21:31:42 UTC
I updated the BIOS from version 5.12, aka 2/24/2018 Rev 2.0b, to 5.14, aka 10/30/2020 Rev 3.4.  That fixed the problem.  Now all CPUs show the MCG_CMCI_P bit disabled.

$ for i in `seq 0 31`; do sudo cpucontrol -m 0x179 /dev/cpuctl${i}; done | uniq -c
  32 MSR 0x179: 0x00000000 0x0f000814
Comment 3 commit-hook freebsd_committer freebsd_triage 2021-02-08 19:47:06 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b5770470276268acef21368b3e77a325df883500

commit b5770470276268acef21368b3e77a325df883500
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-02-08 19:42:54 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-02-08 19:42:54 +0000

    mca: Handle inconsistent CMCI capability reporting

    A BIOS bug may apparently cause the BSP to report that it does not
    implement CMCI, with some APs reporting that they do.  In this scenario,
    avoid a NULL pointer dereference that occurs in cmci_monitor() because
    cmc_state was not allocated by the BSP.

    PR:             253272
    Reported by:    asomers, mmacy
    Reviewed by:    kib (previous version)
    MFC after:      1 week

 sys/x86/x86/mca.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
Comment 4 commit-hook freebsd_committer freebsd_triage 2021-02-15 19:25:22 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8eebd9592e3daf80c2c743666614119d6c862186

commit 8eebd9592e3daf80c2c743666614119d6c862186
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-02-08 19:42:54 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-02-15 19:12:41 +0000

    mca: Handle inconsistent CMCI capability reporting

    A BIOS bug may apparently cause the BSP to report that it does not
    implement CMCI, with some APs reporting that they do.  In this scenario,
    avoid a NULL pointer dereference that occurs in cmci_monitor() because
    cmc_state was not allocated by the BSP.

    PR:             253272
    Reported by:    asomers, mmacy
    Reviewed by:    kib (previous version)

    (cherry picked from commit b5770470276268acef21368b3e77a325df883500)

 sys/x86/x86/mca.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
Comment 5 commit-hook freebsd_committer freebsd_triage 2021-02-15 19:47:27 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=dadf603f0f7b54c65fa5f16f552ae6da12f8210b

commit dadf603f0f7b54c65fa5f16f552ae6da12f8210b
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-02-08 19:42:54 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-02-15 19:47:04 +0000

    mca: Handle inconsistent CMCI capability reporting

    A BIOS bug may apparently cause the BSP to report that it does not
    implement CMCI, with some APs reporting that they do.  In this scenario,
    avoid a NULL pointer dereference that occurs in cmci_monitor() because
    cmc_state was not allocated by the BSP.

    PR:             253272
    Reported by:    asomers, mmacy
    Reviewed by:    kib (previous version)

    (cherry picked from commit b5770470276268acef21368b3e77a325df883500)

 sys/x86/x86/mca.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
Comment 6 commit-hook freebsd_committer freebsd_triage 2021-02-16 17:08:30 UTC
A commit in branch releng/13.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f560a8b1a4edd1b8a9f110ae2edaa7a3307e9034

commit f560a8b1a4edd1b8a9f110ae2edaa7a3307e9034
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-02-16 17:07:43 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-02-16 17:07:43 +0000

    mca: Handle inconsistent CMCI capability reporting

    A BIOS bug may apparently cause the BSP to report that it does not
    implement CMCI, with some APs reporting that they do.  In this scenario,
    avoid a NULL pointer dereference that occurs in cmci_monitor() because
    cmc_state was not allocated by the BSP.

    Approved by:    re (gjb)
    PR:             253272
    Reported by:    asomers, mmacy
    Reviewed by:    kib (previous version)

    (cherry picked from commit b5770470276268acef21368b3e77a325df883500)
    (cherry picked from commit 8eebd9592e3daf80c2c743666614119d6c862186)

 sys/x86/x86/mca.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)