Bug 239678

Summary: panic: Unrecoverable machine check exception (MCA: CPU 2 UNCOR DCACHE L1 DWR error)
Product: Base System Reporter: umproko5 <jprokopowich>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Only Me CC: kib, mav
Priority: --- Keywords: crash, needs-qa
Version: 11.2-STABLE   
Hardware: amd64   
OS: Any   

Description umproko5 2019-08-06 18:38:30 UTC
As per freenas support,

In the provided debug I see several identical core dumps, reporting crash on Machine Check Exception:

MCA: Bank 0, Status 0xb4002000c0000145                                                                                                  
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000007                                                                           
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 2                                                                                      
MCA: CPU 2 UNCOR DCACHE L1 DWR error                                                                                                    
MCA: Address 0x3840b0600                                                                                                                
panic: Unrecoverable machine check exception                                                                                            
cpuid = 2                                                                                                                               
KDB: stack backtrace:                                                                                                                   
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe03d6da8e40                                                          
vpanic() at vpanic+0x177/frame 0xfffffe03d6da8ea0                                                                                       
panic() at panic+0x43/frame 0xfffffe03d6da8f00                                                                                          
mca_intr() at mca_intr+0x9b/frame 0xfffffe03d6da8f20                                                                                    
mchk_calltrap() at mchk_calltrap+0x8/frame 0xfffffe03d6da8f20                                                                           
--- trap 0x1c, rip = 0xffffffff82a3c1db, rsp = 0xfffffe045bb246d0, rbp = 0xfffffe045bb247a0 ---                                         
svm_vmrun() at svm_vmrun+0x99b/frame 0xfffffe045bb247a0                                                                                 
vm_run() at vm_run+0x1fc/frame 0xfffffe045bb24880                                                                                       
vmmdev_ioctl() at vmmdev_ioctl+0x85f/frame 0xfffffe045bb24920                                                                           
devfs_ioctl_f() at devfs_ioctl_f+0x128/frame 0xfffffe045bb24980                                                                         
kern_ioctl() at kern_ioctl+0x26d/frame 0xfffffe045bb249f0                                                                               
sys_ioctl() at sys_ioctl+0x15c/frame 0xfffffe045bb24ac0                                                                                 
amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe045bb24bf0                                                                         
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe045bb24bf0                                                             
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8017f281a, rsp = 0x7fffde7f1e28, rbp = 0x7fffde7f1ee0 ---                           
KDB: enter: panic          

Usually MCA panics are result of hardware issues. Considering that you are running desktop hardware, I am not exactly surprised.  But what makes me worry is that in all cases exception happened while system was executing virtual machine, so it may be either a trigger (and I don't know much about AMD MCA) or just unrelated witness, just because this system spends most of its CPU time running VMs. 

You may try to report this issue to FreeBSD in case somebody really knows how to decode AMD MCA.  But what I see now looks like checksum error in CPU L1 cache, which is a hardware problem.
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2019-08-07 16:27:54 UTC
I don't do unpaid support work for iX.  If you can reproduce in FreeBSD, I'm interested.
Comment 2 Alexander Motin freebsd_committer freebsd_triage 2019-08-07 17:10:29 UTC
Conrad, your comment sound unfair and impolite to me.  We at iX do all we can for FreeBSD.  We upstream everything possible.  But we can not maintain whole OS ourselves.  You can consider the system in question as FreeBSD 11.2 with some backports, but not in MCA code.  If you think the problem may already be fixed in newer FreeBSD version, that is fine, otherwise some bit of expertise shared would be nice.  And this is not a paid customer, we are in the same situation as FreeBSD project.