Bug 226930

Summary: panic: Unrecoverable machine check exception
Product: Base System Reporter: vidwer+fbsdbugs
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: cem
Priority: ---    
Version: CURRENT   
Hardware: amd64   
OS: Any   

Description vidwer+fbsdbugs 2018-03-25 23:38:38 UTC
Running FreeBSD 12.0-CURRENT r331531, machine crashes after having decompressed about 100GB of a very large (>850GB) gzip file using gzip/unpigz.

st_mtime and st_birthtime of the output file show
"Mar 25 22:20:22 2018" "Mar 25 21:34:22 2018"

Unread portion of the kernel message buffer:
MCA: Bank 1, Status 0xcc00001000010151
MCA: Global Cap 0x0000000000000107, Status 0x0000000000000004
MCA: Vendor "AuthenticAMD", ID 0x610f01, APIC ID 18
MCA: CPU 2 COR OVER ICACHE L1 IRD error
MCA: Address 0xffff80f8aa21
MCA: Misc 0xc01b0fff00000000
MCA: Bank 5, Status 0xb0800000000c0e0f
MCA: Global Cap 0x0000000000000107, Status 0x0000000000000004
MCA: Vendor "AuthenticAMD", ID 0x610f01, APIC ID 18
MCA: CPU 2 UNCOR BUSLG ??? ERR Other
panic: Unrecoverable machine check exception
cpuid = 2
time = 1522009224
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0003581e40
vpanic() at vpanic+0x18d/frame 0xfffffe0003581ea0
panic() at panic+0x43/frame 0xfffffe0003581f00
mca_intr() at mca_intr+0x9b/frame 0xfffffe0003581f20
mchk_calltrap() at mchk_calltrap+0x8/frame 0xfffffe0003581f20
--- trap 0x1c, rip = 0xffffffff80f8aa33, rsp = 0xfffffe003dab07d0, rbp = 0xfffffe003dab0890 ---
apic_isr1_u() at apic_isr1_u+0xa9/frame 0xfffffe003dab0890
acpi_cpu_idle() at acpi_cpu_idle+0x2ee/frame 0xfffffe003dab08e0
cpu_idle_acpi() at cpu_idle_acpi+0x3f/frame 0xfffffe003dab0900
cpu_idle() at cpu_idle+0x8f/frame 0xfffffe003dab0920
sched_idletd() at sched_idletd+0x517/frame 0xfffffe003dab09f0
fork_exit() at fork_exit+0x84/frame 0xfffffe003dab0a30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003dab0a30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

__curthread () at ./machine/pcpu.h:230
230		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) list *0xffffffff80f8aa33
0xffffffff80f8aa33 is at /usr/src/sys/amd64/amd64/apic_vector.S:118.
113		SUPERALIGN_TEXT
114	IDTVEC(spuriousint)
115		/* No EOI cycle used here */
116		jmp	doreti_iret
117	
118		ISR_VEC	1, apic_isr1
119		ISR_VEC	2, apic_isr2
120		ISR_VEC	3, apic_isr3
121		ISR_VEC	4, apic_isr4
122		ISR_VEC	5, apic_isr5
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:230
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:361
#2  0xffffffff80424aeb in db_dump (dummy=<optimized out>, dummy2=<unavailable>, dummy3=<unavailable>, dummy4=<unavailable>)
    at /usr/src/sys/ddb/db_command.c:574
#3  0xffffffff804248b9 in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=<optimized out>)
    at /usr/src/sys/ddb/db_command.c:481
#4  0xffffffff80424634 in db_command_loop () at /usr/src/sys/ddb/db_command.c:534
#5  0xffffffff8042785f in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:250
#6  0xffffffff80b35273 in kdb_trap (type=3, code=-61456, tf=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:697
#7  0xffffffff80fad8a8 in trap (frame=0xfffffe0003581d70) at /usr/src/sys/amd64/amd64/trap.c:547
#8  0xffffffff80f89a8c in alltraps_pushregs_no_rax () at /usr/src/sys/amd64/amd64/exception.S:223
#9  0xffffffff81be9b78 in ?? ()
#10 0x0000000000000080 in ?? ()
#11 0xfffffe0003581d30 in ?? ()
#12 0x0000000000000080 in ?? ()
#13 0x0000000000000278 in ?? ()
#14 0x00000000000001d0 in ?? ()
#15 0x0000000000000012 in ?? ()
#16 0xffffffff8121d87b in ?? ()
#17 0xfffffe0003581e40 in ?? ()
#18 0xffffffff81adb7c0 in local_info ()
#19 0x0000000000000010 in ?? ()
#20 0xffffffff81e10901 in __pcpu ()
#21 0x0000000002814000 in ?? ()
#22 0xfffffe0003581ee0 in ?? ()
#23 0xfffff80003346000 in ?? ()
#24 0x001b001300000003 in ?? ()
#25 0x0000000000000000 in ?? ()
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2018-03-26 07:49:23 UTC
This seems like a problem with the hardware.
Either it's failing, maybe it is outside of proper conditions or maybe some other kind of problem.
Maybe it's an early Ryzen?  There have been numerous reports of problems with it.

Also, mcelog helps to get more human readable information about MCA / MCE conditions.
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2018-03-26 09:25:00 UTC
Not really about the automounter daemon.
Comment 3 vidwer+fbsdbugs 2018-03-26 11:42:05 UTC
(In reply to Andriy Gapon from comment #1)
This is Hudson-D4 hardware.

(In reply to Mark Linimon from comment #2)
That should have read AMD, as in the hardware manufacturer.
Comment 4 Conrad Meyer freebsd_committer freebsd_triage 2018-03-26 15:35:25 UTC
(In reply to Andriy Gapon from comment #1)

Mask is 0xff00f00, printed CPUID is 0x610f01, and the fields are summed.  So this is Family 15h, I think.
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2018-03-29 04:51:32 UTC
(In reply to vidwer+fbsdbugs from comment #3)

I agree that the token is overloaded, but in the old days of us using GNATS, we used the [amd] notation to indicate a problem with the automounter daemon.

The fact that you have set this to 'Hardware: amd64' is sufficient to indicate its architecture-specificness these days.