Not a lot I can get from a ppc64 crash other than the backtrace:
panic: FPU already enabled for thread
cpuid = 23
time = 1546311842
KDB: stack backtrace:
0xe000000201c47330: at .kdb_backtrace+0x5c
0xe000000201c47460: at .vpanic+0x1b4
0xe000000201c47520: at .panic+0x38
0xe000000201c475b0: at .trap+0xb64
0xe000000201c47770: at .powerpc_interrupt+0x290
0xe000000201c47810: user FPU trap by 0x81086e528: srr1=0x900000000000d032
r1=0x3fffffffdfdfaea0 cr=0x22040024 xer=0 ctr=0x81086e51c r2=0x810891818
KDB: enter: panic
[ thread pid 17895 tid 101554 ]
Stopped at .kdb_enter+0x60: ld r2, r1, 0x28
The system was in the middle of a package build run.
Firmware that is running:
bmc-firmware-version = "1.10";
buildroot = "2018.05.1-114-g1822255eab";ensor";
capp-ucode = "p9-dd2-v4";
hostboot = "p8-30b88ed-p580ec27";
hostboot-binaries = "hw091818a.930";
linux = "4.18.6-openpower1-p40b056c";
machine-xml = "e0fae90-p90e7e34";
occ = "p8-28f2cec";
open-power = "palmetto-v2.1-134-g1ad4886";
petitboot = "1.9.1";
phandle = <0x10000087>;
skiboot = "v6.1-124-g7dbf80d1db45";
I'm analyzing this issue and it was reproduced in a machine with 32G of RAM, but while trying the same steps in a machine with 256G the issue does not occur.
I'm reducing the OS memory and adding a trap in the code to identify divergences between PCB flags and SRR1 flags when leaving kernel space.
Also building in the 32G machine with the trap change in order to identify the code being executed.
(In reply to Leonardo Bianconi from comment #2)
It does seem that the issue takes "longer" to occur with more ram in the machine. I was able to get a panic after 30 hours of package building.
The change in https://reviews.freebsd.org/D19166 fixed the issue for Leonardo and I.
A commit references this bug:
Date: Thu Feb 14 15:15:32 UTC 2019
New revision: 344123
[PPC64] Fix mismatch between thread flags and MSR
When sigreturn() restored a thread's context, SRR1 was being restored
to its previous value, but pcb_flags was not being touched.
This could cause a mismatch between the thread's MSR and its pcb_flags.
For instance, when the thread used the FPU for the first time inside
the signal handler, sigreturn() would clear SRR1, but not pcb_flags.
Then, the thread would return with the FPU bit cleared in MSR and,
the next time it tried to use the FPU, it would fail on a KASSERT
that checked if the FPU was disabled.
This change clears the FPU bit in both pcb_flags and frame->srr1,
as the code that restores the context expects to use the FPU trap
to re-enable it.
Reported by: sbruno
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D19166
Fantastic work. This definitely is fixed now.