Not a lot I can get from a ppc64 crash other than the backtrace: panic: FPU already enabled for thread cpuid = 23 time = 1546311842 KDB: stack backtrace: 0xe000000201c47330: at .kdb_backtrace+0x5c 0xe000000201c47460: at .vpanic+0x1b4 0xe000000201c47520: at .panic+0x38 0xe000000201c475b0: at .trap+0xb64 0xe000000201c47770: at .powerpc_interrupt+0x290 0xe000000201c47810: user FPU trap by 0x81086e528: srr1=0x900000000000d032 r1=0x3fffffffdfdfaea0 cr=0x22040024 xer=0 ctr=0x81086e51c r2=0x810891818 KDB: enter: panic [ thread pid 17895 tid 101554 ] Stopped at .kdb_enter+0x60: ld r2, r1, 0x28 The system was in the middle of a package build run.
Firmware that is running: ibm,firmware-versions { bmc-firmware-version = "1.10"; buildroot = "2018.05.1-114-g1822255eab";ensor"; capp-ucode = "p9-dd2-v4"; hostboot = "p8-30b88ed-p580ec27"; hostboot-binaries = "hw091818a.930"; linux = "4.18.6-openpower1-p40b056c"; machine-xml = "e0fae90-p90e7e34"; occ = "p8-28f2cec"; open-power = "palmetto-v2.1-134-g1ad4886"; petitboot = "1.9.1"; phandle = <0x10000087>; skiboot = "v6.1-124-g7dbf80d1db45"; };
I'm analyzing this issue and it was reproduced in a machine with 32G of RAM, but while trying the same steps in a machine with 256G the issue does not occur. I'm reducing the OS memory and adding a trap in the code to identify divergences between PCB flags and SRR1 flags when leaving kernel space. Also building in the 32G machine with the trap change in order to identify the code being executed.
(In reply to Leonardo Bianconi from comment #2) It does seem that the issue takes "longer" to occur with more ram in the machine. I was able to get a panic after 30 hours of package building.
The change in https://reviews.freebsd.org/D19166 fixed the issue for Leonardo and I.
A commit references this bug: Author: luporl Date: Thu Feb 14 15:15:32 UTC 2019 New revision: 344123 URL: https://svnweb.freebsd.org/changeset/base/344123 Log: [PPC64] Fix mismatch between thread flags and MSR When sigreturn() restored a thread's context, SRR1 was being restored to its previous value, but pcb_flags was not being touched. This could cause a mismatch between the thread's MSR and its pcb_flags. For instance, when the thread used the FPU for the first time inside the signal handler, sigreturn() would clear SRR1, but not pcb_flags. Then, the thread would return with the FPU bit cleared in MSR and, the next time it tried to use the FPU, it would fail on a KASSERT that checked if the FPU was disabled. This change clears the FPU bit in both pcb_flags and frame->srr1, as the code that restores the context expects to use the FPU trap to re-enable it. PR: 234539 Reported by: sbruno Reviewed by: jhibbits, sbruno Differential Revision: https://reviews.freebsd.org/D19166 Changes: head/sys/powerpc/powerpc/exec_machdep.c
Fantastic work. This definitely is fixed now.