As an unprivileged user I ran a multithreaded program with the command $ pmcstat -U -O perf.outu -P ls_not_halted_cyc -P instructions -P ls_refills_from_sys.ls_mabresp_rmt_dram ./run 24 for The kernel crashed with the message [pmc,4811] Retrieving callchain for thread that doesn't want it The message is from a KASSERT in pmc_capture_user_callchain. My kernel configuration has INVARIANTS. My hardware is CPU: AMD EPYC 7402P 24-Core Processor (2794.90-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x830f10 Family=0x17 Model=0x31 Stepping=0 My software is 13.1-STABLE as of mid-December (17333d92643d998d1c6a2dc5f6b1508b6507ad31).
(In reply to John F. Carr from comment #0) Hi, thanks for the report. I have a couple of questions. First, how easily is this crash reproduced? Does it trigger consistently every time, after the program has run a while, or very rarely? What is the threading model of the program? Does it consist of several long-lived threads or does it continuously spawn many threads with short lifespans? If you are able to share the referenced program, or another reproducer, that would go a long way to assist with isolating the issue. My guess as to how this assertion could occur is that one of the relevant fields in the thread are not being cleared/initialized properly, and re-use of the thread structure results in an unintended call to pmc_capture_user_callchain().
I ran the same command again and the system crashed immediately. The program only runs for a second or a few seconds. It did not crash when I did not use pmcstat's "-U" option. The program "run" starts 48 long-lived threads on this 24x2 system. Most of the threads are spending a lot of time in a loop trying to acquire a spinlock. This is the issue that drove me to try performance tools. I will try to assemble a reproducing program within the attachment size limit.
Created attachment 239457 [details] Reproduction program and required libraries I attached a tar file with the executable "run" and the necessary shared libraries. You will need to set LD_LIBRARY_PATH due to the shared libraries, e.g. $ LD_LIBRARY_PATH=. ./run 23 for I ran the program in a bhyve guest VM (CURRENT as of last week) and it crashed differently when I forgot to set LD_LIBRARY_PATH: $ pmcstat -U -O perf.outu -P ls_not_halted_cyc -P instructions -P ls_refills_from_sys.ls_mabresp_rmt_dram ./run 24 for ld-elf.so.1: Shared object "libopencilk-personality-cpanic: [amd,770] PMC0, CPU4 "K8-0" already stopped cpuid = 4 time = 1673640312 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00855d1c20 vpanic() at vpanic+0x151/frame 0xfffffe00855d1c70 panic() at panic+0x43/frame 0xfffffe00855d1cd0 amd_stop_pmc() at amd_stop_pmc+0x12e/frame 0xfffffe00855d1cf0 pmc_process_exit() at pmc_process_exit+0x26d/frame 0xfffffe00855d1d80 exit1() at exit1+0x36e/frame 0xfffffe00855d1df0 sys_exit() at sys_exit+0xd/frame 0xfffffe00855d1e00 amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00855d1f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00855d1f30 --- syscall (1, FreeBSD ELF64, exit), rip = 0xbd6753917aa, rsp = 0x8203e4b68, rbp = 0x8203e4b80 --- KDB: enter: panic [ thread pid 765 tid 100129 ] Stopped at kdb_enter+0x32: movq $0,0x129dc83(%rip)
The PMC crash on CURRENT in a VM is unrelated to my test program. It happens with /bin/ls: # pmcstat -p instructions ls > /dev/null 2> /tmp/2 panic: [amd,770] PMC0, CPU0 "K8-0" already stopped cpuid = 0 time = 1673651198 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c326dc20 vpanic() at vpanic+0x151/frame 0xfffffe00c326dc70 panic() at panic+0x43/frame 0xfffffe00c326dcd0 amd_stop_pmc() at amd_stop_pmc+0x12e/frame 0xfffffe00c326dcf0 pmc_process_exit() at pmc_process_exit+0x26d/frame 0xfffffe00c326dd80 exit1() at exit1+0x36e/frame 0xfffffe00c326ddf0 sys_exit() at sys_exit+0xd/frame 0xfffffe00c326de00 amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00c326df30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c326df30 --- syscall (1, FreeBSD ELF64, exit), rip = 0x2ee9ac03458a, rsp = 0x2ee9a9c13488, rbp = 0x2ee9a9c134a0 --- KDB: enter: panic [ thread pid 62603 tid 106652 ] Stopped at kdb_enter+0x32: movq $0,0x129dc83(%rip)
I opened bug 268943 to track the second issue, pmcstat always causes a panic on a bhyve guest running CURRENT. It looks different. Let's consider this bug about the original "Retrieving callchain for thread that doesn't want it" crash.
The multithreaded angle is a red herring. You don't need the program I attached. On an ordinary desktop class Zen processor ("Ryzen 5 PRO 2400GE") running CURRENT I ran $ pmcstat -U -P instructions echo -n and the kernel crashed the same way. You do need to get root to kldload hwpmc so it's not quite as bad as it sounds.