Summary: | panic: [pmc,4811] Retrieving callchain for thread that doesn't want it | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | John F. Carr <jfc> | ||||
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||
Status: | Open --- | ||||||
Severity: | Affects Only Me | CC: | grahamperrin, mhorne | ||||
Priority: | --- | Keywords: | crash | ||||
Version: | 13.1-STABLE | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268943 | ||||||
Attachments: |
|
Description
John F. Carr
2023-01-11 23:26:17 UTC
(In reply to John F. Carr from comment #0) Hi, thanks for the report. I have a couple of questions. First, how easily is this crash reproduced? Does it trigger consistently every time, after the program has run a while, or very rarely? What is the threading model of the program? Does it consist of several long-lived threads or does it continuously spawn many threads with short lifespans? If you are able to share the referenced program, or another reproducer, that would go a long way to assist with isolating the issue. My guess as to how this assertion could occur is that one of the relevant fields in the thread are not being cleared/initialized properly, and re-use of the thread structure results in an unintended call to pmc_capture_user_callchain(). I ran the same command again and the system crashed immediately. The program only runs for a second or a few seconds. It did not crash when I did not use pmcstat's "-U" option. The program "run" starts 48 long-lived threads on this 24x2 system. Most of the threads are spending a lot of time in a loop trying to acquire a spinlock. This is the issue that drove me to try performance tools. I will try to assemble a reproducing program within the attachment size limit. Created attachment 239457 [details]
Reproduction program and required libraries
I attached a tar file with the executable "run" and the necessary shared libraries. You will need to set LD_LIBRARY_PATH due to the shared libraries, e.g.
$ LD_LIBRARY_PATH=. ./run 23 for
I ran the program in a bhyve guest VM (CURRENT as of last week) and it crashed differently when I forgot to set LD_LIBRARY_PATH:
$ pmcstat -U -O perf.outu -P ls_not_halted_cyc -P instructions -P ls_refills_from_sys.ls_mabresp_rmt_dram ./run 24 for
ld-elf.so.1: Shared object "libopencilk-personality-cpanic: [amd,770] PMC0, CPU4 "K8-0" already stopped
cpuid = 4
time = 1673640312
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00855d1c20
vpanic() at vpanic+0x151/frame 0xfffffe00855d1c70
panic() at panic+0x43/frame 0xfffffe00855d1cd0
amd_stop_pmc() at amd_stop_pmc+0x12e/frame 0xfffffe00855d1cf0
pmc_process_exit() at pmc_process_exit+0x26d/frame 0xfffffe00855d1d80
exit1() at exit1+0x36e/frame 0xfffffe00855d1df0
sys_exit() at sys_exit+0xd/frame 0xfffffe00855d1e00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00855d1f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00855d1f30
--- syscall (1, FreeBSD ELF64, exit), rip = 0xbd6753917aa, rsp = 0x8203e4b68, rbp = 0x8203e4b80 ---
KDB: enter: panic
[ thread pid 765 tid 100129 ]
Stopped at kdb_enter+0x32: movq $0,0x129dc83(%rip)
The PMC crash on CURRENT in a VM is unrelated to my test program. It happens with /bin/ls: # pmcstat -p instructions ls > /dev/null 2> /tmp/2 panic: [amd,770] PMC0, CPU0 "K8-0" already stopped cpuid = 0 time = 1673651198 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c326dc20 vpanic() at vpanic+0x151/frame 0xfffffe00c326dc70 panic() at panic+0x43/frame 0xfffffe00c326dcd0 amd_stop_pmc() at amd_stop_pmc+0x12e/frame 0xfffffe00c326dcf0 pmc_process_exit() at pmc_process_exit+0x26d/frame 0xfffffe00c326dd80 exit1() at exit1+0x36e/frame 0xfffffe00c326ddf0 sys_exit() at sys_exit+0xd/frame 0xfffffe00c326de00 amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00c326df30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c326df30 --- syscall (1, FreeBSD ELF64, exit), rip = 0x2ee9ac03458a, rsp = 0x2ee9a9c13488, rbp = 0x2ee9a9c134a0 --- KDB: enter: panic [ thread pid 62603 tid 106652 ] Stopped at kdb_enter+0x32: movq $0,0x129dc83(%rip) I opened bug 268943 to track the second issue, pmcstat always causes a panic on a bhyve guest running CURRENT. It looks different. Let's consider this bug about the original "Retrieving callchain for thread that doesn't want it" crash. The multithreaded angle is a red herring. You don't need the program I attached. On an ordinary desktop class Zen processor ("Ryzen 5 PRO 2400GE") running CURRENT I ran $ pmcstat -U -P instructions echo -n and the kernel crashed the same way. You do need to get root to kldload hwpmc so it's not quite as bad as it sounds. |