Bug 268897 - panic: [pmc,4811] Retrieving callchain for thread that doesn't want it
Summary: panic: [pmc,4811] Retrieving callchain for thread that doesn't want it
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-01-11 23:26 UTC by John F. Carr
Modified: 2023-01-14 01:23 UTC (History)
2 users (show)

See Also:


Attachments
Reproduction program and required libraries (257.72 KB, application/x-gzip)
2023-01-13 20:21 UTC, John F. Carr
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description John F. Carr 2023-01-11 23:26:17 UTC
As an unprivileged user I ran a multithreaded program with the command

$ pmcstat -U -O perf.outu -P ls_not_halted_cyc -P instructions -P ls_refills_from_sys.ls_mabresp_rmt_dram ./run 24 for

The kernel crashed with the message

[pmc,4811] Retrieving callchain for thread that doesn't want it

The message is from a KASSERT in pmc_capture_user_callchain.  
My kernel configuration has INVARIANTS.

My hardware is

CPU: AMD EPYC 7402P 24-Core Processor                (2794.90-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x830f10  Family=0x17  Model=0x31  Stepping=0

My software is 13.1-STABLE as of mid-December (17333d92643d998d1c6a2dc5f6b1508b6507ad31).
Comment 1 Mitchell Horne freebsd_committer freebsd_triage 2023-01-13 17:39:18 UTC
(In reply to John F. Carr from comment #0)

Hi, thanks for the report. I have a couple of questions.

First, how easily is this crash reproduced? Does it trigger consistently every time, after the program has run a while, or very rarely?

What is the threading model of the program? Does it consist of several long-lived threads or does it continuously spawn many threads with short lifespans?

If you are able to share the referenced program, or another reproducer, that would go a long way to assist with isolating the issue.

My guess as to how this assertion could occur is that one of the relevant fields in the thread are not being cleared/initialized properly, and re-use of the thread structure results in an unintended call to pmc_capture_user_callchain().
Comment 2 John F. Carr 2023-01-13 17:57:50 UTC
I ran the same command again and the system crashed immediately.  The program only runs for a second or a few seconds.  It did not crash when I did not use pmcstat's "-U" option.

The program "run" starts 48 long-lived threads on this 24x2 system.  Most of the threads are spending a lot of time in a loop trying to acquire a spinlock.  This is the issue that drove me to try performance tools.

I will try to assemble a reproducing program within the attachment size limit.
Comment 3 John F. Carr 2023-01-13 20:21:47 UTC
Created attachment 239457 [details]
Reproduction program and required libraries

I attached a tar file with the executable "run" and the necessary shared libraries.  You will need to set LD_LIBRARY_PATH due to the shared libraries, e.g.

$ LD_LIBRARY_PATH=. ./run 23 for

I ran the program in a bhyve guest VM (CURRENT as of last week) and it crashed differently when I forgot to set LD_LIBRARY_PATH:

$ pmcstat -U -O perf.outu -P ls_not_halted_cyc -P instructions -P ls_refills_from_sys.ls_mabresp_rmt_dram ./run 24 for
ld-elf.so.1: Shared object "libopencilk-personality-cpanic: [amd,770] PMC0, CPU4 "K8-0" already stopped
cpuid = 4
time = 1673640312
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00855d1c20
vpanic() at vpanic+0x151/frame 0xfffffe00855d1c70
panic() at panic+0x43/frame 0xfffffe00855d1cd0
amd_stop_pmc() at amd_stop_pmc+0x12e/frame 0xfffffe00855d1cf0
pmc_process_exit() at pmc_process_exit+0x26d/frame 0xfffffe00855d1d80
exit1() at exit1+0x36e/frame 0xfffffe00855d1df0
sys_exit() at sys_exit+0xd/frame 0xfffffe00855d1e00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00855d1f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00855d1f30
--- syscall (1, FreeBSD ELF64, exit), rip = 0xbd6753917aa, rsp = 0x8203e4b68, rbp = 0x8203e4b80 ---
KDB: enter: panic
[ thread pid 765 tid 100129 ]
Stopped at      kdb_enter+0x32: movq    $0,0x129dc83(%rip)
Comment 4 John F. Carr 2023-01-13 23:09:34 UTC
The PMC crash on CURRENT in a VM is unrelated to my test program.  It happens with /bin/ls:

# pmcstat -p instructions ls > /dev/null 2> /tmp/2
panic: [amd,770] PMC0, CPU0 "K8-0" already stopped
cpuid = 0
time = 1673651198
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c326dc20
vpanic() at vpanic+0x151/frame 0xfffffe00c326dc70
panic() at panic+0x43/frame 0xfffffe00c326dcd0
amd_stop_pmc() at amd_stop_pmc+0x12e/frame 0xfffffe00c326dcf0
pmc_process_exit() at pmc_process_exit+0x26d/frame 0xfffffe00c326dd80
exit1() at exit1+0x36e/frame 0xfffffe00c326ddf0
sys_exit() at sys_exit+0xd/frame 0xfffffe00c326de00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00c326df30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c326df30
--- syscall (1, FreeBSD ELF64, exit), rip = 0x2ee9ac03458a, rsp = 0x2ee9a9c13488, rbp = 0x2ee9a9c134a0 ---
KDB: enter: panic
[ thread pid 62603 tid 106652 ]
Stopped at      kdb_enter+0x32: movq    $0,0x129dc83(%rip)
Comment 5 John F. Carr 2023-01-14 00:08:04 UTC
I opened bug 268943 to track the second issue, pmcstat always causes a panic on a bhyve guest running CURRENT.  It looks different.  Let's consider this bug about the original "Retrieving callchain for thread that doesn't want it" crash.
Comment 6 John F. Carr 2023-01-14 01:23:18 UTC
The multithreaded angle is a red herring.  You don't need the program I attached.  On an ordinary desktop class Zen processor ("Ryzen 5 PRO 2400GE") running CURRENT I ran

$ pmcstat -U -P instructions echo -n

and the kernel crashed the same way.

You do need to get root to kldload hwpmc so it's not quite as bad as it sounds.