Created attachment 197581 [details] core.txt I can get an INVARIANTS kernel to crash reliably by running # pmcstat -S inst_retired.any_p -T and starting a -j32 buildkernel from a different shell. It takes a minute or so. This is on ALPHA7.
I realized that my kernel was somewhat newer than the world. After rebuilding everything I can't repro the problem anymore. Will reopen if it resurfaces.
Never mind, the problem's still there with an updated world.
I see at least one problem that can probably cause this: pmc_capture_user_callchain() processes samples with nsamples == PMC_SAMPLE_INUSE, but there doesn't seem to be anything preventing pmc_add_sample() from overwriting such a sample during that processing.
The patch I have in review largely addresses the races that you see there. I don't have time at this instant to vet it further - but I think that's the path we should be on.
A commit references this bug: Author: mmacy Date: Fri Oct 5 05:55:57 UTC 2018 New revision: 339188 URL: https://svnweb.freebsd.org/changeset/base/339188 Log: hwpmc: Refactor sample ring buffer handling to fix races Refactor sample ring buffer ring handling to make it more robust to long running callchain collection handling r338112 introduced a (now fixed) regression that exposed a number of race conditions within the management of the sample buffers. This simplifies the handling and moves the decision to overwrite a callchain sample that has taken too long out of the NMI in to the hardlock handler. With this change the problem no longer shows up as a ring corruption but as the code spending all of its time in callchain collection. - Makes the producer / consumer index incrementing monotonic, making it easier (for me at least) to reason about. - Moves the decision to overwrite a sample from NMI context to interrupt context where we can enforce serialization. - Puts a time limit on waiting to collect a user callchain - putting a bound on head-of-line blocking causing samples to be dropped - Removes the flush routine which was previously needed to purge dangling references to the pmc from the sample buffers but now is only a source of a race condition on unload. Previously one could lock up or crash HEAD by running: pmcstat -S inst_retired.any_p -T and then hitting ^C After this change it is no longer possible. PR: 231793 Reviewed by: markj@ Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D17011 Changes: head/sys/dev/hwpmc/hwpmc_logging.c head/sys/dev/hwpmc/hwpmc_mod.c head/sys/sys/pmc.h head/sys/sys/pmckern.h