Bug 231793 - panic: [pmc,4965] pm=0xfffff80480ff2400 runcount 0
Summary: panic: [pmc,4965] pm=0xfffff80480ff2400 runcount 0
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-28 22:45 UTC by Mark Johnston
Modified: 2018-10-05 05:59 UTC (History)
1 user (show)

See Also:


Attachments
core.txt (278.81 KB, text/plain)
2018-09-28 22:45 UTC, Mark Johnston
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Johnston freebsd_committer freebsd_triage 2018-09-28 22:45:31 UTC
Created attachment 197581 [details]
core.txt

I can get an INVARIANTS kernel to crash reliably by running

# pmcstat -S inst_retired.any_p -T

and starting a -j32 buildkernel from a different shell.  It takes a minute or so.   This is on ALPHA7.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2018-09-28 23:27:14 UTC
I realized that my kernel was somewhat newer than the world.  After rebuilding everything I can't repro the problem anymore.  Will reopen if it resurfaces.
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2018-09-29 15:45:50 UTC
Never mind, the problem's still there with an updated world.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2018-09-29 16:12:44 UTC
I see at least one problem that can probably cause this: pmc_capture_user_callchain() processes samples with nsamples == PMC_SAMPLE_INUSE, but there doesn't seem to be anything preventing pmc_add_sample() from overwriting such a sample during that processing.
Comment 4 Matt Macy freebsd_committer freebsd_triage 2018-09-29 19:45:26 UTC
The patch I have in review largely addresses the races that you see there. I don't have time at this instant to vet it further - but I think that's the path we should be on.
Comment 5 commit-hook freebsd_committer freebsd_triage 2018-10-05 05:56:22 UTC
A commit references this bug:

Author: mmacy
Date: Fri Oct  5 05:55:57 UTC 2018
New revision: 339188
URL: https://svnweb.freebsd.org/changeset/base/339188

Log:
  hwpmc: Refactor sample ring buffer handling to fix races

  Refactor sample ring buffer ring handling to make it more robust to
  long running callchain collection handling

  r338112 introduced a (now fixed) regression that exposed a number of race
  conditions within the management of the sample buffers. This
  simplifies the handling and moves the decision to overwrite a
  callchain sample that has taken too long out of the NMI in to the
  hardlock handler. With this change the problem no longer shows up as a
  ring corruption but as the code spending all of its time in callchain
  collection.

  - Makes the producer / consumer index incrementing monotonic, making it
    easier (for me at least) to reason about.
  - Moves the decision to overwrite a sample from NMI context to interrupt
    context where we can enforce serialization.
  - Puts a time limit on waiting to collect a user callchain - putting a
    bound on head-of-line blocking causing samples to be dropped
  - Removes the flush routine which was previously needed to purge
    dangling references to the pmc from the sample buffers but now is only
    a source of a race condition on unload.

  Previously one could lock up or crash HEAD by running:
  pmcstat -S inst_retired.any_p -T and then hitting ^C

  After this change it is no longer possible.

  PR:	231793
  Reviewed by:	markj@
  Approved by:	re (gjb@)
  Differential Revision:	https://reviews.freebsd.org/D17011

Changes:
  head/sys/dev/hwpmc/hwpmc_logging.c
  head/sys/dev/hwpmc/hwpmc_mod.c
  head/sys/sys/pmc.h
  head/sys/sys/pmckern.h