Bug 207085 - pmc assertion failure: pmc %p non-NULL
Summary: pmc assertion failure: pmc %p non-NULL
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-10 17:00 UTC by joss.upton
Modified: 2016-02-15 19:09 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description joss.upton 2016-02-10 17:00:09 UTC
I see these panics with INVARIANTS enabled on 10.2-R on an SMP machine running many processes using hwpmc in parallel.

[core,2376] PHW pmc non-NULL
cpuid = 1

pmc_process_exit() does the following:
	sx_xlock();
	critical_enter();
	free_attached_pmcs...
	critical_exit();
	free_owner_pmcs...
	sx_unlock();

and the various drivers do *_release_pmc():
	... KASSERT(phw->phw_pmc == NULL) ...

What I'm seeing is on an SMP machine, from time to time a process which
owns PMCs races with a process that has attached PMCs.  free_owner_pmcs()
calls *_release_pmc() which checks to make sure the per-cpu counter
isn't in use.  A process with attached pmcs will have freed them with
both the pmc_lock and scheduler lock held.  But, a process being scheduled
onto the pmc can sneak in between the critical_exit() and the sx_unlock(),
but before the free_owner_pmcs() resulting in an assertion failure.

free_owner_pmcs can't be protected by the scheduler lock because it
may need to sleep waiting on sampling counters to come off the cpu.  I'm
not sure how to approach a fix other than disabling the KASSERT, and I'm not sure if that's safe.
Comment 1 Konstantin Belousov freebsd_committer 2016-02-12 10:33:15 UTC
(In reply to joss.upton from comment #0)
So the question is how could pc_corepmcs[idx].phw_pmc become non-NULL legitimately there ? Context switch hooks should manage cpu slots, so I see at least two possibilities:

- cpu != curcpu and something else context-switched on cpu which brought its process-private pcms

- global pcm was activated which took the slot.  In fact, this is unlikely since sx lock is still owned.

Could you look what pcm was in the slot for fired assert, and who is the owner of it ?
Comment 2 joss.upton 2016-02-13 03:18:33 UTC
My current workaround is to make sure the owner process and the attached process are pinned to the same cpu.  Since doing this, I've been able to run for several days gathering performance measurements with no crashes.  I believe I still have some crash dumps from when I wasn't pinning and will try to grab the owner/attached info.
Comment 3 joss.upton 2016-02-15 19:09:23 UTC
(In reply to Konstantin Belousov from comment #1)
Here's the state of the machine during one of these panics (back trace at the bottom).

There are four instances of "monitor" and four instances of "target" in the system.  Monitor aloocates/attaches pmc's to target and then waits for target to exit. It then reads the pmcs, releases them, and exits.

panicstr: [soft,327] PHW pmc 0xfffff800b411ec80 non-NULL
cpuid = 0

which corresponds to:

static int soft_release_pmc(int cpu, int ri, struct pmc *pmc) {
 ...
 KASSERT(phw->phw_pmc == NULL,
            ("[soft,%d] PHW pmc %p non-NULL", __LINE__, phw->phw_pmc));
 ...
}


pmc->owner is "monitor" (pid=19085, td_oncpu=0)

pmc->pm_targets list is empty.

pmc->owner->children has one entry "target" (pid=19086, td_oncpu=0xff, td_lastcpu=0).  This child is a zombie.

Looking at phw->phw_pmc:
phw->phw_pmc->pm_phw: owner is "monitor" (pid=19087, td_oncpu=0xff, td_lastcpu=0)

phw->phw_pmc->pm_targets = "target" (pid=19088, td_oncpu=0xff, td_lastcpu=1).

phw->phw_pmc->pm_owner->children: "target" (pid=19088, td_oncpu=0xff, td_lastcpu=1).

panic: [soft,327] PHW pmc 0xfffff800b411ec80 non-NULL
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0239efe650
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0239efe700
vpanic() at vpanic+0x126/frame 0xfffffe0239efe740
kassert_panic() at kassert_panic+0x195/frame 0xfffffe0239efe7d0
soft_release_pmc() at soft_release_pmc+0xb9/frame 0xfffffe0239efe800
pmc_release_pmc_descriptor() at pmc_release_pmc_descriptor+0x33b/frame 0xfffffe0239efe850
pmc_syscall_handler() at pmc_syscall_handler+0x158b/frame 0xfffffe0239efeac0
amd64_syscall() at amd64_syscall+0x2b3/frame 0xfffffe0239efebf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0239efebf0
--- syscall (0, FreeBSD ELF64, nosys), rip = 0x802699a2a, rsp = 0x7fffffffe468, rbp = 0x7fffffffe480 ---
KDB: enter: panic

#0  doadump (textdump=0) at pcpu.h:220
220     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) #0  doadump (textdump=0) at pcpu.h:220
#1  0xffffffff802f3a0e in db_dump (dummy=<value optimized out>, dummy2=0,
    dummy3=0, dummy4=0x0) at /vagrant/usr/src/sys/ddb/db_command.c:533
#2  0xffffffff802f34ad in db_command (cmd_table=0x0)
    at /vagrant/usr/src/sys/ddb/db_command.c:440
#3  0xffffffff802f778f in db_script_exec (
    scriptname=0xfffffe0239efe300 "kdb.enter.panic",
    warnifnotfound=<value optimized out>)
    at /vagrant/usr/src/sys/ddb/db_script.c:302
#4  0xffffffff802f75d3 in db_script_kdbenter (eventname=0x0)
    at /vagrant/usr/src/sys/ddb/db_script.c:324
#5  0xffffffff802f5c9b in db_trap (type=<value optimized out>, code=0)
    at /vagrant/usr/src/sys/ddb/db_main.c:230
#6  0xffffffff805a0529 in kdb_trap (type=3, code=0, tf=<value optimized out>)
    at /vagrant/usr/src/sys/kern/subr_kdb.c:656
#7  0xffffffff8079709e in trap (frame=0xfffffe0239efe630)
    at /vagrant/usr/src/sys/amd64/amd64/trap.c:564
#8  0xffffffff8077ac72 in calltrap ()
    at /vagrant/usr/src/sys/amd64/amd64/exception.S:236
#9  0xffffffff8059fc8e in kdb_enter (why=0xffffffff808949a1 "panic",
    msg=<value optimized out>) at cpufunc.h:63
#10 0xffffffff80563326 in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /vagrant/usr/src/sys/kern/kern_shutdown.c:751
#11 0xffffffff805631d5 in kassert_panic (fmt=<value optimized out>)
    at /vagrant/usr/src/sys/kern/kern_shutdown.c:646
#12 0xffffffff8038d9d9 in soft_release_pmc (cpu=<value optimized out>,
    ri=<value optimized out>, pmc=0xfffff800b411c180)
    at /vagrant/usr/src/sys/dev/hwpmc/hwpmc_soft.c:326
#13 0xffffffff8038a4bb in pmc_release_pmc_descriptor (pm=0xfffff800b411c180)
    at /vagrant/usr/src/sys/dev/hwpmc/hwpmc_mod.c:2360
#14 0xffffffff8038853b in pmc_syscall_handler (td=<value optimized out>,
    syscall_args=<value optimized out>)
    at /vagrant/usr/src/sys/dev/hwpmc/hwpmc_mod.c:3685
#15 0xffffffff80797f83 in amd64_syscall (td=0xfffff800b4d3c000, traced=0)
    at subr_syscall.c:134
#16 0xffffffff8077af5b in Xfast_syscall ()
    at /vagrant/usr/src/sys/amd64/amd64/exception.S:396
#17 0x0000000802699a2a in ?? ()