I see these panics with INVARIANTS enabled on 10.2-R on an SMP machine running many processes using hwpmc in parallel. [core,2376] PHW pmc non-NULL cpuid = 1 pmc_process_exit() does the following: sx_xlock(); critical_enter(); free_attached_pmcs... critical_exit(); free_owner_pmcs... sx_unlock(); and the various drivers do *_release_pmc(): ... KASSERT(phw->phw_pmc == NULL) ... What I'm seeing is on an SMP machine, from time to time a process which owns PMCs races with a process that has attached PMCs. free_owner_pmcs() calls *_release_pmc() which checks to make sure the per-cpu counter isn't in use. A process with attached pmcs will have freed them with both the pmc_lock and scheduler lock held. But, a process being scheduled onto the pmc can sneak in between the critical_exit() and the sx_unlock(), but before the free_owner_pmcs() resulting in an assertion failure. free_owner_pmcs can't be protected by the scheduler lock because it may need to sleep waiting on sampling counters to come off the cpu. I'm not sure how to approach a fix other than disabling the KASSERT, and I'm not sure if that's safe.
(In reply to joss.upton from comment #0) So the question is how could pc_corepmcs[idx].phw_pmc become non-NULL legitimately there ? Context switch hooks should manage cpu slots, so I see at least two possibilities: - cpu != curcpu and something else context-switched on cpu which brought its process-private pcms - global pcm was activated which took the slot. In fact, this is unlikely since sx lock is still owned. Could you look what pcm was in the slot for fired assert, and who is the owner of it ?
My current workaround is to make sure the owner process and the attached process are pinned to the same cpu. Since doing this, I've been able to run for several days gathering performance measurements with no crashes. I believe I still have some crash dumps from when I wasn't pinning and will try to grab the owner/attached info.
(In reply to Konstantin Belousov from comment #1) Here's the state of the machine during one of these panics (back trace at the bottom). There are four instances of "monitor" and four instances of "target" in the system. Monitor aloocates/attaches pmc's to target and then waits for target to exit. It then reads the pmcs, releases them, and exits. panicstr: [soft,327] PHW pmc 0xfffff800b411ec80 non-NULL cpuid = 0 which corresponds to: static int soft_release_pmc(int cpu, int ri, struct pmc *pmc) { ... KASSERT(phw->phw_pmc == NULL, ("[soft,%d] PHW pmc %p non-NULL", __LINE__, phw->phw_pmc)); ... } pmc->owner is "monitor" (pid=19085, td_oncpu=0) pmc->pm_targets list is empty. pmc->owner->children has one entry "target" (pid=19086, td_oncpu=0xff, td_lastcpu=0). This child is a zombie. Looking at phw->phw_pmc: phw->phw_pmc->pm_phw: owner is "monitor" (pid=19087, td_oncpu=0xff, td_lastcpu=0) phw->phw_pmc->pm_targets = "target" (pid=19088, td_oncpu=0xff, td_lastcpu=1). phw->phw_pmc->pm_owner->children: "target" (pid=19088, td_oncpu=0xff, td_lastcpu=1). panic: [soft,327] PHW pmc 0xfffff800b411ec80 non-NULL cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0239efe650 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0239efe700 vpanic() at vpanic+0x126/frame 0xfffffe0239efe740 kassert_panic() at kassert_panic+0x195/frame 0xfffffe0239efe7d0 soft_release_pmc() at soft_release_pmc+0xb9/frame 0xfffffe0239efe800 pmc_release_pmc_descriptor() at pmc_release_pmc_descriptor+0x33b/frame 0xfffffe0239efe850 pmc_syscall_handler() at pmc_syscall_handler+0x158b/frame 0xfffffe0239efeac0 amd64_syscall() at amd64_syscall+0x2b3/frame 0xfffffe0239efebf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0239efebf0 --- syscall (0, FreeBSD ELF64, nosys), rip = 0x802699a2a, rsp = 0x7fffffffe468, rbp = 0x7fffffffe480 --- KDB: enter: panic #0 doadump (textdump=0) at pcpu.h:220 220 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=0) at pcpu.h:220 #1 0xffffffff802f3a0e in db_dump (dummy=<value optimized out>, dummy2=0, dummy3=0, dummy4=0x0) at /vagrant/usr/src/sys/ddb/db_command.c:533 #2 0xffffffff802f34ad in db_command (cmd_table=0x0) at /vagrant/usr/src/sys/ddb/db_command.c:440 #3 0xffffffff802f778f in db_script_exec ( scriptname=0xfffffe0239efe300 "kdb.enter.panic", warnifnotfound=<value optimized out>) at /vagrant/usr/src/sys/ddb/db_script.c:302 #4 0xffffffff802f75d3 in db_script_kdbenter (eventname=0x0) at /vagrant/usr/src/sys/ddb/db_script.c:324 #5 0xffffffff802f5c9b in db_trap (type=<value optimized out>, code=0) at /vagrant/usr/src/sys/ddb/db_main.c:230 #6 0xffffffff805a0529 in kdb_trap (type=3, code=0, tf=<value optimized out>) at /vagrant/usr/src/sys/kern/subr_kdb.c:656 #7 0xffffffff8079709e in trap (frame=0xfffffe0239efe630) at /vagrant/usr/src/sys/amd64/amd64/trap.c:564 #8 0xffffffff8077ac72 in calltrap () at /vagrant/usr/src/sys/amd64/amd64/exception.S:236 #9 0xffffffff8059fc8e in kdb_enter (why=0xffffffff808949a1 "panic", msg=<value optimized out>) at cpufunc.h:63 #10 0xffffffff80563326 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /vagrant/usr/src/sys/kern/kern_shutdown.c:751 #11 0xffffffff805631d5 in kassert_panic (fmt=<value optimized out>) at /vagrant/usr/src/sys/kern/kern_shutdown.c:646 #12 0xffffffff8038d9d9 in soft_release_pmc (cpu=<value optimized out>, ri=<value optimized out>, pmc=0xfffff800b411c180) at /vagrant/usr/src/sys/dev/hwpmc/hwpmc_soft.c:326 #13 0xffffffff8038a4bb in pmc_release_pmc_descriptor (pm=0xfffff800b411c180) at /vagrant/usr/src/sys/dev/hwpmc/hwpmc_mod.c:2360 #14 0xffffffff8038853b in pmc_syscall_handler (td=<value optimized out>, syscall_args=<value optimized out>) at /vagrant/usr/src/sys/dev/hwpmc/hwpmc_mod.c:3685 #15 0xffffffff80797f83 in amd64_syscall (td=0xfffff800b4d3c000, traced=0) at subr_syscall.c:134 #16 0xffffffff8077af5b in Xfast_syscall () at /vagrant/usr/src/sys/amd64/amd64/exception.S:396 #17 0x0000000802699a2a in ?? ()