Hi, Using pmcstat in top mode with system level sampling works fine, but process level sampling stops reporting events after a while. A while is typically a couple of seconds if you're lucky. This is reproducable on -10 and -HEAD. To reproduce: * run something that's very busy - say, nginx with multiple worker processes * run pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -t <nginx pid> * wait until it stops providing events.
Maybe it is related to VM. When I run pmcstat (top mode) in VM it behaves same as Adrian observed. However when I tried pmcstat in real hardware, sampling seems working providing events. I tried both commands: pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -t 1030 pmcstat -P INSTR_RETIRED_ANY -T -w 1 -t 1030 where 1030 is nginx worker process pid. I'm running FreeBSD 11.0-CURRENT #4 r281075M: Sat Apr 4 22:27:29 ULAST 2015.
Hm, how many CPUs are in the real box?
% sysctl -a | grep ncpu hw.ncpu: 2 % sysctl -a | grep smp kern.smp.forward_signal_enabled: 1 kern.smp.topology: 0 kern.smp.cpus: 2 kern.smp.disabled: 0 kern.smp.active: 1 kern.smp.maxcpus: 32 kern.smp.maxid: 1
Do you have any boxes with >2 CPUs?
Works fine here in a box where: FreeBSD 10.1-STABLE #0 r281519: Tue Apr 14 18:34:17 ULAT 2015 tsgan@usec1:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 CPU: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz (2926.06-MHz K8-class CPU) Origin="GenuineIntel" Id=0x206c2 Family=0x6 Model=0x2c Stepping=2 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x1<LAHF> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics real memory = 8589934592 (8192 MB) avail memory = 8257982464 (7875 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <DELL PE_SC3 > FreeBSD/SMP: Multiprocessor System Detected: 12 CPUs FreeBSD/SMP: 1 package(s) x 6 core(s) x 2 SMT threads cpu0 (BSP): APIC ID: 32 cpu1 (AP): APIC ID: 33 cpu2 (AP): APIC ID: 34 cpu3 (AP): APIC ID: 35 cpu4 (AP): APIC ID: 36 cpu5 (AP): APIC ID: 37 cpu6 (AP): APIC ID: 48 cpu7 (AP): APIC ID: 49 cpu8 (AP): APIC ID: 50 cpu9 (AP): APIC ID: 51 cpu10 (AP): APIC ID: 52 cpu11 (AP): APIC ID: 53
ok, try something multi-threaded and doing system calls. I just tried with nginx and it's okay - it's all one thread per process.
I tried to reproduce it with asterisk in same multicore machine and put some load on asterisk: last pid: 1296; load averages: 7.44, 5.22, 3.17 up 0+00:28:00 12:01:12 51 processes: 2 running, 49 sleeping CPU: 31.9% user, 0.0% nice, 21.8% system, 1.8% interrupt, 44.5% idle Mem: 300M Active, 443M Inact, 252M Wired, 90M Buf, 6933M Free Swap: 16G Total, 16G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1275 root 1059 20 0 1472M 584M select 3 14:30 217.48% asterisk 1280 root 1 24 0 33100K 12692K CPU3 3 0:06 1.27% pmcstat 721 tsgan 1 20 0 86488K 7060K select 7 0:03 0.00% sshd And I still see pmcstat shows in top mode: PMC: [CPU_CLK_UNHALTED_CORE] Samples: 24904 (100.0%) , 2906 unresolved %SAMP IMAGE FUNCTION CALLERS 7.0 kernel _mtx_lock_spin_cooki pmclog_reserve:6.4 sched_add:0.6 3.4 kernel __mtx_lock_sleep umtxq_sleep:1.3 __umtx_op_wake2_umutex:1.2 do_lock_umutex:0.6 3.3 libgsm.so. 0xad72 Gsm_Short_Term_Synthesis_Filt 2.4 kernel cpu_search_lowest cpu_search_lowest 2.3 libgsm.so. 0xad37 Gsm_Short_Term_Synthesis_Filt 1.8 libgsm.so. 0xad19 Gsm_Short_Term_Synthesis_Filt 1.6 libgsm.so. 0xad30 Gsm_Short_Term_Synthesis_Filt 1.6 libgsm.so. 0xad69 Gsm_Short_Term_Synthesis_Filt 1.6 libgsm.so. 0xace7 Gsm_Short_Term_Synthesis_Filt 1.4 libgsm.so. Gsm_Decoder gsm_decode 1.3 libgsm.so. 0xad53 Gsm_Short_Term_Synthesis_Filt 1.3 libgsm.so. Gsm_Long_Term_Synthe Gsm_Decoder 1.2 hwpmc.ko pmc_hook_handler sched_switch 1.1 libgsm.so. 0xad4b Gsm_Short_Term_Synthesis_Filt 1.1 codec_ulaw lintoulaw_framein 1.0 libgsm.so. Gsm_Short_Term_Synth Gsm_Decoder 1.0 kernel Xrendezvous 1.0 libgsm.so. 0xad33 Gsm_Short_Term_Synthesis_Filt 1.0 libgsm.so. 0xad3f Gsm_Short_Term_Synthesis_Filt 1.0 libgsm.so. 0xace0 Gsm_Short_Term_Synthesis_Filt 1.0 libgsm.so. 0xacb0 Gsm_Short_Term_Synthesis_Filt 0.9 libgsm.so. 0xacca Gsm_Short_Term_Synthesis_Filt 0.9 kernel do_lock_umutex __umtx_op_wait_umutex 0.9 libgsm.so. 0xacef Gsm_Short_Term_Synthesis_Filt 0.9 libgsm.so. 0xad11 Gsm_Short_Term_Synthesis_Filt 0.8 libgsm.so. 0xad00 Gsm_Short_Term_Synthesis_Filt 0.8 kernel amd64_syscall 0.8 kernel in_pcbconnect_setup udp_send 0.8 libgsm.so. 0xacfa Gsm_Short_Term_Synthesis_Filt 0.8 libgsm.so. 0xad06 Gsm_Short_Term_Synthesis_Filt 0.7 libgsm.so. Gsm_RPE_Decoding Gsm_Decoder 0.7 kernel thread_lock_flags_ 0.7 libc.so.7 memcpy 0.6 kernel __umtx_op_wake2_umut amd64_syscall 0.6 kernel __rw_rlock 0.6 libgsm.so. 0xad43 Gsm_Short_Term_Synthesis_Filt ...
I can reproduce this trivially on real hardware. I use a dummy test app (threadspin.c). On my x220 laptop, this usually breaks in a few seconds. Same on a server (fox2 in the netperf lab). If I run it in a reduced cpuset to limit the number of CPUs then pmcstat runs longer before it stops working. Pinning the threads to specific CPUs does not fix the issue either. I looked via kgdb at the software state of hwpmc when pmcstat stops reporting samples and it seems to think that the PMC is enabled and should be working. However, it seems that the PMC is no longer generating interrupts. I've ported all of the PMC debug stuff over to KTR and added a hack so that KTR auto-disables once a polling sample request fails. When run with all CPUs there is simply too much noise in the ktrdump and I can't find the last sample in my dump (it seems that the dump only contains events logged after interrupts stopped working). I tried restricting to a smaller set of CPUs but that failed to break. My plan is to try to find the smallest set of CPUs that do break while KTR is active and then see if I can tease anything out from the ktrdump. The changes to pmc to use KTR can be found at https://github.com/freebsd/freebsd/compare/master...bsdjhb:pr_198149
Created attachment 156213 [details] threadspin.c
So I think I somewhat understand what is wrong. I'm not yet sure how to fix it. What seems to happen is that on a context switch out, the read_pmc operation is returning a very large value. The result of this is that the PMC gets set to a value large enough that it won't expire during the next slice. This error gets recompounded on every switch out/in and the PMC stops firing as a result. Some snippets of KTR traces show the error in action: 238280 1 268934513552654 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=5dd31 238271 1 268934513429846 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=4fffffff855b6 iafctrl=0 pmc=fffffff855b6 238262 1 268934513342102 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=2f2c 238247 1 268934510388202 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=2fffffffc9345 iafctrl=0 pmc=fffffffc9345 238238 1 268934510294742 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=4125 238229 1 268934510220562 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=1fffffffea910 iafctrl=0 pmc=fffffffea910 238220 1 268934510132922 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=fffffffffffe 238211 1 268934510048862 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffffd489 iafctrl=0 pmc=ffffffffd489 238202 1 268934509967030 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=b34d 238193 1 268934509880238 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffff109e iafctrl=0 pmc=ffffffff109e 238184 1 268934509789534 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=e848 238175 1 268934509749902 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffff942b iafctrl=0 pmc=ffffffff942b 238166 1 268934509673986 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=62dd 238157 1 268934508267090 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffff18a7 iafctrl=0 pmc=ffffffff18a7 238148 1 268934508103386 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=6d25 The error occurs at event 238220 when "-2" is converted to a large unsigned value. After this point, the PMC is programmed with progressively larger and larger values on each switch in and never fires again. By the end of the trace when I killed my test program it was quite far off: 116541 1 268945752955406 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffed2e3cfd33 iafctrl=0 pmc=ffed2e3cfd33 116448 1 268945715324794 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffef3dd2e030 iafctrl=0 pmc=ffef3dd2e030 116337 1 268945421271906 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=fff2cff21be4 iafctrl=0 pmc=fff2cff21be4 116321 1 268945168850926 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=fff2defc2fec iafctrl=0 pmc=fff2defc2fec 116276 1 268944964260070 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=fff3210cd42b iafctrl=0 pmc=fff3210cd42b 116241 1 268944442945530 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=fff353fadde0 iafctrl=0 pmc=fff353fadde0 116207 1 268944442823210 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=fff3fa4fc2e3 iafctrl=0 pmc=fff3fa4fc2e3 ... I'm not really sure where the error is. I think it might be that iap_perfctr_value_to_reload_count needs to sign extend its return value so it can return -2 as the value of the PMC in this case instead of what it returned. Note that this seems specific to hwpmc_core.c. hwpmc_amd.c uses a different approach. It sign extends the value it reads from the PMC first and then negates it (which would have returned -2 in this case).
Created attachment 156477 [details] first_hack.patch I tried this change which would cause iaf_read_pmc() to return -2 in this case instead of a large number. I added some additional debugging so I could verify this was being triggered in my testing. It did seem to make things last longer, but in at least one of my test runs I still saw pmcstat eventually stop getting samples.
Finally snagged the hang with the first hack in place. By the end of the trace all the PMCs have super high thresholds again, but here is where it went wrong: 211497 1 305651558868602 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffdaa2a0 iafctrl=0 pmc=ffffffdaa2a0 211490 1 305651558791430 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=25ec32 211481 1 305651558772962 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffd9db37 iafctrl=0 pmc=ffffffd9db37 211474 1 305651558684430 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=ffffffffffd8db37 211465 1 305651556116174 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=1000000000000 iafctrl=0 pmc=0 211458 1 305651555926610 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=c220 211449 1 305651555915314 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffff1ef0 iafctrl=0 pmc=ffffffff1ef0 211442 1 305651555647630 MDP:REA:1: iaf-read cpu=1 ri=2 msr=0x40000002 -> v=97d8 211433 1 305651555563074 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=ffffffff438c iafctrl=0 pmc=ffffffff438c In particular, after writing 0 to the pmc in 211465, things went downhill. I'm still looking in the debug logs, but one thing is that we definitely can't put a value of 0 into a PMC. We should possibly force it to -1 in this case so that the interrupt triggers on the next event in that case?
Created attachment 156483 [details] second_hack.patch This prevents the PMC value from being saved as zero. When it hits zero during a switch out, this will now wrap back up to ask for a full batch of events before the next interrupt.
This survived an overnight run with pmcstat still getting samples this morning. I added debugging to print a message each time one of these fixes was applied (and in the case of the first patch, I outputted the raw value of the PMC). Both of these conditions fired fairly consistently during the test (once every few seconds or so). In addition, when I had run with just the first patch, I had seen raw PMC counter values that in my debug messages that could be a bit large, for example: CPU 1: counter overflowed: 87516 CPU 1: counter overflowed: 22 CPU 12: counter overflowed: 2 CPU 1: counter overflowed: 2 CPU 1: counter overflowed: 13629 CPU 1: counter overflowed: 2 CPU 1: counter overflowed: 2 With both patches applied I do not see "large" values, only small ones: CPU 5: fixing zero PMC CPU 1: counter overflowed: 20 CPU 1: fixing zero PMC CPU 1: counter overflowed: 2 CPU 1: fixing zero PMC CPU 1: fixing zero PMC CPU 15: fixing zero PMC CPU 1: fixing zero PMC CPU 1: counter overflowed: 4 CPU 1: counter overflowed: 2 CPU 1: counter overflowed: 2 CPU 1: fixing zero PMC CPU 1: counter overflowed: 2 CPU 1: fixing zero PMC CPU 1: counter overflowed: 2 CPU 8: fixing zero PMC CPU 1: counter overflowed: 2 CPU 1: counter overflowed: 2 CPU 1: counter overflowed: 2 CPU 2: counter overflowed: 2 CPU 9: counter overflowed: 2 CPU 3: fixing zero PMC CPU 8: fixing zero PMC CPU 1: counter overflowed: 22 CPU 1: fixing zero PMC CPU 1: fixing zero PMC CPU 1: counter overflowed: 2 CPU 1: counter overflowed: 27 CPU 1: counter overflowed: 5 CPU 1: fixing zero PMC ..... I had wondered if the second bug (writing a PMC value of zero) could have been the source of the second bug (you can see how it would easily trigger it: if you write a PMC of zero and the event happens 2 times before your next context switch you would have a raw value of "2" when you switched out). However, whilt it seems to have fixed some of them (the "large" ones) it does not seem to have fixed all of them. I definitely think the second fix is probably legit (and has been present since sampling was added to PMC). I think the first change is also technically correct, but I'm not sure why we are seeing those values.
So I don't have a good explanation of why the counters overflow without generating an interrupt, however, it does seem like the interrupt is lost, or perhaps there is a race where the counter can be disabled right at about the time that it should fire, but the counter might still count a few events after it has been disabled. In particular, I captured this sequence of events (filtered to only show CPU 1) where I got an overflow with a value of 2: 72890 1 563885152232313 MDP:REA:2: iaf-read counter overflowed: 2 2264 1 563885043095469 MDP:STO:1: iafctrl=0(0) globalctrl=0(0) 2263 1 563885043094005 MDP:STO:1: iaf-stop iafctrl=0 2262 1 563885043093777 MDP:STO:1: iaf-stop cpu=1 ri=2 2256 1 563885043091809 CSW:SWO:1: cpu=1 proc=0xfffff8005698c538 (5697, threadspin) pp=0xfffff80052fea600 2188 1 563885043029837 MDP:INT:1: cpu=1 fixedctrl=b00 globalctrl=400000000 status=0 ovf=0 2187 1 563885043028445 MDP:INT:1: cpu=1 intrenable=400000000 2186 1 563885043028165 MDP:INT:1: iaf-intr cpu=1 error=0 v=ffffffff0000(ffffffff0000) 2185 1 563885043026045 MDP:INT:1: cpu=1 intrstatus=400000000 2184 1 563885043025621 MDP:INT:1: cpu=1 tf=0x0xfffffe083a551f30 um=1 2113 1 563885042960229 MDP:INT:1: cpu=1 fixedctrl=b00 globalctrl=400000000 status=0 ovf=0 2112 1 563885042958865 MDP:INT:1: cpu=1 intrenable=400000000 2111 1 563885042958665 MDP:INT:1: iaf-intr cpu=1 error=0 v=ffffffff0000(ffffffff0000) 2110 1 563885042957205 MDP:INT:1: cpu=1 intrstatus=400000000 2109 1 563885042956765 MDP:INT:1: cpu=1 tf=0x0xfffffe083a551f30 um=0 2108 1 563885042955049 MDP:SWI:1: cr4=0x406e0 2107 1 563885042954861 MDP:SWI:1: pc=0xfffff8014d0e5300 pp=0xfffff80052fea600 enable-msr=0 2106 1 563885042954569 MDP:STA:1: iafctrl=b00(b00) globalctrl=400000000(400000000) 2103 1 563885042952493 MDP:STA:1: iaf-start cpu=1 ri=2 2102 1 563885042952109 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002 v=fffffffff7d2 iafctrl=0 pmc=fffffffff7d2 2094 1 563885042949821 CSW:SWI:1: cpu=1 ri=23 new=2094 2087 1 563885042947433 CSW:SWI:1: cpu=1 proc=0xfffff8005698c538 (5697, threadspin) pp=0xfffff80052fea600 Here you can see that the interrupts for are occurring "about" every 75 KTR events or so, so the next one would have fired at around 2250. That is when the switchout begins. Note that there is a long gap between the call to iaf_stop() from pmc_process_csw_out() and the later call to iaf_read(), not just in KTR events, but also in terms of clock cycles (109136844 cycles on a 3.3 GHz CPU, so ~33 milliseconds assuming my math is correct). All of the intervening PMC events are PMC interrupts on other CPUs during that time. I'm actually not sure what CPU 1 could have been spinning on for that duration aside from the KTR index itself. I don't see anything that should be blocking the PMC interrupt (an NMI) either, so I wonder if it's possible for the PMC to increment a few times after it has been disabled? I do find that if I change my command to only monitor user events, then this goes away, so presumably the counter was still "running" and counting clock cycles in the PMC code itself when it overflowed. Given that, I'm just inclined to apply the first hack patch as-is (though also fix the similar function for IAP).
A commit references this bug: Author: jhb Date: Tue May 19 19:15:20 UTC 2015 New revision: 283123 URL: https://svnweb.freebsd.org/changeset/base/283123 Log: Fix two bugs that could result in PMC sampling effectively stopping. In both cases, the the effect of the bug was that a very small positive number was written to the counter. This means that a large number of events needed to occur before the next sampling interrupt would trigger. Even with very frequently occurring events like clock cycles wrapping all the way around could take a long time. Both bugs occurred when updating the saved reload count for an outgoing thread on a context switch. First, the counter-independent code compares the current reload count against the count set when the thread switched in and generates a delta to apply to the saved count. If this delta causes the reload counter to go negative, it would add a full reload interval to wrap it around to a positive value. The fix is to add the full reload interval if the resulting counter is zero. Second, occasionally the raw counter value read during a context switch has actually wrapped, but an interrupt has not yet triggered. In this case the existing logic would return a very large reload count (e.g. 2^48 - 2 if the counter had overflowed by a count of 2). This was seen both for fixed-function and programmable counters on an E5-2643. Workaround this case by returning a reload count of zero. PR: 198149 Differential Revision: https://reviews.freebsd.org/D2557 Reviewed by: emaste MFC after: 1 week Sponsored by: Norse Corp, Inc. Changes: head/sys/dev/hwpmc/hwpmc_core.c head/sys/dev/hwpmc/hwpmc_mod.c
A commit references this bug: Author: jhb Date: Mon Jun 1 18:08:58 UTC 2015 New revision: 283886 URL: https://svnweb.freebsd.org/changeset/base/283886 Log: MFC 283123: Fix two bugs that could result in PMC sampling effectively stopping. In both cases, the the effect of the bug was that a very small positive number was written to the counter. This means that a large number of events needed to occur before the next sampling interrupt would trigger. Even with very frequently occurring events like clock cycles wrapping all the way around could take a long time. Both bugs occurred when updating the saved reload count for an outgoing thread on a context switch. First, the counter-independent code compares the current reload count against the count set when the thread switched in and generates a delta to apply to the saved count. If this delta causes the reload counter to go negative, it would add a full reload interval to wrap it around to a positive value. The fix is to add the full reload interval if the resulting counter is zero. Second, occasionally the raw counter value read during a context switch has actually wrapped, but an interrupt has not yet triggered. In this case the existing logic would return a very large reload count (e.g. 2^48 - 2 if the counter had overflowed by a count of 2). This was seen both for fixed-function and programmable counters on an E5-2643. Workaround this case by returning a reload count of zero. PR: 198149 Sponsored by: Norse Corp, Inc. Changes: _U stable/10/ stable/10/sys/dev/hwpmc/hwpmc_core.c stable/10/sys/dev/hwpmc/hwpmc_mod.c _U stable/9/sys/ _U stable/9/sys/dev/ stable/9/sys/dev/hwpmc/hwpmc_core.c stable/9/sys/dev/hwpmc/hwpmc_mod.c