Here's a fun one: cpuset -l 2 pmcstat -w 1 -p CPU_CLK_UNHALTED_CORE -p PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW -p PARTIAL_RAT_STALLS.MUL_SINGLE_UOP ./himenobmtxpa M 384 mimax = 128 mjmax = 128 mkmax = 256 imax = 127 jmax = 127 kmax =255 Start rehearsal measurement process. Measure the performance in 3 times. MFLOPS: 1487.222542 time(s): 0.271122 1.733593e-03 Now, start the actual measurement process. The loop will be excuted in 663 times This will take about one minute. Wait for a while # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 2391336164 2808870 468916620 2391116254 2841466 580442618 2391286584 2855097 580087732 2391228957 2900509 578770832 2391232010 2840303 281471262184436 2391175692 2841502 580400260 2391261851 2844140 580443655 2391267208 2845348 18446462599313378116 2391258793 2844923 580426258 2391283394 2844559 580358142 2391251898 2844617 580418628 2391287371 2844894 281471262213505 2391220843 2844253 580344542 2391210632 2844542 580359758 2391286696 2844145 18446462599313346923 2391222766 2843902 580428539 2391301558 2845050 580449769 2391220296 2845291 580429033 2391229903 2843954 281471262221908 2391151757 2844754 580463504 2391264515 2844853 580580231 2390984664 2846816 580756378 2391363113 2848440 18446462599314405744 2391327705 2844575 580393026 2391327491 2845494 580487580 2391338337 2844307 580304875 2391360320 2881127 281471261156617 2391356197 2880822 579204685 2391272958 2845046 580382413 2391316177 2843457 18446462599313098960 2391336593 2845009 580494044 2391314220 2844990 580520104 2391321150 2844087 580434830 2391344912 2843830 281471262041250 2391335165 2844355 580422841 ^Croot@bruce:/home/adrian # Notice how some of these event counters are huge? they shouldn't be; they're cycle counters.
It seems like PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW and PARTIAL_RAT_STALLS.MUL_SINGLE_UOP related to Sandybridge CPUs. Since I don't have real hardware I tried it in VM: root@bsd:/home/tsgan/himenobmtxpa/src # cpuset -l 0 pmcstat -w 1 -p CPU_CLK_UNHALTED_CORE -p PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW -p PARTIAL_RAT_STALLS.MUL_SINGLE_UOP ./himenobmtxpa M 384 For example: Grid-size= XS (32x32x64) S (64x64x128) M (128x128x256) L (256x256x512) XL (512x512x1024) Grid-size = # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 3026657 17021 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... Not sure whether my setup is correct or not, but I don't see such big numbers. How many and what type of CPUs do you have? What else is running on your system? How can I test it in Xeon system?
Hi, It's not running the benchmark - it's waiting for you to select the kind of benchmark to run. That's why you're seeing zero values.
Ok, my case is probably different, so I see different numbers, but not huge: cpuset -l 0 pmcstat -w 1 -p CPU_CLK_UNHALTED_CORE -p PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW -p PARTIAL_RAT_STALLS.MUL_SINGLE_UOP ./himenobmtxpa XL mimax = 512 mjmax = 512 mkmax = 1024 imax = 511 jmax = 511 kmax =1023 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 870033399 3725668 0 245841351 908762 0 149768752 495087 0 92044851 314835 0 172232985 555547 0 130470473 415529 0 97294500 324799 0 224264591 730483 0 323564237 1001842 0 274181332 793169 0 203753726 565633 0 65682400 185914 0 305112357 883635 0 58018371 175088 0 57542509 165927 0 56139142 164077 0 114569102 328945 0 49034541 139120 0 87673805 246503 0 78197785 216160 0 125396191 347830 0 60803608 169127 0 58624781 158633 0 158752072 431521 0 126637934 347436 0 75105123 200448 0 75871442 165376 0 121684025 326831 0 63415999 164887 0 65051834 165931 0 172723309 437796 0 55178114 134277 0 129564415 321174 0 67139171 161328 0 129583559 302727 0 65823717 159945 0 100021756 239138 0 248424650 604507 0 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 2157395 5581 0 266630444 648471 0 265551569 601376 0 350420437 805538 0 274975297 568795 0 204726842 445208 0 145150448 312924 0 263387441 546046 0 174172408 317262 0 170038378 331495 0 190559076 369572 0 0 0 0 0 0 0 65105706 137757 0 0 0 0 83246288 172311 0 0 0 0 83317266 171992 0 64043046 131977 0 73166853 148560 0 0 0 0 82940705 167883 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 359667865 717901 0 570575714 1085051 0 336720952 634324 0 357310518 663736 0 176594285 311291 0 113486543 211149 0 229862414 416510 0 185818110 320432 0 253051214 440886 0 167883757 285382 0 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 94516905 176444 0 0 0 0 98904136 171988 0 0 0 0 91853341 170419 0 0 0 0 72928488 127483 0 0 0 0 93832604 171930 0 95043153 171647 0 0 0 0 100036516 175539 0 81642449 146707 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 122492641 215274 0 432573271 723654 0 672475908 1137167 0 164955154 275113 0 75885192 127743 0 362878684 597953 0 106611797 165063 0 96271731 159800 0 160547540 238537 0 104997341 166820 0 99251489 158100 0 677889774 1069483 0 109420715 171744 0 92841357 147517 0 17494074 26594 0 0 0 0 109178407 173350 0 0 0 0 89909372 148525 0 14054265 22545 0 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 109666162 151431 0 19816973 29435 0 0 0 0 123641867 186868 0 74710420 117138 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 241415658 318707 0 413566724 628075 0 340955679 519405 0 289339024 438947 0 141001617 218872 0 141498097 213880 0 272845051 424913 0 16543212 23563 0 207416368 316845 0 118682821 166716 0 321315802 474006 0 106027959 159611 0 347174670 484656 0 0 0 0 154530996 170777 0 119466402 174679 0 266077428 326838 0 464666247 629234 0 508973836 728643 0 226376736 289833 0 240968127 325854 0 161590314 181870 0 152003550 172463 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 144675563 195757 0 0 0 0 35247263 46692 0 root@bsd:/home/tsgan/himenobmtxpa/src #
Some more outputs: root@bsd:/home/tsgan/himenobmtxpa/src # cpuset -l 1 pmcstat -w 1 -p CPU_CLK_UNHALTED_CORE -p PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW -p PARTIAL_RAT_STALLS.MUL_SINGLE_UOP ./himenobmtxpa M mimax = 128 mjmax = 128 mkmax = 256 imax = 127 jmax = 127 kmax =255 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 496973198 1891482 0 0 0 0 0 0 0 6211441 6475 0 67851413 228420 0 167437929 578476 0 58202990 263315 0 159091555 615841 0 199474522 914360 0 435143068 2102652 0 Start rehearsal measurement process. Measure the performance in 3 times. 1046313910 3638978 9088916 661072248 1262611 12287166 1181939944 2818232 22741743 571538306 1247954 8365446 1123442862 3240698 18681005 1356224563 3224598 22727923 1147070767 3143202 18853622 1291379900 3264098 22911348 640597242 1432676 10454925 469944137 1173440 8993387 388538712 875062 6810780 MFLOPS: 38.013623 time(s): 10.607217 1.733593e-03 Now, start the actual measurement process. The loop will be excuted in 16 times This will take about one minute. Wait for a while 165134408 621783 851017 239418647 639771 3972871 1002366660 2269801 16204933 1125772300 2705173 20909153 1073364783 2946491 16979374 1199833427 2934209 20613526 1270493498 3131262 22138008 1108155249 3002679 17521465 1314310219 3156869 22821855 1140823030 2842658 20023985 1126332577 2883871 18672363 1162000580 2832338 21202038 1109106518 3066909 16946660 1296149672 3304390 21986759 1179265638 2807189 21081432 1068573361 2922111 16053932 1198262630 2887323 21327538 # p/CPU_CLK_UNHALTED_CORE p/PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW p/PARTIAL_RAT_STALLS.MUL_SINGLE_UOP 1175241231 2829240 20693991 1096268514 3018859 16739105 1185463271 2963422 21308158 1178471187 2863822 20017627 1148132142 3233218 17586444 1144689029 2976330 20739875 1094868768 2882188 17116037 1269267205 3162007 21789166 1145196524 2810384 19985030 1210390022 3289607 16888847 1201586511 2994667 21764537 1146922422 2767725 19709885 1170310109 3175550 16395202 1289971174 3180738 22527756 1057194734 2602396 19377512 1157004463 3193955 17467630 1235730044 3065464 21453972 1081579818 2756388 17322508 1196944986 3144452 20926928 1095767389 2631318 18833998 1056892461 2789426 16491839 1262274645 3215379 21593331 1193887452 3066186 21195414 1077392053 3120552 17142660 1199816969 2999705 21635489 1064253746 2552890 18452838 1103708407 2965405 16170094 1244782428 2882083 21250280 1199552244 2798772 19013893 Loop executed for 16 times Gosa : 1.604793e-03 MFLOPS measured : 45.999056 cpu : 46.750959 Score based on Pentium III 600MHz using Fortran 77: 0.560964 558500599 2034859 3314459 root@bsd:/home/tsgan/himenobmtxpa/src #
Anyone's working on this? I'm seeing similar symptoms with multithreaded code.