Created attachment 155167 [details] diff to sys/amd64/amd64/support.S - pagezero function The attached patch gives a consistant 1% decrease in system time when doing something that forks a bunch of jobs, like buildworld or buildkernel, when running with modestly high -j values (-j 24 or -j 32). We have been running this in production for a couple of months.
A commit references this bug: Author: eadler Date: Sun Apr 5 05:07:25 UTC 2015 New revision: 281100 URL: https://svnweb.freebsd.org/changeset/base/281100 Log: head/sys/amd64/amd64/support.S: unroll loop unroll the loop in ENTRY(pagezero) acc' to the submitter this results in a reproducible 1% perf improvement under buildworld like workload I validated correctness and run-testing, but not performance impact Submitted by: lidl@pix.net Reviewed by: adrian PR: 199151 MFC After: 1 month Changes: head/sys/amd64/amd64/support.S
reverted in r281103 per adrian
Here are some runs for various configurations on my machine. In all cases, the machine was updated to r281146 of the SVN tree, either with or without the patch in place on the running kernel. The test builds I did were running 'make -j 24 buildworld', 'make -j 24 buildkernel' and 'make installkernel'. So, I have three different numbers for each test. I ran each test three times. I have 4 different configurations, covering the permutations of cold/warm disk caches, and 32/64 byte loop unrolling. The 32 byte loop unrolling is the "stock" code, the 64 byte loop unrolling is the patched code. The "cold" disk caches are when the /usr/obj is deleted, the machine is rebooted, and the 'make -j 24 buildworld" is started. The "warm" disk caches are when the /usr/obj is deleted after a 'make -j 24 buildworld', the machine is NOT rebooted, and another 'make -j 24 buildworld' is then performed. Results follow: build.log.20150406.32byte.cold: 2412.39 real 23203.80 user 2637.86 sys <-- buildworld - run 0 344.52 real 2777.46 user 760.44 sys <-- buildkernel - run 0 15.74 real 46.43 user 52.23 sys <-- installkernel - run 0 2409.76 real 23192.80 user 2643.00 sys <-- buildworld - run 1 343.96 real 2777.47 user 758.21 sys <-- buildkernel - run 1 14.90 real 45.76 user 53.20 sys <-- installkernel - run 1 2414.65 real 23201.35 user 2643.17 sys 344.24 real 2777.27 user 756.57 sys 15.65 real 45.74 user 53.33 sys build.log.20150406.64byte.cold: 2412.51 real 23211.03 user 2660.45 sys 345.04 real 2774.33 user 758.16 sys 15.09 real 46.12 user 52.69 sys 2411.58 real 23214.80 user 2638.70 sys 344.32 real 2773.58 user 754.47 sys 15.40 real 46.17 user 52.98 sys 2411.78 real 23197.62 user 2651.97 sys 345.65 real 2772.83 user 762.16 sys 14.77 real 46.51 user 52.88 sys build.log.20150406.32byte.warm: 2391.97 real 23195.38 user 2667.92 sys 341.43 real 2777.19 user 766.67 sys 12.11 real 45.55 user 53.88 sys 2389.78 real 23188.92 user 2672.51 sys 341.27 real 2783.25 user 763.09 sys 12.15 real 45.68 user 53.96 sys 2398.17 real 23196.31 user 2674.64 sys 341.52 real 2779.43 user 766.27 sys 12.16 real 45.34 user 53.95 sys build.log.20150406.64byte.warm: 2349.55 real 22999.66 user 2550.08 sys 341.16 real 2776.62 user 766.24 sys 12.05 real 45.24 user 54.03 sys 2348.86 real 23004.25 user 2561.76 sys 340.82 real 2778.00 user 765.37 sys 12.08 real 46.03 user 53.18 sys 2393.53 real 23200.97 user 2673.28 sys 340.74 real 2778.71 user 763.86 sys 12.09 real 46.20 user 53.29 sys When the disk caches are cold, there is almost no difference. When the disk caches are warm, the 64byte version of the zeropage assembly turns in the best time (well, in 2 out of 3 runs). So, best case, you have 2668 seconds of system time with the stock code and warm caches, vs 2550 seconds of system time with the patched code and warm caches. (2668 - 2550) = 118 second difference 2550/2668 * 100 = .956, or about 4 percent reduction in system time. NOTE: All I ever claimed was a 1% decrease in system time, not an overall 1% decrease in time, as was construed in comment 1. The real time differences between the best 32byte warm disk cache and the 64byte warm disk cache is 2349/2390 = .983, or a little more than 1% decrease in time. Personally, I find that adding 4 instructions to the kernel to get a > 1% speedup is worth it. YMMV.
I would take any user vs system times with a large grain of salt. The total runtime of threads on x86 is fairly precise as it is based on TSC samples at each context switch. However, the user vs system breakdown is very imprecise. Each statclock tick bumps either a user or system counter. Then calcru() uses these counters to subdivide the total thread runtime. However, this can be very imprecise (e.g. if the thread mostly runs in short bursts in between statclock ticks then you can end up with a very skewed distribution). For that reason, I only ever compare real time and the sum of user + system. At some point I might add precise timing where we sample the TSC on each state transition (e.g. kernel -> user and vice versa) and keep separate broken out counters. It would need to be an optional thing under some sort of #ifdef since it would be very expensive on smaller embedded boards without a cheap CPU ticker. OTOH, it probably could be enabled by default on amd64 at least. I did write various SSE/AVX versions of memcpy/memset/strlen last summer. My microbenchmarks did seem to show a noticable difference when I unrolled the SSE to do 4 moves at a time. While there was certainly a bit of noise, the minimum runtimes for my tests seemed to be fairly reliable, and those changed fairly deterministically going from 1 to 2 to 4 moves at a time.
There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved. Thanks