Created attachment 155167 [details]
diff to sys/amd64/amd64/support.S - pagezero function
The attached patch gives a consistant 1% decrease in system time when
doing something that forks a bunch of jobs, like buildworld or buildkernel,
when running with modestly high -j values (-j 24 or -j 32).
We have been running this in production for a couple of months.
A commit references this bug:
Date: Sun Apr 5 05:07:25 UTC 2015
New revision: 281100
head/sys/amd64/amd64/support.S: unroll loop
unroll the loop in ENTRY(pagezero)
acc' to the submitter this results in a reproducible 1% perf
improvement under buildworld like workload
I validated correctness and run-testing, but not performance impact
Submitted by: email@example.com
Reviewed by: adrian
MFC After: 1 month
reverted in r281103 per adrian
Here are some runs for various configurations on my machine. In all cases,
the machine was updated to r281146 of the SVN tree, either with or without
the patch in place on the running kernel.
The test builds I did were running 'make -j 24 buildworld',
'make -j 24 buildkernel' and 'make installkernel'. So, I
have three different numbers for each test. I ran each test three
I have 4 different configurations, covering the permutations of
cold/warm disk caches, and 32/64 byte loop unrolling. The 32 byte
loop unrolling is the "stock" code, the 64 byte loop unrolling is
the patched code.
The "cold" disk caches are when the /usr/obj is deleted, the machine
is rebooted, and the 'make -j 24 buildworld" is started.
The "warm" disk caches are when the /usr/obj is deleted after a
'make -j 24 buildworld', the machine is NOT rebooted, and another
'make -j 24 buildworld' is then performed.
2412.39 real 23203.80 user 2637.86 sys <-- buildworld - run 0
344.52 real 2777.46 user 760.44 sys <-- buildkernel - run 0
15.74 real 46.43 user 52.23 sys <-- installkernel - run 0
2409.76 real 23192.80 user 2643.00 sys <-- buildworld - run 1
343.96 real 2777.47 user 758.21 sys <-- buildkernel - run 1
14.90 real 45.76 user 53.20 sys <-- installkernel - run 1
2414.65 real 23201.35 user 2643.17 sys
344.24 real 2777.27 user 756.57 sys
15.65 real 45.74 user 53.33 sys
2412.51 real 23211.03 user 2660.45 sys
345.04 real 2774.33 user 758.16 sys
15.09 real 46.12 user 52.69 sys
2411.58 real 23214.80 user 2638.70 sys
344.32 real 2773.58 user 754.47 sys
15.40 real 46.17 user 52.98 sys
2411.78 real 23197.62 user 2651.97 sys
345.65 real 2772.83 user 762.16 sys
14.77 real 46.51 user 52.88 sys
2391.97 real 23195.38 user 2667.92 sys
341.43 real 2777.19 user 766.67 sys
12.11 real 45.55 user 53.88 sys
2389.78 real 23188.92 user 2672.51 sys
341.27 real 2783.25 user 763.09 sys
12.15 real 45.68 user 53.96 sys
2398.17 real 23196.31 user 2674.64 sys
341.52 real 2779.43 user 766.27 sys
12.16 real 45.34 user 53.95 sys
2349.55 real 22999.66 user 2550.08 sys
341.16 real 2776.62 user 766.24 sys
12.05 real 45.24 user 54.03 sys
2348.86 real 23004.25 user 2561.76 sys
340.82 real 2778.00 user 765.37 sys
12.08 real 46.03 user 53.18 sys
2393.53 real 23200.97 user 2673.28 sys
340.74 real 2778.71 user 763.86 sys
12.09 real 46.20 user 53.29 sys
When the disk caches are cold, there is almost no difference.
When the disk caches are warm, the 64byte version of the
zeropage assembly turns in the best time (well, in 2 out of 3 runs).
So, best case, you have 2668 seconds of system time with the
stock code and warm caches, vs 2550 seconds of system time with
the patched code and warm caches.
(2668 - 2550) = 118 second difference
2550/2668 * 100 = .956, or about 4 percent reduction in system time.
NOTE: All I ever claimed was a 1% decrease in system time, not an
overall 1% decrease in time, as was construed in comment 1.
The real time differences between the best 32byte warm disk cache
and the 64byte warm disk cache is 2349/2390 = .983, or a little
more than 1% decrease in time.
Personally, I find that adding 4 instructions to the kernel to get
a > 1% speedup is worth it. YMMV.
I would take any user vs system times with a large grain of salt. The total runtime of threads on x86 is fairly precise as it is based on TSC samples at each context switch. However, the user vs system breakdown is very imprecise. Each statclock tick bumps either a user or system counter. Then calcru() uses these counters to subdivide the total thread runtime. However, this can be very imprecise (e.g. if the thread mostly runs in short bursts in between statclock ticks then you can end up with a very skewed distribution). For that reason, I only ever compare real time and the sum of user + system.
At some point I might add precise timing where we sample the TSC on each state transition (e.g. kernel -> user and vice versa) and keep separate broken out counters. It would need to be an optional thing under some sort of #ifdef since it would be very expensive on smaller embedded boards without a cheap CPU ticker. OTOH, it probably could be enabled by default on amd64 at least.
I did write various SSE/AVX versions of memcpy/memset/strlen last summer. My microbenchmarks did seem to show a noticable difference when I unrolled the SSE to do 4 moves at a time. While there was certainly a bit of noise, the minimum runtimes for my tests seemed to be fairly reliable, and those changed fairly deterministically going from 1 to 2 to 4 moves at a time.
There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved.