Bug 199151 - speed up pagezero assembly on Xeon
Summary: speed up pagezero assembly on Xeon
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Many People
Assignee: FreeBSD bugs mailing list
Keywords: patch
Depends on:
Reported: 2015-04-03 19:55 UTC by Kurt Lidl
Modified: 2016-03-21 22:55 UTC (History)
4 users (show)

See Also:

diff to sys/amd64/amd64/support.S - pagezero function (872 bytes, patch)
2015-04-03 19:55 UTC, Kurt Lidl
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kurt Lidl freebsd_committer 2015-04-03 19:55:38 UTC
Created attachment 155167 [details]
diff to sys/amd64/amd64/support.S - pagezero function

The attached patch gives a consistant 1% decrease in system time when
doing something that forks a bunch of jobs, like buildworld or buildkernel,
when running with modestly high -j values (-j 24 or -j 32).

We have been running this in production for a couple of months.
Comment 1 commit-hook freebsd_committer 2015-04-05 05:07:57 UTC
A commit references this bug:

Author: eadler
Date: Sun Apr  5 05:07:25 UTC 2015
New revision: 281100
URL: https://svnweb.freebsd.org/changeset/base/281100

  head/sys/amd64/amd64/support.S: unroll loop
  	unroll the loop in ENTRY(pagezero)
  	acc' to the submitter this results in a reproducible 1% perf
  	improvement under buildworld like workload

  	I validated correctness and run-testing, but not performance impact

  Submitted by:	lidl@pix.net
  Reviewed by:	adrian
  PR:		199151
  MFC After:	1 month

Comment 2 Eitan Adler freebsd_committer freebsd_triage 2015-04-05 05:21:42 UTC
reverted in r281103 per adrian
Comment 3 Kurt Lidl freebsd_committer 2015-05-26 23:09:23 UTC
Here are some runs for various configurations on my machine.  In all cases,
the machine was updated to r281146 of the SVN tree, either with or without
the patch in place on the running kernel.

The test builds I did were running 'make -j 24 buildworld',
'make -j 24 buildkernel' and 'make installkernel'.  So, I
have three different numbers for each test.  I ran each test three

I have 4 different configurations, covering the permutations of
cold/warm disk caches, and 32/64 byte loop unrolling.  The 32 byte
loop unrolling is the "stock" code, the 64 byte loop unrolling is
the patched code.

The "cold" disk caches are when the /usr/obj is deleted, the machine
is rebooted, and the 'make -j 24 buildworld" is started.
The "warm" disk caches are when the /usr/obj is deleted after a
'make -j 24 buildworld', the machine is NOT rebooted, and another
'make -j 24 buildworld' is then performed.

Results follow:

     2412.39 real     23203.80 user      2637.86 sys   <-- buildworld    - run 0
      344.52 real      2777.46 user       760.44 sys   <-- buildkernel   - run 0
       15.74 real        46.43 user        52.23 sys   <-- installkernel - run 0
     2409.76 real     23192.80 user      2643.00 sys   <-- buildworld    - run 1
      343.96 real      2777.47 user       758.21 sys   <-- buildkernel   - run 1
       14.90 real        45.76 user        53.20 sys   <-- installkernel - run 1
     2414.65 real     23201.35 user      2643.17 sys
      344.24 real      2777.27 user       756.57 sys
       15.65 real        45.74 user        53.33 sys
     2412.51 real     23211.03 user      2660.45 sys
      345.04 real      2774.33 user       758.16 sys
       15.09 real        46.12 user        52.69 sys
     2411.58 real     23214.80 user      2638.70 sys
      344.32 real      2773.58 user       754.47 sys
       15.40 real        46.17 user        52.98 sys
     2411.78 real     23197.62 user      2651.97 sys
      345.65 real      2772.83 user       762.16 sys
       14.77 real        46.51 user        52.88 sys
     2391.97 real     23195.38 user      2667.92 sys
      341.43 real      2777.19 user       766.67 sys
       12.11 real        45.55 user        53.88 sys
     2389.78 real     23188.92 user      2672.51 sys
      341.27 real      2783.25 user       763.09 sys
       12.15 real        45.68 user        53.96 sys
     2398.17 real     23196.31 user      2674.64 sys
      341.52 real      2779.43 user       766.27 sys
       12.16 real        45.34 user        53.95 sys
     2349.55 real     22999.66 user      2550.08 sys
      341.16 real      2776.62 user       766.24 sys
       12.05 real        45.24 user        54.03 sys
     2348.86 real     23004.25 user      2561.76 sys
      340.82 real      2778.00 user       765.37 sys
       12.08 real        46.03 user        53.18 sys
     2393.53 real     23200.97 user      2673.28 sys
      340.74 real      2778.71 user       763.86 sys
       12.09 real        46.20 user        53.29 sys

When the disk caches are cold, there is almost no difference.
When the disk caches are warm, the 64byte version of the
zeropage assembly turns in the best time (well, in 2 out of 3 runs).

So, best case, you have 2668 seconds of system time with the
stock code and warm caches, vs 2550 seconds of system time with
the patched code and warm caches.

(2668 - 2550) = 118 second difference

2550/2668 * 100 = .956, or about 4 percent reduction in system time.

NOTE:  All I ever claimed was a 1% decrease in system time, not an
overall 1% decrease in time, as was construed in comment 1.

The real time differences between the best 32byte warm disk cache
and the 64byte warm disk cache is 2349/2390 = .983, or a little
more than 1% decrease in time.

Personally, I find that adding 4 instructions to the kernel to get
a > 1% speedup is worth it.  YMMV.
Comment 4 John Baldwin freebsd_committer freebsd_triage 2015-05-27 17:41:53 UTC
I would take any user vs system times with a large grain of salt.  The total runtime of threads on x86 is fairly precise as it is based on TSC samples at each context switch.  However, the user vs system breakdown is very imprecise.  Each statclock tick bumps either a user or system counter.  Then calcru() uses these counters to subdivide the total thread runtime.  However, this can be very imprecise (e.g. if the thread mostly runs in short bursts in between statclock ticks then you can end up with a very skewed distribution).  For that reason, I only ever compare real time and the sum of user + system.

At some point I might add precise timing where we sample the TSC on each state transition (e.g. kernel -> user and vice versa) and keep separate broken out counters.  It would need to be an optional thing under some sort of #ifdef since it would be very expensive on smaller embedded boards without a cheap CPU ticker.  OTOH, it probably could be enabled by default on amd64 at least.

I did write various SSE/AVX versions of memcpy/memset/strlen last summer.  My microbenchmarks did seem to show a noticable difference when I unrolled the SSE to do 4 moves at a time.  While there was certainly a bit of noise, the minimum runtimes for my tests seemed to be fairly reliable, and those changed fairly deterministically going from 1 to 2 to 4 moves at a time.