Bug 216759 - [kern] Memory speed with small blocks (1K) up to 35 times slower than host system (under QEMU emulation, but not only)
Summary: [kern] Memory speed with small blocks (1K) up to 35 times slower than host sy...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-virtualization (Nobody)
URL: https://www.reddit.com/r/freebsd/comm...
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-03 19:49 UTC by andrew
Modified: 2020-07-12 16:39 UTC (History)
7 users (show)

See Also:


Attachments
tsc.c patch to allow KVM hypervisor with host cpu to pass through the good TSC (1.08 KB, patch)
2017-03-22 11:28 UTC, andrew
andrew: maintainer-approval+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description andrew 2017-02-03 19:49:53 UTC
FreeBSD 11-RELEASE and 10.3-RELEASE seem to run memory much slower according to sysbench benchmarking software:

Bare Metal run:
# uname -a
FreeBSD backup 10.3-RELEASE-p11 FreeBSD 10.3-RELEASE-p11 #0: Mon Oct 24 18:49:24 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
# sysbench --num-threads=1 --test=memory --memory-total-size=1G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 1048576 (2183178.34 ops/sec)

1024.00 MB transferred (2132.01 MB/sec)


Test execution summary:
    total time:                          0.4803s
    total number of events:              1048576
    total time taken by event execution: 0.3527
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  7.56ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           1048576.0000/0.00
    execution time (avg/stddev):   0.3527/0.00


QEMU KVM emulation:
# uname -a
FreeBSD dev 11.0-RELEASE-p2 FreeBSD 11.0-RELEASE-p2 #0: Mon Oct 24 06:55:27 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
# sysbench --num-threads=1 --test=memory --memory-total-size=1G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 1048576 (69497.13 ops/sec)

1024.00 MB transferred (67.87 MB/sec)


Test execution summary:
    total time:                          15.0880s
    total number of events:              1048576
    total time taken by event execution: 11.1440
    per-request statistics:
         min:                                  0.01ms
         avg:                                  0.01ms
         max:                                  7.32ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           1048576.0000/0.00
    execution time (avg/stddev):   11.1440/0.00

For comparison
VMWARE:

# uname -a
FreeBSD ns3 10.2-RELEASE-p7 FreeBSD 10.2-RELEASE-p7 #0: Mon Nov  2 14:19:39 UTC 2015     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
# sysbench --num-threads=1 --test=memory --memory-total-size=1G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 1048576 (2234641.77 ops/sec)

1024.00 MB transferred (2182.27 MB/sec)


Test execution summary:
    total time:                          0.4692s
    total number of events:              1048576
    total time taken by event execution: 0.3437
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.09ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           1048576.0000/0.00
    execution time (avg/stddev):   0.3437/0.00



This is not 11 only problem. VPSs with 10.3 tested also have the same problem.

I haven't found any info on this on the net. Might be because nobody tests the RAM.
Sysbench itself starts up a thread and runs the allocation code. I couldn't trace the thread though.
Maybe it is old code of sysbench.



Additional information and reports:
https://www.reddit.com/r/freebsd/comments/5rtf05/abysmal_memory_perfomance_witch_freebsd_under/
Comment 1 andrew 2017-02-04 14:12:26 UTC
Memory disks show writes are perfectly fine.

KVM:

# mdconfig -a -t swap -s 128m -u 1
# dd if=/dev/urandom of=/root/test bs=1M count=128
# dd if=/root/test of=/dev/md1 bs=1M
128+0 records in
128+0 records out
134217728 bytes transferred in 0.089190 secs (1504858419 bytes/sec)
# dd if=/root/test of=/dev/md1 bs=1M
128+0 records in
128+0 records out
134217728 bytes transferred in 0.056503 secs (2375411572 bytes/sec)

So seems some specific sysbench call ruins everything. Should we add [benchmarks/sysbench] ?
Comment 2 andrew 2017-02-04 18:13:24 UTC
(In reply to andrew from comment #1)
# dd if=/root/test of=/dev/md1 bs=512
262144+0 records in
262144+0 records out
134217728 bytes transferred in 4.829553 secs (27790924 bytes/sec

# dd if=/root/test of=/dev/md1 bs=1K
131072+0 records in
131072+0 records out
134217728 bytes transferred in 2.478277 secs (54157674 bytes/sec)

Apparently the bug is only visible with anything smaller than 1M
Comment 3 andrew 2017-02-05 17:30:01 UTC
Without KVM extensions on QEMU the speed goes a bit up to 94MB/s instead of 67 before
Comment 4 andrew 2017-02-20 07:19:47 UTC
Changed the Importance, because it's not a random bug but a 100% hit rate on any qemu platform
Comment 5 Bartek Rutkowski freebsd_committer 2017-02-22 09:05:27 UTC
Here are some results from the tests on a 12-C vm under XenServer:

VM: FreeBSD poudriere 12.0-CURRENT FreeBSD 12.0-CURRENT #3 r314028: Tue Feb 21 08:07:02 CET 2017     root@pd.valinor.palantiri.org:/usr/obj/usr/src/sys/POUDRIERE  amd64

XenServer: Linux xenserver 3.10.0+10 #1 SMP Thu Sep 22 12:31:44 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux, Xen version: 4.6.1-xs133690

DD on 8G mdisk:

# dd if=/dev/zero of=/dev/md1 bs=1M
dd: /dev/md1: end of device
8193+0 records in
8192+0 records out
8589934592 bytes transferred in 15.565112 secs (551871047 bytes/sec)

# dd if=/dev/zero of=/dev/md1 bs=1K
dd: /dev/md1: end of device
8388609+0 records in
8388608+0 records out
8589934592 bytes transferred in 232.354641 secs (36969068 bytes/sec)

Sysbench:

# sysbench --num-threads=1 --test=memory --memory-total-size=8G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 8192M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 8388608 (1169942.57 ops/sec)

8192.00 MB transferred (1142.52 MB/sec)


Test execution summary:
    total time:                          7.1701s
    total number of events:              8388608
    total time taken by event execution: 5.2679
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  3.95ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           8388608.0000/0.00
    execution time (avg/stddev):   5.2679/0.00



# sysbench --num-threads=1 --test=memory --memory-total-size=8G --memory-block-size=1M run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 8192M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 8192 (46849.56 ops/sec)

8192.00 MB transferred (46849.56 MB/sec)


Test execution summary:
    total time:                          0.1749s
    total number of events:              8192
    total time taken by event execution: 0.1727
    per-request statistics:
         min:                                  0.02ms
         avg:                                  0.02ms
         max:                                  0.08ms
         approx.  95 percentile:               0.03ms

Threads fairness:
    events (avg/stddev):           8192.0000/0.00
    execution time (avg/stddev):   0.1727/0.00
Comment 6 Roger Pau Monné freebsd_committer 2017-02-22 10:23:00 UTC
This are the results of the tests on bare-metal using FreeBSD 12 (less than 1 month old):

# sysbench --num-threads=1 --test=memory --memory-total-size=4G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 4096M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 4194304 (1122695.18 ops/sec)

4096.00 MB transferred (1096.38 MB/sec)


Test execution summary:
    total time:                          3.7359s
    total number of events:              4194304
    total time taken by event execution: 2.6089
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.22ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           4194304.0000/0.00
    execution time (avg/stddev):   2.6089/0.00

# sysbench --num-threads=1 --test=memory --memory-total-size=4G --memory-block-size=1M run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 4096M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 4096 (10462.26 ops/sec)

4096.00 MB transferred (10462.26 MB/sec)


Test execution summary:
    total time:                          0.3915s
    total number of events:              4096
    total time taken by event execution: 0.3898
    per-request statistics:
         min:                                  0.06ms
         avg:                                  0.10ms
         max:                                  0.17ms
         approx.  95 percentile:               0.10ms

Threads fairness:
    events (avg/stddev):           4096.0000/0.00
    execution time (avg/stddev):   0.3898/0.00
Comment 7 andrew 2017-02-23 20:13:29 UTC
Same config VPS on QEMU with CentOS 7

# sysbench --num-threads=1 --test=memory --memory-total-size=1G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 1048576 (1491466.64 ops/sec)

1024.00 MB transferred (1456.51 MB/sec)


Test execution summary:
    total time:                          0.7031s
    total number of events:              1048576
    total time taken by event execution: 0.5486
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                 18.49ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           1048576.0000/0.00
    execution time (avg/stddev):   0.5486/0.00


The node running it:

# sysbench --num-threads=1 --test=memory --memory-total-size=1G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 1048576 (3862725.85 ops/sec)

1024.00 MB transferred (3772.19 MB/sec)


Test execution summary:
    total time:                          0.2715s
    total number of events:              1048576
    total time taken by event execution: 0.2213
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.04ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           1048576.0000/0.00
    execution time (avg/stddev):   0.2213/0.00
Comment 8 andrew 2017-02-23 20:17:19 UTC
In the last post the memory performance is fine (twice slower) because the VPS is CPU limited to 1.7GHz while host is not.
Comment 9 kvanbiesen 2017-02-23 20:28:06 UTC
Strange. I installed centos 7 on my server and qemu and i had the same lousy performance as i had with proxmox and bsd
Comment 10 andrew 2017-02-28 15:55:16 UTC
About linux, seems kvm-clock is broken somehow, I adjusted to TSC and got 3700 of the host...


And it seems same is for FreeBSD. Only BSD forces HPET or ACPI on any virtualised platform. So there is only manual way to force to TSC-low which brings us back to 3700

Seems like kernel timecounters code has been written in times of just appearing invariant_tsc and it does not know anything about CPU flags constant_tsc nonstop_tsc tsc_deadline_timer etc...

It's really simple in code. If it's invariant or SMP it's boosted... and that's it:

https://github.com/freebsd/freebsd/blob/21c11d113415f2c87107b6735407b147fae0b851/sys/x86/x86/tsc.c
Comment 11 andrew 2017-02-28 19:46:21 UTC
Here are the latest results supporting my previous post:

root@debian8-test:~# sysbench --num-threads=1 --test=memory --memory-total-size=512M --memory-block-size=1K --debug run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Debug mode enabled.


Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 512M

Memory operations type: write
Memory scope type: global
Threads started!
DEBUG: Runner thread started (0)!
Done.

Operations performed: 524288 (3877252.90 ops/sec)

512.00 MB transferred (3786.38 MB/sec)


Test execution summary:
    total time:                          0.1352s
    total number of events:              524288
    total time taken by event execution: 0.1093
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.14ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           524288.0000/0.00
    execution time (avg/stddev):   0.1093/0.00

DEBUG: Verbose per-thread statistics:

DEBUG:     thread #  0: min: 0.0000s  avg: 0.0000s  max: 0.0001s  events: 524288
DEBUG:                  total time taken by even execution: 0.1093s

root@debian8-test:~# cat /sys/bus/clocksource/devices/clocksource0/current_clocksource
tsc
root@debian8-test:~# cat /sys/bus/clocksource/devices/clocksource0/available_clocksource
tsc hpet acpi_pm




root@dev:~ # dmesg | grep -i "TSC"
Calibrating TSC clock ... TSC clock: 3400129027 Hz
  Features=0xf83fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,SS>
  Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
TSC timecounter discards lower 1 bit(s)
Timecounter "TSC-low" frequency 1700064513 Hz quality -100
root@dev:~ # sysbench --num-threads=1 --test=memory --memory-total-size=512M --memory-block-size=1K --debug run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Debug mode enabled.


Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 512M

Memory operations type: write
Memory scope type: global
Threads started!
DEBUG: Runner thread started (0)!
Done.

Operations performed: 524288 (73267.52 ops/sec)

512.00 MB transferred (71.55 MB/sec)


Test execution summary:
    total time:                          7.1558s
    total number of events:              524288
    total time taken by event execution: 5.2780
    per-request statistics:
         min:                                  0.01ms
         avg:                                  0.01ms
         max:                                  0.72ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           524288.0000/0.00
    execution time (avg/stddev):   5.2780/0.00

DEBUG: Verbose per-thread statistics:

DEBUG:     thread #  0: min: 0.0000s  avg: 0.0000s  max: 0.0007s  events: 524288
DEBUG:                  total time taken by even execution: 5.2780s

root@dev:~ # sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: HPET -> TSC-low
root@dev:~ # sysbench --num-threads=1 --test=memory --memory-total-size=512M --memory-block-size=1K --debug run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Debug mode enabled.


Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 512M

Memory operations type: write
Memory scope type: global
Threads started!
DEBUG: Runner thread started (0)!
Done.

Operations performed: 524288 (3408336.84 ops/sec)

512.00 MB transferred (3328.45 MB/sec)


Test execution summary:
    total time:                          0.1538s
    total number of events:              524288
    total time taken by event execution: 0.1102
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.24ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           524288.0000/0.00
    execution time (avg/stddev):   0.1102/0.00

DEBUG: Verbose per-thread statistics:

DEBUG:     thread #  0: min: 0.0000s  avg: 0.0000s  max: 0.0002s  events: 524288
DEBUG:                  total time taken by even execution: 0.1102s




So it's even worse than memory lag - it's gettimeofday() function lag, basically all software relies on it one way or another....
Comment 12 andrew 2017-02-28 20:22:30 UTC
Ok I tried removing the "hypervisor" feature from CPU and it resulted in 1390MB/s on TSC-low by default in the VPS (same as kvm-clock I must notice)... So seems like FreeBSD does something as not to run slow with "hypervisor" flag but it is turned off when you disable the flag.
Also networking stopped working without that flag for some reason, routing stopped working.

So in the end the only fix I see currently is forcing TSC-low manually in FreeBSD until the code is fixed not to penalize TSC on all new platforms.

Linux is fixed with disabled kvmclock (seems like debian 7 backport from 8 is not the latest so that might be the trouble).
Comment 13 bob.cauthen@gmail.com 2017-03-03 15:40:55 UTC
As an interested party to this bug I have to raise an issue with this potential workaround.

First though... thanks everybody who discovered and tested this... BUT

According to timecounters(4):

     kern.timecounter.tc.X.quality is an integral value, defining the quality
     of	this time counter compared to others.  A negative value	means this
     time counter is broken and	should not be used.


Andrew's test output showed this line:

Timecounter "TSC-low" frequency 1700064513 Hz quality -100

If the workaround forces the use of TSC-low, and it's kern.timecounter.tc.X.quality is negative, are we not advocating a workaround with a broken timecounter as measured by the OS?

If the answer is yes (to my rhetorical question) possible follow-up questions might then be:

- Should we trust the negative "quality" measurement? (if not, maybe it's easier to mod the timecounter measurement code??)

- Has anyone done any longer term testing with the TSC-low timer in this configuration to see if using that time counter effects anything else in a running system?

Sorry to be the opposing voice here (especially because this bug affects me too).
Comment 14 andrew 2017-03-03 18:42:36 UTC
Bob, this issue is a major one. I had to delve into almost academical works around all available timers and their connection with virtualization.

Seems the only "correct" work is done by VMWare which does corrections to broken TSC. I say "correct" because it still requires a special reliable_tsc flag on virtual CPU that the OS believes it. Reason? Because virtualization software does not provide direct access there might be time outs (say host is doing something) which leads to skewing of time or even negative timer.

Here's the VMWare "fix":

static void
tsc_freq_vmware(void)
{
	u_int regs[4];

	if (hv_high >= 0x40000010) {
		do_cpuid(0x40000010, regs);
		tsc_freq = regs[0] * 1000;
	} else {
		vmware_hvcall(VMW_HVCMD_GETHZ, regs);
		if (regs[1] != UINT_MAX)
			tsc_freq = regs[0] | ((uint64_t)regs[1] << 32);
	}
	tsc_is_invariant = 1;
}

static void
probe_tsc_freq(void)
{

...

	if (vm_guest == VM_GUEST_VMWARE) {
		tsc_freq_vmware();
		return;
	}
...

So this issue arrived on KVM, also with ease of migration it came to attention that non-standardized timers lead to migration and suspend resume FAILURE.

KVM created kvmclock - which is a monstrosity and is still twice slower than native TSC. And works only on Linux.

FreeBSD has no support for KVMClock. It has native support for VMWare in code, however it does not recognize constant_tsc flag for some reason. This leads to issue where OS under virtualization uses HPET or ACPI-PM which are slower on some systems and are serial (why have them faster when you have TSC?).

Here's the code which kills the quality of TSC under KVM in FreeBSD:

static int
test_tsc(void)
{
	uint64_t *data, *tsc;
	u_int i, size, adj;

	if ((!smp_tsc && !tsc_is_invariant) || vm_guest)
		return (-100);

...

As you can see, KVM has no chance of going around. What takes sets the vm_guest variable? Well it's the 'hypervisor' flag on CPU


Disabling which brings on a new can of worms (not working virtio drivers). And I think it hits the SMP test which never happens (KVM gives 1 core)
Comment 15 deJong 2017-03-15 08:43:48 UTC
I managed to find a patch which adds KVMCLOCK support, however it's pretty old and I did not get around testing it.

I believe it has not been implemented in the -CURRENT branch yet.

Also, I read here:
http://markmail.org/message/pjtay3dghuqpv4hg

that using TSC-low is not entirely foolproof (whatever that may mean, I am more of a sysadmin) so there are probably some issues with using it.

Mailing list:
https://lists.freebsd.org/pipermail/freebsd-arch/2015-January/016587.html

Patch:
https://reviews.freebsd.org/D1435#inline-51273

Maybe somebody could take a look at it and test it to see if it resolves this issue.
Comment 16 deJong 2017-03-21 13:55:21 UTC
I have tested the patch set and unfortunately it made no difference

root@testbsd11:~ # sysctl kern.timecounter.choice
kern.timecounter.choice: TSC-low(-100) i8254(0) ACPI-fast(900) HPET(950) dummy(-1000000)

root@testbsd11:~ # sysbench --num-threads=1 --test=memory --memory-total-size=1G --memory-block-size=1K run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 1024M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 1048576 (34976.80 ops/sec)

1024.00 MB transferred (34.16 MB/sec)


Test execution summary:
    total time:                          29.9792s
    total number of events:              1048576
    total time taken by event execution: 21.9685
    per-request statistics:
         min:                                  0.02ms
         avg:                                  0.02ms
         max:                                 22.21ms
         approx.  95 percentile:               0.01ms

Threads fairness:
    events (avg/stddev):           1048576.0000/0.00
    execution time (avg/stddev):   21.9685/0.00

So either the problem is somewhere else or these patches does not apply to this situation, it would be good if someone can confirm or deny this.
Comment 17 andrew 2017-03-21 19:04:40 UTC
(In reply to deJong from comment #16)
It doesn't seem like the patchset is applied at all.

> kern.timecounter.choice: TSC-low(-100) i8254(0) ACPI-fast(900) HPET(950) dummy(-1000000)

It should add kvmclock to that list which is obviously not there.

Forcing TSC-low on KVM VM when the host trusts it more than HPET and ACPI-PM is foolproof imo.

It's obvious that the choice for this has been designed a long time ago and has been programmed this way because back then TSC really sucked and any hypervisor except for programmed by VMWare loophole is penalised (don't know though which because xen requires more workarounds in the kernel and I don't really know how bhyve runs it)..

So obviously this part of code should be reworked
if ((!smp_tsc && !tsc_is_invariant) || vm_guest)
	return (-100);

X86_FEATURE_CONSTANT_TSC is not checked at all (this is from linux, but still BSD does not check that in any case in its TSC code (or I might be misreading the magic tsc_is_invariant variable)). And most of all the vm_guest check should go away. It's a terrible hack nowadays as this bug shows.

I did not try this yet but maybe something like this will help

if ((!smp_tsc && !tsc_is_invariant) || (vm_guest && vm_guest != VM_GUEST_KVM))
	return (-100);
Comment 18 andrew 2017-03-22 11:28:15 UTC
Created attachment 181048 [details]
tsc.c patch to allow KVM hypervisor with host cpu to pass through the good TSC

This patch is not "correct" because it just whitelists Generic virtualization (VM_GUEST_VM), however I don't see how it can be done correctly at this point of overall coding quality in tsc.c regarding virtualization.

I guess a CPU_VENDOR_VIRTUAL should be considered as an option, so we can leave host pass-through CPU as a real host CPU and allow virtual CPUs like QEMU_X86_64 or whatever to be treated separately where all ifs and switches are tested for stability of CPU TSC (that seems to be the Linux approach).