After I updated our jenkins slaves from 11.0 to 11.2 running the LLVM regression test suite takes over 2 hours instead of 5-10 minutes.
The testsuite is driven by llvm-lit (a python script) that makes heavy use of python multiprocessing module.
On 11.0 the test output prints status updates all the time whereas on 11.2 it will print N (or N-1, N-2) status updates (with N being the number of parallel jobs) before a noticeable pause. In that pause it seems like the individual python processes start using 100% CPU.
I tried updating one of the slaves to 12.0 but it is still just as slow.
I also tried updating python27 to the latest 11.2 package (2.7.15) but that still runs fast. Therefore it seems to me like this could be a regression with semaphores or umtx.
I am also not sure if this is a regression in the kernel or libc/libthr.
Downgrading to 11.0 fixes the issue.
I can confirm that this is a kernel regression between 11.0 and 11.2.
I updated the kernel on one of the build machines from 11.0 to 11.2 (same userspace) and now the testsuite is much slower.
Running `./bin/llvm-lit -j32 test/CodeGen/Mips` from the LLVM build directory now takes 121 seconds instead of 29, i.e. 4x slower.
On a Linux build slave with the same hardware running the test/CodeGen/Mips tests only takes 6.8 seconds.
Once a I have some more time I can try bisecting but that will probably take a while.
I finished bisecting and the following commit is causing the slowdown:
6dd025b40ee6870bea6ba670f30dcf684edc3f6c is the first bad commit
Author: kib <kib@FreeBSD.org>
Date: Sat Feb 17 18:00:01 2018 +0000
Meltdown mitigation by PTI, PCID optimization of PTI, and kernel use of IBRS
for some mitigations of Spectre.
Tested by: emaste, Arshan Khanifar <email@example.com>
Discussed with: jkim
Sponsored by: The FreeBSD Foundation
I also checked that ibrs was off so this seems to be caused by something else in this massive commit.
$ sysctl hw.ibrs_active
$ sysctl hw.ibrs_disable
The time it took to run the test command was very consistent:
Working commit: r313908+bc19f9718e7(stable/11) -> 26.8seconds
Known bad commit: a190b45c053(stable/11) -> 121.12s
1st step: fd638422ccfbbe7a11e751ed5a7a237f2dc8d004 -> 133.14s
2nd step: 4ebdf0a463e767672045047ec82c75bf545e9a7c -> 25.48s
3rd step: 483c180d2dece828b1800896e642bcb806075c1f -> 133.49s
4th step: e38b1b844d045fb5dcf4db6f2795f7797022ca17 -> 24.97s
5th step (141 to go): 97ca83c5b618a025fdac8e62db23afa7abd198c8 -> 25.27s
6th step (70 to go): cdf8c6144ce326759989ebcbc0d9b5195bd4f811 -> 133.64s
7th step (35 to go): e2f594b309b8fd60b6ab39982b9ba0f6da798f38 -> 134.01s
8th step (17 to go): 195e563d5c9edb667d92e2480a07080ae6ba40ac -> 24.80s
9th step (8 to go): 61669078139eef2cb35756674ab6e925e97ab5ba -> 134.34s
10th step (4 to go): af7d202e1dcdeda020ee1fdb96e2b1eb4dbbad39 -> 133.04s
11th step (1 to go): 6dd025b40ee6870bea6ba670f30dcf684edc3f6c -> 133.92s
12th step (0 to go): ebe022b0c9f14dc22edbe0308795d42c9f93fd39 -> 24.86s
This does not look too informative. Can you infer which syscalls take longer ? Or is it some different timing for synchronization ? Or might be something completely different.
Also, what is the hardware where this happens.
It is a virtual machine with an Intel CPU: CPU info from dmesg:
Hyper-V Version: 0.0.0 [SP0]
PM Features=0x0 [C0]
XEN: Hypervisor version 4.7 detected.
CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (2593.81-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x206d6 Family=0x6 Model=0x2d Stepping=6
Hypervisor: Origin = "Microsoft Hv"
real memory = 118107406336 (112636 MB)
avail memory = 114828369920 (109508 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <Xen HVM>
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 2 package(s) x 16 core(s)
It is kind of difficult to see which sycalls take longer before and after but my guess is that it is related to semaphores since it is attempting to get data from the subprocesses and it uses semaphores for synchronization.
I'm not sure if there is an easy way to see what is taking longer. I'll see if I can somehow get the information using DTrace.
If you disable KPTI by a knob, does the speed loss recover ?
Can you try on bare metal ?
There is no PCID feature advertised by the virtualized CPU, so we flush TLB on each syscall entry and exit. I suspect Xen might have additional pessimisations there comparing to the bare hardware.
For now setting vm.pmap.pti=0 in /boot/loader.conf brings the testing time back down to 25.89s. Meltdown mitigation should not be too important for this server so I will disable it for now.
I guess switching to separate page tables causes a lot of overhead with nested paging and maybe due to the timing changes there are more syscalls so again more overhead.