Bug 236233 - [Meltdown mitigations under Xen, emulated TLB] Python multiprocessing is 4x slower with 11.2 compared to 11.0 kernel
Summary: [Meltdown mitigations under Xen, emulated TLB] Python multiprocessing is 4x s...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-03-04 22:07 UTC by Alex Richardson
Modified: 2019-12-22 05:29 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Richardson freebsd_committer 2019-03-04 22:07:54 UTC
After I updated our jenkins slaves from 11.0 to 11.2 running the LLVM regression test suite takes over 2 hours instead of 5-10 minutes.
The testsuite is driven by llvm-lit (a python script) that makes heavy use of python multiprocessing module.

On 11.0 the test output prints status updates all the time whereas on 11.2 it will print N (or N-1, N-2) status updates (with N being the number of parallel jobs) before a noticeable pause. In that pause it seems like the individual python processes start using 100% CPU.

I tried updating one of the slaves to 12.0 but it is still just as slow.

I also tried updating python27 to the latest 11.2 package (2.7.15) but that still runs fast. Therefore it seems to me like this could be a regression with semaphores or umtx.
I am also not sure if this is a regression in the kernel or libc/libthr.
 
Downgrading to 11.0 fixes the issue.
Comment 1 Alex Richardson freebsd_committer 2019-03-05 13:19:31 UTC
I can confirm that this is a kernel regression between 11.0 and 11.2.


I updated the kernel on one of the build machines from 11.0 to 11.2 (same userspace) and now the testsuite is much slower.

Running `./bin/llvm-lit -j32 test/CodeGen/Mips` from the LLVM build directory now takes 121 seconds instead of 29, i.e. 4x slower.

On a Linux build slave with the same hardware running the test/CodeGen/Mips tests only takes 6.8 seconds.


Once a I have some more time I can try bisecting but that will probably take a while.
Comment 2 Alex Richardson freebsd_committer 2019-03-06 10:09:14 UTC
I finished bisecting and the following commit is causing the slowdown:

6dd025b40ee6870bea6ba670f30dcf684edc3f6c is the first bad commit
commit 6dd025b40ee6870bea6ba670f30dcf684edc3f6c
Author: kib <kib@FreeBSD.org>
Date:   Sat Feb 17 18:00:01 2018 +0000

    MFC r328083,328096,328116,328119,328120,328128,328135,328153,328157,
    328166,328177,328199,328202,328205,328468,328470,328624,328625,328627,
    328628,329214,329297,329365:

    Meltdown mitigation by PTI, PCID optimization of PTI, and kernel use of IBRS
    for some mitigations of Spectre.

    Tested by:      emaste, Arshan Khanifar <arshankhanifar@gmail.com>
    Discussed with: jkim
    Sponsored by:   The FreeBSD Foundation



I also checked that ibrs was off so this seems to be caused by something else in this massive commit.
$ sysctl hw.ibrs_active
hw.ibrs_active: 0
$ sysctl hw.ibrs_disable
hw.ibrs_disable: 1


The time it took to run the test command was very consistent:
Working commit: r313908+bc19f9718e7(stable/11)  -> 26.8seconds
Known bad commit: a190b45c053(stable/11) -> 121.12s

1st step: fd638422ccfbbe7a11e751ed5a7a237f2dc8d004 -> 133.14s
2nd step: 4ebdf0a463e767672045047ec82c75bf545e9a7c -> 25.48s
3rd step: 483c180d2dece828b1800896e642bcb806075c1f -> 133.49s
4th step: e38b1b844d045fb5dcf4db6f2795f7797022ca17 -> 24.97s
5th step (141 to go): 97ca83c5b618a025fdac8e62db23afa7abd198c8 -> 25.27s
6th step (70 to go): cdf8c6144ce326759989ebcbc0d9b5195bd4f811 -> 133.64s
7th step (35 to go): e2f594b309b8fd60b6ab39982b9ba0f6da798f38 -> 134.01s
8th step (17 to go): 195e563d5c9edb667d92e2480a07080ae6ba40ac -> 24.80s
9th step (8 to go): 61669078139eef2cb35756674ab6e925e97ab5ba -> 134.34s
10th step (4 to go): af7d202e1dcdeda020ee1fdb96e2b1eb4dbbad39 -> 133.04s
11th step (1 to go): 6dd025b40ee6870bea6ba670f30dcf684edc3f6c -> 133.92s
12th step (0 to go): ebe022b0c9f14dc22edbe0308795d42c9f93fd39 -> 24.86s
Comment 3 Konstantin Belousov freebsd_committer 2019-03-06 10:20:16 UTC
This does not look too informative.  Can you infer which syscalls take longer ?  Or is it some different timing for synchronization ?  Or might be something completely different.

Also, what is the hardware where this happens.
Comment 4 Alex Richardson freebsd_committer 2019-03-06 14:35:31 UTC
It is a virtual machine with an Intel CPU: CPU info from dmesg:

Hyper-V Version: 0.0.0 [SP0]
  Features=0x870<APIC,HYPERCALL,VPINDEX,TMFREQ>
  PM Features=0x0 [C0]
  Features3=0x0
XEN: Hypervisor version 4.7 detected.
CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (2593.81-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x206d6  Family=0x6  Model=0x2d  Stepping=6
  Features=0x17c3fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,ACPI,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x80b82201<SSE3,SSSE3,CX16,SSE4.1,SSE4.2,x2APIC,POPCNT,HV>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
Hypervisor: Origin = "Microsoft Hv"
real memory  = 118107406336 (112636 MB)
avail memory = 114828369920 (109508 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <Xen HVM>
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 2 package(s) x 16 core(s)


It is kind of difficult to see which sycalls take longer  before and after but my guess is that it is related to semaphores since it is attempting to get data from the subprocesses and it uses semaphores for synchronization.

I'm not sure if there is an easy way to see what is taking longer. I'll see if I can somehow get the information using DTrace.
Comment 5 Konstantin Belousov freebsd_committer 2019-03-06 14:56:27 UTC
If you disable KPTI by a knob, does the speed loss recover ?
Can you try on bare metal ?

There is no PCID feature advertised by the virtualized CPU, so we flush TLB on each syscall entry and exit.  I suspect  Xen might have additional pessimisations there comparing to the bare hardware.
Comment 6 Alex Richardson freebsd_committer 2019-03-06 14:58:42 UTC
For now setting vm.pmap.pti=0 in /boot/loader.conf brings the testing time back down to 25.89s. Meltdown mitigation should not be too important for this server so I will disable it for now.

I guess switching to separate page tables causes a lot of overhead with nested paging and maybe due to the timing changes there are more syscalls so again more overhead.