We are facing a minor, but disturbing regression, when we moved from FreeBSD 11.3 to 12.1 on our loadbalancers. We have 2 identical machines (dual CPU, 4 core per CPU), one running FreeBSD 11.3 and one running 12.1. Identical packages and identical configuration. We have a haproxy instance and, for performance reasons, we pin the threads to different cpus: nbproc 1 nbthread 6 cpu-map auto:1/1-6 0-5 The result of this configuration is (on 11.3): # cpuset -g -p 35799 pid 35799 mask: 0, 1, 2, 3, 4, 5 while on 12.1 is: # cpuset -g -p 86498 pid 86498 mask: 0, 1, 2, 3, 4, 5 pid 86498 domain policy: first-touch mask: 0, 1 What we have seen is a increased CPU usage (2x) and system load (almost doubled) on FreeBSD 12.1. We didn't measure a potential impact on the network latency. A way to solve/workaround the issue is to decrease the number of threads to 4. I guess that this regression has something to do with NUMA, maybe threads 4,5 are forced to use memory domain 0, causing the additional CPU usage. Is this intended? If yes, why on FreeBSD 11.3 we don't see this high CPU usage? Thanks in advance for the help I report here some hardware information. hw.model: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz hw.ncpu: 8 kern.sched.topology_spec: <groups> <group level="1" cache-level="0"> <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu> <children> <group level="2" cache-level="3"> <cpu count="4" mask="f,0,0,0">0, 1, 2, 3</cpu> <children> <group level="3" cache-level="2"> <cpu count="1" mask="1,0,0,0">0</cpu> </group> <group level="3" cache-level="2"> <cpu count="1" mask="2,0,0,0">1</cpu> </group> <group level="3" cache-level="2"> <cpu count="1" mask="4,0,0,0">2</cpu> </group> <group level="3" cache-level="2"> <cpu count="1" mask="8,0,0,0">3</cpu> </group> </children> </group> <group level="2" cache-level="3"> <cpu count="4" mask="f0,0,0,0">4, 5, 6, 7</cpu> <children> <group level="3" cache-level="2"> <cpu count="1" mask="10,0,0,0">4</cpu> </group> <group level="3" cache-level="2"> <cpu count="1" mask="20,0,0,0">5</cpu> </group> <group level="3" cache-level="2"> <cpu count="1" mask="40,0,0,0">6</cpu> </group> <group level="3" cache-level="2"> <cpu count="1" mask="80,0,0,0">7</cpu> </group> </children> </group> </children> </group> </groups>
Created attachment 210564 [details] The grpah showing the drop in cpu usage The image show the cpu usage when we passed from 6 to 4 threads on FreeBSD 12.1 The CPU usage with 4 threads on FreeBSD is comparable with the CPU usage with 6 thread on FreeBSD 11.3 Green line : cpu user time Yellow line : cpu system time Cyan line : cpu irq time
> I guess that this regression has something to do with NUMA, maybe threads 4,5 are forced to use memory domain 0, causing the additional CPU usage. The policy as reported by cpuset is first-touch, meaning that threads will attempt to allocate memory from the local domain first. Some things you could try to help narrow the problem down: - Look at memory utilization. Do you have lots of free memory in both domains? - Collect a flamegraph using https://github.com/brendangregg/FlameGraph to see where the high system CPU time is coming from. - Try other domain allocation policies. round-robin will cause threads to alternate between the two domains. You can also try forcing all allocations to come from domain 0, where most of the threads are running. - Try disabling thread pinning. - Try setting the vm.numa.disabled tunable to 1. I think this will force the page allocator to behave the same as it would in 11.3, so you can rule out other differences between 11.3 and 12.1 that might be causing a problem.
Any updates on this?
I don't have the setup to reproduce the issue. I'm going to close it