Bug 243212 - High CPU usage when set affinity on multiple CPU
Summary: High CPU usage when set affinity on multiple CPU
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords: performance, regression
Depends on:
Blocks:
 
Reported: 2020-01-09 11:00 UTC by Luca Pizzamiglio
Modified: 2020-03-30 18:09 UTC (History)
1 user (show)

See Also:


Attachments
The grpah showing the drop in cpu usage (12.97 KB, image/png)
2020-01-09 12:35 UTC, Luca Pizzamiglio
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Pizzamiglio freebsd_committer 2020-01-09 11:00:00 UTC
We are facing a minor, but disturbing regression, when we moved from FreeBSD 11.3 to 12.1 on our loadbalancers.

We have 2 identical machines (dual CPU, 4 core per CPU), one running FreeBSD 11.3 and one running 12.1. 
Identical packages and identical configuration.

We have a haproxy instance and, for performance reasons, we pin the threads to different cpus:
  nbproc 1
  nbthread 6
  cpu-map auto:1/1-6 0-5

The result of this configuration is (on 11.3):
# cpuset -g -p 35799
pid 35799 mask: 0, 1, 2, 3, 4, 5

while on 12.1 is:
# cpuset -g -p 86498
pid 86498 mask: 0, 1, 2, 3, 4, 5
pid 86498 domain policy: first-touch mask: 0, 1

What we have seen is a increased CPU usage (2x) and system load (almost doubled) on FreeBSD 12.1. We didn't measure a potential impact on the network latency.
A way to solve/workaround the issue is to decrease the number of threads to 4.

I guess that this regression has something to do with NUMA, maybe threads 4,5 are forced to use memory domain 0, causing the additional CPU usage.

Is this intended? If yes, why on FreeBSD 11.3 we don't see this high CPU usage?
Thanks in advance for the help

I report here some hardware information.

hw.model: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
hw.ncpu: 8
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
  <children>
   <group level="2" cache-level="3">
    <cpu count="4" mask="f,0,0,0">0, 1, 2, 3</cpu>
    <children>
     <group level="3" cache-level="2">
      <cpu count="1" mask="1,0,0,0">0</cpu>
     </group>
     <group level="3" cache-level="2">
      <cpu count="1" mask="2,0,0,0">1</cpu>
     </group>
     <group level="3" cache-level="2">
      <cpu count="1" mask="4,0,0,0">2</cpu>
     </group>
     <group level="3" cache-level="2">
      <cpu count="1" mask="8,0,0,0">3</cpu>
     </group>
    </children>
   </group>
   <group level="2" cache-level="3">
    <cpu count="4" mask="f0,0,0,0">4, 5, 6, 7</cpu>
    <children>
     <group level="3" cache-level="2">
      <cpu count="1" mask="10,0,0,0">4</cpu>
     </group>
     <group level="3" cache-level="2">
      <cpu count="1" mask="20,0,0,0">5</cpu>
     </group>
     <group level="3" cache-level="2">
      <cpu count="1" mask="40,0,0,0">6</cpu>
     </group>
     <group level="3" cache-level="2">
      <cpu count="1" mask="80,0,0,0">7</cpu>
     </group>
    </children>
   </group>
  </children>
 </group>
</groups>
Comment 1 Luca Pizzamiglio freebsd_committer 2020-01-09 12:35:30 UTC
Created attachment 210564 [details]
The grpah showing the drop in cpu usage

The image show the cpu usage when we passed from 6 to 4 threads on FreeBSD 12.1
The CPU usage with 4 threads on FreeBSD is comparable with the CPU usage with 6 thread on FreeBSD 11.3

Green line  : cpu user time
Yellow line : cpu system time
Cyan line   : cpu irq time
Comment 2 Mark Johnston freebsd_committer 2020-01-09 15:56:49 UTC
> I guess that this regression has something to do with NUMA, maybe threads 4,5 are forced to use memory domain 0, causing the additional CPU usage.

The policy as reported by cpuset is first-touch, meaning that threads will attempt to allocate memory from the local domain first.

Some things you could try to help narrow the problem down:
- Look at memory utilization.  Do you have lots of free memory in both domains?
- Collect a flamegraph using https://github.com/brendangregg/FlameGraph to see where the high system CPU time is coming from.
- Try other domain allocation policies.  round-robin will cause threads to alternate between the two domains.  You can also try forcing all allocations to come from domain 0, where most of the threads are running.
- Try disabling thread pinning.
- Try setting the vm.numa.disabled tunable to 1.  I think this will force the page allocator to behave the same as it would in 11.3, so you can rule out other differences between 11.3 and 12.1 that might be causing a problem.
Comment 3 Mark Johnston freebsd_committer 2020-03-30 18:09:47 UTC
Any updates on this?