Bug 254040 - AMD 5950X hyperthreading strange performance swings
Summary: AMD 5950X hyperthreading strange performance swings
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-05 15:45 UTC by Dennis Noordsij
Modified: 2023-09-07 03:36 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Noordsij 2021-03-05 15:45:51 UTC
I plan to upgrade our server to a Ryzen 9 5950X system, 16 cores, 3400MHz base frequency, 128GB RAM, and ran into an issue while testing.

For reference I will use a very simple command:

  dd if=/dev/zero bs=1M count=1000 | bzip2 - | wc

which, using a Linux rescue system (i.e. nothing else running) consistently and repeatedly completes in about:

  1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.67561 s, 224 MB/s



Now for FreeBSD 12.2-RELEASE, on a basic boot not running anything except ssh:

if I _disable_ hyperthreading using machdep.hyperthreading_allowed=0, I get the following approximate result consistently and repeatedly:

  dd if=/dev/zero bs=1M count=1000 | bzip2 - | wc
  1048576000 bytes transferred in 4.874335 secs (215121876 bytes/sec)
 
Slightly slower than Linux, not sure if this is in how bzip2 is compiled etc, but nothing that worries me. 



However, if I _enable_ hyperthreading, i.e. the default I started with, then I will get:

  dd if=/dev/zero bs=1M count=1000 | bzip2 - | wc
  1048576000 bytes transferred in 4.887522 secs (214541450 bytes/sec)
  1048576000 bytes transferred in 7.507138 secs (139677190 bytes/sec)
  1048576000 bytes transferred in 6.227179 secs (168386989 bytes/sec)
  1048576000 bytes transferred in 7.590263 secs (138147516 bytes/sec)
  1048576000 bytes transferred in 7.421037 secs (141297776 bytes/sec)
  1048576000 bytes transferred in 4.922986 secs (212995935 bytes/sec)
  1048576000 bytes transferred in 4.945138 secs (212041827 bytes/sec)
  1048576000 bytes transferred in 7.671600 secs (136682828 bytes/sec)
  1048576000 bytes transferred in 7.673428 secs (136650273 bytes/sec)

i.e. very consistently varying results with relatively large differences in commands executed immediately after one another (and no other load whatsoever).


I'm curious why this is happening.

I am not running powerd or touched any of the cpu settings.



Booting _without_ hyperthreading:

CPU: AMD Ryzen 9 5950X 16-Core Processor             (3393.70-MHz K8-class CPU)                                                          
  Origin="AuthenticAMD"  Id=0xa20f10  Family=0x19  Model=0x21  Stepping=0                                                                
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>             
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>                 
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>                                                                       
  AMD Features2=0x75c237ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX,<b30>>                                                                                                                                      
  Structured Extended Features=0x219c97a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA>         
  Structured Extended Features2=0x40068c<UMIP,PKU,VAES,VPCLMULQDQ,RDPID>                                                                 
  Structured Extended Features3=0x10                                                                                                     
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>                                                                                      
  AMD Extended Feature Extensions ID EBX=0x111ef657<CLZERO,IRPerf,XSaveErPtr>                                                            
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768                                                                                        
  TSC: P-state invariant, performance statistics                                                                                         
real memory  = 137434759168 (131068 MB)                                                                                                  
avail memory = 133793423360 (127595 MB)                                                                                                  
Event timer "LAPIC" quality 600                                                                                                          
ACPI APIC Table: <ALASKA A M I >                                                                                                         
FreeBSD/SMP: Multiprocessor System Detected: 16 CPUs                                                                                     
FreeBSD/SMP: 1 package(s) x 2 cache groups x 8 core(s) x 2 hardware threads                                                              
FreeBSD/SMP Online: 1 package(s) x 2 cache groups x 8 core(s)                                                                            
                                                                                                                                         
                                                                                                                                         
                                                                                                                                         
# sysctl dev.cpu.0                                                                                                                       dev.cpu.0.cx_method: C1/hlt C2/io                                                                                                        
dev.cpu.0.cx_usage_counters: 11922 0                                                                                                     
dev.cpu.0.cx_usage: 100.00% 0.00% last 43430us                                                                                           
dev.cpu.0.cx_lowest: C1                                                                                                                  
dev.cpu.0.cx_supported: C1/1/1 C2/2/18                                                                                                   
dev.cpu.0.freq_levels: 3400/3740 2800/2800 2200/1980                                                                                     
dev.cpu.0.freq: 3400                                                                                                                     
dev.cpu.0.%parent: acpi0                                                                                                                 
dev.cpu.0.%pnpinfo: _HID=ACPI0007 _UID=0                                                                                                 
dev.cpu.0.%location: handle=\_SB_.PLTF.C000                                                                                              
dev.cpu.0.%driver: cpu                                                                                                                   
dev.cpu.0.%desc: ACPI CPU                                                                                                                
                                           

Booting _with_ hyperthreading:

CPU: AMD Ryzen 9 5950X 16-Core Processor             (3393.69-MHz K8-class CPU)                                                          
  Origin="AuthenticAMD"  Id=0xa20f10  Family=0x19  Model=0x21  Stepping=0                                                                
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>             
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>                 
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>                                                                       
  AMD Features2=0x75c237ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX,<b30>>                                                                                                                                      
  Structured Extended Features=0x219c97a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA>         
  Structured Extended Features2=0x40068c<UMIP,PKU,VAES,VPCLMULQDQ,RDPID>                                                                 
  Structured Extended Features3=0x10                                                                                                     
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>                                                                                      
  AMD Extended Feature Extensions ID EBX=0x111ef657<CLZERO,IRPerf,XSaveErPtr>                                                            
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768                                                                                        
  TSC: P-state invariant, performance statistics                                                                                         
real memory  = 137434759168 (131068 MB)                                                                                                  
avail memory = 133793423360 (127595 MB)                                                                                                  
Event timer "LAPIC" quality 600                                                                                                          
ACPI APIC Table: <ALASKA A M I >                                                                                                         
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs                                                                                     
FreeBSD/SMP: 1 package(s) x 2 cache groups x 8 core(s) x 2 hardware threads                                                              
                                                                                                                                         
                                                                                                                                         
# sysctl dev.cpu.0                                                                                                                       
dev.cpu.0.cx_method: C1/hlt C2/io                                                                                                        
dev.cpu.0.cx_usage_counters: 3232 0                                                                                                      
dev.cpu.0.cx_usage: 100.00% 0.00% last 65311us                                                                                           
dev.cpu.0.cx_lowest: C1                                                                                                                  
dev.cpu.0.cx_supported: C1/1/1 C2/2/18                                                                                                   
dev.cpu.0.freq_levels: 3400/3740 2800/2800 2200/1980                                                                                     
dev.cpu.0.freq: 3400                                                                                                                     
dev.cpu.0.%parent: acpi0                                                                                                                 
dev.cpu.0.%pnpinfo: _HID=ACPI0007 _UID=0                                                                                                 
dev.cpu.0.%location: handle=\_SB_.PLTF.C000                                                                                              
dev.cpu.0.%driver: cpu                                                                                                                   
dev.cpu.0.%desc: ACPI CPU 



zenstates.py reports:

# ./zenstates.py -l
P0 - Enabled - FID = 88 - DID = 8 - VID = 48 - IDD = 22( / 1 )
             - Ratio = 34.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 58 - IDD = 1C( / 1 )
             - Ratio = 28.00 - vCore = 1.00000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - IDD = 16( / 1 )
             - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
Core Performance Boost - Enabled
C6 State - Package - Disabled
C6 State - Core - Enabled


FWIW if I disable core performance boost the varying execution times shift from ~4.8 and ~7.8 to to ~6.5 and ~10.8 seconds respectively, i.e. same behaviour just slower.

I was hoping someone could explain why this is happening (note it doesn't happen on Linux), if it is expected, and/or how it can be worked around or fixed, or where the problem would be (p-states?). 

Happy to test anything.


PS - I tried a 13-BETA2 rescue boot which has HT enabled and it behaves exactly the same.
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2021-12-19 09:51:08 UTC
Do you think that this is the same issue as bug 256594?
Also, did you try disabling PBO (Precision Boost Overdrive) as opposed to CPB?
Comment 2 Stefan Eßer freebsd_committer freebsd_triage 2021-12-19 14:48:29 UTC
Interesting result, which I could reproduce.
But I'd be very surprised if this was connected to bug 256594.

I have performed a few tests on -CURRENT, with different numbers of processes running in parallel:

$ t () { 
    for i in $(jot ${1:-1});
    do
        dd if=/dev/zero bs=1M count=1000 | bzip2 - | wc > /dev/null &
    done 2>&1 | grep transferred;
    wait
}

$ t 4
1048576000 bytes transferred in 5.601800 secs (187185555 bytes/sec)
1048576000 bytes transferred in 5.649712 secs (185598117 bytes/sec)
1048576000 bytes transferred in 5.695707 secs (184099356 bytes/sec)
1048576000 bytes transferred in 9.145955 secs (114649153 bytes/sec)
$ t 4
1048576000 bytes transferred in 5.615530 secs (186727884 bytes/sec)
1048576000 bytes transferred in 6.000599 secs (174745209 bytes/sec)
1048576000 bytes transferred in 8.281161 secs (126621862 bytes/sec)
1048576000 bytes transferred in 8.982560 secs (116734646 bytes/sec)
$ t 4
1048576000 bytes transferred in 5.597056 secs (187344204 bytes/sec)
1048576000 bytes transferred in 8.940248 secs (117287131 bytes/sec)
1048576000 bytes transferred in 8.962013 secs (117002282 bytes/sec)
1048576000 bytes transferred in 8.975112 secs (116831521 bytes/sec)

There are only two typical throughput value ranges: 175 to 188 MB/s and 115 to 125 MB/s (in powers of 10, not 2). This is roughly a factor of 3/2 ...

$ t 16
1048576000 bytes transferred in 7.537053 secs (139122806 bytes/sec)
1048576000 bytes transferred in 7.643938 secs (137177468 bytes/sec)
1048576000 bytes transferred in 7.658221 secs (136921619 bytes/sec)
1048576000 bytes transferred in 7.676633 secs (136593217 bytes/sec)
1048576000 bytes transferred in 7.684927 secs (136445807 bytes/sec)
1048576000 bytes transferred in 7.692365 secs (136313868 bytes/sec)
1048576000 bytes transferred in 7.785566 secs (134682056 bytes/sec)
1048576000 bytes transferred in 7.869853 secs (133239594 bytes/sec)
1048576000 bytes transferred in 7.887814 secs (132936190 bytes/sec)
1048576000 bytes transferred in 7.902913 secs (132682214 bytes/sec)
1048576000 bytes transferred in 7.901557 secs (132704990 bytes/sec)
1048576000 bytes transferred in 7.918014 secs (132429169 bytes/sec)
1048576000 bytes transferred in 7.964384 secs (131658150 bytes/sec)
1048576000 bytes transferred in 7.973078 secs (131514575 bytes/sec)
1048576000 bytes transferred in 7.992037 secs (131202601 bytes/sec)
1048576000 bytes transferred in 8.074766 secs (129858370 bytes/sec)

Now all results are between 130 and 140 MB/s. And this outcome is stable over multiple runs.

$ t 32
1048576000 bytes transferred in 11.279196 secs (92965495 bytes/sec)
1048576000 bytes transferred in 11.343222 secs (92440755 bytes/sec)
1048576000 bytes transferred in 11.345478 secs (92422376 bytes/sec)
1048576000 bytes transferred in 11.422671 secs (91797797 bytes/sec)
1048576000 bytes transferred in 11.522082 secs (91005777 bytes/sec)
1048576000 bytes transferred in 11.757213 secs (89185763 bytes/sec)
1048576000 bytes transferred in 11.796787 secs (88886578 bytes/sec)
1048576000 bytes transferred in 11.787529 secs (88956389 bytes/sec)
1048576000 bytes transferred in 11.830471 secs (88633499 bytes/sec)
1048576000 bytes transferred in 11.866944 secs (88361080 bytes/sec)
1048576000 bytes transferred in 11.901904 secs (88101537 bytes/sec)
1048576000 bytes transferred in 11.956605 secs (87698475 bytes/sec)
1048576000 bytes transferred in 11.952918 secs (87725524 bytes/sec)
1048576000 bytes transferred in 11.955508 secs (87706519 bytes/sec)
1048576000 bytes transferred in 11.961946 secs (87659316 bytes/sec)
1048576000 bytes transferred in 11.992837 secs (87433521 bytes/sec)
1048576000 bytes transferred in 12.017736 secs (87252376 bytes/sec)
1048576000 bytes transferred in 12.023212 secs (87212632 bytes/sec)
1048576000 bytes transferred in 12.014854 secs (87273302 bytes/sec)
1048576000 bytes transferred in 12.054915 secs (86983277 bytes/sec)
1048576000 bytes transferred in 12.149618 secs (86305266 bytes/sec)
1048576000 bytes transferred in 12.179530 secs (86093302 bytes/sec)
1048576000 bytes transferred in 12.260039 secs (85527952 bytes/sec)
1048576000 bytes transferred in 12.261602 secs (85517046 bytes/sec)
1048576000 bytes transferred in 12.260685 secs (85523445 bytes/sec)
1048576000 bytes transferred in 12.386748 secs (84653048 bytes/sec)
1048576000 bytes transferred in 12.415505 secs (84456972 bytes/sec)
1048576000 bytes transferred in 12.487385 secs (83970822 bytes/sec)
1048576000 bytes transferred in 12.527210 secs (83703871 bytes/sec)
1048576000 bytes transferred in 12.602776 secs (83201986 bytes/sec)
1048576000 bytes transferred in 12.618314 secs (83099532 bytes/sec)
1048576000 bytes transferred in 12.725476 secs (82399749 bytes/sec)

Again similar results on multiple runs, always between 80 and 93 MB/s.

The big variations exist if not all cores are busy, and this might be due to non-optimal scheduling performed by SCHED_ULE. It would be very interesting to repeat this test with SCHED_4BSD, instead.

Funny detail: my CPU is reported to have an upper CPU clock of 4000 MHz, not 3400 MHz (no overclocking, but I had powerd running before, it has been stopped for these measurements).

And I'm quite sure that I had once seen C2 statistics in the sysctl output, which are missing, now:

$  sysctl dev.cpu.0     
dev.cpu.0.temperature: 48,6C
dev.cpu.0.cx_method: C1/hlt
dev.cpu.0.cx_usage_counters: 175874
dev.cpu.0.cx_usage: 100.00% last 6552us
dev.cpu.0.cx_lowest: C8
dev.cpu.0.cx_supported: C1/1/0
dev.cpu.0.freq_levels: 4000/3740 2800/2800 2200/1980
dev.cpu.0.freq: 4000
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=ACPI0007 _UID=0 _CID=none
dev.cpu.0.%location: handle=\_SB_.PLTF.C000
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU

Maybe I need to check the energy efficiency settings in the BIOS, but I thought I had enabled all of them again, after the last BIOS update ...
Comment 3 Stefan Eßer freebsd_committer freebsd_triage 2021-12-19 16:07:34 UTC
Update after booting a kernel with SCHED_4BSD:

This really appears to be a SCHED_ULE issue, as expected. With SCHED_4BSD the results are quite homogeneous for each run:

$ t # repeated runs ...
1048576000 bytes transferred in 6.720101 secs (156035746 bytes/sec)
s048576000 bytes transferred in 7.037620 secs (148995835 bytes/sec)
1048576000 bytes transferred in 7.031251 secs (149130790 bytes/sec)
1048576000 bytes transferred in 7.032412 secs (149106159 bytes/sec)
1048576000 bytes transferred in 7.202687 secs (145581229 bytes/sec)
1048576000 bytes transferred in 6.918272 secs (151566177 bytes/sec)
1048576000 bytes transferred in 6.391244 secs (164064463 bytes/sec)
1048576000 bytes transferred in 6.686209 secs (156826683 bytes/sec)
1048576000 bytes transferred in 6.668778 secs (157236600 bytes/sec)
1048576000 bytes transferred in 6.914906 secs (151639959 bytes/sec)
1048576000 bytes transferred in 6.835760 secs (153395678 bytes/sec)
1048576000 bytes transferred in 6.755348 secs (155221619 bytes/sec)

$ t 4
1048576000 bytes transferred in 7.574533 secs (138434415 bytes/sec)
1048576000 bytes transferred in 7.776473 secs (134839540 bytes/sec)
1048576000 bytes transferred in 7.839487 secs (133755689 bytes/sec)
1048576000 bytes transferred in 7.856730 secs (133462132 bytes/sec)
$ t 4
1048576000 bytes transferred in 7.481412 secs (140157499 bytes/sec)
1048576000 bytes transferred in 7.557077 secs (138754182 bytes/sec)
1048576000 bytes transferred in 7.676920 secs (136588110 bytes/sec)
1048576000 bytes transferred in 8.040430 secs (130412922 bytes/sec)
$ t 4
1048576000 bytes transferred in 7.484386 secs (140101810 bytes/sec)
1048576000 bytes transferred in 7.581198 secs (138312698 bytes/sec)
1048576000 bytes transferred in 7.710614 secs (135991248 bytes/sec)
1048576000 bytes transferred in 7.736921 secs (135528856 bytes/sec)

$ t 16
1048576000 bytes transferred in 9.879846 secs (106132833 bytes/sec)
1048576000 bytes transferred in 10.087562 secs (103947418 bytes/sec)
1048576000 bytes transferred in 10.232516 secs (102474894 bytes/sec)
1048576000 bytes transferred in 10.237664 secs (102423370 bytes/sec)
1048576000 bytes transferred in 10.320563 secs (101600658 bytes/sec)
1048576000 bytes transferred in 10.375758 secs (101060179 bytes/sec)
1048576000 bytes transferred in 10.443859 secs (100401199 bytes/sec)
1048576000 bytes transferred in 10.481844 secs (100037355 bytes/sec)
1048576000 bytes transferred in 10.494019 secs (99921294 bytes/sec)
1048576000 bytes transferred in 10.510178 secs (99767674 bytes/sec)
1048576000 bytes transferred in 10.559435 secs (99302286 bytes/sec)
1048576000 bytes transferred in 10.638978 secs (98559840 bytes/sec)
1048576000 bytes transferred in 10.678294 secs (98196963 bytes/sec)
1048576000 bytes transferred in 10.808183 secs (97016860 bytes/sec)
1048576000 bytes transferred in 11.084706 secs (94596647 bytes/sec)
1048576000 bytes transferred in 11.231343 secs (93361587 bytes/sec)

Seems that my BIOS settings were not optimal during the tests documented in Comment 2. After loading "Optimal Default" settings and re-activation of the energy efficiency options I do get the following sysctl output now:

$ sysctl dev.cpu.0
dev.cpu.0.temperature: 29,6C
dev.cpu.0.cx_method: C1/hlt C2/io
dev.cpu.0.cx_usage_counters: 180 70723
dev.cpu.0.cx_usage: 0.25% 99.74% last 9261us
dev.cpu.0.cx_lowest: C8
dev.cpu.0.cx_supported: C1/1/1 C2/2/18
dev.cpu.0.freq_levels: 3400/3740 2800/2800 2200/1980
dev.cpu.0.freq: 3400
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=ACPI0007 _UID=0 _CID=none
dev.cpu.0.%location: handle=\_SB_.PLTF.C000
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU

Anyway, after another reboot back to SCHED_ULE and repeating the tests with new BIOS settings I see:

$ t 1 # multiple runs again ...
1048576000 bytes transferred in 7.640816 secs (137233502 bytes/sec)
1048576000 bytes transferred in 6.225996 secs (168418995 bytes/sec)
1048576000 bytes transferred in 4.852763 secs (216078118 bytes/sec)
1048576000 bytes transferred in 4.832574 secs (216980866 bytes/sec)
1048576000 bytes transferred in 4.819031 secs (217590617 bytes/sec)

$ t 4
1048576000 bytes transferred in 4.956440 secs (211558309 bytes/sec)
1048576000 bytes transferred in 7.634614 secs (137344998 bytes/sec)
1048576000 bytes transferred in 7.788965 secs (134623280 bytes/sec)
1048576000 bytes transferred in 7.819442 secs (134098574 bytes/sec)
$ t 4
1048576000 bytes transferred in 4.853663 secs (216038083 bytes/sec)
1048576000 bytes transferred in 4.857143 secs (215883287 bytes/sec)
1048576000 bytes transferred in 7.721846 secs (135793430 bytes/sec)
1048576000 bytes transferred in 7.792580 secs (134560821 bytes/sec)
$ t 4
1048576000 bytes transferred in 4.908570 secs (213621479 bytes/sec)
1048576000 bytes transferred in 7.742029 secs (135439433 bytes/sec)
1048576000 bytes transferred in 7.784341 secs (134703245 bytes/sec)
1048576000 bytes transferred in 7.794096 secs (134534649 bytes/sec)

Very similar to the results in the initial bug report ...

And it really shows that SCHED_ULE causes the incoherent performance. 
But the throughput with SCHED_ULE is a lot higher than with SCHED_4BSD, probably due to the immensely higher system overhead of the latter with a large number of cores.

While the CPU% with SCHED_ULE is in the order of 2%, with SCHED_4BSD I have seen values up to 60% with 32 parallel tasks.

Another observation: WCPU of the bzip2 processes on the kernel with SCHED_4BSD was displayed as way beyond 100% (in the order of 300% in one run I remember). Since bzip2 is not multi-threaded (AFAIK) this seems to be a wrong measurement.
The CPU% for the bzip2 processes is always near 100% with SCHED_ULE.
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2021-12-21 10:49:12 UTC
(In reply to Stefan Eßer from comment #3)
Thank you for the tests!

Another observation is that if your t function is modified to pin bzip2 processes to CPUs (e.g. cpuset -l $((2 * i)) bzip2 ...), then the results are much more consistent.  In fact, they match the best results from the original test:
$ ~/test.sh 4         
1048576000 bytes transferred in 4.903489 secs (213842854 bytes/sec)
1048576000 bytes transferred in 4.904297 secs (213807597 bytes/sec)
1048576000 bytes transferred in 4.908786 secs (213612071 bytes/sec)
1048576000 bytes transferred in 4.915402 secs (213324567 bytes/sec)
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2021-12-21 10:56:35 UTC
And a possible "eureka moment" that's consistent with the original description: if I choose odd logical CPUs, (cpuset -l $((2 * i + 1)), then I consistently get the worst results:
$ ~/test.sh 4
1048576000 bytes transferred in 7.800822 secs (134418651 bytes/sec)
1048576000 bytes transferred in 7.803095 secs (134379498 bytes/sec)
1048576000 bytes transferred in 7.809778 secs (134264513 bytes/sec)
1048576000 bytes transferred in 7.810560 secs (134251064 bytes/sec)

If this is indeed what it is, then several conclusions can be drawn:
- hardware threads within a core are not born equal on this hardware
- "primary" threads should be preferred
- ULE does not do that