Bug 223914

Summary: [PATCH] Fix bug in which the long term ULE load balancer is executed only once
Product: Base System Reporter: Justinien Bouron <justinien.bouron>
Component: kernAssignee: Don Lewis <truckman>
Status: Closed FIXED    
Severity: Affects Many People CC: Andrew, avg, cem, danilo, emaste, feld, fidaj, hselasky, jeffr, kbowling, kib, markj, mjg, netchild, o.hushchenkov, pi, truckman
Priority: --- Keywords: patch
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Patch for the ULE load balancer solving the issue of this bug report.
none
truckman's proposed patch to ULE load balancer none

Description Justinien Bouron 2017-11-27 17:51:02 UTC
Created attachment 188323 [details]
Patch for the ULE load balancer solving the issue of this bug report.

The load balancer of ULE has a bug which cause it to be executed only once at startup, and it will *never* be executed again.

The reason is simple, when the sched_balance function is called for the first time, the smp is not yet setup, and the function returns.
However the balance_tick variable is not resetted before the return and thus will remains 0 for ever.

Because the condition in sched_clock to trigger the load balancer is :
" balance_ticks && --balance_ticks == 0 ", once balance_ticks is 0, this condition will never be met, and thus the load balancer will never be executed again.

To convince yourself you can even put a panic right before the call to sched_balance_group in sched_balance, it will never panic ...

The fix is rather simple, just move the line resetting the balance_ticks value *before* the if branch "if (smp_started == 0 || rebalance == 0)". That way we are sure that the load balancer is re-scheduled.
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2017-11-28 18:36:27 UTC
I think the analysis is correct.

Might be, sched_initticks() should set the balance_ticks as well, after the balance_interval is initialized ?
Comment 2 Justinien Bouron 2017-11-28 20:51:23 UTC
sched_setup calls sched_setup_smp which in turns calls sched_balance that will set the balance_ticks.
Comment 3 Ivan Klymenko 2017-12-12 11:47:46 UTC
Hi.
Maybe, this version use less overhead charges for not SMP and given the sched_random() function.

--- sched_ule.c.orig    2017-11-29 09:23:11.503091000 +0200
+++ sched_ule.c.my1     2017-12-12 13:40:46.330506000 +0200
@@ -880,11 +880,15 @@
 {
        struct tdq *tdq;
 
-       if (smp_started == 0 || rebalance == 0)
+       if (smp_started == 0)
                return;
 
        balance_ticks = max(balance_interval / 2, 1) +
            (sched_random() % balance_interval);
+
+       if (rebalance == 0)
+               return;
+
        tdq = TDQ_SELF();
        TDQ_UNLOCK(tdq);
        sched_balance_group(cpu_top);
Comment 4 Ivan Klymenko 2017-12-12 20:53:13 UTC
(In reply to Ivan Klymenko from comment #3)
Sorry, I think I said foolishness.
Comment 5 Mark Felder freebsd_committer freebsd_triage 2018-07-13 11:28:22 UTC
This is getting attention due to the recent FreeBSD ULE vs Linux CFS research paper which is what caused this bug to be discovered. Is there further analysis needed here?
Comment 6 Ben Woods freebsd_committer freebsd_triage 2018-07-15 00:08:39 UTC
Cc’ing a few people who have made recent changes to the ULE scheduler, and therefore might feel confident reviewing this patch.
Comment 7 Don Lewis freebsd_committer freebsd_triage 2018-07-18 05:56:17 UTC
Created attachment 195229 [details]
truckman's proposed patch to ULE load balancer

(In reply to Konstantin Belousov from comment #1)

Yes, that sounds reasonable.  I also don't see any great need to randomize balance ticks for the first iteration.

Then the call to sched_balance() from sched_setup_smp() is useless, so it can be nuked.

Since there is now only one call to sched_balance(), we can hoist the tests at the top of this function out to the caller and avoid the overhead of the function call
when running a SMP kernel on UP hardware.

Patch attached.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2018-07-18 11:12:30 UTC
(In reply to Don Lewis from comment #7)
Looks fine to me.
Comment 9 Alexander Leidinger freebsd_committer freebsd_triage 2018-07-21 17:02:41 UTC
What's the status of this? It's not in -current and there are no objections to committing this.
Comment 10 Danilo Egea Gondolfo freebsd_committer freebsd_triage 2018-07-21 22:17:13 UTC
Could these results be a symptom of this issue? 

sudo dtrace -n 'sched:::on-cpu {@[cpu] = count();}'

1           322199
3           368011
0           538659
2          1750862

I was running a make -j4 buildworld while using some desktop apps.
Comment 11 Mark Felder freebsd_committer freebsd_triage 2018-07-22 03:13:11 UTC
(In reply to Danilo Egea Gondolfo from comment #10)

With the patch applied and your dtrace command I have seen results like this:

        2             5869
        7             5968
        3             6077
        6             6677
        9             6854
        5             7180
       10             7221
        0             7232
        1             7919
       11             8171
        8            10777
        4            32122
Comment 12 Don Lewis freebsd_committer freebsd_triage 2018-07-22 07:35:59 UTC
If I start three CPU-bound processes on two CPUs, I see roughly equal sharing.  Without the patch I would expect to see two processes each getting 50% of a CPU while a third would get 100% of its CPU.

%cpuset -l 1-2 /bin/sh
$ while true; do done &
$ while true; do done &
$ while true; do done &

CPU: 12.6% user,  0.0% nice,  0.1% system,  0.0% interrupt, 87.4% idle
Mem: 57M Active, 125M Inact, 996M Wired, 14K Buf, 61G Free
ARC: 423M Total, 139M MFU, 268M MRU, 32K Anon, 3M Header, 12M Other
     123M Compressed, 300M Uncompressed, 2.44:1 Ratio
Swap: 40G Total, 40G Free

  PID USERNAME    THR PRI NICE  SIZE   RES STATE    C   TIME    WCPU COMMAND
 2851 dl            1  92    0   12M    3M CPU2     2   2:00  84.05% /bin/sh
 2852 dl            1  93    0   12M    3M RUN      1   1:57  65.82% /bin/sh
 2850 dl            1  93    0   12M    3M CPU1     1   2:07  52.66% /bin/sh

If someone wants to commit the patch, feel free.  I will only have a limited amount of time for FreeBSD for most of the next week.
Comment 13 commit-hook freebsd_committer freebsd_triage 2018-07-29 00:30:51 UTC
A commit references this bug:

Author: truckman
Date: Sun Jul 29 00:30:07 UTC 2018
New revision: 336855
URL: https://svnweb.freebsd.org/changeset/base/336855

Log:
  Fix the long term ULE load balancer so that it actually works.  The
  initial call to sched_balance() during startup is meant to initialize
  balance_ticks, but does not actually do that since smp_started is
  still zero at that time.  Since balance_ticks does not get set,
  there are no further calls to sched_balance().  Fix this by setting
  balance_ticks in sched_initticks() since we know the value of
  balance_interval at that time, and eliminate the useless startup
  call to sched_balance().  We don't need to randomize the intial
  value of balance_ticks.

  Since there is now only one call to sched_balance(), we can hoist
  the tests at the top of this function out to the caller and avoid
  the overhead of the function call when running a SMP kernel on UP
  hardware.

  PR:		223914
  Reviewed by:	kib
  MFC after:	2 weeks

Changes:
  head/sys/kern/sched_ule.c
Comment 14 commit-hook freebsd_committer freebsd_triage 2018-08-12 03:23:28 UTC
A commit references this bug:

Author: truckman
Date: Sun Aug 12 03:22:28 UTC 2018
New revision: 337678
URL: https://svnweb.freebsd.org/changeset/base/337678

Log:
  MFC r336855

  Fix the long term ULE load balancer so that it actually works.  The
  initial call to sched_balance() during startup is meant to initialize
  balance_ticks, but does not actually do that since smp_started is
  still zero at that time.  Since balance_ticks does not get set,
  there are no further calls to sched_balance().  Fix this by setting
  balance_ticks in sched_initticks() since we know the value of
  balance_interval at that time, and eliminate the useless startup
  call to sched_balance().  We don't need to randomize the intial
  value of balance_ticks.

  Since there is now only one call to sched_balance(), we can hoist
  the tests at the top of this function out to the caller and avoid
  the overhead of the function call when running a SMP kernel on UP
  hardware.

  PR:		223914
  Reviewed by:	kib

Changes:
_U  stable/11/
  stable/11/sys/kern/sched_ule.c
Comment 15 Don Lewis freebsd_committer freebsd_triage 2018-08-12 03:49:29 UTC
FreeBSD 10.x is not affected by this bug.