Created attachment 188323 [details] Patch for the ULE load balancer solving the issue of this bug report. The load balancer of ULE has a bug which cause it to be executed only once at startup, and it will *never* be executed again. The reason is simple, when the sched_balance function is called for the first time, the smp is not yet setup, and the function returns. However the balance_tick variable is not resetted before the return and thus will remains 0 for ever. Because the condition in sched_clock to trigger the load balancer is : " balance_ticks && --balance_ticks == 0 ", once balance_ticks is 0, this condition will never be met, and thus the load balancer will never be executed again. To convince yourself you can even put a panic right before the call to sched_balance_group in sched_balance, it will never panic ... The fix is rather simple, just move the line resetting the balance_ticks value *before* the if branch "if (smp_started == 0 || rebalance == 0)". That way we are sure that the load balancer is re-scheduled.
I think the analysis is correct. Might be, sched_initticks() should set the balance_ticks as well, after the balance_interval is initialized ?
sched_setup calls sched_setup_smp which in turns calls sched_balance that will set the balance_ticks.
Hi. Maybe, this version use less overhead charges for not SMP and given the sched_random() function. --- sched_ule.c.orig 2017-11-29 09:23:11.503091000 +0200 +++ sched_ule.c.my1 2017-12-12 13:40:46.330506000 +0200 @@ -880,11 +880,15 @@ { struct tdq *tdq; - if (smp_started == 0 || rebalance == 0) + if (smp_started == 0) return; balance_ticks = max(balance_interval / 2, 1) + (sched_random() % balance_interval); + + if (rebalance == 0) + return; + tdq = TDQ_SELF(); TDQ_UNLOCK(tdq); sched_balance_group(cpu_top);
(In reply to Ivan Klymenko from comment #3) Sorry, I think I said foolishness.
This is getting attention due to the recent FreeBSD ULE vs Linux CFS research paper which is what caused this bug to be discovered. Is there further analysis needed here?
Cc’ing a few people who have made recent changes to the ULE scheduler, and therefore might feel confident reviewing this patch.
Created attachment 195229 [details] truckman's proposed patch to ULE load balancer (In reply to Konstantin Belousov from comment #1) Yes, that sounds reasonable. I also don't see any great need to randomize balance ticks for the first iteration. Then the call to sched_balance() from sched_setup_smp() is useless, so it can be nuked. Since there is now only one call to sched_balance(), we can hoist the tests at the top of this function out to the caller and avoid the overhead of the function call when running a SMP kernel on UP hardware. Patch attached.
(In reply to Don Lewis from comment #7) Looks fine to me.
What's the status of this? It's not in -current and there are no objections to committing this.
Could these results be a symptom of this issue? sudo dtrace -n 'sched:::on-cpu {@[cpu] = count();}' 1 322199 3 368011 0 538659 2 1750862 I was running a make -j4 buildworld while using some desktop apps.
(In reply to Danilo Egea Gondolfo from comment #10) With the patch applied and your dtrace command I have seen results like this: 2 5869 7 5968 3 6077 6 6677 9 6854 5 7180 10 7221 0 7232 1 7919 11 8171 8 10777 4 32122
If I start three CPU-bound processes on two CPUs, I see roughly equal sharing. Without the patch I would expect to see two processes each getting 50% of a CPU while a third would get 100% of its CPU. %cpuset -l 1-2 /bin/sh $ while true; do done & $ while true; do done & $ while true; do done & CPU: 12.6% user, 0.0% nice, 0.1% system, 0.0% interrupt, 87.4% idle Mem: 57M Active, 125M Inact, 996M Wired, 14K Buf, 61G Free ARC: 423M Total, 139M MFU, 268M MRU, 32K Anon, 3M Header, 12M Other 123M Compressed, 300M Uncompressed, 2.44:1 Ratio Swap: 40G Total, 40G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 2851 dl 1 92 0 12M 3M CPU2 2 2:00 84.05% /bin/sh 2852 dl 1 93 0 12M 3M RUN 1 1:57 65.82% /bin/sh 2850 dl 1 93 0 12M 3M CPU1 1 2:07 52.66% /bin/sh If someone wants to commit the patch, feel free. I will only have a limited amount of time for FreeBSD for most of the next week.
A commit references this bug: Author: truckman Date: Sun Jul 29 00:30:07 UTC 2018 New revision: 336855 URL: https://svnweb.freebsd.org/changeset/base/336855 Log: Fix the long term ULE load balancer so that it actually works. The initial call to sched_balance() during startup is meant to initialize balance_ticks, but does not actually do that since smp_started is still zero at that time. Since balance_ticks does not get set, there are no further calls to sched_balance(). Fix this by setting balance_ticks in sched_initticks() since we know the value of balance_interval at that time, and eliminate the useless startup call to sched_balance(). We don't need to randomize the intial value of balance_ticks. Since there is now only one call to sched_balance(), we can hoist the tests at the top of this function out to the caller and avoid the overhead of the function call when running a SMP kernel on UP hardware. PR: 223914 Reviewed by: kib MFC after: 2 weeks Changes: head/sys/kern/sched_ule.c
A commit references this bug: Author: truckman Date: Sun Aug 12 03:22:28 UTC 2018 New revision: 337678 URL: https://svnweb.freebsd.org/changeset/base/337678 Log: MFC r336855 Fix the long term ULE load balancer so that it actually works. The initial call to sched_balance() during startup is meant to initialize balance_ticks, but does not actually do that since smp_started is still zero at that time. Since balance_ticks does not get set, there are no further calls to sched_balance(). Fix this by setting balance_ticks in sched_initticks() since we know the value of balance_interval at that time, and eliminate the useless startup call to sched_balance(). We don't need to randomize the intial value of balance_ticks. Since there is now only one call to sched_balance(), we can hoist the tests at the top of this function out to the caller and avoid the overhead of the function call when running a SMP kernel on UP hardware. PR: 223914 Reviewed by: kib Changes: _U stable/11/ stable/11/sys/kern/sched_ule.c
FreeBSD 10.x is not affected by this bug.