Bug 173541 - High (0.60+, 1.00) idle load averages
Summary: High (0.60+, 1.00) idle load averages
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-acpi (Nobody)
URL:
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2012-11-10 23:50 UTC by Viktor Štujber
Modified: 2023-01-25 21:05 UTC (History)
13 users (show)

See Also:
koobs: maintainer-feedback? (jkim)
koobs: maintainer-feedback? (kib)
koobs: mfc-stable13?
koobs: mfc-stable12?


Attachments
messages.diff (3.11 KB, text/plain)
2012-11-11 19:26 UTC, Viktor Štujber
no flags Details
messages.diff (3.11 KB, text/plain)
2012-11-13 15:58 UTC, Viktor Štujber
no flags Details
broken loadavgs, a fix, then fixed loadavs (57.58 KB, text/plain)
2016-03-03 14:45 UTC, Ian Smith
no flags Details
verbose boot Lenovo X200 (53.06 KB, text/plain)
2016-03-03 14:47 UTC, Ian Smith
no flags Details
ps shows idle 1000% cpu (20.48 KB, text/plain)
2022-04-10 00:33 UTC, Bluey
no flags Details
dmesg.boot (10.50 KB, text/plain)
2022-04-10 00:35 UTC, Bluey
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Viktor Štujber 2012-11-10 23:50:00 UTC
I run FreeBSD on an intel atom D525MWV board. On recent FreeBSD versions, I am observing unusual load averages reported by 'top' even when 100% idle. It happens in a normal system, in single user mode, and on official usb install images. I have not yet confirmed if the load is real or if it's just kernel reporting the values wrong.

Here are kernels that I have tried. Four of them are from the currently available usb memstick installers; the 2011 one is a version I've been successfully using for a year until downgrading recently.

8.3-RELEASE  amd64 2012-04-09           -> load 0.02 after 5 minutes idle
9.0-RELEASE  amd64 2012-01-03           -> load 0.02 after 5 minutes idle
10.0-CURRENT amd64 2011-11-12 (cvs)     -> load 0.02 after 5 minutes idle
9.1-RC3      amd64 2012-10-30 (r242324) -> load 0.60 after 5 minutes idle
10.0-CURRENT amd64 2012-11-02 (r242464) -> load 0.60 after 5 minutes idle

How-To-Repeat: 1) boot system on affected hardware(?)
2) leave 'top' running for a few minutes
3) observe load averages
Comment 1 Viktor Štujber 2012-11-11 05:50:10 UTC
A night without sleep and a chain of manual bisection steps later:
stable/9 r232058 2012-02-23 - pass
stable/9 r232129 2012-02-24 - fail
(will continue to narrow it down later)
Comment 2 Viktor Štujber 2012-11-11 19:26:25 UTC
The problem first occurs on 232086 (original head revision 231161)
http://svnweb.freebsd.org/base?view=revision&revision=232086

It has some acpi-related and tick source-related changes regarding their 
probing order/priority. I observed that this does indeed change the way 
they are reported in /var/log/messages. In particular, the order changes 
and they're initalized way earlier in the startup process. Diff attached.
Comment 3 Viktor Štujber 2012-11-13 15:58:03 UTC
The offending revision where the issue can first be observed in stable-9 
is r232086 (MFC: r231161) - 
http://svnweb.freebsd.org/base?view=revision&revision=232086
The consequence of that change can be observed in /var/log/messages - 
the timer sources get initialized way earlier and in different order. 
Diff attached.
Comment 4 je.broman 2013-01-03 22:19:00 UTC
Hi!

I'm getting the exact same issue. Upgraded from FreeBSD 9.0 to FreeBSD 9.1
and my load avg jumped from 0.01 to 0.60, while the server is doing pretty
much nothing at all.

I'm also seeing the timer differences in dmesg as described earlier in the
bug report.

# sysctl -a | egrep "kern.event|clock"
kern.clockrate: { hz = 1000, tick = 1000, profhz = 8126, stathz = 127 }
kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) LAPIC(400)
i8254(100) RTC(0)
kern.eventtimer.et.LAPIC.flags: 15
kern.eventtimer.et.LAPIC.frequency: 0
kern.eventtimer.et.LAPIC.quality: 400
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.quality: 450
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.quality: 440
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.quality: 440
kern.eventtimer.periodic: 0
kern.eventtimer.timer: HPET
kern.eventtimer.activetick: 1
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2
debug.acpi.reset_clock: 1
debug.clocktime: 0
hw.clockrate: 2133
machdep.wall_cmos_clock: 0
dev.atrtc.0.%desc: AT realtime clock


# dmesg | more
Copyright (c) 1992-2012 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012
    root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
CPU: Intel(R) Core(TM)2 CPU          6400  @ 2.13GHz (2133.45-MHz K8-class
CPU)
  Origin = "GenuineIntel"  Id = 0x6f2  Family = 6  Model = f  Stepping = 2

Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0xe3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant, performance statistics
real memory  = 4296015872 (4097 MB)
avail memory = 4101558272 (3911 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <COMPAQ GLENWOOD>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
ioapic0: Changing APIC ID to 1
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <HPQOEM SLIC-WKS> on motherboard
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, dff00000 (3) failed
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
Event timer "RTC" frequency 32768 Hz quality 0
hpet1: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
device_attach: hpet0 attach returned 12
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0xf808-0xf80b on acpi0

Regards
Johan
Comment 5 Viktor Štujber 2013-01-03 22:44:46 UTC
For my system's kernel I just undid that one revision I mentioned 
earlier. No ill-effects observed. Since the change is so low-level and 
there is no rationale provided in the commit message, I do not know what 
improvement it was supposed to achieve. I tried e-mailing the author of 
that commit, but got no response.
Also, due to the low-power nature of my system, I can't test if the 0.60 
load is real, or just a by-product of broken time accounting. Either way 
it might be affecting the process scheduler.
Comment 6 je.broman 2013-01-03 23:06:16 UTC
Hi!

I see. I got around the problem in FreeBSD 9.1 by changing the clock
source. Like this:

# sysctl -w kern.eventtimer.timer=LAPIC

The load avg dropped to 0.00 again. You can try the different clock sources
available to you and see what works best. Some sources might increase the
number of interrupts (like the RTC) and some might cause more context
switching or CPU load. For me the LAPIC works best. You can list your clock
sources using:

# sysctl kern.eventtimer.choice

From what I understand, measuring time can be tricky because new systems
can regulate core frequency on the fly (in HW) and virtualization also
increases the difficulty...

It seems the order and/or weight of the clock sources has changed. When I
have a chance I will reboot into the old kernel and see what has been
changed. I'm totally new to FreeBSD so I don't know the normal handling of
these bugs but I'll put the author of the patch on cc as well :)

Good luck!
Johan
Comment 7 frederique 2014-01-31 06:14:14 UTC
I'm seeing the same issues in FreeBSD 10-RELEASE.

A VM has a load avg of about 0.5, a physical install will show ~0.3 load 
average. My tests were done in single user mode even.

https://forums.freebsd.org/viewtopic.php?f=4&t=44600
Comment 8 Jeremy Chadwick 2015-10-27 04:44:43 UTC
(Warning: below is a long comment with a large amount of detail.  I am CC'ing mav@ because I believe this may have to do with his eventtimer changes made sometime in 2012[1].  I'll provide some information that may greatly help narrow down the root cause near the end).

I reported this problem, specific to stable/10 (i.e. the issue I see may be different than the one reported here), back in June 2014[2].  A visual representation of the difference between stable/10 and stable/9 on said hardware (bare-metal, not virtualised) is available[3].  The system in question still runs stable/9 (r289992) today without any load oddities.

All of my KTR details and so on from that system are available here[4], where Adrian Chadd attempted to help me figure out what was going on.  The mail thread (see [1] again) consisted of many people chiming in with useful details.  At some point during this thread, maybe not publicly (I would need to review my mail archives), someone (John Baldwin?  I'm not certain, my memory is not as good as it used to be) informed me that stable/10 had its load calculation algorithm changed.

Since July 2014, I have been able to reproduce this problem with stable/10 on several different types of systems:

- Supermicro X7SBA (bare metal)
- VMware Workstation 11.x (FreeBSD stable/10 is a guest)
- VPS provider Vultr.com (using QEMU) (FreeBSD stable/10 is a guest)

There are, however, some individuals running stable/10 (e.g. a friend of mine using FreeNAS) on bare-metal hardware that do not experience this problem.

Today I spent some time on a stable/10 (r289827) guest under VMware Workstation 11.x where I was able to reproduce the issue and found something surprising:

kern.eventtimer.periodic=1 completely alleviates the problem -- the instant this sysctl is set to 1 ("periodic mode") the load begins to plummet and eventually hits 0.00, only increasing when there is "real" load.

You can even set kern.eventtimer.periodic=0 ("one-shot") maybe 5-6 minutes after that and still see the system load at 0.00, although over the next 10+ minutes it will begin to gradually increase again.

The downside/problem to using "periodic mode": vmstat -i shows a cpuX:timer rate of almost 2000 on both CPU cores, which if I understand eventtimers(4) correctly, would indicate that extremely high interrupt usage (in my case, LAPIC, i.e. local APIC).

I just find it very interesting that somehow kern.eventtimer.periodic=1 is able to alleviate the issue.  This leads me to wonder if the issue is with eventtimers(4) changes, or with the load calculation algorithm (or some "intermediary" piece that isn't taking into consideration certain things).

On the VM in question, the eventtimer choices are:

root@test-freebsd:~ # sysctl -a | grep kern.eventtimer
kern.eventtimer.periodic: 0
kern.eventtimer.timer: LAPIC
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2
kern.eventtimer.choice: LAPIC(600) i8254(100) RTC(0)
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.LAPIC.quality: 600
kern.eventtimer.et.LAPIC.frequency: 33000101
kern.eventtimer.et.LAPIC.flags: 7

On a Vultr.com VPS (QEMU), the eventtimer choices are (according to stable/9, which I use there):

kern.eventtimer.singlemul: 2
kern.eventtimer.idletick: 0
kern.eventtimer.activetick: 1
kern.eventtimer.timer: LAPIC
kern.eventtimer.periodic: 0
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.LAPIC.quality: 400
kern.eventtimer.et.LAPIC.frequency: 500017353
kern.eventtimer.et.LAPIC.flags: 15
kern.eventtimer.choice: LAPIC(400) i8254(100) RTC(0)

And finally on the aforementioned Supermicro X7SBA box, the eventtimer choices (per stable/9) are:

kern.eventtimer.singlemul: 2
kern.eventtimer.idletick: 0
kern.eventtimer.activetick: 1
kern.eventtimer.timer: LAPIC
kern.eventtimer.periodic: 0
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.LAPIC.quality: 400
kern.eventtimer.et.LAPIC.frequency: 166681842
kern.eventtimer.et.LAPIC.flags: 15
kern.eventtimer.choice: LAPIC(400) i8254(100) RTC(0)

Per hpet(4) I have tried the following in /boot/loader.conf on all of the above VM systems to no avail (probably because they do not use HPET):

hint.hpet.X.legacy_route=1
hint.attimer.0.clock=0
hint.atrtc.0.clock=0

Another thing I found interesting on stable/10 under the VMware VM was that this message is printed:

Event timer "HPET" frequency 14318180 Hz quality 550

Yet the eventtimer is no where to be found.

Anyway, this problem has come up *repeatedly* on multiple FreeBSD mailing lists and on the forums yet has gotten no traction[5][6][7][8][9][10].  I feel, respectively, that this has gone on long enough.  This problem greatly diminishes the feasibility of using FreeBSD as a server operating system (when you can no longer trust or rely on the load average, what can you believe?).  End-users and even skilled/senior administrators do not know how to diagnose or troubleshoot + rectify this problem given where it lies in kernel space.  Please remember: the problem is 100% reproducible.

I am happy to pay for a low-end Vultr.com dedicated VPS (includes console access as long as you have a web browser w/ Javascript) and/or set up a dedicated VMware Workstation VM (w/ remote VNC console) for any developer which wants to take it on, on my own dime.  I cannot provide the aforementioned Supermicro X7SBA box for testing (it's my FreeBSD box and must remain usable).

Alexander (mav@) -- what can we provide you that can help narrow this down?

P.S. -- The Importance field needs to be changed from Affects Only Me to Affects Some People, as this is definitely not specific to just Viktor.  Also, I would strongly suggest changing Depends On to reference bug 192315, which references eventtimer issues (but may be a different problem -- I'll let mav@ decide).

[1]: https://lists.freebsd.org/pipermail/freebsd-net/2012-April/031893.html
[2]: https://lists.freebsd.org/pipermail/freebsd-stable/2014-July/079386.html
[3]: http://jdc.koitsu.org/freebsd/releng10_perf_issue/loadavg_comparison.png
[4]: http://jdc.koitsu.org/freebsd/releng10_perf_issue/
[5]: https://lists.freebsd.org/pipermail/freebsd-virtualization/2014-October/002835.html
[6]: https://forums.freebsd.org/threads/high-idle-load-in-vps.52688/
[7]: https://forums.freebsd.org/threads/high-cpu-utilization-by-average-load-values-while-no-processes-in-top-output-have-high-cpu-load.48933/
[8]: https://lists.freebsd.org/pipermail/freebsd-bugs/2013-January/051385.html
[9]: https://lists.freebsd.org/pipermail/freebsd-xen/2014-January/002006.html
[10]: https://forums.freebsd.org/threads/high-load-average-with-idle-state.38757/
Comment 9 Alexander Motin freebsd_committer freebsd_triage 2015-10-27 08:33:02 UTC
I don't know whether it is the only cause of such behavior, but I have one logical explanation.  The problem goes from the sampled nature of our load average calculation.  To reduce power usage new clock code tries to aggregate events in bursts to reduce number of CPU wakeups.  As result, if system is idle, its CPU may wakeup as low as two times per second.  And if that CPU has some other events waking it up, it is quite likely they will be aggregated with load average sampling.  As result, there is high enough probability that sampler will see some "load" to account, while it has no idea that that load will disappear in microsecond for another second.  Enabling periodic interrupts breaks the event aliasing -- load average sampling will more likely fire in moments when system is really idle.  But do we want to consume several extra watts of power just for one good looking number?  To solve this we should probably change our mechanism of load average calculation somehow.  If anybody have ideas -- I am all ears.
Comment 10 Viktor Štujber 2015-10-27 08:49:59 UTC
On 10.2-RELEASE, my long-term load average is around 0.10, which may be closer to reality because of powerd (225-1800MHz frequency scaling range). The currently chosen timecounter is TSC and the eventtimer is HPET. Unfortunately I forgot to note down the previous values. I can try an older kernel if needed.
Comment 11 amvandemore 2015-10-29 13:13:36 UTC
Load Average has been known to be broken for a long time, pre-dating this by some time I think:

https://lists.freebsd.org/pipermail/freebsd-current/2011-September/027138.html

I think mav's observation is basically what's happening above.
Comment 12 Ian Smith 2016-03-03 14:45:47 UTC
Created attachment 167688 [details]
broken loadavgs, a fix, then fixed loadavs
Comment 13 Ian Smith 2016-03-03 14:47:56 UTC
Created attachment 167689 [details]
verbose boot Lenovo X200
Comment 14 Ian Smith 2016-03-03 14:51:45 UTC
The first 40% of the attached is a study of the "0.6" erroneous loadavgs
experienced also by others, shown in some detail over long periods at
virtually complete idle, plus a number of 100% CPU-bound tasks running.

This shows that the oft-mentioned "0.6" figure (seeming randomly variable
over a range of ~0.45 to ~0.75, with some outliers even beyond that) is an
ADDITIVE increment to real loadavg, for real 100% CPU loads up to here 11.

At 40% through this report are applied a pair of cpuset(1) commands, which
on this machine completely fix broken loadavgs, permanently until reboot.

I'll refrain from further speculation on what might cause this to happen.

The rest of the report illustrates the well-behaved nature of HPET-derived
loadavgs (both one-shot and periodic) at idle and at real loads of 1, 2, 4
and 11 CPU-bound tasks, after this particulat magic spell has been cast ..

Verbose dmesg is also attached.
Comment 15 Jeremy Chadwick 2016-08-15 18:34:32 UTC
Thought I'd add that I checked 11.0-BETA4 for this problem: it still exists.
Comment 16 Aleksander Alekseev 2016-08-15 19:13:58 UTC
Jeremy, you shouldn't probably test LA on -BETA since it's built with INVARIANTS and WITNESS enabled. I'm not an expert but I wouldn't be surprised if these options affect LA.

I've just checked 11.0-RC1 (in which kernel is built without these options) as described above and got LA 0.03. Since system never does literally nothing (there are system clock, process scheduler, etc) I would say it's a pretty expected behavior. At least it's what I see on my hardware (Fujitsu Lifebook E733).
Comment 17 Jeremy Chadwick 2016-08-15 19:30:49 UTC
The problem varies per hardware/system (see comment #8), including multiple types/implementations of VMs.  Ian Lepore (smithi@nimnet.asn.au) already covered part of this in his "broken loadavg, a fix, then fixed loadavgs" attachment (it has a lengthy explanation).  The issue almost certainly has something to do with eventtimers or timecounters.  Ian and I chatted for a bit off-ticket about some of the possibilities, but neither of us could narrow down the root cause.

I'm happy to wait for 11.0-RELEASE and try that, but I'm inclined to believe the issue is still present unless a committer can point to a commit that actually fixed it; the only commits I follow presently are to stable/9, and a search of stable/10 for the word "load" turns up no commits that look relevant.  So, call it sysadmin + FreeBSD-user-since-1997 gut instinct.  :-)
Comment 18 Jeremy Chadwick 2017-01-20 00:48:34 UTC
Problem described continues to exist on the following present-day FreeBSD releases (I've tested all of them; I did not test x86):

* 10.3-RELEASE amd64
* 11.0-RELEASE amd64
* 12.0-CURRENT amd64 (r309773)

Getting traction on this is important due to the EoL of the 9.x series as of January 1st 2017, compounded by the ports infrastructure strongly advocating ceasing use of 9.x (see ports r431746 for proof).
Comment 19 Mark Linimon freebsd_committer freebsd_triage 2017-01-27 08:21:14 UTC
Apparently acpi-related.  Notify committer of 232086 -- although this commit was several years ago.
Comment 20 Jeremy Chadwick 2017-01-27 18:15:37 UTC
For what it's worth: below is a spreadsheet of all the relevant systems I've tested including what OS version, timecounters available (and chosen), eventtimers available (and chosen), etc..

https://docs.google.com/spreadsheets/d/1xsKZ6yNMWW6TA_ZaYMwzZ-vI3C-pSQ9QAJk_yjPBOIQ/edit?usp=sharing
Comment 21 Jeremy Chadwick 2017-01-28 10:38:08 UTC
I wanted to add this piece of data (reflected in spreadsheet):

I was able to reproduce this problem on Amazon EC2 using official FreeBSD AMIs (both 10.3-STABLE and 11.0-RELEASE) on a t2.micro instance.  Launch, wait several minutes, then look at the system.  There, timecounter is XENTIMER, and eventtimer is XENTIMER.
Comment 22 Nick 2017-02-19 22:47:56 UTC
Hi,

I have the same problem on pretty much all of the VPSes I use at different providers. On a brand new FreeBSD 11.0 amd64 install at Leaseweb, VULTR, Liteserver.nl, Amazon, etc I see high loads (sometimes hitting up to 2.00) on idle virtual servers with nothing installed.

I've seen this high load issue reported many times on various sites and forums. It makes it hard to accurately see just how much cpu or load my virtual servers are using since I can't rely on FreeBSD's load report.

Thank you,
- Nick
Comment 23 Kubilay Kocak freebsd_committer freebsd_triage 2022-02-09 00:19:17 UTC
^Triage: Update to latest -RELEASE version referenced as affected. User on IRC just reported 1.00 LA's on Oracle (13.0-p7 amd64, micro, free) VPS.

Default timecounter: HPET, but no change with ACPI-fast but LAPIC increases idle load average to 2.00 (from 1.00)
Comment 24 Kubilay Kocak freebsd_committer freebsd_triage 2022-02-09 00:21:53 UTC
(In reply to Kubilay Kocak from comment #23)

timecounter: i8254 dropped idle load average from 1.00 to 0.1-0.3 ("normal")
Comment 25 Kubilay Kocak freebsd_committer freebsd_triage 2022-02-09 00:28:26 UTC
^Triage: Re-triage given latest report on 13.0

Note also (explicitly), the original bisection to base r232086 (mfc) to base r231161

Cannot identify anyone substantially involved in timecounters since then modulo jkim (original committer), request feedback from them and kib@ for advice and potential resolution given an original regression commit was identified.
Comment 26 Kubilay Kocak freebsd_committer freebsd_triage 2022-02-09 00:30:04 UTC
For anyone (still) affected or able to reproduce this issue, please include:

- /var/run/dmesg.boot output (as attachments)
Comment 27 Alexander Motin freebsd_committer freebsd_triage 2022-02-09 01:17:05 UTC
@Kubilay Kocak, I don't think this should be related to timecounters.  It must be a coincidence.  As I have written 7 years ago in comment 9, I believe it is a result of some synchronization between load average callouts and some other events.  Since you've mentioned HPET, I'd think towards not HPET as timecounter, but as HPET as eventtimer.  Depending on hardware and number of CPUs, kernel may have to use single HPET interrupt to deliver timer interrupts to multiple CPUs.  As result, it is very simple to have two or more CPU to wake up from a long sleep same time (one where HPET interrupt is routed to and another where timer event is actually scheduled), that may cause load average callout to sample simultaneous threads.
Comment 28 Bluey 2022-04-10 00:33:29 UTC
Created attachment 233113 [details]
ps shows idle 1000% cpu
Comment 29 Bluey 2022-04-10 00:35:49 UTC
Created attachment 233114 [details]
dmesg.boot

as requested
13.1-RC2
Comment 30 Bluey 2022-04-10 00:46:46 UTC
hp Z2 mini Xeon 6core. X system monitor shows 1 of 12 CPUs at 60-100%. Didn't do this a week or two ago. Previous 13.1-BETA3 and 13.1-RC1. Possibly caused by kde-plasma5 install?? Today is the first time I have heard the CPU fan run up when all core average appeared >60%. Maybe worse since I enabled KDE highDPI. Will disable.
Comment 31 Bluey 2022-04-10 00:59:28 UTC
So,having turned off sddm|kde highDPI and rebooted, ps still shows high idle. KDE system monitor reports high CPU one CPU 60-100%, while process list doesn't show idle and sorted list show worst performer is 1% CPU. Gnome system monitor shows 100% idle CPU split over several CPUs on graph between 0-10%.
Comment 32 Stavros Filargyropoulos 2023-01-21 19:34:54 UTC
Is there any interest in debugging this further?

I can reproduce this on my home PC, Supermicro X12STL-IF with Xeon E-2314. Freshly installed with nothing running and 100% idle CPU, the load is hovering between 0.20-0.40.
Comment 33 Gordon Bergling freebsd_committer freebsd_triage 2023-01-22 00:45:08 UTC
It was fixed with timer updates in stable/12 and stable/13 a few month ago.
Comment 34 Stavros Filargyropoulos 2023-01-22 01:08:54 UTC
@Gordon (In reply to Gordon Bergling from comment #33)

Cool, do you have the revision handy by any chance?
Comment 35 Stavros Filargyropoulos 2023-01-22 19:30:20 UTC
Anyone knows the commit that fixed that, and what the root cause was?

I still see the problem with:

$ uname -a
FreeBSD nas 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n253535-4a8af507ebe5: Fri Jan 20 03:21:10 UTC 2023     root@releng3.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64


last pid:  1545;  load averages:  0.31,  0.11,  0.04                                                                                                                                       up 0+00:02:34  11:30:02
128 threads:   5 running, 103 sleeping, 20 waiting
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 15M Active, 2260K Inact, 229M Wired, 22M Buf, 31G Free
Swap: 3962M Total, 3962M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B    64K RUN      3   2:33 100.00% idle{idle: cpu3}
   11 root        155 ki31     0B    64K CPU2     2   2:33  99.95% idle{idle: cpu2}
   11 root        155 ki31     0B    64K CPU0     0   2:33  99.53% idle{idle: cpu0}
   11 root        155 ki31     0B    64K CPU1     1   2:34  97.14% idle{idle: cpu1}
    0 root        -76    -     0B   640K -        1   0:00   0.04% kernel{if_config_tqg_0}
Comment 36 Mark Millard 2023-01-22 23:37:20 UTC
Context: aarch64 HoneyComb (so: 16 cores), main non-debug build,
no video hardware, serial console and ssh access in use.

I see the likes of: load averages:   0.20,   0.22,   0.25
in top -SIaocpu . The list of non-idle, non-zero WCPU% columns
is usually something like:

0.05% top -SIaocpu
0.03% [clock]
0.01% [intr]
0.01% [usb]

Sometimes there is also something like:

0.03% [acpi_thermal]
or:
0.01% [kernel]

Less often (varies more widely but is non-zero less often):

0.15% [zfskern]

There are also a number of processes listed with 0.00%,
such as (expect some overlaps with the above):

0.00% /usr/sbin/mountd -r /etc/exports /etc/zfs/exports
0.00% sshd: markmi@pts/1 (sshd)
0.00% /usr/sbin/ntpd -p /var/db/ntp/ntpd.pid -c /etc/ntp.conf -f /var/db/ntp/ntpd.drift -g
0.00% [intr]
0.00% [pagedaemon]
0.00% [bufdaemon]
0.00% [rand_harvestq]
0.00% [zfskern]
0.00% [syncer]
0.00% [vnlru]
0.00% [acpi_cooling0]

Truely zero CPU usage processes are not listed.

Definitely doing somewhat more than nothing, as is normal for
so-called "idle" as far as I know.

What about this would make the load averages unreasonable?
(I'm also not aware of any sort of technical accuracy
criteria being claimed for the load averages. More general
judgments are involved about good-enough instead.)

For reference:

# uname -apKU # long output line split for readability
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #75
main-n259950-5723e5ac6d76-dirty: Fri Jan  6 01:50:53 PST 2023
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72
arm64 aarch64 1400077 1400077
Comment 37 Stavros Filargyropoulos 2023-01-23 00:05:09 UTC
(In reply to Mark Millard from comment #36)

https://svnweb.freebsd.org/base/head/sys/kern/kern_synch.c?view=markup#l87

As far as I understand the OS is looking at the run queue every 5 seconds. And then averages it out over 1, 5 and 15 minutes. 

A load of 0.20 for the 1 minute counter would mean that out of the 6 times it looked at the run queue there was 1 job there in 2-3 of them.
Comment 38 Mark Millard 2023-01-23 02:16:45 UTC
(In reply to Stavros Filargyropoulos from comment #37)

5 seconds intervals for sampling, sampling 6 intervals,
spans 30 seconds, not 60 seconds.

For arithmetic mean=0.2 over 12 values, that would mean that the
total of the 12 values would be 2.4 . (But the code does not use
the arithmetic mean, so that is only suggestive.)

I still do not see how this is to be classified as unreasonable
for the background activity present, even if it was an arithmetic
mean.


You may want to ignore the following . . .

But it does not keep track of the 12 values for a minute
and then find the arithmetic mean: "average" here is more
complicated.

For i==0 (i.e., the 1 minute context), first setting up
context for the code later quoted:

cexp[i] is listed as: 0.9200444146293232 * FSCALE,    /* exp(-1/12) */
The 0.9200444146293232 is for the exp(-1/12) and the 12
looks to be from 5sec * 12 == 60 sec. FSCALE is 2048.
2048 in turn is 1<<FSCALE, with FSCALE==11.

Given all that, the new-sample update looks like (i==0 for
the 1 minute context):
avg->ldavg[i] = (cexp[i] * avg->ldavg[i] +
                 nrun * FSCALE * (FSCALE - cexp[i]))
                >> FSHIFT;

nrun is the new sample.

It does not look like a uniform 1 minute average. More like
a figure for which the contribution of the nrun values from
more than 12 steps back just have a small influence on the
value. Even the various 12 more recent nrun values do not
look to have a uniform influence on the result, just larger
than the older ones. (Not a very detailed classification,
sorry.)
Comment 39 Stavros Filargyropoulos 2023-01-23 04:57:04 UTC
(In reply to Mark Millard from comment #38)

> I still do not see how this is to be classified as unreasonable
for the background activity present, even if it was an arithmetic
mean.

One argument is too look what 9.0-RELEASE used to report. Which would be an order of magnitude less.

Another argument is to try to visualize what a load of 0.20 would mean if we had a single process and a very simple kernel that wouldn't run anything other than idle/scheduling and that process.

Ignoring the more complex computation and taking just the arithmetic mean. 0.20 load would mean that we would have a loop running, since it's the only process, for 20% of a minute.  That means it would run for 12 seconds and sleep for 48.

I don't think that background activity is worth 12 seconds per minute. FreeBSD would have serious performance issues if it did.

Yet another argument is to run something equivalent to the background activity. If things were fine and it's just our perception that is wrong, then I would expect the load to double. It doesn't..
Comment 40 Stavros Filargyropoulos 2023-01-23 05:14:25 UTC
Just some more observations.

OpenBSD uses the same formula:
https://github.com/openbsd/src/blob/2207c4325726fdc5c4bcd0011af0fdf7d3dab137/sys/uvm/uvm_meter.c

And load reported under similar usage is:

load averages:  0.00,  0.01,  0.00                                                                                                                                                                openbsd 05:14:10
51 threads: 50 idle, 1 on processor                                                                                                                                                                       up  0:03
CPU states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
Memory: Real: 9012K/723M act/tot Free: 246M Cache: 194M Swap: 0K/256M

  PID      TID PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND
96848   330739 -22    0    0K 1080K sleep     -         3:13  0.00% idle0
    0   100000 -18    0    0K 1080K sleep     schedul   0:01  0.00% swapper
 5169   407016  10    0    0K 1080K sleep     bored     0:01  0.00% systqmp
Comment 41 Mark Millard 2023-01-23 06:30:07 UTC
(In reply to Stavros Filargyropoulos from comment #39)

I've not looked at anything approaching a single
hardware thread environment to observe what the
load averages would be like (for the same type
of cores run at the same sort of rates and other
things being comparable). The closest I could
test is 4 core instead of 16 core (MACCHIATObin
Double Shot vs. the HoneyComb, same boot media
used would be possible, still 4 GiBytes of RAM
per core: 16 GiBytes).

It is not obvious to me that varying just the
hardware thread count leaves the load average as
it is in my context. The above test would vary
somewhat more then just the hardware thread count.
But it would be fairly close as far as I know.

I can say that a full load (no backlog) on the
HoneyComb would be load average 16: the core
count (nothing analogous to hyperthreading
is involved).

0.20/16 == 0.0125 == 1/80, so 79 of 80 "units" are
available for other uses in such a crude comparison.
I'm never likely to notice the difference vs. 80 out
of 80 "units" being available for use.

If 4 cores also got 0.20 for the load average:
0.20/4 == 0.05 == 1/20 == 4/80, so 76 of 80 "units"
left. That would be less left for other activity,
but I'm still not likely to significantly notice vs.
80 out of 80.

Of course, if 4 cores gets, say, 0.80 for a load
average when "idle" that would be different. But
I do not remember noticing such load averages on
idle 4 core systems that are similar to the
HoneyComb in other ways. May be I'll test at some
point so general memory from activity not focused
on the issue is not involved.

I'm not claiming 0.20 or 0.80 would be good
approximations: illustrations only.

Another point is: if the machine is not near or
beyond being fully in use (all hardware threads),
I'm even less likely to notice the activity that
is associated with idle: the parallelism avoids
most direct-tradeoff time conflicts.
Comment 42 Stavros Filargyropoulos 2023-01-23 06:43:44 UTC
(In reply to Mark Millard from comment #41)

The only thing I can advise for your case is to run NetBsd/OpenBSD/Linux on same setup and see the load being reported.

0.40 load for the usage you posted is not correct. There is a bug. It may be a benign one.

I am not impressed however that this was closed without any further information.
Comment 43 Stavros Filargyropoulos 2023-01-23 06:45:15 UTC
0.20, not 0.40. Sorry, typing on phone.
Comment 44 Mark Millard 2023-01-23 06:48:48 UTC
(In reply to Stavros Filargyropoulos from comment #40)

Previously you listed (FreeBSD context):

128 threads:   5 running, 103 sleeping, 20 waiting

vs. for the OpenBSD context now:

51 threads: 50 idle, 1 on processor

Focusing just on the thread counts (the other figures
are not likely to be as stable of figures) . . .

I can not tell how much of 128 vs. 50 is a mismatched
context/configuration vs. how much is differences in
software design/implementation. Whatever the mix, the
128 vs. 50 is notable all by itself.

That actually points out an issue for the 4 core
test: some types of thread counts do scale with core
counts in order to be able to put more cores to use
if needed. (It does not mean that they are busy.)
I'm not sure how similar the thread counts would
be. I might need to stop nfs or some such to have
a more similar context relative to thread counts.
Comment 45 Viktor Štujber 2023-01-23 11:26:46 UTC
fwiw the dualcore HT atom minipc I originally reported this for has been showing realistic averages (0.01 to 0.10) since my last post around 10.2 in late 2015, and has remained that way until now, running 13.1. Here's the system vars that might  be affecting processor time accounting.
kern.timecounter.choice: TSC(1000) ACPI-fast(900) HPET(950) i8254(0) 
kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) LAPIC(100) i8254(100) RTC(0)
dev.cpufreq.0.freq_driver: p4tcc0
dev.cpu.0.freq_levels: 1800/-1 1575/-1 1350/-1 1125/-1 900/-1 675/-1 450/-1 225/-1
dev.cpu.0.cx_method: C1/mwait/hwc
dev.cpu.0.cx_supported: C1/1/3
Comment 46 Alexander Motin freebsd_committer freebsd_triage 2023-01-23 13:56:35 UTC
It is a long known problem of very idle systems.  To save the power kernel tries to group different timer events together to reduce number of CPU wakeups from idle state.  The problem is that load average accounting also based on timer events, and if system is idle, each time it fires, there is often something else to do, that is getting accounted into load average.
Comment 47 Stavros Filargyropoulos 2023-01-23 19:03:17 UTC
(In reply to Alexander Motin from comment #46)

Alexander, in the comment above gbe@FreeBSD.org suggested that this has been fixed. I tried to have a look through the recent commits in ACPI but can't seem to find anything that would have fixed this. Do you happen to know commit was Gordon referring to?
Comment 48 Alexander Motin freebsd_committer freebsd_triage 2023-01-23 19:59:40 UTC
I am not sure what he means.  There could be some optimizations, but I am not sure it can be easily fixed, unless we implement some mechanism to somehow strictly prevent timer other events grouping with the load average ones.
Comment 49 Stavros Filargyropoulos 2023-01-23 20:09:57 UTC
(In reply to Alexander Motin from comment #48)
Thanks Alexander. Gives me some more confidence to know the root cause and understand that it shouldn't functionally affect the system in any other ways.
Comment 50 Mark Millard 2023-01-23 20:25:40 UTC
(In reply to Alexander Motin from comment #46)

Quoting your #9 :

. . . if system is idle, its CPU may wakeup as low as two times per second.

But, the load average code is based on fixed 5 second sampling and
then a small variability is tolerated/introduced to avoid matching
a (nearly) fixed frequency activity accidentally matching. The
"fixed" turns into a fixed number of adjustments, such as 12, 60,
or 180.

Having anything like 30sec between steppings of the load average
adjustment, spreads the 12 steps case (60 sec @5 sec each intended)
out to something like 360 sec total for the 12 steps. (Also has
implications for widely variable time interval sizes.)

If that is really the case, not much else can matter as long as
the 5 sec interval is not well approximated and calling the
average a "1 minute" average can be highly misleading.
Descriptions that would be accurate are "12 step load figure",
"60 step load figure", and "180 step load figure". No fixed
conversion to time intervals available --and not really
arithmetic means.

Sound right?

Note:
For what I use the "load averages" for, the "N step load figure"
interpretation works fine without knowing specific times.
Inaccuracies in small figures (relative to the hardware
thread count involved) make no noticeable difference. Nor do
small percentage errors in the larger figures: Rule of thumb
material for making some judgments that are not highly sensitive
to the details.
Comment 51 Stavros Filargyropoulos 2023-01-23 20:48:55 UTC
(In reply to Mark Millard from comment #50)

> 30sec

Where did you get the 30secs from?

He said twice per second. Not per minute?
Comment 52 Mark Millard 2023-01-23 20:56:24 UTC
(In reply to Stavros Filargyropoulos from comment #51)

Good question: bad substitution in my head that I did not notice,
despite "second" being in what I quoted.

So Ignore my #50. Sorry for the noise.

Too bad I can not delete to avoid my confusing anyone besides
myself . . .
Comment 53 Ian Smith 2023-01-24 16:45:31 UTC
(In reply to Gordon Bergling from comment #33)

Gordon, please provide a date and link for the revision you refer to?

I updated to 12.4 yesterday, assuming that "updates in stable/12 and stable/13 a few month ago" should mean preceding 12.4 being spun off stable/12 on 20th October, but I can see no change at all in idle behaviour of load_avgs?

(Alexander, more when I know more ...)
Comment 54 Mark Millard 2023-01-25 20:34:36 UTC
(In reply to Stavros Filargyropoulos from comment #42)

FYI for comparison/contrast:

I booted a new USB3 installation of OpenBSD 7.2 on the HoneyComb
and let it sit for over 20 minutes with top -1SCHIocpu
running. A sequence of occasional extraction from after that
are:

load averages:  0.50,  0.35,  0.23

load averages:  0.26,  0.31,  0.22

load averages:  0.39,  0.34,  0.24

load averages:  0.24,  0.30,  0.23

load averages:  0.41,  0.33,  0.26

load averages:  0.21,  0.30,  0.25

That gives an idea of the variability.

The list of non-idle processes is almost always empty but
on occasion shows a acpi0 line (and never any others that
I've observed).

Looks to me like it reports generally larger load averages
than FreeBSD on the specific platform. (I've no clue what
all contributes to that status.)
Comment 55 Mark Millard 2023-01-25 21:05:16 UTC
(In reply to Stavros Filargyropoulos from comment #42)

FYI: a booted-NetBSD-on-HoneyComb comparison/contrast done
in a similar way to the earlier OpenBSD example:

Very different results, always . . .

load averages:  0.00,  0.00,  0.00