Bug 246940

Summary: [wishlist/enhancement, patch incl.]: idle user tasks should be charged as "nice" or "idle" CPU time
Product: Base System Reporter: Walter von Entferndt <walter.von.entferndt>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Many People CC: walter.von.entferndt
Priority: --- Keywords: patch
Version: Unspecified   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Patch for FreeBSD-12: sys.kern.kern_clock.c.diff
none
for FreeBSD-10/11: sys.kern.kern_clock.c.diff
none
expose add. cp_time(s) (realtime/idle) via sysctl cp_time(s)_ext
none
expose add. cp_time(s) (realtime/user_idle) via new sysctl cp_time(s)_ext
none
expose add. cp_time(s) (realtime+user_idle) via new sysctl cp_time(s)_ext
none
Patch for 12.1-REL to expose extended CPUTIMES via new sysctl knob
none
Patch for 12.1-REL powerd(8) to weight (scale) nice & user idle load
none
Patch for 12-STABLE to expose extended CPUTIMES via new sysctl knob
none
Patch for 12-STABLE powerd(8) to weight (scale) nice and user idle CPU load none

Description Walter von Entferndt 2020-06-02 16:58:48 UTC
Created attachment 215172 [details]
Patch for FreeBSD-12: sys.kern.kern_clock.c.diff

Currently idle user tasks are charged as "user" CPU time.

1. This is counter-intuitive.  Instead, one would expect the requested behaviour.
2. The chargement of user tasks affects the decisions taken by a CPU freq manager (i.e. powerd(8), sysutils/powerdxx(8) and the according kernel modules) and thus can result in great power savings for mobile users, and/or a less noisier system for desktop users, provided the user does not care about the runtime of background tasks, since power managers do not increase CPU freq on behalve of idle user tasks.
3. The patch might affect the CPU scheduler, please check.  I have no clue about this.
4. The patch introduces a new global integer sysctl "kern.cp_time_mode":
    [value: behaviour; use-case]
    1: charge idle threads as "nice" CPU time; suggested as server mode
    2:           "            "idle" CPU time; suggested as mobile/desktop mode
    0/any other: "            "user" CPU time; (default); current behaviour

I would be interested in how such "static" sysctl's are handled on a SMP system.  If "OID_AUTO" results in a per-CPU var, a system could be partitioned to have "interesting" behaviour :)
Onother question I have is if kernel tasks (other than the idle task itself) could ever be in the idle class.

I ran this trivial patch on two laptops for several years w/o problems.
There might be implicit assumptions on the CPU scheduling class beeing "user" or "nice" in some base system tasks or scripts, though.

Suggested application is to switch between "server" and "mobile" mode from /etc/power_profile (I'm willing to patch that, it's trivial) and to allow security.bsd.unprivileged_idprio=1.  Then a desktop/mobile user can choose to run cron(8), sysutils/anacron(8) and his own user background tasks as "idle", and enjoy increased battery runtime and a quiter system.

How can I add more than one attachment?  The patch for FreeBSD 10/11 is slightly different, statclock() was named statclock_cnt() and did not have a comment.

With kind regards, Torsten
Comment 1 Walter von Entferndt 2020-06-02 17:18:33 UTC
Created attachment 215173 [details]
for FreeBSD-10/11: sys.kern.kern_clock.c.diff
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2020-06-02 17:19:17 UTC
Hi Torsten,

Why is it counter-intuitive that user processes are counted in user CPU time?  "Idle" is for kernel idle threads, nothing else.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2020-06-02 17:20:14 UTC
Low-priority user programs can be charged as "nice" instead of "user" already by running them with nice: 'nice -n 1 my_program'.
Comment 4 Walter von Entferndt 2020-06-02 19:20:44 UTC
> Why is it counter-intuitive that user processes are counted in user CPU time?
When they are neither "nice" nor "idle", it's not.
But when they are in the idle class, IMHO a mortal user would expect them not to be charged as user CPU time, since their scheduling priority is even lower than all nice tasks.

> "Idle" is for kernel idle threads, nothing else.
Then what's the idle _user_class_ for?

> Low-priority user programs can be charged as "nice"
So this means I have to apply both "nice" and "idprio".  OK, but IMHO one would expect that idle user tasks are implicitely also (at least) nice, same argument: because their scheduling priority is lower than all nice tasks.

I totally agree to a bit of resistance to introduce new sysctl's and in general, accepting patches, no matter how trivial they are.

In this case, however, I'd kindly suggest the issue that currently, idle user tasks do affect the CPU freq scaling (becasause they are charged as "user" time) and with this patch, have an effect that I (and presumably many others) consider useful, i.e. one can run tasks as idle and know that they do not affect the CPU freq scaling - like the idle kernel task(s).

This is at least useful for mobile devices running AC-offline (or solar-powered).  It's useful for desktops, one can switch between fullpower(noisy) and quite mode for background tasks.
It can be even useful for servers to do such a switch on power-outage.
The patch is non-intrusive, since the default is to retain the current behaviour.
Comment 5 Walter von Entferndt 2020-06-02 20:39:18 UTC
(In reply to Conrad Meyer from comment #2)
> Why is it counter-intuitive that user processes are counted in user CPU time?
Because the charge classes supplied by cp_times (roughly) reflect the cpu scheduling classes and these are ordered, thus I would assume the cp_times are implicitely ordered as well: irq, sys, user, nice, idle.

The point is that by allowing idle user tasks to be charged as idle in the load values, they are completely "invisible" (like the kernel idle tasks).

To achieve the effect mentioned above w/o this patch (do not scale cpu freq up on idle user load), a power manager would have to iterate through all idle user tasks, sum up their cpu times, and compute sythetic load values himself.  I consider this counter-effective: consume more user-cpu cycles and - even worse - context switches, to save energy (and noise pollution).
Comment 6 Conrad Meyer freebsd_committer freebsd_triage 2020-06-03 00:27:39 UTC
(In reply to t.eichstaedt from comment #4)
> Then what's the idle _user_class_ for?

Scheduler prioritization.

> IMHO one would expect that idle user tasks are implicitely also (at least) nice, same argument: because their scheduling priority is lower than all nice tasks.

This is a misunderstanding.  Nice and priority are orthogonal.

(In reply to t.eichstaedt from comment #5)
> the charge classes supplied by cp_times (roughly) reflect the cpu scheduling classes

This is also a misunderstanding; it isn't the categorization used by cp_times.  Nice is user threads with nice > 0; user is all other user threads.  Intr is kernel threads running ithreads.  Idle is kernel threads running idle.  Sys is all other kernel threads.  That's it.

I think it would be reasonable to collect and expose the data you want, and have a power manager (useless as they may be)† consume that data instead of cp_times.  But I don't like changing the historical behavior of cp_times to charge user CPU as kernel CPU, nor do I like adding yet another sysctl knob for this behavior.  (Approximately zero users are going to find and enable this knob.  If we want to provide a better laptop experience, it needs to work out of the box.)


(†): User-driven frequency scaling is kind of a losing game at this point, especially with powerd.  Idle C-states consume almost nothing regardless of frequency.  Powerd doesn't know how to manage frequency on multiple independent CPUs.  Intel is moving away from OS-driven p-states entirely; future CPUs will simply not support it.  (Instead, cores can be set to an energy efficiency profile on some 0-100 percentage scale.)
Comment 7 Walter von Entferndt 2020-06-03 08:51:03 UTC
(In reply to Conrad Meyer from comment #6)
Good morning, and thx for your time.

>> [mixed up nice & priority]
> This is a misunderstanding.  Nice and priority are orthogonal.
OK I'll try harder to be more precise and choose the right verbs.

>> [my assumption: load values ~ runtime sched classes]
> This is also a misunderstanding; it isn't the categorization used by cp_times.
> [...] That's it.
OK.  Got that.  Wrong assumption/association.

> I think it would be reasonable to collect and expose the data you want, [...]
I'm glad to hear that. Do you mean to weave in the computations to update a modern clone of cp_time, with additional fields for the non-traditional (soft) realtime & user_idle classes, into the "if(usermode)"-branch of statclock()? And clone the resp. SYSCTL_PROCs for the overall and per-cpu cp_time(s)?
Is it possible to simply add a flag to the existing one and use the very same function for both sysctl's? Or do we need a wrapper function? These functions seem to be so simple, I'd resist to copy&paste when all that's different is the number of fields in an array. Is there an implicit mutex or lock set by one of these SYSCTL_XXX macros? I have no clue about these SYSCTL things... if you want do that because it's trivial and useful anyway, I'm happy.  If you want to let me do it instead, because I'm the only one asking for it, I'll dive into it and come up with a suggestion.  Would that need to be against the FBSD-13_CURRENT src?

> If we want to provide a better laptop experience, it needs to work out of the
> box.
+++ IMHO we can assume laptop users enable powerd or such, and others do not.

> (†): User-driven frequency scaling is kind of a losing game at this point,
> especially with powerd.  Idle C-states consume almost nothing regardless of
> frequency.
When any other except the kernel idle task is running, the cpu is in the C0 state, right?  So the states >=C1 are not reached as long there's some task ready to run, because the scheduler picks it up to let it run, right?  Thus, to achieve what I consider ("cool" user_idle tasks), the runtime scheduler had to provide a mechanism to artificially "retard" an (or all) idle user task, i.e. give it less time slices (less often), which it currently does not do on an otherwise idle system.  I'm not aware of such a mechanism.  The implicit assumption of the scheduler is to get all work done, optimized and parameterized for a mix of speed, throughput and (by architecture) "fairness", and it has no notion neither for ernergy efficiency nor absolute power consumption.  If you suggest the better solution would be to update the scheduler instead, I'd be interested as well, but probably that's beyond my current skills.  More likely than not.  Anyway, it'd be much more complicated and - foremost - dangerous.

Regarding power consumption, the cpu freq is the dominating factor (in C0 state).  That's basic physics.  Am I wrong here?  Freq scaling based on load values is what exists and will be in use for the lifetime of today's machines.  If you can point me to what powerd does methodically wrong (other than the mentioned handling of independant cores), feel free to do so.  Likewise, any other better solution - I welcome your hints.

> Powerd doesn't know how to manage frequency on multiple independent CPUs.
I'll have a  look at that.  Is this possible on a one-package two-core SMT system?  I have a Core i7-5600U, that's what I can test on bare metal.  These things never interested me more than to survive a test at the univ.  Obviously two virtual SMT cores on one physical core can not have different freqs.  The rest is what I "just use", but if I have to dive into that, I'll do so. 

> Intel is moving away from OS-driven p-states entirely; future CPUs will simply 
> not support it.  (Instead, cores can be set to an energy efficiency
> profile on some 0-100 percentage scale.)
Sounds like KISS.  It's important to support new hardware, but it's also fair to support existing hw during it's lifetime.  Let's say, two decades.

Sorry for the lengthy mail - I want to have this going (i.e. "cool" user_idle).
Comment 8 Walter von Entferndt 2020-06-05 03:44:07 UTC
Created attachment 215241 [details]
expose add. cp_time(s) (realtime/idle) via sysctl cp_time(s)_ext

Now guess what, that was so simple that it succeeded to compile on the 3rd try (only typos), and it runs on bare metal on the 1st -- I'm writing this with the patch applied.  I was too lazy to beadm or set up a VM to test.

If I violated any written or unspoken guidelines/rules, please let me know (PLMK).
The Dev Handbook is on top of my list to read, but for now I just followed what I saw in the src code.

I took the freedom to add some comments and documentation which might write out what is already clear to the avg audience.  If so, PLMK, and I'll fix all that.  I want to do it the BSD way, and no quick hack.

In pcpu.h:struct pcpu{}, I added "long pc_cp_time_ext[CPUSTATES_EXT]" analogue to "pc_cp_time[CPUSTATES]", and I'm not comfortable with that.  It will break some base and port apps, right? top(1) runs... but others?

Strictly speaking, a new sysctl is not neccessary, as I also added resource.h:read_cpu_time_ext() analogue to read_cpu_time().  IMHO it's kind to have a sysctl, because curious folk who will make use out of it, will stumble over this more likely than if it's hidden in a header file.

Now if you approve this patch, I can start patching powerd(8) to make use of it, but I fear to start an avalanche, esp. concerning supporting freq setting on independant cpu.  It also seems fairly simple, though.

thx again for your time.
Comment 9 Walter von Entferndt 2020-06-05 07:07:54 UTC
Strike out the paragraph "Strictly speaking...", as it's kernel internal.

- The suffix "_ext" could be just "_x"? (read: extra/extended)
- I used "bool", should be "boolean_t"?
- sys/kern/kern_clock.c:read_cpu_time() moved up because I needed to call the
  common body _read_cpu_time() from _sysctl_kern_cp_time().
  I should instead add a forward decl, so read_cpu_time() does not change it's 
  occurence in the source file?  The patch would be "cleaner" then.
Comment 10 Walter von Entferndt 2020-06-05 08:05:18 UTC
- missed sys/sys/priority.h:PRI_FIFO.  Added that, since they would not be
  included by checking for td->td_pri_class == PRI_REALTIME (PRI_FIFO is a flag).

I do not update the suggested patch because I assume you'll have some more
important advice for changes.
Comment 11 Walter von Entferndt 2020-06-06 03:23:28 UTC
Created attachment 215284 [details]
expose add. cp_time(s) (realtime/user_idle) via new sysctl cp_time(s)_ext

I feel ashamed that I did not sent you the identical patch that I ran successfully.  My editor (kate) included a BOM that I discovered when I was checking the patch visually, so I applied "recode(1)" to remove that.  But recode(1) mangled a character in sys/sys/pcpu.h that I missed to discover.  I apologize for any inconvenience.

Now this is the very same same patch that is running on my machine while I'm writing this, and it has the fixes applied that I mentioned in the previous posts.

Should I install the FBSD-13 src tree and weave it in there, too?

With kind regards, Torsten
Comment 12 Walter von Entferndt 2020-06-06 03:31:53 UTC
Created attachment 215285 [details]
expose add. cp_time(s) (realtime+user_idle) via new sysctl cp_time(s)_ext
Comment 13 Walter von Entferndt 2021-01-28 17:15:46 UTC
Created attachment 221995 [details]
Patch for 12.1-REL to expose extended CPUTIMES via new sysctl knob
Comment 14 Walter von Entferndt 2021-01-28 17:30:32 UTC
Created attachment 221997 [details]
Patch for 12.1-REL powerd(8) to weight (scale) nice & user idle load

tested on stock 12.1-p12 GENERIC amd64 (release)
not tested on i386
Comment 15 Walter von Entferndt 2021-01-28 17:40:08 UTC
Created attachment 221998 [details]
Patch for 12-STABLE to expose extended CPUTIMES via new sysctl knob

tested on 12-STABLE amd64.
The patch also went through w/o errors on a few weeks old 13-CURRENT (it has 7 lines of context), but I did neither try to compile that, nor run it on 13-CURRENT.
Comment 16 Walter von Entferndt 2021-01-28 17:43:54 UTC
Created attachment 221999 [details]
Patch for 12-STABLE powerd(8) to weight (scale) nice and user idle CPU load

tested on 12-STABLE amd64.
The patch also went through w/o errors on a few weeks old 13-CURRENT (it has 7 lines of context), but I did neither try to compile that, nor run it on 13-CURRENT.
Comment 17 Walter von Entferndt 2021-02-03 00:19:42 UTC
Note that if you apply the patch it is necessary to recompile all kernel modules _including_external_ones_ (those in /boot/modules, e.g. GPU drivers)... since an internal kernel structure changed; IIRC it is struct pcpu in sys/sys/pcpu.h?  Likewise, do not forget to update /usr/include/sys.