Bug 219213 - powerd causing problems with ryzen
Summary: powerd causing problems with ryzen
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: Jung-uk Kim
URL:
Keywords:
Depends on: 221621
Blocks:
  Show dependency treegraph
 
Reported: 2017-05-11 08:19 UTC by SF
Modified: 2017-12-06 21:48 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable10?


Attachments
x86/cpufreq/hwpstate.c patch for powerd++ (not tested) (1.35 KB, patch)
2017-08-19 09:32 UTC, mailsed
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description SF 2017-05-11 08:19:39 UTC
I use FreeBSD 11.0 with an Ryzen R7 1700X Processor since March, the first thing i recognised switching over to Ryzen was significantly improved performance compared to my Phenom X6 1100T. The Power-Consumption has become very low.

I struggled a lot of Problems within wine, having very low performance even with Ryzen. Last week i decided to overclock my cpu, the overclocking turned off all the power-saving-features and the first thing i recognised was that my performance was many times higher then before. My computer usually crashed several times a day, rarely i could run my computer 2 days without having to restart because of the machine completely freezing. Since overclocking it my machine keeps running, no problems anymore.

I did also constantly getting error-messages from powerd within my output, they are gone now. I already thought this might be causing the freezes but i didnt think this is also powering my machine at its lowest p-state.

Ryzen-Support should be added and more control over power-saving features to avoid performance holes.
Comment 1 SF 2017-05-11 18:01:26 UTC
To be more specific about power-saving features:

CPU's have different p-states since a long time now but powerd only knows 3 options to set:

adaptive, maximum, minimum

I have only 2 options for the adaptive-mode to set percentages for minimum and maximum.

I dont know how many different p-state each CPU has but there is definitely more then just maximum and minimum. I remember 4-5 different p-states with my old Phenom X6 1100T(cpu from 2010) i had before.

Your powerd is very primitive and i think it doesnt fit todays demands of flexible power-management and accurate delivery of performance.

With 4-5 p-states i can set 4-5 barriers to hit to get my CPU more precisely onto its needed level.

With todays CPU-Performance i can do more math to calculate its needed p-state. An Ryzen R7 1700X has 16 CPU-Cores, there is easyly some spare calculation-power to do the math for adjusting the p-states more accurate.

e.x.:
1 primary interval(long time period)
1 secondary interval(short time period) which gets launched from the primary interval after the specified time
1 counter for each p-state-limit

The primvary interval kicks in each 15 seconds.
The secondary interval checks every 40ms the average cpu-load for an duration of 1 second. It increases the counter for the assigned p-state at each check until the 1 second duration is finished.

It selects the p-state with the highest counter and priority.

Shouldn't be too hard?
Comment 2 Conrad Meyer freebsd_committer 2017-05-11 18:11:42 UTC
Does powerd++ (available as a port "powerdxx") work any better?
Comment 3 SF 2017-05-11 18:37:37 UTC
Compared to powerd? Much better.

Compared to my solution? Its something different and powerd++ adresses it another way and also has temperatur-management. It's primitive in the part of calculating the needed p-state out of the average.

My example was much more solid to performance-spikes because of the counters and probing.

I didn't know powerd++ until yet i have to say and all i know about it are the explanations from their site to get an overview.
Comment 4 SF 2017-05-11 19:44:05 UTC
This made me think about it, my first example was good but still inherits problems. This one is closer to final.

e.x.:
1 primary interval(long time period) to reset counters
1 secondary interval(short time period)to probe and count
1 counter for each p-state-limit and core

The primvary interval resets the counters to 0 each second.
The secondary interval probes every 10ms(?) the current load of each core and increases the counter according to each p-state-limit by 1.

p-state 0 <20%
p-state 1 20%>
p-state 2 40$>
p-state 3 60%>
p-state 4 80%>

After 1 second it selects the cores with the highest reached limits.

Lets say core 1,2,3,4 did exceed 80%> and core 5,6,7,8,9,10,11,12,13,14,15,16 did only stay at max. 20%>. The 20%-Cores would outweight Core 1,2,3,4 and the CPU will stay at a low p-state if they would taken into account, they dont get. Only the CPU's with the max. reached p-state get counted together.

Core 1,2,3,4's counters get summarized and the p-state with the highest count and priority gets selected.

The priority is needed because 80% spikes might occur less often then the counter count something above 40% or 60%, this might cause staying at too low p-state. The p-state 4 needs an higher priority because of it which means a count of 20 on p-state 4 weights more then a count of 50 on p-state 0 or 2.

Finally there needs to be an timer to countdown the time for allowing throttling down to a lower p-state if the previously high p-state didn't get reached anymore.

P-state 4 didn't get reached anymore since 30 seconds and now the cpu is allowed to power down to the new calculated value. If the calculations hit p-state 4 again while the 30 seconds count down then the timer gets reset to 30 seconds and counts down again. This function can be optional and/or only kicks in if the specified p-state is reached. This is for people running programs with very much high-spikes and short periods of low load between it to avoid clocking down and hitting the cap if the spike occurs.

Imho this is much more advanced then what powerd++ does and avoids some of the problems mentioned. I think this is very interesting to servers.
Comment 5 SF 2017-05-12 06:56:55 UTC
After sleeping and reading the stuff from powerd++ over and over again to understand it there is just one problem to my idea and this is responsiveness.

I dont know how the load is calculated while the cpu is throttled down to p-state 0, can they still calculated what would be the load if the cpu would run at max. p-state or can they only retrieve the load out of the current clock?

-powerd++ sets the clock frequency according to a load target, i.e. it jumps right to the clock rate it will stay in if the load does not change.

It will take 4 seconds within my example to push the cpu from p-state 0 to p-state 4, halfing the primary interval to 500ms will get it to 2s but still far too low.

Advantage of mine is it will be much more stable and accurate to set the truely needed p-state.

Powerd++ will skip high-spikes, always averaging the p-state down to a lower one. It doesn't respond to high-spikes or goes bungee.

Both of this doesn't solve this, on server's and for gaming there is circumstances you need the highest p-state to avoid hitting the caps while such spikes occur.

You could add an option to boost the cpu to p-state 4 within under 500ms and take the other calculations to make it slowly throttle down after this. This will increase responsiveness under such circumstances but might also cause the cpu to get stuck within a high p-state because of some few spikes each second.

Atm i have problems solving this because reacting to high spikes might make the cpu never leave a high p-state. If there is just one high-spike per second which needs the p-state 4 then it is completely unnecessary to make the cpu boosting into this state. Exceptions to this are servers at specific times with users often causing such kind of spikes, there might be reasons to deliver a high quality experience to them. Within games there is also reasons to react onto such spikes but if this is the case the gamer will turn off its power-saving-features. This is something different to having a server, the admin cannot always interact with the machine to fit the current required needs of performance.

Adding another option to set-up specific times for enabling the boost-function might be usefull, you could also make the computer watching this over a long period of time like one month to make it decide itself when it is the time to enable the boost. There is no other way solving this dilemma.

Increasing responsiveness will always lead to periods with higher p-states then might be needed.

The powerd++ solution is faster but cannot react onto spikes, its flattening your performance down. There is no priority's to set and no counters to differ between what is needed, it also only reacts always to one single core which is the core with the highest load.

-powerd++ supports taking the average load over more than two samples, this makes it more robust against small load spikes, but sacrifices less responsiveness than just increasing the polling interval would. Because only the oldest and the newest sample are required for calculating the average, this approach does not even cause additional runtime cost!

I think powerd++ is something for old and slow cpu's, on a Ryzen R7 1700X i have 16 Cores i need to take care of and there is no problem giving such more complicated calculations to one spare core.
Comment 6 mailsed 2017-08-19 09:32:06 UTC
Created attachment 185570 [details]
x86/cpufreq/hwpstate.c patch for powerd++ (not tested)

Hi,

I think Ryzen has so many cores that the frequency to change must be various.
But, x86/cpufreq/hwpstate.c is written as a module that changes only one frequency for ALL cores such like "sysctl dev.cpu.0.freq=XXXX".
So the sysctl nodes from dev.cpu.1.freq to dev.cpu.15.freq, when it has 16 CPU threads, are not be generated and estimated as the same as dev.cpu.0.freq.
So, I made this patch (removed some of the lines of hwpstate.c). This patch will generate the nodes from dev.cpu.1.freq to dev.cpu.15.freq, if it has 16 CPU's for example, and you can set each CPU's frequency as different from another.
I think powerd++(sysutils/powerdxx) is requiring such situation that setting each CPU's frequency as possibly different frequencies from others.
Though I didn't tested, I don't have Ryzen, but It will be compiled.
I think this patch is not enough. Because 8 core 16 thread is a problem, If the same core but different thread and frequency, what will be occurred?
so, the cpufreq of different thread but same core, will be treated in kern_cpu.c. But I'm not sure what to do, sorry for that. Disabling the SMT in BIOS is one workaround. I hope so.
Comment 7 Conrad Meyer freebsd_committer 2017-08-19 16:41:30 UTC
See also PR 221621.
Comment 8 Conrad Meyer freebsd_committer 2017-11-30 20:33:03 UTC
Resolved by r326383 + r326407.  Now we simply do not verify if the value set reads back from MSR.
Comment 9 Kubilay Kocak freebsd_committer freebsd_triage 2017-12-06 02:14:21 UTC
@Conrad Can these be MFC'd to stable/{11,10}, noting original report (comment 0) references 11.0-RELEASE (implyinng at least a stable/11 merge)

If so please reference this PR in the merge commit log messages. If not please set the mfc-* flags to - with comment
Comment 10 Conrad Meyer freebsd_committer 2017-12-06 02:29:04 UTC
I don't know if they can or can't be, but I don't MFC.
Comment 11 Kubilay Kocak freebsd_committer freebsd_triage 2017-12-06 02:34:41 UTC
Reset assignee. Hopefully someone else can MFC/finish. jkim/emaste?
Comment 12 Jung-uk Kim freebsd_committer 2017-12-06 18:37:00 UTC
(In reply to Kubilay Kocak from comment #11)
I'll take care of the MFC.
Comment 13 Jung-uk Kim freebsd_committer 2017-12-06 21:48:50 UTC
(In reply to Jung-uk Kim from comment #12)
I took a liberty to sync. hwpstate.c with head.

https://svnweb.freebsd.org/changeset/base/326637
https://svnweb.freebsd.org/changeset/base/326638

Please let me know if it does not work for you.