Bug 234733 - Setting CPU frequency with sysctl dev.cpu.0.fr slows a Ryzen 2700X down
Summary: Setting CPU frequency with sysctl dev.cpu.0.fr slows a Ryzen 2700X down
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-08 07:17 UTC by Erich Dollansky
Modified: 2020-10-29 19:13 UTC (History)
5 users (show)

See Also:


Attachments
acpidump -dt (787.24 KB, text/plain)
2020-01-22 18:02 UTC, sigsys
no flags Details
P3.50 acpidump -dt (790.69 KB, text/plain)
2020-01-27 05:45 UTC, sigsys
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Erich Dollansky 2019-01-08 07:17:11 UTC
Issuing

sysctl dev.cpu.0.fr=3700

or any other value for frequency slows the system down by nearly 50%.

System has a Ryzen 2700X as a CPU.

How to repeat:

Reboot the system with all power saving tools disabled.
Compile the world, it will take around 2h30min on the system mentioned above if that system call is never used.

Now issue the command

sysctl dev.cpu.0.fr=3700

which would bring a Ryzen 2700X to its nominal operating frequency.

Compile the world, it will take now around 4h30min.

I could repeat this error several times thinking originally that it is a powerd problem.

An e-mail by Andrea Venturoli (21 Dec 2018 17:40:49) made me look a bit into powerd.
Comment 1 sigsys 2019-07-10 21:53:59 UTC
Same thing happens for me on 12-STABLE on a Ryzen 2700.  The CPU frequency isn't going up once it has been lowered (even though it looks like it does according to that sysctl, the CPU actually remains much slower).  I haven't found any BIOS setting that fixes it.  It's on a B450 Pro4 motherboard with firmware P3.20.
Comment 2 Conrad Meyer freebsd_committer 2020-01-21 20:59:00 UTC
Can both of you try:

Adding 'debug.hwpstate_verbose="1"' in /boot/loader.conf and checking dmesg for boot-time messages about hwpstate?  This can also be sysctl'd at runtime to see what gets logged when 'sysctl dev.cpu.0.freq=foo' is invoked, for example.

Second, if possible can you share the output of 'acpidump -dt'?  It will be fairly large and you might have to compress it to attach it to bugzilla.  It should not contain especially sensitive information — it's BIOS data and code provided to the operating system to understand system devices.

I'll add: the method hwpstate(4) uses to set the current p-state is documented on Zen PPR as:

***********************
"Writes to this field cause the core to change to the indicated __non-boosted__ P-state number…" (emphasis added)
***********************

So, e.g., writing P0 (unlimited) still disables boost, I guess.

Interestingly, the documentation on the boost enable/disable bit (HWCR::CpbDis) also mentions boosted / non-boosted P-states:

"If core performance boost is disabled while a core is in a boosted P-state, the core automatically transitions to the highest performance non-boosted P-state."

So... perhaps hwpstate(4) should explicitly check and enable CPB (boost) when "P0" is selected.  Or we could synthesize an extra P-state, e.g., "3701" which when selected sets P0 and also boost.  IMO, that's more effort than it's worth because manual P-state setting is silly on these CPUs.

Here's a test you could do to confirm.  Set dev.cpu.0.freq=3700, or whatever the maximum is.  Then run (needs root): 'cpucontrol -m 0xc0010015 /dev/cpuctl0'.  This reads the HWCR MSR from CPU0.  The output will look something like:

MSR 0xc0010015: 0x00000000 0x09000011
If the bit:                0x02000000 is set, it indicates that CPB is *disabled*.

If that bit is set, you could try:
"cpucontrol -m '0xc0010015&=~0x02000000' /dev/cpuctl0" and see if it restores boost-level performance.  (I don't know if you would have to clear it on all CPUs, or if it is globally cleared by the cpu0 command.)
Comment 3 Conrad Meyer freebsd_committer 2020-01-21 21:10:58 UTC
I tried the following test:

$ cpuset -l 0 time sha1 <~200MBfile>
(burn one result for caching, then repeat 3 trials)
I got average 0.24 real, 0.22 user.  You might pick a slightly larger file for bigger and more obvious differences at different steppings.

Then I set CPBDis: cpucontrol -m '0xc0010015|=0x02000000' /dev/cpuctl0
and repeated the tests; I got average 0.29 real, 0.26 user.

I ran "cpucontrol -m '0xc0010015&=~0x02000000' /dev/cpuctl0" and got 0.24/0.22 again.

I also tried sysctl dev.cpu.0.freq=3400 (on my system, dev.cpu.0.freq_levels: 3400/3825 2800/2765 2200/1952).

cpucontrol -m '0xc0010015' did not show CPBDis.  Also, SHA1 timings remained ~0.24/0.22.

I tried dev.cpu.0.freq=2800, waited a few seconds, and restored dev.cpu.0.freq=3400.

cpucontrol -m '0xc0010015' still did not show CPBDis.  SHA1 timings seemed to show the expected boosted performance.

My CPU is a Zen1 Threadripper 1950x; perhaps this behavior is different on Zen+ (Ryzen 2xxx), or there is some difference due to BIOSes.  But you might be interested in running the same tests and seeing what you find.
Comment 4 sigsys 2020-01-22 18:02:05 UTC
Created attachment 210970 [details]
acpidump -dt

(In reply to Conrad Meyer from comment #3)
Alright I tested a bunch of things based on that.

It doesn't seem to be related to CPB in this case.  The 0x02000000 bit never gets set (it always shows "MSR 0xc0010015: 0x00000000 0x09000011").  I tried to toggle it on and off again anyway in case it would reset something but it doesn't seem like it does.

The CPU just seems to get stuck at the lowest frequency it has been set to.

I tested by timing this:
dd if=/dev/zero bs=32k count=20000 | cpuset -l 0 time sha1

And by setting the frequency to 3200 (maximum), 2800 and 1550:
3200: 1.34 real         1.15 user         0.17 sys
2800: 1.78 real         1.47 user         0.30 sys
1550: 3.18 real         2.78 user         0.39 sys

After lowering the frequency to a given value, there's no way to get the performance back (but it lets you set the frequency higher with sysctl without any apparent errors).

With debug.hwpstate_verbose set, there's that during boot:
hwpstate0: going to fetch info from acpi_perf
hwpstate0: <Cool`n'Quiet 2.0> on cpu0

And a bunch of messages like that when setting the frequency with sysctl:
hwpstate0: setting P1-state on cpu0
hwpstate0: setting P1-state on cpu1
hwpstate0: setting P1-state on cpu2
hwpstate0: setting P1-state on cpu3
hwpstate0: setting P1-state on cpu4
hwpstate0: setting P1-state on cpu5
hwpstate0: setting P1-state on cpu6
hwpstate0: setting P1-state on cpu7
hwpstate0: setting P1-state on cpu8
hwpstate0: setting P1-state on cpu9
hwpstate0: setting P1-state on cpu10
hwpstate0: setting P1-state on cpu11
hwpstate0: setting P1-state on cpu12
hwpstate0: setting P1-state on cpu13
hwpstate0: setting P1-state on cpu14
hwpstate0: setting P1-state on cpu15
hwpstate0: setting P2-state on cpu0
hwpstate0: setting P2-state on cpu1
hwpstate0: setting P2-state on cpu2
hwpstate0: setting P2-state on cpu3
hwpstate0: setting P2-state on cpu4
hwpstate0: setting P2-state on cpu5
hwpstate0: setting P2-state on cpu6
hwpstate0: setting P2-state on cpu7
hwpstate0: setting P2-state on cpu8
hwpstate0: setting P2-state on cpu9
hwpstate0: setting P2-state on cpu10
hwpstate0: setting P2-state on cpu11
hwpstate0: setting P2-state on cpu12
hwpstate0: setting P2-state on cpu13
hwpstate0: setting P2-state on cpu14
hwpstate0: setting P2-state on cpu15

Never anything other than P1 and P2.

I tested after a reboot and an update to recent 12-STABLE r356965 and the results were all pretty much the same.

That's a B450 Pro4 still with BIOS version P3.20.  Using non-UEFI boot.  And I reset it to default settings before the last test.  At the time I first reported this it had the latest BIOS version but not anymore.  So maybe a BIOS update would fix it.  I'm gonna wait updating it in case more testing could be helpful.  Thanks for looking into it!

No other problems apart from that on this computer.  And yeah after that I figured that you don't really need to run powerd on these CPUs since they'll throttle themselves automatically anyway and they seem to be doing a good job of it.  But if someone does run powerd (or mess with the frequency directly) the permanent and silent slowdown can be a pretty bad surprise.
Comment 5 Conrad Meyer freebsd_committer 2020-01-22 19:29:57 UTC
(In reply to sigsys from comment #4)
Thanks for the testing, that’s really helpful!

I totally agree this is a bug that needs fixing; disabling powerd / avoiding manually reducing the clock is just a workaround until we can fix the root cause.
Comment 6 Conrad Meyer freebsd_committer 2020-01-25 07:21:55 UTC
Over on bug 234455 comment #9, schaiba@gmail reports tuning debug.acpi.disabled="thermal" to fix Ryzen getting stuck in low P-states.  I think that suggests buggy BIOS, since the ACPI thermal implementation is supplied by the firmware.
Comment 7 schaiba 2020-01-25 08:21:09 UTC
If it helps, indeed powerd is disabled, and powerd++ from pkg is used instead on my system.
Comment 8 sigsys 2020-01-25 22:37:59 UTC
No luck.  I retried with debug.acpi.disabled="thermal" in my loader.conf and redid the tests and didn't notice the behavior changing.  Still gets stuck on the lowest frequency until a reboot.
Comment 9 Conrad Meyer freebsd_committer 2020-01-27 02:49:38 UTC
Another test to try (maybe after entering a non-P0 p-state, not sure if that matters) if you'd like:

$ cpucontrol -m '0xc0010061' /dev/cpuctl0

This is the PStateCurLim register, and the low 3 bits are 'CurPstateLimit'.  It represents the highest performance P-state the processor is (currently) allowed to enter.  I don't know why it would be non-zero if the processor is in P0 at boot, but I suppose it could be a BIOS issue.

Finally, you could try just manually checking the last set P-state:

$ cpucontrol -m '0xc0010062' /dev/cpuctl0

(Last-set P-state is the low 3 bits of that register.)  Or setting P-state 0 across all cpus manually, bypassing hwpstate(4):

$ for i in $(jot 16 0) ; do cpucontrol -m '0xc0010062=0x0' /dev/cpuctl$i ; done

(Bourne sh; I don't know if that works in csh or anything exotic.)
Comment 10 sigsys 2020-01-27 03:11:04 UTC
(In reply to Conrad Meyer from comment #9)

It's showing some interesting behavior I think.  The last 3 bits of 0xc0010061 and 0xc0010062 go up but don't go down.  And they go up together.


At boot:

# cpucontrol -m '0xc0010061' /dev/cpuctl0
MSR 0xc0010061: 0x00000000 0x00000020
# cpucontrol -m '0xc0010062' /dev/cpuctl0
MSR 0xc0010062: 0x00000000 0x00000000


After dev.cpu.0.freq=3200 (max):

# cpucontrol -m '0xc0010061' /dev/cpuctl0
MSR 0xc0010061: 0x00000000 0x00000020
# cpucontrol -m '0xc0010062' /dev/cpuctl0
MSR 0xc0010062: 0x00000000 0x00000000

(No change there.)


After dev.cpu.0.freq=2800:

# cpucontrol -m '0xc0010061' /dev/cpuctl0
MSR 0xc0010061: 0x00000000 0x00000021
# cpucontrol -m '0xc0010062' /dev/cpuctl0
MSR 0xc0010062: 0x00000000 0x00000001


After dev.cpu.0.freq=1500:

# cpucontrol -m '0xc0010061' /dev/cpuctl0
MSR 0xc0010061: 0x00000000 0x00000022
# cpucontrol -m '0xc0010062' /dev/cpuctl0
MSR 0xc0010062: 0x00000000 0x00000002


After dev.cpu.0.freq=3200 (trying to set it back to max):

# cpucontrol -m '0xc0010061' /dev/cpuctl0
MSR 0xc0010061: 0x00000000 0x00000022
# cpucontrol -m '0xc0010062' /dev/cpuctl0
MSR 0xc0010062: 0x00000000 0x00000002

And performance is still as if it were at 1500.


After cpucontrol -m '0xc0010062=0x0' on all CPUs:

# cpucontrol -m '0xc0010061' /dev/cpuctl0
MSR 0xc0010061: 0x00000000 0x00000020
# cpucontrol -m '0xc0010062' /dev/cpuctl0
MSR 0xc0010062: 0x00000000 0x00000000

And now performance is back to what it was at 3200!  I didn't manually try to reset 0xc0010061, it already reverted back to its initial value after changing 0xc0010062.
Comment 11 Conrad Meyer freebsd_committer 2020-01-27 04:52:00 UTC
Very interesting, thanks!  It's the low bits of 0xc0010061 following 0xc0010062 up that is surprising / causing hwpstate(4) to ignore the user's supplied configuration (side note: we should really produce at least a *debug* log message when we ignore the user-supplied P-state!).  I don't have an explanation for why 0xc0010061 low bits are following 0xc0010062.  Nothing in the kernel writes that MSR, as far as I can tell.

The documentation I have for c0010061 says it's error-on-write.  But it also suggests that the value may be changed: "Attempts to change the CurPstateLimit to a value greater (lower performance) than PStateCurLim[PstateMaxVal] leaves CurPstateLimit unchanged.")  So I'm not really sure if software/firmware can write it.  Maybe the CPU's power-governor is misconfigured and is attempting to limit itself?  I don't have any explanation for why it would "follow" c0010062 stepwise, though.  Any chance other people with the same motherboard have reported similar problems with Linux/Windows?  Any chance there is a BIOS update available?
Comment 12 Conrad Meyer freebsd_committer 2020-01-27 04:55:33 UTC
(It's also surprising that writing c0010062 *works* and overwrites c0010061 — the docs for c0010062 say: "P-state limits are applied to any P-state requests made through this register. Reads from this field return the last value written, regardless of whether any limits are applied.")
Comment 13 Conrad Meyer freebsd_committer 2020-01-27 05:14:27 UTC
Note that there are several newer BIOSes available for that board: https://www.asrock.com/mb/AMD/B450%20Pro4/#BIOS

The only remark that really jumps out is for 3.31, "Change AMD PBO rule with Pinnacle CPU."  PBO is "precision boost overdrive" and is something involving power; Pinnacle is the ryzen 2xxx family.  That said, I'd just skip straight to 3.90 as long as you're taking the time to flash a new BIOS.
Comment 14 Conrad Meyer freebsd_committer 2020-01-27 05:15:43 UTC
Sorry, not 3.90, only go to 3.50 maybe?

> ASRock do NOT recommend updating this BIOS if Pinnacle, Raven, Summit or Bristol Ridge CPU is being used on your system.

I don't have a clue why they'd release a BIOS that doesn't support older CPUs, but I'm not ASRock.
Comment 15 sigsys 2020-01-27 05:22:21 UTC
It would already be pretty good if the kernel could detect the situation by double checking on one of those registers and log a warning.  Assuming it wouldn't risk causing even more problems on some other systems.  Or maybe report it as another sysctl variable so that the mismatch could be seen more easily. I only noticed that something was off when make world took much more time than expected.

But if that only happens with certain BIOS versions of one particular motherboard I guess it might not be worth trying to deal with it at all. And that might be the case here. Alright I'm gonna try 3.50 and see if the weirdness goes away. I haven't tried that motherboard with another OS yet. I could try to boot a linux USB stick and try to see how it deals with it if upgrading the BIOS doesn't fix it.
Comment 16 sigsys 2020-01-27 05:45:29 UTC
Created attachment 211097 [details]
P3.50 acpidump -dt

Upgraded to 3.50 and I'm seeing the *exact* same behavior doing the same tests. BIOS setup screens says 3.50 and dmidecode sees 3.50 too so the upgrade must have went fine. The ACPI dump changed a little too. But it didn't fix it.
Comment 17 commit-hook freebsd_committer 2020-01-27 06:05:14 UTC
A commit references this bug:

Author: cem
Date: Mon Jan 27 06:04:33 UTC 2020
New revision: 357165
URL: https://svnweb.freebsd.org/changeset/base/357165

Log:
  hwpstate(4): Log a debug line when throttled

  If we're going to throttle user requested P-states, we should at least produce
  a debug log line indicating the surprising behavior.

  PR:		inspired by 234733

Changes:
  head/sys/x86/cpufreq/hwpstate_amd.c
Comment 18 Conrad Meyer freebsd_committer 2020-01-27 07:31:03 UTC
(In reply to sigsys from comment #15)
> It would already be pretty good if the kernel could detect the situation by
> double checking on one of those registers and log a warning.  Assuming it
> wouldn't risk causing even more problems on some other systems.

We used to verify set P-state on all cpus prior to r326383.  It gets pretty inefficient to check all cores.  Checking just one only tells you one made it to the configured P-state.  (AMD P-states are independent across each core; SMT threads share a P-state domain.)

But that's not really the situation here; here we're restricting P-state ourselves in software due to the (seemingly bogus) c0010061 limit.  In this case, I added a debug log in the commit referenced in comment #17.

I am curious if Linux does any better.  Maybe they just ignore c0010061.

(In reply to sigsys from comment #16)
That's unfortunate :-(.
Comment 19 Conrad Meyer freebsd_committer 2020-01-27 07:58:12 UTC
(In reply to Conrad Meyer from comment #18)
> I am curious if Linux does any better.  Maybe they just ignore c0010061.

Yeah they just ignore c0010061 entirely.  Probably we can too.
Comment 20 sigsys 2020-01-27 16:20:06 UTC
Maybe it could do some double checking only once to "learn" about those registers' behavior (and maybe have a sysctl to reset that to make it check again)?  But if Linux ignores it entirely then you'd think it's probably safe to do that everywhere in practice.  Maybe a sysctl/tunable to enable/disable the check just in case?

On this one maybe that's not even an "implemented" register (if that makes sense), it just follows the other one and can't really be set independently (and it "follows" it wrong according to the specs it seems).  Really not sure I understand most of this though.  If Linux (and maybe Windows too) ignores it maybe the register is nonsense on other machines too and it goes unnoticed.
Comment 21 Conrad Meyer freebsd_committer 2020-01-27 19:15:45 UTC
(In reply to sigsys from comment #20)
c0010061 doesn't follow c00l0062 on my Zen1 + ASRock X370 system, for example. :-)  Maybe it's a Zen+ thing, maybe it's a BIOS thing, who knows.

In any case I agree we don't really understand what it does.  It doesn't seem to line up with the documentation, and I don't expect ignoring it to be harmful to modern AMD hardware for power/thermal reasons.  (I expect HW made in the last ten years to thermal throttle regardless of SW-set P-state.  My concern around removing it entirely, as Linux seems to, is mostly that doing so might be harmful for the older CPUs this driver supports (k10, k11).  But -- clearly not a problem for Linux.)

I like your suggestion of a sysctl/tunable.  I propose defaulting to "ignore c0010061."   I am having trouble coming up with an accurate and descriptive docstring, though.  "If set, limit requested P-states to MSR c0010061[0:2]" isn't exactly useful to end users, probably.
Comment 22 sigsys 2020-01-27 20:41:59 UTC
(In reply to Conrad Meyer from comment #21)
Hard to imagine a description that makes sense to random users, this is just so damn technical.  But if there's a pointer to it in some manpage people are likely to check if there's a problem then at least they might be able to find it and know that this is one thing they can try messing with.  Could be with the other sysctls shown in cpufreq(4) (it's linked to from powerd(8) so people can find it).  Maybe with a note like "This MSR field must be ignored on certain hardware to allow raising back the power level after lowering it.".  Just enough to give some vague idea of what's going on with this.
Comment 23 Conrad Meyer freebsd_committer 2020-01-31 17:40:39 UTC
I committed a fix to switch the default, add the knob, and document it in cpufreq.4 — closing.
Comment 24 commit-hook freebsd_committer 2020-01-31 17:40:47 UTC
A commit references this bug:

Author: cem
Date: Fri Jan 31 17:40:42 UTC 2020
New revision: 357336
URL: https://svnweb.freebsd.org/changeset/base/357336

Log:
  hwpstate(4): Ignore CurPstateLimit by default

  Add a sysctl knob to allow users to re-enable it, and document the knob and
  default in cpufreq.4.  (While here, add a few unrelated updates to
  cpufreq.4.)

  It seems that the register value in some hardware simply reflects the
  configured P-state.  This results in an inadvertent and unintended outcome
  where the P-state can only walk down, and then the driver becomes "stuck" in
  the slowest possible P-state.

  The Linux driver never consults this register, so that's some evidence that
  ignoring the contents are relatively harmless.

  PR:		234733
  Reported by:	sigsys AT gmail.com, Erich Dollanksy <freebsd.ed.lists AT
  		sumeritec.com>

Changes:
  head/share/man/man4/cpufreq.4
  head/sys/x86/cpufreq/hwpstate_amd.c
Comment 25 eborisch+FreeBSD 2020-10-29 19:13:25 UTC
Just adding a "me to" on a Ryzen 5 2600 / 12.2-RELEASE; changes to dev.cpu.0.freq only work going down (slower), not coming back up.

(I am on custom kernel and local compile, but based on others having the issue, I think that's not critical here.)

Commenting out the:

  if (limit > id)
       id = limit;

check in sys/x86/cpufreq/hwpstate.c@176 fixes this. (Following the patch in base r357336 with much less finesse.)

I see this has been closed for nine months, but the fix did not make it into 12.2. Anyone running powerd on an impacted system will see FreeBSD as being very slow; I feel this would be good to fix, or at least call out to end users before 13.0.