Bug 209202 - [PATCH] [powermac_thermal] In-Kernel PowerMac fan control does not effectively regulate some quad core models
Summary: [PATCH] [powermac_thermal] In-Kernel PowerMac fan control does not effectivel...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: powerpc Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2016-05-02 14:44 UTC by gmbroome
Modified: 2016-05-27 22:38 UTC (History)
2 users (show)

See Also:


Attachments
Patch to add additional temp level and longer hysteresis to PowerMac thermal management (2.71 KB, patch)
2016-05-02 14:44 UTC, gmbroome
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description gmbroome 2016-05-02 14:44:13 UTC
Created attachment 169882 [details]
Patch to add additional temp level and longer hysteresis to PowerMac thermal management

The in-kernel fan control for PowerMac systems works reasonably well for single- and dual-core systems, but has mixed results for quad-core systems based on the system condition and CPU assembly revision.

For "revision 1" CPU assemblies (one shared radiator and pump) in quad-core PowerMac G5 units, the system runs extremely hot, hitting critical temperature and forcibly shutting down during even moderate CPU utilization.

For "revision 2" CPU assemblies (a discrete radiator and pump for each CPU package), the fans run at high baseline levels, though the system does not typically reach critical temperature or forcibly shut down.

The attached patch, which includes changes to powermac_thermal.[c|h], smu.c, and smusat.c, adds an intermediate temperature target above which fans and pumps run at full speed, but which is still well below the critical temperature that forces a shutdown.

In testing, this leaves the "revision 1" quad-core systems usable, albeit noisy.  Quad-core systems using the "revision 2" CPU assembly scale fan and pump speed more aggressively, but run them at lower levels during idling.  The dual-core system tested remained largely unaffected by the changes.  The patch is not expected to affect anything but the last generation of PowerMac models, i.e. only ones with an SMU.
Comment 1 gmbroome 2016-05-02 14:58:42 UTC
For another user with a related issue, see freebsd-ppc mailing list from January 2015, subject lines beginning with "PowerMac G5 quad-core, CPU A1 DIODE TEMP"
Comment 2 Mark Millard 2016-05-02 19:28:25 UTC
(In reply to gmbroome from comment #1)

I'm the author of "PowerMac G5 quad-core, CPU A1 DIODE TEMP" from back in 2015-Jan.

At the time I had side by side comparison machines that were a mix of working well vs. overheating. My conclusion was that the cooling systems were in a failure mode for those overheating --since other matching machines were doing fine.

As I remember Mac OS X had fancier handling of the issue that prevented crashing by forcing idle time in addition to fan-speed control but the result was a large difference in performance for "works well" vs. "overheating" for the same load.

At the time some FreeBSD changes were made by Justin Hibbits that did improve things some for the problem machines. But the overheating machines were pretty far gone and I gave up on them.

I returned those that were overheating. So I no longer have a test-context for overheating.
Comment 3 gmbroome 2016-05-02 21:07:05 UTC
(In reply to Mark Millard from comment #2)

The history on my work with this is that I grabbed two of these systems on the cheap from "the Bay."

Both had the "revision 2" CPU assembly, and both worked fine w/ OS X, Linux, and FreeBSD.

Six months on, one of the cooling pumps in a system died.  The only replacement that I could get was a "revision 1" CPU assembly.  After doing the replacement, both OS X and Linux continued to work fine (albeit with the same reduced performance that you noted) but FreeBSD showed the precise behavior that you described in the mailing list posts.

Through a series of CPU assembly swaps, and ordering a second r.1 purely for the testing, I'm reasonably certain that the r.1 CPU assembly is, itself, the culprit.  Though I certainly understand the notion that installing known-defective hardware (why else would there have been a rev?) will cause problems, it makes sense to go ahead and accommodate this r.1 CPU assembly in the driver as the r.1 CPU assembly is the far more common replacement part these days, and r.2 is getting very difficult to obtain, even if one has bottomless pockets.  Based on my own testing, the provided patch allows the r.1 assemblies to work, and has a negligible impact on r.2 assemblies.
Comment 4 Mark Millard 2016-05-02 21:45:59 UTC
(In reply to gmbroome from comment #3)

Do not interpret my notes as an attempt to block the changes. The notes are just general background. Keeping a failure-mode machine operational can be worth while in the right kind of context. I just did not want the performance tradeoff and I did not want to ask Justin Hibbits to do more than he already had done.

I have access to an example "quad core Powermac G5" of each type relative to one pump vs. two, where both work without overheating problems and always have:

# sysctl -a | grep pump
dev.smu.0.fans.cpu_a_pump.rpm: 1716
dev.smu.0.fans.cpu_a_pump.maxrpm: 3600
dev.smu.0.fans.cpu_a_pump.minrpm: 1250

(So only one pump.)

$ sysctl -a | grep pump
dev.smu.0.fans.cpu_b_pump.rpm: 1485
dev.smu.0.fans.cpu_b_pump.maxrpm: 3600
dev.smu.0.fans.cpu_b_pump.minrpm: 1250
dev.smu.0.fans.cpu_a_pump.rpm: 1480
dev.smu.0.fans.cpu_a_pump.maxrpm: 3600
dev.smu.0.fans.cpu_a_pump.minrpm: 1250

(So two pumps.)

I put the same kinds of loads on each, although one has 12 GBytes of RAM and the other has 16 GBytes.

As far as I can tell if a PowerMac G5 Quad Core has an overheating problem then something is in a failure mode, usually the cooling system.
Comment 5 Mark Millard 2016-05-27 22:38:24 UTC
(In reply to Mark Millard from comment #4)

Such timing. . .

Looks like the single pump cooling system started dying as of a couple of days ago (May 25).

Interesting oddities compared to my other exeriences:

A) The problem is intermittent.

B) While it is working normally the CPU diodes for CPU's B0 and B1 are the cooler 2 of 4 according to sysctl results but it is B0 and/or B1 that start getting readings in the 90+ degC range when it fails.

C) I've had complete buildworld's with no sign of the problem. Other times it shuts down fairly soon after starting that or other much smaller builds. (6 or more 90degC+ readings in a row initiates shutdown.)

Its boot drive is at -r298990 (as of May 5) and it was off from May 6 through May 21 so it is not a great test of if the -r298990 update had a software change that might be contributing.