Bug 264775 - [PATCH] acpi_thermal: passive cooling only throttles cpu0
Summary: [PATCH] acpi_thermal: passive cooling only throttles cpu0
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-acpi (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2022-06-19 20:25 UTC by crahman
Modified: 2023-04-09 21:10 UTC (History)
1 user (show)

See Also:


Attachments
Set all the CPUs in acpi_thermal, supporting more processors. (4.03 KB, patch)
2022-06-26 23:47 UTC, crahman
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description crahman 2022-06-19 20:25:19 UTC
Back when all the cpu frequencies were locked to cpu0's frequency, this worked fine.  However, now that that is not true, reducing the frequency of cpu0 and leaving the others at their maximum frequency has a minor affect upon temperatures, leading to overheating and shutdowns when passive cooling is needed.
Comment 1 rkoberman 2022-06-21 06:15:12 UTC
I am seeing a slightly different case that MAY be a similar issue.

During a big build, I will often see all CPUs except cpu0 drop from full speed (2101) to minimum (400). This includes the second thread on CPU0, which is clearly nonsensical. The system then cools to about 45C and the fan speed drops to 0. At that point, the system stays at 400. I can set CPU0 to 400 an then to 2010 and all CPUs show 2101, but often the system does not speed up.

I test by the trivial technique o building a small port, audio/taglib. It should build in about 15 seconds, but, when the system is running slow it takes about 2:08.

If the system remains slow, after some long time (many minutes) it will return to full stream.

Clearly, something i really wrong. This has been happening since at least 13.0 and continues to the current 13.1 stable. The system is a Lenovo L15 which has the problem of locking up when P-States are not disabled. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253288.
This may be related.
Comment 2 crahman 2022-06-26 23:47:03 UTC
Created attachment 234960 [details]
Set all the CPUs in acpi_thermal, supporting more processors.
Comment 3 crahman 2022-06-26 23:49:46 UTC
I've made a patch to set all the processors when cpu0 is set, following the example of kern_cpu.c.  This solves the problem I had with the Celeron N4120 Quad Processor in which passive cooling only affected cpu0, making passive cooling ineffective.
Comment 4 crahman 2022-08-06 10:58:45 UTC
(In reply to rkoberman from comment #1)
Looking at your problem, I do not think it is related - but I have not inspected the device-specific code involved, and I can think of one way it could be related.

So perhaps you should just test this patch and see if it helps.
Comment 5 rkoberman 2022-08-12 01:11:35 UTC
(In reply to crahman from comment #4)
While I can't reliably trigger the problem, after installing the patch and rebooting I was unable to trigger it, even by building firefox. In the past, a compile of that since would trigger the issue.

I don't see the issue of "sysctl dev.cpu | grep freq:" reporting 2101 for CPU0 and 400 for all others. This was clearly bogus. I now see all CPUs at 2101. Unfortunately, this appears to have been bogus as the system appeared to be running a lower speed during the firefox compile. I may build it again and perform a test to confirm this. (That involves pausing the compile of firefox and timing a small compile of audio/taglib. At 2101 it should compile under 15 seconds. At 400, it takes over 2 minutes. I I get a chance to do this, I will report my findings.

I any case, this patch clearly fixes a problem and looks like it should be committed.
Comment 6 rkoberman 2022-08-28 18:51:19 UTC
(In reply to crahman from comment #4)
You are correct in that I see little change to my problem with this patch. I've now been running it for about 3 weeks.

One thing that really surprised me was that the weird case where I see cpu0 at 2101 and all others at a reduced speed, usually 400, the lowest speed, but occasionally at a speed between 2100 and 400 still occurs. I didn't think that this should be possible! Obviously having a CPU showing different speeds for the two threads is not possible. I suspect that this is related to the issue I am having.

I can say that the sysctl lies. I can see all eight "CPUs" reporting 2101 and a compile of audio/taglib taking over 2 minutes to build while, when really running at 2101, it takes 15 seconds. Clearly something is broken, but this is not it.

This may be going away as I am about to install CURRENT on my new laptop. (Just partitioned the SSD.) Hopefully, it will be better behaved, but I am concerned as it has cores running at different speeds and different numbers of hyper-threads, (AlderLake) and FreeBSD will not really know how to deal with this.
Comment 7 crahman 2022-08-28 23:07:56 UTC
(In reply to rkoberman from comment #6)
This all sounds vaguely familiar, as if I once had to deal with it years ago.  Probably some sleepless night needing to fix a server before morning.

Anyway, let me know what speed control module is involved and I'll look at it.  If you're not sure send a dmesg and I'll try to find it.
Comment 8 Graham Perrin freebsd_committer freebsd_triage 2022-10-17 12:39:17 UTC
Keyword: 

    patch
or  patch-ready

– in lieu of summary line prefix: 

    [patch]

* bulk change for the keyword
* summary lines may be edited manually (not in bulk). 

Keyword descriptions and search interface: 

    <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>
Comment 9 rkoberman 2022-10-17 17:21:18 UTC
(In reply to crahman from comment #7)
Thanks, but that system was retired for several reasons two weeks ago. The thermal and performance issues were secondary issues to the mechanical failure of the case (multiple cracks most notably to the hinge attachments). The issues I had with the old unit are not present with the replacement. The P-states work correctly without the fix recently committed to CURRENT. Temperatures remain reasonable for even very large builds (rust, world).

I might mention that performance and thermal management is very different on the Alder Lake processor running CURRENT. At this time there is no support for cores with different clock rates and I run with only the 4 P threads to prevent frequent panics. It's painful to have 8 cores doing nothing.
Comment 10 crahman 2023-04-09 10:39:03 UTC
I didn't have time to address this last year, but the problem described in the comments of this bug is related to a failure of the save/restore mechanism of the cpu frequency.  I've described in in more detail in differential D35704.  In short, the cpu frequency is saved and restored as thermal control is activated and deactivated.  But the restore mechanism breaks and so one ends up with a low cpu frequency until a reboot.

It would be very useful to finally fix this bug.

Instead of maintaining this patch in multiple places, the latest revision is at https://reviews.freebsd.org/D35704 along with code to allow active/passive cooling to be controlled by power_profile.
Comment 11 rkoberman 2023-04-09 21:10:14 UTC
I'll apply this patch and see how it does but the issue I saw, which led to many over-temp warnings with fan speed remaining low has stopped being an issue since the kernel was modified to allow all 12 threads to run without panicking.

Typically, when not busy, the performance CPUs are running at a slightly lower temperature than the others, but all seem to adjust up and down as they should and I have not seen a temperature alert since 2022-11-20.

Of course, my problem may not be the same as yours. I'm running HEAD and try to update it every 2 or 3 weeks.