I have a PowerMac G5 (dual PPC 970FX 2Ghz, 6GB of RAM) on which I'm running FreeBSD 10.0-CURRENT (but the same happened with 9.1-RELEASE). The machine is pretty much rock-solid in all cases: I have been able to build a variety of ports, and I can even do a buildworld -j3 without issues. Whenever I try to build pcre, the machine shuts down with this message: WARNING: Current temperature (U3 HEATSINK: 82.5 C) exceeds critical temperature (80.0 C)! Shutting down! The file that always causes this is libpcre16_la-pcre16_exec.lo. Pausing the compiler with Ctrl+Z every 1-3 seconds and then resuming it after 5 allows the compilation of this file to succeed without shutting the machine down. nwhitehorn@ mentioned that this could be a problematic fan, but I haven't been able to find it in the machine. Because the machine has been stable otherwise, I don't think this is a hardware issue and it seems to me that this is just a problem with the fcu driver and the way it manages this particular fan. See this thread http://lists.freebsd.org/pipermail/freebsd-ppc/2013-March/006207.html for the original discussion. Fix: Unknown, but this particular problem can be worked around as described above. How-To-Repeat: Get FreeBSD running on a PowerMac G5, build pcre, and see it the machine shutdown (assuming this is not a hardware problem).
It seems to me that the powermac_thermal driver should cope with possibly-faulty sensors (or just by bad readings from them) by not trusting a single reading to perform such a drastic action as shutting the machine down. The attached patch makes the driver consider several readings in a row before shutting off. With this patch, building pcre in the machine I have results in the following log: WARNING: Current temperature (U3 HEATSINK: 84.3 C) exceeds critical temperature (80.0 C); count=1 WARNING: Current temperature (U3 HEATSINK: 84.3 C) exceeds critical temperature (80.0 C); count=2 WARNING: Current temperature (U3 HEATSINK: 121.5 C) exceeds critical temperature (80.0 C); count=1 WARNING: Current temperature (U3 HEATSINK: 121.5 C) exceeds critical temperature (80.0 C); count=2 WARNING: Current temperature (U3 HEATSINK: 82.0 C) exceeds critical temperature (80.0 C); count=1 WARNING: Current temperature (U3 HEATSINK: 82.0 C) exceeds critical temperature (80.0 C); count=2 WARNING: Current temperature (U3 HEATSINK: 91.8 C) exceeds critical temperature (80.0 C); count=1 WARNING: Current temperature (U3 HEATSINK: 91.8 C) exceeds critical temperature (80.0 C); count=2 WARNING: Current temperature (U3 HEATSINK: 91.8 C) exceeds critical temperature (80.0 C); count=3 Note the big jumps from previously-good temperatures to supposedly-bad temperatures (80C to 121.5C) and how quickly (2-3 readings with a period of hz) they go down. I don't know if this is caused by a bad sensor or just by bad individual readings. -- Julio Merino / @jmmv
Author: nwhitehorn Date: Fri Oct 25 03:55:52 2013 New Revision: 257093 URL: http://svnweb.freebsd.org/changeset/base/257093 Log: Be a little more suspicious of thermal sensors, which can have single crazy readings occasionally. One wild reading should not be enough to trigger a shutdown, so instead wait for several concerning readings in a row. PR: powerpc/180593 Submitted by: Julio Merino MFC after: 1 week Modified: head/sys/powerpc/powermac/powermac_thermal.c Modified: head/sys/powerpc/powermac/powermac_thermal.c ============================================================================== --- head/sys/powerpc/powermac/powermac_thermal.c Fri Oct 25 03:18:56 2013 (r257092) +++ head/sys/powerpc/powermac/powermac_thermal.c Fri Oct 25 03:55:52 2013 (r257093) @@ -68,6 +68,8 @@ struct pmac_fan_le { struct pmac_sens_le { struct pmac_therm *sensor; int last_val; +#define MAX_CRITICAL_COUNT 6 + int critical_count; SLIST_ENTRY(pmac_sens_le) entries; }; static SLIST_HEAD(pmac_fans, pmac_fan_le) fans = SLIST_HEAD_INITIALIZER(fans); @@ -106,14 +108,27 @@ pmac_therm_manage_fans(void) sensor->last_val = temp; if (sensor->last_val > sensor->sensor->max_temp) { + sensor->critical_count++; printf("WARNING: Current temperature (%s: %d.%d C) " - "exceeds critical temperature (%d.%d C)! " - "Shutting down!\n", sensor->sensor->name, - (sensor->last_val - ZERO_C_TO_K) / 10, - (sensor->last_val - ZERO_C_TO_K) % 10, - (sensor->sensor->max_temp - ZERO_C_TO_K) / 10, - (sensor->sensor->max_temp - ZERO_C_TO_K) % 10); - shutdown_nice(RB_POWEROFF); + "exceeds critical temperature (%d.%d C); " + "count=%d\n", + sensor->sensor->name, + (sensor->last_val - ZERO_C_TO_K) / 10, + (sensor->last_val - ZERO_C_TO_K) % 10, + (sensor->sensor->max_temp - ZERO_C_TO_K) / 10, + (sensor->sensor->max_temp - ZERO_C_TO_K) % 10, + sensor->critical_count); + if (sensor->critical_count >= MAX_CRITICAL_COUNT) { + printf("WARNING: %s temperature exceeded " + "critical temperature %d times in a row; " + "shutting down!\n", + sensor->sensor->name, + sensor->critical_count); + shutdown_nice(RB_POWEROFF); + } + } else { + if (sensor->critical_count > 0) + sensor->critical_count--; } } @@ -177,6 +192,8 @@ pmac_thermal_sensor_register(struct pmac list_entry = malloc(sizeof(struct pmac_sens_le), M_PMACTHERM, M_ZERO | M_WAITOK); list_entry->sensor = sensor; + list_entry->last_val = 0; + list_entry->critical_count = 0; SLIST_INSERT_HEAD(&sensors, list_entry, entries); } _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Is this still the case? I know the quads can have problems if the LCS starts deteriorating (I have 2 with that problem), but I think the changes Nathan committed (referenced here), and the fan control changes I made a few months ago should mitigate this problem in most cases without bad LCS.
The changes I submitted are enough to keep my PowerMac G5 up and running. I haven't tried recently to see if the invalid readings still appear though, but I think it's OK to close this bug.