Bug 180593 - PowerMac G5 shuts down when building pcre
Summary: PowerMac G5 shuts down when building pcre
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: powerpc (show other bugs)
Version: 10.0-CURRENT
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-ppc (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-16 12:40 UTC by Julio Merino
Modified: 2015-07-10 02:24 UTC (History)
2 users (show)

See Also:


Attachments
powermac_thermal.diff (1.92 KB, patch)
2013-09-15 16:45 UTC, Julio Merino
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Julio Merino 2013-07-16 12:40:00 UTC
	I have a PowerMac G5 (dual PPC 970FX 2Ghz, 6GB of RAM) on which
	I'm running FreeBSD 10.0-CURRENT (but the same happened with
	9.1-RELEASE).  The machine is pretty much rock-solid in all
	cases: I have been able to build a variety of ports, and I can
	even do a buildworld -j3 without issues.

	Whenever I try to build pcre, the machine shuts down with this
	message:

	WARNING: Current temperature (U3 HEATSINK: 82.5 C) exceeds
	critical temperature (80.0 C)! Shutting down!

	The file that always causes this is libpcre16_la-pcre16_exec.lo.
	Pausing the compiler with Ctrl+Z every 1-3 seconds and then
	resuming it after 5 allows the compilation of this file to
	succeed without shutting the machine down.

	nwhitehorn@ mentioned that this could be a problematic fan, but
	I haven't been able to find it in the machine.

	Because the machine has been stable otherwise, I don't think
	this is a hardware issue and it seems to me that this is just a
	problem with the fcu driver and the way it manages this
	particular fan.

	See this thread
	http://lists.freebsd.org/pipermail/freebsd-ppc/2013-March/006207.html
	for the original discussion.

Fix: 

Unknown, but this particular problem can be worked around as
	described above.
How-To-Repeat: 	Get FreeBSD running on a PowerMac G5, build pcre, and see it
	the machine shutdown (assuming this is not a hardware problem).
Comment 1 Julio Merino 2013-09-15 16:45:21 UTC
It seems to me that the powermac_thermal driver should cope with
possibly-faulty sensors (or just by bad readings from them) by not
trusting a single reading to perform such a drastic action as shutting
the machine down.

The attached patch makes the driver consider several readings in a row
before shutting off.

With this patch, building pcre in the machine I have results in the
following log:

WARNING: Current temperature (U3 HEATSINK: 84.3 C) exceeds critical
temperature (80.0 C); count=1
WARNING: Current temperature (U3 HEATSINK: 84.3 C) exceeds critical
temperature (80.0 C); count=2
WARNING: Current temperature (U3 HEATSINK: 121.5 C) exceeds critical
temperature (80.0 C); count=1
WARNING: Current temperature (U3 HEATSINK: 121.5 C) exceeds critical
temperature (80.0 C); count=2
WARNING: Current temperature (U3 HEATSINK: 82.0 C) exceeds critical
temperature (80.0 C); count=1
WARNING: Current temperature (U3 HEATSINK: 82.0 C) exceeds critical
temperature (80.0 C); count=2
WARNING: Current temperature (U3 HEATSINK: 91.8 C) exceeds critical
temperature (80.0 C); count=1
WARNING: Current temperature (U3 HEATSINK: 91.8 C) exceeds critical
temperature (80.0 C); count=2
WARNING: Current temperature (U3 HEATSINK: 91.8 C) exceeds critical
temperature (80.0 C); count=3

Note the big jumps from previously-good temperatures to supposedly-bad
temperatures (80C to 121.5C) and how quickly (2-3 readings with a
period of hz) they go down. I don't know if this is caused by a bad
sensor or just by bad individual readings.

-- 
Julio Merino / @jmmv
Comment 2 dfilter service freebsd_committer freebsd_triage 2013-10-25 04:55:59 UTC
Author: nwhitehorn
Date: Fri Oct 25 03:55:52 2013
New Revision: 257093
URL: http://svnweb.freebsd.org/changeset/base/257093

Log:
  Be a little more suspicious of thermal sensors, which can have single
  crazy readings occasionally. One wild reading should not be enough to
  trigger a shutdown, so instead wait for several concerning readings in
  a row.
  
  PR:		powerpc/180593
  Submitted by:	Julio Merino
  MFC after:	1 week

Modified:
  head/sys/powerpc/powermac/powermac_thermal.c

Modified: head/sys/powerpc/powermac/powermac_thermal.c
==============================================================================
--- head/sys/powerpc/powermac/powermac_thermal.c	Fri Oct 25 03:18:56 2013	(r257092)
+++ head/sys/powerpc/powermac/powermac_thermal.c	Fri Oct 25 03:55:52 2013	(r257093)
@@ -68,6 +68,8 @@ struct pmac_fan_le {
 struct pmac_sens_le {
 	struct pmac_therm		*sensor;
 	int				last_val;
+#define MAX_CRITICAL_COUNT 6
+	int				critical_count;
 	SLIST_ENTRY(pmac_sens_le)	entries;
 };
 static SLIST_HEAD(pmac_fans, pmac_fan_le) fans = SLIST_HEAD_INITIALIZER(fans);
@@ -106,14 +108,27 @@ pmac_therm_manage_fans(void)
 			sensor->last_val = temp;
 
 		if (sensor->last_val > sensor->sensor->max_temp) {
+			sensor->critical_count++;
 			printf("WARNING: Current temperature (%s: %d.%d C) "
-			    "exceeds critical temperature (%d.%d C)! "
-			    "Shutting down!\n", sensor->sensor->name,
-			       (sensor->last_val - ZERO_C_TO_K) / 10,
-			       (sensor->last_val - ZERO_C_TO_K) % 10,
-			       (sensor->sensor->max_temp - ZERO_C_TO_K) / 10,
-			       (sensor->sensor->max_temp - ZERO_C_TO_K) % 10);
-			shutdown_nice(RB_POWEROFF);
+			    "exceeds critical temperature (%d.%d C); "
+			    "count=%d\n",
+			    sensor->sensor->name,
+			    (sensor->last_val - ZERO_C_TO_K) / 10,
+			    (sensor->last_val - ZERO_C_TO_K) % 10,
+			    (sensor->sensor->max_temp - ZERO_C_TO_K) / 10,
+			    (sensor->sensor->max_temp - ZERO_C_TO_K) % 10,
+			    sensor->critical_count);
+			if (sensor->critical_count >= MAX_CRITICAL_COUNT) {
+				printf("WARNING: %s temperature exceeded "
+				    "critical temperature %d times in a row; "
+				    "shutting down!\n",
+				    sensor->sensor->name,
+				    sensor->critical_count);
+				shutdown_nice(RB_POWEROFF);
+			}
+		} else {
+			if (sensor->critical_count > 0)
+				sensor->critical_count--;
 		}
 	}
 
@@ -177,6 +192,8 @@ pmac_thermal_sensor_register(struct pmac
 	list_entry = malloc(sizeof(struct pmac_sens_le), M_PMACTHERM,
 	    M_ZERO | M_WAITOK);
 	list_entry->sensor = sensor;
+	list_entry->last_val = 0;
+	list_entry->critical_count = 0;
 
 	SLIST_INSERT_HEAD(&sensors, list_entry, entries);
 }
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 3 Justin Hibbits freebsd_committer freebsd_triage 2015-07-08 23:44:42 UTC
Is this still the case?  I know the quads can have problems if the LCS starts deteriorating (I have 2 with that problem), but I think the changes Nathan committed (referenced here), and the fan control changes I made a few months ago should mitigate this problem in most cases without bad LCS.
Comment 4 Julio Merino,+1 347 694 0576,New York City freebsd_committer freebsd_triage 2015-07-10 02:24:07 UTC
The changes I submitted are enough to keep my PowerMac G5 up and running.  I haven't tried recently to see if the invalid readings still appear though, but I think it's OK to close this bug.