Bug 233764 - [amdtemp] does not know correct offset for AMD Family 15h (A8-7600, FX-8300, etc) Tctl
Summary: [amdtemp] does not know correct offset for AMD Family 15h (A8-7600, FX-8300, ...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-04 07:02 UTC by gosha-necr
Modified: 2018-12-07 20:45 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description gosha-necr 2018-12-04 07:02:19 UTC
Good day!

I have a PC based on CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G   (3094.28-MHz K8-class CPU)
uname -a : FreeBSD MAIN-GATE 11.2-STABLE FreeBSD 11.2-STABLE #0 r340490: Sat Nov 17 15:18:12 +08 2018

And it shows incorrect temperature:
--------------------------------------
dev.amdtemp.0.core0.sensor0: -0,0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: -0,0C
dev.cpu.2.temperature: -0,0C
dev.cpu.1.temperature: -0,0C
dev.cpu.0.temperature: -0,0C
Comment 1 Conrad Meyer freebsd_committer 2018-12-04 18:20:49 UTC
That's a family 15h, model 30h (I think).  Relevant BKDG is here:

https://www.amd.com/system/files/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf

Should be D18F3xA4 ("Reported Temperature Control"), CurTmp[21:31] like other models.  0xA4 matches our AMDTEMP_REPTMP_CTRL.

Maybe this model uses CurTmpTjSel and adjusted range like the Family 17h 2990WX models.  Support for that was added in 12.x, for family 17h only.  That would make sense if the real temperature was *exactly* 49.0°C.  But 0.0 seems suspicious — it seems unlikely the true temperature is exactly 49.0.

Interestingly, the D18F3x00 deviceid is documented as 141D, which we define as MISC17.

So as far as I can tell, we should be doing the right thing on that CPU.

What hostbridge device did amdtemp attach to on your system?  Well, "dev.amdtemp.0.%parent: hostb7".  What's that?  Can you please provide "devinfo -v |grep hostb7"?  And maybe "devinfo -v | grep device=0x141d" as well.

Thanks.
Comment 2 gosha-necr 2018-12-04 18:30:25 UTC
Conrad Meyer, 
That's requested information:

devinfo -v | grep hostb7
        hostb7 pnpinfo vendor=0x1022 device=0x141d subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

devinfo -v | grep device=0x141d
        hostb7 pnpinfo vendor=0x1022 device=0x141d subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3
Comment 3 Conrad Meyer freebsd_committer 2018-12-04 18:36:19 UTC
(In reply to gosha-necr from comment #2)
Thanks, it looks like we're attaching to the right device.  Can you provide:

  pciconf -r pci0:0:24:3 0xa4

Thanks!
Comment 4 gosha-necr 2018-12-04 18:46:55 UTC
That is:

pciconf -r pci0:0:24:3 0xa4
18400fef
Comment 5 Conrad Meyer freebsd_committer 2018-12-04 19:55:46 UTC
(In reply to gosha-necr from comment #4)
Thanks.  My read of that is:

0x   1    8    4    0    0    f    e    f = 
0b0001_1000_0100_0000_0000_1111_1110_1111

So CurTmp is 194 and CurTmpTjSel is 0.

RangeUnajusted = (D18F3xA4[CurTmpTjSel]!=11b)
               = (0                    != 11b)
               = 1

So we shouldn't be subtracting 49.0°, and 194 represents a temperature of 24.25°C, which we should report correctly (at least on HEAD).  We don't have a path that subtracts 49.0° on Fam <17h.

So I'm not sure why we report zero.  The other bits are:

PerStepTimeDn: 15
TmpSlewDnEn: 1
TmpMaxDiffUp: 11b => 9.0
PerStepTimeUp: 15

Maybe something is broken in 11.2, I don't know.  Out of curiosity, can you provide "AMDTEMP_CPUID":

  pciconf -r pci0:0:24:3 0xfc

Thanks.
Comment 6 gosha-necr 2018-12-04 20:06:32 UTC
(In reply to Conrad Meyer from comment #3)
(In reply to Conrad Meyer from comment #5)

Conrad, that is requested info:
pciconf -r pci0:0:24:3 0xfc
00630f01

Also I have such info:
Room where PC placed has temperature near 18C

On that PC works powerd, and when PC in idle I have such temp data:
-----------------------------------
dev.amdtemp.0.core0.sensor0: -0,0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: -0,0C
dev.cpu.2.temperature: -0,0C
dev.cpu.1.temperature: -0,0C
dev.cpu.0.temperature: -0,0C

If I'm kill powerd and PC in idle:
-----------------------------------
dev.amdtemp.0.core0.sensor0: 7,1C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: 7,1C
dev.cpu.2.temperature: 7,1C
dev.cpu.1.temperature: 7,1C
dev.cpu.0.temperature: 7,1C

And w\o powerd and with make -j8 buildworld
-----------------------------------
dev.amdtemp.0.core0.sensor0: 22,6C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: 22,6C
dev.cpu.2.temperature: 22,6C
dev.cpu.1.temperature: 22,6C
dev.cpu.0.temperature: 22,6C

Maybe that info helps :)
Comment 7 Conrad Meyer freebsd_committer 2018-12-05 01:37:00 UTC
(In reply to gosha-necr from comment #6)
Yes, that's very interesting, thanks.
Comment 8 Conrad Meyer freebsd_committer 2018-12-06 23:38:21 UTC
I don't have any good explanation for why the values seem too low.  The 11.2 source tree for amdtemp looks basically the same for that model/family of CPU as in head.
Comment 9 gosha-necr 2018-12-07 16:03:06 UTC
(In reply to Conrad Meyer from comment #8)
On the PC with other CPU and in same room:
AMD FX-8300 with powerd -m 1400 -M 1400 and without any services:
------------------------------------------
last pid: 97542;  load averages:  0.06,  0.13,  0.18                                                                                                                                                                                                   up 12+23:52:32  21:00:28
44 processes:  1 running, 43 sleeping

dev.amdtemp.0.core0.sensor0: 10,1C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb4
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.7.temperature: 10,1C
dev.cpu.6.temperature: 10,1C
dev.cpu.5.temperature: 10,1C
dev.cpu.4.temperature: 10,1C
dev.cpu.3.temperature: 10,1C
dev.cpu.2.temperature: 10,1C
dev.cpu.1.temperature: 10,1C
dev.cpu.0.temperature: 10,1C
dev.hwpstate.0.freq_settings: 3200/11137 2800/8835 2300/6268 1900/4641 1400/3060
dev.cpu.0.freq_levels: 3200/11137 2800/8835 2300/6268 1900/4641 1400/3060
dev.cpu.0.freq: 1400

and in the same time PC with AMD A8-7600 with powerd -m 1400 -M 1400 and working as router:
------------------------------------------
last pid: 46962;  load averages:  0.44,  0.38,  0.28                                                                                                                                                                                                    up 1+20:14:25  21:01:33
107 processes: 1 running, 106 sleeping

dev.amdtemp.0.core0.sensor0: -0,0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: -0,0C
dev.cpu.2.temperature: -0,0C
dev.cpu.1.temperature: -0,0C
dev.cpu.0.temperature: -0,0C
dev.hwpstate.0.freq_settings: 3100/11750 2800/9570 2400/7380 1900/5775 1400/4162
dev.cpu.0.freq_levels: 3100/11750 2800/9570 2400/7380 1900/5775 1400/4162
dev.cpu.0.freq: 1400


I think that this incorrect values from A8-7600 CPU.
Comment 10 Conrad Meyer freebsd_committer 2018-12-07 16:24:58 UTC
(In reply to gosha-necr from comment #9)
Even 10,1C seems unlikely, given you've said the room is about 18C ambient?  Looks like they're both wrong.
Comment 11 gosha-necr 2018-12-07 19:17:32 UTC
18C It's my feelings, maybe I'm wrong with that conclusion about temperature in room :)

But these 2 PC placed both in one place one near another, so it is in equal conditions. 

For relative info, there is smartctl -a /dev/ada0 | grep -i temp
From A8-7600
----------------
194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (255 254 0 0 0)

From FX-8300
----------------
194 Temperature_Celsius     0x0022   119   101   000    Old_age   Always       -       28

So both HDD (different models, but both in idle now and both 3.5'' 1 Tb) has identical temperature, so in-case conditions seems equal.

Conrad, can I somehow help to diagnose in that situation?
Comment 12 Conrad Meyer freebsd_committer 2018-12-07 19:35:16 UTC
(In reply to gosha-necr from comment #11)
It seems like probably the HDD temperature can be inferred to be a floor.  The CPU is unlikely to be cooler.

As far as diagnosing, I am kind of stuck.  Maybe these CPUs run with Tctl at some offset.  But I don't know what the offset is.  One more PCI read for you:

  pciconf -r pci0:0:24:3 0x64

This is "Hardware Thermal Control" and bits [22:16] tell us the "HtcTmpLimit."

As the BKDG notes, the CurTmp value does not reflect an actual temperature, but rather:

> Tctl is a temperature on its own scale aligned to the processors cooling
> requirements.  Therefore Tctl does not represent a temperature which could be
> measured on the die or the case of the processor. Instead, it specifies the
> processor temperature relative to the maximum operating temperature, Tctl,max.
> Tctl,max is specified in the power and thermal data sheet.

So maybe there is some relationship between Tctl,max and HtcTmpLimit.  Or maybe we can find a power and thermal data sheet for 15h and find the correct offsets for A8-7600 / FX-8300.

I bumped bug version to "CURRENT" because I don't believe there are any significant differences between amdtemp on 15h in 11.2 and on CURRENT.
Comment 13 gosha-necr 2018-12-07 19:41:26 UTC
(In reply to Conrad Meyer from comment #12)
Conrad, there is:
pciconf -r pci0:0:24:3 0x64
72240005

Also tomorrow I'm go to that place and check ambient temperature to know with what values need to compare.

Thanks for spending Your time to that question :)
Comment 14 gosha-necr 2018-12-07 19:54:05 UTC
(In reply to Conrad Meyer from comment #12)
Also in any case info from FX-8300

uname -a
FreeBSD BSD-MAIN 11.2-STABLE FreeBSD 11.2-STABLE #0 r340490: Sat Nov 24 15:50:22 +08 2018     root@BSD-MAIN:/usr/obj/usr/src/sys/BSDSERV  amd64

devinfo -v | grep hostb4
        hostb4 pnpinfo vendor=0x1022 device=0x1603 subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

devinfo -v | grep device=0x1603
        hostb4 pnpinfo vendor=0x1022 device=0x1603 subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

pciconf -r pci0:0:24:3 0xa4
0b600fef

pciconf -r pci0:0:24:3 0xfc
00600f12

pciconf -r pci0:0:24:3 0x64
664c0005

Hope it helps!
Comment 15 Conrad Meyer freebsd_committer 2018-12-07 20:14:52 UTC
(In reply to gosha-necr from comment #13)
> pciconf -r pci0:0:24:3 0x64
> 72240005

HTC_TMP_LMT:

  python3 -c 'print(((0x72240005 >> 16) & 0x7f) * 0.5 + 52)'
  70.0

I.e., 70 "°C" in whatever scale Tctl is on is "max," I guess.  Given that seems low, there is probably an offset on that scale.  I don't know what it is, and the Power and Thermal Document does not seem to be published :(.

(In reply to gosha-necr from comment #14)
> pciconf -r pci0:0:24:3 0xa4
> 0b600fef

CurTmpTjSel:

  python3 -c 'print((0x0b600fef >> 16) & 0x3)'
  0

(I.e., RangeUnajusted=1)

CurTmp:

  python3 -c 'print(((0x0b600fef >> 21) & 0x7ff) * 0.125)'
  11.375

(°C, nominal)

> pciconf -r pci0:0:24:3 0x64
> 664c0005

HTC_TMP_LMT:

  python3 -c 'print(((0x664c0005 >> 16) & 0x7f) * 0.5 + 52)'
  90.0

So this one is maybe 20° less offset than the other one, although I'm not sure of that — maybe the other one just throttles more aggressively.  It seems like the total offset can't be much more than 5-10° since 100°C is quite hot for a CPU.  But given HDDtemp of 28°C, I don't know.  That'd suggest 17°+ offset, which makes for a throttle at 107°+.  Extremely hot.
Comment 16 Conrad Meyer freebsd_committer 2018-12-07 20:17:06 UTC
FWIW, cpu-world.com claims both parts have a maximum operating temperature of 70.5-71°C.
Comment 17 Conrad Meyer freebsd_committer 2018-12-07 20:18:51 UTC
The AMD OverDrive windows utility knows the magic Tctl offsets to make results sensical.
Comment 18 Conrad Meyer freebsd_committer 2018-12-07 20:28:59 UTC
Ok, one more observation.  It seems Fam 15h models 0x60-0x7f relocated the actual sensor to a different PCI device and offset.  Your CPUs are models:

pciconf -r pci0:0:24:3 0xfc
00630f01

  python3 -c 'print("0x%x" % (((0x00630f01 & 0xf0000) >> 12) | ((0x00630f01 & 0xf0) >> 4)))'

=> 0x30

pciconf -r pci0:0:24:3 0xfc
00600f12

=> 0x01

So it doesn't apply.
Comment 19 gosha-necr 2018-12-07 20:32:58 UTC
Conrad, can it be useful if I see CPU temperature in BIOS?
Comment 20 Conrad Meyer freebsd_committer 2018-12-07 20:37:30 UTC
From the googling I've done, I think it's basically a known issue that family 15h underreports or reports nonsensically low values at idle :-/.  It should report closer to reality values under load, i.e., as it approaches 70.0°C (or maybe 90°C for the FX-8300).  So the scale isn't exactly 0.125°C per unit as the BKDG suggests — there is both some offset and some slope to the relationship with °C.  That's unfortunate.  I think this is a Closed:Overcome by events, sorry.
Comment 21 Conrad Meyer freebsd_committer 2018-12-07 20:38:18 UTC
(In reply to gosha-necr from comment #19)
Perhaps, although who knows if the BIOS is idle or spinning the CPU at 100%?  Or if it calculates it any more accurately.  Worth a shot.  If you have Windows dual-boot, try AMD Overdrive too.
Comment 22 gosha-necr 2018-12-07 20:45:03 UTC
(In reply to Conrad Meyer from comment #21)
No I have not windows on that PCs.
Thanks for deep involving in that problem Conrad :)

Of course it will be good if FreeBSD works ideally in all cases and in correct explanation of various sensors too, but not enough human resources... Thanks one more time!