Bug 233764 - [amdtemp] does not know correct offset for AMD Family 15h (A8-7600, FX-8300, etc) Tctl
Summary: [amdtemp] does not know correct offset for AMD Family 15h (A8-7600, FX-8300, ...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-04 07:02 UTC by George
Modified: 2021-07-30 13:08 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description George 2018-12-04 07:02:19 UTC
Good day!

I have a PC based on CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G   (3094.28-MHz K8-class CPU)
uname -a : FreeBSD MAIN-GATE 11.2-STABLE FreeBSD 11.2-STABLE #0 r340490: Sat Nov 17 15:18:12 +08 2018

And it shows incorrect temperature:
--------------------------------------
dev.amdtemp.0.core0.sensor0: -0,0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: -0,0C
dev.cpu.2.temperature: -0,0C
dev.cpu.1.temperature: -0,0C
dev.cpu.0.temperature: -0,0C
Comment 1 Conrad Meyer freebsd_committer 2018-12-04 18:20:49 UTC
That's a family 15h, model 30h (I think).  Relevant BKDG is here:

https://www.amd.com/system/files/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf

Should be D18F3xA4 ("Reported Temperature Control"), CurTmp[21:31] like other models.  0xA4 matches our AMDTEMP_REPTMP_CTRL.

Maybe this model uses CurTmpTjSel and adjusted range like the Family 17h 2990WX models.  Support for that was added in 12.x, for family 17h only.  That would make sense if the real temperature was *exactly* 49.0°C.  But 0.0 seems suspicious — it seems unlikely the true temperature is exactly 49.0.

Interestingly, the D18F3x00 deviceid is documented as 141D, which we define as MISC17.

So as far as I can tell, we should be doing the right thing on that CPU.

What hostbridge device did amdtemp attach to on your system?  Well, "dev.amdtemp.0.%parent: hostb7".  What's that?  Can you please provide "devinfo -v |grep hostb7"?  And maybe "devinfo -v | grep device=0x141d" as well.

Thanks.
Comment 2 George 2018-12-04 18:30:25 UTC
Conrad Meyer, 
That's requested information:

devinfo -v | grep hostb7
        hostb7 pnpinfo vendor=0x1022 device=0x141d subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

devinfo -v | grep device=0x141d
        hostb7 pnpinfo vendor=0x1022 device=0x141d subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3
Comment 3 Conrad Meyer freebsd_committer 2018-12-04 18:36:19 UTC
(In reply to gosha-necr from comment #2)
Thanks, it looks like we're attaching to the right device.  Can you provide:

  pciconf -r pci0:0:24:3 0xa4

Thanks!
Comment 4 George 2018-12-04 18:46:55 UTC
That is:

pciconf -r pci0:0:24:3 0xa4
18400fef
Comment 5 Conrad Meyer freebsd_committer 2018-12-04 19:55:46 UTC
(In reply to gosha-necr from comment #4)
Thanks.  My read of that is:

0x   1    8    4    0    0    f    e    f = 
0b0001_1000_0100_0000_0000_1111_1110_1111

So CurTmp is 194 and CurTmpTjSel is 0.

RangeUnajusted = (D18F3xA4[CurTmpTjSel]!=11b)
               = (0                    != 11b)
               = 1

So we shouldn't be subtracting 49.0°, and 194 represents a temperature of 24.25°C, which we should report correctly (at least on HEAD).  We don't have a path that subtracts 49.0° on Fam <17h.

So I'm not sure why we report zero.  The other bits are:

PerStepTimeDn: 15
TmpSlewDnEn: 1
TmpMaxDiffUp: 11b => 9.0
PerStepTimeUp: 15

Maybe something is broken in 11.2, I don't know.  Out of curiosity, can you provide "AMDTEMP_CPUID":

  pciconf -r pci0:0:24:3 0xfc

Thanks.
Comment 6 George 2018-12-04 20:06:32 UTC
(In reply to Conrad Meyer from comment #3)
(In reply to Conrad Meyer from comment #5)

Conrad, that is requested info:
pciconf -r pci0:0:24:3 0xfc
00630f01

Also I have such info:
Room where PC placed has temperature near 18C

On that PC works powerd, and when PC in idle I have such temp data:
-----------------------------------
dev.amdtemp.0.core0.sensor0: -0,0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: -0,0C
dev.cpu.2.temperature: -0,0C
dev.cpu.1.temperature: -0,0C
dev.cpu.0.temperature: -0,0C

If I'm kill powerd and PC in idle:
-----------------------------------
dev.amdtemp.0.core0.sensor0: 7,1C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: 7,1C
dev.cpu.2.temperature: 7,1C
dev.cpu.1.temperature: 7,1C
dev.cpu.0.temperature: 7,1C

And w\o powerd and with make -j8 buildworld
-----------------------------------
dev.amdtemp.0.core0.sensor0: 22,6C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: 22,6C
dev.cpu.2.temperature: 22,6C
dev.cpu.1.temperature: 22,6C
dev.cpu.0.temperature: 22,6C

Maybe that info helps :)
Comment 7 Conrad Meyer freebsd_committer 2018-12-05 01:37:00 UTC
(In reply to gosha-necr from comment #6)
Yes, that's very interesting, thanks.
Comment 8 Conrad Meyer freebsd_committer 2018-12-06 23:38:21 UTC
I don't have any good explanation for why the values seem too low.  The 11.2 source tree for amdtemp looks basically the same for that model/family of CPU as in head.
Comment 9 George 2018-12-07 16:03:06 UTC
(In reply to Conrad Meyer from comment #8)
On the PC with other CPU and in same room:
AMD FX-8300 with powerd -m 1400 -M 1400 and without any services:
------------------------------------------
last pid: 97542;  load averages:  0.06,  0.13,  0.18                                                                                                                                                                                                   up 12+23:52:32  21:00:28
44 processes:  1 running, 43 sleeping

dev.amdtemp.0.core0.sensor0: 10,1C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb4
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.7.temperature: 10,1C
dev.cpu.6.temperature: 10,1C
dev.cpu.5.temperature: 10,1C
dev.cpu.4.temperature: 10,1C
dev.cpu.3.temperature: 10,1C
dev.cpu.2.temperature: 10,1C
dev.cpu.1.temperature: 10,1C
dev.cpu.0.temperature: 10,1C
dev.hwpstate.0.freq_settings: 3200/11137 2800/8835 2300/6268 1900/4641 1400/3060
dev.cpu.0.freq_levels: 3200/11137 2800/8835 2300/6268 1900/4641 1400/3060
dev.cpu.0.freq: 1400

and in the same time PC with AMD A8-7600 with powerd -m 1400 -M 1400 and working as router:
------------------------------------------
last pid: 46962;  load averages:  0.44,  0.38,  0.28                                                                                                                                                                                                    up 1+20:14:25  21:01:33
107 processes: 1 running, 106 sleeping

dev.amdtemp.0.core0.sensor0: -0,0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb7
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.3.temperature: -0,0C
dev.cpu.2.temperature: -0,0C
dev.cpu.1.temperature: -0,0C
dev.cpu.0.temperature: -0,0C
dev.hwpstate.0.freq_settings: 3100/11750 2800/9570 2400/7380 1900/5775 1400/4162
dev.cpu.0.freq_levels: 3100/11750 2800/9570 2400/7380 1900/5775 1400/4162
dev.cpu.0.freq: 1400


I think that this incorrect values from A8-7600 CPU.
Comment 10 Conrad Meyer freebsd_committer 2018-12-07 16:24:58 UTC
(In reply to gosha-necr from comment #9)
Even 10,1C seems unlikely, given you've said the room is about 18C ambient?  Looks like they're both wrong.
Comment 11 George 2018-12-07 19:17:32 UTC
18C It's my feelings, maybe I'm wrong with that conclusion about temperature in room :)

But these 2 PC placed both in one place one near another, so it is in equal conditions. 

For relative info, there is smartctl -a /dev/ada0 | grep -i temp
From A8-7600
----------------
194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (255 254 0 0 0)

From FX-8300
----------------
194 Temperature_Celsius     0x0022   119   101   000    Old_age   Always       -       28

So both HDD (different models, but both in idle now and both 3.5'' 1 Tb) has identical temperature, so in-case conditions seems equal.

Conrad, can I somehow help to diagnose in that situation?
Comment 12 Conrad Meyer freebsd_committer 2018-12-07 19:35:16 UTC
(In reply to gosha-necr from comment #11)
It seems like probably the HDD temperature can be inferred to be a floor.  The CPU is unlikely to be cooler.

As far as diagnosing, I am kind of stuck.  Maybe these CPUs run with Tctl at some offset.  But I don't know what the offset is.  One more PCI read for you:

  pciconf -r pci0:0:24:3 0x64

This is "Hardware Thermal Control" and bits [22:16] tell us the "HtcTmpLimit."

As the BKDG notes, the CurTmp value does not reflect an actual temperature, but rather:

> Tctl is a temperature on its own scale aligned to the processors cooling
> requirements.  Therefore Tctl does not represent a temperature which could be
> measured on the die or the case of the processor. Instead, it specifies the
> processor temperature relative to the maximum operating temperature, Tctl,max.
> Tctl,max is specified in the power and thermal data sheet.

So maybe there is some relationship between Tctl,max and HtcTmpLimit.  Or maybe we can find a power and thermal data sheet for 15h and find the correct offsets for A8-7600 / FX-8300.

I bumped bug version to "CURRENT" because I don't believe there are any significant differences between amdtemp on 15h in 11.2 and on CURRENT.
Comment 13 George 2018-12-07 19:41:26 UTC
(In reply to Conrad Meyer from comment #12)
Conrad, there is:
pciconf -r pci0:0:24:3 0x64
72240005

Also tomorrow I'm go to that place and check ambient temperature to know with what values need to compare.

Thanks for spending Your time to that question :)
Comment 14 George 2018-12-07 19:54:05 UTC
(In reply to Conrad Meyer from comment #12)
Also in any case info from FX-8300

uname -a
FreeBSD BSD-MAIN 11.2-STABLE FreeBSD 11.2-STABLE #0 r340490: Sat Nov 24 15:50:22 +08 2018     root@BSD-MAIN:/usr/obj/usr/src/sys/BSDSERV  amd64

devinfo -v | grep hostb4
        hostb4 pnpinfo vendor=0x1022 device=0x1603 subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

devinfo -v | grep device=0x1603
        hostb4 pnpinfo vendor=0x1022 device=0x1603 subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

pciconf -r pci0:0:24:3 0xa4
0b600fef

pciconf -r pci0:0:24:3 0xfc
00600f12

pciconf -r pci0:0:24:3 0x64
664c0005

Hope it helps!
Comment 15 Conrad Meyer freebsd_committer 2018-12-07 20:14:52 UTC
(In reply to gosha-necr from comment #13)
> pciconf -r pci0:0:24:3 0x64
> 72240005

HTC_TMP_LMT:

  python3 -c 'print(((0x72240005 >> 16) & 0x7f) * 0.5 + 52)'
  70.0

I.e., 70 "°C" in whatever scale Tctl is on is "max," I guess.  Given that seems low, there is probably an offset on that scale.  I don't know what it is, and the Power and Thermal Document does not seem to be published :(.

(In reply to gosha-necr from comment #14)
> pciconf -r pci0:0:24:3 0xa4
> 0b600fef

CurTmpTjSel:

  python3 -c 'print((0x0b600fef >> 16) & 0x3)'
  0

(I.e., RangeUnajusted=1)

CurTmp:

  python3 -c 'print(((0x0b600fef >> 21) & 0x7ff) * 0.125)'
  11.375

(°C, nominal)

> pciconf -r pci0:0:24:3 0x64
> 664c0005

HTC_TMP_LMT:

  python3 -c 'print(((0x664c0005 >> 16) & 0x7f) * 0.5 + 52)'
  90.0

So this one is maybe 20° less offset than the other one, although I'm not sure of that — maybe the other one just throttles more aggressively.  It seems like the total offset can't be much more than 5-10° since 100°C is quite hot for a CPU.  But given HDDtemp of 28°C, I don't know.  That'd suggest 17°+ offset, which makes for a throttle at 107°+.  Extremely hot.
Comment 16 Conrad Meyer freebsd_committer 2018-12-07 20:17:06 UTC
FWIW, cpu-world.com claims both parts have a maximum operating temperature of 70.5-71°C.
Comment 17 Conrad Meyer freebsd_committer 2018-12-07 20:18:51 UTC
The AMD OverDrive windows utility knows the magic Tctl offsets to make results sensical.
Comment 18 Conrad Meyer freebsd_committer 2018-12-07 20:28:59 UTC
Ok, one more observation.  It seems Fam 15h models 0x60-0x7f relocated the actual sensor to a different PCI device and offset.  Your CPUs are models:

pciconf -r pci0:0:24:3 0xfc
00630f01

  python3 -c 'print("0x%x" % (((0x00630f01 & 0xf0000) >> 12) | ((0x00630f01 & 0xf0) >> 4)))'

=> 0x30

pciconf -r pci0:0:24:3 0xfc
00600f12

=> 0x01

So it doesn't apply.
Comment 19 George 2018-12-07 20:32:58 UTC
Conrad, can it be useful if I see CPU temperature in BIOS?
Comment 20 Conrad Meyer freebsd_committer 2018-12-07 20:37:30 UTC
From the googling I've done, I think it's basically a known issue that family 15h underreports or reports nonsensically low values at idle :-/.  It should report closer to reality values under load, i.e., as it approaches 70.0°C (or maybe 90°C for the FX-8300).  So the scale isn't exactly 0.125°C per unit as the BKDG suggests — there is both some offset and some slope to the relationship with °C.  That's unfortunate.  I think this is a Closed:Overcome by events, sorry.
Comment 21 Conrad Meyer freebsd_committer 2018-12-07 20:38:18 UTC
(In reply to gosha-necr from comment #19)
Perhaps, although who knows if the BIOS is idle or spinning the CPU at 100%?  Or if it calculates it any more accurately.  Worth a shot.  If you have Windows dual-boot, try AMD Overdrive too.
Comment 22 George 2018-12-07 20:45:03 UTC
(In reply to Conrad Meyer from comment #21)
No I have not windows on that PCs.
Thanks for deep involving in that problem Conrad :)

Of course it will be good if FreeBSD works ideally in all cases and in correct explanation of various sensors too, but not enough human resources... Thanks one more time!
Comment 23 Alexey Dokuchaev freebsd_committer 2019-12-20 08:09:23 UTC
Unfortunately, our amdtemp(4) driver is doing the best it can according to the spec.  Starting with the Phenoms, AMD's digital sensor no longer reports an absolute temperature value anymore, but a reading with a certain offset, which isn't really known; it might not even be constant per CPU type.

I believe that some proprietary tools employ certain tricks or use undocumented pieces of knowledge to make up for this, but e.g. Open Hardware Monitor uses the same formula as FreeBSD plus allows to specify configurable "offset" which is zero by default: ((Tctl >> 21) & 0x7FF) / 8.0f.

That is, Tctl is a non-physical temperature on an arbitrary scale (confusingly) measured in degrees Celsius with a resolution of 1/8th degree.  AMD designed this equation to accurately read load temperatures (45°C+).  It has an equational offset to determine them which equalizes at 45́°C.  Since it's designed for peak values and is a non-physical temperature it cannot read idle temperatures or account for ambient temperature correctly.

That's why popular tools like HWinfo64, MWmonitor, or Aida64 usually report two values: one for the socket and another for the core temperature.  The socket value is what you should look at if you want an idea of idle temperature and the core one for the CPU temperature under load.

I understand that it's somewhat frustrating to see BIOS and AMD Overdrive reporting seemingly sane temperatures across the entire spectrum, but it is most likely a cumulative reading from a number of different sensors, most of which are out of scope of amdtemp(4) or even undocumented at all.
Comment 24 James 2020-03-23 21:10:34 UTC
Not sure if it's a solution to the OP's case, but I just ran into to this issue with my AMD Phenom. Some digging on the web revealed that the core unlock feature in some BIOS's could cause the temperatures to not be properly reported. Disabling Core Unlock, in my case, corrected the issue. HTH.
Comment 25 Vincent Bentley 2021-07-30 13:08:51 UTC
I think this issue affects more AMD CPUs. I repurposed an old PC today that was running Linux with lm-sensors reporting 45 DegC cpu temp at idle yesterday.

After installing FreeBSD 12.2, amdtemp is reporting -0.0C . It is summer and the ambient room temperature at the moment is uncomfortable 27 DegC.

The CPU is Family 0x10, a 2.6Ghz Athlon II quad core.

I read through other bug reports and concluded that temperature measurement on AMD is proprietary requiring the knowledge of too many undocumented features. I am not expecting a fix, I will just assume that cpu temps are intel only on FreeBSD.

I don't have time to see how lm-sensors does it, so in case anyone is interested in my system info, here it is.

Base Board Information
        Manufacturer: Gigabyte Technology Co., Ltd.
        Product Name: GA-MA785GMT-UD2H
https://www.gigabyte.com/Motherboard/GA-MA785GMT-UD2H-rev-10#ov

Without opening it up, I am not sure what revision the board is. So assuming 1.0
No core unlock feature on this board!

# sysctl -a | grep temperature
dev.cpu.3.temperature: -0.0C
dev.cpu.2.temperature: -0.0C
dev.cpu.1.temperature: -0.0C
dev.cpu.0.temperature: -0.0C

# sysctl -a | grep amdtemp
dev.amdtemp.0.core0.sensor0: -0.0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb4
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 

# devinfo -v |grep hostb4
        hostb4 pnpinfo vendor=0x1022 device=0x1203 subvendor=0x0000 subdevice=0x0000 class=0x060000 at slot=24 function=3 dbsf=pci0:0:24:3

# pciconf -l | grep 0x1203
hostb4@pci0:0:24:3:	class=0x060000 card=0x00000000 chip=0x12031022 rev=0x00 hdr=0x00

# pciconf -r pci0:0:24:3 0xa4
000c1880

# pciconf -r pci0:0:24:3 0xfc
00100f52 

# pciconf -r pci0:0:24:3 0x64
34280005 


From dmesg...

FreeBSD 12.2-RELEASE-p7 GENERIC amd64
FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2)
VT(vga): resolution 640x480
CPU: AMD Athlon(tm) II X4 620 Processor (2611.85-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x100f52  Family=0x10  Model=0x5  Stepping=2
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x37ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT>
  SVM: NP,NRIP,NAsids=64
  TSC: P-state invariant
real memory  = 8589934592 (8192 MB)
avail memory = 7737597952 (7379 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <GBT    GBTUACPI>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)