Bug 234657 - AMD Opteron X3000 series CPU temperature sensor support
Summary: AMD Opteron X3000 series CPU temperature sensor support
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Conrad Meyer
URL:
Keywords: feature
Depends on:
Blocks:
 
Reported: 2019-01-06 09:37 UTC by D.C.
Modified: 2019-01-22 21:35 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description D.C. 2019-01-06 09:37:06 UTC
I am using FreeNAS 11.2 and as it turns out, I cannot read out the CPU temperature for my AMD Opteron X3216 CPU in my HPE Microserver Gen10. I filed this at the FreeNAS Redmine: https://redmine.ixsystems.com/issues/64077

I have the following tunables:

amdtemp_load="YES" # for reading AMD CPU temperatures
hint.acpi_throttle.0.disabled="YES" # cool'n'quiet
hw.pci.realloc_bars="1" # needed for screen output on Microserver gen10

Trying sysctl -a | grep -i temp does not yield any temperatures. I also tried other things to grep for like cpu, none of which would give me the CPU temperatures.

I submitted FreeNAS debug output to the FreeNAS developers. They replied the following: "As I can see, there are several variations of that CPU generation, and while FreeBSD supports temperature reporting for one of them, two others are different. As I understand, amdtemp driver would have to attach to device "hostb7 pnpinfo vendor=0x1022 device=0x1573", but that ID is not in the code.", along with the suggestion to contact Conrad Meyer about this.

I contacted Conrad, he asked me what the family/model numbers for my CPU are. The relevant dmesg output:

CPU: AMD Opteron(tm) X3216 APU                       (1597.04-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x660f01  Family=0x15  Model=0x60  Stepping=1
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x2febbfff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,XOP,SKINIT,WDT,LWP,FMA4,TCE,NodeId,TBM,Topology,PCXC,PNXC,<b25>,DBE,PTSC,MWAITX>
  Structured Extended Features=0x1a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2>
  XSAVE Features=0x1<XSAVEOPT>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics

Conrad also asked me how the FreeNAS dev arrived at this device information. That I do not know, but I think they could (and are also better qualified to) explain this? Also, I won't mind (re)generating any diagnostic information if need be.

I am willing to test patches for this issue on my hardware.
Comment 1 Conrad Meyer freebsd_committer 2019-01-06 19:18:14 UTC
I think I understand where the device=0x1573 guess is coming from.  Family 15h Model 60h BKDG documents 0x1573 as the deviceid for PCI D18F3.  On the earlier Model *00h*, which we currently support, the PCI D18F3 device register 0xa4 is the Reported Temperature.

However, on Model *60h*, Reported Temperature lives in a different register, "D0F0xBC_xD820_0CA4" (page 238 of the relevant BKDG below).  So I suspect 0x1573 is wrong.

https://www.amd.com/system/files/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf

D0F0 (PCI Root complex) on 60h has the vendor:device id 0x1022:0x1576.

It seems like Family 15h Model 60h also requires indirect SMU / SMN access (p. 233), like on Family 17h:

> D0F0xB8 SMU Index Address
> The index/data pair registers, D0F0xB8 and D0F0xBC, are used to access the registers at
> D0F0xBC_x[FFFF_FFFF:0000_0000]. To access any of these registers, the address is first written into the
> index register, D0F0xB8, and then the data is read from or written to the data register, D0F0xBC.

0xB8 is the 32-bit register NbSmuIndAddr.
0xBC is the 32-bit register NbSmuIndData.

So we must indirectly access the index "0xD820_0CA4" to read the temperature on 15h, model 60h.

I'm going to do a quick spot check of other 15h BKDGs to see what other models specify.
Comment 2 Conrad Meyer freebsd_committer 2019-01-06 19:42:32 UTC
Fam15h M10h: D18F3xA4, like M00h.  D18F3 has deviceid 0x1022:0x1403.
(There is no M20h model, or it is not documented publically on amd.com.)
Fam15h M30h: D18F3xA4, like M00h.  D18F3 has deviceid 0x1022:0x141d.
Fam15h M70h: D0F0xBC_xD820_0CA4, like M60h.  D0F0 has deviceid 0x1022:0x1576.

It looks like we support Fam15h M10h by the name "MISC12" and M30h by the name "MISC17".

"MISC12" was a mistake — it was added for the AMD A10-5700 APU, which is Family 15h M10h.  Ditto "MISC17", which was added for an AMD Kaveri APU (Family 15h; presumably model 30h).  I'll go ahead and fix those names while here.

M70h defines 0xB8 and 0xBC identically to M60h, as far as I can tell.  Obviously, we don't support M70h yet for the same reason as M60h.  Work is in progress.
Comment 3 Conrad Meyer freebsd_committer 2019-01-06 23:10:42 UTC
(In reply to D.C. from comment #0)
If possible, please test the following patch against CURRENT:

  https://people.freebsd.org/~cem/pr234657.patch
Comment 4 Conrad Meyer freebsd_committer 2019-01-08 21:56:34 UTC
If I don't hear back in a few days, I'll go ahead and commit this, but it'd be good to get confirmation it works before I do :-).
Comment 5 D.C. 2019-01-09 08:14:49 UTC
(In reply to Conrad Meyer from comment #4)
I have a pretty busy schedule, but I'll see if I can squeeze it in somewhere this weekend. Creating a bootable USB drive in VirtualBox seems to work, so I should be able to create a low-impact test bench for testing the patch.
Comment 6 Conrad Meyer freebsd_committer 2019-01-09 20:48:25 UTC
(In reply to D.C. from comment #5)
Awesome, I appreciate it!
Comment 7 D.C. 2019-01-11 15:50:26 UTC
(In reply to Conrad Meyer from comment #6)

I managed to create a bootable USB stick with CURRENT. However, apparently I need to check out the entire FreeBSD source tree to build the kernel. Not a problem per se, but svn update /usr/src seems to be going very slowly and I need to re-issue this command every couple of minutes due to truncated HTTP responses. No fun, and also kind of unexpected.

I don't have cd drive in my server (used VirtualBox and the installer disc image for installing the stick), so I guess my best bet is to reinstall my stick and use the installer to install the source? I take it that any version of CURRENT from the last couple of days will do for testing the patch?
Comment 8 Conrad Meyer freebsd_committer 2019-01-11 19:52:55 UTC
(In reply to D.C. from comment #7)
If SVN is slow/unreliable, it might be possible to fetch a sources tarball with a flaky-network aware downloader like aria2c?  (or even wget -c.)  There appears to be a src tarball from today published here:

https://download.freebsd.org/ftp/development/tarballs/src_current.tar.gz

Yes, any source tree from the last few days should be just fine.  The last commits to touch this code were in mid-November.

Thanks again for testing it out.
Comment 9 Conrad Meyer freebsd_committer 2019-01-11 20:13:42 UTC
(In reply to D.C. from comment #7)
Sorry, missed this part in my earlier response and then bugzilla was down for a few minutes:

> so I guess my best bet is to reinstall my stick and use the installer to
> install the source? I take it that any version of CURRENT from the last couple
> of days will do for testing the patch?

Yes -- totally fine, as long as the installer is from late November or more recent.  Thanks!
Comment 10 D.C. 2019-01-12 19:11:15 UTC
(In reply to Conrad Meyer from comment #9)

I just tested the patch. Output from sysctl -a | grep temp:

net.inet6.ip6.use_tempaddr: 0
net.inet6.ip6.temppltime: 86400
net.inet6.ip6.tempvltime: 604800
net.inet6.ip6.prefer_tempaddr: 0
hw.usb.template: -1
dev.amdtemp.0.core0.sensor0: 22.6C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb0
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 
dev.cpu.1.temperature: 22.6C
dev.cpu.0.temperature: 22.6C

Prior and after testing the bios reported temperatures around 27 to 28C. However, I did have powerd and cool'n'quiet running (just like during normal operation), so this may account for the temperature difference? Although the bios also reports ambient and dimm area temperatures that are also 27 to 28C.

But all in all the patch seems to doing what it's supposed to :) .
Comment 11 Conrad Meyer freebsd_committer 2019-01-12 20:05:38 UTC
(In reply to D.C. from comment #10)
Thanks, that's excellent!

> Prior and after testing the bios reported temperatures around 27 to 28C.
> However, I did have powerd and cool'n'quiet running (just like during normal
> operation), so this may account for the temperature difference? 

Maybe that, maybe the temperature sensor on Family 15h is not actually in units of 0.125°C.  There was another bug related to that and I found (but could not recall links for you off hand) some discussion on the internet about accuracy as you get further below rated thermal throttling temperature.

I guess another interesting calibration test would be to add various load (100% on one core, 100% on two cores, etc) and see if it looks like the temperature tracks that load in a reasonable way.

Either way, I think your testing is sufficient to show the patch functions as intended and can be committed, thank you!
Comment 12 D.C. 2019-01-12 21:29:13 UTC
(In reply to Conrad Meyer from comment #11)

No problem, I'm glad I could help out a fellow developer this way :) .
Comment 13 commit-hook freebsd_committer 2019-01-12 22:37:21 UTC
A commit references this bug:

Author: cem
Date: Sat Jan 12 22:36:33 UTC 2019
New revision: 342977
URL: https://svnweb.freebsd.org/changeset/base/342977

Log:
  amdtemp(4): Add support for Family 15h, Model >=60h

  Family 15h is a bit of an oddball.  Early models used the same temperature
  register and spec (mostly[1]) as earlier CPU families.

  Model 60h-6Fh and 70-7Fh use something more like Family 17h's Service
  Management Network, communicating with it in a similar fashion.  To support
  them, add support for their version of SMU indirection to amdsmn(4) and use
  it in amdtemp(4) on these models.

  While here, clarify some of the deviceid macros in amdtemp(4) that were
  added with arbitrary, incorrect family numbers, and remove ones that were
  not used.  Additionally, clarify intent and condition of heterogenous
  multi-socket system detection.

  [1]: 15h adds the "adjust range by -49?C if a certain condition is met,"
  which previous families did not have.

  Reported by:	D. C. <tjoard AT gmail.com>
  PR:		234657
  Tested by:	D. C. <tjoard AT gmail.com>

Changes:
  head/sys/dev/amdsmn/amdsmn.c
  head/sys/dev/amdtemp/amdtemp.c
Comment 14 commit-hook freebsd_committer 2019-01-22 21:05:09 UTC
A commit references this bug:

Author: mav
Date: Tue Jan 22 21:04:04 UTC 2019
New revision: 343322
URL: https://svnweb.freebsd.org/changeset/base/343322

Log:
  MFC r342977 (by cem): amdtemp(4): Add support for Family 15h, Model >=60h

  Family 15h is a bit of an oddball.  Early models used the same temperature
  register and spec (mostly[1]) as earlier CPU families.

  Model 60h-6Fh and 70-7Fh use something more like Family 17h's Service
  Management Network, communicating with it in a similar fashion.  To support
  them, add support for their version of SMU indirection to amdsmn(4) and use
  it in amdtemp(4) on these models.

  While here, clarify some of the deviceid macros in amdtemp(4) that were
  added with arbitrary, incorrect family numbers, and remove ones that were
  not used.  Additionally, clarify intent and condition of heterogenous
  multi-socket system detection.

  [1]: 15h adds the "adjust range by -49?C if a certain condition is met,"
  which previous families did not have.

  Reported by:    D. C. <tjoard AT gmail.com>
  PR:             234657
  Tested by:      D. C. <tjoard AT gmail.com>

Changes:
_U  stable/12/
  stable/12/sys/dev/amdsmn/amdsmn.c
  stable/12/sys/dev/amdtemp/amdtemp.c
Comment 15 commit-hook freebsd_committer 2019-01-22 21:35:35 UTC
A commit references this bug:

Author: mav
Date: Tue Jan 22 21:35:25 UTC 2019
New revision: 343325
URL: https://svnweb.freebsd.org/changeset/base/343325

Log:
  MFC r342977 (by cem): amdtemp(4): Add support for Family 15h, Model >=60h

  Family 15h is a bit of an oddball.  Early models used the same temperature
  register and spec (mostly[1]) as earlier CPU families.

  Model 60h-6Fh and 70-7Fh use something more like Family 17h's Service
  Management Network, communicating with it in a similar fashion.  To support
  them, add support for their version of SMU indirection to amdsmn(4) and use
  it in amdtemp(4) on these models.

  While here, clarify some of the deviceid macros in amdtemp(4) that were
  added with arbitrary, incorrect family numbers, and remove ones that were
  not used.  Additionally, clarify intent and condition of heterogenous
  multi-socket system detection.

  [1]: 15h adds the "adjust range by -49?C if a certain condition is met,"
  which previous families did not have.

  Reported by:    D. C. <tjoard AT gmail.com>
  PR:             234657
  Tested by:      D. C. <tjoard AT gmail.com>

Changes:
_U  stable/11/
  stable/11/sys/dev/amdsmn/amdsmn.c
  stable/11/sys/dev/amdtemp/amdtemp.c