I am using FreeNAS 11.2 and as it turns out, I cannot read out the CPU temperature for my AMD Opteron X3216 CPU in my HPE Microserver Gen10. I filed this at the FreeNAS Redmine: https://redmine.ixsystems.com/issues/64077 I have the following tunables: amdtemp_load="YES" # for reading AMD CPU temperatures hint.acpi_throttle.0.disabled="YES" # cool'n'quiet hw.pci.realloc_bars="1" # needed for screen output on Microserver gen10 Trying sysctl -a | grep -i temp does not yield any temperatures. I also tried other things to grep for like cpu, none of which would give me the CPU temperatures. I submitted FreeNAS debug output to the FreeNAS developers. They replied the following: "As I can see, there are several variations of that CPU generation, and while FreeBSD supports temperature reporting for one of them, two others are different. As I understand, amdtemp driver would have to attach to device "hostb7 pnpinfo vendor=0x1022 device=0x1573", but that ID is not in the code.", along with the suggestion to contact Conrad Meyer about this. I contacted Conrad, he asked me what the family/model numbers for my CPU are. The relevant dmesg output: CPU: AMD Opteron(tm) X3216 APU (1597.04-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x660f01 Family=0x15 Model=0x60 Stepping=1 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM> AMD Features2=0x2febbfff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,XOP,SKINIT,WDT,LWP,FMA4,TCE,NodeId,TBM,Topology,PCXC,PNXC,<b25>,DBE,PTSC,MWAITX> Structured Extended Features=0x1a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2> XSAVE Features=0x1<XSAVEOPT> SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768 TSC: P-state invariant, performance statistics Conrad also asked me how the FreeNAS dev arrived at this device information. That I do not know, but I think they could (and are also better qualified to) explain this? Also, I won't mind (re)generating any diagnostic information if need be. I am willing to test patches for this issue on my hardware.
I think I understand where the device=0x1573 guess is coming from. Family 15h Model 60h BKDG documents 0x1573 as the deviceid for PCI D18F3. On the earlier Model *00h*, which we currently support, the PCI D18F3 device register 0xa4 is the Reported Temperature. However, on Model *60h*, Reported Temperature lives in a different register, "D0F0xBC_xD820_0CA4" (page 238 of the relevant BKDG below). So I suspect 0x1573 is wrong. https://www.amd.com/system/files/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf D0F0 (PCI Root complex) on 60h has the vendor:device id 0x1022:0x1576. It seems like Family 15h Model 60h also requires indirect SMU / SMN access (p. 233), like on Family 17h: > D0F0xB8 SMU Index Address > The index/data pair registers, D0F0xB8 and D0F0xBC, are used to access the registers at > D0F0xBC_x[FFFF_FFFF:0000_0000]. To access any of these registers, the address is first written into the > index register, D0F0xB8, and then the data is read from or written to the data register, D0F0xBC. 0xB8 is the 32-bit register NbSmuIndAddr. 0xBC is the 32-bit register NbSmuIndData. So we must indirectly access the index "0xD820_0CA4" to read the temperature on 15h, model 60h. I'm going to do a quick spot check of other 15h BKDGs to see what other models specify.
Fam15h M10h: D18F3xA4, like M00h. D18F3 has deviceid 0x1022:0x1403. (There is no M20h model, or it is not documented publically on amd.com.) Fam15h M30h: D18F3xA4, like M00h. D18F3 has deviceid 0x1022:0x141d. Fam15h M70h: D0F0xBC_xD820_0CA4, like M60h. D0F0 has deviceid 0x1022:0x1576. It looks like we support Fam15h M10h by the name "MISC12" and M30h by the name "MISC17". "MISC12" was a mistake — it was added for the AMD A10-5700 APU, which is Family 15h M10h. Ditto "MISC17", which was added for an AMD Kaveri APU (Family 15h; presumably model 30h). I'll go ahead and fix those names while here. M70h defines 0xB8 and 0xBC identically to M60h, as far as I can tell. Obviously, we don't support M70h yet for the same reason as M60h. Work is in progress.
(In reply to D.C. from comment #0) If possible, please test the following patch against CURRENT: https://people.freebsd.org/~cem/pr234657.patch
If I don't hear back in a few days, I'll go ahead and commit this, but it'd be good to get confirmation it works before I do :-).
(In reply to Conrad Meyer from comment #4) I have a pretty busy schedule, but I'll see if I can squeeze it in somewhere this weekend. Creating a bootable USB drive in VirtualBox seems to work, so I should be able to create a low-impact test bench for testing the patch.
(In reply to D.C. from comment #5) Awesome, I appreciate it!
(In reply to Conrad Meyer from comment #6) I managed to create a bootable USB stick with CURRENT. However, apparently I need to check out the entire FreeBSD source tree to build the kernel. Not a problem per se, but svn update /usr/src seems to be going very slowly and I need to re-issue this command every couple of minutes due to truncated HTTP responses. No fun, and also kind of unexpected. I don't have cd drive in my server (used VirtualBox and the installer disc image for installing the stick), so I guess my best bet is to reinstall my stick and use the installer to install the source? I take it that any version of CURRENT from the last couple of days will do for testing the patch?
(In reply to D.C. from comment #7) If SVN is slow/unreliable, it might be possible to fetch a sources tarball with a flaky-network aware downloader like aria2c? (or even wget -c.) There appears to be a src tarball from today published here: https://download.freebsd.org/ftp/development/tarballs/src_current.tar.gz Yes, any source tree from the last few days should be just fine. The last commits to touch this code were in mid-November. Thanks again for testing it out.
(In reply to D.C. from comment #7) Sorry, missed this part in my earlier response and then bugzilla was down for a few minutes: > so I guess my best bet is to reinstall my stick and use the installer to > install the source? I take it that any version of CURRENT from the last couple > of days will do for testing the patch? Yes -- totally fine, as long as the installer is from late November or more recent. Thanks!
(In reply to Conrad Meyer from comment #9) I just tested the patch. Output from sysctl -a | grep temp: net.inet6.ip6.use_tempaddr: 0 net.inet6.ip6.temppltime: 86400 net.inet6.ip6.tempvltime: 604800 net.inet6.ip6.prefer_tempaddr: 0 hw.usb.template: -1 dev.amdtemp.0.core0.sensor0: 22.6C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb0 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent: dev.cpu.1.temperature: 22.6C dev.cpu.0.temperature: 22.6C Prior and after testing the bios reported temperatures around 27 to 28C. However, I did have powerd and cool'n'quiet running (just like during normal operation), so this may account for the temperature difference? Although the bios also reports ambient and dimm area temperatures that are also 27 to 28C. But all in all the patch seems to doing what it's supposed to :) .
(In reply to D.C. from comment #10) Thanks, that's excellent! > Prior and after testing the bios reported temperatures around 27 to 28C. > However, I did have powerd and cool'n'quiet running (just like during normal > operation), so this may account for the temperature difference? Maybe that, maybe the temperature sensor on Family 15h is not actually in units of 0.125°C. There was another bug related to that and I found (but could not recall links for you off hand) some discussion on the internet about accuracy as you get further below rated thermal throttling temperature. I guess another interesting calibration test would be to add various load (100% on one core, 100% on two cores, etc) and see if it looks like the temperature tracks that load in a reasonable way. Either way, I think your testing is sufficient to show the patch functions as intended and can be committed, thank you!
(In reply to Conrad Meyer from comment #11) No problem, I'm glad I could help out a fellow developer this way :) .
A commit references this bug: Author: cem Date: Sat Jan 12 22:36:33 UTC 2019 New revision: 342977 URL: https://svnweb.freebsd.org/changeset/base/342977 Log: amdtemp(4): Add support for Family 15h, Model >=60h Family 15h is a bit of an oddball. Early models used the same temperature register and spec (mostly[1]) as earlier CPU families. Model 60h-6Fh and 70-7Fh use something more like Family 17h's Service Management Network, communicating with it in a similar fashion. To support them, add support for their version of SMU indirection to amdsmn(4) and use it in amdtemp(4) on these models. While here, clarify some of the deviceid macros in amdtemp(4) that were added with arbitrary, incorrect family numbers, and remove ones that were not used. Additionally, clarify intent and condition of heterogenous multi-socket system detection. [1]: 15h adds the "adjust range by -49?C if a certain condition is met," which previous families did not have. Reported by: D. C. <tjoard AT gmail.com> PR: 234657 Tested by: D. C. <tjoard AT gmail.com> Changes: head/sys/dev/amdsmn/amdsmn.c head/sys/dev/amdtemp/amdtemp.c
A commit references this bug: Author: mav Date: Tue Jan 22 21:04:04 UTC 2019 New revision: 343322 URL: https://svnweb.freebsd.org/changeset/base/343322 Log: MFC r342977 (by cem): amdtemp(4): Add support for Family 15h, Model >=60h Family 15h is a bit of an oddball. Early models used the same temperature register and spec (mostly[1]) as earlier CPU families. Model 60h-6Fh and 70-7Fh use something more like Family 17h's Service Management Network, communicating with it in a similar fashion. To support them, add support for their version of SMU indirection to amdsmn(4) and use it in amdtemp(4) on these models. While here, clarify some of the deviceid macros in amdtemp(4) that were added with arbitrary, incorrect family numbers, and remove ones that were not used. Additionally, clarify intent and condition of heterogenous multi-socket system detection. [1]: 15h adds the "adjust range by -49?C if a certain condition is met," which previous families did not have. Reported by: D. C. <tjoard AT gmail.com> PR: 234657 Tested by: D. C. <tjoard AT gmail.com> Changes: _U stable/12/ stable/12/sys/dev/amdsmn/amdsmn.c stable/12/sys/dev/amdtemp/amdtemp.c
A commit references this bug: Author: mav Date: Tue Jan 22 21:35:25 UTC 2019 New revision: 343325 URL: https://svnweb.freebsd.org/changeset/base/343325 Log: MFC r342977 (by cem): amdtemp(4): Add support for Family 15h, Model >=60h Family 15h is a bit of an oddball. Early models used the same temperature register and spec (mostly[1]) as earlier CPU families. Model 60h-6Fh and 70-7Fh use something more like Family 17h's Service Management Network, communicating with it in a similar fashion. To support them, add support for their version of SMU indirection to amdsmn(4) and use it in amdtemp(4) on these models. While here, clarify some of the deviceid macros in amdtemp(4) that were added with arbitrary, incorrect family numbers, and remove ones that were not used. Additionally, clarify intent and condition of heterogenous multi-socket system detection. [1]: 15h adds the "adjust range by -49?C if a certain condition is met," which previous families did not have. Reported by: D. C. <tjoard AT gmail.com> PR: 234657 Tested by: D. C. <tjoard AT gmail.com> Changes: _U stable/11/ stable/11/sys/dev/amdsmn/amdsmn.c stable/11/sys/dev/amdtemp/amdtemp.c