amdtemp does not seem to be able to report CPU temperatures for the AMD Ryzen Threadripper 7960X CPU as of stable/14 b556c37f83b03432af6dd9af1a4e143fc8b2e100. # dmidecode -t processor # dmidecode 3.5 # SMBIOS entry point at 0x6e2a8000 Found SMBIOS entry point in EFI, reading table from /dev/mem. SMBIOS 3.6 present. # SMBIOS implementations newer than version 3.5.0 are not # fully supported by this version of dmidecode. Handle 0x0010, DMI type 4, 48 bytes Processor Information Socket Designation: SP6 Type: Central Processor Family: Zen Manufacturer: Advanced Micro Devices, Inc. ID: 81 0F A1 00 FF FB 8B 17 Signature: Family 25, Model 24, Stepping 1 Flags: FPU (Floating-point unit on-chip) VME (Virtual mode extension) DE (Debugging extension) PSE (Page size extension) TSC (Time stamp counter) MSR (Model specific registers) PAE (Physical address extension) MCE (Machine check exception) CX8 (CMPXCHG8 instruction supported) APIC (On-chip APIC hardware supported) SEP (Fast system call) MTRR (Memory type range registers) PGE (Page global enable) MCA (Machine check architecture) CMOV (Conditional move instruction supported) PAT (Page attribute table) PSE-36 (36-bit page size extension) CLFSH (CLFLUSH instruction supported) MMX (MMX technology supported) FXSR (FXSAVE and FXSTOR instructions supported) SSE (Streaming SIMD extensions) SSE2 (Streaming SIMD extensions 2) HTT (Multi-threading) Version: AMD Ryzen Threadripper 7960X 24-Cores Voltage: 1.2 V External Clock: 100 MHz Max Speed: 5650 MHz Current Speed: 4200 MHz Status: Populated, Enabled Upgrade: <OUT OF SPEC> L1 Cache Handle: 0x000D L2 Cache Handle: 0x000E L3 Cache Handle: 0x000F Serial Number: Unknown Asset Tag: Unknown Part Number: Unknown Core Count: 24 Core Enabled: 24 Thread Count: 48 Characteristics: 64-bit capable Multi-Core Hardware Thread Execute Protection Enhanced Virtualization Power/Performance Control
Could you please provide dmesg (/var/run/dmesg.boot is likely to be fine) and output from pciconf -lv?
Created attachment 249915 [details] pciconf
Created attachment 249916 [details] dmesg.boot
Created attachment 249918 [details] Add support for F19 M10 (0x10..0x1f) Could you please try this patch?
Does not seem to work yet. Still not getting any temperatures (via sysctl or sysutils/hwstat). Looking through /var/log/messages I see: 7306 Apr 12 00:37:15 hedt1 kernel: amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24 7307 Apr 12 00:37:15 hedt1 kernel: device_attach: amdtemp0 attach returned 6
I see this in your patch: > DEVICEID_AMD_HOSTB19H_M10H_ROOT and this in /var/log/messages: > amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24 Not knowing anything about the internal workings of amdtemp, I'd say this looks fishy.
Created attachment 249919 [details] Revised patch to also add support in amdsmn(4) Ah, I forgot to add the support into amdsmn(4) which amdtemp(4) depends on. Could you revert that one and apply this and let me know if it helps?
After applying your latest patch I am getting corresponding entries as dev.cpu.*.temperature. However, all of them report -0.0 Nothing shows up in dmesg.boot that would indicate anything helpful. hedt1% sysctl -a | grep amdtemp kern.version: FreeBSD 14.0-STABLE #2 feature/amdtemp-n267180-b556c37f83b0-dirty: Fri Apr 12 17:28:54 UTC 2024 FreeBSD 14.0-STABLE #2 feature/amdtemp-n267180-b556c37f83b0-dirty: Fri Apr 12 17:28:54 UTC 2024 amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24 dev.amdtemp.0.core0.sensor0: -0.0C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb24 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent:
(In reply to Joel Bodenmann from comment #8) can you share the dmesg output (ideally with only sensitive information redacted) and `sysctl dev.amdtemp` output?
You have me slightly concerned now... could you elaborate what kind of information would show up in dmesg that would be considered sensitive? I guess one could argue about MAC addresses if being extra paranoid - is there anything else I'm unaware of?
(In reply to Joel Bodenmann from comment #10) > could you elaborate what kind of information would show up in dmesg that would be considered sensitive? Well, I'm just trying to remind you that you should filter out what _you_ would consider sensitive, as the kernel message buffer (`dmseg`) could contain a lot of things, because bugzilla data is generally accessible by anyone. Example of "sensitive data" can include e.g. build host name, ethernet address, hard drive serial number, file system mounted, etc. Some people feel uncomfortable to share some of these information with the Internet, which is totally understandable and should always be respected, and usually these PIIs are not useful for debugging anyways (but information like the which other drivers were loaded, or e.g. what drivers said without indicating them in the same line, especially in verbose boot, are very important for debugging, so I generally attach a full dmesg output when reporting driver issues with only minimal redaction).
Created attachment 249938 [details] dmesg
Thank you for the explanation. That is pretty much what I expected/understood but I wasn't sure whether you're hinting at something I'm unaware of. dmesg (attached) hedt1% sysctl dev.amdtemp dev.amdtemp.0.core0.sensor0: -0.0C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb24 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent:
Created attachment 249959 [details] Corrected Root Complex PCI ID (0x14b0 -> 0x14a4) Hi, could you please try this patch? (basically, replace 0x14b0 PCI ID for function 3, with 0x14a4 the root complex)
I'm getting temperature readouts now. Whether they are accurate I can't easily check but the values appear to be within reasonable expectations (34C idle, 91C during stress -c 48 on a water cooled system). Thank you for handling this so quickly. May we expect a commit and MFC to stable soon? :)
(In reply to Joel Bodenmann from comment #15) Glad to hear! Just wanted to double check -- is the driver fully functional now? E.g. what does `sysctl dev.amdtemp` and `sysctl dev.cpu | grep temp` show now?
hedt1% sysctl dev.amdtemp dev.amdtemp.3.ccd3: 32.6C dev.amdtemp.3.ccd2: 30.6C dev.amdtemp.3.ccd1: 31.6C dev.amdtemp.3.ccd0: 32.0C dev.amdtemp.3.core0.sensor0: 44.1C dev.amdtemp.3.sensor_offset: 0 dev.amdtemp.3.%parent: hostb29 dev.amdtemp.3.%pnpinfo: dev.amdtemp.3.%location: dev.amdtemp.3.%driver: amdtemp dev.amdtemp.3.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.2.ccd3: 32.6C dev.amdtemp.2.ccd2: 30.6C dev.amdtemp.2.ccd1: 31.6C dev.amdtemp.2.ccd0: 32.0C dev.amdtemp.2.core0.sensor0: 44.1C dev.amdtemp.2.sensor_offset: 0 dev.amdtemp.2.%parent: hostb14 dev.amdtemp.2.%pnpinfo: dev.amdtemp.2.%location: dev.amdtemp.2.%driver: amdtemp dev.amdtemp.2.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.1.ccd3: 32.6C dev.amdtemp.1.ccd2: 30.6C dev.amdtemp.1.ccd1: 31.6C dev.amdtemp.1.ccd0: 32.0C dev.amdtemp.1.core0.sensor0: 44.1C dev.amdtemp.1.sensor_offset: 0 dev.amdtemp.1.%parent: hostb7 dev.amdtemp.1.%pnpinfo: dev.amdtemp.1.%location: dev.amdtemp.1.%driver: amdtemp dev.amdtemp.1.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.0.ccd3: 32.6C dev.amdtemp.0.ccd2: 30.6C dev.amdtemp.0.ccd1: 31.6C dev.amdtemp.0.ccd0: 32.0C dev.amdtemp.0.core0.sensor0: 44.1C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb0 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent: hedt1% sysctl dev.cpu | grep temp dev.cpu.47.temperature: 36.6C dev.cpu.46.temperature: 36.6C dev.cpu.45.temperature: 36.6C dev.cpu.44.temperature: 36.6C dev.cpu.43.temperature: 36.6C dev.cpu.42.temperature: 36.6C dev.cpu.41.temperature: 36.6C dev.cpu.40.temperature: 36.6C dev.cpu.39.temperature: 36.6C dev.cpu.38.temperature: 36.6C dev.cpu.37.temperature: 36.6C dev.cpu.36.temperature: 36.6C dev.cpu.35.temperature: 36.6C dev.cpu.34.temperature: 36.6C dev.cpu.33.temperature: 36.6C dev.cpu.32.temperature: 36.6C dev.cpu.31.temperature: 36.6C dev.cpu.30.temperature: 36.6C dev.cpu.29.temperature: 36.6C dev.cpu.28.temperature: 36.6C dev.cpu.27.temperature: 36.6C dev.cpu.26.temperature: 36.6C dev.cpu.25.temperature: 36.6C dev.cpu.24.temperature: 36.6C dev.cpu.23.temperature: 36.6C dev.cpu.22.temperature: 36.6C dev.cpu.21.temperature: 36.6C dev.cpu.20.temperature: 36.6C dev.cpu.19.temperature: 36.6C dev.cpu.18.temperature: 36.6C dev.cpu.17.temperature: 36.6C dev.cpu.16.temperature: 36.6C dev.cpu.15.temperature: 36.6C dev.cpu.14.temperature: 36.6C dev.cpu.13.temperature: 36.6C dev.cpu.12.temperature: 36.6C dev.cpu.11.temperature: 36.6C dev.cpu.10.temperature: 36.6C dev.cpu.9.temperature: 36.6C dev.cpu.8.temperature: 36.6C dev.cpu.7.temperature: 36.6C dev.cpu.6.temperature: 36.6C dev.cpu.5.temperature: 36.6C dev.cpu.4.temperature: 36.6C dev.cpu.3.temperature: 36.6C dev.cpu.2.temperature: 36.6C dev.cpu.1.temperature: 36.6C dev.cpu.0.temperature: 36.6C
dev.cpu.*.temperature seems to consistently report the exact same temperature for all cores. Is that expected? This is the first time I have an AMD build - I have been a team blue player until now. On intel platforms, different cores tend to report different temperatures: sysctl -a | grep temperature hw.acpi.thermal.tz1.temperature: 29.9C hw.acpi.thermal.tz0.temperature: 27.9C dev.cpu.11.temperature: 59.0C dev.cpu.9.temperature: 58.0C dev.cpu.7.temperature: 59.0C dev.cpu.5.temperature: 61.0C dev.cpu.3.temperature: 60.0C dev.cpu.1.temperature: 60.0C dev.cpu.10.temperature: 58.0C dev.cpu.8.temperature: 58.0C dev.cpu.6.temperature: 58.0C dev.cpu.4.temperature: 60.0C dev.cpu.2.temperature: 61.0C dev.cpu.0.temperature: 59.0C
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4 commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4 Author: Xin LI <delphij@FreeBSD.org> AuthorDate: 2024-04-14 07:45:17 +0000 Commit: Xin LI <delphij@FreeBSD.org> CommitDate: 2024-04-14 07:52:08 +0000 amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh. Tested on AMD Threadripper 7960X. PR: kern/278311 Tested by: jbo MFC after: 1 week sys/dev/amdsmn/amdsmn.c | 7 +++++++ sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-)
(In reply to Joel Bodenmann from comment #18) > dev.cpu.*.temperature seems to consistently report the exact same temperature for all cores. Is that expected? Yes, for these newer AMD processors there is only one temperature sensor (`sc_ntemps`) per package. If you boot with verbose you should see something like "Found 48 cores and 1 sensors".
Hi, I tested this patch on dual AMD EPYC 9174F 16-Core Processor Dell server. While now I can see reasonable temperatures, looks like all dev.cpu.x.temperature values are copied from CPU in a first socket. sysctl dev.cpu | grep temper dev.cpu.0.temperature: 51.6C ... dev.cpu.31.temperature: 51.6C sysctl dev.amdtemp | grep sensor dev.amdtemp.7.core0.sensor0: 52.1C dev.amdtemp.7.sensor_offset: 0 dev.amdtemp.6.core0.sensor0: 52.1C dev.amdtemp.6.sensor_offset: 0 dev.amdtemp.5.core0.sensor0: 52.1C dev.amdtemp.5.sensor_offset: 0 dev.amdtemp.4.core0.sensor0: 52.1C dev.amdtemp.4.sensor_offset: 0 dev.amdtemp.3.core0.sensor0: 51.6C dev.amdtemp.3.sensor_offset: 0 dev.amdtemp.2.core0.sensor0: 51.6C dev.amdtemp.2.sensor_offset: 0 dev.amdtemp.1.core0.sensor0: 51.6C dev.amdtemp.1.sensor_offset: 0 dev.amdtemp.0.core0.sensor0: 51.6C dev.amdtemp.0.sensor_offset: 0 Thanks for the patch.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e4d8fe76c5d4833d2ba5b889b8af3e0d2907a531 commit e4d8fe76c5d4833d2ba5b889b8af3e0d2907a531 Author: Xin LI <delphij@FreeBSD.org> AuthorDate: 2024-04-14 07:45:17 +0000 Commit: Xin LI <delphij@FreeBSD.org> CommitDate: 2024-04-21 03:15:52 +0000 amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh. Tested on AMD Threadripper 7960X. PR: kern/278311 Tested by: jbo (cherry picked from commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4) sys/dev/amdsmn/amdsmn.c | 7 +++++++ sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=c5179c2b8a5e5b826f6626634b9b53dd257d6499 commit c5179c2b8a5e5b826f6626634b9b53dd257d6499 Author: Xin LI <delphij@FreeBSD.org> AuthorDate: 2024-04-14 07:45:17 +0000 Commit: Xin LI <delphij@FreeBSD.org> CommitDate: 2024-04-21 03:19:01 +0000 amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh. Tested on AMD Threadripper 7960X. PR: kern/278311 Tested by: jbo (cherry picked from commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4) sys/dev/amdsmn/amdsmn.c | 7 +++++++ sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-)