Bug 278311 - amdtemp: Does not recognize AMD Threadripper 7960X
Summary: amdtemp: Does not recognize AMD Threadripper 7960X
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: --- Affects Some People
Assignee: Xin LI
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-11 15:22 UTC by Joel Bodenmann
Modified: 2024-04-21 04:39 UTC (History)
3 users (show)

See Also:


Attachments
pciconf (26.55 KB, text/plain)
2024-04-11 23:15 UTC, Joel Bodenmann
no flags Details
dmesg.boot (17.19 KB, text/plain)
2024-04-11 23:16 UTC, Joel Bodenmann
no flags Details
Add support for F19 M10 (0x10..0x1f) (1.69 KB, patch)
2024-04-12 00:12 UTC, Xin LI
no flags Details | Diff
Revised patch to also add support in amdsmn(4) (2.56 KB, patch)
2024-04-12 01:51 UTC, Xin LI
no flags Details | Diff
dmesg (17.35 KB, text/plain)
2024-04-12 19:10 UTC, Joel Bodenmann
no flags Details
Corrected Root Complex PCI ID (0x14b0 -> 0x14a4) (2.56 KB, patch)
2024-04-13 22:24 UTC, Xin LI
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joel Bodenmann freebsd_committer freebsd_triage 2024-04-11 15:22:17 UTC
amdtemp does not seem to be able to report CPU temperatures for the AMD Ryzen Threadripper 7960X CPU as of stable/14 b556c37f83b03432af6dd9af1a4e143fc8b2e100.

# dmidecode -t processor

# dmidecode 3.5
# SMBIOS entry point at 0x6e2a8000
Found SMBIOS entry point in EFI, reading table from /dev/mem.
SMBIOS 3.6 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.

Handle 0x0010, DMI type 4, 48 bytes
Processor Information
	Socket Designation: SP6
	Type: Central Processor
	Family: Zen
	Manufacturer: Advanced Micro Devices, Inc.
	ID: 81 0F A1 00 FF FB 8B 17
	Signature: Family 25, Model 24, Stepping 1
	Flags:
		FPU (Floating-point unit on-chip)
		VME (Virtual mode extension)
		DE (Debugging extension)
		PSE (Page size extension)
		TSC (Time stamp counter)
		MSR (Model specific registers)
		PAE (Physical address extension)
		MCE (Machine check exception)
		CX8 (CMPXCHG8 instruction supported)
		APIC (On-chip APIC hardware supported)
		SEP (Fast system call)
		MTRR (Memory type range registers)
		PGE (Page global enable)
		MCA (Machine check architecture)
		CMOV (Conditional move instruction supported)
		PAT (Page attribute table)
		PSE-36 (36-bit page size extension)
		CLFSH (CLFLUSH instruction supported)
		MMX (MMX technology supported)
		FXSR (FXSAVE and FXSTOR instructions supported)
		SSE (Streaming SIMD extensions)
		SSE2 (Streaming SIMD extensions 2)
		HTT (Multi-threading)
	Version: AMD Ryzen Threadripper 7960X 24-Cores          
	Voltage: 1.2 V
	External Clock: 100 MHz
	Max Speed: 5650 MHz
	Current Speed: 4200 MHz
	Status: Populated, Enabled
	Upgrade: <OUT OF SPEC>
	L1 Cache Handle: 0x000D
	L2 Cache Handle: 0x000E
	L3 Cache Handle: 0x000F
	Serial Number: Unknown
	Asset Tag: Unknown
	Part Number: Unknown
	Core Count: 24
	Core Enabled: 24
	Thread Count: 48
	Characteristics:
		64-bit capable
		Multi-Core
		Hardware Thread
		Execute Protection
		Enhanced Virtualization
		Power/Performance Control
Comment 1 Xin LI freebsd_committer freebsd_triage 2024-04-11 22:58:46 UTC
Could you please provide dmesg (/var/run/dmesg.boot is likely to be fine) and output from pciconf -lv?
Comment 2 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-11 23:15:34 UTC
Created attachment 249915 [details]
pciconf
Comment 3 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-11 23:16:02 UTC
Created attachment 249916 [details]
dmesg.boot
Comment 4 Xin LI freebsd_committer freebsd_triage 2024-04-12 00:12:08 UTC
Created attachment 249918 [details]
Add support for F19 M10 (0x10..0x1f)

Could you please try this patch?
Comment 5 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-12 00:42:00 UTC
Does not seem to work yet. Still not getting any temperatures (via sysctl or sysutils/hwstat).

Looking through /var/log/messages I see:


7306 Apr 12 00:37:15 hedt1 kernel: amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24
7307 Apr 12 00:37:15 hedt1 kernel: device_attach: amdtemp0 attach returned 6
Comment 6 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-12 00:45:05 UTC
I see this in your patch:

> DEVICEID_AMD_HOSTB19H_M10H_ROOT

and this in /var/log/messages:

> amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24

Not knowing anything about the internal workings of amdtemp, I'd say this looks fishy.
Comment 7 Xin LI freebsd_committer freebsd_triage 2024-04-12 01:51:22 UTC
Created attachment 249919 [details]
Revised patch to also add support in amdsmn(4)

Ah, I forgot to add the support into amdsmn(4) which amdtemp(4) depends on.

Could you revert that one and apply this and let me know if it helps?
Comment 8 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-12 17:35:30 UTC
After applying your latest patch I am getting corresponding entries as dev.cpu.*.temperature. However, all of them report -0.0

Nothing shows up in dmesg.boot that would indicate anything helpful.

hedt1% sysctl -a | grep amdtemp
kern.version: FreeBSD 14.0-STABLE #2 feature/amdtemp-n267180-b556c37f83b0-dirty: Fri Apr 12 17:28:54 UTC 2024
FreeBSD 14.0-STABLE #2 feature/amdtemp-n267180-b556c37f83b0-dirty: Fri Apr 12 17:28:54 UTC 2024
amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24
dev.amdtemp.0.core0.sensor0: -0.0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb24
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent:
Comment 9 Xin LI freebsd_committer freebsd_triage 2024-04-12 17:37:36 UTC
(In reply to Joel Bodenmann from comment #8)
can you share the dmesg output (ideally with only sensitive information redacted) and `sysctl dev.amdtemp` output?
Comment 10 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-12 17:46:17 UTC
You have me slightly concerned now... could you elaborate what kind of information would show up in dmesg that would be considered sensitive? I guess one could argue about MAC addresses if being extra paranoid - is there anything else I'm unaware of?
Comment 11 Xin LI freebsd_committer freebsd_triage 2024-04-12 18:22:36 UTC
(In reply to Joel Bodenmann from comment #10)
> could you elaborate what kind of information would show up in dmesg that would be considered sensitive?

Well, I'm just trying to remind you that you should filter out what _you_ would consider sensitive, as the kernel message buffer (`dmseg`) could contain a lot of things, because bugzilla data is generally accessible by anyone.

Example of "sensitive data" can include e.g. build host name, ethernet address, hard drive serial number, file system mounted, etc.  Some people feel uncomfortable to share some of these information with the Internet, which is totally understandable and should always be respected, and usually these PIIs are not useful for debugging anyways (but information like the which other drivers were loaded, or e.g. what drivers said without indicating them in the same line, especially in verbose boot, are very important for debugging, so I generally attach a full dmesg output when reporting driver issues with only minimal redaction).
Comment 12 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-12 19:10:33 UTC
Created attachment 249938 [details]
dmesg
Comment 13 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-12 19:10:59 UTC
Thank you for the explanation. That is pretty much what I expected/understood but I wasn't sure whether you're hinting at something I'm unaware of.

dmesg (attached)

hedt1% sysctl dev.amdtemp
dev.amdtemp.0.core0.sensor0: -0.0C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb24
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent:
Comment 14 Xin LI freebsd_committer freebsd_triage 2024-04-13 22:24:36 UTC
Created attachment 249959 [details]
Corrected Root Complex PCI ID (0x14b0 -> 0x14a4)

Hi, could you please try this patch?  (basically, replace 0x14b0 PCI ID for function 3, with 0x14a4 the root complex)
Comment 15 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-13 22:38:55 UTC
I'm getting temperature readouts now.

Whether they are accurate I can't easily check but the values appear to be within reasonable expectations (34C idle, 91C during stress -c 48 on a water cooled system).

Thank you for handling this so quickly. May we expect a commit and MFC to stable soon? :)
Comment 16 Xin LI freebsd_committer freebsd_triage 2024-04-13 22:51:17 UTC
(In reply to Joel Bodenmann from comment #15)
Glad to hear!  Just wanted to double check -- is the driver fully functional now?  E.g. what does `sysctl dev.amdtemp` and `sysctl dev.cpu | grep temp` show now?
Comment 17 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-13 23:08:59 UTC
hedt1% sysctl dev.amdtemp
dev.amdtemp.3.ccd3: 32.6C
dev.amdtemp.3.ccd2: 30.6C
dev.amdtemp.3.ccd1: 31.6C
dev.amdtemp.3.ccd0: 32.0C
dev.amdtemp.3.core0.sensor0: 44.1C
dev.amdtemp.3.sensor_offset: 0
dev.amdtemp.3.%parent: hostb29
dev.amdtemp.3.%pnpinfo: 
dev.amdtemp.3.%location: 
dev.amdtemp.3.%driver: amdtemp
dev.amdtemp.3.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.2.ccd3: 32.6C
dev.amdtemp.2.ccd2: 30.6C
dev.amdtemp.2.ccd1: 31.6C
dev.amdtemp.2.ccd0: 32.0C
dev.amdtemp.2.core0.sensor0: 44.1C
dev.amdtemp.2.sensor_offset: 0
dev.amdtemp.2.%parent: hostb14
dev.amdtemp.2.%pnpinfo: 
dev.amdtemp.2.%location: 
dev.amdtemp.2.%driver: amdtemp
dev.amdtemp.2.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.1.ccd3: 32.6C
dev.amdtemp.1.ccd2: 30.6C
dev.amdtemp.1.ccd1: 31.6C
dev.amdtemp.1.ccd0: 32.0C
dev.amdtemp.1.core0.sensor0: 44.1C
dev.amdtemp.1.sensor_offset: 0
dev.amdtemp.1.%parent: hostb7
dev.amdtemp.1.%pnpinfo: 
dev.amdtemp.1.%location: 
dev.amdtemp.1.%driver: amdtemp
dev.amdtemp.1.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.0.ccd3: 32.6C
dev.amdtemp.0.ccd2: 30.6C
dev.amdtemp.0.ccd1: 31.6C
dev.amdtemp.0.ccd0: 32.0C
dev.amdtemp.0.core0.sensor0: 44.1C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%parent: hostb0
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors
dev.amdtemp.%parent: 


hedt1% sysctl dev.cpu | grep temp
dev.cpu.47.temperature: 36.6C
dev.cpu.46.temperature: 36.6C
dev.cpu.45.temperature: 36.6C
dev.cpu.44.temperature: 36.6C
dev.cpu.43.temperature: 36.6C
dev.cpu.42.temperature: 36.6C
dev.cpu.41.temperature: 36.6C
dev.cpu.40.temperature: 36.6C
dev.cpu.39.temperature: 36.6C
dev.cpu.38.temperature: 36.6C
dev.cpu.37.temperature: 36.6C
dev.cpu.36.temperature: 36.6C
dev.cpu.35.temperature: 36.6C
dev.cpu.34.temperature: 36.6C
dev.cpu.33.temperature: 36.6C
dev.cpu.32.temperature: 36.6C
dev.cpu.31.temperature: 36.6C
dev.cpu.30.temperature: 36.6C
dev.cpu.29.temperature: 36.6C
dev.cpu.28.temperature: 36.6C
dev.cpu.27.temperature: 36.6C
dev.cpu.26.temperature: 36.6C
dev.cpu.25.temperature: 36.6C
dev.cpu.24.temperature: 36.6C
dev.cpu.23.temperature: 36.6C
dev.cpu.22.temperature: 36.6C
dev.cpu.21.temperature: 36.6C
dev.cpu.20.temperature: 36.6C
dev.cpu.19.temperature: 36.6C
dev.cpu.18.temperature: 36.6C
dev.cpu.17.temperature: 36.6C
dev.cpu.16.temperature: 36.6C
dev.cpu.15.temperature: 36.6C
dev.cpu.14.temperature: 36.6C
dev.cpu.13.temperature: 36.6C
dev.cpu.12.temperature: 36.6C
dev.cpu.11.temperature: 36.6C
dev.cpu.10.temperature: 36.6C
dev.cpu.9.temperature: 36.6C
dev.cpu.8.temperature: 36.6C
dev.cpu.7.temperature: 36.6C
dev.cpu.6.temperature: 36.6C
dev.cpu.5.temperature: 36.6C
dev.cpu.4.temperature: 36.6C
dev.cpu.3.temperature: 36.6C
dev.cpu.2.temperature: 36.6C
dev.cpu.1.temperature: 36.6C
dev.cpu.0.temperature: 36.6C
Comment 18 Joel Bodenmann freebsd_committer freebsd_triage 2024-04-13 23:53:14 UTC
dev.cpu.*.temperature seems to consistently report the exact same temperature for all cores. Is that expected?

This is the first time I have an AMD build - I have been a team blue player until now. On intel platforms, different cores tend to report different temperatures:

sysctl -a | grep temperature
hw.acpi.thermal.tz1.temperature: 29.9C
hw.acpi.thermal.tz0.temperature: 27.9C
dev.cpu.11.temperature: 59.0C
dev.cpu.9.temperature: 58.0C
dev.cpu.7.temperature: 59.0C
dev.cpu.5.temperature: 61.0C
dev.cpu.3.temperature: 60.0C
dev.cpu.1.temperature: 60.0C
dev.cpu.10.temperature: 58.0C
dev.cpu.8.temperature: 58.0C
dev.cpu.6.temperature: 58.0C
dev.cpu.4.temperature: 60.0C
dev.cpu.2.temperature: 61.0C
dev.cpu.0.temperature: 59.0C
Comment 19 commit-hook freebsd_committer freebsd_triage 2024-04-14 08:03:59 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4

commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4
Author:     Xin LI <delphij@FreeBSD.org>
AuthorDate: 2024-04-14 07:45:17 +0000
Commit:     Xin LI <delphij@FreeBSD.org>
CommitDate: 2024-04-14 07:52:08 +0000

    amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh.

    Tested on AMD Threadripper 7960X.

    PR:             kern/278311
    Tested by:      jbo
    MFC after:      1 week

 sys/dev/amdsmn/amdsmn.c   |  7 +++++++
 sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)
Comment 20 Xin LI freebsd_committer freebsd_triage 2024-04-14 08:19:35 UTC
(In reply to Joel Bodenmann from comment #18)
> dev.cpu.*.temperature seems to consistently report the exact same temperature for all cores. Is that expected?

Yes, for these newer AMD processors there is only one temperature sensor (`sc_ntemps`) per package.  If you boot with verbose you should see something like "Found 48 cores and 1 sensors".
Comment 21 Oleh Hushchenkov 2024-04-14 15:38:41 UTC
Hi,

I tested this patch on dual AMD EPYC 9174F 16-Core Processor Dell server.
While now I can see reasonable temperatures, looks like all dev.cpu.x.temperature values are copied from CPU in a first socket.

sysctl dev.cpu | grep temper         
dev.cpu.0.temperature: 51.6C
...
dev.cpu.31.temperature: 51.6C

sysctl dev.amdtemp | grep sensor     
dev.amdtemp.7.core0.sensor0: 52.1C                          
dev.amdtemp.7.sensor_offset: 0
dev.amdtemp.6.core0.sensor0: 52.1C                          
dev.amdtemp.6.sensor_offset: 0
dev.amdtemp.5.core0.sensor0: 52.1C                          
dev.amdtemp.5.sensor_offset: 0                              
dev.amdtemp.4.core0.sensor0: 52.1C                          
dev.amdtemp.4.sensor_offset: 0                              
dev.amdtemp.3.core0.sensor0: 51.6C                          
dev.amdtemp.3.sensor_offset: 0                              
dev.amdtemp.2.core0.sensor0: 51.6C                          
dev.amdtemp.2.sensor_offset: 0                              
dev.amdtemp.1.core0.sensor0: 51.6C                          
dev.amdtemp.1.sensor_offset: 0                              
dev.amdtemp.0.core0.sensor0: 51.6C                          
dev.amdtemp.0.sensor_offset: 0

Thanks for the patch.
Comment 22 commit-hook freebsd_committer freebsd_triage 2024-04-21 03:16:46 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e4d8fe76c5d4833d2ba5b889b8af3e0d2907a531

commit e4d8fe76c5d4833d2ba5b889b8af3e0d2907a531
Author:     Xin LI <delphij@FreeBSD.org>
AuthorDate: 2024-04-14 07:45:17 +0000
Commit:     Xin LI <delphij@FreeBSD.org>
CommitDate: 2024-04-21 03:15:52 +0000

    amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh.

    Tested on AMD Threadripper 7960X.

    PR:             kern/278311
    Tested by:      jbo

    (cherry picked from commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4)

 sys/dev/amdsmn/amdsmn.c   |  7 +++++++
 sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)
Comment 23 commit-hook freebsd_committer freebsd_triage 2024-04-21 03:19:49 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=c5179c2b8a5e5b826f6626634b9b53dd257d6499

commit c5179c2b8a5e5b826f6626634b9b53dd257d6499
Author:     Xin LI <delphij@FreeBSD.org>
AuthorDate: 2024-04-14 07:45:17 +0000
Commit:     Xin LI <delphij@FreeBSD.org>
CommitDate: 2024-04-21 03:19:01 +0000

    amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh.

    Tested on AMD Threadripper 7960X.

    PR:             kern/278311
    Tested by:      jbo

    (cherry picked from commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4)

 sys/dev/amdsmn/amdsmn.c   |  7 +++++++
 sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)