Summary: | amdtemp: Does not recognize AMD Threadripper 7960X | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Joel Bodenmann <jbo> | ||||||||||||||
Component: | kern | Assignee: | Xin LI <delphij> | ||||||||||||||
Status: | Closed FIXED | ||||||||||||||||
Severity: | Affects Some People | CC: | delphij, o.hushchenkov, rudolphfroger | ||||||||||||||
Priority: | --- | ||||||||||||||||
Version: | Unspecified | ||||||||||||||||
Hardware: | Any | ||||||||||||||||
OS: | Any | ||||||||||||||||
Attachments: |
|
Description
Joel Bodenmann
2024-04-11 15:22:17 UTC
Could you please provide dmesg (/var/run/dmesg.boot is likely to be fine) and output from pciconf -lv? Created attachment 249915 [details]
pciconf
Created attachment 249916 [details]
dmesg.boot
Created attachment 249918 [details]
Add support for F19 M10 (0x10..0x1f)
Could you please try this patch?
Does not seem to work yet. Still not getting any temperatures (via sysctl or sysutils/hwstat). Looking through /var/log/messages I see: 7306 Apr 12 00:37:15 hedt1 kernel: amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24 7307 Apr 12 00:37:15 hedt1 kernel: device_attach: amdtemp0 attach returned 6 I see this in your patch: > DEVICEID_AMD_HOSTB19H_M10H_ROOT and this in /var/log/messages: > amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24 Not knowing anything about the internal workings of amdtemp, I'd say this looks fishy. Created attachment 249919 [details]
Revised patch to also add support in amdsmn(4)
Ah, I forgot to add the support into amdsmn(4) which amdtemp(4) depends on.
Could you revert that one and apply this and let me know if it helps?
After applying your latest patch I am getting corresponding entries as dev.cpu.*.temperature. However, all of them report -0.0 Nothing shows up in dmesg.boot that would indicate anything helpful. hedt1% sysctl -a | grep amdtemp kern.version: FreeBSD 14.0-STABLE #2 feature/amdtemp-n267180-b556c37f83b0-dirty: Fri Apr 12 17:28:54 UTC 2024 FreeBSD 14.0-STABLE #2 feature/amdtemp-n267180-b556c37f83b0-dirty: Fri Apr 12 17:28:54 UTC 2024 amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb24 dev.amdtemp.0.core0.sensor0: -0.0C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb24 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent: (In reply to Joel Bodenmann from comment #8) can you share the dmesg output (ideally with only sensitive information redacted) and `sysctl dev.amdtemp` output? You have me slightly concerned now... could you elaborate what kind of information would show up in dmesg that would be considered sensitive? I guess one could argue about MAC addresses if being extra paranoid - is there anything else I'm unaware of? (In reply to Joel Bodenmann from comment #10) > could you elaborate what kind of information would show up in dmesg that would be considered sensitive? Well, I'm just trying to remind you that you should filter out what _you_ would consider sensitive, as the kernel message buffer (`dmseg`) could contain a lot of things, because bugzilla data is generally accessible by anyone. Example of "sensitive data" can include e.g. build host name, ethernet address, hard drive serial number, file system mounted, etc. Some people feel uncomfortable to share some of these information with the Internet, which is totally understandable and should always be respected, and usually these PIIs are not useful for debugging anyways (but information like the which other drivers were loaded, or e.g. what drivers said without indicating them in the same line, especially in verbose boot, are very important for debugging, so I generally attach a full dmesg output when reporting driver issues with only minimal redaction). Created attachment 249938 [details]
dmesg
Thank you for the explanation. That is pretty much what I expected/understood but I wasn't sure whether you're hinting at something I'm unaware of. dmesg (attached) hedt1% sysctl dev.amdtemp dev.amdtemp.0.core0.sensor0: -0.0C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb24 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent: Created attachment 249959 [details]
Corrected Root Complex PCI ID (0x14b0 -> 0x14a4)
Hi, could you please try this patch? (basically, replace 0x14b0 PCI ID for function 3, with 0x14a4 the root complex)
I'm getting temperature readouts now. Whether they are accurate I can't easily check but the values appear to be within reasonable expectations (34C idle, 91C during stress -c 48 on a water cooled system). Thank you for handling this so quickly. May we expect a commit and MFC to stable soon? :) (In reply to Joel Bodenmann from comment #15) Glad to hear! Just wanted to double check -- is the driver fully functional now? E.g. what does `sysctl dev.amdtemp` and `sysctl dev.cpu | grep temp` show now? hedt1% sysctl dev.amdtemp dev.amdtemp.3.ccd3: 32.6C dev.amdtemp.3.ccd2: 30.6C dev.amdtemp.3.ccd1: 31.6C dev.amdtemp.3.ccd0: 32.0C dev.amdtemp.3.core0.sensor0: 44.1C dev.amdtemp.3.sensor_offset: 0 dev.amdtemp.3.%parent: hostb29 dev.amdtemp.3.%pnpinfo: dev.amdtemp.3.%location: dev.amdtemp.3.%driver: amdtemp dev.amdtemp.3.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.2.ccd3: 32.6C dev.amdtemp.2.ccd2: 30.6C dev.amdtemp.2.ccd1: 31.6C dev.amdtemp.2.ccd0: 32.0C dev.amdtemp.2.core0.sensor0: 44.1C dev.amdtemp.2.sensor_offset: 0 dev.amdtemp.2.%parent: hostb14 dev.amdtemp.2.%pnpinfo: dev.amdtemp.2.%location: dev.amdtemp.2.%driver: amdtemp dev.amdtemp.2.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.1.ccd3: 32.6C dev.amdtemp.1.ccd2: 30.6C dev.amdtemp.1.ccd1: 31.6C dev.amdtemp.1.ccd0: 32.0C dev.amdtemp.1.core0.sensor0: 44.1C dev.amdtemp.1.sensor_offset: 0 dev.amdtemp.1.%parent: hostb7 dev.amdtemp.1.%pnpinfo: dev.amdtemp.1.%location: dev.amdtemp.1.%driver: amdtemp dev.amdtemp.1.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.0.ccd3: 32.6C dev.amdtemp.0.ccd2: 30.6C dev.amdtemp.0.ccd1: 31.6C dev.amdtemp.0.ccd0: 32.0C dev.amdtemp.0.core0.sensor0: 44.1C dev.amdtemp.0.sensor_offset: 0 dev.amdtemp.0.%parent: hostb0 dev.amdtemp.0.%pnpinfo: dev.amdtemp.0.%location: dev.amdtemp.0.%driver: amdtemp dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors dev.amdtemp.%parent: hedt1% sysctl dev.cpu | grep temp dev.cpu.47.temperature: 36.6C dev.cpu.46.temperature: 36.6C dev.cpu.45.temperature: 36.6C dev.cpu.44.temperature: 36.6C dev.cpu.43.temperature: 36.6C dev.cpu.42.temperature: 36.6C dev.cpu.41.temperature: 36.6C dev.cpu.40.temperature: 36.6C dev.cpu.39.temperature: 36.6C dev.cpu.38.temperature: 36.6C dev.cpu.37.temperature: 36.6C dev.cpu.36.temperature: 36.6C dev.cpu.35.temperature: 36.6C dev.cpu.34.temperature: 36.6C dev.cpu.33.temperature: 36.6C dev.cpu.32.temperature: 36.6C dev.cpu.31.temperature: 36.6C dev.cpu.30.temperature: 36.6C dev.cpu.29.temperature: 36.6C dev.cpu.28.temperature: 36.6C dev.cpu.27.temperature: 36.6C dev.cpu.26.temperature: 36.6C dev.cpu.25.temperature: 36.6C dev.cpu.24.temperature: 36.6C dev.cpu.23.temperature: 36.6C dev.cpu.22.temperature: 36.6C dev.cpu.21.temperature: 36.6C dev.cpu.20.temperature: 36.6C dev.cpu.19.temperature: 36.6C dev.cpu.18.temperature: 36.6C dev.cpu.17.temperature: 36.6C dev.cpu.16.temperature: 36.6C dev.cpu.15.temperature: 36.6C dev.cpu.14.temperature: 36.6C dev.cpu.13.temperature: 36.6C dev.cpu.12.temperature: 36.6C dev.cpu.11.temperature: 36.6C dev.cpu.10.temperature: 36.6C dev.cpu.9.temperature: 36.6C dev.cpu.8.temperature: 36.6C dev.cpu.7.temperature: 36.6C dev.cpu.6.temperature: 36.6C dev.cpu.5.temperature: 36.6C dev.cpu.4.temperature: 36.6C dev.cpu.3.temperature: 36.6C dev.cpu.2.temperature: 36.6C dev.cpu.1.temperature: 36.6C dev.cpu.0.temperature: 36.6C dev.cpu.*.temperature seems to consistently report the exact same temperature for all cores. Is that expected? This is the first time I have an AMD build - I have been a team blue player until now. On intel platforms, different cores tend to report different temperatures: sysctl -a | grep temperature hw.acpi.thermal.tz1.temperature: 29.9C hw.acpi.thermal.tz0.temperature: 27.9C dev.cpu.11.temperature: 59.0C dev.cpu.9.temperature: 58.0C dev.cpu.7.temperature: 59.0C dev.cpu.5.temperature: 61.0C dev.cpu.3.temperature: 60.0C dev.cpu.1.temperature: 60.0C dev.cpu.10.temperature: 58.0C dev.cpu.8.temperature: 58.0C dev.cpu.6.temperature: 58.0C dev.cpu.4.temperature: 60.0C dev.cpu.2.temperature: 61.0C dev.cpu.0.temperature: 59.0C A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4 commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4 Author: Xin LI <delphij@FreeBSD.org> AuthorDate: 2024-04-14 07:45:17 +0000 Commit: Xin LI <delphij@FreeBSD.org> CommitDate: 2024-04-14 07:52:08 +0000 amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh. Tested on AMD Threadripper 7960X. PR: kern/278311 Tested by: jbo MFC after: 1 week sys/dev/amdsmn/amdsmn.c | 7 +++++++ sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-) (In reply to Joel Bodenmann from comment #18) > dev.cpu.*.temperature seems to consistently report the exact same temperature for all cores. Is that expected? Yes, for these newer AMD processors there is only one temperature sensor (`sc_ntemps`) per package. If you boot with verbose you should see something like "Found 48 cores and 1 sensors". Hi, I tested this patch on dual AMD EPYC 9174F 16-Core Processor Dell server. While now I can see reasonable temperatures, looks like all dev.cpu.x.temperature values are copied from CPU in a first socket. sysctl dev.cpu | grep temper dev.cpu.0.temperature: 51.6C ... dev.cpu.31.temperature: 51.6C sysctl dev.amdtemp | grep sensor dev.amdtemp.7.core0.sensor0: 52.1C dev.amdtemp.7.sensor_offset: 0 dev.amdtemp.6.core0.sensor0: 52.1C dev.amdtemp.6.sensor_offset: 0 dev.amdtemp.5.core0.sensor0: 52.1C dev.amdtemp.5.sensor_offset: 0 dev.amdtemp.4.core0.sensor0: 52.1C dev.amdtemp.4.sensor_offset: 0 dev.amdtemp.3.core0.sensor0: 51.6C dev.amdtemp.3.sensor_offset: 0 dev.amdtemp.2.core0.sensor0: 51.6C dev.amdtemp.2.sensor_offset: 0 dev.amdtemp.1.core0.sensor0: 51.6C dev.amdtemp.1.sensor_offset: 0 dev.amdtemp.0.core0.sensor0: 51.6C dev.amdtemp.0.sensor_offset: 0 Thanks for the patch. A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e4d8fe76c5d4833d2ba5b889b8af3e0d2907a531 commit e4d8fe76c5d4833d2ba5b889b8af3e0d2907a531 Author: Xin LI <delphij@FreeBSD.org> AuthorDate: 2024-04-14 07:45:17 +0000 Commit: Xin LI <delphij@FreeBSD.org> CommitDate: 2024-04-21 03:15:52 +0000 amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh. Tested on AMD Threadripper 7960X. PR: kern/278311 Tested by: jbo (cherry picked from commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4) sys/dev/amdsmn/amdsmn.c | 7 +++++++ sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=c5179c2b8a5e5b826f6626634b9b53dd257d6499 commit c5179c2b8a5e5b826f6626634b9b53dd257d6499 Author: Xin LI <delphij@FreeBSD.org> AuthorDate: 2024-04-14 07:45:17 +0000 Commit: Xin LI <delphij@FreeBSD.org> CommitDate: 2024-04-21 03:19:01 +0000 amdsmn(4), amdtemp(4): add support for AMD Family 19h Models 10h-1Fh. Tested on AMD Threadripper 7960X. PR: kern/278311 Tested by: jbo (cherry picked from commit 51c69c8682e8ab0e5d82ab3d6f2d16419d40bad4) sys/dev/amdsmn/amdsmn.c | 7 +++++++ sys/dev/amdtemp/amdtemp.c | 14 +++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-) |