Bug 270943 - Complete system freeze on Asus dual socket AMD 7742 system
Summary: Complete system freeze on Asus dual socket AMD 7742 system
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-04-20 02:08 UTC by Neil Bradley
Modified: 2023-04-24 17:59 UTC (History)
2 users (show)

See Also:


Attachments
dmesg.boot For this system (20.24 KB, text/plain)
2023-04-20 02:08 UTC, Neil Bradley
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Neil Bradley 2023-04-20 02:08:36 UTC
Created attachment 241605 [details]
dmesg.boot For this system

I have a dual socket 7742 system (128 total real cores, 128 threads) that will completely lock up the system in under an hour if left idle. By "lock up", this means:

* Console unresponsive (no keyboard/USB/numlock)
* Networking unresponsive (no pings, no arps, nothing)

Like it's "jumping to self" with all interrupts disabled. The system needs to be reset or power cycled. I have tried the following distributions over the last few months with the same results:

FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
FreeBSD 13.1
FreeBSD 13.0
FreeBSD 12.3
Several memstick images of 14.0 since December 2022

Other notes:

* The lockup is guaranteed. I've never had it not lock up when left idle. Always locks up in <1 hour (usually in 10-20 minutes).

* If I run a "stress" program, the system runs for days at a time without any observed lockups. If there's any significant system activity, it appears to not lock up.

* At one point (on a 14.0 build) I was able to get the kernel debugger compiled in. When the system locked up, hitting the local USB keyboard sequence to get in to the kernel debugger worked. This also seemed to unlock the system, as after I exited the kernel debugger, the system was alive again.

* I've installed the OSes on either 2GB M.2 Samsung SSDs *OR* on a Western Digital SN200 NVME disk. No changes in behavior. Storage does not appear to be a factor.

* I've halved the memory and swapped DIMMs entirely. No change.

System specs:

Motherboard         : Asus rs700a-e11-rs12u-wocpu009z
CPUs                : Dual AMD 7742 CPUs
BIOS Version        : 0901
BMC Firmware model  : RS700A-E11-RS12U
BMC Firmware version: 1.2.15
Installed ECC memory: 512GB
Storage             : Two Samsung EVO 980 TB M.2 SSDs, and a WD SN200 7.68TB NVME U.2 disk

Video is the ASpeed AST2500, which supplies video for the system. 

I'd be happy to put this system on the internet and allow any and all interested parties access to it for troubleshooting/debugging. Thank you!
Comment 1 Neil Bradley 2023-04-21 19:55:34 UTC
Neglected to mention that this system runs Windows 10, Windows 11, or Ubuntu Linux 22.04 and 23.04 desktop (and server) without issue.
Comment 2 Neil Bradley 2023-04-24 17:59:52 UTC
Workaround for this is to go in to BIOS, select "AMD CBS", then "Processor Features", and change "Global C state control" to "enabled" (or disabled, but this uses more power). "auto" Is the setting that seems to cause the lockups. Once changed to enabled or disabled, the system is stable for >12 hours.

More detail, with C state control "disabled", I get the following wattage usage:

Power in  = 300 watts
Power out = 276 watts
CPU       = 168 watts
Mem       = 112 watts

root@amd-megaserver:/home/nb # sysctl dev.cpu | grep cx
dev.cpu.255.cx_method: C1/hlt
dev.cpu.255.cx_usage_counters: 2525
dev.cpu.255.cx_usage: 100.00% last 713773us
dev.cpu.255.cx_lowest: C1
dev.cpu.255.cx_supported: C1/1/0

However, when running with C state control "enabled", it's much more reasonable and in line with other operating systems' idle power consumption:

Power in  = 120 watts
Power out = 108 watts
CPU       = 88 watts
Mem       = 16 watts

nb@amd-megaserver:~ $ sysctl dev.cpu | grep cx
dev.cpu.255.cx_method: C1/hlt C2/io
dev.cpu.255.cx_usage_counters: 1752 0
dev.cpu.255.cx_usage: 100.00% 0.00% last 395716us
dev.cpu.255.cx_lowest: C1
dev.cpu.255.cx_supported: C1/1/1 C2/2/400

Unsure of how to properly disposition this, as while this does provide a usable workaround, it does feel like there's something OS-wise that's not quite right, given there are no lockup issues with Linux or Windows.

Leaving open for the FreeBSD to disposition as they see fit with the information above.