Bug 253288 - hwpstate_intel: modern ThinkPads wedge under any kind of load or during boot
Summary: hwpstate_intel: modern ThinkPads wedge under any kind of load or during boot
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
: 253358 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-02-06 10:33 UTC by Eirik Oeverby
Modified: 2021-07-26 03:45 UTC (History)
20 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eirik Oeverby 2021-02-06 10:33:01 UTC
Only workaround is to add
 hint.hwpstate_intel.0.disabled=1
to /boot/loader.conf, but this makes power consumption go through the roof and generates a lot of dmesg warnings about overheating, suggesting the computer be switched off.

BIOS powersaving settings have no bearing on this problem.
The problem has existed since early 2020, and got significantly worse in the Jan/Feb timeframe. However sporadic hangs with kernels from late 2019 have also been observed.

Kernel/world version stable/13 106efdb060ae523a88caf5ddc3516500cf5b1d64
Comment 1 Oleksandr Kryvulia 2021-02-06 14:59:50 UTC
Same on Thinkpad T490 with CURRENT.
Comment 2 Conrad Meyer freebsd_committer 2021-02-09 07:53:35 UTC
*** Bug 253358 has been marked as a duplicate of this bug. ***
Comment 3 Sreehari S 2021-02-09 08:46:19 UTC
This bug has also been affecting me on the ThinkPad X1 Carbon Gen7. From digging around the codebase (I'm just a noob), it seems from my understanding that it's something introduced in sys/x86/cpufreq/hwpstate_intel.c, which is quite recent (commits starting on January 22, 2020, not merged into 12.x, and few enough commits for me to keep track of: git log sys/x86/cpufreq/hwpstate_intel.c). Once I get time I could try compiling each of the significant revisions starting with the one just before the introduction of the Speed Shift stuff and seeing which one breaks first. Could be wrong about all of this though.
Comment 4 Eirik Oeverby 2021-02-09 11:54:40 UTC
(In reply to Sreehari S from comment #3)
That's around the same time I started seeing these issues. It's been a while since I was testing it aggressively; I thought there was an open bug about this already but I was mistaken - so no wonder nothing changed :-/

Thank you for your efforts, I currently don't have a chance to test this as the computer in question is unavailable for the time being.
Comment 5 Sreehari S 2021-02-10 02:08:00 UTC
(In reply to Eirik Oeverby from comment #4)
No problem, I also have an incentive to help get FreeBSD 13.0 fully working on my hardware and ~~procrastination on my actual responsibilies~~. So today I've successfully built and booted a 13.0-CURRENT tree from January 22, 2020 (git commit 7ec5e1c4cd74b66192e5a34c082dc580e587f77b), which is the commit just before what I suspect may be one of the breaking commits (git 4577cf3744b98d0fa7cea80c75079c3cf5155471, and this is the one that introduces hwpstate_intel.c and friends in the first place). After installing that world/kernel, I've thoroughly abused the machine (compiling software, installing stuff, graphics stuff, etc.) and I could not get it to crash yet. I guess next I'll try installing the sys from all the possible breaking commits I've identified (there's very few commits that touch hwpstate_intel in the first place, so I'm in luck). All this will tell me is which commit broke everything on Lenovo machines, so hopefully that can be used to narrow down the exact change that broke. After all this, I'd hope that the fixing patch would make it into 13.0-RELEASE...
Comment 6 Sreehari S 2021-02-10 04:26:36 UTC
UPDATE: for everyone it concerns: I've proven beyond reasonable doubt that the first broken commit is 4577cf3744b98d0fa7cea80c75079c3cf5155471). I've tested the commit just before it with no issues at all, then i did make {build,install}kernel and rebooted then tried building luajit for neovim over ssh and the system hung in the middle of building and my ssh connection died today too. So for anyone smart enough, please take a look at that commit in particular, as I'm almost certain that's the one that introduces the regression.
Comment 7 Sreehari S 2021-02-10 04:37:27 UTC
(In reply to Sreehari S from comment #6)
At this point it can only really be in one of:
sys/sys/cpu.h
sys/x86/cpufreq/est.c

and most probably:
sys/kern/kern_cpu.c
sys/x86/cpufreq/hwpstate_intel.c
sys/x86/cpufreq/hwpstate_intel_internal.h
Comment 8 Eirik Oeverby 2021-02-10 10:11:22 UTC
(In reply to Sreehari S from comment #5)
I don't have to abuse it all to have it fall over:
- boot up without powerd/powerdxx
- fire up X
- log into kde/plasma
- try to open some preferences panel, start a browser, whatever
- system freezes and a split second later mouse pointer stops moving
Comment 9 Sreehari S 2021-02-10 17:41:29 UTC
(In reply to Eirik Oeverby from comment #8)
Yeah I was just trying to prove beyond reasonable doubt that particular revision *wasn't* flawed in any way. When I tried out the next revision, it would cause a full system hang rather fast, like all I needed to do was log in and try to install vim via pkg or something.
Comment 10 Sreehari S 2021-02-10 17:52:35 UTC
(In reply to Sreehari S from comment #9)
and this was with powerd enabled, and in a tty console (no gui). The amount of time that the system lasts before dying varies, but it's basically guaranteed it will fairly soon
Comment 11 Ed Maste freebsd_committer 2021-02-10 18:00:19 UTC
(In reply to Sreehari S from comment #6)
The identified commit (4577cf3744b98d0fa7cea80c75079c3cf5155471) is the one that added hwpstate, so it's not surprising that it's responsible.

The only immediate suggestion I have is for folks to review changes to the corresponding Linux driver and see if there is some workaround or special case that we're missing.
Comment 12 Sreehari S 2021-02-10 18:08:25 UTC
(In reply to Ed Maste from comment #11)
Yeah, that makes sense. The linux driver is available in their kernel tree at drivers/cpufreq/intel_pstate.c (https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c). The first commit that added HWP in the linux kernel tree was 2f86dc4cddcb21290ca099e1dce2a53533c86e0b from 2014, though I don't think that matters too much. The only thing I can think of that would cause the difference is MSR reading/writing stuff, but I'm no expert on this honestly, and I could be completely wrong for all I know.
Comment 13 Yuri Pankov freebsd_committer 2021-02-11 17:13:07 UTC
Just for the record, I'm not seeing any issues on P51, "Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz".
Comment 14 Sreehari S 2021-02-11 17:27:53 UTC
(In reply to Yuri Pankov from comment #13)
Maybe it only effects intel U processors? That's the only thing I can think of that the effected people's machines have in common that you don't (you've got HQ). Again could be completely wrong
Comment 15 Sreehari S 2021-02-11 17:30:00 UTC
(In reply to Sreehari S from comment #14)
I've got an i7-8565U and some hw probes:
http://bsd-hardware.info/?probe=77e80759a0
http://bsd-hardware.info/?probe=8d1c80c2cb
Comment 16 Yuri Pankov freebsd_committer 2021-02-11 17:44:52 UTC
(In reply to Sreehari S from comment #14)
I also have Intel NUC7i7BNH featuring "Intel(R) Core(TM) i7-7567U CPU", and I don't remember seeing any issues with it either.  The system does not have any storage device at the moment, but I'll get one shortly and re-check to confirm (or disprove) the U series guess.
Comment 17 Sreehari S 2021-02-11 19:37:09 UTC
(In reply to Yuri Pankov from comment #16)
From what I can tell only Lenovo/thinkpad users have complained about this bug, though it's worth checking out if it affects all U processors or something. In all likeliness it could be some differences in MSR writes through some edge case not covered in FreeBSD that is covered in linux and others. I tried checking for deadlocks in the new code through printf debugging (that's all I know) and I couldn't find any myself, but I'm no expert so I wouldn't completely rule that out unless someone else can confirm.
Comment 18 Sreehari S 2021-02-11 19:44:28 UTC
(In reply to Sreehari S from comment #17)
I might try to wrap my head around remote gdb or ddb or even trying to find crash dumps if they're created, but I'm not too familiar with all that yet.
Comment 19 Sreehari S 2021-02-13 08:58:23 UTC
https://github.com/erpalma/throttled
https://www.reddit.com/r/thinkpad/comments/870u0a/t480s_linux_throttling_bug/
https://www.notebookcheck.net/Lenovo-admits-ThinkPad-CPU-throttling-problem-when-running-Linux-fix-in-development.435549.0.html
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763144

Could this have anything to do with it? Apparently certain Lenovo laptops running Linux were known to have some kind of CPU throttling issue that could be mitigated with MSR writes. I certainly remember it being an issue under Linux way back in the day, but it might have been fixed by now. Maybe this is something worth looking into, as it affected the exact Lenovo laptops that people are having issues with under FreeBSD 13
Comment 20 Sreehari S 2021-02-13 23:41:53 UTC
(In reply to Sreehari S from comment #19)
Ok I've injected some kernel code to find the cutoff from MSR_IA32_TEMPERATURE_TARGET, and it seems to be 3, which suggests thermal throttling happens at 97 degrees C, instead of the broken 80 degrees C from before. This is probably a result of Lenovo fixing the bug in firmware, so I'm pretty sure this can be ruled out.
Comment 21 Sreehari S 2021-02-14 10:04:56 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=200133

Anything useful here?
Comment 22 Sreehari S 2021-02-26 09:44:39 UTC
According to the Linux commit from 2014 I referenced earlier, they got their reference based off Section 14.4 of Volume 3 of the Intel architecture Software Developer Manual. On a cursory look this section does indeed describe hardware P-states, though it's a bit over my head at the moment. Maybe I can look into it later. Maybe there's some useful information for whatever edge case the FreeBSD code is missing in here?
Comment 23 Sergei Masharov 2021-03-08 11:41:48 UTC
(In reply to Sreehari S from comment #12)
I think that this issue certainly related to CPU frequency, because dev.cpu.0.freq_levels and dev.cpu.0.freq are looked very different than in the versions before 13, and in 13 with hint.hwpstate_intel.0.disabled=1

details are in the https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659

In my case sometimes system hangs even during kernel boot, last messages in console about USB devices.
Comment 24 Marco 2021-03-08 13:32:10 UTC
seems like a duplicate (still unresolved) to: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659
Comment 25 Patrik Jeppsson 2021-04-25 10:28:18 UTC
I just wanted to add that my Lenovo Thinkpad X1 Carbon Gen 7 is also affected by this bug. The CPU is Intel i7 8665U.
Comment 26 Stéphane D'Alu 2021-04-25 10:58:59 UTC
That's not limited to ThinkPad Carbon X1, mine is a ThinkPad T490 with i7-8565U CPU
Comment 27 rkoberman 2021-04-25 18:52:46 UTC
(In reply to Stéphane D'Alu from comment #26)
Looking at both tickets on this issue, the thing that jumps out at me is that it appears that only Lenovo systems are impacted. Lots of different ones have been reported, but no Dell or HP systems. No Asus, It is clearly something that ONLY impacts Lenovos and it looks like it can be further constrained to ThinkPads. I find no references to IdeaPads or other Lenovo lines. Last I knew, development of ThinkPads was still in the US at the former IBM facility. Looks like something unique to those systems, which is clearly distinct from other laptops. I just can't begin to guess what.

This is WAY beyond anything I can troubleshoot, but I'm more than willing to help test. My L15 has been a real pain, unlike every other ThinkPad I've used (and that is going back to at least 1995). I suspect that this will require at last on and maybe two very top-line FreeBSD folks to track down. kib or jhb, perhaps? Now that 13 is out the door, there will only be more reports and I'd really like P-States.
Comment 28 Ulrich Spörlein freebsd_committer 2021-05-01 09:10:34 UTC
Upgraded to 13.x and I see the same hang during boot with hwpstate. The fans start going full blast for 30s and then throttle down again. Only reboot works at that stage.

Shouldn't the broken commit be reverted? Throttling was working fine with 12.x ...

This is with a i7-8565U in a Thinkpad T490.
Comment 29 rkoberman 2021-05-01 21:07:32 UTC
(In reply to Ulrich Spörlein from comment #28)
The commit is the one that enables P-States and it seems to work fine on all but Lenovo ThinkPads. All that can be done until someone with a lot more ACPI and kernel knowledge that I figures it out. Until then, the ony "solution" is to disable P-State support by adding hint.hwpstate_intel.0.disabled=1 to /boot/loader.conf. Since P-State support was not present before 13.0, it leaves you no worse off than you were on older versions of FreeBSD.

Reverting the commit would simply turn off P-State support for everyone and it is a valuable power management capablity.
Comment 30 Ulrich Spörlein freebsd_committer 2021-05-02 10:57:28 UTC
I'm not sure what is worse, removing P-states from every non-Thinkpad owner, or having a release out there that fails to boot on Thinkpads (which are probably the most often used laptops with FreeBSD, maybe??)

Can we quirk/block the P-State support and disable it whenever the ACPI/BIOS/Firmware/whatever is from Lenovo (and/or the model matches "Thinkpad")?

That would allow it working out of the box (but is too late for 13.0-RELEASE).
Comment 31 Yuri Pankov freebsd_committer 2021-05-02 11:02:50 UTC
(In reply to Yuri Pankov from comment #16)
Weird, I thought I replied with my testing, will do now.  No issues on INTEL NUC7i7BN with i7-7567U CPU.
Comment 32 Chuck Barker 2021-05-07 20:32:26 UTC
Adding my Lenovo T490 to this list of troubled machines.

Hardware details:

Lenovo T490 model type 20RY-S06R00
manufactured date July 2020
current BIOS version N2RET22W 1.16
BIOS date 2020-11-11

Purchased from COSTCO in fall 2020.  Removed 256GB nvme M2 card and replaced with Crucial 512 GB nvme M2 card.

Installed FreeBSD 12.x fine and it operated with no trouble.  However, had no Wi-Fi so installed OpenBSD 6.8.  OpenBSD ran fine with no errors and without any trouble.  Wi-Fi and Xorg worked, out of the box.

Stayed on OpenBSD ...until today.  I was ready to move back to FreeBSD since 13 was released and no serious issues reported.  I created an image of FreeBSD 13.0 RELEASE on Sandisk USB stick.  Plugged into T490 and powered up.  All looked like typical FreeBSD installation messages, install screen with red sphere with horns, ... more installation messages ... probe messages .. THEN ... all STOP !

The install hung and stayed hung.  Fans ramped up, warm air from air vents.
Last message displayed was ... this last line ... AS displayed:


hwpstate_intel0: <Intel Speed Shift> on cpu0


No response from keyboard and laptop warms up ... fast.
Powered OFF.


Chuck Barker
Comment 33 Chuck Barker 2021-05-07 20:46:24 UTC
Forgot to share some BIOS details ...

Intel Core i7-10510U
1.800 Ghz

16384 MB RAM

Came installed with Windows 10

Hyperthreading             - ON in BIOS
Intel SpeedStep Technology - ON in BIOS set to 'Max Performance'
CPU Power Management - ON



Chuck Barker
Comment 34 Guido Kollerie 2021-05-09 14:48:25 UTC
No problems on my Thinkpad T480 (i5-8250U CPU @ 1.60GHz).

For what it is worth, I do have devcpu-data-1.38 installed and the following in my /boot/loader.conf:

cpuctl_load="YES"
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"

dmesg say:

CPU microcode: updated from 0xb4 to 0xe0
Comment 35 Guido Kollerie 2021-05-09 16:00:03 UTC
(In reply to Guido Kollerie from comment #34)

It turns out devcpu-data does not having anything to do with the T480 booting succesfully. Setting:

    cpu_microcode_load="NO" (from "YES")

still boots the T480 fine. And I still see the message:

    CPU microcode: updated from 0xb4 to 0xe0

I did notice that powerd++ complained with the message:

    powerd++: (EDRIVER) frequency control driver not supported: hwpstate_intel0

So I disabled powerd++ and enabled powerd in /etc/rc.conf instead to see if that triggered the problem so many Thinkpad owners are experiencing, but no, the T480 still boots fine.

UEFI BIOS version: N24ET51W (1.26)
UEFI BIOS date: 2019-08-30
Machine type model: 20L5CTO1WW
Comment 36 Dries Michiels 2021-05-09 16:57:10 UTC
Guido, can you try uninstalling the package (devcpu-data), it seems that the script is still being run given your output.
Comment 37 Guido Kollerie 2021-05-09 18:03:59 UTC
(In reply to Dries Michiels from comment #36)

Forgot to clear the kernel buffer (dmesg -c), hence the message was from a previous boot. Anyway, having cleared the kernel buffer and uninstalled devcpu-data, the next reboot did NOT have the microcode update message anymore.

But even without the microcode update I am able to boot just fine. Running KDE Plasma I generated some minor load by compiling a bit of Rust code (orjson lib) while at the same time compiling NumPY/pandas (= lot's of C code): no system freezes.

I guess hwpstate_intel just works on the Thinkpad T480.

% sysctl -a | grep dev.hwpstate_intel                                                                                                                           
dev.hwpstate_intel.3.epp: 50
dev.hwpstate_intel.3.%parent: cpu3
dev.hwpstate_intel.3.%pnpinfo: 
dev.hwpstate_intel.3.%location: 
dev.hwpstate_intel.3.%driver: hwpstate_intel
dev.hwpstate_intel.3.%desc: Intel Speed Shift
dev.hwpstate_intel.2.epp: 50
dev.hwpstate_intel.2.%parent: cpu2
dev.hwpstate_intel.2.%pnpinfo: 
dev.hwpstate_intel.2.%location: 
dev.hwpstate_intel.2.%driver: hwpstate_intel
dev.hwpstate_intel.2.%desc: Intel Speed Shift
dev.hwpstate_intel.1.epp: 50
dev.hwpstate_intel.1.%parent: cpu1
dev.hwpstate_intel.1.%pnpinfo: 
dev.hwpstate_intel.1.%location: 
dev.hwpstate_intel.1.%driver: hwpstate_intel
dev.hwpstate_intel.1.%desc: Intel Speed Shift
dev.hwpstate_intel.0.epp: 50
dev.hwpstate_intel.0.%parent: cpu0
dev.hwpstate_intel.0.%pnpinfo: 
dev.hwpstate_intel.0.%location: 
dev.hwpstate_intel.0.%driver: hwpstate_intel
dev.hwpstate_intel.0.%desc: Intel Speed Shift
dev.hwpstate_intel.%parent: 


% dmesg | grep hwpstate_intel
hwpstate_intel0: <Intel Speed Shift> on cpu0
hwpstate_intel1: <Intel Speed Shift> on cpu1
hwpstate_intel2: <Intel Speed Shift> on cpu2
hwpstate_intel3: <Intel Speed Shift> on cpu3
Comment 38 Marco 2021-05-09 18:06:34 UTC
(In reply to Guido Kollerie from comment #35)

So powerdxx is superseded by the hwpstate_intel[4] driver on systems that support it.

Following taken from https://reviews.freebsd.org/D30004

For more information, including on how to balance performance and energy use, and on how to disable this driver, refer to the man page man:hwpstate_intel[4].

Note: Users accustomed to using man:powerd[8] or package:sysutils/powerdxx[] will find these utilities have been superseded by the man:hwpstate_intel[4] driver and no longer work as expected.



So unless you set hint.hwpstate_intel.0.disabled="1" in loader.conf one should expect both powerd and powerdxx to no longer work as expected.


On my X1-Carbon 7th gen (still running stable/13-n245210-3bec9180c9e7) I get this behaviour when using sysutils/devcpu-data (1.38)

/boot/loader.conf :

cpuctl_load="YES"
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"
hint.p4tcc.0.disabled=1
hint.acpi_throttle.0.disabled=1


/etc/rc.conf :
microcode_update_enable="YES"

dmesg says:

CPU microcode: no matching update found


When manually starting microcode :

service microcode_update start
Updating CPU Microcode...
CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz (2112.12-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x806ec  Family=0x6  Model=0x8e  Stepping=12
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c6fbf<FSGSBASE,TSCADJ,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features3=0xbc000600<MCUOPT,MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0xab<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TSX_CTRL>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
Done.


With hint.hwpstate_intel.0.disabled="1" in loader.conf I am 
using powerdxx with powerdxx_flags="-a min -b min -n min" in rc.conf

I was using powerdxx_flags="-a hiadaptive -b hiadaptive -n hiadaptive" before that but was seeing fairly frequently

kernel: coretemp0: critical temperature detected, suggest system shutdown

So with the device hint for hwpstate disabled my system is using EST

sysctl dev.cpufreq.0.freq_driver
dev.cpufreq.0.freq_driver: est0
Comment 39 Marco 2021-05-10 20:34:36 UTC
Can we please change the importance to 'affects many people' and also change the title ?
It clearly doesn't only affect the 8th gen X1 Carbon.
Comment 40 Tom Weustink 2021-05-21 08:43:52 UTC
Just to add my new work laptop, a Lenovo ThinkPad T14 Gen 1.

Intel Core i7-10510U

Hangs on the same Intel Speed Shift other people have seen.
Disabling it makes the temperature go up to 47C on the zone, and around 40C per core.
Fan is ramped up all the time. Battery lasts for about 4 hours then.