253288 – hwpstate_intel: modern ThinkPads wedge under any kind of load or during boot

Bug 253288 - hwpstate_intel: modern ThinkPads wedge under any kind of load or during boot

Summary: hwpstate_intel: modern ThinkPads wedge under any kind of load or during boot

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	13.0-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Many People
Assignee:	Tom Jones

URL:
Keywords:	performance

Duplicates (2):	248659 253358 (view as bug list)
Depends on:
Blocks:

Reported:	2021-02-06 10:33 UTC by Eirik Oeverby
Modified:	2023-11-20 10:49 UTC (History)
CC List:	31 users (show)

See Also:	255745 254915 267187 https://reviews.freebsd.org/D36699

Flags:	grahamperrin: mfc-stable13?

Attachments
simple debug patch (1.31 KB, patch) 2022-10-20 22:06 UTC, Luís Henriques	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eirik Oeverby 2021-02-06 10:33:01 UTC

Only workaround is to add
 hint.hwpstate_intel.0.disabled=1
to /boot/loader.conf, but this makes power consumption go through the roof and generates a lot of dmesg warnings about overheating, suggesting the computer be switched off.

BIOS powersaving settings have no bearing on this problem.
The problem has existed since early 2020, and got significantly worse in the Jan/Feb timeframe. However sporadic hangs with kernels from late 2019 have also been observed.

Kernel/world version stable/13 106efdb060ae523a88caf5ddc3516500cf5b1d64

Comment 1 Oleksandr Kryvulia 2021-02-06 14:59:50 UTC

Same on Thinkpad T490 with CURRENT.

Comment 2 Conrad Meyer freebsd_committer

2021-02-09 07:53:35 UTC

*** Bug 253358 has been marked as a duplicate of this bug. ***

Comment 3 Sreehari S 2021-02-09 08:46:19 UTC

This bug has also been affecting me on the ThinkPad X1 Carbon Gen7. From digging around the codebase (I'm just a noob), it seems from my understanding that it's something introduced in sys/x86/cpufreq/hwpstate_intel.c, which is quite recent (commits starting on January 22, 2020, not merged into 12.x, and few enough commits for me to keep track of: git log sys/x86/cpufreq/hwpstate_intel.c). Once I get time I could try compiling each of the significant revisions starting with the one just before the introduction of the Speed Shift stuff and seeing which one breaks first. Could be wrong about all of this though.

Comment 4 Eirik Oeverby 2021-02-09 11:54:40 UTC

(In reply to Sreehari S from comment #3)
That's around the same time I started seeing these issues. It's been a while since I was testing it aggressively; I thought there was an open bug about this already but I was mistaken - so no wonder nothing changed :-/

Thank you for your efforts, I currently don't have a chance to test this as the computer in question is unavailable for the time being.

Comment 5 Sreehari S 2021-02-10 02:08:00 UTC

(In reply to Eirik Oeverby from comment #4)
No problem, I also have an incentive to help get FreeBSD 13.0 fully working on my hardware and ~~procrastination on my actual responsibilies~~. So today I've successfully built and booted a 13.0-CURRENT tree from January 22, 2020 (git commit 7ec5e1c4cd74b66192e5a34c082dc580e587f77b), which is the commit just before what I suspect may be one of the breaking commits (git 4577cf3744b98d0fa7cea80c75079c3cf5155471, and this is the one that introduces hwpstate_intel.c and friends in the first place). After installing that world/kernel, I've thoroughly abused the machine (compiling software, installing stuff, graphics stuff, etc.) and I could not get it to crash yet. I guess next I'll try installing the sys from all the possible breaking commits I've identified (there's very few commits that touch hwpstate_intel in the first place, so I'm in luck). All this will tell me is which commit broke everything on Lenovo machines, so hopefully that can be used to narrow down the exact change that broke. After all this, I'd hope that the fixing patch would make it into 13.0-RELEASE...

Comment 6 Sreehari S 2021-02-10 04:26:36 UTC

UPDATE: for everyone it concerns: I've proven beyond reasonable doubt that the first broken commit is 4577cf3744b98d0fa7cea80c75079c3cf5155471). I've tested the commit just before it with no issues at all, then i did make {build,install}kernel and rebooted then tried building luajit for neovim over ssh and the system hung in the middle of building and my ssh connection died today too. So for anyone smart enough, please take a look at that commit in particular, as I'm almost certain that's the one that introduces the regression.

Comment 7 Sreehari S 2021-02-10 04:37:27 UTC

(In reply to Sreehari S from comment #6)
At this point it can only really be in one of:
sys/sys/cpu.h
sys/x86/cpufreq/est.c

and most probably:
sys/kern/kern_cpu.c
sys/x86/cpufreq/hwpstate_intel.c
sys/x86/cpufreq/hwpstate_intel_internal.h

Comment 8 Eirik Oeverby 2021-02-10 10:11:22 UTC

(In reply to Sreehari S from comment #5)
I don't have to abuse it all to have it fall over:
- boot up without powerd/powerdxx
- fire up X
- log into kde/plasma
- try to open some preferences panel, start a browser, whatever
- system freezes and a split second later mouse pointer stops moving

Comment 9 Sreehari S 2021-02-10 17:41:29 UTC

(In reply to Eirik Oeverby from comment #8)
Yeah I was just trying to prove beyond reasonable doubt that particular revision *wasn't* flawed in any way. When I tried out the next revision, it would cause a full system hang rather fast, like all I needed to do was log in and try to install vim via pkg or something.

Comment 10 Sreehari S 2021-02-10 17:52:35 UTC

(In reply to Sreehari S from comment #9)
and this was with powerd enabled, and in a tty console (no gui). The amount of time that the system lasts before dying varies, but it's basically guaranteed it will fairly soon

Comment 11 Ed Maste freebsd_committer

2021-02-10 18:00:19 UTC

(In reply to Sreehari S from comment #6)
The identified commit (4577cf3744b98d0fa7cea80c75079c3cf5155471) is the one that added hwpstate, so it's not surprising that it's responsible.

The only immediate suggestion I have is for folks to review changes to the corresponding Linux driver and see if there is some workaround or special case that we're missing.

Comment 12 Sreehari S 2021-02-10 18:08:25 UTC

(In reply to Ed Maste from comment #11)
Yeah, that makes sense. The linux driver is available in their kernel tree at drivers/cpufreq/intel_pstate.c (https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c). The first commit that added HWP in the linux kernel tree was 2f86dc4cddcb21290ca099e1dce2a53533c86e0b from 2014, though I don't think that matters too much. The only thing I can think of that would cause the difference is MSR reading/writing stuff, but I'm no expert on this honestly, and I could be completely wrong for all I know.

Comment 13 Yuri Pankov freebsd_committer

2021-02-11 17:13:07 UTC

Just for the record, I'm not seeing any issues on P51, "Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz".

Comment 14 Sreehari S 2021-02-11 17:27:53 UTC

(In reply to Yuri Pankov from comment #13)
Maybe it only effects intel U processors? That's the only thing I can think of that the effected people's machines have in common that you don't (you've got HQ). Again could be completely wrong

Comment 15 Sreehari S 2021-02-11 17:30:00 UTC

(In reply to Sreehari S from comment #14)
I've got an i7-8565U and some hw probes:
http://bsd-hardware.info/?probe=77e80759a0
http://bsd-hardware.info/?probe=8d1c80c2cb

Comment 16 Yuri Pankov freebsd_committer

2021-02-11 17:44:52 UTC

(In reply to Sreehari S from comment #14)
I also have Intel NUC7i7BNH featuring "Intel(R) Core(TM) i7-7567U CPU", and I don't remember seeing any issues with it either.  The system does not have any storage device at the moment, but I'll get one shortly and re-check to confirm (or disprove) the U series guess.

Comment 17 Sreehari S 2021-02-11 19:37:09 UTC

(In reply to Yuri Pankov from comment #16)
From what I can tell only Lenovo/thinkpad users have complained about this bug, though it's worth checking out if it affects all U processors or something. In all likeliness it could be some differences in MSR writes through some edge case not covered in FreeBSD that is covered in linux and others. I tried checking for deadlocks in the new code through printf debugging (that's all I know) and I couldn't find any myself, but I'm no expert so I wouldn't completely rule that out unless someone else can confirm.

Comment 18 Sreehari S 2021-02-11 19:44:28 UTC

(In reply to Sreehari S from comment #17)
I might try to wrap my head around remote gdb or ddb or even trying to find crash dumps if they're created, but I'm not too familiar with all that yet.

Comment 19 Sreehari S 2021-02-13 08:58:23 UTC

https://github.com/erpalma/throttled
https://www.reddit.com/r/thinkpad/comments/870u0a/t480s_linux_throttling_bug/
https://www.notebookcheck.net/Lenovo-admits-ThinkPad-CPU-throttling-problem-when-running-Linux-fix-in-development.435549.0.html
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763144

Could this have anything to do with it? Apparently certain Lenovo laptops running Linux were known to have some kind of CPU throttling issue that could be mitigated with MSR writes. I certainly remember it being an issue under Linux way back in the day, but it might have been fixed by now. Maybe this is something worth looking into, as it affected the exact Lenovo laptops that people are having issues with under FreeBSD 13

Comment 20 Sreehari S 2021-02-13 23:41:53 UTC

(In reply to Sreehari S from comment #19)
Ok I've injected some kernel code to find the cutoff from MSR_IA32_TEMPERATURE_TARGET, and it seems to be 3, which suggests thermal throttling happens at 97 degrees C, instead of the broken 80 degrees C from before. This is probably a result of Lenovo fixing the bug in firmware, so I'm pretty sure this can be ruled out.

Comment 21 Sreehari S 2021-02-14 10:04:56 UTC

https://bugzilla.kernel.org/show_bug.cgi?id=200133

Anything useful here?

Comment 22 Sreehari S 2021-02-26 09:44:39 UTC

According to the Linux commit from 2014 I referenced earlier, they got their reference based off Section 14.4 of Volume 3 of the Intel architecture Software Developer Manual. On a cursory look this section does indeed describe hardware P-states, though it's a bit over my head at the moment. Maybe I can look into it later. Maybe there's some useful information for whatever edge case the FreeBSD code is missing in here?

Comment 23 Sergei Masharov 2021-03-08 11:41:48 UTC

(In reply to Sreehari S from comment #12)
I think that this issue certainly related to CPU frequency, because dev.cpu.0.freq_levels and dev.cpu.0.freq are looked very different than in the versions before 13, and in 13 with hint.hwpstate_intel.0.disabled=1

details are in the https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659

In my case sometimes system hangs even during kernel boot, last messages in console about USB devices.

Comment 24 Marco 2021-03-08 13:32:10 UTC

seems like a duplicate (still unresolved) to: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659

Comment 25 Patrik Jeppsson 2021-04-25 10:28:18 UTC

I just wanted to add that my Lenovo Thinkpad X1 Carbon Gen 7 is also affected by this bug. The CPU is Intel i7 8665U.

Comment 26 Stéphane D'Alu 2021-04-25 10:58:59 UTC

That's not limited to ThinkPad Carbon X1, mine is a ThinkPad T490 with i7-8565U CPU

Comment 27 rkoberman 2021-04-25 18:52:46 UTC

(In reply to Stéphane D'Alu from comment #26)
Looking at both tickets on this issue, the thing that jumps out at me is that it appears that only Lenovo systems are impacted. Lots of different ones have been reported, but no Dell or HP systems. No Asus, It is clearly something that ONLY impacts Lenovos and it looks like it can be further constrained to ThinkPads. I find no references to IdeaPads or other Lenovo lines. Last I knew, development of ThinkPads was still in the US at the former IBM facility. Looks like something unique to those systems, which is clearly distinct from other laptops. I just can't begin to guess what.

This is WAY beyond anything I can troubleshoot, but I'm more than willing to help test. My L15 has been a real pain, unlike every other ThinkPad I've used (and that is going back to at least 1995). I suspect that this will require at last on and maybe two very top-line FreeBSD folks to track down. kib or jhb, perhaps? Now that 13 is out the door, there will only be more reports and I'd really like P-States.

Comment 28 Ulrich Spörlein freebsd_committer

2021-05-01 09:10:34 UTC

Upgraded to 13.x and I see the same hang during boot with hwpstate. The fans start going full blast for 30s and then throttle down again. Only reboot works at that stage.

Shouldn't the broken commit be reverted? Throttling was working fine with 12.x ...

This is with a i7-8565U in a Thinkpad T490.

Comment 29 rkoberman 2021-05-01 21:07:32 UTC

(In reply to Ulrich Spörlein from comment #28)
The commit is the one that enables P-States and it seems to work fine on all but Lenovo ThinkPads. All that can be done until someone with a lot more ACPI and kernel knowledge that I figures it out. Until then, the ony "solution" is to disable P-State support by adding hint.hwpstate_intel.0.disabled=1 to /boot/loader.conf. Since P-State support was not present before 13.0, it leaves you no worse off than you were on older versions of FreeBSD.

Reverting the commit would simply turn off P-State support for everyone and it is a valuable power management capablity.

Comment 30 Ulrich Spörlein freebsd_committer

2021-05-02 10:57:28 UTC

I'm not sure what is worse, removing P-states from every non-Thinkpad owner, or having a release out there that fails to boot on Thinkpads (which are probably the most often used laptops with FreeBSD, maybe??)

Can we quirk/block the P-State support and disable it whenever the ACPI/BIOS/Firmware/whatever is from Lenovo (and/or the model matches "Thinkpad")?

That would allow it working out of the box (but is too late for 13.0-RELEASE).

Comment 31 Yuri Pankov freebsd_committer

2021-05-02 11:02:50 UTC

(In reply to Yuri Pankov from comment #16)
Weird, I thought I replied with my testing, will do now.  No issues on INTEL NUC7i7BN with i7-7567U CPU.

Comment 32 C Barker 2021-05-07 20:32:26 UTC

Adding my Lenovo T490 to this list of troubled machines.

Hardware details:

Lenovo T490 model type 20RY-S06R00
manufactured date July 2020
current BIOS version N2RET22W 1.16
BIOS date 2020-11-11

Purchased from COSTCO in fall 2020.  Removed 256GB nvme M2 card and replaced with Crucial 512 GB nvme M2 card.

Installed FreeBSD 12.x fine and it operated with no trouble.  However, had no Wi-Fi so installed OpenBSD 6.8.  OpenBSD ran fine with no errors and without any trouble.  Wi-Fi and Xorg worked, out of the box.

Stayed on OpenBSD ...until today.  I was ready to move back to FreeBSD since 13 was released and no serious issues reported.  I created an image of FreeBSD 13.0 RELEASE on Sandisk USB stick.  Plugged into T490 and powered up.  All looked like typical FreeBSD installation messages, install screen with red sphere with horns, ... more installation messages ... probe messages .. THEN ... all STOP !

The install hung and stayed hung.  Fans ramped up, warm air from air vents.
Last message displayed was ... this last line ... AS displayed:


hwpstate_intel0: <Intel Speed Shift> on cpu0


No response from keyboard and laptop warms up ... fast.
Powered OFF.


Chuck Barker

Comment 33 C Barker 2021-05-07 20:46:24 UTC

Forgot to share some BIOS details ...

Intel Core i7-10510U
1.800 Ghz

16384 MB RAM

Came installed with Windows 10

Hyperthreading             - ON in BIOS
Intel SpeedStep Technology - ON in BIOS set to 'Max Performance'
CPU Power Management - ON



Chuck Barker

Comment 34 Guido Kollerie 2021-05-09 14:48:25 UTC

No problems on my Thinkpad T480 (i5-8250U CPU @ 1.60GHz).

For what it is worth, I do have devcpu-data-1.38 installed and the following in my /boot/loader.conf:

cpuctl_load="YES"
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"

dmesg say:

CPU microcode: updated from 0xb4 to 0xe0

Comment 35 Guido Kollerie 2021-05-09 16:00:03 UTC

(In reply to Guido Kollerie from comment #34)

It turns out devcpu-data does not having anything to do with the T480 booting succesfully. Setting:

    cpu_microcode_load="NO" (from "YES")

still boots the T480 fine. And I still see the message:

    CPU microcode: updated from 0xb4 to 0xe0

I did notice that powerd++ complained with the message:

    powerd++: (EDRIVER) frequency control driver not supported: hwpstate_intel0

So I disabled powerd++ and enabled powerd in /etc/rc.conf instead to see if that triggered the problem so many Thinkpad owners are experiencing, but no, the T480 still boots fine.

UEFI BIOS version: N24ET51W (1.26)
UEFI BIOS date: 2019-08-30
Machine type model: 20L5CTO1WW

Comment 36 Dries Michiels freebsd_committer

2021-05-09 16:57:10 UTC

Guido, can you try uninstalling the package (devcpu-data), it seems that the script is still being run given your output.

Comment 37 Guido Kollerie 2021-05-09 18:03:59 UTC

(In reply to Dries Michiels from comment #36)

Forgot to clear the kernel buffer (dmesg -c), hence the message was from a previous boot. Anyway, having cleared the kernel buffer and uninstalled devcpu-data, the next reboot did NOT have the microcode update message anymore.

But even without the microcode update I am able to boot just fine. Running KDE Plasma I generated some minor load by compiling a bit of Rust code (orjson lib) while at the same time compiling NumPY/pandas (= lot's of C code): no system freezes.

I guess hwpstate_intel just works on the Thinkpad T480.

% sysctl -a | grep dev.hwpstate_intel                                                                                                                           
dev.hwpstate_intel.3.epp: 50
dev.hwpstate_intel.3.%parent: cpu3
dev.hwpstate_intel.3.%pnpinfo: 
dev.hwpstate_intel.3.%location: 
dev.hwpstate_intel.3.%driver: hwpstate_intel
dev.hwpstate_intel.3.%desc: Intel Speed Shift
dev.hwpstate_intel.2.epp: 50
dev.hwpstate_intel.2.%parent: cpu2
dev.hwpstate_intel.2.%pnpinfo: 
dev.hwpstate_intel.2.%location: 
dev.hwpstate_intel.2.%driver: hwpstate_intel
dev.hwpstate_intel.2.%desc: Intel Speed Shift
dev.hwpstate_intel.1.epp: 50
dev.hwpstate_intel.1.%parent: cpu1
dev.hwpstate_intel.1.%pnpinfo: 
dev.hwpstate_intel.1.%location: 
dev.hwpstate_intel.1.%driver: hwpstate_intel
dev.hwpstate_intel.1.%desc: Intel Speed Shift
dev.hwpstate_intel.0.epp: 50
dev.hwpstate_intel.0.%parent: cpu0
dev.hwpstate_intel.0.%pnpinfo: 
dev.hwpstate_intel.0.%location: 
dev.hwpstate_intel.0.%driver: hwpstate_intel
dev.hwpstate_intel.0.%desc: Intel Speed Shift
dev.hwpstate_intel.%parent: 


% dmesg | grep hwpstate_intel
hwpstate_intel0: <Intel Speed Shift> on cpu0
hwpstate_intel1: <Intel Speed Shift> on cpu1
hwpstate_intel2: <Intel Speed Shift> on cpu2
hwpstate_intel3: <Intel Speed Shift> on cpu3

Comment 38 Marco 2021-05-09 18:06:34 UTC

(In reply to Guido Kollerie from comment #35)

So powerdxx is superseded by the hwpstate_intel[4] driver on systems that support it.

Following taken from https://reviews.freebsd.org/D30004

For more information, including on how to balance performance and energy use, and on how to disable this driver, refer to the man page man:hwpstate_intel[4].

Note: Users accustomed to using man:powerd[8] or package:sysutils/powerdxx[] will find these utilities have been superseded by the man:hwpstate_intel[4] driver and no longer work as expected.



So unless you set hint.hwpstate_intel.0.disabled="1" in loader.conf one should expect both powerd and powerdxx to no longer work as expected.


On my X1-Carbon 7th gen (still running stable/13-n245210-3bec9180c9e7) I get this behaviour when using sysutils/devcpu-data (1.38)

/boot/loader.conf :

cpuctl_load="YES"
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"
hint.p4tcc.0.disabled=1
hint.acpi_throttle.0.disabled=1


/etc/rc.conf :
microcode_update_enable="YES"

dmesg says:

CPU microcode: no matching update found


When manually starting microcode :

service microcode_update start
Updating CPU Microcode...
CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz (2112.12-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x806ec  Family=0x6  Model=0x8e  Stepping=12
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c6fbf<FSGSBASE,TSCADJ,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features3=0xbc000600<MCUOPT,MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0xab<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TSX_CTRL>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
Done.


With hint.hwpstate_intel.0.disabled="1" in loader.conf I am 
using powerdxx with powerdxx_flags="-a min -b min -n min" in rc.conf

I was using powerdxx_flags="-a hiadaptive -b hiadaptive -n hiadaptive" before that but was seeing fairly frequently

kernel: coretemp0: critical temperature detected, suggest system shutdown

So with the device hint for hwpstate disabled my system is using EST

sysctl dev.cpufreq.0.freq_driver
dev.cpufreq.0.freq_driver: est0

Comment 39 Marco 2021-05-10 20:34:36 UTC

Can we please change the importance to 'affects many people' and also change the title ?
It clearly doesn't only affect the 8th gen X1 Carbon.

Comment 40 Tom Weustink 2021-05-21 08:43:52 UTC

Just to add my new work laptop, a Lenovo ThinkPad T14 Gen 1.

Intel Core i7-10510U

Hangs on the same Intel Speed Shift other people have seen.
Disabling it makes the temperature go up to 47C on the zone, and around 40C per core.
Fan is ramped up all the time. Battery lasts for about 4 hours then.

Comment 41 David 2021-08-11 23:04:15 UTC

I'm having the same issue trying to install FreeBSD 13.0-RELEASE on a 4th gen X1 Carbon with Intel Core i7-6600U.

The problem doesn't occur on my ThinkPad P50 with Intel Xeon E3-1505M.

Comment 42 rkoberman 2021-08-11 23:58:54 UTC

Have you disabled P-States? While that is very sub-optimal, it does seem to pretty much fix the lockup problem. I still see lock-ups, but fewer than one per month.

add "hint.hwpstate_intel.0.disabled=1" to /boot/loader.conf. It appears to only show up on Lenovo laptops running 13.0 or newer. (P-State support was not available prior to 13.)

If the problem continues, it is likely a different problem.

Comment 43 David 2021-08-12 01:10:13 UTC

(In reply to rkoberman from comment #42)

Yes, just figured that out now.  Thank you.

To get the 13.0-RELEASE installer to boot, I had to input:

set hint.hwpstate_intel.0.disabled=1
boot

Comment 44 Tom Weustink 2021-08-13 06:51:47 UTC

(In reply to rkoberman from comment #42)

It's not all Lenovo laptops.
In my case it works fine on my 6th gen X1 Carbon, but on my work laptop model T14 Gen 1 it hangs.

Also, to update on my own message here with reagrds to the temps, the laptop is just a hothead.
It's running Windows now (for other reasons) and it's equally hot all the time.

Comment 45 Oclair 2021-09-07 21:08:06 UTC

Just installed 13.0 on a Thinkpad T490, and it would freeze and require a whole system reboot after a few minutes of editing config files.  Setting hint.hwpstate_intel.0.disabled=1 let me install KDE5 and it's no longer having any freezes.

I assume I am not experiencing the best battery life until this is resolved...

Thanks everyone who has looked into this!
OC

Comment 46 Ryan Avella 2021-11-17 21:14:00 UTC

I installed 13.0-RELEASE on a Thinkpad T490, and I saw all of the same symptoms described by others above.  The suggestion of setting hint.hwpstate_intel.0.disabled=1 fixed it.

I noticed one symptom not mentioned that might be specific to just me. I can reliably trigger the behavior by moving my laptop, e.g. repositioning it on the desk, or carrying it from my desk to my armchair. I've heard that Lenovo has motion-based CPU throttling, so maybe that is related?

Comment 47 Timo 2022-01-29 12:53:29 UTC

I have the same bug on a Protectli clone device with a i3-8145U.
Im on FreeBSD 13.0-STABLE.
I have this device: https://de.aliexpress.com/item/1005002922518905.html

Comment 48 Eirik Oeverby 2022-05-09 18:53:44 UTC

This is still a problem on the latest 14-CURRENT snapshot as of 2022-05-09.

ThinkPad Carbon X1 Gen8 - but affects several other models (Lenovo and otherwise, if I understand the messages on this ticket correctly).

Comment 49 rkoberman 2022-05-09 22:12:53 UTC

No indication of any work on this. I have been following this since it first popped up due to the added support for P-states in 13. 

I am about to open another ticket on other issues I'm seeing with thermal management that are likely related.

Comment 50 Tom Jones freebsd_committer

2022-05-26 08:28:44 UTC

Can someone experiencing this issue try and reproduce from the FreeBSD installer on a usb stick? I would like to debug this, but (afaict) I don't have hardware that triggers this issue.

I tried:

- boot latest snapshot installer
- break to shell
- start powerd
- run `openssl speed -multi $(sysctl -n hw.ncpu)`

from by quick reading of this thread that seemed like it should be more than enough to hang the system, but interactivity was still fine.

If this can be reproduced from the installer then I can try and borrow laptops to debug on.

I tried to reproduce on an i3 10th Gen NUC NUC10i3FNK
https://dmesgd.nycbug.org/index.cgi?do=view&id=5552

Comment 51 Eirik Oeverby 2022-05-26 08:53:17 UTC

I can confirm this on a ThinkPad Carbon X1 Gen8. 
My test involves 8 instances of dd from /dev/random to /dev/null, but any kind of load will do.

Comment 52 Tom Jones freebsd_committer

2022-05-26 09:14:17 UTC

Could you test to see if you can do it from the installer? Can you include the cpu reported by sysctl hw.model

Comment 53 Eirik Oeverby 2022-05-26 10:28:47 UTC

Yes, this is from the installer. Just boot and choose "live cd".

hw.model: Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz

Comment 54 Tom Jones freebsd_committer

2022-05-27 09:25:14 UTC

From reading the history on the Linux driver, my guess is that this is coming from an interaction between the bios or acpi and the p state driver.

To summarise the thread so far the hardware in this thread breaks down as:

Computer                CPU             BAD

Thinkpad T490           i7-8565U        yes
x1 Carbon Gen7          i7-8565U        yes
P51                     i7-7820HQ       no
NUC7i7BNH               i7-8565U        yes
T490                    i7-10510U       yes
T480                    i5-8250U        no
T14 Gen1                i7-10510U       yes
x1 Carbon Gen4          i7-6600U        yes
Proctectli Clone        i3-8145U        yes
Eirik's laptop          i7-10610U       yes

I think these are all laptop bios' (the router hardware could be if it is a cheap respin). I am trying to borrow a machine I can reproduce this issue on.

Comment 55 rkoberman 2022-05-27 20:39:46 UTC

I wrote how to capture the data. I should have told you how to look at it. 

I use Wireshark, but just "tcpdump -r file-you-wrote" will print the captured data.

Sorry I left off this rather important detail.

Comment 56 rkoberman 2022-05-27 20:41:33 UTC

(In reply to rkoberman from comment #55)
Sorry. My update was for another ticket.

Comment 57 Marco 2022-06-20 09:20:17 UTC

(In reply to Tom Jones from comment #54)

Just making sure the correct CPU for my X1 Carbon Gen7 is also listed (currently it isn't in your summary):
 
hw.model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz

Comment 58 Ruslan Makhmatkhanov freebsd_committer

2022-07-03 08:29:39 UTC

The same here on x13 Gen1 with i7-10510U. Disabling p-states solves the issue

Comment 59 Jonathan Vasquez 2022-07-23 15:54:37 UTC

This is also happening for me on the following machines but with slightly different levels of severity:

Crashes within a few seconds (even on the 14.1 RELEASE installed) on my Thinkpad X1 Carbon Gen 7 with a Intel i7-10710U.

Thinkpad X260: Intel Core i7-6500U

On the X260, it only happens if I close the lid without the AC power connected. It doesn't matter if sleep on lid close is on or not.

This does not happen on my Batch 6 Framework Laptop with a Intel® Core™ i7-1165G7.

Comment 60 Jonathan Vasquez 2022-07-23 15:55:34 UTC

I also forgot to mention that I have disabled pstates as others have done to workaround this for now.

Comment 61 Jonathan Vasquez 2022-07-23 15:56:26 UTC

Arg typo above.. it should have said 13.1 RELEASE installer.

Comment 62 Tom Jones freebsd_committer

2022-09-24 10:03:55 UTC

Thanks to  Eirik Oeverby for providing me with an x1 Carbon to test on I have a fix that seems to stop the lock up.

I am not sure if the fix is safe to use or not, basically there is a bug in the system firmware when handling thermal interrupts. If we tell the smm the os will handle these the lock up seems to go away.

Now we aren't handling the interrupts at all and we probably need to. My next step is going to be figuring out what we need for this.

Comment 63 Tom Jones freebsd_committer

2022-09-25 15:29:14 UTC

I think I have a fix, as far as I can tell it should be safe to tell the SMM we are handling CPPC notifications, but then not actually do anything.

This patch does so:
https://reviews.freebsd.org/D36699

I would really appreciated testing and positive or negative results.

Comment 64 Eirik Oeverby 2022-09-25 15:35:26 UTC

(In reply to Tom Jones from comment #63)
For the record: I hereby permit you to run as many buildworlds and simultaneous "GPU" stress-tests as you need on that device in order to confirm the laptop does not melt.

If it does melt, I want only pictures and a beer.

/Eirik

Comment 65 Oleksandr Kryvulia 2022-09-25 18:20:18 UTC

Works for me, tested several hours under intensive cpu load.

Comment 66 Ruslan Makhmatkhanov freebsd_committer

2022-09-26 21:42:41 UTC

(In reply to Tom Jones from comment #63)

Works here with your patch and commented hint. Successfully passed buildkernel. Thanks a lot, Tom!

Comment 67 Marco 2022-09-30 23:34:54 UTC

(In reply to Tom Jones from comment #63)

Applied your patch today on stable/13 9168218160ca and successfully build world and kernel and have been running your openssl speed suggestion for a couple of hours now, no freezes and during the speed run the system was still responsive.

X1 Carbon 7th Gen
hw.model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz

dmesg | grep hwp                                                  
hwpstate_intel0: <Intel Speed Shift> on cpu0
hwpstate_intel1: <Intel Speed Shift> on cpu1
hwpstate_intel2: <Intel Speed Shift> on cpu2
hwpstate_intel3: <Intel Speed Shift> on cpu3 (I'm running with hyperthreading disabled)

sysctl kern.timecounter.hardware
kern.timecounter.hardware: TSC

Here's hoping you can find out what more is needed for interrupt handling but so far system is stable.

Thanks a lot Tom!

Comment 68 Jonathan Vasquez 2022-09-30 23:59:05 UTC

You all making me want to pull out my X1C7 and do a fresh FreeBSD install to test this ;D. Keep it up!

Comment 69 Eirik Oeverby 2022-10-07 14:21:58 UTC

I can confirm my Carbon X1 Gen8, returned to me without visible signs of scorching or other abuse, is now stable.

Thanks, Tom!

Comment 70 Marco 2022-10-07 22:03:19 UTC

(In reply to Eirik Oeverby from comment #69)

Thanks Eirik for stepping up and lending your X1 gen 8 to Tom to get this issue sorted!
I'm currently rebuilding world and kernel again on stable/13 with the hint still commented but now using the patch from https://reviews.freebsd.org/D36699?id=111554 (in accepted state now).
And of course again Tom, thanks a lot for the efforts.

Looking forward for the patch to officially land :)

Comment 71 g-freebsd.bugzilla 2022-10-08 09:35:46 UTC

(In reply to Tom Jones from comment #63)
Hello Tom,

I applied the patch one week ago on my T590 / 20N4 with i7-8565U. No issue !!
Thank you so much !!

Comment 72 rkoberman 2022-10-08 16:09:16 UTC

I installed the patch on my L15 (Lenovo) almost 2 weeks ago and turned P-States back ON. No freezes and it's not getting any hotter than it did before, though I think it fixed my problem with the CPU slowing to minimum speed (400 MHz) and staying there long after the TZ0 temp had dropped to under 50C and ignoring attempts to set the frequency using sysctls. It does get a bit hotter. Previously it topped out at 88 or 89C. Now it goes to 90C or 91C.

This week I moved to my new T16. It had not demonstrated the problems and P-States were never disabled, s I think Lenovo may have finally fixed their BIOS. In any case, no issues.

Unless I find a problem that forces me to bring it up again, my L15 is down for good, so no further updates on the issue from me. Looks like they are not needed any longer.

Comment 73 commit-hook freebsd_committer

2022-10-10 13:54:46 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=67f2a563bfcad75c16536ca500b06ddc9306dfa0

commit 67f2a563bfcad75c16536ca500b06ddc9306dfa0
Author:     Tom Jones <thj@FreeBSD.org>
AuthorDate: 2022-10-10 13:46:25 +0000
Commit:     Tom Jones <thj@FreeBSD.org>
CommitDate: 2022-10-10 13:53:15 +0000

    acpi: Tell SMM we will handle CPPC notifications

    Buggy SMM implementations can hang while processing CPPC notifications.
    This leads to some laptops (notably Thinkpads) hanging when the
    hwpstate_intel driver is loaded.

    Tell the SMM that we will handle CPPC notifications as described in:

    - Intel® Processor Vendor-Specific ACPI
    - Intel® 64 and IA-32 Architectures Software Developer’s Manual

    CPPC events default to masked (disabled) so while we do not do any
    handling right now this does not seem to lead to any issues.

    This approach was found via this Linux Kernel patch:
    https://lkml.org/lkml/2016/3/17/563

    PR:             253288
    Reviewed by:    imp, jhb
    Sponsored by:   Modirum
    Sponsored by:   Klara, Inc.
    Differential Revision:  https://reviews.freebsd.org/D36699

 sys/dev/acpica/acpi_cpu.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comment 74 Luís Henriques 2022-10-20 22:06:58 UTC

Created attachment 237490 [details]
simple debug patch

So, I'm running 13.1 on T490 Lenovo, with the above mentioned patch applied.  And I'm still seeing a system freeze.

Here's what I've done, maybe I'm doing something wrong:

  cd /usr/src
  git checkout -b 13.1 -t freebsd/releng/13.1
  git cherry-pick 67f2a563bfcad75c16536ca500b06ddc9306dfa0

At this point, I have the 13.1 release kernel + the patch.  After compiling and booting the kernel, it doesn't take a long time before the fans start and the system becomes completely unresponsive.  And an hard-reboot is inevitable.

Now, I've done an experiment with the attached debug patch.  And here's what I see in dmesg:

dmesg|grep DEBUG
cpu0: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu1: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu2: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu3: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu3: ==> DEBUG init cppc_notify: 1

So, 'cppc_notify' is only set to '1' _after_ acpi_cpu_attach() is executed.  And, although I have not idea what this really means, it doesn't look correct to me.  Or does this mean my hardware isn't really affected by this bug and I'm hitting some other bug?  Any hints?

Another experiment I've done was to set 'cppc_notify' to '1' in the variable declaration (and dmesg will obviously show 4 "==> DEBUG: OK" messages instead).  It looks like the system doesn't crash with this patch, but maybe I'm just doing some harm to my hardware.

Comment 75 Graham Perrin freebsd_committer

2022-10-21 00:34:04 UTC

(In reply to commit-hook from comment #73)

Triage: merge to stable/13?

From comment #3 and others, I assume not to stable/12.

Comment 76 Tom Jones freebsd_committer

2022-10-21 16:05:04 UTC

Thanks henrix I have created https://reviews.freebsd.org/D37081

Comment 77 commit-hook freebsd_committer

2022-12-08 20:04:03 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=308d3d6be6da1df4a47e641b5e0cedeccea7b09f

commit 308d3d6be6da1df4a47e641b5e0cedeccea7b09f
Author:     Tom Jones <thj@FreeBSD.org>
AuthorDate: 2022-10-10 13:46:25 +0000
Commit:     Tom Jones <thj@FreeBSD.org>
CommitDate: 2022-12-08 20:02:39 +0000

    acpi: Tell SMM we will handle CPPC notifications

    Buggy SMM implementations can hang while processing CPPC notifications.
    This leads to some laptops (notably Thinkpads) hanging when the
    hwpstate_intel driver is loaded.

    Tell the SMM that we will handle CPPC notifications as described in:

    - Intel® Processor Vendor-Specific ACPI
    - Intel® 64 and IA-32 Architectures Software Developer’s Manual

    CPPC events default to masked (disabled) so while we do not do any
    handling right now this does not seem to lead to any issues.

    This approach was found via this Linux Kernel patch:
    https://lkml.org/lkml/2016/3/17/563

    PR:             253288
    Reviewed by:    imp, jhb
    Sponsored by:   Modirum
    Sponsored by:   Klara, Inc.
    Differential Revision:  https://reviews.freebsd.org/D36699

    (cherry picked from commit 67f2a563bfcad75c16536ca500b06ddc9306dfa0)
    (cherry picked from commit eee0f7aea42564fe005c74f004d63f8cc170ef59)
    (cherry picked from commit 15bd2f366d3e878f5a8bc1628368d59ef318af5f)

 sys/dev/acpica/acpi_cpu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comment 78 Marco 2022-12-09 09:19:35 UTC

(In reply to commit-hook from comment #77)

Officially running this on stable/13-n253250-308d3d6be6da since 20 minutes ago on my Carbon X1 gen 7 with machdep.hyperthreading_allowed=0 set in loader.conf(5)

[~] sysctl hw.model               
hw.model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz

[~] uname -aKU
FreeBSD harbinger.fritz.box 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n253250-308d3d6be6da: Fri Dec  9 00:25:51 UTC 2022     root@harbinger.fritz.box:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 1301510 1301510

[~] sysctl hw.acpi.cpu.cx_lowest 
hw.acpi.cpu.cx_lowest: C8
 [~] sysctl hw.acpi.cpu.cppc_notify
hw.acpi.cpu.cppc_notify: 1

[~] sysctl dev.hwpstate_intel.               
dev.hwpstate_intel.3.epp: 50
dev.hwpstate_intel.3.%parent: cpu3
dev.hwpstate_intel.3.%pnpinfo: 
dev.hwpstate_intel.3.%location: 
dev.hwpstate_intel.3.%driver: hwpstate_intel
dev.hwpstate_intel.3.%desc: Intel Speed Shift
dev.hwpstate_intel.2.epp: 50
dev.hwpstate_intel.2.%parent: cpu2
dev.hwpstate_intel.2.%pnpinfo: 
dev.hwpstate_intel.2.%location: 
dev.hwpstate_intel.2.%driver: hwpstate_intel
dev.hwpstate_intel.2.%desc: Intel Speed Shift
dev.hwpstate_intel.1.epp: 50
dev.hwpstate_intel.1.%parent: cpu1
dev.hwpstate_intel.1.%pnpinfo: 
dev.hwpstate_intel.1.%location: 
dev.hwpstate_intel.1.%driver: hwpstate_intel
dev.hwpstate_intel.1.%desc: Intel Speed Shift
dev.hwpstate_intel.0.epp: 50
dev.hwpstate_intel.0.%parent: cpu0
dev.hwpstate_intel.0.%pnpinfo: 
dev.hwpstate_intel.0.%location: 
dev.hwpstate_intel.0.%driver: hwpstate_intel
dev.hwpstate_intel.0.%desc: Intel Speed Shift
dev.hwpstate_intel.%parent:

Thanks for the MFC to stable/13

Comment 79 John 2023-02-13 19:23:33 UTC

Is there a timeline for a MFC on the second commit "acpi: Create cppc_notify sysctl before it is checked"? If so, will it end up in releng/13.2?

Comment 80 Tom Jones freebsd_committer

2023-02-13 20:41:38 UTC

All three commits were MFC'd together

Comment 81 Dries Michiels freebsd_committer

2023-03-10 21:16:50 UTC

*** Bug 248659 has been marked as a duplicate of this bug. ***

Comment 82 Dries Michiels freebsd_committer

2023-11-20 10:49:48 UTC

This bug has been fixed.

ati.sharma+freebsd
chris
dacrackerx64
driesm
eduardo
emaste
fbsd_bugzilla
g-freebsd.bugzilla
grahamperrin
guido
henrix
hoesglad
jason
jon
mentalbarcode
o.kryvulia
pi
ps.ports
rashey
rkoberman
rm
rmavella+freebsd
sdalu
serzh
sreeharisreedev1
surveyor9
t.claussen
t.weustink
thj
uqs
yuripv