Bug 253288 - hwpstate_intel: Wedges under any kind of load on ThinkPad Carbon X1 Gen 8
Summary: hwpstate_intel: Wedges under any kind of load on ThinkPad Carbon X1 Gen 8
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
: 253358 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-02-06 10:33 UTC by Eirik Oeverby
Modified: 2021-03-22 13:06 UTC (History)
12 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eirik Oeverby 2021-02-06 10:33:01 UTC
Only workaround is to add
 hint.hwpstate_intel.0.disabled=1
to /boot/loader.conf, but this makes power consumption go through the roof and generates a lot of dmesg warnings about overheating, suggesting the computer be switched off.

BIOS powersaving settings have no bearing on this problem.
The problem has existed since early 2020, and got significantly worse in the Jan/Feb timeframe. However sporadic hangs with kernels from late 2019 have also been observed.

Kernel/world version stable/13 106efdb060ae523a88caf5ddc3516500cf5b1d64
Comment 1 Oleksandr Kryvulia 2021-02-06 14:59:50 UTC
Same on Thinkpad T490 with CURRENT.
Comment 2 Conrad Meyer freebsd_committer 2021-02-09 07:53:35 UTC
*** Bug 253358 has been marked as a duplicate of this bug. ***
Comment 3 Sreehari S 2021-02-09 08:46:19 UTC
This bug has also been affecting me on the ThinkPad X1 Carbon Gen7. From digging around the codebase (I'm just a noob), it seems from my understanding that it's something introduced in sys/x86/cpufreq/hwpstate_intel.c, which is quite recent (commits starting on January 22, 2020, not merged into 12.x, and few enough commits for me to keep track of: git log sys/x86/cpufreq/hwpstate_intel.c). Once I get time I could try compiling each of the significant revisions starting with the one just before the introduction of the Speed Shift stuff and seeing which one breaks first. Could be wrong about all of this though.
Comment 4 Eirik Oeverby 2021-02-09 11:54:40 UTC
(In reply to Sreehari S from comment #3)
That's around the same time I started seeing these issues. It's been a while since I was testing it aggressively; I thought there was an open bug about this already but I was mistaken - so no wonder nothing changed :-/

Thank you for your efforts, I currently don't have a chance to test this as the computer in question is unavailable for the time being.
Comment 5 Sreehari S 2021-02-10 02:08:00 UTC
(In reply to Eirik Oeverby from comment #4)
No problem, I also have an incentive to help get FreeBSD 13.0 fully working on my hardware and ~~procrastination on my actual responsibilies~~. So today I've successfully built and booted a 13.0-CURRENT tree from January 22, 2020 (git commit 7ec5e1c4cd74b66192e5a34c082dc580e587f77b), which is the commit just before what I suspect may be one of the breaking commits (git 4577cf3744b98d0fa7cea80c75079c3cf5155471, and this is the one that introduces hwpstate_intel.c and friends in the first place). After installing that world/kernel, I've thoroughly abused the machine (compiling software, installing stuff, graphics stuff, etc.) and I could not get it to crash yet. I guess next I'll try installing the sys from all the possible breaking commits I've identified (there's very few commits that touch hwpstate_intel in the first place, so I'm in luck). All this will tell me is which commit broke everything on Lenovo machines, so hopefully that can be used to narrow down the exact change that broke. After all this, I'd hope that the fixing patch would make it into 13.0-RELEASE...
Comment 6 Sreehari S 2021-02-10 04:26:36 UTC
UPDATE: for everyone it concerns: I've proven beyond reasonable doubt that the first broken commit is 4577cf3744b98d0fa7cea80c75079c3cf5155471). I've tested the commit just before it with no issues at all, then i did make {build,install}kernel and rebooted then tried building luajit for neovim over ssh and the system hung in the middle of building and my ssh connection died today too. So for anyone smart enough, please take a look at that commit in particular, as I'm almost certain that's the one that introduces the regression.
Comment 7 Sreehari S 2021-02-10 04:37:27 UTC
(In reply to Sreehari S from comment #6)
At this point it can only really be in one of:
sys/sys/cpu.h
sys/x86/cpufreq/est.c

and most probably:
sys/kern/kern_cpu.c
sys/x86/cpufreq/hwpstate_intel.c
sys/x86/cpufreq/hwpstate_intel_internal.h
Comment 8 Eirik Oeverby 2021-02-10 10:11:22 UTC
(In reply to Sreehari S from comment #5)
I don't have to abuse it all to have it fall over:
- boot up without powerd/powerdxx
- fire up X
- log into kde/plasma
- try to open some preferences panel, start a browser, whatever
- system freezes and a split second later mouse pointer stops moving
Comment 9 Sreehari S 2021-02-10 17:41:29 UTC
(In reply to Eirik Oeverby from comment #8)
Yeah I was just trying to prove beyond reasonable doubt that particular revision *wasn't* flawed in any way. When I tried out the next revision, it would cause a full system hang rather fast, like all I needed to do was log in and try to install vim via pkg or something.
Comment 10 Sreehari S 2021-02-10 17:52:35 UTC
(In reply to Sreehari S from comment #9)
and this was with powerd enabled, and in a tty console (no gui). The amount of time that the system lasts before dying varies, but it's basically guaranteed it will fairly soon
Comment 11 Ed Maste freebsd_committer 2021-02-10 18:00:19 UTC
(In reply to Sreehari S from comment #6)
The identified commit (4577cf3744b98d0fa7cea80c75079c3cf5155471) is the one that added hwpstate, so it's not surprising that it's responsible.

The only immediate suggestion I have is for folks to review changes to the corresponding Linux driver and see if there is some workaround or special case that we're missing.
Comment 12 Sreehari S 2021-02-10 18:08:25 UTC
(In reply to Ed Maste from comment #11)
Yeah, that makes sense. The linux driver is available in their kernel tree at drivers/cpufreq/intel_pstate.c (https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c). The first commit that added HWP in the linux kernel tree was 2f86dc4cddcb21290ca099e1dce2a53533c86e0b from 2014, though I don't think that matters too much. The only thing I can think of that would cause the difference is MSR reading/writing stuff, but I'm no expert on this honestly, and I could be completely wrong for all I know.
Comment 13 Yuri Pankov freebsd_committer 2021-02-11 17:13:07 UTC
Just for the record, I'm not seeing any issues on P51, "Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz".
Comment 14 Sreehari S 2021-02-11 17:27:53 UTC
(In reply to Yuri Pankov from comment #13)
Maybe it only effects intel U processors? That's the only thing I can think of that the effected people's machines have in common that you don't (you've got HQ). Again could be completely wrong
Comment 15 Sreehari S 2021-02-11 17:30:00 UTC
(In reply to Sreehari S from comment #14)
I've got an i7-8565U and some hw probes:
http://bsd-hardware.info/?probe=77e80759a0
http://bsd-hardware.info/?probe=8d1c80c2cb
Comment 16 Yuri Pankov freebsd_committer 2021-02-11 17:44:52 UTC
(In reply to Sreehari S from comment #14)
I also have Intel NUC7i7BNH featuring "Intel(R) Core(TM) i7-7567U CPU", and I don't remember seeing any issues with it either.  The system does not have any storage device at the moment, but I'll get one shortly and re-check to confirm (or disprove) the U series guess.
Comment 17 Sreehari S 2021-02-11 19:37:09 UTC
(In reply to Yuri Pankov from comment #16)
From what I can tell only Lenovo/thinkpad users have complained about this bug, though it's worth checking out if it affects all U processors or something. In all likeliness it could be some differences in MSR writes through some edge case not covered in FreeBSD that is covered in linux and others. I tried checking for deadlocks in the new code through printf debugging (that's all I know) and I couldn't find any myself, but I'm no expert so I wouldn't completely rule that out unless someone else can confirm.
Comment 18 Sreehari S 2021-02-11 19:44:28 UTC
(In reply to Sreehari S from comment #17)
I might try to wrap my head around remote gdb or ddb or even trying to find crash dumps if they're created, but I'm not too familiar with all that yet.
Comment 19 Sreehari S 2021-02-13 08:58:23 UTC
https://github.com/erpalma/throttled
https://www.reddit.com/r/thinkpad/comments/870u0a/t480s_linux_throttling_bug/
https://www.notebookcheck.net/Lenovo-admits-ThinkPad-CPU-throttling-problem-when-running-Linux-fix-in-development.435549.0.html
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763144

Could this have anything to do with it? Apparently certain Lenovo laptops running Linux were known to have some kind of CPU throttling issue that could be mitigated with MSR writes. I certainly remember it being an issue under Linux way back in the day, but it might have been fixed by now. Maybe this is something worth looking into, as it affected the exact Lenovo laptops that people are having issues with under FreeBSD 13
Comment 20 Sreehari S 2021-02-13 23:41:53 UTC
(In reply to Sreehari S from comment #19)
Ok I've injected some kernel code to find the cutoff from MSR_IA32_TEMPERATURE_TARGET, and it seems to be 3, which suggests thermal throttling happens at 97 degrees C, instead of the broken 80 degrees C from before. This is probably a result of Lenovo fixing the bug in firmware, so I'm pretty sure this can be ruled out.
Comment 21 Sreehari S 2021-02-14 10:04:56 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=200133

Anything useful here?
Comment 22 Sreehari S 2021-02-26 09:44:39 UTC
According to the Linux commit from 2014 I referenced earlier, they got their reference based off Section 14.4 of Volume 3 of the Intel architecture Software Developer Manual. On a cursory look this section does indeed describe hardware P-states, though it's a bit over my head at the moment. Maybe I can look into it later. Maybe there's some useful information for whatever edge case the FreeBSD code is missing in here?
Comment 23 Sergei Masharov 2021-03-08 11:41:48 UTC
(In reply to Sreehari S from comment #12)
I think that this issue certainly related to CPU frequency, because dev.cpu.0.freq_levels and dev.cpu.0.freq are looked very different than in the versions before 13, and in 13 with hint.hwpstate_intel.0.disabled=1

details are in the https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659

In my case sometimes system hangs even during kernel boot, last messages in console about USB devices.
Comment 24 Marco 2021-03-08 13:32:10 UTC
seems like a duplicate (still unresolved) to: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659