Summary: | hwpstate_intel: modern ThinkPads wedge under any kind of load or during boot | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Eirik Oeverby <ltning-freebsd> | ||||
Component: | kern | Assignee: | Tom Jones <thj> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Many People | CC: | ati.sharma+freebsd, chris, dacrackerx64, drd47rinme, driesm, eduardo, emaste, fbsd_bugzilla, g-freebsd.bugzilla, grahamperrin, guido, henrix, hoesglad, jason, jon, mentalbarcode, pi, ps.ports, rashey, rkoberman, rm, rmavella+freebsd, sdalu, serzh, shuriku, sreeharisreedev1, t.claussen, t.weustink, thj, uqs, yuripv | ||||
Priority: | --- | Keywords: | performance | ||||
Version: | 13.0-STABLE | Flags: | grahamperrin:
mfc-stable13?
|
||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255745 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254915 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267187 https://reviews.freebsd.org/D36699 |
||||||
Attachments: |
|
Description
Eirik Oeverby
2021-02-06 10:33:01 UTC
Same on Thinkpad T490 with CURRENT. *** Bug 253358 has been marked as a duplicate of this bug. *** This bug has also been affecting me on the ThinkPad X1 Carbon Gen7. From digging around the codebase (I'm just a noob), it seems from my understanding that it's something introduced in sys/x86/cpufreq/hwpstate_intel.c, which is quite recent (commits starting on January 22, 2020, not merged into 12.x, and few enough commits for me to keep track of: git log sys/x86/cpufreq/hwpstate_intel.c). Once I get time I could try compiling each of the significant revisions starting with the one just before the introduction of the Speed Shift stuff and seeing which one breaks first. Could be wrong about all of this though. (In reply to Sreehari S from comment #3) That's around the same time I started seeing these issues. It's been a while since I was testing it aggressively; I thought there was an open bug about this already but I was mistaken - so no wonder nothing changed :-/ Thank you for your efforts, I currently don't have a chance to test this as the computer in question is unavailable for the time being. (In reply to Eirik Oeverby from comment #4) No problem, I also have an incentive to help get FreeBSD 13.0 fully working on my hardware and ~~procrastination on my actual responsibilies~~. So today I've successfully built and booted a 13.0-CURRENT tree from January 22, 2020 (git commit 7ec5e1c4cd74b66192e5a34c082dc580e587f77b), which is the commit just before what I suspect may be one of the breaking commits (git 4577cf3744b98d0fa7cea80c75079c3cf5155471, and this is the one that introduces hwpstate_intel.c and friends in the first place). After installing that world/kernel, I've thoroughly abused the machine (compiling software, installing stuff, graphics stuff, etc.) and I could not get it to crash yet. I guess next I'll try installing the sys from all the possible breaking commits I've identified (there's very few commits that touch hwpstate_intel in the first place, so I'm in luck). All this will tell me is which commit broke everything on Lenovo machines, so hopefully that can be used to narrow down the exact change that broke. After all this, I'd hope that the fixing patch would make it into 13.0-RELEASE... UPDATE: for everyone it concerns: I've proven beyond reasonable doubt that the first broken commit is 4577cf3744b98d0fa7cea80c75079c3cf5155471). I've tested the commit just before it with no issues at all, then i did make {build,install}kernel and rebooted then tried building luajit for neovim over ssh and the system hung in the middle of building and my ssh connection died today too. So for anyone smart enough, please take a look at that commit in particular, as I'm almost certain that's the one that introduces the regression. (In reply to Sreehari S from comment #6) At this point it can only really be in one of: sys/sys/cpu.h sys/x86/cpufreq/est.c and most probably: sys/kern/kern_cpu.c sys/x86/cpufreq/hwpstate_intel.c sys/x86/cpufreq/hwpstate_intel_internal.h (In reply to Sreehari S from comment #5) I don't have to abuse it all to have it fall over: - boot up without powerd/powerdxx - fire up X - log into kde/plasma - try to open some preferences panel, start a browser, whatever - system freezes and a split second later mouse pointer stops moving (In reply to Eirik Oeverby from comment #8) Yeah I was just trying to prove beyond reasonable doubt that particular revision *wasn't* flawed in any way. When I tried out the next revision, it would cause a full system hang rather fast, like all I needed to do was log in and try to install vim via pkg or something. (In reply to Sreehari S from comment #9) and this was with powerd enabled, and in a tty console (no gui). The amount of time that the system lasts before dying varies, but it's basically guaranteed it will fairly soon (In reply to Sreehari S from comment #6) The identified commit (4577cf3744b98d0fa7cea80c75079c3cf5155471) is the one that added hwpstate, so it's not surprising that it's responsible. The only immediate suggestion I have is for folks to review changes to the corresponding Linux driver and see if there is some workaround or special case that we're missing. (In reply to Ed Maste from comment #11) Yeah, that makes sense. The linux driver is available in their kernel tree at drivers/cpufreq/intel_pstate.c (https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c). The first commit that added HWP in the linux kernel tree was 2f86dc4cddcb21290ca099e1dce2a53533c86e0b from 2014, though I don't think that matters too much. The only thing I can think of that would cause the difference is MSR reading/writing stuff, but I'm no expert on this honestly, and I could be completely wrong for all I know. Just for the record, I'm not seeing any issues on P51, "Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz". (In reply to Yuri Pankov from comment #13) Maybe it only effects intel U processors? That's the only thing I can think of that the effected people's machines have in common that you don't (you've got HQ). Again could be completely wrong (In reply to Sreehari S from comment #14) I've got an i7-8565U and some hw probes: http://bsd-hardware.info/?probe=77e80759a0 http://bsd-hardware.info/?probe=8d1c80c2cb (In reply to Sreehari S from comment #14) I also have Intel NUC7i7BNH featuring "Intel(R) Core(TM) i7-7567U CPU", and I don't remember seeing any issues with it either. The system does not have any storage device at the moment, but I'll get one shortly and re-check to confirm (or disprove) the U series guess. (In reply to Yuri Pankov from comment #16) From what I can tell only Lenovo/thinkpad users have complained about this bug, though it's worth checking out if it affects all U processors or something. In all likeliness it could be some differences in MSR writes through some edge case not covered in FreeBSD that is covered in linux and others. I tried checking for deadlocks in the new code through printf debugging (that's all I know) and I couldn't find any myself, but I'm no expert so I wouldn't completely rule that out unless someone else can confirm. (In reply to Sreehari S from comment #17) I might try to wrap my head around remote gdb or ddb or even trying to find crash dumps if they're created, but I'm not too familiar with all that yet. https://github.com/erpalma/throttled https://www.reddit.com/r/thinkpad/comments/870u0a/t480s_linux_throttling_bug/ https://www.notebookcheck.net/Lenovo-admits-ThinkPad-CPU-throttling-problem-when-running-Linux-fix-in-development.435549.0.html https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763144 Could this have anything to do with it? Apparently certain Lenovo laptops running Linux were known to have some kind of CPU throttling issue that could be mitigated with MSR writes. I certainly remember it being an issue under Linux way back in the day, but it might have been fixed by now. Maybe this is something worth looking into, as it affected the exact Lenovo laptops that people are having issues with under FreeBSD 13 (In reply to Sreehari S from comment #19) Ok I've injected some kernel code to find the cutoff from MSR_IA32_TEMPERATURE_TARGET, and it seems to be 3, which suggests thermal throttling happens at 97 degrees C, instead of the broken 80 degrees C from before. This is probably a result of Lenovo fixing the bug in firmware, so I'm pretty sure this can be ruled out. https://bugzilla.kernel.org/show_bug.cgi?id=200133 Anything useful here? According to the Linux commit from 2014 I referenced earlier, they got their reference based off Section 14.4 of Volume 3 of the Intel architecture Software Developer Manual. On a cursory look this section does indeed describe hardware P-states, though it's a bit over my head at the moment. Maybe I can look into it later. Maybe there's some useful information for whatever edge case the FreeBSD code is missing in here? (In reply to Sreehari S from comment #12) I think that this issue certainly related to CPU frequency, because dev.cpu.0.freq_levels and dev.cpu.0.freq are looked very different than in the versions before 13, and in 13 with hint.hwpstate_intel.0.disabled=1 details are in the https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659 In my case sometimes system hangs even during kernel boot, last messages in console about USB devices. seems like a duplicate (still unresolved) to: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248659 I just wanted to add that my Lenovo Thinkpad X1 Carbon Gen 7 is also affected by this bug. The CPU is Intel i7 8665U. That's not limited to ThinkPad Carbon X1, mine is a ThinkPad T490 with i7-8565U CPU (In reply to Stéphane D'Alu from comment #26) Looking at both tickets on this issue, the thing that jumps out at me is that it appears that only Lenovo systems are impacted. Lots of different ones have been reported, but no Dell or HP systems. No Asus, It is clearly something that ONLY impacts Lenovos and it looks like it can be further constrained to ThinkPads. I find no references to IdeaPads or other Lenovo lines. Last I knew, development of ThinkPads was still in the US at the former IBM facility. Looks like something unique to those systems, which is clearly distinct from other laptops. I just can't begin to guess what. This is WAY beyond anything I can troubleshoot, but I'm more than willing to help test. My L15 has been a real pain, unlike every other ThinkPad I've used (and that is going back to at least 1995). I suspect that this will require at last on and maybe two very top-line FreeBSD folks to track down. kib or jhb, perhaps? Now that 13 is out the door, there will only be more reports and I'd really like P-States. Upgraded to 13.x and I see the same hang during boot with hwpstate. The fans start going full blast for 30s and then throttle down again. Only reboot works at that stage. Shouldn't the broken commit be reverted? Throttling was working fine with 12.x ... This is with a i7-8565U in a Thinkpad T490. (In reply to Ulrich Spörlein from comment #28) The commit is the one that enables P-States and it seems to work fine on all but Lenovo ThinkPads. All that can be done until someone with a lot more ACPI and kernel knowledge that I figures it out. Until then, the ony "solution" is to disable P-State support by adding hint.hwpstate_intel.0.disabled=1 to /boot/loader.conf. Since P-State support was not present before 13.0, it leaves you no worse off than you were on older versions of FreeBSD. Reverting the commit would simply turn off P-State support for everyone and it is a valuable power management capablity. I'm not sure what is worse, removing P-states from every non-Thinkpad owner, or having a release out there that fails to boot on Thinkpads (which are probably the most often used laptops with FreeBSD, maybe??) Can we quirk/block the P-State support and disable it whenever the ACPI/BIOS/Firmware/whatever is from Lenovo (and/or the model matches "Thinkpad")? That would allow it working out of the box (but is too late for 13.0-RELEASE). (In reply to Yuri Pankov from comment #16) Weird, I thought I replied with my testing, will do now. No issues on INTEL NUC7i7BN with i7-7567U CPU. Adding my Lenovo T490 to this list of troubled machines. Hardware details: Lenovo T490 model type 20RY-S06R00 manufactured date July 2020 current BIOS version N2RET22W 1.16 BIOS date 2020-11-11 Purchased from COSTCO in fall 2020. Removed 256GB nvme M2 card and replaced with Crucial 512 GB nvme M2 card. Installed FreeBSD 12.x fine and it operated with no trouble. However, had no Wi-Fi so installed OpenBSD 6.8. OpenBSD ran fine with no errors and without any trouble. Wi-Fi and Xorg worked, out of the box. Stayed on OpenBSD ...until today. I was ready to move back to FreeBSD since 13 was released and no serious issues reported. I created an image of FreeBSD 13.0 RELEASE on Sandisk USB stick. Plugged into T490 and powered up. All looked like typical FreeBSD installation messages, install screen with red sphere with horns, ... more installation messages ... probe messages .. THEN ... all STOP ! The install hung and stayed hung. Fans ramped up, warm air from air vents. Last message displayed was ... this last line ... AS displayed: hwpstate_intel0: <Intel Speed Shift> on cpu0 No response from keyboard and laptop warms up ... fast. Powered OFF. Chuck Barker Forgot to share some BIOS details ... Intel Core i7-10510U 1.800 Ghz 16384 MB RAM Came installed with Windows 10 Hyperthreading - ON in BIOS Intel SpeedStep Technology - ON in BIOS set to 'Max Performance' CPU Power Management - ON Chuck Barker No problems on my Thinkpad T480 (i5-8250U CPU @ 1.60GHz). For what it is worth, I do have devcpu-data-1.38 installed and the following in my /boot/loader.conf: cpuctl_load="YES" cpu_microcode_load="YES" cpu_microcode_name="/boot/firmware/intel-ucode.bin" dmesg say: CPU microcode: updated from 0xb4 to 0xe0 (In reply to Guido Kollerie from comment #34) It turns out devcpu-data does not having anything to do with the T480 booting succesfully. Setting: cpu_microcode_load="NO" (from "YES") still boots the T480 fine. And I still see the message: CPU microcode: updated from 0xb4 to 0xe0 I did notice that powerd++ complained with the message: powerd++: (EDRIVER) frequency control driver not supported: hwpstate_intel0 So I disabled powerd++ and enabled powerd in /etc/rc.conf instead to see if that triggered the problem so many Thinkpad owners are experiencing, but no, the T480 still boots fine. UEFI BIOS version: N24ET51W (1.26) UEFI BIOS date: 2019-08-30 Machine type model: 20L5CTO1WW Guido, can you try uninstalling the package (devcpu-data), it seems that the script is still being run given your output. (In reply to Dries Michiels from comment #36) Forgot to clear the kernel buffer (dmesg -c), hence the message was from a previous boot. Anyway, having cleared the kernel buffer and uninstalled devcpu-data, the next reboot did NOT have the microcode update message anymore. But even without the microcode update I am able to boot just fine. Running KDE Plasma I generated some minor load by compiling a bit of Rust code (orjson lib) while at the same time compiling NumPY/pandas (= lot's of C code): no system freezes. I guess hwpstate_intel just works on the Thinkpad T480. % sysctl -a | grep dev.hwpstate_intel dev.hwpstate_intel.3.epp: 50 dev.hwpstate_intel.3.%parent: cpu3 dev.hwpstate_intel.3.%pnpinfo: dev.hwpstate_intel.3.%location: dev.hwpstate_intel.3.%driver: hwpstate_intel dev.hwpstate_intel.3.%desc: Intel Speed Shift dev.hwpstate_intel.2.epp: 50 dev.hwpstate_intel.2.%parent: cpu2 dev.hwpstate_intel.2.%pnpinfo: dev.hwpstate_intel.2.%location: dev.hwpstate_intel.2.%driver: hwpstate_intel dev.hwpstate_intel.2.%desc: Intel Speed Shift dev.hwpstate_intel.1.epp: 50 dev.hwpstate_intel.1.%parent: cpu1 dev.hwpstate_intel.1.%pnpinfo: dev.hwpstate_intel.1.%location: dev.hwpstate_intel.1.%driver: hwpstate_intel dev.hwpstate_intel.1.%desc: Intel Speed Shift dev.hwpstate_intel.0.epp: 50 dev.hwpstate_intel.0.%parent: cpu0 dev.hwpstate_intel.0.%pnpinfo: dev.hwpstate_intel.0.%location: dev.hwpstate_intel.0.%driver: hwpstate_intel dev.hwpstate_intel.0.%desc: Intel Speed Shift dev.hwpstate_intel.%parent: % dmesg | grep hwpstate_intel hwpstate_intel0: <Intel Speed Shift> on cpu0 hwpstate_intel1: <Intel Speed Shift> on cpu1 hwpstate_intel2: <Intel Speed Shift> on cpu2 hwpstate_intel3: <Intel Speed Shift> on cpu3 (In reply to Guido Kollerie from comment #35) So powerdxx is superseded by the hwpstate_intel[4] driver on systems that support it. Following taken from https://reviews.freebsd.org/D30004 For more information, including on how to balance performance and energy use, and on how to disable this driver, refer to the man page man:hwpstate_intel[4]. Note: Users accustomed to using man:powerd[8] or package:sysutils/powerdxx[] will find these utilities have been superseded by the man:hwpstate_intel[4] driver and no longer work as expected. So unless you set hint.hwpstate_intel.0.disabled="1" in loader.conf one should expect both powerd and powerdxx to no longer work as expected. On my X1-Carbon 7th gen (still running stable/13-n245210-3bec9180c9e7) I get this behaviour when using sysutils/devcpu-data (1.38) /boot/loader.conf : cpuctl_load="YES" cpu_microcode_load="YES" cpu_microcode_name="/boot/firmware/intel-ucode.bin" hint.p4tcc.0.disabled=1 hint.acpi_throttle.0.disabled=1 /etc/rc.conf : microcode_update_enable="YES" dmesg says: CPU microcode: no matching update found When manually starting microcode : service microcode_update start Updating CPU Microcode... CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz (2112.12-MHz K8-class CPU) Origin="GenuineIntel" Id=0x806ec Family=0x6 Model=0x8e Stepping=12 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x121<LAHF,ABM,Prefetch> Structured Extended Features=0x29c6fbf<FSGSBASE,TSCADJ,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE> Structured Extended Features3=0xbc000600<MCUOPT,MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,SSBD> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> IA32_ARCH_CAPS=0xab<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TSX_CTRL> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics Done. With hint.hwpstate_intel.0.disabled="1" in loader.conf I am using powerdxx with powerdxx_flags="-a min -b min -n min" in rc.conf I was using powerdxx_flags="-a hiadaptive -b hiadaptive -n hiadaptive" before that but was seeing fairly frequently kernel: coretemp0: critical temperature detected, suggest system shutdown So with the device hint for hwpstate disabled my system is using EST sysctl dev.cpufreq.0.freq_driver dev.cpufreq.0.freq_driver: est0 Can we please change the importance to 'affects many people' and also change the title ? It clearly doesn't only affect the 8th gen X1 Carbon. Just to add my new work laptop, a Lenovo ThinkPad T14 Gen 1. Intel Core i7-10510U Hangs on the same Intel Speed Shift other people have seen. Disabling it makes the temperature go up to 47C on the zone, and around 40C per core. Fan is ramped up all the time. Battery lasts for about 4 hours then. I'm having the same issue trying to install FreeBSD 13.0-RELEASE on a 4th gen X1 Carbon with Intel Core i7-6600U. The problem doesn't occur on my ThinkPad P50 with Intel Xeon E3-1505M. Have you disabled P-States? While that is very sub-optimal, it does seem to pretty much fix the lockup problem. I still see lock-ups, but fewer than one per month. add "hint.hwpstate_intel.0.disabled=1" to /boot/loader.conf. It appears to only show up on Lenovo laptops running 13.0 or newer. (P-State support was not available prior to 13.) If the problem continues, it is likely a different problem. (In reply to rkoberman from comment #42) Yes, just figured that out now. Thank you. To get the 13.0-RELEASE installer to boot, I had to input: set hint.hwpstate_intel.0.disabled=1 boot (In reply to rkoberman from comment #42) It's not all Lenovo laptops. In my case it works fine on my 6th gen X1 Carbon, but on my work laptop model T14 Gen 1 it hangs. Also, to update on my own message here with reagrds to the temps, the laptop is just a hothead. It's running Windows now (for other reasons) and it's equally hot all the time. Just installed 13.0 on a Thinkpad T490, and it would freeze and require a whole system reboot after a few minutes of editing config files. Setting hint.hwpstate_intel.0.disabled=1 let me install KDE5 and it's no longer having any freezes. I assume I am not experiencing the best battery life until this is resolved... Thanks everyone who has looked into this! OC I installed 13.0-RELEASE on a Thinkpad T490, and I saw all of the same symptoms described by others above. The suggestion of setting hint.hwpstate_intel.0.disabled=1 fixed it. I noticed one symptom not mentioned that might be specific to just me. I can reliably trigger the behavior by moving my laptop, e.g. repositioning it on the desk, or carrying it from my desk to my armchair. I've heard that Lenovo has motion-based CPU throttling, so maybe that is related? I have the same bug on a Protectli clone device with a i3-8145U. Im on FreeBSD 13.0-STABLE. I have this device: https://de.aliexpress.com/item/1005002922518905.html This is still a problem on the latest 14-CURRENT snapshot as of 2022-05-09. ThinkPad Carbon X1 Gen8 - but affects several other models (Lenovo and otherwise, if I understand the messages on this ticket correctly). No indication of any work on this. I have been following this since it first popped up due to the added support for P-states in 13. I am about to open another ticket on other issues I'm seeing with thermal management that are likely related. Can someone experiencing this issue try and reproduce from the FreeBSD installer on a usb stick? I would like to debug this, but (afaict) I don't have hardware that triggers this issue. I tried: - boot latest snapshot installer - break to shell - start powerd - run `openssl speed -multi $(sysctl -n hw.ncpu)` from by quick reading of this thread that seemed like it should be more than enough to hang the system, but interactivity was still fine. If this can be reproduced from the installer then I can try and borrow laptops to debug on. I tried to reproduce on an i3 10th Gen NUC NUC10i3FNK https://dmesgd.nycbug.org/index.cgi?do=view&id=5552 I can confirm this on a ThinkPad Carbon X1 Gen8. My test involves 8 instances of dd from /dev/random to /dev/null, but any kind of load will do. Could you test to see if you can do it from the installer? Can you include the cpu reported by sysctl hw.model Yes, this is from the installer. Just boot and choose "live cd". hw.model: Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz From reading the history on the Linux driver, my guess is that this is coming from an interaction between the bios or acpi and the p state driver. To summarise the thread so far the hardware in this thread breaks down as: Computer CPU BAD Thinkpad T490 i7-8565U yes x1 Carbon Gen7 i7-8565U yes P51 i7-7820HQ no NUC7i7BNH i7-8565U yes T490 i7-10510U yes T480 i5-8250U no T14 Gen1 i7-10510U yes x1 Carbon Gen4 i7-6600U yes Proctectli Clone i3-8145U yes Eirik's laptop i7-10610U yes I think these are all laptop bios' (the router hardware could be if it is a cheap respin). I am trying to borrow a machine I can reproduce this issue on. I wrote how to capture the data. I should have told you how to look at it. I use Wireshark, but just "tcpdump -r file-you-wrote" will print the captured data. Sorry I left off this rather important detail. (In reply to rkoberman from comment #55) Sorry. My update was for another ticket. (In reply to Tom Jones from comment #54) Just making sure the correct CPU for my X1 Carbon Gen7 is also listed (currently it isn't in your summary): hw.model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz The same here on x13 Gen1 with i7-10510U. Disabling p-states solves the issue This is also happening for me on the following machines but with slightly different levels of severity: Crashes within a few seconds (even on the 14.1 RELEASE installed) on my Thinkpad X1 Carbon Gen 7 with a Intel i7-10710U. Thinkpad X260: Intel Core i7-6500U On the X260, it only happens if I close the lid without the AC power connected. It doesn't matter if sleep on lid close is on or not. This does not happen on my Batch 6 Framework Laptop with a Intel® Core™ i7-1165G7. I also forgot to mention that I have disabled pstates as others have done to workaround this for now. Arg typo above.. it should have said 13.1 RELEASE installer. Thanks to Eirik Oeverby for providing me with an x1 Carbon to test on I have a fix that seems to stop the lock up. I am not sure if the fix is safe to use or not, basically there is a bug in the system firmware when handling thermal interrupts. If we tell the smm the os will handle these the lock up seems to go away. Now we aren't handling the interrupts at all and we probably need to. My next step is going to be figuring out what we need for this. I think I have a fix, as far as I can tell it should be safe to tell the SMM we are handling CPPC notifications, but then not actually do anything. This patch does so: https://reviews.freebsd.org/D36699 I would really appreciated testing and positive or negative results. (In reply to Tom Jones from comment #63) For the record: I hereby permit you to run as many buildworlds and simultaneous "GPU" stress-tests as you need on that device in order to confirm the laptop does not melt. If it does melt, I want only pictures and a beer. /Eirik Works for me, tested several hours under intensive cpu load. (In reply to Tom Jones from comment #63) Works here with your patch and commented hint. Successfully passed buildkernel. Thanks a lot, Tom! (In reply to Tom Jones from comment #63) Applied your patch today on stable/13 9168218160ca and successfully build world and kernel and have been running your openssl speed suggestion for a couple of hours now, no freezes and during the speed run the system was still responsive. X1 Carbon 7th Gen hw.model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz dmesg | grep hwp hwpstate_intel0: <Intel Speed Shift> on cpu0 hwpstate_intel1: <Intel Speed Shift> on cpu1 hwpstate_intel2: <Intel Speed Shift> on cpu2 hwpstate_intel3: <Intel Speed Shift> on cpu3 (I'm running with hyperthreading disabled) sysctl kern.timecounter.hardware kern.timecounter.hardware: TSC Here's hoping you can find out what more is needed for interrupt handling but so far system is stable. Thanks a lot Tom! You all making me want to pull out my X1C7 and do a fresh FreeBSD install to test this ;D. Keep it up! I can confirm my Carbon X1 Gen8, returned to me without visible signs of scorching or other abuse, is now stable. Thanks, Tom! (In reply to Eirik Oeverby from comment #69) Thanks Eirik for stepping up and lending your X1 gen 8 to Tom to get this issue sorted! I'm currently rebuilding world and kernel again on stable/13 with the hint still commented but now using the patch from https://reviews.freebsd.org/D36699?id=111554 (in accepted state now). And of course again Tom, thanks a lot for the efforts. Looking forward for the patch to officially land :) (In reply to Tom Jones from comment #63) Hello Tom, I applied the patch one week ago on my T590 / 20N4 with i7-8565U. No issue !! Thank you so much !! I installed the patch on my L15 (Lenovo) almost 2 weeks ago and turned P-States back ON. No freezes and it's not getting any hotter than it did before, though I think it fixed my problem with the CPU slowing to minimum speed (400 MHz) and staying there long after the TZ0 temp had dropped to under 50C and ignoring attempts to set the frequency using sysctls. It does get a bit hotter. Previously it topped out at 88 or 89C. Now it goes to 90C or 91C. This week I moved to my new T16. It had not demonstrated the problems and P-States were never disabled, s I think Lenovo may have finally fixed their BIOS. In any case, no issues. Unless I find a problem that forces me to bring it up again, my L15 is down for good, so no further updates on the issue from me. Looks like they are not needed any longer. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=67f2a563bfcad75c16536ca500b06ddc9306dfa0 commit 67f2a563bfcad75c16536ca500b06ddc9306dfa0 Author: Tom Jones <thj@FreeBSD.org> AuthorDate: 2022-10-10 13:46:25 +0000 Commit: Tom Jones <thj@FreeBSD.org> CommitDate: 2022-10-10 13:53:15 +0000 acpi: Tell SMM we will handle CPPC notifications Buggy SMM implementations can hang while processing CPPC notifications. This leads to some laptops (notably Thinkpads) hanging when the hwpstate_intel driver is loaded. Tell the SMM that we will handle CPPC notifications as described in: - Intel® Processor Vendor-Specific ACPI - Intel® 64 and IA-32 Architectures Software Developer’s Manual CPPC events default to masked (disabled) so while we do not do any handling right now this does not seem to lead to any issues. This approach was found via this Linux Kernel patch: https://lkml.org/lkml/2016/3/17/563 PR: 253288 Reviewed by: imp, jhb Sponsored by: Modirum Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36699 sys/dev/acpica/acpi_cpu.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) Created attachment 237490 [details]
simple debug patch
So, I'm running 13.1 on T490 Lenovo, with the above mentioned patch applied. And I'm still seeing a system freeze.
Here's what I've done, maybe I'm doing something wrong:
cd /usr/src
git checkout -b 13.1 -t freebsd/releng/13.1
git cherry-pick 67f2a563bfcad75c16536ca500b06ddc9306dfa0
At this point, I have the 13.1 release kernel + the patch. After compiling and booting the kernel, it doesn't take a long time before the fans start and the system becomes completely unresponsive. And an hard-reboot is inevitable.
Now, I've done an experiment with the attached debug patch. And here's what I see in dmesg:
dmesg|grep DEBUG
cpu0: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu1: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu2: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu3: ==> DEBUG: res: 256 eax: 0x27f7 mask: 0x100 cppc_notify: 0
cpu3: ==> DEBUG init cppc_notify: 1
So, 'cppc_notify' is only set to '1' _after_ acpi_cpu_attach() is executed. And, although I have not idea what this really means, it doesn't look correct to me. Or does this mean my hardware isn't really affected by this bug and I'm hitting some other bug? Any hints?
Another experiment I've done was to set 'cppc_notify' to '1' in the variable declaration (and dmesg will obviously show 4 "==> DEBUG: OK" messages instead). It looks like the system doesn't crash with this patch, but maybe I'm just doing some harm to my hardware.
(In reply to commit-hook from comment #73) Triage: merge to stable/13? From comment #3 and others, I assume not to stable/12. Thanks henrix I have created https://reviews.freebsd.org/D37081 A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=308d3d6be6da1df4a47e641b5e0cedeccea7b09f commit 308d3d6be6da1df4a47e641b5e0cedeccea7b09f Author: Tom Jones <thj@FreeBSD.org> AuthorDate: 2022-10-10 13:46:25 +0000 Commit: Tom Jones <thj@FreeBSD.org> CommitDate: 2022-12-08 20:02:39 +0000 acpi: Tell SMM we will handle CPPC notifications Buggy SMM implementations can hang while processing CPPC notifications. This leads to some laptops (notably Thinkpads) hanging when the hwpstate_intel driver is loaded. Tell the SMM that we will handle CPPC notifications as described in: - Intel® Processor Vendor-Specific ACPI - Intel® 64 and IA-32 Architectures Software Developer’s Manual CPPC events default to masked (disabled) so while we do not do any handling right now this does not seem to lead to any issues. This approach was found via this Linux Kernel patch: https://lkml.org/lkml/2016/3/17/563 PR: 253288 Reviewed by: imp, jhb Sponsored by: Modirum Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36699 (cherry picked from commit 67f2a563bfcad75c16536ca500b06ddc9306dfa0) (cherry picked from commit eee0f7aea42564fe005c74f004d63f8cc170ef59) (cherry picked from commit 15bd2f366d3e878f5a8bc1628368d59ef318af5f) sys/dev/acpica/acpi_cpu.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (In reply to commit-hook from comment #77) Officially running this on stable/13-n253250-308d3d6be6da since 20 minutes ago on my Carbon X1 gen 7 with machdep.hyperthreading_allowed=0 set in loader.conf(5) [~] sysctl hw.model hw.model: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz [~] uname -aKU FreeBSD harbinger.fritz.box 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n253250-308d3d6be6da: Fri Dec 9 00:25:51 UTC 2022 root@harbinger.fritz.box:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 1301510 1301510 [~] sysctl hw.acpi.cpu.cx_lowest hw.acpi.cpu.cx_lowest: C8 [~] sysctl hw.acpi.cpu.cppc_notify hw.acpi.cpu.cppc_notify: 1 [~] sysctl dev.hwpstate_intel. dev.hwpstate_intel.3.epp: 50 dev.hwpstate_intel.3.%parent: cpu3 dev.hwpstate_intel.3.%pnpinfo: dev.hwpstate_intel.3.%location: dev.hwpstate_intel.3.%driver: hwpstate_intel dev.hwpstate_intel.3.%desc: Intel Speed Shift dev.hwpstate_intel.2.epp: 50 dev.hwpstate_intel.2.%parent: cpu2 dev.hwpstate_intel.2.%pnpinfo: dev.hwpstate_intel.2.%location: dev.hwpstate_intel.2.%driver: hwpstate_intel dev.hwpstate_intel.2.%desc: Intel Speed Shift dev.hwpstate_intel.1.epp: 50 dev.hwpstate_intel.1.%parent: cpu1 dev.hwpstate_intel.1.%pnpinfo: dev.hwpstate_intel.1.%location: dev.hwpstate_intel.1.%driver: hwpstate_intel dev.hwpstate_intel.1.%desc: Intel Speed Shift dev.hwpstate_intel.0.epp: 50 dev.hwpstate_intel.0.%parent: cpu0 dev.hwpstate_intel.0.%pnpinfo: dev.hwpstate_intel.0.%location: dev.hwpstate_intel.0.%driver: hwpstate_intel dev.hwpstate_intel.0.%desc: Intel Speed Shift dev.hwpstate_intel.%parent: Thanks for the MFC to stable/13 Is there a timeline for a MFC on the second commit "acpi: Create cppc_notify sysctl before it is checked"? If so, will it end up in releng/13.2? All three commits were MFC'd together *** Bug 248659 has been marked as a duplicate of this bug. *** This bug has been fixed. |