Bug 255745 - resume on T490 Thinkpad often broken after upgrade 12 -> 13 with hda(4) sound loaded
Summary: resume on T490 Thinkpad often broken after upgrade 12 -> 13 with hda(4) sound...
Status: Closed DUPLICATE of bug 261207
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL: https://reviews.freebsd.org/D34117
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2021-05-10 10:13 UTC by Ulrich Spörlein
Modified: 2022-11-22 08:28 UTC (History)
10 users (show)

See Also:
koobs: mfc-stable13?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Spörlein freebsd_committer 2021-05-10 10:13:27 UTC
Under stable/12 this T490 thinkpad was suspend/resuming like a champ. I've recently upgraded to stable/13 and things are busted.

Now this might be related to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253288 (but I've disabled that module), but often when resuming, the laptop will no longer power up and eventually the fans start spinning.

I did some tests with suspend/resume, and the hang seems to occur only after a longer uptime or when suspending overnight. It's quite non-deterministic.

I could successfully suspend/resume 5 times yesterday between 10am and 10pm. Then left it on overnight. I wanted to reboot this morning and figured I'll test suspend/resume once more. It did not come back from suspend when I opened the lid again 2 minutes later. The typical Thinkpad light was still pulsating though.

How to even start to debug that? What where the changes around suspend/resume between 12 and 13?
Comment 1 John Grafton 2021-05-10 14:06:34 UTC
I am experiencing a similar issue on a Thinkpad X1 Carbon Gen 6.  Under 12.x, suspend/resume has worked without issue since I installed it a couple years ago.  I recently upgraded to 13.0-RELEASE and successful resuming has become very non-deterministic.

Generally, resuming works fine during the day but leaving the system sleeping overnight and attempting to resume the next day fails.  After opening the lid, the power LED blinks as if it's still asleep.  A hard power cycle is required to bring the laptop back online.
Comment 2 Ulrich Spörlein freebsd_committer 2021-05-10 14:19:04 UTC
Yes, exactly! Could you try leaving it running overnight and then suspend/resume it in the morning within say a minute or two? I get the feeling it has to do with uptime also, not just the "sleep time".
Comment 3 John Grafton 2021-05-11 11:11:54 UTC
(In reply to Ulrich Spörlein from comment #2)

I just opened it this morning from an overnight sleep and it worked as expected.  It only has an uptime of 3 hours, however.  I'll leave it open for the day and close/open the lid every few hours and see if it fails.
Comment 4 John Grafton 2021-05-11 15:30:07 UTC
The laptop failed to resume the first time I attempted to close and then open the lid after leaving it open for 4 hours.  So it failed to resume properly after an uptime of ~7 hours.
Comment 5 John Grafton 2021-05-11 19:54:07 UTC
The laptop just failed to resume after an uptime of 4 hours.
Comment 6 John Grafton 2021-05-13 10:37:41 UTC
I've tested suspend/resume on the same Thinkpad X1 Carbon with GhostBSD 21.05.11 (13.0-STABLE) livecd for the past 24 hours and it has not failed.
Comment 7 Ulrich Spörlein freebsd_committer 2021-05-13 13:06:01 UTC
I'm willing to bisect this, but could use some help. It seems that stable/12 was branched off main around 2y7mo ago. So I would need to downgrade to current of that time, then bisect up to the latest main (assuming this also happens on 14-CURRENT).

What are the chances of that downgrade working? I would rebuild world first, install that and hope that it mostly works with the newer kernel. Then downgrade the kernel, yes?

Of course Xorg and co would need a full recompile as well, ugh. I'll probably try to reproduce this in single user first, that would make the later bisect testing a lot easier.

Better ideas? Lucky me I didn't `zpool upgrade` yet, eh?
Comment 8 Graham Perrin freebsd_committer 2021-05-14 08:21:07 UTC
From the outside looking in, my first thought was DRM. 

I read of one case where it seemed that installation of drm-kmod did _not_ automatically trigger installation of drm-fbsd13-kmod

----

On one hand: 

(In reply to Ulrich Spörlein from comment #0)

> … light was still pulsating …

– I never had a ThinkPad, I assume that the pulse is indicative of the machine still sleeping (zero response to an attempt to wake). For this, I would be less inclined to think of DRM. 

----

On the other hand: 

> … fans start spinning …

I can't imagine that _combined with_ a still-pulsating indicator of sleep. For this, I would be more inclined to think of DRM.
Comment 9 Li-Wen Hsu freebsd_committer 2021-05-19 15:37:51 UTC
This report might be related: https://lists.freebsd.org/pipermail/freebsd-stable/2021-May/093709.html

There is another thread on -current: https://lists.freebsd.org/pipermail/freebsd-current/2020-September/077156.html the symptom is exactly the same.

BTW, I think this is less related with drm. I tried without loading drm module from ports, and suspend/resume after boot into multiuser mode. The same symptom still happens.
Comment 10 John Grafton 2021-05-27 17:19:42 UTC
(In reply to Ulrich Spörlein from comment #7)
I upgraded to CURRENT yesterday (FreeBSD 14.0-CURRENT #1 main-n246885-27f09959d5f5) with the latest DRM and encountered the same issue.  I'll downgrade to 12.2 soon.  Since I upgraded zroot shortly after the 13 upgrade I have to do a full reinstall.
Comment 11 Graham Perrin freebsd_committer 2021-05-28 06:52:35 UTC
(In reply to John Grafton from comment #10)

> … 12.2 … upgraded zroot … have to do a full reinstall.

Consider the port of OpenZFS, which is packaged for FreeBSD:12:amd64

<https://www.freshports.org/sysutils/openzfs/#packages>
<https://www.freshports.org/sysutils/openzfs-kmod/#packages>
Comment 12 Graham Perrin freebsd_committer 2021-05-28 07:01:24 UTC
(In reply to Li-Wen Hsu from comment #9)

> … There is another thread on -current: 
> https://lists.freebsd.org/pipermail/freebsd-current/2020-September/077156.html 
> the symptom is exactly the same. …

Sorry, that is currently 404 not found (due to bug 256182), can you identify the same thread with an alternative URL? 

<https://lists.freebsd.org/archives/freebsd-current/2020-September/>
Comment 13 Li-Wen Hsu freebsd_committer 2021-05-28 07:32:27 UTC
(In reply to Graham Perrin from comment #12)
https://lists.freebsd.org/archives/freebsd-current/2020-September/166778.html
Comment 14 John Grafton 2021-05-29 13:39:11 UTC
Downgrading to 12.2-RELEASE-p7 fixes the problem completely.
Comment 15 Ulrich Spörlein freebsd_committer 2021-05-29 15:46:19 UTC
As for 13-STABLE it's definitely more subtle than just uptime. I seem to have struck some luck where my laptop managed to suspend/resume quite often in a row. I'm only using it very lightly and could imagine that never entering some thermal throttling state might also help.

Uptime is 5d now, /var/run/dmesg.boot is from May 18, so 11d old now and I think I've suspended it almost every night. So probably 10 successful cycles.

I _did_ restart Xorg once though, as it tends to slow down horribly over time (xterm scrolling like molasses, etc)
Comment 16 Jan Schreiber 2021-07-09 18:10:31 UTC
I ran into the same issue with my ThinkPad x250. Resume always worked with 12.2. On FreeBSD 13, resume seems not to work when the uptime is more than a couple of hours. Sleep time does not seem to make any difference.

I'm running drm-fbsd13-kmod-5.4.92.g20210419 and 13.0-RELEASE-p1. I also set
hw.usb.no_suspend_wait=1 as suggested here: https://lists.freebsd.org/pipermail/freebsd-current/2020-September/077167.html
Makes no difference.

Have you guys found any fix other than downgrading to 12.2? Or any ideas on how to debug this problem? Or does anybody know where to find the code that handles resume?
Comment 17 brtastic.dev 2021-07-12 13:36:25 UTC
I'm pretty sure it also affects me. I've got Thinkpad T480 (i7 8th gen) for a couple of weeks and I'm figthing random resume failures from the start. FreeBSD 13.0-RELEASE from the start, so can't really tell if it worked on 12, but the symptoms match.

Suspend / resume is working well for the most part, but every couple of cycles it's impossible to resume the system. Screen stays completely off, the power diode is blinking and clicking most of the buttons have no effect. Three buttons have small indicator lights on them: Escape (FnLock), F1 (speakers off) and Caps Lock. Clicking them turns that light on or off, and that's it. I can only force-shutdown the computer by holding the power button, and then turn it on again.

I tried changing some BIOS settings: disabled TPM, enabled USB power during low power states and enabled power on / wake up on AC power. None of them changed anything, although after changing the USB power behavior I had no issues for an entire week, which shows just how randomly it behaves.

As for when it happens, I tried suspending / resuming a couple of times just after turning the system on, and it always came back without problems. It mostly happens when I leave it suspended overnight, however there were a couple of times where I just left it suspended for five minutes after using it for just a couple of hours and had the issue occur. Seems like time of usage is having some impact on whether it will occur or not. After booting it again, the last message in /var/log/messages is: thinkpad acpi[21212]: suspend at YYYYMMDD HH:II:SS

I'm on zfs, drm-fbsd13-kmod and xorg with xf86-video-intel. acpi_ibm and acpi_video are loaded. If there's a workaround, I'll be happy to test it
Comment 18 John Grafton 2021-07-18 17:52:52 UTC
I'm currently bisecting the git repo to hunt down where the problem was introduced.  It's slow going but I'm making a bit of progress.  Luckily, the issue is easily reproducible on my X1 Carbon Gen6.

FreeBSD-src commit 6abe97c0140d54d3520c30517b2bdebc3de92a62 is the first I've found in 13.0-CURRENT that doesn't appear to fail.  This requires the DRM port drm-current-kmod-5.4.62.g20201109_1 from Nov 2020 (FreeBSD-ports commit 25e14f16ef2ccfe80e1b45168318d777369ec9ed).

Time has opened up for me in the next couple of weeks and I'll continue researching.
Comment 19 John Grafton 2021-07-26 18:07:55 UTC
After more testing, I was able to get 6abe97c0140d54d3520c30517b2bdebc3de92a62 to fail (making my previous comment incorrect).  Thus I went back to the beginning of 13-CURRENT to find the where the issue made its way into the code base.  Instead of November 2020, it appears to have been added in July 2020.

I've spent the past week `git bisecting` my way through CURRENT looking for the beginning of the suspend/resume issue.

Here's a brief description of my testing methodology:
1) Build world / kernel on 12 core system in a 12.2 jail (takes ~ 1/2 hour)
2) Install world / kernel on laptop from NFS to new boot environment on laptop
3) Boot laptop into new boot environment
4) Compile DRM driver for current kernel
5) kldload i915kms driver
6) Close lid and leave closed for 5 seconds
7) Open lid and wait to see if the system fails to resume
8) Repeat lid close and open in console 15 times and in X 5 times
9) If resume fails, mark bisect fail, if all resumes succeed, mark bisect good

Usually, resumes would fail after 5 or so tries.  I never had a resume fail in X that didn't first fail in the console.  It did not seem to make a difference if I was on the console or in X.org.

I found that before 12b2f3daaa597f346a4b0065bf7f75378524ef88 the X1 Carbon Gen 6 resumes all 20 times without issue.  The main flaw in my methodology is assuming 20 suspend and resumes are enough to accurately test for the issue.  

I'm currently using src commit e7677232d6eed5f5cae80c1d5968eea5b9266b59 with graphics/drm-current-kmod from ports commit d9b8d3b2b3b5ffcf4b83572098d64803fe237b90 as my daily laptop to test the assumption that 20 suspend and resumes are enough to make a generalization about whether my tests were valid.

Oddly, commit 12b2f3daaa597f346a4b0065bf7f75378524ef88 being flagged as the issue is very strange to me because that commit appears to be essentially a NOOP as it's just cleaning up #ifdef statements for older systems.  12.2 which does not fail at all seems to have similar patches in place.

My next steps are to continue testing this older commit as my daily laptop and see if I can get it to fail.  Then attempt to build a version of 13-RELEASE with what appears to be the offending patch removed and see if it fails.
Comment 20 John Grafton 2021-08-30 14:13:30 UTC
(In reply to John Grafton from comment #19)

Unfortunately, my fears turned out to be valid.  The testing methodology I used didn't capture all failure cases and I found commits before 12b2f3daaa597f346a4b0065bf7f75378524ef88 that hang the laptop on resume.

Back to square one.  :(
Comment 21 brtastic.dev 2021-10-12 15:30:23 UTC
This issue is making it significantly harder for me to daily drive FreeBSD. I appreciate John's work and can see the problem with uncertainty during bisecting. The issue seems to be getting more severe with each freebsd-update, to a point where I'm hardly getting any successful resumes (but again, it's very random).

Is there anything I can do, as someone who has never compiled a kernel from sources but gets the issue to occur frequently, to help hunt down that bug?
Comment 22 John Grafton 2021-10-21 18:27:59 UTC
(In reply to brtastic.dev from comment #21)
Hi brtastic.dev@gmail.com,

I know this isn't the answer you're looking for, but the best workaround I can give is reinstalling with 12.2 (that's what I did).  It would be helpful to have access to another person running 12.x who is willing to test 13.x using a ZFS boot environment.  (Do not upgrade your zpool when testing with 13 if you want to drop back down to 12!)

I can write up a blog post on how to install a development 13 boot environment from 12 if you'd like.  That's how I've been doing most of my testing.

John
Comment 23 Jan Schreiber 2021-10-21 19:20:24 UTC
For me, it seems like this did the trick:

sysctl hw.usb.no_suspend_wait=1

Out of maybe 50 resumes, only one failed, where as before, almost every resume hang. I'm on FreeBSD 13.0-RELEASE-p1.
Comment 24 brtastic.dev 2021-10-23 08:26:22 UTC
(In reply to Jan Schreiber from comment #23)
Thanks for the suggestion, but sadly that does not seem to help at all in my case.

(In reply to John Grafton from comment #22)
Given that 12 is only going EOL in 2024 and I assume this bug will not get propagated from 13 to 12, that's probably what I'm going to do. I'll let you know when I'm ready to help with your testing.
Comment 25 John Grafton 2021-10-24 14:54:44 UTC
(In reply to brtastic.dev from comment #24)
Unfortunately, hw.usb.no_suspend_wait=1 did not fix the issue for me either.
Comment 26 John Grafton 2021-11-15 12:17:07 UTC
I've been running 12.3-RC1 on the Thinkpad for the past few days without any trouble.
Comment 27 Andrew Payne 2021-11-29 02:58:26 UTC
I upgraded to v13 several months back and found this bug affected the 430s and 470s models too. Yesterday I wiped the 430s and installed v13 clean, updated base install, and the bug persists. Adding hw.usb.no_suspend_wait=1 DOES allow the system to enter S3, but resuming produces a crash. This is true with or without xorg running (tested xfce4 and i3).
Comment 28 John Grafton 2021-12-03 14:09:41 UTC
I followed Warner's advice on debugging suspend/resume problems (https://wiki.freebsd.org/DebuggingSuspendResume) and compiled and booted a bare minimum kernel from the 'releng/13.0' branch. 

The suspend/resume problem no longer occurred.  Adding one driver at a time, I found the problem began happening again after the sound driver.

I've been running a GENERIC kernel from the latest 'releng/13.0' with the sound section commented out for the past few days without having a recurrence of the issue.
Comment 29 Mark Johnston freebsd_committer 2022-01-31 21:13:36 UTC
(In reply to John Grafton from comment #28)
Since this appears to be related to sound drivers, you might try testing https://reviews.freebsd.org/D34117
Comment 30 Ulrich Spörlein freebsd_committer 2022-02-07 08:44:28 UTC
Thanks Mark! After applying the patch from https://reviews.freebsd.org/D34117 I was able to suspend/resume quite a number of times w/o any ill effect.

Is the bug specific to the Lenovo ACPI/Firmware? Or is this specific to the exact HDA hardware? I mean, shouldn't this bug plague more laptops?
Comment 31 Mark Johnston freebsd_committer 2022-02-07 14:09:30 UTC
(In reply to Ulrich Spörlein from comment #30)
I don't think it's specific to any particular model, I hit the problem on a frame.work laptop.  Only requirement is that it uses intel HDA.
Comment 32 Mark Johnston freebsd_committer 2022-03-17 17:47:24 UTC
Since this appears to be the same as 261207, I'll make it a dup.  The patch should make it into 13.1, in which case please open a new PR if the problem persists in 13.1-RELEASE or in 14.0-CURRENT.

*** This bug has been marked as a duplicate of bug 261207 ***
Comment 33 brtastic.dev 2022-11-22 08:28:31 UTC
Problem is still present in 13.1 for me, opened new issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264145

Writing this comment to notify you as it hadn't got much attention in the last half a year.