Bug 255745 - resume on T490 Thinkpad often broken after upgrade 12 -> 13
Summary: resume on T490 Thinkpad often broken after upgrade 12 -> 13
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2021-05-10 10:13 UTC by Ulrich Spörlein
Modified: 2021-09-08 20:26 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Spörlein freebsd_committer 2021-05-10 10:13:27 UTC
Under stable/12 this T490 thinkpad was suspend/resuming like a champ. I've recently upgraded to stable/13 and things are busted.

Now this might be related to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253288 (but I've disabled that module), but often when resuming, the laptop will no longer power up and eventually the fans start spinning.

I did some tests with suspend/resume, and the hang seems to occur only after a longer uptime or when suspending overnight. It's quite non-deterministic.

I could successfully suspend/resume 5 times yesterday between 10am and 10pm. Then left it on overnight. I wanted to reboot this morning and figured I'll test suspend/resume once more. It did not come back from suspend when I opened the lid again 2 minutes later. The typical Thinkpad light was still pulsating though.

How to even start to debug that? What where the changes around suspend/resume between 12 and 13?
Comment 1 John Grafton 2021-05-10 14:06:34 UTC
I am experiencing a similar issue on a Thinkpad X1 Carbon Gen 6.  Under 12.x, suspend/resume has worked without issue since I installed it a couple years ago.  I recently upgraded to 13.0-RELEASE and successful resuming has become very non-deterministic.

Generally, resuming works fine during the day but leaving the system sleeping overnight and attempting to resume the next day fails.  After opening the lid, the power LED blinks as if it's still asleep.  A hard power cycle is required to bring the laptop back online.
Comment 2 Ulrich Spörlein freebsd_committer 2021-05-10 14:19:04 UTC
Yes, exactly! Could you try leaving it running overnight and then suspend/resume it in the morning within say a minute or two? I get the feeling it has to do with uptime also, not just the "sleep time".
Comment 3 John Grafton 2021-05-11 11:11:54 UTC
(In reply to Ulrich Spörlein from comment #2)

I just opened it this morning from an overnight sleep and it worked as expected.  It only has an uptime of 3 hours, however.  I'll leave it open for the day and close/open the lid every few hours and see if it fails.
Comment 4 John Grafton 2021-05-11 15:30:07 UTC
The laptop failed to resume the first time I attempted to close and then open the lid after leaving it open for 4 hours.  So it failed to resume properly after an uptime of ~7 hours.
Comment 5 John Grafton 2021-05-11 19:54:07 UTC
The laptop just failed to resume after an uptime of 4 hours.
Comment 6 John Grafton 2021-05-13 10:37:41 UTC
I've tested suspend/resume on the same Thinkpad X1 Carbon with GhostBSD 21.05.11 (13.0-STABLE) livecd for the past 24 hours and it has not failed.
Comment 7 Ulrich Spörlein freebsd_committer 2021-05-13 13:06:01 UTC
I'm willing to bisect this, but could use some help. It seems that stable/12 was branched off main around 2y7mo ago. So I would need to downgrade to current of that time, then bisect up to the latest main (assuming this also happens on 14-CURRENT).

What are the chances of that downgrade working? I would rebuild world first, install that and hope that it mostly works with the newer kernel. Then downgrade the kernel, yes?

Of course Xorg and co would need a full recompile as well, ugh. I'll probably try to reproduce this in single user first, that would make the later bisect testing a lot easier.

Better ideas? Lucky me I didn't `zpool upgrade` yet, eh?
Comment 8 Graham Perrin 2021-05-14 08:21:07 UTC
From the outside looking in, my first thought was DRM. 

I read of one case where it seemed that installation of drm-kmod did _not_ automatically trigger installation of drm-fbsd13-kmod

----

On one hand: 

(In reply to Ulrich Spörlein from comment #0)

> … light was still pulsating …

– I never had a ThinkPad, I assume that the pulse is indicative of the machine still sleeping (zero response to an attempt to wake). For this, I would be less inclined to think of DRM. 

----

On the other hand: 

> … fans start spinning …

I can't imagine that _combined with_ a still-pulsating indicator of sleep. For this, I would be more inclined to think of DRM.
Comment 9 Li-Wen Hsu freebsd_committer 2021-05-19 15:37:51 UTC
This report might be related: https://lists.freebsd.org/pipermail/freebsd-stable/2021-May/093709.html

There is another thread on -current: https://lists.freebsd.org/pipermail/freebsd-current/2020-September/077156.html the symptom is exactly the same.

BTW, I think this is less related with drm. I tried without loading drm module from ports, and suspend/resume after boot into multiuser mode. The same symptom still happens.
Comment 10 John Grafton 2021-05-27 17:19:42 UTC
(In reply to Ulrich Spörlein from comment #7)
I upgraded to CURRENT yesterday (FreeBSD 14.0-CURRENT #1 main-n246885-27f09959d5f5) with the latest DRM and encountered the same issue.  I'll downgrade to 12.2 soon.  Since I upgraded zroot shortly after the 13 upgrade I have to do a full reinstall.
Comment 11 Graham Perrin 2021-05-28 06:52:35 UTC
(In reply to John Grafton from comment #10)

> … 12.2 … upgraded zroot … have to do a full reinstall.

Consider the port of OpenZFS, which is packaged for FreeBSD:12:amd64

<https://www.freshports.org/sysutils/openzfs/#packages>
<https://www.freshports.org/sysutils/openzfs-kmod/#packages>
Comment 12 Graham Perrin 2021-05-28 07:01:24 UTC
(In reply to Li-Wen Hsu from comment #9)

> … There is another thread on -current: 
> https://lists.freebsd.org/pipermail/freebsd-current/2020-September/077156.html 
> the symptom is exactly the same. …

Sorry, that is currently 404 not found (due to bug 256182), can you identify the same thread with an alternative URL? 

<https://lists.freebsd.org/archives/freebsd-current/2020-September/>
Comment 13 Li-Wen Hsu freebsd_committer 2021-05-28 07:32:27 UTC
(In reply to Graham Perrin from comment #12)
https://lists.freebsd.org/archives/freebsd-current/2020-September/166778.html
Comment 14 John Grafton 2021-05-29 13:39:11 UTC
Downgrading to 12.2-RELEASE-p7 fixes the problem completely.
Comment 15 Ulrich Spörlein freebsd_committer 2021-05-29 15:46:19 UTC
As for 13-STABLE it's definitely more subtle than just uptime. I seem to have struck some luck where my laptop managed to suspend/resume quite often in a row. I'm only using it very lightly and could imagine that never entering some thermal throttling state might also help.

Uptime is 5d now, /var/run/dmesg.boot is from May 18, so 11d old now and I think I've suspended it almost every night. So probably 10 successful cycles.

I _did_ restart Xorg once though, as it tends to slow down horribly over time (xterm scrolling like molasses, etc)
Comment 16 Jan Schreiber 2021-07-09 18:10:31 UTC
I ran into the same issue with my ThinkPad x250. Resume always worked with 12.2. On FreeBSD 13, resume seems not to work when the uptime is more than a couple of hours. Sleep time does not seem to make any difference.

I'm running drm-fbsd13-kmod-5.4.92.g20210419 and 13.0-RELEASE-p1. I also set
hw.usb.no_suspend_wait=1 as suggested here: https://lists.freebsd.org/pipermail/freebsd-current/2020-September/077167.html
Makes no difference.

Have you guys found any fix other than downgrading to 12.2? Or any ideas on how to debug this problem? Or does anybody know where to find the code that handles resume?
Comment 17 brtastic.dev 2021-07-12 13:36:25 UTC
I'm pretty sure it also affects me. I've got Thinkpad T480 (i7 8th gen) for a couple of weeks and I'm figthing random resume failures from the start. FreeBSD 13.0-RELEASE from the start, so can't really tell if it worked on 12, but the symptoms match.

Suspend / resume is working well for the most part, but every couple of cycles it's impossible to resume the system. Screen stays completely off, the power diode is blinking and clicking most of the buttons have no effect. Three buttons have small indicator lights on them: Escape (FnLock), F1 (speakers off) and Caps Lock. Clicking them turns that light on or off, and that's it. I can only force-shutdown the computer by holding the power button, and then turn it on again.

I tried changing some BIOS settings: disabled TPM, enabled USB power during low power states and enabled power on / wake up on AC power. None of them changed anything, although after changing the USB power behavior I had no issues for an entire week, which shows just how randomly it behaves.

As for when it happens, I tried suspending / resuming a couple of times just after turning the system on, and it always came back without problems. It mostly happens when I leave it suspended overnight, however there were a couple of times where I just left it suspended for five minutes after using it for just a couple of hours and had the issue occur. Seems like time of usage is having some impact on whether it will occur or not. After booting it again, the last message in /var/log/messages is: thinkpad acpi[21212]: suspend at YYYYMMDD HH:II:SS

I'm on zfs, drm-fbsd13-kmod and xorg with xf86-video-intel. acpi_ibm and acpi_video are loaded. If there's a workaround, I'll be happy to test it
Comment 18 John Grafton 2021-07-18 17:52:52 UTC
I'm currently bisecting the git repo to hunt down where the problem was introduced.  It's slow going but I'm making a bit of progress.  Luckily, the issue is easily reproducible on my X1 Carbon Gen6.

FreeBSD-src commit 6abe97c0140d54d3520c30517b2bdebc3de92a62 is the first I've found in 13.0-CURRENT that doesn't appear to fail.  This requires the DRM port drm-current-kmod-5.4.62.g20201109_1 from Nov 2020 (FreeBSD-ports commit 25e14f16ef2ccfe80e1b45168318d777369ec9ed).

Time has opened up for me in the next couple of weeks and I'll continue researching.
Comment 19 John Grafton 2021-07-26 18:07:55 UTC
After more testing, I was able to get 6abe97c0140d54d3520c30517b2bdebc3de92a62 to fail (making my previous comment incorrect).  Thus I went back to the beginning of 13-CURRENT to find the where the issue made its way into the code base.  Instead of November 2020, it appears to have been added in July 2020.

I've spent the past week `git bisecting` my way through CURRENT looking for the beginning of the suspend/resume issue.

Here's a brief description of my testing methodology:
1) Build world / kernel on 12 core system in a 12.2 jail (takes ~ 1/2 hour)
2) Install world / kernel on laptop from NFS to new boot environment on laptop
3) Boot laptop into new boot environment
4) Compile DRM driver for current kernel
5) kldload i915kms driver
6) Close lid and leave closed for 5 seconds
7) Open lid and wait to see if the system fails to resume
8) Repeat lid close and open in console 15 times and in X 5 times
9) If resume fails, mark bisect fail, if all resumes succeed, mark bisect good

Usually, resumes would fail after 5 or so tries.  I never had a resume fail in X that didn't first fail in the console.  It did not seem to make a difference if I was on the console or in X.org.

I found that before 12b2f3daaa597f346a4b0065bf7f75378524ef88 the X1 Carbon Gen 6 resumes all 20 times without issue.  The main flaw in my methodology is assuming 20 suspend and resumes are enough to accurately test for the issue.  

I'm currently using src commit e7677232d6eed5f5cae80c1d5968eea5b9266b59 with graphics/drm-current-kmod from ports commit d9b8d3b2b3b5ffcf4b83572098d64803fe237b90 as my daily laptop to test the assumption that 20 suspend and resumes are enough to make a generalization about whether my tests were valid.

Oddly, commit 12b2f3daaa597f346a4b0065bf7f75378524ef88 being flagged as the issue is very strange to me because that commit appears to be essentially a NOOP as it's just cleaning up #ifdef statements for older systems.  12.2 which does not fail at all seems to have similar patches in place.

My next steps are to continue testing this older commit as my daily laptop and see if I can get it to fail.  Then attempt to build a version of 13-RELEASE with what appears to be the offending patch removed and see if it fails.
Comment 20 John Grafton 2021-08-30 14:13:30 UTC
(In reply to John Grafton from comment #19)

Unfortunately, my fears turned out to be valid.  The testing methodology I used didn't capture all failure cases and I found commits before 12b2f3daaa597f346a4b0065bf7f75378524ef88 that hang the laptop on resume.

Back to square one.  :(