268276 – Regression: Black screen on resume caused by commit 9e007a88d65b

Bug 268276 - Regression: Black screen on resume caused by commit 9e007a88d65b

Summary: Regression: Black screen on resume caused by commit 9e007a88d65b

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	freebsd-bugs (Nobody)

URL:	https://cgit.freebsd.org/src/commit/?...
Keywords:	needs-qa, regression

Depends on:
Blocks:

Reported:	2022-12-09 15:17 UTC by Austin Shafer
Modified:	2022-12-11 18:53 UTC (History)
CC List:	2 users (show)

See Also:	https://github.com/amshafer/nvidia-driver/issues/1

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Austin Shafer 2022-12-09 15:17:30 UTC

I've finally narrowed down the cause for suspend/resume breaking on my ryzen system for the past year. Commit 9e007a88d65b changed the polling rate of atkbd, which for some reason causes the GPU to disappear off the PCI bus, leaving the screen black.

Author: Alexander Motin <mav@FreeBSD.org>
Date:   Wed Jan 5 11:32:44 2022 -0500

    atkbd: Reduce polling rate from 10Hz to ~1Hz.
    
    In my understanding this is only needed to workaround lost interrupts.
    I was thinking to remove it completely, but the comment about edge-
    triggered interrupt may be true and needs deeper investigation.  ~1Hz
    should be often enough to handle the supposedly rare loss cases, but
    rare enough to not appear in top.  Add sysctl hw.atkbd.hz to tune it.
    
    MFC after:      1 month

The workaround is to put sysctl hw.atkbd.hz=10 in /boot/loader.conf

System is AMD Ryzen 9 5900X, TUF Gaming b550-PLUS motherboard, NVIDIA GTX 960. I did update the motherboard firmware but that didn't help.

Usually when resuming you can ssh into the machine, but if you try to do anything graphical the following prints:

Dec  9 02:12:32 mick kernel: NVRM: GPU at PCI:0000:07:00: GPU-8293a5fd-a5ed-570d-283f-675298ebf38c
Dec  9 02:12:32 mick kernel: NVRM: Xid (PCI:0000:07:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec  9 02:12:32 mick kernel: NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Dec  9 02:12:32 mick devd[384]: notify_clients: send() failed; dropping unresponsive client
Dec  9 02:12:32 mick kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
Dec  9 02:12:32 mick syslogd: last message repeated 2 times
Dec  9 02:12:32 mick kernel: nvidia-modeset: ERROR: GPU:0: Failure reading maximum pixel clock value for display device HDMI-0.
Dec  9 02:12:32 mick kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices


I first noticed this on GhostBSD, and for some reason couldn't reproduce the bisect range on FreeBSD kernels. I had to bisect between GhostBSD's 21.12.24 and 22.3.16 kernel releases to find this commit. Then I could apply the sysctl workaround to a FreeBSD CURRENT kernel and have suspend/resume working again.

Why was this change made? Is there some performance reason why we don't want to be polling atkbd so much? I'm not sure why this would affect the entire PCI bus, but since it breaks suspend resume on certain machines it would be nice to get a fix into base so things work out of the box again without having to add the sysctl workaround.

Comment 1 Alexander Motin freebsd_committer

2022-12-09 15:30:36 UTC

The change was made to save CPU power by not waking up 10 timer per second for no good reason.  I have no any idea why it could affect GPU.  May be we could explicitly poll the keyboard during resume, if it help this situation somehow.

Comment 2 Austin Shafer 2022-12-09 16:37:43 UTC

I don't think it's directly affecting the GPU per se, but instead is causing something wacky to happen with the PCI bus. Then the GPU can't be found on the bus, so the nvidia driver bails and the screen stays black. The GPU fans fully spin up, so it's not like the GPU doesn't have power.

Thanks, saving CPU makes sense, polling on resume is an interesting idea. I wonder if not polling makes us do something different that causes us to miss an interrupt or something. It's hard to know without having more info, but I'm not sure what I should do to get better data?

---

Also I feel like I should mention this (probably unrelated) issue: https://github.com/amshafer/nvidia-driver/issues/1

This predates commit 9e007a88d65b and therefore has a different root cause, but I'm linking it because the symptoms are the same. In that case it looked like ACPI/buggy firmware, but I find it interesting as a second data point. Maybe the atkbd polling causes us or the firmware to hit a situation like this for some reason.

Comment 3 Graham Perrin freebsd_committer

2022-12-11 18:53:48 UTC

Cherry-pick <https://github.com/freebsd/freebsd-src/commit/35f0bf2d37a49a25a81e25de2ffbd7cba850be36> (2022-02-04) is on branches      releng/13.1 and stable/13.