Bug 253733 - vesa.ko: Invalid BIOS call when resuming from S3 suspend/sleep causes nvidia driver hang
Summary: vesa.ko: Invalid BIOS call when resuming from S3 suspend/sleep causes nvidia ...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: Jung-uk Kim
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-21 00:00 UTC by Stefan B.
Modified: 2021-04-20 04:13 UTC (History)
3 users (show)

See Also:


Attachments
Disable POST and save/restore states on NVIDIA cards (3.00 KB, patch)
2021-03-12 19:50 UTC, Jung-uk Kim
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan B. 2021-02-21 00:00:48 UTC
In vesa.ko there is a function that gets called when resuming from sleep, e.g. at resuming after S3 suspend via 'zzz'.
This function does a BIOS call, which is related to restoring the graphics cards' previous state it had before powering off.

On Nvidia cards this BIOS function seems to be implemented in a different way than on most other cards.
For this reason, calling this BIOS function causes the Nvidia graphics driver to hang, failing to resume.
(For technical background, read my discussion with jkim in PR 224069: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224069 )

Reproducing the bug is easy:
-Install FreeBSD (eg using GENERIC kernel).
-Enable sc (kern.vty="sc" in /boot/loader.conf).
-Install xorg. Install and configure nvidia driver. Reboot and start xorg via startx.
-Enter "zzz" in a xterm.
-Watch the system/driver hang, keyboard (PS/2) and mouse becoming unresponsive when attempting to switch back to graphics mode.
-Hitting the power button results no visible change, until after timeout a message about an unresponsive stop job (presumably the nvidia driver) appears shortly before powering off.

For reproducing the bug it is essential to use GENERIC kernel!
Because, working around by building and installing a custom kernel without "options VESA" fixes the hang after suspend/resume. It is also important to not have vt and its helper modules (vt_efifb etc) in the kernel, as these pull in the vesa.ko showstopper module.

Already back in 2017 I found that skipping (commenting out) the Nvidia BIOS call fixes the issue, making resume work reliably.

So I believe the proper fix would be:
1. check whether the graphics card is Nvidia
2. if it is Nvidia, skip that BIOS call in /usr/src/sys/dev/fb/vesa.c line 520.

Pseudocode for a patch might look like this:

(+)if (! nvidia_card_is_installed) {
x86bios_intr(&regs, 0x10);
(+)}


Side note:
All my systems use the sc console, not the vt console.
So I do not know the system behaviour when using vt.
Thus please use sc in text mode when reproducing the bug!
Comment 1 Jung-uk Kim freebsd_committer 2021-03-12 19:50:19 UTC
Created attachment 223218 [details]
Disable POST and save/restore states on NVIDIA cards

Please try the attached patch.  I cannot test it because I don't use syscons(4) any more.
Comment 2 Stefan B. 2021-03-13 01:09:44 UTC
Thank you very much, Jung-uk!

I am going to test the patch on my computers using different nvidia cards/drivers.

To make sure it works reliably, I want to test long enough, to accumulate sufficient uptime and suspend/resume cycles.
So it might take a few days until I report back.
Comment 3 Stefan B. 2021-04-04 06:07:07 UTC
Sorry for late update. I was busy with other things.

Tested the patch. It does not work. Still hangs in text mode when resuming.
Not sure for which exact cause yet.

Apparently I didn't express myself clearly enough; it is *only* the LOAD_STATE call which breaks resume and needs to be omitted in case of Nvidia card/chip.
I verified this is still valid by commenting out the x86bios_intr() call in the case STATE_LOAD: of vesa_bios_save_restore().

So I believe the other VESA calls, including POST, do *not* have a negative impact on suspend/resume.


I didn't test yet whether vesa_find_pci_device() actually finds the card which responds to the VESA BIOS call (but will do soon using some debug printfs).
So I can not rule out yet that a problem there could be the potential cause for the patch not working.


Another issue I am not yet clear of whether it matters:

There are some OEMs who had in some cases their onboard video BIOS at other locations than C000. I remember some cases I personally encountered, where video BIOS was at E000.
For this reason I am not really sure whether the approach of checking for a C000 BIOS start address is 100% safe.


I am now thinking about scanning the OEM string which gets returned by function 4F00 for "nvidia" (case-independent), eg the string the VESA 1.2 OEMStringPtr points to.
This approach would be independent of the Option BIOS memory address.

I have about ten different Nvidia cards and onboard chips, NV4 and higher, and will read out their OEMString via debug printfs, to find out whether this alternative approach could be viable.

As I am currently moving my hardware lab, it will take about 1-3 weeks until I report back, maybe with an updated patch.