Bug 274519 - panic with x11/nvidia-driver-470: sleepq_add: td … to sleep on wchan … with sleeping prohibited
Summary: panic with x11/nvidia-driver-470: sleepq_add: td … to sleep on wchan … with s...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-10-16 17:50 UTC by Graham Perrin
Modified: 2024-03-28 19:09 UTC (History)
2 users (show)

See Also:


Attachments
head of /var/crash/core.txt.2 (5.63 KB, text/plain)
2023-10-16 17:50 UTC, Graham Perrin
no flags Details
poudriere bulk results, viewed after the event (198.01 KB, image/png)
2023-10-16 17:55 UTC, Graham Perrin
no flags Details
A photograph of one of two external displays that were in use when the panic occurred (755.94 KB, image/png)
2024-02-23 22:41 UTC, Graham Perrin
no flags Details
head of today's /var/crash/core.txt.3 (29.78 KB, text/plain)
2024-02-23 22:55 UTC, Graham Perrin
no flags Details
A photograph of one of two external displays that were in use on 14th February (705.82 KB, image/png)
2024-02-23 23:05 UTC, Graham Perrin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Graham Perrin 2023-10-16 17:50:41 UTC
Created attachment 245683 [details]
head of /var/crash/core.txt.2

Around fourteen minutes after beginning a run of poudriere-devel, 

poudriere bulk -j main -J 3 -Ctv mail/thunderbird


Dump header from device: /dev/ada1p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 4845637632
  Blocksize: 512
  Compression: none
  Dumptime: 2023-10-16 15:18:14 +0100
  Hostname: mowa219-gjp4-8570p-freebsd
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 15.0-CURRENT #1 main-n265830-fee14577d590-dirty: Mon Oct  9 13:08:19 BST 2023
    grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC
  Panic String: sleepq_add: td 0xfffffe0144345740 to sleep on wchan 0xffffffff8543a120 with sleeping prohibited
  Dump Parity: 1912812553
  Bounds: 2
  Dump Status: good
Comment 1 Graham Perrin 2023-10-16 17:55:12 UTC
Created attachment 245684 [details]
poudriere bulk results, viewed after the event
Comment 2 Graham Perrin 2023-10-16 20:18:16 UTC
linux-nvidia-libs-470-470.161.03
nvidia-driver-470-470.161.03
Comment 3 Graham Perrin 2023-10-17 04:36:57 UTC
locking(9)
<https://man.freebsd.org/cgi/man.cgi?query=locking&sektion=9&manpath=freebsd-release>

<https://www.freshports.org/x11/nvidia-driver-470/>

Cc: danfe@ (maintainer)

At <https://discord.com/channels/727023752348434432/831066226074976267/1163537365134999622> I mentioned a post in NVIDIA Developer Forums: 

<https://forums.developer.nvidia.com/t/panic-related-to-nvkms-timers-lock-sx-lock/55376> (2017-11-20)

> panic related to nvkms_timers.lock (sx lock)

Re: comment #2:

> That one might not have the fix …
> 
> Basically it got switched to use a spin mutex instead
Comment 4 Graham Perrin 2024-02-23 22:41:43 UTC
Created attachment 248704 [details]
A photograph of one of two external displays that were in use when the panic occurred

Today's panic occurred whilst playing GeoGuessr <https://www.geoguessr.com/> in Chromium, full screen. Space bar pressed and held after a click on the on-screen blue arrow (to travel as fast as possible). 

On the second external display: nothing other than Teams in Firefox. 

Note: play in Chromium coincided with a kernel panic at least once before … for what it's worth, I tend to avoid Chromium for activities such as this (I did suspect that a panic would occur before I began play today). 


----

Dump header from device: /dev/ada1p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 3358982144
  Blocksize: 512
  Compression: none
  Dumptime: 2024-02-23 13:29:55 +0000
  Hostname: mowa219-gjp4-zbook-freebsd
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 15.0-CURRENT main-n268493-759a996d610d GENERIC
  Panic String: sleepq_add: td 0xfffff80032314000 to sleep on wchan 0xffffffff8443a120 with sleeping prohibited
  Dump Parity: 2544365881
  Bounds: 3
  Dump Status: good

----

<https://cgit.freebsd.org/src/log/?qt=range&q=759a996d610d>
Comment 5 Graham Perrin 2024-02-23 22:55:05 UTC
Created attachment 248705 [details]
head of today's /var/crash/core.txt.3
Comment 6 Graham Perrin 2024-02-23 23:05:09 UTC
Created attachment 248706 [details]
A photograph of one of two external displays that were in use on 14th February

(In reply to Graham Perrin from comment #4)

> … Note: play in Chromium coincided with a kernel panic at least once before …

I found a photograph from 14th February. The same as today: GeoGuessr in Chromium, full screen. 

Original photograph metadata: 

    14 Feb
    Wed, 12:57GMT+00:00

– that is, around eight minutes after the time of the dump (see below); the freeze remains visible for a long time before an automated restart of the OS. 

----

Dump header from device: /dev/ada1p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 4074110976
  Blocksize: 512
  Compression: none
  Dumptime: 2024-02-14 12:49:14 +0000
  Hostname: mowa219-gjp4-zbook-freebsd
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 15.0-CURRENT main-n268149-eb86c6c5b462 GENERIC
  Panic String: sleepq_add: td 0xfffff80013259000 to sleep on wchan 0xffffffff8443a120 with sleeping prohibited
  Dump Parity: 3755260517
  Bounds: 0
  Dump Status: good
Comment 7 Tomoaki AOKI 2024-02-24 02:47:37 UTC
Just a FYI.
I just found legacy driver 470.239.06 is released at Feb.22,2024 by nvidia.

 https://www.nvidia.com/Download/driverResults.aspx/218854/en-us/

You can try it by overriding DISTVERSION and PKGNAMESUFFIX with NO_CHECKSUM=YES on x11/nvidia-driver just as x11/nvidia-driver-470 port does.

As I'm not using -470 version of driver and x11/linux-nvidia-libs, I cannot assure if it builds/installs fine or not.


BTW, I found
DISTINFO_FILE?=		${MASTERDIR}/distinfo
line in Mk/bsd.ports.mk.

It seems that it allows conditional switching of distinfo.
What do you think if introducing something like
DISTINFO_FILE?=		${MASTERDIR}/distinfo${PKGNAMESUFFIX}
in x11/nvidia-driver/Makefile and split distindo like below?

distinfo for master port (PKGNAMESUFFIX is not set),
distinfo-470 for *-470 ports (PKGNAMESUFFIX is set to -470 in slave port),
...

Not yet even tried for now, so possibly doesn't work as intended, though.
But if it works OK, I think it could make it easier when nvidia released updated legacy driver and only bump of DISTVERSION and distinfo is required, independently even if some works are ongoing on masterport like just now.
Comment 8 Graham Perrin 2024-02-24 04:59:36 UTC
(In reply to Tomoaki AOKI from comment #7)

Thanks.

<https://discord.com/channels/727023752348434432/757305573866733680/1210783179766501377> Austin Shafer mentions fixes in a superior release that are probably not yet backported to 470.⋯.
Comment 9 Graham Perrin 2024-03-08 06:51:02 UTC
<https://discord.com/channels/727023752348434432/757305573866733680/1215471550862598195> from Austin Shafer:

> You can also compare it with this: 
> <https://github.com/amshafer/nvidia-driver/blob/535.98/nvidia/src/nvidia-modeset/nvidia-modeset-freebsd.c#L76> 
> and see the differences in locking and the task queue
> 
> See the sx lock for nvkms_lock and the fast nvkms task queue. These are 
> what 470 is missing iirc
Comment 10 Graham Perrin 2024-03-09 00:36:00 UTC
<https://github.com/amshafer/freebsd-ports/commit/b2d030183a661703c8b3c0000169df077284e1b8>

>  WIP: x11/nvidia-driver/470: Backport fix for nvidia-modeset panic
> 
> This backports a fix where a non-sleepable lock is held while sleeping
> occurs. This is fixed in more recent versions but is still causing issues
> in 470.
> 
> PR: 274519
Comment 11 Graham Perrin 2024-03-11 21:41:16 UTC
(In reply to Graham Perrin from comment #10)

Instead (again, work in progress): <https://github.com/amshafer/freebsd-ports/commit/635c3df3fefbe00ffe6aaf51df8aa20f906594ac> …
Comment 12 commit-hook freebsd_committer freebsd_triage 2024-03-22 17:39:55 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=cbbce9a123da84852f289ba5aa53b4955b53a2dd

commit cbbce9a123da84852f289ba5aa53b4955b53a2dd
Author:     Austin Shafer <ashafer@badland.io>
AuthorDate: 2024-03-22 16:26:58 +0000
Commit:     Gleb Popov <arrowd@FreeBSD.org>
CommitDate: 2024-03-22 17:38:25 +0000

    x11/nvidia-driver-470: Backport fix for nvidia-modeset panic

    PR: 274519

    Differential Revision: https://reviews.freebsd.org/D44432

 x11/nvidia-driver-470/Makefile                     |   2 +-
 ...tch-src_nvidia-modeset_nvidia-modeset-freebsd.c | 102 +++++++++++++++++++--
 2 files changed, 94 insertions(+), 10 deletions(-)
Comment 13 Graham Perrin 2024-03-28 19:09:12 UTC
👍 thanks