Bug 251015 - x11/nvidia-driver-390: regularly crashes
Summary: x11/nvidia-driver-390: regularly crashes
Status: Open
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Alexey Dokuchaev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-10 13:00 UTC by Martin Birgmeier
Modified: 2021-05-13 03:38 UTC (History)
2 users (show)

See Also:
bugzilla: maintainer-feedback? (danfe)


Attachments
core.txt.*, Xorg.0.log (369.35 KB, application/gzip)
2020-11-10 13:00 UTC, Martin Birgmeier
no flags Details
more nvidia crashes (745.87 KB, application/gzip)
2021-05-06 19:03 UTC, Martin Birgmeier
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Birgmeier 2020-11-10 13:00:24 UTC
Created attachment 219520 [details]
core.txt.*, Xorg.0.log

Scenario:
- FreeBSD 12.1-RELEASE-p6 #6 r362488M, built with debug
- unattended reboot (sysctl debug.debugger_on_panic=0)
- Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
- NVIDIA GPU Quadro 1000M (GF108GL) at PCI:1:0:0
- 24 GB main memory
- ports at latest
- using x11/nvidia-driver-390 to drive the graphics card
- KDE running

Result:
- Even without great graphics activity (the user may be away) FreeBSD crashes regularly
- The core dumps indicate issues with the nvidia driver

Three /var/crash/core.txt.* files are attached as well as Xorg.0.log.

-- Martin
Comment 1 Martin Birgmeier 2020-12-26 09:52:18 UTC
The crashes keep occurring regularly - always at _nv007402rm+0x12. Now with FreeBSD 12.2 instead of 12.1.

With such a definite crash source, shouldn't it be easy to fix it? :-)

-- Martin
Comment 2 Alex S 2020-12-26 10:39:22 UTC
(In reply to Martin Birgmeier from comment #1)

I can't tell if you are joking or not.
Comment 3 Jason W. Bacon freebsd_committer 2020-12-27 01:18:49 UTC
I'm seeing immediate panics on an iMac9,1 GeForce 9400 since upgrading to 12.2-RELEASE.  Under 12.1-RELEASE, it was mostly fine but did panic on rare occasions.
Comment 4 Jason W. Bacon freebsd_committer 2020-12-27 01:19:38 UTC
(In reply to Jason W. Bacon from comment #3)

Scrap that, I misread the version.  My iMac is running 340.  390 reports that the chipset is supported by 340.
Comment 5 Martin Birgmeier 2020-12-28 11:34:40 UTC
Alexey,

Could you as maintainer please contact Nvidia about this PR and get them to fix the issue?

-- Martin
Comment 6 Martin Birgmeier 2021-05-06 19:03:41 UTC
Created attachment 224733 [details]
more nvidia crashes

Here are some more recent crashes... it is always at the same symbol _nv007402rm.

Maybe it would be easy to contact Nvidia with this information and ask for a fix?

-- Martin
Comment 7 Alexey Dokuchaev freebsd_committer 2021-05-08 09:41:01 UTC
This might be related to bug #195097...  Could you try to apply (by hand, the current code is a bit different) the patch https://bz-attachments.freebsd.org/attachment.cgi?id=170499 and see if it makes any difference?
Comment 8 Martin Birgmeier 2021-05-09 07:36:44 UTC
Thank you for the pointer.

The crash happens randomly during operation, but the patch seems to address an open/close issue. What leads you to believe it might help?

-- Martin
Comment 9 Alexey Dokuchaev freebsd_committer 2021-05-13 03:38:41 UTC
(In reply to Martin Birgmeier from comment #8)
> What leads you to believe it might help?
There were quite a few similar reports in the past (bug #193622, https://bugzilla.redhat.com/show_bug.cgi?id=589007, https://forums.developer.nvidia.com/t/gpu-stuck-during-deep-learning-training/115258) and in all of them, the last non-obfuscated function call before obfuscated _nvXXXXrm() chain was rm_free_unused_clients(), so it deemed something's wrong with resource management teardown logic.

> #8  0xffffffff82077bf2 in _nv007402rm () from /boot/modules/nvidia.ko
> #9  0xfffffe00a7bebd50 in ?? ()
> #10 0xffffffff82077a69 in _nv007400rm () from /boot/modules/nvidia.ko
> #11 0xfffffe00a7bebd50 in ?? ()
> #12 0xfffffe00a7bebda0 in ?? ()
> #13 0x0000000000000000 in ?? ()
> (kgdb)
However, in your case I don't see that call (and the stack trace is rather short), so you're probably right, it must be something else in your case.  Too bad nVidia obfuscates Resource Manager API. :-(