Bug 251015 - x11/nvidia-driver-390: regularly crashes
Summary: x11/nvidia-driver-390: regularly crashes
Status: Open
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Alexey Dokuchaev
URL:
Keywords:
: 235865 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-11-10 13:00 UTC by Martin Birgmeier
Modified: 2023-10-31 04:19 UTC (History)
8 users (show)

See Also:
bugzilla: maintainer-feedback? (danfe)


Attachments
core.txt.*, Xorg.0.log (369.35 KB, application/gzip)
2020-11-10 13:00 UTC, Martin Birgmeier
no flags Details
more nvidia crashes (745.87 KB, application/gzip)
2021-05-06 19:03 UTC, Martin Birgmeier
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Birgmeier 2020-11-10 13:00:24 UTC
Created attachment 219520 [details]
core.txt.*, Xorg.0.log

Scenario:
- FreeBSD 12.1-RELEASE-p6 #6 r362488M, built with debug
- unattended reboot (sysctl debug.debugger_on_panic=0)
- Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
- NVIDIA GPU Quadro 1000M (GF108GL) at PCI:1:0:0
- 24 GB main memory
- ports at latest
- using x11/nvidia-driver-390 to drive the graphics card
- KDE running

Result:
- Even without great graphics activity (the user may be away) FreeBSD crashes regularly
- The core dumps indicate issues with the nvidia driver

Three /var/crash/core.txt.* files are attached as well as Xorg.0.log.

-- Martin
Comment 1 Martin Birgmeier 2020-12-26 09:52:18 UTC
The crashes keep occurring regularly - always at _nv007402rm+0x12. Now with FreeBSD 12.2 instead of 12.1.

With such a definite crash source, shouldn't it be easy to fix it? :-)

-- Martin
Comment 2 Alex S 2020-12-26 10:39:22 UTC
(In reply to Martin Birgmeier from comment #1)

I can't tell if you are joking or not.
Comment 3 Jason W. Bacon freebsd_committer freebsd_triage 2020-12-27 01:18:49 UTC
I'm seeing immediate panics on an iMac9,1 GeForce 9400 since upgrading to 12.2-RELEASE.  Under 12.1-RELEASE, it was mostly fine but did panic on rare occasions.
Comment 4 Jason W. Bacon freebsd_committer freebsd_triage 2020-12-27 01:19:38 UTC
(In reply to Jason W. Bacon from comment #3)

Scrap that, I misread the version.  My iMac is running 340.  390 reports that the chipset is supported by 340.
Comment 5 Martin Birgmeier 2020-12-28 11:34:40 UTC
Alexey,

Could you as maintainer please contact Nvidia about this PR and get them to fix the issue?

-- Martin
Comment 6 Martin Birgmeier 2021-05-06 19:03:41 UTC
Created attachment 224733 [details]
more nvidia crashes

Here are some more recent crashes... it is always at the same symbol _nv007402rm.

Maybe it would be easy to contact Nvidia with this information and ask for a fix?

-- Martin
Comment 7 Alexey Dokuchaev freebsd_committer freebsd_triage 2021-05-08 09:41:01 UTC
This might be related to bug #195097...  Could you try to apply (by hand, the current code is a bit different) the patch https://bz-attachments.freebsd.org/attachment.cgi?id=170499 and see if it makes any difference?
Comment 8 Martin Birgmeier 2021-05-09 07:36:44 UTC
Thank you for the pointer.

The crash happens randomly during operation, but the patch seems to address an open/close issue. What leads you to believe it might help?

-- Martin
Comment 9 Alexey Dokuchaev freebsd_committer freebsd_triage 2021-05-13 03:38:41 UTC
(In reply to Martin Birgmeier from comment #8)
> What leads you to believe it might help?
There were quite a few similar reports in the past (bug #193622, https://bugzilla.redhat.com/show_bug.cgi?id=589007, https://forums.developer.nvidia.com/t/gpu-stuck-during-deep-learning-training/115258) and in all of them, the last non-obfuscated function call before obfuscated _nvXXXXrm() chain was rm_free_unused_clients(), so it deemed something's wrong with resource management teardown logic.

> #8  0xffffffff82077bf2 in _nv007402rm () from /boot/modules/nvidia.ko
> #9  0xfffffe00a7bebd50 in ?? ()
> #10 0xffffffff82077a69 in _nv007400rm () from /boot/modules/nvidia.ko
> #11 0xfffffe00a7bebd50 in ?? ()
> #12 0xfffffe00a7bebda0 in ?? ()
> #13 0x0000000000000000 in ?? ()
> (kgdb)
However, in your case I don't see that call (and the stack trace is rather short), so you're probably right, it must be something else in your case.  Too bad nVidia obfuscates Resource Manager API. :-(
Comment 10 Andriy Gapon freebsd_committer freebsd_triage 2022-01-20 14:54:47 UTC
Another crash for the "collection".
Based on the register dump it seems to be caused by a use after free.

Fatal trap 9: general protection fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer     = 0x20:0xffffffff829ccc90
stack pointer           = 0x28:0xfffffe02c1d7d840
frame pointer           = 0x28:0xfffffe021a9a5d20
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 80434 (plasmashell)
trap number             = 9
panic: general protection fault
cpuid = 5
time = 1642654429
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff805ca63b = db_trace_self_wrapper+0x2b/frame 0xfffffe02c1d7d470
kdb_backtrace() at 0xffffffff808ae0c7 = kdb_backtrace+0x37/frame 0xfffffe02c1d7d520
vpanic() at 0xffffffff8086a2ec = vpanic+0x18c/frame 0xfffffe02c1d7d580
panic() at 0xffffffff80869f03 = panic+0x43/frame 0xfffffe02c1d7d5e0
trap_fatal() at 0xffffffff80b5ac35 = trap_fatal+0x375/frame 0xfffffe02c1d7d640
trap() at 0xffffffff80b5a0e7 = trap+0x67/frame 0xfffffe02c1d7d750
trap_check() at 0xffffffff80b5b069 = trap_check+0x29/frame 0xfffffe02c1d7d770
calltrap() at 0xffffffff80b36778 = calltrap+0x8/frame 0xfffffe02c1d7d770
--- trap 0x9, rip = 0xffffffff829ccc90, rsp = 0xfffffe02c1d7d840, rbp = 0xfffffe021a9a5d20 ---
_nv035888rm() at 0xffffffff829ccc90 = _nv035888rm+0xb0/frame 0xfffffe021a9a5d20
??() at 0xfffff803cf6cb570/frame 0xdeadc0df00000000
Uptime: 12d16h10m41s
Dumping 6720 out of 32646 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

doadump (textdump=textdump@entry=1) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:399
399             dumptid = curthread->td_tid;
(kgdb) bt
#0  doadump (textdump=textdump@entry=1) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:399
#1  0xffffffff80869cef in kern_reboot (howto=260) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:487
#2  0xffffffff8086a35f in vpanic (fmt=0xffffffff80c1a2f6 "%s", ap=<optimized out>) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:920
#3  0xffffffff80869f03 in panic (fmt=<unavailable>) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:844
#4  0xffffffff80b5ac35 in trap_fatal (frame=0xfffffe02c1d7d780, eva=0) at /usr/devel/git/trant/sys/amd64/amd64/trap.c:944
#5  0xffffffff80b5a0e7 in trap (frame=frame@entry=0xfffffe02c1d7d780) at /usr/devel/git/trant/sys/amd64/amd64/trap.c:249
#6  0xffffffff80b5b069 in trap_check (frame=0xfffffe02c1d7d780) at /usr/devel/git/trant/sys/amd64/amd64/trap.c:667
#7  <signal handler called>
#8  0xffffffff829ccc90 in _nv035888rm () from /boot/modules/nvidia.ko
#9  0xfffff803cf6cb808 in ?? ()
#10 0xfffffe021a9a5f18 in ?? ()
#11 0xfffff80253624190 in ?? ()
#12 0xffffffff829ca95e in _nv014658rm () from /boot/modules/nvidia.ko
#13 0x0000000000000000 in ?? ()

(kgdb) fr 8
#8  0xffffffff829ccc90 in _nv035888rm () from /boot/modules/nvidia.ko
(kgdb) disassemble
Dump of assembler code for function _nv035888rm:
   0xffffffff829ccbe0 <+0>:     push   %r13
   0xffffffff829ccbe2 <+2>:     push   %r12
   0xffffffff829ccbe4 <+4>:     mov    %rsi,%r12
   0xffffffff829ccbe7 <+7>:     push   %rbx
   0xffffffff829ccbe8 <+8>:     sub    $0x20,%rbp
   0xffffffff829ccbec <+12>:    mov    0x20(%rsi),%rdx
   0xffffffff829ccbf0 <+16>:    mov    %rdi,%rbx
   0xffffffff829ccbf3 <+19>:    test   %rdx,%rdx
   0xffffffff829ccbf6 <+22>:    je     0xffffffff829ccc16 <_nv035888rm+54>
   0xffffffff829ccbf8 <+24>:    mov    0x18(%rdi),%rax
   0xffffffff829ccbfc <+28>:    test   %rax,%rax
   0xffffffff829ccbff <+31>:    jne    0xffffffff829ccc11 <_nv035888rm+49>
   0xffffffff829ccc01 <+33>:    jmp    0xffffffff829ccc40 <_nv035888rm+96>
   0xffffffff829ccc03 <+35>:    nopl   0x0(%rax,%rax,1)
   0xffffffff829ccc08 <+40>:    mov    0x18(%rax),%rax
   0xffffffff829ccc0c <+44>:    test   %rax,%rax
   0xffffffff829ccc0f <+47>:    je     0xffffffff829ccc40 <_nv035888rm+96>
   0xffffffff829ccc11 <+49>:    cmp    %rax,%rdx
   0xffffffff829ccc14 <+52>:    jne    0xffffffff829ccc08 <_nv035888rm+40>
   0xffffffff829ccc16 <+54>:    mov    %r12,%rdi
   0xffffffff829ccc19 <+57>:    call   0xffffffff822f53a0 <_nv035883rm>
   0xffffffff829ccc1e <+62>:    lea    0x120(%rbx),%rdi
   0xffffffff829ccc25 <+69>:    mov    %r12,%rsi
   0xffffffff829ccc28 <+72>:    call   0xffffffff829c2520 <_nv029011rm>
   0xffffffff829ccc2d <+77>:    pop    %rbx
   0xffffffff829ccc2e <+78>:    pop    %r12
   0xffffffff829ccc30 <+80>:    pop    %r13
   0xffffffff829ccc32 <+82>:    add    $0x20,%rbp
   0xffffffff829ccc36 <+86>:    ret
   0xffffffff829ccc37 <+87>:    nopw   0x0(%rax,%rax,1)
   0xffffffff829ccc40 <+96>:    lea    0x148(%rdx),%rdi
   0xffffffff829ccc47 <+103>:   call   0xffffffff829c26f0 <_nv029013rm>
   0xffffffff829ccc4c <+108>:   mov    %rax,%r13
   0xffffffff829ccc4f <+111>:   mov    0x20(%r12),%rax
   0xffffffff829ccc54 <+116>:   lea    0x148(%rax),%rdi
   0xffffffff829ccc5b <+123>:   call   0xffffffff829c26c0 <_nv028995rm>
   0xffffffff829ccc60 <+128>:   mov    0x20(%r12),%rcx
   0xffffffff829ccc65 <+133>:   mov    %rax,%rdx
   0xffffffff829ccc68 <+136>:   mov    %rbp,%rdi
   0xffffffff829ccc6b <+139>:   lea    0x148(%rcx),%rsi
   0xffffffff829ccc72 <+146>:   mov    %r13,%rcx
   0xffffffff829ccc75 <+149>:   call   0xffffffff829c2800 <_nv029003rm>
   0xffffffff829ccc7a <+154>:   nopw   0x0(%rax,%rax,1)
   0xffffffff829ccc80 <+160>:   mov    %rbp,%rdi
   0xffffffff829ccc83 <+163>:   call   0xffffffff829c2870 <_nv029002rm>
   0xffffffff829ccc88 <+168>:   test   %al,%al
   0xffffffff829ccc8a <+170>:   je     0xffffffff829ccc16 <_nv035888rm+54>
   0xffffffff829ccc8c <+172>:   mov    0x0(%rbp),%rsi
=> 0xffffffff829ccc90 <+176>:   cmp    %rbx,0x8(%rsi)
   0xffffffff829ccc94 <+180>:   jne    0xffffffff829ccc80 <_nv035888rm+160>
   0xffffffff829ccc96 <+182>:   cmp    %r12,(%rsi)
   0xffffffff829ccc99 <+185>:   jne    0xffffffff829ccc80 <_nv035888rm+160>
   0xffffffff829ccc9b <+187>:   mov    0x20(%r12),%rax
   0xffffffff829ccca0 <+192>:   lea    0x148(%rax),%rdi
   0xffffffff829ccca7 <+199>:   call   0xffffffff829c2520 <_nv029011rm>
   0xffffffff829cccac <+204>:   jmp    0xffffffff829ccc16 <_nv035888rm+54>
End of assembler dump.

(kgdb) i reg
rax            0x1                 1
rbx            0xfffff807c381b828  -8762748192728
rcx            0xfffff803cf6cb570  -8779728112272
rdx            0xdeadc0dedeadc0de  -2401050962867404578
rsi            0xdeadc0df00000000  -2401050962308366336
rdi            0xfffffe021a9a5d20  -2189986996960
rbp            0xfffffe021a9a5d20  0xfffffe021a9a5d20
rsp            0xfffffe02c1d7d840  0xfffffe02c1d7d840
r8             0xffffffff80c2bc91  -2134721391
r9             0xffffffff8414fcab  -2078999381
r10            0x0                 0
r11            0x372               882
r12            0xfffff80253624190  -8786104139376
r13            0xdeadc0df00000000  -2401050962308366336
r14            0xfffffe021a9a5d98  -2189986996840
r15            0xfffff807c381b828  -8762748192728
rip            0xffffffff829ccc90  0xffffffff829ccc90 <_nv035888rm+176>
eflags         0x10202             [ IF RF ]
cs             0x20                32
ss             0x28                40
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>

(kgdb) x/10a $rdi
0xfffffe021a9a5d20:     0xdeadc0df00000000      0xfffff803cf6cb570
0xfffffe021a9a5d30:     0x0     0xdeadc0dedeadc0de
0xfffffe021a9a5d40:     0x0     0xfffffe021a9a5e18
0xfffffe021a9a5d50:     0xfffff807c381b948      0xfffff807c381b828
0xfffffe021a9a5d60:     0xfffffe021a9a5ed0      0x1441b5f4d70
Comment 11 Richard Gallamore freebsd_committer freebsd_triage 2022-01-20 23:55:58 UTC
Hey Andriy,

Can you please provide the Nvidia driver version that you are currently using?
Comment 12 Alex S 2022-01-21 01:22:24 UTC
(In reply to Richard Gallamore from comment #11)

Apparently 470.86: https://markmail.org/message/73loj7mwc4xsn7d2. I don't see any similarities to the rest of this thread though.
Comment 13 Andriy Gapon freebsd_committer freebsd_triage 2022-01-21 06:56:55 UTC
(In reply to Alex S from comment #12)
Yes, indeed, it's nvidia-driver-470.86.
Comment 14 Martin Birgmeier 2022-01-21 07:08:37 UTC
Well, and mine is still nvidia-driver-390-390.144.

Which just shows that at least they are all crashing in the same way...

-- Martin
Comment 15 Martin Birgmeier 2022-03-14 19:03:06 UTC
*** Bug 235865 has been marked as a duplicate of this bug. ***
Comment 16 Martin Birgmeier 2022-03-14 19:12:38 UTC
Some updated information (also regarding bug #235865 which I just marked as a duplicate of this one): Subjectively, or from experience, the main culprit seems to be some bitblt operation. Specifically, crashes occur, e.g.,
- when using the mouse to scroll, e.g., a firefox window
- when some lines are printed in an xterm, causing it to scroll up (!)

Crashes even happen while the KDE UI is locked (and the screen turned off), and I assume that behind this, some xterm in which some compilation is running has scrolled. But I definitely know that manually provoked scrolling of an xterm can trigger the crash (like with firefox). So most likely off-screen rendering is affected.

Maybe it is just the FreeBSD memory interface which has changed (a long while ago), giving up some guarantees assumed by the nvidia driver... or nvidia has a simple use after free, or buffer overflow, or whatever.

-- Martin
Comment 17 Alexandre C. Guimarães freebsd_committer freebsd_triage 2022-05-20 20:19:11 UTC
I don't know if this is related or not but I have video crashing when hardware acceleration is used (it seems). It doesn't happen immediately but after some apparent random time frame.

In my case I use AwesomeWM without composition or anything special, and so I've just experienced this particular issue while playing videos with 'multimedia/mpv', but once the hardware acceleration was disabled the crashes disappeared completely.

About the crash: either the screen turn black, like the computer is off or it locks with image becoming messed and some annoying buzzing playing non-stop.

I can't tell for sure when it started because I watch videos not that often but I acknowledge it about 2 or 3 months ago.

Card: nVidia GT 630
Driver: nvidia-driver-390-390.151
Comment 18 Martin Birgmeier 2022-05-21 07:56:55 UTC
From my experience this is most likely the same issue.

About the noise: If audio is playing while the kernel crashes the soundcard may continue playing the last buffer contents in a loop.

If you have crash dumps enabled you should always see the same offending symbol at the start of the backtrace. For my driver (x11/nvidia-driver-390) it is _nv007402rm().

-- Martin
Comment 19 Marcin Cieślak 2022-11-24 09:59:19 UTC
@martin do you have full core dump so that analysis similar to Andryi's can be done?

I have a very similar crash with the 340 driver. From what I traced it happens sometimes when nvidia "file" is being closed and some resources are being released. For example when closing browser window or a tab, sometimes when shutting down X - as the name rm_free_unused_clients() suggests.
Comment 20 Martin Birgmeier 2022-11-24 17:42:51 UTC
I have disabled coredumps for this machine precisely because the nvidia driver crashes it regularly. I added the salient info in the attachments to this PR when I created it.

But I have given up on expecting an improvement here; NVidia simply is not interested although it has been made aware of this multiple times.
Comment 21 Marcin Cieślak 2022-11-27 21:54:22 UTC
I understand your frustration. I'm looking into it as much as my skills allow me to and therefore if you managed to produce something, just drop me a line.

I am working only on -340 but I suspect this might be exactly the same issue on -390.

Thank you for reporting this anyway!