Created attachment 219520 [details] core.txt.*, Xorg.0.log Scenario: - FreeBSD 12.1-RELEASE-p6 #6 r362488M, built with debug - unattended reboot (sysctl debug.debugger_on_panic=0) - Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz - NVIDIA GPU Quadro 1000M (GF108GL) at PCI:1:0:0 - 24 GB main memory - ports at latest - using x11/nvidia-driver-390 to drive the graphics card - KDE running Result: - Even without great graphics activity (the user may be away) FreeBSD crashes regularly - The core dumps indicate issues with the nvidia driver Three /var/crash/core.txt.* files are attached as well as Xorg.0.log. -- Martin
The crashes keep occurring regularly - always at _nv007402rm+0x12. Now with FreeBSD 12.2 instead of 12.1. With such a definite crash source, shouldn't it be easy to fix it? :-) -- Martin
(In reply to Martin Birgmeier from comment #1) I can't tell if you are joking or not.
I'm seeing immediate panics on an iMac9,1 GeForce 9400 since upgrading to 12.2-RELEASE. Under 12.1-RELEASE, it was mostly fine but did panic on rare occasions.
(In reply to Jason W. Bacon from comment #3) Scrap that, I misread the version. My iMac is running 340. 390 reports that the chipset is supported by 340.
Alexey, Could you as maintainer please contact Nvidia about this PR and get them to fix the issue? -- Martin
Created attachment 224733 [details] more nvidia crashes Here are some more recent crashes... it is always at the same symbol _nv007402rm. Maybe it would be easy to contact Nvidia with this information and ask for a fix? -- Martin
This might be related to bug #195097... Could you try to apply (by hand, the current code is a bit different) the patch https://bz-attachments.freebsd.org/attachment.cgi?id=170499 and see if it makes any difference?
Thank you for the pointer. The crash happens randomly during operation, but the patch seems to address an open/close issue. What leads you to believe it might help? -- Martin
(In reply to Martin Birgmeier from comment #8) > What leads you to believe it might help? There were quite a few similar reports in the past (bug #193622, https://bugzilla.redhat.com/show_bug.cgi?id=589007, https://forums.developer.nvidia.com/t/gpu-stuck-during-deep-learning-training/115258) and in all of them, the last non-obfuscated function call before obfuscated _nvXXXXrm() chain was rm_free_unused_clients(), so it deemed something's wrong with resource management teardown logic. > #8 0xffffffff82077bf2 in _nv007402rm () from /boot/modules/nvidia.ko > #9 0xfffffe00a7bebd50 in ?? () > #10 0xffffffff82077a69 in _nv007400rm () from /boot/modules/nvidia.ko > #11 0xfffffe00a7bebd50 in ?? () > #12 0xfffffe00a7bebda0 in ?? () > #13 0x0000000000000000 in ?? () > (kgdb) However, in your case I don't see that call (and the stack trace is rather short), so you're probably right, it must be something else in your case. Too bad nVidia obfuscates Resource Manager API. :-(
Another crash for the "collection". Based on the register dump it seems to be caused by a use after free. Fatal trap 9: general protection fault while in kernel mode cpuid = 5; apic id = 05 instruction pointer = 0x20:0xffffffff829ccc90 stack pointer = 0x28:0xfffffe02c1d7d840 frame pointer = 0x28:0xfffffe021a9a5d20 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 80434 (plasmashell) trap number = 9 panic: general protection fault cpuid = 5 time = 1642654429 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff805ca63b = db_trace_self_wrapper+0x2b/frame 0xfffffe02c1d7d470 kdb_backtrace() at 0xffffffff808ae0c7 = kdb_backtrace+0x37/frame 0xfffffe02c1d7d520 vpanic() at 0xffffffff8086a2ec = vpanic+0x18c/frame 0xfffffe02c1d7d580 panic() at 0xffffffff80869f03 = panic+0x43/frame 0xfffffe02c1d7d5e0 trap_fatal() at 0xffffffff80b5ac35 = trap_fatal+0x375/frame 0xfffffe02c1d7d640 trap() at 0xffffffff80b5a0e7 = trap+0x67/frame 0xfffffe02c1d7d750 trap_check() at 0xffffffff80b5b069 = trap_check+0x29/frame 0xfffffe02c1d7d770 calltrap() at 0xffffffff80b36778 = calltrap+0x8/frame 0xfffffe02c1d7d770 --- trap 0x9, rip = 0xffffffff829ccc90, rsp = 0xfffffe02c1d7d840, rbp = 0xfffffe021a9a5d20 --- _nv035888rm() at 0xffffffff829ccc90 = _nv035888rm+0xb0/frame 0xfffffe021a9a5d20 ??() at 0xfffff803cf6cb570/frame 0xdeadc0df00000000 Uptime: 12d16h10m41s Dumping 6720 out of 32646 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% doadump (textdump=textdump@entry=1) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:399 399 dumptid = curthread->td_tid; (kgdb) bt #0 doadump (textdump=textdump@entry=1) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:399 #1 0xffffffff80869cef in kern_reboot (howto=260) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:487 #2 0xffffffff8086a35f in vpanic (fmt=0xffffffff80c1a2f6 "%s", ap=<optimized out>) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:920 #3 0xffffffff80869f03 in panic (fmt=<unavailable>) at /usr/devel/git/trant/sys/kern/kern_shutdown.c:844 #4 0xffffffff80b5ac35 in trap_fatal (frame=0xfffffe02c1d7d780, eva=0) at /usr/devel/git/trant/sys/amd64/amd64/trap.c:944 #5 0xffffffff80b5a0e7 in trap (frame=frame@entry=0xfffffe02c1d7d780) at /usr/devel/git/trant/sys/amd64/amd64/trap.c:249 #6 0xffffffff80b5b069 in trap_check (frame=0xfffffe02c1d7d780) at /usr/devel/git/trant/sys/amd64/amd64/trap.c:667 #7 <signal handler called> #8 0xffffffff829ccc90 in _nv035888rm () from /boot/modules/nvidia.ko #9 0xfffff803cf6cb808 in ?? () #10 0xfffffe021a9a5f18 in ?? () #11 0xfffff80253624190 in ?? () #12 0xffffffff829ca95e in _nv014658rm () from /boot/modules/nvidia.ko #13 0x0000000000000000 in ?? () (kgdb) fr 8 #8 0xffffffff829ccc90 in _nv035888rm () from /boot/modules/nvidia.ko (kgdb) disassemble Dump of assembler code for function _nv035888rm: 0xffffffff829ccbe0 <+0>: push %r13 0xffffffff829ccbe2 <+2>: push %r12 0xffffffff829ccbe4 <+4>: mov %rsi,%r12 0xffffffff829ccbe7 <+7>: push %rbx 0xffffffff829ccbe8 <+8>: sub $0x20,%rbp 0xffffffff829ccbec <+12>: mov 0x20(%rsi),%rdx 0xffffffff829ccbf0 <+16>: mov %rdi,%rbx 0xffffffff829ccbf3 <+19>: test %rdx,%rdx 0xffffffff829ccbf6 <+22>: je 0xffffffff829ccc16 <_nv035888rm+54> 0xffffffff829ccbf8 <+24>: mov 0x18(%rdi),%rax 0xffffffff829ccbfc <+28>: test %rax,%rax 0xffffffff829ccbff <+31>: jne 0xffffffff829ccc11 <_nv035888rm+49> 0xffffffff829ccc01 <+33>: jmp 0xffffffff829ccc40 <_nv035888rm+96> 0xffffffff829ccc03 <+35>: nopl 0x0(%rax,%rax,1) 0xffffffff829ccc08 <+40>: mov 0x18(%rax),%rax 0xffffffff829ccc0c <+44>: test %rax,%rax 0xffffffff829ccc0f <+47>: je 0xffffffff829ccc40 <_nv035888rm+96> 0xffffffff829ccc11 <+49>: cmp %rax,%rdx 0xffffffff829ccc14 <+52>: jne 0xffffffff829ccc08 <_nv035888rm+40> 0xffffffff829ccc16 <+54>: mov %r12,%rdi 0xffffffff829ccc19 <+57>: call 0xffffffff822f53a0 <_nv035883rm> 0xffffffff829ccc1e <+62>: lea 0x120(%rbx),%rdi 0xffffffff829ccc25 <+69>: mov %r12,%rsi 0xffffffff829ccc28 <+72>: call 0xffffffff829c2520 <_nv029011rm> 0xffffffff829ccc2d <+77>: pop %rbx 0xffffffff829ccc2e <+78>: pop %r12 0xffffffff829ccc30 <+80>: pop %r13 0xffffffff829ccc32 <+82>: add $0x20,%rbp 0xffffffff829ccc36 <+86>: ret 0xffffffff829ccc37 <+87>: nopw 0x0(%rax,%rax,1) 0xffffffff829ccc40 <+96>: lea 0x148(%rdx),%rdi 0xffffffff829ccc47 <+103>: call 0xffffffff829c26f0 <_nv029013rm> 0xffffffff829ccc4c <+108>: mov %rax,%r13 0xffffffff829ccc4f <+111>: mov 0x20(%r12),%rax 0xffffffff829ccc54 <+116>: lea 0x148(%rax),%rdi 0xffffffff829ccc5b <+123>: call 0xffffffff829c26c0 <_nv028995rm> 0xffffffff829ccc60 <+128>: mov 0x20(%r12),%rcx 0xffffffff829ccc65 <+133>: mov %rax,%rdx 0xffffffff829ccc68 <+136>: mov %rbp,%rdi 0xffffffff829ccc6b <+139>: lea 0x148(%rcx),%rsi 0xffffffff829ccc72 <+146>: mov %r13,%rcx 0xffffffff829ccc75 <+149>: call 0xffffffff829c2800 <_nv029003rm> 0xffffffff829ccc7a <+154>: nopw 0x0(%rax,%rax,1) 0xffffffff829ccc80 <+160>: mov %rbp,%rdi 0xffffffff829ccc83 <+163>: call 0xffffffff829c2870 <_nv029002rm> 0xffffffff829ccc88 <+168>: test %al,%al 0xffffffff829ccc8a <+170>: je 0xffffffff829ccc16 <_nv035888rm+54> 0xffffffff829ccc8c <+172>: mov 0x0(%rbp),%rsi => 0xffffffff829ccc90 <+176>: cmp %rbx,0x8(%rsi) 0xffffffff829ccc94 <+180>: jne 0xffffffff829ccc80 <_nv035888rm+160> 0xffffffff829ccc96 <+182>: cmp %r12,(%rsi) 0xffffffff829ccc99 <+185>: jne 0xffffffff829ccc80 <_nv035888rm+160> 0xffffffff829ccc9b <+187>: mov 0x20(%r12),%rax 0xffffffff829ccca0 <+192>: lea 0x148(%rax),%rdi 0xffffffff829ccca7 <+199>: call 0xffffffff829c2520 <_nv029011rm> 0xffffffff829cccac <+204>: jmp 0xffffffff829ccc16 <_nv035888rm+54> End of assembler dump. (kgdb) i reg rax 0x1 1 rbx 0xfffff807c381b828 -8762748192728 rcx 0xfffff803cf6cb570 -8779728112272 rdx 0xdeadc0dedeadc0de -2401050962867404578 rsi 0xdeadc0df00000000 -2401050962308366336 rdi 0xfffffe021a9a5d20 -2189986996960 rbp 0xfffffe021a9a5d20 0xfffffe021a9a5d20 rsp 0xfffffe02c1d7d840 0xfffffe02c1d7d840 r8 0xffffffff80c2bc91 -2134721391 r9 0xffffffff8414fcab -2078999381 r10 0x0 0 r11 0x372 882 r12 0xfffff80253624190 -8786104139376 r13 0xdeadc0df00000000 -2401050962308366336 r14 0xfffffe021a9a5d98 -2189986996840 r15 0xfffff807c381b828 -8762748192728 rip 0xffffffff829ccc90 0xffffffff829ccc90 <_nv035888rm+176> eflags 0x10202 [ IF RF ] cs 0x20 32 ss 0x28 40 ds <unavailable> es <unavailable> fs <unavailable> gs <unavailable> fs_base <unavailable> gs_base <unavailable> (kgdb) x/10a $rdi 0xfffffe021a9a5d20: 0xdeadc0df00000000 0xfffff803cf6cb570 0xfffffe021a9a5d30: 0x0 0xdeadc0dedeadc0de 0xfffffe021a9a5d40: 0x0 0xfffffe021a9a5e18 0xfffffe021a9a5d50: 0xfffff807c381b948 0xfffff807c381b828 0xfffffe021a9a5d60: 0xfffffe021a9a5ed0 0x1441b5f4d70
Hey Andriy, Can you please provide the Nvidia driver version that you are currently using?
(In reply to Richard Gallamore from comment #11) Apparently 470.86: https://markmail.org/message/73loj7mwc4xsn7d2. I don't see any similarities to the rest of this thread though.
(In reply to Alex S from comment #12) Yes, indeed, it's nvidia-driver-470.86.
Well, and mine is still nvidia-driver-390-390.144. Which just shows that at least they are all crashing in the same way... -- Martin
*** Bug 235865 has been marked as a duplicate of this bug. ***
Some updated information (also regarding bug #235865 which I just marked as a duplicate of this one): Subjectively, or from experience, the main culprit seems to be some bitblt operation. Specifically, crashes occur, e.g., - when using the mouse to scroll, e.g., a firefox window - when some lines are printed in an xterm, causing it to scroll up (!) Crashes even happen while the KDE UI is locked (and the screen turned off), and I assume that behind this, some xterm in which some compilation is running has scrolled. But I definitely know that manually provoked scrolling of an xterm can trigger the crash (like with firefox). So most likely off-screen rendering is affected. Maybe it is just the FreeBSD memory interface which has changed (a long while ago), giving up some guarantees assumed by the nvidia driver... or nvidia has a simple use after free, or buffer overflow, or whatever. -- Martin
I don't know if this is related or not but I have video crashing when hardware acceleration is used (it seems). It doesn't happen immediately but after some apparent random time frame. In my case I use AwesomeWM without composition or anything special, and so I've just experienced this particular issue while playing videos with 'multimedia/mpv', but once the hardware acceleration was disabled the crashes disappeared completely. About the crash: either the screen turn black, like the computer is off or it locks with image becoming messed and some annoying buzzing playing non-stop. I can't tell for sure when it started because I watch videos not that often but I acknowledge it about 2 or 3 months ago. Card: nVidia GT 630 Driver: nvidia-driver-390-390.151
From my experience this is most likely the same issue. About the noise: If audio is playing while the kernel crashes the soundcard may continue playing the last buffer contents in a loop. If you have crash dumps enabled you should always see the same offending symbol at the start of the backtrace. For my driver (x11/nvidia-driver-390) it is _nv007402rm(). -- Martin
@martin do you have full core dump so that analysis similar to Andryi's can be done? I have a very similar crash with the 340 driver. From what I traced it happens sometimes when nvidia "file" is being closed and some resources are being released. For example when closing browser window or a tab, sometimes when shutting down X - as the name rm_free_unused_clients() suggests.
I have disabled coredumps for this machine precisely because the nvidia driver crashes it regularly. I added the salient info in the attachments to this PR when I created it. But I have given up on expecting an improvement here; NVidia simply is not interested although it has been made aware of this multiple times.
I understand your frustration. I'm looking into it as much as my skills allow me to and therefore if you managed to produce something, just drop me a line. I am working only on -340 but I suspect this might be exactly the same issue on -390. Thank you for reporting this anyway!
Since several months this seems to be better. Instead of crashing, the driver resets, and the session then continues. -> Close. -- Martin