|Summary:||x11/nvidia-driver: Kernel panic after "NVRM: rm_init_adapter() failed!"|
|Product:||Ports & Packages||Reporter:||Stefan Farfeleder <stefanf>|
|Component:||Individual Port(s)||Assignee:||Alexey Dokuchaev <danfe>|
|Status:||In Progress ---|
|Severity:||Affects Some People||CC:||danfe, mike.d.ft402, w.schwarzenfeld|
Description Stefan Farfeleder 2014-11-17 10:01:16 UTC
With the update to nvidia-driver-340.46 on HEAD amd64, I now have a ~50% chance of a kernel panic at `startx'. Just before the panic I get the errors NVRM: RmInitAdapter failed! (0x26:0x2a:1224) nvidia0: NVRM: rm_init_adapter() failed! followed immediately by fatal trap 12: page fault while in kernel mode within rm_free_unused_clients. I took a look at the open source parts of the driver and found an invalid null pointer usage, I think. The driver does roughly this: devfs_open nvidia_dev_open devfs_set_cdevpriv nvidia_open_dev NV_UMA_ZONE_ALLOC_STACK(sc->api_sp); rm_init_adapter -> fail NV_UMA_ZONE_FREE_STACK(sc->api_sp); Here sc->api_sp is set to NULL after the rm_init_adapter failure. devfs_clear_cdevpriv devfs_fpdrop devfs_destroy_cdevpriv nvidia_dev_dtor nvidia_close_dev rm_free_unused_clients(sc->api_sp) Here rm_free_unused_clients is called with a null pointer. This function is not open source, but from the panic my guess is it's not happy being called with a null pointer. I'm not sure about the best possible fix, but calling nvidia_close_dev after an unsuccessful nvidia_open_dev seems wrong and also wraps the refcnt to (uint32_t)-1. Maybe nvidia_dev_dtor simply needs to check the refcnt and avoid calling nvidia_close_dev if it's already 0.
Comment 2 Alexey Dokuchaev 2014-11-17 10:13:41 UTC
Thanks for your report and initial investigation. If you manage to patch open source part to avoid the panic, please share it so we can expedite the problem upstream. There was another report recently on x11@ list about nvidia-driver-340.46 causing kernel panics on 10.1-RELEASE/amd64 and KDE4; this might or might not be related: http://lists.freebsd.org/pipermail/freebsd-x11/2014-November/015577.html (unfortunately it lacks useful details to make any better judgement). Can you also try to build/install their next stable version (not in the ports yet) or beta (346.16): make DISTVERSION=340.58 -DNO_CHECKSUM install and tell how/if it changes anything?
Comment 3 Stefan Farfeleder 2014-11-17 19:12:47 UTC
Created attachment 149530 [details] Patch fixing described panic Here is a patch that works for me. With it, I still get random rm_init_adapter failures, but at least the failure doesn't cause a panic any longer. I'd feel more confortable with it if some driver people took a look at it :) This is against the default version, I didn't have time to test other version yet.
Comment 4 Alexey Dokuchaev 2014-11-18 13:13:22 UTC
Thanks for the patch. I've asked Christian Zander from NVidia to take a look at it; hopefully we'll have real (commitable) version of the fix soon.
Comment 5 Michael Danilov 2014-12-01 13:34:53 UTC
The patch helps my case! I think I have found one more problem. Repeatedly killing xdm in attempt to get it to get past the problem, if done too fast (or too many times?), leads to crash: #0 doadump (textdump=<value optimized out>) at pcpu.h:219 #1 0xffffffff804d55c2 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:452 #2 0xffffffff804d5984 in panic (fmt=<value optimized out>) at ../../../kern/kern_shutdown.c:759 #3 0xffffffff806dc243 in newdirrem (bp=<value optimized out>, dp=<value optimized out>, ip=<value optimized out>, isrmdir=<value optimized out>, prevdirremp=<value optimized out>) at ../../../ufs/ffs/ffs_softdep.c:9179 #4 0xffffffff806db5b6 in softdep_setup_remove (bp=0x0, dp=0x0, ip=0xfffff80006b8eb28, isrmdir=0) at ../../../ufs/ffs/ffs_softdep.c:8886 #5 0xffffffff806f5219 in ufs_dirremove (dvp=0xfffff8000694e938, ip=0xfffff80006b8eb28, flags=<value optimized out>, isrmdir=<value optimized out>) at ../../../ufs/ufs/ufs_lookup.c:1240 #6 0xffffffff806fb5e5 in ufs_remove (ap=0xfffffe00dd5c7a98) at ../../../ufs/ufs/ufs_vnops.c:928 #7 0xffffffff80784b97 in VOP_REMOVE_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:1399 #8 0xffffffff8057d619 in kern_unlinkat (td=<value optimized out>, fd=<value optimized out>, path=<value optimized out>, pathseg=<value optimized out>, oldinum=<value optimized out>) at vnode_if.h:575 #9 0xffffffff8074a4c1 in amd64_syscall (td=0xfffff800068d6920, traced=0) at subr_syscall.c:134 #10 0xffffffff8072f7cb in Xfast_syscall () at ../../../amd64/amd64/exception.S:391 #11 0x0000000041d6998a in ?? () Previous frame inner to this frame (corrupt stack?) Also, what could those mean? ACPI Warning: \134_SB_.PCI0.XVR0.VGA_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130823/nsarguments-97)
Comment 6 Michael Danilov 2015-01-02 10:00:04 UTC
The crash I described can happen on its own. Sometimes the system just locks up -- and I am pretty sure it is the blob. I wonder, are there efforts to bring nouveau back?
Comment 7 Michael Danilov 2016-05-20 07:15:59 UTC
Howdy, I think that calling rm_free_unused_clients with null pointer may be the case in this problem, too. I have suggested a patch that does the same as yours but in both occurrences of the function in nvidia_subr.c. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=205903 https://bz-attachments.freebsd.org/attachment.cgi?id=170499
Comment 8 John Smith 2017-03-01 18:22:30 UTC
(In reply to Alexey Dokuchaev from comment #4) Any news?
Comment 9 Alexey Dokuchaev 2017-04-03 12:16:11 UTC
The driver was updated to the latest long live version 375.39 as of ports r437618, and the code in `nvidia_subr.c' looks different now. Can you try the new version and see if the panic you've been experiencing is gone or not?
Comment 10 Walter Schwarzenfeld 2018-03-07 16:40:13 UTC
Is this still relevant?
Comment 11 Michael Danilov 2018-03-07 18:10:23 UTC
I believe it is, just like my bug -- but neither committers nor NVIDIA are cooperating.