With the update to nvidia-driver-340.46 on HEAD amd64, I now have a ~50% chance
of a kernel panic at `startx'. Just before the panic I get the errors
NVRM: RmInitAdapter failed! (0x26:0x2a:1224)
nvidia0: NVRM: rm_init_adapter() failed!
followed immediately by
fatal trap 12: page fault while in kernel mode within rm_free_unused_clients.
I took a look at the open source parts of the driver and found an invalid null
pointer usage, I think.
The driver does roughly this:
rm_init_adapter -> fail
Here sc->api_sp is set to NULL after the rm_init_adapter failure.
Here rm_free_unused_clients is called with a null pointer. This function is not
open source, but from the panic my guess is it's not happy being called with a
I'm not sure about the best possible fix, but calling nvidia_close_dev after an
unsuccessful nvidia_open_dev seems wrong and also wraps the refcnt to
(uint32_t)-1. Maybe nvidia_dev_dtor simply needs to check the refcnt and avoid
calling nvidia_close_dev if it's already 0.
Auto-assigned to maintainer danfe@FreeBSD.org
Thanks for your report and initial investigation. If you manage to patch open source part to avoid the panic, please share it so we can expedite the problem upstream.
There was another report recently on x11@ list about nvidia-driver-340.46 causing kernel panics on 10.1-RELEASE/amd64 and KDE4; this might or might not be related: http://lists.freebsd.org/pipermail/freebsd-x11/2014-November/015577.html (unfortunately it lacks useful details to make any better judgement).
Can you also try to build/install their next stable version (not in the ports yet) or beta (346.16):
make DISTVERSION=340.58 -DNO_CHECKSUM install
and tell how/if it changes anything?
Created attachment 149530 [details]
Patch fixing described panic
Here is a patch that works for me. With it, I still get random rm_init_adapter failures, but at least the failure doesn't cause a panic any longer.
I'd feel more confortable with it if some driver people took a look at it :)
This is against the default version, I didn't have time to test other version yet.
Thanks for the patch. I've asked Christian Zander from NVidia to take a look at it; hopefully we'll have real (commitable) version of the fix soon.
The patch helps my case!
I think I have found one more problem. Repeatedly killing xdm in attempt to get it to get past the problem, if done too fast (or too many times?), leads to crash:
#0 doadump (textdump=<value optimized out>) at pcpu.h:219
#1 0xffffffff804d55c2 in kern_reboot (howto=260)
#2 0xffffffff804d5984 in panic (fmt=<value optimized out>)
#3 0xffffffff806dc243 in newdirrem (bp=<value optimized out>,
dp=<value optimized out>, ip=<value optimized out>,
isrmdir=<value optimized out>, prevdirremp=<value optimized out>)
#4 0xffffffff806db5b6 in softdep_setup_remove (bp=0x0, dp=0x0,
ip=0xfffff80006b8eb28, isrmdir=0) at ../../../ufs/ffs/ffs_softdep.c:8886
#5 0xffffffff806f5219 in ufs_dirremove (dvp=0xfffff8000694e938,
ip=0xfffff80006b8eb28, flags=<value optimized out>,
isrmdir=<value optimized out>) at ../../../ufs/ufs/ufs_lookup.c:1240
#6 0xffffffff806fb5e5 in ufs_remove (ap=0xfffffe00dd5c7a98)
#7 0xffffffff80784b97 in VOP_REMOVE_APV (vop=<value optimized out>,
a=<value optimized out>) at vnode_if.c:1399
#8 0xffffffff8057d619 in kern_unlinkat (td=<value optimized out>,
fd=<value optimized out>, path=<value optimized out>,
pathseg=<value optimized out>, oldinum=<value optimized out>)
#9 0xffffffff8074a4c1 in amd64_syscall (td=0xfffff800068d6920, traced=0)
#10 0xffffffff8072f7cb in Xfast_syscall ()
#11 0x0000000041d6998a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Also, what could those mean?
ACPI Warning: \134_SB_.PCI0.XVR0.VGA_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130823/nsarguments-97)
The crash I described can happen on its own. Sometimes the system just locks up -- and I am pretty sure it is the blob. I wonder, are there efforts to bring nouveau back?
I think that calling rm_free_unused_clients with null pointer may be the case in this problem, too. I have suggested a patch that does the same as yours but in both occurrences of the function in nvidia_subr.c.
(In reply to Alexey Dokuchaev from comment #4)
The driver was updated to the latest long live version 375.39 as of ports r437618, and the code in `nvidia_subr.c' looks different now. Can you try the new version and see if the panic you've been experiencing is gone or not?
Is this still relevant?
I believe it is, just like my bug -- but neither committers nor NVIDIA are cooperating.