Bug 195097 - x11/nvidia-driver: Kernel panic after "NVRM: rm_init_adapter() failed!"
Summary: x11/nvidia-driver: Kernel panic after "NVRM: rm_init_adapter() failed!"
Status: In Progress
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Some People
Assignee: Alexey Dokuchaev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-17 10:01 UTC by Stefan Farfeleder
Modified: 2018-03-07 18:10 UTC (History)
3 users (show)

See Also:


Attachments
Patch fixing described panic (446 bytes, patch)
2014-11-17 19:12 UTC, Stefan Farfeleder
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Farfeleder freebsd_committer 2014-11-17 10:01:16 UTC
With the update to nvidia-driver-340.46 on HEAD amd64, I now have a ~50% chance
of a kernel panic at `startx'. Just before the panic I get the errors

NVRM: RmInitAdapter failed! (0x26:0x2a:1224)
nvidia0: NVRM: rm_init_adapter() failed!

followed immediately by

fatal trap 12: page fault while in kernel mode within rm_free_unused_clients.

I took a look at the open source parts of the driver and found an invalid null
pointer usage, I think.

The driver does roughly this:

devfs_open
    nvidia_dev_open
        devfs_set_cdevpriv
        nvidia_open_dev
            NV_UMA_ZONE_ALLOC_STACK(sc->api_sp);
            rm_init_adapter -> fail
            NV_UMA_ZONE_FREE_STACK(sc->api_sp);

Here sc->api_sp is set to NULL after the rm_init_adapter failure.

    devfs_clear_cdevpriv
        devfs_fpdrop
            devfs_destroy_cdevpriv
                nvidia_dev_dtor
                    nvidia_close_dev
                        rm_free_unused_clients(sc->api_sp)

Here rm_free_unused_clients is called with a null pointer. This function is not
open source, but from the panic my guess is it's not happy being called with a
null pointer.

I'm not sure about the best possible fix, but calling nvidia_close_dev after an
unsuccessful nvidia_open_dev seems wrong and also wraps the refcnt to 
(uint32_t)-1.  Maybe nvidia_dev_dtor simply needs to check the refcnt and avoid
calling nvidia_close_dev if it's already 0.
Comment 1 Bugzilla Automation freebsd_committer 2014-11-17 10:01:16 UTC
Auto-assigned to maintainer danfe@FreeBSD.org
Comment 2 Alexey Dokuchaev freebsd_committer 2014-11-17 10:13:41 UTC
Thanks for your report and initial investigation.  If you manage to patch open source part to avoid the panic, please share it so we can expedite the problem upstream.

There was another report recently on x11@ list about nvidia-driver-340.46 causing kernel panics on 10.1-RELEASE/amd64 and KDE4; this might or might not be related: http://lists.freebsd.org/pipermail/freebsd-x11/2014-November/015577.html (unfortunately it lacks useful details to make any better judgement).

Can you also try to build/install their next stable version (not in the ports yet) or beta (346.16):

  make DISTVERSION=340.58 -DNO_CHECKSUM install

and tell how/if it changes anything?
Comment 3 Stefan Farfeleder freebsd_committer 2014-11-17 19:12:47 UTC
Created attachment 149530 [details]
Patch fixing described panic

Here is a patch that works for me. With it, I still get random rm_init_adapter failures, but at least the failure doesn't cause a panic any longer.

I'd feel more confortable with it if some driver people took a look at it :)

This is against the default version, I didn't have time to test other version yet.
Comment 4 Alexey Dokuchaev freebsd_committer 2014-11-18 13:13:22 UTC
Thanks for the patch.  I've asked Christian Zander from NVidia to take a look at it; hopefully we'll have real (commitable) version of the fix soon.
Comment 5 Michael Danilov 2014-12-01 13:34:53 UTC
The patch helps my case!

I think I have found one more problem. Repeatedly killing xdm in attempt to get it to get past the problem, if done too fast (or too many times?), leads to crash:

#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff804d55c2 in kern_reboot (howto=260)
    at ../../../kern/kern_shutdown.c:452
#2  0xffffffff804d5984 in panic (fmt=<value optimized out>)
    at ../../../kern/kern_shutdown.c:759
#3  0xffffffff806dc243 in newdirrem (bp=<value optimized out>,
    dp=<value optimized out>, ip=<value optimized out>,
    isrmdir=<value optimized out>, prevdirremp=<value optimized out>)
    at ../../../ufs/ffs/ffs_softdep.c:9179
#4  0xffffffff806db5b6 in softdep_setup_remove (bp=0x0, dp=0x0,
    ip=0xfffff80006b8eb28, isrmdir=0) at ../../../ufs/ffs/ffs_softdep.c:8886
#5  0xffffffff806f5219 in ufs_dirremove (dvp=0xfffff8000694e938,
    ip=0xfffff80006b8eb28, flags=<value optimized out>,
    isrmdir=<value optimized out>) at ../../../ufs/ufs/ufs_lookup.c:1240
#6  0xffffffff806fb5e5 in ufs_remove (ap=0xfffffe00dd5c7a98)
    at ../../../ufs/ufs/ufs_vnops.c:928
#7  0xffffffff80784b97 in VOP_REMOVE_APV (vop=<value optimized out>,
    a=<value optimized out>) at vnode_if.c:1399
#8  0xffffffff8057d619 in kern_unlinkat (td=<value optimized out>,
    fd=<value optimized out>, path=<value optimized out>,
    pathseg=<value optimized out>, oldinum=<value optimized out>)
    at vnode_if.h:575
#9  0xffffffff8074a4c1 in amd64_syscall (td=0xfffff800068d6920, traced=0)
    at subr_syscall.c:134
#10 0xffffffff8072f7cb in Xfast_syscall ()
    at ../../../amd64/amd64/exception.S:391
#11 0x0000000041d6998a in ?? ()
Previous frame inner to this frame (corrupt stack?)

Also, what could those mean?

ACPI Warning: \134_SB_.PCI0.XVR0.VGA_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130823/nsarguments-97)
Comment 6 Michael Danilov 2015-01-02 10:00:04 UTC
The crash I described can happen on its own. Sometimes the system just locks up -- and I am pretty sure it is the blob. I wonder, are there efforts to bring nouveau back?
Comment 7 Michael Danilov 2016-05-20 07:15:59 UTC
Howdy,

I think that calling rm_free_unused_clients with null pointer may be the case in this problem, too. I have suggested a patch that does the same as yours but in both occurrences of the function in nvidia_subr.c.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=205903

https://bz-attachments.freebsd.org/attachment.cgi?id=170499
Comment 8 John Smith 2017-03-01 18:22:30 UTC
(In reply to Alexey Dokuchaev from comment #4)
Any news?
Comment 9 Alexey Dokuchaev freebsd_committer 2017-04-03 12:16:11 UTC
The driver was updated to the latest long live version 375.39 as of ports r437618, and the code in `nvidia_subr.c' looks different now.  Can you try the new version and see if the panic you've been experiencing is gone or not?
Comment 10 w.schwarzenfeld freebsd_triage 2018-03-07 16:40:13 UTC
Is this still relevant?
Comment 11 Michael Danilov 2018-03-07 18:10:23 UTC
I believe it is, just like my bug -- but neither committers nor NVIDIA are cooperating.