Bug 277364 - x11/nvidia-driver: kernel panic: "NVRM: rm_init_rm() failed" after update to 550.54.14
Summary: x11/nvidia-driver: kernel panic: "NVRM: rm_init_rm() failed" after update to ...
Status: New
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Alexey Dokuchaev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-27 21:31 UTC by iron.udjin
Modified: 2024-04-06 01:54 UTC (History)
8 users (show)

See Also:
bugzilla: maintainer-feedback? (danfe)


Attachments
panic_screenshot.jpg (204.02 KB, image/jpeg)
2024-02-27 21:31 UTC, iron.udjin
no flags Details
panic_screenshot2.jpg (252.55 KB, image/jpeg)
2024-02-28 14:21 UTC, iron.udjin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description iron.udjin 2024-02-27 21:31:10 UTC
Created attachment 248801 [details]
panic_screenshot.jpg

OS: 14-STABLE (stable/14-n266905-f3f559705a05)
Hardware: NVIDIA 2080 Ti

See screenshot.
Comment 1 Sean Farley freebsd_committer freebsd_triage 2024-02-28 02:04:49 UTC
I am seeing this too with stable/14-n266905-f3f559705a05.
Comment 2 Tomoaki AOKI 2024-02-28 09:49:06 UTC
I'm the one that filed Bug 277028 to upgrade to 550.54.14 and using 550 branch of driver since it was provided as Beta, but never encountered such a panic.
Possibly it can be GPU specific. Mine is Quadro (notebook) P1000 only.

Another possibility.
Looking into your photo, lines before panic includes

random: entropy device external interface
kbd1 at kbdmux0

and followed by nvidia-related lines.

This usually indicates it's in early boot phase before init starts and runs rc scripts, thus you are loading nvidia-modset.ko (and it automatically loads nvidia.ko, too) in /boot/loader.conf, right?

If so, try removing (or commenting out) nvidia-modeset_load="YES" and/or nvidia_load="YES" lines in /boot/loader.conf[.local].

Instead, put

kld_list="${kld_list} nvidia-modeset.ko"

line somewhere in your /etc/rc.conf[.local].

I suspect nvidia-modeset.ko itself or automatically-loaded nvidia.ko is not fully loaded into staging area (small!) because their size are increased and trying to access to unread portion causing the panic.
Comment 3 Vladyslav Movchan 2024-02-28 11:52:35 UTC
I've experienced the same problem on a recent 15.0-CURRENT with Nvidia GTX 1060 after update to 550.54.14.
And moving the load of `nvidia-modeset` from `/boot/loader.conf` to `/etc/rc.conf` did help.

Thanks a lot for the explanation and advice!
Comment 4 iron.udjin 2024-02-28 14:21:36 UTC
Created attachment 248812 [details]
panic_screenshot2.jpg
Comment 5 iron.udjin 2024-02-28 14:27:49 UTC
(In reply to Tomoaki AOKI from comment #2)

>This usually indicates it's in early boot phase before init starts and runs rc scripts, thus you are loading nvidia-modset.ko (and it automatically loads nvidia.ko, too) in /boot/loader.conf, right?

Yes. When I put kld_list="nvidia-modeset" into /etc/rc.conf it automatically loads nvidia.ko and everything works as expected.

But... If I set nvidia_load="YES" in /boot/loader.conf -- I have panic (see next screenshot).

If it is normal behaviour -- I think pkg-message needs to have warning something like "Do not put nvidia_load="YES" in /boot/loader.conf" to avoid the same panic for other users.

Thank you.
Comment 6 Austin Shafer 2024-02-28 14:52:18 UTC
This was added to the docs not too long ago iirc. Yes you should absolutely load kernel modules through rc.conf and never through loader.conf. Historically we've seen a lot of issues with the kernel not being able to grow the reserved memory region large enough to accommodate the nvidia modules that early in the boot, but this rm_init_rm issue seems to be the latest version of the same thing.

I agree a pkg-message would make sense
Comment 7 Alex 2024-03-07 12:16:37 UTC
I can absolutely confirm the same panic with 550.54.14 on 14.0-RELEASE with an Nvidia RTX-3070

pkg had upgraded the driver to this version from an earlier release - I rebooted after pkg update and bam panic..

I had to go into the boot menu and "disable-module nvidia" and "disable-module nvidia_modeset" to get back into my system.

When one installs the driver via pkg eg: nvidia-driver does this (or did it in the past) populate /boot/loader.conf during the install? I don't recall adding the entries in there myself:

#nvidia_load="YES"
#nvidia_name="nvidia"
#nvidia_modeset_load="YES"
#nvidia_modeset_name="nvidia-modeset"

I've commented them out and all is well.. obviously when X loads it must kldload the modules it needs on the fly (which seems to work) as I don't have anything set to load them at /etc/rc.conf at present, but they show up with a kldstat

root@bsd-desktop:~ # kldstat
Id Refs Address                Size Name
 1   52 0xffffffff80200000  1d345b0 kernel
 2    1 0xffffffff82621000   125678 nvidia-modeset.ko
 3    1 0xffffffff82800000  3118000 nvidia.ko

What should I put in my /etc/rc.conf instead? Why do earlier versions of the driver not have a problem being loaded via /boot/loader.conf?
Comment 8 Tomoaki AOKI 2024-03-07 18:18:58 UTC
(In reply to Alex from comment #7)

As I've replied as Bug277028 Comment15 (sorry, intentionally used too strong wording there to emphasize the importance), loading modules for xorg/wayland in /boot/loader.conf[.local] is discouraged now, because they are too large nowadays to load with the route.

You can safely load non-legacy x11/nvidia-driver by adding nvidia-modeset.ko alone to kld_list variable in /etc/rc.conf[.local].
It automatically loads nvidia.ko as its dependency.

For using with Wayland using in conjunction with graphics/nvidia*-drm-kmod ports, there seems to be something to investigate. Basically follow how described in handbook.
It is the start point of (maybe long, I'm not yet finished) long way.
But for xorg, nvidia-drm.ko built by graphics/nvidia*-drm-kmmod automatically loads nvidia-modeset.ko, then it automatically loads nvidia.ko.

Putting nvidia.ko BEFORE nvidia-modeset.ko (and put them BEFORE nvidia-drm.ko) works OK for xorg at least for me (at least, no new problem are seen for me), but if the order is reversed, warnings would be shown like "already loaded", but would be actually no harm.
Comment 9 Yuri Victorovich freebsd_committer freebsd_triage 2024-03-26 03:24:34 UTC
bug#277967 The loader should fail gracefully when /boot/loader.conf attempts to load a module that is too large
Comment 10 russo 2024-03-31 22:29:44 UTC
(comment copied over from bug #277827) 

Is /usr/ports/UPDATING no longer the right place for information like this?

Having used nvidia-driver for many, many years and having been in the habit of loading it in /boot/loader.conf, this issue bit me today, too.  I managed to recover from it relatively easily, and after recovering finally read the pkg-message printed by the driver on install.  The pkg-message for the driver had apparently been changed to recommend kld_list and NOT loader.conf back in 2020, but the fact is that those pkg-message outputs often scroll off the screen before I can read them during a big port upgrade run so I never noticed it and it never kernel panicked for me before this latest update.

Since I *am* in the habit of reading "pkg updating" before upgrading, having a warning about trying to load the nvidia driver through the loader.conf method show up there would have caught my attention long ago.

Perhaps you can add an entry to UPDATING that makes it more obvious that the old loader.conf recommendations are now recommendations to NEVER load graphics modules via loader.conf?
Comment 11 Graham Perrin 2024-04-01 06:38:58 UTC
(In reply to russo from comment #10)

Please take this to a discussion area that's not specific to 277364, then ping me from there. Thanks.
Comment 12 Tomoaki AOKI 2024-04-06 01:54:02 UTC
(In reply to Graham Perrin from comment #11)

Would be related with Bug 277967 - "The loader should fail gracefully when /boot/loader.conf attempts to load a module that is too large".
It seems to be more generic.
I think it would be nice if it could be implemented, but also think it would be difficult.
I've already commented about the difficulties to overcome as far as I could think of currently. Maybe there should be cleaner and easier way which I couldn't notice.