Bug 291232 - x11/nvidia-driver: On recent CURRENT: not loading properly - missing hw.nvidia.registry.EnableGpuFirmware=1
Summary: x11/nvidia-driver: On recent CURRENT: not loading properly - missing hw.nvidi...
Status: Closed DUPLICATE of bug 291212
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-x11 (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-11-26 07:59 UTC by O. Hartmann
Modified: 2025-11-29 18:50 UTC (History)
1 user (show)

See Also:
bugzilla: maintainer-feedback? (x11)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description O. Hartmann 2025-11-26 07:59:04 UTC
On recent CURRENT (FreeBSD 16.0-CURRENT #14 master-n282165-718acd74657f: Wed Nov 26 08:26:23 CET 2025 amd64) with a recent nvidia 5060Ti GPU, port 
x11/nvidia-kmod (nvidia-kmod-580.105.08.1600004) fails to present xdm/GUI. The x11/nvidia-kmod, x11/nvidia-driver ports are recompiled every time world/kernel is build.

xdm is setup via /etc/ttys (ttyv9   "/usr/local/bin/xdm -nodaemon"  xterm   onifexists      insecure).

Since a couple of days for now, it seems that with recent nvidia kernel module loaded, a reboot (either reboot or shutdown -r now) takes a long time until an error occurs ("... some processes would not die, ... adviced"). It is not possible to interrupt or watch which process rejects to perish, but I guess it is xdm.

Without nvidia kernel module loaded, FreeBSD behaves in most cases as expected and performs a clean reboot or even working without GUI is possible so far.

I use a custom kernel! Custom kernel worked before, I guess, commit 9562994a7aacee2baae6ddee1a7b558b48ae39ef - this commit is a marker to me, I rebuilt kernel yesterday before this commit has been made and it worked so far even with GUI. After that commit, I had to follow now vanished UPDATING remarks about setting some sysctl flags (see my PR 291212 on that).
Comment 1 Tomoaki AOKI 2025-11-26 09:59:06 UTC
(In reply to O. Hartmann from comment #0)
On which commit that ran fine for you?
Can you bisect which commit actually affected?

I'm currently on massive poudriere rebuilds on stable/15
and cannot boot into / upgrading main branch on exactly
same computer.

And which GPU are you affected?
RTX 5xxx series are known to still have some rough edges
and GSP (GPU System Processor that Turing and later
generations, including RTX 5xxx [Blackwel], have) is needed
to be active, unlike prior (through Ada Lovelace) generation
of GPUs.
Comment 2 O. Hartmann 2025-11-26 10:50:42 UTC
(In reply to Tomoaki AOKI from comment #1)

Second question first:

[...]
vgapci0: child nvidia0 requested pci_enable_io
vgapci0: child nvidia0 requested pci_enable_io
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  580.105.08  Wed Oct 29 22:04:36 UTC 2025
kldunload: attempt to unload file that was loaded by the kernel
[...]
Vendor/OEM is MSI

Your first question: I have never done a bisection. Yesterday the box worked, I remember that the last update/compilation I did was with

commit 64ee9c166ce5e807e575d205ac2e15cc5cf6581b

It is because I use an Intel igb/em NIC on my servers and I was wondering ...
Comment 3 O. Hartmann 2025-11-26 10:52:31 UTC
Just for the record: unloading nvidia-modeset via kldunload results in a hung up console (still possible to connect via SSH, but no F1..FXX console/TTY switching on console anymore). Hung is permanent ...
Comment 4 Tomoaki AOKI 2025-11-26 11:26:17 UTC
(In reply to O. Hartmann from comment #2)
So on commit base 64ee9c166ce5e807e575d205ac2e15cc5cf6581b, x11/nvidia-kmod in conjunction with x11/nvidia-driver 580.105.08 worked fine on it, right?

Then upgrading src to commit base 9562994a7aacee2baae6ddee1a7b558b48ae39ef broke it, and still broken at commit base 718acd74657fdf21cfd03c721bb7484d3789aaa0, right?

If so, no need to bisect, as commit base 9562994a7aacee2baae6ddee1a7b558b48ae39ef is just the next one after commit base 64ee9c166ce5e807e575d205ac2e15cc5cf6581b.
It makes clear that commit base 9562994a7aacee2baae6ddee1a7b558b48ae39ef broke
things.

Fortunately, stable/15 has related codes before it branched (what's missing is just the one flipped the default). I can test whether flipping tunable debug.link_elf_obj_leak_locals from 1 to 0 in my /boot/loader.conf and restart to see what happenes. (For me, debug.link_elf_leak_locals is somehow already 0 and working fine for my Quadro P1000 (notebook).) But it would be after a couple of days after massive poudriere rebuilds finishes.

If it affect for me, the issue would be because nvidia.ko cannot fetch tunable you defined in /boot/loader.conf for your GPU (assuming RTX 5xxx series) as of the flip (for me, affected one would be nvidia-drm.ko, though).

And if not, possibly something others are affected and nvidia.ko is killed indirectly by it.
Comment 5 Tomoaki AOKI 2025-11-26 11:33:44 UTC
(In reply to O. Hartmann from comment #3)
Not limited with nvidia, unloading kernel mode setting GPU drivers (included non-generic nvidia-modesetting.ko) usually causes crash or hang, as kernel / console driver no lonnger know how to set modes properly.

On ancient GPU drivers (recently called User Mode Setting driver in contrast with KMS), kernel / console driver didn't matter about mode settings (UMS drove / release video buffer by itself). Vanilla kernel/console drivers run as if nothing changed, IIUC.
Comment 6 O. Hartmann 2025-11-26 14:37:08 UTC
(In reply to Tomoaki AOKI from comment #4)

Sorry, I was to hasty and unprecise and, to be honest, I do not understand the situation anymore!

I went back as far as commit 120f8a4c2ae8a011827d83b098ecf70c791f794b (git reset hard  120f8a4c2ae8a011827d83b098ecf70c791f794b). Then I recompiled both world and kernel (I use NOCLEAN, so it might be the wrong approach). Also recompiled is the x11/nvidia-xxxx stuff as set in src.conf accordingly.

No change! not loading nvidia-modeset leaves the system fully operational as far as I could judge this. As stated prior to this message.

So, reflecting on what I could miss, I'm rebuilding the whole OS with the lastest CURRENT commit and also rebuild both x11/nvidia-driver and x11/nvidia-kmod.

Prior to a whole rebuild of CURRENT after cleanworld with a rebuilt nvidia driver AND carefully set

debug.link_elf_leak_locals=0
debug.link_elf_obj_leak_locals=0

in /boot/loader.conf.local, rebooting and manually trying to load the nvidia-modeset module I get this error on the console:

[...]
bridge0: link state changed to UP
nvidia0: <NVIDIA GeForce RTX 5060 Ti> on vgapci0
vgapci0: child nvidia0 requested pci_enable_io
vgapci0: child nvidia0 requested pci_enable_io
link_elf_obj: symbol nv_kthread_q_schedule_q_item undefined
nvidia0: detached
linker_load_file: /boot/modules/nvidia-modeset.ko - unsupported file type
Comment 7 O. Hartmann 2025-11-26 15:18:34 UTC
(In reply to O. Hartmann from comment #6)

With FreeBSD 16.0-CURRENT #0 master-n282166-23af364630b1: Wed Nov 26 15:36:30 CET 2025  amd64, recent ports tree, rebuilt nvidia modules, and properly disabled 
debug.link_elf_leak_locals=0
debug.link_elf_obj_leak_locals=0

in /boot/loader.conf.local the result is as stated prior to this add:

[...]
nvidia0: <NVIDIA GeForce RTX 5060 Ti> on vgapci0
vgapci0: child nvidia0 requested pci_enable_io
vgapci0: child nvidia0 requested pci_enable_io
link_elf_obj: symbol nv_kthread_q_schedule_q_item undefined
nvidia0: detached
linker_load_file: /boot/modules/nvidia-modeset.ko - unsupported file type

I'm sorry being not much of help in terms of debugging. I guess the latest "issue" is more convenient than a frozen, non responsive console.
Comment 8 Tomoaki AOKI 2025-11-26 23:36:04 UTC
(In reply to O. Hartmann from comment #6)
NOCLEAN (and ccache or alike, if you're using) on rolling back could harm here.

At worst, `rm -rf /usr/obj/` is needed. I've bitten by this worst case before
(unrelated with nvidia GPU drivers). I've struggled with bi-sectiong before, until
I've noticed about this problem.

This is because "linker_load_file: /boot/modules/nvidia-modeset.ko - unsupported file type" usually means mis-match in (interface) versions between actual kernel and kernel modules.
Comment 9 O. Hartmann 2025-11-27 11:33:51 UTC
(In reply to Tomoaki AOKI from comment #8)

Because of the fact being bitten by out of sync kernel and kernel modules from ports tree, I try to recompile kernel modules used whenever world is build.

At this very specific case regarding the recent kernel and recent nvidia driver, both, world/kernel and ports tree/nvidia driver are up to date, being recompiled from scratch after deleting /usr/obj/*. So, the issue is now a real world issue, I guess and not a phenomenon of an out-of-sync ABI.


A further observation: Xorg is eating up 100% WCPU (top) when nvidia driver is loaded and 

debug.link_elf_leak_locals=1
debug.link_elf_obj_leak_locals=1

are both left to their default.
Comment 10 Tomoaki AOKI 2025-11-27 14:41:12 UTC
(In reply to O. Hartmann from comment #9)
What I've mentioned is not only "sync src and pkg tree", but also /src/obj/ needed to be cleaned after rolling back the tree.

This is because if *.o (including *.pico) already built with updated source files are left in /usr/obj/, they can be skipped to be built using rolled back source, thus, things does not change or even worse. This is what I've be betten before.

And (IIRC) *.depends and *.meta generated by compiler for records are not cleaned by usual `make clean` (including cleanworld and cleankernel).
For *.depends, `make cleandepends` would work (IIRC, not worked as expected for kernel, unlike for world, though). But IIRC, there's not `cleanmeta` target to clean *.meta.

After bitten by it, and as I'm using Root-on-ZFS installation, I've started to take snapshot (independent dataset for me) of /usr/obj having the commit used in snapshot names everytime buildworld and buildkernel succeeded. So I'm not bothered for clean rebuilds. Just roll back to known working commit and install from there.
Comment 11 O. Hartmann 2025-11-29 13:35:57 UTC
Fun fact: with FreeBSD 16.0-CURRENT #6 master-n282239-57c0a337dbc5: Sat Nov 29 09:07:33 CET 2025 amd64 everything turned back to normal!
Comment 12 Tomoaki AOKI 2025-11-29 14:56:14 UTC
(In reply to O. Hartmann from comment #11)
It would be simply because the offending commit was temporarily reverted
at commit base fad4c92b78a123f87195173ac118655fa8e325cd, isn't it?

But the offending commit base 9562994a7aacee2baae6ddee1a7b558b48ae39ef is
planned to be reapplied, so need actual fix anyway.

I'm now working on it at Bug291212.
Comment 13 O. Hartmann 2025-11-29 18:50:10 UTC
I mark this as a duplicate of Bug 291212.

*** This bug has been marked as a duplicate of bug 291212 ***