Created attachment 165080 [details] Relevant parts of core.txt The problem has been occurring since some patchlevels before 10.2-RELEASE-p8 and some minor versions of nvidia-driver before 340.96.
Created attachment 165081 [details] Part of dmesg with WITNESS, WITNESS_SKIPSPIN and INVARIANTS that may be relevant.
Take a guess at which port this applies to and assign.
So far from the stack trace, it looks like a process (vimb) using a user-mode nvidia driver component (OpenGL?) was killed due to running out of swap, nvidia.ko attempted to clean up after that process, and somehow confused itself over resources that had been allocated for that user-mode nvidia driver component. Michael: are you able to trigger this on demand by manually killing vimb, or does it require running out of swap and inducing the out-of-memory killer? 340.96 is a relatively old NVIDIA driver version. Could you test with something more recent? E.g., either of these: http://www.nvidia.com/Download/driverResults.aspx/95925/en-us http://www.nvidia.com/Download/driverResults.aspx/95163/en-us Thanks.
Hello, The problem may be unrelated to vimb -- as I have observed the crash after exiting FlightGear and it could as well be triggered by mpv. But recently vimb has been causing OOM on a daily basis. I am unable to use newer drivers: 340 is the last supported by the card.
Is there any progress locating the problem? The driver keeps panicking during accelerated graphics use. Is there other information I should provide that could be useful?
This panic happens to me with nvidia-driver-349.16 every couple months on assorted computers including today. I'm not sure if I'll be able to provide kernel dumps because this only happens on workstations where I access sensitive information and also the dumps are large. In the past this panic could be easily triggered with just startx, but now it happens at random points that seem vaguely graphics related such as opening thunderbird, opening a new tab in chromium, or sometimes using virtualbox. The panics started with FreeBSD 10, never happened on 9 and I suspect something about the binary blob just doesn't match up properly with 10. Does anyone have a reliable way of triggering this panic so I could build a clean system to reproduce it?
Not 100% reliable, but try causing an OOM while running (or shortly after having closed) something using GL. In my case the trigger is closing a GL program and having killed an instance of browser for allocating all memory and swap then hanging. Most backtraces contain "in rm_free_unused_clients () from /boot/modules/nvidia.ko", a vague hint this happens when something is removed by the blob. The crash on startx was a separate (but maybe related) problem, which I have experienced, too. This patch seemed to mitigate it: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195097 Now it seems to have gone on its own without any patches.
Hello, My bad -- the "on startx" crash did NOT go away on its own -- I had the patch file all along in nvidia-driver/files, and the files there apply to nvidia-driver-340.
Created attachment 170497 [details] Patch to x11/nvidia-driver to CAUSE the panic. Alright, I think I have come up with a reliable way to reproduce this annoying problem. Here is a broken version of the patch suggested in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195097 that causes (I think!) similar panic. I have fed rm_free_unused_clients null pointer only on one occasion and it crashed where I expected it to crash -- on exiting 3D programs. In that bug they suggest that "Here rm_free_unused_clients is called with a null pointer. This function is not open source, but from the panic my guess is it's not happy being called with a null pointer.", which pretty much may be the case here. STEPS TO REPRODUCE A 1. Install driver with patch applied. 2. Start a 3D program (I used Blender). 3. Exit the program. STEPS TO REPRODUCE B 1. Start X. 2. Kill X.
Created attachment 170498 [details] Stack trace of "reproduced" crash.
Created attachment 170499 [details] SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer. Naturally, I can suggest kludge in spirit of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195097 Will report back in a couple weeks on if it has shown up again. Or earlier if the panic still shows up. P.S. I am discussing the problem with NVIDIA devs. https://nvidia-submit.custhelp.com/app/account/questions/detail/i_id/971478
Created attachment 170501 [details] SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer. Removed misleading comment.
A week without crashes. No results for long uptimes, though. Maybe something is not cleaned up this way.
I've been testing out the patch for a few weeks, so far so good. It will take me months or a year to be convinced it helps but so far no downsides for me. I see the nvidia-driver port got updated to 367.44 and it looks like a similar fix is in the official source code for one of the two functions. I could not find reference to the change in a number of release notes of the driver so I don't know when it went in or why. Maybe the new version is sufficient? I may end up testing that on a computer.
Does the problem persist after all recent nVidia driver updates?
Since the patch still applies, there are two possibilities: a) they did nothing about this (the discussion with the support suggests this is more likely); b) they added logic to bail out on that NULL in the closed-source part. I will only be able to see sometime the end of the winter, when I get my hands on that machine again. But the (what I think is valid) test is simple: applying the "Patch to x11/nvidia-driver to CAUSE the panic" and running something 3D. When it exits, the kernel should crash. If it does, would mean the problem is still there.
Created attachment 177275 [details] NEWPATCH: Supposed to be cleaner. UNTESTED!
Of course what I said applies only to 340. Maybe they fixed something in newer major versions, but this is the last one that works for my card.
The 340.xx legacy driver was updated to version 340.102 as of ports r437618; can you try that and see if the crash still occurs after the update?
Created attachment 185350 [details] SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer. Tried it out with the new driver, seems ok. Did not use that machine as much as before, but repeatedly opening/closing Blender and such does not crash it. I have deliberately called rm_free_unused_clients with null pointer, and it still crashes, so I think the problem was not solved in the new driver. Added log messages for the occurrences, but for some reason in nvidia_close_dev() sc->dev will print "unknown" instead of "nvidia0".
From my view this issue has been solved a while back with more than a year of proof, removing myself from cc.
k then. Not in my case. But whatever.