Bug 205903 - x11/nvidia-driver-340: Repeating crash with nvidia-driver: "fault on nofault entry"
Summary: x11/nvidia-driver-340: Repeating crash with nvidia-driver: "fault on nofault ...
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Alexey Dokuchaev
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2016-01-04 20:42 UTC by Michael Danilov
Modified: 2017-11-28 13:28 UTC (History)
2 users (show)

See Also:


Attachments
Relevant parts of core.txt (6.29 KB, text/plain)
2016-01-04 20:42 UTC, Michael Danilov
no flags Details
Part of dmesg with WITNESS, WITNESS_SKIPSPIN and INVARIANTS that may be relevant. (1.58 KB, text/plain)
2016-01-04 20:47 UTC, Michael Danilov
no flags Details
Patch to x11/nvidia-driver to CAUSE the panic. (634 bytes, patch)
2016-05-20 07:05 UTC, Michael Danilov
no flags Details | Diff
Stack trace of "reproduced" crash. (3.58 KB, text/plain)
2016-05-20 07:07 UTC, Michael Danilov
no flags Details
SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer. (740 bytes, patch)
2016-05-20 07:12 UTC, Michael Danilov
no flags Details | Diff
SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer. (706 bytes, patch)
2016-05-20 08:12 UTC, Michael Danilov
no flags Details | Diff
NEWPATCH: Supposed to be cleaner. UNTESTED! (869 bytes, patch)
2016-11-22 12:10 UTC, Michael Danilov
no flags Details | Diff
SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer. (1.02 KB, patch)
2017-08-13 08:28 UTC, Michael Danilov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Danilov 2016-01-04 20:42:23 UTC
Created attachment 165080 [details]
Relevant parts of core.txt

The problem has been occurring since some patchlevels before 10.2-RELEASE-p8 and some minor versions of nvidia-driver before 340.96.
Comment 1 Michael Danilov 2016-01-04 20:47:11 UTC
Created attachment 165081 [details]
Part of dmesg with WITNESS, WITNESS_SKIPSPIN and INVARIANTS that may be relevant.
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2016-01-04 21:10:18 UTC
Take a guess at which port this applies to and assign.
Comment 3 Andy Ritger 2016-01-06 01:12:45 UTC
So far from the stack trace, it looks like a process (vimb) using a user-mode nvidia driver component (OpenGL?) was killed due to running out of swap, nvidia.ko attempted to clean up after that process, and somehow confused itself over resources that had been allocated for that user-mode nvidia driver component.

Michael: are you able to trigger this on demand by manually killing vimb, or does it require running out of swap and inducing the out-of-memory killer?

340.96 is a relatively old NVIDIA driver version.  Could you test with something more recent?  E.g., either of these:

http://www.nvidia.com/Download/driverResults.aspx/95925/en-us
http://www.nvidia.com/Download/driverResults.aspx/95163/en-us

Thanks.
Comment 4 Michael Danilov 2016-01-06 08:59:43 UTC
Hello,

The problem may be unrelated to vimb -- as I have observed the crash after exiting FlightGear and it could as well be triggered by mpv. But recently vimb has been causing OOM on a daily basis.

I am unable to use newer drivers: 340 is the last supported by the card.
Comment 5 Michael Danilov 2016-03-08 18:11:12 UTC
Is there any progress locating the problem?
The driver keeps panicking during accelerated graphics use.
Is there other information I should provide that could be useful?
Comment 6 ebay 2016-04-07 13:09:25 UTC
This panic happens to me with nvidia-driver-349.16 every couple months on assorted computers including today.  I'm not sure if I'll be able to provide kernel dumps because this only happens on workstations where I access sensitive information and also the dumps are large.  In the past this panic could be easily triggered with just startx, but now it happens at random points that seem vaguely graphics related such as opening thunderbird, opening a new tab in chromium, or sometimes using virtualbox.  The panics started with FreeBSD 10, never happened on 9 and I suspect something about the binary blob just doesn't match up properly with 10.  Does anyone have a reliable way of triggering this panic so I could build a clean system to reproduce it?
Comment 7 Michael Danilov 2016-04-07 17:30:03 UTC
Not 100% reliable, but try causing an OOM while running (or shortly after having closed) something using GL.

In my case the trigger is closing a GL program and having killed an instance of browser for allocating all memory and swap then hanging.

Most backtraces contain "in rm_free_unused_clients () from /boot/modules/nvidia.ko", a vague hint this happens when something is removed by the blob.

The crash on startx was a separate (but maybe related) problem, which I have experienced, too. This patch seemed to mitigate it:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195097
Now it seems to have gone on its own without any patches.
Comment 8 Michael Danilov 2016-05-20 06:57:44 UTC
Hello,

My bad -- the "on startx" crash did NOT go away on its own -- I had the patch file all along in nvidia-driver/files, and the files there apply to nvidia-driver-340.
Comment 9 Michael Danilov 2016-05-20 07:05:04 UTC
Created attachment 170497 [details]
Patch to x11/nvidia-driver to CAUSE the panic.

Alright, I think I have come up with a reliable way to reproduce this annoying problem.

Here is a broken version of the patch suggested in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195097 that causes (I think!) similar panic. I have fed rm_free_unused_clients null pointer only on one occasion and it crashed where I expected it to crash -- on exiting 3D programs.

In that bug they suggest that "Here rm_free_unused_clients is called with a null pointer. This function is not open source, but from the panic my guess is it's not happy being called with a null pointer.", which pretty much may be the case here.

STEPS TO REPRODUCE A
1. Install driver with patch applied.
2. Start a 3D program (I used Blender).
3. Exit the program.

STEPS TO REPRODUCE B
1. Start X.
2. Kill X.
Comment 10 Michael Danilov 2016-05-20 07:07:08 UTC
Created attachment 170498 [details]
Stack trace of "reproduced" crash.
Comment 11 Michael Danilov 2016-05-20 07:12:02 UTC
Created attachment 170499 [details]
SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer.

Naturally, I can suggest kludge in spirit of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195097

Will report back in a couple weeks on if it has shown up again. Or earlier if the panic still shows up.

P.S. I am discussing the problem with NVIDIA devs.
https://nvidia-submit.custhelp.com/app/account/questions/detail/i_id/971478
Comment 12 Michael Danilov 2016-05-20 08:12:50 UTC
Created attachment 170501 [details]
SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer.

Removed misleading comment.
Comment 13 Michael Danilov 2016-05-30 08:33:08 UTC
A week without crashes.

No results for long uptimes, though. Maybe something is not cleaned up this way.
Comment 14 ebay 2016-09-12 00:52:32 UTC
I've been testing out the patch for a few weeks, so far so good. It will take me months or a year to be convinced it helps but so far no downsides for me.

I see the nvidia-driver port got updated to 367.44 and it looks like a similar fix is in the official source code for one of the two functions. I could not find reference to the change in a number of release notes of the driver so I don't know when it went in or why. Maybe the new version is sufficient? I may end up testing that on a computer.
Comment 15 Alexey Dokuchaev freebsd_committer 2016-11-22 09:36:05 UTC
Does the problem persist after all recent nVidia driver updates?
Comment 16 Michael Danilov 2016-11-22 12:02:14 UTC
Since the patch still applies, there are two possibilities:
a) they did nothing about this (the discussion with the support suggests this is more likely);
b) they added logic to bail out on that NULL in the closed-source part.

I will only be able to see sometime the end of the winter, when I get my hands on that machine again.

But the (what I think is valid) test is simple: applying the "Patch to x11/nvidia-driver to CAUSE the panic" and running something 3D. When it exits, the kernel should crash. If it does, would mean the problem is still there.
Comment 17 Michael Danilov 2016-11-22 12:10:25 UTC
Created attachment 177275 [details]
NEWPATCH: Supposed to be cleaner. UNTESTED!
Comment 18 Michael Danilov 2016-11-22 12:18:19 UTC
Of course what I said applies only to 340. Maybe they fixed something in newer major versions, but this is the last one that works for my card.
Comment 19 Alexey Dokuchaev freebsd_committer 2017-04-03 12:22:11 UTC
The 340.xx legacy driver was updated to version 340.102 as of ports r437618; can you try that and see if the crash still occurs after the update?
Comment 20 Michael Danilov 2017-08-13 08:28:46 UTC
Created attachment 185350 [details]
SUGGESTED PATCH: prevent rm_free_unused_clients from being called with null pointer.

Tried it out with the new driver, seems ok. Did not use that machine as much as before, but repeatedly opening/closing Blender and such does not crash it.

I have deliberately called rm_free_unused_clients with null pointer, and it still crashes, so I think the problem was not solved in the new driver.

Added log messages for the occurrences, but for some reason in nvidia_close_dev() sc->dev will print "unknown" instead of "nvidia0".
Comment 21 ebay 2017-11-27 02:57:03 UTC
From my view this issue has been solved a while back with more than a year of proof, removing myself from cc.
Comment 22 Michael Danilov 2017-11-28 13:28:46 UTC
k then. Not in my case. But whatever.