Bug 257456

Summary: x11/nvidia-driver: Update to 470.57.02
Product: Ports & Packages Reporter: Kevin Bowling <kbowling>
Component: Individual Port(s)Assignee: Kevin Bowling <kbowling>
Status: Closed FIXED    
Severity: Affects Some People CC: ashafer, grahamperrin, iwtcex, jkim, junchoon, ltning-freebsd, otis
Priority: --- Flags: bugzilla: maintainer-feedback? (danfe)
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
nvidia 470.57.02
none
Photograph: OS hung after clicking a site-provided context menu item in Firefox 90.0.2
none
nvidia 470.57.02 w/sx locking
none
Make mtx->sx patches optional for 470.x none

Description Kevin Bowling freebsd_committer 2021-07-27 17:43:11 UTC
How do you want to handle the new stable branch?  There are some significant changes here like vulkan support.  There's an out of tree port version at https://github.com/shkhln/revird-aidivn/blob/afdiuxc/x11/nvidia-driver/Makefile
Comment 1 Graham Perrin 2021-07-28 06:02:57 UTC
For Quadro K1100M (GK107GLM) on FreeBSD 14.0-CURRENT, can you tell, is x11/nvidia-driver appropriate? 

I'm particularly interested in wake from sleep (resume from suspend) when the computer (HP ZBook 17 G2) is docked, with a display on DisplayPort. 

On one hand: 

<https://www.nvidia.com/Download/driverResults.aspx/177146/en-us> for 470.57.02 does _not_ list the K1100M, from which a person might assume that a legacy alternative (maybe x11/nvidia-driver-390) is appropriate. 

On the other hand: 

<https://www.nvidia.com/en-us/drivers/unix/legacy-gpu/> _also_ does not list K1100M, from which a person might assume that it's appropriate to use the regular (non-legacy) Unified UNIX Graphics Driver.
Comment 2 Alex S 2021-07-29 00:21:42 UTC
(In reply to Kevin Bowling from comment #0)

Since danfe@ isn't replying...

> new stable branch

The port is ready for 470 since https://github.com/freebsd/freebsd-ports/commit/d64eb42e5b50c43cec29f672d32f04ddb7d8dca8; it should be enough to do a version bump.

> There are some significant changes here like vulkan support.

That's ready as well. I have a minor cleanup patch for the corresponding port parts, though: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253285#c2.
Comment 3 Alexey Dokuchaev freebsd_committer 2021-07-29 09:33:43 UTC
(In reply to Kevin Bowling from comment #0)
> How do you want to handle the new stable branch?
Just by updating the port.  Alex is right, it should be straightforward.

(In reply to Alex S from comment #2)
> I have a minor cleanup patch for the corresponding port parts, though
That should be included as well, thank you.
Comment 4 Kevin Bowling freebsd_committer 2021-07-29 18:45:17 UTC
Created attachment 226784 [details]
nvidia 470.57.02

Attaching the combined patch for others to test.

I'm having stability issues during session restore for KDE5 and 4 Firefox windows that are stable on 460 driver.

On a fresh user/session I didn't see the issue but I only tried one firefox window without much load.  The issue results in a full hang of the system.  I haven't started any debugging yet, so I will try to get some data this weekend when I have more time.  Any thoughts from kde?
Comment 5 Graham Perrin 2021-07-30 09:58:47 UTC
Thank you. 

(In reply to Kevin Bowling from comment #4)

The same experience with 470.57.02 as with 460.84. 

Initial wake from sleep succeeded. Slept by clicking the button at the SDDM log in dialogue. 

Subsequent wakes fail, quickly (without appearance of an image) – and hard (no response to keyboard or trackpad input, no response to a normal press on the power button). 

If it helps: at the time of the second sleep with 470.57.02 – after the one successful wake – a succession of "… unexpected …" messages scrolled by. So quickly that (sorry) I had no chance to tell _what_ was unexpected. 

Loosely speaking (I don't know the technologies), it feels like, maybe, something troublesome is cached, but I have no idea where (or how to clear the cache).  

<https://bsd-hardware.info/?probe=2faf8af7be>

% uclcmd get --file /boot/loader.conf screen.font
null
% grep screen.font /boot/loader.conf | grep -v \#
screen.font="8x16"
% sysrc -f /etc/rc.conf kld_list
kld_list: fusefs usbhid nvidia-modeset
% pkg info -x nvidia
linux-nvidia-libs-460.84
nvidia-driver-470.57.02
nvidia-settings-470.42.01
nvidia-xconfig-470.42.01
% uname -KUv
FreeBSD 14.0-CURRENT #103 main-n248269-941650aae97: Wed Jul 28 07:28:47 BST 2021     root@mowa219-gjp4-8570p:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG  1400026 1400026
%
Comment 6 Graham Perrin 2021-07-30 10:23:37 UTC
(In reply to Kevin Bowling from comment #4)

> … stability issues during session restore for KDE5 and 
> 4 Firefox windows that are stable on 460 driver.

I can't comment on the 460 experience, because I habitually: 

1. quit Firefox
2. await disappearance of Firefox windows
3. watch htop until all firefox processes end

– before logging out.
 

> On a fresh user/session I didn't see the issue but I only tried 
> one firefox window without much load.  The issue results in a 
> full hang of the system. 

Exceptionally, I logged out without quitting Firefox (two windows, around 952 tabs, various extensions). For me, too: 

* the subsequent log in (without restarting the OS) resulted in a 
  full hang of the OS – no response to input – so I pressed and 
  held the power button. 

The subsequent log in (after starting the OS) restored Firefox, and most other windowed applications, without a hang of the OS.
Comment 7 Graham Perrin 2021-07-30 11:40:13 UTC
Created attachment 226795 [details]
Photograph: OS hung after clicking a site-provided context menu item in Firefox 90.0.2

firefox-90.0.2,2

Right-click (Kensington trackball), release, roll the ball down to the required context menu item, click, the OS hung hard. This photograph shows an arrow pointer. 

Not reproducible with nvidia-driver-460.84; the gloved hand pointer changes to an arrow pointer within a split-second of clicking the context menu item.
Comment 8 Kevin Bowling freebsd_committer 2021-07-30 14:48:26 UTC
Reported instability to nvidia FreeBSD forum https://forums.developer.nvidia.com/t/instability-with-470-57-02/185212
Comment 9 Jung-uk Kim freebsd_committer 2021-07-30 21:09:57 UTC
(In reply to Kevin Bowling from comment #4)
I had a similar problem with 470.42.01 and now 470.57.02 shows the same symptom.  FWIW, I've been using 465.31 for a while and it's been quite stable.
Comment 10 Tomoaki AOKI 2021-07-31 00:05:46 UTC
(In reply to Jung-uk Kim from comment #9)

For me, it does NOT hang at all, but sudden REBOOTS WITHOUT CORE instead.
The situations are quite random.

 *Just moving mouse cursore over Firefox window.
 *While reloading any web pages.
 *Logging into somewhere.

Not at sll reproducible after sudden reboots.

I FEEL 470.42.01 was worse, as it rebooted multiple time a day and I must get back to 465.31, while 470.57.02 reboots about once in 2 or 3 days.
465 series were stable, as you wrote.
Comment 11 Alex S 2021-07-31 00:18:42 UTC
I only had 2 or 3 lockups for whole this year (with FreeBSD 12.2/13 and every beta version of Nvidia's Unix driver). With no correlation to anything in particular. Either you are being overly dramatic or it's something specific to CURRENT.
Comment 12 Kevin Bowling freebsd_committer 2021-07-31 01:49:26 UTC
(In reply to Alex S from comment #11)
There are several reports of unexpected lockups post 465 here so I'm not sure who your comment is targeted to but it's not going to advance the narrative like figuring out what is going on will.
Comment 13 Austin Shafer 2021-08-02 00:49:14 UTC
> For Quadro K1100M (GK107GLM) on FreeBSD 14.0-CURRENT, can you tell, is x11/nvidia-driver appropriate?

Yes use x11/nvidia-driver. K1100M is a Kepler series, and 470 is the last driver release to support those.

> There are several reports of unexpected lockups post 465 here

I think these lockups are the result of a panic with the message "Sleep with fops_mtx held in Nvidia driver 470.42.01 on FreeBSD", which we knew about and have a fix for. This should be fixed in the next point release of 470. (sorry, I thought I had committed it in time for 470.57) The fix is changing fops_mtx in src/nvidia/nv-freebsd.h to an sx lock.

If someone could get an actual coredump/kernel stacktrace that would be helpful to confirm they are hitting the same issue.

Not ground truth, but I did a quick spot check with a recent-ish CURRENT, firefox, KDE5, and the latest internal version of the nvidia driver and couldn't trigger any panics. I'll keep an eye out.
Comment 14 Alexey Dokuchaev freebsd_committer 2021-08-02 02:24:37 UTC
Given the amount of reported problems with 470.57.xx, perhaps we could update the port to version 460.91.03 in the meantime as otis@ had suggested?
Comment 15 Kevin Bowling freebsd_committer 2021-08-03 01:37:21 UTC
Created attachment 226898 [details]
nvidia 470.57.02 w/sx locking

My machine hard locks so I am unable to see a panic message on the vt or get a core.

With Austin's suggestion I am running fops_mtx converted to an sx xlock and it is working well for me.

This attachment is a quick and dirty patch just for the main 'nvidia-driver' port so others can confirm it works as well.  If it works for others I will do it correctly.
Comment 16 Graham Perrin 2021-08-03 06:52:25 UTC
(In reply to Tomoaki AOKI from comment #10)

> … sudden REBOOTS WITHOUT CORE …

There's a comparable report in the GhostBSD area, below 
<https://forums.ghostbsd.org/viewtopic.php?p=9893#p9893>


(In reply to Kevin Bowling from comment #15)

Thanks, I'll try.
Comment 17 Jung-uk Kim freebsd_committer 2021-08-03 17:17:21 UTC
Created attachment 226916 [details]
Make mtx->sx patches optional for 470.x

(In reply to Kevin Bowling from comment #15)
The mtx->sx change fixed the instability issue for me.  This patch should make it optional for 470.x.
Comment 18 Kevin Bowling freebsd_committer 2021-08-04 03:20:34 UTC
danfe, what do you think about the above?  I've been hammering this and cannot see any obvious issues.
Comment 19 Alexey Dokuchaev freebsd_committer 2021-08-04 05:54:21 UTC
(In reply to Kevin Bowling from comment #18)
> danfe, what do you think about the above?
I'd like to see it reviewed and picked by upstream first, and ideally making another bugfix release in 470.xx branch rather than us having to maintain even more local patches.  I'm not sure if we have a direct contact with them now, but perhaps the forum thread mentioned in comment #8 could be updated?
Comment 20 Graham Perrin 2021-08-04 06:45:31 UTC
(In reply to Kevin Bowling from comment #15)

> … a quick and dirty patch just for the main 'nvidia-driver' port …

Firefox: no problem at the time of writing. 

OS wake from sleep: not working. 

% uname -KUv
FreeBSD 14.0-CURRENT #103 main-n248269-941650aae97: Wed Jul 28 07:28:47 BST 2021     root@mowa219-gjp4-8570p:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG  1400026 1400026
% pkg info -x firefox nvidia | sort
firefox-90.0.2,2
linux-nvidia-libs-460.84
nvidia-driver-470.57.02
nvidia-settings-470.42.01
nvidia-xconfig-470.42.01
% kldstat | grep -e nv -e drm
19    1 0xffffffff836ae000   106310 nvidia-modeset.ko
20    1 0xffffffff83800000  1fa0a48 nvidia.ko
%
Comment 21 Tomoaki AOKI 2021-08-04 12:44:09 UTC
Thanks for patches.
Looks working fine with patched driver ATM, but as I wrote previously, would need at least 4 or 5 more days (maybe 1 or 2 weeks to be shure) to confirm.
Tested mainly on stable/13 and partially main, amd64.

BTW, I've experienced a first hangup (formerly, all were sudden reboot), at conditions below:

 *Just the time built/installed patched nvidia-driver but before reboot.
 *Clicked a tab on firefox, and then, hanged up.
 *NumLock and Shift-Caps toggles keyboard LED.
 *Ctrl-Alt-BS, Ctrl-Ald-Del, Ctrl-Alt-Fn and short pushdown of power SW
  didin't work. Just a long press of power SW (forced power-off) worked.
 *No core left.

So this hang should be a problem of nvidia-driver without patch.
Comment 22 Tomoaki AOKI 2021-08-08 11:55:09 UTC
No sudden-reboot nor hang until now.
Will keep on watching and report back if whichever happened.
Comment 23 Eirik Oeverby 2021-08-09 17:53:33 UTC
(In reply to Graham Perrin from comment #16)

This was reported by me. For both 460.x and 470.x the symptom is insta-reboot; this may or may not be a property of GhostBSD - my gut feeling is however that it may also be hardware dependent whether you see a hang or a reboot.

However: This *did* happen with 460 too; but not as reproducable or frequent as with 470. The patch suggested below - is it reasonable to assume that this was a problem present-but-less-frequent in the older drivers too?
Comment 24 Alex S 2021-08-10 18:41:54 UTC
FYI, 470.63.01 was released today.
Comment 25 Austin Shafer 2021-08-10 18:53:04 UTC
Like Alex said, 470.63 is out. I verified that it has fixes for both known panics. It should be good to update the port to that version. Please let me know if there are more stability problems.

> This *did* happen with 460 too; but not as reproducible or frequent
This is probably not the fops_mtx panic, but the other one from nvidia-modeset. I only got that to reproduce a couple times when I was using multiple monitors, and haven't seen anyone else run into it. My money is on that being your issue.
Comment 26 Eirik Oeverby 2021-08-10 19:02:42 UTC
(In reply to Austin Shafer from comment #25)
>Like Alex said, 470.63 is out. I verified that it has fixes for both known panics. It should be good to update the port to that version. Please let me know if there are more stability problems.

I'll test this on my workstation as soon as I'm able.

>> This *did* happen with 460 too; but not as reproducible or frequent
>This is probably not the fops_mtx panic, but the other one from nvidia-modeset. I only got that to reproduce a couple times when I was using multiple monitors, and haven't seen anyone else run into it. My money is on that being your issue.

It could be; I have a monitor with a ridiculous resolution - LG 34WK95U at 5120x2160. This is more than most dual-display configurations used to have until very recently, so I'm not surprised if it tickles similar bugs.
Comment 27 commit-hook freebsd_committer 2021-08-11 02:49:09 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=77a2452a895b80a7136d1453ce4ad5fe00b58773

commit 77a2452a895b80a7136d1453ce4ad5fe00b58773
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-08-11 02:47:16 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-08-11 02:47:16 +0000

    x11/{linux-nvidia-libs,nvidia-driver}: Update to 470.63.01

    This is the new stable branch and adds support for Vulkan.

    See https://www.nvidia.com/Download/driverResults.aspx/177146/en-us
    and https://www.nvidia.com/download/driverResults.aspx/179601/en-us
    for additional changes.

    PR:             257456
    Approved by:    danfe

 x11/linux-nvidia-libs/Makefile | 2 +-
 x11/linux-nvidia-libs/distinfo | 6 +++---
 x11/nvidia-driver/Makefile     | 2 +-
 x11/nvidia-driver/distinfo     | 6 +++---
 4 files changed, 8 insertions(+), 8 deletions(-)
Comment 28 Kevin Bowling freebsd_committer 2021-08-11 02:53:32 UTC
Thanks, Austin and danfe for your work!
Comment 29 Alexey Dokuchaev freebsd_committer 2021-08-11 08:45:26 UTC
Thanks for the update, albeit technically I did not approve it.
Comment 30 Kevin Bowling freebsd_committer 2021-08-11 21:56:31 UTC
(In reply to Alexey Dokuchaev from comment #29)
Sorry for the misunderstanding, I read comment 3 and 19 as go for it once it's ready.
Comment 31 Graham Perrin 2021-08-16 06:30:23 UTC
root@mowa219-gjp4-zbook:~ # pkg info -x nvidia
nvidia-driver-470.63.01_1
nvidia-xconfig-470.42.01
root@mowa219-gjp4-zbook:~ # sysrc kld_list
kld_list: nvidia-modeset
root@mowa219-gjp4-zbook:~ # 

Are these backtraces significant? At the tail of /var/log/messages

----

…
nvidia0: <Unknown> on vgapci0
vgapci0: child nvidia0 requested pci_enable_io
vgapci0: child nvidia0 requested pci_enable_io
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.63.01  Tue Aug  3 20:24:32 UTC 2021
acpi_wmi0: <ACPI-WMI mapping> on acpi0
acpi_wmi0: Embedded MOF found
ACPI: \134_SB.WMID.WQAB: 1 arguments were passed to a non-method ACPI object (Buffer) (20210730/nsarguments-361)
acpi_wmi1: <ACPI-WMI mapping> on acpi0
acpi_wmi1: Embedded MOF found
ACPI: \134_SB.PCI0.WMI1.WQXM: 1 arguments were passed to a non-method ACPI object (Buffer) (20210730/nsarguments-361)
ichsmb0: <Intel Lynx Point SMBus controller> port 0xef80-0xef9f mem 0xd2137000-0xd21370ff at device 31.3 on pci0
smbus0: <System Management Bus> on ichsmb0
iwm0: <Intel(R) Dual Band Wireless AC 7260> at device 0.0 on pci4
iwm0: hw rev 0x140, fw ver 17.352738.0, address ⋯
wlan0: Ethernet address: ⋯
lo0: link state changed to UP
em0: link state changed to UP
wlan0: link state changed to UP
Security policy loaded: MAC/ntpd (mac_ntpd)
ACPI Warning: \134_SB.PCI0.PEGP.DGFX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-212)
acquiring duplicate lock of same type: "os.lock_mtx"
 1st os.lock_mtx @ nvidia_os.c:882
 2nd os.lock_mtx @ nvidia_os.c:882
stack backtrace:
#0 0xffffffff80c90dd1 at witness_debugger+0x71
#1 0xffffffff80bfc554 at __mtx_lock_flags+0x94
#2 0xffffffff8424f00b at os_acquire_spinlock+0x1b
#3 0xffffffff83f4c4bc at _nv035262rm+0xc
acquiring duplicate lock of same type: "fops_sx"
 1st fops_sx @ nvidia_subr.c:400
 2nd fops_sx @ nvidia_subr.c:1051
stack backtrace:
#0 0xffffffff80c90dd1 at witness_debugger+0x71
#1 0xffffffff80c2a9a7 at _sx_xlock+0x67
#2 0xffffffff8425281f at nv_add_mapping_context_to_file+0x7f
#3 0xffffffff8419285d at _nv036018rm+0x59d
Comment 32 Graham Perrin 2021-08-16 06:41:40 UTC
(In reply to Kevin Bowling from comment #8)

Would someone like to update the topic in the NVIDIA forum? 

(OT: I tried, failed repeatedly, to sign in with Google.)

(In reply to Graham Perrin from comment #1)

> I'm particularly interested in wake from sleep (resume from suspend) 
> when the computer (HP ZBook 17 G2) is docked, with a display on DisplayPort. 

Without docking the notebook – with power and Ethernet cables alone attached: 

* still, there's failure to wake from sleep. 

Not worth me reporting the bug, because I'll no longer use this computer.