Bug 233259 - multimedia/libva-intel-driver with HYBRID enabled causes system freezes on Sandy Bridge
Summary: multimedia/libva-intel-driver with HYBRID enabled causes system freezes on Sa...
Status: In Progress
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Jan Beich
URL:
Keywords: regression
Depends on:
Blocks: 232981
  Show dependency treegraph
 
Reported: 2018-11-16 19:21 UTC by rkoberman
Modified: 2019-01-16 19:11 UTC (History)
1 user (show)

See Also:
madpilot: maintainer-feedback+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description rkoberman 2018-11-16 19:21:31 UTC
With the default configuration of multimedia/libva-intel-driver, the HYBRID option is enabled. On my amd64 Sandy Bridge system this results in periodic system deadlocks. These were on the order of daily events from the time I installed the hybrid driver. None had occurred prior to the the hybrid driver was added and non has occurred since I disabled this option.

I might also mention that I was seeing something similar about 8 months ago and reported it in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226495. It disappeared after updates to mesa and moving to drm-stable-kmod.

Here are the event logs of the events of 9-Nov, 10-Nov, and 11-Nov.:
Nov  9 13:09:11 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:11 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov  9 13:09:13 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:13 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov  9 13:09:15 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:15 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov  9 13:09:18 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:18 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov  9 13:09:18 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:18 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov  9 13:09:18 rogue kernel: [drm] GPU HANG: ecode 6:0:0x00000000, in Xorg [1454], reason: Hang on render ring, action: reset
Nov  9 13:09:18 rogue kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov  9 13:09:18 rogue kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov  9 13:09:18 rogue kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov  9 13:09:18 rogue kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov  9 13:09:18 rogue kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov  9 13:09:18 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:18 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov  9 13:09:18 rogue kernel: drm/i915: Resetting chip after gpu hang
Nov  9 13:09:18 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov  9 13:09:18 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out


Nov 10 13:25:57 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:25:57 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:25:57 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:25:57 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:26:00 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:26:00 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:26:02 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:26:02 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:26:02 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:26:02 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:26:02 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:26:02 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:26:02 rogue kernel: [drm] GPU HANG: ecode 6:0:0x00000000, in Xorg [1419], reason: Hang on render ring, action: reset
Nov 10 13:26:02 rogue kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov 10 13:26:02 rogue kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov 10 13:26:02 rogue kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov 10 13:26:02 rogue kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov 10 13:26:02 rogue kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov 10 13:26:02 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:26:02 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out
Nov 10 13:26:02 rogue kernel: drm/i915: Resetting chip after gpu hang
Nov 10 13:26:02 rogue kernel: [drm:fw_domain_wait_ack] render: timed out waiting for forcewake ack request.
Nov 10 13:26:02 rogue kernel: [drm:__gen6_gt_wait_for_thread_c0] GT thread status wait timed out


Nov 11 22:46:21 rogue kernel: [drm] GPU HANG: ecode 6:0:0xf4e9fffe, in Xorg [1355], reason: Hang on blitter ring, action: reset
Nov 11 22:46:21 rogue kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov 11 22:46:21 rogue kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov 11 22:46:21 rogue kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov 11 22:46:21 rogue kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov 11 22:46:21 rogue kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov 11 22:46:21 rogue kernel: drm/i915: Resetting chip after gpu hang
Comment 1 Guido Falsi freebsd_committer 2018-11-16 20:25:39 UTC
Thanks for the report.

I'm adding Jan Beich, who proposed the HYBRID option and enabled it.

Unluckily I'm unable to test on a wide range of hardware. It worked on my system and I had no objection to the plan.

BTW I'm using the drm-devel-kmod package, and I'm on head. Could you test with drm-next and see if the problem mitigates?

After this report I'm prone to disable the option by default.

I'd like to get feedback about this from Jan though.

Jan can you express your ideas about this?
Comment 2 rkoberman 2018-11-16 22:22:45 UTC
(In reply to Guido Falsi from comment #1)
I already dropped a note to Jan. I have been his Sandy Bridge tester and tested this for functionality, but didn't see the first hang for a bit because I was busy moving and my new home's network was not working.

There is an old issue with GPU hangs on Sandy Bridge with power management enabled that were supposed to have been worked around. (Not fixed as it was a hardware problem.) I certainly stopped seeing it. Then the issue early this year that looked just like what I've been seeing now with hybrid mode. It went away with an updated mesa and move to drm-stable-kmod.

I suspect that the support of the hybrid driver again exposed the problem. After all, hybrid mode does not actually work on Sandy Bridge, so I suspect that there was no thought to dealing with the problem. This is really conjecture, though.

I'll try to install drm-devel-kmod a bit later and see what happens, though it will be 3 or 4 days before I will feel confident of success. Failure my take far less time.
Comment 3 Jan Beich freebsd_committer 2018-11-17 13:36:22 UTC
I'm OK with HYBRID disabled. It can be turned into flavor instead e.g., libva-intel-driver@hybrid. Giving up on drm-stable-kmod isn't a good idea (even if graphics team plans it) as later versions aren't stable on xf86-video-intel at least with SNA enabled.

Making media_driver_data_init() return false on Skylake doesn't lead to GPU hangs. Prior to that hybrid driver does some initialization (e.g., intel_bufmgr_gem_*) which probably exacerbates SandyBridge stability on drm-stable-kmod.
Comment 4 Guido Falsi freebsd_committer 2018-11-17 15:30:15 UTC
in the while I'm going to disable the HYBRID option by default, since the risk of causing lockups to further unsuspecting users is too big.

I'll look at adding it as a flavor in the next few days.
Comment 5 commit-hook freebsd_committer 2018-11-17 15:33:52 UTC
A commit references this bug:

Author: madpilot
Date: Sat Nov 17 15:33:34 UTC 2018
New revision: 485138
URL: https://svnweb.freebsd.org/changeset/ports/485138

Log:
  Disable HYBRID option by default due to lockups being reported on
  Sandy Bridge CPUs.

  PR:		233259
  Submitted by:	rkoberman@gmail.com

Changes:
  head/multimedia/libva-intel-driver/Makefile
Comment 6 Jan Beich freebsd_committer 2018-12-30 06:00:07 UTC
Kevin, do hangs from HYBRID still occur after ports r487275 or ports r487274?
Comment 7 rkoberman 2018-12-30 07:21:09 UTC
(In reply to Jan Beich from comment #6)
Now running 12.0-stable and the latest drm-fbsd12.0-kmod (g20181215). I have not been running with HYBRID, but will build with it tonight and see how it goes. Since the hangs were infrequent, it will tak a bit of time before I can report, at least a day or two. a day or two.
Comment 8 Guido Falsi freebsd_committer 2019-01-16 18:07:47 UTC
Passing this PR to new port maintainer.
Comment 9 rkoberman 2019-01-16 18:27:49 UTC
(In reply to rkoberman from comment #7)
After two weeks of HYBRID on my Sandy Bridge system 12-STABLE and mesa 19.3.1 I have had no significant issues.

I have seen occasional blocks of garbage pop  up. They always are the same height, probably 128 pixels, and highly variable width. The blocks appear to contain random noise. Redrawing the window with minimizing or window shading makes the blocks vanish and they only appear very rarely. While I have only seen them since installing with the HYBRID option, they have been so rare that I am not at all sure that libva-intel-driver is the cause.

The prior lockup issues have not recurred.
Comment 10 Jan Beich freebsd_committer 2019-01-16 18:53:41 UTC
(In reply to rkoberman from comment #9)
> I have seen occasional blocks of garbage pop  up.

On modesetting(4x)? Can you try UXA on xf86-video-intel? I see rendering glitches with modesetting(4x) on Skylake myself: stutter on GL init and switching workspaces, black screen flickering on VAAPI init.

Otherwise, thanks for testing. HYBRID can probably re-enabled after adding a warning into UPDATING, so users (on FreeBSD 11.*) can report if HYBRID=off helps in case of stability issues. Unfortunately, we don't have telemetry in order to reduce guessing.
Comment 11 Jan Beich freebsd_committer 2019-01-16 19:11:56 UTC
If you're tired of testing, a composite manager may help to clean up rendering glitches. Try installing x11-wm/compton and maybe check my config[1].

[1] https://github.com/FreeBSDDesktop/kms-drm/issues/32
    "vsync" is disabled because it rarely helps but incurs
    performance cost