Bug 193500

Summary: Interrupt storm after loading i915kms module on Gen4 Intel GPU
Product: Base System Reporter: Jan Kokemüller <jan.kokemueller>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me CC: adrian, danfe, emaste, henry.hu.sh, kib, makc, moertael, wolfgang
Priority: --- Keywords: i915
Version: CURRENT   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=190186
Attachments:
Description Flags
contents of /var/run/dmesg.boot
none
output of "pciconf -lvc"
none
output of "vmstat -i" for hw.drm.msi=0/1
none
Fix/workaround for interrupt storm on GM45 when loading i915kms
none
Reset gmbus interrupt mask register explicitely
none
ktrdump while loading i915kms (debug.ktr.mask=4)
none
Do not call intel_opregion_enable_asle() from the i915_driver_irq_postinstall none

Description Jan Kokemüller 2014-09-09 17:30:02 UTC
When I test the new i915 code (r270990 snapshot iso or backported to 10-stable, doesn't matter) on my gen4 Intel GPU (GMA 4500MHD), I'm getting reproducible interrupt storms from "irq16: uhci0" after loading the i915kms module. It does not matter if I start X or not. This looks very similar to the situation here:

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-June/045369.html

Setting hw.drm.msi=0 in /boot/loader.conf doesn't help. One suspend/resume cycle makes the problem go away. When I revert commit r270516 (the opregion changes) the problem goes away, too. Then there is no irq16 at all in the output of "vmstat -i".

I've attached the output of "pciconf -lvc" and the contents of /var/run/dmesg.boot.
Comment 1 Jan Kokemüller 2014-09-09 17:30:50 UTC
Created attachment 147134 [details]
contents of /var/run/dmesg.boot
Comment 2 Jan Kokemüller 2014-09-09 17:31:27 UTC
Created attachment 147135 [details]
output of "pciconf -lvc"
Comment 3 Ed Maste freebsd_committer freebsd_triage 2014-09-12 16:56:33 UTC
Can you add a vmstat -i taken while the storm's happening, with / without hw.drm.msi=0 set
Comment 4 Jan Kokemüller 2014-09-12 18:21:21 UTC
Created attachment 147255 [details]
output of "vmstat -i" for hw.drm.msi=0/1

With the opregion patch I get the "irq16: uhci0" line in the output of "vmstat -i". Setting "hw.drm.msi" to 0 or 1 makes no difference. In both cases I get about 220000 interrupts per second.
Comment 5 Ed Maste freebsd_committer freebsd_triage 2014-09-12 20:36:55 UTC
Thanks.

It's odd to me that in both the hw.drm.msi=0 and hw.drm.msi=1 cases exactly 1 MSI is delivered.  Could you try disabling msi altogether perhaps, via hw.pci.enable_msi=0 in the loader?
Comment 6 Jan Kokemüller 2014-09-12 20:58:02 UTC
OK, then I get:

interrupt                          total       rate
irq1: atkbd0                         900          3
irq9: acpi0                         3605         13
irq12: psm0                         6468         24
irq14: ata0                         9412         36
irq15: ata1                          197          0
irq16: uhci0++                     10084         38
irq19: iwn0 uhci2++                 6400         24
irq20: hpet0                       43476        167
irq22: hdac0                          95          0
irq23: uhci3 ehci1                    84          0
irq256: re0                         1436          5
Total                              82157        317


…and CPU usage is back to normal.
Comment 7 Ed Maste freebsd_committer freebsd_triage 2014-09-14 14:29:56 UTC
It's strange that globally disabling MSI works, but disabling it just for drm does not.  hw.drm.msi is also available as a read-only sysctl after boot - can you confirm in the "hw.drm.msi=0" case that it managed to be set to 0?
Comment 8 Jan Kokemüller 2014-09-14 15:43:38 UTC
Indeed, it wasn't set properly. I could have sworn I checked that, but apparently not… I made an error when backporting the sysctl changes. I now get about 60 interrupts per second when running X on irq16 (possibly vblank interrupts for the compositor I'm running):

interrupt                          total       rate
irq1: atkbd0                        3988          3
irq9: acpi0                        19361         15
irq12: psm0                       123546        101
irq14: ata0                        35149         28
irq15: ata1                          833          0
irq16: uhci0+                      79654         65
irq20: hpet0                      263768        216
irq23: uhci3 ehci1                    84          0
irq256: hdac0                         95          0
irq258: iwn0                       67988         55
irq259: re0                           35          0
Total                             594501        488
Comment 9 Jan Kokemüller 2014-09-14 15:54:23 UTC
So to clarify, setting "hw.drm.msi=0" is fine as a workaround. When hw.drm.msi is set to 1, I have to do one suspend/resume cycle for the interrupts on irq16 to stop. Then the GPU uses irq260.

hw.drm.msi=1 (after suspend/resume):
interrupt                          total       rate
irq1: atkbd0                        2085          2
irq9: acpi0                        15696         18
irq12: psm0                        85609        100
irq14: ata0                        21899         25
irq15: ata1                          594          0
irq16: uhci0                     3080356       3619
irq20: hpet0                      171708        201
irq23: uhci3 ehci1                   168          0
irq256: hdac0                        113          0
irq258: iwn0                       42571         50
irq259: re0                           24          0
irq260: vgapci0                    33092         38
Total                            3453915       4058
Comment 10 Jan Kokemüller 2014-09-14 16:19:36 UTC
Slight correction: The GPU uses irq260 before *and* after the suspend/resume cycle if hw.drm.msi=1. So the interrupt storm on irq16 is in addition to the "correct" behavior.
Comment 11 Jan Kokemüller 2014-09-15 17:31:50 UTC
Created attachment 147358 [details]
Fix/workaround for interrupt storm on GM45 when loading i915kms

This seems to be a known hardware quirk with the GM45 GPU where the gmbus can generate both MSI and non-MSI interrupts:
http://lists.freedesktop.org/archives/dri-devel/2013-March/036295.html

I've managed to stop the interrupts with the attached patch. The first part restarts the IRQ stuff before calling intel_opregion_init. The second part just makes sure that intel_opregion_enable_asle called from the irq_postinstall handler in drm_irq_install has no effect, as this is called again in intel_opregion_init.

I've tested this on 10-stable with the drm stuff backported and will now check if it also works with a recent snapshot of -current.
Comment 12 Konstantin Belousov freebsd_committer freebsd_triage 2014-09-15 18:39:46 UTC
(In reply to jan.kokemueller from comment #11)
> Created attachment 147358 [details]
> Fix/workaround for interrupt storm on GM45 when loading i915kms

Could you, please, describe how the discussion and fix from Linux commit c12aba5aa0e60b7 is related to your patch ?

Also note that our i915 gmbus code does not use interrupts, the iic is polled.  So even if the c12... is somewhat related, it probably not relevant as is.

I can only guess that some BIOSes leave the gmbus interrupt mask register in the non-zero state, for whatever reasons.  Also, from some time Linux started to explicitely zeroing mask register on gmbus reset.  Please try the following change.  It might be that resume code needs the same patching of GMBUS4, but we will see.
Comment 13 Konstantin Belousov freebsd_committer freebsd_triage 2014-09-15 18:40:39 UTC
Created attachment 147359 [details]
Reset gmbus interrupt mask register explicitely
Comment 14 Jan Kokemüller 2014-09-15 19:45:54 UTC
(In reply to Konstantin Belousov from comment #12)
> Could you, please, describe how the discussion and fix from Linux commit
> c12aba5aa0e60b7 is related to your patch ?

It is not related, except that there seems to be some kind of hardware bug. Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.
Comment 15 Konstantin Belousov freebsd_committer freebsd_triage 2014-09-15 20:25:55 UTC
(In reply to Jan Kokemüller from comment #14)
> It is not related, except that there seems to be some kind of hardware bug.
> Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.

So can you describe what the problem is and why your patch help ?

To see what kind of interrupt causes the storm, compile kernel with KTR_DRM and enable ktr(4) tracing for it.
Comment 16 Jan Kokemüller 2014-09-15 21:05:21 UTC
Created attachment 147363 [details]
ktrdump while loading i915kms (debug.ktr.mask=4)

(In reply to Konstantin Belousov from comment #15)

> So can you describe what the problem is and why your patch help ?

My best guess is that it's some kind of initialization problem regarding the interrupts, because the behavior stops after a suspend/resume cycle. The problem started with (or was uncovered by) the opregion code changes. If I stub out intel_enable_asle the problem goes away. The patch makes it so that intel_enable_asle gets first called only after I disable/enable the interrupts of the card. Then there is no interrupt storm anymore. I'll try to debug this further.

> To see what kind of interrupt causes the storm, compile kernel with KTR_DRM
> and enable ktr(4) tracing for it.

I compiled the kernel with KTR, and enabled KTR_DRM by setting "sysctl debug.ktr.mask=4". Then I loaded an unpatched i915kms module. I've attached the log (interesting events are from 34 to 48). It seems only the interrupts on irq260 are accounted for. Any way I can trace the interrupts on irq16?
Comment 17 Konstantin Belousov freebsd_committer freebsd_triage 2014-09-16 14:12:59 UTC
Created attachment 147377 [details]
Do not call intel_opregion_enable_asle() from the i915_driver_irq_postinstall

Try this, please.  It is somewhat closer to the current Linux code, by not setting asle->ardy in irq_postinstall hook.  It is different from Linux code which still enables pipestat interrupts in postinstall, but we will see.

Your KTR indicates that there was PIPE_B_EVENT active when the interrupts were actually enabled, which might correlate with the storm.  The int16 is the spurious interrupt vector for GM45, AFAIR, and MSI was not yet fully set up when the ardy is set.  So it is possible that the chipset interpreted gfx interrupt request as spurious.
Comment 18 Jan Kokemüller 2014-09-16 14:31:29 UTC
(In reply to Konstantin Belousov from comment #17)
> Try this, please.  It is somewhat closer to the current Linux code, by not
> setting asle->ardy in irq_postinstall hook.  It is different from Linux code
> which still enables pipestat interrupts in postinstall, but we will see.

Still no luck with this patch. The KTR dump indicates one less "driver_irq_handler 10" as expected, though:
    48 driver_irq_handler 10
    47 object_change_domain pin_to_display_plan 0xfffff8001fad8200 41 0
    46 driver_irq_handler 10
    45 driver_irq_handler 10
    44 object_change_domain pin_to_display_plan 0xfffff8001fad8200 1 0
    43 object_change_domain flush_cpu_write 0xfffff8001fad8200 1 1
    42 object_clflush 0xfffff8001fad8200
    41 object_bind 0xfffff8001fad8200 42000 408000 1
    40 object_bind 0xfffff8001fad8600 22000 20000 1
    39 object_bind 0xfffff8001fad8800 21000 1000 1
    38 object_bind 0xfffff8001fad8a00 1000 20000 1
    37 object_bind 0xfffff8001fad8c00 0 1000 1
    36 i915_disable_vblank 1
    35 i915_disable_vblank 0
Comment 19 Juha Nygård 2015-03-27 10:06:58 UTC
I also have this problem on r280615 running on HP Elitebook 2530p.
Comment 20 Adrian Chadd freebsd_committer freebsd_triage 2015-04-09 05:52:33 UTC
I'm also having this problem on:


vgapci0@pci0:0:2:0:     class=0x030000 card=0x20e417aa chip=0x2a428086 rev=0x07 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Mobile 4 Series Chipset Integrated Graphics Controller'
    class      = display
    subclass   = VGA

info: [drm] Initialized drm 1.1.0 20060810
drmn0: <Mobile Intel\M-B\M-. GM45 Express Chipset> on vgapci0
Comment 21 Adrian Chadd freebsd_committer freebsd_triage 2015-04-09 06:15:23 UTC
(In reply to Jan Kokemüller from comment #11)

This also works for me, on my GM45 mobile chipset.
Comment 22 Adrian Chadd freebsd_committer freebsd_triage 2015-04-09 06:31:41 UTC
... just to be clear:

* kib's opregion + zero'ing GMBUS4 patch didn't work
* jan's patch in comment #11 works
Comment 23 wolfgang 2015-05-09 00:10:39 UTC
Same problem happens on a Lenovo T500 with a recent 10.1-STABLE. That machine also uses Intel GM45 graphics.
Apparently I misunderstood how to disable msi for drm: Adding hw.drm.msi=0 to /boot/loader.conf just results in an error message about an unknown oid, kldload i915kms after that still results in 1 MSI being assigned and interrupts coming in on irq16
Comment 24 wolfgang 2015-05-09 20:12:06 UTC
(In reply to wolfgang from comment #23)
patch from #11 applied cleanly to 10.1-STABLE r282622 and fixes problem on T500
Comment 25 Max Brazhnikov freebsd_committer freebsd_triage 2015-06-07 21:15:59 UTC
With the patch from comment #11 I still see high interrupt rate on the first (cold) boot, however it usually gets normal after subsequent reboot.
The patch from bug 156596 (committed in r284012) has solved the problem completely .
Comment 26 wolfgang 2015-11-28 13:02:24 UTC
I have now removed the patch from #11 from my source tree; I do no longer see the problem with 10.2-Stable r291408
Comment 27 Alexey Dokuchaev freebsd_committer freebsd_triage 2020-09-28 18:32:56 UTC
Can this PR now be closed, or there are any unresolved issues still?
Comment 28 Jan Kokemüller 2020-09-28 18:44:00 UTC
This can be closed for sure. It's been a long time, but I think this was fixed in base. Most of the relevant DRM bits now live out of base, anyway.