193500 – Interrupt storm after loading i915kms module on Gen4 Intel GPU

Bug 193500 - Interrupt storm after loading i915kms module on Gen4 Intel GPU

Summary: Interrupt storm after loading i915kms module on Gen4 Intel GPU

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:	i915

Depends on:
Blocks:

Reported:	2014-09-09 17:30 UTC by Jan Kokemüller
Modified:	2020-09-28 18:45 UTC (History)
CC List:	8 users (show)

See Also:	190186

Attachments
contents of /var/run/dmesg.boot (8.87 KB, text/plain) 2014-09-09 17:30 UTC, Jan Kokemüller	no flags	Details
output of "pciconf -lvc" (11.46 KB, text/plain) 2014-09-09 17:31 UTC, Jan Kokemüller	no flags	Details
output of "vmstat -i" for hw.drm.msi=0/1 (2.17 KB, text/plain) 2014-09-12 18:21 UTC, Jan Kokemüller	no flags	Details
Fix/workaround for interrupt storm on GM45 when loading i915kms (1.43 KB, patch) 2014-09-15 17:31 UTC, Jan Kokemüller	no flags	Details \| Diff
Reset gmbus interrupt mask register explicitely (514 bytes, patch) 2014-09-15 18:40 UTC, Konstantin Belousov	no flags	Details \| Diff
ktrdump while loading i915kms (debug.ktr.mask=4) (30.75 KB, text/plain) 2014-09-15 21:05 UTC, Jan Kokemüller	no flags	Details
Do not call intel_opregion_enable_asle() from the i915_driver_irq_postinstall (865 bytes, patch) 2014-09-16 14:12 UTC, Konstantin Belousov	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jan Kokemüller 2014-09-09 17:30:02 UTC

When I test the new i915 code (r270990 snapshot iso or backported to 10-stable, doesn't matter) on my gen4 Intel GPU (GMA 4500MHD), I'm getting reproducible interrupt storms from "irq16: uhci0" after loading the i915kms module. It does not matter if I start X or not. This looks very similar to the situation here:

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-June/045369.html

Setting hw.drm.msi=0 in /boot/loader.conf doesn't help. One suspend/resume cycle makes the problem go away. When I revert commit r270516 (the opregion changes) the problem goes away, too. Then there is no irq16 at all in the output of "vmstat -i".

I've attached the output of "pciconf -lvc" and the contents of /var/run/dmesg.boot.

Comment 1 Jan Kokemüller 2014-09-09 17:30:50 UTC

Created attachment 147134 [details]
contents of /var/run/dmesg.boot

Comment 2 Jan Kokemüller 2014-09-09 17:31:27 UTC

Created attachment 147135 [details]
output of "pciconf -lvc"

Comment 3 Ed Maste freebsd_committer

2014-09-12 16:56:33 UTC

Can you add a vmstat -i taken while the storm's happening, with / without hw.drm.msi=0 set

Comment 4 Jan Kokemüller 2014-09-12 18:21:21 UTC

Created attachment 147255 [details]
output of "vmstat -i" for hw.drm.msi=0/1

With the opregion patch I get the "irq16: uhci0" line in the output of "vmstat -i". Setting "hw.drm.msi" to 0 or 1 makes no difference. In both cases I get about 220000 interrupts per second.

Comment 5 Ed Maste freebsd_committer

2014-09-12 20:36:55 UTC

Thanks.

It's odd to me that in both the hw.drm.msi=0 and hw.drm.msi=1 cases exactly 1 MSI is delivered.  Could you try disabling msi altogether perhaps, via hw.pci.enable_msi=0 in the loader?

Comment 6 Jan Kokemüller 2014-09-12 20:58:02 UTC

OK, then I get:

interrupt                          total       rate
irq1: atkbd0                         900          3
irq9: acpi0                         3605         13
irq12: psm0                         6468         24
irq14: ata0                         9412         36
irq15: ata1                          197          0
irq16: uhci0++                     10084         38
irq19: iwn0 uhci2++                 6400         24
irq20: hpet0                       43476        167
irq22: hdac0                          95          0
irq23: uhci3 ehci1                    84          0
irq256: re0                         1436          5
Total                              82157        317


…and CPU usage is back to normal.

Comment 7 Ed Maste freebsd_committer

2014-09-14 14:29:56 UTC

It's strange that globally disabling MSI works, but disabling it just for drm does not.  hw.drm.msi is also available as a read-only sysctl after boot - can you confirm in the "hw.drm.msi=0" case that it managed to be set to 0?

Comment 8 Jan Kokemüller 2014-09-14 15:43:38 UTC

Indeed, it wasn't set properly. I could have sworn I checked that, but apparently not… I made an error when backporting the sysctl changes. I now get about 60 interrupts per second when running X on irq16 (possibly vblank interrupts for the compositor I'm running):

interrupt                          total       rate
irq1: atkbd0                        3988          3
irq9: acpi0                        19361         15
irq12: psm0                       123546        101
irq14: ata0                        35149         28
irq15: ata1                          833          0
irq16: uhci0+                      79654         65
irq20: hpet0                      263768        216
irq23: uhci3 ehci1                    84          0
irq256: hdac0                         95          0
irq258: iwn0                       67988         55
irq259: re0                           35          0
Total                             594501        488

Comment 9 Jan Kokemüller 2014-09-14 15:54:23 UTC

So to clarify, setting "hw.drm.msi=0" is fine as a workaround. When hw.drm.msi is set to 1, I have to do one suspend/resume cycle for the interrupts on irq16 to stop. Then the GPU uses irq260.

hw.drm.msi=1 (after suspend/resume):
interrupt                          total       rate
irq1: atkbd0                        2085          2
irq9: acpi0                        15696         18
irq12: psm0                        85609        100
irq14: ata0                        21899         25
irq15: ata1                          594          0
irq16: uhci0                     3080356       3619
irq20: hpet0                      171708        201
irq23: uhci3 ehci1                   168          0
irq256: hdac0                        113          0
irq258: iwn0                       42571         50
irq259: re0                           24          0
irq260: vgapci0                    33092         38
Total                            3453915       4058

Comment 10 Jan Kokemüller 2014-09-14 16:19:36 UTC

Slight correction: The GPU uses irq260 before *and* after the suspend/resume cycle if hw.drm.msi=1. So the interrupt storm on irq16 is in addition to the "correct" behavior.

Comment 11 Jan Kokemüller 2014-09-15 17:31:50 UTC

Created attachment 147358 [details]
Fix/workaround for interrupt storm on GM45 when loading i915kms

This seems to be a known hardware quirk with the GM45 GPU where the gmbus can generate both MSI and non-MSI interrupts:
http://lists.freedesktop.org/archives/dri-devel/2013-March/036295.html

I've managed to stop the interrupts with the attached patch. The first part restarts the IRQ stuff before calling intel_opregion_init. The second part just makes sure that intel_opregion_enable_asle called from the irq_postinstall handler in drm_irq_install has no effect, as this is called again in intel_opregion_init.

I've tested this on 10-stable with the drm stuff backported and will now check if it also works with a recent snapshot of -current.

Comment 12 Konstantin Belousov freebsd_committer

2014-09-15 18:39:46 UTC

(In reply to jan.kokemueller from comment #11)
> Created attachment 147358 [details]
> Fix/workaround for interrupt storm on GM45 when loading i915kms

Could you, please, describe how the discussion and fix from Linux commit c12aba5aa0e60b7 is related to your patch ?

Also note that our i915 gmbus code does not use interrupts, the iic is polled.  So even if the c12... is somewhat related, it probably not relevant as is.

I can only guess that some BIOSes leave the gmbus interrupt mask register in the non-zero state, for whatever reasons.  Also, from some time Linux started to explicitely zeroing mask register on gmbus reset.  Please try the following change.  It might be that resume code needs the same patching of GMBUS4, but we will see.

Comment 13 Konstantin Belousov freebsd_committer

2014-09-15 18:40:39 UTC

Created attachment 147359 [details]
Reset gmbus interrupt mask register explicitely

Comment 14 Jan Kokemüller 2014-09-15 19:45:54 UTC

(In reply to Konstantin Belousov from comment #12)
> Could you, please, describe how the discussion and fix from Linux commit
> c12aba5aa0e60b7 is related to your patch ?

It is not related, except that there seems to be some kind of hardware bug. Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.

Comment 15 Konstantin Belousov freebsd_committer

2014-09-15 20:25:55 UTC

(In reply to Jan Kokemüller from comment #14)
> It is not related, except that there seems to be some kind of hardware bug.
> Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.

So can you describe what the problem is and why your patch help ?

To see what kind of interrupt causes the storm, compile kernel with KTR_DRM and enable ktr(4) tracing for it.

Comment 16 Jan Kokemüller 2014-09-15 21:05:21 UTC

Created attachment 147363 [details]
ktrdump while loading i915kms (debug.ktr.mask=4)

(In reply to Konstantin Belousov from comment #15)

> So can you describe what the problem is and why your patch help ?

My best guess is that it's some kind of initialization problem regarding the interrupts, because the behavior stops after a suspend/resume cycle. The problem started with (or was uncovered by) the opregion code changes. If I stub out intel_enable_asle the problem goes away. The patch makes it so that intel_enable_asle gets first called only after I disable/enable the interrupts of the card. Then there is no interrupt storm anymore. I'll try to debug this further.

> To see what kind of interrupt causes the storm, compile kernel with KTR_DRM
> and enable ktr(4) tracing for it.

I compiled the kernel with KTR, and enabled KTR_DRM by setting "sysctl debug.ktr.mask=4". Then I loaded an unpatched i915kms module. I've attached the log (interesting events are from 34 to 48). It seems only the interrupts on irq260 are accounted for. Any way I can trace the interrupts on irq16?

Comment 17 Konstantin Belousov freebsd_committer

2014-09-16 14:12:59 UTC

Created attachment 147377 [details]
Do not call intel_opregion_enable_asle() from the i915_driver_irq_postinstall

Try this, please.  It is somewhat closer to the current Linux code, by not setting asle->ardy in irq_postinstall hook.  It is different from Linux code which still enables pipestat interrupts in postinstall, but we will see.

Your KTR indicates that there was PIPE_B_EVENT active when the interrupts were actually enabled, which might correlate with the storm.  The int16 is the spurious interrupt vector for GM45, AFAIR, and MSI was not yet fully set up when the ardy is set.  So it is possible that the chipset interpreted gfx interrupt request as spurious.

Comment 18 Jan Kokemüller 2014-09-16 14:31:29 UTC

(In reply to Konstantin Belousov from comment #17)
> Try this, please.  It is somewhat closer to the current Linux code, by not
> setting asle->ardy in irq_postinstall hook.  It is different from Linux code
> which still enables pipestat interrupts in postinstall, but we will see.

Still no luck with this patch. The KTR dump indicates one less "driver_irq_handler 10" as expected, though:
    48 driver_irq_handler 10
    47 object_change_domain pin_to_display_plan 0xfffff8001fad8200 41 0
    46 driver_irq_handler 10
    45 driver_irq_handler 10
    44 object_change_domain pin_to_display_plan 0xfffff8001fad8200 1 0
    43 object_change_domain flush_cpu_write 0xfffff8001fad8200 1 1
    42 object_clflush 0xfffff8001fad8200
    41 object_bind 0xfffff8001fad8200 42000 408000 1
    40 object_bind 0xfffff8001fad8600 22000 20000 1
    39 object_bind 0xfffff8001fad8800 21000 1000 1
    38 object_bind 0xfffff8001fad8a00 1000 20000 1
    37 object_bind 0xfffff8001fad8c00 0 1000 1
    36 i915_disable_vblank 1
    35 i915_disable_vblank 0

Comment 19 Juha Nygård 2015-03-27 10:06:58 UTC

I also have this problem on r280615 running on HP Elitebook 2530p.

Comment 20 Adrian Chadd freebsd_committer

2015-04-09 05:52:33 UTC

I'm also having this problem on:


vgapci0@pci0:0:2:0:     class=0x030000 card=0x20e417aa chip=0x2a428086 rev=0x07 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Mobile 4 Series Chipset Integrated Graphics Controller'
    class      = display
    subclass   = VGA

info: [drm] Initialized drm 1.1.0 20060810
drmn0: <Mobile Intel\M-B\M-. GM45 Express Chipset> on vgapci0

Comment 21 Adrian Chadd freebsd_committer

2015-04-09 06:15:23 UTC

(In reply to Jan Kokemüller from comment #11)

This also works for me, on my GM45 mobile chipset.

Comment 22 Adrian Chadd freebsd_committer

2015-04-09 06:31:41 UTC

... just to be clear:

* kib's opregion + zero'ing GMBUS4 patch didn't work
* jan's patch in comment #11 works

Comment 23 wolfgang 2015-05-09 00:10:39 UTC

Same problem happens on a Lenovo T500 with a recent 10.1-STABLE. That machine also uses Intel GM45 graphics.
Apparently I misunderstood how to disable msi for drm: Adding hw.drm.msi=0 to /boot/loader.conf just results in an error message about an unknown oid, kldload i915kms after that still results in 1 MSI being assigned and interrupts coming in on irq16

Comment 24 wolfgang 2015-05-09 20:12:06 UTC

(In reply to wolfgang from comment #23)
patch from #11 applied cleanly to 10.1-STABLE r282622 and fixes problem on T500

Comment 25 Max Brazhnikov freebsd_committer

2015-06-07 21:15:59 UTC

With the patch from comment #11 I still see high interrupt rate on the first (cold) boot, however it usually gets normal after subsequent reboot.
The patch from bug 156596 (committed in r284012) has solved the problem completely .

Comment 26 wolfgang 2015-11-28 13:02:24 UTC

I have now removed the patch from #11 from my source tree; I do no longer see the problem with 10.2-Stable r291408

Comment 27 Alexey Dokuchaev freebsd_committer

2020-09-28 18:32:56 UTC

Can this PR now be closed, or there are any unresolved issues still?

Comment 28 Jan Kokemüller 2020-09-28 18:44:00 UTC

This can be closed for sure. It's been a long time, but I think this was fixed in base. Most of the relevant DRM bits now live out of base, anyway.