|Summary:||Interrupt storm after loading i915kms module on Gen4 Intel GPU|
|Product:||Base System||Reporter:||Jan Kokemüller <jan.kokemueller>|
|Component:||kern||Assignee:||freebsd-bugs mailing list <bugs>|
|Severity:||Affects Only Me||CC:||adrian, emaste, henry.hu.sh, kib, makc, moertael, wolfgang|
Description Jan Kokemüller 2014-09-09 17:30:02 UTC
When I test the new i915 code (r270990 snapshot iso or backported to 10-stable, doesn't matter) on my gen4 Intel GPU (GMA 4500MHD), I'm getting reproducible interrupt storms from "irq16: uhci0" after loading the i915kms module. It does not matter if I start X or not. This looks very similar to the situation here: https://lists.freebsd.org/pipermail/freebsd-hackers/2014-June/045369.html Setting hw.drm.msi=0 in /boot/loader.conf doesn't help. One suspend/resume cycle makes the problem go away. When I revert commit r270516 (the opregion changes) the problem goes away, too. Then there is no irq16 at all in the output of "vmstat -i". I've attached the output of "pciconf -lvc" and the contents of /var/run/dmesg.boot.
Comment 1 Jan Kokemüller 2014-09-09 17:30:50 UTC
Created attachment 147134 [details] contents of /var/run/dmesg.boot
Comment 2 Jan Kokemüller 2014-09-09 17:31:27 UTC
Created attachment 147135 [details] output of "pciconf -lvc"
Comment 3 Ed Maste 2014-09-12 16:56:33 UTC
Can you add a vmstat -i taken while the storm's happening, with / without hw.drm.msi=0 set
Comment 4 Jan Kokemüller 2014-09-12 18:21:21 UTC
Created attachment 147255 [details] output of "vmstat -i" for hw.drm.msi=0/1 With the opregion patch I get the "irq16: uhci0" line in the output of "vmstat -i". Setting "hw.drm.msi" to 0 or 1 makes no difference. In both cases I get about 220000 interrupts per second.
Comment 5 Ed Maste 2014-09-12 20:36:55 UTC
Thanks. It's odd to me that in both the hw.drm.msi=0 and hw.drm.msi=1 cases exactly 1 MSI is delivered. Could you try disabling msi altogether perhaps, via hw.pci.enable_msi=0 in the loader?
Comment 6 Jan Kokemüller 2014-09-12 20:58:02 UTC
OK, then I get: interrupt total rate irq1: atkbd0 900 3 irq9: acpi0 3605 13 irq12: psm0 6468 24 irq14: ata0 9412 36 irq15: ata1 197 0 irq16: uhci0++ 10084 38 irq19: iwn0 uhci2++ 6400 24 irq20: hpet0 43476 167 irq22: hdac0 95 0 irq23: uhci3 ehci1 84 0 irq256: re0 1436 5 Total 82157 317 …and CPU usage is back to normal.
Comment 7 Ed Maste 2014-09-14 14:29:56 UTC
It's strange that globally disabling MSI works, but disabling it just for drm does not. hw.drm.msi is also available as a read-only sysctl after boot - can you confirm in the "hw.drm.msi=0" case that it managed to be set to 0?
Comment 8 Jan Kokemüller 2014-09-14 15:43:38 UTC
Indeed, it wasn't set properly. I could have sworn I checked that, but apparently not… I made an error when backporting the sysctl changes. I now get about 60 interrupts per second when running X on irq16 (possibly vblank interrupts for the compositor I'm running): interrupt total rate irq1: atkbd0 3988 3 irq9: acpi0 19361 15 irq12: psm0 123546 101 irq14: ata0 35149 28 irq15: ata1 833 0 irq16: uhci0+ 79654 65 irq20: hpet0 263768 216 irq23: uhci3 ehci1 84 0 irq256: hdac0 95 0 irq258: iwn0 67988 55 irq259: re0 35 0 Total 594501 488
Comment 9 Jan Kokemüller 2014-09-14 15:54:23 UTC
So to clarify, setting "hw.drm.msi=0" is fine as a workaround. When hw.drm.msi is set to 1, I have to do one suspend/resume cycle for the interrupts on irq16 to stop. Then the GPU uses irq260. hw.drm.msi=1 (after suspend/resume): interrupt total rate irq1: atkbd0 2085 2 irq9: acpi0 15696 18 irq12: psm0 85609 100 irq14: ata0 21899 25 irq15: ata1 594 0 irq16: uhci0 3080356 3619 irq20: hpet0 171708 201 irq23: uhci3 ehci1 168 0 irq256: hdac0 113 0 irq258: iwn0 42571 50 irq259: re0 24 0 irq260: vgapci0 33092 38 Total 3453915 4058
Comment 10 Jan Kokemüller 2014-09-14 16:19:36 UTC
Slight correction: The GPU uses irq260 before *and* after the suspend/resume cycle if hw.drm.msi=1. So the interrupt storm on irq16 is in addition to the "correct" behavior.
Comment 11 Jan Kokemüller 2014-09-15 17:31:50 UTC
Created attachment 147358 [details] Fix/workaround for interrupt storm on GM45 when loading i915kms This seems to be a known hardware quirk with the GM45 GPU where the gmbus can generate both MSI and non-MSI interrupts: http://lists.freedesktop.org/archives/dri-devel/2013-March/036295.html I've managed to stop the interrupts with the attached patch. The first part restarts the IRQ stuff before calling intel_opregion_init. The second part just makes sure that intel_opregion_enable_asle called from the irq_postinstall handler in drm_irq_install has no effect, as this is called again in intel_opregion_init. I've tested this on 10-stable with the drm stuff backported and will now check if it also works with a recent snapshot of -current.
Comment 12 Konstantin Belousov 2014-09-15 18:39:46 UTC
(In reply to jan.kokemueller from comment #11) > Created attachment 147358 [details] > Fix/workaround for interrupt storm on GM45 when loading i915kms Could you, please, describe how the discussion and fix from Linux commit c12aba5aa0e60b7 is related to your patch ? Also note that our i915 gmbus code does not use interrupts, the iic is polled. So even if the c12... is somewhat related, it probably not relevant as is. I can only guess that some BIOSes leave the gmbus interrupt mask register in the non-zero state, for whatever reasons. Also, from some time Linux started to explicitely zeroing mask register on gmbus reset. Please try the following change. It might be that resume code needs the same patching of GMBUS4, but we will see.
Comment 13 Konstantin Belousov 2014-09-15 18:40:39 UTC
Created attachment 147359 [details] Reset gmbus interrupt mask register explicitely
Comment 14 Jan Kokemüller 2014-09-15 19:45:54 UTC
(In reply to Konstantin Belousov from comment #12) > Could you, please, describe how the discussion and fix from Linux commit > c12aba5aa0e60b7 is related to your patch ? It is not related, except that there seems to be some kind of hardware bug. Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.
Comment 15 Konstantin Belousov 2014-09-15 20:25:55 UTC
(In reply to Jan Kokemüller from comment #14) > It is not related, except that there seems to be some kind of hardware bug. > Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help. So can you describe what the problem is and why your patch help ? To see what kind of interrupt causes the storm, compile kernel with KTR_DRM and enable ktr(4) tracing for it.
Comment 16 Jan Kokemüller 2014-09-15 21:05:21 UTC
Created attachment 147363 [details] ktrdump while loading i915kms (debug.ktr.mask=4) (In reply to Konstantin Belousov from comment #15) > So can you describe what the problem is and why your patch help ? My best guess is that it's some kind of initialization problem regarding the interrupts, because the behavior stops after a suspend/resume cycle. The problem started with (or was uncovered by) the opregion code changes. If I stub out intel_enable_asle the problem goes away. The patch makes it so that intel_enable_asle gets first called only after I disable/enable the interrupts of the card. Then there is no interrupt storm anymore. I'll try to debug this further. > To see what kind of interrupt causes the storm, compile kernel with KTR_DRM > and enable ktr(4) tracing for it. I compiled the kernel with KTR, and enabled KTR_DRM by setting "sysctl debug.ktr.mask=4". Then I loaded an unpatched i915kms module. I've attached the log (interesting events are from 34 to 48). It seems only the interrupts on irq260 are accounted for. Any way I can trace the interrupts on irq16?
Comment 17 Konstantin Belousov 2014-09-16 14:12:59 UTC
Created attachment 147377 [details] Do not call intel_opregion_enable_asle() from the i915_driver_irq_postinstall Try this, please. It is somewhat closer to the current Linux code, by not setting asle->ardy in irq_postinstall hook. It is different from Linux code which still enables pipestat interrupts in postinstall, but we will see. Your KTR indicates that there was PIPE_B_EVENT active when the interrupts were actually enabled, which might correlate with the storm. The int16 is the spurious interrupt vector for GM45, AFAIR, and MSI was not yet fully set up when the ardy is set. So it is possible that the chipset interpreted gfx interrupt request as spurious.
Comment 18 Jan Kokemüller 2014-09-16 14:31:29 UTC
(In reply to Konstantin Belousov from comment #17) > Try this, please. It is somewhat closer to the current Linux code, by not > setting asle->ardy in irq_postinstall hook. It is different from Linux code > which still enables pipestat interrupts in postinstall, but we will see. Still no luck with this patch. The KTR dump indicates one less "driver_irq_handler 10" as expected, though: 48 driver_irq_handler 10 47 object_change_domain pin_to_display_plan 0xfffff8001fad8200 41 0 46 driver_irq_handler 10 45 driver_irq_handler 10 44 object_change_domain pin_to_display_plan 0xfffff8001fad8200 1 0 43 object_change_domain flush_cpu_write 0xfffff8001fad8200 1 1 42 object_clflush 0xfffff8001fad8200 41 object_bind 0xfffff8001fad8200 42000 408000 1 40 object_bind 0xfffff8001fad8600 22000 20000 1 39 object_bind 0xfffff8001fad8800 21000 1000 1 38 object_bind 0xfffff8001fad8a00 1000 20000 1 37 object_bind 0xfffff8001fad8c00 0 1000 1 36 i915_disable_vblank 1 35 i915_disable_vblank 0
Comment 19 Juha Nygård 2015-03-27 10:06:58 UTC
I also have this problem on r280615 running on HP Elitebook 2530p.
Comment 20 Adrian Chadd 2015-04-09 05:52:33 UTC
I'm also having this problem on: vgapci0@pci0:0:2:0: class=0x030000 card=0x20e417aa chip=0x2a428086 rev=0x07 hdr=0x00 vendor = 'Intel Corporation' device = 'Mobile 4 Series Chipset Integrated Graphics Controller' class = display subclass = VGA info: [drm] Initialized drm 1.1.0 20060810 drmn0: <Mobile Intel\M-B\M-. GM45 Express Chipset> on vgapci0
Comment 21 Adrian Chadd 2015-04-09 06:15:23 UTC
(In reply to Jan Kokemüller from comment #11) This also works for me, on my GM45 mobile chipset.
Comment 22 Adrian Chadd 2015-04-09 06:31:41 UTC
... just to be clear: * kib's opregion + zero'ing GMBUS4 patch didn't work * jan's patch in comment #11 works
Comment 23 wolfgang 2015-05-09 00:10:39 UTC
Same problem happens on a Lenovo T500 with a recent 10.1-STABLE. That machine also uses Intel GM45 graphics. Apparently I misunderstood how to disable msi for drm: Adding hw.drm.msi=0 to /boot/loader.conf just results in an error message about an unknown oid, kldload i915kms after that still results in 1 MSI being assigned and interrupts coming in on irq16
Comment 24 wolfgang 2015-05-09 20:12:06 UTC
(In reply to wolfgang from comment #23) patch from #11 applied cleanly to 10.1-STABLE r282622 and fixes problem on T500
Comment 25 Max Brazhnikov 2015-06-07 21:15:59 UTC
With the patch from comment #11 I still see high interrupt rate on the first (cold) boot, however it usually gets normal after subsequent reboot. The patch from bug 156596 (committed in r284012) has solved the problem completely .
Comment 26 wolfgang 2015-11-28 13:02:24 UTC
I have now removed the patch from #11 from my source tree; I do no longer see the problem with 10.2-Stable r291408