When I test the new i915 code (r270990 snapshot iso or backported to 10-stable, doesn't matter) on my gen4 Intel GPU (GMA 4500MHD), I'm getting reproducible interrupt storms from "irq16: uhci0" after loading the i915kms module. It does not matter if I start X or not. This looks very similar to the situation here:
Setting hw.drm.msi=0 in /boot/loader.conf doesn't help. One suspend/resume cycle makes the problem go away. When I revert commit r270516 (the opregion changes) the problem goes away, too. Then there is no irq16 at all in the output of "vmstat -i".
I've attached the output of "pciconf -lvc" and the contents of /var/run/dmesg.boot.
Created attachment 147134 [details]
contents of /var/run/dmesg.boot
Created attachment 147135 [details]
output of "pciconf -lvc"
Can you add a vmstat -i taken while the storm's happening, with / without hw.drm.msi=0 set
Created attachment 147255 [details]
output of "vmstat -i" for hw.drm.msi=0/1
With the opregion patch I get the "irq16: uhci0" line in the output of "vmstat -i". Setting "hw.drm.msi" to 0 or 1 makes no difference. In both cases I get about 220000 interrupts per second.
It's odd to me that in both the hw.drm.msi=0 and hw.drm.msi=1 cases exactly 1 MSI is delivered. Could you try disabling msi altogether perhaps, via hw.pci.enable_msi=0 in the loader?
OK, then I get:
interrupt total rate
irq1: atkbd0 900 3
irq9: acpi0 3605 13
irq12: psm0 6468 24
irq14: ata0 9412 36
irq15: ata1 197 0
irq16: uhci0++ 10084 38
irq19: iwn0 uhci2++ 6400 24
irq20: hpet0 43476 167
irq22: hdac0 95 0
irq23: uhci3 ehci1 84 0
irq256: re0 1436 5
Total 82157 317
…and CPU usage is back to normal.
It's strange that globally disabling MSI works, but disabling it just for drm does not. hw.drm.msi is also available as a read-only sysctl after boot - can you confirm in the "hw.drm.msi=0" case that it managed to be set to 0?
Indeed, it wasn't set properly. I could have sworn I checked that, but apparently not… I made an error when backporting the sysctl changes. I now get about 60 interrupts per second when running X on irq16 (possibly vblank interrupts for the compositor I'm running):
interrupt total rate
irq1: atkbd0 3988 3
irq9: acpi0 19361 15
irq12: psm0 123546 101
irq14: ata0 35149 28
irq15: ata1 833 0
irq16: uhci0+ 79654 65
irq20: hpet0 263768 216
irq23: uhci3 ehci1 84 0
irq256: hdac0 95 0
irq258: iwn0 67988 55
irq259: re0 35 0
Total 594501 488
So to clarify, setting "hw.drm.msi=0" is fine as a workaround. When hw.drm.msi is set to 1, I have to do one suspend/resume cycle for the interrupts on irq16 to stop. Then the GPU uses irq260.
hw.drm.msi=1 (after suspend/resume):
interrupt total rate
irq1: atkbd0 2085 2
irq9: acpi0 15696 18
irq12: psm0 85609 100
irq14: ata0 21899 25
irq15: ata1 594 0
irq16: uhci0 3080356 3619
irq20: hpet0 171708 201
irq23: uhci3 ehci1 168 0
irq256: hdac0 113 0
irq258: iwn0 42571 50
irq259: re0 24 0
irq260: vgapci0 33092 38
Total 3453915 4058
Slight correction: The GPU uses irq260 before *and* after the suspend/resume cycle if hw.drm.msi=1. So the interrupt storm on irq16 is in addition to the "correct" behavior.
Created attachment 147358 [details]
Fix/workaround for interrupt storm on GM45 when loading i915kms
This seems to be a known hardware quirk with the GM45 GPU where the gmbus can generate both MSI and non-MSI interrupts:
I've managed to stop the interrupts with the attached patch. The first part restarts the IRQ stuff before calling intel_opregion_init. The second part just makes sure that intel_opregion_enable_asle called from the irq_postinstall handler in drm_irq_install has no effect, as this is called again in intel_opregion_init.
I've tested this on 10-stable with the drm stuff backported and will now check if it also works with a recent snapshot of -current.
(In reply to jan.kokemueller from comment #11)
> Created attachment 147358 [details]
> Fix/workaround for interrupt storm on GM45 when loading i915kms
Could you, please, describe how the discussion and fix from Linux commit c12aba5aa0e60b7 is related to your patch ?
Also note that our i915 gmbus code does not use interrupts, the iic is polled. So even if the c12... is somewhat related, it probably not relevant as is.
I can only guess that some BIOSes leave the gmbus interrupt mask register in the non-zero state, for whatever reasons. Also, from some time Linux started to explicitely zeroing mask register on gmbus reset. Please try the following change. It might be that resume code needs the same patching of GMBUS4, but we will see.
Created attachment 147359 [details]
Reset gmbus interrupt mask register explicitely
(In reply to Konstantin Belousov from comment #12)
> Could you, please, describe how the discussion and fix from Linux commit
> c12aba5aa0e60b7 is related to your patch ?
It is not related, except that there seems to be some kind of hardware bug. Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.
(In reply to Jan Kokemüller from comment #14)
> It is not related, except that there seems to be some kind of hardware bug.
> Sadly, zeroing GMBUS4 in intel_iic_reset does not seem to help.
So can you describe what the problem is and why your patch help ?
To see what kind of interrupt causes the storm, compile kernel with KTR_DRM and enable ktr(4) tracing for it.
Created attachment 147363 [details]
ktrdump while loading i915kms (debug.ktr.mask=4)
(In reply to Konstantin Belousov from comment #15)
> So can you describe what the problem is and why your patch help ?
My best guess is that it's some kind of initialization problem regarding the interrupts, because the behavior stops after a suspend/resume cycle. The problem started with (or was uncovered by) the opregion code changes. If I stub out intel_enable_asle the problem goes away. The patch makes it so that intel_enable_asle gets first called only after I disable/enable the interrupts of the card. Then there is no interrupt storm anymore. I'll try to debug this further.
> To see what kind of interrupt causes the storm, compile kernel with KTR_DRM
> and enable ktr(4) tracing for it.
I compiled the kernel with KTR, and enabled KTR_DRM by setting "sysctl debug.ktr.mask=4". Then I loaded an unpatched i915kms module. I've attached the log (interesting events are from 34 to 48). It seems only the interrupts on irq260 are accounted for. Any way I can trace the interrupts on irq16?
Created attachment 147377 [details]
Do not call intel_opregion_enable_asle() from the i915_driver_irq_postinstall
Try this, please. It is somewhat closer to the current Linux code, by not setting asle->ardy in irq_postinstall hook. It is different from Linux code which still enables pipestat interrupts in postinstall, but we will see.
Your KTR indicates that there was PIPE_B_EVENT active when the interrupts were actually enabled, which might correlate with the storm. The int16 is the spurious interrupt vector for GM45, AFAIR, and MSI was not yet fully set up when the ardy is set. So it is possible that the chipset interpreted gfx interrupt request as spurious.
(In reply to Konstantin Belousov from comment #17)
> Try this, please. It is somewhat closer to the current Linux code, by not
> setting asle->ardy in irq_postinstall hook. It is different from Linux code
> which still enables pipestat interrupts in postinstall, but we will see.
Still no luck with this patch. The KTR dump indicates one less "driver_irq_handler 10" as expected, though:
48 driver_irq_handler 10
47 object_change_domain pin_to_display_plan 0xfffff8001fad8200 41 0
46 driver_irq_handler 10
45 driver_irq_handler 10
44 object_change_domain pin_to_display_plan 0xfffff8001fad8200 1 0
43 object_change_domain flush_cpu_write 0xfffff8001fad8200 1 1
42 object_clflush 0xfffff8001fad8200
41 object_bind 0xfffff8001fad8200 42000 408000 1
40 object_bind 0xfffff8001fad8600 22000 20000 1
39 object_bind 0xfffff8001fad8800 21000 1000 1
38 object_bind 0xfffff8001fad8a00 1000 20000 1
37 object_bind 0xfffff8001fad8c00 0 1000 1
36 i915_disable_vblank 1
35 i915_disable_vblank 0
I also have this problem on r280615 running on HP Elitebook 2530p.
I'm also having this problem on:
vgapci0@pci0:0:2:0: class=0x030000 card=0x20e417aa chip=0x2a428086 rev=0x07 hdr=0x00
vendor = 'Intel Corporation'
device = 'Mobile 4 Series Chipset Integrated Graphics Controller'
class = display
subclass = VGA
info: [drm] Initialized drm 1.1.0 20060810
drmn0: <Mobile Intel\M-B\M-. GM45 Express Chipset> on vgapci0
(In reply to Jan Kokemüller from comment #11)
This also works for me, on my GM45 mobile chipset.
... just to be clear:
* kib's opregion + zero'ing GMBUS4 patch didn't work
* jan's patch in comment #11 works
Same problem happens on a Lenovo T500 with a recent 10.1-STABLE. That machine also uses Intel GM45 graphics.
Apparently I misunderstood how to disable msi for drm: Adding hw.drm.msi=0 to /boot/loader.conf just results in an error message about an unknown oid, kldload i915kms after that still results in 1 MSI being assigned and interrupts coming in on irq16
(In reply to wolfgang from comment #23)
patch from #11 applied cleanly to 10.1-STABLE r282622 and fixes problem on T500
With the patch from comment #11 I still see high interrupt rate on the first (cold) boot, however it usually gets normal after subsequent reboot.
The patch from bug 156596 (committed in r284012) has solved the problem completely .
I have now removed the patch from #11 from my source tree; I do no longer see the problem with 10.2-Stable r291408
Can this PR now be closed, or there are any unresolved issues still?
This can be closed for sure. It's been a long time, but I think this was fixed in base. Most of the relevant DRM bits now live out of base, anyway.