We have got SuperMicro-based server (X8DTU-6+ motherboard), with CPU having 12 cores (with hyperthreadng that's 24 CPUs). We downloaded 9.1-RC1 installation CD and tried to boot it. It panices during igb(4) driver init: http://www.grosbein.net/img/crash-91rc.png This is 100% repeatable. Fix: Workaround is to disable MSI-X in /boot/loader.conf: hw.pci.enable_msix=0 This allows us to boot installation CD and install the system. I've found this problem is pretty old and should have been fixed over a year ago in 9.0 but it wasn't. I've found patch by David G. Lawrence in that thread: http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063963.html The patch needed some corrections to be applied to 9.1-RC1, so I've corrected it and it solved the problem. Here is version for 9.1-RC1: How-To-Repeat: See above.
Responsible Changed From-To: freebsd-bugs->freebsd-net Over to maintainer(s).
I was able to finally reproduce this panic today. It seems to require a server configured for PXE but that receives no DHCP reply (and possibly with the requisite SuperMicro X8 board). I was able to prevent the panic with a subset of the referenced patch by only adding the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of igb_msix_que(). The rest of the patch was unnecessary. I also added some debugging to print out the ICR, EICR, IMS, and EIMS registers in this case. It does look like the hardware is sending an interrupt that is not enabled in the interrupt mask (specifically LSC). In fact, the 82576 datasheet specifically mentions masking LSC until initialization is complete to avoid spurious interrupts during boot and AFAICT igb(4) does this since e1000_reset_hw() clears the interrupt mask via writes to IMC and doesn't re-enable interrupts until igb_init_locked() is invoked via 'ifconfig up'. Here is my debug output: SMP: AP CPU #6 Launched! SMP: AP CPU #4 Launched! stray irq0 igb0: interrupt on que 0: icr 0x1000004 eicr 0 ims 0 eims 0x80000000 Hmmm. Nothing clears EIMS. After some more debugging, I determined that e1000_reset_hw() always turns this bit in EIMS on, even if it is off before e1000_reset_hw() is called(!). I added explicit calls to igb_disable_intr() to clear EIMS after each call to e1000_reset_hw(). This removes the 'stray irq0', but I still get a spurious interrupt during boot (albeit with eims 0). I can use the IFF_DRV_RUNNING hack for now, but I think the real fix is something else. -- John Baldwin
On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote: > I was able to finally reproduce this panic today. It seems to require > a server configured for PXE but that receives no DHCP reply (and > possibly with the requisite SuperMicro X8 board). I was able to > prevent the panic with a subset of the referenced patch by only adding > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of > igb_msix_que(). The rest of the patch was unnecessary. I also added > some debugging to print out the ICR, EICR, IMS, and EIMS registers in > this case. It does look like the hardware is sending an interrupt = that > is not enabled in the interrupt mask (specifically LSC). In fact, the > 82576 datasheet specifically mentions masking LSC until initialization > is complete to avoid spurious interrupts during boot and AFAICT igb(4) > does this since e1000_reset_hw() clears the interrupt mask via writes > to IMC and doesn't re-enable interrupts until igb_init_locked() is > invoked via 'ifconfig up'. Here is my debug output: >=20 > SMP: AP CPU #6 Launched! > SMP: AP CPU #4 Launched! > stray irq0 > igb0: interrupt on que 0: icr 0x1000004 eicr 0 > ims 0 eims 0x80000000 >=20 > Hmmm. Nothing clears EIMS. After some more debugging, I determined > that e1000_reset_hw() always turns this bit in EIMS on, even if it is > off before e1000_reset_hw() is called(!). I added explicit calls to > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw(). > This removes the 'stray irq0', but I still get a spurious interrupt > during boot (albeit with eims 0). I can use the IFF_DRV_RUNNING hack > for now, but I think the real fix is something else. >=20 I think Jack will have to chime in on this one. Do you think it's all = SM X8 boards or just the one we happen to have? I wonder if Jack or Jeffrey (the = testing guy he works with) have access to the right board. Best, George
Well, do you have a more complete designation of the motherboard? We can look into it, although if the one check stops the problem it may be a low priority. Jack On Mon, Jan 21, 2013 at 11:25 AM, George Neville-Neil <gnn@freebsd.org>wrote: > > On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote: > > > I was able to finally reproduce this panic today. It seems to require > > a server configured for PXE but that receives no DHCP reply (and > > possibly with the requisite SuperMicro X8 board). I was able to > > prevent the panic with a subset of the referenced patch by only adding > > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of > > igb_msix_que(). The rest of the patch was unnecessary. I also added > > some debugging to print out the ICR, EICR, IMS, and EIMS registers in > > this case. It does look like the hardware is sending an interrupt that > > is not enabled in the interrupt mask (specifically LSC). In fact, the > > 82576 datasheet specifically mentions masking LSC until initialization > > is complete to avoid spurious interrupts during boot and AFAICT igb(4) > > does this since e1000_reset_hw() clears the interrupt mask via writes > > to IMC and doesn't re-enable interrupts until igb_init_locked() is > > invoked via 'ifconfig up'. Here is my debug output: > > > > SMP: AP CPU #6 Launched! > > SMP: AP CPU #4 Launched! > > stray irq0 > > igb0: interrupt on que 0: icr 0x1000004 eicr 0 > > ims 0 eims 0x80000000 > > > > Hmmm. Nothing clears EIMS. After some more debugging, I determined > > that e1000_reset_hw() always turns this bit in EIMS on, even if it is > > off before e1000_reset_hw() is called(!). I added explicit calls to > > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw(). > > This removes the 'stray irq0', but I still get a spurious interrupt > > during boot (albeit with eims 0). I can use the IFF_DRV_RUNNING hack > > for now, but I think the real fix is something else. > > > > I think Jack will have to chime in on this one. Do you think it's all SM > X8 boards > or just the one we happen to have? I wonder if Jack or Jeffrey (the > testing guy he works > with) have access to the right board. > > Best, > George > > >
On Monday, January 21, 2013 3:28:40 pm Jack Vogel wrote: > Well, do you have a more complete designation of the motherboard? We can > look into it, although if the one check stops the problem it may be a low > priority. It is a SuperMicro X8DTU-F. > Jack > > > On Mon, Jan 21, 2013 at 11:25 AM, George Neville-Neil <gnn@freebsd.org>wrote: > > > > > On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote: > > > > > I was able to finally reproduce this panic today. It seems to require > > > a server configured for PXE but that receives no DHCP reply (and > > > possibly with the requisite SuperMicro X8 board). I was able to > > > prevent the panic with a subset of the referenced patch by only adding > > > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of > > > igb_msix_que(). The rest of the patch was unnecessary. I also added > > > some debugging to print out the ICR, EICR, IMS, and EIMS registers in > > > this case. It does look like the hardware is sending an interrupt that > > > is not enabled in the interrupt mask (specifically LSC). In fact, the > > > 82576 datasheet specifically mentions masking LSC until initialization > > > is complete to avoid spurious interrupts during boot and AFAICT igb(4) > > > does this since e1000_reset_hw() clears the interrupt mask via writes > > > to IMC and doesn't re-enable interrupts until igb_init_locked() is > > > invoked via 'ifconfig up'. Here is my debug output: > > > > > > SMP: AP CPU #6 Launched! > > > SMP: AP CPU #4 Launched! > > > stray irq0 > > > igb0: interrupt on que 0: icr 0x1000004 eicr 0 > > > ims 0 eims 0x80000000 > > > > > > Hmmm. Nothing clears EIMS. After some more debugging, I determined > > > that e1000_reset_hw() always turns this bit in EIMS on, even if it is > > > off before e1000_reset_hw() is called(!). I added explicit calls to > > > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw(). > > > This removes the 'stray irq0', but I still get a spurious interrupt > > > during boot (albeit with eims 0). I can use the IFF_DRV_RUNNING hack > > > for now, but I think the real fix is something else. > > > > > > > I think Jack will have to chime in on this one. Do you think it's all SM > > X8 boards > > or just the one we happen to have? I wonder if Jack or Jeffrey (the > > testing guy he works > > with) have access to the right board. > > > > Best, > > George > > > > > > > -- John Baldwin
An update on this. I think we should just use a workaround as this seems to be specific to a certain set of motherboards. This is the fix I'm using locally: Index: if_igb.c =================================================================== --- if_igb.c (revision 243732) +++ if_igb.c (working copy) @@ -1522,6 +1522,15 @@ u32 newitr = 0; bool more_rx; + /* + * The onboard adapters on certain SuperMicro X8* boards + * trigger a spurious interrupt during boot. Since it + * occurs before the interface is fully configured it + * triggers a panic. Ignore the interrupt instead. + */ + if (!(adapter->ifp->if_drv_flags & IFF_DRV_RUNNING)) + return; + E1000_WRITE_REG(&adapter->hw, E1000_EIMC, que->eims); ++que->irqs; -- John Baldwin
Hi! Audit-Trail of http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/172113 has more than enough patches to fix the problem. Would someone be so kind to commit one of them so 9.2-RELEASE won't be broken out-of-the-box for noted Supermicro's?
State Changed From-To: open->patched The fix committed to HEAD (r254002), stable/9 (r254003), releng/9.2 (r254009).
*** Bug 151593 has been marked as a duplicate of this bug. ***