Bug 172113 - [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cluster type
Summary: [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cl...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords:
: 151593 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-09-27 12:10 UTC by Eugene Grosbein
Modified: 2018-05-28 20:57 UTC (History)
3 users (show)

See Also:


Attachments
file.diff (1.13 KB, patch)
2012-09-27 12:10 UTC, Eugene Grosbein
no flags Details | Diff
igb-path-8.txt (1.07 KB, text/plain; charset=US-ASCII)
2012-10-24 11:49 UTC, andrew.filonov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene Grosbein 2012-09-27 12:10:09 UTC
	
	We have got SuperMicro-based server (X8DTU-6+ motherboard),
	with CPU having 12 cores (with hyperthreadng that's 24 CPUs).
	
	We downloaded 9.1-RC1 installation CD and tried to boot it.
	It panices during igb(4) driver init:

http://www.grosbein.net/img/crash-91rc.png

	This is 100% repeatable.

Fix: Workaround is to disable MSI-X in /boot/loader.conf:

hw.pci.enable_msix=0

	This allows us to boot installation CD and install the system.

	I've found this problem is pretty old and should have been fixed
	over a year ago in 9.0 but it wasn't.

	I've found patch by David G. Lawrence in that thread:
	http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063963.html

	The patch needed some corrections to be applied to 9.1-RC1,
	so I've corrected it and it solved the problem.
	Here is version for 9.1-RC1:
How-To-Repeat: 
	See above.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2012-10-05 04:27:58 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 2 John Baldwin freebsd_committer freebsd_triage 2013-01-20 04:26:17 UTC
I was able to finally reproduce this panic today.  It seems to require
a server configured for PXE but that receives no DHCP reply (and
possibly with the requisite SuperMicro X8 board).  I was able to
prevent the panic with a subset of the referenced patch by only adding
the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
igb_msix_que().  The rest of the patch was unnecessary.  I also added
some debugging to print out the ICR, EICR, IMS, and EIMS registers in
this case.  It does look like the hardware is sending an interrupt that
is not enabled in the interrupt mask (specifically LSC).  In fact, the
82576 datasheet specifically mentions masking LSC until initialization
is complete to avoid spurious interrupts during boot and AFAICT igb(4)
does this since e1000_reset_hw() clears the interrupt mask via writes
to IMC and doesn't re-enable interrupts until igb_init_locked() is
invoked via 'ifconfig up'.  Here is my debug output:

SMP: AP CPU #6 Launched!
SMP: AP CPU #4 Launched!
stray irq0
igb0: interrupt on que 0: icr 0x1000004 eicr 0
     ims 0 eims 0x80000000

Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
that e1000_reset_hw() always turns this bit in EIMS on, even if it is
off before e1000_reset_hw() is called(!).  I added explicit calls to
igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
This removes the 'stray irq0', but I still get a spurious interrupt
during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
for now, but I think the real fix is something else.

-- 
John Baldwin
Comment 3 George V. Neville-Neil freebsd_committer 2013-01-21 19:25:00 UTC
On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote:

> I was able to finally reproduce this panic today.  It seems to require
> a server configured for PXE but that receives no DHCP reply (and
> possibly with the requisite SuperMicro X8 board).  I was able to
> prevent the panic with a subset of the referenced patch by only adding
> the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
> igb_msix_que().  The rest of the patch was unnecessary.  I also added
> some debugging to print out the ICR, EICR, IMS, and EIMS registers in
> this case.  It does look like the hardware is sending an interrupt =
that
> is not enabled in the interrupt mask (specifically LSC).  In fact, the
> 82576 datasheet specifically mentions masking LSC until initialization
> is complete to avoid spurious interrupts during boot and AFAICT igb(4)
> does this since e1000_reset_hw() clears the interrupt mask via writes
> to IMC and doesn't re-enable interrupts until igb_init_locked() is
> invoked via 'ifconfig up'.  Here is my debug output:
>=20
> SMP: AP CPU #6 Launched!
> SMP: AP CPU #4 Launched!
> stray irq0
> igb0: interrupt on que 0: icr 0x1000004 eicr 0
>     ims 0 eims 0x80000000
>=20
> Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
> that e1000_reset_hw() always turns this bit in EIMS on, even if it is
> off before e1000_reset_hw() is called(!).  I added explicit calls to
> igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
> This removes the 'stray irq0', but I still get a spurious interrupt
> during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
> for now, but I think the real fix is something else.
>=20

I think Jack will have to chime in on this one.  Do you think it's all =
SM X8 boards
or just the one we happen to have?  I wonder if Jack or Jeffrey (the =
testing guy he works
with) have access to the right board.

Best,
George
Comment 4 jfvogel 2013-01-21 20:28:40 UTC
Well, do you have a more complete designation of the motherboard? We can
look into it, although if the one check stops the problem it may be a low
priority.

Jack


On Mon, Jan 21, 2013 at 11:25 AM, George Neville-Neil <gnn@freebsd.org>wrote:

>
> On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote:
>
> > I was able to finally reproduce this panic today.  It seems to require
> > a server configured for PXE but that receives no DHCP reply (and
> > possibly with the requisite SuperMicro X8 board).  I was able to
> > prevent the panic with a subset of the referenced patch by only adding
> > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
> > igb_msix_que().  The rest of the patch was unnecessary.  I also added
> > some debugging to print out the ICR, EICR, IMS, and EIMS registers in
> > this case.  It does look like the hardware is sending an interrupt that
> > is not enabled in the interrupt mask (specifically LSC).  In fact, the
> > 82576 datasheet specifically mentions masking LSC until initialization
> > is complete to avoid spurious interrupts during boot and AFAICT igb(4)
> > does this since e1000_reset_hw() clears the interrupt mask via writes
> > to IMC and doesn't re-enable interrupts until igb_init_locked() is
> > invoked via 'ifconfig up'.  Here is my debug output:
> >
> > SMP: AP CPU #6 Launched!
> > SMP: AP CPU #4 Launched!
> > stray irq0
> > igb0: interrupt on que 0: icr 0x1000004 eicr 0
> >     ims 0 eims 0x80000000
> >
> > Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
> > that e1000_reset_hw() always turns this bit in EIMS on, even if it is
> > off before e1000_reset_hw() is called(!).  I added explicit calls to
> > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
> > This removes the 'stray irq0', but I still get a spurious interrupt
> > during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
> > for now, but I think the real fix is something else.
> >
>
> I think Jack will have to chime in on this one.  Do you think it's all SM
> X8 boards
> or just the one we happen to have?  I wonder if Jack or Jeffrey (the
> testing guy he works
> with) have access to the right board.
>
> Best,
> George
>
>
>
Comment 5 John Baldwin freebsd_committer freebsd_triage 2013-01-22 17:09:32 UTC
On Monday, January 21, 2013 3:28:40 pm Jack Vogel wrote:
> Well, do you have a more complete designation of the motherboard? We can
> look into it, although if the one check stops the problem it may be a low
> priority.

It is a SuperMicro X8DTU-F.
 
> Jack
> 
> 
> On Mon, Jan 21, 2013 at 11:25 AM, George Neville-Neil <gnn@freebsd.org>wrote:
> 
> >
> > On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote:
> >
> > > I was able to finally reproduce this panic today.  It seems to require
> > > a server configured for PXE but that receives no DHCP reply (and
> > > possibly with the requisite SuperMicro X8 board).  I was able to
> > > prevent the panic with a subset of the referenced patch by only adding
> > > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
> > > igb_msix_que().  The rest of the patch was unnecessary.  I also added
> > > some debugging to print out the ICR, EICR, IMS, and EIMS registers in
> > > this case.  It does look like the hardware is sending an interrupt that
> > > is not enabled in the interrupt mask (specifically LSC).  In fact, the
> > > 82576 datasheet specifically mentions masking LSC until initialization
> > > is complete to avoid spurious interrupts during boot and AFAICT igb(4)
> > > does this since e1000_reset_hw() clears the interrupt mask via writes
> > > to IMC and doesn't re-enable interrupts until igb_init_locked() is
> > > invoked via 'ifconfig up'.  Here is my debug output:
> > >
> > > SMP: AP CPU #6 Launched!
> > > SMP: AP CPU #4 Launched!
> > > stray irq0
> > > igb0: interrupt on que 0: icr 0x1000004 eicr 0
> > >     ims 0 eims 0x80000000
> > >
> > > Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
> > > that e1000_reset_hw() always turns this bit in EIMS on, even if it is
> > > off before e1000_reset_hw() is called(!).  I added explicit calls to
> > > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
> > > This removes the 'stray irq0', but I still get a spurious interrupt
> > > during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
> > > for now, but I think the real fix is something else.
> > >
> >
> > I think Jack will have to chime in on this one.  Do you think it's all SM
> > X8 boards
> > or just the one we happen to have?  I wonder if Jack or Jeffrey (the
> > testing guy he works
> > with) have access to the right board.
> >
> > Best,
> > George
> >
> >
> >
> 

-- 
John Baldwin
Comment 6 John Baldwin freebsd_committer freebsd_triage 2013-02-21 22:12:55 UTC
An update on this.  I think we should just use a workaround as this seems to 
be specific to a certain set of motherboards.  This is the fix I'm using 
locally:

Index: if_igb.c
===================================================================
--- if_igb.c    (revision 243732)
+++ if_igb.c    (working copy)
@@ -1522,6 +1522,15 @@
        u32             newitr = 0;
        bool            more_rx;
 
+       /*
+        * The onboard adapters on certain SuperMicro X8* boards
+        * trigger a spurious interrupt during boot.  Since it
+        * occurs before the interface is fully configured it
+        * triggers a panic.  Ignore the interrupt instead.
+        */
+       if (!(adapter->ifp->if_drv_flags & IFF_DRV_RUNNING))
+               return;
+
        E1000_WRITE_REG(&adapter->hw, E1000_EIMC, que->eims);
        ++que->irqs;

-- 
John Baldwin
Comment 7 Eugene Grosbein 2013-09-27 09:31:58 UTC
Hi!

Audit-Trail of http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/172113
has more than enough patches to fix the problem.

Would someone be so kind to commit one of them so 9.2-RELEASE
won't be broken out-of-the-box for noted Supermicro's?
Comment 8 Sergey Kandaurov freebsd_committer 2013-09-27 10:00:31 UTC
State Changed
From-To: open->patched

The fix committed to HEAD (r254002), stable/9 (r254003), releng/9.2 (r254009).
Comment 9 Eugene Grosbein freebsd_committer 2018-05-28 20:57:38 UTC
*** Bug 151593 has been marked as a duplicate of this bug. ***