172113 – [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cluster type

Bug 172113 - [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cluster type

Summary: [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cl...

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	Unspecified
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	freebsd-net (Nobody)

URL:
Keywords:

Duplicates (1):	151593 (view as bug list)
Depends on:
Blocks:

Reported:	2012-09-27 12:10 UTC by Eugene Grosbein
Modified:	2018-05-28 20:57 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
file.diff (1.13 KB, patch) 2012-09-27 12:10 UTC, Eugene Grosbein	no flags	Details \| Diff
igb-path-8.txt (1.07 KB, text/plain; charset=US-ASCII) 2012-10-24 11:49 UTC, andrew.filonov	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eugene Grosbein 2012-09-27 12:10:09 UTC

	
	We have got SuperMicro-based server (X8DTU-6+ motherboard),
	with CPU having 12 cores (with hyperthreadng that's 24 CPUs).
	
	We downloaded 9.1-RC1 installation CD and tried to boot it.
	It panices during igb(4) driver init:

http://www.grosbein.net/img/crash-91rc.png

	This is 100% repeatable.

Fix: Workaround is to disable MSI-X in /boot/loader.conf:

hw.pci.enable_msix=0

	This allows us to boot installation CD and install the system.

	I've found this problem is pretty old and should have been fixed
	over a year ago in 9.0 but it wasn't.

	I've found patch by David G. Lawrence in that thread:
	http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063963.html

	The patch needed some corrections to be applied to 9.1-RC1,
	so I've corrected it and it solved the problem.
	Here is version for 9.1-RC1:
How-To-Repeat: 
	See above.

Comment 1 Mark Linimon freebsd_committer

2012-10-05 04:27:58 UTC

Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).

Comment 2 John Baldwin freebsd_committer

2013-01-20 04:26:17 UTC

I was able to finally reproduce this panic today.  It seems to require
a server configured for PXE but that receives no DHCP reply (and
possibly with the requisite SuperMicro X8 board).  I was able to
prevent the panic with a subset of the referenced patch by only adding
the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
igb_msix_que().  The rest of the patch was unnecessary.  I also added
some debugging to print out the ICR, EICR, IMS, and EIMS registers in
this case.  It does look like the hardware is sending an interrupt that
is not enabled in the interrupt mask (specifically LSC).  In fact, the
82576 datasheet specifically mentions masking LSC until initialization
is complete to avoid spurious interrupts during boot and AFAICT igb(4)
does this since e1000_reset_hw() clears the interrupt mask via writes
to IMC and doesn't re-enable interrupts until igb_init_locked() is
invoked via 'ifconfig up'.  Here is my debug output:

SMP: AP CPU #6 Launched!
SMP: AP CPU #4 Launched!
stray irq0
igb0: interrupt on que 0: icr 0x1000004 eicr 0
     ims 0 eims 0x80000000

Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
that e1000_reset_hw() always turns this bit in EIMS on, even if it is
off before e1000_reset_hw() is called(!).  I added explicit calls to
igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
This removes the 'stray irq0', but I still get a spurious interrupt
during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
for now, but I think the real fix is something else.

-- 
John Baldwin

Comment 3 George V. Neville-Neil freebsd_committer

2013-01-21 19:25:00 UTC

On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote:

> I was able to finally reproduce this panic today.  It seems to require
> a server configured for PXE but that receives no DHCP reply (and
> possibly with the requisite SuperMicro X8 board).  I was able to
> prevent the panic with a subset of the referenced patch by only adding
> the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
> igb_msix_que().  The rest of the patch was unnecessary.  I also added
> some debugging to print out the ICR, EICR, IMS, and EIMS registers in
> this case.  It does look like the hardware is sending an interrupt =
that
> is not enabled in the interrupt mask (specifically LSC).  In fact, the
> 82576 datasheet specifically mentions masking LSC until initialization
> is complete to avoid spurious interrupts during boot and AFAICT igb(4)
> does this since e1000_reset_hw() clears the interrupt mask via writes
> to IMC and doesn't re-enable interrupts until igb_init_locked() is
> invoked via 'ifconfig up'.  Here is my debug output:
>=20
> SMP: AP CPU #6 Launched!
> SMP: AP CPU #4 Launched!
> stray irq0
> igb0: interrupt on que 0: icr 0x1000004 eicr 0
>     ims 0 eims 0x80000000
>=20
> Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
> that e1000_reset_hw() always turns this bit in EIMS on, even if it is
> off before e1000_reset_hw() is called(!).  I added explicit calls to
> igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
> This removes the 'stray irq0', but I still get a spurious interrupt
> during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
> for now, but I think the real fix is something else.
>=20

I think Jack will have to chime in on this one.  Do you think it's all =
SM X8 boards
or just the one we happen to have?  I wonder if Jack or Jeffrey (the =
testing guy he works
with) have access to the right board.

Best,
George

Comment 4 jfvogel 2013-01-21 20:28:40 UTC

Well, do you have a more complete designation of the motherboard? We can
look into it, although if the one check stops the problem it may be a low
priority.

Jack


On Mon, Jan 21, 2013 at 11:25 AM, George Neville-Neil <gnn@freebsd.org>wrote:

>
> On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote:
>
> > I was able to finally reproduce this panic today.  It seems to require
> > a server configured for PXE but that receives no DHCP reply (and
> > possibly with the requisite SuperMicro X8 board).  I was able to
> > prevent the panic with a subset of the referenced patch by only adding
> > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
> > igb_msix_que().  The rest of the patch was unnecessary.  I also added
> > some debugging to print out the ICR, EICR, IMS, and EIMS registers in
> > this case.  It does look like the hardware is sending an interrupt that
> > is not enabled in the interrupt mask (specifically LSC).  In fact, the
> > 82576 datasheet specifically mentions masking LSC until initialization
> > is complete to avoid spurious interrupts during boot and AFAICT igb(4)
> > does this since e1000_reset_hw() clears the interrupt mask via writes
> > to IMC and doesn't re-enable interrupts until igb_init_locked() is
> > invoked via 'ifconfig up'.  Here is my debug output:
> >
> > SMP: AP CPU #6 Launched!
> > SMP: AP CPU #4 Launched!
> > stray irq0
> > igb0: interrupt on que 0: icr 0x1000004 eicr 0
> >     ims 0 eims 0x80000000
> >
> > Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
> > that e1000_reset_hw() always turns this bit in EIMS on, even if it is
> > off before e1000_reset_hw() is called(!).  I added explicit calls to
> > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
> > This removes the 'stray irq0', but I still get a spurious interrupt
> > during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
> > for now, but I think the real fix is something else.
> >
>
> I think Jack will have to chime in on this one.  Do you think it's all SM
> X8 boards
> or just the one we happen to have?  I wonder if Jack or Jeffrey (the
> testing guy he works
> with) have access to the right board.
>
> Best,
> George
>
>
>

Comment 5 John Baldwin freebsd_committer

2013-01-22 17:09:32 UTC

On Monday, January 21, 2013 3:28:40 pm Jack Vogel wrote:
> Well, do you have a more complete designation of the motherboard? We can
> look into it, although if the one check stops the problem it may be a low
> priority.

It is a SuperMicro X8DTU-F.
 
> Jack
> 
> 
> On Mon, Jan 21, 2013 at 11:25 AM, George Neville-Neil <gnn@freebsd.org>wrote:
> 
> >
> > On Jan 19, 2013, at 23:26 , John Baldwin <jhb@FreeBSD.org> wrote:
> >
> > > I was able to finally reproduce this panic today.  It seems to require
> > > a server configured for PXE but that receives no DHCP reply (and
> > > possibly with the requisite SuperMicro X8 board).  I was able to
> > > prevent the panic with a subset of the referenced patch by only adding
> > > the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of
> > > igb_msix_que().  The rest of the patch was unnecessary.  I also added
> > > some debugging to print out the ICR, EICR, IMS, and EIMS registers in
> > > this case.  It does look like the hardware is sending an interrupt that
> > > is not enabled in the interrupt mask (specifically LSC).  In fact, the
> > > 82576 datasheet specifically mentions masking LSC until initialization
> > > is complete to avoid spurious interrupts during boot and AFAICT igb(4)
> > > does this since e1000_reset_hw() clears the interrupt mask via writes
> > > to IMC and doesn't re-enable interrupts until igb_init_locked() is
> > > invoked via 'ifconfig up'.  Here is my debug output:
> > >
> > > SMP: AP CPU #6 Launched!
> > > SMP: AP CPU #4 Launched!
> > > stray irq0
> > > igb0: interrupt on que 0: icr 0x1000004 eicr 0
> > >     ims 0 eims 0x80000000
> > >
> > > Hmmm.   Nothing clears EIMS.  After some more debugging, I determined
> > > that e1000_reset_hw() always turns this bit in EIMS on, even if it is
> > > off before e1000_reset_hw() is called(!).  I added explicit calls to
> > > igb_disable_intr() to clear EIMS after each call to e1000_reset_hw().
> > > This removes the 'stray irq0', but I still get a spurious interrupt
> > > during boot (albeit with eims 0).  I can use the IFF_DRV_RUNNING hack
> > > for now, but I think the real fix is something else.
> > >
> >
> > I think Jack will have to chime in on this one.  Do you think it's all SM
> > X8 boards
> > or just the one we happen to have?  I wonder if Jack or Jeffrey (the
> > testing guy he works
> > with) have access to the right board.
> >
> > Best,
> > George
> >
> >
> >
> 

-- 
John Baldwin

Comment 6 John Baldwin freebsd_committer

2013-02-21 22:12:55 UTC

An update on this.  I think we should just use a workaround as this seems to 
be specific to a certain set of motherboards.  This is the fix I'm using 
locally:

Index: if_igb.c
===================================================================
--- if_igb.c    (revision 243732)
+++ if_igb.c    (working copy)
@@ -1522,6 +1522,15 @@
        u32             newitr = 0;
        bool            more_rx;
 
+       /*
+        * The onboard adapters on certain SuperMicro X8* boards
+        * trigger a spurious interrupt during boot.  Since it
+        * occurs before the interface is fully configured it
+        * triggers a panic.  Ignore the interrupt instead.
+        */
+       if (!(adapter->ifp->if_drv_flags & IFF_DRV_RUNNING))
+               return;
+
        E1000_WRITE_REG(&adapter->hw, E1000_EIMC, que->eims);
        ++que->irqs;

-- 
John Baldwin

Comment 7 Eugene Grosbein 2013-09-27 09:31:58 UTC

Hi!

Audit-Trail of http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/172113
has more than enough patches to fix the problem.

Would someone be so kind to commit one of them so 9.2-RELEASE
won't be broken out-of-the-box for noted Supermicro's?

Comment 8 Sergey Kandaurov freebsd_committer

2013-09-27 10:00:31 UTC

State Changed
From-To: open->patched

The fix committed to HEAD (r254002), stable/9 (r254003), releng/9.2 (r254009).

Comment 9 Eugene Grosbein freebsd_committer

2018-05-28 20:57:38 UTC

*** Bug 151593 has been marked as a duplicate of this bug. ***