Bug 26161

Summary: Kernel Panic on Dual Processor System during heavy disk IO
Product: Base System Reporter: cjm88 <cjm88>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.2-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Panic 1.jpg
none
Panic 2.jpg none

Description cjm88 2001-03-28 03:40:02 UTC
The system panics when subjected to heavy disk IO. The system is an Intel
altserver with two Pentium 166 processors on a mother board supprting SMP
1.4. I'm using the on-board adaptec SCSI controller with 2G Seagate drive.
It is quite stable until something requires heavy disk IO and then
crashes within 15 to 30 minutes.  The behavior is the same whether the IO
is for swapping or just heavy file access.  I managed to photograph the
console just after the panic on two occasions and can forward via e-mail, 
those jpgs to whoever would be interested in looking at this problem.

I tried to replicate the problem on two other FreeBSD platforms but they
were single-cpu boxes.  The problem did not occur even after extended
disk pounding (over 24 hours).

Fix: 

Short of a reboot... which isn't a fix :) I have no idea.

I would be willing to work with whoever takes this on, to perform more
tests, try fixes and generally support the effort to resolve this.
How-To-Repeat: I can repeat the problem fairly easily by either running a program I wrote
to cause swapping or by running the Bonnie benchmark with a large file size.
Interestingly it seems that things only really blow up when there is more than
one process trying to do IO concurrently.
Comment 1 Peter Pentchev 2001-03-28 07:04:49 UTC
On Tue, Mar 27, 2001 at 06:31:21PM -0800, cjm88@home.com wrote:
> 
> >Number:         26161
> >Category:       kern
> >Synopsis:       Kernel Panic on Dual Processor System during heavy disk IO
> >Originator:     Christophe Michel
> >Release:        4.2-RELEASE
> >Organization:
> >Environment:
> FreeBSD u2 4.2-RELEASE FreeBSD 4.2-RELEASE #1: Sat Mar 24 21:27:43 EST 2001     
> root@u2:/usr/src/sys/compile/U2  i386
> 
> >Description:
> The system panics when subjected to heavy disk IO. The system is an Intel
> altserver with two Pentium 166 processors on a mother board supprting SMP
> 1.4. I'm using the on-board adaptec SCSI controller with 2G Seagate drive.
> It is quite stable until something requires heavy disk IO and then
> crashes within 15 to 30 minutes.  The behavior is the same whether the IO
> is for swapping or just heavy file access.  I managed to photograph the
> console just after the panic on two occasions and can forward via e-mail, 
> those jpgs to whoever would be interested in looking at this problem.
> 
> I tried to replicate the problem on two other FreeBSD platforms but they
> were single-cpu boxes.  The problem did not occur even after extended
> disk pounding (over 24 hours).

This is all very nice :)  But, can you either:

1. update your system to 4.2-stable (which is actually 4.3-RC now), or

2. follow the instructions on http://www.FreeBSD.org/handbook/kerneldebug.html
   to build a debugging kernel, run dumpon, have the kernel panic again,
   this time storing the core dump, then run savecore and examine
   the kernel crash dump, posting more information about the dump?

G'luck,
Peter

-- 
Nostalgia ain't what it used to be.
Comment 2 cjm88 2001-03-28 13:15:25 UTC
cjm88@home.com wrote:

> OK, Here's what I'll do (in the hope that it's the most useful way to proceed in
> terms of QA for future releases).
>
> 1) I'll follow your first suggestion and update to 4.2-stable/4.3-RC1
> 2) I'll try to replicate the problem
> 3) If it occurs I'll follow your second suggestion and try to recreate the problem
> again
> 4) I'll advise regarding the outcome with additional information in either case.
>
> I should be able to perform the above this evening.
>
> Thanks for your help :)
>
> C
>
> PS. I attached the jpegs of the console for your interesst although the info will
> be moot once I complete step 1)
>
> Peter Pentchev wrote:
>
> > On Tue, Mar 27, 2001 at 06:31:21PM -0800, cjm88@home.com wrote:
> > >
> > > >Number:         26161
> > > >Category:       kern
> > > >Synopsis:       Kernel Panic on Dual Processor System during heavy disk IO
> > > >Originator:     Christophe Michel
> > > >Release:        4.2-RELEASE
> > > >Organization:
> > > >Environment:
> > > FreeBSD u2 4.2-RELEASE FreeBSD 4.2-RELEASE #1: Sat Mar 24 21:27:43 EST 2001
> > > root@u2:/usr/src/sys/compile/U2  i386
> > >
> > > >Description:
> > > The system panics when subjected to heavy disk IO. The system is an Intel
> > > altserver with two Pentium 166 processors on a mother board supprting SMP
> > > 1.4. I'm using the on-board adaptec SCSI controller with 2G Seagate drive.
> > > It is quite stable until something requires heavy disk IO and then
> > > crashes within 15 to 30 minutes.  The behavior is the same whether the IO
> > > is for swapping or just heavy file access.  I managed to photograph the
> > > console just after the panic on two occasions and can forward via e-mail,
> > > those jpgs to whoever would be interested in looking at this problem.
> > >
> > > I tried to replicate the problem on two other FreeBSD platforms but they
> > > were single-cpu boxes.  The problem did not occur even after extended
> > > disk pounding (over 24 hours).
> >
> > This is all very nice :)  But, can you either:
> >
> > 1. update your system to 4.2-stable (which is actually 4.3-RC now), or
> >
> > 2. follow the instructions on http://www.FreeBSD.org/handbook/kerneldebug.html
> >    to build a debugging kernel, run dumpon, have the kernel panic again,
> >    this time storing the core dump, then run savecore and examine
> >    the kernel crash dump, posting more information about the dump?
> >
> > G'luck,
> > Peter
> >
> > --
> > Nostalgia ain't what it used to be.
>
>   ------------------------------------------------------------------------
>  [Image]  [Image]
Comment 3 cjm88 2001-03-30 11:16:12 UTC
OK,

I followed the first suggestion and initially it seemed that the system was more
stable (i.e. it took longer for it to panic).

When it paniced I built a debug versino of the kernel and ran the tests again.
this time it took even longer for the system to crash.  So I ran the tests a few
more times with increasing intensity.  It seemed that the time required to crash
the system was inversely proportional to the intensity of the disk IO that the
system was subjected to.

Here is what the gdb session (run by a newbie... i.e. 'me' ) showed... I probably
need some further direction from someone more experienced to extract more useful
information.

u2# gdb -k
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd".
(kgdb) symbol-file kernel.debug
Reading symbols from kernel.debug...done.
(kgdb) exec-file kernel
(kgdb) core-file /usr/wrk/vmcore.0
SMP 2 cpus
IdlePTD 3461120
initial pcb at 2bdae0
panicstr: NMI indicates hardware failure
panic messages:
---
---
#0  0xc0158eae in dumpsys ()
(kgdb) where
#0  0xc0158eae in dumpsys ()
#1  0xc0158ccf in boot ()
#2  0xc0159080 in poweroff_wait ()
#3  0xc0256d90 in trap (frame={tf_fs = 47, tf_es = 47, tf_ds = 47,
      tf_edi = 134533120, tf_esi = 69, tf_ebp = -1077937120,
      tf_isp = -931377196, tf_ebx = 183304184, tf_edx = -1077937252,
      tf_ecx = 672025592, tf_eax = 15, tf_trapno = 19, tf_err = 0,
      tf_eip = 134514178, tf_cs = 31, tf_eflags = 514, tf_esp = -1077937160,
      tf_ss = 47}) at ../../i386/i386/trap.c:396
#4  0x8048602 in ?? ()
#5  0x80484bd in ?? ()
(kgdb)


OK...  :)   so now what do I do???

Thanks for your help.

C




Peter Pentchev wrote:

> On Tue, Mar 27, 2001 at 06:31:21PM -0800, cjm88@home.com wrote:
> >
> > >Number:         26161
> > >Category:       kern
> > >Synopsis:       Kernel Panic on Dual Processor System during heavy disk IO
> > >Originator:     Christophe Michel
> > >Release:        4.2-RELEASE
> > >Organization:
> > >Environment:
> > FreeBSD u2 4.2-RELEASE FreeBSD 4.2-RELEASE #1: Sat Mar 24 21:27:43 EST 2001
> > root@u2:/usr/src/sys/compile/U2  i386
> >
> > >Description:
> > The system panics when subjected to heavy disk IO. The system is an Intel
> > altserver with two Pentium 166 processors on a mother board supprting SMP
> > 1.4. I'm using the on-board adaptec SCSI controller with 2G Seagate drive.
> > It is quite stable until something requires heavy disk IO and then
> > crashes within 15 to 30 minutes.  The behavior is the same whether the IO
> > is for swapping or just heavy file access.  I managed to photograph the
> > console just after the panic on two occasions and can forward via e-mail,
> > those jpgs to whoever would be interested in looking at this problem.
> >
> > I tried to replicate the problem on two other FreeBSD platforms but they
> > were single-cpu boxes.  The problem did not occur even after extended
> > disk pounding (over 24 hours).
>
> This is all very nice :)  But, can you either:
>
> 1. update your system to 4.2-stable (which is actually 4.3-RC now), or
>
> 2. follow the instructions on http://www.FreeBSD.org/handbook/kerneldebug.html
>    to build a debugging kernel, run dumpon, have the kernel panic again,
>    this time storing the core dump, then run savecore and examine
>    the kernel crash dump, posting more information about the dump?
>
> G'luck,
> Peter
>
> --
> Nostalgia ain't what it used to be.
Comment 4 iedowse freebsd_committer freebsd_triage 2001-12-02 22:36:32 UTC
State Changed
From-To: open->feedback


NMI traps are usually caused by some sort of hardware failure (RAM 
parity errors etc). I guess maybe try replacing bits of hardware 
to see if you can isolate the problem and let us know if it helps.
Comment 5 iedowse freebsd_committer freebsd_triage 2002-06-02 12:14:37 UTC
State Changed
From-To: feedback->closed


Feedback timeout.