Bug 107436

Summary: Core dumps not working via (mpt) driver using LSI contoller
Product: Base System Reporter: Johan <johannes.kruger>
Component: kernAssignee: Matt Jacob <mjacob>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description Johan 2007-01-02 17:50:15 UTC
Note: Even though uname -a gives May 14, I also updated the src/sys/cam code and src/dev/mpt code to the latest, and it still does not work. I had to update it to have support for the newer LSI-1064

Anyway, 1 SATA drive connected to PHY 0 of the LSI contoller. Boots fine and works well , but when core dumps of kernel it does not work.
jump into debugger by typing:
sysctl -w debug.kdb.panic=1

You will now be in the debugger.
Go ahead and type "panic"
Core dump should now be generated on the swap partition, however, it does not work.

This is kind of serious since no debug information will be available when the kernel panic's.

Fix: 

Unknown.
The problem can either be in the CAM layer or the (mpt) driver.
If core dumping works with a different card which uses a different driver, then it's a problem with the (mpt) driver, else it's the CAM layer.
How-To-Repeat: --------------- snip --------------------
# sysctl -w debug.kdb.panic=1
debug.kdb.panic: 0panic: kdb_sysctl_panic
cpuid = 0
KDB: enter: panic
[thread pid 3438 tid 100053 ]
Stopped at      kdb_enter+0x2b: nop
db> panic
cpuid = 0
Uptime: 21h29m12s
Physical memory: 501 MB
Dumping 75 MB:Aborting dump due to I/O error.
status == 0x58, scsi status == 0x0

** DUMP FAILED (ERROR 5) **
Automatic reboot in 15 seconds - press a key on the console to abort
--------------- snip --------------------
Comment 1 Remko Lodder freebsd_committer freebsd_triage 2007-01-02 21:14:39 UTC
Responsible Changed
From-To: freebsd-bugs->mjacob

hey Matthew, can you look at the problem please?
Comment 2 Johan 2007-02-16 15:35:19 UTC
In order to track down the problem (or localize it a bit) I tried the
same thing on Adaptec controller, in order to rule out the CAM layer.

So here is the summary:
- The problem is in the (mpt) driver and not in the CAM layer.
- It exists in at least FreeBSD-6.X,  in FreeBSD-7-CURRENT (old and new
CAM layer)

--------------- snip --------------------
db> panic
cpuid =3D 0
Uptime: 21h29m12s
Physical memory: 501 MB
Dumping 75 MB:Aborting dump due to I/O error.
status =3D=3D 0x58, scsi status =3D=3D 0x0

** DUMP FAILED (ERROR 5) **
Automatic reboot in 15 seconds - press a key on the console to abort
--------------- snip --------------------=20


On execution of the panic command the poll routine inside (mpt) driver
does call the mpt_intr routine as it should.
It enters it multiple times:
First a bunch of times where I commented with **** 1 ****=20
Then 1 time **** 2 ****=20
-------------- snippet of mpt_intr ----------------------------------
.
.
.
    if ((reply_desc & MPI_ADDRESS_REPLY_A_BIT) !=3D 0) {
**** 2 **** THEN IT ENTERS HERE AND FAILS
*****************************************
.
.
.
    } else {
**** 1 **** ENTERS HERE A LOT WITH TYPE =3D
MPI_CONTEXT_REPLY_TYPE_SCSI_INIT ********
	type =3D MPI_GET_CONTEXT_REPLY_TYPE(reply_desc);
.
.
.
    }

    cb_index =3D MPT_CONTEXT_TO_CBI(ctxt_idx);
    req_index =3D MPT_CONTEXT_TO_REQI(ctxt_idx);
    if (req_index < MPT_MAX_REQUESTS(mpt)) {
       req =3D &mpt->request_pool[req_index];
    } else {
       mpt_prt(mpt, "WARN: mpt_intr index =3D=3D %d (reply_desc =3D=3D"
           " 0x%x)\n", req_index, reply_desc);
    }

    free_rf =3D mpt_reply_handlers[cb_index](mpt, req,
         reply_desc, reply_frame);

    if (reply_frame !=3D NULL && free_rf) {
         mpt_free_reply(mpt, reply_baddr);
    }
.
.
.
-------------- snippet of mpt_intr ----------------------------------



Johan


.
Comment 3 Matt Jacob freebsd_committer freebsd_triage 2007-02-16 16:40:04 UTC
Thanks for narrowing it down.  I have this and the same issue for isp to 
tackle this or next week.
Comment 4 Johan 2007-02-16 17:25:10 UTC
No problem, I'll let you know if I find anything.

Thanks

Johan

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Friday, February 16, 2007 11:40 AM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: Re: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller


Thanks for narrowing it down.  I have this and the same issue for isp to
tackle this or next week.
Comment 5 Matt Jacob freebsd_committer freebsd_triage 2007-02-17 00:45:19 UTC
Feb 16 16:25:37 obrien reboot: rebooted by mjacob
Feb 16 16:25:37 obrien syslogd: exiting on signal 15
Waiting (max 60 seconds) for system process `vnlru' to stop...done
Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining...0 0 done
All buffers synced.
Swap device label/obrien_swap removed.
Uptime: 12m28s
Physical memory: 5106 MB
Dumping 234 MB: 219 203 187 171 155 139 123 107 91 75 59 43 27 11
Dump complete


So, this works for a system booted off of MPT LSI-1030c (U320).
Comment 6 Johan 2007-02-17 02:04:57 UTC
Hmmm, this is interesting.
(1) How did you initiate the dump, because the output looks a bit
different ?
Do you think it can be the specific hardware that has a problem ?
and
(2) Do you have an LSI-1064 available
and
(3) Which version of FreeBSD did you use for the dump ?
I tried FreeBSD 6.X and 7-CURRENT

I want to try the same version on the LSI-1064 I have here.

Thanks

Johan

,

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Friday, February 16, 2007 7:45 PM
To: bug-followup@freebsd.org
Cc: Kruger Johannes (Nokia-ES/Boston)
Subject: Re: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller


Feb 16 16:25:37 obrien reboot: rebooted by mjacob
Feb 16 16:25:37 obrien syslogd: exiting on signal 15
Waiting (max 60 seconds) for system process `vnlru' to stop...done
Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining...0 0 done
All buffers synced.
Swap device label/obrien_swap removed.
Uptime: 12m28s
Physical memory: 5106 MB
Dumping 234 MB: 219 203 187 171 155 139 123 107 91 75 59 43 27 11
Dump complete


So, this works for a system booted off of MPT LSI-1030c (U320).
Comment 7 Matt Jacob freebsd_committer freebsd_triage 2007-02-17 04:50:36 UTC
> Hmmm, this is interesting.
> (1) How did you initiate the dump, because the output looks a bit
> different ?

reboot -d

> Do you think it can be the specific hardware that has a problem ?

Likely.

> and
> (2) Do you have an LSI-1064 available

I have a 1068 card and a SF4100 (courtesy of a very helpful FreeBSD 
fan). The SF4100 also seems to have no trouble dumping to a SAS drive.

> and
> (3) Which version of FreeBSD did you use for the dump ?
> I tried FreeBSD 6.X and 7-CURRENT

-current.

> I want to try the same version on the LSI-1064 I have here.

Yes.

It's likely one or more  of three things:

a) It's platform specific. That is, something about your platform is 
causing the MPT to fail or be failed.

b) It's HBA/Disk specific. That is, it's the fact that it's a SATA drive 
on a 1064.

c) It's invocation specific. That is, 'reboot -d' is very kind, while a 
panic, from a random point, is not.

I haven't really narrowed it down yet, but have proved to my 
satisfaction that a polled operation works for at least a couple mpt 
cards.

-matt
Comment 8 Johan 2007-02-17 05:22:33 UTC
Ok, I tried "reboot -d" and get the same error.
I have tried it on FreeBSD-7-CURRENT, and I also tried it on 3 standard
Intel server platforms using different chipsets for the North-Bridges,
and a forth platform we developed ourselves, which uses the same chipset
as one of the standard Intel boards, with the LSI-1064 onboard.

If you can try the LSI-1068 , I think it would be the closest to the
1064.

One thing that is different though is the firmware that is running on
the LSI-1064.
It's using MPI version : MPI Version=3D1.5.13.0

I noticed that the MPI version of the header files in FreeBSD are MPI
Version=3D1.5.12.0, if you look at the version in mpilib/mpi.h
The version MPI Version=3D1.5.13.0 do not differ much.
I cannot remember if the header structures are the same in length, but I
do remember that the newer header files have one of the reserved values
in one of the structures in use now.
There are actually newer version than MPI Version=3D1.5.13.0, but I have
not migrated to that yet, it will require a little more code change.

If it does not happen on the LSI-1068, than I can only guess that it's
something specific to the LSI-1064 or it's firmware version.

O yes, before I forget, I will try a SAS drive on Monday and see if it
makes a difference.
I did plug it in and use it, but I cannot remember if I tried to do a
dump on it.

Thanks

Johan

.

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Friday, February 16, 2007 11:51 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller



> Hmmm, this is interesting.
> (1) How did you initiate the dump, because the output looks a bit
> different ?

reboot -d

> Do you think it can be the specific hardware that has a problem ?

Likely.

> and
> (2) Do you have an LSI-1064 available

I have a 1068 card and a SF4100 (courtesy of a very helpful FreeBSD=20
fan). The SF4100 also seems to have no trouble dumping to a SAS drive.

> and
> (3) Which version of FreeBSD did you use for the dump ?
> I tried FreeBSD 6.X and 7-CURRENT

-current.

> I want to try the same version on the LSI-1064 I have here.

Yes.

It's likely one or more  of three things:

a) It's platform specific. That is, something about your platform is=20
causing the MPT to fail or be failed.

b) It's HBA/Disk specific. That is, it's the fact that it's a SATA drive

on a 1064.

c) It's invocation specific. That is, 'reboot -d' is very kind, while a=20
panic, from a random point, is not.

I haven't really narrowed it down yet, but have proved to my=20
satisfaction that a polled operation works for at least a couple mpt=20
cards.

-matt
Comment 9 Matt Jacob freebsd_committer freebsd_triage 2007-02-17 17:02:55 UTC
I've confirmed that reboot -d produces valid core dumps with the 
following mpt based hardware all on -current:

AMD64 8-way:
   LSI-1030 (u320)
   FC7202 (FC 2Gb)

SuperMicro 2xXeon/EM64T:
   LSI-1068 with 250GB Maxtor SATA drives

Sun X4100
   LSI-1064 with 70GB Fujitsu SAS drives
Comment 10 Johan 2007-02-17 17:31:15 UTC
Thanks for testing on the LSI-1064.
I'll try a SAS drive on Monday.
Do you know which firmware version is running on the Sun X4100 LSI-1064
?

My feeling is that the SAS is going to behave the same as the SATA with
the MPI 1.5.13.0 firmware, but I could be wrong.

Here's hoping the SAS drive will be able to core dump properly ..

Thanks

Johan=20

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Saturday, February 17, 2007 12:03 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller



I've confirmed that reboot -d produces valid core dumps with the=20
following mpt based hardware all on -current:

AMD64 8-way:
   LSI-1030 (u320)
   FC7202 (FC 2Gb)

SuperMicro 2xXeon/EM64T:
   LSI-1068 with 250GB Maxtor SATA drives

Sun X4100
   LSI-1064 with 70GB Fujitsu SAS drives
Comment 11 Johan 2007-02-17 17:44:44 UTC
Hi Matthew.
I looked on Sun's website, and it looks like they are running 1.16.00
firmware.
On LSI's ftp site that contains header files for MPI version 1.5.13.0
#define MPI_HEADER_VERSION_UNIT             (0x0D)     <<--- 13
#define MPI_HEADER_VERSION_DEV              (0x00)
#define MPI_HEADER_VERSION_UNIT_MASK        (0xFF00)
#define MPI_HEADER_VERSION_UNIT_SHIFT       (8)
#define MPI_HEADER_VERSION_DEV_MASK         (0x00FF)
#define MPI_HEADER_VERSION_DEV_SHIFT        (0)

Do you know the MPI version that gets reported via "dmesg" on the Sun
with the LSI-1064 ?

-----------------------
http://www.sun.com/servers/entry/x4100/downloads.jsp

Sun Fire X4100 M2 and Sun Fire X4200 M2 Servers Supplemental 1.1
=09
Sun Fire X4100 M2 and X4200 M2 Servers Supplemental 1.1

    * BIOS 34 (0ABJX034)
          o Support for VMware ESX 3.0.1
    * ILOM Service Processor firmware 1.1.1
    * LSI MPT SAS firmware 1.16.00
    * LSI MPT BIOS 6.10.00B
    * Bootable Diagnostics CD version 2.2
    * Tools and Drivers CD version 1.1
    * Sun Installation Assistant CD version 2.0.127:
          o Support for RHEL 3 U8 AS/ES/WS (64-bit)
          o Support for SLES 10 (64-bit)

Thanks

Johan

.
=20

-----Original Message-----
From: Kruger Johannes (Nokia-ES/Boston)=20
Sent: Saturday, February 17, 2007 12:31 PM
To: 'mjacob@freebsd.org'
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller

Thanks for testing on the LSI-1064.
I'll try a SAS drive on Monday.
Do you know which firmware version is running on the Sun X4100 LSI-1064
?

My feeling is that the SAS is going to behave the same as the SATA with
the MPI 1.5.13.0 firmware, but I could be wrong.

Here's hoping the SAS drive will be able to core dump properly ..

Thanks

Johan=20

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Saturday, February 17, 2007 12:03 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller



I've confirmed that reboot -d produces valid core dumps with the=20
following mpt based hardware all on -current:

AMD64 8-way:
   LSI-1030 (u320)
   FC7202 (FC 2Gb)

SuperMicro 2xXeon/EM64T:
   LSI-1068 with 250GB Maxtor SATA drives

Sun X4100
   LSI-1064 with 70GB Fujitsu SAS drives
Comment 12 Matt Jacob freebsd_committer freebsd_triage 2007-02-17 17:44:55 UTC
> Thanks for testing on the LSI-1064.
> I'll try a SAS drive on Monday.
> Do you know which firmware version is running on the Sun X4100 LSI-1064
> ?

1.5.9.0

>
> My feeling is that the SAS is going to behave the same as the SATA with
> the MPI 1.5.13.0 firmware, but I could be wrong.
>

We'll see. The FUSION architecture is supposed to allow for version 
skew, which is the whole point of saying which version you are to the 
other side. *shrug*
Comment 13 Matt Jacob freebsd_committer freebsd_triage 2007-02-17 17:48:01 UTC
For all I know, Sun does their own f/w schema. I haven't got a Sun boot 
instance for the sf4100 right now so I can't tell you, but the FreeBSD 
reports 1.5.9.0.

I'm less sure than you are that it's MPI version related.

On Sat, 17 Feb 2007, Johannes.Kruger@nokia.com wrote:

> Hi Matthew.
> I looked on Sun's website, and it looks like they are running 1.16.00
> firmware.
> On LSI's ftp site that contains header files for MPI version 1.5.13.0
> #define MPI_HEADER_VERSION_UNIT             (0x0D)     <<--- 13
> #define MPI_HEADER_VERSION_DEV              (0x00)
> #define MPI_HEADER_VERSION_UNIT_MASK        (0xFF00)
> #define MPI_HEADER_VERSION_UNIT_SHIFT       (8)
> #define MPI_HEADER_VERSION_DEV_MASK         (0x00FF)
> #define MPI_HEADER_VERSION_DEV_SHIFT        (0)
>
> Do you know the MPI version that gets reported via "dmesg" on the Sun
> with the LSI-1064 ?
>
Comment 14 Johan 2007-02-17 17:49:06 UTC
Ok, thanks.
1.5.9.0 is older than I thought ?
I will keep the option of an old firmware as plan B if the test with the
SAS drive fails.

Thanks

Johan=20

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Saturday, February 17, 2007 12:45 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller




> Thanks for testing on the LSI-1064.
> I'll try a SAS drive on Monday.
> Do you know which firmware version is running on the Sun X4100
LSI-1064
> ?

1.5.9.0

>
> My feeling is that the SAS is going to behave the same as the SATA
with
> the MPI 1.5.13.0 firmware, but I could be wrong.
>

We'll see. The FUSION architecture is supposed to allow for version=20
skew, which is the whole point of saying which version you are to the=20
other side. *shrug*
Comment 15 Johan 2007-02-17 17:52:09 UTC
I agree, I think the chance of it being the firmware is less than 5%,
but I also think that the chance of it being the SAS versus SATA is less
than 60% ... just a feeling ;-)  but those 2 are the only things that
are different ... that I can think off.
I'll take the latest snapshot of FreeBSD-7-CURRENT to test.

Thanks

Johan

.

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Saturday, February 17, 2007 12:48 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller


For all I know, Sun does their own f/w schema. I haven't got a Sun boot=20
instance for the sf4100 right now so I can't tell you, but the FreeBSD=20
reports 1.5.9.0.

I'm less sure than you are that it's MPI version related.

On Sat, 17 Feb 2007, Johannes.Kruger@nokia.com wrote:

> Hi Matthew.
> I looked on Sun's website, and it looks like they are running 1.16.00
> firmware.
> On LSI's ftp site that contains header files for MPI version 1.5.13.0
> #define MPI_HEADER_VERSION_UNIT             (0x0D)     <<--- 13
> #define MPI_HEADER_VERSION_DEV              (0x00)
> #define MPI_HEADER_VERSION_UNIT_MASK        (0xFF00)
> #define MPI_HEADER_VERSION_UNIT_SHIFT       (8)
> #define MPI_HEADER_VERSION_DEV_MASK         (0x00FF)
> #define MPI_HEADER_VERSION_DEV_SHIFT        (0)
>
> Do you know the MPI version that gets reported via "dmesg" on the Sun
> with the LSI-1064 ?
>
Comment 16 Johan 2007-02-20 16:20:01 UTC
Hi Matthew.
I think you can close the PR.
Here is what seems to be happening:
All the hardware I tested before, seems to have memory holes mapped by
the BIOS.
That means that when the core dump starts, it dumps the memory until it
gets to a memory hole (memory that is not mapped) and then returns with
an IO error.

So it seems to be related to the BIOS of platforms.

Thanks for all the testing.
I will have to find a way to skip the dumping of the memory holes now,
and dump only mapped pages. New learning curve ... Oh well.


Thanks

Johan

.
=20

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Saturday, February 17, 2007 12:48 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller


For all I know, Sun does their own f/w schema. I haven't got a Sun boot
instance for the sf4100 right now so I can't tell you, but the FreeBSD
reports 1.5.9.0.

I'm less sure than you are that it's MPI version related.

On Sat, 17 Feb 2007, Johannes.Kruger@nokia.com wrote:

> Hi Matthew.
> I looked on Sun's website, and it looks like they are running 1.16.00=20
> firmware.
> On LSI's ftp site that contains header files for MPI version 1.5.13.0
> #define MPI_HEADER_VERSION_UNIT             (0x0D)     <<--- 13
> #define MPI_HEADER_VERSION_DEV              (0x00)
> #define MPI_HEADER_VERSION_UNIT_MASK        (0xFF00)
> #define MPI_HEADER_VERSION_UNIT_SHIFT       (8)
> #define MPI_HEADER_VERSION_DEV_MASK         (0x00FF)
> #define MPI_HEADER_VERSION_DEV_SHIFT        (0)
>
> Do you know the MPI version that gets reported via "dmesg" on the Sun=20
> with the LSI-1064 ?
>
Comment 17 Matt Jacob freebsd_committer freebsd_triage 2007-02-20 18:44:17 UTC
> Hi Matthew.
> I think you can close the PR.
> Here is what seems to be happening:
> All the hardware I tested before, seems to have memory holes mapped by
> the BIOS.
> That means that when the core dump starts, it dumps the memory until it
> gets to a memory hole (memory that is not mapped) and then returns with
> an IO error.
>
> So it seems to be related to the BIOS of platforms.
>
> Thanks for all the testing.
> I will have to find a way to skip the dumping of the memory holes now,
> and dump only mapped pages. New learning curve ... Oh well.
>

Hmm- I thought the FreeBSD dump code only dumped "pages of interest"?

In any case, I intend to do a bit more testing to make sure that dumps 
can occur under 'panic' conditions as opposed to just 'reboot -d'.
Comment 18 Johan 2007-02-20 18:47:22 UTC
I thought so too.
Well, the platform I tried it on last has no memory holes defined and it
works there.
But yes, if it dumps only valid mapped pages, then the problem should
not happen .. Wonder what the problem can be ?

Johan

=20

-----Original Message-----
From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20
Sent: Tuesday, February 20, 2007 1:44 PM
To: Kruger Johannes (Nokia-ES/Boston)
Cc: mjacob@freebsd.org; bug-followup@freebsd.org
Subject: RE: kern/107436: Core dumps not working via (mpt) driver using
LSI contoller





> Hi Matthew.
> I think you can close the PR.
> Here is what seems to be happening:
> All the hardware I tested before, seems to have memory holes mapped by

> the BIOS.
> That means that when the core dump starts, it dumps the memory until=20
> it gets to a memory hole (memory that is not mapped) and then returns=20
> with an IO error.
>
> So it seems to be related to the BIOS of platforms.
>
> Thanks for all the testing.
> I will have to find a way to skip the dumping of the memory holes now,

> and dump only mapped pages. New learning curve ... Oh well.
>

Hmm- I thought the FreeBSD dump code only dumped "pages of interest"?

In any case, I intend to do a bit more testing to make sure that dumps
can occur under 'panic' conditions as opposed to just 'reboot -d'.
Comment 19 Matt Jacob freebsd_committer freebsd_triage 2007-03-31 21:08:38 UTC
State Changed
From-To: open->closed

The dump entry point does work as well as any other dump entry point works.