| Summary: | Core dumps not working via (mpt) driver using LSI contoller | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Johan <johannes.kruger> |
| Component: | kern | Assignee: | Matt Jacob <mjacob> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | Unspecified | ||
| Hardware: | Any | ||
| OS: | Any | ||
|
Description
Johan
2007-01-02 17:50:15 UTC
Responsible Changed From-To: freebsd-bugs->mjacob hey Matthew, can you look at the problem please? In order to track down the problem (or localize it a bit) I tried the
same thing on Adaptec controller, in order to rule out the CAM layer.
So here is the summary:
- The problem is in the (mpt) driver and not in the CAM layer.
- It exists in at least FreeBSD-6.X, in FreeBSD-7-CURRENT (old and new
CAM layer)
--------------- snip --------------------
db> panic
cpuid =3D 0
Uptime: 21h29m12s
Physical memory: 501 MB
Dumping 75 MB:Aborting dump due to I/O error.
status =3D=3D 0x58, scsi status =3D=3D 0x0
** DUMP FAILED (ERROR 5) **
Automatic reboot in 15 seconds - press a key on the console to abort
--------------- snip --------------------=20
On execution of the panic command the poll routine inside (mpt) driver
does call the mpt_intr routine as it should.
It enters it multiple times:
First a bunch of times where I commented with **** 1 ****=20
Then 1 time **** 2 ****=20
-------------- snippet of mpt_intr ----------------------------------
.
.
.
if ((reply_desc & MPI_ADDRESS_REPLY_A_BIT) !=3D 0) {
**** 2 **** THEN IT ENTERS HERE AND FAILS
*****************************************
.
.
.
} else {
**** 1 **** ENTERS HERE A LOT WITH TYPE =3D
MPI_CONTEXT_REPLY_TYPE_SCSI_INIT ********
type =3D MPI_GET_CONTEXT_REPLY_TYPE(reply_desc);
.
.
.
}
cb_index =3D MPT_CONTEXT_TO_CBI(ctxt_idx);
req_index =3D MPT_CONTEXT_TO_REQI(ctxt_idx);
if (req_index < MPT_MAX_REQUESTS(mpt)) {
req =3D &mpt->request_pool[req_index];
} else {
mpt_prt(mpt, "WARN: mpt_intr index =3D=3D %d (reply_desc =3D=3D"
" 0x%x)\n", req_index, reply_desc);
}
free_rf =3D mpt_reply_handlers[cb_index](mpt, req,
reply_desc, reply_frame);
if (reply_frame !=3D NULL && free_rf) {
mpt_free_reply(mpt, reply_baddr);
}
.
.
.
-------------- snippet of mpt_intr ----------------------------------
Johan
.
Thanks for narrowing it down. I have this and the same issue for isp to tackle this or next week. No problem, I'll let you know if I find anything. Thanks Johan -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Friday, February 16, 2007 11:40 AM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: Re: kern/107436: Core dumps not working via (mpt) driver using LSI contoller Thanks for narrowing it down. I have this and the same issue for isp to tackle this or next week. Feb 16 16:25:37 obrien reboot: rebooted by mjacob Feb 16 16:25:37 obrien syslogd: exiting on signal 15 Waiting (max 60 seconds) for system process `vnlru' to stop...done Waiting (max 60 seconds) for system process `bufdaemon' to stop...done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining...0 0 done All buffers synced. Swap device label/obrien_swap removed. Uptime: 12m28s Physical memory: 5106 MB Dumping 234 MB: 219 203 187 171 155 139 123 107 91 75 59 43 27 11 Dump complete So, this works for a system booted off of MPT LSI-1030c (U320). Hmmm, this is interesting. (1) How did you initiate the dump, because the output looks a bit different ? Do you think it can be the specific hardware that has a problem ? and (2) Do you have an LSI-1064 available and (3) Which version of FreeBSD did you use for the dump ? I tried FreeBSD 6.X and 7-CURRENT I want to try the same version on the LSI-1064 I have here. Thanks Johan , -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Friday, February 16, 2007 7:45 PM To: bug-followup@freebsd.org Cc: Kruger Johannes (Nokia-ES/Boston) Subject: Re: kern/107436: Core dumps not working via (mpt) driver using LSI contoller Feb 16 16:25:37 obrien reboot: rebooted by mjacob Feb 16 16:25:37 obrien syslogd: exiting on signal 15 Waiting (max 60 seconds) for system process `vnlru' to stop...done Waiting (max 60 seconds) for system process `bufdaemon' to stop...done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining...0 0 done All buffers synced. Swap device label/obrien_swap removed. Uptime: 12m28s Physical memory: 5106 MB Dumping 234 MB: 219 203 187 171 155 139 123 107 91 75 59 43 27 11 Dump complete So, this works for a system booted off of MPT LSI-1030c (U320). > Hmmm, this is interesting. > (1) How did you initiate the dump, because the output looks a bit > different ? reboot -d > Do you think it can be the specific hardware that has a problem ? Likely. > and > (2) Do you have an LSI-1064 available I have a 1068 card and a SF4100 (courtesy of a very helpful FreeBSD fan). The SF4100 also seems to have no trouble dumping to a SAS drive. > and > (3) Which version of FreeBSD did you use for the dump ? > I tried FreeBSD 6.X and 7-CURRENT -current. > I want to try the same version on the LSI-1064 I have here. Yes. It's likely one or more of three things: a) It's platform specific. That is, something about your platform is causing the MPT to fail or be failed. b) It's HBA/Disk specific. That is, it's the fact that it's a SATA drive on a 1064. c) It's invocation specific. That is, 'reboot -d' is very kind, while a panic, from a random point, is not. I haven't really narrowed it down yet, but have proved to my satisfaction that a polled operation works for at least a couple mpt cards. -matt Ok, I tried "reboot -d" and get the same error. I have tried it on FreeBSD-7-CURRENT, and I also tried it on 3 standard Intel server platforms using different chipsets for the North-Bridges, and a forth platform we developed ourselves, which uses the same chipset as one of the standard Intel boards, with the LSI-1064 onboard. If you can try the LSI-1068 , I think it would be the closest to the 1064. One thing that is different though is the firmware that is running on the LSI-1064. It's using MPI version : MPI Version=3D1.5.13.0 I noticed that the MPI version of the header files in FreeBSD are MPI Version=3D1.5.12.0, if you look at the version in mpilib/mpi.h The version MPI Version=3D1.5.13.0 do not differ much. I cannot remember if the header structures are the same in length, but I do remember that the newer header files have one of the reserved values in one of the structures in use now. There are actually newer version than MPI Version=3D1.5.13.0, but I have not migrated to that yet, it will require a little more code change. If it does not happen on the LSI-1068, than I can only guess that it's something specific to the LSI-1064 or it's firmware version. O yes, before I forget, I will try a SAS drive on Monday and see if it makes a difference. I did plug it in and use it, but I cannot remember if I tried to do a dump on it. Thanks Johan . -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Friday, February 16, 2007 11:51 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller > Hmmm, this is interesting. > (1) How did you initiate the dump, because the output looks a bit > different ? reboot -d > Do you think it can be the specific hardware that has a problem ? Likely. > and > (2) Do you have an LSI-1064 available I have a 1068 card and a SF4100 (courtesy of a very helpful FreeBSD=20 fan). The SF4100 also seems to have no trouble dumping to a SAS drive. > and > (3) Which version of FreeBSD did you use for the dump ? > I tried FreeBSD 6.X and 7-CURRENT -current. > I want to try the same version on the LSI-1064 I have here. Yes. It's likely one or more of three things: a) It's platform specific. That is, something about your platform is=20 causing the MPT to fail or be failed. b) It's HBA/Disk specific. That is, it's the fact that it's a SATA drive on a 1064. c) It's invocation specific. That is, 'reboot -d' is very kind, while a=20 panic, from a random point, is not. I haven't really narrowed it down yet, but have proved to my=20 satisfaction that a polled operation works for at least a couple mpt=20 cards. -matt I've confirmed that reboot -d produces valid core dumps with the following mpt based hardware all on -current: AMD64 8-way: LSI-1030 (u320) FC7202 (FC 2Gb) SuperMicro 2xXeon/EM64T: LSI-1068 with 250GB Maxtor SATA drives Sun X4100 LSI-1064 with 70GB Fujitsu SAS drives Thanks for testing on the LSI-1064. I'll try a SAS drive on Monday. Do you know which firmware version is running on the Sun X4100 LSI-1064 ? My feeling is that the SAS is going to behave the same as the SATA with the MPI 1.5.13.0 firmware, but I could be wrong. Here's hoping the SAS drive will be able to core dump properly .. Thanks Johan=20 -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Saturday, February 17, 2007 12:03 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller I've confirmed that reboot -d produces valid core dumps with the=20 following mpt based hardware all on -current: AMD64 8-way: LSI-1030 (u320) FC7202 (FC 2Gb) SuperMicro 2xXeon/EM64T: LSI-1068 with 250GB Maxtor SATA drives Sun X4100 LSI-1064 with 70GB Fujitsu SAS drives Hi Matthew. I looked on Sun's website, and it looks like they are running 1.16.00 firmware. On LSI's ftp site that contains header files for MPI version 1.5.13.0 #define MPI_HEADER_VERSION_UNIT (0x0D) <<--- 13 #define MPI_HEADER_VERSION_DEV (0x00) #define MPI_HEADER_VERSION_UNIT_MASK (0xFF00) #define MPI_HEADER_VERSION_UNIT_SHIFT (8) #define MPI_HEADER_VERSION_DEV_MASK (0x00FF) #define MPI_HEADER_VERSION_DEV_SHIFT (0) Do you know the MPI version that gets reported via "dmesg" on the Sun with the LSI-1064 ? ----------------------- http://www.sun.com/servers/entry/x4100/downloads.jsp Sun Fire X4100 M2 and Sun Fire X4200 M2 Servers Supplemental 1.1 =09 Sun Fire X4100 M2 and X4200 M2 Servers Supplemental 1.1 * BIOS 34 (0ABJX034) o Support for VMware ESX 3.0.1 * ILOM Service Processor firmware 1.1.1 * LSI MPT SAS firmware 1.16.00 * LSI MPT BIOS 6.10.00B * Bootable Diagnostics CD version 2.2 * Tools and Drivers CD version 1.1 * Sun Installation Assistant CD version 2.0.127: o Support for RHEL 3 U8 AS/ES/WS (64-bit) o Support for SLES 10 (64-bit) Thanks Johan . =20 -----Original Message----- From: Kruger Johannes (Nokia-ES/Boston)=20 Sent: Saturday, February 17, 2007 12:31 PM To: 'mjacob@freebsd.org' Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller Thanks for testing on the LSI-1064. I'll try a SAS drive on Monday. Do you know which firmware version is running on the Sun X4100 LSI-1064 ? My feeling is that the SAS is going to behave the same as the SATA with the MPI 1.5.13.0 firmware, but I could be wrong. Here's hoping the SAS drive will be able to core dump properly .. Thanks Johan=20 -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Saturday, February 17, 2007 12:03 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller I've confirmed that reboot -d produces valid core dumps with the=20 following mpt based hardware all on -current: AMD64 8-way: LSI-1030 (u320) FC7202 (FC 2Gb) SuperMicro 2xXeon/EM64T: LSI-1068 with 250GB Maxtor SATA drives Sun X4100 LSI-1064 with 70GB Fujitsu SAS drives > Thanks for testing on the LSI-1064. > I'll try a SAS drive on Monday. > Do you know which firmware version is running on the Sun X4100 LSI-1064 > ? 1.5.9.0 > > My feeling is that the SAS is going to behave the same as the SATA with > the MPI 1.5.13.0 firmware, but I could be wrong. > We'll see. The FUSION architecture is supposed to allow for version skew, which is the whole point of saying which version you are to the other side. *shrug* For all I know, Sun does their own f/w schema. I haven't got a Sun boot instance for the sf4100 right now so I can't tell you, but the FreeBSD reports 1.5.9.0. I'm less sure than you are that it's MPI version related. On Sat, 17 Feb 2007, Johannes.Kruger@nokia.com wrote: > Hi Matthew. > I looked on Sun's website, and it looks like they are running 1.16.00 > firmware. > On LSI's ftp site that contains header files for MPI version 1.5.13.0 > #define MPI_HEADER_VERSION_UNIT (0x0D) <<--- 13 > #define MPI_HEADER_VERSION_DEV (0x00) > #define MPI_HEADER_VERSION_UNIT_MASK (0xFF00) > #define MPI_HEADER_VERSION_UNIT_SHIFT (8) > #define MPI_HEADER_VERSION_DEV_MASK (0x00FF) > #define MPI_HEADER_VERSION_DEV_SHIFT (0) > > Do you know the MPI version that gets reported via "dmesg" on the Sun > with the LSI-1064 ? > Ok, thanks. 1.5.9.0 is older than I thought ? I will keep the option of an old firmware as plan B if the test with the SAS drive fails. Thanks Johan=20 -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Saturday, February 17, 2007 12:45 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller > Thanks for testing on the LSI-1064. > I'll try a SAS drive on Monday. > Do you know which firmware version is running on the Sun X4100 LSI-1064 > ? 1.5.9.0 > > My feeling is that the SAS is going to behave the same as the SATA with > the MPI 1.5.13.0 firmware, but I could be wrong. > We'll see. The FUSION architecture is supposed to allow for version=20 skew, which is the whole point of saying which version you are to the=20 other side. *shrug* I agree, I think the chance of it being the firmware is less than 5%, but I also think that the chance of it being the SAS versus SATA is less than 60% ... just a feeling ;-) but those 2 are the only things that are different ... that I can think off. I'll take the latest snapshot of FreeBSD-7-CURRENT to test. Thanks Johan . -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Saturday, February 17, 2007 12:48 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller For all I know, Sun does their own f/w schema. I haven't got a Sun boot=20 instance for the sf4100 right now so I can't tell you, but the FreeBSD=20 reports 1.5.9.0. I'm less sure than you are that it's MPI version related. On Sat, 17 Feb 2007, Johannes.Kruger@nokia.com wrote: > Hi Matthew. > I looked on Sun's website, and it looks like they are running 1.16.00 > firmware. > On LSI's ftp site that contains header files for MPI version 1.5.13.0 > #define MPI_HEADER_VERSION_UNIT (0x0D) <<--- 13 > #define MPI_HEADER_VERSION_DEV (0x00) > #define MPI_HEADER_VERSION_UNIT_MASK (0xFF00) > #define MPI_HEADER_VERSION_UNIT_SHIFT (8) > #define MPI_HEADER_VERSION_DEV_MASK (0x00FF) > #define MPI_HEADER_VERSION_DEV_SHIFT (0) > > Do you know the MPI version that gets reported via "dmesg" on the Sun > with the LSI-1064 ? > Hi Matthew. I think you can close the PR. Here is what seems to be happening: All the hardware I tested before, seems to have memory holes mapped by the BIOS. That means that when the core dump starts, it dumps the memory until it gets to a memory hole (memory that is not mapped) and then returns with an IO error. So it seems to be related to the BIOS of platforms. Thanks for all the testing. I will have to find a way to skip the dumping of the memory holes now, and dump only mapped pages. New learning curve ... Oh well. Thanks Johan . =20 -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Saturday, February 17, 2007 12:48 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller For all I know, Sun does their own f/w schema. I haven't got a Sun boot instance for the sf4100 right now so I can't tell you, but the FreeBSD reports 1.5.9.0. I'm less sure than you are that it's MPI version related. On Sat, 17 Feb 2007, Johannes.Kruger@nokia.com wrote: > Hi Matthew. > I looked on Sun's website, and it looks like they are running 1.16.00=20 > firmware. > On LSI's ftp site that contains header files for MPI version 1.5.13.0 > #define MPI_HEADER_VERSION_UNIT (0x0D) <<--- 13 > #define MPI_HEADER_VERSION_DEV (0x00) > #define MPI_HEADER_VERSION_UNIT_MASK (0xFF00) > #define MPI_HEADER_VERSION_UNIT_SHIFT (8) > #define MPI_HEADER_VERSION_DEV_MASK (0x00FF) > #define MPI_HEADER_VERSION_DEV_SHIFT (0) > > Do you know the MPI version that gets reported via "dmesg" on the Sun=20 > with the LSI-1064 ? > > Hi Matthew. > I think you can close the PR. > Here is what seems to be happening: > All the hardware I tested before, seems to have memory holes mapped by > the BIOS. > That means that when the core dump starts, it dumps the memory until it > gets to a memory hole (memory that is not mapped) and then returns with > an IO error. > > So it seems to be related to the BIOS of platforms. > > Thanks for all the testing. > I will have to find a way to skip the dumping of the memory holes now, > and dump only mapped pages. New learning curve ... Oh well. > Hmm- I thought the FreeBSD dump code only dumped "pages of interest"? In any case, I intend to do a bit more testing to make sure that dumps can occur under 'panic' conditions as opposed to just 'reboot -d'. I thought so too. Well, the platform I tried it on last has no memory holes defined and it works there. But yes, if it dumps only valid mapped pages, then the problem should not happen .. Wonder what the problem can be ? Johan =20 -----Original Message----- From: ext mjacob@freebsd.org [mailto:mjacob@freebsd.org]=20 Sent: Tuesday, February 20, 2007 1:44 PM To: Kruger Johannes (Nokia-ES/Boston) Cc: mjacob@freebsd.org; bug-followup@freebsd.org Subject: RE: kern/107436: Core dumps not working via (mpt) driver using LSI contoller > Hi Matthew. > I think you can close the PR. > Here is what seems to be happening: > All the hardware I tested before, seems to have memory holes mapped by > the BIOS. > That means that when the core dump starts, it dumps the memory until=20 > it gets to a memory hole (memory that is not mapped) and then returns=20 > with an IO error. > > So it seems to be related to the BIOS of platforms. > > Thanks for all the testing. > I will have to find a way to skip the dumping of the memory holes now, > and dump only mapped pages. New learning curve ... Oh well. > Hmm- I thought the FreeBSD dump code only dumped "pages of interest"? In any case, I intend to do a bit more testing to make sure that dumps can occur under 'panic' conditions as opposed to just 'reboot -d'. State Changed From-To: open->closed The dump entry point does work as well as any other dump entry point works. |