Bug 157534 - [mpt] freeze when disk is removed/died from geom_mirror/zfs raid
Summary: [mpt] freeze when disk is removed/died from geom_mirror/zfs raid
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 8.2-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-02 18:50 UTC by Eugene M. Zheganin
Modified: 2011-08-28 13:18 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene M. Zheganin 2011-06-02 18:50:07 UTC
I'm using geom_mirror/zfs on IBM System X 3250 servers, which have LSI 1064e controller.

When drive dies or when it's removed from the server the system freezes on disk operations, until reboot or until same (or new) drive is inserted. After that the system runs normally.

This is reproduceable and I encountered this on i386/amd64.
This cannot be helped by upgrading the controller firmware (I downloaded and upgraded to the latest available from IBM support site).

ps in debugger shows a great amount of processes in D state.

How-To-Repeat: Get an IBM System X server. Install FreeBSD onto a geom_mirror or zfs mirrored pool. Pull out one drive. Issue some disk i/o related command.
Comment 1 Eugene M. Zheganin 2011-06-06 09:20:09 UTC
Fix: use integrated mirroring on controller. When using IR all is fine 
when disk is pulled out.
Comment 2 Eugene M. Zheganin 2011-06-07 14:00:19 UTC
Seems like its more zfs-related freeze. geom_mirror(4) seems to be able 
to detect that one of the providers is gone in about a minute.

When dealing with a zfs mirrored pool, the console is constantly updated 
with these messages:

[...]
mpt0: completing timedout/aborted req 0xffffff80002be9c0:2484
mpt0: completing timedout/aborted req 0xffffff80002be6f0:2483
mpt0: completing timedout/aborted req 0xffffff80002be810:2482
mpt0: completing timedout/aborted req 0xffffff80002bde80:2481
mpt0: abort of req 0xffffff80002bde80:0 completed
mpt0: request 0xffffff80002be660:2486 timed out for ccb 
0xffffff00035ea800 (req>ccb 0xffffff00035ea800)
mpt0: request 0xffffff80002be930:2487 timed out for ccb 
0xffffff00035da000 (req>ccb 0xffffff00035da000)
mpt0: completing timedout/aborted req 0xffffff80002aff30:2582
mpt0: abort of req 0xffffff80002aff30:0 completed2486 function 0
mpt0: request 0xffffff80002afb40:2584 timed out for ccb 
0xffffff00035c0000 (req>ccb 0xffffff00035c0000)
mpt0: attempting to abort req 0xffffff80002afb40:2584 function 0
mpt0: completing timedout/aborted req 0xffffff80002afb40:2584
[...]

This is 100% reproduceable.
Unfortunately, I got like 11 of these servers. I can provide root ssh 
and local console (vie IP KVM) to one of them, along with my help in any 
destructive testing.
Comment 3 Eugene M. Zheganin 2011-06-07 14:45:52 UTC
Having enough time on panicbox, I can say that 8.2-RELEASE zfs mirror 
pool can detect that one of the disks from LSI 1064e controller is gone 
in about an hour. Without issuing 'camcontrol rescan' or other commands.
Comment 4 Eugene M. Zheganin 2011-06-08 12:51:55 UTC
Persists on today's STABLE with v28.
Comment 5 emz 2011-06-23 12:42:12 UTC
The thing is, that after disk removal the controller sends two types of 
events: 0x12 and 0x16.
According to the mpi_ioc.h the first is MPI_EVENT_SAS_PHY_LINK_STATUS 
and the second is MPI_EVENT_SAS_DISCOVERY.

Furthermore, according to the kernel messages on the console during the 
drive removal/attaching, and the code in mpt_cam.c, mpt_cam_event() does 
nothing to handle these events (they both are handled by 'default:' 
section). I think this leads to freezing.

Comparing to the linux mpt code, I can say that Linux kernel does 
nothing about MPI_EVENT_SAS_PHY_LINK_STATUS, but it definitely does 
something (which my skills are to low to understand to) about 
MPI_EVENT_SAS_DISCOVERY.

Anyway, my skills are to low to correct this.

IPKVM screenshots of drive removal and insertion (shot 1 - removal, shot 
3 - insertion):
http://unix.zhegan.in/files/mpt_cam_event01.jpeg
http://unix.zhegan.in/files/mpt_cam_event02.jpeg
http://unix.zhegan.in/files/mpt_cam_event03.jpeg
Comment 6 emz 2011-07-12 09:06:38 UTC
Reflashing the IT firmware doesn't help either - controller behaves 
identically.
Comment 7 marius 2011-07-24 13:33:37 UTC
Could you please give the following patch a try?
http://people.freebsd.org/~marius/mpt_MPI_EVENT_QUEUE_FULL_non_member_MPI_EVENT_SAS_DEVICE_STATUS_CHANGE.diff
It should make mpt(4) report the SAS device discovery and departure events
to cam(4) but I don't know whether higher levels like geom_mirror(4) or
zfs(4) will handle these properly.

Marius
Comment 8 Eugene M. Zheganin 2011-07-25 15:44:24 UTC
I tested it on today's STABLE (looks like the patch is against CURRENT, 
but I managed to apply it to STABLE, there was just some extra spaces in 
a couple of places). Seems to be working (do you need screens ?). At 
least I can say - the freeze timeout is now around 5 minutes 
(unfortunately, I didn't measure the exact amount of time), against one 
hour before the patch. Hop I will be able to diminish it even more by 
tuning the kern.cam.da.default_timeout.
Thanks for the great work.
Comment 9 marius 2011-07-25 16:00:40 UTC
On Mon, Jul 25, 2011 at 08:44:24PM +0600, Eugene M. Zheganin wrote:
> I tested it on today's STABLE (looks like the patch is against CURRENT, 
> but I managed to apply it to STABLE, there was just some extra spaces in 
> a couple of places). Seems to be working (do you need screens ?). At 
> least I can say - the freeze timeout is now around 5 minutes 
> (unfortunately, I didn't measure the exact amount of time), against one 
> hour before the patch. Hop I will be able to diminish it even more by 
> tuning the kern.cam.da.default_timeout.
> Thanks for the great work.

Do you get log entries regarding the removal of the disk in question
in a timely manor, i.e. some seconds rather than minutes, like the
following:
Jul 24 21:13:49 flak kernel: (da0:mpt0:0:0:0): lost device
Jul 24 21:13:49 flak kernel: (da0:mpt0:0:0:0): removing device entry
and is it automatically detected as daX again when re-plugged?

Marius
Comment 10 Eugene M. Zheganin 2011-07-25 16:05:14 UTC
Exactly, I received these messages in seconds after disk removal, then I 
got freeze around 4-5 minutes (during which I thought that this was no 
success, and went to my office).
When I came there I saw messages like 'Invalidating pack' and 'Removing 
device entry' and the system was unfrozen. I didn't test the device 
reinsertion; but I can, if you like.
Comment 11 marius 2011-07-25 16:30:53 UTC
On Mon, Jul 25, 2011 at 09:05:14PM +0600, Eugene M. Zheganin wrote:
> Exactly, I received these messages in seconds after disk removal, then I 
> got freeze around 4-5 minutes (during which I thought that this was no 
> success, and went to my office).

Okay, that's basically all that can be done form the SIM, i.e. mpt(4),
driver point of view. Maybe there's also room for improvement in da(4).
The following patch may also help here but I'm not sure about that:
http://people.freebsd.org/~mav/periph_noretry.patch

> When I came there I saw messages like 'Invalidating pack' and 'Removing 
> device entry' and the system was unfrozen. I didn't test the device 
> reinsertion; but I can, if you like.

Yes, it would be great if you could test that.

Marius
Comment 12 emz 2011-07-27 12:09:16 UTC
Reinsertion is also working just fine. Drive was detected in seconds 
after pushing in, without any additional iteraction from me.

Thanks again for the great work.
Comment 13 marius 2011-07-27 13:58:04 UTC
On Wed, Jul 27, 2011 at 05:09:16PM +0600, Eugene M. Zheganin wrote:
> Reinsertion is also working just fine. Drive was detected in seconds 
> after pushing in, without any additional iteraction from me.
> 
> Thanks again for the great work.

Ok, thanks for testing. I've sent an approval request to the Release-
Engineers regarding the inclusion of these changes into 9.0.

Marius
Comment 14 dfilter service freebsd_committer freebsd_triage 2011-07-29 19:38:45 UTC
Author: marius
Date: Fri Jul 29 18:38:31 2011
New Revision: 224494
URL: http://svn.freebsd.org/changeset/base/224494

Log:
  - Send the RELSIM_ADJUST_OPENINGS in response to a MPI_EVENT_QUEUE_FULL using
    the right SIM in case the HBA is RAID-capable but the target in question is
    not a hot spare or member of a RAID volume.
  - Report the loss and addition of SAS and SATA targets detected via PHY link
    status changes and signalled by MPI_EVENT_SAS_DEVICE_STATUS_CHANGE to cam(4)
    as lost devices and trigger rescans as appropriate. Without this it can take
    quite some time until a lost device actually is no longer tried to be used,
    if it ever stops. [1]
  - Handle MPI_EVENT_IR2, MPI_EVENT_LOG_ENTRY_ADDED, MPI_EVENT_SAS_DISCOVERY
    and MPI_EVENT_SAS_PHY_LINK_STATUS silently as these serve no additional
    purpose beyond adding cryptic entries to logs.
  
  Thanks to Hans-Joerg Sirtl for providing one of the HBAs these changes were
  developed with and RIP to the mainboard that didn't survive testing them.
  
  PR:		157534 [1]
  Approved by:	re (kib)
  MFC after:	2 weeks

Modified:
  head/sys/dev/mpt/mpt_cam.c
  head/sys/dev/mpt/mpt_raid.c
  head/sys/dev/mpt/mpt_raid.h

Modified: head/sys/dev/mpt/mpt_cam.c
==============================================================================
--- head/sys/dev/mpt/mpt_cam.c	Fri Jul 29 18:35:10 2011	(r224493)
+++ head/sys/dev/mpt/mpt_cam.c	Fri Jul 29 18:38:31 2011	(r224494)
@@ -2538,7 +2538,8 @@ mpt_cam_event(struct mpt_softc *mpt, req
 		pqf->CurrentDepth = le16toh(pqf->CurrentDepth);
 		mpt_prt(mpt, "QUEUE FULL EVENT: Bus 0x%02x Target 0x%02x Depth "
 		    "%d\n", pqf->Bus, pqf->TargetID, pqf->CurrentDepth);
-		if (mpt->phydisk_sim) {
+		if (mpt->phydisk_sim && mpt_is_raid_member(mpt,
+		    pqf->TargetID) != 0) {
 			sim = mpt->phydisk_sim;
 		} else {
 			sim = mpt->sim;
@@ -2570,9 +2571,72 @@ mpt_cam_event(struct mpt_softc *mpt, req
 		mpt_prt(mpt, "IR resync update %d completed\n",
 		    (data0 >> 16) & 0xff);
 		break;
+	case MPI_EVENT_SAS_DEVICE_STATUS_CHANGE:
+	{
+		union ccb *ccb;
+		struct cam_sim *sim;
+		struct cam_path *tmppath;
+		PTR_EVENT_DATA_SAS_DEVICE_STATUS_CHANGE psdsc;
+
+		psdsc = (PTR_EVENT_DATA_SAS_DEVICE_STATUS_CHANGE)msg->Data;
+		if (mpt->phydisk_sim && mpt_is_raid_member(mpt,
+		    psdsc->TargetID) != 0)
+			sim = mpt->phydisk_sim;
+		else
+			sim = mpt->sim;
+		switch(psdsc->ReasonCode) {
+		case MPI_EVENT_SAS_DEV_STAT_RC_ADDED:
+			MPTLOCK_2_CAMLOCK(mpt);
+			ccb = xpt_alloc_ccb_nowait();
+			if (ccb == NULL) {
+				mpt_prt(mpt,
+				    "unable to alloc CCB for rescan\n");
+				CAMLOCK_2_MPTLOCK(mpt);
+				break;
+			}
+			if (xpt_create_path(&ccb->ccb_h.path, xpt_periph,
+			    cam_sim_path(sim), psdsc->TargetID,
+			    CAM_LUN_WILDCARD) != CAM_REQ_CMP) {
+				CAMLOCK_2_MPTLOCK(mpt);
+				mpt_prt(mpt,
+				    "unable to create path for rescan\n");
+				xpt_free_ccb(ccb);
+				break;
+			}
+			xpt_rescan(ccb);
+			CAMLOCK_2_MPTLOCK(mpt);
+			break;
+		case MPI_EVENT_SAS_DEV_STAT_RC_NOT_RESPONDING:
+			MPTLOCK_2_CAMLOCK(mpt);
+			if (xpt_create_path(&tmppath, NULL, cam_sim_path(sim),
+			    psdsc->TargetID, CAM_LUN_WILDCARD) !=
+			    CAM_REQ_CMP) {
+				mpt_prt(mpt,
+				    "unable to create path for async event");
+				CAMLOCK_2_MPTLOCK(mpt);
+				break;
+			}
+			xpt_async(AC_LOST_DEVICE, tmppath, NULL);
+			xpt_free_path(tmppath);
+			CAMLOCK_2_MPTLOCK(mpt);
+			break;
+		case MPI_EVENT_SAS_DEV_STAT_RC_INTERNAL_DEVICE_RESET:
+			break;
+		default:
+			mpt_lprt(mpt, MPT_PRT_WARN,
+			    "SAS device status change: Bus: 0x%02x TargetID: "
+			    "0x%02x ReasonCode: 0x%02x\n", psdsc->TargetID,
+			    psdsc->Bus, psdsc->ReasonCode);
+			break;
+		}
+		break;
+	}
 	case MPI_EVENT_EVENT_CHANGE:
 	case MPI_EVENT_INTEGRATED_RAID:
-	case MPI_EVENT_SAS_DEVICE_STATUS_CHANGE:
+	case MPI_EVENT_IR2:
+	case MPI_EVENT_LOG_ENTRY_ADDED:
+	case MPI_EVENT_SAS_DISCOVERY:
+	case MPI_EVENT_SAS_PHY_LINK_STATUS:
 	case MPI_EVENT_SAS_SES:
 		break;
 	default:

Modified: head/sys/dev/mpt/mpt_raid.c
==============================================================================
--- head/sys/dev/mpt/mpt_raid.c	Fri Jul 29 18:35:10 2011	(r224493)
+++ head/sys/dev/mpt/mpt_raid.c	Fri Jul 29 18:38:31 2011	(r224494)
@@ -812,6 +812,25 @@ mpt_map_physdisk(struct mpt_softc *mpt, 
 
 /* XXX Ignores that there may be multiple busses/IOCs involved. */
 int
+mpt_is_raid_member(struct mpt_softc *mpt, target_id_t tgt)
+{
+	struct mpt_raid_disk *mpt_disk;
+	int i;
+
+	if (mpt->ioc_page2 == NULL || mpt->ioc_page2->MaxPhysDisks == 0)
+		return (0);
+	for (i = 0; i < mpt->ioc_page2->MaxPhysDisks; i++) {
+		mpt_disk = &mpt->raid_disks[i];
+		if ((mpt_disk->flags & MPT_RDF_ACTIVE) != 0 &&
+		    mpt_disk->config_page.PhysDiskID == tgt)
+			return (1);
+	}
+	return (0);
+	
+}
+
+/* XXX Ignores that there may be multiple busses/IOCs involved. */
+int
 mpt_is_raid_volume(struct mpt_softc *mpt, target_id_t tgt)
 {
 	CONFIG_PAGE_IOC_2_RAID_VOL *ioc_vol;

Modified: head/sys/dev/mpt/mpt_raid.h
==============================================================================
--- head/sys/dev/mpt/mpt_raid.h	Fri Jul 29 18:35:10 2011	(r224493)
+++ head/sys/dev/mpt/mpt_raid.h	Fri Jul 29 18:38:31 2011	(r224494)
@@ -54,6 +54,7 @@ typedef enum {
 } mpt_raid_mwce_t;
 
 cam_status mpt_map_physdisk(struct mpt_softc *, union ccb *, target_id_t *);
+int mpt_is_raid_member(struct mpt_softc *, target_id_t);
 int mpt_is_raid_volume(struct mpt_softc *, target_id_t);
 #if	0
 cam_status
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 15 dfilter service freebsd_committer freebsd_triage 2011-08-13 13:37:38 UTC
Author: marius
Date: Sat Aug 13 12:37:22 2011
New Revision: 224820
URL: http://svn.freebsd.org/changeset/base/224820

Log:
  MFC: r224494, r224761
  
  - Send the RELSIM_ADJUST_OPENINGS in response to a MPI_EVENT_QUEUE_FULL using
    the right SIM in case the HBA is RAID-capable but the target in question is
    not a hot spare or member of a RAID volume.
  - Report the loss and addition of SAS and SATA targets detected via PHY link
    status changes and signalled by MPI_EVENT_SAS_DEVICE_STATUS_CHANGE to cam(4)
    as lost devices and trigger rescans as appropriate. Without this it can take
    quite some time until a lost device actually is no longer tried to be used,
    if it ever stops. [1]
  - Handle MPI_EVENT_IR2, MPI_EVENT_LOG_ENTRY_ADDED, MPI_EVENT_SAS_DISCOVERY
    and MPI_EVENT_SAS_PHY_LINK_STATUS silently as these serve no additional
    purpose beyond adding cryptic entries to logs.
  - Add a warning for MPI_EVENT_SAS_DISCOVERY_ERROR events, which can help
    identifying broken disks. [2]
  
  Thanks to Hans-Joerg Sirtl for providing one of the HBAs these changes were
  developed with and RIP to the mainboard that didn't survive testing them.
  
  PR:		157534 [1]
  Submitted by:	Andrew Boyer [2]

Modified:
  stable/8/sys/dev/mpt/mpilib/mpi_ioc.h
  stable/8/sys/dev/mpt/mpt_cam.c
  stable/8/sys/dev/mpt/mpt_raid.c
  stable/8/sys/dev/mpt/mpt_raid.h
Directory Properties:
  stable/8/sys/   (props changed)
  stable/8/sys/amd64/include/xen/   (props changed)
  stable/8/sys/cddl/contrib/opensolaris/   (props changed)
  stable/8/sys/contrib/dev/acpica/   (props changed)
  stable/8/sys/contrib/pf/   (props changed)

Modified: stable/8/sys/dev/mpt/mpilib/mpi_ioc.h
==============================================================================
--- stable/8/sys/dev/mpt/mpilib/mpi_ioc.h	Sat Aug 13 12:33:41 2011	(r224819)
+++ stable/8/sys/dev/mpt/mpilib/mpi_ioc.h	Sat Aug 13 12:37:22 2011	(r224820)
@@ -33,7 +33,7 @@
  *          Title:  MPI IOC, Port, Event, FW Download, and FW Upload messages
  *  Creation Date:  August 11, 2000
  *
- *    mpi_ioc.h Version:  01.05.14
+ *    mpi_ioc.h Version:  01.05.16
  *
  *  Version History
  *  ---------------
@@ -140,6 +140,16 @@
  *                      added _MULTI_PORT_DOMAIN.
  *  05-24-07  01.05.14  Added Common Boot Block type to FWDownload Request.
  *                      Added Common Boot Block type to FWUpload Request.
+ *  08-07-07  01.05.15  Added MPI_EVENT_SAS_INIT_RC_REMOVED define.
+ *                      Added MPI_EVENT_IR2_RC_DUAL_PORT_ADDED and
+ *                      MPI_EVENT_IR2_RC_DUAL_PORT_REMOVED for IR2 event data.
+ *                      Added SASAddress field to SAS Initiator Device Table
+ *                      Overflow event data structure.
+ *  03-28-08  01.05.16  Added two new ReasonCode values to SAS Device Status
+ *                      Change Event data to indicate completion of internally
+ *                      generated task management.
+ *                      Added MPI_EVENT_DSCVRY_ERR_DS_SATA_INIT_FAILURE define.
+ *                      Added MPI_EVENT_SAS_INIT_RC_INACCESSIBLE define.
  *  --------------------------------------------------------------------------
  */
 
@@ -639,6 +649,8 @@ typedef struct _EVENT_DATA_SAS_DEVICE_ST
 #define MPI_EVENT_SAS_DEV_STAT_RC_CLEAR_TASK_SET_INTERNAL   (0x0B)
 #define MPI_EVENT_SAS_DEV_STAT_RC_QUERY_TASK_INTERNAL       (0x0C)
 #define MPI_EVENT_SAS_DEV_STAT_RC_ASYNC_NOTIFICATION        (0x0D)
+#define MPI_EVENT_SAS_DEV_STAT_RC_CMPL_INTERNAL_DEV_RESET   (0x0E)
+#define MPI_EVENT_SAS_DEV_STAT_RC_CMPL_TASK_ABORT_INTERNAL  (0x0F)
 
 
 /* SCSI Event data for Queue Full event */
@@ -735,6 +747,8 @@ typedef struct _MPI_EVENT_DATA_IR2
 #define MPI_EVENT_IR2_RC_PD_REMOVED                 (0x05)
 #define MPI_EVENT_IR2_RC_FOREIGN_CFG_DETECTED       (0x06)
 #define MPI_EVENT_IR2_RC_REBUILD_MEDIUM_ERROR       (0x07)
+#define MPI_EVENT_IR2_RC_DUAL_PORT_ADDED            (0x08)
+#define MPI_EVENT_IR2_RC_DUAL_PORT_REMOVED          (0x09)
 
 /* defines for logical disk states */
 #define MPI_LD_STATE_OPTIMAL                        (0x00)
@@ -894,6 +908,7 @@ typedef struct _EVENT_DATA_DISCOVERY_ERR
 #define MPI_EVENT_DSCVRY_ERR_DS_UNSUPPORTED_DEVICE          (0x00000800)
 #define MPI_EVENT_DSCVRY_ERR_DS_MAX_SATA_TARGETS            (0x00001000)
 #define MPI_EVENT_DSCVRY_ERR_DS_MULTI_PORT_DOMAIN           (0x00002000)
+#define MPI_EVENT_DSCVRY_ERR_DS_SATA_INIT_FAILURE           (0x00004000)
 
 /* SAS SMP Error Event data */
 
@@ -929,6 +944,8 @@ typedef struct _EVENT_DATA_SAS_INIT_DEV_
 
 /* defines for the ReasonCode field of the SAS Initiator Device Status Change event */
 #define MPI_EVENT_SAS_INIT_RC_ADDED                 (0x01)
+#define MPI_EVENT_SAS_INIT_RC_REMOVED               (0x02)
+#define MPI_EVENT_SAS_INIT_RC_INACCESSIBLE          (0x03)
 
 /* SAS Initiator Device Table Overflow Event data */
 
@@ -937,6 +954,7 @@ typedef struct _EVENT_DATA_SAS_INIT_TABL
     U8                      MaxInit;                    /* 00h */
     U8                      CurrentInit;                /* 01h */
     U16                     Reserved1;                  /* 02h */
+    U64                     SASAddress;                 /* 04h */
 } EVENT_DATA_SAS_INIT_TABLE_OVERFLOW,
   MPI_POINTER PTR_EVENT_DATA_SAS_INIT_TABLE_OVERFLOW,
   MpiEventDataSasInitTableOverflow_t,

Modified: stable/8/sys/dev/mpt/mpt_cam.c
==============================================================================
--- stable/8/sys/dev/mpt/mpt_cam.c	Sat Aug 13 12:33:41 2011	(r224819)
+++ stable/8/sys/dev/mpt/mpt_cam.c	Sat Aug 13 12:37:22 2011	(r224820)
@@ -2538,7 +2538,8 @@ mpt_cam_event(struct mpt_softc *mpt, req
 		pqf->CurrentDepth = le16toh(pqf->CurrentDepth);
 		mpt_prt(mpt, "QUEUE FULL EVENT: Bus 0x%02x Target 0x%02x Depth "
 		    "%d\n", pqf->Bus, pqf->TargetID, pqf->CurrentDepth);
-		if (mpt->phydisk_sim) {
+		if (mpt->phydisk_sim && mpt_is_raid_member(mpt,
+		    pqf->TargetID) != 0) {
 			sim = mpt->phydisk_sim;
 		} else {
 			sim = mpt->sim;
@@ -2570,9 +2571,85 @@ mpt_cam_event(struct mpt_softc *mpt, req
 		mpt_prt(mpt, "IR resync update %d completed\n",
 		    (data0 >> 16) & 0xff);
 		break;
+	case MPI_EVENT_SAS_DEVICE_STATUS_CHANGE:
+	{
+		union ccb *ccb;
+		struct cam_sim *sim;
+		struct cam_path *tmppath;
+		PTR_EVENT_DATA_SAS_DEVICE_STATUS_CHANGE psdsc;
+
+		psdsc = (PTR_EVENT_DATA_SAS_DEVICE_STATUS_CHANGE)msg->Data;
+		if (mpt->phydisk_sim && mpt_is_raid_member(mpt,
+		    psdsc->TargetID) != 0)
+			sim = mpt->phydisk_sim;
+		else
+			sim = mpt->sim;
+		switch(psdsc->ReasonCode) {
+		case MPI_EVENT_SAS_DEV_STAT_RC_ADDED:
+			MPTLOCK_2_CAMLOCK(mpt);
+			ccb = xpt_alloc_ccb_nowait();
+			if (ccb == NULL) {
+				mpt_prt(mpt,
+				    "unable to alloc CCB for rescan\n");
+				CAMLOCK_2_MPTLOCK(mpt);
+				break;
+			}
+			if (xpt_create_path(&ccb->ccb_h.path, xpt_periph,
+			    cam_sim_path(sim), psdsc->TargetID,
+			    CAM_LUN_WILDCARD) != CAM_REQ_CMP) {
+				CAMLOCK_2_MPTLOCK(mpt);
+				mpt_prt(mpt,
+				    "unable to create path for rescan\n");
+				xpt_free_ccb(ccb);
+				break;
+			}
+			xpt_rescan(ccb);
+			CAMLOCK_2_MPTLOCK(mpt);
+			break;
+		case MPI_EVENT_SAS_DEV_STAT_RC_NOT_RESPONDING:
+			MPTLOCK_2_CAMLOCK(mpt);
+			if (xpt_create_path(&tmppath, NULL, cam_sim_path(sim),
+			    psdsc->TargetID, CAM_LUN_WILDCARD) !=
+			    CAM_REQ_CMP) {
+				mpt_prt(mpt,
+				    "unable to create path for async event");
+				CAMLOCK_2_MPTLOCK(mpt);
+				break;
+			}
+			xpt_async(AC_LOST_DEVICE, tmppath, NULL);
+			xpt_free_path(tmppath);
+			CAMLOCK_2_MPTLOCK(mpt);
+			break;
+		case MPI_EVENT_SAS_DEV_STAT_RC_CMPL_INTERNAL_DEV_RESET:
+		case MPI_EVENT_SAS_DEV_STAT_RC_CMPL_TASK_ABORT_INTERNAL:
+		case MPI_EVENT_SAS_DEV_STAT_RC_INTERNAL_DEVICE_RESET:
+			break;
+		default:
+			mpt_lprt(mpt, MPT_PRT_WARN,
+			    "SAS device status change: Bus: 0x%02x TargetID: "
+			    "0x%02x ReasonCode: 0x%02x\n", psdsc->Bus,
+			    psdsc->TargetID, psdsc->ReasonCode);
+			break;
+		}
+		break;
+	}
+	case MPI_EVENT_SAS_DISCOVERY_ERROR:
+	{
+		PTR_EVENT_DATA_DISCOVERY_ERROR pde;
+
+		pde = (PTR_EVENT_DATA_DISCOVERY_ERROR)msg->Data;
+		pde->DiscoveryStatus = le32toh(pde->DiscoveryStatus);
+		mpt_lprt(mpt, MPT_PRT_WARN,
+		    "SAS discovery error: Port: 0x%02x Status: 0x%08x\n",
+		    pde->Port, pde->DiscoveryStatus);
+		break;
+	}
 	case MPI_EVENT_EVENT_CHANGE:
 	case MPI_EVENT_INTEGRATED_RAID:
-	case MPI_EVENT_SAS_DEVICE_STATUS_CHANGE:
+	case MPI_EVENT_IR2:
+	case MPI_EVENT_LOG_ENTRY_ADDED:
+	case MPI_EVENT_SAS_DISCOVERY:
+	case MPI_EVENT_SAS_PHY_LINK_STATUS:
 	case MPI_EVENT_SAS_SES:
 		break;
 	default:

Modified: stable/8/sys/dev/mpt/mpt_raid.c
==============================================================================
--- stable/8/sys/dev/mpt/mpt_raid.c	Sat Aug 13 12:33:41 2011	(r224819)
+++ stable/8/sys/dev/mpt/mpt_raid.c	Sat Aug 13 12:37:22 2011	(r224820)
@@ -812,6 +812,25 @@ mpt_map_physdisk(struct mpt_softc *mpt, 
 
 /* XXX Ignores that there may be multiple busses/IOCs involved. */
 int
+mpt_is_raid_member(struct mpt_softc *mpt, target_id_t tgt)
+{
+	struct mpt_raid_disk *mpt_disk;
+	int i;
+
+	if (mpt->ioc_page2 == NULL || mpt->ioc_page2->MaxPhysDisks == 0)
+		return (0);
+	for (i = 0; i < mpt->ioc_page2->MaxPhysDisks; i++) {
+		mpt_disk = &mpt->raid_disks[i];
+		if ((mpt_disk->flags & MPT_RDF_ACTIVE) != 0 &&
+		    mpt_disk->config_page.PhysDiskID == tgt)
+			return (1);
+	}
+	return (0);
+	
+}
+
+/* XXX Ignores that there may be multiple busses/IOCs involved. */
+int
 mpt_is_raid_volume(struct mpt_softc *mpt, target_id_t tgt)
 {
 	CONFIG_PAGE_IOC_2_RAID_VOL *ioc_vol;

Modified: stable/8/sys/dev/mpt/mpt_raid.h
==============================================================================
--- stable/8/sys/dev/mpt/mpt_raid.h	Sat Aug 13 12:33:41 2011	(r224819)
+++ stable/8/sys/dev/mpt/mpt_raid.h	Sat Aug 13 12:37:22 2011	(r224820)
@@ -54,6 +54,7 @@ typedef enum {
 } mpt_raid_mwce_t;
 
 cam_status mpt_map_physdisk(struct mpt_softc *, union ccb *, target_id_t *);
+int mpt_is_raid_member(struct mpt_softc *, target_id_t);
 int mpt_is_raid_volume(struct mpt_softc *, target_id_t);
 #if	0
 cam_status
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 16 dfilter service freebsd_committer freebsd_triage 2011-08-13 13:37:47 UTC
Author: marius
Date: Sat Aug 13 12:37:23 2011
New Revision: 224821
URL: http://svn.freebsd.org/changeset/base/224821

Log:
  MFC: r224494, r224761
  
  - Send the RELSIM_ADJUST_OPENINGS in response to a MPI_EVENT_QUEUE_FULL using
    the right SIM in case the HBA is RAID-capable but the target in question is
    not a hot spare or member of a RAID volume.
  - Report the loss and addition of SAS and SATA targets detected via PHY link
    status changes and signalled by MPI_EVENT_SAS_DEVICE_STATUS_CHANGE to cam(4)
    as lost devices and trigger rescans as appropriate. Without this it can take
    quite some time until a lost device actually is no longer tried to be used,
    if it ever stops. [1]
  - Handle MPI_EVENT_IR2, MPI_EVENT_LOG_ENTRY_ADDED, MPI_EVENT_SAS_DISCOVERY
    and MPI_EVENT_SAS_PHY_LINK_STATUS silently as these serve no additional
    purpose beyond adding cryptic entries to logs.
  - Add a warning for MPI_EVENT_SAS_DISCOVERY_ERROR events, which can help
    identifying broken disks. [2]
  
  Thanks to Hans-Joerg Sirtl for providing one of the HBAs these changes were
  developed with and RIP to the mainboard that didn't survive testing them.
  
  PR:		157534 [1]
  Submitted by:	Andrew Boyer [2]

Modified:
  stable/7/sys/dev/mpt/mpilib/mpi_ioc.h
  stable/7/sys/dev/mpt/mpt_cam.c
  stable/7/sys/dev/mpt/mpt_raid.c
  stable/7/sys/dev/mpt/mpt_raid.h
Directory Properties:
  stable/7/sys/   (props changed)
  stable/7/sys/cddl/contrib/opensolaris/   (props changed)
  stable/7/sys/contrib/dev/acpica/   (props changed)
  stable/7/sys/contrib/pf/   (props changed)

Modified: stable/7/sys/dev/mpt/mpilib/mpi_ioc.h
==============================================================================
--- stable/7/sys/dev/mpt/mpilib/mpi_ioc.h	Sat Aug 13 12:37:22 2011	(r224820)
+++ stable/7/sys/dev/mpt/mpilib/mpi_ioc.h	Sat Aug 13 12:37:23 2011	(r224821)
@@ -33,7 +33,7 @@
  *          Title:  MPI IOC, Port, Event, FW Download, and FW Upload messages
  *  Creation Date:  August 11, 2000
  *
- *    mpi_ioc.h Version:  01.05.14
+ *    mpi_ioc.h Version:  01.05.16
  *
  *  Version History
  *  ---------------
@@ -140,6 +140,16 @@
  *                      added _MULTI_PORT_DOMAIN.
  *  05-24-07  01.05.14  Added Common Boot Block type to FWDownload Request.
  *                      Added Common Boot Block type to FWUpload Request.
+ *  08-07-07  01.05.15  Added MPI_EVENT_SAS_INIT_RC_REMOVED define.
+ *                      Added MPI_EVENT_IR2_RC_DUAL_PORT_ADDED and
+ *                      MPI_EVENT_IR2_RC_DUAL_PORT_REMOVED for IR2 event data.
+ *                      Added SASAddress field to SAS Initiator Device Table
+ *                      Overflow event data structure.
+ *  03-28-08  01.05.16  Added two new ReasonCode values to SAS Device Status
+ *                      Change Event data to indicate completion of internally
+ *                      generated task management.
+ *                      Added MPI_EVENT_DSCVRY_ERR_DS_SATA_INIT_FAILURE define.
+ *                      Added MPI_EVENT_SAS_INIT_RC_INACCESSIBLE define.
  *  --------------------------------------------------------------------------
  */
 
@@ -639,6 +649,8 @@ typedef struct _EVENT_DATA_SAS_DEVICE_ST
 #define MPI_EVENT_SAS_DEV_STAT_RC_CLEAR_TASK_SET_INTERNAL   (0x0B)
 #define MPI_EVENT_SAS_DEV_STAT_RC_QUERY_TASK_INTERNAL       (0x0C)
 #define MPI_EVENT_SAS_DEV_STAT_RC_ASYNC_NOTIFICATION        (0x0D)
+#define MPI_EVENT_SAS_DEV_STAT_RC_CMPL_INTERNAL_DEV_RESET   (0x0E)
+#define MPI_EVENT_SAS_DEV_STAT_RC_CMPL_TASK_ABORT_INTERNAL  (0x0F)
 
 
 /* SCSI Event data for Queue Full event */
@@ -735,6 +747,8 @@ typedef struct _MPI_EVENT_DATA_IR2
 #define MPI_EVENT_IR2_RC_PD_REMOVED                 (0x05)
 #define MPI_EVENT_IR2_RC_FOREIGN_CFG_DETECTED       (0x06)
 #define MPI_EVENT_IR2_RC_REBUILD_MEDIUM_ERROR       (0x07)
+#define MPI_EVENT_IR2_RC_DUAL_PORT_ADDED            (0x08)
+#define MPI_EVENT_IR2_RC_DUAL_PORT_REMOVED          (0x09)
 
 /* defines for logical disk states */
 #define MPI_LD_STATE_OPTIMAL                        (0x00)
@@ -894,6 +908,7 @@ typedef struct _EVENT_DATA_DISCOVERY_ERR
 #define MPI_EVENT_DSCVRY_ERR_DS_UNSUPPORTED_DEVICE          (0x00000800)
 #define MPI_EVENT_DSCVRY_ERR_DS_MAX_SATA_TARGETS            (0x00001000)
 #define MPI_EVENT_DSCVRY_ERR_DS_MULTI_PORT_DOMAIN           (0x00002000)
+#define MPI_EVENT_DSCVRY_ERR_DS_SATA_INIT_FAILURE           (0x00004000)
 
 /* SAS SMP Error Event data */
 
@@ -929,6 +944,8 @@ typedef struct _EVENT_DATA_SAS_INIT_DEV_
 
 /* defines for the ReasonCode field of the SAS Initiator Device Status Change event */
 #define MPI_EVENT_SAS_INIT_RC_ADDED                 (0x01)
+#define MPI_EVENT_SAS_INIT_RC_REMOVED               (0x02)
+#define MPI_EVENT_SAS_INIT_RC_INACCESSIBLE          (0x03)
 
 /* SAS Initiator Device Table Overflow Event data */
 
@@ -937,6 +954,7 @@ typedef struct _EVENT_DATA_SAS_INIT_TABL
     U8                      MaxInit;                    /* 00h */
     U8                      CurrentInit;                /* 01h */
     U16                     Reserved1;                  /* 02h */
+    U64                     SASAddress;                 /* 04h */
 } EVENT_DATA_SAS_INIT_TABLE_OVERFLOW,
   MPI_POINTER PTR_EVENT_DATA_SAS_INIT_TABLE_OVERFLOW,
   MpiEventDataSasInitTableOverflow_t,

Modified: stable/7/sys/dev/mpt/mpt_cam.c
==============================================================================
--- stable/7/sys/dev/mpt/mpt_cam.c	Sat Aug 13 12:37:22 2011	(r224820)
+++ stable/7/sys/dev/mpt/mpt_cam.c	Sat Aug 13 12:37:23 2011	(r224821)
@@ -2538,7 +2538,8 @@ mpt_cam_event(struct mpt_softc *mpt, req
 		pqf->CurrentDepth = le16toh(pqf->CurrentDepth);
 		mpt_prt(mpt, "QUEUE FULL EVENT: Bus 0x%02x Target 0x%02x Depth "
 		    "%d\n", pqf->Bus, pqf->TargetID, pqf->CurrentDepth);
-		if (mpt->phydisk_sim) {
+		if (mpt->phydisk_sim && mpt_is_raid_member(mpt,
+		    pqf->TargetID) != 0) {
 			sim = mpt->phydisk_sim;
 		} else {
 			sim = mpt->sim;
@@ -2570,9 +2571,85 @@ mpt_cam_event(struct mpt_softc *mpt, req
 		mpt_prt(mpt, "IR resync update %d completed\n",
 		    (data0 >> 16) & 0xff);
 		break;
+	case MPI_EVENT_SAS_DEVICE_STATUS_CHANGE:
+	{
+		union ccb *ccb;
+		struct cam_sim *sim;
+		struct cam_path *tmppath;
+		PTR_EVENT_DATA_SAS_DEVICE_STATUS_CHANGE psdsc;
+
+		psdsc = (PTR_EVENT_DATA_SAS_DEVICE_STATUS_CHANGE)msg->Data;
+		if (mpt->phydisk_sim && mpt_is_raid_member(mpt,
+		    psdsc->TargetID) != 0)
+			sim = mpt->phydisk_sim;
+		else
+			sim = mpt->sim;
+		switch(psdsc->ReasonCode) {
+		case MPI_EVENT_SAS_DEV_STAT_RC_ADDED:
+			MPTLOCK_2_CAMLOCK(mpt);
+			ccb = xpt_alloc_ccb_nowait();
+			if (ccb == NULL) {
+				mpt_prt(mpt,
+				    "unable to alloc CCB for rescan\n");
+				CAMLOCK_2_MPTLOCK(mpt);
+				break;
+			}
+			if (xpt_create_path(&ccb->ccb_h.path, xpt_periph,
+			    cam_sim_path(sim), psdsc->TargetID,
+			    CAM_LUN_WILDCARD) != CAM_REQ_CMP) {
+				CAMLOCK_2_MPTLOCK(mpt);
+				mpt_prt(mpt,
+				    "unable to create path for rescan\n");
+				xpt_free_ccb(ccb);
+				break;
+			}
+			xpt_rescan(ccb);
+			CAMLOCK_2_MPTLOCK(mpt);
+			break;
+		case MPI_EVENT_SAS_DEV_STAT_RC_NOT_RESPONDING:
+			MPTLOCK_2_CAMLOCK(mpt);
+			if (xpt_create_path(&tmppath, NULL, cam_sim_path(sim),
+			    psdsc->TargetID, CAM_LUN_WILDCARD) !=
+			    CAM_REQ_CMP) {
+				mpt_prt(mpt,
+				    "unable to create path for async event");
+				CAMLOCK_2_MPTLOCK(mpt);
+				break;
+			}
+			xpt_async(AC_LOST_DEVICE, tmppath, NULL);
+			xpt_free_path(tmppath);
+			CAMLOCK_2_MPTLOCK(mpt);
+			break;
+		case MPI_EVENT_SAS_DEV_STAT_RC_CMPL_INTERNAL_DEV_RESET:
+		case MPI_EVENT_SAS_DEV_STAT_RC_CMPL_TASK_ABORT_INTERNAL:
+		case MPI_EVENT_SAS_DEV_STAT_RC_INTERNAL_DEVICE_RESET:
+			break;
+		default:
+			mpt_lprt(mpt, MPT_PRT_WARN,
+			    "SAS device status change: Bus: 0x%02x TargetID: "
+			    "0x%02x ReasonCode: 0x%02x\n", psdsc->Bus,
+			    psdsc->TargetID, psdsc->ReasonCode);
+			break;
+		}
+		break;
+	}
+	case MPI_EVENT_SAS_DISCOVERY_ERROR:
+	{
+		PTR_EVENT_DATA_DISCOVERY_ERROR pde;
+
+		pde = (PTR_EVENT_DATA_DISCOVERY_ERROR)msg->Data;
+		pde->DiscoveryStatus = le32toh(pde->DiscoveryStatus);
+		mpt_lprt(mpt, MPT_PRT_WARN,
+		    "SAS discovery error: Port: 0x%02x Status: 0x%08x\n",
+		    pde->Port, pde->DiscoveryStatus);
+		break;
+	}
 	case MPI_EVENT_EVENT_CHANGE:
 	case MPI_EVENT_INTEGRATED_RAID:
-	case MPI_EVENT_SAS_DEVICE_STATUS_CHANGE:
+	case MPI_EVENT_IR2:
+	case MPI_EVENT_LOG_ENTRY_ADDED:
+	case MPI_EVENT_SAS_DISCOVERY:
+	case MPI_EVENT_SAS_PHY_LINK_STATUS:
 	case MPI_EVENT_SAS_SES:
 		break;
 	default:

Modified: stable/7/sys/dev/mpt/mpt_raid.c
==============================================================================
--- stable/7/sys/dev/mpt/mpt_raid.c	Sat Aug 13 12:37:22 2011	(r224820)
+++ stable/7/sys/dev/mpt/mpt_raid.c	Sat Aug 13 12:37:23 2011	(r224821)
@@ -828,6 +828,25 @@ mpt_map_physdisk(struct mpt_softc *mpt, 
 
 /* XXX Ignores that there may be multiple busses/IOCs involved. */
 int
+mpt_is_raid_member(struct mpt_softc *mpt, target_id_t tgt)
+{
+	struct mpt_raid_disk *mpt_disk;
+	int i;
+
+	if (mpt->ioc_page2 == NULL || mpt->ioc_page2->MaxPhysDisks == 0)
+		return (0);
+	for (i = 0; i < mpt->ioc_page2->MaxPhysDisks; i++) {
+		mpt_disk = &mpt->raid_disks[i];
+		if ((mpt_disk->flags & MPT_RDF_ACTIVE) != 0 &&
+		    mpt_disk->config_page.PhysDiskID == tgt)
+			return (1);
+	}
+	return (0);
+	
+}
+
+/* XXX Ignores that there may be multiple busses/IOCs involved. */
+int
 mpt_is_raid_volume(struct mpt_softc *mpt, target_id_t tgt)
 {
 	CONFIG_PAGE_IOC_2_RAID_VOL *ioc_vol;

Modified: stable/7/sys/dev/mpt/mpt_raid.h
==============================================================================
--- stable/7/sys/dev/mpt/mpt_raid.h	Sat Aug 13 12:37:22 2011	(r224820)
+++ stable/7/sys/dev/mpt/mpt_raid.h	Sat Aug 13 12:37:23 2011	(r224821)
@@ -54,6 +54,7 @@ typedef enum {
 } mpt_raid_mwce_t;
 
 cam_status mpt_map_physdisk(struct mpt_softc *, union ccb *, target_id_t *);
+int mpt_is_raid_member(struct mpt_softc *, target_id_t);
 int mpt_is_raid_volume(struct mpt_softc *, target_id_t);
 #if	0
 cam_status
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 17 Marius Strobl freebsd_committer freebsd_triage 2011-08-28 13:16:36 UTC
State Changed
From-To: open->closed

Close; patches have been committed and MFC'ed down to stable/7.