Bug 212914 - CAM: SATA drives are getting deleted and then re-added after controller rescan
Summary: CAM: SATA drives are getting deleted and then re-added after controller rescan
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RC1
Hardware: Any Any
: --- Affects Only Me
Assignee: Alan Somers
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2016-09-23 03:22 UTC by Kashyap
Modified: 2017-05-22 15:55 UTC (History)
8 users (show)

See Also:
asomers: mfc-stable11+
asomers: mfc-stable10-
asomers: mfc-stable9-


Attachments
Rescan ATA Device with valid MD5 checksum (488 bytes, patch)
2016-10-18 13:26 UTC, Kashyap
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kashyap 2016-09-23 03:22:49 UTC
This issue is even common for LSI/Broadcom IT HBA (mpr driver), so looks to be a common issue. 

If you have SATA drive and just do "camcontrol rescan all", SATA disc is added and removed back by CAM layer.


On FreeBSD11.0 RC1, we are facing an issue where SATA drives connected behind LSI's MegaRAID controller getting deleted and added back after controller reset.
I am using Broadcom/Avago/LSI's  MegaRAID Invader controller(device ID- 0x005d). The point to note here is- this behavior is not observed with SAS drives on FreeBSD11.0-RC1.
Also on FreeBSD10.3 this behavior is not at all observed on SATA as well.
We are debugging the issue but it would be much helpful if we can get quick inputs/pointers.

Please find below the detailed information-

OS: FreeBSD 11.0 RC1
Controller: LSI's MegaRAID invader controller

Connected devices list:

root@freeBSD11:~ # camcontrol devlist
<ST500NM0011 PA09>                 at scbus5 target 0 lun 0 (pass0,ada0)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus6 target 0 lun 0 (ses0,pass1)
<ATA ST9250610NS SN01>             at scbus8 target 51 lun 0
(da9,pass11)----------------------------------------->this is SATA drive which is getting deleted and re-added post controller reset
<SEAGATE ST9300605SS 0004>         at scbus8 target 163 lun 0 (da8,pass10)
<LSI Default 5.00>                 at scbus9 target 0 lun 0 (da6,pass8)
<LSI Default 5.00>                 at scbus9 target 1 lun 0 (da2,pass4)
<LSI Default 5.00>                 at scbus9 target 2 lun 0 (da0,pass2)
<LSI Default 5.00>                 at scbus9 target 3 lun 0 (da7,pass9)
<LSI Default 5.00>                 at scbus9 target 4 lun 0 (da3,pass5)
<LSI Default 5.00>                 at scbus9 target 5 lun 0 (da1,pass3)
<SEAGATE ST600MP0005 VS09>         at scbus10 target 48 lun 0 (da4,pass6)
<SEAGATE ST600MP0005 VS09>         at scbus10 target 54 lun 0 (da5,pass7)


Relevant dmesg logs snippet(da9 is SATA drive which is getting deleted and added back):

================================
mrsas0: Initiaiting OCR because of FW fault!
mrsas0: Waiting for FW to come to ready state
mrsas0: Jbod map is supported
mrsas0: Reset successful
da9 at mrsas0 bus 1 scbus8 target 51 lun 0
da9: <ATA ST9250610NS SN01> s/n 9XE02AR2 detached
(da9:mrsas0:1:51:0): Periph destroyed
(da9:mrsas0:1:51:0): UNMAPPED
(da9:mrsas0:1:51:0): fatal error, could not acquire reference count
g_access(918): provider da9 has error
g_access(918): provider da9 has error
g_access(918): provider da9 has error
(da9:mrsas0:1:51:0): UNMAPPED
da9 at mrsas0 bus 1 scbus8 target 51 lun 0
da9: <ATA ST9250610NS SN01> Fixed Direct Access SPC-4 SCSI device
da9: Serial Number 9XE02AR2
da9: 150.000MB/s transfers
da9: 238475MB (488397168 512 byte sectors) =================================
Comment 1 Kashyap 2016-10-12 14:03:58 UTC
Any update/pointer  on this ?  Issue happen only with SATA driver attached via CAM layer.

Do we need to address this in driver or will there be any fix in CAM layer ?
Comment 2 Kenneth D. Merry freebsd_committer freebsd_triage 2016-10-12 18:10:21 UTC
The device may be going away for several possible reasons:

1. A CCB is returned with the status CAM_DEV_NOT_THERE or CAM_SEL_TIMEOUT.
2. A CCB is returned with the SCSI ASC/ASCQ 0x25,0x00, Logical Unit Not Supported.
3. Someone is doing an xpt_async(AC_LOST_DEVICE, ...)


A device may go away and come back as a result of a rescan if any of the following changes:

SCSI Standard Inquiry Data (Full inquiry data, including Vendor, Product, Revision)
SCSI page 0x80 serial number

The first thing I would look at here is what status is getting returned from the drive in question after a reset.  If that all looks good, look at whether someone is issuing a rescan, and whether the device is returning inconsistent results.  Those inconsistent results could be buried in the part of the Inquiry data that isn't displayed.  Standard inquiry data is checksummed along with the serial number and any change in the checksum will make the device go away and come back.

Although this is a SATA drive, obviously the only thing that matters is the SCSI response, because CAM is communicating with it as if it is a SCSI protocol device.
Comment 3 Kashyap 2016-10-14 14:18:32 UTC
Here is CAMDEBUG enabled logs  for SAS Drives. This is a working case -

Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_allocq(0xfffff80060a62e00)
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): Probe PROBE_EXTENDED_INQUIRY to PROBE_SERIAL_NUM
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_schedule
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_allocq(0xfffff80060a62e00)
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_setup_ccb
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: calling periph_start()
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): probestart
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0):
Oct 14 18:56:06 freeBSD11_RC1 kernel: xpt_action: func 0x901 XPT_SCSI_IO
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_action_default: func 0x901 XPT_SCSI_IO
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:
Oct 14 18:56:06 freeBSD11_RC1 kernel: 73:0): xpt_action_default: func= 0x901 XPT_SCSI_IO status 0
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0):
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_release_devq(0, 0, 0, 0)
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_setup_ccb
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_action: func 0x5 XPT_REL_SIMQ
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0):
Oct 14 18:56:06 freeBSD11_RC1 kernel: xpt_action_default: func 0x5 XPT_REL_SIMQ
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_release_devq(1, 1)
Oct 14 18:56:06 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:73:0):
Oct 14 18:56:06 freeBSD11_RC1 kernel: xpt_release_devq_device(1, 1) 1->0
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_schedule_dev
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: Inserting onto queue
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: running device 0xfffff8000f04d000
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_freeze_devq(1)
Oct 14 18:56:06 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:73:0): xpt_freeze_devq_device(1) 0->1
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0):
Oct 14 18:56:06 freeBSD11_RC1 kernel: . CDB: 12 01 80 00 ff 00
Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_action_default: func= 0x5 XPT_REL_SIMQ status 0x1
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_allocq(0xfffff8000f000900)
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: calling periph_start()
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_schedule_dev
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: Inserting onto queue
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: running device 0xfffff8000f165800
Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq


See below similar logs from SATA Drives. - This is non working case -

Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): Probe PROBE_EXTENDED_INQUIRY to PROBE_SERIAL_NUM
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_schedule
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): cam_release_devq(0, 0, 0, 0)
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_setup_ccb
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action: func 0x5 XPT_REL_SIMQ
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func 0x5 XPT_REL_SIMQ
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_release_devq(1, 1)
Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_release_devq_device(1, 1) 2->1
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func= 0x5 XPT_REL_SIMQ status 0x1
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_setup_ccb
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: calling periph_start()
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): probestart
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: (probe0:xpt_run_devq
Oct 14 18:52:04 freeBSD11_RC1 kernel: mrsas0:1:cam_debug: 71:xpt_release_ccb
Oct 14 18:52:04 freeBSD11_RC1 kernel: 0): cam_debug: xpt_action: func 0x901 XPT_SCSI_IO
Oct 14 18:52:04 freeBSD11_RC1 kernel: xpt_run_allocq(0xfffff8000f000900)
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func 0x901 XPT_SCSI_IO
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func= 0x901 XPT_SCSI_IO status 0
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): cam_release_devq(0, 0, 0, 0)
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_setup_ccb
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action: func 0x5 XPT_REL_SIMQ
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func 0x5 XPT_REL_SIMQ
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_release_devq(1, 1)
Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_release_devq_device(1, 1) 1->0
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: xpt_schedule_dev
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: Inserting onto queue
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: running device 0xfffff8000f046000
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_freeze_devq(1)
Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_freeze_devq_device(1) 0->1
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): . CDB: 12 01 80 00 ff 00
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func= 0x5 XPT_REL_SIMQ status 0x1
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_done: func= 0x901 XPT_SCSI_IO status 0x1
Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): probedone
Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_async: func 0x70f XPT_ASYNC aync_code 256 AC_LOST_DEVICE
Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_freeze_devq(1)


From the CAM layer code, I think xpt_async() is called from below location @ probedone. Not sure why there is a Checksum failure for SATA drives every time. I will debug further. Any pointers ? Same h/w works fine on FREEBSD10.3. Is there any extra logic added after FreeBSD10.3 ?


                        /*
                         * XXX Do we need to do a TUR in order to ensure
                         *     that the device really hasn't changed???
                         */
                        if ((changed != 0)
                         && ((softc->flags & PROBE_NO_ANNOUNCE) == 0))
                                xpt_async(AC_LOST_DEVICE, path, NULL);
                }
                if (serial_buf != NULL)
                        free(serial_buf, M_CAMXPT);

                if (changed != 0) {
                        /*
                         * Now that we have all the necessary
                         * information to safely perform transfer
                         * negotiations... Controllers don't perform
                         * any negotiation or tagged queuing until
                         * after the first XPT_SET_TRAN_SETTINGS ccb is
                         * received.  So, on a new device, just retrieve
                         * the user settings, and set them as the current
                         * settings to set the device up.
                         */
                        proberequestdefaultnegotiation(periph);
                        xpt_release_ccb(done_ccb);

                        /*
                         * Perform a TUR to allow the controller to
                         * perform any necessary transfer negotiation.
                         */
                        PROBE_SET_ACTION(softc, PROBE_TUR_FOR_NEGOTIATION);
                        xpt_schedule(periph, priority);
                        goto out;
                }


Thanks, Kashyap
Comment 4 Kashyap 2016-10-18 09:17:19 UTC
This issue is fixed using below patch. Please review and let me know if this is a correct fix.  Root cause is - "Checksum is updated using different serial number. One without removing extra spaces and another with additional spaces. Because of that, any rescan of ATA disk is defected as different ATA drive, so it is removed and re-added later. "


Index: scsi_xpt.c
===================================================================
--- scsi_xpt.c	(revision 307137)
+++ scsi_xpt.c	(working copy)
@@ -1600,8 +1600,8 @@
 				  sizeof(struct scsi_inquiry_data));
 
 			if (have_serialnum)
-				MD5Update(&context, serial_buf->serial_num,
-					  serial_buf->length);
+				MD5Update(&context, path->device->serial_num,
+				    path->device->serial_num_len);
 
 			MD5Final(digest, &context);
 			if (bcmp(softc->digest, digest, 16) == 0)
Comment 5 Kashyap 2016-10-18 13:26:21 UTC
Created attachment 175903 [details]
Rescan ATA Device with valid MD5 checksum
Comment 6 Kenneth D. Merry freebsd_committer freebsd_triage 2016-10-20 20:32:40 UTC
Reassigning to Alan.  Looks like his change broke this, and it would be good for him to  double check the patch as well.
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2016-11-17 00:58:28 UTC
We've been hitting this same problem with mps at Isilon, and the patch fixes it for us.
Comment 8 commit-hook freebsd_committer freebsd_triage 2016-11-17 20:43:49 UTC
A commit references this bug:

Author: asomers
Date: Thu Nov 17 20:42:56 UTC 2016
New revision: 308780
URL: https://svnweb.freebsd.org/changeset/base/308780

Log:
  Fix "camcontrol rescan" with SATA drives behind a SAS controller

  A bug in CAM's serial number hash logic resulted in SATA drives behind a SAS
  controller getting removed and readded anytime the drive was rescanned for
  any reason.

  PR:		212914
  Submitted by:	kadesai
  Reported by:	kadesai
  Reviewed by:	asomers, ken
  MFC after:	4 weeks
  Sponsored by:	Spectra Logic Corp

Changes:
  head/sys/cam/scsi/scsi_xpt.c
Comment 9 Alan Somers freebsd_committer freebsd_triage 2016-11-17 20:45:33 UTC
Fixed in head by r308780.  I'll MFC to stable/11 in 4 weeks.  stable/10 and stable/9 are not affected.
Comment 10 Kashyap 2016-11-21 13:31:05 UTC
(In reply to commit-hook from comment #8)
Thanks.! I will validate changes, but curious to know how customer should fix this issue ? 
Do we need to recommend to update stable-11 ?
Any steps recommended to be followed if I have FreeBSD-11 release update1 installed ?

Thanks, Kashyap
Comment 11 Alan Somers freebsd_committer freebsd_triage 2016-11-21 16:18:31 UTC
Upgrading from 11.0-RELEASE to stable/11 is pretty inconvenient, and it makes it hard to upgrade to 11.1-RELEASE in the future.  Is there any kind of HBA setting that will cause serial numbers to be generated without spaces?  Or cause no serial numbers to be generated?  That would be a workaround.
Comment 12 Kashyap 2016-11-22 10:03:46 UTC
(In reply to Alan Somers from comment #11)
We don't have HBA setting which can generate serial number without space.
In fact, this issue was also observed on some SAS Devices as well.

So here is what I understood - RELEASE to stable update is not recommended. 
And also HBA setting is not possible to  provide any workaround. We need to document and explain this to customer and make sure they are aware of this. 

Correct ? Or any more ideas of workaround ?
Comment 13 Scott Long freebsd_committer freebsd_triage 2016-11-22 12:32:14 UTC
This fix could be incorporated into an EN update to 11.0, making it an easy update from 11.0-RELEASE.  It definitely sounds like it's a serious enough bug to warrant an errata and fix.  Please contact Glen Barber and work with him on it.
Comment 14 Kashyap 2016-11-25 12:55:21 UTC
I confirm below test result.

1. Current FreeBSD11 Stable (below info), issue reproduce.

-bash-4.1# svn info
Path: .
URL: http://svn.freebsd.org/base/stable/11
Repository Root: http://svn.freebsd.org/base
Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
Revision: 309090
Node Kind: directory
Schedule: normal
Last Changed Author: emaste
Last Changed Rev: 309075
Last Changed Date: 2016-11-23 19:45:00 -0500 (Wed, 23 Nov 2016)

2. After applying attached patch (r308780). Issue is resolved.
Comment 15 Alan Somers freebsd_committer freebsd_triage 2017-02-14 20:05:53 UTC
MFCed to stable/11 by r309273 on 28-Nov-2017.  I'll contact re@ about an EN.