This issue is even common for LSI/Broadcom IT HBA (mpr driver), so looks to be a common issue. If you have SATA drive and just do "camcontrol rescan all", SATA disc is added and removed back by CAM layer. On FreeBSD11.0 RC1, we are facing an issue where SATA drives connected behind LSI's MegaRAID controller getting deleted and added back after controller reset. I am using Broadcom/Avago/LSI's MegaRAID Invader controller(device ID- 0x005d). The point to note here is- this behavior is not observed with SAS drives on FreeBSD11.0-RC1. Also on FreeBSD10.3 this behavior is not at all observed on SATA as well. We are debugging the issue but it would be much helpful if we can get quick inputs/pointers. Please find below the detailed information- OS: FreeBSD 11.0 RC1 Controller: LSI's MegaRAID invader controller Connected devices list: root@freeBSD11:~ # camcontrol devlist <ST500NM0011 PA09> at scbus5 target 0 lun 0 (pass0,ada0) <AHCI SGPIO Enclosure 1.00 0001> at scbus6 target 0 lun 0 (ses0,pass1) <ATA ST9250610NS SN01> at scbus8 target 51 lun 0 (da9,pass11)----------------------------------------->this is SATA drive which is getting deleted and re-added post controller reset <SEAGATE ST9300605SS 0004> at scbus8 target 163 lun 0 (da8,pass10) <LSI Default 5.00> at scbus9 target 0 lun 0 (da6,pass8) <LSI Default 5.00> at scbus9 target 1 lun 0 (da2,pass4) <LSI Default 5.00> at scbus9 target 2 lun 0 (da0,pass2) <LSI Default 5.00> at scbus9 target 3 lun 0 (da7,pass9) <LSI Default 5.00> at scbus9 target 4 lun 0 (da3,pass5) <LSI Default 5.00> at scbus9 target 5 lun 0 (da1,pass3) <SEAGATE ST600MP0005 VS09> at scbus10 target 48 lun 0 (da4,pass6) <SEAGATE ST600MP0005 VS09> at scbus10 target 54 lun 0 (da5,pass7) Relevant dmesg logs snippet(da9 is SATA drive which is getting deleted and added back): ================================ mrsas0: Initiaiting OCR because of FW fault! mrsas0: Waiting for FW to come to ready state mrsas0: Jbod map is supported mrsas0: Reset successful da9 at mrsas0 bus 1 scbus8 target 51 lun 0 da9: <ATA ST9250610NS SN01> s/n 9XE02AR2 detached (da9:mrsas0:1:51:0): Periph destroyed (da9:mrsas0:1:51:0): UNMAPPED (da9:mrsas0:1:51:0): fatal error, could not acquire reference count g_access(918): provider da9 has error g_access(918): provider da9 has error g_access(918): provider da9 has error (da9:mrsas0:1:51:0): UNMAPPED da9 at mrsas0 bus 1 scbus8 target 51 lun 0 da9: <ATA ST9250610NS SN01> Fixed Direct Access SPC-4 SCSI device da9: Serial Number 9XE02AR2 da9: 150.000MB/s transfers da9: 238475MB (488397168 512 byte sectors) =================================
Any update/pointer on this ? Issue happen only with SATA driver attached via CAM layer. Do we need to address this in driver or will there be any fix in CAM layer ?
The device may be going away for several possible reasons: 1. A CCB is returned with the status CAM_DEV_NOT_THERE or CAM_SEL_TIMEOUT. 2. A CCB is returned with the SCSI ASC/ASCQ 0x25,0x00, Logical Unit Not Supported. 3. Someone is doing an xpt_async(AC_LOST_DEVICE, ...) A device may go away and come back as a result of a rescan if any of the following changes: SCSI Standard Inquiry Data (Full inquiry data, including Vendor, Product, Revision) SCSI page 0x80 serial number The first thing I would look at here is what status is getting returned from the drive in question after a reset. If that all looks good, look at whether someone is issuing a rescan, and whether the device is returning inconsistent results. Those inconsistent results could be buried in the part of the Inquiry data that isn't displayed. Standard inquiry data is checksummed along with the serial number and any change in the checksum will make the device go away and come back. Although this is a SATA drive, obviously the only thing that matters is the SCSI response, because CAM is communicating with it as if it is a SCSI protocol device.
Here is CAMDEBUG enabled logs for SAS Drives. This is a working case - Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_allocq(0xfffff80060a62e00) Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): Probe PROBE_EXTENDED_INQUIRY to PROBE_SERIAL_NUM Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_schedule Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_allocq(0xfffff80060a62e00) Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_setup_ccb Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: calling periph_start() Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): probestart Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): Oct 14 18:56:06 freeBSD11_RC1 kernel: xpt_action: func 0x901 XPT_SCSI_IO Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_action_default: func 0x901 XPT_SCSI_IO Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1: Oct 14 18:56:06 freeBSD11_RC1 kernel: 73:0): xpt_action_default: func= 0x901 XPT_SCSI_IO status 0 Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_release_devq(0, 0, 0, 0) Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_setup_ccb Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_action: func 0x5 XPT_REL_SIMQ Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): Oct 14 18:56:06 freeBSD11_RC1 kernel: xpt_action_default: func 0x5 XPT_REL_SIMQ Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_release_devq(1, 1) Oct 14 18:56:06 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:73:0): Oct 14 18:56:06 freeBSD11_RC1 kernel: xpt_release_devq_device(1, 1) 1->0 Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_schedule_dev Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: Inserting onto queue Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: running device 0xfffff8000f04d000 Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_freeze_devq(1) Oct 14 18:56:06 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:73:0): xpt_freeze_devq_device(1) 0->1 Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): Oct 14 18:56:06 freeBSD11_RC1 kernel: . CDB: 12 01 80 00 ff 00 Oct 14 18:56:06 freeBSD11_RC1 kernel: (probe0:mrsas0:1:73:0): xpt_action_default: func= 0x5 XPT_REL_SIMQ status 0x1 Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_allocq(0xfffff8000f000900) Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: calling periph_start() Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_schedule_dev Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: Inserting onto queue Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: running device 0xfffff8000f165800 Oct 14 18:56:06 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq See below similar logs from SATA Drives. - This is non working case - Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): Probe PROBE_EXTENDED_INQUIRY to PROBE_SERIAL_NUM Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_schedule Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): cam_release_devq(0, 0, 0, 0) Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_setup_ccb Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action: func 0x5 XPT_REL_SIMQ Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func 0x5 XPT_REL_SIMQ Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_release_devq(1, 1) Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_release_devq_device(1, 1) 2->1 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func= 0x5 XPT_REL_SIMQ status 0x1 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_setup_ccb Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: calling periph_start() Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): probestart Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: (probe0:xpt_run_devq Oct 14 18:52:04 freeBSD11_RC1 kernel: mrsas0:1:cam_debug: 71:xpt_release_ccb Oct 14 18:52:04 freeBSD11_RC1 kernel: 0): cam_debug: xpt_action: func 0x901 XPT_SCSI_IO Oct 14 18:52:04 freeBSD11_RC1 kernel: xpt_run_allocq(0xfffff8000f000900) Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func 0x901 XPT_SCSI_IO Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func= 0x901 XPT_SCSI_IO status 0 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): cam_release_devq(0, 0, 0, 0) Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_setup_ccb Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action: func 0x5 XPT_REL_SIMQ Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func 0x5 XPT_REL_SIMQ Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_release_devq(1, 1) Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_release_devq_device(1, 1) 1->0 Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: xpt_schedule_dev Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: Inserting onto queue Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: running device 0xfffff8000f046000 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_freeze_devq(1) Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_freeze_devq_device(1) 0->1 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): . CDB: 12 01 80 00 ff 00 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_action_default: func= 0x5 XPT_REL_SIMQ status 0x1 Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_done: func= 0x901 XPT_SCSI_IO status 0x1 Oct 14 18:52:04 freeBSD11_RC1 kernel: cam_debug: xpt_run_devq Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): probedone Oct 14 18:52:04 freeBSD11_RC1 kernel: (noperiph:mrsas0:1:71:0): xpt_async: func 0x70f XPT_ASYNC aync_code 256 AC_LOST_DEVICE Oct 14 18:52:04 freeBSD11_RC1 kernel: (probe0:mrsas0:1:71:0): xpt_freeze_devq(1) From the CAM layer code, I think xpt_async() is called from below location @ probedone. Not sure why there is a Checksum failure for SATA drives every time. I will debug further. Any pointers ? Same h/w works fine on FREEBSD10.3. Is there any extra logic added after FreeBSD10.3 ? /* * XXX Do we need to do a TUR in order to ensure * that the device really hasn't changed??? */ if ((changed != 0) && ((softc->flags & PROBE_NO_ANNOUNCE) == 0)) xpt_async(AC_LOST_DEVICE, path, NULL); } if (serial_buf != NULL) free(serial_buf, M_CAMXPT); if (changed != 0) { /* * Now that we have all the necessary * information to safely perform transfer * negotiations... Controllers don't perform * any negotiation or tagged queuing until * after the first XPT_SET_TRAN_SETTINGS ccb is * received. So, on a new device, just retrieve * the user settings, and set them as the current * settings to set the device up. */ proberequestdefaultnegotiation(periph); xpt_release_ccb(done_ccb); /* * Perform a TUR to allow the controller to * perform any necessary transfer negotiation. */ PROBE_SET_ACTION(softc, PROBE_TUR_FOR_NEGOTIATION); xpt_schedule(periph, priority); goto out; } Thanks, Kashyap
This issue is fixed using below patch. Please review and let me know if this is a correct fix. Root cause is - "Checksum is updated using different serial number. One without removing extra spaces and another with additional spaces. Because of that, any rescan of ATA disk is defected as different ATA drive, so it is removed and re-added later. " Index: scsi_xpt.c =================================================================== --- scsi_xpt.c (revision 307137) +++ scsi_xpt.c (working copy) @@ -1600,8 +1600,8 @@ sizeof(struct scsi_inquiry_data)); if (have_serialnum) - MD5Update(&context, serial_buf->serial_num, - serial_buf->length); + MD5Update(&context, path->device->serial_num, + path->device->serial_num_len); MD5Final(digest, &context); if (bcmp(softc->digest, digest, 16) == 0)
Created attachment 175903 [details] Rescan ATA Device with valid MD5 checksum
Reassigning to Alan. Looks like his change broke this, and it would be good for him to double check the patch as well.
We've been hitting this same problem with mps at Isilon, and the patch fixes it for us.
A commit references this bug: Author: asomers Date: Thu Nov 17 20:42:56 UTC 2016 New revision: 308780 URL: https://svnweb.freebsd.org/changeset/base/308780 Log: Fix "camcontrol rescan" with SATA drives behind a SAS controller A bug in CAM's serial number hash logic resulted in SATA drives behind a SAS controller getting removed and readded anytime the drive was rescanned for any reason. PR: 212914 Submitted by: kadesai Reported by: kadesai Reviewed by: asomers, ken MFC after: 4 weeks Sponsored by: Spectra Logic Corp Changes: head/sys/cam/scsi/scsi_xpt.c
Fixed in head by r308780. I'll MFC to stable/11 in 4 weeks. stable/10 and stable/9 are not affected.
(In reply to commit-hook from comment #8) Thanks.! I will validate changes, but curious to know how customer should fix this issue ? Do we need to recommend to update stable-11 ? Any steps recommended to be followed if I have FreeBSD-11 release update1 installed ? Thanks, Kashyap
Upgrading from 11.0-RELEASE to stable/11 is pretty inconvenient, and it makes it hard to upgrade to 11.1-RELEASE in the future. Is there any kind of HBA setting that will cause serial numbers to be generated without spaces? Or cause no serial numbers to be generated? That would be a workaround.
(In reply to Alan Somers from comment #11) We don't have HBA setting which can generate serial number without space. In fact, this issue was also observed on some SAS Devices as well. So here is what I understood - RELEASE to stable update is not recommended. And also HBA setting is not possible to provide any workaround. We need to document and explain this to customer and make sure they are aware of this. Correct ? Or any more ideas of workaround ?
This fix could be incorporated into an EN update to 11.0, making it an easy update from 11.0-RELEASE. It definitely sounds like it's a serious enough bug to warrant an errata and fix. Please contact Glen Barber and work with him on it.
I confirm below test result. 1. Current FreeBSD11 Stable (below info), issue reproduce. -bash-4.1# svn info Path: . URL: http://svn.freebsd.org/base/stable/11 Repository Root: http://svn.freebsd.org/base Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f Revision: 309090 Node Kind: directory Schedule: normal Last Changed Author: emaste Last Changed Rev: 309075 Last Changed Date: 2016-11-23 19:45:00 -0500 (Wed, 23 Nov 2016) 2. After applying attached patch (r308780). Issue is resolved.
MFCed to stable/11 by r309273 on 28-Nov-2017. I'll contact re@ about an EN.