Bug 224496 - mpr and mps drivers seems to have issues with large seagate drives
Summary: mpr and mps drivers seems to have issues with large seagate drives
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-21 08:51 UTC by Christoph Bubel
Modified: 2019-11-03 08:53 UTC (History)
7 users (show)

See Also:


Attachments
/var/log/messages during device resets (307.41 KB, text/plain)
2019-06-12 08:13 UTC, Matthias Pfaller
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Bubel 2017-12-21 08:51:25 UTC
Over on the Freenas forums several people reported issues with large (10TB) Seagate drives (ST10000NM0016 and ST10000VN0004) and LSI controllers. Links to the threads:
https://forums.freenas.org/index.php?threads/lsi-avago-9207-8i-with-seagate-10tb-enterprise-st10000nm0016.58251/
https://forums.freenas.org/index.php?threads/synchronize-cache-command-timeout-error.55067/

I am using the ST10000NM0016 drives and i am getting the following errors on a LSI SAS2308 (mps driver) and on a LSI SAS3008 (mpr driver). This happens about once every one or two weeks in low load situations. 

Here the logs:

(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1010 command timeout cm 0xfffffe0000f7cda0 ccb 0xfffff8018198d800
(noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0000f7cda0
mps0: Sending reset from mpssas_send_abort for target ID 1
(da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 length 4096 SMID 959 terminated ioc 804b scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 1
(da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 
(da2:mps0:0:1:0): CAM status: CCB request completed with an error
(da2:mps0:0:1:0): Retrying command
(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mps0:0:1:0): CAM status: Command timeout
(da2:mps0:0:1:0): Retrying command
(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mps0:0:1:0): CAM status: SCSI Status Error
(da2:mps0:0:1:0): SCSI status: Check Condition
(da2:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mps0:0:1:0): Error 6, Retries exhausted
(da2:mps0:0:1:0): Invalidating pack

-------

(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 319 Aborting command 0xfffffe0000f54a90
mpr0: Sending reset from mprsas_send_abort for target ID 4
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 length 20480 SMID 320 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
mpr0: Unfreezing devq for target ID 4
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: Command timeout
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 
(da1:mpr0:0:4:0): CAM status: SCSI Status Error
(da1:mpr0:0:4:0): SCSI status: Check Condition
(da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpr0:0:4:0): Retrying command (per sense data)
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 length 16384 SMID 653 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 711 terminated ioc 804b loginfo 311(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 
10e03 scsi 0 state c xfer 0
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: SCSI Status Error
(da1:mpr0:0:4:0): SCSI status: Check Condition
(da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpr0:0:4:0): Error 6, Retries exhausted
(da1:mpr0:0:4:0): Invalidating pack
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 797 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 753 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 846 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 828 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 662 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 738 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 778 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 747 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 954 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 923 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 932 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 926 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
Comment 1 Bane Ivosev 2019-03-23 07:03:46 UTC
We have the very same problem but with WD Red disks. System randomly reboot sometimes after 20 days of working. Different disk everytime. It's our production nfs server and now it's very frustrating.

Supermicro 5049p
64 GB ECC RAM
LSI 3008 IT mode
18x WD Red 4 TB

Mar 23 07:39:46 fap kernel:       (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Command timeout on target 25(0x001c), 60000 set, 60.107418057 elapsed
Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: mpr0: Sending abort to target 25 for SMID 357
Mar 23 07:39:46 fap kernel:       (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Aborting command 0xfffffe00b7aa6130
Mar 23 07:39:46 fap kernel:       (pass19:mpr0:0:25:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 1182 Command timeout on target 25(0x001c), 60000 set, 60.217681679 elampr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: mpr0: Controller reported scsi ioc terminated tgt 25 SMID 1182 loginfo 31130000
Mar 23 07:39:46 fap kernel: mpr0: Abort failed for target 25, sending logical unit reset
Mar 23 07:39:46 fap kernel: mpr0: Sending logical unit reset to target 25 lun 0
Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CCB request aborted by the host
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 0 more tries remain
Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Error 5, Retries exhausted
Mar 23 07:39:46 fap smartd[95746]: Device: /dev/da17 [SAT], failed to read SMART Attribute Data
Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): WRITE(10). CDB: 2a 00 09 4a 32 a8 00 00 08 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 3 more tries remain
Mar 23 07:43:19 fap syslogd: kernel boot file is /boot/kernel/kernel
Mar 23 07:43:19 fap kernel: ---<<BOOT>>---
Comment 2 Bane Ivosev 2019-03-23 07:05:51 UTC
Forgot to say, its FreeBSD 12-RELEASE.
Comment 3 Matthias Pfaller 2019-06-12 08:13:45 UTC
Created attachment 205003 [details]
/var/log/messages during device resets
Comment 4 Matthias Pfaller 2019-06-12 08:15:20 UTC
We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008:

Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

After writing ~200gb to our pool it started reseting. I did a

sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20))

The resulting trace is attached.
Comment 5 Matthias Pfaller 2019-06-12 08:16:12 UTC
Comment on attachment 205003 [details]
/var/log/messages during device resets

We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008:

Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

After writing ~200gb to our pool it started reseting. I did a

sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20))

The resulting trace is attached.
Comment 6 Matthias Pfaller 2019-06-12 08:36:27 UTC
We are using FreeBSD 12.0-RELEASE:
FreeBSD nyx 12.0-RELEASE-p4 FreeBSD 12.0-RELEASE-p4 GENERIC  amd64
Comment 7 Bane Ivosev 2019-06-12 09:06:22 UTC
Just to append my previous post from March, same hardware and same config, we revert back on 11.1-RELEASE and everything working flawlessly for more then two months now.
Comment 8 Bane Ivosev 2019-06-12 09:23:36 UTC
And I don't think the problem is exclusively with Seagate 10TB drives. We have WD Red 4TB drives and have the same problem. We have same situation also with 11.2-RELEASE, and beacuse 11.2 and 12.0 have same mpr/mps driver version we decide to try with 11.1.
Comment 9 Matthias Pfaller 2019-06-12 09:35:00 UTC
(In reply to Bane Ivosev from comment #8)
We have several other machines with SAS2008 controllers. All of them are running 11.1 and none of them shows these problems...
Comment 10 Daniel Shafer 2019-07-23 19:53:42 UTC
So I came across this same issue.  It was causing my server to reboot several times a day due to kernel panics caused by this issue.  It happens with both SAS9200 and 9300 controllers.  I have 8 x 10TB Seagate Iron Wolf NAS drives.

I wanted to mention that for me there was a resolution.  I added an Intel Optane 900p 280GB drive and set that up for cache/l2arc and the problem entirely disappeared.  My server ran for 20 days before I rebooted it last night to perform an upgrade.

So, a workaround I believe would be is to add a cache drive to your ZFS pool.

The Intel Optane 900p is a highly recommended cache drive for ZFS pools.
Comment 11 Francois Baillargeon 2019-08-08 17:05:57 UTC
Following what Daniel Shafer says, we have the same issues on a Freenas deployment we did.

Everything fine with our other pools that use less than 10tb drives. But one of our pool using 14tb drive exhibit this exact behavior.

For us this is a major show stopper bug since we can't use this pool reliably. Our vendor sent us a new HBA, a new server, etc before I stumbled upon this bug listing.
Comment 12 n 2019-08-08 19:41:01 UTC
Hi,

We are seeing the same issue. 

Here is more information on our setup :

FreeNAS-11.2-U5
FreeBSD 11.2-STABLE amd64

We use 2 x (6x 14TB seagate ironwolf drives )
We also have a 2 TB crucial SSD for L2ARC

Issue always comes up after 10-14hours of heavy IO

Disk Model : 14 TB Seagate  ST14000VN0008


The drives are on 2 different LSI HBAs. Drive that fails are random on both those HBA.

Please let us know if you need more information on this, it is impacting our production load.

Thank you.

Log output for our latest errors :

> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 
> 00 00 10 00 00 length 8192 SMID 60 Aborting command 0xfffffe000171f640
> mpr1: Sending reset from mprsas_send_abort for target ID 20
> 	(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 length 4096 SMID 332 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00 
> length 4096 SMID 703 terminated ioc 804b loginfo 31130000 sc(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 si 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 
> 00 01 00 00 00 length 131072 SMID 510 terminated ioc 
> 804b(da30:mpr1:0:20:0): CAM status: CCB request completed with an 
> error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00  
> loginfo 31130000 scsi 0 state c xfer 0
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 01 00 00 00 length 131072 SMID 938 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 01 00 00 00 length 131072 SMID 839 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 01 00 00 00 length 131072 SMID 681 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 01 00 00 00 length 131072 SMID 647 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 01 00 00 00 length 131072 SMID 253 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 01 00 00 00 length 131072 SMID 109 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 01 00 00 00 length 131072 SMID 267 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 00 10 00 00 length 8192 SMID 506 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 00 10 00 00 length 8192 SMID 774 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00 length 0 SMID 281 terminated ioc 804b loginfo 31140000 scsi 0 
> state c xfer 0
> mpr1: Unfreezing devq for target ID 20
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: Command timeout
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00
> (da30:mpr1:0:20:0): CAM status: SCSI Status Error
> (da30:mpr1:0:20:0): SCSI status: Check Condition
> (da30:mpr1:0:20:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, 
> reset, or bus device reset occurred)
> (da30:mpr1:0:20:0): Retrying command (per sense data)
> ctl_datamove: tag 0x855ffd44 on (2:3:106) aborted
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00 
> length 8192 SMID 486 Aborting command 0xfffffe0001745aa0
> mpr1: Sending reset from mprsas_send_abort for target ID 22
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 length 4096 SMID 105 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 
> length 81920 SMID 467 terminated ioc 804b loginfo 31130000 s(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 csi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 
> length 131072 SMID 959 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 
> length 131072 SMID 346 terminated ioc 804b loginfo 31130000 (da32:scsi 
> 0 state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00 
> length 131072 SMID 627 terminated ioc 804b loginfo 31130000 scsi 0 
> state c xfer 0
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00 
> length 131072 SMID 455 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00 
> length 131072 SMID 951 terminated ioc 804b loginfo 31130000 (da32:scsi 
> 0 state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00 
> length 131072 SMID 822 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00 
> length 131072 SMID 155 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00 
> length 8192 SMID 495 terminated ioc 804b loginfo 31130000 sc(da32:si 0 
> state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00 
> length 8192 SMID 494 terminated ioc 804b loginfo 31130000 sc(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 si 0 state c xfer 0
> 	(da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00 length 0 SMID 726 terminated ioc 804b loginfo 
> 3(da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> 1140000 scsi 0 state c xfer 0
> mpr1: Unfreezing devq for target ID 22
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: Command timeout
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00
> (da32:mpr1:0:22:0): CAM status: SCSI Status Error
> (da32:mpr1:0:22:0): SCSI status: Check Condition
> (da32:mpr1:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, 
> reset, or bus device reset occurred)
> (da32:mpr1:0:22:0): Retrying command (per sense data)
Comment 13 Christoph Bubel 2019-08-13 16:57:16 UTC
(In reply to Daniel Shafer from comment #10)

I can confirm the workaround, no errors since i added L2ARC to the pool.
Comment 14 Francois Baillargeon 2019-08-13 17:21:05 UTC
We usually have a cache on affected pools, and sadly we still have issues when the l2arc hits the disk, think long sequential reads
Comment 15 Paul Thornton 2019-08-13 19:55:22 UTC
I too have run into this issue on a nas box, once it started taking on any kind of load.

Running 12.0-RELEASE p3

The server contains 8x Seagate Ironwolf Pro 10Tb SATA drives on an Avago 3008 HBA - 8 of these basically:

da2 at mpr1 bus 0 scbus13 target 12 lun 0
da2: <ATA ST10000NE0004-1Z EN01> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number ZA237AVY
da2: 1200.000MB/s transfers
da2: Command Queueing enabled
da2: 9537536MB (19532873728 512 byte sectors)

Driver versions:
dev.mpr.1.driver_version: 18.03.00.00-fbsd
dev.mpr.1.firmware_version: 15.00.03.00

These drives are configured in a ZFS RAID10 setup (in case that datapoint matters):
        NAME         STATE     READ WRITE CKSUM
        data0        ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            da2.eli  ONLINE       0     0     0
            da3.eli  ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            da4.eli  ONLINE       0     0     0
            da5.eli  ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            da6.eli  ONLINE       0     0     0
            da7.eli  ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            da8.eli  ONLINE       0     0     0
            da9.eli  ONLINE       0     0     0

I currently get about 25 days between reboots.  The machine hangs and (I'm guessing here) kernel panics and restarts - I don't have the panic information, but log messages look very similar to what other people are seeing:

Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Command timeout on
 target 12(0x000c), 60000 set, 60.703976195 elapsed
Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:17 nas1a kernel: mpr1: Sending abort to target 12 for SMID 1484
Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Aborting command 0
xfffffe00bad0b540
Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1792 Command ti
meout on target 12(0x000c), 60000 set, 60.707504796 elapsed
Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:18 nas1a kernel: mpr1: Controller reported scsi ioc terminated tgt 12 SMID 1792 loginfo 31140000
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00
Jul 20 11:14:18 nas1a kernel: mpr1: Abort failed for target 12, sending logical unit reset
Jul 20 11:14:18 nas1a kernel: mpr1: (da2:mpr1:0:12:0): CAM status: CCB request aborted by the host
Jul 20 11:14:18 nas1a kernel: Sending logical unit reset to target 12 lun 0
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 3 more tries remain
Jul 20 11:14:18 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CCB request completed with an error
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 0 more tries remain
Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Error 5, Retries exhausted
Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 2 more tries remain

[reboot happens here]

And the most recent one, today:

Aug 13 08:58:55 nas1a kernel:   (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Command tim
eout on target 16(0x0010), 60000 set, 60.109683189 elapsed
Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name (    )
Aug 13 08:58:55 nas1a kernel: mpr1: Sending abort to target 16 for SMID 998
Aug 13 08:58:55 nas1a kernel:   (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Aborting co
mmand 0xfffffe00bacdfaa0
Aug 13 08:58:55 nas1a kernel: mpr1: Abort failed for target 16, sending logical unit reset
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 08:58:55 nas1a kernel: mpr1: Sending logical unit reset to target 16 lun 0
Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name (    )
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CCB request aborted by the host
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 0 more tries remain
Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Error 5, Retries exhausted
Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): WRITE(10). CDB: 2a 00 92 b0 7d 70 00 00 48 00
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 3 more tries remain

[reboot happens here]

After the reboot, there's no problem and everything works fine.  ZFS never marks the pool as degraded or unavailable.

Looking at the FreeNAS threads this seems to have been going on for ages.  Can anyone confirm that a downgrade to 11.1 does work around this issue as that seems to be the only thing that might help?
Comment 16 Bane Ivosev 2019-08-13 20:46:07 UTC
Hi Paul, i had exact same symptoms like you but with WD Reds 4TB, and for me  with 11.1 everything work flawlessly for more then 130 days now.
Comment 17 Matthias Pfaller 2019-08-14 09:23:45 UTC
(In reply to Francois Baillargeon from comment #14)
We have cache and log devices:
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
k-zb1               87.2T  35.5T  51.8T        -         -    20%    40%  1.91x  ONLINE  -
  raidz2            87.2T  35.5T  51.8T        -         -    20%    40%
    gpt/k-zb1-0         -      -      -        -         -      -      -
    gpt/k-zb1-1         -      -      -        -         -      -      -
    gpt/k-zb1-2         -      -      -        -         -      -      -
    gpt/k-zb1-3         -      -      -        -         -      -      -
    gpt/k-zb1-4         -      -      -        -         -      -      -
    gpt/k-zb1-5         -      -      -        -         -      -      -
    gpt/k-zb1-6         -      -      -        -         -      -      -
    gpt/k-zb1-7         -      -      -        -         -      -      -
log                     -      -      -         -      -      -
  mirror            11.5G      0  11.5G        -         -     0%     0%
    gpt/k-zb1-zil0      -      -      -        -         -      -      -
    gpt/k-zb1-zil1      -      -      -        -         -      -      -
cache                   -      -      -         -      -      -
  gpt/k-zb1-cache0  80.0G  29.9G  50.1G        -         -     0%    37%
  gpt/k-zb1-cache1  80.0G  30.0G  50.0G        -         -     0%    37%

The only solution for us was to use the onboard sata ports instead of the LSI-controller. We have keep the other machines that need the LSI-controllers (not enough ports on the mainboard) at 11.1 :-(
Comment 18 Paul Thornton 2019-08-14 15:08:48 UTC
(In reply to Bane Ivosev from comment #16)

Thanks for the confirmation.  I'm in the process of downgrading the affected NAS machine (they are an identical pair) from 12.0 to 11.1, I need to test this in the lab first but will report back once we have the production machine on the older release.

I need to do this before it next crashes and reboots, of course!
Comment 19 Paul Thornton 2019-10-02 14:53:18 UTC
An update for all.

I downgraded our production servers from 12.0 to 11.1-RELEASE-p15 and for 5 weeks they have worked without any problems.

Previously we saw a problem after 4 weeks, then another after 1 week - so whilst there is a chance I have not waited long enough, I think this definitely fixes my problem.

Paul.
Comment 20 freebsd 2019-11-03 08:53:37 UTC
(In reply to Paul Thornton from comment #19)

Thanks for sharing Paul. I took a slightly different approach. I have been running stable for a few months now with a FreeBSD 12.0-RELEASE kernel but using the FreeBSD 11.2 LSI mps driver. 12.1 brings in many updates to the 12.0 mps driver, hopefully those address these problems; will test once its released.

/Sharad