Bug 224496 - mpr and mps drivers seems to have issues with large seagate drives
Summary: mpr and mps drivers seems to have issues with large seagate drives
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-21 08:51 UTC by Christoph Bubel
Modified: 2020-05-13 07:41 UTC (History)
12 users (show)

See Also:


Attachments
/var/log/messages during device resets (307.41 KB, text/plain)
2019-06-12 08:13 UTC, Matthias Pfaller
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Bubel 2017-12-21 08:51:25 UTC
Over on the Freenas forums several people reported issues with large (10TB) Seagate drives (ST10000NM0016 and ST10000VN0004) and LSI controllers. Links to the threads:
https://forums.freenas.org/index.php?threads/lsi-avago-9207-8i-with-seagate-10tb-enterprise-st10000nm0016.58251/
https://forums.freenas.org/index.php?threads/synchronize-cache-command-timeout-error.55067/

I am using the ST10000NM0016 drives and i am getting the following errors on a LSI SAS2308 (mps driver) and on a LSI SAS3008 (mpr driver). This happens about once every one or two weeks in low load situations. 

Here the logs:

(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1010 command timeout cm 0xfffffe0000f7cda0 ccb 0xfffff8018198d800
(noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0000f7cda0
mps0: Sending reset from mpssas_send_abort for target ID 1
(da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 length 4096 SMID 959 terminated ioc 804b scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 1
(da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 
(da2:mps0:0:1:0): CAM status: CCB request completed with an error
(da2:mps0:0:1:0): Retrying command
(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mps0:0:1:0): CAM status: Command timeout
(da2:mps0:0:1:0): Retrying command
(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mps0:0:1:0): CAM status: SCSI Status Error
(da2:mps0:0:1:0): SCSI status: Check Condition
(da2:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mps0:0:1:0): Error 6, Retries exhausted
(da2:mps0:0:1:0): Invalidating pack

-------

(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 319 Aborting command 0xfffffe0000f54a90
mpr0: Sending reset from mprsas_send_abort for target ID 4
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 length 20480 SMID 320 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
mpr0: Unfreezing devq for target ID 4
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: Command timeout
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 
(da1:mpr0:0:4:0): CAM status: SCSI Status Error
(da1:mpr0:0:4:0): SCSI status: Check Condition
(da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpr0:0:4:0): Retrying command (per sense data)
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 length 16384 SMID 653 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 711 terminated ioc 804b loginfo 311(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 
10e03 scsi 0 state c xfer 0
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: SCSI Status Error
(da1:mpr0:0:4:0): SCSI status: Check Condition
(da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpr0:0:4:0): Error 6, Retries exhausted
(da1:mpr0:0:4:0): Invalidating pack
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 797 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 753 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 846 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 828 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 662 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 738 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 778 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 747 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 954 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 923 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 932 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 926 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
Comment 1 Bane Ivosev 2019-03-23 07:03:46 UTC
We have the very same problem but with WD Red disks. System randomly reboot sometimes after 20 days of working. Different disk everytime. It's our production nfs server and now it's very frustrating.

Supermicro 5049p
64 GB ECC RAM
LSI 3008 IT mode
18x WD Red 4 TB

Mar 23 07:39:46 fap kernel:       (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Command timeout on target 25(0x001c), 60000 set, 60.107418057 elapsed
Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: mpr0: Sending abort to target 25 for SMID 357
Mar 23 07:39:46 fap kernel:       (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Aborting command 0xfffffe00b7aa6130
Mar 23 07:39:46 fap kernel:       (pass19:mpr0:0:25:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 1182 Command timeout on target 25(0x001c), 60000 set, 60.217681679 elampr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: mpr0: Controller reported scsi ioc terminated tgt 25 SMID 1182 loginfo 31130000
Mar 23 07:39:46 fap kernel: mpr0: Abort failed for target 25, sending logical unit reset
Mar 23 07:39:46 fap kernel: mpr0: Sending logical unit reset to target 25 lun 0
Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CCB request aborted by the host
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 0 more tries remain
Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Error 5, Retries exhausted
Mar 23 07:39:46 fap smartd[95746]: Device: /dev/da17 [SAT], failed to read SMART Attribute Data
Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): WRITE(10). CDB: 2a 00 09 4a 32 a8 00 00 08 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 3 more tries remain
Mar 23 07:43:19 fap syslogd: kernel boot file is /boot/kernel/kernel
Mar 23 07:43:19 fap kernel: ---<<BOOT>>---
Comment 2 Bane Ivosev 2019-03-23 07:05:51 UTC
Forgot to say, its FreeBSD 12-RELEASE.
Comment 3 Matthias Pfaller 2019-06-12 08:13:45 UTC
Created attachment 205003 [details]
/var/log/messages during device resets
Comment 4 Matthias Pfaller 2019-06-12 08:15:20 UTC
We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008:

Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

After writing ~200gb to our pool it started reseting. I did a

sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20))

The resulting trace is attached.
Comment 5 Matthias Pfaller 2019-06-12 08:16:12 UTC
Comment on attachment 205003 [details]
/var/log/messages during device resets

We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008:

Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

After writing ~200gb to our pool it started reseting. I did a

sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20))

The resulting trace is attached.
Comment 6 Matthias Pfaller 2019-06-12 08:36:27 UTC
We are using FreeBSD 12.0-RELEASE:
FreeBSD nyx 12.0-RELEASE-p4 FreeBSD 12.0-RELEASE-p4 GENERIC  amd64
Comment 7 Bane Ivosev 2019-06-12 09:06:22 UTC
Just to append my previous post from March, same hardware and same config, we revert back on 11.1-RELEASE and everything working flawlessly for more then two months now.
Comment 8 Bane Ivosev 2019-06-12 09:23:36 UTC
And I don't think the problem is exclusively with Seagate 10TB drives. We have WD Red 4TB drives and have the same problem. We have same situation also with 11.2-RELEASE, and beacuse 11.2 and 12.0 have same mpr/mps driver version we decide to try with 11.1.
Comment 9 Matthias Pfaller 2019-06-12 09:35:00 UTC
(In reply to Bane Ivosev from comment #8)
We have several other machines with SAS2008 controllers. All of them are running 11.1 and none of them shows these problems...
Comment 10 Daniel Shafer 2019-07-23 19:53:42 UTC
So I came across this same issue.  It was causing my server to reboot several times a day due to kernel panics caused by this issue.  It happens with both SAS9200 and 9300 controllers.  I have 8 x 10TB Seagate Iron Wolf NAS drives.

I wanted to mention that for me there was a resolution.  I added an Intel Optane 900p 280GB drive and set that up for cache/l2arc and the problem entirely disappeared.  My server ran for 20 days before I rebooted it last night to perform an upgrade.

So, a workaround I believe would be is to add a cache drive to your ZFS pool.

The Intel Optane 900p is a highly recommended cache drive for ZFS pools.
Comment 11 Francois Baillargeon 2019-08-08 17:05:57 UTC
Following what Daniel Shafer says, we have the same issues on a Freenas deployment we did.

Everything fine with our other pools that use less than 10tb drives. But one of our pool using 14tb drive exhibit this exact behavior.

For us this is a major show stopper bug since we can't use this pool reliably. Our vendor sent us a new HBA, a new server, etc before I stumbled upon this bug listing.
Comment 12 n 2019-08-08 19:41:01 UTC
Hi,

We are seeing the same issue. 

Here is more information on our setup :

FreeNAS-11.2-U5
FreeBSD 11.2-STABLE amd64

We use 2 x (6x 14TB seagate ironwolf drives )
We also have a 2 TB crucial SSD for L2ARC

Issue always comes up after 10-14hours of heavy IO

Disk Model : 14 TB Seagate  ST14000VN0008


The drives are on 2 different LSI HBAs. Drive that fails are random on both those HBA.

Please let us know if you need more information on this, it is impacting our production load.

Thank you.

Log output for our latest errors :

> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 
> 00 00 10 00 00 length 8192 SMID 60 Aborting command 0xfffffe000171f640
> mpr1: Sending reset from mprsas_send_abort for target ID 20
> 	(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 length 4096 SMID 332 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00 
> length 4096 SMID 703 terminated ioc 804b loginfo 31130000 sc(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 si 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 
> 00 01 00 00 00 length 131072 SMID 510 terminated ioc 
> 804b(da30:mpr1:0:20:0): CAM status: CCB request completed with an 
> error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00  
> loginfo 31130000 scsi 0 state c xfer 0
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 01 00 00 00 length 131072 SMID 938 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 01 00 00 00 length 131072 SMID 839 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 01 00 00 00 length 131072 SMID 681 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 01 00 00 00 length 131072 SMID 647 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 01 00 00 00 length 131072 SMID 253 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 01 00 00 00 length 131072 SMID 109 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 01 00 00 00 length 131072 SMID 267 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 00 10 00 00 length 8192 SMID 506 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 00 10 00 00 length 8192 SMID 774 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00 length 0 SMID 281 terminated ioc 804b loginfo 31140000 scsi 0 
> state c xfer 0
> mpr1: Unfreezing devq for target ID 20
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: Command timeout
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00
> (da30:mpr1:0:20:0): CAM status: SCSI Status Error
> (da30:mpr1:0:20:0): SCSI status: Check Condition
> (da30:mpr1:0:20:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, 
> reset, or bus device reset occurred)
> (da30:mpr1:0:20:0): Retrying command (per sense data)
> ctl_datamove: tag 0x855ffd44 on (2:3:106) aborted
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00 
> length 8192 SMID 486 Aborting command 0xfffffe0001745aa0
> mpr1: Sending reset from mprsas_send_abort for target ID 22
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 length 4096 SMID 105 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 
> length 81920 SMID 467 terminated ioc 804b loginfo 31130000 s(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 csi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 
> length 131072 SMID 959 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 
> length 131072 SMID 346 terminated ioc 804b loginfo 31130000 (da32:scsi 
> 0 state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00 
> length 131072 SMID 627 terminated ioc 804b loginfo 31130000 scsi 0 
> state c xfer 0
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00 
> length 131072 SMID 455 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00 
> length 131072 SMID 951 terminated ioc 804b loginfo 31130000 (da32:scsi 
> 0 state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00 
> length 131072 SMID 822 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00 
> length 131072 SMID 155 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00 
> length 8192 SMID 495 terminated ioc 804b loginfo 31130000 sc(da32:si 0 
> state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00 
> length 8192 SMID 494 terminated ioc 804b loginfo 31130000 sc(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 si 0 state c xfer 0
> 	(da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00 length 0 SMID 726 terminated ioc 804b loginfo 
> 3(da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> 1140000 scsi 0 state c xfer 0
> mpr1: Unfreezing devq for target ID 22
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: Command timeout
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00
> (da32:mpr1:0:22:0): CAM status: SCSI Status Error
> (da32:mpr1:0:22:0): SCSI status: Check Condition
> (da32:mpr1:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, 
> reset, or bus device reset occurred)
> (da32:mpr1:0:22:0): Retrying command (per sense data)
Comment 13 Christoph Bubel 2019-08-13 16:57:16 UTC
(In reply to Daniel Shafer from comment #10)

I can confirm the workaround, no errors since i added L2ARC to the pool.
Comment 14 Francois Baillargeon 2019-08-13 17:21:05 UTC
We usually have a cache on affected pools, and sadly we still have issues when the l2arc hits the disk, think long sequential reads
Comment 15 Paul Thornton 2019-08-13 19:55:22 UTC
I too have run into this issue on a nas box, once it started taking on any kind of load.

Running 12.0-RELEASE p3

The server contains 8x Seagate Ironwolf Pro 10Tb SATA drives on an Avago 3008 HBA - 8 of these basically:

da2 at mpr1 bus 0 scbus13 target 12 lun 0
da2: <ATA ST10000NE0004-1Z EN01> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number ZA237AVY
da2: 1200.000MB/s transfers
da2: Command Queueing enabled
da2: 9537536MB (19532873728 512 byte sectors)

Driver versions:
dev.mpr.1.driver_version: 18.03.00.00-fbsd
dev.mpr.1.firmware_version: 15.00.03.00

These drives are configured in a ZFS RAID10 setup (in case that datapoint matters):
        NAME         STATE     READ WRITE CKSUM
        data0        ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            da2.eli  ONLINE       0     0     0
            da3.eli  ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            da4.eli  ONLINE       0     0     0
            da5.eli  ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            da6.eli  ONLINE       0     0     0
            da7.eli  ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            da8.eli  ONLINE       0     0     0
            da9.eli  ONLINE       0     0     0

I currently get about 25 days between reboots.  The machine hangs and (I'm guessing here) kernel panics and restarts - I don't have the panic information, but log messages look very similar to what other people are seeing:

Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Command timeout on
 target 12(0x000c), 60000 set, 60.703976195 elapsed
Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:17 nas1a kernel: mpr1: Sending abort to target 12 for SMID 1484
Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Aborting command 0
xfffffe00bad0b540
Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1792 Command ti
meout on target 12(0x000c), 60000 set, 60.707504796 elapsed
Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:18 nas1a kernel: mpr1: Controller reported scsi ioc terminated tgt 12 SMID 1792 loginfo 31140000
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00
Jul 20 11:14:18 nas1a kernel: mpr1: Abort failed for target 12, sending logical unit reset
Jul 20 11:14:18 nas1a kernel: mpr1: (da2:mpr1:0:12:0): CAM status: CCB request aborted by the host
Jul 20 11:14:18 nas1a kernel: Sending logical unit reset to target 12 lun 0
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 3 more tries remain
Jul 20 11:14:18 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CCB request completed with an error
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 0 more tries remain
Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Error 5, Retries exhausted
Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 2 more tries remain

[reboot happens here]

And the most recent one, today:

Aug 13 08:58:55 nas1a kernel:   (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Command tim
eout on target 16(0x0010), 60000 set, 60.109683189 elapsed
Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name (    )
Aug 13 08:58:55 nas1a kernel: mpr1: Sending abort to target 16 for SMID 998
Aug 13 08:58:55 nas1a kernel:   (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Aborting co
mmand 0xfffffe00bacdfaa0
Aug 13 08:58:55 nas1a kernel: mpr1: Abort failed for target 16, sending logical unit reset
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 08:58:55 nas1a kernel: mpr1: Sending logical unit reset to target 16 lun 0
Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name (    )
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CCB request aborted by the host
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 0 more tries remain
Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Error 5, Retries exhausted
Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): WRITE(10). CDB: 2a 00 92 b0 7d 70 00 00 48 00
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 3 more tries remain

[reboot happens here]

After the reboot, there's no problem and everything works fine.  ZFS never marks the pool as degraded or unavailable.

Looking at the FreeNAS threads this seems to have been going on for ages.  Can anyone confirm that a downgrade to 11.1 does work around this issue as that seems to be the only thing that might help?
Comment 16 Bane Ivosev 2019-08-13 20:46:07 UTC
Hi Paul, i had exact same symptoms like you but with WD Reds 4TB, and for me  with 11.1 everything work flawlessly for more then 130 days now.
Comment 17 Matthias Pfaller 2019-08-14 09:23:45 UTC
(In reply to Francois Baillargeon from comment #14)
We have cache and log devices:
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
k-zb1               87.2T  35.5T  51.8T        -         -    20%    40%  1.91x  ONLINE  -
  raidz2            87.2T  35.5T  51.8T        -         -    20%    40%
    gpt/k-zb1-0         -      -      -        -         -      -      -
    gpt/k-zb1-1         -      -      -        -         -      -      -
    gpt/k-zb1-2         -      -      -        -         -      -      -
    gpt/k-zb1-3         -      -      -        -         -      -      -
    gpt/k-zb1-4         -      -      -        -         -      -      -
    gpt/k-zb1-5         -      -      -        -         -      -      -
    gpt/k-zb1-6         -      -      -        -         -      -      -
    gpt/k-zb1-7         -      -      -        -         -      -      -
log                     -      -      -         -      -      -
  mirror            11.5G      0  11.5G        -         -     0%     0%
    gpt/k-zb1-zil0      -      -      -        -         -      -      -
    gpt/k-zb1-zil1      -      -      -        -         -      -      -
cache                   -      -      -         -      -      -
  gpt/k-zb1-cache0  80.0G  29.9G  50.1G        -         -     0%    37%
  gpt/k-zb1-cache1  80.0G  30.0G  50.0G        -         -     0%    37%

The only solution for us was to use the onboard sata ports instead of the LSI-controller. We have keep the other machines that need the LSI-controllers (not enough ports on the mainboard) at 11.1 :-(
Comment 18 Paul Thornton 2019-08-14 15:08:48 UTC
(In reply to Bane Ivosev from comment #16)

Thanks for the confirmation.  I'm in the process of downgrading the affected NAS machine (they are an identical pair) from 12.0 to 11.1, I need to test this in the lab first but will report back once we have the production machine on the older release.

I need to do this before it next crashes and reboots, of course!
Comment 19 Paul Thornton 2019-10-02 14:53:18 UTC
An update for all.

I downgraded our production servers from 12.0 to 11.1-RELEASE-p15 and for 5 weeks they have worked without any problems.

Previously we saw a problem after 4 weeks, then another after 1 week - so whilst there is a chance I have not waited long enough, I think this definitely fixes my problem.

Paul.
Comment 20 Sharad Ahlawat 2019-11-03 08:53:37 UTC
(In reply to Paul Thornton from comment #19)

Thanks for sharing Paul. I took a slightly different approach. I have been running stable for a few months now with a FreeBSD 12.0-RELEASE kernel but using the FreeBSD 11.2 LSI mps driver. 12.1 brings in many updates to the 12.0 mps driver, hopefully those address these problems; will test once its released.

/Sharad
Comment 21 Bane Ivosev 2019-11-28 16:10:15 UTC
(In reply to freebsd from comment #20)

I'm running 12.1-RELEASE on the same hardware for 10 days now. Everything is ok. I'll report back for about a month. Success, i hope.
Comment 22 Matthias Pfaller 2019-12-05 13:35:48 UTC
I'm running FreeBSD 12.1 now. Yesterday we reattached the drives to the LSI controller. Problems showed up during the next backup :-(

Kernel log:
Dec  4 10:13:30 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Dec  4 10:13:30 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Dec  4 10:13:30 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
Dec  4 10:33:08 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Dec  4 10:33:08 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Dec  4 10:33:08 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
Dec  4 10:33:08 nyx kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0
Dec  4 10:33:08 nyx kernel: da1 at mps0 bus 0 scbus0 target 3 lun 0
Dec  4 10:33:08 nyx kernel: da2 at mps0 bus 0 scbus0 target 4 lun 0
Dec  4 10:33:08 nyx kernel: da4 at mps0 bus 0 scbus0 target 6 lun 0
Dec  4 10:33:08 nyx kernel: da5 at mps0 bus 0 scbus0 target 7 lun 0
Dec  4 10:33:08 nyx kernel: da6 at mps0 bus 0 scbus0 target 8 lun 0
Dec  4 10:33:08 nyx kernel: da3 at mps0 bus 0 scbus0 target 5 lun 0
Dec  4 10:33:08 nyx kernel: da7 at mps0 bus 0 scbus0 target 9 lun 0
Dec  4 22:42:34 nyx kernel:     (da4:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 45 37 b3 48 00 00 00 08 00 00 length 4096 SMID 1583 Command timeout on target 6(0x0010) 60000 set, 60.66782887 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 6 for SMID 1583
Dec  4 22:42:34 nyx kernel:     (da4:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 45 37 b3 48 00 00 00 08 00 00 length 4096 SMID 1583 Aborting command 0xfffffe00f9364f28
Dec  4 22:42:34 nyx kernel:     (da7:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 03 60 06 f6 c8 00 00 00 18 00 00 length 12288 SMID 925 Command timeout on target 9(0x000d) 60000 set, 60.23115994 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 9 for SMID 925
Dec  4 22:42:34 nyx kernel:     (da7:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 03 60 06 f6 c8 00 00 00 18 00 00 length 12288 SMID 925 Aborting command 0xfffffe00f932daf8
Dec  4 22:42:34 nyx kernel:     (da0:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1033 Command timeout on target 2(0x000c) 60000 set, 60.24094823 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 2 for SMID 1033
Dec  4 22:42:34 nyx kernel:     (da0:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1033 Aborting command 0xfffffe00f9336c18
Dec  4 22:42:34 nyx kernel:     (da1:mps0:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1440 Command timeout on target 3(0x000b) 60000 set, 60.24535090 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 3 for SMID 1440
Dec  4 22:42:34 nyx kernel:     (da1:mps0:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1440 Aborting command 0xfffffe00f9358f00
Dec  4 22:42:34 nyx kernel:     (da2:mps0:0:4:0): WRITE(10). CDB: 2a 00 24 36 61 98 00 00 18 00 length 12288 SMID 1472 Command timeout on target 4(0x000a) 60000 set, 60.24982318 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 4 for SMID 1472
Dec  4 22:42:34 nyx kernel:     (da2:mps0:0:4:0): WRITE(10). CDB: 2a 00 24 36 61 98 00 00 18 00 length 12288 SMID 1472 Aborting command 0xfffffe00f935ba00
Dec  4 22:42:34 nyx kernel:     (da5:mps0:0:7:0): WRITE(10). CDB: 2a 00 24 36 61 10 00 00 a0 00 length 81920 SMID 507 Command timeout on target 7(0x000f) 60000 set, 60.25666047 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 7 for SMID 507
Dec  4 22:42:34 nyx kernel:     (da5:mps0:0:7:0): WRITE(10). CDB: 2a 00 24 36 61 10 00 00 a0 00 length 81920 SMID 507 Aborting command 0xfffffe00f930a948
Dec  4 22:42:40 nyx kernel:     (xpt0:mps0:0:7:0): SMID 6 task mgmt 0xfffffe00f92e0810 timed out
Dec  4 22:42:40 nyx kernel: mps0: Reinitializing controller
Dec  4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 6
Dec  4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 9
Dec  4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 2

... lots more
Just the reinitialization messages:
Dec  4 22:42:40 nyx kernel: mps0: Reinitializing controller
Dec  4 23:01:27 nyx kernel: mps0: Reinitializing controller
Dec  4 23:40:55 nyx kernel: mps0: Reinitializing controller
Dec  4 23:47:49 nyx kernel: mps0: Reinitializing controller
Dec  4 23:58:02 nyx kernel: mps0: Reinitializing controller
Dec  5 00:17:33 nyx kernel: mps0: Reinitializing controller
Dec  5 00:20:18 nyx kernel: mps0: Reinitializing controller
Dec  5 00:21:33 nyx kernel: mps0: Reinitializing controller
Dec  5 00:24:30 nyx kernel: mps0: Reinitializing controller
Dec  5 00:26:40 nyx kernel: mps0: Reinitializing controller
Dec  5 00:29:30 nyx kernel: mps0: Reinitializing controller
Dec  5 00:33:10 nyx kernel: mps0: Reinitializing controller
Dec  5 00:34:56 nyx kernel: mps0: Reinitializing controller
Dec  5 00:42:16 nyx kernel: mps0: Reinitializing controller
Dec  5 00:46:34 nyx kernel: mps0: Reinitializing controller
Dec  5 00:48:16 nyx kernel: mps0: Reinitializing controller
Dec  5 00:58:18 nyx kernel: mps0: Reinitializing controller
Dec  5 01:00:55 nyx kernel: mps0: Reinitializing controller
Dec  5 01:11:38 nyx kernel: mps0: Reinitializing controller
Dec  5 01:15:39 nyx kernel: mps0: Reinitializing controller
Dec  5 01:22:00 nyx kernel: mps0: Reinitializing controller
Dec  5 01:23:17 nyx kernel: mps0: Reinitializing controller
Dec  5 01:35:25 nyx kernel: mps0: Reinitializing controller
Dec  5 01:37:44 nyx kernel: mps0: Reinitializing controller
Dec  5 01:44:59 nyx kernel: mps0: Reinitializing controller
Dec  5 01:52:05 nyx kernel: mps0: Reinitializing controller
Dec  5 01:54:33 nyx kernel: mps0: Reinitializing controller
Dec  5 01:56:06 nyx kernel: mps0: Reinitializing controller
Dec  5 02:11:14 nyx kernel: mps0: Reinitializing controller
Dec  5 02:27:42 nyx kernel: mps0: Reinitializing controller
Dec  5 02:44:13 nyx kernel: mps0: Reinitializing controller
Dec  5 02:50:13 nyx kernel: mps0: Reinitializing controller
Dec  5 02:56:23 nyx kernel: mps0: Reinitializing controller
Dec  5 03:07:32 nyx kernel: mps0: Reinitializing controller
Dec  5 03:09:16 nyx kernel: mps0: Reinitializing controller
Dec  5 03:10:41 nyx kernel: mps0: Reinitializing controller
Dec  5 03:23:06 nyx kernel: mps0: Reinitializing controller
Dec  5 03:29:13 nyx kernel: mps0: Reinitializing controller
Dec  5 03:37:18 nyx kernel: mps0: Reinitializing controller
Dec  5 03:39:23 nyx kernel: mps0: Reinitializing controller
Dec  5 03:45:42 nyx kernel: mps0: Reinitializing controller
Dec  5 03:51:22 nyx kernel: mps0: Reinitializing controller
Dec  5 03:58:32 nyx kernel: mps0: Reinitializing controller
Dec  5 03:59:50 nyx kernel: mps0: Reinitializing controller
Dec  5 04:01:04 nyx kernel: mps0: Reinitializing controller
Dec  5 04:02:25 nyx kernel: mps0: Reinitializing controller
Dec  5 04:06:33 nyx kernel: mps0: Reinitializing controller
Dec  5 04:08:54 nyx kernel: mps0: Reinitializing controller
Dec  5 04:11:35 nyx kernel: mps0: Reinitializing controller
Dec  5 04:13:41 nyx kernel: mps0: Reinitializing controller

We are attaching the drives to the mainboard SATA-ports again. Unfortunately this is not an option for our other machines. So we have to stick with 11.1.
Comment 23 Bane Ivosev 2019-12-05 19:40:47 UTC
Really sad news ... We are still running for 18 days now, everything is great with 12.1 for us, but is still early to make conclusions and your expirience is not promissing.
Comment 24 Matthias Pfaller 2020-01-22 08:06:41 UTC
(In reply to Bane Ivosev from comment #23)
We just gave it another try. We hopped that the problem might have been caused by a defective disk when we tried last. The problem was triggered again during the next night's backup :-( This is forcing us to keep running 11.1 on our backup machines.

Any (bad/good) news from your machine?
Comment 25 Bane Ivosev 2020-01-22 08:32:14 UTC
(In reply to Matthias Pfaller from comment #24)
We are still fine. No problem at all.
Comment 26 wanpengqian 2020-01-29 08:44:37 UTC
I experience this issue mulpi-times after upgrade to 12.1 from 11.1.
the server is rock before. and I didn't realize it is a driver issue.

Since my pool is already upgrade to lastest, I cannot downgrade
to 11.1. I am buiding a 11.1 mps.ko for 12.1 kernel.
load it manually by loader.conf

I will report the result a few weeks later.
Comment 27 Matthias Pfaller 2020-01-29 09:30:09 UTC
How does one change the affected version in the bug report? After all this is a >11.1 problem and not a 11.1-STABLE problem.
Comment 28 Sharad Ahlawat 2020-01-29 18:22:21 UTC
I finally tried 12.1 and it seemed to run fine for a few days till I started a large file transfer and ...

(da4:mps0:0:11:0): READ(10). CDB: 28 00 7c fa 4a a0 00 00 08 00
(da4:mps0:0:11:0): CAM status: SCSI Status Error
(da4:mps0:0:11:0): SCSI status: Check Condition
(da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
(da4:mps0:0:11:0): Retrying command (per sense data)

Currently running 12.1-p2 with 11.3 release 357156 mps driver.

I just delete the /usr/sr/sys/dev/mps directory and copy it over from the 11.3 source and compile the kernel

/Sharad
Comment 29 Bane Ivosev 2020-01-29 18:27:08 UTC
We have success with  11.1 and 12.1-RELEASE standard installation. No compiling and mixing driver versions. Problem was with 12.0 and 11.3-RELEASE.
Comment 30 Ștefan BĂLU 2020-02-17 11:45:38 UTC
Guys, i've just experienced these issues on the latest FreeNAS (it's a FreeBSD 11.3-RELEASE) with a LSI3008 controller.

The issues you all experience are related to the BIOS Revision and Firmware Revision versions shown by:

mprutil show all | less
...
Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IT
   Board Assembly:
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 17.00.00.00
Firmware Revision: 15.00.03.00
  Integrated RAID: no
...


It's definitely not related to the types/models of disks used or the BSD version. Just make sure you are running the following firmware versions as i already have these in production with FreeBSD 11.2-RELEASE:

Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IT
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 17.00.00.00
Firmware Revision: 15.00.03.00
  Integrated RAID: no
Comment 31 Matthias Pfaller 2020-02-17 11:57:20 UTC
mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.37.00.00
Firmware Revision: 20.00.07.00
  Integrated RAID: yes

uname -a
FreeBSD nyx 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC  amd64

This machine is experiencing the problems.

The machines without problems are running 11.1 and (thank's for the pointer)
       Board Name: SAS9211-8i
   Board Assembly: H3-25250-02J
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.37.00.00
Firmware Revision: 19.00.00.00
  Integrated RAID: no

I'll try a downgrade to 19.00.00.00

regards, Matthias(In reply to Ștefan BĂLU from comment #30)
Comment 32 Bane Ivosev 2020-02-17 12:53:51 UTC
We have problem with 18.03 version of the mpr/mps driver. That's why i said 12.1 and 11.1 relase working for us.

12.1 -> #define MPR_DRIVER_VERSION	"23.00.00.00-fbsd"
12.0 -> #define MPR_DRIVER_VERSION	"18.03.00.00-fbsd"
11.3 -> #define MPR_DRIVER_VERSION	"18.03.00.00-fbsd"
11.2 -> #define MPR_DRIVER_VERSION	"18.03.00.00-fbsd"
11.1 -> #define MPR_DRIVER_VERSION	"15.03.00.00-fbsd"
Comment 33 Ștefan BĂLU 2020-02-17 13:57:46 UTC
(In reply to Bane Ivosev from comment #32), so, should we keep the firmware in sync with those versions or go for a trial-and-error aproach?

In the meantime, i'll let you know how it goes with latest FreeNAS (FreeBSD 11.3-RELEASE) and the firmware 15.00.03.00 version.

I see that my corrected comment didn't go through... So, I downgraded from:

    BIOS Revision: 18.00.00.00
    Firmware Revision: 16.00.01.00

to:

    BIOS Revision: 8.35.00.00
    Firmware Revision: 15.00.03.00

The behaviour would trigger in a couple of days, so i'll keep you all posted.
Comment 34 Bane Ivosev 2020-02-17 14:14:33 UTC
(In reply to Ștefan BĂLU from comment #33)
Before first install (11.3) we update firmware and leave it as is. So, later we didn't change firmware, we just try mps/mpr driver different from 18.03.

Now, with 12.1-RELEASE, we have:
mpr0: Firmware: 16.00.01.00, Driver: 23.00.00.00-fbsd
Comment 35 Ștefan BĂLU 2020-02-18 11:06:57 UTC
Looks like the problem persists with FreeNAS (FreeBSD 11.3-RELEASE) and Firmware Revision: 15.00.03.00.
Comment 36 lastb0isct 2020-02-26 03:04:47 UTC
Has anyone gotten further with this issue?  Seems that it is affecting me every other hour now! And it all started after adding a WD DC Drive...
Comment 37 Peter Eriksson 2020-03-10 13:35:11 UTC
Any SAS-expanders between the SAS HBA and the (SATA) disks?
("mprutil show expanders") - and if so - do they (the expanders) have up to date firmware?
Comment 38 Matthias Pfaller 2020-03-10 13:57:46 UTC
(In reply to Peter Eriksson from comment #37)
Not in our case:
[root@nyx ~]# mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 0.00.00.00
Firmware Revision: 19.00.00.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0                              N                 1.5    6.0    SAS Initiator 
1                              N                 1.5    6.0    SAS Initiator 
2                              N                 1.5    6.0    SAS Initiator 
3                              N                 1.5    6.0    SAS Initiator 
4                              N                 1.5    6.0    SAS Initiator 
5                              N                 1.5    6.0    SAS Initiator 
6                              N                 1.5    6.0    SAS Initiator 
7                              N                 1.5    6.0    SAS Initiator 

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b001551a80               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

We did downgrade to firmware 19.00.00.00, but I hadn't the time to run some tests using this firmware level.
Comment 39 Paul Thornton 2020-03-10 14:10:01 UTC
I don't have an expanders (note that this card has IR firmware, in case that is relevant - I wasn't actually aware of that, it is only being used as a dumb HBA):

[root@nas1b ~]# mprutil show all
Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IR
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 8.35.00.00
Firmware Revision: 15.00.03.00
  Integrated RAID: yes

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0                              N                 3.0    12     SAS Initiator 
1                              N                 3.0    12     SAS Initiator 
2                              N                 3.0    12     SAS Initiator 
3                              N                 3.0    12     SAS Initiator 
4                              N                 3.0    12     SAS Initiator 
5                              N                 3.0    12     SAS Initiator 
6                              N                 3.0    12     SAS Initiator 
7                              N                 3.0    12     SAS Initiator 

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500304801cf54a05               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

[root@nas1b ~]#
Comment 40 Peter Eriksson 2020-03-10 14:43:47 UTC
I've had some random problems when using the cards with IR firmware, so I always make sure they are running IT firmware. Might be worth testing.


But that might not be relevant for your problems. We are using them without any issues (so far) but we're using SAS drives (10TB HGST) behind SAS Expanders (and a couple of Intel SSD SATA drives) - so nothing similar to your situation...

# mprutil show all
Adapter:
mpr0 Adapter:
       Board Name: Dell HBA330 Mini
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 18.00.00.00
Firmware Revision: 16.00.08.00
  Integrated RAID: no
Comment 41 Sharad Ahlawat 2020-05-06 20:21:57 UTC
(In reply to Sharad Ahlawat from comment #28)

Symptom:
CAM retry and timeout messages leading to controller aborts and resets

Cause:
slow disks or using SMR disks

Workaround:
Increase the CAM timeout defaults

❯ sysctl kern.cam.da.default_timeout=90
kern.cam.da.default_timeout: 60 -> 90
❯ sysctl kern.cam.ada.default_timeout=60
kern.cam.ada.default_timeout: 30 -> 60

And disable NCQ on SMR Seagates:
❯ cat cam-tags.sh
#!/usr/local/bin/bash
#shrinking the command Native Command Queue down to 1 effectively disabling queuing
for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d")" -f1`;
do
    camcontrol tags $Disk -N 1 ;
    # camcontrol tags $Disk -v
done

If you only have SMRs in your setup and use an UPS you could also:
❯ sysctl vfs.zfs.cache_flush_disable=1

Solution:
don't use slow disks and SMRs disks with ZFS


The long version:

I am obliged to post this update given the driver downgrade workaround I previously posted on this thread before getting to the root cause for these logs in my messages file after upgrading to 12.x

Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0
Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0000c817
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more tries remain
Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0
Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0000c917
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 more tries remain

Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1039 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1357 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1933 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c8 00 00 00 b0 00 00
Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 38 78 00 00 00 58 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a1 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 SMID 621 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 476 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 321 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1873 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1852 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1742 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 387 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 2104 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a2 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a3 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a4 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a5 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a6 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a7 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a8 d8 00 00 00 c0 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c0 00 00 00 08 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 36 c0 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense data)

Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0
Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs 00400000 tfd 40 serr 00000000 cmd 0000d617
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 80 44 e8 40 c9 01 00 00 00 00
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more tries remain
Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)

❯ smartctl -a /dev/da4
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   178   000    Old_age   Always       -       419
Note: Every controller ABORTED COMMAND increases the ID# 199 count

Messages are correlated with sustained disk operations events:
	scrubs
	large file transfers
	cron jobs kicking off on multiple jails accessing the same pool, etc.

Ruled out any issues with:
	loose cable
	cable quality
	controller hardware, bios and firmware
	power supply

❯ mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9207-4i4e
   Board Assembly: H3-25434-00H
        Chip Name: LSISAS2308
    Chip Revision: ALL
    BIOS Revision: 7.39.02.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   10   4433221100000000 0009    0001      SATA Target   6.0   0001 00    1
00   09   4433221101000000 000a    0002      SATA Target   6.0   0001 01    1
00   11   4433221102000000 000b    0003      SATA Target   6.0   0001 02    1
00   07   4433221103000000 000c    0004      SATA Target   6.0   0001 03    1
00   13   4433221104000000 000d    0005      SATA Target   6.0   0001 07    1
00   04   4433221105000000 000e    0006      SATA Target   6.0   0001 06    1
00   14   4433221106000000 000f    0007      SATA Target   6.0   0001 05    1
00   12   4433221107000000 0010    0008      SATA Target   6.0   0001 04    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b0073400a0               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level


❯ camcontrol devlist

<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus1 target 0 lun 0 (ada1,pass1)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus2 target 0 lun 0 (ada2,pass2)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus3 target 0 lun 0 (ada3,pass3)
<ST8000DM004-2CX188 0001>          at scbus4 target 0 lun 0 (ada4,pass4)
<ST8000DM004-2CX188 0001>          at scbus5 target 0 lun 0 (ada5,pass5)
<ST8000DM004-2CX188 0001>          at scbus6 target 0 lun 0 (ada6,pass6)
<ST8000DM004-2CX188 0001>          at scbus7 target 0 lun 0 (ada7,pass7)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus8 target 0 lun 0 (ada8,pass8)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus9 target 0 lun 0 (ada9,pass9)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 4 lun 0 (pass10,da0)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 7 lun 0 (pass11,da1)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 9 lun 0 (pass12,da2)
<ATA ST4000NM0024-1HT SN02>        at scbus10 target 10 lun 0 (pass13,da3)
<ATA ST4000NM0024-1HT SN06>        at scbus10 target 11 lun 0 (pass14,da4)
<ATA ST4000DX001-1CE1 CC44>        at scbus10 target 12 lun 0 (pass15,da5)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 13 lun 0 (pass16,da6)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 14 lun 0 (pass17,da7)

Notice: da4 has a different firmware revision and most probably the reason that I see relatively fewer timeouts errors on da1, da2 and da3, the other disks in the same ZFS pool.


To understand these errors messages:

https://en.wikipedia.org/wiki/SCSI_command
https://en.wikipedia.org/wiki/SCSI_CDB

❯ cat /var/log/messages | grep "SCSI status"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition

https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition
https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition

❯ cat /var/log/messages | grep "SCSI sense"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)

https://en.wikipedia.org/wiki/Key_Code_Qualifier
B	47	00	Aborted Command - SCSI parity error

https://www.t10.org/lists/asc-num.htm#ASC_47
47h/03h  DZTPROMAEBKVF  INFORMATION UNIT iuCRC ERROR DETECTED

CPB:
	- Byte 0-1 : other
	- Byte 2-9 : 8 bytes is the logical block address
	- Byte 10-13 : 4 bytes is the transfer length
	- Byte 14-5 : other
	
So in my logs we are writing 0x100 = 256 bytes of data and throwing errors, writing to consecutive blocks. The retries generally succeed and subsequent scrubs don't find any errors, though you may see CKSUM errors during a scrub which I associated with the abort and controller reset events.


Summary:
12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more data back and forth from the disks and that has led to these timeout events

I also experimented and found that reducing the memory clock speed in the bios is another workaround to stop these timeout errors with 12.x. Basically that is what my earlier posted workaround of using 11.x mps drivers with 12.x was effectively doing, slowing down the data transfer rates so the disks could keep up.


Disclaimer:
These observations, analysis and workarounds are specific to my setup, please adapt and incorporate accordingly.
Comment 42 Matthias Pfaller 2020-05-12 12:14:22 UTC
We followed the advice of Christop Bubel, disabled NCQ and set the timeouts to 90s (I can't imagine a situation where this should be necessary, but still...).

Results:
May  8 17:36:00 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 17:36:00 nyx kernel: mps0: Reinitializing controller
May  8 17:36:00 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
May  8 17:36:00 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May  8 17:50:44 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 17:50:44 nyx kernel: mps0: Reinitializing controller
May  8 17:50:44 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
May  8 17:50:44 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May  8 18:10:06 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 18:10:06 nyx kernel: mps0: Reinitializing controller
May  8 18:10:06 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
May  8 18:10:06 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May  8 18:27:27 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 18:27:27 nyx kernel: mps0: Reinitializing controller
May  8 18:27:27 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd

We upgraded the controller firmware to 20.00.07.00 and tried again:
[root@nyx ~]# mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 0.00.00.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator 
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator 
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator 
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator 
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator 
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator 
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator 

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   01   4433221100000000 0009    0001      SATA Target   6.0   0001 03    1
00   04   4433221101000000 000a    0002      SATA Target   6.0   0001 02    1
00   03   4433221102000000 000b    0003      SATA Target   6.0   0001 01    1
00   00   4433221103000000 000c    0004      SATA Target   6.0   0001 00    1
00   02   4433221104000000 000d    0005      SATA Target   6.0   0001 07    1
00   05   4433221105000000 000e    0006      SATA Target   6.0   0001 06    1
00   06   4433221106000000 000f    0007      SATA Target   6.0   0001 05    1
00   07   4433221107000000 0010    0008      SATA Target   6.0   0001 04    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b001551a80               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

[root@nyx ~]# for i in $(camcontrol devlist | grep "ST12000" | cut -d"," -f2 | cut -d")" -f1); do 
> camcontrol tags $i
> done
(pass0:mps0:0:0:0): device openings: 1
(pass1:mps0:0:1:0): device openings: 1
(pass2:mps0:0:2:0): device openings: 1
(pass3:mps0:0:3:0): device openings: 1
(pass4:mps0:0:4:0): device openings: 1
(pass5:mps0:0:5:0): device openings: 1
(pass6:mps0:0:6:0): device openings: 1
(pass7:mps0:0:7:0): device openings: 1
[root@nyx ~]# 
Results:

May 11 15:58:58 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 ef 46 55 38 00 00 10 00 length 8192 SMID 64 Command timeout on target 7(0x0010) 90000 set, 90.154685585 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 7 for SMID 64
May 11 15:58:58 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 ef 46 55 38 00 00 10 00 length 8192 SMID 64 Aborting command 0xfffffe00f92e5600
May 11 15:58:58 nyx kernel:     (da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 a0 00 00 00 20 00 00 length 16384 SMID 1564 Command timeout on target 2(0x000d) 90000 set, 90.120064627 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 2 for SMID 1564
May 11 15:58:58 nyx kernel:     (da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 a0 00 00 00 20 00 00 length 16384 SMID 1564 Aborting command 0xfffffe00f93635a0
May 11 15:58:58 nyx kernel:     (da6:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 04 e8 09 8e 30 00 00 00 30 00 00 length 24576 SMID 432 Command timeout on target 6(0x000f) 90000 set, 90.97377933 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 6 for SMID 432
May 11 15:58:58 nyx kernel:     (da6:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 04 e8 09 8e 30 00 00 00 30 00 00 length 24576 SMID 432 Aborting command 0xfffffe00f9304480
May 11 15:58:58 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 ef 46 47 a0 00 01 00 00 length 131072 SMID 1407 Command timeout on target 5(0x000e) 90000 set, 90.108646173 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 5 for SMID 1407
May 11 15:58:58 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 ef 46 47 a0 00 01 00 00 length 131072 SMID 1407 Aborting command 0xfffffe00f93562a8
May 11 15:58:58 nyx kernel:     (da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 04 e0 21 4e 68 00 00 00 10 00 00 length 8192 SMID 2034 Command timeout on target 4(0x000a) 90000 set, 90.67287327 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 4 for SMID 2034
May 11 15:58:58 nyx kernel:     (da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 04 e0 21 4e 68 00 00 00 10 00 00 length 8192 SMID 2034 Aborting command 0xfffffe00f938ad30
May 11 15:58:58 nyx kernel:     (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 04 e2 48 50 10 00 00 00 e8 00 00 length 118784 SMID 1470 Command timeout on target 0(0x000c) 90000 set, 90.51090608 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 0 for SMID 1470
May 11 15:58:58 nyx kernel:     (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 04 e2 48 50 10 00 00 00 e8 00 00 length 118784 SMID 1470 Aborting command 0xfffffe00f935b750
May 11 15:58:58 nyx kernel:     (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 b0 00 00 00 28 00 00 length 20480 SMID 1110 Command timeout on target 1(0x0009) 90000 set, 90.57964893 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 1 for SMID 1110
May 11 15:58:58 nyx kernel:     (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 b0 00 00 00 28 00 00 length 20480 SMID 1110 Aborting command 0xfffffe00f933d390
May 11 15:58:58 nyx kernel:     (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 03 d9 ab c8 40 00 00 00 08 00 00 length 4096 SMID 1125 Command timeout on target 3(0x000b) 90000 set, 90.59928632 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 3 for SMID 1125
May 11 15:58:58 nyx kernel:     (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 03 d9 ab c8 40 00 00 00 08 00 00 length 4096 SMID 1125 Aborting command 0xfffffe00f933e7b8
May 11 15:59:05 nyx kernel:     (xpt0:mps0:0:3:0): SMID 8 task mgmt 0xfffffe00f92e0ac0 timed out
May 11 15:59:05 nyx kernel: mps0: Reinitializing controller
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 7
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 2
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 6
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 5
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 4
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 0
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 1
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 3
May 11 15:59:05 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
May 11 15:59:05 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May 11 16:32:23 nyx kernel:     (da2:mps0:0:2:0): READ(10). CDB: 28 00 0e 77 4d 58 00 00 28 00 length 20480 SMID 111 Command timeout on target 2(0x000c) 90000 set, 90.363886166 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 2 for SMID 111
May 11 16:32:23 nyx kernel:     (da2:mps0:0:2:0): READ(10). CDB: 28 00 0e 77 4d 58 00 00 28 00 length 20480 SMID 111 Aborting command 0xfffffe00f92e9528
May 11 16:32:23 nyx kernel:     (da4:mps0:0:4:0): READ(10). CDB: 28 00 0e 77 25 58 00 00 08 00 length 4096 SMID 1374 Command timeout on target 4(0x000a) 90000 set, 90.365649439 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 4 for SMID 1374
May 11 16:32:23 nyx kernel:     (da4:mps0:0:4:0): READ(10). CDB: 28 00 0e 77 25 58 00 00 08 00 length 4096 SMID 1374 Aborting command 0xfffffe00f9353650
May 11 16:32:23 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 0e 77 16 d8 00 00 10 00 length 8192 SMID 139 Command timeout on target 7(0x0010) 90000 set, 90.368426169 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 7 for SMID 139
May 11 16:32:23 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 0e 77 16 d8 00 00 10 00 length 8192 SMID 139 Aborting command 0xfffffe00f92ebac8
May 11 16:32:23 nyx kernel:     (da6:mps0:0:6:0): READ(10). CDB: 28 00 0e 77 97 18 00 00 28 00 length 20480 SMID 731 Command timeout on target 6(0x000f) 90000 set, 90.353716039 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 6 for SMID 731
May 11 16:32:23 nyx kernel:     (da6:mps0:0:6:0): READ(10). CDB: 28 00 0e 77 97 18 00 00 28 00 length 20480 SMID 731 Aborting command 0xfffffe00f931d648
May 11 16:32:23 nyx kernel:     (da3:mps0:0:3:0): READ(10). CDB: 28 00 0e 77 7e 58 00 00 18 00 length 12288 SMID 332 Command timeout on target 3(0x000b) 90000 set, 90.354107428 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 3 for SMID 332
May 11 16:32:23 nyx kernel:     (da3:mps0:0:3:0): READ(10). CDB: 28 00 0e 77 7e 58 00 00 18 00 length 12288 SMID 332 Aborting command 0xfffffe00f92fbe20
May 11 16:32:23 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 0e 77 6e e8 00 00 b8 00 length 94208 SMID 1491 Command timeout on target 5(0x000e) 90000 set, 90.354717928 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 5 for SMID 1491
May 11 16:32:23 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 0e 77 6e e8 00 00 b8 00 length 94208 SMID 1491 Aborting command 0xfffffe00f935d388
May 11 16:32:23 nyx kernel:     (da1:mps0:0:1:0): READ(10). CDB: 28 00 0e 77 4f d0 00 01 00 00 length 131072 SMID 1330 Command timeout on target 1(0x0009) 90000 set, 90.357039861 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 1 for SMID 1330
May 11 16:32:23 nyx kernel:     (da1:mps0:0:1:0): READ(10). CDB: 28 00 0e 77 4f d0 00 01 00 00 length 131072 SMID 1330 Aborting command 0xfffffe00f934fb30
May 11 16:32:23 nyx kernel:     (da0:mps0:0:0:0): READ(10). CDB: 28 00 0e 77 6b e8 00 00 50 00 length 40960 SMID 692 Command timeout on target 0(0x000d) 90000 set, 90.359445705 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 0 for SMID 692
May 11 16:32:23 nyx kernel:     (da0:mps0:0:0:0): READ(10). CDB: 28 00 0e 77 6b e8 00 00 50 00 length 40960 SMID 692 Aborting command 0xfffffe00f931a1e0
May 11 16:32:29 nyx kernel:     (xpt0:mps0:0:4:0): SMID 10 task mgmt 0xfffffe00f92e0d70 timed out
May 11 16:32:29 nyx kernel: mps0: Reinitializing controller

So at least for our setup this didn't help :-(

Matthias

PS: We are using ST12000VN0007 drives. These shouldn't be pmr drives. Please note that these drives do not cause any problems when used with ahci or isci (maxopenings is 32 in that case).
Comment 43 Matthias Pfaller 2020-05-12 12:16:33 UTC
(In reply to Matthias Pfaller from comment #42)
Sorry, I messed up the names... This should have been Sharad Ahlawat and not Christoph Bubel.
Comment 44 Sharad Ahlawat 2020-05-13 06:46:27 UTC
(In reply to Matthias Pfaller from comment #43)

logs shows the read events are still timing out, even with 90

also there are no retry messages, is "sysctl kern.cam.da.retry_count" set to 0 ?

you could try a few things to get to the root cause:

a: also disable ZFS cache flush
sysctl vfs.zfs.cache_flush_disable=1
even though your drives are not SMRs

b: experiment with larger timeout values
also observe "gstat" output and ensure the first column L(q) is continually returning to zero and not getting stuck for any of the drives

c: try setting reducing the SCIC speed to 3.0 in the controller settings; just to eliminate some disk firmware speed compatibility issue.

Side note, not sure if this applies to your drives but a few of mine don't support NCQ TRIM and are not properly driver blacklisted, so I had to set vfs.unmapped_buf_allowed=0 in loader.conf
Comment 45 Matthias Pfaller 2020-05-13 07:41:40 UTC
(In reply to Sharad Ahlawat from comment #44)
[root@nyx ~]# sysctl kern.cam.da.retry_count
kern.cam.da.retry_count: 4

Previously we had the drives connected to the onboard SATA/SAS ports (hoping the mps problems would get solved). The machine was running for month without showing any disk related problems. At the moment we are looking for a replacement for the SAS2008 (for other machines without that many onboard SATA/SAS channels) and are testing the following config:

ahci0: <Marvell 88SE9215 AHCI SATA controller> port 0x3050-0x3057,0x3040-0x3043,0x3030-0x3037,0x3020-0x3023,0x3000-0x301f mem 0xdf940000-0xdf9407ff irq 32 at device 0.0 on pci5
ahci0: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahci1: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS
ahcich4: <AHCI channel> at channel 0 on ahci1
ahcich5: <AHCI channel> at channel 1 on ahci1
ahcich6: <AHCI channel> at channel 2 on ahci1
ahcich7: <AHCI channel> at channel 3 on ahci1
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada0: Serial Number ZJV3ZYSX
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 11444224MB (23437770752 512 byte sectors)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada1: Serial Number ZJV1VVWX
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 11444224MB (23437770752 512 byte sectors)
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada2: Serial Number ZJV1WLXM
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 11444224MB (23437770752 512 byte sectors)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada3: Serial Number ZJV2YY9A
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 11444224MB (23437770752 512 byte sectors)
ada4 at ahcich4 bus 0 scbus4 target 0 lun 0
ada4: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada4: Serial Number ZJV3ZJNA
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada4: Command Queueing enabled
ada4: 11444224MB (23437770752 512 byte sectors)
ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada5: Serial Number ZJV3ZXN5
ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 11444224MB (23437770752 512 byte sectors)
ada6 at ahcich6 bus 0 scbus6 target 0 lun 0
ada6: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada6: Serial Number ZJV2MWZ0
ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada6: Command Queueing enabled
ada6: 11444224MB (23437770752 512 byte sectors)
ada7 at ahcich7 bus 0 scbus7 target 0 lun 0
ada7: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada7: Serial Number ZCH0HHJ6
ada7: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada7: Command Queueing enabled
ada7: 11444224MB (23437770752 512 byte sectors)
ada8: <SAMSUNG MZ7KM240HMHQ-00005 GXM5304Q> ACS-2 ATA SATA 3.x device
ada8: Serial Number S3F5NX0M400493
ada8: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada8: Command Queueing enabled
ada8: 228936MB (468862128 512 byte sectors)
ada8: quirks=0x1<4K>
ada9: <SAMSUNG MZ7KM240HMHQ-00005 GXM5304Q> ACS-2 ATA SATA 3.x device
ada9: Serial Number S3F5NX0M400461
ada9: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada9: Command Queueing enabled
ada9: 228936MB (468862128 512 byte sectors)
ada9: quirks=0x1<4K>

I reverted the "camcontrol tags <DEV> -N 1" and the changed sysctls. So with this setup the system runs with the same drives and without any changed sysctl settings or cam settings. 

Up to now there are no problems showing up. Operating conditions are the same as with the last mps test (a zpool scrub is running and our systems did a backup to the machine last night).

As the drives behave well with at least three different controllers

  ahci0: <Intel Patsburg AHCI SATA controller>
  ahci1: <Marvell 88SE9215 AHCI SATA controller>
  isci0: <Intel(R) C600 Series Chipset SAS Controller (SATA mode)>

I can't imagine that changing parameters on the cam/zfs side (NCQ TRIM, NCQ, timeouts) will help in our case.

Probably we will have to decommission the LSI9211 boards :-(


regards, Matthias