Bug 224496 - mpr and mps drivers seems to have issues with large seagate drives
Summary: mpr and mps drivers seems to have issues with large seagate drives
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Bugmeister
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-21 08:51 UTC by Christoph Bubel
Modified: 2025-01-11 21:16 UTC (History)
25 users (show)

See Also:


Attachments
/var/log/messages during device resets (307.41 KB, text/plain)
2019-06-12 08:13 UTC, Matthias Pfaller
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Bubel 2017-12-21 08:51:25 UTC
Over on the Freenas forums several people reported issues with large (10TB) Seagate drives (ST10000NM0016 and ST10000VN0004) and LSI controllers. Links to the threads:
https://forums.freenas.org/index.php?threads/lsi-avago-9207-8i-with-seagate-10tb-enterprise-st10000nm0016.58251/
https://forums.freenas.org/index.php?threads/synchronize-cache-command-timeout-error.55067/

I am using the ST10000NM0016 drives and i am getting the following errors on a LSI SAS2308 (mps driver) and on a LSI SAS3008 (mpr driver). This happens about once every one or two weeks in low load situations. 

Here the logs:

(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1010 command timeout cm 0xfffffe0000f7cda0 ccb 0xfffff8018198d800
(noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0000f7cda0
mps0: Sending reset from mpssas_send_abort for target ID 1
(da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 length 4096 SMID 959 terminated ioc 804b scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 1
(da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 
(da2:mps0:0:1:0): CAM status: CCB request completed with an error
(da2:mps0:0:1:0): Retrying command
(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mps0:0:1:0): CAM status: Command timeout
(da2:mps0:0:1:0): Retrying command
(da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mps0:0:1:0): CAM status: SCSI Status Error
(da2:mps0:0:1:0): SCSI status: Check Condition
(da2:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mps0:0:1:0): Error 6, Retries exhausted
(da2:mps0:0:1:0): Invalidating pack

-------

(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 319 Aborting command 0xfffffe0000f54a90
mpr0: Sending reset from mprsas_send_abort for target ID 4
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 length 20480 SMID 320 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
mpr0: Unfreezing devq for target ID 4
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: Command timeout
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 
(da1:mpr0:0:4:0): CAM status: SCSI Status Error
(da1:mpr0:0:4:0): SCSI status: Check Condition
(da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpr0:0:4:0): Retrying command (per sense data)
(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 length 16384 SMID 653 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 711 terminated ioc 804b loginfo 311(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 
10e03 scsi 0 state c xfer 0
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: CCB request completed with an error
(da1:mpr0:0:4:0): Retrying command
(da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:mpr0:0:4:0): CAM status: SCSI Status Error
(da1:mpr0:0:4:0): SCSI status: Check Condition
(da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpr0:0:4:0): Error 6, Retries exhausted
(da1:mpr0:0:4:0): Invalidating pack
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 797 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 753 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 846 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 828 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 662 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 738 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 778 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 747 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 954 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 923 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 932 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
(pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 926 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
Comment 1 Bane Ivosev 2019-03-23 07:03:46 UTC
We have the very same problem but with WD Red disks. System randomly reboot sometimes after 20 days of working. Different disk everytime. It's our production nfs server and now it's very frustrating.

Supermicro 5049p
64 GB ECC RAM
LSI 3008 IT mode
18x WD Red 4 TB

Mar 23 07:39:46 fap kernel:       (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Command timeout on target 25(0x001c), 60000 set, 60.107418057 elapsed
Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: mpr0: Sending abort to target 25 for SMID 357
Mar 23 07:39:46 fap kernel:       (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Aborting command 0xfffffe00b7aa6130
Mar 23 07:39:46 fap kernel:       (pass19:mpr0:0:25:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 1182 Command timeout on target 25(0x001c), 60000 set, 60.217681679 elampr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: mpr0: Controller reported scsi ioc terminated tgt 25 SMID 1182 loginfo 31130000
Mar 23 07:39:46 fap kernel: mpr0: Abort failed for target 25, sending logical unit reset
Mar 23 07:39:46 fap kernel: mpr0: Sending logical unit reset to target 25 lun 0
Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name (    )
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CCB request aborted by the host
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 0 more tries remain
Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Error 5, Retries exhausted
Mar 23 07:39:46 fap smartd[95746]: Device: /dev/da17 [SAT], failed to read SMART Attribute Data
Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): WRITE(10). CDB: 2a 00 09 4a 32 a8 00 00 08 00 
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy
Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 3 more tries remain
Mar 23 07:43:19 fap syslogd: kernel boot file is /boot/kernel/kernel
Mar 23 07:43:19 fap kernel: ---<<BOOT>>---
Comment 2 Bane Ivosev 2019-03-23 07:05:51 UTC
Forgot to say, its FreeBSD 12-RELEASE.
Comment 3 Matthias Pfaller 2019-06-12 08:13:45 UTC
Created attachment 205003 [details]
/var/log/messages during device resets
Comment 4 Matthias Pfaller 2019-06-12 08:15:20 UTC
We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008:

Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

After writing ~200gb to our pool it started reseting. I did a

sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20))

The resulting trace is attached.
Comment 5 Matthias Pfaller 2019-06-12 08:16:12 UTC
Comment on attachment 205003 [details]
/var/log/messages during device resets

We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008:

Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

After writing ~200gb to our pool it started reseting. I did a

sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20))

The resulting trace is attached.
Comment 6 Matthias Pfaller 2019-06-12 08:36:27 UTC
We are using FreeBSD 12.0-RELEASE:
FreeBSD nyx 12.0-RELEASE-p4 FreeBSD 12.0-RELEASE-p4 GENERIC  amd64
Comment 7 Bane Ivosev 2019-06-12 09:06:22 UTC
Just to append my previous post from March, same hardware and same config, we revert back on 11.1-RELEASE and everything working flawlessly for more then two months now.
Comment 8 Bane Ivosev 2019-06-12 09:23:36 UTC
And I don't think the problem is exclusively with Seagate 10TB drives. We have WD Red 4TB drives and have the same problem. We have same situation also with 11.2-RELEASE, and beacuse 11.2 and 12.0 have same mpr/mps driver version we decide to try with 11.1.
Comment 9 Matthias Pfaller 2019-06-12 09:35:00 UTC
(In reply to Bane Ivosev from comment #8)
We have several other machines with SAS2008 controllers. All of them are running 11.1 and none of them shows these problems...
Comment 10 Daniel Shafer 2019-07-23 19:53:42 UTC
So I came across this same issue.  It was causing my server to reboot several times a day due to kernel panics caused by this issue.  It happens with both SAS9200 and 9300 controllers.  I have 8 x 10TB Seagate Iron Wolf NAS drives.

I wanted to mention that for me there was a resolution.  I added an Intel Optane 900p 280GB drive and set that up for cache/l2arc and the problem entirely disappeared.  My server ran for 20 days before I rebooted it last night to perform an upgrade.

So, a workaround I believe would be is to add a cache drive to your ZFS pool.

The Intel Optane 900p is a highly recommended cache drive for ZFS pools.
Comment 11 Francois Baillargeon 2019-08-08 17:05:57 UTC
Following what Daniel Shafer says, we have the same issues on a Freenas deployment we did.

Everything fine with our other pools that use less than 10tb drives. But one of our pool using 14tb drive exhibit this exact behavior.

For us this is a major show stopper bug since we can't use this pool reliably. Our vendor sent us a new HBA, a new server, etc before I stumbled upon this bug listing.
Comment 12 n 2019-08-08 19:41:01 UTC
Hi,

We are seeing the same issue. 

Here is more information on our setup :

FreeNAS-11.2-U5
FreeBSD 11.2-STABLE amd64

We use 2 x (6x 14TB seagate ironwolf drives )
We also have a 2 TB crucial SSD for L2ARC

Issue always comes up after 10-14hours of heavy IO

Disk Model : 14 TB Seagate  ST14000VN0008


The drives are on 2 different LSI HBAs. Drive that fails are random on both those HBA.

Please let us know if you need more information on this, it is impacting our production load.

Thank you.

Log output for our latest errors :

> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 
> 00 00 10 00 00 length 8192 SMID 60 Aborting command 0xfffffe000171f640
> mpr1: Sending reset from mprsas_send_abort for target ID 20
> 	(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 length 4096 SMID 332 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00 
> length 4096 SMID 703 terminated ioc 804b loginfo 31130000 sc(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 si 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 
> 00 01 00 00 00 length 131072 SMID 510 terminated ioc 
> 804b(da30:mpr1:0:20:0): CAM status: CCB request completed with an 
> error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00  
> loginfo 31130000 scsi 0 state c xfer 0
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 01 00 00 00 length 131072 SMID 938 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 01 00 00 00 length 131072 SMID 839 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 01 00 00 00 length 131072 SMID 681 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 01 00 00 00 length 131072 SMID 647 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 01 00 00 00 length 131072 SMID 253 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 01 00 00 00 length 131072 SMID 109 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 01 00 00 00 length 131072 SMID 267 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 00 10 00 00 length 8192 SMID 506 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 00 10 00 00 length 8192 SMID 774 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00 length 0 SMID 281 terminated ioc 804b loginfo 31140000 scsi 0 
> state c xfer 0
> mpr1: Unfreezing devq for target ID 20
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 
> 01 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: Command timeout
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 
> 00 10 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00
> (da30:mpr1:0:20:0): CAM status: CCB request completed with an error
> (da30:mpr1:0:20:0): Retrying command
> (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00
> (da30:mpr1:0:20:0): CAM status: SCSI Status Error
> (da30:mpr1:0:20:0): SCSI status: Check Condition
> (da30:mpr1:0:20:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, 
> reset, or bus device reset occurred)
> (da30:mpr1:0:20:0): Retrying command (per sense data)
> ctl_datamove: tag 0x855ffd44 on (2:3:106) aborted
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00 
> length 8192 SMID 486 Aborting command 0xfffffe0001745aa0
> mpr1: Sending reset from mprsas_send_abort for target ID 22
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 length 4096 SMID 105 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 
> length 81920 SMID 467 terminated ioc 804b loginfo 31130000 s(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 csi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 
> length 131072 SMID 959 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 
> length 131072 SMID 346 terminated ioc 804b loginfo 31130000 (da32:scsi 
> 0 state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00 
> length 131072 SMID 627 terminated ioc 804b loginfo 31130000 scsi 0 
> state c xfer 0
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00 
> length 131072 SMID 455 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00 
> length 131072 SMID 951 terminated ioc 804b loginfo 31130000 (da32:scsi 
> 0 state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00 
> length 131072 SMID 822 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00 
> length 131072 SMID 155 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00 
> length 8192 SMID 495 terminated ioc 804b loginfo 31130000 sc(da32:si 0 
> state c xfer 0
> mpr1:0:22:0): Retrying command
> 	(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00 
> length 8192 SMID 494 terminated ioc 804b loginfo 31130000 sc(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 si 0 state c xfer 0
> 	(da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00 length 0 SMID 726 terminated ioc 804b loginfo 
> 3(da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> 1140000 scsi 0 state c xfer 0
> mpr1: Unfreezing devq for target ID 22
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: Command timeout
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 
> 00 00 00
> (da32:mpr1:0:22:0): CAM status: CCB request completed with an error
> (da32:mpr1:0:22:0): Retrying command
> (da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00
> (da32:mpr1:0:22:0): CAM status: SCSI Status Error
> (da32:mpr1:0:22:0): SCSI status: Check Condition
> (da32:mpr1:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, 
> reset, or bus device reset occurred)
> (da32:mpr1:0:22:0): Retrying command (per sense data)
Comment 13 Christoph Bubel 2019-08-13 16:57:16 UTC
(In reply to Daniel Shafer from comment #10)

I can confirm the workaround, no errors since i added L2ARC to the pool.
Comment 14 Francois Baillargeon 2019-08-13 17:21:05 UTC
We usually have a cache on affected pools, and sadly we still have issues when the l2arc hits the disk, think long sequential reads
Comment 15 Paul Thornton 2019-08-13 19:55:22 UTC
I too have run into this issue on a nas box, once it started taking on any kind of load.

Running 12.0-RELEASE p3

The server contains 8x Seagate Ironwolf Pro 10Tb SATA drives on an Avago 3008 HBA - 8 of these basically:

da2 at mpr1 bus 0 scbus13 target 12 lun 0
da2: <ATA ST10000NE0004-1Z EN01> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number ZA237AVY
da2: 1200.000MB/s transfers
da2: Command Queueing enabled
da2: 9537536MB (19532873728 512 byte sectors)

Driver versions:
dev.mpr.1.driver_version: 18.03.00.00-fbsd
dev.mpr.1.firmware_version: 15.00.03.00

These drives are configured in a ZFS RAID10 setup (in case that datapoint matters):
        NAME         STATE     READ WRITE CKSUM
        data0        ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            da2.eli  ONLINE       0     0     0
            da3.eli  ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            da4.eli  ONLINE       0     0     0
            da5.eli  ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            da6.eli  ONLINE       0     0     0
            da7.eli  ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            da8.eli  ONLINE       0     0     0
            da9.eli  ONLINE       0     0     0

I currently get about 25 days between reboots.  The machine hangs and (I'm guessing here) kernel panics and restarts - I don't have the panic information, but log messages look very similar to what other people are seeing:

Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Command timeout on
 target 12(0x000c), 60000 set, 60.703976195 elapsed
Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:17 nas1a kernel: mpr1: Sending abort to target 12 for SMID 1484
Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Aborting command 0
xfffffe00bad0b540
Jul 20 11:14:17 nas1a kernel:   (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1792 Command ti
meout on target 12(0x000c), 60000 set, 60.707504796 elapsed
Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:18 nas1a kernel: mpr1: Controller reported scsi ioc terminated tgt 12 SMID 1792 loginfo 31140000
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00
Jul 20 11:14:18 nas1a kernel: mpr1: Abort failed for target 12, sending logical unit reset
Jul 20 11:14:18 nas1a kernel: mpr1: (da2:mpr1:0:12:0): CAM status: CCB request aborted by the host
Jul 20 11:14:18 nas1a kernel: Sending logical unit reset to target 12 lun 0
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 3 more tries remain
Jul 20 11:14:18 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name (    )
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CCB request completed with an error
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 0 more tries remain
Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Error 5, Retries exhausted
Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy
Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 2 more tries remain

[reboot happens here]

And the most recent one, today:

Aug 13 08:58:55 nas1a kernel:   (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Command tim
eout on target 16(0x0010), 60000 set, 60.109683189 elapsed
Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name (    )
Aug 13 08:58:55 nas1a kernel: mpr1: Sending abort to target 16 for SMID 998
Aug 13 08:58:55 nas1a kernel:   (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Aborting co
mmand 0xfffffe00bacdfaa0
Aug 13 08:58:55 nas1a kernel: mpr1: Abort failed for target 16, sending logical unit reset
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 08:58:55 nas1a kernel: mpr1: Sending logical unit reset to target 16 lun 0
Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name (    )
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CCB request aborted by the host
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 0 more tries remain
Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Error 5, Retries exhausted
Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): WRITE(10). CDB: 2a 00 92 b0 7d 70 00 00 48 00
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy
Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 3 more tries remain

[reboot happens here]

After the reboot, there's no problem and everything works fine.  ZFS never marks the pool as degraded or unavailable.

Looking at the FreeNAS threads this seems to have been going on for ages.  Can anyone confirm that a downgrade to 11.1 does work around this issue as that seems to be the only thing that might help?
Comment 16 Bane Ivosev 2019-08-13 20:46:07 UTC
Hi Paul, i had exact same symptoms like you but with WD Reds 4TB, and for me  with 11.1 everything work flawlessly for more then 130 days now.
Comment 17 Matthias Pfaller 2019-08-14 09:23:45 UTC
(In reply to Francois Baillargeon from comment #14)
We have cache and log devices:
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
k-zb1               87.2T  35.5T  51.8T        -         -    20%    40%  1.91x  ONLINE  -
  raidz2            87.2T  35.5T  51.8T        -         -    20%    40%
    gpt/k-zb1-0         -      -      -        -         -      -      -
    gpt/k-zb1-1         -      -      -        -         -      -      -
    gpt/k-zb1-2         -      -      -        -         -      -      -
    gpt/k-zb1-3         -      -      -        -         -      -      -
    gpt/k-zb1-4         -      -      -        -         -      -      -
    gpt/k-zb1-5         -      -      -        -         -      -      -
    gpt/k-zb1-6         -      -      -        -         -      -      -
    gpt/k-zb1-7         -      -      -        -         -      -      -
log                     -      -      -         -      -      -
  mirror            11.5G      0  11.5G        -         -     0%     0%
    gpt/k-zb1-zil0      -      -      -        -         -      -      -
    gpt/k-zb1-zil1      -      -      -        -         -      -      -
cache                   -      -      -         -      -      -
  gpt/k-zb1-cache0  80.0G  29.9G  50.1G        -         -     0%    37%
  gpt/k-zb1-cache1  80.0G  30.0G  50.0G        -         -     0%    37%

The only solution for us was to use the onboard sata ports instead of the LSI-controller. We have keep the other machines that need the LSI-controllers (not enough ports on the mainboard) at 11.1 :-(
Comment 18 Paul Thornton 2019-08-14 15:08:48 UTC
(In reply to Bane Ivosev from comment #16)

Thanks for the confirmation.  I'm in the process of downgrading the affected NAS machine (they are an identical pair) from 12.0 to 11.1, I need to test this in the lab first but will report back once we have the production machine on the older release.

I need to do this before it next crashes and reboots, of course!
Comment 19 Paul Thornton 2019-10-02 14:53:18 UTC
An update for all.

I downgraded our production servers from 12.0 to 11.1-RELEASE-p15 and for 5 weeks they have worked without any problems.

Previously we saw a problem after 4 weeks, then another after 1 week - so whilst there is a chance I have not waited long enough, I think this definitely fixes my problem.

Paul.
Comment 20 Sharad Ahlawat 2019-11-03 08:53:37 UTC
(In reply to Paul Thornton from comment #19)

Thanks for sharing Paul. I took a slightly different approach. I have been running stable for a few months now with a FreeBSD 12.0-RELEASE kernel but using the FreeBSD 11.2 LSI mps driver. 12.1 brings in many updates to the 12.0 mps driver, hopefully those address these problems; will test once its released.

/Sharad
Comment 21 Bane Ivosev 2019-11-28 16:10:15 UTC
(In reply to freebsd from comment #20)

I'm running 12.1-RELEASE on the same hardware for 10 days now. Everything is ok. I'll report back for about a month. Success, i hope.
Comment 22 Matthias Pfaller 2019-12-05 13:35:48 UTC
I'm running FreeBSD 12.1 now. Yesterday we reattached the drives to the LSI controller. Problems showed up during the next backup :-(

Kernel log:
Dec  4 10:13:30 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Dec  4 10:13:30 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Dec  4 10:13:30 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
Dec  4 10:33:08 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3
Dec  4 10:33:08 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Dec  4 10:33:08 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
Dec  4 10:33:08 nyx kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0
Dec  4 10:33:08 nyx kernel: da1 at mps0 bus 0 scbus0 target 3 lun 0
Dec  4 10:33:08 nyx kernel: da2 at mps0 bus 0 scbus0 target 4 lun 0
Dec  4 10:33:08 nyx kernel: da4 at mps0 bus 0 scbus0 target 6 lun 0
Dec  4 10:33:08 nyx kernel: da5 at mps0 bus 0 scbus0 target 7 lun 0
Dec  4 10:33:08 nyx kernel: da6 at mps0 bus 0 scbus0 target 8 lun 0
Dec  4 10:33:08 nyx kernel: da3 at mps0 bus 0 scbus0 target 5 lun 0
Dec  4 10:33:08 nyx kernel: da7 at mps0 bus 0 scbus0 target 9 lun 0
Dec  4 22:42:34 nyx kernel:     (da4:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 45 37 b3 48 00 00 00 08 00 00 length 4096 SMID 1583 Command timeout on target 6(0x0010) 60000 set, 60.66782887 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 6 for SMID 1583
Dec  4 22:42:34 nyx kernel:     (da4:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 45 37 b3 48 00 00 00 08 00 00 length 4096 SMID 1583 Aborting command 0xfffffe00f9364f28
Dec  4 22:42:34 nyx kernel:     (da7:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 03 60 06 f6 c8 00 00 00 18 00 00 length 12288 SMID 925 Command timeout on target 9(0x000d) 60000 set, 60.23115994 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 9 for SMID 925
Dec  4 22:42:34 nyx kernel:     (da7:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 03 60 06 f6 c8 00 00 00 18 00 00 length 12288 SMID 925 Aborting command 0xfffffe00f932daf8
Dec  4 22:42:34 nyx kernel:     (da0:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1033 Command timeout on target 2(0x000c) 60000 set, 60.24094823 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 2 for SMID 1033
Dec  4 22:42:34 nyx kernel:     (da0:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1033 Aborting command 0xfffffe00f9336c18
Dec  4 22:42:34 nyx kernel:     (da1:mps0:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1440 Command timeout on target 3(0x000b) 60000 set, 60.24535090 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 3 for SMID 1440
Dec  4 22:42:34 nyx kernel:     (da1:mps0:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1440 Aborting command 0xfffffe00f9358f00
Dec  4 22:42:34 nyx kernel:     (da2:mps0:0:4:0): WRITE(10). CDB: 2a 00 24 36 61 98 00 00 18 00 length 12288 SMID 1472 Command timeout on target 4(0x000a) 60000 set, 60.24982318 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 4 for SMID 1472
Dec  4 22:42:34 nyx kernel:     (da2:mps0:0:4:0): WRITE(10). CDB: 2a 00 24 36 61 98 00 00 18 00 length 12288 SMID 1472 Aborting command 0xfffffe00f935ba00
Dec  4 22:42:34 nyx kernel:     (da5:mps0:0:7:0): WRITE(10). CDB: 2a 00 24 36 61 10 00 00 a0 00 length 81920 SMID 507 Command timeout on target 7(0x000f) 60000 set, 60.25666047 elapsed
Dec  4 22:42:34 nyx kernel: mps0: Sending abort to target 7 for SMID 507
Dec  4 22:42:34 nyx kernel:     (da5:mps0:0:7:0): WRITE(10). CDB: 2a 00 24 36 61 10 00 00 a0 00 length 81920 SMID 507 Aborting command 0xfffffe00f930a948
Dec  4 22:42:40 nyx kernel:     (xpt0:mps0:0:7:0): SMID 6 task mgmt 0xfffffe00f92e0810 timed out
Dec  4 22:42:40 nyx kernel: mps0: Reinitializing controller
Dec  4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 6
Dec  4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 9
Dec  4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 2

... lots more
Just the reinitialization messages:
Dec  4 22:42:40 nyx kernel: mps0: Reinitializing controller
Dec  4 23:01:27 nyx kernel: mps0: Reinitializing controller
Dec  4 23:40:55 nyx kernel: mps0: Reinitializing controller
Dec  4 23:47:49 nyx kernel: mps0: Reinitializing controller
Dec  4 23:58:02 nyx kernel: mps0: Reinitializing controller
Dec  5 00:17:33 nyx kernel: mps0: Reinitializing controller
Dec  5 00:20:18 nyx kernel: mps0: Reinitializing controller
Dec  5 00:21:33 nyx kernel: mps0: Reinitializing controller
Dec  5 00:24:30 nyx kernel: mps0: Reinitializing controller
Dec  5 00:26:40 nyx kernel: mps0: Reinitializing controller
Dec  5 00:29:30 nyx kernel: mps0: Reinitializing controller
Dec  5 00:33:10 nyx kernel: mps0: Reinitializing controller
Dec  5 00:34:56 nyx kernel: mps0: Reinitializing controller
Dec  5 00:42:16 nyx kernel: mps0: Reinitializing controller
Dec  5 00:46:34 nyx kernel: mps0: Reinitializing controller
Dec  5 00:48:16 nyx kernel: mps0: Reinitializing controller
Dec  5 00:58:18 nyx kernel: mps0: Reinitializing controller
Dec  5 01:00:55 nyx kernel: mps0: Reinitializing controller
Dec  5 01:11:38 nyx kernel: mps0: Reinitializing controller
Dec  5 01:15:39 nyx kernel: mps0: Reinitializing controller
Dec  5 01:22:00 nyx kernel: mps0: Reinitializing controller
Dec  5 01:23:17 nyx kernel: mps0: Reinitializing controller
Dec  5 01:35:25 nyx kernel: mps0: Reinitializing controller
Dec  5 01:37:44 nyx kernel: mps0: Reinitializing controller
Dec  5 01:44:59 nyx kernel: mps0: Reinitializing controller
Dec  5 01:52:05 nyx kernel: mps0: Reinitializing controller
Dec  5 01:54:33 nyx kernel: mps0: Reinitializing controller
Dec  5 01:56:06 nyx kernel: mps0: Reinitializing controller
Dec  5 02:11:14 nyx kernel: mps0: Reinitializing controller
Dec  5 02:27:42 nyx kernel: mps0: Reinitializing controller
Dec  5 02:44:13 nyx kernel: mps0: Reinitializing controller
Dec  5 02:50:13 nyx kernel: mps0: Reinitializing controller
Dec  5 02:56:23 nyx kernel: mps0: Reinitializing controller
Dec  5 03:07:32 nyx kernel: mps0: Reinitializing controller
Dec  5 03:09:16 nyx kernel: mps0: Reinitializing controller
Dec  5 03:10:41 nyx kernel: mps0: Reinitializing controller
Dec  5 03:23:06 nyx kernel: mps0: Reinitializing controller
Dec  5 03:29:13 nyx kernel: mps0: Reinitializing controller
Dec  5 03:37:18 nyx kernel: mps0: Reinitializing controller
Dec  5 03:39:23 nyx kernel: mps0: Reinitializing controller
Dec  5 03:45:42 nyx kernel: mps0: Reinitializing controller
Dec  5 03:51:22 nyx kernel: mps0: Reinitializing controller
Dec  5 03:58:32 nyx kernel: mps0: Reinitializing controller
Dec  5 03:59:50 nyx kernel: mps0: Reinitializing controller
Dec  5 04:01:04 nyx kernel: mps0: Reinitializing controller
Dec  5 04:02:25 nyx kernel: mps0: Reinitializing controller
Dec  5 04:06:33 nyx kernel: mps0: Reinitializing controller
Dec  5 04:08:54 nyx kernel: mps0: Reinitializing controller
Dec  5 04:11:35 nyx kernel: mps0: Reinitializing controller
Dec  5 04:13:41 nyx kernel: mps0: Reinitializing controller

We are attaching the drives to the mainboard SATA-ports again. Unfortunately this is not an option for our other machines. So we have to stick with 11.1.
Comment 23 Bane Ivosev 2019-12-05 19:40:47 UTC
Really sad news ... We are still running for 18 days now, everything is great with 12.1 for us, but is still early to make conclusions and your expirience is not promissing.
Comment 24 Matthias Pfaller 2020-01-22 08:06:41 UTC
(In reply to Bane Ivosev from comment #23)
We just gave it another try. We hopped that the problem might have been caused by a defective disk when we tried last. The problem was triggered again during the next night's backup :-( This is forcing us to keep running 11.1 on our backup machines.

Any (bad/good) news from your machine?
Comment 25 Bane Ivosev 2020-01-22 08:32:14 UTC
(In reply to Matthias Pfaller from comment #24)
We are still fine. No problem at all.
Comment 26 wanpengqian 2020-01-29 08:44:37 UTC
I experience this issue mulpi-times after upgrade to 12.1 from 11.1.
the server is rock before. and I didn't realize it is a driver issue.

Since my pool is already upgrade to lastest, I cannot downgrade
to 11.1. I am buiding a 11.1 mps.ko for 12.1 kernel.
load it manually by loader.conf

I will report the result a few weeks later.
Comment 27 Matthias Pfaller 2020-01-29 09:30:09 UTC
How does one change the affected version in the bug report? After all this is a >11.1 problem and not a 11.1-STABLE problem.
Comment 28 Sharad Ahlawat 2020-01-29 18:22:21 UTC
I finally tried 12.1 and it seemed to run fine for a few days till I started a large file transfer and ...

(da4:mps0:0:11:0): READ(10). CDB: 28 00 7c fa 4a a0 00 00 08 00
(da4:mps0:0:11:0): CAM status: SCSI Status Error
(da4:mps0:0:11:0): SCSI status: Check Condition
(da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
(da4:mps0:0:11:0): Retrying command (per sense data)

Currently running 12.1-p2 with 11.3 release 357156 mps driver.

I just delete the /usr/sr/sys/dev/mps directory and copy it over from the 11.3 source and compile the kernel

/Sharad
Comment 29 Bane Ivosev 2020-01-29 18:27:08 UTC
We have success with  11.1 and 12.1-RELEASE standard installation. No compiling and mixing driver versions. Problem was with 12.0 and 11.3-RELEASE.
Comment 30 Ștefan BĂLU 2020-02-17 11:45:38 UTC
Guys, i've just experienced these issues on the latest FreeNAS (it's a FreeBSD 11.3-RELEASE) with a LSI3008 controller.

The issues you all experience are related to the BIOS Revision and Firmware Revision versions shown by:

mprutil show all | less
...
Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IT
   Board Assembly:
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 17.00.00.00
Firmware Revision: 15.00.03.00
  Integrated RAID: no
...


It's definitely not related to the types/models of disks used or the BSD version. Just make sure you are running the following firmware versions as i already have these in production with FreeBSD 11.2-RELEASE:

Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IT
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 17.00.00.00
Firmware Revision: 15.00.03.00
  Integrated RAID: no
Comment 31 Matthias Pfaller 2020-02-17 11:57:20 UTC
mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.37.00.00
Firmware Revision: 20.00.07.00
  Integrated RAID: yes

uname -a
FreeBSD nyx 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC  amd64

This machine is experiencing the problems.

The machines without problems are running 11.1 and (thank's for the pointer)
       Board Name: SAS9211-8i
   Board Assembly: H3-25250-02J
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.37.00.00
Firmware Revision: 19.00.00.00
  Integrated RAID: no

I'll try a downgrade to 19.00.00.00

regards, Matthias(In reply to Ștefan BĂLU from comment #30)
Comment 32 Bane Ivosev 2020-02-17 12:53:51 UTC
We have problem with 18.03 version of the mpr/mps driver. That's why i said 12.1 and 11.1 relase working for us.

12.1 -> #define MPR_DRIVER_VERSION	"23.00.00.00-fbsd"
12.0 -> #define MPR_DRIVER_VERSION	"18.03.00.00-fbsd"
11.3 -> #define MPR_DRIVER_VERSION	"18.03.00.00-fbsd"
11.2 -> #define MPR_DRIVER_VERSION	"18.03.00.00-fbsd"
11.1 -> #define MPR_DRIVER_VERSION	"15.03.00.00-fbsd"
Comment 33 Ștefan BĂLU 2020-02-17 13:57:46 UTC
(In reply to Bane Ivosev from comment #32), so, should we keep the firmware in sync with those versions or go for a trial-and-error aproach?

In the meantime, i'll let you know how it goes with latest FreeNAS (FreeBSD 11.3-RELEASE) and the firmware 15.00.03.00 version.

I see that my corrected comment didn't go through... So, I downgraded from:

    BIOS Revision: 18.00.00.00
    Firmware Revision: 16.00.01.00

to:

    BIOS Revision: 8.35.00.00
    Firmware Revision: 15.00.03.00

The behaviour would trigger in a couple of days, so i'll keep you all posted.
Comment 34 Bane Ivosev 2020-02-17 14:14:33 UTC
(In reply to Ștefan BĂLU from comment #33)
Before first install (11.3) we update firmware and leave it as is. So, later we didn't change firmware, we just try mps/mpr driver different from 18.03.

Now, with 12.1-RELEASE, we have:
mpr0: Firmware: 16.00.01.00, Driver: 23.00.00.00-fbsd
Comment 35 Ștefan BĂLU 2020-02-18 11:06:57 UTC
Looks like the problem persists with FreeNAS (FreeBSD 11.3-RELEASE) and Firmware Revision: 15.00.03.00.
Comment 36 lastb0isct 2020-02-26 03:04:47 UTC
Has anyone gotten further with this issue?  Seems that it is affecting me every other hour now! And it all started after adding a WD DC Drive...
Comment 37 Peter Eriksson 2020-03-10 13:35:11 UTC
Any SAS-expanders between the SAS HBA and the (SATA) disks?
("mprutil show expanders") - and if so - do they (the expanders) have up to date firmware?
Comment 38 Matthias Pfaller 2020-03-10 13:57:46 UTC
(In reply to Peter Eriksson from comment #37)
Not in our case:
[root@nyx ~]# mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 0.00.00.00
Firmware Revision: 19.00.00.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0                              N                 1.5    6.0    SAS Initiator 
1                              N                 1.5    6.0    SAS Initiator 
2                              N                 1.5    6.0    SAS Initiator 
3                              N                 1.5    6.0    SAS Initiator 
4                              N                 1.5    6.0    SAS Initiator 
5                              N                 1.5    6.0    SAS Initiator 
6                              N                 1.5    6.0    SAS Initiator 
7                              N                 1.5    6.0    SAS Initiator 

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b001551a80               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

We did downgrade to firmware 19.00.00.00, but I hadn't the time to run some tests using this firmware level.
Comment 39 Paul Thornton 2020-03-10 14:10:01 UTC
I don't have an expanders (note that this card has IR firmware, in case that is relevant - I wasn't actually aware of that, it is only being used as a dumb HBA):

[root@nas1b ~]# mprutil show all
Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IR
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 8.35.00.00
Firmware Revision: 15.00.03.00
  Integrated RAID: yes

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0                              N                 3.0    12     SAS Initiator 
1                              N                 3.0    12     SAS Initiator 
2                              N                 3.0    12     SAS Initiator 
3                              N                 3.0    12     SAS Initiator 
4                              N                 3.0    12     SAS Initiator 
5                              N                 3.0    12     SAS Initiator 
6                              N                 3.0    12     SAS Initiator 
7                              N                 3.0    12     SAS Initiator 

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500304801cf54a05               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

[root@nas1b ~]#
Comment 40 Peter Eriksson 2020-03-10 14:43:47 UTC
I've had some random problems when using the cards with IR firmware, so I always make sure they are running IT firmware. Might be worth testing.


But that might not be relevant for your problems. We are using them without any issues (so far) but we're using SAS drives (10TB HGST) behind SAS Expanders (and a couple of Intel SSD SATA drives) - so nothing similar to your situation...

# mprutil show all
Adapter:
mpr0 Adapter:
       Board Name: Dell HBA330 Mini
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 18.00.00.00
Firmware Revision: 16.00.08.00
  Integrated RAID: no
Comment 41 Sharad Ahlawat 2020-05-06 20:21:57 UTC
(In reply to Sharad Ahlawat from comment #28)

Symptom:
CAM retry and timeout messages leading to controller aborts and resets

Cause:
slow disks or using SMR disks

Workaround:
Increase the CAM timeout defaults

❯ sysctl kern.cam.da.default_timeout=90
kern.cam.da.default_timeout: 60 -> 90
❯ sysctl kern.cam.ada.default_timeout=60
kern.cam.ada.default_timeout: 30 -> 60

And disable NCQ on SMR Seagates:
❯ cat cam-tags.sh
#!/usr/local/bin/bash
#shrinking the command Native Command Queue down to 1 effectively disabling queuing
for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d")" -f1`;
do
    camcontrol tags $Disk -N 1 ;
    # camcontrol tags $Disk -v
done

If you only have SMRs in your setup and use an UPS you could also:
❯ sysctl vfs.zfs.cache_flush_disable=1

Solution:
don't use slow disks and SMRs disks with ZFS


The long version:

I am obliged to post this update given the driver downgrade workaround I previously posted on this thread before getting to the root cause for these logs in my messages file after upgrading to 12.x

Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0
Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0000c817
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more tries remain
Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0
Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0000c917
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 more tries remain

Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1039 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1357 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1933 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c8 00 00 00 b0 00 00
Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 38 78 00 00 00 58 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a1 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 SMID 621 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 476 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 321 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1873 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1852 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1742 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 387 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 2104 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a2 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a3 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a4 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a5 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a6 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a7 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a8 d8 00 00 00 c0 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c0 00 00 00 08 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 36 c0 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense data)

Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0
Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs 00400000 tfd 40 serr 00000000 cmd 0000d617
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 80 44 e8 40 c9 01 00 00 00 00
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more tries remain
Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)

❯ smartctl -a /dev/da4
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   178   000    Old_age   Always       -       419
Note: Every controller ABORTED COMMAND increases the ID# 199 count

Messages are correlated with sustained disk operations events:
	scrubs
	large file transfers
	cron jobs kicking off on multiple jails accessing the same pool, etc.

Ruled out any issues with:
	loose cable
	cable quality
	controller hardware, bios and firmware
	power supply

❯ mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9207-4i4e
   Board Assembly: H3-25434-00H
        Chip Name: LSISAS2308
    Chip Revision: ALL
    BIOS Revision: 7.39.02.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   10   4433221100000000 0009    0001      SATA Target   6.0   0001 00    1
00   09   4433221101000000 000a    0002      SATA Target   6.0   0001 01    1
00   11   4433221102000000 000b    0003      SATA Target   6.0   0001 02    1
00   07   4433221103000000 000c    0004      SATA Target   6.0   0001 03    1
00   13   4433221104000000 000d    0005      SATA Target   6.0   0001 07    1
00   04   4433221105000000 000e    0006      SATA Target   6.0   0001 06    1
00   14   4433221106000000 000f    0007      SATA Target   6.0   0001 05    1
00   12   4433221107000000 0010    0008      SATA Target   6.0   0001 04    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b0073400a0               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level


❯ camcontrol devlist

<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus1 target 0 lun 0 (ada1,pass1)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus2 target 0 lun 0 (ada2,pass2)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus3 target 0 lun 0 (ada3,pass3)
<ST8000DM004-2CX188 0001>          at scbus4 target 0 lun 0 (ada4,pass4)
<ST8000DM004-2CX188 0001>          at scbus5 target 0 lun 0 (ada5,pass5)
<ST8000DM004-2CX188 0001>          at scbus6 target 0 lun 0 (ada6,pass6)
<ST8000DM004-2CX188 0001>          at scbus7 target 0 lun 0 (ada7,pass7)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus8 target 0 lun 0 (ada8,pass8)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus9 target 0 lun 0 (ada9,pass9)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 4 lun 0 (pass10,da0)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 7 lun 0 (pass11,da1)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 9 lun 0 (pass12,da2)
<ATA ST4000NM0024-1HT SN02>        at scbus10 target 10 lun 0 (pass13,da3)
<ATA ST4000NM0024-1HT SN06>        at scbus10 target 11 lun 0 (pass14,da4)
<ATA ST4000DX001-1CE1 CC44>        at scbus10 target 12 lun 0 (pass15,da5)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 13 lun 0 (pass16,da6)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 14 lun 0 (pass17,da7)

Notice: da4 has a different firmware revision and most probably the reason that I see relatively fewer timeouts errors on da1, da2 and da3, the other disks in the same ZFS pool.


To understand these errors messages:

https://en.wikipedia.org/wiki/SCSI_command
https://en.wikipedia.org/wiki/SCSI_CDB

❯ cat /var/log/messages | grep "SCSI status"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition

https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition
https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition

❯ cat /var/log/messages | grep "SCSI sense"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)

https://en.wikipedia.org/wiki/Key_Code_Qualifier
B	47	00	Aborted Command - SCSI parity error

https://www.t10.org/lists/asc-num.htm#ASC_47
47h/03h  DZTPROMAEBKVF  INFORMATION UNIT iuCRC ERROR DETECTED

CPB:
	- Byte 0-1 : other
	- Byte 2-9 : 8 bytes is the logical block address
	- Byte 10-13 : 4 bytes is the transfer length
	- Byte 14-5 : other
	
So in my logs we are writing 0x100 = 256 bytes of data and throwing errors, writing to consecutive blocks. The retries generally succeed and subsequent scrubs don't find any errors, though you may see CKSUM errors during a scrub which I associated with the abort and controller reset events.


Summary:
12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more data back and forth from the disks and that has led to these timeout events

I also experimented and found that reducing the memory clock speed in the bios is another workaround to stop these timeout errors with 12.x. Basically that is what my earlier posted workaround of using 11.x mps drivers with 12.x was effectively doing, slowing down the data transfer rates so the disks could keep up.


Disclaimer:
These observations, analysis and workarounds are specific to my setup, please adapt and incorporate accordingly.
Comment 42 Matthias Pfaller 2020-05-12 12:14:22 UTC
We followed the advice of Christop Bubel, disabled NCQ and set the timeouts to 90s (I can't imagine a situation where this should be necessary, but still...).

Results:
May  8 17:36:00 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 17:36:00 nyx kernel: mps0: Reinitializing controller
May  8 17:36:00 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
May  8 17:36:00 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May  8 17:50:44 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 17:50:44 nyx kernel: mps0: Reinitializing controller
May  8 17:50:44 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
May  8 17:50:44 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May  8 18:10:06 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 18:10:06 nyx kernel: mps0: Reinitializing controller
May  8 18:10:06 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
May  8 18:10:06 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May  8 18:27:27 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting
May  8 18:27:27 nyx kernel: mps0: Reinitializing controller
May  8 18:27:27 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd

We upgraded the controller firmware to 20.00.07.00 and tried again:
[root@nyx ~]# mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 0.00.00.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator 
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator 
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator 
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator 
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator 
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator 
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator 

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   01   4433221100000000 0009    0001      SATA Target   6.0   0001 03    1
00   04   4433221101000000 000a    0002      SATA Target   6.0   0001 02    1
00   03   4433221102000000 000b    0003      SATA Target   6.0   0001 01    1
00   00   4433221103000000 000c    0004      SATA Target   6.0   0001 00    1
00   02   4433221104000000 000d    0005      SATA Target   6.0   0001 07    1
00   05   4433221105000000 000e    0006      SATA Target   6.0   0001 06    1
00   06   4433221106000000 000f    0007      SATA Target   6.0   0001 05    1
00   07   4433221107000000 0010    0008      SATA Target   6.0   0001 04    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b001551a80               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

[root@nyx ~]# for i in $(camcontrol devlist | grep "ST12000" | cut -d"," -f2 | cut -d")" -f1); do 
> camcontrol tags $i
> done
(pass0:mps0:0:0:0): device openings: 1
(pass1:mps0:0:1:0): device openings: 1
(pass2:mps0:0:2:0): device openings: 1
(pass3:mps0:0:3:0): device openings: 1
(pass4:mps0:0:4:0): device openings: 1
(pass5:mps0:0:5:0): device openings: 1
(pass6:mps0:0:6:0): device openings: 1
(pass7:mps0:0:7:0): device openings: 1
[root@nyx ~]# 
Results:

May 11 15:58:58 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 ef 46 55 38 00 00 10 00 length 8192 SMID 64 Command timeout on target 7(0x0010) 90000 set, 90.154685585 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 7 for SMID 64
May 11 15:58:58 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 ef 46 55 38 00 00 10 00 length 8192 SMID 64 Aborting command 0xfffffe00f92e5600
May 11 15:58:58 nyx kernel:     (da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 a0 00 00 00 20 00 00 length 16384 SMID 1564 Command timeout on target 2(0x000d) 90000 set, 90.120064627 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 2 for SMID 1564
May 11 15:58:58 nyx kernel:     (da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 a0 00 00 00 20 00 00 length 16384 SMID 1564 Aborting command 0xfffffe00f93635a0
May 11 15:58:58 nyx kernel:     (da6:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 04 e8 09 8e 30 00 00 00 30 00 00 length 24576 SMID 432 Command timeout on target 6(0x000f) 90000 set, 90.97377933 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 6 for SMID 432
May 11 15:58:58 nyx kernel:     (da6:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 04 e8 09 8e 30 00 00 00 30 00 00 length 24576 SMID 432 Aborting command 0xfffffe00f9304480
May 11 15:58:58 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 ef 46 47 a0 00 01 00 00 length 131072 SMID 1407 Command timeout on target 5(0x000e) 90000 set, 90.108646173 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 5 for SMID 1407
May 11 15:58:58 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 ef 46 47 a0 00 01 00 00 length 131072 SMID 1407 Aborting command 0xfffffe00f93562a8
May 11 15:58:58 nyx kernel:     (da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 04 e0 21 4e 68 00 00 00 10 00 00 length 8192 SMID 2034 Command timeout on target 4(0x000a) 90000 set, 90.67287327 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 4 for SMID 2034
May 11 15:58:58 nyx kernel:     (da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 04 e0 21 4e 68 00 00 00 10 00 00 length 8192 SMID 2034 Aborting command 0xfffffe00f938ad30
May 11 15:58:58 nyx kernel:     (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 04 e2 48 50 10 00 00 00 e8 00 00 length 118784 SMID 1470 Command timeout on target 0(0x000c) 90000 set, 90.51090608 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 0 for SMID 1470
May 11 15:58:58 nyx kernel:     (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 04 e2 48 50 10 00 00 00 e8 00 00 length 118784 SMID 1470 Aborting command 0xfffffe00f935b750
May 11 15:58:58 nyx kernel:     (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 b0 00 00 00 28 00 00 length 20480 SMID 1110 Command timeout on target 1(0x0009) 90000 set, 90.57964893 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 1 for SMID 1110
May 11 15:58:58 nyx kernel:     (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 b0 00 00 00 28 00 00 length 20480 SMID 1110 Aborting command 0xfffffe00f933d390
May 11 15:58:58 nyx kernel:     (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 03 d9 ab c8 40 00 00 00 08 00 00 length 4096 SMID 1125 Command timeout on target 3(0x000b) 90000 set, 90.59928632 elapsed
May 11 15:58:58 nyx kernel: mps0: Sending abort to target 3 for SMID 1125
May 11 15:58:58 nyx kernel:     (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 03 d9 ab c8 40 00 00 00 08 00 00 length 4096 SMID 1125 Aborting command 0xfffffe00f933e7b8
May 11 15:59:05 nyx kernel:     (xpt0:mps0:0:3:0): SMID 8 task mgmt 0xfffffe00f92e0ac0 timed out
May 11 15:59:05 nyx kernel: mps0: Reinitializing controller
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 7
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 2
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 6
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 5
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 4
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 0
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 1
May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 3
May 11 15:59:05 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
May 11 15:59:05 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
May 11 16:32:23 nyx kernel:     (da2:mps0:0:2:0): READ(10). CDB: 28 00 0e 77 4d 58 00 00 28 00 length 20480 SMID 111 Command timeout on target 2(0x000c) 90000 set, 90.363886166 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 2 for SMID 111
May 11 16:32:23 nyx kernel:     (da2:mps0:0:2:0): READ(10). CDB: 28 00 0e 77 4d 58 00 00 28 00 length 20480 SMID 111 Aborting command 0xfffffe00f92e9528
May 11 16:32:23 nyx kernel:     (da4:mps0:0:4:0): READ(10). CDB: 28 00 0e 77 25 58 00 00 08 00 length 4096 SMID 1374 Command timeout on target 4(0x000a) 90000 set, 90.365649439 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 4 for SMID 1374
May 11 16:32:23 nyx kernel:     (da4:mps0:0:4:0): READ(10). CDB: 28 00 0e 77 25 58 00 00 08 00 length 4096 SMID 1374 Aborting command 0xfffffe00f9353650
May 11 16:32:23 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 0e 77 16 d8 00 00 10 00 length 8192 SMID 139 Command timeout on target 7(0x0010) 90000 set, 90.368426169 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 7 for SMID 139
May 11 16:32:23 nyx kernel:     (da7:mps0:0:7:0): READ(10). CDB: 28 00 0e 77 16 d8 00 00 10 00 length 8192 SMID 139 Aborting command 0xfffffe00f92ebac8
May 11 16:32:23 nyx kernel:     (da6:mps0:0:6:0): READ(10). CDB: 28 00 0e 77 97 18 00 00 28 00 length 20480 SMID 731 Command timeout on target 6(0x000f) 90000 set, 90.353716039 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 6 for SMID 731
May 11 16:32:23 nyx kernel:     (da6:mps0:0:6:0): READ(10). CDB: 28 00 0e 77 97 18 00 00 28 00 length 20480 SMID 731 Aborting command 0xfffffe00f931d648
May 11 16:32:23 nyx kernel:     (da3:mps0:0:3:0): READ(10). CDB: 28 00 0e 77 7e 58 00 00 18 00 length 12288 SMID 332 Command timeout on target 3(0x000b) 90000 set, 90.354107428 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 3 for SMID 332
May 11 16:32:23 nyx kernel:     (da3:mps0:0:3:0): READ(10). CDB: 28 00 0e 77 7e 58 00 00 18 00 length 12288 SMID 332 Aborting command 0xfffffe00f92fbe20
May 11 16:32:23 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 0e 77 6e e8 00 00 b8 00 length 94208 SMID 1491 Command timeout on target 5(0x000e) 90000 set, 90.354717928 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 5 for SMID 1491
May 11 16:32:23 nyx kernel:     (da5:mps0:0:5:0): READ(10). CDB: 28 00 0e 77 6e e8 00 00 b8 00 length 94208 SMID 1491 Aborting command 0xfffffe00f935d388
May 11 16:32:23 nyx kernel:     (da1:mps0:0:1:0): READ(10). CDB: 28 00 0e 77 4f d0 00 01 00 00 length 131072 SMID 1330 Command timeout on target 1(0x0009) 90000 set, 90.357039861 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 1 for SMID 1330
May 11 16:32:23 nyx kernel:     (da1:mps0:0:1:0): READ(10). CDB: 28 00 0e 77 4f d0 00 01 00 00 length 131072 SMID 1330 Aborting command 0xfffffe00f934fb30
May 11 16:32:23 nyx kernel:     (da0:mps0:0:0:0): READ(10). CDB: 28 00 0e 77 6b e8 00 00 50 00 length 40960 SMID 692 Command timeout on target 0(0x000d) 90000 set, 90.359445705 elapsed
May 11 16:32:23 nyx kernel: mps0: Sending abort to target 0 for SMID 692
May 11 16:32:23 nyx kernel:     (da0:mps0:0:0:0): READ(10). CDB: 28 00 0e 77 6b e8 00 00 50 00 length 40960 SMID 692 Aborting command 0xfffffe00f931a1e0
May 11 16:32:29 nyx kernel:     (xpt0:mps0:0:4:0): SMID 10 task mgmt 0xfffffe00f92e0d70 timed out
May 11 16:32:29 nyx kernel: mps0: Reinitializing controller

So at least for our setup this didn't help :-(

Matthias

PS: We are using ST12000VN0007 drives. These shouldn't be pmr drives. Please note that these drives do not cause any problems when used with ahci or isci (maxopenings is 32 in that case).
Comment 43 Matthias Pfaller 2020-05-12 12:16:33 UTC
(In reply to Matthias Pfaller from comment #42)
Sorry, I messed up the names... This should have been Sharad Ahlawat and not Christoph Bubel.
Comment 44 Sharad Ahlawat 2020-05-13 06:46:27 UTC
(In reply to Matthias Pfaller from comment #43)

logs shows the read events are still timing out, even with 90

also there are no retry messages, is "sysctl kern.cam.da.retry_count" set to 0 ?

you could try a few things to get to the root cause:

a: also disable ZFS cache flush
sysctl vfs.zfs.cache_flush_disable=1
even though your drives are not SMRs

b: experiment with larger timeout values
also observe "gstat" output and ensure the first column L(q) is continually returning to zero and not getting stuck for any of the drives

c: try setting reducing the SCIC speed to 3.0 in the controller settings; just to eliminate some disk firmware speed compatibility issue.

Side note, not sure if this applies to your drives but a few of mine don't support NCQ TRIM and are not properly driver blacklisted, so I had to set vfs.unmapped_buf_allowed=0 in loader.conf
Comment 45 Matthias Pfaller 2020-05-13 07:41:40 UTC
(In reply to Sharad Ahlawat from comment #44)
[root@nyx ~]# sysctl kern.cam.da.retry_count
kern.cam.da.retry_count: 4

Previously we had the drives connected to the onboard SATA/SAS ports (hoping the mps problems would get solved). The machine was running for month without showing any disk related problems. At the moment we are looking for a replacement for the SAS2008 (for other machines without that many onboard SATA/SAS channels) and are testing the following config:

ahci0: <Marvell 88SE9215 AHCI SATA controller> port 0x3050-0x3057,0x3040-0x3043,0x3030-0x3037,0x3020-0x3023,0x3000-0x301f mem 0xdf940000-0xdf9407ff irq 32 at device 0.0 on pci5
ahci0: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahci1: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS
ahcich4: <AHCI channel> at channel 0 on ahci1
ahcich5: <AHCI channel> at channel 1 on ahci1
ahcich6: <AHCI channel> at channel 2 on ahci1
ahcich7: <AHCI channel> at channel 3 on ahci1
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada0: Serial Number ZJV3ZYSX
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 11444224MB (23437770752 512 byte sectors)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada1: Serial Number ZJV1VVWX
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 11444224MB (23437770752 512 byte sectors)
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada2: Serial Number ZJV1WLXM
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 11444224MB (23437770752 512 byte sectors)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada3: Serial Number ZJV2YY9A
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 11444224MB (23437770752 512 byte sectors)
ada4 at ahcich4 bus 0 scbus4 target 0 lun 0
ada4: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada4: Serial Number ZJV3ZJNA
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada4: Command Queueing enabled
ada4: 11444224MB (23437770752 512 byte sectors)
ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada5: Serial Number ZJV3ZXN5
ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada5: Command Queueing enabled
ada5: 11444224MB (23437770752 512 byte sectors)
ada6 at ahcich6 bus 0 scbus6 target 0 lun 0
ada6: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada6: Serial Number ZJV2MWZ0
ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada6: Command Queueing enabled
ada6: 11444224MB (23437770752 512 byte sectors)
ada7 at ahcich7 bus 0 scbus7 target 0 lun 0
ada7: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device
ada7: Serial Number ZCH0HHJ6
ada7: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada7: Command Queueing enabled
ada7: 11444224MB (23437770752 512 byte sectors)
ada8: <SAMSUNG MZ7KM240HMHQ-00005 GXM5304Q> ACS-2 ATA SATA 3.x device
ada8: Serial Number S3F5NX0M400493
ada8: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada8: Command Queueing enabled
ada8: 228936MB (468862128 512 byte sectors)
ada8: quirks=0x1<4K>
ada9: <SAMSUNG MZ7KM240HMHQ-00005 GXM5304Q> ACS-2 ATA SATA 3.x device
ada9: Serial Number S3F5NX0M400461
ada9: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada9: Command Queueing enabled
ada9: 228936MB (468862128 512 byte sectors)
ada9: quirks=0x1<4K>

I reverted the "camcontrol tags <DEV> -N 1" and the changed sysctls. So with this setup the system runs with the same drives and without any changed sysctl settings or cam settings. 

Up to now there are no problems showing up. Operating conditions are the same as with the last mps test (a zpool scrub is running and our systems did a backup to the machine last night).

As the drives behave well with at least three different controllers

  ahci0: <Intel Patsburg AHCI SATA controller>
  ahci1: <Marvell 88SE9215 AHCI SATA controller>
  isci0: <Intel(R) C600 Series Chipset SAS Controller (SATA mode)>

I can't imagine that changing parameters on the cam/zfs side (NCQ TRIM, NCQ, timeouts) will help in our case.

Probably we will have to decommission the LSI9211 boards :-(


regards, Matthias
Comment 46 Wayne Willcox 2020-11-05 12:53:00 UTC
So I have been getting the same errors in 12.1 and now 12.2
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 8 lun 0 (pass0,da0)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 9 lun 0 (pass1,da1)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 10 lun 0 (pass2,da2)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 11 lun 0 (pass3,da3)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 12 lun 0 (pass4,da4)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 13 lun 0 (pass5,da5)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 14 lun 0 (pass6,da6)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 15 lun 0 (pass7,da7)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 16 lun 0 (pass8,da8)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 17 lun 0 (pass9,da9)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 18 lun 0 (pass10,da10)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 19 lun 0 (pass11,da11)
<LSI CORP SAS2X36 0717>            at scbus0 target 20 lun 0 (ses0,pass12)
<LSI CORP SAS2X28 0717>            at scbus0 target 21 lun 0 (ses1,pass13)
<ATA WDC WD20EFRX-68A 0A80>        at scbus0 target 22 lun 0 (pass14,da12)
<ATA WDC WD20EFRX-68A 0A80>        at scbus0 target 23 lun 0 (pass15,da13)
<ATA WDC WD20EFRX-68A 0A80>        at scbus0 target 24 lun 0 (pass16,da14)
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: H3-25250-02B
        Chip Name: LSISAS2008
    Chip Revision: ALL:
    BIOS Revision: 7.39.00.00
Firmware Revision: 20.00.06.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0                              N                 1.5    6.0    SAS Initiator 
1                              N                 1.5    6.0    SAS Initiator 
2                              N                 1.5    6.0    SAS Initiator 
3                              N                 1.5    6.0    SAS Initiator 
4       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
5       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
6       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
7       0001        0009       N         6.0     1.5    6.0    SAS Initiator 


hawthorn kernel log messages:
+       (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 1732 Command timeout on target 24(0x001a) 60000 set, 60.96681849 elapsed
+mps0: Sending abort to target 24 for SMID 1732
+       (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 1732 Aborting command 0xfffffe00daa55760
+       (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 30 00 length 24576 SMID 1089 Command timeout on target 12(0x000f) 60000 set, 60.97158463 elapsed
+mps0: Sending abort to target 12 for SMID 1089
+       (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 30 00 length 24576 SMID 1089 Aborting command 0xfffffe00daa1f758
+       (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 49 79 f9 f8 00 00 08 00 length 4096 SMID 1230 Command timeout on target 24(0x001a) 60000 set, 60.97561590 elapsed
+       (da13:mps0:0:23:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 78 Command timeout on target 23(0x0019) 60000 set, 60.97777852 elapsed
+mps0: Sending abort to target 23 for SMID 78
+       (da13:mps0:0:23:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 78 Aborting command 0xfffffe00da9ca8d0
+       (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 48 00 00 10 00 length 8192 SMID 217 Command timeout on target 13(0x0010) 60000 set, 60.98193577 elapsed
+mps0: Sending abort to target 13 for SMID 217
+       (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 48 00 00 10 00 length 8192 SMID 217 Aborting command 0xfffffe00da9d6398
+       (da8:mps0:0:16:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 670 Command timeout on target 16(0x0013) 60000 set, 60.98762575 elapsed
+mps0: Sending abort to target 16 for SMID 670
+       (da8:mps0:0:16:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 670 Aborting command 0xfffffe00da9fc450
+       (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 91 Command timeout on target 10(0x000d) 60000 set, 60.99148905 elapsed
+mps0: Sending abort to target 10 for SMID 91
+       (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 91 Aborting command 0xfffffe00da9cba48
+       (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 1b 8d ae b0 00 00 08 00 length 4096 SMID 1064 Command timeout on target 10(0x000d) 60000 set, 60.99537335 elapsed
+       (da9:mps0:0:17:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 982 Command timeout on target 17(0x0014) 60000 set, 60.99738899 elapsed
+mps0: Sending abort to target 17 for SMID 982
+       (da9:mps0:0:17:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 982 Aborting command 0xfffffe00daa16790
+       (da12:mps0:0:22:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 1677 Command timeout on target 22(0x0018) 60000 set, 60.100129429 elapsed
+mps0: Sending abort to target 22 for SMID 1677
+       (da12:mps0:0:22:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 1677 Aborting command 0xfffffe00daa50d78
+       (da10:mps0:0:18:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1723 Command timeout on target 18(0x0015) 60000 set, 60.100597645 elapsed
+mps0: Sending abort to target 18 for SMID 1723
+       (da10:mps0:0:18:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1723 Aborting command 0xfffffe00daa54b48
+       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 08 00 length 4096 SMID 944 Command timeout on target 9(0x000c) 60000 set, 60.101000772 elapsed
+mps0: Sending abort to target 9 for SMID 944
+       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 08 00 length 4096 SMID 944 Aborting command 0xfffffe00daa13480
+       (da11:mps0:0:19:0): WRITE(10). CDB: 2a 00 49 79 f9 f8 00 00 08 00 length 4096 SMID 391 Command timeout on target 19(0x0016) 60000 set, 60.101567670 elapsed
+mps0: Sending abort to target 19 for SMID 391
+       (da11:mps0:0:19:0): WRITE(10). CDB: 2a 00 49 79 f9 f8 00 00 08 00 length 4096 SMID 391 Aborting command 0xfffffe00da9e4d68
+       (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 53 42 14 38 00 00 10 00 length 8192 SMID 1326 Command timeout on target 15(0x0012) 60000 set, 60.101970797 elapsed
+mps0: Sending abort to target 15 for SMID 1326
+       (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 53 42 14 38 00 00 10 00 length 8192 SMID 1326 Aborting command 0xfffffe00daa335d0
+       (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 08 00 length 4096 SMID 456 Command timeout on target 11(0x000e) 60000 set, 60.105479267 elapsed
+mps0: Sending abort to target 11 for SMID 456
+       (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 08 00 length 4096 SMID 456 Aborting command 0xfffffe00da9ea4c0
+       (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1882 Command timeout on target 14(0x0011) 60000 set, 60.106592066 elapsed
+mps0: Sending abort to target 14 for SMID 1882
       (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1882 Aborting command 0xfffffe00daa620f0
+       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 53 42 14 20 00 00 48 00 length 36864 SMID 69 Command timeout on target 9(0x000c) 60000 set, 60.69088610 elapsed
+       (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 20 00 00 08 00 length 4096 SMID 912 Command timeout on target 10(0x000d) 60000 set, 60.69428749 elapsed
+       (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 53 42 14 28 00 00 48 00 length 36864 SMID 440 Command timeout on target 14(0x0011) 60000 set, 60.349451194 elapsed
+       (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 58 00 00 18 00 length 12288 SMID 1580 Command timeout on target 13(0x0010) 60000 set, 60.349818628 elapsed
+       (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1052 Command timeout on target 14(0x0011) 60000 set, 60.350062184 elapsed
+       (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1375 Command timeout on target 13(0x0010) 60000 set, 60.350263749 elapsed
+       (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1455 Command timeout on target 12(0x000f) 60000 set, 60.350454814 elapsed
+       (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 30 00 length 24576 SMID 775 Command timeout on target 12(0x000f) 60000 set, 60.350652179 elapsed
+       (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 931 Command timeout on target 11(0x000e) 60000 set, 60.350866340 elapsed
+       (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1319 Command timeout on target 10(0x000d) 60000 set, 60.351090999 elapsed
+       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 728 Command timeout on target 9(0x000c) 60000 set, 60.351292564 elapsed
+       (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 53 42 14 28 00 00 48 00 length 36864 SMID 843 Command timeout on target 11(0x000e) 60000 set, 60.351485729 elapsed
+       (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 53 42 14 68 00 00 08 00 length 4096 SMID 412 Command timeout on target 24(0x001a) 60000 set, 60.351704089 elapsed
+       (da13:mps0:0:23:0): WRITE(10). CDB: 2a 00 53 42 14 68 00 00 08 00 length 4096 SMID 458 Command timeout on target 23(0x0019) 60000 set, 60.351935048 elapsed
-  F- 1/1: Charlie Root           hawthorn daily security run output                                                                                                                -- +mps0: Reinitializing controller
+mps0: Unfreezing devq for target ID 24
+mps0: Unfreezing devq for target ID 12
+mps0: Unfreezing devq for target ID 23
+mps0: Unfreezing devq for target ID 13
+mps0: Unfreezing devq for target ID 16
+mps0: Unfreezing devq for target ID 10
+mps0: Unfreezing devq for target ID 17
+mps0: Unfreezing devq for target ID 22
+mps0: Unfreezing devq for target ID 18
+mps0: Unfreezing devq for target ID 9
+mps0: Unfreezing devq for target ID 19
+mps0: Unfreezing devq for target ID 15
+mps0: Unfreezing devq for target ID 11
+mps0: Unfreezing devq for target ID 14
+mps0: Firmware: 20.00.06.00, Driver: 21.02.00.00-fbsd
+mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

Taking the suggestion in comment 45 I have reduced my max opens from 255 for my devices to 100. They went all they way to 1 but hopefully 100 will be sufficient.
Comment 47 elij 2020-12-29 13:08:14 UTC
I have a SAS9305-16i, and was seeing similar errors to some of the folks here. The comment about NCQ got me thinking, and sure enough.. looks like broadcom had a new firmware out (I had 16.00.11.00 where I saw the errors).

16.00.12.00 has these two as the fixed "defect list":

    ID:
        DCSG00398894
    Headline: SATA only:
        WRITE SAME NCQ encapsulation assumes NonData NCQ is supported if Zero
        EXT is Supported
    Description Of Change:
        Disable NCQ encapsulation if Zero EXT is supported but Non Data NCQ is
        not supported
    Issue Description:
        In case if Zero EXT is supported but Non Data NCQ is not supported by
        drive, WRITE SAME NCQ encapsulation would send Non Data NCQ command to
        the drive.
        Drive would fail the command as Non Data NCQ is not supported by drive.
        This will cause command failure to host.
    Steps To Reproduce:
        IO errors are observed when mkfs.ext4 operation is done on drives that
        support Zero EXT but do not support Non Data NCQ.


    ID: 
        DCSG00411882 (Port Of Defect DCSG00139294)
    Headline:
        SATA only : Controller hang is observed due to recursive function call.
    Description Of Change:
        Avoid starting of pended commands to SATA drive from completion
        functions of a command to avoid recursion.
    Issue Description:
        Controller hang is observed with ATA pass through command followed by a
        pended command that fails.
        At the completion of pass through command, PL starts the pended IOs if any.
        If the pended IO is failed due to invalid CDB, then immediately
        completion function is called causing recursion. This causes controller
        hang.
    Steps To Reproduce:
        Install the FreeNAS and try to create ZFS pool(Storage -> Pools -> Add)
        of the direct attached SSDs.
        - ATA command completes when there are IOs in pendlist.
        - The pended IO has invalid CDB

I have installed the firmware, and am keeping an eye on it. My system is lightly loaded though, so my sightings of the issue have been rather spurious.
Comment 48 Daniel Austin 2021-04-02 16:01:57 UTC
I have this same issue too...

My setup is a LSI 9206-16e PCIe card (which is really just 2 x LSISAS2308 cards with a PCIe switch) and BSD 12+13.  I have this connecting to a QNAP TL-D1600S 16-bay SATA chassis with 12 x 8TB Toshiba SATA (non-SMR) disks.

I tried a workaround online (camcontrol tags daX -N 1) to disable NCQ per drive and this is fine *once the server is booted*... however, if i ever rebooted, I had a massive scroll of CAM errors as the kernel was trying to import a ZFS pool before the camcontrol script had run... This even leads to ZFS reporting too many errors and taking the pool offline (note: i'm not booting from this pool)

I don't have the luxury of a firmware update as there hasn't been any updates to the SAS2308 in a long long time!  I am running 20.00.07.00.

I've also tried 3 different 9206-16e cards, so i'm happy it's not just a faulty card.  I also tried in 2 different servers with the same results so happy it's probably not a hardware issue at all.

My final solution which is really a kludge but does fix the issue permanently was to boot into a live ubuntu environment and use lsiutil 1.72 to disable NCQ on the card itself.  This is saved in the cards EEPROM.

Now when I boot bsd, I get zero CAM errors from any disks, zfs pool imports straight away, and I can still max out the bandwidth to the drives - yay.

I also have an LSI 9207-8i card in a different machine running the same firmware which has no issues at all even with NCQ enabled (but these are directly connected to the card via SATA cables with no enclosure as such), so I do think this is just some kind of incompatibility between driver<-->card<-->enclosure of some kind.

It would be lovely if lsiutil could be ported to bsd... I did look at it briefly but it's beyond my capabilities... the source code is online if anyone wanted to try.

Hope that helps anyone else stumbling upon this PR and I appreciate it may only fix some not all of the cases reported here so far... but some is better than none :-)
Comment 49 Conall O'Brien 2021-05-31 09:56:29 UTC
I've been experiencing these issues with WD Red 6TB disks. I had frequent, unexplained reboots with 12.1-RELEASE and 12.2-RELEASE. 
mprutil show all
Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IR
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 12.00.00.00
Firmware Revision: 10.00.00.00
  Integrated RAID: yes

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     3.0    12     SAS Initiator 
1       0002        000a       N         6.0     3.0    12     SAS Initiator 
2       0003        000b       N         6.0     3.0    12     SAS Initiator 
3       0004        000c       N         6.0     3.0    12     SAS Initiator 
4       0005        000d       N         6.0     3.0    12     SAS Initiator 
5       0006        000e       N         6.0     3.0    12     SAS Initiator 
6       0007        000f       N         6.0     3.0    12     SAS Initiator 
7       0008        0010       N         6.0     3.0    12     SAS Initiator 


Since upgrading to 13.0-RELEASE I am no longer experiencing reboots, but I do continue to see CAM errors


da7:mpr0:0:9:0): Info: 0x8024c3a0
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 80 24 c3 e8 00 00 08 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x8024c3e8
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 3a 80 4d 20 00 00 10 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x3a804d20
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c4 f0 90 00 00 10 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x31c4f090
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c4 f8 b8 00 06 88 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x31c4f8b8
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c3 c0 a0 00 01 60 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x31c3c0a0
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 10 00 00 00 10 00 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x2baa0f010
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 14 4b b4 c0 00 00 00 08 00 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x2144bb4c0
(da7:mpr0:0:9:0): Error 22, Unretryable error


I have upgraded to 13.0-RELEASE-p1, on account of FreeBSD-EN-21:13.mpt . Could the issues from that errata notice also be an issue for mps?
Comment 50 Conall O'Brien 2021-05-31 09:56:40 UTC
I've been experiencing these issues with WD Red 6TB disks. I had frequent, unexplained reboots with 12.1-RELEASE and 12.2-RELEASE. 
mprutil show all
Adapter:
mpr0 Adapter:
       Board Name: LSI3008-IR
   Board Assembly: 
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 12.00.00.00
Firmware Revision: 10.00.00.00
  Integrated RAID: yes

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     3.0    12     SAS Initiator 
1       0002        000a       N         6.0     3.0    12     SAS Initiator 
2       0003        000b       N         6.0     3.0    12     SAS Initiator 
3       0004        000c       N         6.0     3.0    12     SAS Initiator 
4       0005        000d       N         6.0     3.0    12     SAS Initiator 
5       0006        000e       N         6.0     3.0    12     SAS Initiator 
6       0007        000f       N         6.0     3.0    12     SAS Initiator 
7       0008        0010       N         6.0     3.0    12     SAS Initiator 


Since upgrading to 13.0-RELEASE I am no longer experiencing reboots, but I do continue to see CAM errors


da7:mpr0:0:9:0): Info: 0x8024c3a0
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 80 24 c3 e8 00 00 08 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x8024c3e8
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 3a 80 4d 20 00 00 10 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x3a804d20
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c4 f0 90 00 00 10 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x31c4f090
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c4 f8 b8 00 06 88 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x31c4f8b8
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c3 c0 a0 00 01 60 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x31c3c0a0
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 10 00 00 00 10 00 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x2baa0f010
(da7:mpr0:0:9:0): Error 22, Unretryable error
(da7:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 14 4b b4 c0 00 00 00 08 00 00 
(da7:mpr0:0:9:0): CAM status: SCSI Status Error
(da7:mpr0:0:9:0): SCSI status: Check Condition
(da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mpr0:0:9:0): Info: 0x2144bb4c0
(da7:mpr0:0:9:0): Error 22, Unretryable error


I have upgraded to 13.0-RELEASE-p1, on account of FreeBSD-EN-21:13.mpt . Could the issues from that errata notice also be an issue for mps?
Comment 51 Alexander Motin freebsd_committer freebsd_triage 2022-01-06 19:24:16 UTC
(In reply to Daniel Austin from comment #48)
I see mpsutil/mprutil tools recently got support for `set ncq` subcommand, that should allow to disable NCQ.  I've merged that into TrueNAS 12.0-U7.  I haven't tried that myself, and it makes me shiver inside from its inefficiency, but I can believe that it may reduce maximum command latency in some scenarios, or may be even avoid command timeouts in situations when the disks or the HBAs can't schedule or process the NCQ commands reasonably.
Comment 52 Daniel Austin 2022-01-06 20:10:12 UTC
(In reply to Alexander Motin from comment #51)
If it helps, it seems to be enclosure related...

I can use NCQ on my disks and controller in another enclosure (e.g. I have an Areca 8 bay that's fine)... but when using my QNAP jbod enclosure (TL-D1600S) I get errors when NCQ is enabled.
Comment 53 leopoldo.20b7513b 2022-02-13 14:43:24 UTC
I just started experiencing this issue with my setup with 3 IBM M1015 HBAs and ST10000NM002G SAS drives.

Has anyone tested their setup with TrueNAS Scale? Since the platform is based on Linux I was hoping this bug was not present. I may try switching when Scale is out of RC status later this month.
Comment 54 JerRatt IT 2022-05-17 23:38:53 UTC
I'm reporting either the same or similar issue, here are my findings, and please let me know if my plans sound correct:

Setup:
TrueNAS Scale 22.02.0.1
AMD Threadripper 1920X
ASRock X399 Taichi
128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC
AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter
Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane
8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn
2 x Crucial 120GB SSD
2 x Crucial 1TB SSD
2 x Western Digital 960GB NVME
Supermicro 4U case w/2000watt Redundant Power Supply

The system is connected with a large APC data-center battery system and conditioner, in a HVAC controlled area. All hard drives have the newest firmware, and in 4k sectors both logical and native. The controller has the newest firmware, both regular and legacy roms, and with the SATA/SAS only mode flashed (dropping the NVME multi/tri-mode option that the new 9400 series cards support).

Running any kind of heavy I/O onto the 18TB drives that are connected to the BPN-SAS3-743A backplane and through to the LSI 9400-8i HBA eventually results in the drive resetting. This happens even without the drives assigned to any kind of ZFS pool. This also happens whether running from the shell within the GUI or from the shell itself. This happens on all drives, that are using two separate SFF8643 cables with a backplane that has two separate SFF8643 ports.

To cause this to happen, I can either run badblocks on each drive (using: badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a SMART extended/long test.

Eventually, all or nearly all drives will reset, even spin down (according to the shell logs). Sometimes they reset in batches, while others continue chugging along. It's made completing any kind of SMART extended test not possible. Badblocks will fail out, reporting too many bad blocks, on multiple hard drives all at nearly the exact same moment, yet consecutive badblock scans won't report bad blocks in the same areas. SMART test will just show "aborted, drive reset?" as the result.

My plan was to replace the HBA with an older LSI 9305-16i, replace the two SFF8643-SFF8643 cables going from the HBA to the backplane just for good measure, install two different SFF8643-SFF8482 cables that bypass the backplane fully, then four of the existing Seagate 18TB drives and put them on the the SFF8643-SFF8482 connections that bypass the backplane, as well as install four new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using the backplane, some not). That should reveal if this is a compatibility/bug issue with all large drives or certain large drives on a LSI controller, the mpr driver, and/or this backplane.

If none of that works or doesn't eliminate all the potential points of failures, I'm left with nothing but subpar work arounds, such as just using the onboard SATA ports, disabling NCQ in the LSI controller, or setting up a L2ARC cache (or I might try a metadata cache to see if that circumvents the issue as well).




Condensed logs when one drive errors out:

sd 0:0:0:0: device_unblock and setting to running, handle(0x000d)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
~
~
~
~
sd 0:0:0:0: Power-on or device reset occurred
.......ready
sd 0:0:6:0: device_block, handle(0x000f)
sd 0:0:9:0: device_block, handle(0x0012)
sd 0:0:10:0: device_block, handle(0x0014)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
sd 0:0:9:0: device_unblock and setting to running, handle(0x0012)
sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
sd 0:0:10:0: device_unblock and setting to running, handle(0x0014)
sd 0:0:9:0: Power-on or device reset occurred
sd 0:0:6:0: Power-on or device reset occurred
sd 0:0:10:0: Power-on or device reset occurred
scsi_io_completion_action: 5 callbacks suppressed
sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor] 
sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional power granted
sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00 00 01 40 00 00
print_req_error: 5 callbacks suppressed
blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor] 
sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional power use not yet granted
sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00 00 01 40 00 00
blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
~
~
~
~
sd 0:0:10:0: [sdd] Spinning up disk...
.
sd 0:0:3:0: device_block, handle(0x0013)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
.
sd 0:0:3:0: device_unblock and setting to running, handle(0x0013)
.
sd 0:0:3:0: Power-on or device reset occurred
.................ready
Comment 55 Allan Jude freebsd_committer freebsd_triage 2022-05-28 13:10:49 UTC
If you see the error message: "kernel: mps0: IOC Fault 0x40000d04, Resetting"

You need to update to get this fix: https://cgit.freebsd.org/src/commit/?id=e30fceb89b7eb51825bdd65f9cc4fbadf107d763

If your errors don't include that code, that is a different problem.
Comment 56 Michel Le Cocq 2024-01-05 08:44:20 UTC
(In reply to Allan Jude from comment #55)

Hi, is this patch been included in 13.2-RELEASE ?

I'm on 13.2-RELEASE-p9 and have the same issue with 2 different 9207-8i card :

Jan  5 09:38:49 gobt kernel: mps0: IOC Fault 0x40000d04, Resetting
Jan  5 09:38:49 gobt kernel: mps0: Reinitializing controller
Jan  5 09:38:49 gobt kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd

------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS2308_2
  BIOS version                            : 7.39.02.00
  Firmware version                        : 20.00.07.00

mps0@pci0:4:0:0:        class=0x010700 rev=0x05 hdr=0x00 vendor=0x1000 device=0x0087 subvendor=0x1000 subdevice=0x3020
    vendor     = 'Broadcom / LSI'
    device     = 'SAS2308 PCI-Express Fusion-MPT SAS-2'
    class      = mass storage
    subclass   = SAS
pcib8@pci0:6:0:0:       clas
Comment 57 Michel Le Cocq 2024-01-17 08:23:40 UTC
no more trouble since upgrade to 14.0.
Comment 58 Wayne Willcox 2024-01-20 12:51:15 UTC
just finished updating to 14.0-RELEASE-p4 with my problem child supermicro system. 
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: H3-25250-02B
        Chip Name: LSISAS2008
    Chip Revision: ALL:
    BIOS Revision: 7.39.00.00
Firmware Revision: 20.00.06.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (5.0 GB/sec)
        IOC Speed: Full
      Temperature: Unknown/Unsupported

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
1       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
2       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
3       0001        0009       N         6.0     1.5    6.0    SAS Initiator 
4                              N                 1.5    6.0    SAS Initiator 
5                              N                 1.5    6.0    SAS Initiator 
6                              N                 1.5    6.0    SAS Initiator 
7                              N                 1.5    6.0    SAS Initiator 

This system has regularly under heavy i/o load got SCSI timeout errors sometimes causing the zpool  or zpools to go offline.  This included zroot hanging the system.  Tried several different LSI 9211 adapters could never resolve the issue.

After upgrading to FreeBSD 14 the problem has not happened in spite of my trying to run rsync's and zfs send/receives from different pools.  Even tried to reproduce it by also doing a scrub. It has only been a few days but so far it is looking promising that the issue is resolved by going to FreeBSD 14.0
Comment 59 davydm 2025-01-11 21:16:09 UTC
On 14.1-RELEASE-p6 and facing similar issues.

Only facing SCSI errors on one disk of a 5 disk vdev raidz: /dev/da7
Disk is a HC520 12TB

Depending on the load, the system hangs or reboots.



Example errors in /var/log/messages:
(da7:mps0:0:17:0): READ(10). CDB: 28 00 00 2e 67 c8 00 00 40 00 
mps0: Controller reported scsi ioc terminated tgt 17 SMID 344 loginfo 31110d01
(da7:mps0:0:17:0): CAM status: CCB request completed with an error
(da7:mps0:0:17:0): Retrying command, 3 more tries remain
...
(da7:mps0:0:17:0): CAM status: SCSI Status Error
(da7:mps0:0:17:0): SCSI status: Check Condition
(da7:mps0:0:17:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)



Details on the SAS LSI9211-8i:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: 
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.29.00.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (5.0 GB/sec)
        IOC Speed: Full
      Temperature: Unknown/Unsupported


mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd



Since the HC520 is not an SMR disk, would disabling NCQ be worth testing?