Over on the Freenas forums several people reported issues with large (10TB) Seagate drives (ST10000NM0016 and ST10000VN0004) and LSI controllers. Links to the threads: https://forums.freenas.org/index.php?threads/lsi-avago-9207-8i-with-seagate-10tb-enterprise-st10000nm0016.58251/ https://forums.freenas.org/index.php?threads/synchronize-cache-command-timeout-error.55067/ I am using the ST10000NM0016 drives and i am getting the following errors on a LSI SAS2308 (mps driver) and on a LSI SAS3008 (mpr driver). This happens about once every one or two weeks in low load situations. Here the logs: (da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1010 command timeout cm 0xfffffe0000f7cda0 ccb 0xfffff8018198d800 (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0000f7cda0 mps0: Sending reset from mpssas_send_abort for target ID 1 (da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 length 4096 SMID 959 terminated ioc 804b scsi 0 state c xfer 0 mps0: Unfreezing devq for target ID 1 (da2:mps0:0:1:0): WRITE(16). CDB: 8a 00 00 00 00 02 e1 76 9f 88 00 00 00 08 00 00 (da2:mps0:0:1:0): CAM status: CCB request completed with an error (da2:mps0:0:1:0): Retrying command (da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da2:mps0:0:1:0): CAM status: Command timeout (da2:mps0:0:1:0): Retrying command (da2:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da2:mps0:0:1:0): CAM status: SCSI Status Error (da2:mps0:0:1:0): SCSI status: Check Condition (da2:mps0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da2:mps0:0:1:0): Error 6, Retries exhausted (da2:mps0:0:1:0): Invalidating pack ------- (da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 319 Aborting command 0xfffffe0000f54a90 mpr0: Sending reset from mprsas_send_abort for target ID 4 (da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 length 20480 SMID 320 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 mpr0: Unfreezing devq for target ID 4 (da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 (da1:mpr0:0:4:0): CAM status: CCB request completed with an error (da1:mpr0:0:4:0): Retrying command (da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da1:mpr0:0:4:0): CAM status: Command timeout (da1:mpr0:0:4:0): Retrying command (da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 b9 f0 00 00 00 28 00 00 (da1:mpr0:0:4:0): CAM status: SCSI Status Error (da1:mpr0:0:4:0): SCSI status: Check Condition (da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da1:mpr0:0:4:0): Retrying command (per sense data) (da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 length 16384 SMID 653 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 711 terminated ioc 804b loginfo 311(da1:mpr0:0:4:0): WRITE(16). CDB: 8a 00 00 00 00 03 35 b7 ba 80 00 00 00 20 00 00 10e03 scsi 0 state c xfer 0 (da1:mpr0:0:4:0): CAM status: CCB request completed with an error (da1:mpr0:0:4:0): Retrying command (da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da1:mpr0:0:4:0): CAM status: CCB request completed with an error (da1:mpr0:0:4:0): Retrying command (da1:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da1:mpr0:0:4:0): CAM status: SCSI Status Error (da1:mpr0:0:4:0): SCSI status: Check Condition (da1:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da1:mpr0:0:4:0): Error 6, Retries exhausted (da1:mpr0:0:4:0): Invalidating pack (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 797 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 753 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 846 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 828 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 662 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 738 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 778 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 747 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 954 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 923 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 932 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (pass1:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 926 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
We have the very same problem but with WD Red disks. System randomly reboot sometimes after 20 days of working. Different disk everytime. It's our production nfs server and now it's very frustrating. Supermicro 5049p 64 GB ECC RAM LSI 3008 IT mode 18x WD Red 4 TB Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Command timeout on target 25(0x001c), 60000 set, 60.107418057 elapsed Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name ( ) Mar 23 07:39:46 fap kernel: mpr0: Sending abort to target 25 for SMID 357 Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 357 Aborting command 0xfffffe00b7aa6130 Mar 23 07:39:46 fap kernel: (pass19:mpr0:0:25:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 1182 Command timeout on target 25(0x001c), 60000 set, 60.217681679 elampr0: At enclosure level 0, slot 17, connector name ( ) Mar 23 07:39:46 fap kernel: mpr0: Controller reported scsi ioc terminated tgt 25 SMID 1182 loginfo 31130000 Mar 23 07:39:46 fap kernel: mpr0: Abort failed for target 25, sending logical unit reset Mar 23 07:39:46 fap kernel: mpr0: Sending logical unit reset to target 25 lun 0 Mar 23 07:39:46 fap kernel: mpr0: At enclosure level 0, slot 17, connector name ( ) Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CCB request aborted by the host Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 0 more tries remain Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25 Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Error 5, Retries exhausted Mar 23 07:39:46 fap smartd[95746]: Device: /dev/da17 [SAT], failed to read SMART Attribute Data Mar 23 07:39:46 fap kernel: mpr0: mprsas_action_scsiio: Freezing devq for target ID 25 Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): WRITE(10). CDB: 2a 00 09 4a 32 a8 00 00 08 00 Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): CAM status: CAM subsystem is busy Mar 23 07:39:46 fap kernel: (da17:mpr0:0:25:0): Retrying command, 3 more tries remain Mar 23 07:43:19 fap syslogd: kernel boot file is /boot/kernel/kernel Mar 23 07:43:19 fap kernel: ---<<BOOT>>---
Forgot to say, its FreeBSD 12-RELEASE.
Created attachment 205003 [details] /var/log/messages during device resets
We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008: Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3 Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> After writing ~200gb to our pool it started reseting. I did a sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20)) The resulting trace is attached.
Comment on attachment 205003 [details] /var/log/messages during device resets We just did configure a backup server with eight seagate ironwulf (ST12000VN0007-2GS116) 12TB disks connected to a SAS2008: Jun 12 08:51:35 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3 Jun 12 08:51:35 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd Jun 12 08:51:35 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> After writing ~200gb to our pool it started reseting. I did a sysctl dev.mps.0.debug_level=0x$((0x1+0x2+0x4+0x10+0x20)) The resulting trace is attached.
We are using FreeBSD 12.0-RELEASE: FreeBSD nyx 12.0-RELEASE-p4 FreeBSD 12.0-RELEASE-p4 GENERIC amd64
Just to append my previous post from March, same hardware and same config, we revert back on 11.1-RELEASE and everything working flawlessly for more then two months now.
And I don't think the problem is exclusively with Seagate 10TB drives. We have WD Red 4TB drives and have the same problem. We have same situation also with 11.2-RELEASE, and beacuse 11.2 and 12.0 have same mpr/mps driver version we decide to try with 11.1.
(In reply to Bane Ivosev from comment #8) We have several other machines with SAS2008 controllers. All of them are running 11.1 and none of them shows these problems...
So I came across this same issue. It was causing my server to reboot several times a day due to kernel panics caused by this issue. It happens with both SAS9200 and 9300 controllers. I have 8 x 10TB Seagate Iron Wolf NAS drives. I wanted to mention that for me there was a resolution. I added an Intel Optane 900p 280GB drive and set that up for cache/l2arc and the problem entirely disappeared. My server ran for 20 days before I rebooted it last night to perform an upgrade. So, a workaround I believe would be is to add a cache drive to your ZFS pool. The Intel Optane 900p is a highly recommended cache drive for ZFS pools.
Following what Daniel Shafer says, we have the same issues on a Freenas deployment we did. Everything fine with our other pools that use less than 10tb drives. But one of our pool using 14tb drive exhibit this exact behavior. For us this is a major show stopper bug since we can't use this pool reliably. Our vendor sent us a new HBA, a new server, etc before I stumbled upon this bug listing.
Hi, We are seeing the same issue. Here is more information on our setup : FreeNAS-11.2-U5 FreeBSD 11.2-STABLE amd64 We use 2 x (6x 14TB seagate ironwolf drives ) We also have a 2 TB crucial SSD for L2ARC Issue always comes up after 10-14hours of heavy IO Disk Model : 14 TB Seagate ST14000VN0008 The drives are on 2 different LSI HBAs. Drive that fails are random on both those HBA. Please let us know if you need more information on this, it is impacting our production load. Thank you. Log output for our latest errors : > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 > 00 00 10 00 00 length 8192 SMID 60 Aborting command 0xfffffe000171f640 > mpr1: Sending reset from mprsas_send_abort for target ID 20 > (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 length 4096 SMID 332 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00 > length 4096 SMID 703 terminated ioc 804b loginfo 31130000 sc(da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 si 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 > 00 01 00 00 00 length 131072 SMID 510 terminated ioc > 804b(da30:mpr1:0:20:0): CAM status: CCB request completed with an > error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 e8 00 00 08 00 > loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 01 00 00 00 length 131072 SMID 938 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 01 00 00 00 length 131072 SMID 839 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 01 00 00 00 length 131072 SMID 681 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 01 00 00 00 length 131072 SMID 647 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 01 00 00 00 length 131072 SMID 253 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 01 00 00 00 length 131072 SMID 109 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 01 00 00 00 length 131072 SMID 267 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 00 10 00 00 length 8192 SMID 506 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 00 10 00 00 length 8192 SMID 774 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 > 00 00 00 length 0 SMID 281 terminated ioc 804b loginfo 31140000 scsi 0 > state c xfer 0 > mpr1: Unfreezing devq for target ID 20 > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 77 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 76 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 75 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 74 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 73 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 72 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 71 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 b8 00 00 > 01 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 a8 00 00 > 00 10 00 00 > (da30:mpr1:0:20:0): CAM status: Command timeout > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 98 00 00 > 00 10 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(16). CDB: 88 00 00 00 00 01 05 1a 70 88 00 00 > 00 10 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 > 00 00 00 > (da30:mpr1:0:20:0): CAM status: CCB request completed with an error > (da30:mpr1:0:20:0): Retrying command > (da30:mpr1:0:20:0): READ(10). CDB: 28 00 b7 81 49 f0 00 00 08 00 > (da30:mpr1:0:20:0): CAM status: SCSI Status Error > (da30:mpr1:0:20:0): SCSI status: Check Condition > (da30:mpr1:0:20:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, > reset, or bus device reset occurred) > (da30:mpr1:0:20:0): Retrying command (per sense data) > ctl_datamove: tag 0x855ffd44 on (2:3:106) aborted > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00 > length 8192 SMID 486 Aborting command 0xfffffe0001745aa0 > mpr1: Sending reset from mprsas_send_abort for target ID 22 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 length 4096 SMID 105 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 > length 81920 SMID 467 terminated ioc 804b loginfo 31130000 s(da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 csi 0 state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 > length 131072 SMID 959 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 > length 131072 SMID 346 terminated ioc 804b loginfo 31130000 (da32:scsi > 0 state c xfer 0 > mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00 > length 131072 SMID 627 terminated ioc 804b loginfo 31130000 scsi 0 > state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6c 78 00 00 a0 00 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00 > length 131072 SMID 455 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00 > length 131072 SMID 951 terminated ioc 804b loginfo 31130000 (da32:scsi > 0 state c xfer 0 > mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00 > length 131072 SMID 822 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6b 78 00 01 00 00 scsi 0 state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00 > length 131072 SMID 155 terminated ioc 804b loginfo 31130000 (da32:mpr1:0:22:0): CAM status: CCB request completed with an error scsi 0 state c xfer 0 > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00 > length 8192 SMID 495 terminated ioc 804b loginfo 31130000 sc(da32:si 0 > state c xfer 0 > mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00 > length 8192 SMID 494 terminated ioc 804b loginfo 31130000 sc(da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 6a 78 00 01 00 00 si 0 state c xfer 0 > (da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 > 00 00 00 length 0 SMID 726 terminated ioc 804b loginfo > 3(da32:mpr1:0:22:0): CAM status: CCB request completed with an error > 1140000 scsi 0 state c xfer 0 > mpr1: Unfreezing devq for target ID 22 > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 69 78 00 01 00 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 68 78 00 01 00 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 67 78 00 01 00 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 66 78 00 01 00 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 78 00 01 00 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 68 00 00 10 00 > (da32:mpr1:0:22:0): CAM status: Command timeout > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 58 00 00 10 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 ca 8b 65 48 00 00 10 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 > 00 00 00 > (da32:mpr1:0:22:0): CAM status: CCB request completed with an error > (da32:mpr1:0:22:0): Retrying command > (da32:mpr1:0:22:0): READ(10). CDB: 28 00 c1 ae 62 38 00 00 08 00 > (da32:mpr1:0:22:0): CAM status: SCSI Status Error > (da32:mpr1:0:22:0): SCSI status: Check Condition > (da32:mpr1:0:22:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, > reset, or bus device reset occurred) > (da32:mpr1:0:22:0): Retrying command (per sense data)
(In reply to Daniel Shafer from comment #10) I can confirm the workaround, no errors since i added L2ARC to the pool.
We usually have a cache on affected pools, and sadly we still have issues when the l2arc hits the disk, think long sequential reads
I too have run into this issue on a nas box, once it started taking on any kind of load. Running 12.0-RELEASE p3 The server contains 8x Seagate Ironwolf Pro 10Tb SATA drives on an Avago 3008 HBA - 8 of these basically: da2 at mpr1 bus 0 scbus13 target 12 lun 0 da2: <ATA ST10000NE0004-1Z EN01> Fixed Direct Access SPC-4 SCSI device da2: Serial Number ZA237AVY da2: 1200.000MB/s transfers da2: Command Queueing enabled da2: 9537536MB (19532873728 512 byte sectors) Driver versions: dev.mpr.1.driver_version: 18.03.00.00-fbsd dev.mpr.1.firmware_version: 15.00.03.00 These drives are configured in a ZFS RAID10 setup (in case that datapoint matters): NAME STATE READ WRITE CKSUM data0 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 da2.eli ONLINE 0 0 0 da3.eli ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 da4.eli ONLINE 0 0 0 da5.eli ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 da6.eli ONLINE 0 0 0 da7.eli ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 da8.eli ONLINE 0 0 0 da9.eli ONLINE 0 0 0 I currently get about 25 days between reboots. The machine hangs and (I'm guessing here) kernel panics and restarts - I don't have the panic information, but log messages look very similar to what other people are seeing: Jul 20 11:14:17 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Command timeout on target 12(0x000c), 60000 set, 60.703976195 elapsed Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name ( ) Jul 20 11:14:17 nas1a kernel: mpr1: Sending abort to target 12 for SMID 1484 Jul 20 11:14:17 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 length 24576 SMID 1484 Aborting command 0 xfffffe00bad0b540 Jul 20 11:14:17 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1792 Command ti meout on target 12(0x000c), 60000 set, 60.707504796 elapsed Jul 20 11:14:17 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name ( ) Jul 20 11:14:18 nas1a kernel: mpr1: Controller reported scsi ioc terminated tgt 12 SMID 1792 loginfo 31140000 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 Jul 20 11:14:18 nas1a kernel: mpr1: Abort failed for target 12, sending logical unit reset Jul 20 11:14:18 nas1a kernel: mpr1: (da2:mpr1:0:12:0): CAM status: CCB request aborted by the host Jul 20 11:14:18 nas1a kernel: Sending logical unit reset to target 12 lun 0 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 3 more tries remain Jul 20 11:14:18 nas1a kernel: mpr1: At enclosure level 0, slot 2, connector name ( ) Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CCB request completed with an error Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 0 more tries remain Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Error 5, Retries exhausted Jul 20 11:14:18 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 12 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): WRITE(10). CDB: 2a 00 62 81 f9 d0 00 00 30 00 Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): CAM status: CAM subsystem is busy Jul 20 11:14:18 nas1a kernel: (da2:mpr1:0:12:0): Retrying command, 2 more tries remain [reboot happens here] And the most recent one, today: Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Command tim eout on target 16(0x0010), 60000 set, 60.109683189 elapsed Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name ( ) Aug 13 08:58:55 nas1a kernel: mpr1: Sending abort to target 16 for SMID 998 Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 998 Aborting co mmand 0xfffffe00bacdfaa0 Aug 13 08:58:55 nas1a kernel: mpr1: Abort failed for target 16, sending logical unit reset Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Aug 13 08:58:55 nas1a kernel: mpr1: Sending logical unit reset to target 16 lun 0 Aug 13 08:58:55 nas1a kernel: mpr1: At enclosure level 0, slot 6, connector name ( ) Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CCB request aborted by the host Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 0 more tries remain Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16 Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Error 5, Retries exhausted Aug 13 08:58:55 nas1a kernel: mpr1: mprsas_action_scsiio: Freezing devq for target ID 16 Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): WRITE(10). CDB: 2a 00 92 b0 7d 70 00 00 48 00 Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): CAM status: CAM subsystem is busy Aug 13 08:58:55 nas1a kernel: (da6:mpr1:0:16:0): Retrying command, 3 more tries remain [reboot happens here] After the reboot, there's no problem and everything works fine. ZFS never marks the pool as degraded or unavailable. Looking at the FreeNAS threads this seems to have been going on for ages. Can anyone confirm that a downgrade to 11.1 does work around this issue as that seems to be the only thing that might help?
Hi Paul, i had exact same symptoms like you but with WD Reds 4TB, and for me with 11.1 everything work flawlessly for more then 130 days now.
(In reply to Francois Baillargeon from comment #14) We have cache and log devices: NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT k-zb1 87.2T 35.5T 51.8T - - 20% 40% 1.91x ONLINE - raidz2 87.2T 35.5T 51.8T - - 20% 40% gpt/k-zb1-0 - - - - - - - gpt/k-zb1-1 - - - - - - - gpt/k-zb1-2 - - - - - - - gpt/k-zb1-3 - - - - - - - gpt/k-zb1-4 - - - - - - - gpt/k-zb1-5 - - - - - - - gpt/k-zb1-6 - - - - - - - gpt/k-zb1-7 - - - - - - - log - - - - - - mirror 11.5G 0 11.5G - - 0% 0% gpt/k-zb1-zil0 - - - - - - - gpt/k-zb1-zil1 - - - - - - - cache - - - - - - gpt/k-zb1-cache0 80.0G 29.9G 50.1G - - 0% 37% gpt/k-zb1-cache1 80.0G 30.0G 50.0G - - 0% 37% The only solution for us was to use the onboard sata ports instead of the LSI-controller. We have keep the other machines that need the LSI-controllers (not enough ports on the mainboard) at 11.1 :-(
(In reply to Bane Ivosev from comment #16) Thanks for the confirmation. I'm in the process of downgrading the affected NAS machine (they are an identical pair) from 12.0 to 11.1, I need to test this in the lab first but will report back once we have the production machine on the older release. I need to do this before it next crashes and reboots, of course!
An update for all. I downgraded our production servers from 12.0 to 11.1-RELEASE-p15 and for 5 weeks they have worked without any problems. Previously we saw a problem after 4 weeks, then another after 1 week - so whilst there is a chance I have not waited long enough, I think this definitely fixes my problem. Paul.
(In reply to Paul Thornton from comment #19) Thanks for sharing Paul. I took a slightly different approach. I have been running stable for a few months now with a FreeBSD 12.0-RELEASE kernel but using the FreeBSD 11.2 LSI mps driver. 12.1 brings in many updates to the 12.0 mps driver, hopefully those address these problems; will test once its released. /Sharad
(In reply to freebsd from comment #20) I'm running 12.1-RELEASE on the same hardware for 10 days now. Everything is ok. I'll report back for about a month. Success, i hope.
I'm running FreeBSD 12.1 now. Yesterday we reattached the drives to the LSI controller. Problems showed up during the next backup :-( Kernel log: Dec 4 10:13:30 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3 Dec 4 10:13:30 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd Dec 4 10:13:30 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> Dec 4 10:33:08 nyx kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0x8000-0x80ff mem 0xdf700000-0xdf703fff,0xdf680000-0xdf6bffff irq 32 at device 0.0 on pci3 Dec 4 10:33:08 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd Dec 4 10:33:08 nyx kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> Dec 4 10:33:08 nyx kernel: da0 at mps0 bus 0 scbus0 target 2 lun 0 Dec 4 10:33:08 nyx kernel: da1 at mps0 bus 0 scbus0 target 3 lun 0 Dec 4 10:33:08 nyx kernel: da2 at mps0 bus 0 scbus0 target 4 lun 0 Dec 4 10:33:08 nyx kernel: da4 at mps0 bus 0 scbus0 target 6 lun 0 Dec 4 10:33:08 nyx kernel: da5 at mps0 bus 0 scbus0 target 7 lun 0 Dec 4 10:33:08 nyx kernel: da6 at mps0 bus 0 scbus0 target 8 lun 0 Dec 4 10:33:08 nyx kernel: da3 at mps0 bus 0 scbus0 target 5 lun 0 Dec 4 10:33:08 nyx kernel: da7 at mps0 bus 0 scbus0 target 9 lun 0 Dec 4 22:42:34 nyx kernel: (da4:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 45 37 b3 48 00 00 00 08 00 00 length 4096 SMID 1583 Command timeout on target 6(0x0010) 60000 set, 60.66782887 elapsed Dec 4 22:42:34 nyx kernel: mps0: Sending abort to target 6 for SMID 1583 Dec 4 22:42:34 nyx kernel: (da4:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 45 37 b3 48 00 00 00 08 00 00 length 4096 SMID 1583 Aborting command 0xfffffe00f9364f28 Dec 4 22:42:34 nyx kernel: (da7:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 03 60 06 f6 c8 00 00 00 18 00 00 length 12288 SMID 925 Command timeout on target 9(0x000d) 60000 set, 60.23115994 elapsed Dec 4 22:42:34 nyx kernel: mps0: Sending abort to target 9 for SMID 925 Dec 4 22:42:34 nyx kernel: (da7:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 03 60 06 f6 c8 00 00 00 18 00 00 length 12288 SMID 925 Aborting command 0xfffffe00f932daf8 Dec 4 22:42:34 nyx kernel: (da0:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1033 Command timeout on target 2(0x000c) 60000 set, 60.24094823 elapsed Dec 4 22:42:34 nyx kernel: mps0: Sending abort to target 2 for SMID 1033 Dec 4 22:42:34 nyx kernel: (da0:mps0:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1033 Aborting command 0xfffffe00f9336c18 Dec 4 22:42:34 nyx kernel: (da1:mps0:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1440 Command timeout on target 3(0x000b) 60000 set, 60.24535090 elapsed Dec 4 22:42:34 nyx kernel: mps0: Sending abort to target 3 for SMID 1440 Dec 4 22:42:34 nyx kernel: (da1:mps0:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 04 dc 52 aa d8 00 00 00 30 00 00 length 24576 SMID 1440 Aborting command 0xfffffe00f9358f00 Dec 4 22:42:34 nyx kernel: (da2:mps0:0:4:0): WRITE(10). CDB: 2a 00 24 36 61 98 00 00 18 00 length 12288 SMID 1472 Command timeout on target 4(0x000a) 60000 set, 60.24982318 elapsed Dec 4 22:42:34 nyx kernel: mps0: Sending abort to target 4 for SMID 1472 Dec 4 22:42:34 nyx kernel: (da2:mps0:0:4:0): WRITE(10). CDB: 2a 00 24 36 61 98 00 00 18 00 length 12288 SMID 1472 Aborting command 0xfffffe00f935ba00 Dec 4 22:42:34 nyx kernel: (da5:mps0:0:7:0): WRITE(10). CDB: 2a 00 24 36 61 10 00 00 a0 00 length 81920 SMID 507 Command timeout on target 7(0x000f) 60000 set, 60.25666047 elapsed Dec 4 22:42:34 nyx kernel: mps0: Sending abort to target 7 for SMID 507 Dec 4 22:42:34 nyx kernel: (da5:mps0:0:7:0): WRITE(10). CDB: 2a 00 24 36 61 10 00 00 a0 00 length 81920 SMID 507 Aborting command 0xfffffe00f930a948 Dec 4 22:42:40 nyx kernel: (xpt0:mps0:0:7:0): SMID 6 task mgmt 0xfffffe00f92e0810 timed out Dec 4 22:42:40 nyx kernel: mps0: Reinitializing controller Dec 4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 6 Dec 4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 9 Dec 4 22:42:40 nyx kernel: mps0: Unfreezing devq for target ID 2 ... lots more Just the reinitialization messages: Dec 4 22:42:40 nyx kernel: mps0: Reinitializing controller Dec 4 23:01:27 nyx kernel: mps0: Reinitializing controller Dec 4 23:40:55 nyx kernel: mps0: Reinitializing controller Dec 4 23:47:49 nyx kernel: mps0: Reinitializing controller Dec 4 23:58:02 nyx kernel: mps0: Reinitializing controller Dec 5 00:17:33 nyx kernel: mps0: Reinitializing controller Dec 5 00:20:18 nyx kernel: mps0: Reinitializing controller Dec 5 00:21:33 nyx kernel: mps0: Reinitializing controller Dec 5 00:24:30 nyx kernel: mps0: Reinitializing controller Dec 5 00:26:40 nyx kernel: mps0: Reinitializing controller Dec 5 00:29:30 nyx kernel: mps0: Reinitializing controller Dec 5 00:33:10 nyx kernel: mps0: Reinitializing controller Dec 5 00:34:56 nyx kernel: mps0: Reinitializing controller Dec 5 00:42:16 nyx kernel: mps0: Reinitializing controller Dec 5 00:46:34 nyx kernel: mps0: Reinitializing controller Dec 5 00:48:16 nyx kernel: mps0: Reinitializing controller Dec 5 00:58:18 nyx kernel: mps0: Reinitializing controller Dec 5 01:00:55 nyx kernel: mps0: Reinitializing controller Dec 5 01:11:38 nyx kernel: mps0: Reinitializing controller Dec 5 01:15:39 nyx kernel: mps0: Reinitializing controller Dec 5 01:22:00 nyx kernel: mps0: Reinitializing controller Dec 5 01:23:17 nyx kernel: mps0: Reinitializing controller Dec 5 01:35:25 nyx kernel: mps0: Reinitializing controller Dec 5 01:37:44 nyx kernel: mps0: Reinitializing controller Dec 5 01:44:59 nyx kernel: mps0: Reinitializing controller Dec 5 01:52:05 nyx kernel: mps0: Reinitializing controller Dec 5 01:54:33 nyx kernel: mps0: Reinitializing controller Dec 5 01:56:06 nyx kernel: mps0: Reinitializing controller Dec 5 02:11:14 nyx kernel: mps0: Reinitializing controller Dec 5 02:27:42 nyx kernel: mps0: Reinitializing controller Dec 5 02:44:13 nyx kernel: mps0: Reinitializing controller Dec 5 02:50:13 nyx kernel: mps0: Reinitializing controller Dec 5 02:56:23 nyx kernel: mps0: Reinitializing controller Dec 5 03:07:32 nyx kernel: mps0: Reinitializing controller Dec 5 03:09:16 nyx kernel: mps0: Reinitializing controller Dec 5 03:10:41 nyx kernel: mps0: Reinitializing controller Dec 5 03:23:06 nyx kernel: mps0: Reinitializing controller Dec 5 03:29:13 nyx kernel: mps0: Reinitializing controller Dec 5 03:37:18 nyx kernel: mps0: Reinitializing controller Dec 5 03:39:23 nyx kernel: mps0: Reinitializing controller Dec 5 03:45:42 nyx kernel: mps0: Reinitializing controller Dec 5 03:51:22 nyx kernel: mps0: Reinitializing controller Dec 5 03:58:32 nyx kernel: mps0: Reinitializing controller Dec 5 03:59:50 nyx kernel: mps0: Reinitializing controller Dec 5 04:01:04 nyx kernel: mps0: Reinitializing controller Dec 5 04:02:25 nyx kernel: mps0: Reinitializing controller Dec 5 04:06:33 nyx kernel: mps0: Reinitializing controller Dec 5 04:08:54 nyx kernel: mps0: Reinitializing controller Dec 5 04:11:35 nyx kernel: mps0: Reinitializing controller Dec 5 04:13:41 nyx kernel: mps0: Reinitializing controller We are attaching the drives to the mainboard SATA-ports again. Unfortunately this is not an option for our other machines. So we have to stick with 11.1.
Really sad news ... We are still running for 18 days now, everything is great with 12.1 for us, but is still early to make conclusions and your expirience is not promissing.
(In reply to Bane Ivosev from comment #23) We just gave it another try. We hopped that the problem might have been caused by a defective disk when we tried last. The problem was triggered again during the next night's backup :-( This is forcing us to keep running 11.1 on our backup machines. Any (bad/good) news from your machine?
(In reply to Matthias Pfaller from comment #24) We are still fine. No problem at all.
I experience this issue mulpi-times after upgrade to 12.1 from 11.1. the server is rock before. and I didn't realize it is a driver issue. Since my pool is already upgrade to lastest, I cannot downgrade to 11.1. I am buiding a 11.1 mps.ko for 12.1 kernel. load it manually by loader.conf I will report the result a few weeks later.
How does one change the affected version in the bug report? After all this is a >11.1 problem and not a 11.1-STABLE problem.
I finally tried 12.1 and it seemed to run fine for a few days till I started a large file transfer and ... (da4:mps0:0:11:0): READ(10). CDB: 28 00 7c fa 4a a0 00 00 08 00 (da4:mps0:0:11:0): CAM status: SCSI Status Error (da4:mps0:0:11:0): SCSI status: Check Condition (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) (da4:mps0:0:11:0): Retrying command (per sense data) Currently running 12.1-p2 with 11.3 release 357156 mps driver. I just delete the /usr/sr/sys/dev/mps directory and copy it over from the 11.3 source and compile the kernel /Sharad
We have success with 11.1 and 12.1-RELEASE standard installation. No compiling and mixing driver versions. Problem was with 12.0 and 11.3-RELEASE.
Guys, i've just experienced these issues on the latest FreeNAS (it's a FreeBSD 11.3-RELEASE) with a LSI3008 controller. The issues you all experience are related to the BIOS Revision and Firmware Revision versions shown by: mprutil show all | less ... Adapter: mpr0 Adapter: Board Name: LSI3008-IT Board Assembly: Chip Name: LSISAS3008 Chip Revision: ALL BIOS Revision: 17.00.00.00 Firmware Revision: 15.00.03.00 Integrated RAID: no ... It's definitely not related to the types/models of disks used or the BSD version. Just make sure you are running the following firmware versions as i already have these in production with FreeBSD 11.2-RELEASE: Adapter: mpr0 Adapter: Board Name: LSI3008-IT Board Assembly: Chip Name: LSISAS3008 Chip Revision: ALL BIOS Revision: 17.00.00.00 Firmware Revision: 15.00.03.00 Integrated RAID: no
mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9211-8i Board Assembly: Chip Name: LSISAS2008 Chip Revision: ALL BIOS Revision: 7.37.00.00 Firmware Revision: 20.00.07.00 Integrated RAID: yes uname -a FreeBSD nyx 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC amd64 This machine is experiencing the problems. The machines without problems are running 11.1 and (thank's for the pointer) Board Name: SAS9211-8i Board Assembly: H3-25250-02J Chip Name: LSISAS2008 Chip Revision: ALL BIOS Revision: 7.37.00.00 Firmware Revision: 19.00.00.00 Integrated RAID: no I'll try a downgrade to 19.00.00.00 regards, Matthias(In reply to Ștefan BĂLU from comment #30)
We have problem with 18.03 version of the mpr/mps driver. That's why i said 12.1 and 11.1 relase working for us. 12.1 -> #define MPR_DRIVER_VERSION "23.00.00.00-fbsd" 12.0 -> #define MPR_DRIVER_VERSION "18.03.00.00-fbsd" 11.3 -> #define MPR_DRIVER_VERSION "18.03.00.00-fbsd" 11.2 -> #define MPR_DRIVER_VERSION "18.03.00.00-fbsd" 11.1 -> #define MPR_DRIVER_VERSION "15.03.00.00-fbsd"
(In reply to Bane Ivosev from comment #32), so, should we keep the firmware in sync with those versions or go for a trial-and-error aproach? In the meantime, i'll let you know how it goes with latest FreeNAS (FreeBSD 11.3-RELEASE) and the firmware 15.00.03.00 version. I see that my corrected comment didn't go through... So, I downgraded from: BIOS Revision: 18.00.00.00 Firmware Revision: 16.00.01.00 to: BIOS Revision: 8.35.00.00 Firmware Revision: 15.00.03.00 The behaviour would trigger in a couple of days, so i'll keep you all posted.
(In reply to Ștefan BĂLU from comment #33) Before first install (11.3) we update firmware and leave it as is. So, later we didn't change firmware, we just try mps/mpr driver different from 18.03. Now, with 12.1-RELEASE, we have: mpr0: Firmware: 16.00.01.00, Driver: 23.00.00.00-fbsd
Looks like the problem persists with FreeNAS (FreeBSD 11.3-RELEASE) and Firmware Revision: 15.00.03.00.
Has anyone gotten further with this issue? Seems that it is affecting me every other hour now! And it all started after adding a WD DC Drive...
Any SAS-expanders between the SAS HBA and the (SATA) disks? ("mprutil show expanders") - and if so - do they (the expanders) have up to date firmware?
(In reply to Peter Eriksson from comment #37) Not in our case: [root@nyx ~]# mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9211-8i Board Assembly: Chip Name: LSISAS2008 Chip Revision: ALL BIOS Revision: 0.00.00.00 Firmware Revision: 19.00.00.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 N 1.5 6.0 SAS Initiator 1 N 1.5 6.0 SAS Initiator 2 N 1.5 6.0 SAS Initiator 3 N 1.5 6.0 SAS Initiator 4 N 1.5 6.0 SAS Initiator 5 N 1.5 6.0 SAS Initiator 6 N 1.5 6.0 SAS Initiator 7 N 1.5 6.0 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot Wdt Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500605b001551a80 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level We did downgrade to firmware 19.00.00.00, but I hadn't the time to run some tests using this firmware level.
I don't have an expanders (note that this card has IR firmware, in case that is relevant - I wasn't actually aware of that, it is only being used as a dumb HBA): [root@nas1b ~]# mprutil show all Adapter: mpr0 Adapter: Board Name: LSI3008-IR Board Assembly: Chip Name: LSISAS3008 Chip Revision: ALL BIOS Revision: 8.35.00.00 Firmware Revision: 15.00.03.00 Integrated RAID: yes PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 N 3.0 12 SAS Initiator 1 N 3.0 12 SAS Initiator 2 N 3.0 12 SAS Initiator 3 N 3.0 12 SAS Initiator 4 N 3.0 12 SAS Initiator 5 N 3.0 12 SAS Initiator 6 N 3.0 12 SAS Initiator 7 N 3.0 12 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot Wdt Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500304801cf54a05 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level [root@nas1b ~]#
I've had some random problems when using the cards with IR firmware, so I always make sure they are running IT firmware. Might be worth testing. But that might not be relevant for your problems. We are using them without any issues (so far) but we're using SAS drives (10TB HGST) behind SAS Expanders (and a couple of Intel SSD SATA drives) - so nothing similar to your situation... # mprutil show all Adapter: mpr0 Adapter: Board Name: Dell HBA330 Mini Board Assembly: Chip Name: LSISAS3008 Chip Revision: ALL BIOS Revision: 18.00.00.00 Firmware Revision: 16.00.08.00 Integrated RAID: no
(In reply to Sharad Ahlawat from comment #28) Symptom: CAM retry and timeout messages leading to controller aborts and resets Cause: slow disks or using SMR disks Workaround: Increase the CAM timeout defaults ❯ sysctl kern.cam.da.default_timeout=90 kern.cam.da.default_timeout: 60 -> 90 ❯ sysctl kern.cam.ada.default_timeout=60 kern.cam.ada.default_timeout: 30 -> 60 And disable NCQ on SMR Seagates: ❯ cat cam-tags.sh #!/usr/local/bin/bash #shrinking the command Native Command Queue down to 1 effectively disabling queuing for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d")" -f1`; do camcontrol tags $Disk -N 1 ; # camcontrol tags $Disk -v done If you only have SMRs in your setup and use an UPS you could also: ❯ sysctl vfs.zfs.cache_flush_disable=1 Solution: don't use slow disks and SMRs disks with ZFS The long version: I am obliged to post this update given the driver downgrade workaround I previously posted on this thread before getting to the root cause for these logs in my messages file after upgrading to 12.x Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0 Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0000c817 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more tries remain Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0 Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0000c917 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 more tries remain Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1039 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1357 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1933 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c8 00 00 00 b0 00 00 Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 38 78 00 00 00 58 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a1 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 SMID 621 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 476 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 321 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1873 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1852 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1742 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 387 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 2104 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a2 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a3 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a4 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a5 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a6 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a7 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a8 d8 00 00 00 c0 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c0 00 00 00 08 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 36 c0 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense data) Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0 Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs 00400000 tfd 40 serr 00000000 cmd 0000d617 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 80 44 e8 40 c9 01 00 00 00 00 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more tries remain Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080) ❯ smartctl -a /dev/da4 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 178 000 Old_age Always - 419 Note: Every controller ABORTED COMMAND increases the ID# 199 count Messages are correlated with sustained disk operations events: scrubs large file transfers cron jobs kicking off on multiple jails accessing the same pool, etc. Ruled out any issues with: loose cable cable quality controller hardware, bios and firmware power supply ❯ mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9207-4i4e Board Assembly: H3-25434-00H Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.02.00 Firmware Revision: 20.00.07.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 1.5 6.0 SAS Initiator 1 0002 000a N 6.0 1.5 6.0 SAS Initiator 2 0003 000b N 6.0 1.5 6.0 SAS Initiator 3 0004 000c N 6.0 1.5 6.0 SAS Initiator 4 0005 000d N 6.0 1.5 6.0 SAS Initiator 5 0006 000e N 6.0 1.5 6.0 SAS Initiator 6 0007 000f N 6.0 1.5 6.0 SAS Initiator 7 0008 0010 N 6.0 1.5 6.0 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot Wdt 00 10 4433221100000000 0009 0001 SATA Target 6.0 0001 00 1 00 09 4433221101000000 000a 0002 SATA Target 6.0 0001 01 1 00 11 4433221102000000 000b 0003 SATA Target 6.0 0001 02 1 00 07 4433221103000000 000c 0004 SATA Target 6.0 0001 03 1 00 13 4433221104000000 000d 0005 SATA Target 6.0 0001 07 1 00 04 4433221105000000 000e 0006 SATA Target 6.0 0001 06 1 00 14 4433221106000000 000f 0007 SATA Target 6.0 0001 05 1 00 12 4433221107000000 0010 0008 SATA Target 6.0 0001 04 1 Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500605b0073400a0 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level ❯ camcontrol devlist <Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus0 target 0 lun 0 (ada0,pass0) <Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus1 target 0 lun 0 (ada1,pass1) <Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus2 target 0 lun 0 (ada2,pass2) <Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus3 target 0 lun 0 (ada3,pass3) <ST8000DM004-2CX188 0001> at scbus4 target 0 lun 0 (ada4,pass4) <ST8000DM004-2CX188 0001> at scbus5 target 0 lun 0 (ada5,pass5) <ST8000DM004-2CX188 0001> at scbus6 target 0 lun 0 (ada6,pass6) <ST8000DM004-2CX188 0001> at scbus7 target 0 lun 0 (ada7,pass7) <WDC WD20EFRX-68EUZN0 80.00A80> at scbus8 target 0 lun 0 (ada8,pass8) <WDC WD20EFRX-68EUZN0 80.00A80> at scbus9 target 0 lun 0 (ada9,pass9) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 4 lun 0 (pass10,da0) <ATA ST4000NM0024-1HT SN05> at scbus10 target 7 lun 0 (pass11,da1) <ATA ST4000NM0024-1HT SN05> at scbus10 target 9 lun 0 (pass12,da2) <ATA ST4000NM0024-1HT SN02> at scbus10 target 10 lun 0 (pass13,da3) <ATA ST4000NM0024-1HT SN06> at scbus10 target 11 lun 0 (pass14,da4) <ATA ST4000DX001-1CE1 CC44> at scbus10 target 12 lun 0 (pass15,da5) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 13 lun 0 (pass16,da6) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 14 lun 0 (pass17,da7) Notice: da4 has a different firmware revision and most probably the reason that I see relatively fewer timeouts errors on da1, da2 and da3, the other disks in the same ZFS pool. To understand these errors messages: https://en.wikipedia.org/wiki/SCSI_command https://en.wikipedia.org/wiki/SCSI_CDB ❯ cat /var/log/messages | grep "SCSI status" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition ❯ cat /var/log/messages | grep "SCSI sense" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) https://en.wikipedia.org/wiki/Key_Code_Qualifier B 47 00 Aborted Command - SCSI parity error https://www.t10.org/lists/asc-num.htm#ASC_47 47h/03h DZTPROMAEBKVF INFORMATION UNIT iuCRC ERROR DETECTED CPB: - Byte 0-1 : other - Byte 2-9 : 8 bytes is the logical block address - Byte 10-13 : 4 bytes is the transfer length - Byte 14-5 : other So in my logs we are writing 0x100 = 256 bytes of data and throwing errors, writing to consecutive blocks. The retries generally succeed and subsequent scrubs don't find any errors, though you may see CKSUM errors during a scrub which I associated with the abort and controller reset events. Summary: 12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more data back and forth from the disks and that has led to these timeout events I also experimented and found that reducing the memory clock speed in the bios is another workaround to stop these timeout errors with 12.x. Basically that is what my earlier posted workaround of using 11.x mps drivers with 12.x was effectively doing, slowing down the data transfer rates so the disks could keep up. Disclaimer: These observations, analysis and workarounds are specific to my setup, please adapt and incorporate accordingly.
We followed the advice of Christop Bubel, disabled NCQ and set the timeouts to 90s (I can't imagine a situation where this should be necessary, but still...). Results: May 8 17:36:00 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting May 8 17:36:00 nyx kernel: mps0: Reinitializing controller May 8 17:36:00 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd May 8 17:36:00 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> May 8 17:50:44 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting May 8 17:50:44 nyx kernel: mps0: Reinitializing controller May 8 17:50:44 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd May 8 17:50:44 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> May 8 18:10:06 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting May 8 18:10:06 nyx kernel: mps0: Reinitializing controller May 8 18:10:06 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd May 8 18:10:06 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> May 8 18:27:27 nyx kernel: mps0: IOC Fault 0x40000d04, Resetting May 8 18:27:27 nyx kernel: mps0: Reinitializing controller May 8 18:27:27 nyx kernel: mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd We upgraded the controller firmware to 20.00.07.00 and tried again: [root@nyx ~]# mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9211-8i Board Assembly: Chip Name: LSISAS2008 Chip Revision: ALL BIOS Revision: 0.00.00.00 Firmware Revision: 20.00.07.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 1.5 6.0 SAS Initiator 1 0002 000a N 6.0 1.5 6.0 SAS Initiator 2 0003 000b N 6.0 1.5 6.0 SAS Initiator 3 0004 000c N 6.0 1.5 6.0 SAS Initiator 4 0005 000d N 6.0 1.5 6.0 SAS Initiator 5 0006 000e N 6.0 1.5 6.0 SAS Initiator 6 0007 000f N 6.0 1.5 6.0 SAS Initiator 7 0008 0010 N 6.0 1.5 6.0 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot Wdt 00 01 4433221100000000 0009 0001 SATA Target 6.0 0001 03 1 00 04 4433221101000000 000a 0002 SATA Target 6.0 0001 02 1 00 03 4433221102000000 000b 0003 SATA Target 6.0 0001 01 1 00 00 4433221103000000 000c 0004 SATA Target 6.0 0001 00 1 00 02 4433221104000000 000d 0005 SATA Target 6.0 0001 07 1 00 05 4433221105000000 000e 0006 SATA Target 6.0 0001 06 1 00 06 4433221106000000 000f 0007 SATA Target 6.0 0001 05 1 00 07 4433221107000000 0010 0008 SATA Target 6.0 0001 04 1 Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500605b001551a80 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level [root@nyx ~]# for i in $(camcontrol devlist | grep "ST12000" | cut -d"," -f2 | cut -d")" -f1); do > camcontrol tags $i > done (pass0:mps0:0:0:0): device openings: 1 (pass1:mps0:0:1:0): device openings: 1 (pass2:mps0:0:2:0): device openings: 1 (pass3:mps0:0:3:0): device openings: 1 (pass4:mps0:0:4:0): device openings: 1 (pass5:mps0:0:5:0): device openings: 1 (pass6:mps0:0:6:0): device openings: 1 (pass7:mps0:0:7:0): device openings: 1 [root@nyx ~]# Results: May 11 15:58:58 nyx kernel: (da7:mps0:0:7:0): READ(10). CDB: 28 00 ef 46 55 38 00 00 10 00 length 8192 SMID 64 Command timeout on target 7(0x0010) 90000 set, 90.154685585 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 7 for SMID 64 May 11 15:58:58 nyx kernel: (da7:mps0:0:7:0): READ(10). CDB: 28 00 ef 46 55 38 00 00 10 00 length 8192 SMID 64 Aborting command 0xfffffe00f92e5600 May 11 15:58:58 nyx kernel: (da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 a0 00 00 00 20 00 00 length 16384 SMID 1564 Command timeout on target 2(0x000d) 90000 set, 90.120064627 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 2 for SMID 1564 May 11 15:58:58 nyx kernel: (da2:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 a0 00 00 00 20 00 00 length 16384 SMID 1564 Aborting command 0xfffffe00f93635a0 May 11 15:58:58 nyx kernel: (da6:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 04 e8 09 8e 30 00 00 00 30 00 00 length 24576 SMID 432 Command timeout on target 6(0x000f) 90000 set, 90.97377933 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 6 for SMID 432 May 11 15:58:58 nyx kernel: (da6:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 04 e8 09 8e 30 00 00 00 30 00 00 length 24576 SMID 432 Aborting command 0xfffffe00f9304480 May 11 15:58:58 nyx kernel: (da5:mps0:0:5:0): READ(10). CDB: 28 00 ef 46 47 a0 00 01 00 00 length 131072 SMID 1407 Command timeout on target 5(0x000e) 90000 set, 90.108646173 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 5 for SMID 1407 May 11 15:58:58 nyx kernel: (da5:mps0:0:5:0): READ(10). CDB: 28 00 ef 46 47 a0 00 01 00 00 length 131072 SMID 1407 Aborting command 0xfffffe00f93562a8 May 11 15:58:58 nyx kernel: (da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 04 e0 21 4e 68 00 00 00 10 00 00 length 8192 SMID 2034 Command timeout on target 4(0x000a) 90000 set, 90.67287327 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 4 for SMID 2034 May 11 15:58:58 nyx kernel: (da4:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 04 e0 21 4e 68 00 00 00 10 00 00 length 8192 SMID 2034 Aborting command 0xfffffe00f938ad30 May 11 15:58:58 nyx kernel: (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 04 e2 48 50 10 00 00 00 e8 00 00 length 118784 SMID 1470 Command timeout on target 0(0x000c) 90000 set, 90.51090608 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 0 for SMID 1470 May 11 15:58:58 nyx kernel: (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 04 e2 48 50 10 00 00 00 e8 00 00 length 118784 SMID 1470 Aborting command 0xfffffe00f935b750 May 11 15:58:58 nyx kernel: (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 b0 00 00 00 28 00 00 length 20480 SMID 1110 Command timeout on target 1(0x0009) 90000 set, 90.57964893 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 1 for SMID 1110 May 11 15:58:58 nyx kernel: (da1:mps0:0:1:0): READ(16). CDB: 88 00 00 00 00 04 e9 23 92 b0 00 00 00 28 00 00 length 20480 SMID 1110 Aborting command 0xfffffe00f933d390 May 11 15:58:58 nyx kernel: (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 03 d9 ab c8 40 00 00 00 08 00 00 length 4096 SMID 1125 Command timeout on target 3(0x000b) 90000 set, 90.59928632 elapsed May 11 15:58:58 nyx kernel: mps0: Sending abort to target 3 for SMID 1125 May 11 15:58:58 nyx kernel: (da3:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 03 d9 ab c8 40 00 00 00 08 00 00 length 4096 SMID 1125 Aborting command 0xfffffe00f933e7b8 May 11 15:59:05 nyx kernel: (xpt0:mps0:0:3:0): SMID 8 task mgmt 0xfffffe00f92e0ac0 timed out May 11 15:59:05 nyx kernel: mps0: Reinitializing controller May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 7 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 2 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 6 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 5 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 4 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 0 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 1 May 11 15:59:05 nyx kernel: mps0: Unfreezing devq for target ID 3 May 11 15:59:05 nyx kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd May 11 15:59:05 nyx kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> May 11 16:32:23 nyx kernel: (da2:mps0:0:2:0): READ(10). CDB: 28 00 0e 77 4d 58 00 00 28 00 length 20480 SMID 111 Command timeout on target 2(0x000c) 90000 set, 90.363886166 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 2 for SMID 111 May 11 16:32:23 nyx kernel: (da2:mps0:0:2:0): READ(10). CDB: 28 00 0e 77 4d 58 00 00 28 00 length 20480 SMID 111 Aborting command 0xfffffe00f92e9528 May 11 16:32:23 nyx kernel: (da4:mps0:0:4:0): READ(10). CDB: 28 00 0e 77 25 58 00 00 08 00 length 4096 SMID 1374 Command timeout on target 4(0x000a) 90000 set, 90.365649439 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 4 for SMID 1374 May 11 16:32:23 nyx kernel: (da4:mps0:0:4:0): READ(10). CDB: 28 00 0e 77 25 58 00 00 08 00 length 4096 SMID 1374 Aborting command 0xfffffe00f9353650 May 11 16:32:23 nyx kernel: (da7:mps0:0:7:0): READ(10). CDB: 28 00 0e 77 16 d8 00 00 10 00 length 8192 SMID 139 Command timeout on target 7(0x0010) 90000 set, 90.368426169 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 7 for SMID 139 May 11 16:32:23 nyx kernel: (da7:mps0:0:7:0): READ(10). CDB: 28 00 0e 77 16 d8 00 00 10 00 length 8192 SMID 139 Aborting command 0xfffffe00f92ebac8 May 11 16:32:23 nyx kernel: (da6:mps0:0:6:0): READ(10). CDB: 28 00 0e 77 97 18 00 00 28 00 length 20480 SMID 731 Command timeout on target 6(0x000f) 90000 set, 90.353716039 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 6 for SMID 731 May 11 16:32:23 nyx kernel: (da6:mps0:0:6:0): READ(10). CDB: 28 00 0e 77 97 18 00 00 28 00 length 20480 SMID 731 Aborting command 0xfffffe00f931d648 May 11 16:32:23 nyx kernel: (da3:mps0:0:3:0): READ(10). CDB: 28 00 0e 77 7e 58 00 00 18 00 length 12288 SMID 332 Command timeout on target 3(0x000b) 90000 set, 90.354107428 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 3 for SMID 332 May 11 16:32:23 nyx kernel: (da3:mps0:0:3:0): READ(10). CDB: 28 00 0e 77 7e 58 00 00 18 00 length 12288 SMID 332 Aborting command 0xfffffe00f92fbe20 May 11 16:32:23 nyx kernel: (da5:mps0:0:5:0): READ(10). CDB: 28 00 0e 77 6e e8 00 00 b8 00 length 94208 SMID 1491 Command timeout on target 5(0x000e) 90000 set, 90.354717928 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 5 for SMID 1491 May 11 16:32:23 nyx kernel: (da5:mps0:0:5:0): READ(10). CDB: 28 00 0e 77 6e e8 00 00 b8 00 length 94208 SMID 1491 Aborting command 0xfffffe00f935d388 May 11 16:32:23 nyx kernel: (da1:mps0:0:1:0): READ(10). CDB: 28 00 0e 77 4f d0 00 01 00 00 length 131072 SMID 1330 Command timeout on target 1(0x0009) 90000 set, 90.357039861 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 1 for SMID 1330 May 11 16:32:23 nyx kernel: (da1:mps0:0:1:0): READ(10). CDB: 28 00 0e 77 4f d0 00 01 00 00 length 131072 SMID 1330 Aborting command 0xfffffe00f934fb30 May 11 16:32:23 nyx kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 0e 77 6b e8 00 00 50 00 length 40960 SMID 692 Command timeout on target 0(0x000d) 90000 set, 90.359445705 elapsed May 11 16:32:23 nyx kernel: mps0: Sending abort to target 0 for SMID 692 May 11 16:32:23 nyx kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 0e 77 6b e8 00 00 50 00 length 40960 SMID 692 Aborting command 0xfffffe00f931a1e0 May 11 16:32:29 nyx kernel: (xpt0:mps0:0:4:0): SMID 10 task mgmt 0xfffffe00f92e0d70 timed out May 11 16:32:29 nyx kernel: mps0: Reinitializing controller So at least for our setup this didn't help :-( Matthias PS: We are using ST12000VN0007 drives. These shouldn't be pmr drives. Please note that these drives do not cause any problems when used with ahci or isci (maxopenings is 32 in that case).
(In reply to Matthias Pfaller from comment #42) Sorry, I messed up the names... This should have been Sharad Ahlawat and not Christoph Bubel.
(In reply to Matthias Pfaller from comment #43) logs shows the read events are still timing out, even with 90 also there are no retry messages, is "sysctl kern.cam.da.retry_count" set to 0 ? you could try a few things to get to the root cause: a: also disable ZFS cache flush sysctl vfs.zfs.cache_flush_disable=1 even though your drives are not SMRs b: experiment with larger timeout values also observe "gstat" output and ensure the first column L(q) is continually returning to zero and not getting stuck for any of the drives c: try setting reducing the SCIC speed to 3.0 in the controller settings; just to eliminate some disk firmware speed compatibility issue. Side note, not sure if this applies to your drives but a few of mine don't support NCQ TRIM and are not properly driver blacklisted, so I had to set vfs.unmapped_buf_allowed=0 in loader.conf
(In reply to Sharad Ahlawat from comment #44) [root@nyx ~]# sysctl kern.cam.da.retry_count kern.cam.da.retry_count: 4 Previously we had the drives connected to the onboard SATA/SAS ports (hoping the mps problems would get solved). The machine was running for month without showing any disk related problems. At the moment we are looking for a replacement for the SAS2008 (for other machines without that many onboard SATA/SAS channels) and are testing the following config: ahci0: <Marvell 88SE9215 AHCI SATA controller> port 0x3050-0x3057,0x3040-0x3043,0x3030-0x3037,0x3020-0x3023,0x3000-0x301f mem 0xdf940000-0xdf9407ff irq 32 at device 0.0 on pci5 ahci0: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS ahcich0: <AHCI channel> at channel 0 on ahci0 ahcich1: <AHCI channel> at channel 1 on ahci0 ahcich2: <AHCI channel> at channel 2 on ahci0 ahcich3: <AHCI channel> at channel 3 on ahci0 ahci1: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS ahcich4: <AHCI channel> at channel 0 on ahci1 ahcich5: <AHCI channel> at channel 1 on ahci1 ahcich6: <AHCI channel> at channel 2 on ahci1 ahcich7: <AHCI channel> at channel 3 on ahci1 ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada0: Serial Number ZJV3ZYSX ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 11444224MB (23437770752 512 byte sectors) ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada1: Serial Number ZJV1VVWX ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 11444224MB (23437770752 512 byte sectors) ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 ada2: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada2: Serial Number ZJV1WLXM ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada2: Command Queueing enabled ada2: 11444224MB (23437770752 512 byte sectors) ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 ada3: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada3: Serial Number ZJV2YY9A ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada3: Command Queueing enabled ada3: 11444224MB (23437770752 512 byte sectors) ada4 at ahcich4 bus 0 scbus4 target 0 lun 0 ada4: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada4: Serial Number ZJV3ZJNA ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada4: Command Queueing enabled ada4: 11444224MB (23437770752 512 byte sectors) ada5 at ahcich5 bus 0 scbus5 target 0 lun 0 ada5: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada5: Serial Number ZJV3ZXN5 ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada5: Command Queueing enabled ada5: 11444224MB (23437770752 512 byte sectors) ada6 at ahcich6 bus 0 scbus6 target 0 lun 0 ada6: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada6: Serial Number ZJV2MWZ0 ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada6: Command Queueing enabled ada6: 11444224MB (23437770752 512 byte sectors) ada7 at ahcich7 bus 0 scbus7 target 0 lun 0 ada7: <ST12000VN0007-2GS116 SC60> ACS-3 ATA SATA 3.x device ada7: Serial Number ZCH0HHJ6 ada7: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada7: Command Queueing enabled ada7: 11444224MB (23437770752 512 byte sectors) ada8: <SAMSUNG MZ7KM240HMHQ-00005 GXM5304Q> ACS-2 ATA SATA 3.x device ada8: Serial Number S3F5NX0M400493 ada8: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada8: Command Queueing enabled ada8: 228936MB (468862128 512 byte sectors) ada8: quirks=0x1<4K> ada9: <SAMSUNG MZ7KM240HMHQ-00005 GXM5304Q> ACS-2 ATA SATA 3.x device ada9: Serial Number S3F5NX0M400461 ada9: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada9: Command Queueing enabled ada9: 228936MB (468862128 512 byte sectors) ada9: quirks=0x1<4K> I reverted the "camcontrol tags <DEV> -N 1" and the changed sysctls. So with this setup the system runs with the same drives and without any changed sysctl settings or cam settings. Up to now there are no problems showing up. Operating conditions are the same as with the last mps test (a zpool scrub is running and our systems did a backup to the machine last night). As the drives behave well with at least three different controllers ahci0: <Intel Patsburg AHCI SATA controller> ahci1: <Marvell 88SE9215 AHCI SATA controller> isci0: <Intel(R) C600 Series Chipset SAS Controller (SATA mode)> I can't imagine that changing parameters on the cam/zfs side (NCQ TRIM, NCQ, timeouts) will help in our case. Probably we will have to decommission the LSI9211 boards :-( regards, Matthias
So I have been getting the same errors in 12.1 and now 12.2 <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 8 lun 0 (pass0,da0) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 9 lun 0 (pass1,da1) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 10 lun 0 (pass2,da2) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 11 lun 0 (pass3,da3) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 12 lun 0 (pass4,da4) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 13 lun 0 (pass5,da5) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 14 lun 0 (pass6,da6) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 15 lun 0 (pass7,da7) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 16 lun 0 (pass8,da8) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 17 lun 0 (pass9,da9) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 18 lun 0 (pass10,da10) <IBM-XIV ST2000NM0001 A2 BC42> at scbus0 target 19 lun 0 (pass11,da11) <LSI CORP SAS2X36 0717> at scbus0 target 20 lun 0 (ses0,pass12) <LSI CORP SAS2X28 0717> at scbus0 target 21 lun 0 (ses1,pass13) <ATA WDC WD20EFRX-68A 0A80> at scbus0 target 22 lun 0 (pass14,da12) <ATA WDC WD20EFRX-68A 0A80> at scbus0 target 23 lun 0 (pass15,da13) <ATA WDC WD20EFRX-68A 0A80> at scbus0 target 24 lun 0 (pass16,da14) Adapter: mps0 Adapter: Board Name: SAS9211-8i Board Assembly: H3-25250-02B Chip Name: LSISAS2008 Chip Revision: ALL: BIOS Revision: 7.39.00.00 Firmware Revision: 20.00.06.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 N 1.5 6.0 SAS Initiator 1 N 1.5 6.0 SAS Initiator 2 N 1.5 6.0 SAS Initiator 3 N 1.5 6.0 SAS Initiator 4 0001 0009 N 6.0 1.5 6.0 SAS Initiator 5 0001 0009 N 6.0 1.5 6.0 SAS Initiator 6 0001 0009 N 6.0 1.5 6.0 SAS Initiator 7 0001 0009 N 6.0 1.5 6.0 SAS Initiator hawthorn kernel log messages: + (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 1732 Command timeout on target 24(0x001a) 60000 set, 60.96681849 elapsed +mps0: Sending abort to target 24 for SMID 1732 + (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 1732 Aborting command 0xfffffe00daa55760 + (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 30 00 length 24576 SMID 1089 Command timeout on target 12(0x000f) 60000 set, 60.97158463 elapsed +mps0: Sending abort to target 12 for SMID 1089 + (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 30 00 length 24576 SMID 1089 Aborting command 0xfffffe00daa1f758 + (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 49 79 f9 f8 00 00 08 00 length 4096 SMID 1230 Command timeout on target 24(0x001a) 60000 set, 60.97561590 elapsed + (da13:mps0:0:23:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 78 Command timeout on target 23(0x0019) 60000 set, 60.97777852 elapsed +mps0: Sending abort to target 23 for SMID 78 + (da13:mps0:0:23:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 18 00 length 12288 SMID 78 Aborting command 0xfffffe00da9ca8d0 + (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 48 00 00 10 00 length 8192 SMID 217 Command timeout on target 13(0x0010) 60000 set, 60.98193577 elapsed +mps0: Sending abort to target 13 for SMID 217 + (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 48 00 00 10 00 length 8192 SMID 217 Aborting command 0xfffffe00da9d6398 + (da8:mps0:0:16:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 670 Command timeout on target 16(0x0013) 60000 set, 60.98762575 elapsed +mps0: Sending abort to target 16 for SMID 670 + (da8:mps0:0:16:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 670 Aborting command 0xfffffe00da9fc450 + (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 91 Command timeout on target 10(0x000d) 60000 set, 60.99148905 elapsed +mps0: Sending abort to target 10 for SMID 91 + (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 18 00 length 12288 SMID 91 Aborting command 0xfffffe00da9cba48 + (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 1b 8d ae b0 00 00 08 00 length 4096 SMID 1064 Command timeout on target 10(0x000d) 60000 set, 60.99537335 elapsed + (da9:mps0:0:17:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 982 Command timeout on target 17(0x0014) 60000 set, 60.99738899 elapsed +mps0: Sending abort to target 17 for SMID 982 + (da9:mps0:0:17:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 982 Aborting command 0xfffffe00daa16790 + (da12:mps0:0:22:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 1677 Command timeout on target 22(0x0018) 60000 set, 60.100129429 elapsed +mps0: Sending abort to target 22 for SMID 1677 + (da12:mps0:0:22:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 10 00 length 8192 SMID 1677 Aborting command 0xfffffe00daa50d78 + (da10:mps0:0:18:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1723 Command timeout on target 18(0x0015) 60000 set, 60.100597645 elapsed +mps0: Sending abort to target 18 for SMID 1723 + (da10:mps0:0:18:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1723 Aborting command 0xfffffe00daa54b48 + (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 08 00 length 4096 SMID 944 Command timeout on target 9(0x000c) 60000 set, 60.101000772 elapsed +mps0: Sending abort to target 9 for SMID 944 + (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 3b c2 f3 f0 00 00 08 00 length 4096 SMID 944 Aborting command 0xfffffe00daa13480 + (da11:mps0:0:19:0): WRITE(10). CDB: 2a 00 49 79 f9 f8 00 00 08 00 length 4096 SMID 391 Command timeout on target 19(0x0016) 60000 set, 60.101567670 elapsed +mps0: Sending abort to target 19 for SMID 391 + (da11:mps0:0:19:0): WRITE(10). CDB: 2a 00 49 79 f9 f8 00 00 08 00 length 4096 SMID 391 Aborting command 0xfffffe00da9e4d68 + (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 53 42 14 38 00 00 10 00 length 8192 SMID 1326 Command timeout on target 15(0x0012) 60000 set, 60.101970797 elapsed +mps0: Sending abort to target 15 for SMID 1326 + (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 53 42 14 38 00 00 10 00 length 8192 SMID 1326 Aborting command 0xfffffe00daa335d0 + (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 08 00 length 4096 SMID 456 Command timeout on target 11(0x000e) 60000 set, 60.105479267 elapsed +mps0: Sending abort to target 11 for SMID 456 + (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 3b c2 f3 d8 00 00 08 00 length 4096 SMID 456 Aborting command 0xfffffe00da9ea4c0 + (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1882 Command timeout on target 14(0x0011) 60000 set, 60.106592066 elapsed +mps0: Sending abort to target 14 for SMID 1882 (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 1b 8d ae a8 00 00 08 00 length 4096 SMID 1882 Aborting command 0xfffffe00daa620f0 + (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 53 42 14 20 00 00 48 00 length 36864 SMID 69 Command timeout on target 9(0x000c) 60000 set, 60.69088610 elapsed + (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 20 00 00 08 00 length 4096 SMID 912 Command timeout on target 10(0x000d) 60000 set, 60.69428749 elapsed + (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 53 42 14 28 00 00 48 00 length 36864 SMID 440 Command timeout on target 14(0x0011) 60000 set, 60.349451194 elapsed + (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 58 00 00 18 00 length 12288 SMID 1580 Command timeout on target 13(0x0010) 60000 set, 60.349818628 elapsed + (da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1052 Command timeout on target 14(0x0011) 60000 set, 60.350062184 elapsed + (da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1375 Command timeout on target 13(0x0010) 60000 set, 60.350263749 elapsed + (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1455 Command timeout on target 12(0x000f) 60000 set, 60.350454814 elapsed + (da4:mps0:0:12:0): WRITE(10). CDB: 2a 00 53 42 14 40 00 00 30 00 length 24576 SMID 775 Command timeout on target 12(0x000f) 60000 set, 60.350652179 elapsed + (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 931 Command timeout on target 11(0x000e) 60000 set, 60.350866340 elapsed + (da2:mps0:0:10:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 1319 Command timeout on target 10(0x000d) 60000 set, 60.351090999 elapsed + (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 53 42 14 70 00 00 08 00 length 4096 SMID 728 Command timeout on target 9(0x000c) 60000 set, 60.351292564 elapsed + (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 53 42 14 28 00 00 48 00 length 36864 SMID 843 Command timeout on target 11(0x000e) 60000 set, 60.351485729 elapsed + (da14:mps0:0:24:0): WRITE(10). CDB: 2a 00 53 42 14 68 00 00 08 00 length 4096 SMID 412 Command timeout on target 24(0x001a) 60000 set, 60.351704089 elapsed + (da13:mps0:0:23:0): WRITE(10). CDB: 2a 00 53 42 14 68 00 00 08 00 length 4096 SMID 458 Command timeout on target 23(0x0019) 60000 set, 60.351935048 elapsed - F- 1/1: Charlie Root hawthorn daily security run output -- +mps0: Reinitializing controller +mps0: Unfreezing devq for target ID 24 +mps0: Unfreezing devq for target ID 12 +mps0: Unfreezing devq for target ID 23 +mps0: Unfreezing devq for target ID 13 +mps0: Unfreezing devq for target ID 16 +mps0: Unfreezing devq for target ID 10 +mps0: Unfreezing devq for target ID 17 +mps0: Unfreezing devq for target ID 22 +mps0: Unfreezing devq for target ID 18 +mps0: Unfreezing devq for target ID 9 +mps0: Unfreezing devq for target ID 19 +mps0: Unfreezing devq for target ID 15 +mps0: Unfreezing devq for target ID 11 +mps0: Unfreezing devq for target ID 14 +mps0: Firmware: 20.00.06.00, Driver: 21.02.00.00-fbsd +mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> Taking the suggestion in comment 45 I have reduced my max opens from 255 for my devices to 100. They went all they way to 1 but hopefully 100 will be sufficient.
I have a SAS9305-16i, and was seeing similar errors to some of the folks here. The comment about NCQ got me thinking, and sure enough.. looks like broadcom had a new firmware out (I had 16.00.11.00 where I saw the errors). 16.00.12.00 has these two as the fixed "defect list": ID: DCSG00398894 Headline: SATA only: WRITE SAME NCQ encapsulation assumes NonData NCQ is supported if Zero EXT is Supported Description Of Change: Disable NCQ encapsulation if Zero EXT is supported but Non Data NCQ is not supported Issue Description: In case if Zero EXT is supported but Non Data NCQ is not supported by drive, WRITE SAME NCQ encapsulation would send Non Data NCQ command to the drive. Drive would fail the command as Non Data NCQ is not supported by drive. This will cause command failure to host. Steps To Reproduce: IO errors are observed when mkfs.ext4 operation is done on drives that support Zero EXT but do not support Non Data NCQ. ID: DCSG00411882 (Port Of Defect DCSG00139294) Headline: SATA only : Controller hang is observed due to recursive function call. Description Of Change: Avoid starting of pended commands to SATA drive from completion functions of a command to avoid recursion. Issue Description: Controller hang is observed with ATA pass through command followed by a pended command that fails. At the completion of pass through command, PL starts the pended IOs if any. If the pended IO is failed due to invalid CDB, then immediately completion function is called causing recursion. This causes controller hang. Steps To Reproduce: Install the FreeNAS and try to create ZFS pool(Storage -> Pools -> Add) of the direct attached SSDs. - ATA command completes when there are IOs in pendlist. - The pended IO has invalid CDB I have installed the firmware, and am keeping an eye on it. My system is lightly loaded though, so my sightings of the issue have been rather spurious.
I have this same issue too... My setup is a LSI 9206-16e PCIe card (which is really just 2 x LSISAS2308 cards with a PCIe switch) and BSD 12+13. I have this connecting to a QNAP TL-D1600S 16-bay SATA chassis with 12 x 8TB Toshiba SATA (non-SMR) disks. I tried a workaround online (camcontrol tags daX -N 1) to disable NCQ per drive and this is fine *once the server is booted*... however, if i ever rebooted, I had a massive scroll of CAM errors as the kernel was trying to import a ZFS pool before the camcontrol script had run... This even leads to ZFS reporting too many errors and taking the pool offline (note: i'm not booting from this pool) I don't have the luxury of a firmware update as there hasn't been any updates to the SAS2308 in a long long time! I am running 20.00.07.00. I've also tried 3 different 9206-16e cards, so i'm happy it's not just a faulty card. I also tried in 2 different servers with the same results so happy it's probably not a hardware issue at all. My final solution which is really a kludge but does fix the issue permanently was to boot into a live ubuntu environment and use lsiutil 1.72 to disable NCQ on the card itself. This is saved in the cards EEPROM. Now when I boot bsd, I get zero CAM errors from any disks, zfs pool imports straight away, and I can still max out the bandwidth to the drives - yay. I also have an LSI 9207-8i card in a different machine running the same firmware which has no issues at all even with NCQ enabled (but these are directly connected to the card via SATA cables with no enclosure as such), so I do think this is just some kind of incompatibility between driver<-->card<-->enclosure of some kind. It would be lovely if lsiutil could be ported to bsd... I did look at it briefly but it's beyond my capabilities... the source code is online if anyone wanted to try. Hope that helps anyone else stumbling upon this PR and I appreciate it may only fix some not all of the cases reported here so far... but some is better than none :-)
I've been experiencing these issues with WD Red 6TB disks. I had frequent, unexplained reboots with 12.1-RELEASE and 12.2-RELEASE. mprutil show all Adapter: mpr0 Adapter: Board Name: LSI3008-IR Board Assembly: Chip Name: LSISAS3008 Chip Revision: ALL BIOS Revision: 12.00.00.00 Firmware Revision: 10.00.00.00 Integrated RAID: yes PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 3.0 12 SAS Initiator 1 0002 000a N 6.0 3.0 12 SAS Initiator 2 0003 000b N 6.0 3.0 12 SAS Initiator 3 0004 000c N 6.0 3.0 12 SAS Initiator 4 0005 000d N 6.0 3.0 12 SAS Initiator 5 0006 000e N 6.0 3.0 12 SAS Initiator 6 0007 000f N 6.0 3.0 12 SAS Initiator 7 0008 0010 N 6.0 3.0 12 SAS Initiator Since upgrading to 13.0-RELEASE I am no longer experiencing reboots, but I do continue to see CAM errors da7:mpr0:0:9:0): Info: 0x8024c3a0 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 80 24 c3 e8 00 00 08 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x8024c3e8 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 3a 80 4d 20 00 00 10 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x3a804d20 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c4 f0 90 00 00 10 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x31c4f090 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c4 f8 b8 00 06 88 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x31c4f8b8 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(10). CDB: 2a 00 31 c3 c0 a0 00 01 60 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x31c3c0a0 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 10 00 00 00 10 00 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x2baa0f010 (da7:mpr0:0:9:0): Error 22, Unretryable error (da7:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 14 4b b4 c0 00 00 00 08 00 00 (da7:mpr0:0:9:0): CAM status: SCSI Status Error (da7:mpr0:0:9:0): SCSI status: Check Condition (da7:mpr0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da7:mpr0:0:9:0): Info: 0x2144bb4c0 (da7:mpr0:0:9:0): Error 22, Unretryable error I have upgraded to 13.0-RELEASE-p1, on account of FreeBSD-EN-21:13.mpt . Could the issues from that errata notice also be an issue for mps?
(In reply to Daniel Austin from comment #48) I see mpsutil/mprutil tools recently got support for `set ncq` subcommand, that should allow to disable NCQ. I've merged that into TrueNAS 12.0-U7. I haven't tried that myself, and it makes me shiver inside from its inefficiency, but I can believe that it may reduce maximum command latency in some scenarios, or may be even avoid command timeouts in situations when the disks or the HBAs can't schedule or process the NCQ commands reasonably.
(In reply to Alexander Motin from comment #51) If it helps, it seems to be enclosure related... I can use NCQ on my disks and controller in another enclosure (e.g. I have an Areca 8 bay that's fine)... but when using my QNAP jbod enclosure (TL-D1600S) I get errors when NCQ is enabled.
I just started experiencing this issue with my setup with 3 IBM M1015 HBAs and ST10000NM002G SAS drives. Has anyone tested their setup with TrueNAS Scale? Since the platform is based on Linux I was hoping this bug was not present. I may try switching when Scale is out of RC status later this month.
I'm reporting either the same or similar issue, here are my findings, and please let me know if my plans sound correct: Setup: TrueNAS Scale 22.02.0.1 AMD Threadripper 1920X ASRock X399 Taichi 128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane 8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn 2 x Crucial 120GB SSD 2 x Crucial 1TB SSD 2 x Western Digital 960GB NVME Supermicro 4U case w/2000watt Redundant Power Supply The system is connected with a large APC data-center battery system and conditioner, in a HVAC controlled area. All hard drives have the newest firmware, and in 4k sectors both logical and native. The controller has the newest firmware, both regular and legacy roms, and with the SATA/SAS only mode flashed (dropping the NVME multi/tri-mode option that the new 9400 series cards support). Running any kind of heavy I/O onto the 18TB drives that are connected to the BPN-SAS3-743A backplane and through to the LSI 9400-8i HBA eventually results in the drive resetting. This happens even without the drives assigned to any kind of ZFS pool. This also happens whether running from the shell within the GUI or from the shell itself. This happens on all drives, that are using two separate SFF8643 cables with a backplane that has two separate SFF8643 ports. To cause this to happen, I can either run badblocks on each drive (using: badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a SMART extended/long test. Eventually, all or nearly all drives will reset, even spin down (according to the shell logs). Sometimes they reset in batches, while others continue chugging along. It's made completing any kind of SMART extended test not possible. Badblocks will fail out, reporting too many bad blocks, on multiple hard drives all at nearly the exact same moment, yet consecutive badblock scans won't report bad blocks in the same areas. SMART test will just show "aborted, drive reset?" as the result. My plan was to replace the HBA with an older LSI 9305-16i, replace the two SFF8643-SFF8643 cables going from the HBA to the backplane just for good measure, install two different SFF8643-SFF8482 cables that bypass the backplane fully, then four of the existing Seagate 18TB drives and put them on the the SFF8643-SFF8482 connections that bypass the backplane, as well as install four new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using the backplane, some not). That should reveal if this is a compatibility/bug issue with all large drives or certain large drives on a LSI controller, the mpr driver, and/or this backplane. If none of that works or doesn't eliminate all the potential points of failures, I'm left with nothing but subpar work arounds, such as just using the onboard SATA ports, disabling NCQ in the LSI controller, or setting up a L2ARC cache (or I might try a metadata cache to see if that circumvents the issue as well). Condensed logs when one drive errors out: sd 0:0:0:0: device_unblock and setting to running, handle(0x000d) mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05) mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05) ~ ~ ~ ~ sd 0:0:0:0: Power-on or device reset occurred .......ready sd 0:0:6:0: device_block, handle(0x000f) sd 0:0:9:0: device_block, handle(0x0012) sd 0:0:10:0: device_block, handle(0x0014) mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c) sd 0:0:9:0: device_unblock and setting to running, handle(0x0012) sd 0:0:6:0: device_unblock and setting to running, handle(0x000f) sd 0:0:10:0: device_unblock and setting to running, handle(0x0014) sd 0:0:9:0: Power-on or device reset occurred sd 0:0:6:0: Power-on or device reset occurred sd 0:0:10:0: Power-on or device reset occurred scsi_io_completion_action: 5 callbacks suppressed sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor] sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional power granted sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00 00 01 40 00 00 print_req_error: 5 callbacks suppressed blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0 sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor] sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional power use not yet granted sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00 00 01 40 00 00 blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0 ~ ~ ~ ~ sd 0:0:10:0: [sdd] Spinning up disk... . sd 0:0:3:0: device_block, handle(0x0013) mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c) . sd 0:0:3:0: device_unblock and setting to running, handle(0x0013) . sd 0:0:3:0: Power-on or device reset occurred .................ready
If you see the error message: "kernel: mps0: IOC Fault 0x40000d04, Resetting" You need to update to get this fix: https://cgit.freebsd.org/src/commit/?id=e30fceb89b7eb51825bdd65f9cc4fbadf107d763 If your errors don't include that code, that is a different problem.
(In reply to Allan Jude from comment #55) Hi, is this patch been included in 13.2-RELEASE ? I'm on 13.2-RELEASE-p9 and have the same issue with 2 different 9207-8i card : Jan 5 09:38:49 gobt kernel: mps0: IOC Fault 0x40000d04, Resetting Jan 5 09:38:49 gobt kernel: mps0: Reinitializing controller Jan 5 09:38:49 gobt kernel: mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd ------------------------------------------------------------------------ Controller information ------------------------------------------------------------------------ Controller type : SAS2308_2 BIOS version : 7.39.02.00 Firmware version : 20.00.07.00 mps0@pci0:4:0:0: class=0x010700 rev=0x05 hdr=0x00 vendor=0x1000 device=0x0087 subvendor=0x1000 subdevice=0x3020 vendor = 'Broadcom / LSI' device = 'SAS2308 PCI-Express Fusion-MPT SAS-2' class = mass storage subclass = SAS pcib8@pci0:6:0:0: clas
no more trouble since upgrade to 14.0.
just finished updating to 14.0-RELEASE-p4 with my problem child supermicro system. mps0 Adapter: Board Name: SAS9211-8i Board Assembly: H3-25250-02B Chip Name: LSISAS2008 Chip Revision: ALL: BIOS Revision: 7.39.00.00 Firmware Revision: 20.00.06.00 Integrated RAID: no SATA NCQ: ENABLED PCIe Width/Speed: x8 (5.0 GB/sec) IOC Speed: Full Temperature: Unknown/Unsupported PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 1.5 6.0 SAS Initiator 1 0001 0009 N 6.0 1.5 6.0 SAS Initiator 2 0001 0009 N 6.0 1.5 6.0 SAS Initiator 3 0001 0009 N 6.0 1.5 6.0 SAS Initiator 4 N 1.5 6.0 SAS Initiator 5 N 1.5 6.0 SAS Initiator 6 N 1.5 6.0 SAS Initiator 7 N 1.5 6.0 SAS Initiator This system has regularly under heavy i/o load got SCSI timeout errors sometimes causing the zpool or zpools to go offline. This included zroot hanging the system. Tried several different LSI 9211 adapters could never resolve the issue. After upgrading to FreeBSD 14 the problem has not happened in spite of my trying to run rsync's and zfs send/receives from different pools. Even tried to reproduce it by also doing a scrub. It has only been a few days but so far it is looking promising that the issue is resolved by going to FreeBSD 14.0
On 14.1-RELEASE-p6 and facing similar issues. Only facing SCSI errors on one disk of a 5 disk vdev raidz: /dev/da7 Disk is a HC520 12TB Depending on the load, the system hangs or reboots. Example errors in /var/log/messages: (da7:mps0:0:17:0): READ(10). CDB: 28 00 00 2e 67 c8 00 00 40 00 mps0: Controller reported scsi ioc terminated tgt 17 SMID 344 loginfo 31110d01 (da7:mps0:0:17:0): CAM status: CCB request completed with an error (da7:mps0:0:17:0): Retrying command, 3 more tries remain ... (da7:mps0:0:17:0): CAM status: SCSI Status Error (da7:mps0:0:17:0): SCSI status: Check Condition (da7:mps0:0:17:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) Details on the SAS LSI9211-8i: mps0 Adapter: Board Name: SAS9211-8i Board Assembly: Chip Name: LSISAS2008 Chip Revision: ALL BIOS Revision: 7.29.00.00 Firmware Revision: 20.00.07.00 Integrated RAID: no SATA NCQ: ENABLED PCIe Width/Speed: x8 (5.0 GB/sec) IOC Speed: Full Temperature: Unknown/Unsupported mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd Since the HC520 is not an SMR disk, would disabling NCQ be worth testing?