Bug 229745 - ahcich: CAM status: Command timeout
Summary: ahcich: CAM status: Command timeout
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-12 22:35 UTC by Alexey
Modified: 2024-02-08 17:46 UTC (History)
21 users (show)

See Also:


Attachments
Log messages in sequence for the problem (11.05 KB, text/plain)
2019-02-27 22:42 UTC, Dave Hayes
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey 2018-07-12 22:35:01 UTC
Hello!

We have some Supermicro server based on X11SSH-F
All servers were installed half year ago and works under Fbsd 11.1. All server have 4 HDD HGST HUS722T1TALA604
All of them works fine for this time with half year uptime.
Recently servers were upgraded to Fbsd 11.2 (self build 11.2-STABLE r335679 with default make.conf src.conf and GENERIC)

and after some time (all the time different, from 2 hours to 7 days) one or some disks started timeout:

Jul 13 00:56:24 mrr32 kernel: ahcich2: Timeout on slot 17 port 0
Jul 13 00:56:24 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 00060000 rs 00060000 tfd 40 serr 00000000 cmd 0004d217
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 ca 22 23 40 06 00 00 00 00 00
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 00:58:16 srv32 kernel: ahcich2: Timeout on slot 26 port 0
Jul 13 00:58:16 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 04000000 rs 04000000 tfd 40 serr 00000000 cmd 0004da17
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e0 8a cc c6 40 18 00 00 00 00 00
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 01:01:46 srv32 kernel: ahcich2: Timeout on slot 18 port 0
Jul 13 01:01:46 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 00040000 rs 00040000 tfd 40 serr 00000000 cmd 0004d217
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 2a 2b 23 40 06 00 00 00 00 00
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 01:07:12 srv32 kernel: ahcich0: Timeout on slot 23 port 0
Jul 13 01:07:12 srv32 kernel: ahcich0: is 00000000 cs 00000000 ss 00800000 rs 00800000 tfd 40 serr 00000000 cmd 0004d717
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 18 62 f5 c6 40 18 00 00 00 00 00
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jul 13 01:07:43 srv32 kernel: ahcich0: Timeout on slot 2 port 0
Jul 13 01:07:43 srv32 kernel: ahcich0: is 00000000 cs 00000000 ss 00000004 rs 00000004 tfd 40 serr 00000000 cmd 0004c217
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 62 12 7b 40 06 00 00 00 00 00
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): Retrying command

reboot (/sbin/shutdown -r or /sbin/reboot) does not solve the problem, disks still timeout after boot. Only power off / power on solve problem for some time. and after while it generate timeount 

Servers were updated to latest bios available on Supermicro. No changes.

ahci0: <Intel Sunrise Point AHCI SATA controller> port 0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xdf310000-0xdf311fff,0xdf31e000-0xdf31e0ff,0xdf31d000-0xdf31d7ff irq 16 at device 23.0 on pci0
ahci0: AHCI v1.31 with 8 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahcich6: <AHCI channel> at channel 6 on ahci0
ahcich7: <AHCI channel> at channel 7 on ahci0

ses0 at ahciem0 bus 0 scbus8 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)


ahci0@pci0:0:23:0:      class=0x010601 card=0x088415d9 chip=0xa1028086 rev=0x31 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Sunrise Point-H SATA controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA


We use zfs on all servers, some servers are raidz1, some raid-10, with same results

We use to use smartd on all servers, I tried to disable smartd. Looks like no changes.

We already upgraded zpools to new features, it require remove features before downgrade back to 11.1
Comment 1 Alexey 2018-07-12 22:50:30 UTC
Also we have some recently installed Supermicro X10DRW-i servers with same problems. Such servers were installed to 11.2, so we have no statistic for 11.1

ahci0: <Intel Wellsburg AHCI SATA controller> port 0x70b0-0x70b7,0x70a0-0x70a3,0x7090-0x7097,0x7080-0x7083,0x7000-0x701f mem 0xc7237000-0xc72377ff irq 16 at device 17.4 numa-domain 0 on pci2
ahci0: AHCI v1.30 with 4 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0

(there are two SATA controllers on MB, after problem was detected second was disabled in BIOS, second use to be same Intel Wellsburg AHCI SATA controller)

ses0 at ahciem0 bus 0 scbus4 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)


Here we have lost disk twise. power off/on return disk back, but timeouts return again after some while.

We already upgrade BIOS to latest one here also.
Comment 2 skillcoder 2018-07-14 14:28:53 UTC
I have same problem with disk 
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Firmware Version: RAGNWA05
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

Directly after upgrade FreeBSD from 11.1-RELEASE-p4 to 11.2-RELEASE i have this in /var/log/messages:

Jul 14 17:04:54 skillcoder kernel: ahcich2: Timeout on slot 21 port 0
Jul 14 17:04:54 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 0f200000 rs 0f200000 tfd 40 serr 00000000 cmd 0000db17
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 98 2a 7c 40 05 00 00 00 00 00
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:05:25 skillcoder kernel: ahcich2: Timeout on slot 31 port 0
Jul 14 17:05:25 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss e000000f rs e000000f tfd 40 serr 00000000 cmd 0000c317
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 68 fa 40 14 00 00 01 00 00
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:06:12 skillcoder kernel: ahcich2: Timeout on slot 24 port 0
Jul 14 17:06:12 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 19000000 rs 19000000 tfd 40 serr 00000000 cmd 0000de17
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1d 6c 40 05 00 00 00 00 00
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:06:43 skillcoder kernel: ahcich2: Timeout on slot 14 port 0
Jul 14 17:06:43 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 000fc000 rs 000fc000 tfd 40 serr 00000000 cmd 0000d317
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 60 62 2c 40 14 00 00 01 00 00
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:07:35 skillcoder kernel: ahcich2: Timeout on slot 14 port 0
Jul 14 17:07:35 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 007fc000 rs 007fc000 tfd 40 serr 00000000 cmd 0000d617
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 8b fe 40 14 00 00 01 00 00
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:08:08 skillcoder kernel: ahcich2: Timeout on slot 10 port 0
Jul 14 17:08:08 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 00007c00 rs 00007c00 tfd 40 serr 00000000 cmd 0000ce17
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 70 7e 40 40 14 00 00 01 00 00
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:08:45 skillcoder kernel: ahcich2: Timeout on slot 10 port 0
Jul 14 17:08:45 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 000017c0 rs 000017c0 tfd 40 serr 00000000 cmd 0000c917
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 f0 0d 5b 40 1e 00 00 00 00 00
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:09:19 skillcoder kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 5028, size: 12288


I also have enother server with same disk under hardware raid Raid HP Smart Array P410
And after upgrade from 11.0-RELEASE-p9 to 11.2-RELEASE, this disk working fine, but it's hardware raid and this disks not see by freebsd.

HELP!
Comment 3 Alexey 2018-07-14 23:14:14 UTC
Yes, it possibe problem related to disk. We have 8 servers Supermicro X10DRW-i, 4 of them are with HGST HUS722T1TALA604 and another 4 with ТOSHIBA DT01ACA100 MS2OA750 HDDs. All of them use to be installed near half year ago to Fbsd 11.1-stable, all of them use to works fine. now we have upgraded to 11.2 two and two server with each disk type and we got problem with servers who use HGST and still do not have problem with Toshiba based servers.. I'll try to replace disks to different model. Bot if problem in disk itself, why it work under 11.1? 

If we compare camcontrol identify under 11.1 and 11.2 for same model we will see no differences (in 11.2 new line Zoned-Device Commands no line).

protocol              ATA/ATAPI-10 SATA 3.x
device model          HGST HUS722T1TALA604
firmware revision     RAGNWA07
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    yes
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         yes      yes
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      no

and Toshiba disk looks different

pass0: <TOSHIBA DT01ACA100 MS2OA750> ATA8-ACS SATA 3.x device
pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 3.x
device model          TOSHIBA DT01ACA100
firmware revision     MS2OA750
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         no       no
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      1953525168/1953525168
HPA - Security                 no
Comment 4 skillcoder 2018-07-25 19:50:33 UTC
The problem is still relevant
May be this problem be related to bug #224536
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224536
Comment 5 Alexey 2018-08-01 21:17:07 UTC
Yes, looks like vfs.zfs.cache_flush_disable=1 hide the problem. (At list servers with raidz1 ZFS pool, who used to start timeouts after no more 2 days uptime, now have 6 days uptime and no any timeout on console)
Comment 6 Alexey 2018-08-01 21:48:35 UTC
(In reply to Alexey from comment #5)
But problem is, if we check spec for HGST HUS722T1TALA604 drive,  https://www.hgst.com/sites/default/files/resources/Ultrastar-7K2-EN-US-DS.pdf I do not see the drive is "Shingled magnetic recording".. (1TB is not so big capacity for 3.5", this  not need any extra methods)

And, if we replace (one by one all four disks in zpool) from HGST HUS722T1TALA604 to something different, to TOSHIBA DT01ACA100 or to TOSHIBA MG03ACA100 in our case, it solve the problem w/o disabling flush cache (vfs.zfs.cache_flush_disable=0)
Comment 7 skillcoder 2018-08-01 22:33:57 UTC
Unfortunately, this workaround was not help me.

After add "vfs.zfs.cache_flush_disable=1" to "/boot/loader.conf" and reboot.
I still have the same log after boot:
Aug  2 01:30:32 skillcoder kernel: ahcich2: Timeout on slot 12 port 0
Aug  2 01:30:32 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 001ff000 rs 001ff000 tfd 40 serr 00000000 cmd 0
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e8 70 e5 07 40 5d 00 00 00 00 00
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command
Aug  2 01:31:08 skillcoder kernel: ahcich2: Timeout on slot 15 port 0
Aug  2 01:31:08 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 007f8000 rs 007f8000 tfd 40 serr 00000000 cmd 0
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e8 70 1e 08 40 5d 00 00 00 00 00
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command


The problem is still relevant with
# sysctl vfs.zfs.cache_flush_disable                                                       
vfs.zfs.cache_flush_disable: 1
Comment 8 Alexey 2018-08-01 22:37:13 UTC
Power off ? or simple /sbin/reboot ? you will need power off + power on. not reboot or shutdown -r if timeouts already started.
Comment 9 skillcoder 2018-08-02 17:15:48 UTC
Simple shutdown -r now
But i tried, shutdown -p now (with "vfs.zfs.cache_flush_disable=1" in "/boot/loader.conf") and switch power off.
And now uptime +18 hours and not just one hang and log without any ahcichX: Timeout on slot XX port 0
BIG THX, But how it posible?
Comment 10 cryx-freebsd 2018-09-04 08:17:10 UTC
Affects me too. FreeBSD 11.2-RELEASE on a SuperMicro X11SSL-F with Intel Sunrise Point AHCI SATA controller and HUS722T2TALA604 disks. The vfs.zfs.cache_flush_disable=1 workaround also bypasses the problem after a power-off.
Comment 11 cryx-freebsd 2018-09-04 13:07:07 UTC
(In reply to cryx-freebsd from comment #10)

Setting kern.cam.ada.write_cache=0 in loader.conf and doing a power-cycle makes the problem go away, with the obvious hits on IO performance.

IHMO this shows that there is indeed a problem with the disks write-cache handling.
Comment 12 Oleksii Samorukov freebsd_committer freebsd_triage 2018-10-23 05:54:25 UTC
Same apply for me, 2 old 1TB drives were replaced by HGST HUS722T1TALA604/RAGNWA07 and problem started:

(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 02 af 63 40 37 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada0:ahcich0:0:0:0): RES: 41 10 02 af 63 00 37 00 00 00 00
Comment 13 Oleksii Samorukov freebsd_committer freebsd_triage 2018-10-23 05:55:50 UTC
Also from smart standpoint drives are looking healthy - selftests are passing fine, attributes are good, no UDMA CRC errors, etc.
Comment 14 Oleksii Samorukov freebsd_committer freebsd_triage 2018-10-23 06:03:56 UTC
P.S. We also using 

11.2-RELEASE-p4/amd64 on X10SLM-F server,

ahci0: <Intel Lynx Point AHCI SATA controller> port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 0xf7232000-0xf72327ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
Comment 15 Oleksii Samorukov freebsd_committer freebsd_triage 2018-10-23 06:10:17 UTC
TBH, it makes me thinking that drives itself or controller may have an issues with NCQ
Comment 16 Oleksii Samorukov freebsd_committer freebsd_triage 2018-10-23 06:12:49 UTC
To confirm if it is NCQ related or not i decided to disable it on one of the 2 affected drives:

camcontrol negotiate ada0 -T disable
camcontrol reset ada0

Second affected drive will still use NCQ for now to see if it makes any difference.
Comment 17 Oleksii Samorukov freebsd_committer freebsd_triage 2018-10-31 14:06:02 UTC
One update - after disabling NCQ on ada0 i do not see any problems with it, but ada2 still failing. I am disabling NCQ on both for now.
Comment 18 Oleksii Samorukov freebsd_committer freebsd_triage 2018-11-05 20:23:20 UTC
One more update - after disabling NCQ on both affected drives things are going well - i do not see any errors anymore. So from my POV - it is a buggy drive firmware.
Comment 19 sec 2019-02-06 12:36:20 UTC
Also having the same problem, before 11.2, my ZFS mirror was working fine. After upgrade, those CAM errors started to show up.
I replaced PSU, replaced one of the drives, replaced cables - still the same. Checked smart for drives, did shoty/long tests - drives are fine.
Even did memtest :)

My observations:
- when HDD's connected directly into motherboard - there are Timeout errors
- when HDD's connected to pci-e sata controller - there are unrecoverable CRC errors

I tried to disable NCQ and cache - nothing helps.

Strange thing is, when mirror is broken (there's only one drive connected) - everything is fine. It's only when 2 drives are connected into same mirror, those start to show up.

My drives are WD Gold 1TB:
1. WDC WD1005FBYZ-01YCBB2 RR07
2. WDC WD1005FBYZ-01YCBB1 RR04

Before I had two RR07 working fine, after upgrade, errors shows up, so I RMA one of them and got RR04 - which didn't fix the error. Also the problem only shows up only on one of the drives in mirror.

Right now I'm the process of migrating data to 11.1 zfs pool, then I will downgrade my server back to releng/11.1 and check if pool is working fine.

I also have two SAMSUNG HD642JJ 1AA01113 connected into other mirror - no issues with those.

Tried to swap cables, ports, etc - problem is following those drives, together.

Hope for some solution to this one, becuase it will block any upgrade to 12 :)
Comment 20 sec 2019-02-07 07:36:36 UTC
(In reply to sec from comment #19)
Ok, so I tried to downgrade pool to 11.1 - didn't helped.
Then I also started to get those errors with only one drive connected (which was fine before).
Also tried to boot on 12.0R, same issue.

Weird thing is, that sometimes drives are OK, they resilver without any problems, then after some read/write, timeout starts to show up:

In the end I tried something I think I've tried before. I've added those to /boot/loader.conf and power cycle:
vfs.zfs.cache_flush_disable=1
kern.cam.ada.write_cache=0

And started to stress test - no errors.
Then I commented out the "write_cache" - no errors.
Commented out "cache_flush" - errors back again (I think I've tried that one already, and got errors, but this time looks fine, fingers crossed).

So right know I'm running with cache_flush disabled - hope this will get rid of those errors until some proper fix is done.

Errors I was getting was something like that:
(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 b0 00 20 40 4c 00 00 01 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada2:ahcich2:0:0:0): RES: 41 10 b0 00 20 00 4c 00 00 00 00
(ada2:ahcich2:0:0:0): Retrying command

There were also READ and FLUSHCACHE48 errors - this depends on the load I was generating on the pool.

If needed I can provide more debug output.
Comment 21 Dave Hayes 2019-02-27 22:42:32 UTC
Created attachment 202427 [details]
Log messages in sequence for the problem
Comment 22 Dave Hayes 2019-02-27 22:45:52 UTC
Comment on attachment 202427 [details]
Log messages in sequence for the problem

Having the same problem here, just when I upgraded to FreeBSD 11.2-STABLE. I decided to attach my log output, this might have been a bad idea in retrospect. 

I also use smartctl regularly through icinga2. 

# camcontrol devlist
<WDC WD2005FBYZ-01YCBB3 RR09>      at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD200MFYYZ-01D45B0 01.01K01>  at scbus1 target 0 lun 0 (pass1,ada1)
<WDC WD2004FBYZ-01YCBB1 RR03>      at scbus2 target 0 lun 0 (pass2,ada2)
<WDC WD2004FBYZ-01YCBB1 RR03>      at scbus3 target 0 lun 0 (pass3,ada3)
<HL-DT-ST DVDRAM GH24NS95 RN01>    at scbus5 target 0 lun 0 (cd0,pass4)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus6 target 0 lun 0 (pass5,ses0)
Comment 23 Dave Hayes 2019-02-28 03:52:25 UTC
I've taken smartctl out of the picture and this problem still occurs. I can trigger it consistently with high write load on the disk subsystem.

Isn't this the same bug as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201194
Comment 24 Allan Jude freebsd_committer freebsd_triage 2019-05-28 15:49:26 UTC
We are seeing the same problem across a large number of machines, using Supermicro X11SPW-TF and X10SRW-F motherboards.

The disks are:
WD1004FBYZ-01YCBB1
or
WD1005FBYZ-01YCBB2

And they just fall off the bus and won't come back (not even visible in the BIOS) without a hard power cycle.


May 28 14:51:03 US-EWR3-02 kernel: ahcich9: Timeout on slot 14 port 0
May 28 14:51:03 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 1d0 serr 00000000 cmd 0004ce17
May 28 14:51:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:51:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:51:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Error 5, Retries exhausted
May 28 14:51:33 US-EWR3-02 kernel: ahcich9: Timeout on slot 19 port 0
May 28 14:51:33 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 1d0 serr 00000000 cmd 0004d317
May 28 14:51:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:51:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:51:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Retrying command, 0 more tries remain
May 28 14:52:03 US-EWR3-02 kernel: ahcich9: Timeout on slot 20 port 0
May 28 14:52:03 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd 1d0 serr 00000000 cmd 0004d417
May 28 14:52:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:52:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:52:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Error 5, Retries exhausted
May 28 14:52:19 US-EWR3-02 kernel: ahcich8: Timeout on slot 26 port 0
May 28 14:52:19 US-EWR3-02 kernel: ahcich8: is 00000000 cs 00000000 ss fc03ffff rs fc03ffff tfd 40 serr 00000000 cmd 0004d117
May 28 14:52:19 US-EWR3-02 kernel: (ada2:ahcich8:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c7 65 68 40 61 00 00 01 00 00
May 28 14:52:19 US-EWR3-02 kernel: (ada2:ahcich8:0:0:0): CAM status: Command timeout
May 28 14:52:19 US-EWR3-02 kernel: (ada2:ahcich8:0:0:0): Retrying command, 3 more tries remain
May 28 14:52:33 US-EWR3-02 kernel: ahcich9: Timeout on slot 23 port 0
May 28 14:52:33 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd 1d0 serr 00000000 cmd 0004d717
May 28 14:52:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:52:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:52:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Retrying command, 0 more tries remain
Comment 25 Dave Hayes 2019-06-12 18:38:14 UTC
(In reply to Allan Jude from comment #24)

Did you find a suitable workaround?
Comment 26 Alexey 2019-06-12 18:52:22 UTC
it works for me since august 2018 with vfs.zfs.cache_flush_disable=1 w/o reboots and disk timeouts.
Comment 27 Dave Hayes 2019-06-12 23:14:56 UTC
(In reply to Alexey from comment #26)

I have another more drastic workaround that works. In /boot/device.hints I placed the following lines:

hint.ahcich.0.sata_rev=2
hint.ahcich.1.sata_rev=2
hint.ahcich.2.sata_rev=2
...etc...

This slows down the SATA bus which isn't desirable either, but better than timeouts and controller drops.
Comment 28 Samuel Chow 2019-07-06 01:50:38 UTC
Unfortunately, I am also hitting this issue:

[3974486] ada6 at ahcich7 bus 0 scbus9 target 0 lun 0
[3974486] ada6: <HGST HUS722T2TALA604 RAGNWA09> ACS-3 ATA SATA 3.x device
[3974486] ada6: Serial Number WMC6N0XXXXXX
[3974486] ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
[3974486] ada6: Command Queueing enabled
[3974486] ada6: 1907729MB (3907029168 512 byte sectors)

[4038180] ahcich7: Timeout on slot 15 port 0
[4038180] ahcich7: is 00000000 cs 00000000 ss 00008000 rs 00008000 tfd 40 serr 00000000 cmd 0004cf17
[4038180] (ada6:ahcich7:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 e4 3f 40 2e 00 00 00 00 00
[4038180] (ada6:ahcich7:0:0:0): CAM status: Command timeout
[4038180] (ada6:ahcich7:0:0:0): Retrying command
[4039687] ahcich7: Timeout on slot 5 port 0
[4039687] ahcich7: is 00000000 cs 00000000 ss 00000020 rs 00000020 tfd 40 serr 00000000 cmd 0004c517
[4039687] (ada6:ahcich7:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 e9 3f 40 2e 00 00 00 00 00I
[4039687] (ada6:ahcich7:0:0:0): CAM status: Command timeout
[4039687] (ada6:ahcich7:0:0:0): Retrying command
[4046427] ahcich7: Timeout on slot 31 port 0
[4046427] ahcich7: is 00000000 cs 00000000 ss 80000007 rs 80000007 tfd 40 serr 00000000 cmd 0004c217
[4046427] (ada6:ahcich7:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 f8 1a 40 40 2e 00 00 00 00 00
[4046427] (ada6:ahcich7:0:0:0): CAM status: Command timeout
[4046427] (ada6:ahcich7:0:0:0): Retrying command


I do note that for all the reports here, people all seems to have SuperMicro machines. I have a X10DAi motherboard, and it has a 'C610/X99 series chipset 6-Port SATA Controller' on the motherboard connecting to this disk. 

Interestingly, on this machine, I also have 2 other mps 'LSI 9207-4i4e PCIe SATA Host Controller', and I also have another HUS722T2TALA604 on it. That disk has been running for 5 months, and I have never seen any problems.
Comment 29 Rick 2019-07-31 06:54:10 UTC
We had the same problem with a SuperMicro SIS14B server. 
A college of mine found out dat removing the vertical-hotswappable connectorboard solves the issues.
So when the disks are connected directly to the main board it works correctly.
Comment 30 Rick 2019-08-01 09:23:50 UTC
(In reply to Rick from comment #29)
Please forget this Remark. It lasted only for a few days :S
Comment 31 John Baldwin freebsd_committer freebsd_triage 2019-08-08 23:49:21 UTC
I believe I ran into the same thing though in my case the drives would bounce (come right back after detaching), but since they all bounced at once it always killed the zpool (the drives were all mirrors in a single pool).  I found that disabling hotplug for each SATA port in the BIOS setup caused the bouncing to stop and the system has been stable for a day and a half (previously when this started it wouldn't make it to multiuser without losing the zpool).  I'm not sure if I had outsmarted myself by enabling this in the BIOS setup at one point but figured I'd leave a bread crumb here in case it helps someone else.
Comment 32 Dave Hayes 2019-09-04 22:05:54 UTC
(In reply to dave from comment #27)

Ok more data is good right? :) 

This timeout issue has resurfaced on the weekly scrubs. It is not solved by what I said in comment #27.

For me, it's time to get a new controller. (sigh)
Comment 33 John Baldwin freebsd_committer freebsd_triage 2019-09-05 20:53:35 UTC
(In reply to John Baldwin from comment #31)
I was fine for about a week with two active drives, then I tried to add a 3rd drive to replace one that had died and to bring my zpool back to full health.  I then started getting timeouts again before the resilver finished.  I then started getting errors one of the two original drives even when only booting with two drives, so disabling hotplug was not a silver bullet.
Comment 34 Simon G. 2019-09-08 19:15:55 UTC
Hello!

I have tha same problem.
Error 
WRITE_FPDMA_QUEUED... READ_FPDMA_QUEUED... and etc...


OS FreeBSD 11.2-RELEASE-p9

HDD Drives

# camcontrol devlist
<WDC WD10EZRZ-00HTKB0 01.01A01>    at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD10EZRZ-00HTKB0 01.01A01>    at scbus1 target 0 lun 0 (ada1,pass1)
<WDC WD10EZRX-00D8PB0 80.00A80>    at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD10EZRX-00D8PB0 80.00A80>    at scbus3 target 0 lun 0 (ada3,pass3)


Controller SATA
Marvell 88SE9230, FG-EST14A-1-BU01, Espada
4 internal SATA 6Gb/s ports
https://static.nix.ru/autocatalog/controllers_Espada/189135_2254_draft.jpg

# pciconf -lv
ahci0@pci0:1:0:0:       class=0x010601 card=0x92301b4b chip=0x92301b4b rev=0x11 hdr=0x00
    vendor     = 'Marvell Technology Group Ltd.'
    device     = '88SE9230 PCIe SATA 6Gb/s Controller'
    class      = mass storage
    subclass   = SATA



# zpool status -v
pool: z4000
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        z4000       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

errors: No known data errors


- - -

Solution for WD Hard Disk Drives

I use program WDIdle3 under DOS and turn off parking heads.

https://support.wdc.com/downloads.aspx?p=113

Enter BIOS and change Mode from AHCI on controller on IDE or Compatible

And one more and IMPORTAND!!!

For the duration of the program WDIdle3.
Connected to controller only disks for change idle time.
Because utility WDIdle3 changing all HDD which conneted on controller.

Using
wdidle3 /r - current configuration ilde time on disk

wdidle3 /d - turn off idle time (I use this option)

wdidle3 /s50 - set time 50 secons (you can use 20, 30, 40 or above, read manual please)

wdidle3 /? - help for program

- - -

Turn off idle timer.
Work fine.
Speed when use rsync
320 - 340 Megabytes per second.
No error in system log.


P.S.
HGST and WD the same brend and one technology and i thinks work this solution too.
Comment 35 John Baldwin freebsd_committer freebsd_triage 2019-12-28 00:41:24 UTC
I tried the wdidle3 program but disabling the idle timer did not help in my case.  I ended up replacing the drives with WD blue drives.  I talked with Allan Jude a bit and apparently he has only seen these issues with WD-100[3-5]* drives.  The same drives work fine in Linux but fall over in FreeBSD.
Comment 36 John Baldwin freebsd_committer freebsd_triage 2019-12-30 21:49:27 UTC
My brand new drives all detached after about 10 days of use:

[-- MARK -- Sat Dec 28 06:00:00 2019]
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD10EZEX-08WN4A0 02.01A02> s/n WD-WCC6Y2ZRADXP detached
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD10EZEX-08WN4A0 02.01A02> s/n WD-WCC6Y3TPFEKT detached
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <WDC WD10EZEX-08WN4A0 02.01A02> s/n WD-WCC6Y1VPULU3 detached
(ada0:ahcich0:0:0:0): Periph destroyed
(ada1:ahcich1:0:0:0): Periph destroyed
(ada2:ahcich2:0:0:0): Periph destroyed
Comment 37 Daniel Bell 2020-02-12 18:36:25 UTC
I'm having this problem on 12.1-RELEASE-p2 GENERIC, with a Supermicro X11DPi-N, 6 drive 3-mirror stripe zpool of "WD Gold" disks.
<WDC WD2005FBYZ-01YCBB2 RR07>      at scbus6 target 0 lun 0 (ada0,pass1)
<WDC WD2005FBYZ-01YCBB2 RR07>      at scbus7 target 0 lun 0 (ada1,pass2)
<WDC WD2005FBYZ-01YCBB2 RR07>      at scbus8 target 0 lun 0 (ada2,pass3)
<WDC WD2005FBYZ-01YCBB2 RR07>      at scbus9 target 0 lun 0 (ada3,pass4)
<WDC WD2005FBYZ-01YCBB2 RR07>      at scbus10 target 0 lun 0 (ada4,pass5)
<WDC WD2005FBYZ-01YCBB2 RR07>      at scbus11 target 0 lun 0 (ada5,pass6)

There's significant load and I haven't had disconnects yet, but it's only been up for 8 days and I've rebooted a few times to test the loader.conf tweaks (which didn't help as far as I can tell).

I'll try replacing the controller next week, since I haven't heard of this issue without roughly this drive/controller combo—if that's a bad idea please let me know.

The errors pop up on seemingly random disks, e.g.

Feb 12 13:20:42 wyn2018 kernel: ahcich11: Timeout on slot 10 port 0
Feb 12 13:20:42 wyn2018 kernel: ahcich11: is 00000000 cs 00000000 ss 0003fc00 rs 0003fc00 tfd 40 serr 00000000 cmd 0004d117
Feb 12 13:20:42 wyn2018 kernel: (ada5:ahcich11:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 b8 c5 36 40 06 00 00 00 00 00
Feb 12 13:20:42 wyn2018 kernel: (ada5:ahcich11:0:0:0): CAM status: Command timeout
Feb 12 13:20:42 wyn2018 kernel: (ada5:ahcich11:0:0:0): Retrying command, 3 more tries remain
Feb 12 13:20:49 wyn2018 kernel: (ada4:ahcich10:0:0:0): READ_FPDMA_QUEUED. ACB: 60 68 98 c6 36 40 06 00 00 00 00 00
Feb 12 13:20:49 wyn2018 kernel: (ada4:ahcich10:0:0:0): CAM status: ATA Status Error
Feb 12 13:20:49 wyn2018 kernel: (ada4:ahcich10:0:0:0): ATA status: 43 (DRDY IDX ERR), error: 40 (UNC )
Feb 12 13:20:49 wyn2018 kernel: (ada4:ahcich10:0:0:0): RES: 43 40 98 c6 36 00 06 00 00 00 00
Feb 12 13:20:49 wyn2018 kernel: (ada4:ahcich10:0:0:0): Retrying command, 3 more tries remain
Feb 12 13:25:42 wyn2018 kernel: ahcich6: Timeout on slot 15 port 0
Feb 12 13:25:42 wyn2018 kernel: ahcich6: is 00000000 cs 00000000 ss 01fe8000 rs 01fe8000 tfd 40 serr 00000000 cmd 0004d817
Feb 12 13:25:42 wyn2018 kernel: (ada0:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 00 6c 01 40 1a 00 00 00 00 00
Feb 12 13:25:42 wyn2018 kernel: (ada0:ahcich6:0:0:0): CAM status: Command timeout
Feb 12 13:25:42 wyn2018 kernel: (ada0:ahcich6:0:0:0): Retrying command, 3 more tries remain
Feb 12 13:26:15 wyn2018 kernel: (ada5:ahcich11:0:0:0): READ_FPDMA_QUEUED. ACB: 60 20 98 f3 9a 40 08 00 00 00 00 00
Feb 12 13:26:15 wyn2018 kernel: (ada5:ahcich11:0:0:0): CAM status: ATA Status Error
Feb 12 13:26:15 wyn2018 kernel: (ada5:ahcich11:0:0:0): ATA status: 43 (DRDY IDX ERR), error: 40 (UNC )
Feb 12 13:26:15 wyn2018 kernel: (ada5:ahcich11:0:0:0): RES: 43 40 98 f3 9a 00 08 00 00 00 00
Feb 12 13:26:15 wyn2018 kernel: (ada5:ahcich11:0:0:0): Retrying command, 3 more tries remain
Comment 38 sec 2020-02-12 18:49:56 UTC
(In reply to Daniel Bell from comment #37)
I'm running on 11.3-STABLE without issues, just have this in /boot/loader.conf:

vfs.zfs.cache_flush_disable=1

Just be sure to power off server (maybe even take the power cord out, who knows) and power cycle. For me, adding this tweak and doing reboots didn't helped, but full power cycle made the issue go away. I know this probably hits performance, but until the issue is fixed, I think it's the way to go :(
Comment 39 Wayne Willcox 2020-04-13 12:43:03 UTC
I am getting the same errors and a hang.  changed the settings to
vfs.zfs.cache_flush_disable: 1

and it will still get those errors and hung however it does take longer. This system is not that heavily used it is a zabbix server only.

<WDC WD30EZRZ-00GXCB0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<Hitachi HDS721010CLA330 JP4OA3MA>  at scbus1 target 0 lun 0 (ada1,pass1)
<Hitachi HDS721010CLA330 JP4OA3MA>  at scbus2 target 0 lun 0 (ada2,pass2)
<Hitachi HDS721010CLA330 JP4OA3MA>  at scbus3 target 0 lun 0 (ada3,pass3)
<Hitachi HDS721010CLA330 JP4OA3MA>  at scbus4 target 0 lun 0 (ada4,pass4)
<WDC WD5000AACS-00G8B1 05.04C05>   at scbus4 target 1 lun 0 (ada5,pass5)

ahci0: AHCI v1.20 with 4 6Gbps ports, Port Multiplier supported
ahci0: quirks=0x22000<ATI_PMP_BUG,1MSI>
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ohci0: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xf9ef7000-0xf9ef7fff irq 18 at device 18.0 on pci0

I swapped to a completely different motherboard going from an intel originally to this one which is an AMD.
FreeBSD 12.1-RELEASE r354233 GENERIC amd64
FreeBSD clang version 8.0.1 (tags/RELEASE_801/final 366581) (based on LLVM 8.0.1)
VT(efifb): resolution 800x600
CPU: AMD Phenom(tm) II X6 1090T Processor (3200.16-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x100fa0  Family=0x10  Model=0xa  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId>
  SVM: NP,NRIP,NAsids=64
  TSC: P-state invariant, performance statistics
real memory  = 17179869184 (16384 MB)
avail memory = 16612155392 (15842 MB)

I have also tried a different brand new WD blue drive with the same results.
I have a bunch of WD red drives in service and have never seen this problem it looks like it is some incompatibility with the drive firmware and FreeBSD?

I plan to replace this particular drive with a WD RED that I just ordered however I have a number of the Blues and could assist with the testing of a fix if there is one available.

(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: Timeout on slot 6 port 0
ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 00006617
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
sonewconn: pcb 0xfffff8000ee63d58: Listen queue overflow: 193 already in queue awaiting acceptance (224 occurrences)
ahcich0: Timeout on slot 9 port 0
ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000200 tfd 50 serr 00000000 cmd 00006917
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Comment 40 Wayne Willcox 2020-04-13 16:07:39 UTC
Ok so even with the write_cache disabled it started getting the errors and hung.   So this is a pretty serious issue at this point the WD Blue drives (and maybe others) are not at all reliable with zfs and FreeBSD 12.1.
Comment 41 Warner Losh freebsd_committer freebsd_triage 2020-04-13 18:14:59 UTC
Are people here absolutely sure they have good power supplies and good distribution? SuperMicro has had issues with this in the past... and that can cause timeouts...
Comment 42 Wayne Willcox 2020-04-13 18:44:47 UTC
Absolutely sure?  No having said that I have seen the problem on 2 different systems with 2 different power supplies.  Additionally the power supplies have been fine and not had any other problems AND the other drives that are not the WD Blue don't have this problem that are in the systems.

It is possible I suppose and I could swap the power supply but at this point I am going to swap the drive instead with a WD Red.  Which if it works will eliminate the power supply.
Comment 43 Daniel Bell 2020-04-13 18:57:34 UTC
(In reply to sec from comment #38)

I'm sorry I didn't respond sooner, but for my issue, sec was 100% right: a cold power cycle, as bizarre as that sounded to me, along with the loader.conf tweaks mentioned elsewhere, completely solved my problem despite having the problematic drive/controller combo here. Reboots after editing loader.conf did not help, which is why I originally replied here.

As of today, the box in question has been up 60 days with zero errors, zero problems, and a healthy IO load—multiple VMs including Windows VMs, multiple active jails and database engines. I'm no longer planning to add a friendlier HBA unless something changes.

I realize this isn't solving everyone's problem, but I thought I should mention this worked for me.

% uname -a
FreeBSD wyn2018.office.wynnecap.com 12.1-RELEASE-p2 FreeBSD 12.1-RELEASE-p2 GENERIC  amd64
% uptime
 2:47PM  up 60 days, 23:45, 1 user, load averages: 0.29, 0.67, 0.76
% dmesg | grep DRDY || echo Groovy.
Groovy.
Comment 44 Warner Losh freebsd_committer freebsd_triage 2020-04-13 23:04:08 UTC
(In reply to Wayne Willcox from comment #42)

I ask because we have many thousands of machines deployed, and a thousand or so of those have intermittent power supply issues which means at any time we have a hundred or two machines that drop connections from time to time and a couple dozen that do it consistently... But there it's not a timeout, but the PHY goes down or we get some other signal from the drive that there's been trouble.

However, having said that, we should instrument the code that does that, and maybe look at improving 'recovery' operations in the face of a timeout to see if there are things we can do to 'wakeup' a drive that's otherwise stopped responding to commands.
Comment 45 Wayne Willcox 2020-04-13 23:31:54 UTC
(In reply to Warner Losh from comment #44)
one of the power supplies is a 600 corsair (not going to take the other system apart that also had this issue).  I will try a third one power supply when I take this drive out of production service later this week and install it in a test rig.

I have LONG since stopped buying inexpensive cheapo power supplies and have not had a problem since with the power supply failing or dropping power. It could be these 2  systems both have their power supply failing in the same way but that seems to be a bit of a stretch.

When I get the 3tb WD red on Thursday I will copy the Blue drive to the Red and install it.. if the system stops failing it is pretty reasonable to assume it isn't the power supply.  Then I will test the Blue with the original install in a test rig and (completely different power supply, motherboard and cpu) See if it continues to fail.
Comment 46 Warner Losh freebsd_committer freebsd_triage 2020-04-14 01:40:48 UTC
I've re-read things, and I think everybody is hanging the 'timeout' issue on this bug report when really it looks like there's several different pathologies...  Those that clear up after a power cycle, for example, are different than the ones that persist. Those at are cleared up by loader.conf entires likely need quirks of some type (maybe a new type). And those with no messages are Something Else [tm].

Not sure how to go about sorting it all out though. As it is now, it's hard to take action on anything specific.
Comment 47 sec 2020-04-14 07:55:29 UTC
(In reply to Warner Losh from comment #46)

Hi,
From my expierence:
1) I didn't have this issue until I've updated to 11.2 IIRC (that was almost one year ago)
2) First I replaced my drives to new ones - issue was still there
3) Then I bought new PSU - issue was still there
4) Power cycle was solving the issue until load was put on drives
5) After adding 'vfs.zfs.cache_flush_disable=1' and doing power cycle issue was fixed - no more timeout in almost year time now
6) I'm prettry sure the old PSU and drives were OK - old PSU is working in another machine without issues, drives were RMA'ed so can't tell
Comment 48 Warner Losh freebsd_committer freebsd_triage 2020-04-15 15:14:45 UTC
(In reply to sec from comment #47)
Thanks for the feedback. And was your issue 'timeout' or was it just disappearing?
Comment 49 Wayne Willcox 2020-04-15 18:05:42 UTC
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: Timeout on slot 6 port 0
ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 00006617

I was watching for it to happen. when it did the zroot filesystem was taken offline.  Causing the hang condition.
Comment 50 sec 2020-04-16 06:57:31 UTC
(In reply to Warner Losh from comment #48)
My error were like this - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229745#c20

There were WRITE, READ and FLUSHCACHE48, first there were timeouts messages, then system was "frozen" - I wasn't able to even login using ssh, it was just dumping those errors on console - sometimes drives went offline, then reboot make them online, but errors were still there.

Only power cycle made the errors go away, but after any load on the drives - again the same error. And the fix for me was this "vfs.zfs.cache_flush_disable=1", power off, power on.

Anything more I can provide to help debug this issue?
Comment 51 Wayne Willcox 2020-04-16 11:02:46 UTC
Ok have some new information in my case.
I commented out from /boot/loader.conf
#vfs.zfs.cache_flush_disable="1"

I commented out from /boot/device.hints
#hint.ahcich.0.sata_rev=2

so that I was back to the standard defaults.

Backed up the New WD Blue drive model WD30EZRZ using rsnapshot

installed the same FreeBSD 12.1 on a New WD RED drive:
ada0 at ahcich4 bus 0 scbus4 target 0 lun 0
ada0: <WDC WD30EFRX-68EUZN0 82.00A82> ACS-2 ATA SATA 3.x device
ada0: Serial Number WD-WCC4N3SJU005
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors)
ada0: quirks=0x1<4K>

restored the snapshot I created from the Blue drive onto the Red drive and rebooted.

Let it run off the Red drive and for the first time it successfully  ran overnight without disabling /zroot.  It didn't log any timeout errors and the work load was the same.

So 2 new WD Blue drives had this problem and just switching to a new WD Red seems to have resolved the problem.  I intend to monitor the Red drive and see if it continues to behave properly (I expect it will I have a lot of Red drives in service).  Additionally I am going to do some more testing with these drives in a test rig.
Comment 52 Wayne Willcox 2020-04-16 12:45:47 UTC
well this is interesting could these Western Digital drives also be using SMR?  Might that explain the problem?

Seems Western Digital is trying to include SMR without telling us?
https://www.tomshardware.com/news/wd-fesses-up-some-red-hdds-use-slow-smr-tech
Comment 53 sec 2020-04-16 12:55:58 UTC
(In reply to Wayne Willcox from comment #52)
At the moment my drives are:
WDC WD1005FBYZ-01YCBB2 RR07 (WD Gold 1TB)
WDC WD1005FBYZ-01YCBB1 RR04 (WD Gold 1TB)

I had two with RR07, but I RMA'd one of the drives becuase I though it died and got new one from WD (they have long warranty, yay).

Before that I was using WD5003-ABYZ (WD Re 500GB) - no issues with those, or they didn't get chance to be upgraded to newer FBSD release maybe.

29.05.2017 is the date I switched to those WD Gold's and they were working fine until January of 2019 - I did upgrade of my machine to newer release and then problems started.
Comment 54 Daniel Morante 2020-08-27 02:45:12 UTC
I am using FreeBSD 12.1-RELEASE amd64 with two WDS500G2B0B attached to a Ableconn PEXM2-122 Dual M.2 SATA SSD Controller.  These are all brand new fresh out of the box components.

```
ahci0: <ASMedia ASM106x AHCI SATA controller> mem 0xfe6fe000-0xfe6fffff irq 16 at device 0.0 on pci4
ahci0: AHCI v1.31 with 12 6Gbps ports, Port Multiplier supported
ahci0: quirks=0xc00000<NOCCS,NOAUX>
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahcich6: <AHCI channel> at channel 6 on ahci0
ahcich7: <AHCI channel> at channel 7 on ahci0
ahcich8: <AHCI channel> at channel 8 on ahci0
ahcich9: <AHCI channel> at channel 9 on ahci0
ahcich10: <AHCI channel> at channel 10 on ahci0
ahcich11: <AHCI channel> at channel 11 on ahci0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS500G2B0B-00YS70 415000WD> ACS-4 ATA SATA 3.x device
ada0: Serial Number 20274C802827
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 476940MB (976773168 512 byte sectors)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WDS500G2B0B-00YS70 415000WD> ACS-4 ATA SATA 3.x device
ada1: Serial Number 20274C804009
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada1: Command Queueing enabled
ada1: 476940MB (976773168 512 byte sectors)
```

I created a mirrored pool and was able to copy/move a few files to it without any issues.

```
zpool create -m /mnt/internal internal mirror ada0 ada1
```

Once I try to copy a large amount of files, I start to see hangups and errors like the ones described on this bug and eventually the two drives get detached.

```
ahcich0: Timeout on slot 22 port 0
ahcich0: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 150 serr 00000000 cmd 0004c917
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Poll timeout on slot 23 port 0
ahcich0: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd 150 serr 00000000 cmd 0004c917
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 24 port 0
...
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS500G2B0B-00YS70 415000WD> s/n 20274C802827 detached
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WDS500G2B0B-00YS70 415000WD> s/n 20274C804009 detached
```

The ZFS pool gets completely trashed:

```
zpool status
  pool: internal
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        internal                  UNAVAIL      0     0     0
          mirror-0                UNAVAIL      0     0     0
            12901058162764697962  REMOVED      0     0     0  was /dev/ada0
            16801836340799291794  REMOVED      0     0     0  was /dev/ada1

errors: 3 data errors, use '-v' for a list
```

Attempting to reboot the system just hangs indefinitely.
Comment 55 Nicolas Richeton 2020-09-22 06:39:18 UTC
Hello, 

On FreeBSD 11.3-RELEASE-p11 (FreeNAS) :

I got the same issue with a ASMedia ASM1062 AHCI SATA controller. (2 SATA ports; PCIe x1 card on PCIe2.0 connector) 
2 drives connected, configured as a ZFS mirror pool. I faced this issue when switching from 2 HDD to 2 SSD.

It seems related to the volume of data flowing through the controller: 

- When 2 HDD (SATA2) are used : everything is fine.
 
- When 2 SSD (SATA3) are used : they are detected correctly, but when I start a copy/zfs scrub, I get a lot of : 

Sep 18 16:54:21 nas (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 70 f1 cc 40 02 00 00 00 00 00
Sep 18 16:54:21 nas (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Sep 18 16:54:21 nas (ada1:ahcich1:0:0:0): Retrying command
Sep 18 16:54:51 nas ahcich1: Timeout on slot 5 port 0
Sep 18 16:54:51 nas ahcich1: is 00000000 cs 00000000 ss 00000020 rs 00000020 tfd 40 serr 00000000 cmd 0004c517

Sep 18 16:54:51 nas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 30 d0 6d e0 40 4f 00 00 00 00 00
Sep 18 16:54:51 nas (ada1:ahcich1:0:0:0): CAM status: Command timeout
Sep 18 16:54:51 nas (ada1:ahcich1:0:0:0): Retrying command
Sep 18 16:55:21 nas ahcich1: Timeout on slot 14 port 0

And sometimes : 
Sep 13 14:10:06 nas (aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Sep 13 14:10:06 nas (aprobe0:ahcich1:0:0:0): CAM status: Command timeout
Sep 13 14:10:06 nas (aprobe0:ahcich1:0:0:0): Retrying command

In extreme cases, I loose the drive, the pool goes degraded, and I have to reboot to bring the disk back. 

There are more messages if the io speed is high : copy through network gives some messages, scrub gives a lot of messages and progress pauses (and can result of the lost of one disk).
It can also affect the other drive (ada0), but more errors on the second one (ada1)

I changed the SSD => same issue with SAMSUNG 860 EVO and Crucial MX500.
I changed the cables => same issue 
BIOS is up to date (HP micro server Gen8)

- When I plug one of the SSD on another SATA2-only port on the motherboard (so 1 SSD SATA2 and 1 SSD SATA3 on ASMedia controller) => everything is fine when I do a scrub, probably because ZFS is waiting for the slower drive => data flow is smaller

- When I do a ZFS replace : HDD + HDD->SSD (3 drives connected during the replace - 2 on ASMedia controller and 1 on motherboard) : there are no errors (HDD are limiting the speed). Issues start when the pool is SSD-only, on SATA3. 

Config :
Sep 13 15:11:29 nas ahci0: <ASMedia ASM1062 AHCI SATA controller> port 0x5000-0x5007,0x5008-0x500b,0x5010-0x5017,0x5018-0x501b,0x5020-0x503f mem 0xfbff0000-0xfbff01ff irq 16 at device 0.0 on pci1
Sep 13 15:11:29 nas ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported
Sep 13 15:11:29 nas ahci0: quirks=0xc00000<NOCCS,NOAUX>
Sep 13 15:11:29 nas ahcich0: <AHCI channel> at channel 0 on ahci0
Sep 13 15:11:29 nas ahcich1: <AHCI channel> at channel 1 on ahci0
Sep 13 15:11:29 nas ahci1: <Intel Cougar Point (RAID) AHCI SATA controller> port 0x1000-0x1007,0x1008-0x100b,0x1010-0x1017,0x1018-0x101b,0x1020-0x103f mem 0xfacd0000-0xfacd07ff irq 17 at device 31.2 on pci0
Sep 13 15:11:29 nas ahci1: AHCI v1.30 with 6 6Gbps ports, Port Multiplier supported
Sep 13 15:11:29 nas ahcich2: <AHCI channel> at channel 0 on ahci1
Sep 13 15:11:29 nas ahcich3: <AHCI channel> at channel 1 on ahci1
Sep 13 15:11:29 nas ahcich4: <AHCI channel> at channel 2 on ahci1
Sep 13 15:11:29 nas ahcich5: <AHCI channel> at channel 3 on ahci1
Sep 13 15:11:29 nas ahcich6: <AHCI channel> at channel 4 on ahci1
Sep 13 15:11:29 nas ahcich7: <AHCI channel> at channel 5 on ahci1
Sep 13 15:11:29 nas ahciem0: <AHCI enclosure management bridge> on ahci1
Comment 56 Wayne Willcox 2020-11-19 20:33:28 UTC
Ok so it looks like I solved this for my system. I did it by reducing the command tags.

The default was 255 and I started by just cutting that in half. I kept dividing by 2 until I got to 25.  At 50 I only had the time outs every few days.  At 25 I have not had an error in over a week.  I could possibly increase from 25 but I won't bother.  Performance seems to be fine and the errors are stopped.

You can do this with the camcontrol tags command
        camcontrol tags       [dev_id][generic args] [-N tags] [-q] [-v]

Here are the steps.
 camcontrol devlist
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 8 lun 0 (pass0,da0)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 9 lun 0 (pass1,da1)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 10 lun 0 (pass2,da2)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 11 lun 0 (pass3,da3)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 12 lun 0 (pass4,da4)
<IBM-XIV ST2000NM0001  A2 BC42>    at scbus0 target 13 lun 0 (pass5,da5)

 camcontrol tags da5   
(pass5:mps0:0:13:0): device openings: 25

Originally mine was at 255.  This can be done on a live system and doesn't require closing the disk and reopening it to take effect.  I just changed all 15 hdisks.
Comment 57 sec 2020-12-30 12:38:23 UTC
(In reply to Wayne Willcox from comment #56)
Mine drives are on 32 atm (unless vfs.zfs.cache_flush_disable=1 if changing that). Without that tweak, I had those errors.

If there any update on this issue, as I'm still on 11.4 and would like to upgrade to 12.x, but I would also like to have my ZFS pool working :)
Comment 58 ekoort 2021-12-08 08:15:23 UTC
Hi,

I have same problem with Pine RockPro64 with 2 sata 1TB Seagate hdd-s.
One time kernel did panic when there was a bit more writing (over nfs) to zfs.

root@pine:~ # uname -a
FreeBSD pine 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:38:07 UTC 2021     root@arm64-builder.daemonology.net:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC  arm64

Dec  6 00:51:47 pine kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 48 40 a3 0c 40 1f 00 00 00 00 00
Dec  6 00:51:47 pine kernel: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  6 00:51:47 pine kernel: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
Dec  6 00:51:47 pine kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 68 55 28 40 3c 00 00 08 00 00
Dec  6 00:51:47 pine kernel: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
Dec  6 00:51:47 pine kernel: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
Comment 59 ekoort 2021-12-13 04:08:55 UTC
Hello again!

In my case disabling vfs.zfs.cache_flush_disable="1" did not help. 

Instead this seems to work:

root@pine:~ # camcontrol tags ada0
(pass0:ahcich0:0:0:0): device openings: 32
root@pine:~ # camcontrol tags ada0 -N 25
(pass0:ahcich0:0:0:0): tagged openings now 25
(pass0:ahcich0:0:0:0): device openings: 25

root@pine:~ # camcontrol tags ada1
(pass1:ahcich1:0:0:0): device openings: 32
root@pine:~ # camcontrol tags ada1 -N 25
(pass1:ahcich1:0:0:0): tagged openings now 25
(pass1:ahcich1:0:0:0): device openings: 25

Pine Rockpro64 with 2 Seagate sata disks survived last night backup jobs without any errors and no errors have been reported to logs since my previous post about 5 days ago.

Thanks to Wayne comment #56 for this.
Comment 60 Chris 2022-07-04 03:18:26 UTC
This issue has returned for me on 13.1


Adding vfs.zfs.cache_flush_disable=1 in /boot/loader.conf no longer works.


Any other insights on this?
Comment 61 Daniel Menelkir 2022-07-22 11:38:25 UTC
I can confirm that, I was xz a 2Gb file and suddenly the device gave up. 
The device is fine, smartctl with long test is also fine. The device became inacessible and only a hard reboot solves.
Here's the log I had:

ahcich1: Timeout on slot 2 port 0
ahcich1: is 00000000 cs 01fffff0 ss 01fffffc rs 01fffffc tfd 40 serr 00000000 cmd 0000c317
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e0 00 3c 10 40 1b 00 00 05 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 25 port 0
ahcich1: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd 80 serr 00000000 cmd 0000d917
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Retrying command, 0 more tries remain
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Comment 62 Warner Losh freebsd_committer freebsd_triage 2022-07-22 15:13:28 UTC
This bug should be closed. There's too many different symptoms that have been co-located in this bug that are likely unrelated. There's clearly some bad hardware here. There's clearly some issues with ahci itself that appear to my eye to be 'quirky' resets of the bridge (though w/o traces it will be hard to know). Things that fail to reset on reboot are different than transient errors than are WRITE errors with codes that aren't timeouts. It's hard to sort out all the issues here. A number of other bugs should be filed to take its place, for real bugs that can be reproducible (because otherwise we won't know if any changes fix the problem or not).

By and large, if a drive hangs, it is to blame. If a drive throws write errors, it's always the drive (though the CRC errors might be cabling issues).

Reducing the write load makes sense at having the problem 'disappear': it puts a much higher instantaneous load on the drive than would otherwise be seen for drives that have marginal data and can cope with retries for a few writes vs retires on lots of writes all at once. The latter can overwhelm some drives' firmware.

It's also possible that error recovery could be better in ahci, since we do recovery things when we get a timeout. However, those improvements can be hard to roll out and test due to needing real hardware that's basically good but sometimes misbehaves and most operations will retire / discard such hardware.
Comment 63 Daniel Menelkir 2022-07-22 20:50:22 UTC
(In reply to Warner Losh from comment #62)
This issue only happens specifically with FreeBSD 13.1 and with zfs. 
So S.M.A.R.T is wrong, every other thing I throw there doesn't generate this issue, only with 13.1 with zfs, and the issue is hardware? I'm rather using my disk with something else that doesn't generate this "hardware issue" then. Feel free to close this PR.
Comment 64 Chris 2022-07-22 21:21:20 UTC
This is a hardware issue. I have 5 Western Digital Ultrastar Data Center Hard drives that show up as HGST with camcontrol. The errors popped up when I upgraded to 12.2 and stopped by adding vfs.zfs.cache_flush_disable=1 in /boot/loader.conf. The error returned when I just updated to 13.1.

I replaced the 5 drives and I have a stable system again. If I have a chance I will test the drives on another system just to see if it is a compatibility issue with the SATA controller.

I agree with Warner Losh since the system feels more responsive after replacing the drives and removing vfs.zfs.cache_flush_disable=1 in /boot/loader.conf.

This is a hardware issue. Possibly a compatibility issue.
Comment 65 Andriy Gapon freebsd_committer freebsd_triage 2022-07-26 09:03:35 UTC
It could be a combination of what FreeBSD 13 or, more specifically, OpenZFS/FreeBSD does (legitimately) and a bug in drives' firmware or even hardware.

E.g., I have a mirror of two SSDs from different vendors.  One is 100% stable, the other does the disappearing act at random intervals of time.  Actually, recently I replaced the problematic SSD with a new one from yet another vendor, it's 100% stable.

If I had to make a guess, then I'd bet on the problem being related to TRIM.
But I am not inclined to put my time into a deep investigation of such an "exotic" problem.
Comment 66 Daniel Menelkir 2022-07-27 10:03:31 UTC
(In reply to Chris from comment #64)
Try smartctl with a USB (archlinux ISO have smartmontools afaik), or even Hirens BootCD (it's a mini windows with some tools if you're not familiar).
(In reply to Andriy Gapon from comment #65)
I don't think it's TRIM since I have the issues with two HDDs with different brands/speeds.
Comment 67 Chris 2022-07-27 12:06:18 UTC
In reply to Daniel (comment #66)

Smartmontools are available on FreeBSD. The drives show no issue with the smartmontools. I believe there is a compatibility issue between the drives and the motherboard. I updated the BIOS and firmware on the server which didn't help.
Comment 68 Daniel Menelkir 2022-07-27 13:21:23 UTC
(In reply to Chris from comment #67)
I know, I mean to try different platforms. 
And to mention, this is an specific issue with 13.1+ZFS, I don't think it's something of the motherboard itself, since this issue is only happens (for now) on FreeBSD 13.1 with ZFS, since I had it in two different machines. If I use the disk for anything else, this doesn't happens.
Comment 69 Chris 2022-07-27 14:37:03 UTC
(In reply to Daniel Menelkir from comment #68)
The issue happens with 12 as well and requires vfs.zfs.cache_flush_disable=1 and a power cycle to clear. 

I don't think it's a just a ZFS issue since a reboot just doesn't clear it. The system runs great with different disks. Using another OS or a different file system isn't an option for me and I don't have another system to replace it to continue running tests.
Comment 70 Daniel Menelkir 2022-07-28 00:52:47 UTC
(In reply to Chris from comment #69)
A zpool error isn't supposed to clear with a reboot. The test I've mentioned can be done with a live USB.
Comment 71 Chris 2022-07-28 02:51:08 UTC
(In reply to Daniel Menelkir from comment #70)
I didn't get any zpool errors. The system was hanging with cam status timeouts. 

Since I changed the disks the performance of the system improved considerably.
Comment 72 Jonathan Reynolds 2022-11-03 01:36:32 UTC
Please close. It has nothing to do with FreeBSD. Also same bug 201194

Nov  2 22:46:18 bsd kernel: ahcich0: Timeout on slot 7 port 0                                                   
Nov  2 22:46:18 bsd kernel: ahcich0: is 00000000 cs 00000000 ss 000001f8 rs 000001f8 tfd 40 serr 00000000 cmd 00
00c817                                                                                                          
Nov  2 22:46:18 bsd kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 18 30 76 01 40 65 00 00 00 00 00  
Nov  2 22:46:18 bsd kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout                                   
Nov  2 22:46:18 bsd kernel: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain

I have four sata slot on my motherboard and 2 disks. Switching both sata cables to the empty slot seems to have solved the issue.
I was unnable to reproduce the error after running 15x installworld, which usually got me the above error after 2,3 runs.

It is for most users a hardware issue.
Comment 73 Kevin Zheng 2024-01-31 22:39:40 UTC
Hi there, sorry to resurrect a long-closed thread. I ran into a similar:

ahcich3: Timeout on slot 7 port 0
ahcich3: is 00000000 cs 00000080 ss 00000000 rs 00000080 tfd c0 serr 00000000 cmd 0000c717
(ada1:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich3:0:0:0): CAM status: Command timeout
(ada1:ahcich3:0:0:0): Retrying command, 0 more tries remain
ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich3: Timeout on slot 8 port 0
ahcich3: is 00000000 cs 00000000 ss 00000000 rs 00000100 tfd 150 serr 00000000 cmd 0000c817
(aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich3:0:0:0): CAM status: Command timeout
(aprobe0:ahcich3:0:0:0): Retrying command, 0 more tries remain
ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
...
GEOM_ELI: Device ada1p4.eli destroyed.
GEOM_ELI: Detached ada1p4.eli on last close.
(ada1:ahcich3:0:0:0): Periph destroyed

I'm running 14.0-RELEASE-p3 GENERIC amd64 on a Dell consumer-grade motherboard. The HDD that is timing out is a Seagate Barracuda 7200.12.

I also suspect that there is a hardware issue, but the reason I'm reporting this is that I have this set up in a zfs mirror on top of GELI drives where I expect FreeBSD to not block all disk I/O waiting for this disk that is timing out. Instead what happens is that the whole system feels like it's hanging (probably on I/O) while the device that is timing out is eventually detached.

Perhaps this timeout needs to be made much shorter, or a timeout on a single SATA device to not stall all I/O to this zpool? (I'm not even sure what subsystem this is stalling, is it at the ZFS/GELI/CAM level?)

Thanks for your consideration.
Comment 74 Warner Losh freebsd_committer freebsd_triage 2024-02-01 00:18:18 UTC
(In reply to Kevin Zheng from comment #73)
Cam tried to send a command to the drive and it took too long to reply. Either sync cache needs a lot more than 30s sometimes or the drive has gone out to lunch. Both are hardware problems, but maybe the former needs a knob to increase, but that would delay other io a really long time .... the latter is likely the firmware crashing which is kinda hard to work around: did that wipe / not commit the transactions still in cache?
Comment 75 Kevin Zheng 2024-02-01 21:28:19 UTC
(In reply to Warner Losh from comment #74)
I concur that there is definitely a hardware problem somewhere that I need to investigate.

The issue that I'm writing about is the system behavior. It seemed that all I/O (or maybe just writes?) to the ZFS pool were stalled waiting of the disk to time out and reattach, despite the fact that I have other working mirror devices. It seems to me that one hardware issue with one disk shouldn't stall the whole pool.

I'm not actually sure if this problem is happening on the ZFS level or in CAM or the SATA subsystem; if this happens again, what debugging steps would determine what the cause of this is?
Comment 76 Warner Losh freebsd_committer freebsd_triage 2024-02-08 17:43:27 UTC
(In reply to Kevin Zheng from comment #75)
>The issue that I'm writing about is the system behavior. It seemed that all I/O  (or maybe just writes?) to the ZFS pool were stalled waiting of the disk to time out and reattach, despite the fact that I have other working mirror devices. It seems to me that one hardware issue with one disk shouldn't stall the whole pool.
>
> I'm not actually sure if this problem is happening on the ZFS level or in CAM or the SATA subsystem; if this happens again, what debugging steps would determine what the cause of this is?

Yes. So there's a few things going on here. First, ZFS has ordering issues that it undertakes to enforce by scheduling some I/O (especially writes) only after I/O it depends on has completed. The ZFS code ensures that the state of its log is always in a reasonable state by this means. That means that if some I/O hangs for a "long period of time (more than a second or five)." then that would delay the I/O that depend on that completing as well. This could have the effect of causing processes to hang waiting for that I/O to complete.

So while I'd agree that one misbehaving disk shouldn't hang the pool, I can see how it might. How can ZFS know what to schedule, consistent with its desire to keep the law consistent, if any disk could suddenly stop writing? Now, I'm not a ZFS expert enough to know if one of its goals is to cope with this situation. I'd check with the ZFS developers to see if they'd expect ZFS to not stall if one disk stalls for a long time. ZFS does try to pipeline its stream of I/Os as well, as much as possible, and one stalling disk interferes with that pipeline.

One way to mitigate this, however, could be to set the timeout down from 30s to something smaller like 3-5s (ssd) or 8-12s (hdd). And the number of retires down to 2 (it has to be greater than 1 for most controllers due to deficiencies in their recovery protocols, which are kinda hard to fix). That could help keep the hangs down from 90s down to more like 5-10s (ssd) or 15-20s (hdd) which would be less noticeable in a wide range of workloads (though certainly not all).

There may be ZFS specific tunings that you might be able to try if this happens often. Maybe smaller (or paradoxically larger) I/Os by creating the pools with a smaller logical block size (ashift). This might help align the I/O to the physical NAND blocks better (hence maybe bigger is needed). Also partitioning the drive such that it starts on a good LBA boundary (I often keep 1MB at the start of disks unused because that's still < physical block sizes, but also a trivial amount... I expect to bump this to 8M or 16MB in the future). That might help keep whatever bug / pathology that's in the drive leading to the hangs to not occur (though there's no guarantee: maybe it's due to a bug in the firmware that's impossible to avoid).