Bug 229745 - ahcich: CAM status: Command timeout
Summary: ahcich: CAM status: Command timeout
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-12 22:35 UTC by Alexey
Modified: 2019-08-08 23:49 UTC (History)
11 users (show)

See Also:


Attachments
Log messages in sequence for the problem (11.05 KB, text/plain)
2019-02-27 22:42 UTC, dave
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey 2018-07-12 22:35:01 UTC
Hello!

We have some Supermicro server based on X11SSH-F
All servers were installed half year ago and works under Fbsd 11.1. All server have 4 HDD HGST HUS722T1TALA604
All of them works fine for this time with half year uptime.
Recently servers were upgraded to Fbsd 11.2 (self build 11.2-STABLE r335679 with default make.conf src.conf and GENERIC)

and after some time (all the time different, from 2 hours to 7 days) one or some disks started timeout:

Jul 13 00:56:24 mrr32 kernel: ahcich2: Timeout on slot 17 port 0
Jul 13 00:56:24 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 00060000 rs 00060000 tfd 40 serr 00000000 cmd 0004d217
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 ca 22 23 40 06 00 00 00 00 00
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 00:58:16 srv32 kernel: ahcich2: Timeout on slot 26 port 0
Jul 13 00:58:16 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 04000000 rs 04000000 tfd 40 serr 00000000 cmd 0004da17
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e0 8a cc c6 40 18 00 00 00 00 00
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 01:01:46 srv32 kernel: ahcich2: Timeout on slot 18 port 0
Jul 13 01:01:46 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 00040000 rs 00040000 tfd 40 serr 00000000 cmd 0004d217
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 2a 2b 23 40 06 00 00 00 00 00
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 01:07:12 srv32 kernel: ahcich0: Timeout on slot 23 port 0
Jul 13 01:07:12 srv32 kernel: ahcich0: is 00000000 cs 00000000 ss 00800000 rs 00800000 tfd 40 serr 00000000 cmd 0004d717
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 18 62 f5 c6 40 18 00 00 00 00 00
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jul 13 01:07:43 srv32 kernel: ahcich0: Timeout on slot 2 port 0
Jul 13 01:07:43 srv32 kernel: ahcich0: is 00000000 cs 00000000 ss 00000004 rs 00000004 tfd 40 serr 00000000 cmd 0004c217
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 62 12 7b 40 06 00 00 00 00 00
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): Retrying command

reboot (/sbin/shutdown -r or /sbin/reboot) does not solve the problem, disks still timeout after boot. Only power off / power on solve problem for some time. and after while it generate timeount 

Servers were updated to latest bios available on Supermicro. No changes.

ahci0: <Intel Sunrise Point AHCI SATA controller> port 0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xdf310000-0xdf311fff,0xdf31e000-0xdf31e0ff,0xdf31d000-0xdf31d7ff irq 16 at device 23.0 on pci0
ahci0: AHCI v1.31 with 8 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahcich6: <AHCI channel> at channel 6 on ahci0
ahcich7: <AHCI channel> at channel 7 on ahci0

ses0 at ahciem0 bus 0 scbus8 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)


ahci0@pci0:0:23:0:      class=0x010601 card=0x088415d9 chip=0xa1028086 rev=0x31 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Sunrise Point-H SATA controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA


We use zfs on all servers, some servers are raidz1, some raid-10, with same results

We use to use smartd on all servers, I tried to disable smartd. Looks like no changes.

We already upgraded zpools to new features, it require remove features before downgrade back to 11.1
Comment 1 Alexey 2018-07-12 22:50:30 UTC
Also we have some recently installed Supermicro X10DRW-i servers with same problems. Such servers were installed to 11.2, so we have no statistic for 11.1

ahci0: <Intel Wellsburg AHCI SATA controller> port 0x70b0-0x70b7,0x70a0-0x70a3,0x7090-0x7097,0x7080-0x7083,0x7000-0x701f mem 0xc7237000-0xc72377ff irq 16 at device 17.4 numa-domain 0 on pci2
ahci0: AHCI v1.30 with 4 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0

(there are two SATA controllers on MB, after problem was detected second was disabled in BIOS, second use to be same Intel Wellsburg AHCI SATA controller)

ses0 at ahciem0 bus 0 scbus4 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)


Here we have lost disk twise. power off/on return disk back, but timeouts return again after some while.

We already upgrade BIOS to latest one here also.
Comment 2 skillcoder 2018-07-14 14:28:53 UTC
I have same problem with disk 
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Firmware Version: RAGNWA05
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

Directly after upgrade FreeBSD from 11.1-RELEASE-p4 to 11.2-RELEASE i have this in /var/log/messages:

Jul 14 17:04:54 skillcoder kernel: ahcich2: Timeout on slot 21 port 0
Jul 14 17:04:54 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 0f200000 rs 0f200000 tfd 40 serr 00000000 cmd 0000db17
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 98 2a 7c 40 05 00 00 00 00 00
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:05:25 skillcoder kernel: ahcich2: Timeout on slot 31 port 0
Jul 14 17:05:25 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss e000000f rs e000000f tfd 40 serr 00000000 cmd 0000c317
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 68 fa 40 14 00 00 01 00 00
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:06:12 skillcoder kernel: ahcich2: Timeout on slot 24 port 0
Jul 14 17:06:12 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 19000000 rs 19000000 tfd 40 serr 00000000 cmd 0000de17
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1d 6c 40 05 00 00 00 00 00
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:06:43 skillcoder kernel: ahcich2: Timeout on slot 14 port 0
Jul 14 17:06:43 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 000fc000 rs 000fc000 tfd 40 serr 00000000 cmd 0000d317
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 60 62 2c 40 14 00 00 01 00 00
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:07:35 skillcoder kernel: ahcich2: Timeout on slot 14 port 0
Jul 14 17:07:35 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 007fc000 rs 007fc000 tfd 40 serr 00000000 cmd 0000d617
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 8b fe 40 14 00 00 01 00 00
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:08:08 skillcoder kernel: ahcich2: Timeout on slot 10 port 0
Jul 14 17:08:08 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 00007c00 rs 00007c00 tfd 40 serr 00000000 cmd 0000ce17
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 70 7e 40 40 14 00 00 01 00 00
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:08:45 skillcoder kernel: ahcich2: Timeout on slot 10 port 0
Jul 14 17:08:45 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 000017c0 rs 000017c0 tfd 40 serr 00000000 cmd 0000c917
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 f0 0d 5b 40 1e 00 00 00 00 00
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:09:19 skillcoder kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 5028, size: 12288


I also have enother server with same disk under hardware raid Raid HP Smart Array P410
And after upgrade from 11.0-RELEASE-p9 to 11.2-RELEASE, this disk working fine, but it's hardware raid and this disks not see by freebsd.

HELP!
Comment 3 Alexey 2018-07-14 23:14:14 UTC
Yes, it possibe problem related to disk. We have 8 servers Supermicro X10DRW-i, 4 of them are with HGST HUS722T1TALA604 and another 4 with ТOSHIBA DT01ACA100 MS2OA750 HDDs. All of them use to be installed near half year ago to Fbsd 11.1-stable, all of them use to works fine. now we have upgraded to 11.2 two and two server with each disk type and we got problem with servers who use HGST and still do not have problem with Toshiba based servers.. I'll try to replace disks to different model. Bot if problem in disk itself, why it work under 11.1? 

If we compare camcontrol identify under 11.1 and 11.2 for same model we will see no differences (in 11.2 new line Zoned-Device Commands no line).

protocol              ATA/ATAPI-10 SATA 3.x
device model          HGST HUS722T1TALA604
firmware revision     RAGNWA07
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    yes
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         yes      yes
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      no

and Toshiba disk looks different

pass0: <TOSHIBA DT01ACA100 MS2OA750> ATA8-ACS SATA 3.x device
pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 3.x
device model          TOSHIBA DT01ACA100
firmware revision     MS2OA750
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         no       no
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      1953525168/1953525168
HPA - Security                 no
Comment 4 skillcoder 2018-07-25 19:50:33 UTC
The problem is still relevant
May be this problem be related to bug #224536
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224536
Comment 5 Alexey 2018-08-01 21:17:07 UTC
Yes, looks like vfs.zfs.cache_flush_disable=1 hide the problem. (At list servers with raidz1 ZFS pool, who used to start timeouts after no more 2 days uptime, now have 6 days uptime and no any timeout on console)
Comment 6 Alexey 2018-08-01 21:48:35 UTC
(In reply to Alexey from comment #5)
But problem is, if we check spec for HGST HUS722T1TALA604 drive,  https://www.hgst.com/sites/default/files/resources/Ultrastar-7K2-EN-US-DS.pdf I do not see the drive is "Shingled magnetic recording".. (1TB is not so big capacity for 3.5", this  not need any extra methods)

And, if we replace (one by one all four disks in zpool) from HGST HUS722T1TALA604 to something different, to TOSHIBA DT01ACA100 or to TOSHIBA MG03ACA100 in our case, it solve the problem w/o disabling flush cache (vfs.zfs.cache_flush_disable=0)
Comment 7 skillcoder 2018-08-01 22:33:57 UTC
Unfortunately, this workaround was not help me.

After add "vfs.zfs.cache_flush_disable=1" to "/boot/loader.conf" and reboot.
I still have the same log after boot:
Aug  2 01:30:32 skillcoder kernel: ahcich2: Timeout on slot 12 port 0
Aug  2 01:30:32 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 001ff000 rs 001ff000 tfd 40 serr 00000000 cmd 0
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e8 70 e5 07 40 5d 00 00 00 00 00
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command
Aug  2 01:31:08 skillcoder kernel: ahcich2: Timeout on slot 15 port 0
Aug  2 01:31:08 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 007f8000 rs 007f8000 tfd 40 serr 00000000 cmd 0
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e8 70 1e 08 40 5d 00 00 00 00 00
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command


The problem is still relevant with
# sysctl vfs.zfs.cache_flush_disable                                                       
vfs.zfs.cache_flush_disable: 1
Comment 8 Alexey 2018-08-01 22:37:13 UTC
Power off ? or simple /sbin/reboot ? you will need power off + power on. not reboot or shutdown -r if timeouts already started.
Comment 9 skillcoder 2018-08-02 17:15:48 UTC
Simple shutdown -r now
But i tried, shutdown -p now (with "vfs.zfs.cache_flush_disable=1" in "/boot/loader.conf") and switch power off.
And now uptime +18 hours and not just one hang and log without any ahcichX: Timeout on slot XX port 0
BIG THX, But how it posible?
Comment 10 cryx-freebsd 2018-09-04 08:17:10 UTC
Affects me too. FreeBSD 11.2-RELEASE on a SuperMicro X11SSL-F with Intel Sunrise Point AHCI SATA controller and HUS722T2TALA604 disks. The vfs.zfs.cache_flush_disable=1 workaround also bypasses the problem after a power-off.
Comment 11 cryx-freebsd 2018-09-04 13:07:07 UTC
(In reply to cryx-freebsd from comment #10)

Setting kern.cam.ada.write_cache=0 in loader.conf and doing a power-cycle makes the problem go away, with the obvious hits on IO performance.

IHMO this shows that there is indeed a problem with the disks write-cache handling.
Comment 12 samm 2018-10-23 05:54:25 UTC
Same apply for me, 2 old 1TB drives were replaced by HGST HUS722T1TALA604/RAGNWA07 and problem started:

(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 02 af 63 40 37 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada0:ahcich0:0:0:0): RES: 41 10 02 af 63 00 37 00 00 00 00
Comment 13 samm 2018-10-23 05:55:50 UTC
Also from smart standpoint drives are looking healthy - selftests are passing fine, attributes are good, no UDMA CRC errors, etc.
Comment 14 samm 2018-10-23 06:03:56 UTC
P.S. We also using 

11.2-RELEASE-p4/amd64 on X10SLM-F server,

ahci0: <Intel Lynx Point AHCI SATA controller> port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 0xf7232000-0xf72327ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
Comment 15 samm 2018-10-23 06:10:17 UTC
TBH, it makes me thinking that drives itself or controller may have an issues with NCQ
Comment 16 samm 2018-10-23 06:12:49 UTC
To confirm if it is NCQ related or not i decided to disable it on one of the 2 affected drives:

camcontrol negotiate ada0 -T disable
camcontrol reset ada0

Second affected drive will still use NCQ for now to see if it makes any difference.
Comment 17 samm 2018-10-31 14:06:02 UTC
One update - after disabling NCQ on ada0 i do not see any problems with it, but ada2 still failing. I am disabling NCQ on both for now.
Comment 18 samm 2018-11-05 20:23:20 UTC
One more update - after disabling NCQ on both affected drives things are going well - i do not see any errors anymore. So from my POV - it is a buggy drive firmware.
Comment 19 sec 2019-02-06 12:36:20 UTC
Also having the same problem, before 11.2, my ZFS mirror was working fine. After upgrade, those CAM errors started to show up.
I replaced PSU, replaced one of the drives, replaced cables - still the same. Checked smart for drives, did shoty/long tests - drives are fine.
Even did memtest :)

My observations:
- when HDD's connected directly into motherboard - there are Timeout errors
- when HDD's connected to pci-e sata controller - there are unrecoverable CRC errors

I tried to disable NCQ and cache - nothing helps.

Strange thing is, when mirror is broken (there's only one drive connected) - everything is fine. It's only when 2 drives are connected into same mirror, those start to show up.

My drives are WD Gold 1TB:
1. WDC WD1005FBYZ-01YCBB2 RR07
2. WDC WD1005FBYZ-01YCBB1 RR04

Before I had two RR07 working fine, after upgrade, errors shows up, so I RMA one of them and got RR04 - which didn't fix the error. Also the problem only shows up only on one of the drives in mirror.

Right now I'm the process of migrating data to 11.1 zfs pool, then I will downgrade my server back to releng/11.1 and check if pool is working fine.

I also have two SAMSUNG HD642JJ 1AA01113 connected into other mirror - no issues with those.

Tried to swap cables, ports, etc - problem is following those drives, together.

Hope for some solution to this one, becuase it will block any upgrade to 12 :)
Comment 20 sec 2019-02-07 07:36:36 UTC
(In reply to sec from comment #19)
Ok, so I tried to downgrade pool to 11.1 - didn't helped.
Then I also started to get those errors with only one drive connected (which was fine before).
Also tried to boot on 12.0R, same issue.

Weird thing is, that sometimes drives are OK, they resilver without any problems, then after some read/write, timeout starts to show up:

In the end I tried something I think I've tried before. I've added those to /boot/loader.conf and power cycle:
vfs.zfs.cache_flush_disable=1
kern.cam.ada.write_cache=0

And started to stress test - no errors.
Then I commented out the "write_cache" - no errors.
Commented out "cache_flush" - errors back again (I think I've tried that one already, and got errors, but this time looks fine, fingers crossed).

So right know I'm running with cache_flush disabled - hope this will get rid of those errors until some proper fix is done.

Errors I was getting was something like that:
(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 b0 00 20 40 4c 00 00 01 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada2:ahcich2:0:0:0): RES: 41 10 b0 00 20 00 4c 00 00 00 00
(ada2:ahcich2:0:0:0): Retrying command

There were also READ and FLUSHCACHE48 errors - this depends on the load I was generating on the pool.

If needed I can provide more debug output.
Comment 21 dave 2019-02-27 22:42:32 UTC
Created attachment 202427 [details]
Log messages in sequence for the problem
Comment 22 dave 2019-02-27 22:45:52 UTC
Comment on attachment 202427 [details]
Log messages in sequence for the problem

Having the same problem here, just when I upgraded to FreeBSD 11.2-STABLE. I decided to attach my log output, this might have been a bad idea in retrospect. 

I also use smartctl regularly through icinga2. 

# camcontrol devlist
<WDC WD2005FBYZ-01YCBB3 RR09>      at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD200MFYYZ-01D45B0 01.01K01>  at scbus1 target 0 lun 0 (pass1,ada1)
<WDC WD2004FBYZ-01YCBB1 RR03>      at scbus2 target 0 lun 0 (pass2,ada2)
<WDC WD2004FBYZ-01YCBB1 RR03>      at scbus3 target 0 lun 0 (pass3,ada3)
<HL-DT-ST DVDRAM GH24NS95 RN01>    at scbus5 target 0 lun 0 (cd0,pass4)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus6 target 0 lun 0 (pass5,ses0)
Comment 23 dave 2019-02-28 03:52:25 UTC
I've taken smartctl out of the picture and this problem still occurs. I can trigger it consistently with high write load on the disk subsystem.

Isn't this the same bug as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201194
Comment 24 Allan Jude freebsd_committer 2019-05-28 15:49:26 UTC
We are seeing the same problem across a large number of machines, using Supermicro X11SPW-TF and X10SRW-F motherboards.

The disks are:
WD1004FBYZ-01YCBB1
or
WD1005FBYZ-01YCBB2

And they just fall off the bus and won't come back (not even visible in the BIOS) without a hard power cycle.


May 28 14:51:03 US-EWR3-02 kernel: ahcich9: Timeout on slot 14 port 0
May 28 14:51:03 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 1d0 serr 00000000 cmd 0004ce17
May 28 14:51:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:51:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:51:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Error 5, Retries exhausted
May 28 14:51:33 US-EWR3-02 kernel: ahcich9: Timeout on slot 19 port 0
May 28 14:51:33 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 1d0 serr 00000000 cmd 0004d317
May 28 14:51:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:51:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:51:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Retrying command, 0 more tries remain
May 28 14:52:03 US-EWR3-02 kernel: ahcich9: Timeout on slot 20 port 0
May 28 14:52:03 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd 1d0 serr 00000000 cmd 0004d417
May 28 14:52:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:52:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:52:03 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Error 5, Retries exhausted
May 28 14:52:19 US-EWR3-02 kernel: ahcich8: Timeout on slot 26 port 0
May 28 14:52:19 US-EWR3-02 kernel: ahcich8: is 00000000 cs 00000000 ss fc03ffff rs fc03ffff tfd 40 serr 00000000 cmd 0004d117
May 28 14:52:19 US-EWR3-02 kernel: (ada2:ahcich8:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c7 65 68 40 61 00 00 01 00 00
May 28 14:52:19 US-EWR3-02 kernel: (ada2:ahcich8:0:0:0): CAM status: Command timeout
May 28 14:52:19 US-EWR3-02 kernel: (ada2:ahcich8:0:0:0): Retrying command, 3 more tries remain
May 28 14:52:33 US-EWR3-02 kernel: ahcich9: Timeout on slot 23 port 0
May 28 14:52:33 US-EWR3-02 kernel: ahcich9: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd 1d0 serr 00000000 cmd 0004d717
May 28 14:52:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
May 28 14:52:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): CAM status: Command timeout
May 28 14:52:33 US-EWR3-02 kernel: (aprobe0:ahcich9:0:0:0): Retrying command, 0 more tries remain
Comment 25 dave 2019-06-12 18:38:14 UTC
(In reply to Allan Jude from comment #24)

Did you find a suitable workaround?
Comment 26 Alexey 2019-06-12 18:52:22 UTC
it works for me since august 2018 with vfs.zfs.cache_flush_disable=1 w/o reboots and disk timeouts.
Comment 27 dave 2019-06-12 23:14:56 UTC
(In reply to Alexey from comment #26)

I have another more drastic workaround that works. In /boot/device.hints I placed the following lines:

hint.ahcich.0.sata_rev=2
hint.ahcich.1.sata_rev=2
hint.ahcich.2.sata_rev=2
...etc...

This slows down the SATA bus which isn't desirable either, but better than timeouts and controller drops.
Comment 28 Samuel Chow 2019-07-06 01:50:38 UTC
Unfortunately, I am also hitting this issue:

[3974486] ada6 at ahcich7 bus 0 scbus9 target 0 lun 0
[3974486] ada6: <HGST HUS722T2TALA604 RAGNWA09> ACS-3 ATA SATA 3.x device
[3974486] ada6: Serial Number WMC6N0XXXXXX
[3974486] ada6: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
[3974486] ada6: Command Queueing enabled
[3974486] ada6: 1907729MB (3907029168 512 byte sectors)

[4038180] ahcich7: Timeout on slot 15 port 0
[4038180] ahcich7: is 00000000 cs 00000000 ss 00008000 rs 00008000 tfd 40 serr 00000000 cmd 0004cf17
[4038180] (ada6:ahcich7:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 e4 3f 40 2e 00 00 00 00 00
[4038180] (ada6:ahcich7:0:0:0): CAM status: Command timeout
[4038180] (ada6:ahcich7:0:0:0): Retrying command
[4039687] ahcich7: Timeout on slot 5 port 0
[4039687] ahcich7: is 00000000 cs 00000000 ss 00000020 rs 00000020 tfd 40 serr 00000000 cmd 0004c517
[4039687] (ada6:ahcich7:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 e9 3f 40 2e 00 00 00 00 00I
[4039687] (ada6:ahcich7:0:0:0): CAM status: Command timeout
[4039687] (ada6:ahcich7:0:0:0): Retrying command
[4046427] ahcich7: Timeout on slot 31 port 0
[4046427] ahcich7: is 00000000 cs 00000000 ss 80000007 rs 80000007 tfd 40 serr 00000000 cmd 0004c217
[4046427] (ada6:ahcich7:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 f8 1a 40 40 2e 00 00 00 00 00
[4046427] (ada6:ahcich7:0:0:0): CAM status: Command timeout
[4046427] (ada6:ahcich7:0:0:0): Retrying command


I do note that for all the reports here, people all seems to have SuperMicro machines. I have a X10DAi motherboard, and it has a 'C610/X99 series chipset 6-Port SATA Controller' on the motherboard connecting to this disk. 

Interestingly, on this machine, I also have 2 other mps 'LSI 9207-4i4e PCIe SATA Host Controller', and I also have another HUS722T2TALA604 on it. That disk has been running for 5 months, and I have never seen any problems.
Comment 29 Rick 2019-07-31 06:54:10 UTC
We had the same problem with a SuperMicro SIS14B server. 
A college of mine found out dat removing the vertical-hotswappable connectorboard solves the issues.
So when the disks are connected directly to the main board it works correctly.
Comment 30 Rick 2019-08-01 09:23:50 UTC
(In reply to Rick from comment #29)
Please forget this Remark. It lasted only for a few days :S
Comment 31 John Baldwin freebsd_committer freebsd_triage 2019-08-08 23:49:21 UTC
I believe I ran into the same thing though in my case the drives would bounce (come right back after detaching), but since they all bounced at once it always killed the zpool (the drives were all mirrors in a single pool).  I found that disabling hotplug for each SATA port in the BIOS setup caused the bouncing to stop and the system has been stable for a day and a half (previously when this started it wouldn't make it to multiuser without losing the zpool).  I'm not sure if I had outsmarted myself by enabling this in the BIOS setup at one point but figured I'd leave a bread crumb here in case it helps someone else.