Bug 229745 - ahcich: CAM status: Command timeout
Summary: ahcich: CAM status: Command timeout
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-12 22:35 UTC by Alexey
Modified: 2019-02-07 07:36 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey 2018-07-12 22:35:01 UTC
Hello!

We have some Supermicro server based on X11SSH-F
All servers were installed half year ago and works under Fbsd 11.1. All server have 4 HDD HGST HUS722T1TALA604
All of them works fine for this time with half year uptime.
Recently servers were upgraded to Fbsd 11.2 (self build 11.2-STABLE r335679 with default make.conf src.conf and GENERIC)

and after some time (all the time different, from 2 hours to 7 days) one or some disks started timeout:

Jul 13 00:56:24 mrr32 kernel: ahcich2: Timeout on slot 17 port 0
Jul 13 00:56:24 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 00060000 rs 00060000 tfd 40 serr 00000000 cmd 0004d217
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 ca 22 23 40 06 00 00 00 00 00
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 00:56:24 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 00:58:16 srv32 kernel: ahcich2: Timeout on slot 26 port 0
Jul 13 00:58:16 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 04000000 rs 04000000 tfd 40 serr 00000000 cmd 0004da17
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e0 8a cc c6 40 18 00 00 00 00 00
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 00:58:16 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 01:01:46 srv32 kernel: ahcich2: Timeout on slot 18 port 0
Jul 13 01:01:46 srv32 kernel: ahcich2: is 00000000 cs 00000000 ss 00040000 rs 00040000 tfd 40 serr 00000000 cmd 0004d217
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 2a 2b 23 40 06 00 00 00 00 00
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 13 01:01:46 srv32 kernel: (ada2:ahcich2:0:0:0): Retrying command
Jul 13 01:07:12 srv32 kernel: ahcich0: Timeout on slot 23 port 0
Jul 13 01:07:12 srv32 kernel: ahcich0: is 00000000 cs 00000000 ss 00800000 rs 00800000 tfd 40 serr 00000000 cmd 0004d717
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 18 62 f5 c6 40 18 00 00 00 00 00
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jul 13 01:07:12 srv32 kernel: (ada0:ahcich0:0:0:0): Retrying command
Jul 13 01:07:43 srv32 kernel: ahcich0: Timeout on slot 2 port 0
Jul 13 01:07:43 srv32 kernel: ahcich0: is 00000000 cs 00000000 ss 00000004 rs 00000004 tfd 40 serr 00000000 cmd 0004c217
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 62 12 7b 40 06 00 00 00 00 00
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Jul 13 01:07:43 srv32 kernel: (ada0:ahcich0:0:0:0): Retrying command

reboot (/sbin/shutdown -r or /sbin/reboot) does not solve the problem, disks still timeout after boot. Only power off / power on solve problem for some time. and after while it generate timeount 

Servers were updated to latest bios available on Supermicro. No changes.

ahci0: <Intel Sunrise Point AHCI SATA controller> port 0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xdf310000-0xdf311fff,0xdf31e000-0xdf31e0ff,0xdf31d000-0xdf31d7ff irq 16 at device 23.0 on pci0
ahci0: AHCI v1.31 with 8 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahcich6: <AHCI channel> at channel 6 on ahci0
ahcich7: <AHCI channel> at channel 7 on ahci0

ses0 at ahciem0 bus 0 scbus8 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)


ahci0@pci0:0:23:0:      class=0x010601 card=0x088415d9 chip=0xa1028086 rev=0x31 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Sunrise Point-H SATA controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA


We use zfs on all servers, some servers are raidz1, some raid-10, with same results

We use to use smartd on all servers, I tried to disable smartd. Looks like no changes.

We already upgraded zpools to new features, it require remove features before downgrade back to 11.1
Comment 1 Alexey 2018-07-12 22:50:30 UTC
Also we have some recently installed Supermicro X10DRW-i servers with same problems. Such servers were installed to 11.2, so we have no statistic for 11.1

ahci0: <Intel Wellsburg AHCI SATA controller> port 0x70b0-0x70b7,0x70a0-0x70a3,0x7090-0x7097,0x7080-0x7083,0x7000-0x701f mem 0xc7237000-0xc72377ff irq 16 at device 17.4 numa-domain 0 on pci2
ahci0: AHCI v1.30 with 4 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0

(there are two SATA controllers on MB, after problem was detected second was disabled in BIOS, second use to be same Intel Wellsburg AHCI SATA controller)

ses0 at ahciem0 bus 0 scbus4 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <HGST HUS722T1TALA604 RAGNWA07> ACS-3 ATA SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors)


Here we have lost disk twise. power off/on return disk back, but timeouts return again after some while.

We already upgrade BIOS to latest one here also.
Comment 2 skillcoder 2018-07-14 14:28:53 UTC
I have same problem with disk 
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Firmware Version: RAGNWA05
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

Directly after upgrade FreeBSD from 11.1-RELEASE-p4 to 11.2-RELEASE i have this in /var/log/messages:

Jul 14 17:04:54 skillcoder kernel: ahcich2: Timeout on slot 21 port 0
Jul 14 17:04:54 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 0f200000 rs 0f200000 tfd 40 serr 00000000 cmd 0000db17
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 98 2a 7c 40 05 00 00 00 00 00
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:04:54 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:05:25 skillcoder kernel: ahcich2: Timeout on slot 31 port 0
Jul 14 17:05:25 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss e000000f rs e000000f tfd 40 serr 00000000 cmd 0000c317
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 68 fa 40 14 00 00 01 00 00
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:05:25 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:06:12 skillcoder kernel: ahcich2: Timeout on slot 24 port 0
Jul 14 17:06:12 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 19000000 rs 19000000 tfd 40 serr 00000000 cmd 0000de17
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 a8 1d 6c 40 05 00 00 00 00 00
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:06:12 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:06:43 skillcoder kernel: ahcich2: Timeout on slot 14 port 0
Jul 14 17:06:43 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 000fc000 rs 000fc000 tfd 40 serr 00000000 cmd 0000d317
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 60 62 2c 40 14 00 00 01 00 00
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:06:43 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:07:35 skillcoder kernel: ahcich2: Timeout on slot 14 port 0
Jul 14 17:07:35 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 007fc000 rs 007fc000 tfd 40 serr 00000000 cmd 0000d617
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 8b fe 40 14 00 00 01 00 00
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:07:35 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:08:08 skillcoder kernel: ahcich2: Timeout on slot 10 port 0
Jul 14 17:08:08 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 00007c00 rs 00007c00 tfd 40 serr 00000000 cmd 0000ce17
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 70 7e 40 40 14 00 00 01 00 00
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:08:08 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:08:45 skillcoder kernel: ahcich2: Timeout on slot 10 port 0
Jul 14 17:08:45 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 000017c0 rs 000017c0 tfd 40 serr 00000000 cmd 0000c917
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 f0 0d 5b 40 1e 00 00 00 00 00
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Jul 14 17:08:45 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command

Jul 14 17:09:19 skillcoder kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 5028, size: 12288


I also have enother server with same disk under hardware raid Raid HP Smart Array P410
And after upgrade from 11.0-RELEASE-p9 to 11.2-RELEASE, this disk working fine, but it's hardware raid and this disks not see by freebsd.

HELP!
Comment 3 Alexey 2018-07-14 23:14:14 UTC
Yes, it possibe problem related to disk. We have 8 servers Supermicro X10DRW-i, 4 of them are with HGST HUS722T1TALA604 and another 4 with ТOSHIBA DT01ACA100 MS2OA750 HDDs. All of them use to be installed near half year ago to Fbsd 11.1-stable, all of them use to works fine. now we have upgraded to 11.2 two and two server with each disk type and we got problem with servers who use HGST and still do not have problem with Toshiba based servers.. I'll try to replace disks to different model. Bot if problem in disk itself, why it work under 11.1? 

If we compare camcontrol identify under 11.1 and 11.2 for same model we will see no differences (in 11.2 new line Zoned-Device Commands no line).

protocol              ATA/ATAPI-10 SATA 3.x
device model          HGST HUS722T1TALA604
firmware revision     RAGNWA07
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    yes
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         yes      yes
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      no

and Toshiba disk looks different

pass0: <TOSHIBA DT01ACA100 MS2OA750> ATA8-ACS SATA 3.x device
pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 3.x
device model          TOSHIBA DT01ACA100
firmware revision     MS2OA750
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      no      0/0x00
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         no       no
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      1953525168/1953525168
HPA - Security                 no
Comment 4 skillcoder 2018-07-25 19:50:33 UTC
The problem is still relevant
May be this problem be related to bug #224536
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224536
Comment 5 Alexey 2018-08-01 21:17:07 UTC
Yes, looks like vfs.zfs.cache_flush_disable=1 hide the problem. (At list servers with raidz1 ZFS pool, who used to start timeouts after no more 2 days uptime, now have 6 days uptime and no any timeout on console)
Comment 6 Alexey 2018-08-01 21:48:35 UTC
(In reply to Alexey from comment #5)
But problem is, if we check spec for HGST HUS722T1TALA604 drive,  https://www.hgst.com/sites/default/files/resources/Ultrastar-7K2-EN-US-DS.pdf I do not see the drive is "Shingled magnetic recording".. (1TB is not so big capacity for 3.5", this  not need any extra methods)

And, if we replace (one by one all four disks in zpool) from HGST HUS722T1TALA604 to something different, to TOSHIBA DT01ACA100 or to TOSHIBA MG03ACA100 in our case, it solve the problem w/o disabling flush cache (vfs.zfs.cache_flush_disable=0)
Comment 7 skillcoder 2018-08-01 22:33:57 UTC
Unfortunately, this workaround was not help me.

After add "vfs.zfs.cache_flush_disable=1" to "/boot/loader.conf" and reboot.
I still have the same log after boot:
Aug  2 01:30:32 skillcoder kernel: ahcich2: Timeout on slot 12 port 0
Aug  2 01:30:32 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 001ff000 rs 001ff000 tfd 40 serr 00000000 cmd 0
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e8 70 e5 07 40 5d 00 00 00 00 00
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Aug  2 01:30:32 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command
Aug  2 01:31:08 skillcoder kernel: ahcich2: Timeout on slot 15 port 0
Aug  2 01:31:08 skillcoder kernel: ahcich2: is 00000000 cs 00000000 ss 007f8000 rs 007f8000 tfd 40 serr 00000000 cmd 0
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 e8 70 1e 08 40 5d 00 00 00 00 00
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Aug  2 01:31:08 skillcoder kernel: (ada2:ahcich2:0:0:0): Retrying command


The problem is still relevant with
# sysctl vfs.zfs.cache_flush_disable                                                       
vfs.zfs.cache_flush_disable: 1
Comment 8 Alexey 2018-08-01 22:37:13 UTC
Power off ? or simple /sbin/reboot ? you will need power off + power on. not reboot or shutdown -r if timeouts already started.
Comment 9 skillcoder 2018-08-02 17:15:48 UTC
Simple shutdown -r now
But i tried, shutdown -p now (with "vfs.zfs.cache_flush_disable=1" in "/boot/loader.conf") and switch power off.
And now uptime +18 hours and not just one hang and log without any ahcichX: Timeout on slot XX port 0
BIG THX, But how it posible?
Comment 10 cryx-freebsd 2018-09-04 08:17:10 UTC
Affects me too. FreeBSD 11.2-RELEASE on a SuperMicro X11SSL-F with Intel Sunrise Point AHCI SATA controller and HUS722T2TALA604 disks. The vfs.zfs.cache_flush_disable=1 workaround also bypasses the problem after a power-off.
Comment 11 cryx-freebsd 2018-09-04 13:07:07 UTC
(In reply to cryx-freebsd from comment #10)

Setting kern.cam.ada.write_cache=0 in loader.conf and doing a power-cycle makes the problem go away, with the obvious hits on IO performance.

IHMO this shows that there is indeed a problem with the disks write-cache handling.
Comment 12 samm 2018-10-23 05:54:25 UTC
Same apply for me, 2 old 1TB drives were replaced by HGST HUS722T1TALA604/RAGNWA07 and problem started:

(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 02 af 63 40 37 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada0:ahcich0:0:0:0): RES: 41 10 02 af 63 00 37 00 00 00 00
Comment 13 samm 2018-10-23 05:55:50 UTC
Also from smart standpoint drives are looking healthy - selftests are passing fine, attributes are good, no UDMA CRC errors, etc.
Comment 14 samm 2018-10-23 06:03:56 UTC
P.S. We also using 

11.2-RELEASE-p4/amd64 on X10SLM-F server,

ahci0: <Intel Lynx Point AHCI SATA controller> port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 0xf7232000-0xf72327ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
Comment 15 samm 2018-10-23 06:10:17 UTC
TBH, it makes me thinking that drives itself or controller may have an issues with NCQ
Comment 16 samm 2018-10-23 06:12:49 UTC
To confirm if it is NCQ related or not i decided to disable it on one of the 2 affected drives:

camcontrol negotiate ada0 -T disable
camcontrol reset ada0

Second affected drive will still use NCQ for now to see if it makes any difference.
Comment 17 samm 2018-10-31 14:06:02 UTC
One update - after disabling NCQ on ada0 i do not see any problems with it, but ada2 still failing. I am disabling NCQ on both for now.
Comment 18 samm 2018-11-05 20:23:20 UTC
One more update - after disabling NCQ on both affected drives things are going well - i do not see any errors anymore. So from my POV - it is a buggy drive firmware.
Comment 19 sec 2019-02-06 12:36:20 UTC
Also having the same problem, before 11.2, my ZFS mirror was working fine. After upgrade, those CAM errors started to show up.
I replaced PSU, replaced one of the drives, replaced cables - still the same. Checked smart for drives, did shoty/long tests - drives are fine.
Even did memtest :)

My observations:
- when HDD's connected directly into motherboard - there are Timeout errors
- when HDD's connected to pci-e sata controller - there are unrecoverable CRC errors

I tried to disable NCQ and cache - nothing helps.

Strange thing is, when mirror is broken (there's only one drive connected) - everything is fine. It's only when 2 drives are connected into same mirror, those start to show up.

My drives are WD Gold 1TB:
1. WDC WD1005FBYZ-01YCBB2 RR07
2. WDC WD1005FBYZ-01YCBB1 RR04

Before I had two RR07 working fine, after upgrade, errors shows up, so I RMA one of them and got RR04 - which didn't fix the error. Also the problem only shows up only on one of the drives in mirror.

Right now I'm the process of migrating data to 11.1 zfs pool, then I will downgrade my server back to releng/11.1 and check if pool is working fine.

I also have two SAMSUNG HD642JJ 1AA01113 connected into other mirror - no issues with those.

Tried to swap cables, ports, etc - problem is following those drives, together.

Hope for some solution to this one, becuase it will block any upgrade to 12 :)
Comment 20 sec 2019-02-07 07:36:36 UTC
(In reply to sec from comment #19)
Ok, so I tried to downgrade pool to 11.1 - didn't helped.
Then I also started to get those errors with only one drive connected (which was fine before).
Also tried to boot on 12.0R, same issue.

Weird thing is, that sometimes drives are OK, they resilver without any problems, then after some read/write, timeout starts to show up:

In the end I tried something I think I've tried before. I've added those to /boot/loader.conf and power cycle:
vfs.zfs.cache_flush_disable=1
kern.cam.ada.write_cache=0

And started to stress test - no errors.
Then I commented out the "write_cache" - no errors.
Commented out "cache_flush" - errors back again (I think I've tried that one already, and got errors, but this time looks fine, fingers crossed).

So right know I'm running with cache_flush disabled - hope this will get rid of those errors until some proper fix is done.

Errors I was getting was something like that:
(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 b0 00 20 40 4c 00 00 01 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada2:ahcich2:0:0:0): RES: 41 10 b0 00 20 00 4c 00 00 00 00
(ada2:ahcich2:0:0:0): Retrying command

There were also READ and FLUSHCACHE48 errors - this depends on the load I was generating on the pool.

If needed I can provide more debug output.