Bug 224536

Summary: FLUSHCACHE48 AHCI Timeouts on SMR Drives
Product: Base System Reporter: fabiob <ilgenox>
Component: kernAssignee: freebsd-bugs mailing list <bugs>
Status: New ---    
Severity: Affects Only Me CC: allanjude, elij+freebsd, ronen8250, tcovert+freebsd, tomek, v_bachvarov
Priority: ---    
Version: 11.1-STABLE   
Hardware: amd64   
OS: Any   
URL: https://forums.freenas.org/index.php?threads/a-long-journey-to-solve-cam-status-command-timeout-issues.59549/#post-422291

Description fabiob 2017-12-23 07:08:58 UTC
Hi!
Starting from this thread:

https://forums.freenas.org/index.php?threads/a-long-journey-to-solve-cam-status-command-timeout-issues.59549/

I had hard times troubleshooting this issue on FreeNAS with 4 x ST8000AS0002 Drives.
It seems that drives increase latency over time and exceed the AHCI (hardcoded?) 31000 msec timeout.
It will be great if FreeBSD could support these new SMR drives without disabling cache flushing on ZFS.

Thanks,
Fabio
Comment 1 ronen8250 2018-03-09 00:37:46 UTC
Same problem with sparc64.

I have just upgrade my Sun Ultra2 from Free-BSD 10.1 to 11.1 and I got some error messages and the running commands freeze.



# dmesg 
Copyright (c) 1992-2017 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.1-RELEASE #0 r321309: Fri Jul 21 08:46:54 UTC 2017
    root@releng2.nyi.freebsd.org:/usr/obj/sparc64.sparc64/usr/src/sys/GENERIC sparc64
gcc version 4.2.1 20070831 patched [FreeBSD]
real memory  = 1073741824 (1024 MB)
avail memory = 1030152192 (982 MB)
cpu0: Sun Microsystems UltraSparc-II Processor (296.01 MHz CPU)
cpu1: Sun Microsystems UltraSparc-II Processor (296.01 MHz CPU)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
random: unblocking device.
random: entropy device external interface
kbd0 at kbdmux0
nexus0: <Open Firmware Nexus device>
sbus0: <U2S UPA-SBus bridge> mem 0x1fe00000000-0x1fe00007fff irq 2036,2037,2038,2021,2026,2039 on nexus0
sbus0: clock 25.000 MHz
sbus0: DVMA map: 0xfc000000 to 0xffffffff 8192 entries, streaming buffer
sbus0: [GIANT-LOCKED]
sbus0: [GIANT-LOCKED]
initializing counter-timer
Timecounter "sbus0" frequency 1000000 Hz quality 100
auxio0: <Sun Auxiliary I/O> mem 0x1900000 on sbus0
pcm0: <Sun Audiocs> mem 0xc000000-0xc0001ff irq 2020 on sbus0
pcm0: <CS4231A Codec Id. 10>
sbus0: <flashprom> mem 0-0x7ffff,0x1380000-0x13fffff type unknown (no driver attached)
sbus0: <SUNW,fdtwo> mem 0x1400000-0x1400007 irq 2025 type block (no driver attached)
eeprom0: <EEPROM/clock> mem 0x1200000-0x1201fff on sbus0
eeprom0: model mk48t59
scc0: <Zilog Z8530 dual channel SCC> mem 0x1100000-0x1100003 irq 2024 on sbus0
uart0: <z8530, channel A> on scc0
uart0: stray vector interrupt 2033
console (38400,n,8,1)
uart1: <z8530, channel B> on scc0
scc1: <Zilog Z8530 dual channel SCC> mem 0x1000000-0x1000003 irq 2024 on sbus0
uart2: <z8530, channel A> on scc1
uart2: keyboard (1200,n,8,1)
uart2: keyboard not present
uart3: <z8530, channel B> on scc1
sbus0: <sc> mem 0x1300000-0x1300007 type unknown (no driver attached)
sbus0: <SUNW,pll> mem 0x1304000-0x1304002 type unknown (no driver attached)
esp0: <Sun FAS366 Fast-Wide SCSI> mem 0x8800000-0x880000f,0x8810000-0x881003f irq 2016 on sbus0
esp0: FAS366/HME, 40 MHz, SCSI ID 7
hme0: <Sun HME 10/100 Ethernet> mem 0x8c00000-0x8c00107,0x8c02000-0x8c03fff,0x8c04000-0x8c05fff,0x8c06000-0x8c07fff,0x8c07000-0x8c0701f irq 2017 on sbus0
miibus0: <MII bus> on hme0
nsphy0: <DP83840 10/100 media interface> PHY 1 on miibus0
nsphy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
hme0: Ethernet address: xx:xx:xx:xx:xx:xx
sbus0: <SUNW,bpp> mem 0xc800000-0xc80001b irq 2018 type unknown (no driver attached)
sbus0: <nm212> mem 0-0x1ffff,0x20000-0x9ffff,0xa0000-0xdffff,0xe0000-0xe00ff irq 1996 type network (no driver attached)
creator0: <Creator3D> mem 0x1fc00000000-0x1fc000003ff,0x1fc00400000-0x1fc005fffff,0x1fc00600000-0x1fc007fffff,0x1fc01000000-0x1fc013fffff,0x1fc01400000-0x1fc017fffff,0x1fc01800000-0x1fc01bfffff,0x1fc01c00000-0x1fc01ffffff,0x1fc02000000-0x1fc02ffffff,0x1fc03000000-0x1fc03ffffff,0x1fc04000000-0x1fc043fffff,0x1fc04400000-0x1fc047fffff,0x1fc04800000-0x1fc04bfffff,0x1fc04c00000-0x1fc04ffffff,0x1fc05000000-0x1fc05ffffff,0x1fc06000000-0x1fc07ffffff,0x1fc09000000-0x1fc097fffff,0x1fc09800000-0x1fc09ffffff,0x1fc0a000000-0x1fc0affffff,0x1fc0b000000-0x1fc0b7fffff,0x1fc0b800000-0x1fc0bffffff,0x1fc0c000000-0x1fc0c3fffff,0x1fc0c800000-0x1fc0cffffff,0x1fc0d000000-0x1fc0d7fffff,0x1fc0d800000-0x1fc0dffffff irq 1925 on nexus0
creator0: resolution 1152x900
cryptosoft0: <software crypto> on nexus0
syscons0: <System console> on nexus0
syscons0: Unknown <16 virtual consoles, flags=0x100>
Timecounter "tick" frequency 296008859 Hz quality 10
Event timer "tick" frequency 296008859 Hz quality 1000
Timecounters tick every 1.000 msec
usb_needs_explore_all: no devclass
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/da0a [rw]...
da0 at esp0 bus 0 scbus0 target 0 lun 0
da0: <IBM DNES30917SUN9.0G SAD0> Fixed Direct Access SCSI-3 device
da0: Serial Number 27L734          
da0: 20.000MB/s transfers (10.000MHz, offset 5, 16bit)
da0: Command Queueing enabled
da0: 8637MB (17689267 512 byte sectors)
cd0 at esp0 bus 0 scbus0 target 6 lun 0
cd0: <TOSHIBA DVD-ROM SD-M1401 1009> Removable CD-ROM SCSI-2 device
cd0: 10.000MB/s transfers (10.000MHz, offset 15)
cd0: Attempt to query device size failed: NOT READY, Logical unit is in process of becoming ready
hme0: link state changed to DOWN
hme0: link state changed to UP
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1681ce0 (flags 0x1, dleft 1000, stat 0)], <state 2, nexus 0xc1680888, phase(l 97, c 100, p 3), resid 1000, msg(q 0,o 40) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1683138 (flags 0x1, dleft 1000, stat 0)], <state 1, nexus 0, phase(l 97, c 100, p 7), resid 1000, msg(q 0,o 40) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1687770 (flags 0x1, dleft 1000, stat 0)], <state 1, nexus 0, phase(l 97, c 100, p 7), resid 1000, msg(q 0,o 40) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1687ab8 (flags 0x1, dleft 1000, stat 0)], <state 1, nexus 0, phase(l 97, c 100, p 7), resid 1000, msg(q 0,o 40) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc16842f0 (flags 0x1, dleft 1000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1688928 (flags 0x1, dleft 8000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1682178 (flags 0x1, dleft 8000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1680930 (flags 0x1, dleft 1000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1682370 (flags 0x1, dleft 1000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1686078 (flags 0x1, dleft 1000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1687cb0 (flags 0x1, dleft 1000, stat 0)], <state 5, nexus 0xc168a4b8, phase(l 93, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc16887d8 (flags 0x101, dleft 1000, stat 0)], <state 5, nexus 0xc168a4b8, phase(l 97, c 3, p 3), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1685b38 (flags 0x1, dleft 1000, stat 0)], <state 1, nexus 0, phase(l 97, c 100, p 3), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc1680888 (flags 0x1, dleft 1000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 3), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): esp0: timed out [ecb 0xc16885e0 (flags 0x1, dleft 1000, stat 0)], <state 4, nexus 0, phase(l 97, c 7, p 7), resid 1000, msg(q 0,o 0) >
(da0:esp0:0:0:0): WRITE(10). CDB: 2a 00 00 9f ed a0 00 00 08 00 
(da0:esp0:0:0:0): CAM status: Command timeout
(da0:esp0:0:0:0): Retrying command
Comment 2 Rocco 2019-04-07 10:23:54 UTC
+1, Having the same problem on FreeBSD 12.0-p3 (latest upgrades).
Hardware: QNAP with 4x 4TB WD Red drives.
ada0: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device

Drives get detached only when doing a scrub.

11:39:53  kernel: ahcich3: Timeout on slot 9 port 0
11:39:53  kernel: ahcich3: is 00000000 cs 0001f800 ss 0001ff00 rs 0001ff00 tfd 40 serr 00000000 cmd 0004cb17
11:39:53  kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d0 9a 9e 40 20 00 00 01 00 00
11:39:53  kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
11:39:53  kernel: (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
11:41:32  kernel: ahcich3: Timeout on slot 8 port 0
11:41:32  kernel: ahcich3: is 00000000 cs 0000c000 ss 0000fff0 rs 0000fff0 tfd 40 serr 00000000 cmd 0004ce17
11:41:32  kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 80 e6 f4 40 21 00 00 01 00 00
11:41:32  kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
11:41:32  kernel: (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
11:46:53  kernel: ahcich2: Timeout on slot 5 port 0
11:46:53  kernel: ahcich2: is 00000000 cs 00070000 ss 0007ffe0 rs 0007ffe0 tfd 40 serr 00000000 cmd 0004d017
11:46:53  kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 bd 0f 40 34 00 00 01 00 00
11:46:53  kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
11:46:53  kernel: (ada2:ahcich2:0:0:0): Retrying command, 3 more tries remain
11:49:21  kernel: ahcich2: Timeout on slot 31 port 0
11:49:21  kernel: ahcich2: is 00000000 cs 00020000 ss 8003ffff rs 8003ffff tfd 40 serr 00000000 cmd 0004d017
11:49:21  kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 28 7a 4a 40 35 00 00 01 00 00
11:49:21  kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
11:49:21  kernel: (ada2:ahcich2:0:0:0): Retrying command, 3 more tries remain
11:49:21  kernel: ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
11:49:21  kernel: ada2: < > detached
11:49:21  kernel: GEOM_ELI: g_eli_write_done() failed (error=6) ada2p1.eli[WRITE(offset=10917146624, length=4096)]
11:49:21  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) ada2p1.eli[READ(offset=270336, length=8192)]
11:49:21  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) ada2p1.eli[READ(offset=34358960128, length=8192)]
11:49:21  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) ada2p1.eli[READ(offset=34359222272, length=8192)]
11:49:23  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) ada2p1.eli[READ(offset=270336, length=8192)]
11:49:23  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) ada2p1.eli[READ(offset=34358960128, length=8192)]
11:49:23  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) ada2p1.eli[READ(offset=34359222272, length=8192)]
11:49:23  ZFS[61041]: vdev state changed, pool_guid= vdev_guid=
11:49:23  ZFS[64175]: vdev state changed, pool_guid= vdev_guid=
11:49:51  kernel: ahcich2: Timeout on slot 19 port 0
11:49:51  kernel: ahcich2: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 0004d217
11:49:51  kernel: (aprobe0:ahcich2:0:0:0): SETFEATURES SET TRANSFER MODE. ACB: ef 03 00 00 00 40 00 00 00 00 08 00
11:49:51  kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout
11:49:51  kernel: (aprobe0:ahcich2:0:0:0): Retrying command, 0 more tries remain
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=423407169536, length=1003520)]
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=423408177152, length=1011712)]
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=270336, length=8192)]
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=3966424588288, length=8192)]
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=3966424850432, length=8192)]
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=423409188864, length=1011712)]
11:49:51  kernel: GEOM_ELI: g_eli_read_done() failed (error=6) gpt/storage2.eli[READ(offset=423410204672, length=1011712)]
11:49:51  ZFS[72126]: vdev state changed, pool_guid= vdev_guid=
11:49:51  kernel: GEOM_ELI: Device ada2p1.eli destroyed.
11:49:51  kernel: GEOM_ELI: Detached ada2p1.eli on last close.
11:49:51  ZFS[79977]: vdev is removed, pool_guid= vdev_guid=
11:49:51  kernel: GEOM_ELI: Device gpt/storage2.eli destroyed.
11:49:51  kernel: GEOM_ELI: Detached gpt/storage2.eli on last close.
11:49:51  kernel: (ada2:ahcich2:0:0:0): Periph destroyed
Comment 3 Allan Jude freebsd_committer 2019-05-04 17:41:33 UTC
If you are sure it is what you want to do, you can increase the timeout from 30 seconds:

sysctl kern.cam.ada.default_timeout=60
Comment 4 Tomasz "CeDeROM" CEDRO 2019-08-07 13:21:38 UTC
Hello world,

The same here with Panasonic Toughbook CF-MX4 and M2 SSD 256GB SAMSUNG drive :-(

sysctl kern.cam.ada.default_timeout=100 helps a bit.. but still I cannot even install system on that machine..

Is this a M2 SATA Controller Driver problem or the M2 disk wear is already that bad I should replace a disk?

Any hints appreciated :-)