Bug 195349

Summary: CAM status: Command timeout since upgrade to 10.1
Product: Base System Reporter: olivier
Component: kernAssignee: Steven Hartland <smh>
Status: Closed FIXED    
Severity: Affects Some People CC: admon.vector, akos.tamasi, art, ben, dave+freebsd, david.n.arnold, emaste, evan, igor.arabesc.pavlov, info, je.broman, lists, martymac, maxime, nk, pascal.guitierrez, petteri.valkonen, pi, sh, simond, smh, tbabut, trasz, vbk7, web.freebsd, yudi.tux
Priority: ---    
Version: 10.1-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
dmesg and pciconf output
none
/var/log/messages before panic
none
/var/crash/core.txt.5 after panic none

Description olivier 2014-11-24 22:32:35 UTC
Created attachment 149801 [details]
dmesg and pciconf output

Since the upgrade of my HP Proliant Microserver, I can't start "zpool scrub" on my zraid1 pool build with 5 SATA hard drive.
My log file display lot's of "CAM status: Command timeout", then the system can't no more access to the zpool.

Here is an extract:

Nov 23 18:16:25 nas kernel: (aprobe2:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 23 18:16:25 nas kernel: (aprobe2:ahcich2:0:0:0): CAM status: Command timeout
Nov 23 18:16:25 nas kernel: ahcich5: (aprobe2:Timeout on slot 1 port 0
Nov 23 18:16:25 nas kernel: ahcich2:0:ahcich5: 0:0): is 00000002 cs 00000000 ss 00000000 rs 00000002 tfd 50 serr 00000000 cmd 00006117
Nov 23 18:16:25 nas kernel: Error 5, Retries exhausted
Nov 23 18:16:25 nas kernel: (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 23 18:16:25 nas kernel: ahcich3: (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
Nov 23 18:16:25 nas kernel: Timeout on slot 8 port 0
Nov 23 18:16:25 nas kernel: (aprobe0:ahcich3: ahcich5:0:is 00000002 cs 00000000 ss 00000000 rs 00000100 tfd 50 serr 00000000 cmd 00006817
Nov 23 18:16:25 nas kernel: 0:0): Error 5, Retries exhausted
Nov 23 18:16:25 nas kernel: (aprobe1:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 23 18:16:25 nas kernel: (aprobe1:ahcich3:0:0:0): CAM status: Command timeout
Nov 23 18:16:25 nas kernel: (aprobe1:ahcich3:0:0:0): Error 5, Retries exhausted
Nov 23 18:16:55 nas kernel: ahcich3: Timeout on slot 8 port 0
Nov 23 18:16:55 nas kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 00000100 tfd 50 serr 00000000 cmd 00006817
Nov 23 18:16:55 nas kernel: (aprobe1:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 23 18:16:55 nas kernel: (aprobe1:ahcich3:0:0:0): CAM status: Command timeout
Nov 23 18:16:55 nas kernel: ahcich5: (aprobe1:Timeout on slot 1 port 0
Nov 23 18:16:55 nas kernel: ahcich3:0:ahcich5: 0:is 00000002 cs 00000000 ss 00000000 rs 00000002 tfd 50 serr 00000000 cmd 00006117
Nov 23 18:16:55 nas kernel: 0): Error 5, Retry was blocked
Nov 23 18:16:55 nas kernel: (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 23 18:16:55 nas kernel: ahcich2: (aprobe0:ahcich5:0:0:0): CAM status: Command timeout

Some information about the zpool:

 zpool status -v storage
  pool: storage
 state: ONLINE
  scan: scrub canceled on Sun Nov 16 22:28:54 2014
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p1  ONLINE       0     0     0
            ada1p1  ONLINE       0     0     0
            ada2p1  ONLINE       0     0     0
            ada3p1  ONLINE       0     0     0
            ada4p1  ONLINE       0     0     0

errors: No known data errors

I've attached dmesg and pciconf output.
Comment 1 olivier 2014-11-24 22:34:44 UTC
Another people seems to meet the same "CAM status:
Command timeout" problem (but with SSD drive and not standard) since upgrade to 10.1 here:
https://lists.freebsd.org/pipermail/freebsd-stable/2014-November/080914.html
Comment 2 Ganael LAPLANCHE freebsd_committer freebsd_triage 2014-11-27 09:44:21 UTC
Hi Olivier,

I confirm that regression too.

I encounter the exact same problem on my HP 658553-421 ProLiant MicroServer. When trying to copy a lot of big files from another machine through NFS, several disks suddenly become unavailable and are detached. It happens randomly, a few hours after having started the copy.

After having rebooted the machine, zpool show no errors, neither does a smartctl on each drive of the pool. I am able to start the copy again... until the next timeouts.

# uname -a
FreeBSD tank.martymac.org 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

# pciconf -vl
[...]
ahci0@pci0:0:17:0:      class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00
    vendor     = 'Advanced Micro Devices [AMD] nee ATI'
    device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA
[...]
Comment 3 Ganael LAPLANCHE freebsd_committer freebsd_triage 2014-11-28 12:29:23 UTC
Same on this morning, my NAS lost 3 disks (ada1, ada2, ada3) and even panic'ed !

See attached messages and core.txt.
Comment 4 Ganael LAPLANCHE freebsd_committer freebsd_triage 2014-11-28 12:30:33 UTC
Created attachment 149957 [details]
/var/log/messages before panic
Comment 5 Ganael LAPLANCHE freebsd_committer freebsd_triage 2014-11-28 12:31:23 UTC
Created attachment 149958 [details]
/var/crash/core.txt.5 after panic
Comment 6 maxime 2014-11-29 21:47:37 UTC
Hi,

I am also experiencing the same issues on my HP Microserver (N40L) since upgrading to 10.1.
Same dmesg output and crash profile, same achi controller :

# uname -a 
FreeBSD Daguerreo 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

#pciconf -vl
[...]
ahci0@pci0:0:17:0:	class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00
    vendor     = 'Advanced Micro Devices [AMD] nee ATI'
    device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA
[...]

I was able to reduce the mean time between crashes by slowing down the sata links to 1.5 Gb/s (by adding `set hint.ahcich.[0..4].sata_rev=1` to /boot/loader.rc.local). It still crashes when scrubbing my zpool, but it will do so much farther.
Comment 7 je.broman 2014-12-01 23:15:58 UTC
Getting the exact same problem. I've been running FreeBSD 10.0 since it was released and haven't had any problems. I have a HP Proliant Microserver and running heavy I/O like zpool scrub causes these error messages after the 10.1 upgrade. System hangs and a reboot is required. I don't have a core dump at this moment in time. Error messsage below:

Dec  1 23:49:49 bilbo kernel: ahcich2: Timeout on slot 13 port 0
Dec  1 23:49:49 bilbo kernel: ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00003000 tfd 40 serr 00000000 cmd 00206d17
Dec  1 23:49:49 bilbo kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 78 f9 be 40 13 00 00 00 00 00
Dec  1 23:49:49 bilbo kernel: ahcich1: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Dec  1 23:49:49 bilbo kernel: Timeout on slot 7 port 0
Dec  1 23:49:49 bilbo kernel: (ada2:ahcich1: ahcich2:0:is 00000008 cs 00000000 ss 00000000 rs 00000080 tfd 40 serr 00000000 cmd 00206717
Dec  1 23:49:49 bilbo kernel: 0:0): Retrying command
Dec  1 23:49:49 bilbo kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 d0 03 bf 40 13 00 00 00 00 00
Dec  1 23:49:49 bilbo kernel: ahcich3: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Dec  1 23:49:49 bilbo kernel: Timeout on slot 4 port 0
Dec  1 23:49:49 bilbo kernel: (ada1:ahcich3: ahcich1:0:is 00000008 cs 00000000 ss 00000000 rs 00000010 tfd 40 serr 00000000 cmd 00206417
Dec  1 23:49:49 bilbo kernel: 0:0): Retrying command
Dec  1 23:49:49 bilbo kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 c8 03 bf 40 13 00 00 00 00 00
Dec  1 23:49:49 bilbo kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
Dec  1 23:49:49 bilbo kernel: (ada3:ahcich3:0:0:0): Retrying command


# pciconf -vl
ahci0@pci0:0:17:0:	class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00
    vendor     = 'Advanced Micro Devices [AMD] nee ATI'
    device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA

# uname -a 
FreeBSD bilbo 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
Comment 8 pascal.guitierrez 2014-12-03 10:30:11 UTC
We are also experiencing this issue on multiple HP Microserver N54L with the ahci driver.
Comment 9 pascal.guitierrez 2014-12-03 10:41:05 UTC
Dec  1 17:44:06 nas01 kernel: ahcich3: Timeout on slot 6 port 0
Dec  1 17:44:06 nas01 kernel: ahcich3: is 00000008 cs 00000000 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 00006617
Dec  1 17:44:06 nas01 kernel: (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 9e 2f 40 3e 00 00 01 00 00
Dec  1 17:44:06 nas01 kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
Dec  1 17:44:06 nas01 kernel: ahcich2: (ada3:Timeout on slot 15 port 0
Dec  1 17:44:06 nas01 kernel: ahcich3:0:ahcich2: 0:is 00000008 cs 00000000 ss 00000000 rs 00008000 tfd 40 serr 00000000 cmd 00006f17
Dec  1 17:44:06 nas01 kernel: 0): Retrying command
Dec  1 17:44:06 nas01 kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 e8 a9 2f 40 3e 00 00 00 00 00
Dec  1 17:44:06 nas01 kernel: ahcich1: (ada2:ahcich2:0:0:0): CAM status: Command timeout


pciconf -vl

ahci0@pci0:0:17:0:      class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00
    vendor     = 'Advanced Micro Devices [AMD] nee ATI'
    device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA

dmesg:

ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported


This issue is reproduceable by causing load (such as zpool scrub). Seems to hit the issue much more frequently with > 2 disks in the pool - we see it on our HP microservers with 4 and 5 disks.

Downgrading SATA to v1 (via loader hints) and disabling NCQ (via camcontrol tags -N 1) seems to delay, not fix the inevitable timeouts.
Comment 10 olivier 2014-12-04 06:23:47 UTC
I reach to finish an 12 hours ZFS scrub by disabling MSI interrupts for AHCI ! (thanks mav@ for the temporary workaround).

Here is how to do it:

echo 'hint.ahci.0.msi="0"' >> /boot/loader.conf
shutdown -r now

Once rebooted, check that `vmstat -i` report ahci0 using interrupt below 256.
Like this example:

root@nas:~ # vmstat -i
interrupt                          total       rate
irq17: ehci0 ehci1+                19403          4
irq18: ohci0 ohci1+                    4          0
irq19: ahci0                        3908          0
irq256: hpet0:t0                  112433         26
irq257: hpet0:t1                   33742          8
irq260: bge0                        3354          0
Total                             172844         41

With this workaround, ZFS scrubbing is finished without any "Command timeout":
root@nas:~ # zpool status
  pool: storage
 state: ONLINE
  scan: scrub repaired 0 in 11h55m with 0 errors on Thu Dec  4 01:29:42 2014
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p1  ONLINE       0     0     0
            ada1p1  ONLINE       0     0     0
            ada2p1  ONLINE       0     0     0
            ada3p1  ONLINE       0     0     0
            ada4p1  ONLINE       0     0     0

Can someone test this workaround and report its result too ?
Comment 11 je.broman 2014-12-04 18:05:49 UTC
I tested the workaround 

echo 'hint.ahci.0.msi="0"' >> /boot/loader.conf
shutdown -r now

This is my zpool. 
# zpool list

NAME    SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
data   3.62T  1.56T  2.06T     6%         -    43%  1.00x  ONLINE  -

 pool: data
 state: ONLINE
  scan: scrub repaired 0 in 3h9m with 0 errors on Thu Dec  4 11:55:29 2014
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
        cache
          ada4p4    ONLINE       0     0     0

errors: No known data errors

After the reboot I'm able to run a full zpool scrub without any error messages or problems. The workaround seems to work - great! Thank you!
Comment 12 Ganael LAPLANCHE freebsd_committer freebsd_triage 2014-12-05 10:10:43 UTC
It also works for me, thanks :)
Comment 13 Kurt Jaeger freebsd_committer freebsd_triage 2014-12-09 10:18:10 UTC
Here is a case where disabeling msi did not help:

https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081243.html
Comment 14 Art Win 2014-12-09 18:35:55 UTC
My HP Proliant N54L is also affected. Sometimes it works for a week, sometimes hangs every hour. Heavily depends on load, but not entirely. I'm pretty sure I had this problem with 10-STABLE before first BETA. As clever as I am, I "accidently" upgraded the pool to embedded_data, so I coudn't get back to 10.0 or even 9.3. And since in my experience with FreeBSD, it breaks only when HW breaks, I was sure i's a HW problem. However, I ran SmartOS (Illumos based) to scrub this pool, then ran it for a week, moved a couple of hundred gigs on and off the pool, no problems.
Comment 15 pascal.guitierrez 2014-12-11 12:24:35 UTC
(In reply to Kurt Jaeger from comment #13)
> Here is a case where disabeling msi did not help:
> 
> https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081243.html


I don't think that is the same issue.

The rest of us were:

1. not seeing any checksum errors; and
2. seeing the timeouts crossing multiple ahcich devices

That doesn't appear to be the case in the referenced link
Comment 16 Art Win 2014-12-11 12:36:39 UTC
(In reply to pascal.guitierrez from comment #15)
> (In reply to Kurt Jaeger from comment #13)
> > Here is a case where disabeling msi did not help:
> > 
> > https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081243.html
> 
> 
> I don't think that is the same issue.
> 
> The rest of us were:
> 
> 1. not seeing any checksum errors; and
> 2. seeing the timeouts crossing multiple ahcich devices
> 
> That doesn't appear to be the case in the referenced link

I agree. For me, even the CD-ROM port timed out, which hosts the fifth drive in separate enclosure. And hint.ahci.0.msi="0" workaround helps here too.

I tried to backport http://svnweb.freebsd.org/base/head/sys/dev/ahci/ahci.c?r1=271201&r2=271200&pathrev=271201
but this does not help, still hangs. Not that I know what this actually does though.
Comment 17 Ben B 2014-12-19 12:03:31 UTC
This is also present in Debian kBSD 10.1 on an HP Microserver N40L.....
Comment 18 Yaroslav 2014-12-19 18:59:00 UTC
I confirm that regression too.

Since the upgrade of my HP N54L ( 4x HDD ) I have exact same problem.
Comment 19 Yuri 2014-12-20 07:15:48 UTC
Confirm on Microserver NL36
Comment 20 David A 2014-12-21 20:39:48 UTC
I just experienced this on my HP Proliant MicroServer after upgrading to 10.1.  An overnight zpool scrub left ahcich2 and ahcich3 both timing out.  System was unresponsibe and had to force power off.  Everything seemed to come up ok and the scrub resumed.

% sysctl hw.model
hw.model: AMD Athlon(tm) II Neo N36L Dual-Core Processor

% zpool status
  pool: bootpool
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Dec 20 03:02:21 2014
config:

	NAME                                            STATE     READ WRITE CKSUM
	bootpool                                        ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    gptid/10b51084-a026-11e3-b392-d485646a6e25  ONLINE       0     0     0
	    gptid/11887bb6-a026-11e3-b392-d485646a6e25  ONLINE       0     0     0
	    gptid/12413ebe-a026-11e3-b392-d485646a6e25  ONLINE       0     0     0
	    gptid/131a1743-a026-11e3-b392-d485646a6e25  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: scrub in progress since Sat Dec 20 03:01:45 2014
        469G scanned out of 547G at 45/s, (scan is slow, no estimated time)
        0 repaired, 85.73% done
config:

	NAME            STATE     READ WRITE CKSUM
	zroot           ONLINE       0     0     0
	  raidz1-0      ONLINE       0     0     0
	    ada0p4.eli  ONLINE       0     0     0
	    ada1p4.eli  ONLINE       0     0     0
	    ada2p4.eli  ONLINE       0     0     0
	    ada3p4.eli  ONLINE       0     0     0

errors: No known data errors

% vmstat -i
interrupt                          total       rate
irq17: ehci0 ehci1+                  256          0
irq18: ohci0 ohci1+                    4          0
irq256: hpet0:t0                 1768123       1001
irq257: hpet0:t1                 1814676       1027
irq259: bge0                        9355          5
irq260: ahci0:ch0                 351163        198
irq261: ahci0:ch1                 350469        198
irq262: ahci0:ch2                 316033        178
irq263: ahci0:3                   309137        175
Total                            4919216       2785
Comment 21 admon.vector 2014-12-29 12:32:33 UTC
Same here on a 54L, I am currently testing the workaround.

# pciconf -vl

ahci0@pci0:0:17:0:      class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40                                                                                                                                                              hdr=0x00
    vendor     = 'Advanced Micro Devices [AMD] nee ATI'
    device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA


# sysctl hw.model
hw.model: AMD Turion(tm) II Neo N54L Dual-Core Processor


# uname -a
FreeBSD xx 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

dmesg extract:

Dec 29 10:00:18 xx kernel: ahcich3: Timeout on slot 30 port 0
Dec 29 10:00:18 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17
Dec 29 10:00:18 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:00:18 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Dec 29 10:00:18 xx kernel: (aprobe0:ahcich3:0:0:0): Retrying command
Dec 29 10:00:19 xx kernel: ahcich1: Timeout on slot 15 port 0
Dec 29 10:00:19 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17
Dec 29 10:00:19 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:00:19 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout
Dec 29 10:00:19 xx kernel: (aprobe2:ahcich1:0:0:0): Retrying command
Dec 29 10:00:20 xx kernel: ahcich2: Timeout on slot 19 port 0
Dec 29 10:00:20 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317
Dec 29 10:00:20 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:00:20 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout
Dec 29 10:00:20 xx kernel: (aprobe1:ahcich2:0:0:0): Retrying command
Dec 29 10:00:49 xx kernel: ahcich3: Timeout on slot 30 port 0
Dec 29 10:00:49 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17
Dec 29 10:00:49 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:00:49 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Dec 29 10:00:49 xx kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted
Dec 29 10:00:50 xx kernel: ahcich1: Timeout on slot 15 port 0
Dec 29 10:00:50 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17
Dec 29 10:00:50 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:00:50 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout
Dec 29 10:00:50 xx kernel: (aprobe2:ahcich1:0:0:0): Error 5, Retries exhausted
Dec 29 10:00:50 xx kernel: ahcich2: Timeout on slot 19 port 0
Dec 29 10:00:50 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317
Dec 29 10:00:50 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:00:50 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout
Dec 29 10:00:50 xx kernel: (aprobe1:ahcich2:0:0:0): Error 5, Retries exhausted
Dec 29 10:01:19 xx kernel: ahcich3: Timeout on slot 30 port 0
Dec 29 10:01:19 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17
Dec 29 10:01:19 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:01:19 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Dec 29 10:01:19 xx kernel: (aprobe0:ahcich3:0:0:0): Retrying command
Dec 29 10:01:20 xx kernel: ahcich1: Timeout on slot 15 port 0
Dec 29 10:01:20 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17
Dec 29 10:01:20 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:01:20 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout
Dec 29 10:01:20 xx kernel: (aprobe2:ahcich1:0:0:0): Retrying command
Dec 29 10:01:20 xx kernel: ahcich2: Timeout on slot 19 port 0
Dec 29 10:01:20 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317
Dec 29 10:01:20 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:01:20 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout
Dec 29 10:01:20 xx kernel: (aprobe1:ahcich2:0:0:0): Retrying command
Dec 29 10:01:49 xx kernel: ahcich3: Timeout on slot 30 port 0
Dec 29 10:01:49 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17
Dec 29 10:01:49 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:01:49 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Dec 29 10:01:49 xx kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted
Dec 29 10:01:50 xx kernel: ahcich1: Timeout on slot 15 port 0
Dec 29 10:01:50 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17
Dec 29 10:01:50 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:01:50 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout
Dec 29 10:01:50 xx kernel: (aprobe2:ahcich1:0:0:0): Error 5, Retries exhausted
Dec 29 10:01:51 xx kernel: ahcich2: Timeout on slot 19 port 0
Dec 29 10:01:51 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317
Dec 29 10:01:51 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Dec 29 10:01:51 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout
Dec 29 10:01:51 xx kernel: (aprobe1:ahcich2:0:0:0): Error 5, Retries exhausted
Comment 22 Nerijus 2015-01-02 14:49:54 UTC
Same happened in the middle of 9.3 upgrade to 10.1. Same error, different controller (see below), gmirror raid1.

ahci0@pci0:0:18:0:      class=0x01018f card=0x73271462 chip=0x43801002 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices [AMD] nee ATI'
    device     = 'SB600 Non-Raid-5 SATA'
    class      = mass storage
    subclass   = ATA
Comment 23 Nerijus 2015-01-02 15:41:01 UTC
(In reply to Nerijus from comment #22)
> Same happened in the middle of 9.3 upgrade to 10.1. Same error, different
> controller (see below), gmirror raid1.
> 
> ahci0@pci0:0:18:0:      class=0x01018f card=0x73271462 chip=0x43801002
> rev=0x00 hdr=0x00
>     vendor     = 'Advanced Micro Devices [AMD] nee ATI'
>     device     = 'SB600 Non-Raid-5 SATA'
>     class      = mass storage
>     subclass   = ATA

After rereading more carefully this report, noticed there're more than one error type. Mine is:

Dec 31 17:31:20 nitrogen kernel: ahcich2: Timeout on slot 25 port 0
Dec 31 17:31:20 nitrogen kernel: ahcich2: is 00000000 cs c1ffffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004be17
Dec 31 17:31:20 nitrogen kernel: (ada1:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 e2 4e 15 40 02 00 00 00 00 00
Dec 31 17:31:20 nitrogen kernel: (ada1:ahcich2:0:0:0): CAM status: Command timeout
Dec 31 17:31:20 nitrogen kernel: (ada1:ahcich2:0:0:0): Retrying command
Dec 31 17:31:20 nitrogen kernel: ahcich0: Timeout on slot 4 port 0
Dec 31 17:31:20 nitrogen kernel: ahcich0: is 00000000 cs fffffe0f ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004a917
Dec 31 17:31:20 nitrogen kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 e2 4e 15 40 02 00 00 00 00 00
Dec 31 17:31:20 nitrogen kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Dec 31 17:31:20 nitrogen kernel: (ada0:ahcich0:0:0:0): Retrying command
Comment 24 Edward Tomasz Napierala freebsd_committer freebsd_triage 2015-01-05 15:37:23 UTC
Is there anyone who had this problem and the 'hint.ahci.0.msi="0"' workaround did _not_ help?

FWIW, I have the same problem with HP Microserver N40L, with four WD Red in RAIDZ.
Comment 25 akos.tamasi 2015-01-06 14:11:09 UTC
This workaround works for me too.
I have 2 HP Proliant Microserver N54L with 3 WD RED 3TB and ZFS raidz.
Comment 26 Nerijus 2015-01-14 06:35:21 UTC
(In reply to Edward Tomasz Napierala from comment #24)
How do I verify if this setting is activated? I have two servers one set and one not. Both report low IRQ used (e.g. irq22).
Comment 27 Edward Tomasz Napierala freebsd_committer freebsd_triage 2015-01-14 10:55:25 UTC
Well, if you're running 10.1 and do not have those disk errors, then it works :-)
Comment 28 Nerijus 2015-01-14 17:41:23 UTC
(In reply to Edward Tomasz Napierala from comment #27)
As per my comment #23 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195349#c23) I did have an issue and it did not seem to be very predictable. :) I have one server with the setting in loader.conf (and rebooted) and another without. Both running 10.1. So far (a bit over 10 days) both are running without those errors. And I'm wondering if that setting has any effect and how do I verify it is set?
Comment 29 Oliver Peter 2015-02-01 21:58:24 UTC
Hi,

I just ran into this issue as well, this is a N54L with 3 zpools using a 30G SSD (connected to the fifth SATA CD-ROM port) as L2ARC.

I'm also using the workaround described here and it does not make any difference.

Here are some details about the machine:

$ uname -a
FreeBSD cube.home.gfuzz.de 10.1-STABLE FreeBSD 10.1-STABLE #0 ebf97d8(HEAD): Tue Jan 27 12:27:05 UTC 2015     root@cube.home.gfuzz.de:/usr/obj/usr/src/sys/GENERIC  amd64

$ grep hint /boot/loader.conf
hint.ahci.0.msi="0"

$ zpool list -v
NAME                                SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
dpool                              1.81T   653G  1.17T         -    22%    35%  1.00x  ONLINE  -
  mirror                           1.81T   653G  1.17T         -    22%    35%
    gpt/dpool0                         -      -      -         -      -      -
    gpt/dpool1                         -      -      -         -      -      -
cache                                  -      -      -      -      -      -
  ada4p1                           2.00G  1.90G  99.8M         -     0%    95%
fpool                               137G  31.8G   105G         -    20%    23%  1.00x  ONLINE  -
  diskid/DISK-WD-WMASU0331725p4    68.5G  15.9G  52.6G         -    20%    23%
  diskid/DISK-WD-WCASY6487340p4    68.5G  15.9G  52.6G         -    20%    23%
cache                                  -      -      -      -      -      -
  ada4p6                           2.00G  14.8M  1.98G         -     0%     0%
zroot                               392G  18.7G   373G         -     2%     4%  1.00x  ONLINE  -
  mirror                            392G  18.7G   373G         -     2%     4%
    diskid/DISK-WD-WMASU0331725p3      -      -      -         -      -      -
    diskid/DISK-WD-WCASY6487340p3      -      -      -         -      -      -
cache                                  -      -      -      -      -      -
  ada4p3                           2.00G  12.4M  1.98G         -     0%     0%

$ vmstat -i
interrupt                          total       rate
irq14: ata0                        45120         34
irq18: ohci0 ohci1+                    4          0
irq19: ahci0                     1899578       1461
irq256: hpet0:t0                 1482445       1140
irq257: hpet0:t1                  970039        746
irq259: bge0                       21615         16
Total                            4418801       3399

$ tail /var/log/messages
Feb  1 21:47:23 cube kernel: (ada4:ata0:0:1:0): Retrying command
Feb  1 21:51:14 cube kernel: (ada4:ata0:0:1:0): READ_DMA. ACB: c8 00 d3 3f 80 40 00 00 00 00 01 00
Feb  1 21:51:14 cube kernel: (ada4:ata0:0:1:0): CAM status: Command timeout
Feb  1 21:51:14 cube kernel: (ada4:ata0:0:1:0): Retrying command
Feb  1 21:52:04 cube kernel: (ada4:ata0:0:1:0): READ_DMA. ACB: c8 00 2b 1d 0c 40 00 00 00 00 01 00
Feb  1 21:52:04 cube kernel: (ada4:ata0:0:1:0): CAM status: Command timeout
Feb  1 21:52:04 cube kernel: (ada4:ata0:0:1:0): Retrying command
Feb  1 21:52:54 cube kernel: (ada4:ata0:0:1:0): READ_DMA. ACB: c8 00 e3 eb 0d 40 00 00 00 00 01 00
Feb  1 21:52:54 cube kernel: (ada4:ata0:0:1:0): CAM status: Command timeout
Feb  1 21:52:54 cube kernel: (ada4:ata0:0:1:0): Retrying command
Comment 30 Steven Hartland freebsd_committer freebsd_triage 2015-02-01 23:09:36 UTC
This could well be fixed by:
https://svnweb.freebsd.org/changeset/base/278034

If you still have the effected system and can check by either applying the patch and rebuilding and installing an updated kernel

Alternatively rebooting with the following in /boot/loader.conf which is manually overriding the msi count should help confirm as well.
hint.achi.0.msi="1"
Comment 31 evan 2015-02-03 03:28:38 UTC
I have a N40L with a single zroot that I boot off of consisting of three 1.5TB platter drives.

Before I could not complete a scrub, however after setting the hint.achi.0.msi="1" and rebooting, I have successfully completed my first scrub in several weeks.
Comment 32 Steven Hartland freebsd_committer freebsd_triage 2015-02-03 14:16:05 UTC
Thanks for confirming Evan, based on that I'm going to close this bug.
Comment 33 David A 2015-02-05 02:44:27 UTC
Why is this bug being closed?  Not sure if the fix is the patch or the boot loader hint.
Comment 34 Oliver Peter 2015-02-05 07:52:02 UTC
Hi David,

Both patch and boot loader hint fix the issue for me.
I didn't experience any further locks or crashes on my zpools.

Steven was so kind to assist me with my issue and we suspect my old SSD and/or crappy cabling causing the timeouts on my side.

The boot loader hint restores the previous behaviour, the patch fixes the new behaviour to work properly.

Cheers
Comment 35 Kubilay Kocak freebsd_committer freebsd_triage 2015-07-01 11:23:32 UTC
Assign committer who closed issue