Created attachment 149801 [details] dmesg and pciconf output Since the upgrade of my HP Proliant Microserver, I can't start "zpool scrub" on my zraid1 pool build with 5 SATA hard drive. My log file display lot's of "CAM status: Command timeout", then the system can't no more access to the zpool. Here is an extract: Nov 23 18:16:25 nas kernel: (aprobe2:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Nov 23 18:16:25 nas kernel: (aprobe2:ahcich2:0:0:0): CAM status: Command timeout Nov 23 18:16:25 nas kernel: ahcich5: (aprobe2:Timeout on slot 1 port 0 Nov 23 18:16:25 nas kernel: ahcich2:0:ahcich5: 0:0): is 00000002 cs 00000000 ss 00000000 rs 00000002 tfd 50 serr 00000000 cmd 00006117 Nov 23 18:16:25 nas kernel: Error 5, Retries exhausted Nov 23 18:16:25 nas kernel: (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Nov 23 18:16:25 nas kernel: ahcich3: (aprobe0:ahcich5:0:0:0): CAM status: Command timeout Nov 23 18:16:25 nas kernel: Timeout on slot 8 port 0 Nov 23 18:16:25 nas kernel: (aprobe0:ahcich3: ahcich5:0:is 00000002 cs 00000000 ss 00000000 rs 00000100 tfd 50 serr 00000000 cmd 00006817 Nov 23 18:16:25 nas kernel: 0:0): Error 5, Retries exhausted Nov 23 18:16:25 nas kernel: (aprobe1:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Nov 23 18:16:25 nas kernel: (aprobe1:ahcich3:0:0:0): CAM status: Command timeout Nov 23 18:16:25 nas kernel: (aprobe1:ahcich3:0:0:0): Error 5, Retries exhausted Nov 23 18:16:55 nas kernel: ahcich3: Timeout on slot 8 port 0 Nov 23 18:16:55 nas kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 00000100 tfd 50 serr 00000000 cmd 00006817 Nov 23 18:16:55 nas kernel: (aprobe1:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Nov 23 18:16:55 nas kernel: (aprobe1:ahcich3:0:0:0): CAM status: Command timeout Nov 23 18:16:55 nas kernel: ahcich5: (aprobe1:Timeout on slot 1 port 0 Nov 23 18:16:55 nas kernel: ahcich3:0:ahcich5: 0:is 00000002 cs 00000000 ss 00000000 rs 00000002 tfd 50 serr 00000000 cmd 00006117 Nov 23 18:16:55 nas kernel: 0): Error 5, Retry was blocked Nov 23 18:16:55 nas kernel: (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Nov 23 18:16:55 nas kernel: ahcich2: (aprobe0:ahcich5:0:0:0): CAM status: Command timeout Some information about the zpool: zpool status -v storage pool: storage state: ONLINE scan: scrub canceled on Sun Nov 16 22:28:54 2014 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0p1 ONLINE 0 0 0 ada1p1 ONLINE 0 0 0 ada2p1 ONLINE 0 0 0 ada3p1 ONLINE 0 0 0 ada4p1 ONLINE 0 0 0 errors: No known data errors I've attached dmesg and pciconf output.
Another people seems to meet the same "CAM status: Command timeout" problem (but with SSD drive and not standard) since upgrade to 10.1 here: https://lists.freebsd.org/pipermail/freebsd-stable/2014-November/080914.html
Hi Olivier, I confirm that regression too. I encounter the exact same problem on my HP 658553-421 ProLiant MicroServer. When trying to copy a lot of big files from another machine through NFS, several disks suddenly become unavailable and are detached. It happens randomly, a few hours after having started the copy. After having rebooted the machine, zpool show no errors, neither does a smartctl on each drive of the pool. I am able to start the copy again... until the next timeouts. # uname -a FreeBSD tank.martymac.org 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 # pciconf -vl [...] ahci0@pci0:0:17:0: class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00 vendor = 'Advanced Micro Devices [AMD] nee ATI' device = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]' class = mass storage subclass = SATA [...]
Same on this morning, my NAS lost 3 disks (ada1, ada2, ada3) and even panic'ed ! See attached messages and core.txt.
Created attachment 149957 [details] /var/log/messages before panic
Created attachment 149958 [details] /var/crash/core.txt.5 after panic
Hi, I am also experiencing the same issues on my HP Microserver (N40L) since upgrading to 10.1. Same dmesg output and crash profile, same achi controller : # uname -a FreeBSD Daguerreo 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 #pciconf -vl [...] ahci0@pci0:0:17:0: class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00 vendor = 'Advanced Micro Devices [AMD] nee ATI' device = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]' class = mass storage subclass = SATA [...] I was able to reduce the mean time between crashes by slowing down the sata links to 1.5 Gb/s (by adding `set hint.ahcich.[0..4].sata_rev=1` to /boot/loader.rc.local). It still crashes when scrubbing my zpool, but it will do so much farther.
Getting the exact same problem. I've been running FreeBSD 10.0 since it was released and haven't had any problems. I have a HP Proliant Microserver and running heavy I/O like zpool scrub causes these error messages after the 10.1 upgrade. System hangs and a reboot is required. I don't have a core dump at this moment in time. Error messsage below: Dec 1 23:49:49 bilbo kernel: ahcich2: Timeout on slot 13 port 0 Dec 1 23:49:49 bilbo kernel: ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00003000 tfd 40 serr 00000000 cmd 00206d17 Dec 1 23:49:49 bilbo kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 78 f9 be 40 13 00 00 00 00 00 Dec 1 23:49:49 bilbo kernel: ahcich1: (ada2:ahcich2:0:0:0): CAM status: Command timeout Dec 1 23:49:49 bilbo kernel: Timeout on slot 7 port 0 Dec 1 23:49:49 bilbo kernel: (ada2:ahcich1: ahcich2:0:is 00000008 cs 00000000 ss 00000000 rs 00000080 tfd 40 serr 00000000 cmd 00206717 Dec 1 23:49:49 bilbo kernel: 0:0): Retrying command Dec 1 23:49:49 bilbo kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 50 d0 03 bf 40 13 00 00 00 00 00 Dec 1 23:49:49 bilbo kernel: ahcich3: (ada1:ahcich1:0:0:0): CAM status: Command timeout Dec 1 23:49:49 bilbo kernel: Timeout on slot 4 port 0 Dec 1 23:49:49 bilbo kernel: (ada1:ahcich3: ahcich1:0:is 00000008 cs 00000000 ss 00000000 rs 00000010 tfd 40 serr 00000000 cmd 00206417 Dec 1 23:49:49 bilbo kernel: 0:0): Retrying command Dec 1 23:49:49 bilbo kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 58 c8 03 bf 40 13 00 00 00 00 00 Dec 1 23:49:49 bilbo kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 1 23:49:49 bilbo kernel: (ada3:ahcich3:0:0:0): Retrying command # pciconf -vl ahci0@pci0:0:17:0: class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00 vendor = 'Advanced Micro Devices [AMD] nee ATI' device = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]' class = mass storage subclass = SATA # uname -a FreeBSD bilbo 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
We are also experiencing this issue on multiple HP Microserver N54L with the ahci driver.
Dec 1 17:44:06 nas01 kernel: ahcich3: Timeout on slot 6 port 0 Dec 1 17:44:06 nas01 kernel: ahcich3: is 00000008 cs 00000000 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 00006617 Dec 1 17:44:06 nas01 kernel: (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 9e 2f 40 3e 00 00 01 00 00 Dec 1 17:44:06 nas01 kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 1 17:44:06 nas01 kernel: ahcich2: (ada3:Timeout on slot 15 port 0 Dec 1 17:44:06 nas01 kernel: ahcich3:0:ahcich2: 0:is 00000008 cs 00000000 ss 00000000 rs 00008000 tfd 40 serr 00000000 cmd 00006f17 Dec 1 17:44:06 nas01 kernel: 0): Retrying command Dec 1 17:44:06 nas01 kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 e8 a9 2f 40 3e 00 00 00 00 00 Dec 1 17:44:06 nas01 kernel: ahcich1: (ada2:ahcich2:0:0:0): CAM status: Command timeout pciconf -vl ahci0@pci0:0:17:0: class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00 vendor = 'Advanced Micro Devices [AMD] nee ATI' device = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]' class = mass storage subclass = SATA dmesg: ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0 ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported This issue is reproduceable by causing load (such as zpool scrub). Seems to hit the issue much more frequently with > 2 disks in the pool - we see it on our HP microservers with 4 and 5 disks. Downgrading SATA to v1 (via loader hints) and disabling NCQ (via camcontrol tags -N 1) seems to delay, not fix the inevitable timeouts.
I reach to finish an 12 hours ZFS scrub by disabling MSI interrupts for AHCI ! (thanks mav@ for the temporary workaround). Here is how to do it: echo 'hint.ahci.0.msi="0"' >> /boot/loader.conf shutdown -r now Once rebooted, check that `vmstat -i` report ahci0 using interrupt below 256. Like this example: root@nas:~ # vmstat -i interrupt total rate irq17: ehci0 ehci1+ 19403 4 irq18: ohci0 ohci1+ 4 0 irq19: ahci0 3908 0 irq256: hpet0:t0 112433 26 irq257: hpet0:t1 33742 8 irq260: bge0 3354 0 Total 172844 41 With this workaround, ZFS scrubbing is finished without any "Command timeout": root@nas:~ # zpool status pool: storage state: ONLINE scan: scrub repaired 0 in 11h55m with 0 errors on Thu Dec 4 01:29:42 2014 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0p1 ONLINE 0 0 0 ada1p1 ONLINE 0 0 0 ada2p1 ONLINE 0 0 0 ada3p1 ONLINE 0 0 0 ada4p1 ONLINE 0 0 0 Can someone test this workaround and report its result too ?
I tested the workaround echo 'hint.ahci.0.msi="0"' >> /boot/loader.conf shutdown -r now This is my zpool. # zpool list NAME SIZE ALLOC FREE FRAG EXPANDSZ CAP DEDUP HEALTH ALTROOT data 3.62T 1.56T 2.06T 6% - 43% 1.00x ONLINE - pool: data state: ONLINE scan: scrub repaired 0 in 3h9m with 0 errors on Thu Dec 4 11:55:29 2014 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 cache ada4p4 ONLINE 0 0 0 errors: No known data errors After the reboot I'm able to run a full zpool scrub without any error messages or problems. The workaround seems to work - great! Thank you!
It also works for me, thanks :)
Here is a case where disabeling msi did not help: https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081243.html
My HP Proliant N54L is also affected. Sometimes it works for a week, sometimes hangs every hour. Heavily depends on load, but not entirely. I'm pretty sure I had this problem with 10-STABLE before first BETA. As clever as I am, I "accidently" upgraded the pool to embedded_data, so I coudn't get back to 10.0 or even 9.3. And since in my experience with FreeBSD, it breaks only when HW breaks, I was sure i's a HW problem. However, I ran SmartOS (Illumos based) to scrub this pool, then ran it for a week, moved a couple of hundred gigs on and off the pool, no problems.
(In reply to Kurt Jaeger from comment #13) > Here is a case where disabeling msi did not help: > > https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081243.html I don't think that is the same issue. The rest of us were: 1. not seeing any checksum errors; and 2. seeing the timeouts crossing multiple ahcich devices That doesn't appear to be the case in the referenced link
(In reply to pascal.guitierrez from comment #15) > (In reply to Kurt Jaeger from comment #13) > > Here is a case where disabeling msi did not help: > > > > https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081243.html > > > I don't think that is the same issue. > > The rest of us were: > > 1. not seeing any checksum errors; and > 2. seeing the timeouts crossing multiple ahcich devices > > That doesn't appear to be the case in the referenced link I agree. For me, even the CD-ROM port timed out, which hosts the fifth drive in separate enclosure. And hint.ahci.0.msi="0" workaround helps here too. I tried to backport http://svnweb.freebsd.org/base/head/sys/dev/ahci/ahci.c?r1=271201&r2=271200&pathrev=271201 but this does not help, still hangs. Not that I know what this actually does though.
This is also present in Debian kBSD 10.1 on an HP Microserver N40L.....
I confirm that regression too. Since the upgrade of my HP N54L ( 4x HDD ) I have exact same problem.
Confirm on Microserver NL36
I just experienced this on my HP Proliant MicroServer after upgrading to 10.1. An overnight zpool scrub left ahcich2 and ahcich3 both timing out. System was unresponsibe and had to force power off. Everything seemed to come up ok and the scrub resumed. % sysctl hw.model hw.model: AMD Athlon(tm) II Neo N36L Dual-Core Processor % zpool status pool: bootpool state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Sat Dec 20 03:02:21 2014 config: NAME STATE READ WRITE CKSUM bootpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gptid/10b51084-a026-11e3-b392-d485646a6e25 ONLINE 0 0 0 gptid/11887bb6-a026-11e3-b392-d485646a6e25 ONLINE 0 0 0 gptid/12413ebe-a026-11e3-b392-d485646a6e25 ONLINE 0 0 0 gptid/131a1743-a026-11e3-b392-d485646a6e25 ONLINE 0 0 0 errors: No known data errors pool: zroot state: ONLINE scan: scrub in progress since Sat Dec 20 03:01:45 2014 469G scanned out of 547G at 45/s, (scan is slow, no estimated time) 0 repaired, 85.73% done config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0p4.eli ONLINE 0 0 0 ada1p4.eli ONLINE 0 0 0 ada2p4.eli ONLINE 0 0 0 ada3p4.eli ONLINE 0 0 0 errors: No known data errors % vmstat -i interrupt total rate irq17: ehci0 ehci1+ 256 0 irq18: ohci0 ohci1+ 4 0 irq256: hpet0:t0 1768123 1001 irq257: hpet0:t1 1814676 1027 irq259: bge0 9355 5 irq260: ahci0:ch0 351163 198 irq261: ahci0:ch1 350469 198 irq262: ahci0:ch2 316033 178 irq263: ahci0:3 309137 175 Total 4919216 2785
Same here on a 54L, I am currently testing the workaround. # pciconf -vl ahci0@pci0:0:17:0: class=0x010601 card=0x1609103c chip=0x43911002 rev=0x40 hdr=0x00 vendor = 'Advanced Micro Devices [AMD] nee ATI' device = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]' class = mass storage subclass = SATA # sysctl hw.model hw.model: AMD Turion(tm) II Neo N54L Dual-Core Processor # uname -a FreeBSD xx 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 dmesg extract: Dec 29 10:00:18 xx kernel: ahcich3: Timeout on slot 30 port 0 Dec 29 10:00:18 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17 Dec 29 10:00:18 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:00:18 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 29 10:00:18 xx kernel: (aprobe0:ahcich3:0:0:0): Retrying command Dec 29 10:00:19 xx kernel: ahcich1: Timeout on slot 15 port 0 Dec 29 10:00:19 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17 Dec 29 10:00:19 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:00:19 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout Dec 29 10:00:19 xx kernel: (aprobe2:ahcich1:0:0:0): Retrying command Dec 29 10:00:20 xx kernel: ahcich2: Timeout on slot 19 port 0 Dec 29 10:00:20 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317 Dec 29 10:00:20 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:00:20 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout Dec 29 10:00:20 xx kernel: (aprobe1:ahcich2:0:0:0): Retrying command Dec 29 10:00:49 xx kernel: ahcich3: Timeout on slot 30 port 0 Dec 29 10:00:49 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17 Dec 29 10:00:49 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:00:49 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 29 10:00:49 xx kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted Dec 29 10:00:50 xx kernel: ahcich1: Timeout on slot 15 port 0 Dec 29 10:00:50 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17 Dec 29 10:00:50 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:00:50 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout Dec 29 10:00:50 xx kernel: (aprobe2:ahcich1:0:0:0): Error 5, Retries exhausted Dec 29 10:00:50 xx kernel: ahcich2: Timeout on slot 19 port 0 Dec 29 10:00:50 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317 Dec 29 10:00:50 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:00:50 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout Dec 29 10:00:50 xx kernel: (aprobe1:ahcich2:0:0:0): Error 5, Retries exhausted Dec 29 10:01:19 xx kernel: ahcich3: Timeout on slot 30 port 0 Dec 29 10:01:19 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17 Dec 29 10:01:19 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:01:19 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 29 10:01:19 xx kernel: (aprobe0:ahcich3:0:0:0): Retrying command Dec 29 10:01:20 xx kernel: ahcich1: Timeout on slot 15 port 0 Dec 29 10:01:20 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17 Dec 29 10:01:20 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:01:20 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout Dec 29 10:01:20 xx kernel: (aprobe2:ahcich1:0:0:0): Retrying command Dec 29 10:01:20 xx kernel: ahcich2: Timeout on slot 19 port 0 Dec 29 10:01:20 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317 Dec 29 10:01:20 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:01:20 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout Dec 29 10:01:20 xx kernel: (aprobe1:ahcich2:0:0:0): Retrying command Dec 29 10:01:49 xx kernel: ahcich3: Timeout on slot 30 port 0 Dec 29 10:01:49 xx kernel: ahcich3: is 00000002 cs 00000000 ss 00000000 rs 40000000 tfd 50 serr 00000000 cmd 00007e17 Dec 29 10:01:49 xx kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:01:49 xx kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 29 10:01:49 xx kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted Dec 29 10:01:50 xx kernel: ahcich1: Timeout on slot 15 port 0 Dec 29 10:01:50 xx kernel: ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00006f17 Dec 29 10:01:50 xx kernel: (aprobe2:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:01:50 xx kernel: (aprobe2:ahcich1:0:0:0): CAM status: Command timeout Dec 29 10:01:50 xx kernel: (aprobe2:ahcich1:0:0:0): Error 5, Retries exhausted Dec 29 10:01:51 xx kernel: ahcich2: Timeout on slot 19 port 0 Dec 29 10:01:51 xx kernel: ahcich2: is 00000002 cs 00000000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 00007317 Dec 29 10:01:51 xx kernel: (aprobe1:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 29 10:01:51 xx kernel: (aprobe1:ahcich2:0:0:0): CAM status: Command timeout Dec 29 10:01:51 xx kernel: (aprobe1:ahcich2:0:0:0): Error 5, Retries exhausted
Same happened in the middle of 9.3 upgrade to 10.1. Same error, different controller (see below), gmirror raid1. ahci0@pci0:0:18:0: class=0x01018f card=0x73271462 chip=0x43801002 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices [AMD] nee ATI' device = 'SB600 Non-Raid-5 SATA' class = mass storage subclass = ATA
(In reply to Nerijus from comment #22) > Same happened in the middle of 9.3 upgrade to 10.1. Same error, different > controller (see below), gmirror raid1. > > ahci0@pci0:0:18:0: class=0x01018f card=0x73271462 chip=0x43801002 > rev=0x00 hdr=0x00 > vendor = 'Advanced Micro Devices [AMD] nee ATI' > device = 'SB600 Non-Raid-5 SATA' > class = mass storage > subclass = ATA After rereading more carefully this report, noticed there're more than one error type. Mine is: Dec 31 17:31:20 nitrogen kernel: ahcich2: Timeout on slot 25 port 0 Dec 31 17:31:20 nitrogen kernel: ahcich2: is 00000000 cs c1ffffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004be17 Dec 31 17:31:20 nitrogen kernel: (ada1:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 e2 4e 15 40 02 00 00 00 00 00 Dec 31 17:31:20 nitrogen kernel: (ada1:ahcich2:0:0:0): CAM status: Command timeout Dec 31 17:31:20 nitrogen kernel: (ada1:ahcich2:0:0:0): Retrying command Dec 31 17:31:20 nitrogen kernel: ahcich0: Timeout on slot 4 port 0 Dec 31 17:31:20 nitrogen kernel: ahcich0: is 00000000 cs fffffe0f ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004a917 Dec 31 17:31:20 nitrogen kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 e2 4e 15 40 02 00 00 00 00 00 Dec 31 17:31:20 nitrogen kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout Dec 31 17:31:20 nitrogen kernel: (ada0:ahcich0:0:0:0): Retrying command
Is there anyone who had this problem and the 'hint.ahci.0.msi="0"' workaround did _not_ help? FWIW, I have the same problem with HP Microserver N40L, with four WD Red in RAIDZ.
This workaround works for me too. I have 2 HP Proliant Microserver N54L with 3 WD RED 3TB and ZFS raidz.
(In reply to Edward Tomasz Napierala from comment #24) How do I verify if this setting is activated? I have two servers one set and one not. Both report low IRQ used (e.g. irq22).
Well, if you're running 10.1 and do not have those disk errors, then it works :-)
(In reply to Edward Tomasz Napierala from comment #27) As per my comment #23 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195349#c23) I did have an issue and it did not seem to be very predictable. :) I have one server with the setting in loader.conf (and rebooted) and another without. Both running 10.1. So far (a bit over 10 days) both are running without those errors. And I'm wondering if that setting has any effect and how do I verify it is set?
Hi, I just ran into this issue as well, this is a N54L with 3 zpools using a 30G SSD (connected to the fifth SATA CD-ROM port) as L2ARC. I'm also using the workaround described here and it does not make any difference. Here are some details about the machine: $ uname -a FreeBSD cube.home.gfuzz.de 10.1-STABLE FreeBSD 10.1-STABLE #0 ebf97d8(HEAD): Tue Jan 27 12:27:05 UTC 2015 root@cube.home.gfuzz.de:/usr/obj/usr/src/sys/GENERIC amd64 $ grep hint /boot/loader.conf hint.ahci.0.msi="0" $ zpool list -v NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT dpool 1.81T 653G 1.17T - 22% 35% 1.00x ONLINE - mirror 1.81T 653G 1.17T - 22% 35% gpt/dpool0 - - - - - - gpt/dpool1 - - - - - - cache - - - - - - ada4p1 2.00G 1.90G 99.8M - 0% 95% fpool 137G 31.8G 105G - 20% 23% 1.00x ONLINE - diskid/DISK-WD-WMASU0331725p4 68.5G 15.9G 52.6G - 20% 23% diskid/DISK-WD-WCASY6487340p4 68.5G 15.9G 52.6G - 20% 23% cache - - - - - - ada4p6 2.00G 14.8M 1.98G - 0% 0% zroot 392G 18.7G 373G - 2% 4% 1.00x ONLINE - mirror 392G 18.7G 373G - 2% 4% diskid/DISK-WD-WMASU0331725p3 - - - - - - diskid/DISK-WD-WCASY6487340p3 - - - - - - cache - - - - - - ada4p3 2.00G 12.4M 1.98G - 0% 0% $ vmstat -i interrupt total rate irq14: ata0 45120 34 irq18: ohci0 ohci1+ 4 0 irq19: ahci0 1899578 1461 irq256: hpet0:t0 1482445 1140 irq257: hpet0:t1 970039 746 irq259: bge0 21615 16 Total 4418801 3399 $ tail /var/log/messages Feb 1 21:47:23 cube kernel: (ada4:ata0:0:1:0): Retrying command Feb 1 21:51:14 cube kernel: (ada4:ata0:0:1:0): READ_DMA. ACB: c8 00 d3 3f 80 40 00 00 00 00 01 00 Feb 1 21:51:14 cube kernel: (ada4:ata0:0:1:0): CAM status: Command timeout Feb 1 21:51:14 cube kernel: (ada4:ata0:0:1:0): Retrying command Feb 1 21:52:04 cube kernel: (ada4:ata0:0:1:0): READ_DMA. ACB: c8 00 2b 1d 0c 40 00 00 00 00 01 00 Feb 1 21:52:04 cube kernel: (ada4:ata0:0:1:0): CAM status: Command timeout Feb 1 21:52:04 cube kernel: (ada4:ata0:0:1:0): Retrying command Feb 1 21:52:54 cube kernel: (ada4:ata0:0:1:0): READ_DMA. ACB: c8 00 e3 eb 0d 40 00 00 00 00 01 00 Feb 1 21:52:54 cube kernel: (ada4:ata0:0:1:0): CAM status: Command timeout Feb 1 21:52:54 cube kernel: (ada4:ata0:0:1:0): Retrying command
This could well be fixed by: https://svnweb.freebsd.org/changeset/base/278034 If you still have the effected system and can check by either applying the patch and rebuilding and installing an updated kernel Alternatively rebooting with the following in /boot/loader.conf which is manually overriding the msi count should help confirm as well. hint.achi.0.msi="1"
I have a N40L with a single zroot that I boot off of consisting of three 1.5TB platter drives. Before I could not complete a scrub, however after setting the hint.achi.0.msi="1" and rebooting, I have successfully completed my first scrub in several weeks.
Thanks for confirming Evan, based on that I'm going to close this bug.
Why is this bug being closed? Not sure if the fix is the patch or the boot loader hint.
Hi David, Both patch and boot loader hint fix the issue for me. I didn't experience any further locks or crashes on my zpools. Steven was so kind to assist me with my issue and we suspect my old SSD and/or crappy cabling causing the timeouts on my side. The boot loader hint restores the previous behaviour, the patch fixes the new behaviour to work properly. Cheers
Assign committer who closed issue