Bug 201194 - Kernel: ahcich: Timeout in slot
Summary: Kernel: ahcich: Timeout in slot
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs mailing list
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2015-06-29 11:21 UTC by Vladimir Noskov
Modified: 2018-07-06 00:13 UTC (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vladimir Noskov 2015-06-29 11:21:19 UTC
FreeBSD wfid78-172 10.1-RELEASE-p23 FreeBSD 10.1-RELEASE-p23 #0: Thu May 14 13:35:13 UTC 2015
root@amd64-builder.pcbsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

May 30 11:43:29 wfid78-172 kernel: ahcich3: Timeout on slot 24 port 0
May 30 11:43:29 wfid78-172 kernel: ahcich3: is 00000008 cs 00000000 ss 00000000 rs 01000000 tfd 40 serr 00000000 cmd 00207817
May 30 11:43:29 wfid78-172 kernel: (ada2:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 bc f5 1e 40 00 00 00 00 00 00
May 30 11:43:29 wfid78-172 kernel: (ada2:ahcich3:0:0:0): CAM status: Command timeout
May 30 11:43:29 wfid78-172 kernel: (ada2:ahcich3:0:0:0): Retrying command
May 30 11:43:29 wfid78-172 kernel: ahcich1: Timeout on slot 15 port 0
May 30 11:43:29 wfid78-172 kernel: ahcich1: is 00000008 cs 00000000 ss 00000000 rs 00008000 tfd 40 serr 00000000 cmd 00206f17
May 30 11:43:29 wfid78-172 kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 bd f5 1e 40 00 00 00 00 00 00
May 30 11:43:29 wfid78-172 kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
May 30 11:43:29 wfid78-172 kernel: (ada1:ahcich1:0:0:0): Retrying command

I set hint.ahci.0.msi=0

As long as it works.
Comment 1 biotin 2015-07-07 17:22:38 UTC
Affects me on FreeBSD Prelease 10.2 too. (AMD A10-5750 Lenovo E545 Laptop)

Same Error: 
ahcich1: Timeouton slot 17 port 0
ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00020000 tfd 50 seried 00007117
(aprobe0:ahcich :0 :0 :0): ATAPI_IDENTIFY. ACB: a1 00 00 00 00 40 00
(aprobe:ahcich1 :0 :0 :0): CAM status: command timeout
(aprobe:ahcich1 :0 :0 :0): Retrying command
Comment 2 patpro 2015-07-23 15:45:40 UTC
I've got a similar problem on FreeBSD 10.1-RELEASE, but using Intel chipset:

ahci0: <Intel Cougar Point AHCI SATA controller> port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf000-0xf01f mem 0xdfa22000-0xdfa227ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0

Problem first appeared 28 minutes after the binary upgrade from 9.3-RELEASE (the server ran flawlessly for 4 years before that).

At boottime : 

ahcich5: Timeout on slot 0 port 0
(and so on)

During normal operation :

ahcich1: Timeout on slot 25 port 0
ahcich1: is 00000000 cs 04000000 ss 06000000 rs 06000000 tfd 40 serr 00000000 cmd 0004d917
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 28 ee 3c 40 2e 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: stopping AHCI engine failed
ahcich0: Timeout on slot 23 port 0
ahcich0: is 00000000 cs 01000000 ss 01800000 rs 01800000 tfd 40 serr 00000000 cmd 0004d717
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 28 ee 3c 40 2e 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: stopping AHCI engine failed
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: executing CLO failed
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: executing CLO failed
ahcich1: Timeout on slot 26 port 0

and the box freezes.
Happened twice in one month.

I'm going to give hint.ahci.0.msi=0 a try later...
Comment 3 patpro 2015-08-27 20:17:57 UTC
(In reply to patpro from comment #2)

apparently hint.ahci.0.msi=0 did not help, so I removed this setting. And about 4 weeks later, my box had a new freeze.
Currently it runs 10.1-RELEASE-p19, on a Supermicro X9SCL-F motherboard.

I'd be happy to help, let me know if I can provide further info or run some tests.
Comment 4 Chris Platts 2015-09-30 00:16:29 UTC
Same issue for me.  No problem on 10.0-RELEASE -- it all began after a freebsd-update to 10.2-RELEASE.

It's happening across two ahci interfaces:

ahci0: <ASMedia ASM1061 AHCI SATA controller> port 0xd800-0xd807,0xd400-0xd403,0xd000-0xd007,0xc800-0xc803,0xc400-0xc41f mem 0xfdfffc00-0xfdfffdff irq 18 at device 0.0 on pci1
ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported

ahci1: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfdeffc00-0xfdefffff irq 19 at device 17.0 on pci0
ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported
ahci1: quirks=0x22000<ATI_PMP_BUG,1MSI>

I've tried disabling MSI on the AHCI interfaces via hints in /boot/loader.conf but the problem persists.
Comment 5 Vladimir Noskov 2015-09-30 07:41:48 UTC
(In reply to Chris Platts from comment #4)
This hint.ahci.0.msi=0 place to /boot/device.hints

But now my system have defaults settings and I comment this: 

##                                                                                                                                  
## ahci                                                                                                                             
##                                                                                                                                  
                                                                                                                                    
#hint.ahci.0.msi=0                                                                                                                  
#hint.ahci.1.msi=0   

All works normal. 

> uname -a
FreeBSD wfid78-172 10.2-RELEASE-p4 FreeBSD 10.2-RELEASE-p4 #0: Tue Aug 18 15:15:36 UTC 2015
> uptime 
10:34  up 18 days, 14:21, 6 users, load averages: 0,42 0,40 0,43

Strange...

But... the problem is still there, as you can see.
It was not (as far as I could understand) changes in ahci.c.
Comment 6 Chris Platts 2015-09-30 12:32:25 UTC
(In reply to Vladimir Noskov from comment #5)

Ok -- I tried moving the hits to /boot/device.hints rather than /boot/loader.conf.  Unfortunately the errors persist:

Sep 30 13:25:15 trillian kernel: ahcich1: Timeout on slot 6 port 0
Sep 30 13:25:15 trillian kernel: ahcich1: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 0004c617
Sep 30 13:25:15 trillian kernel: (ada1:ahcich1:0:0:0): READ_DMA. ACB: c8 00 00 00 00 40 00 00 00 00 01 00
Sep 30 13:25:15 trillian kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Sep 30 13:25:15 trillian kernel: (ada1:ahcich1:0:0:0): Retrying command
Sep 30 13:25:36 trillian kernel: ahcich0: Timeout on slot 21 port 0
Sep 30 13:25:36 trillian kernel: ahcich0: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 451 serr 00000000 cmd 0004d517
Sep 30 13:26:08 trillian kernel: ahcich0: Timeout on slot 26 port 0
Sep 30 13:26:08 trillian kernel: ahcich0: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 451 serr 00000000 cmd 0004da17
Sep 30 13:26:08 trillian kernel: (ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
Sep 30 13:26:08 trillian kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Sep 30 13:26:08 trillian kernel: (ada0:ahcich0:0:0:0): Retrying command
Sep 30 13:26:29 trillian kernel: ahcich1: Timeout on slot 18 port 0
Sep 30 13:26:29 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00040000 tfd 451 serr 00000000 cmd 0004d117
Sep 30 13:26:49 trillian kernel: ahcich1: Timeout on slot 29 port 0
Sep 30 13:26:49 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 20000000 tfd 451 serr 00000000 cmd 0004dc17

They occur in pairs -- if one channel goes, the other follows next.  When the channel with the boot disk on times-out, the system freezes for 10-20 seconds before resuming.  

This is really hurting :(
Comment 7 Vladimir Noskov 2015-09-30 13:54:08 UTC
(In reply to Chris Platts from comment #6)

You place this

hint.ahci.0.msi=0                                                                                                                  
hint.ahci.1.msi=0

??

You see kernel settings:
# kenv | grep ahci
Comment 8 Chris Platts 2015-09-30 14:03:24 UTC
(In reply to Vladimir Noskov from comment #7)

Hi,

Here's my output:

chris@trillian:~ % kenv | grep ahci
hint.ahci.0.msi="0"
hint.ahci.1.msi="0"

Those double-quotes aren't present in device.hints, but all the values as displayed by kenv appear to have them - so that's probably not an issue.
Comment 9 Vladimir Noskov 2015-09-30 14:20:13 UTC
Yes, I have no quotes, but it works without them. Hmm ...
As far as I understand, the double quotes are needed for multi tsschkvy parameters.
But the right to put "" for all. :)

When will appear and then check the connecting on the hdd and SATA and Power cables.
Comment 10 Vladimir Noskov 2015-09-30 14:22:12 UTC
"tsschkvy" - read as "words"
Comment 11 Chris Platts 2015-09-30 15:31:20 UTC
(In reply to Vladimir Noskov from comment #10)

No problem :)

These are eSATA interfaces.  I've replaced the cables and enclosures (and in one instance, the drive itself).  Still the problem remains.

I'm going to install 10.0-RELEASE tonight on another disk and verify that the problem disappears.  That's a shame because I was really enjoying the improved NFS performance on 10.2!
Comment 12 Vladimir Noskov 2015-09-30 17:21:53 UTC
I suspect that this has something to do with the hot-plug drives connections.
My suspicions are supported by you.

Developers are !!! Pay attention!
Comment 13 Chris Platts 2015-09-30 23:53:44 UTC
(In reply to Vladimir Noskov from comment #12)

Ok - I think I've been stupid.  

I was understanding 'ahcichX' to be where 'X' is the controller ID.  But it isn't, is it?  It's the channel number - the physical connector.

So, ahcich0 and ahcich1 are the two connectors on my ASMedia controller card (ahci0).  Nothing to do with the built-in controller (ahci1).

Sorry :)

Now, that card is an eSATA card and as such, supports hot-plugging.  

You might well be on to something!
Comment 14 patpro 2015-10-01 05:10:01 UTC
(In reply to Chris Platts from comment #13)

Well, don't forget I've got the exact same problem on a native Intel chipset (comment 2) :

ahci0@pci0:0:31:2:	class=0x010601 card=0x062415d9 chip=0x1c028086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '6 Series/C200 Series Chipset Family SATA AHCI Controller'
    class      = mass storage
    subclass   = SATA

And I would be surprised if biotin's laptop (comment 1) would come with hotplug support.

I think it could help to report only verified facts here and keep the speculation on forums, so that people taking over this (very nasty) bug have solid info to work with.
Comment 15 Chris Platts 2015-10-01 08:20:20 UTC
(In reply to patpro from comment #14)

Fair point!

I'll sort out my system so I have two boot disks available -- one being the installation upgraded from 9.1 to 10.0 and finally to 10.2 which is experiencing the problem.  I'll also have another fresh 10.2 available for comparison and as a test environment for any further queries from the developers.

Will await any requests for further info from the devs.
Comment 16 Vladimir Noskov 2015-10-01 12:16:25 UTC
(In reply to Chris Platts from comment #15)
It's not stupidity are just some do not care. Sometimes.

But the fact that there is an error related to the behavior it is hot plug devices, very bad. Conventional Sata drives, if they put in the BIOS settings, can also be hot-plug.

Developers !!! Where are you?!! :)
Comment 17 Chris Platts 2015-10-03 16:45:44 UTC
Ok, I found the cause of these timeouts for me.

I rebuilt the system from scratch on another disk (connected to the same controller as my present boot disk).

All the way through the build I had no timeouts.  Then I installed smartmontools...  

It seems that when smartd queried either of the two disks on the problematic controller, it caused command timeouts.

Odd, considering the Option ROM for the ASMedia card specifically mentions SMART support.  

Anyway, everything's fine now.  Will take this up with the smartmontools project to see if there's a resolution.
Comment 18 patpro 2015-10-03 17:52:26 UTC
(In reply to Chris Platts from comment #17)

Interesting! I use smartmontools too, but I can see timeouts during boot, so I guess this particular problem is not related to smartmontools.
Comment 19 Vladimir Noskov 2015-10-05 15:41:44 UTC
Also talks about this:
https://forums.freebsd.org/threads/kernel-ahcich-timeout-in-slot.51868/
Comment 20 214748mv 2015-11-13 21:22:05 UTC
I have this issue on two new HP Microserver Gen8, so the probability of a hardware failure is low. 
I have solved the issue disabling the drives stand by.
Comment 21 Vladimir Noskov 2016-01-20 23:11:43 UTC
Now... bug repets :(

FreeBSD wcsn.wocson.ru 10.2-RELEASE-p11 FreeBSD 10.2-RELEASE-p11 #0: Thu Jan 14 15:48:17 UTC 2016     root@amd64-builder.pcbsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

Now after update to -p11. First reboot and get error ahci.
Comment 22 Vladimir Noskov 2016-04-11 07:26:38 UTC
I upgrade to 10.3-RELEASE.

Error sata is present :(
Comment 23 Dmitry Afanasiev 2018-07-05 14:37:12 UTC
I'm fixed this problem to me using this simple patch:

Index: sys/dev/ahci/ahci_pci.c
===================================================================
--- sys/dev/ahci/ahci_pci.c     (revision 326818)
+++ sys/dev/ahci/ahci_pci.c     (working copy)
@@ -59,7 +59,7 @@
        {0x43901002, 0x00, "AMD SB7x0/SB8x0/SB9x0",
            AHCI_Q_ATI_PMP_BUG | AHCI_Q_1MSI},
        {0x43911002, 0x00, "AMD SB7x0/SB8x0/SB9x0",
-           AHCI_Q_ATI_PMP_BUG | AHCI_Q_1MSI},
+           AHCI_Q_NOMSI | AHCI_Q_NOPMP | AHCI_Q_ATI_PMP_BUG },
        {0x43921002, 0x00, "AMD SB7x0/SB8x0/SB9x0",
            AHCI_Q_ATI_PMP_BUG | AHCI_Q_1MSI},
        {0x43931002, 0x00, "AMD SB7x0/SB8x0/SB9x0",