Bug 186346 - [ahci] [regression] ahci(4) thousands of warnings Timeout on slot 22
Summary: [ahci] [regression] ahci(4) thousands of warnings Timeout on slot 22
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-01 16:00 UTC by David Demelier
Modified: 2018-05-20 23:50 UTC (History)
2 users (show)

See Also:


Attachments
Verbose boot output. (48.17 KB, text/plain)
2015-10-01 08:48 UTC, Chris Platts
no flags Details
Updated dmesg.boot with MSI enabled (48.06 KB, text/plain)
2015-10-02 14:34 UTC, Chris Platts
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Demelier 2014-02-01 16:00:00 UTC
After upgrading to 10.0, I get a lot of warnings regarding ahcich :

ahcich1: Timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617
ahcich1: Timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617
ahcich1: Timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617
ahcich1: Timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617
ahcich1: Timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617
Comment 1 Steven Hartland freebsd_committer freebsd_triage 2015-02-01 19:48:20 UTC
Can you provide the output of verbose boot and pciconf -lv please.
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2015-03-15 05:08:36 UTC
Maybe someday I will remember that acpi != ahci.  Maybe.
Comment 3 Steven Hartland freebsd_committer freebsd_triage 2015-03-15 11:41:11 UTC
What platform / ahci controller are you running if this is SB7x0/SB8x0/SB9x0 then r278034 should help. It was merged to stable/10 by r279918
Comment 4 Chris Platts 2015-09-30 14:26:55 UTC
I'm not sure if this is the same root issue, but in case it's related, there's a few of us seeing similar problems over at #201194

In my case, this is happening on two ahci controllers - 

ahci0: <ASMedia ASM1061 AHCI SATA controller> port 0xd800-0xd807,0xd400-0xd403,0xd000-0xd007,0xc800-0xc803,0xc400-0xc41f mem 0xfdfffc00-0xfdfffdff irq 18 at device 0.0 on pci1
ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported

ahci1: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfdeffc00-0xfdefffff irq 19 at device 17.0 on pci0
ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported
ahci1: quirks=0x22000<ATI_PMP_BUG,1MSI>


(ahci0 is a PCIe add-in card, ahci1 is the motherboard's eSATA port).

Interestingly, I did _not_ have this issue in 10.0-RELEASE.  The problems only started after a binary update to 10.2-RELEASE.

Here's the latest from my logs:

chris@trillian:~ % tail -n 25  /var/log/messages
Sep 30 15:17:59 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00010000 tfd 451 serr 00000000 cmd 0004cf17
Sep 30 15:17:59 trillian kernel: ahcich0: Timeout on slot 5 port 0
Sep 30 15:17:59 trillian kernel: ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 451 serr 00000000 cmd 0004c517
Sep 30 15:17:59 trillian kernel: (ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
Sep 30 15:17:59 trillian kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Sep 30 15:17:59 trillian kernel: (ada0:ahcich0:0:0:0): Retrying command
Sep 30 15:18:19 trillian kernel: ahcich1: Timeout on slot 27 port 0
Sep 30 15:18:19 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 08000000 tfd 451 serr 00000000 cmd 0004da17
Sep 30 15:22:26 trillian kernel: ahcich0: Timeout on slot 12 port 0
Sep 30 15:22:26 trillian kernel: ahcich0: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 451 serr 00000000 cmd 0004cc17
Sep 30 15:22:58 trillian kernel: ahcich1: Timeout on slot 12 port 0
Sep 30 15:22:58 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00001000 tfd 451 serr 00000000 cmd 0004cb17
Sep 30 15:22:58 trillian kernel: ahcich0: Timeout on slot 17 port 0
Sep 30 15:22:58 trillian kernel: ahcich0: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 451 serr 00000000 cmd 0004d117
Sep 30 15:22:58 trillian kernel: (ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
Sep 30 15:22:58 trillian kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
Sep 30 15:22:58 trillian kernel: (ada0:ahcich0:0:0:0): Retrying command
Sep 30 15:23:19 trillian kernel: ahcich1: Timeout on slot 23 port 0
Sep 30 15:23:19 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00800000 tfd 451 serr 00000000 cmd 0004d617
Comment 5 Steven Hartland freebsd_committer freebsd_triage 2015-09-30 14:59:27 UTC
Are all the timeouts for SETFEATURES ENABLE WCACHE?
Comment 6 Steven Hartland freebsd_committer freebsd_triage 2015-09-30 15:34:19 UTC
If so you can try disabling write cache with the write_cache sysctl's:
kern.cam.ada.write_cache, kern.cam.ada.X.write_cache
Comment 7 Chris Platts 2015-09-30 18:00:00 UTC
(In reply to Steven Hartland from comment #5)

Unfortunately, no.  Many READ_FPDMA_QUEUED and WRITE_FPDMA_QUEUED too.
Comment 8 Chris Platts 2015-09-30 18:26:31 UTC
(In reply to Chris Platts from comment #7)

Apologies for the comment-spam, but thought it worth mentioning:

ahci0 has two eSATA ports.  Each is connected to drives in eSATA enclosures.

ahci1 has 4xSATA disks and a DVD drive inside the chassis, with an eSATA port connected to a 4-bay JBOD enclosure.

Following advice from bug #201194 I've tried disabling MSI via device.hints without any noticeable change.

I've also replaced a whole disk, enclosure and eSATA cable but errors persist across both controllers.

I'm considering reverting to 10.0-RELEASE since I'm uncomfortable having my ZFS pools on a system that's struggling with its controller hardware.  I'll keep the disk with 10.2-RELEASE available to boot from for further diagnosis.
Comment 9 Chris Platts 2015-10-01 00:00:34 UTC
Gave up on rolling back to 10.0 - I stupidly upgraded my ZFS pools, so they'll be unreadable by the old version. Silly mistake; lesson learnt!

Anyway, I've been stupid.

I was understanding 'ahcichX' to be where 'X' is the controller ID.  But it isn't, is it?  It's the channel number - the physical connector.

So, ahcich0 and ahcich1 are the two connectors on my ASMedia controller card (ahci0).  Nothing to do with the built-in controller (ahci1).

Sorry :)

I'll dismantle the box tomorrow and re-seat everything. At least I know the AMD controller which has my ZFS pools hanging off it isn't suffering.
Comment 10 Steven Hartland freebsd_committer freebsd_triage 2015-10-01 08:21:32 UTC
So all the problems are with:
"ahci0: <ASMedia ASM1061 AHCI SATA controller> port 0xd800-0xd807,0xd400-0xd403,0xd000-0xd007,0xc800-0xc803,0xc400-0xc41f mem 0xfdfffc00-0xfdfffdff irq 18 at device 0.0 on pci1
ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported"

If so could you do a verbose boot and let me know how many MSI channels its allocating?
Comment 11 Chris Platts 2015-10-01 08:26:32 UTC
(In reply to Steven Hartland from comment #10)

That's correct.  I'll look up how to do a verbose boot now.  I'm nowhere near the machine at the moment but can ssh into it and get that done.

Would the file containing the verbose output still be /var/run/dmesg.boot (and that's the one you'd like to see?)
Comment 12 Chris Platts 2015-10-01 08:48:55 UTC
Created attachment 161585 [details]
Verbose boot output.

Verbose boot output.
Comment 13 Chris Platts 2015-10-02 08:07:16 UTC
Thanks for looking into this, Steven.

Does the dmesg output I provided include the information you need about the number of MSI channels allocated to the device?  I've had a read-through a few times but I'm still not sure!
Comment 14 Steven Hartland freebsd_committer freebsd_triage 2015-10-02 11:35:41 UTC
Can you confirm you don't have MSI disabled in your loader.conf e.g.
hint.ahci.X.msi=0
Comment 15 Chris Platts 2015-10-02 14:34:54 UTC
Created attachment 161645 [details]
Updated dmesg.boot with MSI enabled

Apologies -- I'd read separately to put the hints in /boot/loader.conf and also /boot/device.hints, so I hadn't removed them from both files.

Here's another dmesg.boot.
Comment 16 Chris Platts 2015-10-03 12:34:42 UTC
Ok, I've installed FreeBSD 10.2-RELEASE fresh from the ISO onto ada1.

The old boot drive (ada0) and ada1 are both on the ASMedia controller, and the physical hardware is entirely unchanged from when the problem was occurring.

and... no timeouts.

pkg reports that all packages are up-to-date.

I'm going to now do a freebsd-update to bring the OS up to the latest patch level and see if the problem appears.
Comment 17 Chris Platts 2015-10-03 13:17:23 UTC
I'm now at 10.2-RELEASE-p5 and am rebuilding the software and configuration for the system.

ada1, being the boot disk, is getting a bit of a work-out as I'm installing packages, updating the ports tree, etc.  

No timeouts so far. I'm at a bit of a loss to explain this!
Comment 18 Chris Platts 2015-10-03 16:42:19 UTC
Ok.  Figured it out :)

I just installed smartmontools.  Immediately, the timeouts started.  

Looks like smartd querying the two disks on ahci1 was causing command timeouts.  

I'll do some more investigation and take this up with the smartmontools project, but at this point it seems ahci isn't my problem.

I'll duck out of this bug now!
Comment 19 Steven Hartland freebsd_committer freebsd_triage 2015-10-03 18:58:11 UTC
Very useful to know that so thanks for reporting back :)
Comment 20 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:50:32 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"