After upgrading to 10.0, I get a lot of warnings regarding ahcich : ahcich1: Timeout on slot 22 port 0 ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617 ahcich1: Timeout on slot 22 port 0 ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617 ahcich1: Timeout on slot 22 port 0 ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617 ahcich1: Timeout on slot 22 port 0 ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617 ahcich1: Timeout on slot 22 port 0 ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd c0 serr 00000000 cmd 0000d617
Can you provide the output of verbose boot and pciconf -lv please.
Maybe someday I will remember that acpi != ahci. Maybe.
What platform / ahci controller are you running if this is SB7x0/SB8x0/SB9x0 then r278034 should help. It was merged to stable/10 by r279918
I'm not sure if this is the same root issue, but in case it's related, there's a few of us seeing similar problems over at #201194 In my case, this is happening on two ahci controllers - ahci0: <ASMedia ASM1061 AHCI SATA controller> port 0xd800-0xd807,0xd400-0xd403,0xd000-0xd007,0xc800-0xc803,0xc400-0xc41f mem 0xfdfffc00-0xfdfffdff irq 18 at device 0.0 on pci1 ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported ahci1: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfdeffc00-0xfdefffff irq 19 at device 17.0 on pci0 ahci1: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported ahci1: quirks=0x22000<ATI_PMP_BUG,1MSI> (ahci0 is a PCIe add-in card, ahci1 is the motherboard's eSATA port). Interestingly, I did _not_ have this issue in 10.0-RELEASE. The problems only started after a binary update to 10.2-RELEASE. Here's the latest from my logs: chris@trillian:~ % tail -n 25 /var/log/messages Sep 30 15:17:59 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00010000 tfd 451 serr 00000000 cmd 0004cf17 Sep 30 15:17:59 trillian kernel: ahcich0: Timeout on slot 5 port 0 Sep 30 15:17:59 trillian kernel: ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 451 serr 00000000 cmd 0004c517 Sep 30 15:17:59 trillian kernel: (ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00 Sep 30 15:17:59 trillian kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout Sep 30 15:17:59 trillian kernel: (ada0:ahcich0:0:0:0): Retrying command Sep 30 15:18:19 trillian kernel: ahcich1: Timeout on slot 27 port 0 Sep 30 15:18:19 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 08000000 tfd 451 serr 00000000 cmd 0004da17 Sep 30 15:22:26 trillian kernel: ahcich0: Timeout on slot 12 port 0 Sep 30 15:22:26 trillian kernel: ahcich0: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 451 serr 00000000 cmd 0004cc17 Sep 30 15:22:58 trillian kernel: ahcich1: Timeout on slot 12 port 0 Sep 30 15:22:58 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00001000 tfd 451 serr 00000000 cmd 0004cb17 Sep 30 15:22:58 trillian kernel: ahcich0: Timeout on slot 17 port 0 Sep 30 15:22:58 trillian kernel: ahcich0: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 451 serr 00000000 cmd 0004d117 Sep 30 15:22:58 trillian kernel: (ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00 Sep 30 15:22:58 trillian kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout Sep 30 15:22:58 trillian kernel: (ada0:ahcich0:0:0:0): Retrying command Sep 30 15:23:19 trillian kernel: ahcich1: Timeout on slot 23 port 0 Sep 30 15:23:19 trillian kernel: ahcich1: is 00000000 cs 00000000 ss 00000000 rs 00800000 tfd 451 serr 00000000 cmd 0004d617
Are all the timeouts for SETFEATURES ENABLE WCACHE?
If so you can try disabling write cache with the write_cache sysctl's: kern.cam.ada.write_cache, kern.cam.ada.X.write_cache
(In reply to Steven Hartland from comment #5) Unfortunately, no. Many READ_FPDMA_QUEUED and WRITE_FPDMA_QUEUED too.
(In reply to Chris Platts from comment #7) Apologies for the comment-spam, but thought it worth mentioning: ahci0 has two eSATA ports. Each is connected to drives in eSATA enclosures. ahci1 has 4xSATA disks and a DVD drive inside the chassis, with an eSATA port connected to a 4-bay JBOD enclosure. Following advice from bug #201194 I've tried disabling MSI via device.hints without any noticeable change. I've also replaced a whole disk, enclosure and eSATA cable but errors persist across both controllers. I'm considering reverting to 10.0-RELEASE since I'm uncomfortable having my ZFS pools on a system that's struggling with its controller hardware. I'll keep the disk with 10.2-RELEASE available to boot from for further diagnosis.
Gave up on rolling back to 10.0 - I stupidly upgraded my ZFS pools, so they'll be unreadable by the old version. Silly mistake; lesson learnt! Anyway, I've been stupid. I was understanding 'ahcichX' to be where 'X' is the controller ID. But it isn't, is it? It's the channel number - the physical connector. So, ahcich0 and ahcich1 are the two connectors on my ASMedia controller card (ahci0). Nothing to do with the built-in controller (ahci1). Sorry :) I'll dismantle the box tomorrow and re-seat everything. At least I know the AMD controller which has my ZFS pools hanging off it isn't suffering.
So all the problems are with: "ahci0: <ASMedia ASM1061 AHCI SATA controller> port 0xd800-0xd807,0xd400-0xd403,0xd000-0xd007,0xc800-0xc803,0xc400-0xc41f mem 0xfdfffc00-0xfdfffdff irq 18 at device 0.0 on pci1 ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported" If so could you do a verbose boot and let me know how many MSI channels its allocating?
(In reply to Steven Hartland from comment #10) That's correct. I'll look up how to do a verbose boot now. I'm nowhere near the machine at the moment but can ssh into it and get that done. Would the file containing the verbose output still be /var/run/dmesg.boot (and that's the one you'd like to see?)
Created attachment 161585 [details] Verbose boot output. Verbose boot output.
Thanks for looking into this, Steven. Does the dmesg output I provided include the information you need about the number of MSI channels allocated to the device? I've had a read-through a few times but I'm still not sure!
Can you confirm you don't have MSI disabled in your loader.conf e.g. hint.ahci.X.msi=0
Created attachment 161645 [details] Updated dmesg.boot with MSI enabled Apologies -- I'd read separately to put the hints in /boot/loader.conf and also /boot/device.hints, so I hadn't removed them from both files. Here's another dmesg.boot.
Ok, I've installed FreeBSD 10.2-RELEASE fresh from the ISO onto ada1. The old boot drive (ada0) and ada1 are both on the ASMedia controller, and the physical hardware is entirely unchanged from when the problem was occurring. and... no timeouts. pkg reports that all packages are up-to-date. I'm going to now do a freebsd-update to bring the OS up to the latest patch level and see if the problem appears.
I'm now at 10.2-RELEASE-p5 and am rebuilding the software and configuration for the system. ada1, being the boot disk, is getting a bit of a work-out as I'm installing packages, updating the ports tree, etc. No timeouts so far. I'm at a bit of a loss to explain this!
Ok. Figured it out :) I just installed smartmontools. Immediately, the timeouts started. Looks like smartd querying the two disks on ahci1 was causing command timeouts. I'll do some more investigation and take this up with the smartmontools project, but at this point it seems ahci isn't my problem. I'll duck out of this bug now!
Very useful to know that so thanks for reporting back :)
For bugs matching the following conditions: - Status == In Progress - Assignee == "bugs@FreeBSD.org" - Last Modified Year <= 2017 Do - Set Status to "Open"