Summary: | CAM status: command timeout (fix in HEAD, still in 9.3, 10.1) (relates to bug 195349) | ||
---|---|---|---|
Product: | Base System | Reporter: | Yudi <yudi.tux> |
Component: | kern | Assignee: | Steven Hartland <smh> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | gibbs, grahamperrin, ken, koobs, mav, sasamotikomi, smh, stenio |
Priority: | --- | Keywords: | easy, needs-qa, patch |
Version: | 10.1-RELEASE | Flags: | koobs:
mfc-stable10?
koobs: mfc-stable9? |
Hardware: | amd64 | ||
OS: | Any | ||
URL: | https://svnweb.freebsd.org/base?view=revision&revision=278034 | ||
Attachments: |
Description
Yudi
2015-04-17 04:13:53 UTC
Hi, I think I have the same problem: for some reason my hardware (Fabiatech FX5621) doesn't work with FreeBSD 10.1 while it works perfectly with version 8.3. I tried to boot from a USB drive, disabling from BIOS all IDE drives and using all hints commands I found with no luck! This is what I tried: set hint.atapci.1.msi=0 set hint.atapci.0.msi=0 set hint.ata.1.mode="PIO4" set hint.ata.0.mode="PIO4" set hint.acpi.0.disabled=1 set kern.cam.ada.write_cache=0 boot The boot loops with this error: (aprobe0:ata1:0:1:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ata1:0:1:0): CAM status: Command timeout (aprobe0:ata1:0:1:0): Retrying command run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config (aprobe0:ata1:0:1:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ata1:0:1:0): CAM status: Command timeout (aprobe0:ata1:0:1:0): Error 5, Retries exhausted (aprobe0:ata1:0:1:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ata1:0:1:0): CAM status: Command timeout (aprobe0:ata1:0:1:0): Retrying command run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config (aprobe0:ata1:0:1:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ata1:0:1:0): CAM status: Command timeout (aprobe0:ata1:0:1:0): Error 5, Retries exhausted Please let me know if you have any suggestion. Thanks, Stenio I just want to update my original post it had some inaccurate info. First, this bug is present in v9.3 as well as v10.1 but NOT in v11. I have added the following to /boot/loader.conf but did not resolve the issue: "hint.atapci.0.msi="0" (also 1) "hint.atapci.1.msi="0" "hint.ahci.0.msi="1" I even tried rebuilding the kernel but looks like it did not fix the issue. Well, this was my first time building a custom kernel, so I am not sure I got this right. the steps I followed were: I used the LINT config instead of creating my own, # svn checkout svn-mirror/base/head /usr/src # cd /usr/src/sys/amd64/conf && make LINT # cd /usr/src # make buildkernel KERNCONF=LINT # make installkernel KERNCONF=LINT one of the replies I receive on the mailing list suggested this is not the right way to rebuild the kernel. He said I was using v11 source files to build the kernel for v10. I dont think that was accurate. Can someone please explain how I can apply the patch. Bit of info on the system, the Hitachi drives are in a 2-way mirror connected to 3Gbps SATA ports (AHCI mode) which I am not using until I fix this issue. OS is installed on the samsung drives (2-way mirror) connected to 1.5Gbps SATA ports running in IDE mode (cannot change this to AHCI without installing a hacked version of BIOS which I dont want to do). I would greatly appreciate any advise on how to fix this issue. Below I added some output from the system that might help. ============================= output of "camcontrol devlist" =================================== <Hitachi HDS723030ALA640 MKAOAA10> at scbus0 target 0 lun 0 (ada0,pass0) <Hitachi HDS723030ALA640 MKAOAA10> at scbus2 target 0 lun 0 (ada1,pass1) <SAMSUNG HM080HI AB100-17> at scbus4 target 0 lun 0 (ada2,pass2) <SAMSUNG HM080HI AB100-17> at scbus4 target 1 lun 0 (ada3,pass3) ====================================================== ==================================== ERROR from /var/log/messages ================================= Jun 28 21:22:47 10p1test kernel: (ada3:ata0:0:1:0): READ_DMA. ACB: c8 00 88 00 41 44 00 00 00 00 01 00 Jun 28 21:22:47 10p1test kernel: (ada3:ata0:0:1:0): CAM status: Command timeout Jun 28 21:22:47 10p1test kernel: (ada3:ata0:0:1:0): Retrying command Jun 28 21:23:21 10p1test kernel: (ada2:ata0:0:0:0): READ_DMA. ACB: c8 00 0d 30 c0 45 00 00 00 00 01 00 Jun 28 21:23:21 10p1test kernel: (ada2:ata0:0:0:0): CAM status: Command timeout Jun 28 21:23:21 10p1test kernel: (ada2:ata0:0:0:0): Retrying command Jun 28 21:40:33 10p1test kernel: (ada2:ata0:0:0:0): READ_DMA. ACB: c8 00 51 30 70 45 00 00 00 00 01 00 Jun 28 21:40:33 10p1test kernel: (ada2:ata0:0:0:0): CAM status: Command timeout Jun 28 21:40:33 10p1test kernel: (ada2:ata0:0:0:0): Retrying command ======================================== output of "dmesg | grep ahci" ============================= ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0 ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported ahcich0: <AHCI channel> at channel 0 on ahci0 ahcich1: <AHCI channel> at channel 1 on ahci0 ahcich2: <AHCI channel> at channel 2 on ahci0 ahcich3: <AHCI channel> at channel 3 on ahci0 ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada1 at ahcich2 bus 0 scbus2 target 0 lun 0 ====================================== output of "dmesg |grep ada" ============================ random: selecting highest priority adaptor <Dummy> random: selecting highest priority adaptor <Yarrow> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <Hitachi HDS723030ALA640 MKAOAA10> ATA8-ACS SATA 3.x device ada0: Serial Number MK0301YHKT8A2A ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 ada1 at ahcich2 bus 0 scbus2 target 0 lun 0 ada1: <Hitachi HDS723030ALA640 MKAOAA10> ATA8-ACS SATA 3.x device ada1: Serial Number MK0301YHKV1JWD ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C) ada1: Previously was known as ad8 ada2 at ata0 bus 0 scbus4 target 0 lun 0 ada2: <SAMSUNG HM080HI AB100-17> ATA-7 SATA 1.x device ada2: Serial Number S0ZAJD0P700140 ada2: 150.000MB/s transfers (SATA, UDMA5, PIO 8192bytes) ada2: 76319MB (156301488 512 byte sectors: 16H 63S/T 16383C) ada2: Previously was known as ad0 ada3 at ata0 bus 0 scbus4 target 1 lun 0 ada3: <SAMSUNG HM080HI AB100-17> ATA-7 SATA 1.x device ada3: Serial Number S0ZAJD0P700102 ada3: 150.000MB/s transfers (SATA, UDMA5, PIO 8192bytes) ada3: 76319MB (156301488 512 byte sectors: 16H 63S/T 16383C) ada3: Previously was known as ad1 =============================== Appears related to r278034 which may not? have been MFC'd to stable/9 and stable/10. CC committer (smh) of that change, opportunity to address this resolve this prior to 10.2-RELEASE after speaking with users at #freebsd IRC channel, I realized I made couple of mistakes in rebuilding the kernel. I was advised to track base/stable/10 rather than base/head and use GENERIC instead of LINT. rebuild the kernel again as follows: renamed /usr/src Then created /usr/src # svn checkout https://svn0.us-west.FreeBSD.org/base/stable/10 /usr/src # cd /usr/src/sys/amd64/conf # cp GENERIC MYKERNEL1 # cd /usr/src # make buildkernel KERNCONF=MYKERNEL1 # make installkernel KERNCONF=MYKERNEL1 rebooted the system and the issue is still present. (In reply to Kubilay Kocak from comment #3) It's not clear from this report what the controller in question is but 278034 has already been merged to stable/10, stable/9 is too different for a direct merge. (In reply to Steven Hartland from comment #5) Ok so saw it was the same controller, so should be fixed by that commit. (In reply to Yudi from comment #4) Can you confirm the quirks list for you controller from a verbose boot please? Created attachment 158218 [details]
/var/run/dmesg.boot file from v10.1
attached /var/run/dmesg.boot file
(In reply to Steven Hartland from comment #7) Steven, attached /var/run/dmesg.boot file Let me know if anything else is needed As you're only seeing issues on ada2 and ada3 (the ports @1.5Gbps) I suspect the issue is indeed caused by a BIOS SATA config / compatibility issue. If that's the case applying the BIOS fix is likely the best cause of action. It would be interesting to see the verbose boot from 9.3 to see if there are any significant differences on how it reports your hardware has been initialised. Created attachment 158236 [details]
/var/run/dmesg.boot file from v9.3
Created attachment 158237 [details]
/var/run/dmesg.boot file from v11
(In reply to Steven Hartland from comment #11) Please ignore the comment from my first post, this bug is present in v9.3 but NOT in v11. I really hope it can be backported to v10.1. I am not comfortable using 3rd party BIOS hack ( people had success with it as it enables AHCI mode for the two ports that are running in IDE mode right now). If this bug can't be fixed in v10.1, I will most likely use v11. I have attached /var/run/dmesg.boot files from v9.3 and v11, I found this on line 410 in v11 file - ahci0: quirks=0x1b5f0<ATI_PMP_BUG,1MSI,FORCE_PI> but not in the v10.1 or v9.3. Ok so could you try stable/10 and provide a verbose from that? While reading down thought that your 10.1 log was actually as stable/10 as thats what you'd said you where running just above. (In reply to Yudi from comment #14) For reference once we identify the issue its highly unlikely to be fixed in 10.1 as that boat has sailed, so stable/10 or the upcoming 10.2 release would be your options. Created attachment 158260 [details]
/var/run/dmesg.boot file from v10.2 prerelease
(In reply to Steven Hartland from comment #16) This is not a production system, I am still testing it. If it can be fixed in 10.2 that will be great. (In reply to Steven Hartland from comment #15) I have different versions installed to different ZFS datasets on the same system. Based on your comment at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195349#c30 I rebuilt the kernel first by tracking base/head and LINT config (my comment #2). Users in the IRC channel suggested I track base/stable/10 instead and use GENERIC. So I rebuilt the kernel again tracking base/stable/10 and GENERIC (my comment 4). This was done on v10.1 install. I followed the instructions at https://www.freebsd.org/doc/handbook/book.html#kernelconfig Not sure why the uname -a output shows 10.2-PRERELEASE when I only rebuilt the kernel. Wouldn't that be the case when rebuilding world? (https://www.freebsd.org/doc/handbook/makeworld.html) The last dmesg.boot file I attached is from the install where the kernel was rebuilt. It has a quirk on line 474 - ahci0: quirks=0x22000<ATI_PMP_BUG,1MSI> different from the one from v11. (In reply to Yudi from comment #18) Just to confirm your still seeing the issue on stable/10? (In reply to Steven Hartland from comment #19) Yes, in 10.2-PRERELEASE. Something not right there on HEAD, as the output from quirks looks corrupted. ahci0: quirks=0x1b5f0<ATI_PMP_BUG,1MSI,FORCE_PI> This is not a valid combination as there is only one controller which is flagged with FORCE_PI and its not the one your using; in addition there's loads of flags in there which are never set (11011010111110000). Created attachment 158282 [details]
ahci quirk corrupt debug
Could you test with head + this patch to see if will narrow down where the corruption of quirks is occuring.
Created attachment 158338 [details]
/var/run/dmesg.boot file from v11+patch
as requested.
rebuild kernel from revision 285130 after applying the patch.
No idea how you got previous output from 11. I've fixed the incorrect output of quirks include FORCE_PI in https://svnweb.freebsd.org/changeset/base/285200. With regards to the main issue, there are no differences between stable/10 and head in the ahci device which would effect the HW you have. Given this could I ask you to ensure re-test both HEAD and stable/10, as of today, to confirm we're still seeing the same behaviour? Created attachment 158562 [details]
/var/run/dmesg.boot file from stable/10 rev 285310
(In reply to Steven Hartland from comment #24) Sorry Steven, been very busy with work. rebuilt kernel from rev 285310, stable/10, the issue is still present (using zpool scrub, the issue pops up right away). I will rebuild kernel from HEAD tomorrow and let you know. Thanks! tail /var/log/messages output from rev 285310 build: ======================================================= Jul 9 21:26:31 test kernel: (ada3:ata0:0:1:0): READ_DMA. ACB: c8 00 3a af 05 45 00 00 00 00 01 00 Jul 9 21:26:31 test kernel: (ada3:ata0:0:1:0): CAM status: Command timeout Jul 9 21:26:31 test kernel: (ada3:ata0:0:1:0): Retrying command Jul 9 21:33:35 test kernel: ata0: (ada2:ata0:0:0:0): READ_DMA. ACB: c8 00 f8 1f a7 43 00 00 00 00 01 00 Jul 9 21:33:35 test kernel: reset tp1 mask=03 ostat0=50 ostat1=50 Jul 9 21:33:35 test kernel: (ada2:ata0:0:0:0): CAM status: Command timeout Jul 9 21:33:35 test kernel: (ada2:ata0:0:0:0): Retrying command Jul 9 21:33:35 test kernel: ata0: stat0=0x50 err=0x01 lsb=0x00 msb=0x00 Jul 9 21:33:35 test kernel: ata0: stat1=0x50 err=0x01 lsb=0x00 msb=0x00 Jul 9 21:33:35 test kernel: ata0: reset tp2 stat0=50 stat1=50 devices=0x3 Created attachment 158570 [details]
/var/run/dmesg.boot file from v11 from rev285311
rebuilt kernel from HEAD, revision 285311 and attached the dmesg.boot file.
Created attachment 158572 [details]
ata cleanup patch MFC r280451
mav kindly pointed out that your second 2 drivers, which are causing the problem, are attached to the legacy ata driver not ahci.
There's one noticeable commit which effects ata for ati (your chipset) in head which isn't in stable/10.
This attachment is said commit merged to stable/10. So if you could apply this on top of your stable/10 build and see if eliminates the timeouts?
Oh your can remove the debug patch from your head builds as that seems to be resolved now.
(In reply to Steven Hartland from comment #29) # cd /usr/src # svn update /usr/src //(updated to rev 285365 (stable/10)) # patch < ata-cleanup.patch I checked the output later and the below three files failed to patch. Sys/arm/mv/mv_sata.c Sys/dev/ata/chipsets/ata-adaptec.c sys/dev/ata/chipsets/ata-ahci.c I rebuilt the kernel but the issue is still present. /var/run/dmesg.boot file after rebuild, is attached. Created attachment 158605 [details]
/var/run/dmesg.boot file from stable/10 rev 285365
All three of the files that failed where removed by the patch, so that should have not impact apart from dangling files. With ata the same we're running out of options. The next thing to try is a binary chop on head to try an determine the change that fixed the issue. It can be quite time consuming but given the details from generic inspection I can't see anything more obvious to try. Just seen another report on freebsd-scsi@ which might be related. Can you try setting: kern.racct.enable="0" in /boot/loader.conf for stable/10 and see if that has any impact on the timeouts at all (don't need a dmesg). But confirming its set correct with sysctl kern.racct.enable would be a good idea. (In reply to Steven Hartland from comment #33) tried kern.racct.enable="0" in /boot/loader.conf for stable/10, did not make a difference. What's a binary chop? Identify the edges where the issue occurs and doesn't occur then pick the mid point, see if it does or doesn't happen then adjust go for the next mid point e.g. working = commit 1000, broken = 500 1. checkout, build and test commit 750 1.1. if works repeat with commit 625 1.2. if broken repeat with commit 875 ... Each time you test you halve the number of commits that may be the fix, until you have identified the fix / fixes. (In reply to Steven Hartland from comment #35) If I have time to go down the binary chop method, which revisions do I start with? Another option is to just update the BIOS and move on. If this bug is not affecting many then I guess updating the BIOS might save a lot of time. Did you manage to make any progress on this Yudi? (In reply to Steven Hartland from comment #37) Sorry haven't had much free time. I flashed the 3rd party BIOS and enabled AHCI on all the SATA ports. The bug disappeared right away. looks like the bug is restricted to IDE ports. v11 definitely fixes the ATA/IDE bug but I could not run v11 on a production server. Thank you very much for all your help and advise. Let me know if there is anything I can help with, I would like to contribute back if possible. cheers Yudi If you can still test the failure it would be good to see if we can identify the kernel changes which fix the issue so we can ensure they get MFC'ed. I detailed the binary chop method above, so would be good if you could help identify the relaxant fix. (In reply to Steven Hartland from comment #39) I can reproduce the bug when I set the two SATA ports to IDE mode in 10.1. Before I go ahead with the binary chop and narrow down the commit that fixed this in HEAD can you please confirm the below process is correct. I am guessing because this is fixed in HEAD, I need to start with a copy of the kernel from HEAD, I know that r283160 from HEAD did not have the bug, so I am guessing I need to go back from this one until I find the bug. checkout r270000 (is that reasonable) and see if the bug is present. once I can find the bug, I go forward using the binary chop. Should I do this on a v11 install and roll back the kernel or do it on 10.1 release and install the kernel from the HEAD? (In reply to Yudi from comment #40) First off sorry Yudi, I missed the notification of your reply. Yes your approach is reasonable, start a point which you think would have the bug and work through the space in HEAD (v11) until you identify the issue. As it kernel related then you only need to build / install kernel. I would advise you to create a cut down kernel config to test with as that will be much quicker to build than a full generic. One think that came up the other day was that another user had a timeout issue and that turned out to be related to smartmontools. When they ran it caused his drives to throw timeout errors, so just in case your running them thought it would be worth mentioning. I can repeat this bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206200 Workaround for non-patched kernel: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=202712#c3 Later I will test this patch from current. |