Bug 66001

Summary: ATA driver does not recover from READ_DMA TIMEOUT
Product: Base System Reporter: Patrick Mackinlay <patrick>
Component: kernAssignee: Søren Schmidt <sos>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 5.2.1-RELEASE   
Hardware: Any   
OS: Any   

Description Patrick Mackinlay 2004-04-26 18:30:21 UTC
Transferring data from one drive to another I get the following errors:
ad7: TIMEOUT - READ_DMA retrying (2 retries left) LBA=19030399
ad7: WARNING - READ_DMA interrupt was seen but timeout fired LBA=19030399
ad7: WARNING - READ_DMA interrupt was seen but taskqueue stalled LBA=19030399
The process that caused the error (cp or mv in my case) can be interrupted or killed, but will otherwise block. After this point it is no longer possible to access ad7. All processes that either read, write or try and umount the driver block and cannot be killed, interrupted or stopped. Since ad7 cannot be umounted it becomed useless. Furthermore, eventually the entire machine will simply hang (presumably when sufficient processes try and access ad7).
The folling lines from dmesg are also relevant:

atapci1: <HighPoint HPT370 UDMA100 controller> port 0xc000-0xc0ff,0xbc00-0xbc03,0xb800-0xb807,0xb400-0xb403,0xb000-0xb007 irq 11 at device 19.0 on pci0
atapci1: [MPSAFE]

ata3: at 0xb800 on atapci1
ata3: [MPSAFE]

GEOM: create disk ad7 dp=0xc637da60
ad7: 14649MB <IBM-DTLA-307015> [29765/16/63] at ata3-slave UDMA100

ad7: TIMEOUT - READ_DMA retrying (2 retries left) LBA=19030399
ad7: WARNING - READ_DMA interrupt was seen but timeout fired LBA=19030399
ad7: WARNING - READ_DMA interrupt was seen but taskqueue stalled LBA=19030399

Please let me know if you require further details.
Comment 1 Simon L. B. Nielsen freebsd_committer freebsd_triage 2004-04-26 21:38:54 UTC
Responsible Changed
From-To: freebsd-bugs->sos

Sounds like an ata(4) issue, so over to ata maintainer.
Comment 2 dkelly 2004-07-30 18:23:06 UTC
Believe I am having same problem. And that kern/62897 is probably the 
same thing too.

Bought a brand new Dell 400SC, then a pair of Hitachi HDS722516VLSA80 
160G SATA drives. The base Seagate ST340014A 40G is on parallel ATA 
partioned with sysinstall's "auto" defaults. System withstood a couple 
of days of abuse including "make world" before installing the SATA 
drives, leaving the PATA 40G booting FreeBSD 5.2.1-p9.

Partitioned the 160's with 1G of swap at the start, remainder native 
FreeBSD. Have not used the swap partitions.

Striped the two large partitions with vinum. Then started filling via 
ftp. Instantly locked the machine requiring power cycle to recover.

Have removed vinum, newfs'ed the bare partitions ad[46]s1d and tried 
using them simply. cp from PATA to the fs on ad6s1d works just great. 
cp of files on the fs at ad6s1d to the fs on ad4s1d gets READ_DMA 
timeout at 1349058560 bytes into the first file. This cp process is 
stuck. Its not moving. Its not responding to kill. Apparently 
everything to ad4 is blocked until this clears.

Shutdown gave up on syncing 22 buffers. Fsck reports "bad inode number 
1083392 to nextinode" now on ad4s1d. Time for newfs.

CPU is a P4-2.8G 512k with Hyperthreading enabled. Disabling HT appears 
to have ended the problem and results in a reliable machine.
Comment 3 Patrick Mackinlay 2004-07-30 18:40:40 UTC
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I can reproduce this every time. I finally identified the file that is
using the disk sectors that are causing the fault and renamed the file
to "/file_system_mount_point/broken". This is a work arround that works,
however what really needs to be fixed is the ata driver. It quite clearly
does not handle hard disk failures properly.

Patrick

David Kelly wrote:
| Believe I am having same problem. And that kern/62897 is probably the
| same thing too.
|
| Bought a brand new Dell 400SC, then a pair of Hitachi HDS722516VLSA80
| 160G SATA drives. The base Seagate ST340014A 40G is on parallel ATA
| partioned with sysinstall's "auto" defaults. System withstood a couple
| of days of abuse including "make world" before installing the SATA
| drives, leaving the PATA 40G booting FreeBSD 5.2.1-p9.
|
| Partitioned the 160's with 1G of swap at the start, remainder native
| FreeBSD. Have not used the swap partitions.
|
| Striped the two large partitions with vinum. Then started filling via
| ftp. Instantly locked the machine requiring power cycle to recover.
|
| Have removed vinum, newfs'ed the bare partitions ad[46]s1d and tried
| using them simply. cp from PATA to the fs on ad6s1d works just great. cp
| of files on the fs at ad6s1d to the fs on ad4s1d gets READ_DMA timeout
| at 1349058560 bytes into the first file. This cp process is stuck. Its
| not moving. Its not responding to kill. Apparently everything to ad4 is
| blocked until this clears.
|
| Shutdown gave up on syncing 22 buffers. Fsck reports "bad inode number
| 1083392 to nextinode" now on ad4s1d. Time for newfs.
|
| CPU is a P4-2.8G 512k with Hyperthreading enabled. Disabling HT appears
| to have ended the problem and results in a reliable machine.
|
|


- --
Patrick Mackinlay                              patrick@spacesurfer.com
http://patrick.spacesurfer.com/                    tel: +44.7050699851
Yahoo messenger: patrick00_uk                      fax: +44.7050699852
SpaceSurfer Limited                           http://www.spacereg.com/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFBCogYD97IpyzY3RIRAi6kAKCKUho4Tx/vJfnxks+lXsu2m5RDCgCcDIh7
CeCO1LrgwWYUGPUFQ2lnBdw=
=T48C
-----END PGP SIGNATURE-----
Comment 4 dkelly 2004-07-30 18:57:35 UTC
On Jul 30, 2004, at 12:40 PM, Patrick Mackinlay wrote:

> I can reproduce this every time. I finally identified the file that is
> using the disk sectors that are causing the fault and renamed the file
> to "/file_system_mount_point/broken". This is a work arround that 
> works,
> however what really needs to be fixed is the ata driver. It quite 
> clearly
> does not handle hard disk failures properly.

That cause sounds different than my problem altho its likely we are 
both hanging on the same error handling problem. Sounds like Patrick 
has a bad block on the media? See badsect(8) for something that might 
help create a bandaid.

Since posting earlier I have disabled hyperthreading of the CPU in the 
BIOS and have written over 50G to each of the "problem" SATA drives, 
reading from one or the other.

Am now confident hyperthreading (SMP) was the root of my problem and am 
ready to set the machine to the tasks it was purchased for, with HT 
disabled.
Comment 5 Søren Schmidt freebsd_committer freebsd_triage 2004-08-16 12:23:00 UTC
State Changed
From-To: open->closed

You should try -current (or soon to be 5.3) as I've fixed a couble of races 
that could provoke this..