Bug 74070

Summary: DMA problems with large disks and HPT370
Product: Base System Reporter: Tuure Laurinolli <tuure>
Component: kernAssignee: Søren Schmidt <sos>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description Tuure Laurinolli 2004-11-18 12:50:34 UTC
I get DMA errors when trying to access sector 268435455, or the 2^28th sector from the beginning of the disk.

I guess this is a controller problem, however I don't have any real proof, because this is my only available controller that supports disks as large as those. I will try to find another controller to test with. I think it would be very unlikely for two new disks to both have the same problem on the same sector.

With a single disk, the errors of dd if=/dev/ad6 of=/tmp/test6 are:

ad6: TIMEOUT - READ_DMA retrying (2 retries left) LBA=268435455
ad6: TIMEOUT - READ_DMA retrying (1 retries left) LBA=268435455
ad6: FAILURE - READ_DMA timed out

With a HPT-native RAID1 setup the results are worse. I don't have exact error messages, but there are DMA timeouts on both disks (ad4 and ad6), that result in tearing the array (ar0) apart, and causing a kernel panic (maybe because the array is the root disk too).

How-To-Repeat: [14:30:23][tazle@vortex][/var/run]% sudo dd if=/dev/ad6 of=/tmp/test6 skip=268435450 count=10
dd: /dev/ad6: Input/output error
5+0 records in
5+0 records out
2560 bytes transferred in 15.645115 secs (164 bytes/sec)


The system console gives the errors given in full desription.
Comment 1 Ilya Pizik 2004-11-19 08:21:41 UTC
Me has the same problem:
RELENG_5 from 16.11
There are 5 HDD in my PC:
60Gb (Seagate connected via Intel ICH2 UDMA100 controller)
120Gb (Seagate connected via Intel ICH2 UDMA100 controller)
250Gb (WD connected via Intel ICH2 UDMA100 controller)
250Gb (WD connected via Promise PDC20268 UDMA100 controller)
120Gb (Seagate connected via Promise PDC20268 UDMA100 controller)

Such messages appear in log when PC is heavy loaded:
kernel: ad3: TIMEOUT - READ_DMA retrying (2 retries left) LBA=369792831
kernel: ad3: FAILURE - READ_DMA timed out
kernel: ad3: TIMEOUT - READ_DMA retrying (2 retries left) LBA=421053375
kernel: ad3: FAILURE - READ_DMA timed out
kernel: ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=355563327
kernel: ad2: FAILURE - READ_DMA timed out
kernel: ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=11109887
kernel: ad2: WARNING - removed from configuration
kernel: ata1-slave: FAILURE - READ_DMA timed out
kernel: ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=488374591
kernel: ad2: WARNING - removed from configuration
kernel: ata1-slave: FAILURE - READ_DMA timed out

...

I try to detach an attach devices with atacontrol - result is DMA errors
only with 250Gb HDDs


-- 
With respect, Pizik Ilya.
Comment 2 Gleb Smirnoff freebsd_committer freebsd_triage 2004-11-22 08:24:57 UTC
Responsible Changed
From-To: freebsd-bugs->sos

Over to ATA maintainer.
Comment 3 Tuure Laurinolli 2005-01-09 00:03:00 UTC
I tested the same machine with linux, and had no problems reading sector 
268435455. After digging around the sources for a while, it seemed that 
Linux uses 48-bit operations whenever they're available, and FreeBSD 
only for sectors > 268435455. Maybe this is the source of failure here?

The linux driver also seems to reset the HPT state machine before each 
command is run, though it's hard to see how this would cause problems 
with one specific sector, independent of previous commands.
Comment 4 Tuure Laurinolli 2005-01-14 09:09:33 UTC
The problem was indeed solved by the LBA tripover changes in -CURRENT, 
so this is can now be closed as far as I'm considered.
Comment 5 psyta 2005-03-15 08:47:28 UTC
Same problems on FreeBSD 5.3-RC2 and -RELEASE.
HPT370 with no raid option on.

Mar 15 05:00:25 obcy4 kernel: ad7: TIMEOUT - READ_DMA retrying (2 retries left) LBA=378610367
Mar 15 05:00:26 obcy4 kernel: ad7: WARNING - removed from configuration
Mar 15 05:00:26 obcy4 kernel: ata3-slave: FAILURE - READ_DMA timed out
Mar 15 05:06:35 obcy4 kernel: ad7: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=303905343
Mar 15 05:06:39 obcy4 kernel: ad7: WARNING - removed from configuration
Mar 15 05:06:39 obcy4 kernel: ata3-slave: FAILURE - WRITE_DMA timed out
Mar 15 05:06:39 obcy4 kernel: ad4: TIMEOUT - READ_DMA retrying (2 retries left) LBA=15923711
Mar 15 05:06:39 obcy4 kernel: ad4: FAILURE - READ_DMA timed out

Then panic. :/

Comment 6 Tuure Laurinolli 2005-03-15 12:09:41 UTC
Przemek Syta wrote:

> Mar 15 05:00:25 obcy4 kernel: ad7: TIMEOUT - READ_DMA retrying (2 retries left) LBA=378610367
> Mar 15 05:06:35 obcy4 kernel: ad7: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=303905343
> Mar 15 05:06:39 obcy4 kernel: ad4: TIMEOUT - READ_DMA retrying (2 retries left) LBA=15923711


Notice that your failing sectors are different from my original ones, 
which were caused by the combination of drive firmware bug and FreeBSD.
Comment 7 Søren Schmidt freebsd_committer freebsd_triage 2005-04-11 12:14:10 UTC
State Changed
From-To: open->closed

Fixed in both 5.x-stable and -current