| Summary: | file corruption with Adaptec 29160 SCSI adapter | ||
|---|---|---|---|
| Product: | Base System | Reporter: | mjh <mjh> |
| Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | 4.2-RELEASE | ||
| Hardware: | Any | ||
| OS: | Any | ||
I still haven't found a cure for this, but here are a few things I've
tried:
- configure the adaptor to only use 80MB/s
- disable write caching in the adaptor
- build a kernel that has tagged queuing disabled for this particular
Seagate drive.
So far, no luck. I'm still getting file corruption.
Of the corrupted files I've looked at, here's where the corruptions
starts (rounded to the nearest MB).
192MB, 208MB, 496MB, 256MB, 0.9MB, 437MB, 905MB, 656MB, 576MB, 672MB,
512MB, 672MB, 400MB, 643MB, 752MB.
The 0.9MB corruption seems to be an anomaly amongst my anomalies - when
I was writing 100MB
files, I wrote thousands of files and never saw corruption. When
writing 768MB files, I'm seeing
about one file in seven get corrupted.
BTW, if someone wants remote access to help track this down, I can
oblige.
Cheers,
Marj
Another update: - the problem also occurs on FreeBSD 4.3-RELEASE - the problem occurs with an IBM DDYS-T18350N 18GB Ultra 160 drive, although less frequently - the problem occurs with an IBM DDRS-39130D 9GB Ultra 2 drive, although much less frequently (took writing 250 768MB files for it to occur). Not sure if this helps much, but at least it's a few more data points. - Mark OK, the problem isn't a SCSI problem. I just put an ATA100 disk in one of these machines, and removed the SCSI controller card - the problem is just the same - large files get corrupted with about a 1-in-5 probability. So, what's the real problem? Is this a fundamental problem with the VIA 82C686A south bridge? The Register mentions something that sounds similar: http://www.theregister.co.uk/content/3/18267.html But this is with the 686B, not the 686A that the original Asus A7V that my machines have. I suppose it's also possible there's a timing hole in the FreeBSD filesystem code, but this seems unlikely to me. - Mark Just for another data point, I moved the IDE disk to the ATA66 controller (built into the south bridge) rather than the separate Promise ATA100 controller (which is on the motherboard on the A7V). Same problem, although rather than seeing 64 byte chunks of corrupted data, as I did with SCSI, I'm seeing 4K chunks of data from elsewhere in the file (ie the file contains two copies of one 4K chunk, and no copy of another 4K chunk). Given this, it seems unlikely to me that this is a software problem. If it had been, I'd have expected the size of the corruptions to be similar in both cases. But I'm really confused - this seems to be the sort of problem that someone else must have seen. - Mark I've now replaced the Asus A7V motherboard with the new Asus A7A266
which has an ALi south bridge instead of the Via 82C686A. The rest of
the system is unchanged. The problem has now gone away.
Thus I conclude that Via's south bridge is most likely the problem, and
that this is unlikely to be a FreeBSD
issue, and very unlikely to be a SCSI problem.
I think this Problem Report can be closed now.
Cheers,
Mark
State Changed From-To: open->closed Closed at originator's request |
I've got five 1.1GHz Athlon systems, running FreeBSD 4.2R with 512MB RAM, Asus A7V motherboards, Adaptec 29160 U160 SCSI adaptors, and SEAGATE ST318451LW 18GB drives. The problem is I'm seeing file corruption when I write large (approx 512Mb or larger) files, especially when I write them rapidly. I can't guarantee it doesn't happen with smaller files, but I wrote a thousand 100MB files, and not one of them was corrupted. The problem basically is that the files get 64-byte chunks (usally 64, sometimes smaller)of other data in the middle of them. I first noticed the problem with scp, but the problem also happens with moderate repeatability when simply rapidly writing a big file by redirecting stdout. Here's the quick-hack test program: #include<stdio.h> #define FSIZE 1000*1024*1024 main() { int i,j; int buf[1024]; j=0; for(i=0;i<FSIZE/4;i++) { buf[j]=i; if (j==1023) { fwrite(buf, 1024, 4, stdout); j=0; } else { j++; } } } Basically it's writing 1000MB to stdout, writing incrementing values to each 32-bit word. I direct stdout to a file. The MD5 checksum of the output file should be 1da068574fdb3e3b9ffc3b2022cca171, but sometimes (somewhere between 1-in-3 and 1-in-10 tries) the file gets corrupted. The program to read this back is: #include <stdio.h> #define FSIZE 1000*1024*1024 main() { int i; int j, prev; int mode=0; for(i=0;i<FSIZE/4;i++) { fread(&j, 1, 4, stdin); if (mode==0) { if (i!=j) { printf("-----------------------------\n"); printf("problem start at word: %d\n", i); printf("got value %d instead of %d\n", j, i); mode=1; } } else { if (i==j) { printf("-----------------------------\n"); printf("last word of problem : %d\n", i-1); printf("got value %d instead of %d\n", prev, i-1); mode=0; } } prev=j; } } Here's one sample output, where there are two separate corruptions: gaur.aciri.org: ./unfoo3 < t4 ----------------------------- problem start at word: 114561360 got value 909456435 instead of 114561360 got value 171522103 instead of 114561361 got value 875770417 instead of 114561362 got value 943142453 instead of 114561363 got value 842074681 instead of 114561364 got value 909456435 instead of 114561365 got value 171522103 instead of 114561366 got value 875770417 instead of 114561367 got value 943142453 instead of 114561368 got value 842074681 instead of 114561369 got value 909456435 instead of 114561370 got value 171522103 instead of 114561371 got value 875770417 instead of 114561372 got value 943142453 instead of 114561373 got value 842074681 instead of 114561374 got value 909456435 instead of 114561375 ----------------------------- last word of problem : 114561375 got value 909456435 instead of 114561375 ----------------------------- problem start at word: 237338864 got value 112460016 instead of 237338864 got value 112460017 instead of 237338865 got value 112460018 instead of 237338866 got value 112460019 instead of 237338867 got value 112460020 instead of 237338868 got value 112460021 instead of 237338869 got value 112460022 instead of 237338870 got value 112460023 instead of 237338871 got value 112460024 instead of 237338872 got value 112460025 instead of 237338873 got value 112460026 instead of 237338874 got value 112460027 instead of 237338875 got value 112460028 instead of 237338876 got value 112460029 instead of 237338877 got value 112460030 instead of 237338878 got value 112460031 instead of 237338879 ----------------------------- last word of problem : 237338879 got value 112460031 instead of 237338879 In this case, there are two corruptions. The first corruption seems to be some random chunk of data; the second (more typical) corruption seems to be a copy of an earlier piece of the file. In most cases, the corruption seems to be of a 64-byte chunk of the file replaced with some other data, typically (but not always) an earlier chunk of the same file. I've never seen more than 64 bytes corrupted, but on one of the machines I've seen smaller corruptions. I originally thought this was a hardware problem, but I've reproduced it on the three identical machines I've tried, so if it is a hardware fault, it's in the whole batch. I've also tried to reproduce it on an additional 1GHz Athlon/A7V machine with a Adaptec 2940 SCSI adaptor, but that machine doesn't suffer from the same problem, so I'm beginning to suspect an interaction between the Adaptec 29160 driver and the filesystem when writing large files as being a possible cause. Here's the dmesg.boot from one of the problem machines in case it helps. gaur.aciri.org: more /var/run/dmesg.boot Copyright (c) 1992-2000 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001 root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB Timecounter "i8254" frequency 1193182 Hz CPU: AMD Athlon(tm) Processor (1109.89-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x642 Stepping = 2 Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR> AMD Features=0xc0440000<<b18>,AMIE,DSP,3DNow!> real memory = 536788992 (524208K bytes) avail memory = 518864896 (506704K bytes) Preloaded elf kernel "kernel" at 0xc03c8000. Pentium Pro MTRR support enabled md0: Malloc disk npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <Host to PCI bridge> on motherboard pci0: <PCI bus> on pcib0 pcib2: <PCI to PCI bridge (vendor=1106 device=8305)> at device 1.0 on pci0 pci1: <PCI bus> on pcib2 isab0: <VIA 82C686 PCI-ISA bridge> at device 4.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <VIA 82C686 ATA66 controller> port 0xd800-0xd80f at device 4.1 on pci0 ata1: at 0x170 irq 15 on atapci0 pci0: <VIA 83C572 USB controller> at 4.2 irq 12 pci0: <VIA 83C572 USB controller> at 4.3 irq 12 fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xa400-0xa43f mem 0xd6800000-0xd68fffff,0xd7000000-0xd7000fff irq 10 at device 11.0 on pci0 fxp0: Ethernet address 00:02:b3:10:b4:67 pci0: <3D Labs model 000a graphics accelerator> at 12.0 irq 11 ahc0: <Adaptec 29160 Ultra160 SCSI adapter> port 0xa000-0xa0ff mem 0xd5800000-0xd5800fff irq 12 at device 13.0 on pci0 aic7892: Wide Channel A, SCSI Id=7, 32/255 SCBs atapci1: <Promise ATA100 controller> port 0x8400-0x843f,0x8800-0x8803,0x9000-0x9007,0x9400-0x9403,0x9800-0x9807 mem 0xd5000000-0xd501ffff irq 10 at device 17.0 on pci0 pcib1: <Host to PCI bridge> on motherboard pci2: <PCI bus> on pcib1 fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A DUMMYNET initialized (000608) IP packet filtering initialized, divert disabled, rule-based forwarding disabled, default to deny, logging disabled acd0: CDROM <SONY CDU4811> at ata1-master using PIO4 Waiting 5 seconds for SCSI devices to settle Mounting root from ufs:/dev/da0s1a da0 at ahc0 bus 0 target 0 lun 0 da0: <SEAGATE ST318451LW 0003> Fixed Direct Access SCSI-3 device da0: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da0: 17501MB (35843671 512 byte sectors: 255H 63S/T 2231C) How-To-Repeat: Write several very large files rapidly (see above). Some fraction of them will be corrupted (I see between 5% and 25% of 512MB files get corrupted).