I am observing DMA timeouts when newfs-ing large 450GB file systems: - using RAID_10 (striped mirror) built by the card's BIOS utility - using 4 250GB disks of type WDC WD2500SD (Caviar RAID EDITION disks) which probe as WDC WD25 00SD-01KCC0 08.0 when visible outside an array. Note for these disks: they have a time-limited error recovery feature set to a default of 7 seconds. This is presumably so a RAID controller will not lockup forever - see http://store.westerndigital.com/product.asp?sku=2545542 for details. The data sheet says the disks do stay on-line if they do a timeout. In the experiments here, I am fairly sure the disks had no internal failure but I can't say if they were triggering any sort of internal timeout. ROCKETRAID 1820A When using the HPTMV driver v1.12, the following messages are received (repeated COMPLETION ERROR and "channel 2: perform recalibrate commandhptmv: Retry on channel (2)" have been deleted so we can see the final device removed message: RocketRAID 182x SATA Controller driver Version v1.12 (Jan 26 2006 11:46:00) Jan 28 16:10:49 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 28 16:10:49 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 28 16:10:49 placid kernel: hptmv: too many retries on channel(2) Jan 28 16:10:49 placid kernel: hptmv: Device removed: controller 2 channel 2 The device removal is triggered in /usr/src/sys/dev/hptmv/entry.c after 5 retries i.e. at source code line "if (pCmd->RetryCount++>5)". A hacky experiment to change the hardwired 5 to be UCHAR 250 allowed the newfs to complete but then the intense activity of copying many small files in /usr/ports caused the same failure. Note that this system is successfully handling a RAID1 (mirror) with its OS and a largest file system of about 65GB built on a couple of 80GB disks, using another RocketRAID-1820A card. [I have access to 3 1820A cards and believe the cards are fine] FastTRAK S150 Changing to a FastTRAK-S150 with the same 4 disks on the same system, with a RAID_10 setup of almost exactly the same size (a few MB different as FastTRAK seem to use some storage when creating the RAID and offer a slightly different geometry) gave a similar DMA related error of: ad6: req=0xc446b4b0 WRITE_DMA semaphore timeout !! DANGER Mr Robinson This system was able to do the newfs of the 400+GB file system but the copying of many many small /usr/ports files was where the timeout occurred. (I was not able to consult with Mr Robinson but I think I got the message.) SYSTEM MESSAGES (with RocketRAID 1820 only) *************************************************************************** Jan 29 16:41:30 placid syslogd: kernel boot file is /boot/kernel/kernel Jan 29 16:41:30 placid kernel: Copyright (c) 1992-2005 The FreeBSD Project. Jan 29 16:41:30 placid kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Jan 29 16:41:30 placid kernel: The Regents of the University of California. All rights reser ved. Jan 29 16:41:30 placid kernel: FreeBSD 6.0-STABLE #1: Fri Jan 27 19:09:38 EST 2006 Jan 29 16:41:30 placid kernel: phillip@placid.pm.yi.org:/usr/src/sys/i386/compile/hawke Jan 29 16:41:30 placid kernel: Timecounter "i8254" frequency 1193182 Hz quality 0 Jan 29 16:41:30 placid kernel: CPU: AMD Athlon(tm) XP 2500+ (1830.01-MHz 686-class CPU) Jan 29 16:41:30 placid kernel: Origin = "AuthenticAMD" Id = 0x6a0 Stepping = 0 Jan 29 16:41:30 placid kernel: Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SE P,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE> Jan 29 16:41:30 placid kernel: AMD Features=0xc0400800<SYSCALL,MMX+,3DNow+,3DNow> Jan 29 16:41:30 placid kernel: real memory = 536805376 (511 MB) Jan 29 16:41:30 placid kernel: avail memory = 515936256 (492 MB) Jan 29 16:41:30 placid kernel: mptable_probe: MP Config Table has bad signature: <C9><E8>^L ^V Jan 29 16:41:30 placid kernel: ACPI APIC Table: <Nvidia AWRDACPI> Jan 29 16:41:30 placid kernel: ioapic0 <Version 1.1> irqs 0-23 on motherboard Jan 29 16:41:30 placid kernel: npx0: [FAST] Jan 29 16:41:30 placid kernel: npx0: <math processor> on motherboard Jan 29 16:41:30 placid kernel: npx0: INT 16 interface Jan 29 16:41:30 placid kernel: acpi0: <Nvidia AWRDACPI> on motherboard Jan 29 16:41:30 placid kernel: acpi0: Power Button (fixed) Jan 29 16:41:30 placid kernel: Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 Jan 29 16:41:30 placid kernel: acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0 Jan 29 16:41:30 placid kernel: cpu0: <ACPI CPU> on acpi0 Jan 29 16:41:30 placid kernel: acpi_button0: <Power Button> on acpi0 Jan 29 16:41:30 placid kernel: pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff,0xcf0-0xcf3 on acpi0 Jan 29 16:41:30 placid kernel: pci0: <ACPI PCI bus> on pcib0 Jan 29 16:41:30 placid kernel: agp0: <NVIDIA nForce2 AGP Controller> mem 0xd0000000-0xd7ffff ff at device 0.0 on pci0 Jan 29 16:41:30 placid kernel: pci0: <memory, RAM> at device 0.1 (no driver attached) Jan 29 16:41:30 placid kernel: pci0: <memory, RAM> at device 0.2 (no driver attached) Jan 29 16:41:30 placid kernel: pci0: <memory, RAM> at device 0.3 (no driver attached) Jan 29 16:41:30 placid kernel: pci0: <memory, RAM> at device 0.4 (no driver attached) Jan 29 16:41:30 placid kernel: pci0: <memory, RAM> at device 0.5 (no driver attached) Jan 29 16:41:30 placid kernel: isab0: <PCI-ISA bridge> at device 1.0 on pci0 Jan 29 16:41:30 placid kernel: isa0: <ISA bus> on isab0 Jan 29 16:41:30 placid kernel: pci0: <serial bus, SMBus> at device 1.1 (no driver attached) Jan 29 16:41:30 placid kernel: ohci0: <OHCI (generic) USB controller> mem 0xe6001000-0xe6001 fff irq 20 at device 2.0 on pci0 Jan 29 16:41:30 placid kernel: ohci0: [GIANT-LOCKED] Jan 29 16:41:30 placid kernel: usb0: OHCI version 1.0, legacy support Jan 29 16:41:30 placid kernel: usb0: SMM does not respond, resetting Jan 29 16:41:30 placid kernel: usb0: <OHCI (generic) USB controller> on ohci0 Jan 29 16:41:30 placid kernel: usb0: USB revision 1.0 Jan 29 16:41:30 placid kernel: uhub0: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 Jan 29 16:41:30 placid kernel: uhub0: 3 ports with 3 removable, self powered Jan 29 16:41:30 placid kernel: ohci1: <OHCI (generic) USB controller> mem 0xe6002000-0xe6002 fff irq 21 at device 2.1 on pci0 Jan 29 16:41:30 placid kernel: ohci1: [GIANT-LOCKED] Jan 29 16:41:30 placid kernel: usb1: OHCI version 1.0, legacy support Jan 29 16:41:30 placid kernel: usb1: SMM does not respond, resetting Jan 29 16:41:30 placid kernel: usb1: <OHCI (generic) USB controller> on ohci1 Jan 29 16:41:30 placid kernel: usb1: USB revision 1.0 Jan 29 16:41:30 placid kernel: uhub1: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 Jan 29 16:41:30 placid kernel: uhub1: 3 ports with 3 removable, self powered Jan 29 16:41:30 placid kernel: ehci0: <EHCI (generic) USB 2.0 controller> mem 0xe6005000-0xe 60050ff irq 22 at device 2.2 on pci0 Jan 29 16:41:30 placid kernel: ehci0: [GIANT-LOCKED] Jan 29 16:41:30 placid kernel: usb2: EHCI version 1.0 Jan 29 16:41:30 placid kernel: usb2: companion controllers, 4 ports each: usb0 usb1 Jan 29 16:41:30 placid kernel: usb2: <EHCI (generic) USB 2.0 controller> on ehci0 Jan 29 16:41:30 placid kernel: usb2: USB revision 2.0 Jan 29 16:41:30 placid kernel: uhub2: nVidia EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 Jan 29 16:41:30 placid kernel: uhub2: 6 ports with 6 removable, self powered Jan 29 16:41:30 placid kernel: nve0: <NVIDIA nForce MCP2 Networking Adapter> port 0xec00-0xe c07 mem 0xe6006000-0xe6006fff irq 20 at device 4.0 on pci0 Jan 29 16:41:30 placid kernel: nve0: Ethernet address 00:0a:48:09:0a:79 Jan 29 16:41:30 placid kernel: miibus0: <MII bus> on nve0 Jan 29 16:41:30 placid kernel: ukphy0: <Generic IEEE 802.3u media interface> on miibus0 Jan 29 16:41:30 placid kernel: ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto Jan 29 16:41:30 placid kernel: nve0: Ethernet address: 00:0a:48:09:0a:79 Jan 29 16:41:30 placid kernel: pcib1: <ACPI PCI-PCI bridge> at device 8.0 on pci0 Jan 29 16:41:30 placid kernel: pci1: <ACPI PCI bus> on pcib1 Jan 29 16:41:30 placid kernel: pcm0: <Creative CT5880-C> port 0xc000-0xc03f irq 16 at device 8.0 on pci1 Jan 29 16:41:30 placid kernel: pcm0: <SigmaTel STAC9721/23 AC97 Codec> Jan 29 16:41:30 placid kernel: pcm0: <Playback: DAC2 / Record: ADC> Jan 29 16:41:30 placid kernel: hptmv0: <RocketRAID 182x SATA Controller> mem 0xe5000000-0xe5 07ffff irq 19 at device 10.0 on pci1 Jan 29 16:41:30 placid kernel: RocketRAID 182x SATA Controller driver Version v1.12 (Jan 26 2006 11:46:00) Jan 29 16:41:30 placid kernel: RR182x [0,0]: channel started successfully Jan 29 16:41:30 placid kernel: RR182x [0,1]: channel started successfully Jan 29 16:41:30 placid kernel: RR182x: RAID5 write-back enabled Jan 29 16:41:30 placid kernel: hptmv0: [GIANT-LOCKED] Jan 29 16:41:30 placid kernel: hptmv1: <RocketRAID 182x SATA Controller> mem 0xe5080000-0xe5 0fffff irq 16 at device 11.0 on pci1 Jan 29 16:41:30 placid kernel: RocketRAID 182x SATA Controller driver Version v1.12 (Jan 26 2006 11:46:00) Jan 29 16:41:30 placid kernel: RR182x [1,0]: channel started successfully Jan 29 16:41:30 placid kernel: RR182x [1,1]: channel started successfully Jan 29 16:41:30 placid kernel: RR182x [1,2]: channel started successfully Jan 29 16:41:30 placid kernel: RR182x [1,3]: channel started successfully Jan 29 16:41:30 placid kernel: RR182x: RAID5 write-back enabled Jan 29 16:41:30 placid kernel: hptmv1: [GIANT-LOCKED] Jan 29 16:41:30 placid kernel: atapci0: <nVidia nForce2 UDMA133 controller> port 0x1f0-0x1f7 ,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 9.0 on pci0 Jan 29 16:41:30 placid kernel: ata0: <ATA channel 0> on atapci0 Jan 29 16:41:30 placid kernel: ata1: <ATA channel 1> on atapci0 Jan 29 16:41:30 placid kernel: fwohci0: <1394 Open Host Controller Interface> mem 0xe6003000 -0xe60037ff,0xe6004000-0xe600403f irq 21 at device 13.0 on pci0 Jan 29 16:41:30 placid kernel: fwohci0: OHCI version 1.10 (ROM=0) Jan 29 16:41:30 placid kernel: fwohci0: No. of Isochronous channels is 4. Jan 29 16:41:30 placid kernel: fwohci0: EUI64 00:0a:48:00:00:02:c1:2c Jan 29 16:41:30 placid kernel: fwohci0: Phy 1394a available S400, 2 ports. Jan 29 16:41:30 placid kernel: fwohci0: Link S400, max_rec 2048 bytes. Jan 29 16:41:30 placid kernel: firewire0: <IEEE1394(FireWire) bus> on fwohci0 Jan 29 16:41:30 placid kernel: fwe0: <Ethernet over FireWire> on firewire0 Jan 29 16:41:30 placid kernel: if_fwe0: Fake Ethernet address: 02:0a:48:02:c1:2c Jan 29 16:41:30 placid kernel: fwe0: Ethernet address: 02:0a:48:02:c1:2c Jan 29 16:41:30 placid kernel: fwe0: if_start running deferred for Giant Jan 29 16:41:30 placid kernel: sbp0: <SBP-2/SCSI over FireWire> on firewire0 Jan 29 16:41:30 placid kernel: fwohci0: Initiate bus reset Jan 29 16:41:30 placid kernel: fwohci0: node_id=0xc800ffc0, gen=1, CYCLEMASTER mode Jan 29 16:41:30 placid kernel: firewire0: 1 nodes, maxhop <= 0, cable IRM = 0 (me) Jan 29 16:41:30 placid kernel: firewire0: bus manager 0 (me) Jan 29 16:41:30 placid kernel: pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0 Jan 29 16:41:30 placid kernel: pci3: <ACPI PCI bus> on pcib2 Jan 29 16:41:30 placid kernel: pci3: <display, VGA> at device 0.0 (no driver attached) Jan 29 16:41:30 placid kernel: acpi_tz0: <Thermal Zone> on acpi0 Jan 29 16:41:30 placid kernel: fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 Jan 29 16:41:30 placid kernel: fdc0: [FAST] Jan 29 16:41:30 placid kernel: fd0: <1440-KB 3.5" drive> on fdc0 drive 0 Jan 29 16:41:30 placid kernel: atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 o n acpi0 Jan 29 16:41:30 placid kernel: atkbd0: <AT Keyboard> irq 1 on atkbdc0 Jan 29 16:41:30 placid kernel: kbd0 at atkbd0 Jan 29 16:41:30 placid kernel: atkbd0: [GIANT-LOCKED] Jan 29 16:41:30 placid kernel: pmtimer0 on isa0 Jan 29 16:41:30 placid kernel: orm0: <ISA Option ROMs> at iomem 0xc0000-0xcc7ff,0xd0000-0xd3 fff,0xd4000-0xd57ff on isa0 Jan 29 16:41:30 placid kernel: sc0: <System console> at flags 0x100 on isa0 Jan 29 16:41:30 placid kernel: sc0: VGA <16 virtual consoles, flags=0x300> Jan 29 16:41:30 placid kernel: vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xb ffff on isa0 Jan 29 16:41:30 placid kernel: ppc0: parallel port not found. Jan 29 16:41:30 placid kernel: sio0: configured irq 4 not in bitmap of probed irqs 0 Jan 29 16:41:30 placid kernel: sio0: port may not be enabled Jan 29 16:41:30 placid kernel: sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 Jan 29 16:41:30 placid kernel: sio0: type 8250 or not responding Jan 29 16:41:30 placid kernel: sio1: configured irq 3 not in bitmap of probed irqs 0 Jan 29 16:41:30 placid kernel: sio1: port may not be enabled Jan 29 16:41:30 placid kernel: ugen0: Sony Ericsson SEMC DSS SyncStation, rev 1.10/4.00, add r 2 Jan 29 16:41:30 placid kernel: ums0: WACOM CTE-430-UV3.1-4, rev 1.10/3.14, addr 3, iclass 3/ 1 Jan 29 16:41:30 placid kernel: ums0: 3 buttons and Z dir. Jan 29 16:41:30 placid kernel: ums1: Logitech USB Mouse, rev 1.10/6.10, addr 2, iclass 3/1 Jan 29 16:41:30 placid kernel: ums1: 3 buttons and Z dir. Jan 29 16:41:30 placid kernel: Timecounter "TSC" frequency 1830012961 Hz quality 800 Jan 29 16:41:30 placid kernel: Timecounters tick every 1.000 msec Jan 29 16:41:30 placid kernel: acd0: CDRW <HL-DT-ST CD-RW GCE-8240B/1.07> at ata0-slave PIO4 Jan 29 16:41:30 placid kernel: da0 at hptmv0 bus 0 target 0 lun 0 Jan 29 16:41:30 placid kernel: da0: <RR182x RAID 1 Array 3.00> Fixed Direct Access SCSI-0 de vice Jan 29 16:41:30 placid kernel: da0: 76319MB (156301477 512 byte sectors: 255H 63S/T 9729C) Jan 29 16:41:30 placid kernel: da1 at hptmv1 bus 0 target 0 lun 0 Jan 29 16:41:30 placid kernel: da1: <RR182x RAID 1/0 Array 3.00> Fixed Direct Access SCSI-0 device Jan 29 16:41:30 placid kernel: da1: 476950MB (976794112 512 byte sectors: 255H 63S/T 60802C) Jan 29 16:41:30 placid kernel: Trying to mount root from ufs:/dev/da0s1a Jan 29 16:41:30 placid savecore: no dumps found Jan 29 16:41:31 placid root: /etc/rc: WARNING: $compat4x_enable is not set properly - see rc .conf(5). Jan 29 16:47:05 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:05 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 29 16:47:05 placid kernel: channel 2: perform recalibrate commandhptmv: Retry on channel (2) Jan 29 16:47:08 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:08 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 29 16:47:08 placid kernel: channel 2: perform recalibrate commandhptmv: Retry on channel (2) Jan 29 16:47:10 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:10 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 29 16:47:10 placid kernel: channel 2: perform recalibrate commandhptmv: Retry on channel (2) Jan 29 16:47:13 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:13 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 29 16:47:13 placid kernel: channel 2: perform recalibrate commandhptmv: Retry on channel (2) Jan 29 16:47:16 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:16 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 29 16:47:16 placid kernel: channel 2: perform recalibrate commandhptmv: Retry on channel (2) Jan 29 16:47:18 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:18 placid kernel: ATA regs: error 40, sector count 3f, LBA low 80, LBA mid d0, LBA high ac, device 40, status 51 Jan 29 16:47:18 placid kernel: channel 2: perform recalibrate commandhptmv: Retry on channel (2) Jan 29 16:47:20 placid kernel: IAL: COMPLETION ERROR, adapter 1, channel 2, flags=104 Jan 29 16:47:20 placid kernel: ATA regs: error 40, sector count 3f, LBA l 51 Jan 29 16:47:20 placid kernel: hptmv: too many retries on channel(2) Jan 29 16:47:20 placid kernel: hptmv: Device removed: controller 2 channel 2 Jan 29 16:50:30 placid halt: halted by phillip Jan 29 16:50:30 placid syslogd: exiting on signal 15 (this looks like a test that allowed >5 timeouts as it has >5 messages) *************************************************************************** How-To-Repeat: Using FreeBSD-6, do 1) build a RAID10 (4 disks) or RAID1 (2 disks) of 250GB or larger, using the disks noted above; 2) newfs a large file system. If that doesn't trigger the problem, try copying many small files. It seems that an intense burst of disk activity is what causes the problem. Similar bug reports with the RocketRAID 1820A exist from last year but they do not show that the same problem can occur with the FastTRAK S150 card. Those reports did not say, as I recall, if they were using similar raid edition disks.
The following work-around for this bug has proved successful: 1) use the RocketRAID for a RAID-1 arrays constructed on pairs of disks. Assuming same sized disks, you'll end up with same sized RAID-1 arrays. 2) instead of using the RocketRAID drivers for striping to produce a RAID-10 array, use a GEOM stripe across the RAID-1 arrays. This allows newfs of 240GB file systems with no problems, and test copies of large numbers of small files (such as are in /usr/ports and /usr/src) also works fine. In addition, the RocketRAID driver can still manage on-line spare disks in the underlying RAID-1 arrays. The speculation that led to this work-around was that perhaps the EDMA time outs are being caused by events being confused between the RocketRAID software drivers associated with RAID-1 and RAID-0. Having RocketRAID handle RAID-1 and GEOM handle higher levels gets around this. However, I have not tested this speculation. The RocketRAID RAID-1 + GEOM stripe arrays have been in normal use for about almost week. It looks good. Note system update: the system has gone from 6.0-STABLE to 6.1-PRERELEASE during which the RocketRAID driver has not changed.
Had a problem similar to this and its fixed by updating to v1.14 version of the highpoint driver. This includes a fix for a boundary condition error on specific large disks which may well be triggered by the test mentioned, it certainly creates the same error message. The new driver version can be found on highpoints site. ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.