Bug 22640

Summary: SCSI problem halts system after long period of perfect behaviour
Product: Base System Reporter: jan.redepenning <jan.redepenning>
Component: i386Assignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 3.5-STABLE   
Hardware: Any   
OS: Any   

Description jan.redepenning 2000-11-06 14:40:01 UTC
Machine boots and works fine (even "perfect") for a unpredictable 
period of time (between 4 and 8 days). Then, on the console (and in
the log files) there are long rows of:

goelz /kernel: (da0:ahc0:0:0:0): SCB 0x85 - timed out in datain phase, SEQADDR == 0x5e
goelz /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
goelz /kernel: (da0:ahc0:0:0:0): SCB 0x85 - timed out in datain phase, SEQADDR == 0x5e
goelz /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
goelz /kernel: ahc0: Issued Channel A Bus Reset. 78 SCBs aborted

repeating all the time - until manual reset of the machine (often, even
telnet doesn´t work any more). Usually the problems start with da0 and 
then switch to the other drives. Configuration of the machine from the 
boot messages:

goelz /kernel: CPU: AMD-K7(tm) Processor (604.23-MHz 686-class CPU)
goelz /kernel: Origin = "AuthenticAMD"  Id = 0x612  Stepping = 2
goelz /kernel: Features=0x81f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,MMX>
goelz /kernel: AMD Features=0xc0400000<<b22>,<b30>,3DNow!>
goelz /kernel: real memory  = 268435456 (262144K bytes)
goelz /kernel: avail memory = 258494464 (252436K bytes)
goelz /kernel: Preloaded elf kernel "kernel" at 0xc02b0000.
goelz /kernel: Pentium Pro MTRR support enabled
goelz /kernel: Probing for devices on PCI bus 0:
goelz /kernel: chip0: <Host to PCI bridge (vendor=1022 device=7006)> rev 0x23 on pci0.0.0
goelz /kernel: chip1: <PCI to PCI bridge (vendor=1022 device=7007)> rev 0x01 on pci0.1.0
goelz /kernel: chip2: <PCI to ISA bridge (vendor=1106 device=0686)> rev 0x1b on pci0.4.0
goelz /kernel: ahc0: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 10 on pci0.14.0
goelz /kernel: ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
goelz /kernel: xl0: <3Com 3c905C-TX Fast Etherlink XL> rev 0x74 int a irq 12 on pci0.16.0
goelz /kernel: xl0: Ethernet address: 00:50:da:40:b3:8e
goelz /kernel: xl0: autoneg complete, link status good (half-duplex, 100Mbps)
goelz /kernel: Probing for devices on PCI bus 1:
goelz /kernel: vga0: <ATI model 4c42 graphics accelerator> rev 0xdc int a irq 11 on pci1.5.0
goelz /kernel: Probing for devices on the ISA bus:
goelz /kernel: sc0 on isa
goelz /kernel: sc0: VGA color <16 virtual consoles, flags=0x0>
goelz /kernel: atkbdc0 at 0x60-0x6f on motherboard
goelz /kernel: atkbd0 irq 1 on isa
goelz /kernel: psm0 not found
goelz /kernel: sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa
goelz /kernel: sio0: type 16550A
goelz /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa
goelz /kernel: sio1: type 16550A
goelz /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
goelz /kernel: fdc0: FIFO enabled, 8 bytes threshold
goelz /kernel: fd0: 1.44MB 3.5in
goelz /kernel: ppc0 at 0x378 irq 7 flags 0x40 on isa
goelz /kernel: ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
goelz /kernel: ppc0: FIFO with 16/16/8 bytes threshold
goelz /kernel: lpt0: <generic printer> on ppbus 0
goelz /kernel: lpt0: Interrupt-driven port
goelz /kernel: ppi0: <generic parallel i/o> on ppbus 0
goelz /kernel: plip0: <PLIP network interface> on ppbus 0
goelz /kernel: vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa
goelz /kernel: npx0 on motherboard
goelz /kernel: npx0: INT 16 interface
goelz /kernel: Waiting 2 seconds for SCSI devices to settle
goelz /kernel: changing root device to da0s1a
goelz /kernel: da2 at ahc0 bus 0 target 2 lun 0
goelz /kernel: da2: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da2: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da2: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: da3 at ahc0 bus 0 target 5 lun 0
goelz /kernel: da3: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da3: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da3: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: da1 at ahc0 bus 0 target 1 lun 0
goelz /kernel: da1: <IBM DRHS36D 0270> Fixed Direct Access SCSI-3 device
goelz /kernel: da1: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing Enabled
goelz /kernel: da1: 35239MB (72170879 512 byte sectors: 255H 63S/T 4492C)
goelz /kernel: da0 at ahc0 bus 0 target 0 lun 0
goelz /kernel: da0: <IBM DRHS36V 0270> Fixed Direct Access SCSI-3 device
goelz /kernel: da0: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing Enabled
goelz /kernel: da0: 35239MB (72170879 512 byte sectors: 255H 63S/T 4492C)
goelz /kernel: da5 at ahc0 bus 0 target 9 lun 0
goelz /kernel: da5: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da5: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da5: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: da4 at ahc0 bus 0 target 8 lun 0
goelz /kernel: da4: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da4: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da4: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: cd0 at ahc0 bus 0 target 4 lun 0
goelz /kernel: cd0: <TEAC CD-ROM CD-532S 1.0A> Removable CD-ROM SCSI-2 device
goelz /kernel: cd0: 20.000MB/s transfers (20.000MHz, offset 16)
goelz /kernel: cd0: Attempt to query device size failed: NOT READY, Medium not present

How-To-Repeat: Reboot, wait a few days... I´m sorry that I´m unable to be more
specific... We´ve tried heavy load phases which worked fine; once
it crashed during such a "copy-orgy", at other times it worked fine.

:-(
Comment 1 wataru-s 2000-12-05 17:08:25 UTC
Herr,

I'm Wataru Satoh working for an ISP in Japan.

We have caught in mysterious SCSI trouble just like yours.

OS is FreeBSD 3.5.1-RELEASE, and
host adapter is adaptec 2940UW, and
disks are IBM 18G ultrastor, DNES-318350.

suddenly died twice in a week, a disk obstinately
kept "being accessed" LED on.
I tried hot start without power down, but did not work
- the machine tried to boot as if it had no disk.

on second time, after success of booting with power cycle,
I examined it under single-user mode.

on the first time it equipped two same disks, but
this time, only one drive was attached, SCSI-IDfied as #1, da0,
which is devided into 2 partitions and a swap region.
the secondary filesystem (da0s1e for /home), which was pretty
heavily accessed mostly for MRTG logging and graphing, was awfully
corrupted. the primary one (da0s1a for /) was not corrupted at all.

in /var/log/dmesg.yesterday, I found following message:
I don't know when it was printed - sorry serial console.

(da0:ahc0:0:0:0): SCB 0x3e - timed out while idle, LASTPHASE == 0x1,\(wrap)
SEQADDR == 0x153
(da0:ahc0:0:0:0): SCB 62: Immediate reset.  Flags = 0x4040
(da0:ahc0:0:0:0): no longer in timeout, status = 34b
ahc0: Issued Channel A Bus Reset. 64 SCBs aborted

anyone knows how to track/examine this "bug", or
hardware/firmware failure or any other SCSI boodoo?

----
Wataru Satoh <wataru-s@mfeed.ad.jp> / INTERNET MULTIFEED CO.
TEL: 03-3282-1040 / FAX: 03-3282-1020
Comment 2 Matt Jacob freebsd_committer freebsd_triage 2001-10-02 03:03:43 UTC
State Changed
From-To: open->feedback

Is this still a problem? The problem seems to me to actually 
be a disk h/w problem.
Comment 3 wilko freebsd_committer freebsd_triage 2001-11-24 11:41:41 UTC
State Changed
From-To: feedback->closed

Timeout polling for feedback. mjacob asked for feedback 
on Oct 1, no reply