Bug 30559

Summary: Intense SCSI tape access results in controller errors
Product: Base System Reporter: jdc <jdc>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description jdc 2001-09-13 22:00:00 UTC
  Under heavy SCSI tape access, our system spits out the following on the console.  Please note this applies to the ahc1 controller.

(sa0:ahc1:0:1:0): SCB 0x7 - timed out
ahc1: Dumping Card State in Data-out phase, at SEQADDR 0x6c
ACCUM = 0x0, SINDEX = 0x8, DINDEX = 0x8f, ARG_2 = 0x1
HCNT = 0x0
SCSISEQ = 0x12, SBLKCTL = 0x2
 DFCNTRL = 0x3c, DFSTATUS = 0x6d
LASTPHASE = 0x0, SCSISIGI = 0x4, SXFRCTL0 = 0xa0
SSTAT0 = 0x0, SSTAT1 = 0x2
STACK == 0x83, 0x188, 0x147, 0x0
SCB count = 20
Kernel NEXTQSCB = 9
Card NEXTQSCB = 9
QINFIFO entries: 
Waiting Queue entries: 
Disconnected Queue entries: 
QOUTFIFO entries: 
Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Pending list: 7
Kernel Free SCB list: 14 6 8 15 16 17 18 19 0 1 2 3 4 5 13 12 11 10 
Untagged Q(1): 7 
sg[0] - Addr 0x48fc000 : Length 4096
sg[1] - Addr 0x315d000 : Length 4096
sg[2] - Addr 0x7be000 : Length 4096
sg[3] - Addr 0x3fdf000 : Length 4096
sg[4] - Addr 0xd4c0000 : Length 4096
sg[5] - Addr 0xb001000 : Length 4096
sg[6] - Addr 0x63e2000 : Length 4096
sg[7] - Addr 0x38a3000 : Length 4096
sg[8] - Addr 0x6a04000 : Length 4096
sg[9] - Addr 0x2de5000 : Length 4096
sg[10] - Addr 0x46e6000 : Length 4096
sg[11] - Addr 0x52c7000 : Length 4096
sg[12] - Addr 0x6ee8000 : Length 4096
sg[13] - Addr 0xa6c9000 : Length 4096
sg[14] - Addr 0x5d2a000 : Length 4096
sg[15] - Addr 0x3b0b000 : Length 4096
(sa0:ahc1:0:1:0): BDR message in message buffer
(sa0:ahc1:0:1:0): SCB 0x7 - timed out
ahc1: Dumping Card State in Data-out phase, at SEQADDR 0x6d
ACCUM = 0x0, SINDEX = 0x8, DINDEX = 0x8f, ARG_2 = 0x1
HCNT = 0x0
SCSISEQ = 0x12, SBLKCTL = 0x2
 DFCNTRL = 0x3c, DFSTATUS = 0x6d
LASTPHASE = 0x0, SCSISIGI = 0x14, SXFRCTL0 = 0xa0
SSTAT0 = 0x0, SSTAT1 = 0x2
STACK == 0x83, 0x188, 0x147, 0x0
SCB count = 20
Kernel NEXTQSCB = 9
Card NEXTQSCB = 9
QINFIFO entries: 
Waiting Queue entries: 
Disconnected Queue entries: 
QOUTFIFO entries: 
Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Pending list: 7
Kernel Free SCB list: 14 6 8 15 16 17 18 19 0 1 2 3 4 5 13 12 11 10 
Untagged Q(1): 7 
sg[0] - Addr 0x48fc000 : Length 4096
sg[1] - Addr 0x315d000 : Length 4096
sg[2] - Addr 0x7be000 : Length 4096
sg[3] - Addr 0x3fdf000 : Length 4096
sg[4] - Addr 0xd4c0000 : Length 4096
sg[5] - Addr 0xb001000 : Length 4096
sg[6] - Addr 0x63e2000 : Length 4096
sg[7] - Addr 0x38a3000 : Length 4096
sg[8] - Addr 0x6a04000 : Length 4096
sg[9] - Addr 0x2de5000 : Length 4096
sg[10] - Addr 0x46e6000 : Length 4096
sg[11] - Addr 0x52c7000 : Length 4096
sg[12] - Addr 0x6ee8000 : Length 4096
sg[13] - Addr 0xa6c9000 : Length 4096
sg[14] - Addr 0x5d2a000 : Length 4096
sg[15] - Addr 0x3b0b000 : Length 4096
(sa0:ahc1:0:1:0): no longer in timeout, status = 34b
ahc1: Issued Channel A Bus Reset. 1 SCBs aborted
(sa0:ahc1:0:1:0): failed to write terminating filemark(s)
(sa0:ahc1:0:1:0): tape is now frozen- use an OFFLINE, REWIND or MTEOM command to clear this state.

  Our SCSI bus is terminated properly.  The drives are not LVD.  Cables do not "run too close to the power supply."  Cable length does not exceed specification.  Cable quality is high -- replacing cables made no difference.  Decreasing speed from 40MB/sec to 20MB/sec made no difference.  Disabling SMP (via sysctl MIB) made no difference.

  The only thing I haven't tried is removing the drive from the library/changer system itself, and throwing it right off the main SCSI cable.

  We have no problems with the other Adaptec controller (although used for hard disks).  Both controllers use the same BIOS version.

Fix: 

Fix unknown.
How-To-Repeat: $ tar -b 512 -vpcf /dev/nsa0 shell2.la.best.com__sd*
shell2.la.best.com__sd0a.gz
shell2.la.best.com__sd0d.gz
shell2.la.best.com__sd0e.gz
shell2.la.best.com__sd0f.gz
shell2.la.best.com__sd0g.gz
shell2.la.best.com__sd0h.gz
shell2.la.best.com__sd1d.gz
shell2.la.best.com__sd1e.gz
shell2.la.best.com__sd2d.gz
tar: can't write to /dev/nsa0 : Input/output error

  Where the files in question total 1023875396 bytes (~1GB).

  Using a smaller blocksize results in the operation getting further, but still errors out:

$ tar -b 20 -vpcf /dev/nsa0 shell2.la.best.com__sd*
shell2.la.best.com__sd0a.gz
shell2.la.best.com__sd0d.gz
shell2.la.best.com__sd0e.gz
shell2.la.best.com__sd0f.gz
shell2.la.best.com__sd0g.gz
shell2.la.best.com__sd0h.gz
shell2.la.best.com__sd1d.gz
shell2.la.best.com__sd1e.gz
shell2.la.best.com__sd2d.gz
shell2.la.best.com__sd2e.gz
tar: can't write to /dev/nsa0 : Input/output error

  Blocksize set via mt is 512 bytes:

$ mt -f /dev/sa0.ctl status
Mode      Density              Blocksize      bpi      Compression
Current:  0x31                 512 bytes      0        0x3
---------available modes---------
0:        0x31                 512 bytes      0        0x3
1:        0x31                 512 bytes      0        0x3
2:        0x31                 512 bytes      0        0x3
3:        0x31                 512 bytes      0        0x3
---------------------------------
Current Driver State: at rest.
---------------------------------
File Number: 0  Record Number: 0    Residual Count 0

  Disabling hardware compression (mt comp off) makes no difference.

  The problem is 100% repeatable.
Comment 1 Justin T. Gibbs 2001-09-17 18:16:07 UTC
>>Description:
>  Under heavy SCSI tape access, our system spits out the following on the cons
>ole.  Please note this applies to the ahc1 controller.

This essentially tells us that the controller is waiting for the target to
REQ the last bits of data on this transfer.  Either the target failed to see
an ACK from the initiator, or the initiator failed to see a REQ from the target.

>  Our SCSI bus is terminated properly.  The drives are not LVD.  Cables do
> not "run too close to the power supply."  Cable length does not exceed
> specification.  Cable quality is high -- replacing cables made no difference.
> Decreasing speed from 40MB/sec to 20MB/sec made no difference.  Disabling SMP
> (via sysc tl MIB) made no difference.
>
>  The only thing I haven't tried is removing the drive from the library/changer
>  system itself, and throwing it right off the main SCSI cable.

Nonetheless, this is an "environmental" problem.  Perhaps your changer has
a bad power supply.  Perhaps the changer design does not allow you to run
with anything other than a very short cable (well below the maximum length
allowed by the SCSI spec), etc.

If you bootverbose, does the controller report the termination values
you expect?

--
Justin
Comment 2 jdc 2001-09-17 19:29:03 UTC
On Mon, Sep 17, 2001 at 11:16:07AM -0600, gibbs@scsiguy.com wrote:
> >>Description:
> >  Under heavy SCSI tape access, our system spits out the following on the cons
> >ole.  Please note this applies to the ahc1 controller.
> 
> This essentially tells us that the controller is waiting for the target to
> REQ the last bits of data on this transfer.  Either the target failed to see
> an ACK from the initiator, or the initiator failed to see a REQ from the target.
> 
> >  Our SCSI bus is terminated properly.  The drives are not LVD.  Cables do
> > not "run too close to the power supply."  Cable length does not exceed
> > specification.  Cable quality is high -- replacing cables made no difference.
> > Decreasing speed from 40MB/sec to 20MB/sec made no difference.  Disabling SMP
> > (via sysc tl MIB) made no difference.
> >
> >  The only thing I haven't tried is removing the drive from the library/changer
> >  system itself, and throwing it right off the main SCSI cable.
> 
> Nonetheless, this is an "environmental" problem.  Perhaps your changer has
> a bad power supply.  Perhaps the changer design does not allow you to run
> with anything other than a very short cable (well below the maximum length
> allowed by the SCSI spec), etc.
> 
> If you bootverbose, does the controller report the termination values
> you expect?

        Thanks for getting back to me.

        In a "last attempt" to figure out the problem, we flipped our
        1st and 3rd SDX-500C drives (in the library system).  Oddly
        enough, the problem went away.

        This leads me to believe the problem relates to either a flakey
        SCSI port on one of the SDX-500C drives (the one reporting errors
        in my bug report), or possibly bad cabling within the library
        system itself.

        Since swapping the drive order, we've seen fantastic results in
        speed and stability from the drives.  Hence my prognosis.

        This bug report can be closed.

-- 
| Jeremy Chadwick                                         jdc@best.net |
| Best Internet/Verio Pacific                                 ext 8251 |
| UNIX Systems Administrator                    Mountain View, CA, USA |
| Verio - "the new world of business"                                  |
Comment 3 dwmalone freebsd_committer freebsd_triage 2001-09-17 19:45:22 UTC
State Changed
From-To: open->closed

Closed as submitter is happy with Justin's explaination.