Bug 15447

Summary: Seagate ST32550 (Barracuda 2LP) may be a broken tagged queueing drive?
Product: Base System Reporter: klh <klh>
Component: kernAssignee: Kenneth D. Merry <ken>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 3.1-RELEASE   
Hardware: Any   
OS: Any   

Description klh 1999-12-13 02:50:01 UTC
A separate problem is causing my system to sometimes boot up with
tagged queueing enabled and sometimes not.  I've recently been stressing
the disk significantly more than usual and have encountered user-level
I/O errors that I traced back to the enabling of tagged queueing.

With tagged queueing off, everything always works.  With it on,
a heavy load of seeks will cause reads and writes to start failing.
I was able to verify this by running a test case on two ST32550s, both
on line during the same kernel boot and both identical in all respects
except that one had tagged queueing enabled and the other didn't (the
randomness of this enabling is a separate problem).  The drive without
tagging always works perfectly; the drive with tagging always fails
at random places during the test.  I verified that it is not specific
to the individual drives by doing reboots until the formerly tag-enabled
drive booted up tag-disabled -- whereupon it then performed perfectly
again.  I also verified that the filesystems were identical by doing
a complete track-by-track copy of one to the other prior to testing.

The ST32550 is not in the latest quirks table in cam_xtp.c, although
several other Seagates are.  The Barracuda 2LP was at one time fairly
popular so I'm a little surprised this hasn't shown up before, but
who knows.  Maybe most FreeBSD users have IDE drives.

Fix: 

Obviously the ST32550 can be added to the quirks table in cam_xtp.c.
I just hope this does not reflect some underlying problem with
tagged queueing support of Seagates in general.
How-To-Repeat: My test case consisted of saving a tar of /usr/src/sys in a partition
about 1G distance away from /usr on the same drive, and attempting to
restore to /usr from that.  Tar encounters I/O errors attempting to
restore perhaps 1% to 2% of the files.  Once in a while the kernel
will get a swap error and complain, but otherwise no diagnostics are
shown on the console.
Comment 1 ken 1999-12-13 04:29:20 UTC
On Sun, Dec 12, 1999 at 06:40:28PM -0800, klh@netcom.com wrote:
> A separate problem is causing my system to sometimes boot up with
> tagged queueing enabled and sometimes not.  I've recently been stressing
> the disk significantly more than usual and have encountered user-level
> I/O errors that I traced back to the enabling of tagged queueing.
> 
> With tagged queueing off, everything always works.  With it on,
> a heavy load of seeks will cause reads and writes to start failing.
> I was able to verify this by running a test case on two ST32550s, both
> on line during the same kernel boot and both identical in all respects
> except that one had tagged queueing enabled and the other didn't (the
> randomness of this enabling is a separate problem).  The drive without
> tagging always works perfectly; the drive with tagging always fails
> at random places during the test.  I verified that it is not specific
> to the individual drives by doing reboots until the formerly tag-enabled
> drive booted up tag-disabled -- whereupon it then performed perfectly
> again.  I also verified that the filesystems were identical by doing
> a complete track-by-track copy of one to the other prior to testing.
> 
> The ST32550 is not in the latest quirks table in cam_xtp.c, although
> several other Seagates are.  The Barracuda 2LP was at one time fairly
> popular so I'm a little surprised this hasn't shown up before, but
> who knows.  Maybe most FreeBSD users have IDE drives.

No, there are many, many people using Seagate drives (including me)
successfully in FreeBSD systems.  I think this problem is most likely
peculiar to your particular system and/or drives.

> >Fix:
> Obviously the ST32550 can be added to the quirks table in cam_xtp.c.
> I just hope this does not reflect some underlying problem with
> tagged queueing support of Seagates in general.

Nope, it reflects a problem either with your drives or your cabling and
termination setup.

You need to supply some more information before we can make any sort of
guess at what is going on.  So, please send (and make sure you do a "group"
reply to this mail, so it winds up in the PR database) full 'dmesg' output
from your system, including any kernel messages that have shown up while
doing your tests.

Please don't send the output of /var/log/messages, unless it is necessary
to show problems that happened in a previous boot.  The output of dmesg(8)
is easier to read.

Also, please send a description of your cabling and termination setup.

Ken
-- 
Kenneth Merry
ken@kdm.org
Comment 2 klh 1999-12-14 12:58:33 UTC
> On Sun, Dec 12, 1999 at 06:40:28PM -0800, klh@netcom.com wrote:
> > A separate problem is causing my system to sometimes boot up with
> > tagged queueing enabled and sometimes not.  I've recently been stressing
> > the disk significantly more than usual and have encountered user-level
> > I/O errors that I traced back to the enabling of tagged queueing.
> > 
> > With tagged queueing off, everything always works.  With it on,
> > a heavy load of seeks will cause reads and writes to start failing.
> > I was able to verify this by running a test case on two ST32550s, both
> > on line during the same kernel boot and both identical in all respects
> > except that one had tagged queueing enabled and the other didn't (the
> > randomness of this enabling is a separate problem).  The drive without
> > tagging always works perfectly; the drive with tagging always fails
> > at random places during the test.  I verified that it is not specific
> > to the individual drives by doing reboots until the formerly tag-enabled
> > drive booted up tag-disabled -- whereupon it then performed perfectly
> > again.  I also verified that the filesystems were identical by doing
> > a complete track-by-track copy of one to the other prior to testing.
> > 
> > The ST32550 is not in the latest quirks table in cam_xtp.c, although
> > several other Seagates are.  The Barracuda 2LP was at one time fairly
> > popular so I'm a little surprised this hasn't shown up before, but
> > who knows.  Maybe most FreeBSD users have IDE drives.
> 
> No, there are many, many people using Seagate drives (including me)
> successfully in FreeBSD systems.  I think this problem is most likely
> peculiar to your particular system and/or drives.

Agreed.  These are surplus drives that appear to be Sun OEM but those,
also, are in wide use.  If any of the hardware is to be suspected, I
would squint at the AM53C974 or more properly its driver.  Read on.

> > >Fix:
> > Obviously the ST32550 can be added to the quirks table in cam_xtp.c.
> > I just hope this does not reflect some underlying problem with
> > tagged queueing support of Seagates in general.
> 
> Nope, it reflects a problem either with your drives or your cabling and
> termination setup.

I think the cabling and termination is highly unlikely to be a problem
in this case; I am familiar with the SCSI requirements and use
high-quality components, active termination, etc.  In any case, it's
only a fast-10 bus and there have been no other signs of trouble.
It's *only* when the kernel thinks that tagged queueing is enabled
that I start to get user-mode I/O errors, and then only when doing a
lot of long-distance seeks (ie when commands would start piling up).

One of the main things I've been trying to pin down is whether this
problem is specific to the ST32550s or if it happens with other drives
as well.  Finally, after several hours far into the night of reboots
with various versions and flags, I struck paydirt and enticed the
system to come up with the Fujitsu M2952 TQ-enabled.

Guess what?  It behaves just like the ST32550s, meaning that it causes
the same problems with Tagged-Queueing enabled, but works fine
otherwise.

I wondered if perhaps the problem might be a queue-full condition;
the ST32550 manual says it can handle up to 64 commands, while the
kernel default is 255 (implying it expects a QUEUE FULL response from
the drive).  So I tried adding a quirk entry limiting the Seagate to
a maxtags of 63.  No luck.  Tried 32.  Still no change.  Now I'm using
0 which disables it altogether and things are now safe.

> You need to supply some more information before we can make any sort of
> guess at what is going on.  So, please send (and make sure you do a "group"
> reply to this mail, so it winds up in the PR database) full 'dmesg' output
> from your system, including any kernel messages that have shown up while
> doing your tests.
> 
> Please don't send the output of /var/log/messages, unless it is necessary
> to show problems that happened in a previous boot.  The output of dmesg(8)
> is easier to read.

Done; see response to kern/15446.

> Also, please send a description of your cabling and termination setup.

#1 ---- #7 ---- #0 ---- #2 ---- #3 ---- TERM
DPES	amd0	ST32550	ST32550	M2952
(term)

#1 internal, #0,2,3 external.

Because of your statement that the ST32550 is known to work, and the
fact that my Fujitsu was failing in the same way, I don't think the
drives are at fault.  So we're left with either the controller, or
FreeBSD 3.1's support of it, or something else.  The controller seems
unlikely since Tagged Queueing is a higher-level protocol and there's
no reason to suspect either the physical bus or the link-level
protocol (otherwise many more problems would have evinced themselves).

One more data point.  I use the same kernel source base in another
system (NCR 53c895, 3 IBM drives) where all drives are TQ-enabled and
have never had problems despite much heavier usage.

I'm starting to think that whatever is causing the kernel to be
spastic about whether or not to use Tagged Queueing (cf kern/15446)
may also be responsible for its failure to operate properly.  In any
case, since the ST32550 is no longer a suspect, I suggest that this
bug (kern/15447) be closed and the above information made a follow-up
to kern/15446.

--Ken
Comment 3 ken 1999-12-15 07:13:04 UTC
On Tue, Dec 14, 1999 at 04:58:33 -0800, Ken Harrenstien wrote:
> Because of your statement that the ST32550 is known to work, and the
> fact that my Fujitsu was failing in the same way, I don't think the
> drives are at fault.  So we're left with either the controller, or
> FreeBSD 3.1's support of it, or something else.  The controller seems
> unlikely since Tagged Queueing is a higher-level protocol and there's
> no reason to suspect either the physical bus or the link-level
> protocol (otherwise many more problems would have evinced themselves).
> 
> One more data point.  I use the same kernel source base in another
> system (NCR 53c895, 3 IBM drives) where all drives are TQ-enabled and
> have never had problems despite much heavier usage.
> 
> I'm starting to think that whatever is causing the kernel to be
> spastic about whether or not to use Tagged Queueing (cf kern/15446)
> may also be responsible for its failure to operate properly.  In any
> case, since the ST32550 is no longer a suspect, I suggest that this
> bug (kern/15447) be closed and the above information made a follow-up
> to kern/15446.

Thanks for all the detailed information.  Based on PR kern/15447, I think
we can close this and assume for now that this is a problem with Tekram's
amd driver.

Ken
-- 
Kenneth Merry
ken@kdm.org
Comment 4 Kenneth D. Merry freebsd_committer freebsd_triage 1999-12-15 07:31:11 UTC
State Changed
From-To: open->closed

Closed at the request of the submitter.  See PR kern/15446 for additional 
followup information on this problem. 


Comment 5 Kenneth D. Merry freebsd_committer freebsd_triage 1999-12-15 07:31:11 UTC
Responsible Changed
From-To: freebsd-bugs->ken

I'm handling this.