| Summary: | Would be nice if the kernel could detect/report problems with SCSI tagged queueing | ||
|---|---|---|---|
| Product: | Base System | Reporter: | klh <klh> |
| Component: | kern | Assignee: | Kenneth D. Merry <ken> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | 3.1-RELEASE | ||
| Hardware: | Any | ||
| OS: | Any | ||
|
Description
klh
1999-12-13 03:00:02 UTC
On Sun, Dec 12, 1999 at 06:52:28PM -0800, klh@netcom.com wrote: > >Synopsis: Would be nice if the kernel could detect/report problems with SCSI tagged queueing [ ... ] > FreeBSD <hostname> 3.1-RELEASE FreeBSD 3.1-RELEASE #<n>: <buildstring> i386 > >Description: > I just spent a nerve-wracking day backing up some drives that I thought > were about to crash their little heads, only to finally discover that > the problem was a failure of SCSI command tagged queueing to work > properly. > > I was very surprised that even though I was getting user program I/O > errors (from tar), the kernel gave me no feedback at all on the > console. This was really mystifying. I don't know enough about how > tagging works to know whether it's even feasible to detect when it's > not working -- but the kernel is clearly getting SOME kind of error > that it's relaying back to the user. > > Would it be possible to make sure that I/O errors of this nature send > *something* to the console log? (Actually, that's a good idea for > any sort of I/O error; I know most of them are reported OK). This > would be a huge help tracking down potentially buggy drives; the effort > to zero in on this possibility is otherwise very time-consuming. I don't know why the kernel didn't print out any errors, but you'll get a lot more information if you boot with the '-v' switch. At the boot loader prompt, you can type: boot kernel -v To get the verbose boot messages. You may need to increase your message buffer size, using the MSGBUF_SIZE kernel option (see LINT for details) to avoid overflowing the kernel's message buffer. Ken -- Kenneth Merry ken@kdm.org > On Sun, Dec 12, 1999 at 06:52:28PM -0800, klh@netcom.com wrote:
> > >Synopsis: Would be nice if the kernel could detect/report problems with SCSI tagged queueing
>
> [ ... ]
>
> > FreeBSD <hostname> 3.1-RELEASE FreeBSD 3.1-RELEASE #<n>: <buildstring> i386
> > >Description:
> > I just spent a nerve-wracking day backing up some drives that I thought
> > were about to crash their little heads, only to finally discover that
> > the problem was a failure of SCSI command tagged queueing to work
> > properly.
> >
> > I was very surprised that even though I was getting user program I/O
> > errors (from tar), the kernel gave me no feedback at all on the
> > console. This was really mystifying. I don't know enough about how
> > tagging works to know whether it's even feasible to detect when it's
> > not working -- but the kernel is clearly getting SOME kind of error
> > that it's relaying back to the user.
> >
> > Would it be possible to make sure that I/O errors of this nature send
> > *something* to the console log? (Actually, that's a good idea for
> > any sort of I/O error; I know most of them are reported OK). This
> > would be a huge help tracking down potentially buggy drives; the effort
> > to zero in on this possibility is otherwise very time-consuming.
>
> I don't know why the kernel didn't print out any errors, but you'll get a
> lot more information if you boot with the '-v' switch. At the boot loader
> prompt, you can type:
>
> boot kernel -v
I know about (and like) -v, but it doesn't make any difference. Which
is to say, there is no error output either way. Just user-level I/O
errors except when a page transfer fails, in which case the pager code
then complains. Never the CAM subsystem.
One way of verifying this might be to test with a known broken drive
after re-enabling tagged queueing, and see what happens in the way of
error reporting.
Unfortunately I don't have any of the ones in the current table or I
could do that test. Regardless of the actual cause of the I/O errors,
it is still worrisome to me that there is no kernel log output at all.
--Ken
On Tue, Dec 14, 1999 at 00:37:04 -0800, Ken Harrenstien wrote: > > I don't know why the kernel didn't print out any errors, but you'll get a > > lot more information if you boot with the '-v' switch. At the boot loader > > prompt, you can type: > > > > boot kernel -v > > I know about (and like) -v, but it doesn't make any difference. Which > is to say, there is no error output either way. Just user-level I/O > errors except when a page transfer fails, in which case the pager code > then complains. Never the CAM subsystem. > > One way of verifying this might be to test with a known broken drive > after re-enabling tagged queueing, and see what happens in the way of > error reporting. > > Unfortunately I don't have any of the ones in the current table or I > could do that test. Regardless of the actual cause of the I/O errors, > it is still worrisome to me that there is no kernel log output at all. I think the reason you're not seeing any kernel diagnostics is because the driver isn't reporting errors to the upper level code. It may be that it is just silently failing to return some buffers or something. Since we already know the Tekram AMD driver is broken (PR readers see PR 15446), I suppose this isn't very surprising. In most cases, booting with the verbose switch turned on causes more SCSI diagnostics to be printed than would normally be printed. Anyway, why don't we leave this PR open, and you can verify that you get more diagnostics when you upgrade to the newer amd driver, and then we can close it. Ken -- Kenneth Merry ken@kdm.org On Sun, Jan 09, 2000 at 13:23:34 -0800, Ken Harrenstien wrote: > > Anyway, why don't we leave this PR open, and you can verify that you get > > more diagnostics when you upgrade to the newer amd driver, and then we can > > close it. > > > > Ken > > -- > > Kenneth Merry > > ken@kdm.org > > Just an update. I can't actually verify whether there are more > diagnostics with the current Tekram driver, because I no longer have > user-level I/O errors and I'm reluctant to deliberately generate them. > The bug was more of a concern that this was evidence of a general > loophole allowing I/O errors to be propagated up to the user without > ever causing a kernel error message. > > I *could* try to arrange for a sacrificial system and do a number of > horrible things to the bus. But if the driver folk are confident that > they've covered all the bases, that's good enough for me. I think things work well enough in general. In any case, we've got a rewrite of the CAM error recovery code in the pipeline. That should change things a little bit. So I'll go ahead and close this PR. Ken -- Kenneth Merry ken@kdm.org State Changed From-To: open->closed PR submitter is satisfied that this isn't a big problem. I don't think it's a big problem either. Responsible Changed From-To: freebsd-bugs->ken I'll handle this. > Anyway, why don't we leave this PR open, and you can verify that you get
> more diagnostics when you upgrade to the newer amd driver, and then we can
> close it.
>
> Ken
> --
> Kenneth Merry
> ken@kdm.org
Just an update. I can't actually verify whether there are more
diagnostics with the current Tekram driver, because I no longer have
user-level I/O errors and I'm reluctant to deliberately generate them.
The bug was more of a concern that this was evidence of a general
loophole allowing I/O errors to be propagated up to the user without
ever causing a kernel error message.
I *could* try to arrange for a sacrificial system and do a number of
horrible things to the bus. But if the driver folk are confident that
they've covered all the bases, that's good enough for me.
--Ken
|