Bug 141918 - [ehci] ehci_interrupt: unrecoverable error, controller halted (sparc64)
Summary: [ehci] ehci_interrupt: unrecoverable error, controller halted (sparc64)
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: sparc64 (show other bugs)
Version: 8.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Mark Linimon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-23 11:40 UTC by bel
Modified: 2023-08-23 05:05 UTC (History)
1 user (show)

See Also:


Attachments
skip_via.diff (888 bytes, patch)
2012-04-06 19:37 UTC, marius
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description bel 2009-12-23 11:40:01 UTC
	After upgrading from FreeBSD 7.2-RELEASE to 8.0-RELEASE there were
problems with USB. Attempt to mount flash drive leads to an error in USB
EHCI driver. Here are the kernel messages:

ehci_interrupt: unrecoverable error, controller halted
cmd=0x00010020
 EHCI_CMD_ITC_1
 EHCI_CMD_ASE
sts=0x0000b000
 EHCI_STS_ASS
 EHCI_STS_REC
 EHCI_STS_HCH
ien=0x00000037
frindex=0x0000169e ctrdsegm=0x00000000 periodic=0xfee30000 async=0xfee35600
port 1 status=0x00001000
port 2 status=0x00001000
port 3 status=0x00001005
port 4 status=0x00003400
ehci_dump_isoc: isochronous dump from frame 0x053:
ITD(0xfffff800813a3900) at 0xff185900
 next=0xff384204
 status[0]=0x00000000; <>
 status[1]=0x00000000; <>
 status[2]=0x00000000; <>
 status[3]=0x00000000; <>
 status[4]=0x00000000; <>
 status[5]=0x00000000; <>
 status[6]=0x00000000; <>
 status[7]=0x00000000; <>
 bp[0]=0x00000000
  addr=0x00; endpt=0x0
 bp[1]=0x00000000
 dir=out; mpl=0x00
 bp[2..6]=0x00000000,0x00000000,0x00000000,0x00000000,0x00000000
 bp_hi=0x00000000,0x00000000,0x00000000,0x00000000,
       0x00000000,0x00000000,0x00000000
SITD(0xfffff800813b4200) at 0xff384200
 next=0xfef85502
 portaddr=0x00000000 dir=out addr=0 endpt=0x0 port=0x0 huba=0x0
 mask=0x00000000
 status=0x00000000 <> len=0x0
 back=0x00000001, bp=0x00000000,0x00000000,0x00000000,0x00000000
ehci_interrupt: blocking interrupts 0x10

Full dmesg:
http://orel.ru/~bel/usb/dmesg.txt

Kernel config:
http://orel.ru/~bel/usb/SUNC3D.txt

How-To-Repeat: 	Insert USB Flash drive and try to mount it for read-write.
Comment 1 Hans Petter Selasky 2009-12-23 15:25:53 UTC
Hi,

My guess for this issue is that the cache invalidate and cache flush 
instructions are not properly implemented by busdma on your platform. Please 
check that first.

--HPS
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2010-02-08 07:04:38 UTC
Responsible Changed
From-To: freebsd-usb->freebsd-sparc64

hps claims that this may be sparc64-specific.
Comment 3 marius 2010-02-08 09:07:42 UTC
As outlined here it's unlikely that this is a problem of the sparc64
bus_dmamap_sync(9):
http://lists.freebsd.org/pipermail/freebsd-sparc64/2009-December/006866.html
There are however known problems with usb(4) in this regard, see for
example:
http://svn.freebsd.org/viewvc/base?view=revision&revision=203080

Marius
Comment 4 marius 2010-09-18 09:38:10 UTC
I can't reproduce the problem using the exact same hardware (U60 and
VIA VT6202). Could you please try again with stable/8 (preferably with
r212621/sys/dev/usb/usb_busdma.c 1.13.2.5 in place) whether the problem
still persists? There where some changes since 8.0-RELEASE including
the addition of a workaround for a bug of exactly that controller which
might have fixed this.

Marius
Comment 5 bel 2010-09-22 08:00:30 UTC
The problem still persists on stable/8 with sys/dev/usb/usb_busdma.c 1.13.2.5.
Note: my U60 have two CPU.

With Best Regards,
 Andrew Belashov
Comment 6 marius 2010-09-22 20:29:19 UTC
I've also tested with an MP machine and I doubt that may have any
impact on this problem.
Could you please try whether the following patches make any
difference for you?
http://people.freebsd.org/~kan/usb_rspro.diff
http://people.freebsd.org/~marius/usb_busdma.c_sparc64_no_hack.diff
Please also test how the machine behaves if you leave ehci(4) out
of the kernel so uhci(4) is used instead.

Marius
Comment 7 Manuel Schiller 2012-03-30 13:58:07 UTC
Dear all,

has there been any progress on this one? I'm seeing basically the same
thing with FreeBSD 9, but in my case the problem turns up when I try to
mount one of my ZFS filesystems built in a RAIDZ configuration on real
hard disks connected over USB. The disks are fine, and the ZFS filesystem
was unexported cleanly on a different machine. You can find the kernel
messages from dmesg below.

I'm willing to test patches, but it might take a while, since my machine
(Netra T1 AC200 with a 500 MHz CPU) is not quite that fast when
recompiling kernels.

Cheers,

Manuel



This is what I get in dmesg:

da10 at umass-sim7 bus 7 scbus12 target 0 lun 0
da10: <WD Ext HDD 1021 2021> Fixed Direct Access SCSI-4 device 
da10: 40.000MB/s transfers
da10: 1907727MB (3907024896 512 byte sectors: 255H 255S/T 60084C)

### start mounting the ZFS filesystem on USB disks here, then wait;
### there is a period of insense I/O and a variable amount of time you
### have to wait before the following messages appear:

ehci_interrupt: unrecoverable error, controller halted
cmd=0x00010030
 EHCI_CMD_ITC_1
 EHCI_CMD_ASE
 EHCI_CMD_PSE
sts=0x0000e004
 EHCI_STS_ASS
 EHCI_STS_PSS
 EHCI_STS_REC
 EHCI_STS_PCD
ien=0x00000037
frindex=0x00002746 ctrdsegm=0x00000000 periodic=0xc38cc000 async=0xc2623300
port 1 status=0x0000180b
port 2 status=0x00001000
port 3 status=0x00001000
port 4 status=0x00001000
ehci_dump_isoc: isochronous dump from frame 0x068:
ITD(0xfffff80021621c00) at 0xc3c79c00
 next=0xc3e78b04
 status[0]=0x00000000; <>
 status[1]=0x00000000; <>
 status[2]=0x00000000; <>
 status[3]=0x00000000; <>
 status[4]=0x00000000; <>
 status[5]=0x00000000; <>
 status[6]=0x00000000; <>
 status[7]=0x00000000; <>
 bp[0]=0x00000000
  addr=0x00; endpt=0x0
 bp[1]=0x00000000
 dir=out; mpl=0x00
 bp[2..6]=0x00000000,0x00000000,0x00000000,0x00000000,0x00000000
 bp_hi=0x00000000,0x00000000,0x00000000,0x00000000,
       0x00000000,0x00000000,0x00000000
SITD(0xfffff80021634b00) at 0xc3e78b00
 next=0xc3a78302
 portaddr=0x00000000 dir=out addr=0 endpt=0x0 port=0x0 huba=0x0
 mask=0x00000000
 status=0x00000000 <> len=0x0
 back=0x00000001, bp=0x00000000,0x00000000,0x00000000,0x00000000
ehci_interrupt: blocking interrupts 0x10
ugen4.2: <vendor 0x1a40> at usbus4 (disconnected)
uhub5: at uhub4, port 1, addr 2 (disconnected)
ugen4.3: <vendor 0x1a40> at usbus4 (disconnected)
uhub6: at uhub5, port 1, addr 3 (disconnected)
ugen4.11: <Western Digital> at usbus4 (disconnected)
umass7: at uhub6, port 3, addr 11 (disconnected)
(da10:umass-sim7:7:0:0): lost device - 1 outstanding, 1 refs
(da10:umass-sim7:7:0:0): oustanding 0

I get the following line when I run "uname -a"

FreeBSD router.hinter.bergen2 9.0-STABLE FreeBSD 9.0-STABLE #0: Wed Mar 14 17:09:45 CET 2012     root@router.hinter.bergen2:/usr/obj/usr/src/sys/GENERIC  sparc64

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 8 marius 2012-03-30 14:30:26 UTC
Have you given the two patches mentioned earlier in the audit-trail
of this PR a try? It's probably a good idea to additionally put
"options USB_HOST_ALIGN=64" into the kernel configuration file
when testing the one for usb_transfer.c.

Marius
Comment 9 Manuel Schiller 2012-03-30 14:34:26 UTC
Hi Marius,

ok, so you did not get feedback on those patches. I'll try to give it a
try over the weekend (in between my LHCb analysis work), and I'll let
you know what comes out.

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 10 marius 2012-03-30 14:50:19 UTC
No; so far I also couldn't reproduce this problem using the on-board
EHCI controllers in sun4u machines or the add-on cards I have. What
controller is this? If I'm not mistaken, the T1-AC200 don't have an
on-board EHCI controller.

Marius
Comment 11 Manuel Schiller 2012-03-30 15:04:13 UTC
Correct. These are "no-name" VIA PCI USB 2.0 controllers. In the past,
I've swapped several of these (bought more than one year apart - this is
what you got at the time when buying a USB controller in the area where I
used to live) between the two Netra machines I administer, and it does
not seems to be specific to the machine or a specific controller card (in
the sense that these PCI cards run fine and without hiccups on Linux/ppc
and Linux/x86).

These are the kernel-messages:

uhci0: <VIA 83C572 USB controller> port 0xc00200-0xc0021f at device 5.0 on pci2
usbus2: <VIA 83C572 USB controller> on uhci0
uhci1: <VIA 83C572 USB controller> port 0xc00220-0xc0023f at device 5.1 on pci2
usbus3: <VIA 83C572 USB controller> on uhci1
ehci0: <VIA VT6202 USB 2.0 controller> mem 0xa000-0xa0ff at device 5.2 on pci2
ehci0: VIA-quirk applied
usbus4: EHCI version 1.0
usbus4: <VIA VT6202 USB 2.0 controller> on ehci0

The relevant part of "pciconf -l" is:

uhci0@pci0:2:5:0:       class=0x0c0300 card=0x30381106 chip=0x30381106 rev=0x62 hdr=0x00
uhci1@pci0:2:5:1:       class=0x0c0300 card=0x30381106 chip=0x30381106 rev=0x62 hdr=0x00
ehci0@pci0:2:5:2:       class=0x0c0320 card=0x31041106 chip=0x31041106 rev=0x65 hdr=0x00

(World and kernel build for the latest 9-STABLE is on its way, first a
reference kernel without patches so I know what I compare to...)

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 12 Manuel Schiller 2012-04-01 08:33:03 UTC
Hi Marius,

it seems that the first patch alone
(http://people.freebsd.org/~kan/usb_rspro.diff)
does not solve the issue. I'm currently compiling a kernel with the second
patch on its own (with "options USB_HOST_ALIGN=64" in the kernel options
as you suggested), and I'll let you know what comes out.

Should I also build a kernel with both patches (assuming the one I'm
currently building does not work either)?

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 13 marius 2012-04-01 11:41:24 UTC
Well, the individual patches shouldn't make things worse except for
the second one causing more memory to be used so I'd suggest to
combine them. If in the end things actually work we still can check
what changes are needed for that.
Looking at the Linux USB code, the FreeBSD one doesn't some to honor
some DMA constraints and at least for the alignment it's actually
hard to follow what value eventually is used. One thing that stands
out is that for EHCI, the boundary is 4096. This is most easily fixed
by defining USB_PAGE_SIZE to 4096 in sys/dev/usb/usb_busdma.h.

Marius
Comment 14 Manuel Schiller 2012-04-02 00:00:56 UTC
Ok, the second patch on its own doesn't appear to work either, so I'm
trying the combination of patches now. By the way: defining USB_PAGE_SIZE
to 4096 in sys/dev/usb/usb_busdma.h is a bad idea - the kernel panics with
a backtrace pointing into the mmu-related code. Probably has to do with
sparc64 mmu only supporting 8k pages, so I'm not terribly surprised...
Ok, I'm waiting for the next make buildkernel to finish, and I'll let
you know what comes out.

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 15 Manuel Schiller 2012-04-02 09:43:14 UTC
Ok, I also tested a kernel with both patches, and the issue persists. Do
you have something else to try?

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 16 Manuel Schiller 2012-04-03 09:37:14 UTC
Hi Marius,

I did a bit of code reading (/usr/src/sys/dev/usb/controller/ehci.c near
line 1494), and I realised that the "unrecoverable error" message should
only be triggered if the EHCI status register has the EHCI_STS_HCH bit
set - according to the status word dump in my log, it is not set (just
after the "unrecoverable error" message). The register dump re-reads the
status register from the hardware. Could it be that some controllers have
a glitch or something on that particular bit, and we better re-read the
status register before we conclude that the controller "really wanted to
set that bit"?

I can also see that the bit is set in the original bug report. I don't
know if that machine is just faster (and the bit has not had the time to
clear yet), or if we're talking about two different problems here...

(This observation might also indicate that small delay loop has to put
in before I re-read the status register - we'll have to see...)

I'm building a kernel with that modification, but I'd be interested in
a second opinion nevertheless...

Cheers,

Manuel


-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 17 marius 2012-04-03 16:00:43 UTC
> Could it be that some controllers have
> a glitch or something on that particular bit, and we better re-read the
> status register before we conclude that the controller "really wanted to
> set that bit"?

You mean EHCI_STS_HSE? This is expected, ehci_interrupt() clears the
pending interrupt status bits before dumping the register content:
EOWRITE4(sc, EHCI_USBSTS, status);      /* acknowledge */

> I can also see that the bit is set in the original bug report. I don't
> know if that machine is just faster (and the bit has not had the time to
> clear yet), or if we're talking about two different problems here...

Probably, the other controller just sets it again after the bit is
cleared.

Marius
Comment 18 marius 2012-04-03 22:19:52 UTC
Okay, could you please give the following patch a try?
http://people.freebsd.org/~marius/usb_busdma.diff

Marius
Comment 19 Manuel Schiller 2012-04-04 13:38:25 UTC
Okay, I tried both my idea (which naturally did not work ;) and your patch
(without my patch, so I don't screw up the results). Unfortunately, your
patch does not seem to work either. From what I can tell from here at
work, the machine is stuck in a reboot loop (I guess after trying to
access the USB disks), but I'd like to be sure and watch the disk's LEDs
for a bit when I get home tonight (to make sure that the reboot loop is
really related to USB disk access).

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 20 marius 2012-04-04 13:59:46 UTC
Hrm, okay, would be interesting to know what the machine actually does.
Looking at the code I found another bug; the VIA-workaround currently
doesn't do anything:
http://people.freebsd.org/~marius/ehci_pci_fix_via_quirk.diff
This might apply for the insane I/O you've reported but I'm unsure
whether it makes a difference for the HSE interrupt.

Marius
Comment 21 Manuel Schiller 2012-04-05 17:21:24 UTC
On Wed, 4 Apr 2012 14:59:46 +0200
Marius Strobl <marius@alchemy.franken.de> wrote:

> Hrm, okay, would be interesting to know what the machine actually does.
> Looking at the code I found another bug; the VIA-workaround currently
> doesn't do anything:
> http://people.freebsd.org/~marius/ehci_pci_fix_via_quirk.diff
> This might apply for the insane I/O you've reported but I'm unsure
> whether it makes a difference for the HSE interrupt.
> 
> Marius


From the looks of it (with your patch at
http://people.freebsd.org/~marius/usb_busdma.diff), the machine starts
booting, then tries to mount the filesystems residing on the USB disks,
apparently does some I/O (while still processing interrupts), and after
less than a minute locks up solid without any indication on the serial
console as to what went wrong...

I've started another build with your "VIA quirk fix" but without the
patch in the last paragraph (the machine locking up is a lot worse than
just USB not working after some heavy I/O, so I left it out for now), but
since I started the build without being properly awake this morning, I
typed "make buildworld" where I wanted to type "make buildkernel", so it's
going to take some time. Also, I'll be leaving CERN over easter, so I
won't be running tests on that machine from tomorrow morning until Monday
evening (I can compile kernels, though). Anyhow, I'll let you know what
comes out.

Cheers, thanks a lot for your effort, and, of course, a Happy Easter!

Manuel


-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 22 Manuel Schiller 2012-04-06 08:58:42 UTC
Hi,

the "VIA quirk fix" on its own gives the familiar message in dmesg
(unrecoverable error, controller halted), so I'm compiling a kernel which
combines this fix with your latest busdma fix to try them both together;
as I said in my last e-mail, I'll probably not be testing this until
Monday night...

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 23 marius 2012-04-06 19:37:26 UTC
On Fri, Apr 06, 2012 at 09:58:42AM +0200, Manuel Tobias Schiller wrote:
> On Thu, 5 Apr 2012 18:21:24 +0200
> Manuel Tobias Schiller <mala@hinterbergen.de> wrote:
> 
> > On Wed, 4 Apr 2012 14:59:46 +0200
> > Marius Strobl <marius@alchemy.franken.de> wrote:
> > 
> > > Hrm, okay, would be interesting to know what the machine actually
> > > does. Looking at the code I found another bug; the VIA-workaround
> > > currently doesn't do anything:
> > > http://people.freebsd.org/~marius/ehci_pci_fix_via_quirk.diff
> > > This might apply for the insane I/O you've reported but I'm unsure
> > > whether it makes a difference for the HSE interrupt.
> > > 
> > > Marius
> > 
> > From the looks of it (with your patch at
> > http://people.freebsd.org/~marius/usb_busdma.diff), the machine starts
> > booting, then tries to mount the filesystems residing on the USB disks,
> > apparently does some I/O (while still processing interrupts), and after
> > less than a minute locks up solid without any indication on the serial
> > console as to what went wrong...
> > 
> > I've started another build with your "VIA quirk fix" but without the
> > patch in the last paragraph (the machine locking up is a lot worse than
> > just USB not working after some heavy I/O, so I left it out for now),
> > but since I started the build without being properly awake this
> > morning, I typed "make buildworld" where I wanted to type "make
> > buildkernel", so it's going to take some time. Also, I'll be leaving
> > CERN over easter, so I won't be running tests on that machine from
> > tomorrow morning until Monday evening (I can compile kernels, though).
> > Anyhow, I'll let you know what comes out.
> > 
> > Cheers, thanks a lot for your effort, and, of course, a Happy Easter!
> > 
> > Manuel
> 
> Hi,
> 
> the "VIA quirk fix" on its own gives the familiar message in dmesg
> (unrecoverable error, controller halted), so I'm compiling a kernel which

Oof, this likely means there's a more basic problem with this device.
Have you already tried to re-seat the card in case there's an electrical
problem?
Please also provide the output of `pciconf -rb ehci0@pci0:2:5:2 0:255'
from a booting kernel.
FYI, after some digging I've found the following card
ehci0@pci0:2:5:2: class=0x0c0320 card=0x31041106 chip=0x31041106 rev=0x6h0
which is a newer revision of your device and works just fine in a T1-200
including with the usb(4) fixes. The publicly available datasheets for
the VIA USB controllers are minimal and exclude errata and Linux also
doesn't seem to use any additional work arounds, so I'm starting to run
out of ideas what could be wrong with your revision. The only remaining
thing to give a try I currently can think of is to test whether it chokes
on the generic initialization done by the sparc64 PCI code using the
attached patch.

> combines this fix with your latest busdma fix to try them both together;

This combination is unlikely to make a difference.

Marius
Comment 24 Manuel Schiller 2012-04-11 11:59:54 UTC
On Fri, 6 Apr 2012 20:37:26 +0200
Marius Strobl <marius@alchemy.franken.de> wrote:

> On Fri, Apr 06, 2012 at 09:58:42AM +0200, Manuel Tobias Schiller wrote:
> > On Thu, 5 Apr 2012 18:21:24 +0200
> > Manuel Tobias Schiller <mala@hinterbergen.de> wrote:
> > 
> > > On Wed, 4 Apr 2012 14:59:46 +0200
> > > Marius Strobl <marius@alchemy.franken.de> wrote:
> > > 
> > > > Hrm, okay, would be interesting to know what the machine actually
> > > > does. Looking at the code I found another bug; the VIA-workaround
> > > > currently doesn't do anything:
> > > > http://people.freebsd.org/~marius/ehci_pci_fix_via_quirk.diff
> > > > This might apply for the insane I/O you've reported but I'm unsure
> > > > whether it makes a difference for the HSE interrupt.
> > > > 
> > > > Marius
> > > 
> > > From the looks of it (with your patch at
> > > http://people.freebsd.org/~marius/usb_busdma.diff), the machine
> > > starts booting, then tries to mount the filesystems residing on the
> > > USB disks, apparently does some I/O (while still processing
> > > interrupts), and after less than a minute locks up solid without
> > > any indication on the serial console as to what went wrong...
> > > 
> > > I've started another build with your "VIA quirk fix" but without the
> > > patch in the last paragraph (the machine locking up is a lot worse
> > > than just USB not working after some heavy I/O, so I left it out
> > > for now), but since I started the build without being properly
> > > awake this morning, I typed "make buildworld" where I wanted to
> > > type "make buildkernel", so it's going to take some time. Also,
> > > I'll be leaving CERN over easter, so I won't be running tests on
> > > that machine from tomorrow morning until Monday evening (I can
> > > compile kernels, though). Anyhow, I'll let you know what comes out.
> > > 
> > > Cheers, thanks a lot for your effort, and, of course, a Happy
> > > Easter!
> > > 
> > > Manuel
> > 
> > Hi,
> > 
> > the "VIA quirk fix" on its own gives the familiar message in dmesg
> > (unrecoverable error, controller halted), so I'm compiling a kernel
> > which
> 
> Oof, this likely means there's a more basic problem with this device.
> Have you already tried to re-seat the card in case there's an electrical
> problem?
> Please also provide the output of `pciconf -rb ehci0@pci0:2:5:2 0:255'
> from a booting kernel.
> FYI, after some digging I've found the following card
> ehci0@pci0:2:5:2: class=0x0c0320 card=0x31041106 chip=0x31041106
> rev=0x6h0 which is a newer revision of your device and works just fine
> in a T1-200 including with the usb(4) fixes. The publicly available
> datasheets for the VIA USB controllers are minimal and exclude errata
> and Linux also doesn't seem to use any additional work arounds, so I'm
> starting to run out of ideas what could be wrong with your revision.
> The only remaining thing to give a try I currently can think of is to
> test whether it chokes on the generic initialization done by the
> sparc64 PCI code using the attached patch.
> 
> > combines this fix with your latest busdma fix to try them both
> > together;
> 
> This combination is unlikely to make a difference.
> 
> Marius
> 


Hi Marius,

I've tried your new patch, both on its own and in conjunction with the 
latest busdma and Via quirk fixes, and I still get the same error
message...

Here's the output of pciconf you requested:

mala@router:~> sudo pciconf -rb ehci0@pci0:2:5:2 0:255
Password:
06 11 04 31 06 00 10 22  65 20 03 0c 00 16 80 00 
00 a0 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 06 11 04 31 
00 00 00 00 80 00 00 00  00 00 00 00 14 03 00 00 
00 00 0b 00 00 00 00 00  a0 20 00 29 00 00 ff ff 
00 5a 04 80 00 00 00 00  04 0b 88 88 33 00 00 00 
20 20 01 00 00 00 00 00  01 00 00 00 00 00 00 c0 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
01 00 0a 7e 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 03  00 00 00 00 00 00 00 00

This was taken after the controller stopped, on a kernel with your
latest patch, but I'd guess that doesn't matter - the EHCI driver should
not be playing with the PCI settings after initialisation...

I've also opened the machine, and the PCI card is seated properly. I even
removed it and tried an even older VIA EHCI controller and one of the
first USB 2.0 controllers by NEC - no luck, the VIA one had trouble
recognizing devices, the NEC one did not recognize a single one I plugged
in.

Is there anything else I can try?

Manuel

-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 25 marius 2012-04-15 13:51:05 UTC
On Wed, Apr 11, 2012 at 12:59:54PM +0200, Manuel Tobias Schiller wrote:
> On Fri, 6 Apr 2012 20:37:26 +0200
> Marius Strobl <marius@alchemy.franken.de> wrote:
> 
> > On Fri, Apr 06, 2012 at 09:58:42AM +0200, Manuel Tobias Schiller wrote:
> > > On Thu, 5 Apr 2012 18:21:24 +0200
> > > Manuel Tobias Schiller <mala@hinterbergen.de> wrote:
> > > 
> > > > On Wed, 4 Apr 2012 14:59:46 +0200
> > > > Marius Strobl <marius@alchemy.franken.de> wrote:
> > > > 
> > > > > Hrm, okay, would be interesting to know what the machine actually
> > > > > does. Looking at the code I found another bug; the VIA-workaround
> > > > > currently doesn't do anything:
> > > > > http://people.freebsd.org/~marius/ehci_pci_fix_via_quirk.diff
> > > > > This might apply for the insane I/O you've reported but I'm unsure
> > > > > whether it makes a difference for the HSE interrupt.
> > > > > 
> > > > > Marius
> > > > 
> > > > From the looks of it (with your patch at
> > > > http://people.freebsd.org/~marius/usb_busdma.diff), the machine
> > > > starts booting, then tries to mount the filesystems residing on the
> > > > USB disks, apparently does some I/O (while still processing
> > > > interrupts), and after less than a minute locks up solid without
> > > > any indication on the serial console as to what went wrong...
> > > > 
> > > > I've started another build with your "VIA quirk fix" but without the
> > > > patch in the last paragraph (the machine locking up is a lot worse
> > > > than just USB not working after some heavy I/O, so I left it out
> > > > for now), but since I started the build without being properly
> > > > awake this morning, I typed "make buildworld" where I wanted to
> > > > type "make buildkernel", so it's going to take some time. Also,
> > > > I'll be leaving CERN over easter, so I won't be running tests on
> > > > that machine from tomorrow morning until Monday evening (I can
> > > > compile kernels, though). Anyhow, I'll let you know what comes out.
> > > > 
> > > > Cheers, thanks a lot for your effort, and, of course, a Happy
> > > > Easter!
> > > > 
> > > > Manuel
> > > 
> > > Hi,
> > > 
> > > the "VIA quirk fix" on its own gives the familiar message in dmesg
> > > (unrecoverable error, controller halted), so I'm compiling a kernel
> > > which
> > 
> > Oof, this likely means there's a more basic problem with this device.
> > Have you already tried to re-seat the card in case there's an electrical
> > problem?
> > Please also provide the output of `pciconf -rb ehci0@pci0:2:5:2 0:255'
> > from a booting kernel.
> > FYI, after some digging I've found the following card
> > ehci0@pci0:2:5:2: class=0x0c0320 card=0x31041106 chip=0x31041106
> > rev=0x6h0 which is a newer revision of your device and works just fine
> > in a T1-200 including with the usb(4) fixes. The publicly available
> > datasheets for the VIA USB controllers are minimal and exclude errata
> > and Linux also doesn't seem to use any additional work arounds, so I'm
> > starting to run out of ideas what could be wrong with your revision.
> > The only remaining thing to give a try I currently can think of is to
> > test whether it chokes on the generic initialization done by the
> > sparc64 PCI code using the attached patch.
> > 
> > > combines this fix with your latest busdma fix to try them both
> > > together;
> > 
> > This combination is unlikely to make a difference.
> > 
> > Marius
> > 
> 
> Hi Marius,
> 
> I've tried your new patch, both on its own and in conjunction with the 
> latest busdma and Via quirk fixes, and I still get the same error
> message...
> 
> Here's the output of pciconf you requested:
> 
> mala@router:~> sudo pciconf -rb ehci0@pci0:2:5:2 0:255
> Password:
> 06 11 04 31 06 00 10 22  65 20 03 0c 00 16 80 00 
> 00 a0 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 06 11 04 31 
> 00 00 00 00 80 00 00 00  00 00 00 00 14 03 00 00 
> 00 00 0b 00 00 00 00 00  a0 20 00 29 00 00 ff ff 

This is rather confusing; the 0x29 in the above line means that the
VIA workaround is applied. Didn't you say that with the fix to
actually apply it, the kernel panics as soon as attaching the
device?
Apart from this, the configuration space differs in 3 undocumented
bytes from mine. I'm not sure whether it's worth trying whether
these make a difference ...

> 00 5a 04 80 00 00 00 00  04 0b 88 88 33 00 00 00 
> 20 20 01 00 00 00 00 00  01 00 00 00 00 00 00 c0 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 01 00 0a 7e 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 03  00 00 00 00 00 00 00 00
> 
> This was taken after the controller stopped, on a kernel with your
> latest patch, but I'd guess that doesn't matter - the EHCI driver should
> not be playing with the PCI settings after initialisation...
> 
> I've also opened the machine, and the PCI card is seated properly. I even
> removed it and tried an even older VIA EHCI controller and one of the
> first USB 2.0 controllers by NEC - no luck, the VIA one had trouble
> recognizing devices, the NEC one did not recognize a single one I plugged
> in.
> 

This also is rather strange. Have you ever used any other type of
card in the slot, f.e. an NIC, so you can rule out it's broken
somehow?
How does using the on-board USB controller work out?

Marius
Comment 26 Manuel Schiller 2012-04-24 13:05:47 UTC
Hi Marius,

I'm rather busy with work at the moment, so I'm not working quite as much
on troubleshooting this issue right now... (See below for answers to your
questions...)

On Sun, 15 Apr 2012 14:51:05 +0200
Marius Strobl <marius@alchemy.franken.de> wrote:
> [...]
> > > > 
> > > > Hi,
> > > > 
> > > > the "VIA quirk fix" on its own gives the familiar message in dmesg
> > > > (unrecoverable error, controller halted), so I'm compiling a
> > > > kernel which
> > > 
> > > Oof, this likely means there's a more basic problem with this
> > > device. Have you already tried to re-seat the card in case there's
> > > an electrical problem?
> > > Please also provide the output of `pciconf -rb ehci0@pci0:2:5:2
> > > 0:255' from a booting kernel.
> > > FYI, after some digging I've found the following card
> > > ehci0@pci0:2:5:2: class=0x0c0320 card=0x31041106 chip=0x31041106
> > > rev=0x6h0 which is a newer revision of your device and works just
> > > fine in a T1-200 including with the usb(4) fixes. The publicly
> > > available datasheets for the VIA USB controllers are minimal and
> > > exclude errata and Linux also doesn't seem to use any additional
> > > work arounds, so I'm starting to run out of ideas what could be
> > > wrong with your revision. The only remaining thing to give a try I
> > > currently can think of is to test whether it chokes on the generic
> > > initialization done by the sparc64 PCI code using the attached
> > > patch.
> > > 
> > > > combines this fix with your latest busdma fix to try them both
> > > > together;
> > > 
> > > This combination is unlikely to make a difference.
> > > 
> > > Marius
> > > 
> > 
> > Hi Marius,
> > 
> > I've tried your new patch, both on its own and in conjunction with
> > the latest busdma and Via quirk fixes, and I still get the same error
> > message...
> > 
> > Here's the output of pciconf you requested:
> > 
> > mala@router:~> sudo pciconf -rb ehci0@pci0:2:5:2 0:255
> > Password:
> > 06 11 04 31 06 00 10 22  65 20 03 0c 00 16 80 00 
> > 00 a0 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 06 11 04 31 
> > 00 00 00 00 80 00 00 00  00 00 00 00 14 03 00 00 
> > 00 00 0b 00 00 00 00 00  a0 20 00 29 00 00 ff ff 
> 
> This is rather confusing; the 0x29 in the above line means that the
> VIA workaround is applied. Didn't you say that with the fix to
> actually apply it, the kernel panics as soon as attaching the
> device?
> Apart from this, the configuration space differs in 3 undocumented
> bytes from mine. I'm not sure whether it's worth trying whether
> these make a difference ...


Yes, this was from a kernel with your patch and the VIA workaround
applied; the kernel usually stops when I start using these devices
heavily (i.e. the automatic checks done during a ZFS mount operation).

> > 00 5a 04 80 00 00 00 00  04 0b 88 88 33 00 00 00 
> > 20 20 01 00 00 00 00 00  01 00 00 00 00 00 00 c0 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 01 00 0a 7e 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> > 00 00 00 00 00 00 00 03  00 00 00 00 00 00 00 00
> > 
> > This was taken after the controller stopped, on a kernel with your
> > latest patch, but I'd guess that doesn't matter - the EHCI driver
> > should not be playing with the PCI settings after initialisation...
> > 
> > I've also opened the machine, and the PCI card is seated properly. I
> > even removed it and tried an even older VIA EHCI controller and one
> > of the first USB 2.0 controllers by NEC - no luck, the VIA one had
> > trouble recognizing devices, the NEC one did not recognize a single
> > one I plugged in.
> > 
> 
> This also is rather strange. Have you ever used any other type of
> card in the slot, f.e. an NIC, so you can rule out it's broken
> somehow?


Some four or five years ago, the slot held a quad fast ethernet NIC, and
that seemed to work fine... But: a lot can happen during this time, so I
ordered a new USB controller to test with, just in case...

> How does using the on-board USB controller work out?


As far as I know, the on-board controller is USB1.1, so I have not really
tried it because it's going to be a no-go option for disks (I'd get
similar speed getting data from some server here at CERN over my DSL
connection, and I probably wouldn't even have to administer the server
myself - if I could get them to host my data ;)... I can give the onboard
USB 1.1 controller a try, though...

I noticed something else when reconnecting everything to the server: The
USB ground seems to have a quite high (voltage) potential with respect to
the chassis of the server (and the protective ground of the wall outlet),
about 80 Volts. I've tried to locate a single faulty power supply of the
hard disks (since the server chassis is at ground levels), but when
tested individually, none of them shows this behaviour. It only happens
when I connect all eight USB disks to the USB hub which in turn connects
to the server. Apparently, this is some collective effect. Obviously, when
the USB cable from the hub is plugged into the server, this potential
difference is no longer there, and the disks are recognised.

I'm not sure what this observation means (except that I'd really prefer
linear over switching mode power supplies because of the galvanic
separation between primary and secondary sides), but I thought I
mention it anyway.


Manuel

> Marius
> 
> 



-- 
Homepage: http://www.hinterbergen.de/mala
OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)
Comment 27 Michael Moll freebsd_committer freebsd_triage 2015-11-29 20:14:07 UTC
What's the status here?
Comment 28 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:49:59 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Comment 29 Mark Linimon freebsd_committer freebsd_triage 2023-08-23 05:05:43 UTC
I'm sorry that this PR was never addressed.

In the meantime, FreeBSD support for sparc64 was dropped.