| Summary: | 4.4-PRERELEASE crashes under heavy net I/O | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Andre Albsmeier <Andre.Albsmeier> |
| Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | 4.4-PRERELEASE | ||
| Hardware: | Any | ||
| OS: | Any | ||
|
Description
Andre Albsmeier
2001-08-18 12:40:02 UTC
On Sat, Aug 18, 2001 at 01:33:22PM +0200, Andre Albsmeier wrote:
> I have saved the crashdumps for further examination. As you can see,
> the box crashes in whichever process it wants...
I'd guess that something is freeing an mbuf while it is still in
use. This would result in either a panic when the mbuf is corrupted
while in use or a double freeing of the mbuf. This could plausable
explain the panics you included trace backs for.
I think Ian Dowse has some tools for examining the mbuf free lists
in kernel dumps. He did also have some patches for catching writes
to shared or free mbuf clusters, which might help figure out what's
going on here.
The only thing that doesn't tally is that this is only effecting
your laptop and not all your machines.
David.
On Sat, 18-Aug-2001 at 14:49:23 +0100, David Malone wrote: > On Sat, Aug 18, 2001 at 01:33:22PM +0200, Andre Albsmeier wrote: > > I have saved the crashdumps for further examination. As you can see, > > the box crashes in whichever process it wants... > > I'd guess that something is freeing an mbuf while it is still in > use. This would result in either a panic when the mbuf is corrupted > while in use or a double freeing of the mbuf. This could plausable > explain the panics you included trace backs for. > > I think Ian Dowse has some tools for examining the mbuf free lists > in kernel dumps. He did also have some patches for catching writes > to shared or free mbuf clusters, which might help figure out what's > going on here. As I said: I am glad to try anything. > The only thing that doesn't tally is that this is only effecting > your laptop and not all your machines. The first thing I thought of was a hardware problem. But the old version ran fine as does Win98 :-). But: <wild and amateurish speculation on> I am using the Intel Etherexpress 100MBit PCMCIA card with the xe driver. The driver is somehow inefficient: When doing heavy net I/O with it, the load gets up to 4 and higher. It has always been like this. Maybe some changes with the mbuf handling and Warners recent pccard commits cause these problems under load now. </wild and amateurish speculation on> Sometimes I can ftp the crashdumps to another machine, sometimes not. Hmm, I have the same box again at work. On monday I will swap the harddrives and see how this behaves... -Andre On Sat, 18-Aug-2001 at 14:49:23 +0100, David Malone wrote:
> On Sat, Aug 18, 2001 at 01:33:22PM +0200, Andre Albsmeier wrote:
> > I have saved the crashdumps for further examination. As you can see,
> > the box crashes in whichever process it wants...
>
> I'd guess that something is freeing an mbuf while it is still in
> use. This would result in either a panic when the mbuf is corrupted
> while in use or a double freeing of the mbuf. This could plausable
> explain the panics you included trace backs for.
>
> I think Ian Dowse has some tools for examining the mbuf free lists
> in kernel dumps. He did also have some patches for catching writes
> to shared or free mbuf clusters, which might help figure out what's
> going on here.
>
> The only thing that doesn't tally is that this is only effecting
> your laptop and not all your machines.
OK, I have some news here:
1.) I put the harddisk into another machine of the same type (Siemens
Mobile 510 AGP). Same bad effects here. So we can be quite sure
it is no problem with RAM/CPU ...
2.) I tried the newest 4.4-RC1. Same problems.
Now it comes:
3.) I put the box into a docking station which got an Intel
Etherexpress PRO 100 sitting on the PCI bus. Now I can
stress the machine as much as I want... no problems.
As soon as I go back using the pccard stuff for networking
my problems are back.
It really seems to be somehow pccard related...
-Andre
[ Andre tried adding the kludge from if_sl.c that merges the tty and net interrupt masks - not as a solution - but just to determine if these crashes are caused by an spl problem. It seems they are. ] In message <20010821143627.A26964@curry.mchp.siemens.de>, Andre Albsmeier write s: >On Tue, 21-Aug-2001 at 11:52:57 +0100, Ian Dowse wrote: >> The fact that it is only pccard cards that have problems really >> suggests a problem there, but the crashes are so random that it >> has to be some kind of spl problem. Previously the cards got their >> own IRQ, so they would set it up with the right interrupt mask. >> Now all pccard interrupt handlers are called from the pcic one, >> so I don't think splimp() is blocking these interrupts. > >Yes, that seems to do it! I have my dd from /dev/zero to /dev/null via >rsh running in both directions for about 5 minutes now... No problems >so far. So at the moment, when the network code calls splimp(), it does not block NIC interrupts that come in via the pccard code. That certainly explains all the odd crashes. I'm not sure how to solve this problem properly, but it seems that pcic_pci_setup_intr() needs to call bus_generic_setup_intr() to properly update the interrupt masks. I assume there is a reason for not just using bus_generic_setup_intr() as the pcic_pci bus_setup_intr method? Thanks for trying out that kludge Andre! Hopefully there's enough information now to get it fixed properly. Ian On Tue, 21-Aug-2001 at 14:42:54 +0100, Ian Dowse wrote:
>
> [ Andre tried adding the kludge from if_sl.c that merges the tty and
> net interrupt masks - not as a solution - but just to determine if
> these crashes are caused by an spl problem. It seems they are. ]
>
> In message <20010821143627.A26964@curry.mchp.siemens.de>, Andre Albsmeier write
> s:
> >On Tue, 21-Aug-2001 at 11:52:57 +0100, Ian Dowse wrote:
> >> The fact that it is only pccard cards that have problems really
> >> suggests a problem there, but the crashes are so random that it
> >> has to be some kind of spl problem. Previously the cards got their
> >> own IRQ, so they would set it up with the right interrupt mask.
> >> Now all pccard interrupt handlers are called from the pcic one,
> >> so I don't think splimp() is blocking these interrupts.
> >
> >Yes, that seems to do it! I have my dd from /dev/zero to /dev/null via
> >rsh running in both directions for about 5 minutes now... No problems
> >so far.
>
> So at the moment, when the network code calls splimp(), it does
> not block NIC interrupts that come in via the pccard code. That
> certainly explains all the odd crashes.
>
> I'm not sure how to solve this problem properly, but it seems that
> pcic_pci_setup_intr() needs to call bus_generic_setup_intr() to
> properly update the interrupt masks. I assume there is a reason
> for not just using bus_generic_setup_intr() as the pcic_pci
> bus_setup_intr method?
>
> Thanks for trying out that kludge Andre! Hopefully there's enough
> information now to get it fixed properly.
Well I was only whining about the problem, you fixed it (or at least
isolated it) :-)
Anyway, I am looking forward to testing other suggestions. It seems
that I have an environment that triggers the problem easily.
Thanks,
-Andre
In message <200108211442.aa32071@salmon.maths.tcd.ie> Ian Dowse writes: : I'm not sure how to solve this problem properly, but it seems that : pcic_pci_setup_intr() needs to call bus_generic_setup_intr() to : properly update the interrupt masks. I assume there is a reason : for not just using bus_generic_setup_intr() as the pcic_pci : bus_setup_intr method? I wanted the ability to intercept the interrupt. I can do that easily enough with a second function... I'm still not sure the proper way to handle this. But if I'm understanding you correctly, we're not blocking splnet interrupts. But in this case, when there's only one network card, wouldn't the net spl mask only have one bit, which is the IRQ that we're in? Warner In message <20010821161749.A29621@curry.mchp.siemens.de> Andre Albsmeier writes: : Well I was only whining about the problem, you fixed it (or at least : isolated it) :-) Here's a simple fix you can try. I don't see how this would help, but if it does, we know what the problem is. Ian suggested this a while ago, and I'm still not sure how this could be a problem, but if it is Ian's suggestions are right. Warner Index: pcic_pci.c =================================================================== RCS file: /home/imp/FreeBSD/CVS/src/sys/pccard/pcic_pci.c,v retrieving revision 1.54.2.7 diff -u -r1.54.2.7 pcic_pci.c --- pcic_pci.c 2001/08/21 09:06:25 1.54.2.7 +++ pcic_pci.c 2001/08/21 15:38:29 @@ -522,8 +522,11 @@ * interrupt handler for it. Since multifunction cards aren't * supported, this shouldn't cause a problem in practice. */ - if (sc->cd_present && sp->intr != NULL) + if (sc->cd_present && sp->intr != NULL) { + s = splhigh(); sp->intr(sp->argp); + splx(s); + } } /* In message <200108211539.f7LFdoW65851@harmony.village.org>, Warner Losh writes: >Here's a simple fix you can try. I don't see how this would help, but >if it does, we know what the problem is. Ian suggested this a while >ago, and I'm still not sure how this could be a problem, but if it is >Ian's suggestions are right. No, I was confused when I suggested this to you :-) It is too late when pcic_pci_intr() is called, because at that point a critical section of some network code has already been interrupted. Once a NIC has registered a net interrupt on IRQ X, splimp() should mask IRQ X, but here the pcic code never changes the interrupt mask when a NIC registers its interrupt. e.g. consider some network code that does splimp(): s = splimp(); (critical stuff where no net interrupts should occur) <pcic interrupt occurs> pcic_pci_intr() called s = splhigh(); (this blocks further interrupts) NIC ISR called (messes with splimp-protected state) splx(x); pcic_pci_intr() returns <pcic interrupt end> (network code finds its state messed up) splx(s); When the pccard NIC sets up its interrupt, it needs to go through all the mask adjustment behind bus_generic_setup_intr() to ensure that the first splimp() call above actually blocks the pcic interrupts too. That's why I'm suggesting using bus_generic_setup_intr() either within or instead of pcic_pci_setup_intr(). Ian In message <200108211713.aa61585@salmon.maths.tcd.ie> Ian Dowse writes: : No, I was confused when I suggested this to you :-) It is too late : when pcic_pci_intr() is called, because at that point a critical : section of some network code has already been interrupted. Once a : NIC has registered a net interrupt on IRQ X, splimp() should mask : IRQ X, but here the pcic code never changes the interrupt mask when : a NIC registers its interrupt. Lightbulb. I completely understand now. NEWCARD has exactly the same problem. : When the pccard NIC sets up its interrupt, it needs to go through : all the mask adjustment behind bus_generic_setup_intr() to ensure : that the first splimp() call above actually blocks the pcic interrupts : too. That's why I'm suggesting using bus_generic_setup_intr() either : within or instead of pcic_pci_setup_intr(). I think we need to use it within pcic_pci_setup_intr so our own function gets called and we only call the ISR if the card is in place. My splhigh() changes have 0 chance of working. Warner State Changed From-To: open->feedback I think this has been resolved now - may I close the PR? State Changed From-To: feedback->closed Fixed by a number of pccard and interrupt changes over the last week. I think pcic_pci.c rev 1.54.2.8 solved the main problem, which was that NIC interrupts were not set up to be blocked by splimp(). Thanks for the bug report! |