Bug 29845

Summary: 4.4-PRERELEASE crashes under heavy net I/O
Product: Base System Reporter: Andre Albsmeier <Andre.Albsmeier>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.4-PRERELEASE   
Hardware: Any   
OS: Any   

Description Andre Albsmeier 2001-08-18 12:40:02 UTC
Coming from 4.3-STABLE as of May 18th I tried to test 4.4-PRERELEASE
on the above machine. I can reliably crash the box when doing heavy
net I/O, otherwise it runs fine. I replaced the Intel NIC with a 3COM
589D but this didn't help.

It runs stable under Win98 (as stable as Win98 can run :-)).

I have other machines (non laptops) here which run perfectly.
_I_ would assume it is something laptop specific -- might
be related to Warners new pccard code. I have also reconfigured
the BIOS to change/share various interrupt combinations without
success.

I have saved the crashdumps for further examination. As you can see,
the box crashes in whichever process it wants...

********************************************************************************

root@schlappy:/var/crash>gdb -k kernel vmcore.6
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...
IdlePTD 3276800
initial pcb at 293180
panicstr: page fault
panic messages:
---
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xc08befd6
fault code              = supervisor write, page not present
instruction pointer     = 0x8:0xc0191cf3
stack pointer           = 0x10:0xc8367de8
frame pointer           = 0x10:0xc8367e0c
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 239 (nfsiod)
interrupt mask          = net 
trap number             = 12
panic: page fault

syncing disks... 10 9 8 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
done
Uptime: 12m29s

dumping to dev #ad/1, offset 150076
dump ata0: resetting devices .. ata0-slave: timeout waiting for command=ef s=00 e=00
done
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
---
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
472             if (dumping++) {
(kgdb) where
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
#1  0xc013bd7f in boot (howto=256) at /src/src-4/sys/kern/kern_shutdown.c:312
#2  0xc013c165 in panic (fmt=0xc02574ec "%s") at /src/src-4/sys/kern/kern_shutdown.c:580
#3  0xc02203ef in trap_fatal (frame=0xc8367da8, eva=3230396374) at /src/src-4/sys/i386/i386/trap.c:956
#4  0xc022009d in trap_pfault (frame=0xc8367da8, usermode=0, eva=3230396374) at /src/src-4/sys/i386/i386/trap.c:849
#5  0xc021fc43 in trap (frame={tf_fs = 16, tf_es = -935985136, tf_ds = 16, tf_edi = 0, tf_esi = -1066136064, tf_ebp = -935952884, tf_isp = -935952940, tf_ebx = -1066136064, 
      tf_edx = -1066254336, tf_ecx = -1913498372, tf_eax = 1683414, tf_trapno = 12, tf_err = 2, tf_eip = -1072096013, tf_cs = 8, tf_eflags = 66051, tf_esp = -1066121172, tf_ss = -935952684})
    at /src/src-4/sys/i386/i386/trap.c:448
#6  0xc0191cf3 in nfsm_uiotombuf (uiop=0xc8367ed4, mq=0xc8367e74, siz=8192, bpos=0xc8367e78) at /src/src-4/sys/nfs/nfs_subs.c:892
#7  0xc0199bb3 in nfs_writerpc (vp=0xc83a3d00, uiop=0xc8367ed4, cred=0xc0afaf80, iomode=0xc8367ec4, must_commit=0xc8367ec0) at /src/src-4/sys/nfs/nfs_vnops.c:1183
#8  0xc018e034 in nfs_doio (bp=0xc32f8264, cr=0xc0afaf80, p=0x0) at /src/src-4/sys/nfs/nfs_bio.c:1518
#9  0xc019303a in nfssvc_iod (p=0xc75e15a0) at /src/src-4/sys/nfs/nfs_syscalls.c:970
#10 0xc0192e5c in nfssvc (p=0xc75e15a0, uap=0xc8367f80) at /src/src-4/sys/nfs/nfs_syscalls.c:166
#11 0xc02206a1 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = -1077936680, tf_esi = 0, tf_ebp = -1077936776, tf_isp = -935952428, tf_ebx = 2, tf_edx = 1, tf_ecx = 19, 
      tf_eax = 155, tf_trapno = 12, tf_err = 2, tf_eip = 134515664, tf_cs = 31, tf_eflags = 643, tf_esp = -1077936852, tf_ss = 47}) at /src/src-4/sys/i386/i386/trap.c:1155
#12 0xc02144c5 in Xint0x80_syscall ()
#13 0x8048135 in ?? ()
(kgdb) 

********************************************************************************

root@schlappy:/var/crash>gdb -k kernel vmcore.7
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...
IdlePTD 3276800
initial pcb at 293180
panicstr: sf_buf_ref: referencing a free sf_buf
panic messages:
---
panic: sf_buf_ref: referencing a free sf_buf

syncing disks... 3 3 
done
Uptime: 3m31s

dumping to dev #ad/1, offset 150076
dump ata0: resetting devices .. ata0-slave: timeout waiting for command=ef s=00 e=00
done
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
---
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
472             if (dumping++) {
(kgdb) where
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
#1  0xc013bd7f in boot (howto=256) at /src/src-4/sys/kern/kern_shutdown.c:312
#2  0xc013c165 in panic (fmt=0xc0239060 "sf_buf_ref: referencing a free sf_buf") at /src/src-4/sys/kern/kern_shutdown.c:580
#3  0xc015e6ef in sf_buf_ref (addr=0xc78ee000 <Address 0xc78ee000 out of bounds>, size=4096) at /src/src-4/sys/kern/uipc_syscalls.c:1469
#4  0xc015792e in m_copym (m=0xc0741f00, off0=4344, len=1448, wait=1) at /src/src-4/sys/kern/uipc_mbuf.c:713
#5  0xc0188a58 in tcp_output (tp=0xc79270c0) at /src/src-4/sys/netinet/tcp_output.c:592
#6  0xc0187bc1 in tcp_input (m=0xc0741000, off0=20, proto=6) at /src/src-4/sys/netinet/tcp_input.c:2316
#7  0xc01829f5 in ip_input (m=0xc0741000) at /src/src-4/sys/netinet/ip_input.c:820
#8  0xc0182a53 in ipintr () at /src/src-4/sys/netinet/ip_input.c:848
#9  0xc02158b5 in swi_net_next ()
#10 0xc02206a1 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 134152192, tf_esi = 0, tf_ebp = -1077940616, tf_isp = -935239724, tf_ebx = 0, 
      tf_edx = 6, tf_ecx = 134152192, tf_eax = 336, tf_trapno = 22, tf_err = 2, tf_eip = 672045052, tf_cs = 31, tf_eflags = 518, tf_esp = -1077940692, tf_ss = 47})
    at /src/src-4/sys/i386/i386/trap.c:1155
#11 0xc02144c5 in Xint0x80_syscall ()
#12 0x804c8c2 in ?? ()
#13 0x8050a7a in ?? ()
#14 0x804b1d5 in ?? ()
#15 0x804a76d in ?? ()
(kgdb) 

********************************************************************************

root@schlappy:/var/crash>gdb -k kernel vmcore.8 
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...
IdlePTD 3276800
initial pcb at 293180
panicstr: page fault
panic messages:
---
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x371f4
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xc0157545
stack pointer           = 0x10:0xc843bda0
frame pointer           = 0x10:0xc843bdac
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 416 (xeyes)
interrupt mask          = net 
trap number             = 12
panic: page fault

syncing disks... 7 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
done
Uptime: 6m36s

dumping to dev #ad/1, offset 150076
dump ata0: resetting devices .. ata0-slave: timeout waiting for command=ef s=00 e=00
done
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
---
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
472             if (dumping++) {
(kgdb) where
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
#1  0xc013bd7f in boot (howto=256) at /src/src-4/sys/kern/kern_shutdown.c:312
#2  0xc013c165 in panic (fmt=0xc02574ec "%s") at /src/src-4/sys/kern/kern_shutdown.c:580
#3  0xc02203ef in trap_fatal (frame=0xc843bd60, eva=225780) at /src/src-4/sys/i386/i386/trap.c:956
#4  0xc022009d in trap_pfault (frame=0xc843bd60, usermode=0, eva=225780) at /src/src-4/sys/i386/i386/trap.c:849
#5  0xc021fc43 in trap (frame={tf_fs = -1063387120, tf_es = 16, tf_ds = 16, tf_edi = 6684672, tf_esi = 225764, tf_ebp = -935084628, tf_isp = -935084660, 
      tf_ebx = 225764, tf_edx = 6684672, tf_ecx = 28, tf_eax = 6684672, tf_trapno = 12, tf_err = 0, tf_eip = -1072335547, tf_cs = 8, tf_eflags = 66054, 
      tf_esp = -1066149888, tf_ss = -1066149888}) at /src/src-4/sys/i386/i386/trap.c:448
#6  0xc0157545 in m_freem (m=0x371e4) at /src/src-4/sys/kern/uipc_mbuf.c:618
#7  0xc015746d in m_free (m=0xc073d800) at /src/src-4/sys/kern/uipc_mbuf.c:605
#8  0xc015c638 in sbcompress (sb=0xc78f388c, m=0xc073aa00, n=0x0) at /src/src-4/sys/kern/uipc_socket2.c:718
#9  0xc015c257 in sbappend (sb=0xc78f388c, m=0xc073aa00) at /src/src-4/sys/kern/uipc_socket2.c:506
#10 0xc015f5c9 in uipc_send (so=0xc78f3900, flags=0, m=0xc073aa00, nam=0x0, control=0x0, p=0xc8415f60) at /src/src-4/sys/kern/uipc_usrreq.c:344
#11 0xc0159f53 in sosend (so=0xc78f3900, addr=0x0, uio=0xc843bed8, top=0xc073aa00, control=0x0, flags=0, p=0xc8415f60) at /src/src-4/sys/kern/uipc_socket.c:611
#12 0xc014dbe8 in soo_write (fp=0xc0a2b480, uio=0xc843bed8, cred=0xc0afc080, flags=0, p=0xc8415f60) at /src/src-4/sys/kern/sys_socket.c:81
#13 0xc014a74d in dofilewrite (p=0xc8415f60, fp=0xc0a2b480, fd=3, buf=0x8054800, nbyte=8, offset=-1, flags=0) at /src/src-4/sys/sys/file.h:162
#14 0xc014a606 in write (p=0xc8415f60, uap=0xc843bf80) at /src/src-4/sys/kern/sys_generic.c:329
#15 0xc02206a1 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = -1078001617, tf_edi = 134561792, tf_esi = 134563840, tf_ebp = -1077938904, tf_isp = -935084076, 
      tf_ebx = 672709204, tf_edx = 672694464, tf_ecx = 29360136, tf_eax = 4, tf_trapno = 7, tf_err = 2, tf_eip = 673326260, tf_cs = 31, tf_eflags = 647, 
      tf_esp = -1077938948, tf_ss = 47}) at /src/src-4/sys/i386/i386/trap.c:1155
#16 0xc02144c5 in Xint0x80_syscall ()
#17 0x281388d3 in ?? ()
#18 0x2811d84e in ?? ()
#19 0x2811ec0a in ?? ()
#20 0x28115d36 in ?? ()
#21 0x8049833 in ?? ()
#22 0x2809bb7c in ?? ()
#23 0x2809bdf1 in ?? ()
#24 0x280919be in ?? ()
#25 0x80491ac in ?? ()
#26 0x8048f59 in ?? ()
(kgdb) 

********************************************************************************

root@schlappy:/var/crash>gdb -k kernel vmcore.9 
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...
IdlePTD 3276800
initial pcb at 293180
panicstr: page fault
panic messages:
---
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x0
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0x0
stack pointer           = 0x10:0xc833bc8c
frame pointer           = 0x10:0xc833bd1c
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 3
current process         = 357 (Xserver)
interrupt mask          = none
trap number             = 12
panic: page fault

syncing disks... 5 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
done
Uptime: 4m31s

dumping to dev #ad/1, offset 150076
dump ata0: resetting devices .. ata0-slave: timeout waiting for command=ef s=00 e=00
done
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
---
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
472             if (dumping++) {
(kgdb) where
#0  dumpsys () at /src/src-4/sys/kern/kern_shutdown.c:472
#1  0xc013bd7f in boot (howto=256) at /src/src-4/sys/kern/kern_shutdown.c:312
#2  0xc013c165 in panic (fmt=0xc02574ec "%s") at /src/src-4/sys/kern/kern_shutdown.c:580
#3  0xc02203ef in trap_fatal (frame=0xc833bc4c, eva=0) at /src/src-4/sys/i386/i386/trap.c:956
#4  0xc022009d in trap_pfault (frame=0xc833bc4c, usermode=0, eva=0) at /src/src-4/sys/i386/i386/trap.c:849
#5  0xc021fc43 in trap (frame={tf_fs = -1071054832, tf_es = -936181744, tf_ds = -936181744, tf_edi = -1071020724, tf_esi = -946937856, tf_ebp = -936133348, 
      tf_isp = -936133512, tf_ebx = -950131232, tf_edx = -936140800, tf_ecx = -936133432, tf_eax = 14, tf_trapno = 12, tf_err = 0, tf_eip = 0, tf_cs = 8, 
      tf_eflags = 78466, tf_esp = -65536, tf_ss = 0}) at /src/src-4/sys/i386/i386/trap.c:448
#6  0x0 in ?? ()
(kgdb)

Fix: 

Unknown. I have the dumps for further investigation. I can even
upload them somewhere if needed. I can also do tests if desired.
How-To-Repeat: 
Not difficult here, e.g., by pulling the crashdumps from the box onto
another machine with ftp :-)
Comment 1 dwmalone 2001-08-18 14:49:23 UTC
On Sat, Aug 18, 2001 at 01:33:22PM +0200, Andre Albsmeier wrote:
> I have saved the crashdumps for further examination. As you can see,
> the box crashes in whichever process it wants...

I'd guess that something is freeing an mbuf while it is still in
use.  This would result in either a panic when the mbuf is corrupted
while in use or a double freeing of the mbuf. This could plausable
explain the panics you included trace backs for.

I think Ian Dowse has some tools for examining the mbuf free lists
in kernel dumps. He did also have some patches for catching writes
to shared or free mbuf clusters, which might help figure out what's
going on here.

The only thing that doesn't tally is that this is only effecting
your laptop and not all your machines.

	David.
Comment 2 Andre Albsmeier 2001-08-18 18:55:37 UTC
On Sat, 18-Aug-2001 at 14:49:23 +0100, David Malone wrote:
> On Sat, Aug 18, 2001 at 01:33:22PM +0200, Andre Albsmeier wrote:
> > I have saved the crashdumps for further examination. As you can see,
> > the box crashes in whichever process it wants...
> 
> I'd guess that something is freeing an mbuf while it is still in
> use.  This would result in either a panic when the mbuf is corrupted
> while in use or a double freeing of the mbuf. This could plausable
> explain the panics you included trace backs for.
> 
> I think Ian Dowse has some tools for examining the mbuf free lists
> in kernel dumps. He did also have some patches for catching writes
> to shared or free mbuf clusters, which might help figure out what's
> going on here.

As I said: I am glad to try anything.

> The only thing that doesn't tally is that this is only effecting
> your laptop and not all your machines.

The first thing I thought of was a hardware problem. But the old
version ran fine as does Win98 :-). But:

<wild and amateurish speculation on>
I am using the Intel Etherexpress 100MBit PCMCIA card with the xe
driver. The driver is somehow inefficient: When doing heavy net I/O
with it, the load gets up to 4 and higher. It has always been like
this. Maybe some changes with the mbuf handling and Warners recent
pccard commits cause these problems under load now.
</wild and amateurish speculation on>

Sometimes I can ftp the crashdumps to another machine, sometimes not.
Hmm, I have the same box again at work. On monday I will swap the
harddrives and see how this behaves...

	-Andre
Comment 3 Andre Albsmeier 2001-08-21 08:43:14 UTC
On Sat, 18-Aug-2001 at 14:49:23 +0100, David Malone wrote:
> On Sat, Aug 18, 2001 at 01:33:22PM +0200, Andre Albsmeier wrote:
> > I have saved the crashdumps for further examination. As you can see,
> > the box crashes in whichever process it wants...
> 
> I'd guess that something is freeing an mbuf while it is still in
> use.  This would result in either a panic when the mbuf is corrupted
> while in use or a double freeing of the mbuf. This could plausable
> explain the panics you included trace backs for.
> 
> I think Ian Dowse has some tools for examining the mbuf free lists
> in kernel dumps. He did also have some patches for catching writes
> to shared or free mbuf clusters, which might help figure out what's
> going on here.
> 
> The only thing that doesn't tally is that this is only effecting
> your laptop and not all your machines.

OK, I have some news here:

1.) I put the harddisk into another machine of the same type (Siemens
    Mobile 510 AGP). Same bad effects here. So we can be quite sure
    it is no problem with RAM/CPU ...

2.) I tried the newest 4.4-RC1. Same problems.


Now it comes:

3.) I put the box into a docking station which got an Intel 
    Etherexpress PRO 100 sitting on the PCI bus. Now I can
    stress the machine as much as I want... no problems.
    As soon as I go back using the pccard stuff for networking
    my problems are back.


It really seems to be somehow pccard related...

	-Andre
Comment 4 iedowse 2001-08-21 14:42:54 UTC
[ Andre tried adding the kludge from if_sl.c that merges the tty and
net interrupt masks - not as a solution - but just to determine if
these crashes are caused by an spl problem. It seems they are. ]

In message <20010821143627.A26964@curry.mchp.siemens.de>, Andre Albsmeier write
s:
>On Tue, 21-Aug-2001 at 11:52:57 +0100, Ian Dowse wrote:
>> The fact that it is only pccard cards that have problems really
>> suggests a problem there, but the crashes are so random that it
>> has to be some kind of spl problem. Previously the cards got their
>> own IRQ, so they would set it up with the right interrupt mask.
>> Now all pccard interrupt handlers are called from the pcic one,
>> so I don't think splimp() is blocking these interrupts.
>
>Yes, that seems to do it! I have my dd from /dev/zero to /dev/null via
>rsh running in both directions for about 5 minutes now... No problems
>so far.

So at the moment, when the network code calls splimp(), it does
not block NIC interrupts that come in via the pccard code. That
certainly explains all the odd crashes.

I'm not sure how to solve this problem properly, but it seems that
pcic_pci_setup_intr() needs to call bus_generic_setup_intr() to
properly update the interrupt masks. I assume there is a reason
for not just using bus_generic_setup_intr() as the pcic_pci
bus_setup_intr method?

Thanks for trying out that kludge Andre! Hopefully there's enough
information now to get it fixed properly.

Ian
Comment 5 Andre Albsmeier 2001-08-21 15:17:49 UTC
On Tue, 21-Aug-2001 at 14:42:54 +0100, Ian Dowse wrote:
> 
> [ Andre tried adding the kludge from if_sl.c that merges the tty and
> net interrupt masks - not as a solution - but just to determine if
> these crashes are caused by an spl problem. It seems they are. ]
> 
> In message <20010821143627.A26964@curry.mchp.siemens.de>, Andre Albsmeier write
> s:
> >On Tue, 21-Aug-2001 at 11:52:57 +0100, Ian Dowse wrote:
> >> The fact that it is only pccard cards that have problems really
> >> suggests a problem there, but the crashes are so random that it
> >> has to be some kind of spl problem. Previously the cards got their
> >> own IRQ, so they would set it up with the right interrupt mask.
> >> Now all pccard interrupt handlers are called from the pcic one,
> >> so I don't think splimp() is blocking these interrupts.
> >
> >Yes, that seems to do it! I have my dd from /dev/zero to /dev/null via
> >rsh running in both directions for about 5 minutes now... No problems
> >so far.
> 
> So at the moment, when the network code calls splimp(), it does
> not block NIC interrupts that come in via the pccard code. That
> certainly explains all the odd crashes.
> 
> I'm not sure how to solve this problem properly, but it seems that
> pcic_pci_setup_intr() needs to call bus_generic_setup_intr() to
> properly update the interrupt masks. I assume there is a reason
> for not just using bus_generic_setup_intr() as the pcic_pci
> bus_setup_intr method?
> 
> Thanks for trying out that kludge Andre! Hopefully there's enough
> information now to get it fixed properly.

Well I was only whining about the problem, you fixed it (or at least
isolated it) :-)

Anyway, I am looking forward to testing other suggestions. It seems
that I have an environment that triggers the problem easily.

Thanks,

	-Andre
Comment 6 Warner Losh 2001-08-21 16:35:51 UTC
In message <200108211442.aa32071@salmon.maths.tcd.ie> Ian Dowse writes:
: I'm not sure how to solve this problem properly, but it seems that
: pcic_pci_setup_intr() needs to call bus_generic_setup_intr() to
: properly update the interrupt masks. I assume there is a reason
: for not just using bus_generic_setup_intr() as the pcic_pci
: bus_setup_intr method?

I wanted the ability to intercept the interrupt.  I can do that easily 
enough with a second function...  I'm still not sure the proper way to 
handle this.  But if I'm understanding you correctly, we're not
blocking splnet interrupts.  But in this case, when there's only one
network card, wouldn't the net spl mask only have one bit, which is
the IRQ that we're in?

Warner
Comment 7 Warner Losh 2001-08-21 16:39:50 UTC
In message <20010821161749.A29621@curry.mchp.siemens.de> Andre Albsmeier writes:
: Well I was only whining about the problem, you fixed it (or at least
: isolated it) :-)

Here's a simple fix you can try.  I don't see how this would help, but 
if it does, we know what the problem is.  Ian suggested this a while
ago, and I'm still not sure how this could be a problem, but if it is
Ian's suggestions are right.

Warner

Index: pcic_pci.c
===================================================================
RCS file: /home/imp/FreeBSD/CVS/src/sys/pccard/pcic_pci.c,v
retrieving revision 1.54.2.7
diff -u -r1.54.2.7 pcic_pci.c
--- pcic_pci.c	2001/08/21 09:06:25	1.54.2.7
+++ pcic_pci.c	2001/08/21 15:38:29
@@ -522,8 +522,11 @@
 	 * interrupt handler for it.  Since multifunction cards aren't
 	 * supported, this shouldn't cause a problem in practice.
 	 */
-	if (sc->cd_present && sp->intr != NULL)
+	if (sc->cd_present && sp->intr != NULL) {
+		s = splhigh();
 		sp->intr(sp->argp);
+		splx(s);
+	}
 }
 
 /*
Comment 8 iedowse 2001-08-21 17:13:26 UTC
In message <200108211539.f7LFdoW65851@harmony.village.org>, Warner Losh writes:
>Here's a simple fix you can try.  I don't see how this would help, but 
>if it does, we know what the problem is.  Ian suggested this a while
>ago, and I'm still not sure how this could be a problem, but if it is
>Ian's suggestions are right.

No, I was confused when I suggested this to you :-) It is too late
when pcic_pci_intr() is called, because at that point a critical
section of some network code has already been interrupted. Once a
NIC has registered a net interrupt on IRQ X, splimp() should mask
IRQ X, but here the pcic code never changes the interrupt mask when
a NIC registers its interrupt.

e.g. consider some network code that does splimp():

	s = splimp();

	(critical stuff where no net interrupts should occur)

	<pcic interrupt occurs>
		pcic_pci_intr() called
			s = splhigh();
			(this blocks further interrupts)

			NIC ISR called
				(messes with splimp-protected state)

			splx(x);
		pcic_pci_intr() returns
	<pcic interrupt end>

	(network code finds its state messed up)

	splx(s);

When the pccard NIC sets up its interrupt, it needs to go through
all the mask adjustment behind bus_generic_setup_intr() to ensure
that the first splimp() call above actually blocks the pcic interrupts
too. That's why I'm suggesting using bus_generic_setup_intr() either
within or instead of pcic_pci_setup_intr().

Ian
Comment 9 Warner Losh 2001-08-21 17:33:13 UTC
In message <200108211713.aa61585@salmon.maths.tcd.ie> Ian Dowse writes:
: No, I was confused when I suggested this to you :-) It is too late
: when pcic_pci_intr() is called, because at that point a critical
: section of some network code has already been interrupted. Once a
: NIC has registered a net interrupt on IRQ X, splimp() should mask
: IRQ X, but here the pcic code never changes the interrupt mask when
: a NIC registers its interrupt.

Lightbulb.  I completely understand now.  NEWCARD has exactly the same 
problem.

: When the pccard NIC sets up its interrupt, it needs to go through
: all the mask adjustment behind bus_generic_setup_intr() to ensure
: that the first splimp() call above actually blocks the pcic interrupts
: too. That's why I'm suggesting using bus_generic_setup_intr() either
: within or instead of pcic_pci_setup_intr().

I think we need to use it within pcic_pci_setup_intr so our own
function gets called and we only call the ISR if the card is in
place.

My splhigh() changes have 0 chance of working.

Warner
Comment 10 iedowse freebsd_committer freebsd_triage 2001-08-26 12:08:22 UTC
State Changed
From-To: open->feedback


I think this has been resolved now - may I close the PR?
Comment 11 iedowse freebsd_committer freebsd_triage 2001-08-26 15:36:41 UTC
State Changed
From-To: feedback->closed


Fixed by a number of pccard and interrupt changes over the last 
week. I think pcic_pci.c rev 1.54.2.8 solved the main problem, 
which was that NIC interrupts were not set up to be blocked by 
splimp(). Thanks for the bug report!