Bug 28844

Summary: Router/nameserver system crashes 2-3 times monthly
Product: Base System Reporter: Rudi Mathijssen <r.mathijssen>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.2-RELEASE   
Hardware: Any   
OS: Any   

Description Rudi Mathijssen 2001-07-09 21:50:01 UTC
System Compaq Proliant ML350 (600MHz pentium 3, 128MB ram) has 6 100Mbit nics: fxp0 (Intel Pro 10/100B/100+ Etherne), de0..3 (SMC 9332BDT), xl0 (3Com 3c905B-TX Fast Etherlink XL). Each nic is connected to a 3com 100Mbit hub (half duplex!) which is wired to a second proliant (traffic analyser and also cold standy for the gateway), and each to a 3com switch model 3300 to which the nt, sco, and true64 servers and pc hubs are connected. Three of the six lans are connected to a cisco wan router.
Disk configuration: RAID-1 setup with 2 9.1GB wu2 scsi disks on a compaq smart array 3200.
The custom kernel is ipfw enabled:
options         IPFIREWALL
options         IPFIREWALL_DEFAULT_TO_ACCEPT
options         DUMMYNET
but firewall_enable is currently set to "NO".
The system used to run FreeBSD 4.0, without the raid controller, and there was not a single problem. Since the "upgrade" it crashes irregularly 2-3 times per month. There is no crash dump or anything in /var/log. I have no screen dump of the panic screen (should we have a camera in the computer room?). After a few minutes the system resets itself. At times there are notifications in /var/log/messages like:
Jul  3 01:43:31 irsdevgate /kernel: de1: abnormal interrupt: transmit underflow (raising TX threshold to 96|256)
Jul  5 09:29:42 irsdevgate /kernel: xl0: no memory for rx list -- packet dropped!
There is, however, no correlation between the occurence these messages and the time of crashing.

How-To-Repeat: There appears (as yet) to be no link between the moment of crashing and external (network) or internal (e.g. cron) activity.
Comment 1 Rudi Mathijssen 2001-11-04 09:59:37 UTC
Modifications tested: (1) removed xl and (suspect) fxp, added more SMC
cards, now we have six interfaces 
de0-de5. These all run half-duplex 100Mbps. (2) Furthermore, as netstat -m
showed that the peak use of 
mbuf clusters (944) came awfully close to 1024 (default), NMBCLUSTERS=4096
was set. After a flawless 
operation  from 4-sep-2001 on, it crashed again on oct-29 and oct-31 (the
panic message is: page fault in kernel mode). This is not acceptable. Should
we upgrade to 4.4? Go back to 4.0? Is there a special kernel param NO_PANIC
which should be set to 1? 
I stress, this is not a test lab, it's a production environment. If FreeBSD
is not suitable for this, please tell me.

Rudi Mathijssen
Comment 2 Murray Stokely freebsd_committer freebsd_triage 2001-11-05 12:31:46 UTC
On Sun, Nov 04, 2001 at 10:59:37AM +0100, Rudi Mathijssen wrote:
> Modifications tested: (1) removed xl and (suspect) fxp, added more SMC
> cards, now we have six interfaces 
> de0-de5. These all run half-duplex 100Mbps. (2) Furthermore, as netstat -m
> showed that the peak use of 
> mbuf clusters (944) came awfully close to 1024 (default), NMBCLUSTERS=4096
> was set. After a flawless 
> operation  from 4-sep-2001 on, it crashed again on oct-29 and oct-31 (the
> panic message is: page fault in kernel mode). This is not acceptable. Should
> we upgrade to 4.4? Go back to 4.0? Is there a special kernel param NO_PANIC
> which should be set to 1? 
> I stress, this is not a test lab, it's a production environment. If FreeBSD
> is not suitable for this, please tell me.

  Can you generate a backtrace of the kernel crash and post it to
freebsd-stable@FreeBSD.org or freebsd-net@FreeBSD.org?

  This document should help narrow down the cause of the failure :

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#AEN4392

  What version of FreeBSD are you running?  Does this problem still
exist with 4.4 or 4.4-STABLE?

      - Murray
Comment 3 iedowse freebsd_committer freebsd_triage 2001-11-21 18:11:51 UTC
State Changed
From-To: open->feedback


Based on the backtrace sent to -net, this is almost certainly caused 
by the icmp_error bug that was fixed just before 4.3-RELEASE. Can 
you confirm that this problem goes away if you upgrade to a more 
recent release?
Comment 4 iedowse freebsd_committer freebsd_triage 2002-01-13 18:10:28 UTC
State Changed
From-To: feedback->closed


Feedback timeout, and almost certainly fixed.