Bug 156026

Summary: [arp] [panic] arpintr()->in_lltable_lookup() 8.1 bce(4) crash
Product: Base System Reporter: pluknet <pluknet>
Component: kernAssignee: Qing Li <qingli>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 8.1-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
file.shar none

Description pluknet 2011-03-29 10:00:22 UTC
It's GENERIC w/ :
options         KDB
options         DDB
options         BREAK_TO_DEBUGGER

options         IPFIREWALL
options         IPFIREWALL_FORWARD
options         IPFIREWALL_VERBOSE
options         IPFIREWALL_VERBOSE_LIMIT=500
options         IPFIREWALL_DEFAULT_TO_ACCEPT
options         LIBALIAS
options         IPFIREWALL_NAT
options         QUOTA

device          carp

There are 3 crashes seen between February and March
on different boxes but identical h/w.
Backtrace of crashed proc is the same.

db> bt
Tracing pid 12 tid 100039 td 0xffffff00029a23e0
propagate_priority() at propagate_priority+0x72
turnstile_wait() at turnstile_wait+0x1aa
_rw_wlock_hard() at _rw_wlock_hard+0xfa
in_lltable_lookup() at in_lltable_lookup+0x12b
arpintr() at arpintr+0x9d6
netisr_dispatch_src() at netisr_dispatch_src+0x7e
ether_demux() at ether_demux+0x14d
ether_input() at ether_input+0x17b
ether_demux() at ether_demux+0x6f
ether_input() at ether_input+0x17b
bce_intr() at bce_intr+0x3b0
intr_event_execute_handlers() at intr_event_execute_handlers+0xfd
ithread_loop() at ithread_loop+0x8e
fork_exit() at fork_exit+0x118
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff82b155dd30, rbp = 0 ---

Other ddb data available per each crash (attached in shar).

Fix: Patch attached with submission follows:
How-To-Repeat: Happens by accident on a moderately loaded box. Usual numbers:
2799 processes:12 running, 2750 sleeping, 37 zombie

   packets  errs idrops      bytes    packets  errs      bytes colls
      1723     0     0     198578       2166     0    2415553     0

interrupt                          total       rate
irq4: uart0                       309128          6
irq15: ata1                           35          0
irq17: aac0                     13404675        293
cpu0: timer                     90914103       1993
irq256: bce0                     2914454         63
irq257: bce1                   109935240       2410
cpu1: timer                     90974578       1994
cpu3: timer                     91076473       1997
cpu2: timer                     91059038       1996
cpu4: timer                     90930409       1993
cpu5: timer                     91033366       1996
cpu6: timer                     91085723       1997
cpu7: timer                     91111107       1997
Total                          854748329      18743
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2011-04-09 20:49:09 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 2 Pyun YongHyeon freebsd_committer freebsd_triage 2011-04-10 23:54:34 UTC
State Changed
From-To: open->feedback

I think it's not a bug of bce(4) but arp(4) and I guess it was 
fixed in 8.2-RELEASE. 
Can you reproduce the issue on 8.2-RELEASE or latest stable/8?
Comment 3 Pyun YongHyeon freebsd_committer freebsd_triage 2011-04-10 23:55:25 UTC
Responsible Changed
From-To: freebsd-net->yongari

Grab.
Comment 4 pluknet 2011-04-11 08:57:55 UTC
I'm sorry I have no much 8.2 boxes in production use (yet).
The crash is occasional. I cannot reproduce it reliably:
I have 3 crashes for a month among ~100 boxes.

-- 
wbr,
pluknet
Comment 5 pyunyh 2011-04-11 17:31:16 UTC
On Mon, Apr 11, 2011 at 08:00:23AM +0000, Sergey Kandaurov wrote:
> The following reply was made to PR kern/156026; it has been noted by GNATS.
> 
> From: Sergey Kandaurov <pluknet@gmail.com>
> To: bug-followup@FreeBSD.org, pluknet@gmail.com
> Cc:  
> Subject: Re: kern/156026: bce panic arpintr ->in_lltable_lookup 8.1 bce 4 crash
> Date: Mon, 11 Apr 2011 11:57:55 +0400
> 
>  I'm sorry I have no much 8.2 boxes in production use (yet).
>  The crash is occasional. I cannot reproduce it reliably:
>  I have 3 crashes for a month among ~100 boxes.
>  

Ok, please let me know when you encounter the issue on 8.2-RELEASE.
I think there were a large set of change of arp(4) in stable/8
after the release of 8.1-RELEASE. Qing Li may know better whether
this is really one of them(CCed).
Qing, could you look over the issue?
Comment 6 Colin Percival freebsd_committer freebsd_triage 2011-05-10 06:37:44 UTC
This bug definitely still exists in 8.2-RELEASE -- it's currently the #1 panic
on FreeBSD/EC2.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid
Comment 7 pyunyh 2011-05-10 18:06:48 UTC
On Tue, May 10, 2011 at 05:40:10AM +0000, Colin Percival wrote:
> The following reply was made to PR kern/156026; it has been noted by GNATS.
> 
> From: Colin Percival <cperciva@freebsd.org>
> To: bug-followup@FreeBSD.org, pluknet@gmail.com
> Cc:  
> Subject: Re: kern/156026: [bce] [panic] arpintr()-&gt;in_lltable_lookup()
>  8.1 bce(4) crash
> Date: Mon, 09 May 2011 22:37:44 -0700
> 
>  This bug definitely still exists in 8.2-RELEASE -- it's currently the #1 panic
>  on FreeBSD/EC2.

Is there easy way to reproduce the issue? I have a quad-port bce(4)
controller but I didn't see any issues.
Would you show me dmesg and pciconf -lcbv output get more details
on controller/firmware revision?
Comment 8 Colin Percival freebsd_committer freebsd_triage 2011-05-11 00:26:02 UTC
On 05/10/11 10:06, YongHyeon PYUN wrote:
> On Tue, May 10, 2011 at 05:40:10AM +0000, Colin Percival wrote:
>>  This bug definitely still exists in 8.2-RELEASE -- it's currently the #1 panic
>>  on FreeBSD/EC2.
> 
> Is there easy way to reproduce the issue? I have a quad-port bce(4)
> controller but I didn't see any issues.
> Would you show me dmesg and pciconf -lcbv output get more details
> on controller/firmware revision?

I don't have any easy way to reproduce this, but I can say that it does not
require bce, since EC2 doesn't have bce -- the hardware there is the Xen xn
virtual interface.

So I think it's safe to say that this is a bug in the arp/lltable/etc code,
not in any network driver.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid
Comment 9 Pyun YongHyeon freebsd_committer freebsd_triage 2011-05-11 00:52:34 UTC
State Changed
From-To: feedback->open

Ok, thanks for confirming this. 
I think it would be better to give it to our arp guru, Qing Li. 
Qing, would you take a look this issue? It seems it's not device 
driver related issue. If you have no time or interests to fix it 
in near future, please assign it back to freebsd-net@.
Comment 10 Pyun YongHyeon freebsd_committer freebsd_triage 2011-05-11 00:52:34 UTC
Responsible Changed
From-To: yongari->qingli

Ok, thanks for confirming this. 
I think it would be better to give it to our arp guru, Qing Li. 
Qing, would you take a look this issue? It seems it's not device 
driver related issue. If you have no time or interests to fix it 
in near future, please assign it back to freebsd-net@. 

Thanks.
Comment 11 Colin Percival freebsd_committer freebsd_triage 2011-05-14 20:00:22 UTC
I am told that this was fixed by r214675 -- this certainly seems
plausible to me.  John, can you confirm this so that we can close
this PR?

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid
Comment 12 John Baldwin freebsd_committer freebsd_triage 2011-05-14 20:07:49 UTC
On 5/14/11 3:00 PM, Colin Percival wrote:
> I am told that this was fixed by r214675 -- this certainly seems
> plausible to me.  John, can you confirm this so that we can close
> this PR?

Yes.

-- 
John Baldwin
Comment 13 Colin Percival freebsd_committer freebsd_triage 2011-05-14 20:08:39 UTC
State Changed
From-To: open->closed

jhb confirms that this was fixed in r214675 (Nov 2010, MFCed to stable/8 
as r217848 in Jan 2011).