Bug 105437

Summary: 6.2-BETA3 crashes on amd64
Product: Base System Reporter: Wojciech Puchar <wojtek>
Component: kernAssignee: ru <ru>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 6.2-BETA3   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
dmesg.4
none
typescript.4
none
dmesg.5
none
typescript.5 none

Description Wojciech Puchar 2006-11-12 13:20:22 UTC
system crashes every few days without warning. memory is dumped kgdb shows:

Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xd4
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xffffffff802858b4
stack pointer           = 0x10:0xffffffffb49a1720
frame pointer           = 0x10:0x4
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 11 (swi4: clock sio)
trap number             = 12
panic: page fault
Uptime: 1d5h42m1s
Dumping 1023 MB (2 chunks)
  chunk 0: 1MB (159 pages) ... ok
  chunk 1: 1023MB (261872 pages) 1007 991 975 959 943 927 911 895 879 863 847 831 815 799 783 767 751 
735 719 703 687 671 655 639 623 607 591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 33
5 319 303 287 271 255 239 223 207 191 175 159 143 127 111 95 79 63 47 31 15

#0  doadump () at pcpu.h:172
172             __asm __volatile("movq %%gs:0,%0" : "=r" (td));

Fix: 

no idea :(
How-To-Repeat: i don't need any system is normally used under moderate load by multiple users.

nothing special happens before crash.
Comment 1 ru freebsd_committer freebsd_triage 2006-11-12 17:07:12 UTC
On Sun, Nov 12, 2006 at 05:09:22PM +0100, Wojciech Puchar wrote:
> #0  doadump () at pcpu.h:172
> #1  0x0000000000000004 in ?? ()
> #2  0xffffffff8025deb3 in boot (howto=260) at 
> ../../../kern/kern_shutdown.c:409
> #3  0xffffffff8025e4b6 in panic (fmt=0xffffff003d8fa980 "°\226\217=")
>     at ../../../kern/kern_shutdown.c:565
> #4  0xffffffff803e87f2 in trap_fatal (frame=0xffffff003d8fa980, 
> eva=18446742975230744240)
>     at ../../../amd64/amd64/trap.c:660
> #5  0xffffffff803e8d16 in trap (frame=
>       {tf_rdi = -1098993325056, tf_rsi = 4, tf_rdx = -1098478802560, 
> tf_rcx = 4, tf_r8 = -1098478802496, tf_r9 = -1098993325056, tf_rax = 2, 
> tf_rbx = -1098478802560, tf_rbp = 4, tf_r10 = -1098993325056, tf_r11 = 
> -1264970144, tf_r12 = -1098478802560, tf_r13 = -1098993325056, tf_r14 = 
> -2141357264, tf_r15 = -1098758394592, tf_trapno = 12, tf_addr = 212, 
> tf_flags = -2144054761, tf_err = 0, tf_rip = -2144839500, tf_cs = 8, 
> tf_rflags = 65543, tf_rsp = -1264969928, tf_ss = 16})
>     at ../../../amd64/amd64/trap.c:238
> #6  0xffffffff803d640b in calltrap () at 
> ../../../amd64/amd64/exception.S:168
> #7  0xffffffff802858b4 in turnstile_setowner (ts=0xffffff001ee4ac00, 
> owner=0x4)
>     at ../../../kern/subr_turnstile.c:432
> #8  0xffffffff80285ebb in turnstile_wait (lock=0xffffff002ce56d20, 
> owner=0x4)
>     at ../../../kern/subr_turnstile.c:591
> #9  0xffffffff80252f39 in _mtx_lock_sleep (m=0xffffff002ce56d20, 
> tid=18446742975230749056,
>     opts=1032825216, file=0x4 <Address 0x4 out of bounds>, 
> line=1032825280)
>     at ../../../kern/kern_mutex.c:579
> 
The line 579 has:

:                 turnstile_wait(&m->mtx_object, mtx_owner(m));

Some references:

: /*
:  * Internal utility macros.
:  */
: #define mtx_unowned(m)  ((m)->mtx_lock == MTX_UNOWNED)
:  
: #define mtx_owner(m)    (mtx_unowned((m)) ? NULL \
:         : (struct thread *)((m)->mtx_lock & MTX_FLAGMASK))

: /*
:  * State bits kept in mutex->mtx_lock, for the DEFAULT lock type. None of this,
:  * with the exception of MTX_UNOWNED, applies to spin locks.
:  */
: #define MTX_RECURSED    0x00000001      /* lock recursed (for MTX_DEF only) */
: #define MTX_CONTESTED   0x00000002      /* lock contested (for MTX_DEF only) */
: #define MTX_UNOWNED     0x00000004      /* Cookie for free mutex */
: #define MTX_FLAGMASK    ~(MTX_RECURSED | MTX_CONTESTED)

mtx_owner(m) returns the value of "4", which is MUTEX_UNOWNED,
but if mtx_lock were only MTX_UNOWNED, mtx_unowned() would return
true, and mtx_owner() would return NULL.  This means that mtx_lock
has something other than MTX_UNOWNED as well, which is illegal.
Most likely, it's MTX_DESTROYED (which is defined as (MTX_CONTESTED \
| MTX_UNOWNED)).  You should print the mutex it to be sure.  So
it looks like the code is trying to pass a corrupt mutex.
Please recompile your kernel with the following options:

options         INVARIANTS              # Enable calls of extra sanity checking
options         INVARIANT_SUPPORT       # Extra sanity checks of internal structures, required by INVARIANTS
options         WITNESS                 # Enable checks to detect deadlocks and cycles
options         WITNESS_SKIPSPIN        # Don't run witness on spinlocks for speed

It will run more slowly, but could allow to catch the bug earlier.

It could turn out to be a problem with the IPv6 routing code.

> #10 0xffffffff8033c7ab in nd6_output (ifp=0xffffff003063c000, 
> origifp=0xffffff003063c000,
>     m0=0xffffff0001cd6400, dst=0xffffff002e437a60, rt0=0xffffff002b96f630)
>     at ../../../netinet6/nd6.c:2004
> #11 0xffffffff80338c12 in ip6_output (m0=0x100010170400120, opt=0x500, 
> ro=0xffffffffb49a1a00,
>     flags=0, im6o=0x0, ifpp=0x0, inp=0xffffff0001c304c0) at 
> ../../../netinet6/ip6_output.c:994
> 
I don't understand why "ro" is not NULL here, because tcp_output()
below calls it with a NULL argument; this is probably due to a
-O2 compilation.

> #12 0xffffffff80315a6d in tcp_output (tp=0xffffff0010b165e0) at 
> ../../../netinet/tcp_output.c:1059
> #13 0xffffffff8031c6a5 in tcp_timer_rexmt (xtp=0xffffff001ee4ac00)
>     at ../../../netinet/tcp_timer.c:537
> #14 0xffffffff8026d02a in softclock (dummy=0xffffff001ee4ac00) at 
> ../../../kern/kern_timeout.c:290
> #15 0xffffffff802442b6 in ithread_loop (arg=0xffffff00000053c0) at 
> ../../../kern/kern_intr.c:682
> #16 0xffffffff80242d03 in fork_exit (callout=0xffffffff80244170 
> <ithread_loop>,
>     arg=0xffffff00000053c0, frame=0xffffffffb49a1c50) at 
> ../../../kern/kern_fork.c:821
> #17 0xffffffff803d676e in fork_trampoline () at 
> ../../../amd64/amd64/exception.S:394
> #18 0x0000000000000000 in ?? ()
> #19 0x0000000000000000 in ?? ()
> #20 0x0000000000000001 in ?? ()
> #21 0x0000000000000000 in ?? ()
> #22 0x0000000000000000 in ?? ()
> #23 0x0000000000000000 in ?? ()
> #24 0x0000000000000000 in ?? ()
> #25 0x0000000000000000 in ?? ()
> #26 0x0000000000000000 in ?? ()
> #27 0x0000000000000000 in ?? ()
> #28 0x0000000000000000 in ?? ()
> #29 0x0000000000000000 in ?? ()
> #30 0x0000000000000000 in ?? ()
> #31 0x0000000000000000 in ?? ()
> #32 0x0000000000000000 in ?? ()
> #33 0x0000000000000000 in ?? ()
> #34 0x0000000000000000 in ?? ()
> #35 0x0000000000000000 in ?? ()
> #36 0x0000000000000000 in ?? ()
> #37 0x0000000000000000 in ?? ()
> #38 0x0000000000000000 in ?? ()
> #39 0x0000000000000000 in ?? ()
> #40 0x0000000000000000 in ?? ()
> #41 0x0000000000000000 in ?? ()
> #42 0x0000000000000000 in ?? ()
> #43 0x0000000000000000 in ?? ()
> #44 0x0000000000000000 in ?? ()
> #45 0x0000000000000000 in ?? ()
> #46 0x0000000000000000 in ?? ()
> #47 0x0000000000000000 in ?? ()
> #48 0x0000000000000000 in ?? ()
> #49 0x0000000000000000 in ?? ()
> #50 0x00000000007b4000 in ?? ()
> #51 0xffffff003d8fa980 in ?? ()
> #52 0xffffff00000053c0 in ?? ()
> #53 0x0000000000000001 in ?? ()
> #54 0xffffff003d8f96b0 in ?? ()
> #55 0xffffff001ffa4980 in ?? ()
> #56 0xffffffffb49a1b58 in ?? ()
> #57 0xffffff003d8fa980 in ?? ()
> #58 0xffffffff802734db in sched_switch (td=0xffffff00000053c0, newtd=0x0, 
> flags=0)
> 
> then zeroes up to #130


Cheers,
-- 
Ruslan Ermilov
ru@FreeBSD.org
FreeBSD committer
Comment 2 Wojciech Puchar 2006-11-12 17:32:36 UTC
>
> options         INVARIANTS              # Enable calls of extra sanity checking
> options         INVARIANT_SUPPORT       # Extra sanity checks of internal structures, required by INVARIANTS
> options         WITNESS                 # Enable checks to detect deadlocks and cycles
> options         WITNESS_SKIPSPIN        # Don't run witness on spinlocks for speed
>
> It will run more slowly, but could allow to catch the bug earlier.
>
> It could turn out to be a problem with the IPv6 routing code.

yes i DO use IPv6 intensively.

this machine has native IPv6 connectivity, offers native connectivity and 
over TUN (ppp) and tunnels over gif.

options added kernel compiled machine rebooted. on what messages should i 
look for?


now i know that sometimes 2 machines in my IPv6 network crashes at the 
same time? special kind of packets?

will my firewall config and kernel config help?
Comment 3 Wojciech Puchar 2006-11-12 20:29:29 UTC
effects:

1) after compiling kernel with

>> options         INVARIANTS              # Enable calls of extra sanity 
>> checking
>> options         INVARIANT_SUPPORT       # Extra sanity checks of internal 
>> structures, required by INVARIANTS
>> options         WITNESS                 # Enable checks to detect deadlocks 
>> and cycles
>> options         WITNESS_SKIPSPIN        # Don't run witness on spinlocks 
>> for speed


kernel crashes after maybe 5 seconds after boot!!!

see dmesg.4 - done with dmesg -M, and typescript.4 done with kgdb



because this server must work, i compiled kernel again without these 
options and it started.

later i did as root

/etc/rc.d/route6d stop

and it crashed :) and similar crash i've got some time ago.

see dmesg.5 and typescript.5


all coredumps and kernels are saved, if you like i'll make you an account 
to see whatever you need.


if you need to test some patches, i can do it every day after about 18:00, 
or my clients will kill me :)


 						Wojtek
Comment 4 ru freebsd_committer freebsd_triage 2006-11-12 22:24:39 UTC
On Sun, Nov 12, 2006 at 09:29:29PM +0100, Wojciech Puchar wrote:
> 
> effects:
> 
> 1) after compiling kernel with
> 
> >>options         INVARIANTS              # Enable calls of extra sanity 
> >>checking
> >>options         INVARIANT_SUPPORT       # Extra sanity checks of internal 
> >>structures, required by INVARIANTS
> >>options         WITNESS                 # Enable checks to detect 
> >>deadlocks and cycles
> >>options         WITNESS_SKIPSPIN        # Don't run witness on spinlocks 
> >>for speed
> 
> 
> kernel crashes after maybe 5 seconds after boot!!!
> 
> see dmesg.4 - done with dmesg -M, and typescript.4 done with kgdb
> 
This is the bug in the current rue(4) driver; it holds a non-sleepable
lock in rue_read_mem() and calls into USB stack that wants to sleep;
hence it panics.  There's a major rework of the USB stack happening
now in FreeBSD Perforce; I've looked there and the new rue(4) driver
doesn't have this problem.  I suggest that you don't use USB NICs at
the moment if this is possible.

> because this server must work, i compiled kernel again without these 
> options and it started.
> 
> later i did as root
> 
> /etc/rc.d/route6d stop
> 
> and it crashed :) and similar crash i've got some time ago.
> 
> see dmesg.5 and typescript.5
> 
This can be traces of the same problem; after you eliminate the
USB NIC from your system, let's see what happens to it next.


Cheers,
-- 
Ruslan Ermilov
ru@FreeBSD.org
FreeBSD committer
Comment 5 ru freebsd_committer freebsd_triage 2006-11-16 10:56:56 UTC
State Changed
From-To: open->feedback

The USB problem with rue(4) was isolated. 
The original problem became more obvious under INVARIANTS. 
A patch has been sent to submitter for testing: 
http://people.freebsd.org/~ru/patches/ipv6_rtentry_locking.patch 

I suspect that this might actually be a duplicate of PR kern/93910.
Comment 6 ru freebsd_committer freebsd_triage 2006-11-16 10:59:39 UTC
Responsible Changed
From-To: freebsd-amd64->ru

I'm tracking it.  It's also not amd64-specific so PR's category has 
been changed to "kern".
Comment 7 dfilter service freebsd_committer freebsd_triage 2006-11-25 20:39:13 UTC
ru          2006-11-25 20:38:56 UTC

  FreeBSD src repository

  Modified files:
    sys/netinet6         nd6.c 
  Log:
  - In nd6_rtrequest(), when caching an rtentry, don't forget
    to add a reference to it; otherwise, we could later access
    a freed memory.  This is believed to fix panics some users
    were observing when running route6d(8), and is similar to
    the fix in sys/netinet/if_ether.c,v 1.139 by glebius@.
  
  PR:             kern/93910, kern/105437
  Testing by:     Wojciech Puchar (still ongoing)
  
  - Add rtentry locking to nd6_output() similar to rt_check().
  
  MFC after:      4 days
  
  Revision  Changes    Path
  1.72      +29 -9     src/sys/netinet6/nd6.c
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
Comment 8 ru freebsd_committer freebsd_triage 2006-11-27 15:11:30 UTC
State Changed
From-To: feedback->patched

I believe it's fixed now.
Comment 9 dfilter service freebsd_committer freebsd_triage 2006-11-29 14:00:53 UTC
ru          2006-11-29 14:00:29 UTC

  FreeBSD src repository

  Modified files:        (Branch: RELENG_6)
    sys/netinet6         nd6.c 
  Log:
  MFC: 1.72: Prevent cached rtentry from being removed, add rtentry
  locking to nd6_output().
  
  PR:             kern/93910, kern/105437
  
  Revision   Changes    Path
  1.48.2.16  +29 -9     src/sys/netinet6/nd6.c
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
Comment 10 ru freebsd_committer freebsd_triage 2006-11-29 14:01:12 UTC
State Changed
From-To: patched->closed

Fixed in RELENG_6.