| Summary: | 6.2-BETA3 crashes on amd64 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | Wojciech Puchar <wojtek> | ||||||||||
| Component: | kern | Assignee: | ru <ru> | ||||||||||
| Status: | Closed FIXED | ||||||||||||
| Severity: | Affects Only Me | ||||||||||||
| Priority: | Normal | ||||||||||||
| Version: | 6.2-BETA3 | ||||||||||||
| Hardware: | Any | ||||||||||||
| OS: | Any | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Wojciech Puchar
2006-11-12 13:20:22 UTC
On Sun, Nov 12, 2006 at 05:09:22PM +0100, Wojciech Puchar wrote: > #0 doadump () at pcpu.h:172 > #1 0x0000000000000004 in ?? () > #2 0xffffffff8025deb3 in boot (howto=260) at > ../../../kern/kern_shutdown.c:409 > #3 0xffffffff8025e4b6 in panic (fmt=0xffffff003d8fa980 "°\226\217=") > at ../../../kern/kern_shutdown.c:565 > #4 0xffffffff803e87f2 in trap_fatal (frame=0xffffff003d8fa980, > eva=18446742975230744240) > at ../../../amd64/amd64/trap.c:660 > #5 0xffffffff803e8d16 in trap (frame= > {tf_rdi = -1098993325056, tf_rsi = 4, tf_rdx = -1098478802560, > tf_rcx = 4, tf_r8 = -1098478802496, tf_r9 = -1098993325056, tf_rax = 2, > tf_rbx = -1098478802560, tf_rbp = 4, tf_r10 = -1098993325056, tf_r11 = > -1264970144, tf_r12 = -1098478802560, tf_r13 = -1098993325056, tf_r14 = > -2141357264, tf_r15 = -1098758394592, tf_trapno = 12, tf_addr = 212, > tf_flags = -2144054761, tf_err = 0, tf_rip = -2144839500, tf_cs = 8, > tf_rflags = 65543, tf_rsp = -1264969928, tf_ss = 16}) > at ../../../amd64/amd64/trap.c:238 > #6 0xffffffff803d640b in calltrap () at > ../../../amd64/amd64/exception.S:168 > #7 0xffffffff802858b4 in turnstile_setowner (ts=0xffffff001ee4ac00, > owner=0x4) > at ../../../kern/subr_turnstile.c:432 > #8 0xffffffff80285ebb in turnstile_wait (lock=0xffffff002ce56d20, > owner=0x4) > at ../../../kern/subr_turnstile.c:591 > #9 0xffffffff80252f39 in _mtx_lock_sleep (m=0xffffff002ce56d20, > tid=18446742975230749056, > opts=1032825216, file=0x4 <Address 0x4 out of bounds>, > line=1032825280) > at ../../../kern/kern_mutex.c:579 > The line 579 has: : turnstile_wait(&m->mtx_object, mtx_owner(m)); Some references: : /* : * Internal utility macros. : */ : #define mtx_unowned(m) ((m)->mtx_lock == MTX_UNOWNED) : : #define mtx_owner(m) (mtx_unowned((m)) ? NULL \ : : (struct thread *)((m)->mtx_lock & MTX_FLAGMASK)) : /* : * State bits kept in mutex->mtx_lock, for the DEFAULT lock type. None of this, : * with the exception of MTX_UNOWNED, applies to spin locks. : */ : #define MTX_RECURSED 0x00000001 /* lock recursed (for MTX_DEF only) */ : #define MTX_CONTESTED 0x00000002 /* lock contested (for MTX_DEF only) */ : #define MTX_UNOWNED 0x00000004 /* Cookie for free mutex */ : #define MTX_FLAGMASK ~(MTX_RECURSED | MTX_CONTESTED) mtx_owner(m) returns the value of "4", which is MUTEX_UNOWNED, but if mtx_lock were only MTX_UNOWNED, mtx_unowned() would return true, and mtx_owner() would return NULL. This means that mtx_lock has something other than MTX_UNOWNED as well, which is illegal. Most likely, it's MTX_DESTROYED (which is defined as (MTX_CONTESTED \ | MTX_UNOWNED)). You should print the mutex it to be sure. So it looks like the code is trying to pass a corrupt mutex. Please recompile your kernel with the following options: options INVARIANTS # Enable calls of extra sanity checking options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS options WITNESS # Enable checks to detect deadlocks and cycles options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed It will run more slowly, but could allow to catch the bug earlier. It could turn out to be a problem with the IPv6 routing code. > #10 0xffffffff8033c7ab in nd6_output (ifp=0xffffff003063c000, > origifp=0xffffff003063c000, > m0=0xffffff0001cd6400, dst=0xffffff002e437a60, rt0=0xffffff002b96f630) > at ../../../netinet6/nd6.c:2004 > #11 0xffffffff80338c12 in ip6_output (m0=0x100010170400120, opt=0x500, > ro=0xffffffffb49a1a00, > flags=0, im6o=0x0, ifpp=0x0, inp=0xffffff0001c304c0) at > ../../../netinet6/ip6_output.c:994 > I don't understand why "ro" is not NULL here, because tcp_output() below calls it with a NULL argument; this is probably due to a -O2 compilation. > #12 0xffffffff80315a6d in tcp_output (tp=0xffffff0010b165e0) at > ../../../netinet/tcp_output.c:1059 > #13 0xffffffff8031c6a5 in tcp_timer_rexmt (xtp=0xffffff001ee4ac00) > at ../../../netinet/tcp_timer.c:537 > #14 0xffffffff8026d02a in softclock (dummy=0xffffff001ee4ac00) at > ../../../kern/kern_timeout.c:290 > #15 0xffffffff802442b6 in ithread_loop (arg=0xffffff00000053c0) at > ../../../kern/kern_intr.c:682 > #16 0xffffffff80242d03 in fork_exit (callout=0xffffffff80244170 > <ithread_loop>, > arg=0xffffff00000053c0, frame=0xffffffffb49a1c50) at > ../../../kern/kern_fork.c:821 > #17 0xffffffff803d676e in fork_trampoline () at > ../../../amd64/amd64/exception.S:394 > #18 0x0000000000000000 in ?? () > #19 0x0000000000000000 in ?? () > #20 0x0000000000000001 in ?? () > #21 0x0000000000000000 in ?? () > #22 0x0000000000000000 in ?? () > #23 0x0000000000000000 in ?? () > #24 0x0000000000000000 in ?? () > #25 0x0000000000000000 in ?? () > #26 0x0000000000000000 in ?? () > #27 0x0000000000000000 in ?? () > #28 0x0000000000000000 in ?? () > #29 0x0000000000000000 in ?? () > #30 0x0000000000000000 in ?? () > #31 0x0000000000000000 in ?? () > #32 0x0000000000000000 in ?? () > #33 0x0000000000000000 in ?? () > #34 0x0000000000000000 in ?? () > #35 0x0000000000000000 in ?? () > #36 0x0000000000000000 in ?? () > #37 0x0000000000000000 in ?? () > #38 0x0000000000000000 in ?? () > #39 0x0000000000000000 in ?? () > #40 0x0000000000000000 in ?? () > #41 0x0000000000000000 in ?? () > #42 0x0000000000000000 in ?? () > #43 0x0000000000000000 in ?? () > #44 0x0000000000000000 in ?? () > #45 0x0000000000000000 in ?? () > #46 0x0000000000000000 in ?? () > #47 0x0000000000000000 in ?? () > #48 0x0000000000000000 in ?? () > #49 0x0000000000000000 in ?? () > #50 0x00000000007b4000 in ?? () > #51 0xffffff003d8fa980 in ?? () > #52 0xffffff00000053c0 in ?? () > #53 0x0000000000000001 in ?? () > #54 0xffffff003d8f96b0 in ?? () > #55 0xffffff001ffa4980 in ?? () > #56 0xffffffffb49a1b58 in ?? () > #57 0xffffff003d8fa980 in ?? () > #58 0xffffffff802734db in sched_switch (td=0xffffff00000053c0, newtd=0x0, > flags=0) > > then zeroes up to #130 Cheers, -- Ruslan Ermilov ru@FreeBSD.org FreeBSD committer > > options INVARIANTS # Enable calls of extra sanity checking > options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS > options WITNESS # Enable checks to detect deadlocks and cycles > options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed > > It will run more slowly, but could allow to catch the bug earlier. > > It could turn out to be a problem with the IPv6 routing code. yes i DO use IPv6 intensively. this machine has native IPv6 connectivity, offers native connectivity and over TUN (ppp) and tunnels over gif. options added kernel compiled machine rebooted. on what messages should i look for? now i know that sometimes 2 machines in my IPv6 network crashes at the same time? special kind of packets? will my firewall config and kernel config help?
effects:
1) after compiling kernel with
>> options INVARIANTS # Enable calls of extra sanity
>> checking
>> options INVARIANT_SUPPORT # Extra sanity checks of internal
>> structures, required by INVARIANTS
>> options WITNESS # Enable checks to detect deadlocks
>> and cycles
>> options WITNESS_SKIPSPIN # Don't run witness on spinlocks
>> for speed
kernel crashes after maybe 5 seconds after boot!!!
see dmesg.4 - done with dmesg -M, and typescript.4 done with kgdb
because this server must work, i compiled kernel again without these
options and it started.
later i did as root
/etc/rc.d/route6d stop
and it crashed :) and similar crash i've got some time ago.
see dmesg.5 and typescript.5
all coredumps and kernels are saved, if you like i'll make you an account
to see whatever you need.
if you need to test some patches, i can do it every day after about 18:00,
or my clients will kill me :)
Wojtek
On Sun, Nov 12, 2006 at 09:29:29PM +0100, Wojciech Puchar wrote: > > effects: > > 1) after compiling kernel with > > >>options INVARIANTS # Enable calls of extra sanity > >>checking > >>options INVARIANT_SUPPORT # Extra sanity checks of internal > >>structures, required by INVARIANTS > >>options WITNESS # Enable checks to detect > >>deadlocks and cycles > >>options WITNESS_SKIPSPIN # Don't run witness on spinlocks > >>for speed > > > kernel crashes after maybe 5 seconds after boot!!! > > see dmesg.4 - done with dmesg -M, and typescript.4 done with kgdb > This is the bug in the current rue(4) driver; it holds a non-sleepable lock in rue_read_mem() and calls into USB stack that wants to sleep; hence it panics. There's a major rework of the USB stack happening now in FreeBSD Perforce; I've looked there and the new rue(4) driver doesn't have this problem. I suggest that you don't use USB NICs at the moment if this is possible. > because this server must work, i compiled kernel again without these > options and it started. > > later i did as root > > /etc/rc.d/route6d stop > > and it crashed :) and similar crash i've got some time ago. > > see dmesg.5 and typescript.5 > This can be traces of the same problem; after you eliminate the USB NIC from your system, let's see what happens to it next. Cheers, -- Ruslan Ermilov ru@FreeBSD.org FreeBSD committer State Changed From-To: open->feedback The USB problem with rue(4) was isolated. The original problem became more obvious under INVARIANTS. A patch has been sent to submitter for testing: http://people.freebsd.org/~ru/patches/ipv6_rtentry_locking.patch I suspect that this might actually be a duplicate of PR kern/93910. Responsible Changed From-To: freebsd-amd64->ru I'm tracking it. It's also not amd64-specific so PR's category has been changed to "kern". ru 2006-11-25 20:38:56 UTC
FreeBSD src repository
Modified files:
sys/netinet6 nd6.c
Log:
- In nd6_rtrequest(), when caching an rtentry, don't forget
to add a reference to it; otherwise, we could later access
a freed memory. This is believed to fix panics some users
were observing when running route6d(8), and is similar to
the fix in sys/netinet/if_ether.c,v 1.139 by glebius@.
PR: kern/93910, kern/105437
Testing by: Wojciech Puchar (still ongoing)
- Add rtentry locking to nd6_output() similar to rt_check().
MFC after: 4 days
Revision Changes Path
1.72 +29 -9 src/sys/netinet6/nd6.c
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
State Changed From-To: feedback->patched I believe it's fixed now. ru 2006-11-29 14:00:29 UTC
FreeBSD src repository
Modified files: (Branch: RELENG_6)
sys/netinet6 nd6.c
Log:
MFC: 1.72: Prevent cached rtentry from being removed, add rtentry
locking to nd6_output().
PR: kern/93910, kern/105437
Revision Changes Path
1.48.2.16 +29 -9 src/sys/netinet6/nd6.c
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
State Changed From-To: patched->closed Fixed in RELENG_6. |