Bug 244706

Summary:	panic: NULL dereference inside __mtx_lock_sleep()
Product:	Base System	Reporter:	Eugene Grosbein <eugen>
Component:	kern	Assignee:	Eugene Grosbein <eugen>
Status:	Closed FIXED
Severity:	Affects Only Me	CC:	chris, hselasky, kib, net
Priority:	---	Keywords:	crash, needs-qa
Version:	11.3-STABLE	Flags:	koobs: mfc-stable12? koobs: mfc-stable11?
Hardware:	amd64
OS:	Any

Description Eugene Grosbein freebsd_committer

2020-03-10 01:05:28 UTC

I observe very seldom kernel panics of my home router that runs FreeBSD 11.3-STABLE/amd64 r356315, once per several months. It paniced again today and I've got nice crashdump.

The router uses custom kernel with following config file:

include         GENERIC
ident           GW
options         DDB
options         DDB_NUMSYM
options         ALT_BREAK_TO_DEBUGGER
#EOF

The router processes several IPSec tunnels and some volume of fragmented ESP packets. The router uses ipfw and it has the following rule:

reass ip from any to any in { recv ng0 or recv em0 or recv wlan* }

kgdb session follows:

Unread portion of the kernel message buffer:


__curthread () at ./machine/pcpu.h:234
234             __asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:234
#1  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:320
#2  0xffffffff80b2212d in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:388
#3  0xffffffff80b22578 in vpanic (fmt=<optimized out>, ap=0xfffffe022b5ed470)
    at /usr/src/sys/kern/kern_shutdown.c:784
#4  0xffffffff80b223b3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:715
#5  0xffffffff80fb8d00 in trap_fatal (frame=0xfffffe022b5ed660, eva=952)
    at /usr/src/sys/amd64/amd64/trap.c:899
#6  0xffffffff80fb8d49 in trap_pfault (frame=0xfffffe022b5ed660, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:744
#7  0xffffffff80fb83dd in trap (frame=0xfffffe022b5ed660)
    at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  __mtx_lock_sleep (c=0xffffffff81e57188 <ipq+45624>, v=<optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:563
#10 0xffffffff80ca1078 in ipreass_slowtimo ()
    at /usr/src/sys/netinet/ip_reass.c:573
#11 0xffffffff80baa504 in pfslowtimo (arg=0xffffffff81e57188 <ipq+45624>)
    at /usr/src/sys/kern/uipc_domain.c:506
#12 0xffffffff80b3acbf in softclock_call_cc (
--Type <RET> for more, q to quit, c to continue without paging--c
    c=0xffffffff81e46200 <pfslow_callout>, cc=0xffffffff81efe000 <cc_cpu>, direct=0) at /usr/src/sys/kern/kern_timeout.c:729
#13 0xffffffff80b3b1b9 in softclock (arg=0xffffffff81efe000 <cc_cpu>) at /usr/src/sys/kern/kern_timeout.c:867
#14 0xffffffff80ae7119 in intr_event_execute_handlers (p=<optimized out>, ie=0xfffff80005240200) at /usr/src/sys/kern/kern_intr.c:1346
#15 0xffffffff80ae7807 in ithread_execute_handlers (p=<optimized out>, ie=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1359
#16 ithread_loop (arg=0xfffff80005226680) at /usr/src/sys/kern/kern_intr.c:1440
#17 0xffffffff80ae44c3 in fork_exit (callout=0xffffffff80ae7720 <ithread_loop>, arg=0xfffff80005226680, frame=0xfffffe022b5ed9c0) at /usr/src/sys/kern/kern_fork.c:1086
#18 <signal handler called>
(kgdb) frame 10
#10 0xffffffff80ca1078 in ipreass_slowtimo ()
    at /usr/src/sys/netinet/ip_reass.c:573
573                     IPQ_LOCK(i);
(kgdb) l
568     ipreass_slowtimo(void)
569     {
570             struct ipq *fp, *tmp;
571
572             for (int i = 0; i < IPREASS_NHASH; i++) {
573                     IPQ_LOCK(i);
574                     TAILQ_FOREACH_SAFE(fp, &V_ipq[i].head, ipq_list, tmp)
575                     if (--fp->ipq_ttl == 0)
576                                     ipq_timeout(&V_ipq[i], fp);
577                     IPQ_UNLOCK(i);
(kgdb) p i
$1 = 814
(kgdb) frame 9
#9  __mtx_lock_sleep (c=0xffffffff81e57188 <ipq+45624>, v=<optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:563
563                     if (TD_IS_RUNNING(owner)) {
(kgdb) l
558                     /*
559                      * If the owner is running on another CPU, spin until the
560                      * owner stops running or the state of the lock changes.
561                      */
562                     owner = lv_mtx_owner(v);
563                     if (TD_IS_RUNNING(owner)) {
564                             if (LOCK_LOG_TEST(&m->lock_object, 0))
565                                     CTR3(KTR_LOCK,
566                                         "%s: spinning on %p held by %p",
567                                         __func__, m, owner);
(kgdb) p ipq[45624]
$2 = {head = {tqh_first = 0xffffffff80b1d0c0 <_rm_wlock>, tqh_last = 0x2fb},
  lock = {lock_object = {
      lo_name = 0x5001200125dd1 <error: Cannot access memory at address 0x5001200125dd1>, lo_flags = 2155235392, lo_data = 4294967295, lo_witness = 0x5eb},
    mtx_lock = 1407452194168295}, count = -2139146832}
(kgdb) p owner
$3 = (struct thread *) 0x0
(kgdb) p *m
$4 = {lock_object = {lo_name = 0xffffffff81567e78 "IP reassembly",
    lo_flags = 21168128, lo_data = 0, lo_witness = 0x0}, mtx_lock = 2}
(kgdb) p v
$5 = <optimized out>
(kgdb) quit

Comment 1 Eugene Grosbein freebsd_committer

2020-03-10 01:10:52 UTC

The CPU is two-core Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz, if it matters.

Comment 2 Eugene Grosbein freebsd_committer

2020-03-10 02:55:42 UTC

Also, this system does not clear RAM between reboots, so dmesg buffer survives panic and saved to the log after reboot:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x3b8
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80b024ad
stack pointer           = 0x28:0xfffffe022b5ed720
frame pointer           = 0x28:0xfffffe022b5ed7a0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff803b4bbb = db_trace_self_wrapper+0x2b/frame 0xfffffe022b5ed3d0
vpanic() at 0xffffffff80b2253e = vpanic+0x17e/frame 0xfffffe022b5ed430
panic() at 0xffffffff80b223b3 = panic+0x43/frame 0xfffffe022b5ed490
trap_pfault() at 0xffffffff80fb8d00 = trap_pfault/frame 0xfffffe022b5ed4e0
trap_pfault() at 0xffffffff80fb8d49 = trap_pfault+0x49/frame 0xfffffe022b5ed540
trap() at 0xffffffff80fb83dd = trap+0x29d/frame 0xfffffe022b5ed650
calltrap() at 0xffffffff80f979e3 = calltrap+0x8/frame 0xfffffe022b5ed650
--- trap 0xc, rip = 0xffffffff80b024ad, rsp = 0xfffffe022b5ed720, rbp = 0xfffffe022b5ed7a0 ---
__mtx_lock_sleep() at 0xffffffff80b024ad = __mtx_lock_sleep+0xbd/frame 0xfffffe022b5ed7a0
ipreass_slowtimo() at 0xffffffff80ca1078 = ipreass_slowtimo+0x28/frame 0xfffffe022b5ed7e0
pfslowtimo() at 0xffffffff80baa504 = pfslowtimo+0x54/frame 0xfffffe022b5ed810
softclock_call_cc() at 0xffffffff80b3acbf = softclock_call_cc+0x14f/frame 0xfffffe022b5ed8c0
softclock() at 0xffffffff80b3b1b9 = softclock+0x79/frame 0xfffffe022b5ed8e0
intr_event_execute_handlers() at 0xffffffff80ae7119 = intr_event_execute_handlers+0xe9/frame 0xfffffe022b5ed920
ithread_loop() at 0xffffffff80ae7807 = ithread_loop+0xe7/frame 0xfffffe022b5ed970
fork_exit() at 0xffffffff80ae44c3 = fork_exit+0x83/frame 0xfffffe022b5ed9b0
fork_trampoline() at 0xffffffff80f989fe = fork_trampoline+0xe/frame 0xfffffe022b5ed9b0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Uptime: 24d16h23m51s

Comment 3 Hans Petter Selasky freebsd_committer

2020-06-07 06:42:07 UTC

The lockname should always be valid for the mutex. Else this may be a memory corruption issue.

Comment 4 Hans Petter Selasky freebsd_committer

2020-06-07 06:48:30 UTC

Do you know if the router was about to reboot when this happened?

Comment 5 Eugene Grosbein freebsd_committer

2020-08-02 11:07:01 UTC

(In reply to Hans Petter Selasky from comment #4)

Sorry for late reply, I've missed your question.

AFAIR, the router was not rebooting. However, it has USB-based Wifi card run0 in station mode with unstable connection, so corresponding run0/wlanX kernel objects could be destroyed/re-created any moment.

Comment 6 Eugene Grosbein freebsd_committer

2021-02-20 03:51:10 UTC

Believed to be fixed with MFC r359746 to stable/11 as the problem did not repeat since then. The hardware now runs stable/12 that did not reproduce the problem, too.