Bug 210884 - panic: bogus refcnt 0 on lle 0xfffff80004608c00
Summary: panic: bogus refcnt 0 on lle 0xfffff80004608c00
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: Normal Affects Some People
Assignee: Gleb Smirnoff
URL:
Keywords: crash, patch, regression
Depends on:
Blocks:
 
Reported: 2016-07-07 07:21 UTC by Edward Tomasz Napierala
Modified: 2018-08-28 08:44 UTC (History)
19 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable10?


Attachments
crash #1 (104.23 KB, text/plain)
2016-07-07 13:58 UTC, Larry Rosenman
no flags Details
Crash #2 (182.78 KB, text/plain)
2016-07-07 13:59 UTC, Larry Rosenman
no flags Details
crash after revert (157.37 KB, text/plain)
2016-07-07 23:02 UTC, Larry Rosenman
no flags Details
Quick and dirty patch to try (407 bytes, patch)
2016-07-08 19:54 UTC, Alexander Kabaev
no flags Details | Diff
post revert crash on boot :( (165.10 KB, text/plain)
2016-07-09 01:28 UTC, Larry Rosenman
no flags Details
svn diff (5.37 KB, patch)
2016-07-09 01:57 UTC, Larry Rosenman
no flags Details | Diff
build log (359.19 KB, application/x-gzip)
2016-07-09 02:20 UTC, Larry Rosenman
no flags Details
suggested patch: fixes return value for a never scheduled callout (995 bytes, patch)
2016-07-12 11:09 UTC, Gleb Smirnoff
no flags Details | Diff
suggested patch: fixes return value for a never scheduled callout (376 bytes, patch)
2016-07-14 11:30 UTC, Gleb Smirnoff
no flags Details | Diff
second fix for the return value (396 bytes, patch)
2016-07-18 09:14 UTC, Gleb Smirnoff
no flags Details | Diff
second fix for the return value (600 bytes, patch)
2016-07-18 10:32 UTC, Gleb Smirnoff
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Edward Tomasz Napierala freebsd_committer freebsd_triage 2016-07-07 07:21:23 UTC
11-CURRENT panics every ten minutes or so with "panic: bogus refcnt 0 on lle 0xfffff80004608c00" message.  This seems to have been introduced in the last two weeks or so.  Backtrace:

#8  0xffffffff80a89c2b in kdb_enter (why=0xffffffff813e3899 "panic", 
    msg=0x80 <Address 0x80 out of bounds>) at cpufunc.h:63
#9  0xffffffff80a4a7af in vpanic (fmt=<value optimized out>, 
    ap=0xfffffe02327b2530)
    at /usr/home/trasz/svn/head/sys/kern/kern_shutdown.c:752
#10 0xffffffff80a4a606 in kassert_panic (fmt=<value optimized out>)
    at /usr/home/trasz/svn/head/sys/kern/kern_shutdown.c:649
#11 0xffffffff80b3dd26 in llentry_free (lle=<value optimized out>)
    at /usr/home/trasz/svn/head/sys/net/if_llatbl.c:432
#12 0xffffffff80ba6f30 in in_lltable_free_entry (llt=0xfffff800079cf000, 
    lle=0xfffff80004608c00) at /usr/home/trasz/svn/head/sys/netinet/in.c:1144
#13 0xffffffff80b3e5fe in htable_prefix_free (llt=<value optimized out>, 
    addr=<value optimized out>, mask=<value optimized out>, 
    flags=<value optimized out>)
    at /usr/home/trasz/svn/head/sys/net/if_llatbl.c:673
#14 0xffffffff80b3e2cd in lltable_prefix_free (af=<value optimized out>, 
    addr=<value optimized out>, mask=<value optimized out>, 
    flags=<value optimized out>)
    at /usr/home/trasz/svn/head/sys/net/if_llatbl.c:599
#15 0xffffffff80ba5ff0 in in_scrubprefix (target=<value optimized out>, 
    flags=<value optimized out>)
    at /usr/home/trasz/svn/head/sys/netinet/in.c:757
#16 0xffffffff80ba5a25 in in_difaddr_ioctl (data=<value optimized out>, 
    ifp=<value optimized out>, td=<value optimized out>)
    at /usr/home/trasz/svn/head/sys/netinet/in.c:608
#17 0xffffffff80ba4bf6 in in_control (so=<value optimized out>, 
    cmd=2149607705, data=0xfffffe02327b28f0 "wlan0", ifp=0xfffff80007f60800, 
    td=0xfffff8003fd88500) at /usr/home/trasz/svn/head/sys/netinet/in.c:248
#18 0xffffffff80b35b81 in ifioctl (so=<value optimized out>, 
    cmd=<value optimized out>, data=<value optimized out>, 
    td=<value optimized out>) at /usr/home/trasz/svn/head/sys/net/if.c:2840
#19 0xffffffff80aae786 in kern_ioctl (td=<value optimized out>, 
    fd=<value optimized out>, com=<value optimized out>, 
    data=<value optimized out>) at file.h:327
#20 0xffffffff80aae4d1 in sys_ioctl (td=<value optimized out>, 
    uap=0xfffffe02327b2a40)
    at /usr/home/trasz/svn/head/sys/kern/sys_generic.c:743
#21 0xffffffff80ebbb2b in amd64_syscall (td=0xfffff8003fd88500, traced=0)
    at subr_syscall.c:135
#22 0xffffffff80e9bf7b in Xfast_syscall ()
    at /usr/home/trasz/svn/head/sys/amd64/amd64/exception.S:396
Comment 1 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 13:30:07 UTC
I have 2 vmcore's from similar panics.
Comment 2 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 13:58:50 UTC
Created attachment 172201 [details]
crash #1
Comment 3 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 13:59:13 UTC
Created attachment 172202 [details]
Crash #2
Comment 4 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 13:59:50 UTC
I attached the core.txt files from my 2 crashes.  I *DO* have the vmcore's available if a dev wants to look.
Comment 5 Ivan Klymenko 2016-07-07 14:12:31 UTC
(In reply to Larry Rosenman from comment #4)
look man crashinfo
Comment 6 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 14:14:13 UTC
(In reply to Ivan Klymenko from comment #5)
That runs automatically, and produces the aforementioned core.txt's....
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2016-07-07 18:11:29 UTC
Does reverting r302081 fix the problem?
Comment 8 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 18:25:37 UTC
(In reply to Mark Johnston from comment #7)
It's been random enough that I don't know if it would help or not :(
Comment 9 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 22:45:39 UTC
I got 2 more crashes this afternoon.  I'll see if reverting r302081 helps.

I have core.txt's and vmcores from these as well.
Comment 10 Larry Rosenman freebsd_committer freebsd_triage 2016-07-07 23:02:54 UTC
Created attachment 172217 [details]
crash after revert

That apparently DID NOT help.  See the core.txt I uploaded.
Comment 11 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-08 19:54:58 UTC
Created attachment 172254 [details]
Quick and dirty patch to try

Hi, could you try this one? It stops crashes for me on my CI20 whereas before it panicking reliable on every nfsroot boot. It does appear to me that arp code assumer that callout_stop on the timer that was never started should return 0, which is not true somehow.

Admittedly, I did not keep tabs on recent callout refactorings, so I cannot make the final determination about the validity of the patch.
Comment 12 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-08 20:19:53 UTC
I suggest trying to revert https://svnweb.freebsd.org/changeset/base/302350 also, I suspect this is the commit that introduced the behaviour.
Comment 13 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 00:53:24 UTC
One more comment: the patch addresses only one instance of callout_stop use and is not enough to cover all possible panics, ipv6 is being one example.
Comment 14 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:01:02 UTC
So, should I do the revert of 302350 AND apply your patch or just apply your patch?  I've already reverted 302081.
Comment 15 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 01:04:34 UTC
Just revert the change and do not apply my diff
Comment 16 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:25:30 UTC
Done.  Now we wait :)
Comment 17 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:28:57 UTC
Created attachment 172266 [details]
post revert crash on boot :(
Comment 18 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:30:05 UTC
That latest crash is post revert of 302350
Comment 19 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 01:31:21 UTC
Please do check again that you run the kernel with the change reverted. I'll allow myself not to believe you since reverting _does_ fix the panic here and it is a reliable one.
Comment 20 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 01:31:34 UTC
Please do check again that you run the kernel with the change reverted. I'll allow myself not to believe you since reverting _does_ fix the panic here and it is a reliable one.
Comment 21 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 01:32:23 UTC
Somehow bugzilla eats comments?
Comment 22 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:34:20 UTC
I did:
cd /usr/src
svn merge -c -302350 .
nohup make buildworld -j 20 buildkernel >make.k.out 2>&1
make installkernel 
shutdown -r now

this is with bdrewery's meta-mode stuff.
Comment 23 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 01:38:04 UTC
If, after all, you do panic after really, really, really reverting the change, please print lle->lle_timer from in_lltable_free_entry context.

In above command sequence do not forget 'svn diff' to verify it did what it supposed to do.
Comment 24 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:41:47 UTC
borg.lerctr.org /var/crash # kgdb -c vmcore.last  /boot/kernel/kernel
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:

panic: bogus refcnt 0 on lle 0xfffff80116196a00
cpuid = 5
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe2eb59085b0
vpanic() at vpanic+0x182/frame 0xfffffe2eb5908630
kassert_panic() at kassert_panic+0x126/frame 0xfffffe2eb59086a0
llentry_free() at llentry_free+0x136/frame 0xfffffe2eb59086d0
in_lltable_free_entry() at in_lltable_free_entry+0xb0/frame 0xfffffe2eb5908700
htable_prefix_free() at htable_prefix_free+0xce/frame 0xfffffe2eb5908760
lltable_prefix_free() at lltable_prefix_free+0x5d/frame 0xfffffe2eb59087a0
in_scrubprefix() at in_scrubprefix+0x290/frame 0xfffffe2eb5908840
in_difaddr_ioctl() at in_difaddr_ioctl+0x285/frame 0xfffffe2eb5908890
in_control() at in_control+0x96/frame 0xfffffe2eb5908910
ifioctl() at ifioctl+0xda1/frame 0xfffffe2eb59089a0
kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe2eb5908a00
sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe2eb5908ae0
amd64_syscall() at amd64_syscall+0x2c1/frame 0xfffffe2eb5908bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe2eb5908bf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800fd2eaa, rsp = 0x7fffffffe418, rbp = 0x7fffffffe460 ---
Uptime: 4h47m13s
Dumping 4623 out of 64465 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux.ko
Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux_common.ko
Reading symbols from /boot/kernel/if_lagg.ko...Reading symbols from /usr/lib/debug//boot/kernel/if_lagg.ko.debug...done.
done.
Loaded symbols for /boot/kernel/if_lagg.ko
Reading symbols from /boot/kernel/snd_envy24ht.ko...Reading symbols from /usr/lib/debug//boot/kernel/snd_envy24ht.ko.debug...done.
done.
Loaded symbols for /boot/kernel/snd_envy24ht.ko
Reading symbols from /boot/kernel/snd_spicds.ko...Reading symbols from /usr/lib/debug//boot/kernel/snd_spicds.ko.debug...done.
done.
Loaded symbols for /boot/kernel/snd_spicds.ko
Reading symbols from /boot/kernel/coretemp.ko...Reading symbols from /usr/lib/debug//boot/kernel/coretemp.ko.debug...done.
done.
Loaded symbols for /boot/kernel/coretemp.ko
Reading symbols from /boot/kernel/fuse.ko...Reading symbols from /usr/lib/debug//boot/kernel/fuse.ko.debug...done.
done.
Loaded symbols for /boot/kernel/fuse.ko
Reading symbols from /boot/kernel/ichsmb.ko...Reading symbols from /usr/lib/debug//boot/kernel/ichsmb.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ichsmb.ko
Reading symbols from /boot/kernel/smbus.ko...Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...done.
done.
Loaded symbols for /boot/kernel/smbus.ko
Reading symbols from /boot/kernel/ichwd.ko...Reading symbols from /usr/lib/debug//boot/kernel/ichwd.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ichwd.ko
Reading symbols from /boot/kernel/cpuctl.ko...Reading symbols from /usr/lib/debug//boot/kernel/cpuctl.ko.debug...done.
done.
Loaded symbols for /boot/kernel/cpuctl.ko
Reading symbols from /boot/kernel/cryptodev.ko...Reading symbols from /usr/lib/debug//boot/kernel/cryptodev.ko.debug...done.
done.
Loaded symbols for /boot/kernel/cryptodev.ko
Reading symbols from /boot/kernel/dtraceall.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtraceall.ko.debug...done.
done.
Loaded symbols for /boot/kernel/dtraceall.ko
Reading symbols from /boot/kernel/profile.ko...Reading symbols from /usr/lib/debug//boot/kernel/profile.ko.debug...done.
done.
Loaded symbols for /boot/kernel/profile.ko
Reading symbols from /boot/kernel/dtrace.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtrace.ko.debug...done.
done.
Loaded symbols for /boot/kernel/dtrace.ko
Reading symbols from /boot/kernel/systrace_freebsd32.ko...Reading symbols from /usr/lib/debug//boot/kernel/systrace_freebsd32.ko.debug...done.
done.
Loaded symbols for /boot/kernel/systrace_freebsd32.ko
Reading symbols from /boot/kernel/systrace.ko...Reading symbols from /usr/lib/debug//boot/kernel/systrace.ko.debug...done.
done.
Loaded symbols for /boot/kernel/systrace.ko
Reading symbols from /boot/kernel/sdt.ko...Reading symbols from /usr/lib/debug//boot/kernel/sdt.ko.debug...done.
done.
Loaded symbols for /boot/kernel/sdt.ko
Reading symbols from /boot/kernel/fasttrap.ko...Reading symbols from /usr/lib/debug//boot/kernel/fasttrap.ko.debug...done.
done.
Loaded symbols for /boot/kernel/fasttrap.ko
Reading symbols from /boot/kernel/fbt.ko...Reading symbols from /usr/lib/debug//boot/kernel/fbt.ko.debug...done.
done.
Loaded symbols for /boot/kernel/fbt.ko
Reading symbols from /boot/kernel/dtnfscl.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtnfscl.ko.debug...done.
done.
Loaded symbols for /boot/kernel/dtnfscl.ko
Reading symbols from /boot/kernel/dtmalloc.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtmalloc.ko.debug...done.
done.
Loaded symbols for /boot/kernel/dtmalloc.ko
Reading symbols from /boot/kernel/ipmi.ko...Reading symbols from /usr/lib/debug//boot/kernel/ipmi.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ipmi.ko
Reading symbols from /boot/kernel/ipmi_linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/ipmi_linux.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ipmi_linux.ko
Reading symbols from /boot/kernel/hwpmc.ko...Reading symbols from /usr/lib/debug//boot/kernel/hwpmc.ko.debug...done.
done.
Loaded symbols for /boot/kernel/hwpmc.ko
Reading symbols from /boot/kernel/filemon.ko...Reading symbols from /usr/lib/debug//boot/kernel/filemon.ko.debug...done.
done.
Loaded symbols for /boot/kernel/filemon.ko
Reading symbols from /boot/kernel/uhid.ko...Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug...done.
done.
Loaded symbols for /boot/kernel/uhid.ko
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/nullfs.ko
Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/fdescfs.ko
Reading symbols from /boot/kernel/linux64.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux64.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...done.
done.
Loaded symbols for /boot/kernel/pf.ko
#0  doadump (textdump=1) at pcpu.h:221
221	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:221
#1  0xffffffff80a48945 in kern_reboot (howto=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a48f1b in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80a48d56 in kassert_panic (fmt=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:649
#4  0xffffffff80b351d6 in llentry_alloc (ifp=0x0, lt=0x0, dst=0x0)
    at /usr/src/sys/net/if_llatbl.c:445
#5  0xffffffff80b9e3e0 in in_lltable_free_entry (llt=0xfffff80116196a00,
    lle=0xfffff80116196a00) at counter.h:85
#6  0xffffffff80b35aae in htable_prefix_free (llt=<value optimized out>,
    addr=<value optimized out>, mask=<value optimized out>,
    flags=<value optimized out>) at /usr/src/sys/net/if_llatbl.c:231
#7  0xffffffff80b3577d in lltable_prefix_free (af=<value optimized out>,
    addr=<value optimized out>, mask=<value optimized out>,
    flags=<value optimized out>) at /usr/src/sys/net/if_llatbl.c:601
#8  0xffffffff80b9d4a0 in in_scrubprefix (target=<value optimized out>,
    flags=<value optimized out>) at /usr/src/sys/netinet/in.c:888
#9  0xffffffff80b9ced5 in in_difaddr_ioctl (data=<value optimized out>,
    ifp=<value optimized out>, td=<value optimized out>)
    at /usr/src/sys/netinet/in.c:618
#10 0xffffffff80b9c0a6 in in_control (so=<value optimized out>,
    cmd=<value optimized out>, data=0xfffffe2eb5908a30 "lagg0",
---Type <return> to continue, or q <return> to quit---
    ifp=0xfffff8007534a800, td=0xfffff80067f3fa00)
    at /usr/src/sys/netinet/in.c:247
#11 0xffffffff80b2d031 in ifioctl (so=<value optimized out>,
    cmd=<value optimized out>, data=<value optimized out>,
    td=<value optimized out>) at /usr/src/sys/net/if.c:2842
#12 0xffffffff80aa7616 in kern_ioctl (td=<value optimized out>,
    fd=<value optimized out>, com=<value optimized out>,
    data=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:838
#13 0xffffffff80aa7361 in sys_ioctl (td=<value optimized out>,
    uap=0xfffffe2eb5908b80) at /usr/src/sys/kern/sys_generic.c:745
#14 0xffffffff80eb4af1 in amd64_syscall (td=0xfffff80067f3fa00, traced=0)
    at subr_syscall.c:135
#15 0xffffffff80e9418b in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:400
#16 0x0000000800fd2eaa in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
(kgdb) fr 5
#5  0xffffffff80b9e3e0 in in_lltable_free_entry (llt=0xfffff80116196a00,
    lle=0xfffff80116196a00) at counter.h:85
85	counter.h: No such file or directory.
	in counter.h
(kgdb) print lle->lle_timer
$1 = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0},
    tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0,
  c_arg = 0x0, c_func = 0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}
(kgdb) list
80	in counter.h
(kgdb)
Comment 25 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-09 01:49:52 UTC
This is most definitely a callout that never run and was never scheduled, just as I suspected. I'd say I am still not believing you about reverting that change.
Please give me your svn diff and clean kernel build log somewhere, without incremental, meta or other bdrewery magic.
Comment 26 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:57:08 UTC
Created attachment 172268 [details]
svn diff
Comment 27 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 01:57:29 UTC
kernel recompile in the works
Comment 28 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 02:20:00 UTC
Created attachment 172272 [details]
build log

reeBSD 11.0-ALPHA6 (VT-LER) #9 r302379M: Fri Jul  8 21:13:03 CDT 2016
Comment 29 Larry Rosenman freebsd_committer freebsd_triage 2016-07-09 02:20:23 UTC
now we wait some more
Comment 30 Kubilay Kocak freebsd_committer freebsd_triage 2016-07-09 14:44:10 UTC
CC original committer of r302350
Comment 31 Konstantin Belousov freebsd_committer freebsd_triage 2016-07-09 17:28:24 UTC
Another set of problem from the same commit
https://people.freebsd.org/~pho/stress/log/kostik917.txt

kern_exit.c:exit1() does callout_stop(&p->p_itcallout) == 0.

diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
index 7b73236..929c2c0 100644
--- a/sys/kern/kern_exit.c
+++ b/sys/kern/kern_exit.c
@@ -344,7 +344,7 @@ exit1(struct thread *td, int rval, int signo)
         * executing, prevent it from rearming itself and let it finish.
         */
        if (timevalisset(&p->p_realtimer.it_value) &&
-           callout_stop(&p->p_itcallout) == 0) {
+           _callout_stop_safe(&p->p_itcallout, CS_EXECUTING, NULL) == 0) {
                timevalclear(&p->p_realtimer.it_interval);
                msleep(&p->p_itcallout, &p->p_mtx, PWAIT, "ritwait", 0);
                KASSERT(!timevalisset(&p->p_realtimer.it_value),
Comment 32 Shawn Webb 2016-07-10 01:43:19 UTC
I've been hit with this bug, too. I'm running HardenedBSD 12-CURRENT/amd64 in an IPv6-enabled bhyve guest. It happens reliably around 15 seconds after booting to the login prompt.
Comment 33 Shawn Webb 2016-07-10 02:12:23 UTC
I just disabled IPv6 on the VM and still get the panic.
Comment 34 Alexander Kabaev freebsd_committer freebsd_triage 2016-07-10 02:29:16 UTC
It does not help to disable ipv6 if ipv4 is affected as well.
Please to try to revert r302350.
Comment 35 Larry Rosenman freebsd_committer freebsd_triage 2016-07-10 02:31:49 UTC
so far, with the kernel from the build log I posted:

borg.lerctr.org ~ $ w
 9:30PM  up 1 day, 14 mins, 1 users, load averages: 4.96, 5.96, 5.80
USER       TTY      FROM                                      LOGIN@  IDLE WHAT
ler        pts/0    2001:470:1f0f:42c::1748                   9:30PM     - w
borg.lerctr.org ~ $

so, I think kan@ is on to something.
Comment 36 Shawn Webb 2016-07-11 16:44:57 UTC
Reverting r302350 works for me.
Comment 37 Larry Rosenman freebsd_committer freebsd_triage 2016-07-11 17:32:31 UTC
(In reply to Shawn Webb from comment #36)
Same here.  Looks good:
borg.lerctr.org ~ $ uptime
12:31PM  up 2 days, 15:15, 1 users, load averages: 6.79, 5.84, 6.01
borg.lerctr.org ~ $ uname -a
FreeBSD borg.lerctr.org 11.0-ALPHA6 FreeBSD 11.0-ALPHA6 #9 r302379M: Fri Jul  8 21:13:03 CDT 2016     root@borg.lerctr.org:/usr/obj/usr/src/sys/VT-LER  amd64
borg.lerctr.org ~ $

Can we get it reverted in the 11 & 12 trees until it's repaired?
Comment 38 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-12 11:08:09 UTC
Yes, looks like my patch changed return value for callout_stop() on a callout that wasn't scheduled at all, which was not intended. Can you please test the attached patch, reverting any alternative patches?
Comment 39 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-12 11:09:40 UTC
Created attachment 172407 [details]
suggested patch: fixes return value for a never scheduled callout

Please try this patch reverting any other changes.
Comment 40 Larry Rosenman freebsd_committer freebsd_triage 2016-07-12 15:00:46 UTC
I get a panic:
http://www.lerctr.org/~ler/FreeBSD/Screen%20Shot%202016-07-12%20at%209.56.32%20AM.png

this is on boot. 
with the only change from r302379 being the patch from Gleb.
Comment 41 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-12 16:57:11 UTC
Strange, callout_handle_init() doesn't have any queue(3) macros in it, but the panic messages is from queue(3) internal assertion.
Comment 42 Larry Rosenman freebsd_committer freebsd_triage 2016-07-12 19:57:57 UTC
What can I provide to help?
Comment 43 Maxim Sobolev freebsd_committer freebsd_triage 2016-07-14 03:32:37 UTC
Gleb,

The patch in the attachment 172407 [details] has fixed similar issue for me. Original panic is here:

http://sobomax.sippysoft.com/DSC00012.JPG

Reported:

https://docs.freebsd.org/cgi/getmsg.cgi?fetch=332577+0+current/freebsd-stable

JFYI. I don't know why it's not working for Larry.

-Max
Comment 44 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-14 07:42:07 UTC
Like stated in D7042, there are dependencies in the code that expects callout_reset() to work like this:

int
callout_reset()
{
    atomic_lock();
    retval = callout_stop();
    restart callout();
    atomic_unlock();

    return (retval);
}

r302350 fundamentally changes that.


D7042 has many open questions that are not answered and I think it should be reverted from head and 11-stable until the discussions are complete.
Comment 45 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-14 07:43:43 UTC
Possible solution to using multiple locks in callouts and atomic cancel:

https://svnweb.freebsd.org/changeset/base/302768

--HPS
Comment 46 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 08:54:23 UTC
Hans,

I don't see callout_stop() being called indirectly via callout_reset(). Can you please give me a pointer?
Comment 47 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 08:56:07 UTC
Larry,

can you please make a clean kernel build and retry with that patch? Your panic screenshot looks really strange. The panic message doesn't match the code path.

Is that hardware or a virtual machine? Can you please provide me information on how I can reproduce that?
Comment 48 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 08:58:25 UTC
Has anyone except Larry and Maxim tried my patch?
Comment 49 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 10:28:11 UTC
Larry, I understand your panic with the patch. Please don't bother with reproducing.
Comment 50 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 11:30:44 UTC
Created attachment 172505 [details]
suggested patch: fixes return value for a never scheduled callout
Comment 51 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 11:31:12 UTC
Larry, can you please test attachment 172505 [details]?
Comment 52 Larry Rosenman freebsd_committer freebsd_triage 2016-07-14 12:13:05 UTC
reverted the other patch and applied 172505, and we're up. 

Now to wait a bit :) 

to answer your questions:
It's real hardware

What was happening that made you confused?

Is there something weird in my HW?
Comment 53 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-14 12:16:18 UTC
Everything fine with your hw :) The trace had a function that shouldn't be there. That probably is a problem with kernel debugger. But looking at code I understood what should really be there and understood the panic.

Thanks for help!
Comment 54 Larry Rosenman freebsd_committer freebsd_triage 2016-07-14 12:19:09 UTC
My Pleasure. 

Let's see if we get 48 hours of uptime :)
Comment 55 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-14 19:22:42 UTC
(In reply to Gleb Smirnoff from comment #46)

Look at this piece of code:

        if (tick < 0) {
                ln->la_expire = 0;
                ln->ln_ntick = 0;
                canceled = callout_stop(&ln->lle_timer);
        } else {
                ln->la_expire = time_uptime + tick / hz;
                LLE_ADDREF(ln);
                if (tick > INT_MAX) {
                        ln->ln_ntick = tick - INT_MAX;
                        canceled = callout_reset(&ln->lle_timer, INT_MAX,
                            nd6_llinfo_timer, ln);
                } else {
                        ln->ln_ntick = 0;
                        canceled = callout_reset(&ln->lle_timer, tick,
                            nd6_llinfo_timer, ln);
                }
        }
        if (canceled > 0)
                LLE_REMREF(ln);

It expects both callout_stop() and callout_reset() to return 1, in case the callout was scheduled, regardless of being serviced or not ????

--HPS
Comment 56 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-15 04:21:42 UTC
Please also test with more than the TCP stack!

On 07/15/16 05:45, Matthew Macy wrote:
> glebius last commit needs some further re-work.

Hi,

Glebius commit needs to be backed out, at least the API change that changes the return value when calling callout_stop() when the callout is scheduled and being serviced. Simply because there is code out there, like Mattew and others have discovered that is "refcounting" on the callout_reset() and expecting that a subsequent callout_stop() will return 1 to "unref".

If you consider this impossible, maybe a fourth return value is needed for CANCELLED and DRAINING .

Further, getting the callouts straight in the TCP stack is a matter of doing the locking correctly, which some has called "my magic bullet" and not the return values. I've proposed in the following revision https://svnweb.freebsd.org/changeset/base/302768 to add a new callout API that accepts a locking function so that the callout code can run its cancelled checks at the right place for situations where more than one lock is needed.

Consider this case:

> void
> tcp_timer_2msl(void *xtp)
> {
>         struct tcpcb *tp = xtp;
>         struct inpcb *inp;
>         CURVNET_SET(tp->t_vnet);
> #ifdef TCPDEBUG
>         int ostate;
> 
>         ostate = tp->t_state;
> #endif
>         INP_INFO_RLOCK(&V_tcbinfo);
>         inp = tp->t_inpcb;
>         KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp));
>         INP_WLOCK(inp);
>         tcp_free_sackholes(tp);
>         if (callout_pending(&tp->t_timers->tt_2msl) ||
>             !callout_active(&tp->t_timers->tt_2msl)) {

Here we have custom in-house race check that doesn't affect the return value of callout_reset() nor callout_stop(). 

>                 INP_WUNLOCK(tp->t_inpcb);
>                 INP_INFO_RUNLOCK(&V_tcbinfo);
>                 CURVNET_RESTORE();
>                 return;


I propose the following solution:

> 
> static void
> tcp_timer_2msl_lock(void *xtp, int do_lock)
> {
> 	struct tcpcb *tp = xtp;
> 	struct inpcb *inp;
> 
> 	inp = tp->t_inpcb;
> 
> 	if (do_lock) {
> 		CURVNET_SET(tp->t_vnet);
> 		INP_INFO_RLOCK(&V_tcbinfo);
> 		INP_WLOCK(inp);
> 	} else {
> 		INP_WUNLOCK(inp);
> 		INP_INFO_RUNLOCK(&V_tcbinfo);
> 		CURVNET_RESTORE();
> 	}
> }
> 

callout_init_lock_function(&callout, &tcp_timer_2msl_lock, CALLOUT_RETURNUNLOCKED);

Then in softclock_call_cc() it will look like this:

> 
> 	CC_UNLOCK(cc);
> 	if (c_lock != NULL) {
> 		if (have locking function)
> 			tcp_timer_2msl_lock(c_arg, 1);
> 		else
> 			class->lc_lock(c_lock, lock_status);
> 		/*
> 		 * The callout may have been cancelled
> 		 * while we switched locks.
> 		 */

Actually "CC_LOCK(cc)" should be in-front of cc_exec_cancel() to avoid races testing, setting and clearing this variable, like done in hps_head.

> 		if (cc_exec_cancel(cc, direct)) {
> 			if (have locking function)
> 				tcp_timer_2msl_lock(c_arg, 0);
> 			else
> 				class->lc_unlock(c_lock);
> 			goto skip;
>                }
>                cc_exec_cancel(cc, direct) = true;
> 
> ....
> 
> skip:
> 	if ((c_iflags & CALLOUT_RETURNUNLOCKED) == 0) {
> 		if (have locking function)
> 			...
> 		else
> 			class->lc_unlock(c_lock);
> 	}

The whole point about this is to make the the cancelled check atomic.

1) Lock TCP
2) Lock CC_LOCK()
3) change callout state

--HPS
Comment 57 commit-hook freebsd_committer freebsd_triage 2016-07-15 09:29:26 UTC
A commit references this bug:

Author: glebius
Date: Fri Jul 15 09:28:33 UTC 2016
New revision: 302894
URL: https://svnweb.freebsd.org/changeset/base/302894

Log:
  Fix regression introduced by r302350. The change of return value for a
  callout that wasn't scheduled at all was unintentional and yielded in
  several panics.

  PR:		210884

Changes:
  head/sys/kern/kern_timeout.c
Comment 58 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-15 13:32:00 UTC
Did you check that you now don't have a memory leakage with LLE entries?
Comment 59 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-15 14:26:29 UTC
(In reply to commit-hook from comment #57)

Hi,

I think r302894 and r302350 changes the return value of the following case, which is not described in the commit message? Is this also a regression?

In this revision:
> https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=markup&pathrev=296320

Assume we enter _callout_stop_safe() having the following assertions:

(c->c_iflags & CALLOUT_PENDING) == 0 (satisfied)
cc_exec_curr(cc, direct) != c (satisfied)

Then we exit returning (-1).

In this revision:
> https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=markup&pathrev=296320#l1253

After your changes, entering the same function under the same conditions:

cc_exec_curr(cc, direct) == c (not satisifed)
(c->c_iflags & CALLOUT_PENDING) == 0 (satisfied)

Then we exit returning (0).

> https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?revision=302894&view=markup#l1384

If we call callout_stop() on a never scheduled callout, we now get a return value of 0 instead of -1, which by manual page definition is wrong ????


Am I wrong? Do others see this too?


--HPS
Comment 60 Larry Rosenman freebsd_committer freebsd_triage 2016-07-16 15:25:45 UTC
Looks like gleb's one liner is a winner:

borg.lerctr.org ~ $ w
10:24AM  up 2 days,  3:15, 1 users, load averages: 6.26, 6.40, 6.26
USER       TTY      FROM                                      LOGIN@  IDLE WHAT
ler        pts/0    192.168.201.205                          10:24AM     - w
borg.lerctr.org ~ $

Thanks for the help.
Comment 61 commit-hook freebsd_committer freebsd_triage 2016-07-18 04:30:54 UTC
A commit references this bug:

Author: kib
Date: Mon Jul 18 04:30:35 UTC 2016
New revision: 302981
URL: https://svnweb.freebsd.org/changeset/base/302981

Log:
  Fix another bug after r302350.

  Reported and tested by:	pho
  PR:	210884
  Sponsored by:	The FreeBSD Foundation
  MFC after:	3 days

Changes:
  head/sys/kern/kern_exit.c
Comment 62 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-18 09:14:36 UTC
Created attachment 172634 [details]
second fix for the return value

The attachment 172505 [details] brought the return value to the value that we had in stable/10. I've overseen that the addition of callout_async_drain() also covertly changed the return value of callout_stop().

The new patch sets the value to the new head/11 negative value. The patch should work as well as the old one, but extra testing is now required :(
Comment 63 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-18 09:15:28 UTC
Larry, can you please turn on testing with attachment 172634 [details]?
Comment 64 commit-hook freebsd_committer freebsd_triage 2016-07-18 09:26:17 UTC
A commit references this bug:

Author: glebius
Date: Mon Jul 18 09:26:06 UTC 2016
New revision: 302997
URL: https://svnweb.freebsd.org/changeset/base/302997

Log:
  Redo the r302894: the very new value for a non-scheduled callout is -1.
  This was recently added in r290664.

  Noticed by:	hselasky
  PR:		210884

Changes:
  head/sys/kern/kern_timeout.c
Comment 65 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-18 10:32:13 UTC
Created attachment 172639 [details]
second fix for the return value

The patch in attachment 172634 [details] was incorrect.
Comment 66 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-18 10:33:06 UTC
Larry, can you please test attachment 172639 [details]? It is applied on top of head, overwriting the previous version of the patch.
Comment 67 Larry Rosenman freebsd_committer freebsd_triage 2016-07-18 11:40:56 UTC
I've applied the attachment, but since I froze where I am, do I need to apply:
https://svnweb.freebsd.org/base/head/sys/kern/kern_exit.c?r1=302981&r2=302980&pathrev=302981

as well?
Comment 68 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-18 15:43:06 UTC
Up to you. The change isn't related exactly to your problem. My suggestion is to grab fresh head + attachment 172639 [details]. That will include the kern_exit change.
Comment 69 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-18 18:33:40 UTC
Sorry to bug you again, but should line 1384 which you corrected to -1, be conditionally -1 ?

if (cc_exec_curr(cc, direct) == c)
return (0);
else
return (-1);

Because the two cases fall-through to the same piece of code.

--HPS
Comment 70 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-18 18:38:43 UTC
Correcting myself:

if (cc_exec_curr(cc, direct) == c)
return ((flags & CS_EXECUTING) != 0);
else
return (-1);

I think this was the case before.

--HPS
Comment 71 Gleb Smirnoff freebsd_committer freebsd_triage 2016-07-18 18:42:24 UTC
Are you looking at the latest attachment 172639 [details]? It is conditional there, and logic is same as in your code. In case if the callout is executing, we override with result of cancellation attempt. If it isn't executing and isn't scheduled we return -1.
Comment 72 Hans Petter Selasky freebsd_committer freebsd_triage 2016-07-18 18:48:19 UTC
No, I was looking somewhere else. That looks much better.

Thank you!

--HPS
Comment 73 Larry Rosenman freebsd_committer freebsd_triage 2016-07-18 19:51:27 UTC
(In reply to Gleb Smirnoff from comment #68)
I've svn up'd and applied 172639.  so far so good, but....

Do I need to add anything else?

borg.lerctr.org ~ $ uname -a
FreeBSD borg.lerctr.org 12.0-CURRENT FreeBSD 12.0-CURRENT #13 r303004M: Mon Jul 18 12:48:20 CDT 2016     root@borg.lerctr.org:/usr/obj/usr/src/sys/VT-LER  amd64
borg.lerctr.org ~ $
Comment 74 commit-hook freebsd_committer freebsd_triage 2016-07-20 16:49:18 UTC
A commit references this bug:

Author: glebius
Date: Wed Jul 20 16:48:26 UTC 2016
New revision: 303098
URL: https://svnweb.freebsd.org/changeset/base/303098

Log:
  Redo the r302894: the very new value for a non-scheduled callout is -1.
  This was recently added in r290664.

  Noticed by:	hselasky
  Tested by:	Larry Rosenman <ler lerctr.org>
  PR:		210884

Changes:
  head/sys/kern/kern_timeout.c
Comment 75 commit-hook freebsd_committer freebsd_triage 2016-07-21 19:19:22 UTC
A commit references this bug:

Author: glebius
Date: Thu Jul 21 19:18:49 UTC 2016
New revision: 303159
URL: https://svnweb.freebsd.org/changeset/base/303159

Log:
  Merge r302894, r303098: fix regression caused by r302350. Stopping a
  not scheduled not running callout should return (-1).

  PR:		210884
  Approved by:	re (gjb)

Changes:
_U  stable/11/
  stable/11/sys/kern/kern_timeout.c
Comment 76 Bryan Drewery freebsd_committer freebsd_triage 2016-08-10 23:29:12 UTC
I believe this is all fixed now?
Comment 77 Glen Barber freebsd_committer freebsd_triage 2016-08-11 01:20:51 UTC
(In reply to Bryan Drewery from comment #76)
> I believe this is all fixed now?

I have not seen any recent PRs that point directly to this, but I'd like to keep the PR open for a while longer to be safe.
Comment 78 Edward Tomasz Napierala freebsd_committer freebsd_triage 2016-11-04 13:25:38 UTC
How about now?  :-)
Comment 79 Ed Maste freebsd_committer freebsd_triage 2016-11-04 13:37:15 UTC
I believe this is fixed with r303098 and that Gleb should close.
Comment 80 Cristy 2018-08-28 08:44:22 UTC
MARKED AS SPAM