11-CURRENT panics every ten minutes or so with "panic: bogus refcnt 0 on lle 0xfffff80004608c00" message. This seems to have been introduced in the last two weeks or so. Backtrace: #8 0xffffffff80a89c2b in kdb_enter (why=0xffffffff813e3899 "panic", msg=0x80 <Address 0x80 out of bounds>) at cpufunc.h:63 #9 0xffffffff80a4a7af in vpanic (fmt=<value optimized out>, ap=0xfffffe02327b2530) at /usr/home/trasz/svn/head/sys/kern/kern_shutdown.c:752 #10 0xffffffff80a4a606 in kassert_panic (fmt=<value optimized out>) at /usr/home/trasz/svn/head/sys/kern/kern_shutdown.c:649 #11 0xffffffff80b3dd26 in llentry_free (lle=<value optimized out>) at /usr/home/trasz/svn/head/sys/net/if_llatbl.c:432 #12 0xffffffff80ba6f30 in in_lltable_free_entry (llt=0xfffff800079cf000, lle=0xfffff80004608c00) at /usr/home/trasz/svn/head/sys/netinet/in.c:1144 #13 0xffffffff80b3e5fe in htable_prefix_free (llt=<value optimized out>, addr=<value optimized out>, mask=<value optimized out>, flags=<value optimized out>) at /usr/home/trasz/svn/head/sys/net/if_llatbl.c:673 #14 0xffffffff80b3e2cd in lltable_prefix_free (af=<value optimized out>, addr=<value optimized out>, mask=<value optimized out>, flags=<value optimized out>) at /usr/home/trasz/svn/head/sys/net/if_llatbl.c:599 #15 0xffffffff80ba5ff0 in in_scrubprefix (target=<value optimized out>, flags=<value optimized out>) at /usr/home/trasz/svn/head/sys/netinet/in.c:757 #16 0xffffffff80ba5a25 in in_difaddr_ioctl (data=<value optimized out>, ifp=<value optimized out>, td=<value optimized out>) at /usr/home/trasz/svn/head/sys/netinet/in.c:608 #17 0xffffffff80ba4bf6 in in_control (so=<value optimized out>, cmd=2149607705, data=0xfffffe02327b28f0 "wlan0", ifp=0xfffff80007f60800, td=0xfffff8003fd88500) at /usr/home/trasz/svn/head/sys/netinet/in.c:248 #18 0xffffffff80b35b81 in ifioctl (so=<value optimized out>, cmd=<value optimized out>, data=<value optimized out>, td=<value optimized out>) at /usr/home/trasz/svn/head/sys/net/if.c:2840 #19 0xffffffff80aae786 in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>, data=<value optimized out>) at file.h:327 #20 0xffffffff80aae4d1 in sys_ioctl (td=<value optimized out>, uap=0xfffffe02327b2a40) at /usr/home/trasz/svn/head/sys/kern/sys_generic.c:743 #21 0xffffffff80ebbb2b in amd64_syscall (td=0xfffff8003fd88500, traced=0) at subr_syscall.c:135 #22 0xffffffff80e9bf7b in Xfast_syscall () at /usr/home/trasz/svn/head/sys/amd64/amd64/exception.S:396
I have 2 vmcore's from similar panics.
Created attachment 172201 [details] crash #1
Created attachment 172202 [details] Crash #2
I attached the core.txt files from my 2 crashes. I *DO* have the vmcore's available if a dev wants to look.
(In reply to Larry Rosenman from comment #4) look man crashinfo
(In reply to Ivan Klymenko from comment #5) That runs automatically, and produces the aforementioned core.txt's....
Does reverting r302081 fix the problem?
(In reply to Mark Johnston from comment #7) It's been random enough that I don't know if it would help or not :(
I got 2 more crashes this afternoon. I'll see if reverting r302081 helps. I have core.txt's and vmcores from these as well.
Created attachment 172217 [details] crash after revert That apparently DID NOT help. See the core.txt I uploaded.
Created attachment 172254 [details] Quick and dirty patch to try Hi, could you try this one? It stops crashes for me on my CI20 whereas before it panicking reliable on every nfsroot boot. It does appear to me that arp code assumer that callout_stop on the timer that was never started should return 0, which is not true somehow. Admittedly, I did not keep tabs on recent callout refactorings, so I cannot make the final determination about the validity of the patch.
I suggest trying to revert https://svnweb.freebsd.org/changeset/base/302350 also, I suspect this is the commit that introduced the behaviour.
One more comment: the patch addresses only one instance of callout_stop use and is not enough to cover all possible panics, ipv6 is being one example.
So, should I do the revert of 302350 AND apply your patch or just apply your patch? I've already reverted 302081.
Just revert the change and do not apply my diff
Done. Now we wait :)
Created attachment 172266 [details] post revert crash on boot :(
That latest crash is post revert of 302350
Please do check again that you run the kernel with the change reverted. I'll allow myself not to believe you since reverting _does_ fix the panic here and it is a reliable one.
Somehow bugzilla eats comments?
I did: cd /usr/src svn merge -c -302350 . nohup make buildworld -j 20 buildkernel >make.k.out 2>&1 make installkernel shutdown -r now this is with bdrewery's meta-mode stuff.
If, after all, you do panic after really, really, really reverting the change, please print lle->lle_timer from in_lltable_free_entry context. In above command sequence do not forget 'svn diff' to verify it did what it supposed to do.
borg.lerctr.org /var/crash # kgdb -c vmcore.last /boot/kernel/kernel GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: panic: bogus refcnt 0 on lle 0xfffff80116196a00 cpuid = 5 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe2eb59085b0 vpanic() at vpanic+0x182/frame 0xfffffe2eb5908630 kassert_panic() at kassert_panic+0x126/frame 0xfffffe2eb59086a0 llentry_free() at llentry_free+0x136/frame 0xfffffe2eb59086d0 in_lltable_free_entry() at in_lltable_free_entry+0xb0/frame 0xfffffe2eb5908700 htable_prefix_free() at htable_prefix_free+0xce/frame 0xfffffe2eb5908760 lltable_prefix_free() at lltable_prefix_free+0x5d/frame 0xfffffe2eb59087a0 in_scrubprefix() at in_scrubprefix+0x290/frame 0xfffffe2eb5908840 in_difaddr_ioctl() at in_difaddr_ioctl+0x285/frame 0xfffffe2eb5908890 in_control() at in_control+0x96/frame 0xfffffe2eb5908910 ifioctl() at ifioctl+0xda1/frame 0xfffffe2eb59089a0 kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe2eb5908a00 sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe2eb5908ae0 amd64_syscall() at amd64_syscall+0x2c1/frame 0xfffffe2eb5908bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe2eb5908bf0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800fd2eaa, rsp = 0x7fffffffe418, rbp = 0x7fffffffe460 --- Uptime: 4h47m13s Dumping 4623 out of 64465 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done. done. Loaded symbols for /boot/kernel/linux.ko Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done. done. Loaded symbols for /boot/kernel/linux_common.ko Reading symbols from /boot/kernel/if_lagg.ko...Reading symbols from /usr/lib/debug//boot/kernel/if_lagg.ko.debug...done. done. Loaded symbols for /boot/kernel/if_lagg.ko Reading symbols from /boot/kernel/snd_envy24ht.ko...Reading symbols from /usr/lib/debug//boot/kernel/snd_envy24ht.ko.debug...done. done. Loaded symbols for /boot/kernel/snd_envy24ht.ko Reading symbols from /boot/kernel/snd_spicds.ko...Reading symbols from /usr/lib/debug//boot/kernel/snd_spicds.ko.debug...done. done. Loaded symbols for /boot/kernel/snd_spicds.ko Reading symbols from /boot/kernel/coretemp.ko...Reading symbols from /usr/lib/debug//boot/kernel/coretemp.ko.debug...done. done. Loaded symbols for /boot/kernel/coretemp.ko Reading symbols from /boot/kernel/fuse.ko...Reading symbols from /usr/lib/debug//boot/kernel/fuse.ko.debug...done. done. Loaded symbols for /boot/kernel/fuse.ko Reading symbols from /boot/kernel/ichsmb.ko...Reading symbols from /usr/lib/debug//boot/kernel/ichsmb.ko.debug...done. done. Loaded symbols for /boot/kernel/ichsmb.ko Reading symbols from /boot/kernel/smbus.ko...Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...done. done. Loaded symbols for /boot/kernel/smbus.ko Reading symbols from /boot/kernel/ichwd.ko...Reading symbols from /usr/lib/debug//boot/kernel/ichwd.ko.debug...done. done. Loaded symbols for /boot/kernel/ichwd.ko Reading symbols from /boot/kernel/cpuctl.ko...Reading symbols from /usr/lib/debug//boot/kernel/cpuctl.ko.debug...done. done. Loaded symbols for /boot/kernel/cpuctl.ko Reading symbols from /boot/kernel/cryptodev.ko...Reading symbols from /usr/lib/debug//boot/kernel/cryptodev.ko.debug...done. done. Loaded symbols for /boot/kernel/cryptodev.ko Reading symbols from /boot/kernel/dtraceall.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtraceall.ko.debug...done. done. Loaded symbols for /boot/kernel/dtraceall.ko Reading symbols from /boot/kernel/profile.ko...Reading symbols from /usr/lib/debug//boot/kernel/profile.ko.debug...done. done. Loaded symbols for /boot/kernel/profile.ko Reading symbols from /boot/kernel/dtrace.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtrace.ko.debug...done. done. Loaded symbols for /boot/kernel/dtrace.ko Reading symbols from /boot/kernel/systrace_freebsd32.ko...Reading symbols from /usr/lib/debug//boot/kernel/systrace_freebsd32.ko.debug...done. done. Loaded symbols for /boot/kernel/systrace_freebsd32.ko Reading symbols from /boot/kernel/systrace.ko...Reading symbols from /usr/lib/debug//boot/kernel/systrace.ko.debug...done. done. Loaded symbols for /boot/kernel/systrace.ko Reading symbols from /boot/kernel/sdt.ko...Reading symbols from /usr/lib/debug//boot/kernel/sdt.ko.debug...done. done. Loaded symbols for /boot/kernel/sdt.ko Reading symbols from /boot/kernel/fasttrap.ko...Reading symbols from /usr/lib/debug//boot/kernel/fasttrap.ko.debug...done. done. Loaded symbols for /boot/kernel/fasttrap.ko Reading symbols from /boot/kernel/fbt.ko...Reading symbols from /usr/lib/debug//boot/kernel/fbt.ko.debug...done. done. Loaded symbols for /boot/kernel/fbt.ko Reading symbols from /boot/kernel/dtnfscl.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtnfscl.ko.debug...done. done. Loaded symbols for /boot/kernel/dtnfscl.ko Reading symbols from /boot/kernel/dtmalloc.ko...Reading symbols from /usr/lib/debug//boot/kernel/dtmalloc.ko.debug...done. done. Loaded symbols for /boot/kernel/dtmalloc.ko Reading symbols from /boot/kernel/ipmi.ko...Reading symbols from /usr/lib/debug//boot/kernel/ipmi.ko.debug...done. done. Loaded symbols for /boot/kernel/ipmi.ko Reading symbols from /boot/kernel/ipmi_linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/ipmi_linux.ko.debug...done. done. Loaded symbols for /boot/kernel/ipmi_linux.ko Reading symbols from /boot/kernel/hwpmc.ko...Reading symbols from /usr/lib/debug//boot/kernel/hwpmc.ko.debug...done. done. Loaded symbols for /boot/kernel/hwpmc.ko Reading symbols from /boot/kernel/filemon.ko...Reading symbols from /usr/lib/debug//boot/kernel/filemon.ko.debug...done. done. Loaded symbols for /boot/kernel/filemon.ko Reading symbols from /boot/kernel/uhid.ko...Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug...done. done. Loaded symbols for /boot/kernel/uhid.ko Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done. done. Loaded symbols for /boot/kernel/nullfs.ko Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug...done. done. Loaded symbols for /boot/kernel/fdescfs.ko Reading symbols from /boot/kernel/linux64.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug...done. done. Loaded symbols for /boot/kernel/linux64.ko Reading symbols from /boot/kernel/pf.ko...Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...done. done. Loaded symbols for /boot/kernel/pf.ko #0 doadump (textdump=1) at pcpu.h:221 221 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump (textdump=1) at pcpu.h:221 #1 0xffffffff80a48945 in kern_reboot (howto=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:366 #2 0xffffffff80a48f1b in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff80a48d56 in kassert_panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:649 #4 0xffffffff80b351d6 in llentry_alloc (ifp=0x0, lt=0x0, dst=0x0) at /usr/src/sys/net/if_llatbl.c:445 #5 0xffffffff80b9e3e0 in in_lltable_free_entry (llt=0xfffff80116196a00, lle=0xfffff80116196a00) at counter.h:85 #6 0xffffffff80b35aae in htable_prefix_free (llt=<value optimized out>, addr=<value optimized out>, mask=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/net/if_llatbl.c:231 #7 0xffffffff80b3577d in lltable_prefix_free (af=<value optimized out>, addr=<value optimized out>, mask=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/net/if_llatbl.c:601 #8 0xffffffff80b9d4a0 in in_scrubprefix (target=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/netinet/in.c:888 #9 0xffffffff80b9ced5 in in_difaddr_ioctl (data=<value optimized out>, ifp=<value optimized out>, td=<value optimized out>) at /usr/src/sys/netinet/in.c:618 #10 0xffffffff80b9c0a6 in in_control (so=<value optimized out>, cmd=<value optimized out>, data=0xfffffe2eb5908a30 "lagg0", ---Type <return> to continue, or q <return> to quit--- ifp=0xfffff8007534a800, td=0xfffff80067f3fa00) at /usr/src/sys/netinet/in.c:247 #11 0xffffffff80b2d031 in ifioctl (so=<value optimized out>, cmd=<value optimized out>, data=<value optimized out>, td=<value optimized out>) at /usr/src/sys/net/if.c:2842 #12 0xffffffff80aa7616 in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>, data=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:838 #13 0xffffffff80aa7361 in sys_ioctl (td=<value optimized out>, uap=0xfffffe2eb5908b80) at /usr/src/sys/kern/sys_generic.c:745 #14 0xffffffff80eb4af1 in amd64_syscall (td=0xfffff80067f3fa00, traced=0) at subr_syscall.c:135 #15 0xffffffff80e9418b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:400 #16 0x0000000800fd2eaa in ?? () Previous frame inner to this frame (corrupt stack?) Current language: auto; currently minimal (kgdb) (kgdb) fr 5 #5 0xffffffff80b9e3e0 in in_lltable_free_entry (llt=0xfffff80116196a00, lle=0xfffff80116196a00) at counter.h:85 85 counter.h: No such file or directory. in counter.h (kgdb) print lle->lle_timer $1 = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0} (kgdb) list 80 in counter.h (kgdb)
This is most definitely a callout that never run and was never scheduled, just as I suspected. I'd say I am still not believing you about reverting that change. Please give me your svn diff and clean kernel build log somewhere, without incremental, meta or other bdrewery magic.
Created attachment 172268 [details] svn diff
kernel recompile in the works
Created attachment 172272 [details] build log reeBSD 11.0-ALPHA6 (VT-LER) #9 r302379M: Fri Jul 8 21:13:03 CDT 2016
now we wait some more
CC original committer of r302350
Another set of problem from the same commit https://people.freebsd.org/~pho/stress/log/kostik917.txt kern_exit.c:exit1() does callout_stop(&p->p_itcallout) == 0. diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c index 7b73236..929c2c0 100644 --- a/sys/kern/kern_exit.c +++ b/sys/kern/kern_exit.c @@ -344,7 +344,7 @@ exit1(struct thread *td, int rval, int signo) * executing, prevent it from rearming itself and let it finish. */ if (timevalisset(&p->p_realtimer.it_value) && - callout_stop(&p->p_itcallout) == 0) { + _callout_stop_safe(&p->p_itcallout, CS_EXECUTING, NULL) == 0) { timevalclear(&p->p_realtimer.it_interval); msleep(&p->p_itcallout, &p->p_mtx, PWAIT, "ritwait", 0); KASSERT(!timevalisset(&p->p_realtimer.it_value),
I've been hit with this bug, too. I'm running HardenedBSD 12-CURRENT/amd64 in an IPv6-enabled bhyve guest. It happens reliably around 15 seconds after booting to the login prompt.
I just disabled IPv6 on the VM and still get the panic.
It does not help to disable ipv6 if ipv4 is affected as well. Please to try to revert r302350.
so far, with the kernel from the build log I posted: borg.lerctr.org ~ $ w 9:30PM up 1 day, 14 mins, 1 users, load averages: 4.96, 5.96, 5.80 USER TTY FROM LOGIN@ IDLE WHAT ler pts/0 2001:470:1f0f:42c::1748 9:30PM - w borg.lerctr.org ~ $ so, I think kan@ is on to something.
Reverting r302350 works for me.
(In reply to Shawn Webb from comment #36) Same here. Looks good: borg.lerctr.org ~ $ uptime 12:31PM up 2 days, 15:15, 1 users, load averages: 6.79, 5.84, 6.01 borg.lerctr.org ~ $ uname -a FreeBSD borg.lerctr.org 11.0-ALPHA6 FreeBSD 11.0-ALPHA6 #9 r302379M: Fri Jul 8 21:13:03 CDT 2016 root@borg.lerctr.org:/usr/obj/usr/src/sys/VT-LER amd64 borg.lerctr.org ~ $ Can we get it reverted in the 11 & 12 trees until it's repaired?
Yes, looks like my patch changed return value for callout_stop() on a callout that wasn't scheduled at all, which was not intended. Can you please test the attached patch, reverting any alternative patches?
Created attachment 172407 [details] suggested patch: fixes return value for a never scheduled callout Please try this patch reverting any other changes.
I get a panic: http://www.lerctr.org/~ler/FreeBSD/Screen%20Shot%202016-07-12%20at%209.56.32%20AM.png this is on boot. with the only change from r302379 being the patch from Gleb.
Strange, callout_handle_init() doesn't have any queue(3) macros in it, but the panic messages is from queue(3) internal assertion.
What can I provide to help?
Gleb, The patch in the attachment 172407 [details] has fixed similar issue for me. Original panic is here: http://sobomax.sippysoft.com/DSC00012.JPG Reported: https://docs.freebsd.org/cgi/getmsg.cgi?fetch=332577+0+current/freebsd-stable JFYI. I don't know why it's not working for Larry. -Max
Like stated in D7042, there are dependencies in the code that expects callout_reset() to work like this: int callout_reset() { atomic_lock(); retval = callout_stop(); restart callout(); atomic_unlock(); return (retval); } r302350 fundamentally changes that. D7042 has many open questions that are not answered and I think it should be reverted from head and 11-stable until the discussions are complete.
Possible solution to using multiple locks in callouts and atomic cancel: https://svnweb.freebsd.org/changeset/base/302768 --HPS
Hans, I don't see callout_stop() being called indirectly via callout_reset(). Can you please give me a pointer?
Larry, can you please make a clean kernel build and retry with that patch? Your panic screenshot looks really strange. The panic message doesn't match the code path. Is that hardware or a virtual machine? Can you please provide me information on how I can reproduce that?
Has anyone except Larry and Maxim tried my patch?
Larry, I understand your panic with the patch. Please don't bother with reproducing.
Created attachment 172505 [details] suggested patch: fixes return value for a never scheduled callout
Larry, can you please test attachment 172505 [details]?
reverted the other patch and applied 172505, and we're up. Now to wait a bit :) to answer your questions: It's real hardware What was happening that made you confused? Is there something weird in my HW?
Everything fine with your hw :) The trace had a function that shouldn't be there. That probably is a problem with kernel debugger. But looking at code I understood what should really be there and understood the panic. Thanks for help!
My Pleasure. Let's see if we get 48 hours of uptime :)
(In reply to Gleb Smirnoff from comment #46) Look at this piece of code: if (tick < 0) { ln->la_expire = 0; ln->ln_ntick = 0; canceled = callout_stop(&ln->lle_timer); } else { ln->la_expire = time_uptime + tick / hz; LLE_ADDREF(ln); if (tick > INT_MAX) { ln->ln_ntick = tick - INT_MAX; canceled = callout_reset(&ln->lle_timer, INT_MAX, nd6_llinfo_timer, ln); } else { ln->ln_ntick = 0; canceled = callout_reset(&ln->lle_timer, tick, nd6_llinfo_timer, ln); } } if (canceled > 0) LLE_REMREF(ln); It expects both callout_stop() and callout_reset() to return 1, in case the callout was scheduled, regardless of being serviced or not ???? --HPS
Please also test with more than the TCP stack! On 07/15/16 05:45, Matthew Macy wrote: > glebius last commit needs some further re-work. Hi, Glebius commit needs to be backed out, at least the API change that changes the return value when calling callout_stop() when the callout is scheduled and being serviced. Simply because there is code out there, like Mattew and others have discovered that is "refcounting" on the callout_reset() and expecting that a subsequent callout_stop() will return 1 to "unref". If you consider this impossible, maybe a fourth return value is needed for CANCELLED and DRAINING . Further, getting the callouts straight in the TCP stack is a matter of doing the locking correctly, which some has called "my magic bullet" and not the return values. I've proposed in the following revision https://svnweb.freebsd.org/changeset/base/302768 to add a new callout API that accepts a locking function so that the callout code can run its cancelled checks at the right place for situations where more than one lock is needed. Consider this case: > void > tcp_timer_2msl(void *xtp) > { > struct tcpcb *tp = xtp; > struct inpcb *inp; > CURVNET_SET(tp->t_vnet); > #ifdef TCPDEBUG > int ostate; > > ostate = tp->t_state; > #endif > INP_INFO_RLOCK(&V_tcbinfo); > inp = tp->t_inpcb; > KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp)); > INP_WLOCK(inp); > tcp_free_sackholes(tp); > if (callout_pending(&tp->t_timers->tt_2msl) || > !callout_active(&tp->t_timers->tt_2msl)) { Here we have custom in-house race check that doesn't affect the return value of callout_reset() nor callout_stop(). > INP_WUNLOCK(tp->t_inpcb); > INP_INFO_RUNLOCK(&V_tcbinfo); > CURVNET_RESTORE(); > return; I propose the following solution: > > static void > tcp_timer_2msl_lock(void *xtp, int do_lock) > { > struct tcpcb *tp = xtp; > struct inpcb *inp; > > inp = tp->t_inpcb; > > if (do_lock) { > CURVNET_SET(tp->t_vnet); > INP_INFO_RLOCK(&V_tcbinfo); > INP_WLOCK(inp); > } else { > INP_WUNLOCK(inp); > INP_INFO_RUNLOCK(&V_tcbinfo); > CURVNET_RESTORE(); > } > } > callout_init_lock_function(&callout, &tcp_timer_2msl_lock, CALLOUT_RETURNUNLOCKED); Then in softclock_call_cc() it will look like this: > > CC_UNLOCK(cc); > if (c_lock != NULL) { > if (have locking function) > tcp_timer_2msl_lock(c_arg, 1); > else > class->lc_lock(c_lock, lock_status); > /* > * The callout may have been cancelled > * while we switched locks. > */ Actually "CC_LOCK(cc)" should be in-front of cc_exec_cancel() to avoid races testing, setting and clearing this variable, like done in hps_head. > if (cc_exec_cancel(cc, direct)) { > if (have locking function) > tcp_timer_2msl_lock(c_arg, 0); > else > class->lc_unlock(c_lock); > goto skip; > } > cc_exec_cancel(cc, direct) = true; > > .... > > skip: > if ((c_iflags & CALLOUT_RETURNUNLOCKED) == 0) { > if (have locking function) > ... > else > class->lc_unlock(c_lock); > } The whole point about this is to make the the cancelled check atomic. 1) Lock TCP 2) Lock CC_LOCK() 3) change callout state --HPS
A commit references this bug: Author: glebius Date: Fri Jul 15 09:28:33 UTC 2016 New revision: 302894 URL: https://svnweb.freebsd.org/changeset/base/302894 Log: Fix regression introduced by r302350. The change of return value for a callout that wasn't scheduled at all was unintentional and yielded in several panics. PR: 210884 Changes: head/sys/kern/kern_timeout.c
Did you check that you now don't have a memory leakage with LLE entries?
(In reply to commit-hook from comment #57) Hi, I think r302894 and r302350 changes the return value of the following case, which is not described in the commit message? Is this also a regression? In this revision: > https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=markup&pathrev=296320 Assume we enter _callout_stop_safe() having the following assertions: (c->c_iflags & CALLOUT_PENDING) == 0 (satisfied) cc_exec_curr(cc, direct) != c (satisfied) Then we exit returning (-1). In this revision: > https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=markup&pathrev=296320#l1253 After your changes, entering the same function under the same conditions: cc_exec_curr(cc, direct) == c (not satisifed) (c->c_iflags & CALLOUT_PENDING) == 0 (satisfied) Then we exit returning (0). > https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?revision=302894&view=markup#l1384 If we call callout_stop() on a never scheduled callout, we now get a return value of 0 instead of -1, which by manual page definition is wrong ???? Am I wrong? Do others see this too? --HPS
Looks like gleb's one liner is a winner: borg.lerctr.org ~ $ w 10:24AM up 2 days, 3:15, 1 users, load averages: 6.26, 6.40, 6.26 USER TTY FROM LOGIN@ IDLE WHAT ler pts/0 192.168.201.205 10:24AM - w borg.lerctr.org ~ $ Thanks for the help.
A commit references this bug: Author: kib Date: Mon Jul 18 04:30:35 UTC 2016 New revision: 302981 URL: https://svnweb.freebsd.org/changeset/base/302981 Log: Fix another bug after r302350. Reported and tested by: pho PR: 210884 Sponsored by: The FreeBSD Foundation MFC after: 3 days Changes: head/sys/kern/kern_exit.c
Created attachment 172634 [details] second fix for the return value The attachment 172505 [details] brought the return value to the value that we had in stable/10. I've overseen that the addition of callout_async_drain() also covertly changed the return value of callout_stop(). The new patch sets the value to the new head/11 negative value. The patch should work as well as the old one, but extra testing is now required :(
Larry, can you please turn on testing with attachment 172634 [details]?
A commit references this bug: Author: glebius Date: Mon Jul 18 09:26:06 UTC 2016 New revision: 302997 URL: https://svnweb.freebsd.org/changeset/base/302997 Log: Redo the r302894: the very new value for a non-scheduled callout is -1. This was recently added in r290664. Noticed by: hselasky PR: 210884 Changes: head/sys/kern/kern_timeout.c
Created attachment 172639 [details] second fix for the return value The patch in attachment 172634 [details] was incorrect.
Larry, can you please test attachment 172639 [details]? It is applied on top of head, overwriting the previous version of the patch.
I've applied the attachment, but since I froze where I am, do I need to apply: https://svnweb.freebsd.org/base/head/sys/kern/kern_exit.c?r1=302981&r2=302980&pathrev=302981 as well?
Up to you. The change isn't related exactly to your problem. My suggestion is to grab fresh head + attachment 172639 [details]. That will include the kern_exit change.
Sorry to bug you again, but should line 1384 which you corrected to -1, be conditionally -1 ? if (cc_exec_curr(cc, direct) == c) return (0); else return (-1); Because the two cases fall-through to the same piece of code. --HPS
Correcting myself: if (cc_exec_curr(cc, direct) == c) return ((flags & CS_EXECUTING) != 0); else return (-1); I think this was the case before. --HPS
Are you looking at the latest attachment 172639 [details]? It is conditional there, and logic is same as in your code. In case if the callout is executing, we override with result of cancellation attempt. If it isn't executing and isn't scheduled we return -1.
No, I was looking somewhere else. That looks much better. Thank you! --HPS
(In reply to Gleb Smirnoff from comment #68) I've svn up'd and applied 172639. so far so good, but.... Do I need to add anything else? borg.lerctr.org ~ $ uname -a FreeBSD borg.lerctr.org 12.0-CURRENT FreeBSD 12.0-CURRENT #13 r303004M: Mon Jul 18 12:48:20 CDT 2016 root@borg.lerctr.org:/usr/obj/usr/src/sys/VT-LER amd64 borg.lerctr.org ~ $
A commit references this bug: Author: glebius Date: Wed Jul 20 16:48:26 UTC 2016 New revision: 303098 URL: https://svnweb.freebsd.org/changeset/base/303098 Log: Redo the r302894: the very new value for a non-scheduled callout is -1. This was recently added in r290664. Noticed by: hselasky Tested by: Larry Rosenman <ler lerctr.org> PR: 210884 Changes: head/sys/kern/kern_timeout.c
A commit references this bug: Author: glebius Date: Thu Jul 21 19:18:49 UTC 2016 New revision: 303159 URL: https://svnweb.freebsd.org/changeset/base/303159 Log: Merge r302894, r303098: fix regression caused by r302350. Stopping a not scheduled not running callout should return (-1). PR: 210884 Approved by: re (gjb) Changes: _U stable/11/ stable/11/sys/kern/kern_timeout.c
I believe this is all fixed now?
(In reply to Bryan Drewery from comment #76) > I believe this is all fixed now? I have not seen any recent PRs that point directly to this, but I'd like to keep the PR open for a while longer to be safe.
How about now? :-)
I believe this is fixed with r303098 and that Gleb should close.
MARKED AS SPAM