https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11696/ https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11697/ https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11698/ https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11699/ These test runs panic while executing sys.netpfil.pf.names.names or sys.netpfil.pf.synproxy.synproxy, with message: 00:53:22.137 sys/netpfil/pf/names:names -> panic: epair_qflush: ifp=0xfffff80063f15800, epair_softc gone? sc=0 `cd /usr/tests/sys/netpfil/pf && kyua test names:names` can trigger this. Note: this should not be related to r349508 (revision of build 11696), it can be reproduced on earlier revision, e.g.: r349507.
The backtrace suggests this is an epair teardown problem: panic: epair_qflush: ifp=0xfffff80063f15800, epair_softc gone? sc=0 cpuid = 0 time = 1561746423 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0024260750 vpanic() at vpanic+0x19d/frame 0xfffffe00242607a0 panic() at panic+0x43/frame 0xfffffe0024260800 epair_qflush() at epair_qflush+0x1ba/frame 0xfffffe0024260850 if_down() at if_down+0x11d/frame 0xfffffe0024260880 if_detach_internal() at if_detach_internal+0x704/frame 0xfffffe0024260900 if_vmove() at if_vmove+0x3c/frame 0xfffffe0024260950 vnet_if_return() at vnet_if_return+0x48/frame 0xfffffe0024260970 vnet_destroy() at vnet_destroy+0x124/frame 0xfffffe00242609a0 prison_deref() at prison_deref+0x29d/frame 0xfffffe00242609e0 taskqueue_run_locked() at taskqueue_run_locked+0x10c/frame 0xfffffe0024260a40 taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe0024260a70 fork_exit() at fork_exit+0x84/frame 0xfffffe0024260ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0024260ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- The pf tests create and destroy many epair interfaces as they start up and stop vnet jails.
Created attachment 205406 [details] epair.patch This seems to fix the panic, and I *think* it might even be correct. I'd quite like bz@ to take a look though.
A commit references this bug: Author: lwhsu Date: Sat Jun 29 12:19:57 UTC 2019 New revision: 349539 URL: https://svnweb.freebsd.org/changeset/base/349539 Log: Skip sys.netpfil.pf.names.names and sys.netpfil.pf.synproxy.synproxy temporarily because kernel panics when flushing epair queue. PR: 238870 Sponsored by: The FreeBSD Foundation Changes: head/tests/sys/netpfil/pf/names.sh head/tests/sys/netpfil/pf/synproxy.sh
(In reply to commit-hook from comment #3) I'm pretty sure the other pf tests (and the netipsec) tests risk triggering this too.
I also believe so. The other questions are why this only started happening recently and sys.netpfil.pf.names.names triggers panic mostly.
(In reply to Li-Wen Hsu from comment #5) The names test does trigger very unusual behaviour, in that it provokes a situation where we have two struct ifnets with the same name. That might be a factor. I think it's also racy, which might mean that any random change in the network stack suddenly makes this more likely to occur.
Created attachment 205530 [details] ifp_dying.patch A second experimental patch. This solves a different problem, that also manifests during the pf tests. It looks like there's a race, where the epair deletes its ifp while the jail shuts down. The jail shutdown moves the ifp back into its original vnet, while the epair delete is causing it to be deleted. As a result we end up with a freed ifp in V_ifnet, and when we run into it later the box panics. I think this is likely fallout from the epoch-ification, so mmacy@ should take a look at this.
To reproduce the issue that ifp_dying.patch works around (I don't think it's a fully correct fix), apply epair.patch, revert r349539, kldload pfsync, cd /usr/tests/sys/netpfil/pf and do while true do sudo kyua test done. It shouldn't take more than 10 minutes to panic the box.
This stars happening in 12 now so the fix also needs being MFC'd
https://reviews.freebsd.org/D20868 https://reviews.freebsd.org/D20869
A commit references this bug: Author: lwhsu Date: Tue Jul 9 09:09:52 UTC 2019 New revision: 349858 URL: https://svnweb.freebsd.org/changeset/base/349858 Log: MFC r349539 Skip sys.netpfil.pf.names.names and sys.netpfil.pf.synproxy.synproxy temporarily because kernel panics when flushing epair queue. PR: 238870 Sponsored by: The FreeBSD Foundation Changes: _U stable/12/ stable/12/tests/sys/netpfil/pf/names.sh stable/12/tests/sys/netpfil/pf/synproxy.sh
As an additional clue, I tried taking the IFNET_WLOCK in vnet_if_return(), and while that's clearly too crude (if only because that's a non-sleepable lock and we sleep during if_vmove()), but it does prevent these panics: diff --git a/sys/net/if.c b/sys/net/if.c index 1abb42893a0..b41a1a102cd 100644 --- a/sys/net/if.c +++ b/sys/net/if.c @@ -481,11 +481,13 @@ vnet_if_return(const void *unused __unused) { struct ifnet *ifp, *nifp; + IFNET_WLOCK(); /* Return all inherited interfaces to their parent vnets. */ CK_STAILQ_FOREACH_SAFE(ifp, &V_ifnet, if_link, nifp) { if (ifp->if_home_vnet != ifp->if_vnet) if_vmove(ifp, ifp->if_home_vnet); } + IFNET_WUNLOCK(); } VNET_SYSUNINIT(vnet_if_return, SI_SUB_VNET_DONE, SI_ORDER_ANY, vnet_if_return, NULL);
@kp: Regarding your patch: CK_STAILQ_FOREACH_SAFE() should not be used outside of an EPOCH section. That might also explain the race you are seeing. --HPS
(In reply to Hans Petter Selasky from comment #13) Good point, but it's also not trivial to add this, because if_detach_internal() does an epoch_wait_preempt() on the net epoch. Simply adding NET_EPOCH_ENTER/EXIT around the loop produces this panic: panic: epoch_wait_preempt() called in the middle of an epoch section of the same epoch cpuid = 5 time = 1566475161 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0091dc65a0 vpanic() at vpanic+0x19d/frame 0xfffffe0091dc65f0 panic() at panic+0x43/frame 0xfffffe0091dc6650 epoch_wait_preempt() at epoch_wait_preempt+0x29c/frame 0xfffffe0091dc66b0 if_detach_internal() at if_detach_internal+0x1a6/frame 0xfffffe0091dc6730 if_vmove() at if_vmove+0x4b/frame 0xfffffe0091dc6780 vnet_if_return() at vnet_if_return+0x58/frame 0xfffffe0091dc67c0 vnet_destroy() at vnet_destroy+0x124/frame 0xfffffe0091dc67f0 prison_deref() at prison_deref+0x29d/frame 0xfffffe0091dc6830 sys_jail_remove() at sys_jail_remove+0x28f/frame 0xfffffe0091dc6880 amd64_syscall() at amd64_syscall+0x2bb/frame 0xfffffe0091dc69b0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0091dc69b0 --- syscall (508, FreeBSD ELF64, sys_jail_remove), rip = 0x80031712a, rsp = 0x7fffffffe848, rbp = 0x7fffffffe8d0 --- KDB: enter: panic [ thread pid 1174 tid 100507 ] Stopped at kdb_enter+0x3b: movq $0,kdb_why db>
Try this: struct ifnet *ifp; struct epoch_tracker et; while (1) { NET_EPOCH_ENTER(et); CK_STAILQ_FOREACH(ifp, &V_ifnet, if_link) { if (ifp->if_home_vnet != ifp->if_vnet) break; } NET_EPOCH_EXIT(et); if (ifp == NULL) break; if_vmove(ifp, ifp->if_home_vnet); }
(Adding what was already discussed on IRC) The above patch (or an alternative version where the ifnet is if_ref()ed before the NET_EPOCH_EXIT()) does not fix the problem. My current understanding is that this is because an ifp_detach() of an epair will remove two interfaces, which can mean we're deleting one while it's being moved to a different vnet.
I wonder if the recent epcoh changes on head have shifted the problem; if I run the names test in an endless loop it pretty quickly panics in Fatal trap 9: general protection fault while in kernel mode cpuid = 3; apic id = 05 instruction pointer = 0x20:0xffffffff80cc4585 stack pointer = 0x28:0xfffffe00d9785740 frame pointer = 0x28:0xfffffe00d9785740 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 9949 (ifconfig) trap number = 9 panic: general protection fault cpuid = 3 time = 1572126334 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d9785450 vpanic() at vpanic+0x19d/frame 0xfffffe00d97854a0 panic() at panic+0x43/frame 0xfffffe00d9785500 trap_fatal() at trap_fatal+0x39c/frame 0xfffffe00d9785560 trap() at trap+0x6a/frame 0xfffffe00d9785670 calltrap() at calltrap+0x8/frame 0xfffffe00d9785670 --- trap 0x9, rip = 0xffffffff80cc4585, rsp = 0xfffffe00d9785740, rbp = 0xfffffe00d9785740 --- strncmp() at strncmp+0x15/frame 0xfffffe00d9785740 ifunit_ref() at ifunit_ref+0x51/frame 0xfffffe00d9785780 ifioctl() at ifioctl+0x396/frame 0xfffffe00d9785850 kern_ioctl() at kern_ioctl+0x295/frame 0xfffffe00d97858b0 sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe00d9785980 amd64_syscall() at amd64_syscall+0x2b9/frame 0xfffffe00d9785ab0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00d9785ab0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8004802ea, rsp = 0x7fffffffe178, rbp = 0x7fffffffe1e0 --- KDB: enter: panic [ thread pid 9949 tid 100138 ]
(In reply to Bjoern A. Zeeb from comment #17) I think I've seen a similar panic before as well. The exact panic is unpredictable, because it depends on what operations are done after (or while) the epair is being cleaned up.
(In reply to Kristof Provost from comment #18) well it also highly depends on the current locking situation in our network stack which keeps changing. Hence, I am not sure why or if this is an epair related issue at all.
(In reply to Bjoern A. Zeeb from comment #19) I believe it is related to epair, yes. There are some notes in https://reviews.freebsd.org/D20868 My recollection from when I was chasing this is that the problem occurs when a vnet jail (which contains one or more pair interfaces) is destroyed and immediately afterwards the epair interface is destroyed (or possibly the other way around). Destroying the vnet jail moves the epair interface back into the host, but because it's also being destroyed at the same time we end up moving an ifp into the host which will be freed afterwards. A freed ifp in the host's list of interfaces is obviously a panic waiting to happen. I'm not 100% clear on the exact order of operations here, but I am fully convinced that it's related to interfaces being moved due to a vnet destroy, and epairs destroying two interfaces at a time (i.e. a and b).
(In reply to Kristof Provost from comment #20) Well, there is your problem. In the old days we did use locks for exclusivity; that probably has changed on the interface lists, etc. with epoch(). Hence are now having two operations running in parallel and not waiting for each other. At least that sounds very likely to me. I can go and have a look in about 2 weeks to debug and fix this.
(In reply to Kristof Provost from comment #20) Thanks for the explanation. For what is worth adding a delay before destroying the epair solved the kernel panic issues.
A commit references this bug: Author: lwhsu Date: Fri Jun 26 09:39:23 UTC 2020 New revision: 362646 URL: https://svnweb.freebsd.org/changeset/base/362646 Log: Temporarily skip flakey bin.sh.execution.functional_test.bg12 in CI PR: 238870 Sponsored by: The FreeBSD Foundation Changes: head/bin/sh/tests/functional_test.sh
A commit references this bug: Author: kp Date: Tue Sep 8 14:54:11 UTC 2020 New revision: 365457 URL: https://svnweb.freebsd.org/changeset/base/365457 Log: net: mitigate vnet / epair cleanup races There's a race where dying vnets move their interfaces back to their original vnet, and if_epair cleanup (where deleting one interface also deletes the other end of the epair). This is commonly triggered by the pf tests, but also by cleanup of vnet jails. As we've not yet been able to fix the root cause of the issue work around the panic by not dereferencing a NULL softc in epair_qflush() and by not re-attaching DYING interfaces. This isn't a full fix, but makes a very common panic far less likely. PR: 244703, 238870 Reviewed by: lutz_donnerhacke.de MFC after: 4 days Differential Revision: https://reviews.freebsd.org/D26324 Changes: head/sys/net/if.c head/sys/net/if_epair.c
A commit references this bug: Author: kp Date: Sat Sep 12 12:45:32 UTC 2020 New revision: 365659 URL: https://svnweb.freebsd.org/changeset/base/365659 Log: MFC r365457: net: mitigate vnet / epair cleanup races There's a race where dying vnets move their interfaces back to their original vnet, and if_epair cleanup (where deleting one interface also deletes the other end of the epair). This is commonly triggered by the pf tests, but also by cleanup of vnet jails. As we've not yet been able to fix the root cause of the issue work around the panic by not dereferencing a NULL softc in epair_qflush() and by not re-attaching DYING interfaces. This isn't a full fix, but makes a very common panic far less likely. PR: 244703, 238870 Changes: _U stable/12/ stable/12/sys/net/if.c stable/12/sys/net/if_epair.c
A commit references this bug: Author: kp Date: Sat Sep 12 18:58:37 UTC 2020 New revision: 365669 URL: https://svnweb.freebsd.org/changeset/base/365669 Log: MFC r365457: net: mitigate vnet / epair cleanup races There's a race where dying vnets move their interfaces back to their original vnet, and if_epair cleanup (where deleting one interface also deletes the other end of the epair). This is commonly triggered by the pf tests, but also by cleanup of vnet jails. As we've not yet been able to fix the root cause of the issue work around the panic by not dereferencing a NULL softc in epair_qflush() and by not re-attaching DYING interfaces. This isn't a full fix, but makes a very common panic far less likely. PR: 244703, 238870 Approved by: re (gjb) Changes: _U releng/12.2/ releng/12.2/sys/net/if.c releng/12.2/sys/net/if_epair.c
A commit references this bug: Author: kp Date: Tue Dec 1 16:24:00 UTC 2020 New revision: 368237 URL: https://svnweb.freebsd.org/changeset/base/368237 Log: if: Fix panic when destroying vnet and epair simultaneously When destroying a vnet and an epair (with one end in the vnet) we often panicked. This was the result of the destruction of the epair, which destroys both ends simultaneously, happening while vnet_if_return() was moving the struct ifnet to its home vnet. This can result in a freed ifnet being re-added to the home vnet V_ifnet list. That in turn panics the next time the ifnet is used. Prevent this race by ensuring that vnet_if_return() cannot run at the same time as if_detach() or epair_clone_destroy(). PR: 238870, 234985, 244703, 250870 MFC after: 2 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27378 Changes: head/sys/net/if.c
A commit references this bug: Author: kp Date: Tue Dec 15 15:33:29 UTC 2020 New revision: 368663 URL: https://svnweb.freebsd.org/changeset/base/368663 Log: MFC r368237: if: Fix panic when destroying vnet and epair simultaneously When destroying a vnet and an epair (with one end in the vnet) we often panicked. This was the result of the destruction of the epair, which destroys both ends simultaneously, happening while vnet_if_return() was moving the struct ifnet to its home vnet. This can result in a freed ifnet being re-added to the home vnet V_ifnet list. That in turn panics the next time the ifnet is used. Prevent this race by ensuring that vnet_if_return() cannot run at the same time as if_detach() or epair_clone_destroy(). PR: 238870, 234985, 244703, 250870 Sponsored by: Modirum MDPay Changes: _U stable/12/ stable/12/sys/net/if.c
A commit in branch releng/12.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e0c15f45abd4bd5165e11b557a8c90d0faf5cfeb commit e0c15f45abd4bd5165e11b557a8c90d0faf5cfeb Author: Kristof Provost <kp@FreeBSD.org> AuthorDate: 2021-01-18 21:55:53 +0000 Commit: Ed Maste <emaste@FreeBSD.org> CommitDate: 2021-01-29 00:58:55 +0000 MFC r368237: if: Fix panic when destroying vnet and epair simultaneously When destroying a vnet and an epair (with one end in the vnet) we often panicked. This was the result of the destruction of the epair, which destroys both ends simultaneously, happening while vnet_if_return() was moving the struct ifnet to its home vnet. This can result in a freed ifnet being re-added to the home vnet V_ifnet list. That in turn panics the next time the ifnet is used. Prevent this race by ensuring that vnet_if_return() cannot run at the same time as if_detach() or epair_clone_destroy(). PR: 238870, 234985, 244703, 250870 Sponsored by: Modirum MDPay Approved by: so sys/net/if.c | 147 +++++++++++++++++++++++++++++++++++++------------------ sys/net/if_var.h | 24 ++------- 2 files changed, 104 insertions(+), 67 deletions(-)
A commit in branch releng/12.2 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e682b62c96e94c60d830e4414215032e0d4f8dad commit e682b62c96e94c60d830e4414215032e0d4f8dad Author: Kristof Provost <kp@FreeBSD.org> AuthorDate: 2020-09-12 16:33:05 +0000 Commit: Ed Maste <emaste@FreeBSD.org> CommitDate: 2021-01-29 01:14:24 +0000 MFC r368237: if: Fix panic when destroying vnet and epair simultaneously When destroying a vnet and an epair (with one end in the vnet) we often panicked. This was the result of the destruction of the epair, which destroys both ends simultaneously, happening while vnet_if_return() was moving the struct ifnet to its home vnet. This can result in a freed ifnet being re-added to the home vnet V_ifnet list. That in turn panics the next time the ifnet is used. Prevent this race by ensuring that vnet_if_return() cannot run at the same time as if_detach() or epair_clone_destroy(). PR: 238870, 234985, 244703, 250870 Sponsored by: Modirum MDPay Approved by: so sys/net/if.c | 147 +++++++++++++++++++++++++++++++++++++------------------ sys/net/if_var.h | 24 ++------- 2 files changed, 104 insertions(+), 67 deletions(-)