Bug 238870 - sys.netpfil.pf.names.names and sys.netpfil.pf.synproxy.synproxy cause panic
Summary: sys.netpfil.pf.names.names and sys.netpfil.pf.synproxy.synproxy cause panic
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: tests (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-testing mailing list
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2019-06-28 20:11 UTC by Li-Wen Hsu
Modified: 2019-10-26 21:47 UTC (History)
5 users (show)

See Also:


Attachments
epair.patch (637 bytes, patch)
2019-06-28 22:23 UTC, Kristof Provost
no flags Details | Diff
ifp_dying.patch (413 bytes, patch)
2019-07-05 12:11 UTC, Kristof Provost
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Li-Wen Hsu freebsd_committer 2019-06-28 20:11:29 UTC
https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11696/
https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11697/
https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11698/
https://ci.freebsd.org/job/FreeBSD-head-amd64-test/11699/

These test runs panic while executing sys.netpfil.pf.names.names or sys.netpfil.pf.synproxy.synproxy, with message:

00:53:22.137 sys/netpfil/pf/names:names  ->  panic: epair_qflush: ifp=0xfffff80063f15800, epair_softc gone? sc=0

`cd /usr/tests/sys/netpfil/pf && kyua test names:names` can trigger this.

Note: this should not be related to r349508 (revision of build 11696), it can be reproduced on earlier revision, e.g.: r349507.
Comment 1 Kristof Provost freebsd_committer 2019-06-28 20:28:15 UTC
The backtrace suggests this is an epair teardown problem:

panic: epair_qflush: ifp=0xfffff80063f15800, epair_softc gone? sc=0

cpuid = 0
time = 1561746423
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0024260750
vpanic() at vpanic+0x19d/frame 0xfffffe00242607a0
panic() at panic+0x43/frame 0xfffffe0024260800
epair_qflush() at epair_qflush+0x1ba/frame 0xfffffe0024260850
if_down() at if_down+0x11d/frame 0xfffffe0024260880
if_detach_internal() at if_detach_internal+0x704/frame 0xfffffe0024260900
if_vmove() at if_vmove+0x3c/frame 0xfffffe0024260950
vnet_if_return() at vnet_if_return+0x48/frame 0xfffffe0024260970
vnet_destroy() at vnet_destroy+0x124/frame 0xfffffe00242609a0
prison_deref() at prison_deref+0x29d/frame 0xfffffe00242609e0
taskqueue_run_locked() at taskqueue_run_locked+0x10c/frame 0xfffffe0024260a40
taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe0024260a70
fork_exit() at fork_exit+0x84/frame 0xfffffe0024260ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0024260ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

The pf tests create and destroy many epair interfaces as they start up and stop vnet jails.
Comment 2 Kristof Provost freebsd_committer 2019-06-28 22:23:40 UTC
Created attachment 205406 [details]
epair.patch

This seems to fix the panic, and I *think* it might even be correct. I'd quite like bz@ to take a look though.
Comment 3 commit-hook freebsd_committer 2019-06-29 12:20:06 UTC
A commit references this bug:

Author: lwhsu
Date: Sat Jun 29 12:19:57 UTC 2019
New revision: 349539
URL: https://svnweb.freebsd.org/changeset/base/349539

Log:
  Skip sys.netpfil.pf.names.names and sys.netpfil.pf.synproxy.synproxy
  temporarily because kernel panics when flushing epair queue.

  PR:		238870
  Sponsored by:	The FreeBSD Foundation

Changes:
  head/tests/sys/netpfil/pf/names.sh
  head/tests/sys/netpfil/pf/synproxy.sh
Comment 4 Kristof Provost freebsd_committer 2019-06-29 12:22:10 UTC
(In reply to commit-hook from comment #3)
I'm pretty sure the other pf tests (and the netipsec) tests risk triggering this too.
Comment 5 Li-Wen Hsu freebsd_committer 2019-06-29 12:25:24 UTC
I also believe so.  The other questions are why this only started happening recently and sys.netpfil.pf.names.names triggers panic mostly.
Comment 6 Kristof Provost freebsd_committer 2019-06-29 12:27:33 UTC
(In reply to Li-Wen Hsu from comment #5)
The names test does trigger very unusual behaviour, in that it provokes a situation where we have two struct ifnets with the same name. That might be a factor.

I think it's also racy, which might mean that any random change in the network stack suddenly makes this more likely to occur.
Comment 7 Kristof Provost freebsd_committer 2019-07-05 12:11:11 UTC
Created attachment 205530 [details]
ifp_dying.patch

A second experimental patch. This solves a different problem, that also manifests during the pf tests.

It looks like there's a race, where the epair deletes its ifp while the jail shuts down. The jail shutdown moves the ifp back into its original vnet, while the epair delete is causing it to be deleted. As a result we end up with a freed ifp in V_ifnet, and when we run into it later the box panics.

I think this is likely fallout from the epoch-ification, so mmacy@ should take a look at this.
Comment 8 Kristof Provost freebsd_committer 2019-07-05 12:13:04 UTC
To reproduce the issue that ifp_dying.patch works around (I don't think it's a fully correct fix), apply epair.patch, revert r349539, kldload pfsync, cd /usr/tests/sys/netpfil/pf and do while true do sudo kyua test done. It shouldn't take more than 10 minutes to panic the box.
Comment 9 Li-Wen Hsu freebsd_committer 2019-07-06 00:53:57 UTC
This stars happening in 12 now so the fix also needs being MFC'd
Comment 11 commit-hook freebsd_committer 2019-07-09 09:10:27 UTC
A commit references this bug:

Author: lwhsu
Date: Tue Jul  9 09:09:52 UTC 2019
New revision: 349858
URL: https://svnweb.freebsd.org/changeset/base/349858

Log:
  MFC r349539

  Skip sys.netpfil.pf.names.names and sys.netpfil.pf.synproxy.synproxy
  temporarily because kernel panics when flushing epair queue.

  PR:		238870
  Sponsored by:	The FreeBSD Foundation

Changes:
_U  stable/12/
  stable/12/tests/sys/netpfil/pf/names.sh
  stable/12/tests/sys/netpfil/pf/synproxy.sh
Comment 12 Kristof Provost freebsd_committer 2019-08-22 11:31:42 UTC
As an additional clue, I tried taking the IFNET_WLOCK in vnet_if_return(), and while that's clearly too crude (if only because that's a non-sleepable lock and we sleep during if_vmove()), but it does prevent these panics:

diff --git a/sys/net/if.c b/sys/net/if.c
index 1abb42893a0..b41a1a102cd 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -481,11 +481,13 @@ vnet_if_return(const void *unused __unused)
 {
        struct ifnet *ifp, *nifp;

+       IFNET_WLOCK();
        /* Return all inherited interfaces to their parent vnets. */
        CK_STAILQ_FOREACH_SAFE(ifp, &V_ifnet, if_link, nifp) {
                if (ifp->if_home_vnet != ifp->if_vnet)
                        if_vmove(ifp, ifp->if_home_vnet);
        }
+       IFNET_WUNLOCK();
 }
 VNET_SYSUNINIT(vnet_if_return, SI_SUB_VNET_DONE, SI_ORDER_ANY,
     vnet_if_return, NULL);
Comment 13 Hans Petter Selasky freebsd_committer 2019-08-22 11:36:24 UTC
@kp: Regarding your patch:

CK_STAILQ_FOREACH_SAFE() should not be used outside of an EPOCH section. That might also explain the race you are seeing.

--HPS
Comment 14 Kristof Provost freebsd_committer 2019-08-22 12:01:39 UTC
(In reply to Hans Petter Selasky from comment #13)
Good point, but it's also not trivial to add this, because if_detach_internal() does an epoch_wait_preempt() on the net epoch.

Simply adding NET_EPOCH_ENTER/EXIT around the loop produces this panic:

panic: epoch_wait_preempt() called in the middle of an epoch section of the same epoch
cpuid = 5
time = 1566475161
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0091dc65a0
vpanic() at vpanic+0x19d/frame 0xfffffe0091dc65f0
panic() at panic+0x43/frame 0xfffffe0091dc6650
epoch_wait_preempt() at epoch_wait_preempt+0x29c/frame 0xfffffe0091dc66b0
if_detach_internal() at if_detach_internal+0x1a6/frame 0xfffffe0091dc6730
if_vmove() at if_vmove+0x4b/frame 0xfffffe0091dc6780
vnet_if_return() at vnet_if_return+0x58/frame 0xfffffe0091dc67c0
vnet_destroy() at vnet_destroy+0x124/frame 0xfffffe0091dc67f0
prison_deref() at prison_deref+0x29d/frame 0xfffffe0091dc6830
sys_jail_remove() at sys_jail_remove+0x28f/frame 0xfffffe0091dc6880
amd64_syscall() at amd64_syscall+0x2bb/frame 0xfffffe0091dc69b0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0091dc69b0
--- syscall (508, FreeBSD ELF64, sys_jail_remove), rip = 0x80031712a, rsp = 0x7fffffffe848, rbp = 0x7fffffffe8d0 ---
KDB: enter: panic
[ thread pid 1174 tid 100507 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db>
Comment 15 Hans Petter Selasky freebsd_committer 2019-08-22 12:11:53 UTC
Try this:

        struct ifnet *ifp;
        struct epoch_tracker et;

        while (1) {
            NET_EPOCH_ENTER(et);
            CK_STAILQ_FOREACH(ifp, &V_ifnet, if_link) {
                if (ifp->if_home_vnet != ifp->if_vnet)
                    break;
            }
            NET_EPOCH_EXIT(et);
            if (ifp == NULL)
                  break;
            if_vmove(ifp, ifp->if_home_vnet);
        }
Comment 16 Kristof Provost freebsd_committer 2019-08-29 15:28:51 UTC
(Adding what was already discussed on IRC)
The above patch (or an alternative version where the ifnet is if_ref()ed before the NET_EPOCH_EXIT()) does not fix the problem.

My current understanding is that this is because an ifp_detach() of an epair will remove two interfaces, which can mean we're deleting one while it's being moved to a different vnet.
Comment 17 Bjoern A. Zeeb freebsd_committer 2019-10-26 21:47:35 UTC
I wonder if the recent epcoh changes on head have shifted the problem;  if I run the names test in an endless loop it pretty quickly panics in

Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 05
instruction pointer     = 0x20:0xffffffff80cc4585
stack pointer           = 0x28:0xfffffe00d9785740
frame pointer           = 0x28:0xfffffe00d9785740
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        =
interrupt enabled, resume, IOPL = 0
current process         = 9949 (ifconfig)
trap number             = 9
panic: general protection fault
cpuid = 3
time = 1572126334
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d9785450
vpanic() at vpanic+0x19d/frame 0xfffffe00d97854a0
panic() at panic+0x43/frame 0xfffffe00d9785500
trap_fatal() at trap_fatal+0x39c/frame 0xfffffe00d9785560
trap() at trap+0x6a/frame 0xfffffe00d9785670
calltrap() at calltrap+0x8/frame 0xfffffe00d9785670
--- trap 0x9, rip = 0xffffffff80cc4585, rsp = 0xfffffe00d9785740, rbp = 0xfffffe00d9785740 ---
strncmp() at strncmp+0x15/frame 0xfffffe00d9785740
ifunit_ref() at ifunit_ref+0x51/frame 0xfffffe00d9785780
ifioctl() at ifioctl+0x396/frame 0xfffffe00d9785850
kern_ioctl() at kern_ioctl+0x295/frame 0xfffffe00d97858b0
sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe00d9785980
amd64_syscall() at amd64_syscall+0x2b9/frame 0xfffffe00d9785ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00d9785ab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8004802ea, rsp = 0x7fffffffe178, rbp = 0x7fffffffe1e0 ---
KDB: enter: panic
[ thread pid 9949 tid 100138 ]