Created attachment 183373 [details]
core.txt for crash on 2017-06-10 r319591
While testing vnet jails over time and bringing about 20 of them up and down about 10-20 times a day I get this panic (bridge_set_ifcap).
I have attached the core.txt for the panic to this post.
The issue seems to be during the tear-down of an interface on an if_bridge, while setting the interface capabilities on the bridge.
This is the third time the same panic has occurred and this occurs after some time (~5 days).
I have tried running a script that ups and downs the vnet jails in a loop and have not been able to recreate the panic.
I have crash dumps and core.txt for all three of them, but the trace back is almost identical so am attaching the most recent one.
Pleas let me know if there is any other information you require.
(kgdb) #0 doadump (textdump=0) at pcpu.h:232
#1 0xffffffff803a308b in db_dump (dummy=<value optimized out>,
dummy2=<value optimized out>, dummy3=<value optimized out>,
dummy4=<value optimized out>) at /usr/src/sys/ddb/db_command.c:546
#2 0xffffffff803a2e7f in db_command (cmd_table=<value optimized out>)
#3 0xffffffff803a2bb4 in db_command_loop ()
#4 0xffffffff803a5c7f in db_trap (type=<value optimized out>,
code=<value optimized out>) at /usr/src/sys/ddb/db_main.c:248
#5 0xffffffff80a98b23 in kdb_trap (type=9, code=0, tf=<value optimized out>)
#6 0xffffffff80ef6532 in trap_fatal (frame=0xfffffe104379d430, eva=0)
#7 0xffffffff80ef5b3d in trap (frame=0xfffffe104379d430) at counter.h:85
#8 0xffffffff80ed88e1 in calltrap ()
#9 0xffffffff82e7620f in bridge_set_ifcap (sc=0xfffff8002866bc00,
bif=<value optimized out>, set=8) at /usr/src/sys/net/if_bridge.c:940
#10 0xffffffff82e75f0d in bridge_delete_member (sc=0xfffff8002866bc00,
bif=0xfffff8083d54ea00, gone=0) at /usr/src/sys/net/if_bridge.c:1065
#11 0xffffffff82e73dca in bridge_clone_destroy (ifp=0xfffff80251155000)
#12 0xffffffff80b510c7 in if_clone_destroyif (ifc=0xfffff8000efd2c80,
ifp=<value optimized out>) at /usr/src/sys/net/if_clone.c:719
#13 0xffffffff80b50e29 in if_clone_destroy (name=0xfffffe104379d780 "igb1br")
#14 0xffffffff80b4c283 in ifioctl (so=0xfffff80251beb000,
cmd=<value optimized out>, data=<value optimized out>,
td=<value optimized out>) at /usr/src/sys/net/if.c:2802
#15 0xffffffff80abd72d in kern_ioctl (td=<value optimized out>,
fd=<value optimized out>, com=<value optimized out>,
data=<value optimized out>) at file.h:323
#16 0xffffffff80abd3ef in sys_ioctl (td=<value optimized out>,
uap=0xfffffe104379d930) at /usr/src/sys/kern/sys_generic.c:745
#17 0xffffffff80ef6fd9 in amd64_syscall (td=0xfffff80251001000, traced=0)
#18 0xffffffff80ed8bcb in Xfast_syscall ()
#19 0x0000000800fdc4ca in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language: auto; currently minimal
# uname -a
FreeBSD v3dev-20-7 12.0-CURRENT FreeBSD 12.0-CURRENT #9 r319591: Sun Jun 4 23:01:11 UTC 2017 root@pkgbase:/usr/obj/usr/src/sys/VNET amd64
is this still a problem in recent HEAD or stable/12?
(In reply to Bjoern A. Zeeb from comment #1)
I know it has been long, but I recently have seen similar failure on 12.1 (and I think on one of my 12.0 systems too).
I do have the crash dump vmcore files if you would like me to run some other tests.
This happens when I am trying to teardown the networking across multiple vimage jails (by calling ifconfig commands in parallel).
From the backtrace this looks specific to removing an epair from a bridge (once again using ifconfig in parallel).
I am not sure that the issue I am facing here is exactly the same as the one I saw when creating this issue, but it looks very similar.
Here is the backtrace from kgdb for what I am seeing.
I am using the latest 12.1 release and have attached an output of uname for the exact release version.
Any help either getting to a route cause or even a workaround for this would be much appreciated as I have seen 6 crashes in the last week
root@build-1:~ # uname -a
FreeBSD build-1 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 r354729 GENERIC amd64
#0 __curthread () at /usr/src/sys/amd64/include/pcpu.h:234
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:371
#2 0xffffffff80bd01c8 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451
#3 0xffffffff80bd0629 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:877
#4 0xffffffff80bd0423 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:804
#5 0xffffffff810a7dcc in trap_fatal (frame=0xfffffe001c730380, eva=1040) at /usr/src/sys/amd64/amd64/trap.c:943
#6 0xffffffff810a7e19 in trap_pfault (frame=0xfffffe001c730380, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:767
#7 0xffffffff810a740f in trap (frame=0xfffffe001c730380) at /usr/src/sys/amd64/amd64/trap.c:443
#8 <signal handler called>
#9 __mtx_lock_sleep (c=0xfffff80066f6fa30, v=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:565
#10 0xffffffff82be048e in bridge_mutecaps (sc=0xfffff80066f6fa00) at /usr/src/sys/net/if_bridge.c:916
#11 0xffffffff82be017a in bridge_delete_member (sc=0xfffff80066f6fa00, bif=0xfffff80028463000, gone=1) at /usr/src/sys/net/if_bridge.c:1033
#12 0xffffffff82be07eb in bridge_ifdetach (arg=<optimized out>, ifp=0xfffff80004900800) at /usr/src/sys/net/if_bridge.c:1829
#13 0xffffffff80ccde5a in if_detach_internal (ifp=<optimized out>, vmove=0, ifcp=0x0) at /usr/src/sys/net/if.c:1187
#14 0xffffffff80ccd36e in if_detach (ifp=0xfffff80066f6fa30) at /usr/src/sys/net/if.c:1041
#15 0xffffffff83055c4c in epair_clone_destroy (ifc=0xfffff800209a8300, ifp=0xfffff80004900800) at /usr/src/sys/net/if_epair.c:972
#16 0xffffffff80cd5b7d in if_clone_destroyif (ifc=0xfffff800209a8300, ifp=0xfffff80004900800) at /usr/src/sys/net/if_clone.c:330
#17 0xffffffff80cd5a0e in if_clone_destroy (name=0xfffffe001c7308c0 "epair205a") at /usr/src/sys/net/if_clone.c:288
#18 0xffffffff80cd2915 in ifioctl (so=0xfffff800207d3000, cmd=2149607801, data=0xfffffe001c7308c0 "epair205a", td=<optimized out>) at /usr/src/sys/net/if.c:3106
#19 0xffffffff80c3b55e in fo_ioctl (fp=<optimized out>, com=<optimized out>, data=0xfffff8005f459000, active_cred=0x1, td=<optimized out>) at /usr/src/sys/sys/file.h:337
#20 kern_ioctl (td=0xfffff8005f459000, fd=<optimized out>, com=2149607801, data=0xfffff8005f459000 "") at /usr/src/sys/kern/sys_generic.c:804
#21 0xffffffff80c3b22d in sys_ioctl (td=0xfffff8005f459000, uap=0xfffff8005f4593c0) at /usr/src/sys/kern/sys_generic.c:712
#22 0xffffffff810a8984 in syscallenter (td=0xfffff8005f459000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#23 amd64_syscall (td=0xfffff8005f459000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1186
#24 <signal handler called>
#25 0x0000000800473e4a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe308
The problem is also still persistent in CURRENT (13.0-CURRENT FreeBSD 13.0-CURRENT #26 r356437: Tue Jan 7 07:19:34 CET 2020 amd64) and it can be reliably triggered by stopping vnet jails, where the epair is member of a bridge device.
Also, these PRs seem to be related to this almost two years old bug: PR 238326 and probably this one year old bug PR 234985.
Is anybody investigating this issue? It is also affecting 12.1-RELENG and 12-STABLE systems and it seems the larger the number of jails hosted is, the more likely is the failure . In our case, we crash one large server with a couple of Intel 350T2 NICs and ~ 10 -12 jails one EVERY shutdown of the system or stopping jails via "service jail stop" and it takes one or two attempts to crash the box by randomly select a jail and stop it via "service jail stop $name", $name is the jailname to be stopped.
The only secure way to reboot the server is to issue "reboot".
^Triage: Track earliest reported/reproducible/affected branch