Created attachment 201173 [details]
vnet_epair_test.sh: Script for reproducing vnet jail epair destroy panic
When creating an epair interface pair for a VNET enabled jail, and then using ifconfig within this jail, the kernel will often panic later when destroying the jail and finally the epair interface again. However this will not happen when ifconfig is not used within the jail or when it is used outside of the jail, and it will not happen every time. But when it happens, it always happens at the moment the ifconfig destroy epair is done.
This has been tested and reproduced on 12.0-RELEASE-p2 and 13.0-CURRENT r343065.
I have included a script which reproduces this. It is based on an older script which tested for a similar issue, and I changed it so that it will test this 999 times, with an optional 'panic' argument for triggering the critical ifconfig command that makes the difference here.
With the panic argument it will reliably panic my system on every run, at worst after a couple hundred loops or so (perhaps it is some kind of race condition?). Without the panic argument the system never crashes.
I have also included the kernel trace I obtained from the 13.0-CURRENT system, and can supply a kernel memory dump if you need it.
So what side effect would this innocent ifconfig command have that it affects a later ifconfig destroy command? It also does not matter which interface you query with it, like when you run ifconfig lo0 or something else, as long as I use ifconfig at least once I can trigger this.
Created attachment 201174 [details]
trace_13.0-CURRENT-r343065.txt: kernel trace
Interesting fact after doing some more testing: All is well also when I do not remove the jails, or when I remove the jail *after* destroying the epair interface.
Only when I remove the jail *before* destroying the epair interface *and* I run the ifconfig command inside the jail, I can trigger the panic.
I hope I provided enough info, let me know if I can test and/or provide anything else to pinpoint the issue here.
With r349853 I don't get a panic with your script, but if I assign an IP inside the jail (jexec <id> ifconfig inet 18.104.22.168) instead of just listing the interfaces, it panics on destroy.
(kgdb) #0 __curthread () at /space/system/usr_src/sys/amd64/include/pcpu.h:246
#1 doadump (textdump=1) at /space/system/usr_src/sys/kern/kern_shutdown.c:392
#2 0xffffffff8050cf70 in kern_reboot (howto=260)
#3 0xffffffff8050d3e9 in vpanic (fmt=<optimized out>, ap=<optimized out>)
#4 0xffffffff8050d123 in panic (fmt=<unavailable>)
#5 0xffffffff807e758c in trap_fatal (frame=0xfffffe01598227c0, eva=0)
#6 0xffffffff807e698c in trap (frame=0xfffffe01598227c0)
#7 <signal handler called>
#8 0xffffffff805f2045 in strncmp (s1=<optimized out>, s2=<optimized out>,
n=<optimized out>) at /space/system/usr_src/sys/libkern/strncmp.c:44
#9 0xffffffff80605d31 in ifunit_ref (name=0xfffffe0159822a20 "panic_test1b")
#10 0xffffffff80607ef8 in ifioctl (so=0xfffff809a1afd368, cmd=3223349536,
data=0xfffffe0159822a20 "panic_test1b", td=0xfffff8014c83e5a0)
#11 0xffffffff8057658d in fo_ioctl (fp=<optimized out>, com=3223349536,
data=0xfffff800020e2180, active_cred=0x0, td=0xfffff8014c83e5a0)
#12 kern_ioctl (td=0xfffff8014c83e5a0, fd=3, com=3223349536,
#13 0xffffffff805762ad in sys_ioctl (td=0xfffff8014c83e5a0,
uap=0xfffff8014c83e968) at /space/system/usr_src/sys/kern/sys_generic.c:712
#14 0xffffffff807e801a in syscallenter (td=0xfffff8014c83e5a0)
#15 amd64_syscall (td=0xfffff8014c83e5a0, traced=0)
This is almost certainly the same problem as the one discussed in #238870 and https://reviews.freebsd.org/D20868.
The patches in https://reviews.freebsd.org/D20868 and https://reviews.freebsd.org/D20869 work around the panic, but are not fully correct fixes.
I have had the same problem quite early on. I believe it is a bug in the VNET cleanup code.
It has an easy workaround, which works quite well in my setup:
Before destroying the interface, remove it from the jail (maybe use a prestop hook in the jail.conf). Use this command on the host:
ifconfig $interfaceName -vnet $jailName
It will remove the interface from the jail's VNET. Then you can destroy the epair on the host.
The problem is still persistent in recent CURRENT ( FreeBSD 13.0-CURRENT #26 r356437: Tue Jan 7 07:19:34 CET 2020 amd64).
See also PR 238326 and PR 219901, a bug known since 2017.
^Triage: Track earliest reported/reproducible/affected branch