Bug 234985

Summary: epair: Kernel panic when destroying epair interface of vnet jail after using ifconfig inside the jail
Product: Base System Reporter: Henno Schooljan <henno>
Component: kernAssignee: Kristof Provost <kp>
Status: Closed FIXED    
Severity: Affects Some People CC: alexx, freebsd, henno, kp, lwhsu, naito.yuichiro, net, netchild, ohartmann, sharky, trashcan, v_bachvarov, yp2008cn, zarychtam
Priority: --- Keywords: crash, vimage
Version: 12.0-RELEASEFlags: kp: mfc-stable12+
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238870
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238326
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219901
Attachments:
Description Flags
vnet_epair_test.sh: Script for reproducing vnet jail epair destroy panic
none
trace_13.0-CURRENT-r343065.txt: kernel trace
none
bug through ezjail : host rc.conf
none
bug through ezjail : jail definition
none
bug through ezjail : jail rc.conf
none
bug through ezjail : crashinfo output none

Description Henno Schooljan 2019-01-16 00:57:34 UTC
Created attachment 201173 [details]
vnet_epair_test.sh: Script for reproducing vnet jail epair destroy panic

When creating an epair interface pair for a VNET enabled jail, and then using ifconfig within this jail, the kernel will often panic later when destroying the jail and finally the epair interface again. However this will not happen when ifconfig is not used within the jail or when it is used outside of the jail, and it will not happen every time. But when it happens, it always happens at the moment the ifconfig destroy epair is done.

This has been tested and reproduced on 12.0-RELEASE-p2 and 13.0-CURRENT r343065.

I have included a script which reproduces this. It is based on an older script which tested for a similar issue, and I changed it so that it will test this 999 times, with an optional 'panic' argument for triggering the critical ifconfig command that makes the difference here.
With the panic argument it will reliably panic my system on every run, at worst after a couple hundred loops or so (perhaps it is some kind of race condition?). Without the panic argument the system never crashes.

I have also included the kernel trace I obtained from the 13.0-CURRENT system, and can supply a kernel memory dump if you need it.

So what side effect would this innocent ifconfig command have that it affects a later ifconfig destroy command? It also does not matter which interface you query with it, like when you run ifconfig lo0 or something else, as long as I use ifconfig at least once I can trigger this.
Comment 1 Henno Schooljan 2019-01-16 00:58:43 UTC
Created attachment 201174 [details]
trace_13.0-CURRENT-r343065.txt: kernel trace
Comment 2 Henno Schooljan 2019-01-16 01:20:04 UTC
Interesting fact after doing some more testing: All is well also when I do not remove the jails, or when I remove the jail *after* destroying the epair interface.

Only when I remove the jail *before* destroying the epair interface *and* I run the ifconfig command inside the jail, I can trigger the panic.

I hope I provided enough info, let me know if I can test and/or provide anything else to pinpoint the issue here.
Comment 3 Alexander Leidinger freebsd_committer freebsd_triage 2019-07-09 18:14:52 UTC
With r349853 I don't get a panic with your script, but if I assign an IP inside the jail (jexec <id> ifconfig inet 1.2.3.4) instead of just listing the interfaces, it panics on destroy.

(kgdb) #0  __curthread () at /space/system/usr_src/sys/amd64/include/pcpu.h:246
#1  doadump (textdump=1) at /space/system/usr_src/sys/kern/kern_shutdown.c:392
#2  0xffffffff8050cf70 in kern_reboot (howto=260)
    at /space/system/usr_src/sys/kern/kern_shutdown.c:479
#3  0xffffffff8050d3e9 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /space/system/usr_src/sys/kern/kern_shutdown.c:905
#4  0xffffffff8050d123 in panic (fmt=<unavailable>)
    at /space/system/usr_src/sys/kern/kern_shutdown.c:832
#5  0xffffffff807e758c in trap_fatal (frame=0xfffffe01598227c0, eva=0)
    at /space/system/usr_src/sys/amd64/amd64/trap.c:943
#6  0xffffffff807e698c in trap (frame=0xfffffe01598227c0)
    at /space/system/usr_src/sys/amd64/amd64/trap.c:221
#7  <signal handler called>
#8  0xffffffff805f2045 in strncmp (s1=<optimized out>, s2=<optimized out>,
    n=<optimized out>) at /space/system/usr_src/sys/libkern/strncmp.c:44
#9  0xffffffff80605d31 in ifunit_ref (name=0xfffffe0159822a20 "panic_test1b")
    at /space/system/usr_src/sys/net/if.c:2434
#10 0xffffffff80607ef8 in ifioctl (so=0xfffff809a1afd368, cmd=3223349536,
    data=0xfffffe0159822a20 "panic_test1b", td=0xfffff8014c83e5a0)
    at /space/system/usr_src/sys/net/if.c:3093
#11 0xffffffff8057658d in fo_ioctl (fp=<optimized out>, com=3223349536,
    data=0xfffff800020e2180, active_cred=0x0, td=0xfffff8014c83e5a0)
    at /space/system/usr_src/sys/sys/file.h:333
#12 kern_ioctl (td=0xfffff8014c83e5a0, fd=3, com=3223349536,
    data=0xfffff800020e2180 "")
    at /space/system/usr_src/sys/kern/sys_generic.c:800
#13 0xffffffff805762ad in sys_ioctl (td=0xfffff8014c83e5a0,
    uap=0xfffff8014c83e968) at /space/system/usr_src/sys/kern/sys_generic.c:712
#14 0xffffffff807e801a in syscallenter (td=0xfffff8014c83e5a0)
    at /space/system/usr_src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#15 amd64_syscall (td=0xfffff8014c83e5a0, traced=0)
    at /space/system/usr_src/sys/amd64/amd64/trap.c:1181
Comment 4 Kristof Provost freebsd_committer freebsd_triage 2019-07-09 18:27:57 UTC
This is almost certainly the same problem as the one discussed in #238870 and https://reviews.freebsd.org/D20868.

The patches in https://reviews.freebsd.org/D20868 and https://reviews.freebsd.org/D20869 work around the panic, but are not fully correct fixes.
Comment 5 Rocco 2019-09-26 16:42:17 UTC
I have had the same problem quite early on. I believe it is a bug in the VNET cleanup code.
It has an easy workaround, which works quite well in my setup:

Before destroying the interface, remove it from the jail (maybe use a prestop hook in the jail.conf). Use this command on the host:

ifconfig $interfaceName -vnet $jailName

It will remove the interface from the jail's VNET. Then you can destroy the epair on the host.
Comment 6 O. Hartmann 2020-01-08 09:48:28 UTC
The problem is still persistent in recent CURRENT ( FreeBSD 13.0-CURRENT #26 r356437: Tue Jan  7 07:19:34 CET 2020 amd64).

See also PR 238326 and PR 219901, a bug known since 2017.
Comment 7 Kubilay Kocak freebsd_committer freebsd_triage 2020-01-25 03:18:24 UTC
^Triage: Track earliest reported/reproducible/affected branch
Comment 8 freebsd 2020-11-24 17:21:54 UTC
My case seems similar :
* Using 12.2-RELEASE
* Jail defined through ezjail
* using vnet, and jib (/usr/share/examples/jails/jib) to manage the interface

ezjail-admin (one)start works without problem, if logged as root ezjail-admin (one) stop works
However when logged as another user, sudo ezjail-admin (one)stop or su -; ezjail-admin (one)stop provoque a panic (pagefault) of the host.
I join the crashinfo output, definition of the jail, rc.conf of host and jail.
If needed I can provide the vmcrash file or even the virtualbox diskimage used to reproduce the bug
Comment 9 freebsd 2020-11-24 17:23:12 UTC
Created attachment 219930 [details]
bug through ezjail : host rc.conf
Comment 10 freebsd 2020-11-24 17:23:55 UTC
Created attachment 219931 [details]
bug through ezjail : jail definition
Comment 11 freebsd 2020-11-24 17:24:39 UTC
Created attachment 219932 [details]
bug through ezjail : jail rc.conf
Comment 12 freebsd 2020-11-24 17:25:44 UTC
Created attachment 219933 [details]
bug through ezjail : crashinfo output
Comment 13 Marek Zarychta 2020-11-24 18:01:27 UTC
(In reply to freebsd from comment #8)

I have tested it some time ago and it looks like ezjail is not able to shut down the guest system in the correct way. If the child interfaces of epairb in the jail are destroyed before the withdrawal of epair from the jail and jail shutdown, then panic doesn't occur. 

From the other hand, ezjail officially doesn't support vnet and is unmaintained since a while.
Comment 14 Kristof Provost freebsd_committer freebsd_triage 2020-11-24 19:07:41 UTC
(In reply to Marek Zarychta from comment #13)
This isn't an ezjail bug. It's a kernel issue.

I'm working on a fix. Some discussion in https://reviews.freebsd.org/D27279 but that's not going to be the final fix.
Comment 15 commit-hook freebsd_committer freebsd_triage 2020-12-01 16:24:23 UTC
A commit references this bug:

Author: kp
Date: Tue Dec  1 16:24:00 UTC 2020
New revision: 368237
URL: https://svnweb.freebsd.org/changeset/base/368237

Log:
  if: Fix panic when destroying vnet and epair simultaneously

  When destroying a vnet and an epair (with one end in the vnet) we often
  panicked. This was the result of the destruction of the epair, which destroys
  both ends simultaneously, happening while vnet_if_return() was moving the
  struct ifnet to its home vnet. This can result in a freed ifnet being re-added
  to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
  used.

  Prevent this race by ensuring that vnet_if_return() cannot run at the same time
  as if_detach() or epair_clone_destroy().

  PR:		238870, 234985, 244703, 250870
  MFC after:	2 weeks
  Sponsored by:	Modirum MDPay
  Differential Revision:	https://reviews.freebsd.org/D27378

Changes:
  head/sys/net/if.c
Comment 16 commit-hook freebsd_committer freebsd_triage 2020-12-15 15:34:27 UTC
A commit references this bug:

Author: kp
Date: Tue Dec 15 15:33:29 UTC 2020
New revision: 368663
URL: https://svnweb.freebsd.org/changeset/base/368663

Log:
  MFC r368237:

  if: Fix panic when destroying vnet and epair simultaneously

  When destroying a vnet and an epair (with one end in the vnet) we often
  panicked. This was the result of the destruction of the epair, which destroys
  both ends simultaneously, happening while vnet_if_return() was moving the
  struct ifnet to its home vnet. This can result in a freed ifnet being re-added
  to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
  used.

  Prevent this race by ensuring that vnet_if_return() cannot run at the same time
  as if_detach() or epair_clone_destroy().

  PR:		238870, 234985, 244703, 250870
  Sponsored by:	Modirum MDPay

Changes:
_U  stable/12/
  stable/12/sys/net/if.c
Comment 17 Marek Zarychta 2020-12-15 16:44:49 UTC
Thanks for the fix. It looks promising, but the panic still occurs when the jail is stopped without removing all child interfaces of epairb within the jail. 

Perhaps it's  step forward in right direction, but still no way to use VLAN subintrfaces inside VNET jails.

Tested on FreeBSD 12.2-STABLE r368664.
Comment 18 Kristof Provost freebsd_committer freebsd_triage 2020-12-15 16:45:39 UTC
(In reply to Marek Zarychta from comment #17)
What setup do you use and what panic do you see?
Comment 19 Marek Zarychta 2020-12-15 17:02:30 UTC
Since a while, I am testing setup with epair(4) bridged to LACP lagg(4) with a few VLANs. I am able to create, utilise and destroy vlan(4) subinterfaces on epairb within the VNET jail. The only drawback is that all vlan(4) interfaces created on epairb have to be destroyed prior to stoping the VNET jail. If it's done manually then everything works fine, if the jail is stopped by sysutils/ezjail then panic occurs. 
I am aware that sysutils/ezjail is not actively maintained, neither capable to support VNET framework. Please don't get me wrong, I am not complaining on the patch, but I believed that it solves all issues regarding VNET jail panics which was the wrong assumption.
Comment 20 Kristof Provost freebsd_committer freebsd_triage 2020-12-15 17:06:22 UTC
(In reply to Marek Zarychta from comment #19)
Yes, and if you'd describe your setup and show the panic you're running into maybe we could fix that problem too.

The if_vlan:basic test does this: it creates a vlan on top of an epair (actually in two jails, to do a basic plan test) and then the jails and epairs get destroyed. That does not panic. So, clearly you're doing something different, so please tell us what that is!
Comment 21 Marek Zarychta 2020-12-15 19:51:52 UTC
Thanks for the patch. It solves the issue of the clean removal of orphaned interfaces.

I had a deeper look into this and it came out, that netgraph(3) was the culprit.
In the meantime, the kernel was upgraded to 12.2-STABLE r368671.
I confirm that with netgraph modules not loaded panic doesn't happen. 

This machine has swap on ZFS zvol so I was not able to get a local core dump. Network interfaces (VLAN over LACP lagg(4)) doesn't allow to utilize netdump(4) server, but with slow motion on the serial console, I was able to get the panic. I can share the screencasts if still relevant.
Comment 22 Kristof Provost freebsd_committer freebsd_triage 2020-12-15 19:56:42 UTC
(In reply to Marek Zarychta from comment #21)
Yes! Panics are relevant! Explain how you trigger it and show the crashdump! Please!
Comment 23 Marek Zarychta 2020-12-15 20:59:05 UTC
(In reply to Kristof Provost from comment #22)

Prior to the panic, such messages appear:

ng node vlan0 needs NGF_REALLY_DIE
ng node vlan1 needs NGF_REALLY_DIE

These messages regard to the interfaces created atop epairb in the jail.

I have sent you link to the screencast but don't want to disclose it here.

Thank you for the patience and for solving this old bug. It looks like VNET jails can be now widely supported and gain the ability to use vlan(4) interfaces when netgraph(3) is not required on the host.
Comment 24 Kristof Provost freebsd_committer freebsd_triage 2020-12-15 22:10:41 UTC
(In reply to Marek Zarychta from comment #23)
You're running into #233622, which is a different bug.
Comment 25 commit-hook freebsd_committer freebsd_triage 2021-01-29 01:06:05 UTC
A commit in branch releng/12.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e0c15f45abd4bd5165e11b557a8c90d0faf5cfeb

commit e0c15f45abd4bd5165e11b557a8c90d0faf5cfeb
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2021-01-18 21:55:53 +0000
Commit:     Ed Maste <emaste@FreeBSD.org>
CommitDate: 2021-01-29 00:58:55 +0000

    MFC r368237: if: Fix panic when destroying vnet and epair simultaneously

    When destroying a vnet and an epair (with one end in the vnet) we often
    panicked. This was the result of the destruction of the epair, which destroys
    both ends simultaneously, happening while vnet_if_return() was moving the
    struct ifnet to its home vnet. This can result in a freed ifnet being re-added
    to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
    used.

    Prevent this race by ensuring that vnet_if_return() cannot run at the same time
    as if_detach() or epair_clone_destroy().

    PR:             238870, 234985, 244703, 250870
    Sponsored by:   Modirum MDPay
    Approved by:    so

 sys/net/if.c     | 147 +++++++++++++++++++++++++++++++++++++------------------
 sys/net/if_var.h |  24 ++-------
 2 files changed, 104 insertions(+), 67 deletions(-)
Comment 26 commit-hook freebsd_committer freebsd_triage 2021-01-29 01:21:20 UTC
A commit in branch releng/12.2 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e682b62c96e94c60d830e4414215032e0d4f8dad

commit e682b62c96e94c60d830e4414215032e0d4f8dad
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2020-09-12 16:33:05 +0000
Commit:     Ed Maste <emaste@FreeBSD.org>
CommitDate: 2021-01-29 01:14:24 +0000

    MFC r368237: if: Fix panic when destroying vnet and epair simultaneously

    When destroying a vnet and an epair (with one end in the vnet) we often
    panicked. This was the result of the destruction of the epair, which destroys
    both ends simultaneously, happening while vnet_if_return() was moving the
    struct ifnet to its home vnet. This can result in a freed ifnet being re-added
    to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
    used.

    Prevent this race by ensuring that vnet_if_return() cannot run at the same time
    as if_detach() or epair_clone_destroy().

    PR:             238870, 234985, 244703, 250870
    Sponsored by:   Modirum MDPay
    Approved by:    so

 sys/net/if.c     | 147 +++++++++++++++++++++++++++++++++++++------------------
 sys/net/if_var.h |  24 ++-------
 2 files changed, 104 insertions(+), 67 deletions(-)