Bug 233955 - [panic] Page fault in in6_purgeaddr upon tun create/destroy (net/wireguard affected)
Summary: [panic] Page fault in in6_purgeaddr upon tun create/destroy (net/wireguard af...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Kyle Evans
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2018-12-12 11:09 UTC by Manas Bhatnagar
Modified: 2019-05-10 08:56 UTC (History)
16 users (show)

See Also:
koobs: mfc-stable11+
koobs: mfc-stable12+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Manas Bhatnagar 2018-12-12 11:09:38 UTC
I have experienced multiple hard reboots of my FreeBSD 11.2-RELEASE system which occurs when I try to deactivate a Wireguard interface with wg-quick.

This does not always occur, on occasion I am able to activate & deactivate all interfaces without issue. I am unable to determine which specific conditions cause the hard reboot.

After creating/configuring the appropriate keypairs & configuration files, a new interface is created on a specific routing table with:

# setfib $FIB route add default $DEFAULTGATEWAY
# setfib $FIB wg-quick up wg$N

After the connection has been active for a short period, it is deactivated with:

# setfib $FIB wg-quick down wg$N

The last few messages seen in debug.log prior to the reboot are of the form:

Dec 12 10:50:16 $hostname kernel: ifa_maintain_loopback_route: deletion failed for interface wg0: 3

The configuration files are simple:

[Interface]
Address = ${PRIVATE_IP}/24
PrivateKey = $PRIVATEKEY
DNS = 127.0.0.1

[Peer]
PublicKey = $PUBLICKEY
Endpoint = ${PUBLIC_IP}:51820
AllowedIPs = 0.0.0.0/0
PersistentKeepalive = 30
Comment 1 Bernhard Froehlich freebsd_committer 2018-12-12 12:37:00 UTC
There is currently a report at OPNsense including a stacktrace which seems to be a kernel panic in UFS code.

https://github.com/opnsense/plugins/pull/1049

I tried to reproduce that on multiple environments but without luck so far. It would help a lot if you could check if you have a kernel crashdump in /var/crash and can obtain a stacktrace.

The handbook helps with instructions:
https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html
Comment 2 Manas Bhatnagar 2018-12-13 07:22:47 UTC
If it helps, I am using ZFS-on-root - I do not have any UFS filesystems.

I have checked /var/crash and the directory is empty other than a 'minfree' file which contains only the text "2048".
'dumpdev' was set to "AUTO" in my /etc/rc.conf
'dumpdir' was not defined, I have now defined it as '/var/crash' and set the permissions to 700. I will check for any kernel dumps if a reboot occurs.

I did notice that if I deactivated the Wireguard interfaces almost immediately after activation, they would deactivate without issues. Otherwise, when running about a dozen Wireguard instances which have been active for more than a few minutes, deactivating Wireguard interfaces sequentially could result in a hard reboot in an unpredictable manner - some interfaces will deactivate fine but one will cause a hard reboot.
Comment 3 Manas Bhatnagar 2019-01-02 19:07:33 UTC
I have seen the reboot happen twice more - still nothing in /var/crash
Comment 4 Michael Muenz 2019-01-29 12:30:47 UTC
Here's a new report with a bit of crashdump:

https://twitter.com/genneko217/status/1090218028480921600

https://gist.github.com/genneko/755f6160ba2594c5945b8fc18940ea71

HTH
Michael
Comment 5 Mark Johnston freebsd_committer 2019-01-30 18:30:31 UTC
This appears to be a kernel bug rather than an issue with the port itself.
Comment 6 Mark Johnston freebsd_committer 2019-01-30 18:34:29 UTC
Looks like the issue is seen on both 11.2 and 12.0?
Comment 7 Michael Muenz 2019-01-31 14:08:49 UTC
I was able to reproduce on 11.2 and 12.0, sometimes it takes up to 100 daemon restarts, sometimes after second one.
Comment 8 Manas Bhatnagar 2019-02-28 06:58:52 UTC
I see this on FreeBSD 12.0-RELEASE-p3 without setfib as well. Every 'wg-quick down wg0' command has resulted in a hard reboot.

wireguard-0.0.20190123
wireguard-go-0.0.20181222
Comment 9 Marek Zarychta 2019-02-28 07:38:47 UTC
12-STABLE panicked once for me while deactivating wireguard interface (in fib 0). I haven't seen such an issue on 13-STABLE yet, despite the fact I am using wireguard here more often.
I can provide neither crash dumps nor additional information.
Comment 10 Marek Zarychta 2019-02-28 07:39:47 UTC
(In reply to Marek Zarychta from comment #9)
13-CURRENT not STABLE of course
Comment 11 Bernhard Froehlich freebsd_committer 2019-03-21 10:52:29 UTC
This stacktrace is from https://gist.github.com/genneko/755f6160ba2594c5945b8fc18940ea71
and I copied it here in case it vanishes on github.


dumped core - see /var/crash/vmcore.0

Tue Jan 29 11:09:03 UTC 2019

FreeBSD  12.0-RELEASE-p2 FreeBSD 12.0-RELEASE-p2 GENERIC  amd64

panic: page fault

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
<6>in_scrubprefix: err=65, prefix delete failed
<6>wg0: deletion failed: 3


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80cc3fe3
stack pointer	        = 0x28:0xfffffe001de86300
frame pointer	        = 0x28:0xfffffe001de86450
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 3813 (wireguard-go)
trap number		= 12
panic: page fault
cpuid = 1
time = 1548760075
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f315 at calltrap+0x8
#7 0xffffffff80de0f73 at in6_purgeaddr+0x463
#8 0xffffffff80c9662f at if_purgeaddrs+0x21f
#9 0xffffffff80ca79c1 at tunclose+0x1f1
#10 0xffffffff80a518ca at devfs_close+0x3ba
#11 0xffffffff811f89b8 at VOP_CLOSE_APV+0x78
#12 0xffffffff80c7b6bf at vn_close1+0xdf
#13 0xffffffff80c7a3c0 at vn_closefile+0x50
#14 0xffffffff80a5224c at devfs_close_f+0x2c
#15 0xffffffff80b4363a at _fdrop+0x1a
#16 0xffffffff80b466e4 at closef+0x244
#17 0xffffffff80b43b69 at closefp+0x99
Uptime: 5m14s
Dumping 190 out of 2005 MB:..9%..17%..26%..34%..43%..51%..68%..76%..85%..93%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/modules/vboxguest.ko...done.
Loaded symbols for /boot/modules/vboxguest.ko
Reading symbols from /boot/kernel/intpm.ko...Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug...done.
done.
Loaded symbols for /boot/kernel/intpm.ko
Reading symbols from /boot/kernel/smbus.ko...Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...done.
done.
Loaded symbols for /boot/kernel/smbus.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:230
230	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) #0  doadump (textdump=<value optimized out>) at pcpu.h:230
#1  0xffffffff80b9b14b in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff80b9b5c3 in vpanic (fmt=<value optimized out>, 
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff80b9b3b3 in panic (fmt=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff8107496f in trap_fatal (frame=0xfffffe001de86240, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:929
#5  0xffffffff810749c9 in trap_pfault (frame=0xfffffe001de86240, usermode=0)
    at pcpu.h:230
#6  0xffffffff81073fee in trap (frame=0xfffffe001de86240)
    at /usr/src/sys/amd64/amd64/trap.c:441
#7  0xffffffff8104f315 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:232
#8  0xffffffff80cc3fe3 in rtsock_addrmsg (cmd=2, ifa=0xfffff80062f71200, 
    fibnum=-1) at /usr/src/sys/net/rtsock.c:1337
#9  0xffffffff80de0f73 in in6_purgeaddr (ifa=0xfffff80062f71200)
    at /usr/src/sys/netinet6/in6.c:193
#10 0xffffffff80c9662f in if_purgeaddrs (ifp=0xfffff80062845000)
    at /usr/src/sys/net/if.c:995
#11 0xffffffff80ca79c1 in tunclose (dev=<value optimized out>, 
    foo=<value optimized out>, bar=<value optimized out>, 
    td=<value optimized out>) at /usr/src/sys/net/if_tun.c:478
#12 0xffffffff80a518ca in devfs_close (ap=<value optimized out>)
    at /usr/src/sys/fs/devfs/devfs_vnops.c:650
#13 0xffffffff811f89b8 in VOP_CLOSE_APV (vop=<value optimized out>, 
    a=0xfffffe001de86788) at vnode_if.c:534
#14 0xffffffff80c7b6bf in vn_close1 (vp=0xfffff8006291ad20, flags=7, 
    file_cred=0xfffff80062849a00, td=0xfffff8001da8d000, keep_ref=false)
    at vnode_if.h:225
#15 0xffffffff80c7a3c0 in vn_closefile (fp=0xfffff8006a031050, 
    td=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1563
#16 0xffffffff80a5224c in devfs_close_f (fp=0xfffff8006a031050, 
    td=<value optimized out>) at /usr/src/sys/fs/devfs/devfs_vnops.c:669
#17 0xffffffff80b4363a in _fdrop (fp=0xfffff8006a031050, 
    td=<value optimized out>) at file.h:353
#18 0xffffffff80b466e4 in closef (fp=0xfffff8006a031050, td=0xfffff8001da8d000)
    at /usr/src/sys/kern/kern_descrip.c:2528
#19 0xffffffff80b43b69 in closefp (fdp=0xfffff8006a04d450, 
    fd=<value optimized out>, fp=0xfffff8006a031050, td=0xfffff8001da8d000, 
    holdleaders=0) at /usr/src/sys/kern/kern_descrip.c:1199
#20 0xffffffff81075449 in amd64_syscall (td=0xfffff8001da8d000, traced=0)
    at subr_syscall.c:135
#21 0xffffffff8104fbfd in fast_syscall_common ()
    at /usr/src/sys/amd64/amd64/exception.S:504
#22 0x000000000048bdb0 in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
Comment 12 Kubilay Kocak freebsd_committer freebsd_triage 2019-04-05 14:35:35 UTC
Set version to original (and earliest) version issue was identified in.
Comment 13 Ed Maste freebsd_committer 2019-04-05 15:37:31 UTC
(In reply to Kubilay Kocak from comment #12)
IMO tagging an issue with the earliest version where it's reproducible rather than the latest is likely to somewhat reduce the likelihood that it is noticed / triaged / addressed by the appropriate developer. (For this specific PR it doesn't matter as that's already happened.)
Comment 14 Jason A. Donenfeld 2019-04-20 02:31:21 UTC
Can somebody test this and see if it "fixes" the issue: https://git.zx2c4.com/wireguard-go/patch/?id=3fafe92382d6231ee066f62ac946fbc909aeac5d

This obviously doesn't address a rather grave kernel bug, but perhaps it's enough to work around the issue for now.
Comment 15 Jason A. Donenfeld 2019-04-20 02:33:35 UTC
Sorry, wrong link. Try this, rather: https://git.zx2c4.com/wireguard-go/patch/?h=jd/rancid-freebsd-hack
Comment 16 genneko217 2019-04-20 14:09:53 UTC
(In reply to Jason A. Donenfeld from comment #15)
As I've been recently playing with WireGuard on FreeBSD again,
I quickly tested the patch on a 4-core FreeBSD 12.0p3 VM and found
it almost worked around the kernel issue.

With the patched wireguard-go, only 2 out of 25000+ "service wireguard
restart" caused kernel panic, while panic occured every 5 to 50 restarts
without the patch.

As a side note, I also noticed in my recent testing

- No kernel panic on single-core FreeBSD 12.0p3 / 13-CURRENT VMs
  with the unpatched wireguard-go-0.0.20181222 / 20190409
  and 10000+ restarts.

- No kernel panic on a 4-core FreeBSD 13-CURRENT r346132 VM with
  the unpatched wireguard-go-0.0.20190409 and 40000+ restarts.

A stacktrace of the panic with the patch is as follows.
(Panics without the patch are the same as the one mentioned in
 comment #4 and #11.)

Hope this helps.


 dumped core - see /var/crash/vmcore.2

Sat Apr 20 12:07:44 UTC 2019

FreeBSD  12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC  amd64

panic: page fault

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: page fault
cpuid = 1
time = 1555762025
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f435 at calltrap+0x8
#7 0xffffffff80ca90d7 at tunifioctl+0x257
#8 0xffffffff80c9a072 at ifhwioctl+0x2f2
#9 0xffffffff80c9c05f at ifioctl+0x45f
#10 0xffffffff80c04f3d at kern_ioctl+0x26d
#11 0xffffffff80c04c5e at sys_ioctl+0x15e
#12 0xffffffff81075449 at amd64_syscall+0x369
#13 0xffffffff8104fd1d at fast_syscall_common+0x101
Uptime: 22m44s
Dumping 171 out of 469 MB:..10%..19%..29%..38%..47%..57%..66%..75%..85%..94%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/modules/vboxguest.ko...done.
Loaded symbols for /boot/modules/vboxguest.ko
Reading symbols from /boot/kernel/intpm.ko...Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug...done.
done.
Loaded symbols for /boot/kernel/intpm.ko
Reading symbols from /boot/kernel/smbus.ko...Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...done.
done.
Loaded symbols for /boot/kernel/smbus.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:230
230     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) #0  doadump (textdump=<value optimized out>) at pcpu.h:230
#1  0xffffffff80b9b14b in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff80b9b5c3 in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff80b9b3b3 in panic (fmt=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff8107496f in trap_fatal (frame=0xfffffe000fe94590, eva=1040)
    at /usr/src/sys/amd64/amd64/trap.c:929
#5  0xffffffff810749c9 in trap_pfault (frame=0xfffffe000fe94590, usermode=0)
    at pcpu.h:230
#6  0xffffffff81073fee in trap (frame=0xfffffe000fe94590)
    at /usr/src/sys/amd64/amd64/trap.c:441
#7  0xffffffff8104f435 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:232
#8  0xffffffff80b7ad4c in __mtx_lock_sleep (c=0xfffff8001045bc98, v=4)
    at /usr/src/sys/kern/kern_mutex.c:577
#9  0xffffffff80ca90d7 in tunifioctl (ifp=<value optimized out>,
    cmd=<value optimized out>, data=0xfffff80002f98c00 "wg0")
    at /usr/src/sys/net/if_tun.c:543
#10 0xffffffff80c9a072 in ifhwioctl (cmd=<value optimized out>,
    ifp=<value optimized out>, data=<value optimized out>,
    td=0xfffff80002f22000) at /usr/src/sys/net/if.c:2881
#11 0xffffffff80c9c05f in ifioctl (so=0xfffff8000969b6d0, cmd=3274795323,
    data=<value optimized out>, td=0xfffff80002f22000)
    at /usr/src/sys/net/if.c:3086
#12 0xffffffff80c04f3d in kern_ioctl (td=0xfffff80002f22000, fd=3,
    com=3274795323, data=<value optimized out>) at file.h:330
#13 0xffffffff80c04c5e in sys_ioctl (td=0xfffff80002f22000,
    uap=0xfffff80002f223c0) at /usr/src/sys/kern/sys_generic.c:712
#14 0xffffffff81075449 in amd64_syscall (td=0xfffff80002f22000, traced=0)
    at subr_syscall.c:135
#15 0xffffffff8104fd1d in fast_syscall_common ()
    at /usr/src/sys/amd64/amd64/exception.S:504
#16 0x000000080046611a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
Comment 17 Jason A. Donenfeld 2019-04-21 01:37:33 UTC
Interesting. That looks like *another*, separate, race. Yikes, this kernel driver...

Let's see if this works around it: https://git.zx2c4.com/WireGuard/patch/?id=69ffe5b7f58ce6f55dda2b9e13ff364a0d9b3dcd
Comment 18 Jason A. Donenfeld 2019-04-21 01:38:19 UTC
(That commit should be combined with the previous one: https://git.zx2c4.com/wireguard-go/patch/?h=jd/rancid-freebsd-hack )
Comment 19 Michael Muenz 2019-04-22 16:39:03 UTC
I applied both patches on OPNsense 19.1 (based on HardenedBSD 11.2) and it seems to work. Nice work!
Comment 20 Michael Muenz 2019-04-22 16:42:16 UTC
Now it crashed, after around 1000 restarts, sorry.
Comment 21 Jason A. Donenfeld 2019-04-22 16:45:59 UTC
Stack trace, please.
Comment 22 Jason A. Donenfeld 2019-04-23 09:53:40 UTC
Alright, I've now spent a bit of time tracking down these race conditions and reproducing them in a VM. It looks like there are two separate kernel race conditions:

- SIOCGIFSTATUS races with SIOCIFDESTROY, which was being triggered by the call to ifconfig(8) in the route monitor script. This should be now fixed by: https://git.zx2c4.com/WireGuard/patch/?id=90c546598c0a9d9da82c138c6c9c1396c453368e

- The asynchronous callback of IPv6 Link Local DAD conflict with both SIOCIFDESTROY and the /dev/tun cloning mechanism, resulting in a wide variety of crashes with dangling pointers on IPv6 address lists. I'm able to trigger this by just running `while true; do ifconfig tun0 create; ifconfig tun0 destroy; done` and after a while, there's one sort of crash. This should now be fixed by: https://git.zx2c4.com/wireguard-go/patch/?id=bb42ec7d185ab5f5cd3867ac1258edff86b7f307

I'd appreciate it if Michael Muenz could test these and make sure it fixes his own reproducer. After, Bernhard Froehlich can backport those two packages into the ports tree. Finally, THIS BUG SHOULD REMAIN OPEN until the FreeBSD kernel team actually gets the man power to fix these race conditions; the above only represents a few workarounds but does not address the underlying issue of this bug at all.
Comment 23 Jason A. Donenfeld 2019-04-23 10:08:49 UTC
Also, given the success of `while true; do ifconfig tun0 create; ifconfig tun0 destroy; done` at crashing the kernel, I'm pretty sure you can remove "triggered by net/wireguard" from the title of the bug report.
Comment 24 Jason A. Donenfeld 2019-04-23 11:22:40 UTC
(while true; do ifconfig tun0 create; ifconfig tun0 destroy; done)& (while true; do for i in {1..30}; do ifconfig tun0 & done; wait; done)&

Potentially more vicious reproducer.
Comment 25 genneko217 2019-04-23 11:58:52 UTC
(In reply to Jason A. Donenfeld from comment #22)
Thank you for various workarounds. Those patches work so far.

I noticed wg-quick down occasionally hangs at piperd, but now it seems to be patched in the upsteam master. Really quick! I'm testing it now.
Comment 26 Jason A. Donenfeld 2019-04-23 12:06:27 UTC
Indeed those patches I posted here are (already) out of date. But upstream master now has patches that appear to workaround the kernel bugs of this report. I think we're done here on the WireGuard front and it should be time to ship the fixes in the official package. However, do pipe up if you're able to crash things again in relation to WireGuard with yet-even-more FreeBSD kernel race conditions, and I'll take out the hatchet and try to hack around it.
Comment 27 commit-hook freebsd_committer 2019-04-23 12:34:09 UTC
A commit references this bug:

Author: decke
Date: Tue Apr 23 12:33:45 UTC 2019
New revision: 499754
URL: https://svnweb.freebsd.org/changeset/ports/499754

Log:
  net/wireguard:
  work around numerous kernel panics on shutdown in tun(4)

  There are numerous race conditions. But even this will crash it:

  while true; do ifconfig tun0 create; ifconfig tun0 destroy; done

  It seems like LLv6 is related, which we're not using anyway, so
  explicitly disable it on the interface.

  PR:     	233955

Changes:
  head/net/wireguard-go/Makefile
  head/net/wireguard-go/files/
  head/net/wireguard-go/files/patch-bb42ec7d185ab5f5cd3867ac1258edff86b7f307
Comment 28 commit-hook freebsd_committer 2019-04-23 12:37:17 UTC
A commit references this bug:

Author: decke
Date: Tue Apr 23 12:36:30 UTC 2019
New revision: 499755
URL: https://svnweb.freebsd.org/changeset/ports/499755

Log:
  net/wireguard: workaround SIOCGIFSTATUS race in FreeBSD kernel

  PR:		233955

Changes:
  head/net/wireguard/Makefile
  head/net/wireguard/files/patch-b3e1a1b07d3631bd816f9bfc27452a89dc29fa28
Comment 29 commit-hook freebsd_committer 2019-04-23 17:29:17 UTC
A commit references this bug:

Author: kevans
Date: Tue Apr 23 17:28:28 UTC 2019
New revision: 346602
URL: https://svnweb.freebsd.org/changeset/base/346602

Log:
  tun(4): Defer clearing TUN_OPEN until much later

  tun destruction will not continue until TUN_OPEN is cleared. There are brief
  moments in tunclose where the mutex is dropped and we've already cleared
  TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle
  of cleaning up the tun still. tun_destroy should be blocked until these
  parts (address/route purges, mostly) are complete.

  PR:		233955
  MFC after:	2 weeks

Changes:
  head/sys/net/if_tun.c
Comment 30 Kyle Evans freebsd_committer 2019-04-23 18:33:08 UTC
r346602 probably fixes the ip6_purgeaddr panic noted here. The tun mtx is dropped while we purge addresses/routes and the TUN_OPEN flag was cleared, giving tundestroy a chance to rototill the interface somewhere in the middle of that.

I've opened https://reviews.freebsd.org/D20027 to try and settle the ioctl race.
Comment 31 commit-hook freebsd_committer 2019-04-25 12:44:29 UTC
A commit references this bug:

Author: kevans
Date: Thu Apr 25 12:44:08 UTC 2019
New revision: 346670
URL: https://svnweb.freebsd.org/changeset/base/346670

Log:
  tun/tap: close race between destroy/ioctl handler

  It seems that there should be a better way to handle this, but this seems to
  be the more common approach and it should likely get replaced in all of the
  places it happens... Basically, thread 1 is in the process of destroying the
  tun/tap while thread 2 is executing one of the ioctls that requires the
  tun/tap mutex and the mutex is destroyed before the ioctl handler can
  acquire it.

  This is only one of the races described/found in PR 233955.

  PR:		233955
  Reviewed by:	ae
  MFC after:	2 weeks
  Differential Revision:	https://reviews.freebsd.org/D20027

Changes:
  head/sys/net/if_tap.c
  head/sys/net/if_tun.c
Comment 32 Ed Maste freebsd_committer 2019-04-25 13:27:52 UTC
> I'm pretty sure you can remove "triggered by net/wireguard" from the title of the bug report.

Indeed, updated - I think it's good to keep wireguard in the title as it seems it might be the common user-facing impact, but shouldn't imply that wireguard is somehow at fault.
Comment 33 commit-hook freebsd_committer 2019-05-09 03:52:10 UTC
A commit references this bug:

Author: kevans
Date: Thu May  9 03:51:35 UTC 2019
New revision: 347378
URL: https://svnweb.freebsd.org/changeset/base/347378

Log:
  MFC r346602, r346670-r346671, r347183: tun/tap race fixes

  r346602:
  tun(4): Defer clearing TUN_OPEN until much later

  tun destruction will not continue until TUN_OPEN is cleared. There are brief
  moments in tunclose where the mutex is dropped and we've already cleared
  TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle
  of cleaning up the tun still. tun_destroy should be blocked until these
  parts (address/route purges, mostly) are complete.

  r346670:
  tun/tap: close race between destroy/ioctl handler

  It seems that there should be a better way to handle this, but this seems to
  be the more common approach and it should likely get replaced in all of the
  places it happens... Basically, thread 1 is in the process of destroying the
  tun/tap while thread 2 is executing one of the ioctls that requires the
  tun/tap mutex and the mutex is destroyed before the ioctl handler can
  acquire it.

  This is only one of the races described/found in PR 233955.

  r346671:
  tun(4): Don't allow open of open or dying devices

  Previously, a pid check was used to prevent open of the tun(4); this works,
  but may not make the most sense as we don't prevent the owner process from
  opening the tun device multiple times.

  The potential race described near tun_pid should not be an issue: if a
  tun(4) is to be handed off, its fd has to have been sent via control message
  or some other mechanism that duplicates the fd to the receiving process so
  that it may set the pid. Otherwise, the pid gets cleared when the original
  process closes it and you have no effective handoff mechanism.

  Close up another potential issue with handing a tun(4) off by not clobbering
  state if the closer isn't the controller anymore. If we want some state to
  be cleared, we should do that a little more surgically.

  Additionally, nothing prevents a dying tun(4) from being "reopened" in the
  middle of tun_destroy as soon as the mutex is unlocked, quickly leading to a
  bad time. Return EBUSY if we're marked for destruction, as well, and the
  consumer will need to deal with it. The associated character device will be
  destroyed in short order.

  r347183:
  geom: fix initialization order

  There's a race between the initialization of devsoftc.mtx (by devinit)
  and the creation of the geom worker thread g_run_events, which calls
  devctl_queue_data_f. Both of those are initialized at SI_SUB_DRIVERS
  and SI_ORDER_FIRST, which means the geom worked thread can be created
  before the mutex has been initialized, leading to the panic below:

   wpanic: mtx_lock() of spin mutex (null) @ /usr/home/osstest/build.135317.build-amd64-freebsd/freebsd/sys/kern/subr_bus.c:620
   cpuid = 3
   time = 1
   KDB: stack backtrace:
   db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003b968710
   vpanic() at vpanic+0x19d/frame 0xfffffe003b968760
   panic() at panic+0x43/frame 0xfffffe003b9687c0
   __mtx_lock_flags() at __mtx_lock_flags+0x145/frame 0xfffffe003b968810
   devctl_queue_data_f() at devctl_queue_data_f+0x6a/frame 0xfffffe003b968840
   g_dev_taste() at g_dev_taste+0x463/frame 0xfffffe003b968a00
   g_load_class() at g_load_class+0x1bc/frame 0xfffffe003b968a30
   g_run_events() at g_run_events+0x197/frame 0xfffffe003b968a70
   fork_exit() at fork_exit+0x84/frame 0xfffffe003b968ab0
   fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003b968ab0
   --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
   KDB: enter: panic
   [ thread pid 13 tid 100029 ]
   Stopped at      kdb_enter+0x3b: movq    $0,kdb_why

  Fix this by initializing geom at SI_ORDER_SECOND instead of
  SI_ORDER_FIRST.

  PR:		233955

Changes:
_U  stable/11/
  stable/11/sys/geom/geom.h
  stable/11/sys/net/if_tap.c
  stable/11/sys/net/if_tun.c
_U  stable/12/
  stable/12/sys/geom/geom.h
  stable/12/sys/net/if_tap.c
  stable/12/sys/net/if_tun.c
Comment 34 Kyle Evans freebsd_committer 2019-05-09 04:02:35 UTC
tun(4) should now be in decent enough shape on all supported branches. If anyone has an unpatched wireguard package laying around and some time to try and reproduce any of these issues on either of the -STABLE snapshots being built tomorrow (or -STABLE built past r347378, of course) I'd greatly appreciate it.

I'm tentatively closing this as FIXED, since the more obvious races have been addressed and MFC'd.