253541 – if_wg(4): panic on wireguard interface destroy

Bug 253541 - if_wg(4): panic on wireguard interface destroy

Summary: if_wg(4): panic on wireguard interface destroy

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	Mark Johnston

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2021-02-15 18:04 UTC by Frank Behrens
Modified:	2022-10-12 00:48 UTC (History)
CC List:	9 users (show)

See Also:

Attachments
panic backtrace (9.17 KB, text/plain) 2021-02-15 18:04 UTC, Frank Behrens	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Frank Behrens 2021-02-15 18:04:44 UTC

Created attachment 222473 [details]
panic backtrace

When a (configured) wg interface is destroyed, the system panics.
#ifconfig wg0 destroy

panic: page fault
cpuid = 1
time = 1613404414
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e7cb3540
vpanic() at vpanic+0x181/frame 0xfffffe00e7cb3590
panic() at panic+0x43/frame 0xfffffe00e7cb35f0
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e7cb3650
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e7cb36b0
trap() at trap+0x27d/frame 0xfffffe00e7cb37c0
calltrap() at calltrap+0x8/frame 0xfffffe00e7cb37c0
--- trap 0xc, rip = 0xffffffff805011a1, rsp = 0xfffffe00e7cb3890, rbp = 0xfffffe00e7cb38b0 ---
grouptask_block() at grouptask_block+0x11/frame 0xfffffe00e7cb38b0
taskqgroup_detach() at taskqgroup_detach+0x18/frame 0xfffffe00e7cb38e0
wg_detach() at wg_detach+0xc8/frame 0xfffffe00e7cb3910
iflib_pseudo_deregister() at iflib_pseudo_deregister+0xfe/frame 0xfffffe00e7cb3940
if_clone_destroyif() at if_clone_destroyif+0x1b5/frame 0xfffffe00e7cb3980
if_clone_destroy() at if_clone_destroy+0x195/frame 0xfffffe00e7cb39d0
ifioctl() at ifioctl+0x4a4/frame 0xfffffe00e7cb3a90
kern_ioctl() at kern_ioctl+0x26d/frame 0xfffffe00e7cb3b00
sys_ioctl() at sys_ioctl+0xf6/frame 0xfffffe00e7cb3bc0
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe00e7cb3cf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e7cb3cf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80041f5aa, rsp = 0x7fffffffe148, rbp = 0x7fffffffe160 ---


Some more data in the attached file, I'm not sure if it helps. vmcore and matching debug kernel/sources are available on demand.

version: FreeBSD 14.0-CURRENT, commit 88db1cc9f197a376817ce27ba269348666bbd4b7 (freebsd/main, freebsd/HEAD, main) with minor local changes

Comment 1 Raúl 2021-02-17 12:53:30 UTC

It also happens on stable/13-7973db046, plain GENERIC, no changes

Fatal trap 12: page fault while in kernel mode                                                                                         
cpuid = 1; apic id = 01                                                                                                                
fault virtual address   = 0xa4                                                                                                         
fault code              = supervisor read data, page not present                                                                       
instruction pointer     = 0x20:0xffffffff80c53541                                                                                      
stack pointer           = 0x28:0xfffffe00a8548790                                                                                      
frame pointer           = 0x28:0xfffffe00a85487b0                                                                                      
code segment            = base rx0, limit 0xfffff, type 0x1b                                                                           
                        = DPL 0, pres 1, long 1, def32 0, gran 1                                                                       
processor eflags        = interrupt enabled, resume, IOPL = 0                                                                          
current process         = 753 (ifconfig)                                                                                               
trap number             = 12                                                                                                           
panic: page fault                                                                                                                      
cpuid = 1                                                                                                                              
time = 1613565900                                                                                                                      
KDB: stack backtrace:                                                                                                                  
#0 0xffffffff80c56875 at kdb_backtrace+0x65                                                                                            
#1 0xffffffff80c09441 at vpanic+0x181                                                                                                  
#2 0xffffffff80c092b3 at panic+0x43                                                                                                    
#3 0xffffffff810891a7 at trap_fatal+0x387                                                                                              
#4 0xffffffff810891ff at trap_pfault+0x4f                                                                                              
#5 0xffffffff8108885d at trap+0x27d                                                                                                    
#6 0xffffffff8105fe28 at calltrap+0x8                                                                                                  
#7 0xffffffff80c54758 at taskqgroup_detach+0x18                                                                                        
#8 0xffffffff829249f8 at wg_detach+0xc8                                                                                                
#9 0xffffffff80d33c1e at iflib_pseudo_deregister+0xfe                                                                                  
#10 0xffffffff80d20ff5 at if_clone_destroyif+0x1b5                                                                                     
#11 0xffffffff80d20de6 at if_clone_destroy+0x196                                                                                       
#12 0xffffffff80d1e31e at ifioctl+0x44e                                                                                                
#13 0xffffffff80c7621d at kern_ioctl+0x26d                                                                                             
#14 0xffffffff80c75f16 at sys_ioctl+0xf6                                                                                               
#15 0xffffffff81089aac at amd64_syscall+0x10c                                                                                          
#16 0xffffffff8106074e at fast_syscall_common+0xf8

Comment 2 Yuichiro NAITO 2021-02-18 03:02:45 UTC

I see same panic on my FreeBSD Current box.
I've tried bisect this problem.
It have been caused since 38bfc6dee33bedb290e1ea2540f97a86fe3caee0 commit.

https://cgit.freebsd.org/src/commit/?id=38bfc6dee33bedb290e1ea2540f97a86fe3caee0

Comment 3 Mark Johnston freebsd_committer

2021-02-18 03:20:54 UTC

(In reply to Yuichiro NAITO from comment #2)
Thanks for tracking it down, looking.

Comment 4 Mark Johnston freebsd_committer

2021-02-18 03:50:26 UTC

The problem is basically that the added an IFDI_DETACH() call to iflib_pseudo_deregister(), but that was already taken care of by the DEVICE_DETACH method of a device created by iflib_clone_create().

It looks like the device is only created in the first place because iflib wants a device_t associated with driver contexts.  Even with the commit reverted you can crash a system with "devctl detach wg0" or so.  I think iflib_pseudo_detach() should rely on the ifnet cloning framework to handle teardown and simply return an error if it detects that something is detaching the device without having destroyed the interface first.

Comment 5 Mark Johnston freebsd_committer

2021-02-18 15:08:45 UTC

https://reviews.freebsd.org/D28774

Comment 6 commit-hook freebsd_committer

2021-02-19 22:12:21 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0f9544d03e89d180f94a7a84b110ec7d2b6c625a

commit 0f9544d03e89d180f94a7a84b110ec7d2b6c625a
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-02-19 22:08:34 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-02-19 22:10:41 +0000

    iflib: Fix detach of pseudo interfaces

    In commit 38bfc6dee33b we added an IFDI_DETACH() call to
    iflib_pseudo_deregister() since it looked like it was missing.  One is
    present in the error-handling path of iflib_pseudo_register().  However,
    the detach actually comes from the DEVICE_DETACH() method for the
    above-mentioned device_t, so now we're calling IFDI_DETACH() twice when
    destroying a pseudo interface.

    Fix the problem by not calling IFDI_DETACH() from the device detach
    routine.  This way we can ensure that iflib de-initialization always
    happens in a consistent order.  It also ensures that you can't do silly
    things like "devctl detach <pseudo ifnet>", which would previously
    detach the driver without tearing down the corresponding ifnet.

    PR:             253541
    Reviewed by:    erj
    MFC after:      1 week
    Fixes:          38bfc6dee33b
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D28774

 sys/net/iflib_clone.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comment 7 Frank Behrens 2021-02-20 09:17:05 UTC

(In reply to commit-hook from comment #6)
Mark, thanks for your effort and fast response!

Meanwhile running freebsd/main 504e64af32ba6c62fdcc894a3b1da76061c64796 I can confirm, that the problem has been solved, the system does not longer panic. From my point of view the ticket may be closed.

BTW, the interface destruction was my attempt to reconfigure the interface, I could not find another way. But that is another problem...

Thanks!
    Frank

Comment 8 Mark Johnston freebsd_committer

2021-02-20 16:37:13 UTC

(In reply to Frank Behrens from comment #7)
Thanks for confirming and sorry for the breakage.  I'll close the PR once this change has been merged to stable/13.

Comment 9 commit-hook freebsd_committer

2021-02-24 15:21:48 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4664afc05402abb8e572975ec65210bdcd8b2ea1

commit 4664afc05402abb8e572975ec65210bdcd8b2ea1
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-02-19 22:08:34 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-02-24 15:20:55 +0000

    iflib: Fix detach of pseudo interfaces

    In commit 38bfc6dee33b we added an IFDI_DETACH() call to
    iflib_pseudo_deregister() since it looked like it was missing.  One is
    present in the error-handling path of iflib_pseudo_register().  However,
    the detach actually comes from the DEVICE_DETACH() method for the
    above-mentioned device_t, so now we're calling IFDI_DETACH() twice when
    destroying a pseudo interface.

    Fix the problem by not calling IFDI_DETACH() from the device detach
    routine.  This way we can ensure that iflib de-initialization always
    happens in a consistent order.  It also ensures that you can't do silly
    things like "devctl detach <pseudo ifnet>", which would previously
    detach the driver without tearing down the corresponding ifnet.

    PR:             253541
    Reviewed by:    erj
    Fixes:          38bfc6dee33b
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D28774

    (cherry picked from commit 0f9544d03e89d180f94a7a84b110ec7d2b6c625a)

 sys/net/iflib_clone.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comment 10 Mark Johnston freebsd_committer

2021-02-24 15:22:43 UTC

Thanks for the report.

Comment 11 Bernhard Froehlich freebsd_committer

2021-03-06 20:53:33 UTC

Can anyone please merge this into releng/13.0 since it fixes an easy to trigger panic?

Comment 12 Mark Johnston freebsd_committer

2021-03-07 14:49:01 UTC

(In reply to Bernhard Froehlich from comment #11)
The regression (see comment 2) should not be present in 13.0.  Can you show the exact panic you're seeing?

Comment 13 Christos Chatzaras 2021-03-07 19:06:30 UTC

I could not trigger it with FreeBSD 13.0-RC1 or 12.2p4

# ifconfig wg0 destroy

the process hangs in D-state.

-------------

# truss -p 28527

shows no activity

-------------

# kill -9 28527

doesn't kill the process

-------------

procstat 28527
  PID  PPID  PGID   SID  TSID THR LOGIN    WCHAN     EMUL          COMM
28527  3161 28527 92965 92965   1 chris    tun_cond  FreeBSD ELF64 ifconfig

-------------

But if after the process "ifconfig wg0 destroy" hangs I run:

service wireguard stop

then the "ifconfig wg0 destroy" stops running and truss shows:

close(3)                                         = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)                 = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)                 = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)                 = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)                 = 0 (0x0)
sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
sigprocmask(SIG_SETMASK,{ },0x0)                 = 0 (0x0)
exit(0x0)
process exit, rval = 0

Comment 14 Bernhard Froehlich freebsd_committer

2021-03-07 20:10:45 UTC

(In reply to Mark Johnston from comment #12)
I was able to reliably panic 13.0-BETA4 with an ifconfig wg0 destroy if a wireguard connection was established already. Since this bug description looked very similar I assumed it is the same. Just tried now on a test VM with RC1 and I cannot reproduce this panic at the moment (but found two other panics, will create PRs for them).

Comment 15 Christos Chatzaras 2021-03-07 20:42:05 UTC

(In reply to Bernhard Froehlich from comment #14)

My wg0 interface was up but with no connection established, maybe that's why it didn't panic.

Comment 16 Bernhard Froehlich freebsd_committer

2021-03-07 21:02:01 UTC

I created PR 254114 with my details but the stacktrace looks different so definitely not the same issue.