Bug 254496 - kernel panic when destroying interface with ECMP route
Summary: kernel panic when destroying interface with ECMP route
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Alexander V. Chernikov
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2021-03-23 10:03 UTC by Zhenlei Huang
Modified: 2021-04-08 22:22 UTC (History)
2 users (show)

See Also:


Attachments
Core text dump (64.06 KB, text/plain)
2021-03-23 10:03 UTC, Zhenlei Huang
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zhenlei Huang 2021-03-23 10:03:06 UTC
Created attachment 223516 [details]
Core text dump

I was trying to reproduce bug #254303 , and found another bug, not sure if it is related.

Steps to repeat:
1. Fresh install FreeBSD 13.0 RC3
2. Run the following script

<pre><code>
# set up interface and add ECMP route
tap=$( ifconfig tap create inet 10.10.10.1/24 )
route -n add 10.0.0.0 10.10.10.2
route -n add 10.0.0.0 10.10.10.3

# destroy interface to trigger the panic
ifconfig $tap destroy
</code></pre>


Kernel panic core dump text summary:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x38
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80d48f38
stack pointer           = 0x28:0xfffffe0044dd6a80
frame pointer           = 0x28:0xfffffe0044dd6a80
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (softirq_0)
trap number             = 12
panic: page fault
cpuid = 0
time = 1616522291
KDB: stack backtrace:
#0 0xffffffff80c570b5 at kdb_backtrace+0x65
#1 0xffffffff80c09cd1 at vpanic+0x181
#2 0xffffffff80c09b43 at panic+0x43
#3 0xffffffff8108a187 at trap_fatal+0x387
#4 0xffffffff8108a1df at trap_pfault+0x4f
#5 0xffffffff8108983d at trap+0x27d
#6 0xffffffff810612c8 at calltrap+0x8
#7 0xffffffff80d4b1de at destroy_rtentry_epoch+0x2e
#8 0xffffffff80c51e2a at epoch_call_task+0x16a
#9 0xffffffff80c55b1d at gtaskqueue_run_locked+0x15d
#10 0xffffffff80c557bc at gtaskqueue_thread_loop+0xac
#11 0xffffffff80bc7c0e at fork_exit+0x7e
#12 0xffffffff8106234e at fork_trampoline+0xe
Uptime: 18s
Dumping 137 out of 472 MB:..12%..24%..36%..47%..59%..71%..82%..94%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c098c6 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09d40 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09b43 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108a187 in trap_fatal (frame=0xfffffe0044dd69c0, eva=56)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108a1df in trap_pfault (frame=frame@entry=0xfffffe0044dd69c0, 
    usermode=false, signo=<optimized out>, signo@entry=0x0, 
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108983d in trap (frame=0xfffffe0044dd69c0)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  0xffffffff80d48f38 in nhop_get_vnet (nh=0xfffff8001458b980)
    at /usr/src/sys/net/route/nhop_ctl.c:761
#10 0xffffffff80d4b1de in destroy_rtentry (rt=0xfffff80014627840)
    at /usr/src/sys/net/route/route_ctl.c:139
#11 destroy_rtentry_epoch (ctx=0xfffff800146278e0)
    at /usr/src/sys/net/route/route_ctl.c:159
#12 0xffffffff80c51e2a in epoch_call_task (arg=<optimized out>)
    at /usr/src/sys/kern/subr_epoch.c:816
#13 0xffffffff80c55b1d in gtaskqueue_run_locked (
    queue=queue@entry=0xfffff8000332b900)
    at /usr/src/sys/kern/subr_gtaskqueue.c:371
#14 0xffffffff80c557bc in gtaskqueue_thread_loop (arg=<optimized out>, 
    arg@entry=0xfffffe0044f09008) at /usr/src/sys/kern/subr_gtaskqueue.c:547
#15 0xffffffff80bc7c0e in fork_exit (
    callout=0xffffffff80c55710 <gtaskqueue_thread_loop>, 
    arg=0xfffffe0044f09008, frame=0xfffffe0044dd6c00)
    at /usr/src/sys/kern/kern_fork.c:1069
#16 <signal handler called>
(kgdb)
Comment 1 Zhenlei Huang 2021-03-23 10:26:44 UTC
After debugging the caller routine is `rtfree()` in `rib_walk_del()` at /usr/src/sys/net/route/route_ctl.c

CC Alexander V. Chernikov
Comment 2 Zhenlei Huang 2021-03-23 10:36:36 UTC
Some debug info:

kgdb /boot/kernel/kernel /var/crash/vmcore.5
...

(kgdb) frame 9
#9  0xffffffff80d48f38 in nhop_get_vnet (nh=0xfffff8001458b980) at /usr/src/sys/net/route/nhop_ctl.c:761
761		return (nh->nh_priv->nh_vnet);
(kgdb) p nh->nh_priv
$1 = (struct nhop_priv *) 0x0
Comment 3 Alexander V. Chernikov freebsd_committer 2021-03-23 21:03:24 UTC
Thank you for the report?

Is it a GENERIC kernel a custome one w/o VNET?
Comment 4 commit-hook freebsd_committer 2021-03-23 22:13:27 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd

commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd
Author:     Alexander V. Chernikov <melifaro@FreeBSD.org>
AuthorDate: 2021-03-23 22:00:04 +0000
Commit:     Alexander V. Chernikov <melifaro@FreeBSD.org>
CommitDate: 2021-03-23 22:03:20 +0000

    Fix panic when destroying interface with ECMP routes.

    Reported by:    Zhenlei Huang <zlei.huang at gmail.com>
    PR:             254496
    MFC after:      immediately

 sys/net/route/route_ctl.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
Comment 5 Zhenlei Huang 2021-03-24 02:05:10 UTC
(In reply to Alexander V. Chernikov from comment #3)
It is GENERIC kernel.

Thanks for the fast fix, I'll try stable/13 with the patch and report ASAP.
Comment 6 Zhenlei Huang 2021-03-24 04:25:24 UTC
I spun up a new VM instance installed with FreeBSD-13.0-RC3-amd64-disk1.iso, extract src and applied the patch from commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd, built and installed the new kernel and repeated the steps above.

So far so good and the new kernel does not panic any more, but there're dangling nexthop objects.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@:~ # netstat -4on
Nexthop data

Internet:
Idx   IFA                Gateway            Flags     Netif  Refcnt
1     127.0.0.1          lo0/resolve        H           lo0     1 
2     192.168.10.127     192.168.10.1       GS       vtnet0     1 
3     192.168.10.127     vtnet0/resolve              vtnet0     2 
4     127.0.0.1          lo0/resolve        HS          lo0     1 
7     10.10.10.1         10.10.10.2         GHS         ---     1 
8     10.10.10.1         10.10.10.3         GHS         ---     1 
root@:~ # netstat -4On
Nexthop groups data

Internet:
GrpIdx  NhIdx     Weight   Slots           Gateway     Netif  Refcnt
1         ------- ------- ------- ----------------- ---------       1
               8       1       1        10.10.10.3       ---
               7       1       1        10.10.10.2       ---
root@:~ # 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


After applying the commit c00e2f573b50893e7428aee4b928c95ac27b7e5e from main branch,

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@:~ # netstat -4on
Nexthop data

Internet:
Idx   IFA                Gateway            Flags     Netif  Refcnt
1     127.0.0.1          lo0/resolve        H           lo0     1 
2     192.168.10.127     192.168.10.1       GS       vtnet0     1 
3     192.168.10.127     vtnet0/resolve              vtnet0     2 
4     127.0.0.1          lo0/resolve        HS          lo0     1 
7     10.10.10.1         10.10.10.2         GHS         ---     1
root@:~ # netstat -4On
Nexthop groups data

Internet:
GrpIdx  NhIdx     Weight   Slots           Gateway     Netif  Refcnt
1         ------- ------- ------- ----------------- ---------       2
               0       1       1
               7       1       1        10.10.10.2       ---
root@:~ #
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It seems the nexthop objects / groups are not released properly.
Comment 7 Zhenlei Huang 2021-03-24 04:38:44 UTC
Before destroying the interface,

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@:~ # netstat -4on
Nexthop data

Internet:
Idx   IFA                Gateway            Flags     Netif  Refcnt
1     127.0.0.1          lo0/resolve        H           lo0     1 
2     192.168.10.127     192.168.10.1       GS       vtnet0     1 
3     192.168.10.127     vtnet0/resolve              vtnet0     2 
4     127.0.0.1          lo0/resolve        HS          lo0     1 
5     10.10.10.1         tap0/resolve                  tap0     1 
6     127.0.0.1          lo0/resolve        HS          lo0     1 
7     10.10.10.1         10.10.10.2         GHS        tap0     1 
8     10.10.10.1         10.10.10.3         GHS        tap0     1 
root@:~ # netstat -4On
Nexthop groups data

Internet:
GrpIdx  NhIdx     Weight   Slots           Gateway     Netif  Refcnt
1         ------- ------- ------- ----------------- ---------       1
               8       1       1        10.10.10.3      tap0
               7       1       1        10.10.10.2      tap0
root@:~ #
Comment 8 commit-hook freebsd_committer 2021-03-24 23:52:47 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=66f138563becf12d5c21924f816d2a45c3a1ed7a

commit 66f138563becf12d5c21924f816d2a45c3a1ed7a
Author:     Alexander V. Chernikov <melifaro@FreeBSD.org>
AuthorDate: 2021-03-24 23:51:45 +0000
Commit:     Alexander V. Chernikov <melifaro@FreeBSD.org>
CommitDate: 2021-03-24 23:52:18 +0000

    Plug nexthop group refcount leak.
    In case with batch route delete via rib_walk_del(), when
     some paths from the multipath route gets deleted, old
     multipath group were not freed.

    PR:    254496
    Reported by:   Zhenlei Huang <zlei.huang@gmail.com>
    MFC after:     1 day

 sys/net/route/route_ctl.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)
Comment 9 Zhenlei Huang 2021-03-25 04:50:02 UTC
Tried stable/13 with cherry-pick a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd, c00e2f573b50893e7428aee4b928c95ac27b7e5e, 66f138563becf12d5c21924f816d2a45c3a1ed7a and 24cd2796cf10211964be8a2cb3ea3e161adea746 from main branch, I can confirm that it works solidly now.

Thank you for the fix !
Comment 10 Zhenlei Huang 2021-03-25 04:53:16 UTC
Will the fix be part of 13.0-RC4 ?
Thanks :)
Comment 11 Zhenlei Huang 2021-03-25 16:14:03 UTC
Oh, busy days. It seems the document for the new flags “-o” and “-O” of netstat are missing in the manual page.
Comment 12 commit-hook freebsd_committer 2021-03-25 20:27:20 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=af85312e8a6f34ea7c8af77b9841fab6b5559e25

commit af85312e8a6f34ea7c8af77b9841fab6b5559e25
Author:     Alexander V. Chernikov <melifaro@FreeBSD.org>
AuthorDate: 2021-03-23 22:00:04 +0000
Commit:     Alexander V. Chernikov <melifaro@FreeBSD.org>
CommitDate: 2021-03-25 20:22:21 +0000

    Fix panic when destroying interface with ECMP routes.

    Reported by:    Zhenlei Huang <zlei.huang at gmail.com>
    PR:             254496

    (cherry picked from commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd)

 sys/net/route/route_ctl.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)
Comment 13 commit-hook freebsd_committer 2021-03-25 20:27:21 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=47c00a9835926e96e562c67fa28e4432e99d9c56

commit 47c00a9835926e96e562c67fa28e4432e99d9c56
Author:     Alexander V. Chernikov <melifaro@FreeBSD.org>
AuthorDate: 2021-03-24 23:51:45 +0000
Commit:     Alexander V. Chernikov <melifaro@FreeBSD.org>
CommitDate: 2021-03-25 20:22:58 +0000

    Plug nexthop group refcount leak.
    In case with batch route delete via rib_walk_del(), when
     some paths from the multipath route gets deleted, old
     multipath group were not freed.

    PR:    254496
    Reported by:   Zhenlei Huang <zlei.huang@gmail.com>

    (cherry picked from commit 66f138563becf12d5c21924f816d2a45c3a1ed7a)

 sys/net/route/route_ctl.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)
Comment 14 commit-hook freebsd_committer 2021-03-28 20:51:00 UTC
A commit in branch releng/13.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=3765afa4dacf5850de984fede7f9b26760efac73

commit 3765afa4dacf5850de984fede7f9b26760efac73
Author:     Alexander V. Chernikov <melifaro@FreeBSD.org>
AuthorDate: 2021-03-23 22:00:04 +0000
Commit:     Alexander V. Chernikov <melifaro@FreeBSD.org>
CommitDate: 2021-03-28 20:40:48 +0000

    Fix panic when destroying interface with ECMP routes.

    Reported by:    Zhenlei Huang <zlei.huang at gmail.com>
    PR:             254496
    Approved by:    re (gjb)

    (cherry picked from commit af85312e8a6f34ea7c8af77b9841fab6b5559e25)

 sys/net/route/route_ctl.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)
Comment 15 commit-hook freebsd_committer 2021-03-28 20:51:01 UTC
A commit in branch releng/13.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=411cbdb1f298880a0100a633cd0508b70ac4c924

commit 411cbdb1f298880a0100a633cd0508b70ac4c924
Author:     Alexander V. Chernikov <melifaro@FreeBSD.org>
AuthorDate: 2021-03-24 23:51:45 +0000
Commit:     Alexander V. Chernikov <melifaro@FreeBSD.org>
CommitDate: 2021-03-28 20:40:48 +0000

    Plug nexthop group refcount leak.
    In case with batch route delete via rib_walk_del(), when
     some paths from the multipath route gets deleted, old
     multipath group were not freed.

    PR:    254496
    Reported by:   Zhenlei Huang <zlei.huang@gmail.com>
    Approved by:    re (gjb)

    (cherry picked from commit 47c00a9835926e96e562c67fa28e4432e99d9c56)

 sys/net/route/route_ctl.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)
Comment 16 Alexander V. Chernikov freebsd_committer 2021-03-31 21:04:10 UTC
(In reply to Zhenlei Huang from comment #10)
No, but will be a part of RC5 :-)
Comment 17 Alexander V. Chernikov freebsd_committer 2021-03-31 21:04:51 UTC
(In reply to Zhenlei Huang from comment #11)
Yes, the documentation is lacking. Hope to do this once we get 13 released.
Comment 18 Alexander V. Chernikov freebsd_committer 2021-04-04 08:41:38 UTC
The patch is in 13-R.
I'm going to proceed towards the case closure tomorrow unless there any objections.