Created attachment 223516 [details] Core text dump I was trying to reproduce bug #254303 , and found another bug, not sure if it is related. Steps to repeat: 1. Fresh install FreeBSD 13.0 RC3 2. Run the following script <pre><code> # set up interface and add ECMP route tap=$( ifconfig tap create inet 10.10.10.1/24 ) route -n add 10.0.0.0 10.10.10.2 route -n add 10.0.0.0 10.10.10.3 # destroy interface to trigger the panic ifconfig $tap destroy </code></pre> Kernel panic core dump text summary: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x38 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80d48f38 stack pointer = 0x28:0xfffffe0044dd6a80 frame pointer = 0x28:0xfffffe0044dd6a80 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (softirq_0) trap number = 12 panic: page fault cpuid = 0 time = 1616522291 KDB: stack backtrace: #0 0xffffffff80c570b5 at kdb_backtrace+0x65 #1 0xffffffff80c09cd1 at vpanic+0x181 #2 0xffffffff80c09b43 at panic+0x43 #3 0xffffffff8108a187 at trap_fatal+0x387 #4 0xffffffff8108a1df at trap_pfault+0x4f #5 0xffffffff8108983d at trap+0x27d #6 0xffffffff810612c8 at calltrap+0x8 #7 0xffffffff80d4b1de at destroy_rtentry_epoch+0x2e #8 0xffffffff80c51e2a at epoch_call_task+0x16a #9 0xffffffff80c55b1d at gtaskqueue_run_locked+0x15d #10 0xffffffff80c557bc at gtaskqueue_thread_loop+0xac #11 0xffffffff80bc7c0e at fork_exit+0x7e #12 0xffffffff8106234e at fork_trampoline+0xe Uptime: 18s Dumping 137 out of 472 MB:..12%..24%..36%..47%..59%..71%..82%..94% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c098c6 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c09d40 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c09b43 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff8108a187 in trap_fatal (frame=0xfffffe0044dd69c0, eva=56) at /usr/src/sys/amd64/amd64/trap.c:915 #6 0xffffffff8108a1df in trap_pfault (frame=frame@entry=0xfffffe0044dd69c0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:732 #7 0xffffffff8108983d in trap (frame=0xfffffe0044dd69c0) at /usr/src/sys/amd64/amd64/trap.c:398 #8 <signal handler called> #9 0xffffffff80d48f38 in nhop_get_vnet (nh=0xfffff8001458b980) at /usr/src/sys/net/route/nhop_ctl.c:761 #10 0xffffffff80d4b1de in destroy_rtentry (rt=0xfffff80014627840) at /usr/src/sys/net/route/route_ctl.c:139 #11 destroy_rtentry_epoch (ctx=0xfffff800146278e0) at /usr/src/sys/net/route/route_ctl.c:159 #12 0xffffffff80c51e2a in epoch_call_task (arg=<optimized out>) at /usr/src/sys/kern/subr_epoch.c:816 #13 0xffffffff80c55b1d in gtaskqueue_run_locked ( queue=queue@entry=0xfffff8000332b900) at /usr/src/sys/kern/subr_gtaskqueue.c:371 #14 0xffffffff80c557bc in gtaskqueue_thread_loop (arg=<optimized out>, arg@entry=0xfffffe0044f09008) at /usr/src/sys/kern/subr_gtaskqueue.c:547 #15 0xffffffff80bc7c0e in fork_exit ( callout=0xffffffff80c55710 <gtaskqueue_thread_loop>, arg=0xfffffe0044f09008, frame=0xfffffe0044dd6c00) at /usr/src/sys/kern/kern_fork.c:1069 #16 <signal handler called> (kgdb)
After debugging the caller routine is `rtfree()` in `rib_walk_del()` at /usr/src/sys/net/route/route_ctl.c CC Alexander V. Chernikov
Some debug info: kgdb /boot/kernel/kernel /var/crash/vmcore.5 ... (kgdb) frame 9 #9 0xffffffff80d48f38 in nhop_get_vnet (nh=0xfffff8001458b980) at /usr/src/sys/net/route/nhop_ctl.c:761 761 return (nh->nh_priv->nh_vnet); (kgdb) p nh->nh_priv $1 = (struct nhop_priv *) 0x0
Thank you for the report? Is it a GENERIC kernel a custome one w/o VNET?
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-23 22:00:04 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-23 22:03:20 +0000 Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 MFC after: immediately sys/net/route/route_ctl.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
(In reply to Alexander V. Chernikov from comment #3) It is GENERIC kernel. Thanks for the fast fix, I'll try stable/13 with the patch and report ASAP.
I spun up a new VM instance installed with FreeBSD-13.0-RC3-amd64-disk1.iso, extract src and applied the patch from commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd, built and installed the new kernel and repeated the steps above. So far so good and the new kernel does not panic any more, but there're dangling nexthop objects. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@:~ # netstat -4on Nexthop data Internet: Idx IFA Gateway Flags Netif Refcnt 1 127.0.0.1 lo0/resolve H lo0 1 2 192.168.10.127 192.168.10.1 GS vtnet0 1 3 192.168.10.127 vtnet0/resolve vtnet0 2 4 127.0.0.1 lo0/resolve HS lo0 1 7 10.10.10.1 10.10.10.2 GHS --- 1 8 10.10.10.1 10.10.10.3 GHS --- 1 root@:~ # netstat -4On Nexthop groups data Internet: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- ----------------- --------- 1 8 1 1 10.10.10.3 --- 7 1 1 10.10.10.2 --- root@:~ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After applying the commit c00e2f573b50893e7428aee4b928c95ac27b7e5e from main branch, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@:~ # netstat -4on Nexthop data Internet: Idx IFA Gateway Flags Netif Refcnt 1 127.0.0.1 lo0/resolve H lo0 1 2 192.168.10.127 192.168.10.1 GS vtnet0 1 3 192.168.10.127 vtnet0/resolve vtnet0 2 4 127.0.0.1 lo0/resolve HS lo0 1 7 10.10.10.1 10.10.10.2 GHS --- 1 root@:~ # netstat -4On Nexthop groups data Internet: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- ----------------- --------- 2 0 1 1 7 1 1 10.10.10.2 --- root@:~ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It seems the nexthop objects / groups are not released properly.
Before destroying the interface, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@:~ # netstat -4on Nexthop data Internet: Idx IFA Gateway Flags Netif Refcnt 1 127.0.0.1 lo0/resolve H lo0 1 2 192.168.10.127 192.168.10.1 GS vtnet0 1 3 192.168.10.127 vtnet0/resolve vtnet0 2 4 127.0.0.1 lo0/resolve HS lo0 1 5 10.10.10.1 tap0/resolve tap0 1 6 127.0.0.1 lo0/resolve HS lo0 1 7 10.10.10.1 10.10.10.2 GHS tap0 1 8 10.10.10.1 10.10.10.3 GHS tap0 1 root@:~ # netstat -4On Nexthop groups data Internet: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- ----------------- --------- 1 8 1 1 10.10.10.3 tap0 7 1 1 10.10.10.2 tap0 root@:~ #
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=66f138563becf12d5c21924f816d2a45c3a1ed7a commit 66f138563becf12d5c21924f816d2a45c3a1ed7a Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-24 23:51:45 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-24 23:52:18 +0000 Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> MFC after: 1 day sys/net/route/route_ctl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
Tried stable/13 with cherry-pick a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd, c00e2f573b50893e7428aee4b928c95ac27b7e5e, 66f138563becf12d5c21924f816d2a45c3a1ed7a and 24cd2796cf10211964be8a2cb3ea3e161adea746 from main branch, I can confirm that it works solidly now. Thank you for the fix !
Will the fix be part of 13.0-RC4 ? Thanks :)
Oh, busy days. It seems the document for the new flags “-o” and “-O” of netstat are missing in the manual page.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=af85312e8a6f34ea7c8af77b9841fab6b5559e25 commit af85312e8a6f34ea7c8af77b9841fab6b5559e25 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-23 22:00:04 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-25 20:22:21 +0000 Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 (cherry picked from commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd) sys/net/route/route_ctl.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=47c00a9835926e96e562c67fa28e4432e99d9c56 commit 47c00a9835926e96e562c67fa28e4432e99d9c56 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-24 23:51:45 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-25 20:22:58 +0000 Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> (cherry picked from commit 66f138563becf12d5c21924f816d2a45c3a1ed7a) sys/net/route/route_ctl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
A commit in branch releng/13.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=3765afa4dacf5850de984fede7f9b26760efac73 commit 3765afa4dacf5850de984fede7f9b26760efac73 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-23 22:00:04 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-28 20:40:48 +0000 Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 Approved by: re (gjb) (cherry picked from commit af85312e8a6f34ea7c8af77b9841fab6b5559e25) sys/net/route/route_ctl.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
A commit in branch releng/13.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=411cbdb1f298880a0100a633cd0508b70ac4c924 commit 411cbdb1f298880a0100a633cd0508b70ac4c924 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-24 23:51:45 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-28 20:40:48 +0000 Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Approved by: re (gjb) (cherry picked from commit 47c00a9835926e96e562c67fa28e4432e99d9c56) sys/net/route/route_ctl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
(In reply to Zhenlei Huang from comment #10) No, but will be a part of RC5 :-)
(In reply to Zhenlei Huang from comment #11) Yes, the documentation is lacking. Hope to do this once we get 13 released.
The patch is in 13-R. I'm going to proceed towards the case closure tomorrow unless there any objections.