Summary: | kernel panic when destroying interface with ECMP route | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Zhenlei Huang <zlei> | ||||
Component: | kern | Assignee: | Alexander V. Chernikov <melifaro> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Some People | CC: | melifaro, zlei | ||||
Priority: | --- | Keywords: | crash | ||||
Version: | 13.0-STABLE | ||||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Zhenlei Huang
2021-03-23 10:03:06 UTC
After debugging the caller routine is `rtfree()` in `rib_walk_del()` at /usr/src/sys/net/route/route_ctl.c CC Alexander V. Chernikov Some debug info: kgdb /boot/kernel/kernel /var/crash/vmcore.5 ... (kgdb) frame 9 #9 0xffffffff80d48f38 in nhop_get_vnet (nh=0xfffff8001458b980) at /usr/src/sys/net/route/nhop_ctl.c:761 761 return (nh->nh_priv->nh_vnet); (kgdb) p nh->nh_priv $1 = (struct nhop_priv *) 0x0 Thank you for the report? Is it a GENERIC kernel a custome one w/o VNET? A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-23 22:00:04 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-23 22:03:20 +0000 Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 MFC after: immediately sys/net/route/route_ctl.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (In reply to Alexander V. Chernikov from comment #3) It is GENERIC kernel. Thanks for the fast fix, I'll try stable/13 with the patch and report ASAP. I spun up a new VM instance installed with FreeBSD-13.0-RC3-amd64-disk1.iso, extract src and applied the patch from commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd, built and installed the new kernel and repeated the steps above. So far so good and the new kernel does not panic any more, but there're dangling nexthop objects. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@:~ # netstat -4on Nexthop data Internet: Idx IFA Gateway Flags Netif Refcnt 1 127.0.0.1 lo0/resolve H lo0 1 2 192.168.10.127 192.168.10.1 GS vtnet0 1 3 192.168.10.127 vtnet0/resolve vtnet0 2 4 127.0.0.1 lo0/resolve HS lo0 1 7 10.10.10.1 10.10.10.2 GHS --- 1 8 10.10.10.1 10.10.10.3 GHS --- 1 root@:~ # netstat -4On Nexthop groups data Internet: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- ----------------- --------- 1 8 1 1 10.10.10.3 --- 7 1 1 10.10.10.2 --- root@:~ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After applying the commit c00e2f573b50893e7428aee4b928c95ac27b7e5e from main branch, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@:~ # netstat -4on Nexthop data Internet: Idx IFA Gateway Flags Netif Refcnt 1 127.0.0.1 lo0/resolve H lo0 1 2 192.168.10.127 192.168.10.1 GS vtnet0 1 3 192.168.10.127 vtnet0/resolve vtnet0 2 4 127.0.0.1 lo0/resolve HS lo0 1 7 10.10.10.1 10.10.10.2 GHS --- 1 root@:~ # netstat -4On Nexthop groups data Internet: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- ----------------- --------- 2 0 1 1 7 1 1 10.10.10.2 --- root@:~ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It seems the nexthop objects / groups are not released properly. Before destroying the interface, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@:~ # netstat -4on Nexthop data Internet: Idx IFA Gateway Flags Netif Refcnt 1 127.0.0.1 lo0/resolve H lo0 1 2 192.168.10.127 192.168.10.1 GS vtnet0 1 3 192.168.10.127 vtnet0/resolve vtnet0 2 4 127.0.0.1 lo0/resolve HS lo0 1 5 10.10.10.1 tap0/resolve tap0 1 6 127.0.0.1 lo0/resolve HS lo0 1 7 10.10.10.1 10.10.10.2 GHS tap0 1 8 10.10.10.1 10.10.10.3 GHS tap0 1 root@:~ # netstat -4On Nexthop groups data Internet: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- ----------------- --------- 1 8 1 1 10.10.10.3 tap0 7 1 1 10.10.10.2 tap0 root@:~ # A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=66f138563becf12d5c21924f816d2a45c3a1ed7a commit 66f138563becf12d5c21924f816d2a45c3a1ed7a Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-24 23:51:45 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-24 23:52:18 +0000 Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> MFC after: 1 day sys/net/route/route_ctl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) Tried stable/13 with cherry-pick a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd, c00e2f573b50893e7428aee4b928c95ac27b7e5e, 66f138563becf12d5c21924f816d2a45c3a1ed7a and 24cd2796cf10211964be8a2cb3ea3e161adea746 from main branch, I can confirm that it works solidly now. Thank you for the fix ! Will the fix be part of 13.0-RC4 ? Thanks :) Oh, busy days. It seems the document for the new flags “-o” and “-O” of netstat are missing in the manual page. A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=af85312e8a6f34ea7c8af77b9841fab6b5559e25 commit af85312e8a6f34ea7c8af77b9841fab6b5559e25 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-23 22:00:04 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-25 20:22:21 +0000 Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 (cherry picked from commit a0308e48ec12ae37f525aa3c6d3c1a236fb55dcd) sys/net/route/route_ctl.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=47c00a9835926e96e562c67fa28e4432e99d9c56 commit 47c00a9835926e96e562c67fa28e4432e99d9c56 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-24 23:51:45 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-25 20:22:58 +0000 Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> (cherry picked from commit 66f138563becf12d5c21924f816d2a45c3a1ed7a) sys/net/route/route_ctl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) A commit in branch releng/13.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=3765afa4dacf5850de984fede7f9b26760efac73 commit 3765afa4dacf5850de984fede7f9b26760efac73 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-23 22:00:04 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-28 20:40:48 +0000 Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 Approved by: re (gjb) (cherry picked from commit af85312e8a6f34ea7c8af77b9841fab6b5559e25) sys/net/route/route_ctl.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) A commit in branch releng/13.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=411cbdb1f298880a0100a633cd0508b70ac4c924 commit 411cbdb1f298880a0100a633cd0508b70ac4c924 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-24 23:51:45 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-28 20:40:48 +0000 Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Approved by: re (gjb) (cherry picked from commit 47c00a9835926e96e562c67fa28e4432e99d9c56) sys/net/route/route_ctl.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) (In reply to Zhenlei Huang from comment #10) No, but will be a part of RC5 :-) (In reply to Zhenlei Huang from comment #11) Yes, the documentation is lacking. Hope to do this once we get 13 released. The patch is in 13-R. I'm going to proceed towards the case closure tomorrow unless there any objections. |