Summary: | Fatal trap 12: page fault while in kernel mode ((frr 7.5_1 + Freebsd 13 Beta3) zebra crashes server when routes are populated) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Aleks <a.ivanov> | ||||||
Component: | kern | Assignee: | Alexander V. Chernikov <melifaro> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Only Me | CC: | melifaro, net, rgrimes, zarychtam, zlei | ||||||
Priority: | --- | Keywords: | crash | ||||||
Version: | Unspecified | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Same issue on Freebsd 13 RC2 13.0-RC2 #0 releng/13.0-n244684-13c22f74953: Fri Mar 12 04:05:19 UTC 2021 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 Frr same : frr7-7.5_1 Name : frr7 Version : 7.5_1 Can you upgrade and test 13.0-RC2? Does it also happen with sysctl net.route.multipath=0 set? (In reply to Marek Zarychta from comment #2) It does happen on 13.0-RC2 too with default sysctl settings. (multipath=1) With sysctl net.route.multipath=0 it doesn't crash. (on RC2) (In reply to Aleks from comment #3) If you can build a custom kernel with "options FIB_ALGO", install it and after reboot load the module dpdk_lpm4 (and dpdk_lpm6 if appropriate), then please give it a try. (In reply to Marek Zarychta from comment #4) So, I took kernel src from https://download.freebsd.org/ftp/releases/amd64/13.0-RC2/src.txz Build with "options FIB_ALGO" FreeBSD 13.0-RC2 FreeBSD 13.0-RC2 #0: Wed Mar 17 13:23:47 EET 2021 :/usr/obj/usr/src/amd64.amd64/sys/CUSTOM amd64 Disabled FRR autostart and rebooted the server. After reboot I've set multipath=1 and loaded dpdk_lpm4/6, and after that started FRR. [fib_algo] inet.0 (bsearch4#13) rebuild_fd: switching algo to radix4_lockless [fib_algo] fib_module_register: attaching dpdk_lpm4 to inet [fib_algo] fib_module_register: attaching dpdk_lpm6 to inet6 [fib_algo] inet.0 (radix4_lockless#114) rebuild_fd: switching algo to dpdk_lpm4 After bringing up second BGP FullView session servers still crashed. CC Alexander V. Chernikov (In reply to Aleks from comment #5) Is there any chance you could share kernel&core? (In reply to Alexander V. Chernikov from comment #7) Sure, just tell me what you mean by "share kernel&core" Obvious code checks and tries to repro this in an easy way failed - I don’t have a good idea on why its happening. Setting up multiple live bgp feeds will take some time, so there are multiple ways to proceed: Fastest one - if you could tar all your /boot/kernel AND coredump for that kernel. The downside is that kernel memory dump may contain some private information (passwords, other sensitive stuff in packet memory etc). If you could consider sharing it with me (so noone else gets access to this info) - that would be awesome. Otherwise I can ether write a list of gdb conmands to run on the core or try to repro with the feeds, but that will take more time. (In reply to Alexander V. Chernikov from comment #9) by core dump you mean vmcore.* file? p.s. I can even give you access to this server if it will help you (it's not in production) (In reply to Alexander V. Chernikov from comment #9) p.s. I can't make him write dump file after I compiled custom kernel (+FIB_ALGO) When I test dumps with "sysctl debug.kdb.panic=1" the dump is written but when zebra populates routes and that crashes servers - dump is not there( (In reply to Alexander V. Chernikov from comment #9) I've sent you what you asked for (in email). Awesome! Could you also share other panics backtraces (if any)? Created attachment 223541 [details]
core2
(In reply to Alexander V. Chernikov from comment #13) Apart from main trace and the one I've sent you yesterday? I've attached another one (named core2) to this bugreport (In reply to Aleks from comment #15) Thank you! Short summary: From the private core.5 you sent me: * rtentry looks perfectly fine, but the nexthop pointer is (mostly) zeroed * from the core2: failure to resolve nh_priv pointer * from the original kgdb_backtrace: nhg has zero pointer to nh_ctl So far it looks like we're removing the additional reference from the nexthop group in some corner case scenario, which results in the group being freed, with the rtentry still pointing to this group. Re reproduction: I don't have 2 full-view peers, so I ended up duplicating the feed from a single peer & introducing some delay, to mimic propagation delays. So far I wasn't able to reproduce any panic. Are there any additional specifics (e.g. links flapping) in the setup? IS there any chance you could run stdbuf -o0 route -n monitor > zebra_log.txt at startup (or, actually, at the point in time when all peers are down) and then try to turn up first and then the second peer? If you could also run something like `while true; do date >> nhg.log ; netstat -4OnW >> nhg.log ; sleep 5; done` and share both files along with the core backtrace, that would be awesome. If there is a possibility of getting access to the server - that would really speed the things up. (In reply to Alexander V. Chernikov from comment #16) I'll give you both files and server access via email. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9095dc7da4cf0c484fb1160b2180b7329b09b107 commit 9095dc7da4cf0c484fb1160b2180b7329b09b107 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-29 23:00:17 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-29 23:00:17 +0000 Fix nexhtop group index array scaling. The current code has the limit of 127 nexthop groups due to the wrongly-checked bitmask_copy() return value. PR: 254303 Reported by: Aleks <a.ivanov at veesp.com> MFC after: 1 day sys/net/route/nhgrp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) So, it looks like it is a combination of 3 bugs: The actual thing corrupting memory is https://cgit.freebsd.org/src/commit/?id=42f997d9b721ce5b64c37958f21fa81630f5a224 (in 13.0-RC4). We get to this codepath by having 127 hexthop groups (number when we trigger array resize). This is addressed in https://cgit.freebsd.org/src/commit/?id=9095dc7da4cf0c484fb1160b2180b7329b09b107 (only in HEAD atm). We get that amount of nexthop groups (should be only one) because of non-zeroing all of the memory in the comparison part of nexthop group. This is address in https://cgit.freebsd.org/src/commit/?id=823a80f4f9037b6b9611aaceb21f53115d1e64f1 (in 13-S, not sure if it lands in 13.0-R). A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=923e7f7e12670e97b097a195e69c848a6e8773a2 commit 923e7f7e12670e97b097a195e69c848a6e8773a2 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-29 23:00:17 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-30 07:34:31 +0000 Fix nexhtop group index array scaling. The current code has the limit of 127 nexthop groups due to the wrongly-checked bitmask_copy() return value. PR: 254303 Reported by: Aleks <a.ivanov at veesp.com> (cherry picked from commit 9095dc7da4cf0c484fb1160b2180b7329b09b107) sys/net/route/nhgrp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) A commit in branch releng/13.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b7fbdb5042c619221ee0b97573affcb8bcb59458 commit b7fbdb5042c619221ee0b97573affcb8bcb59458 Author: Alexander V. Chernikov <melifaro@FreeBSD.org> AuthorDate: 2021-03-29 23:00:17 +0000 Commit: Alexander V. Chernikov <melifaro@FreeBSD.org> CommitDate: 2021-03-31 20:00:10 +0000 Fix nexhtop group index array scaling. The current code has the limit of 127 nexthop groups due to the wrongly-checked bitmask_copy() return value. PR: 254303 Reported by: Aleks <a.ivanov at veesp.com> Approved by: re (gjb) (cherry picked from commit 923e7f7e12670e97b097a195e69c848a6e8773a2) sys/net/route/nhgrp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) All relevant patches are in 13-R. Does it fix an issue for you? (In reply to Alexander V. Chernikov from comment #22) For me - yes. Thank you very much! |
Created attachment 223285 [details] kgdb backtrace Description : FreeBSD 13 Beta 3 + Frr frr7-7.5_1 Server has 2 BGP connections with Uplink routers. Each neighbour sends FullView ipv4 (840k+ routes) to this FreeBSD box. When 1 session is up - everything is ok, when we bring second connection - server crashes (dump attached) How to reproduce : 1) Install FreeBSD 13 Beta 3 + Frr frr7-7.5_1 2) Send 2 x Fullview via 2 peers to FreeBSD box