While researching bug 278394, Marek recommended: net.route.algo.inet.algo=dxr Using that caused a kernel panic: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80cb316b stack pointer = 0x0:0xfffffe000374fd00 frame pointer = 0x0:0xfffffe000374fd10 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2 (clock (0)) rdi: deadc0dedeadc0de rsi: fffffe000374fd28 rdx: 000000000009f272 rcx: 000000000009f271 r8: 0000000000000f1c r9: 000000000009f271 rax: deadc0dedeadc0de rbx: fffffe000374fd28 rbp: fffffe000374fd10 r10: 0000000000004f93 r11: 0000000000800000 r12: fffffe00a133b000 r13: 0000000000000003 r14: deadc0dedeadc0de r15: 0000000000000f1c trap number = 9 panic: general protection fault cpuid = 0 time = 1713406520 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000374fa40 vpanic() at vpanic+0x132/frame 0xfffffe000374fb70 panic() at panic+0x43/frame 0xfffffe000374fbd0 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe000374fc30 calltrap() at calltrap+0x8/frame 0xfffffe000374fc30 --- trap 0x9, rip = 0xffffffff80cb316b, rsp = 0xfffffe000374fd00, rbp = 0xfffffe000374fd10 --- fib_get_rtable_info() at fib_get_rtable_info+0x1b/frame 0xfffffe000374fd10 dxr_change_rib_batch() at dxr_change_rib_batch+0x272/frame 0xfffffe000374fd70 apply_rtable_changes() at apply_rtable_changes+0x2a/frame 0xfffffe000374fda0 handle_fd_callout() at handle_fd_callout+0x167/frame 0xfffffe000374fe00 softclock_call_cc() at softclock_call_cc+0x14e/frame 0xfffffe000374fec0 softclock_thread() at softclock_thread+0xc6/frame 0xfffffe000374fef0 fork_exit() at fork_exit+0x82/frame 0xfffffe000374ff30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000374ff30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Created attachment 250041 [details] core.txt.7 Attaching full coreinfo info
Could you set sysctl net.route.algo.debug_level=6 before kldloading fib_dxr, so hopefully a bit more info preceding the panic gets captured in the message buffer?
It's been up for 12+ hours with no crash. I'll keep it in place but the last crash happened in minutes so I'm not sure if it will happen again. Nothing useful from the existing crash dump?
Created attachment 250194 [details] core.txt.8 Had a dxr panic with the additional debugging enabled. Attaching the core.txt.8.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d commit b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-07 15:44:09 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). MFC after: 1 week PR: 278422 sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
(In reply to Gregory Neil Shapiro from comment #4) The provided logs suggest that several malloc(M_NOWAIT) requests coulnd't be fullfilled, which led to a sequence of futile DXR rebuild attempts. Whatever the reasons for the malloc() failures were, DXR shouldn't have crashed the system for sure. I hope a possible culprit was nailed here: https://cgit.FreeBSD.org/src/commit/?id=b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d Could you fetch the most recent version of in_fib_dxr.c from the main branch and try it on your 14.0-R system, it should compile as a module just fine... And sorry for the slow reply, I've mostly drifted away from FreeBSD hacking...
(In reply to Marko Zec from comment #6) The new module is now running. Given how long the previous version went without a crash, I don't know if the lack of crashing from this new version means the bug is fixed or I just didn't hit the condition to trigger the crash. I'll update this bug if it crashes.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0418d7a0903725ade71ae77c4ff900010a93a185 commit 0418d7a0903725ade71ae77c4ff900010a93a185 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-14 20:32:41 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). MFC after: 1 week PR: 278422 sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9ae078121d3f70d8cd8c537fa16daf302ff5ee21 commit 9ae078121d3f70d8cd8c537fa16daf302ff5ee21 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-14 20:36:20 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). MFC after: 1 week PR: 278422 sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4ab122e8ef127d36d95f874e85600c36c87c8c22 commit 4ab122e8ef127d36d95f874e85600c36c87c8c22 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-17 16:21:54 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d6e32525c778d92c26a37f4e1b562e80b18a9af7 commit d6e32525c778d92c26a37f4e1b562e80b18a9af7 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-22 17:34:05 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22) sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9629a4b6865c5c56804f79a62f45512b175776e2 commit 9629a4b6865c5c56804f79a62f45512b175776e2 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-22 17:37:31 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22) sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
A commit in branch releng/14.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b0a1a3138a37b7849d1fb735e6b5c2cd392a2e8b commit b0a1a3138a37b7849d1fb735e6b5c2cd392a2e8b Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-22 17:50:29 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). Approved by: re (cperciva) MFC after: 1 week PR: 278422 (cherry picked from commit 0418d7a0903725ade71ae77c4ff900010a93a185) sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch releng/14.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=782f02004251f68d144ea7914e390297b6edad48 commit 782f02004251f68d144ea7914e390297b6edad48 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-23 04:29:22 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week Approved by: re (cperciva) (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22) (cherry picked from commit d6e32525c778d92c26a37f4e1b562e80b18a9af7) sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)