Bug 278422 - kernel panic related to IPv4 routes populated by bird2 when dxr routing algorithm used
Summary: kernel panic related to IPv4 routes populated by bird2 when dxr routing algor...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Marko Zec
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-18 02:27 UTC by Gregory Neil Shapiro
Modified: 2024-10-26 17:38 UTC (History)
3 users (show)

See Also:


Attachments
core.txt.7 (182.31 KB, text/plain)
2024-04-18 02:28 UTC, Gregory Neil Shapiro
no flags Details
core.txt.8 (179.12 KB, text/plain)
2024-04-24 01:40 UTC, Gregory Neil Shapiro
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gregory Neil Shapiro freebsd_committer freebsd_triage 2024-04-18 02:27:58 UTC
While researching bug 278394, Marek recommended:

net.route.algo.inet.algo=dxr

Using that caused a kernel panic:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff80cb316b
stack pointer           = 0x0:0xfffffe000374fd00
frame pointer           = 0x0:0xfffffe000374fd10
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2 (clock (0))
rdi: deadc0dedeadc0de rsi: fffffe000374fd28 rdx: 000000000009f272
rcx: 000000000009f271  r8: 0000000000000f1c  r9: 000000000009f271
rax: deadc0dedeadc0de rbx: fffffe000374fd28 rbp: fffffe000374fd10
r10: 0000000000004f93 r11: 0000000000800000 r12: fffffe00a133b000
r13: 0000000000000003 r14: deadc0dedeadc0de r15: 0000000000000f1c
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1713406520
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000374fa40
vpanic() at vpanic+0x132/frame 0xfffffe000374fb70
panic() at panic+0x43/frame 0xfffffe000374fbd0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe000374fc30
calltrap() at calltrap+0x8/frame 0xfffffe000374fc30
--- trap 0x9, rip = 0xffffffff80cb316b, rsp = 0xfffffe000374fd00, rbp = 0xfffffe000374fd10 ---
fib_get_rtable_info() at fib_get_rtable_info+0x1b/frame 0xfffffe000374fd10
dxr_change_rib_batch() at dxr_change_rib_batch+0x272/frame 0xfffffe000374fd70
apply_rtable_changes() at apply_rtable_changes+0x2a/frame 0xfffffe000374fda0
handle_fd_callout() at handle_fd_callout+0x167/frame 0xfffffe000374fe00
softclock_call_cc() at softclock_call_cc+0x14e/frame 0xfffffe000374fec0
softclock_thread() at softclock_thread+0xc6/frame 0xfffffe000374fef0
fork_exit() at fork_exit+0x82/frame 0xfffffe000374ff30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000374ff30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Comment 1 Gregory Neil Shapiro freebsd_committer freebsd_triage 2024-04-18 02:28:37 UTC
Created attachment 250041 [details]
core.txt.7

Attaching full coreinfo info
Comment 2 Marko Zec freebsd_committer freebsd_triage 2024-04-20 15:36:50 UTC
Could you set sysctl net.route.algo.debug_level=6 before kldloading fib_dxr, so hopefully a bit more info preceding the panic gets captured in the message buffer?
Comment 3 Gregory Neil Shapiro freebsd_committer freebsd_triage 2024-04-21 15:50:03 UTC
It's been up for 12+ hours with no crash.  I'll keep it in place but the last crash happened in minutes so I'm not sure if it will happen again.  Nothing useful from the existing crash dump?
Comment 4 Gregory Neil Shapiro freebsd_committer freebsd_triage 2024-04-24 01:40:19 UTC
Created attachment 250194 [details]
core.txt.8

Had a dxr panic with the additional debugging enabled.  Attaching the core.txt.8.
Comment 5 commit-hook freebsd_committer freebsd_triage 2024-05-07 15:51:04 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d

commit b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-07 15:44:09 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-07 15:44:09 +0000

    fib_dxr: set fib_data field in struct dxr_aux early enough

    Previously it was possible for dxr_build() to return with da->fd
    unset in case of range_tbl or x_tbl malloc() failures.  This
    may have led to NULL ptr dereferencing in dxr_change_rib_batch().

    MFC after:      1 week

    PR:             278422

 sys/netinet/in_fib_dxr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 6 Marko Zec freebsd_committer freebsd_triage 2024-05-07 16:06:35 UTC
(In reply to Gregory Neil Shapiro from comment #4)

The provided logs suggest that several malloc(M_NOWAIT) requests coulnd't be fullfilled, which led to a sequence of futile DXR rebuild attempts.  Whatever the reasons for the malloc() failures were, DXR shouldn't have crashed the system for sure.

I hope a possible culprit was nailed here: https://cgit.FreeBSD.org/src/commit/?id=b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d

Could you fetch the most recent version of in_fib_dxr.c from the main branch and try it on your 14.0-R system, it should compile as a module just fine...

And sorry for the slow reply, I've mostly drifted away from FreeBSD hacking...
Comment 7 Gregory Neil Shapiro freebsd_committer freebsd_triage 2024-05-14 04:35:40 UTC
(In reply to Marko Zec from comment #6)

The new module is now running.  Given how long the previous version went without a crash, I don't know if the lack of crashing from this new version means the bug is fixed or I just didn't hit the condition to trigger the crash.

I'll update this bug if it crashes.
Comment 8 commit-hook freebsd_committer freebsd_triage 2024-05-14 20:39:36 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0418d7a0903725ade71ae77c4ff900010a93a185

commit 0418d7a0903725ade71ae77c4ff900010a93a185
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-07 15:44:09 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-14 20:32:41 +0000

    fib_dxr: set fib_data field in struct dxr_aux early enough

    Previously it was possible for dxr_build() to return with da->fd
    unset in case of range_tbl or x_tbl malloc() failures.  This
    may have led to NULL ptr dereferencing in dxr_change_rib_batch().

    MFC after:      1 week

    PR:             278422

 sys/netinet/in_fib_dxr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 9 commit-hook freebsd_committer freebsd_triage 2024-05-14 20:40:37 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9ae078121d3f70d8cd8c537fa16daf302ff5ee21

commit 9ae078121d3f70d8cd8c537fa16daf302ff5ee21
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-07 15:44:09 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-14 20:36:20 +0000

    fib_dxr: set fib_data field in struct dxr_aux early enough

    Previously it was possible for dxr_build() to return with da->fd
    unset in case of range_tbl or x_tbl malloc() failures.  This
    may have led to NULL ptr dereferencing in dxr_change_rib_batch().

    MFC after:      1 week

    PR:             278422

 sys/netinet/in_fib_dxr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 10 commit-hook freebsd_committer freebsd_triage 2024-05-17 16:22:46 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4ab122e8ef127d36d95f874e85600c36c87c8c22

commit 4ab122e8ef127d36d95f874e85600c36c87c8c22
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-17 15:55:43 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-17 16:21:54 +0000

    fib_dxr: check if cached fib_data matches the new request in dxr_init()

    When calling dxr_init(), the FIB_ALGO infrastructure may provide a
    pointer to a previous dxr instance, which permits reuse of auxiliary
    dxr structures, i.e. incremental lookup structure updates.  For dxr this
    is a crucial feature provided by FIB_ALGO, since dxr incremental updates
    are typically several orders of magnitude faster than full lookup table
    rebuilds.

    However, the auxiliary dxr structure caches a pointer to struct fib_data and
    relies upon it for performing incremental updates.  Apparently, incremental
    rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
    old_data set, may (under not yet fully understood circumstances) be invoked
    within a different fib_data context than the one cached in the previous
    version of dxr auxiliary structures.  In such (rare) events, we ignore the
    offered old dxr context, and proceed with a full lookup structure rebuild
    instead of attempting an incremental one using a fib_data context which
    may or may not no longer be valid, and thus lead to a system crash.

    PR:             278422
    MFC after:      1 week

 sys/netinet/in_fib_dxr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
Comment 11 commit-hook freebsd_committer freebsd_triage 2024-05-22 17:35:00 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d6e32525c778d92c26a37f4e1b562e80b18a9af7

commit d6e32525c778d92c26a37f4e1b562e80b18a9af7
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-17 15:55:43 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-22 17:34:05 +0000

    fib_dxr: check if cached fib_data matches the new request in dxr_init()

    When calling dxr_init(), the FIB_ALGO infrastructure may provide a
    pointer to a previous dxr instance, which permits reuse of auxiliary
    dxr structures, i.e. incremental lookup structure updates.  For dxr this
    is a crucial feature provided by FIB_ALGO, since dxr incremental updates
    are typically several orders of magnitude faster than full lookup table
    rebuilds.

    However, the auxiliary dxr structure caches a pointer to struct fib_data and
    relies upon it for performing incremental updates.  Apparently, incremental
    rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
    old_data set, may (under not yet fully understood circumstances) be invoked
    within a different fib_data context than the one cached in the previous
    version of dxr auxiliary structures.  In such (rare) events, we ignore the
    offered old dxr context, and proceed with a full lookup structure rebuild
    instead of attempting an incremental one using a fib_data context which
    may or may not no longer be valid, and thus lead to a system crash.

    PR:             278422
    MFC after:      1 week

    (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22)

 sys/netinet/in_fib_dxr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
Comment 12 commit-hook freebsd_committer freebsd_triage 2024-05-22 17:39:01 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9629a4b6865c5c56804f79a62f45512b175776e2

commit 9629a4b6865c5c56804f79a62f45512b175776e2
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-17 15:55:43 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-22 17:37:31 +0000

    fib_dxr: check if cached fib_data matches the new request in dxr_init()

    When calling dxr_init(), the FIB_ALGO infrastructure may provide a
    pointer to a previous dxr instance, which permits reuse of auxiliary
    dxr structures, i.e. incremental lookup structure updates.  For dxr this
    is a crucial feature provided by FIB_ALGO, since dxr incremental updates
    are typically several orders of magnitude faster than full lookup table
    rebuilds.

    However, the auxiliary dxr structure caches a pointer to struct fib_data and
    relies upon it for performing incremental updates.  Apparently, incremental
    rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
    old_data set, may (under not yet fully understood circumstances) be invoked
    within a different fib_data context than the one cached in the previous
    version of dxr auxiliary structures.  In such (rare) events, we ignore the
    offered old dxr context, and proceed with a full lookup structure rebuild
    instead of attempting an incremental one using a fib_data context which
    may or may not no longer be valid, and thus lead to a system crash.

    PR:             278422
    MFC after:      1 week

    (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22)

 sys/netinet/in_fib_dxr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
Comment 13 commit-hook freebsd_committer freebsd_triage 2024-05-22 17:53:03 UTC
A commit in branch releng/14.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b0a1a3138a37b7849d1fb735e6b5c2cd392a2e8b

commit b0a1a3138a37b7849d1fb735e6b5c2cd392a2e8b
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-07 15:44:09 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-22 17:50:29 +0000

    fib_dxr: set fib_data field in struct dxr_aux early enough

    Previously it was possible for dxr_build() to return with da->fd
    unset in case of range_tbl or x_tbl malloc() failures.  This
    may have led to NULL ptr dereferencing in dxr_change_rib_batch().

    Approved by:    re (cperciva)
    MFC after:      1 week

    PR:             278422
    (cherry picked from commit 0418d7a0903725ade71ae77c4ff900010a93a185)

 sys/netinet/in_fib_dxr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 14 commit-hook freebsd_committer freebsd_triage 2024-05-23 05:37:52 UTC
A commit in branch releng/14.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=782f02004251f68d144ea7914e390297b6edad48

commit 782f02004251f68d144ea7914e390297b6edad48
Author:     Marko Zec <zec@FreeBSD.org>
AuthorDate: 2024-05-17 15:55:43 +0000
Commit:     Marko Zec <zec@FreeBSD.org>
CommitDate: 2024-05-23 04:29:22 +0000

    fib_dxr: check if cached fib_data matches the new request in dxr_init()

    When calling dxr_init(), the FIB_ALGO infrastructure may provide a
    pointer to a previous dxr instance, which permits reuse of auxiliary
    dxr structures, i.e. incremental lookup structure updates.  For dxr this
    is a crucial feature provided by FIB_ALGO, since dxr incremental updates
    are typically several orders of magnitude faster than full lookup table
    rebuilds.

    However, the auxiliary dxr structure caches a pointer to struct fib_data and
    relies upon it for performing incremental updates.  Apparently, incremental
    rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
    old_data set, may (under not yet fully understood circumstances) be invoked
    within a different fib_data context than the one cached in the previous
    version of dxr auxiliary structures.  In such (rare) events, we ignore the
    offered old dxr context, and proceed with a full lookup structure rebuild
    instead of attempting an incremental one using a fib_data context which
    may or may not no longer be valid, and thus lead to a system crash.

    PR:             278422
    MFC after:      1 week
    Approved by:    re (cperciva)

    (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22)
    (cherry picked from commit d6e32525c778d92c26a37f4e1b562e80b18a9af7)

 sys/netinet/in_fib_dxr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)