Summary: | kernel panic related to IPv4 routes populated by bird2 when dxr routing algorithm used | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Gregory Neil Shapiro <gshapiro> | ||||||
Component: | kern | Assignee: | Marko Zec <zec> | ||||||
Status: | New --- | ||||||||
Severity: | Affects Only Me | CC: | olivier, zlei | ||||||
Priority: | --- | ||||||||
Version: | 14.0-RELEASE | ||||||||
Hardware: | Any | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Gregory Neil Shapiro
![]() ![]() Created attachment 250041 [details]
core.txt.7
Attaching full coreinfo info
Could you set sysctl net.route.algo.debug_level=6 before kldloading fib_dxr, so hopefully a bit more info preceding the panic gets captured in the message buffer? It's been up for 12+ hours with no crash. I'll keep it in place but the last crash happened in minutes so I'm not sure if it will happen again. Nothing useful from the existing crash dump? Created attachment 250194 [details]
core.txt.8
Had a dxr panic with the additional debugging enabled. Attaching the core.txt.8.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d commit b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-07 15:44:09 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). MFC after: 1 week PR: 278422 sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (In reply to Gregory Neil Shapiro from comment #4) The provided logs suggest that several malloc(M_NOWAIT) requests coulnd't be fullfilled, which led to a sequence of futile DXR rebuild attempts. Whatever the reasons for the malloc() failures were, DXR shouldn't have crashed the system for sure. I hope a possible culprit was nailed here: https://cgit.FreeBSD.org/src/commit/?id=b24e353f9e58f6b5bcbd444a062c1c57cd8fc43d Could you fetch the most recent version of in_fib_dxr.c from the main branch and try it on your 14.0-R system, it should compile as a module just fine... And sorry for the slow reply, I've mostly drifted away from FreeBSD hacking... (In reply to Marko Zec from comment #6) The new module is now running. Given how long the previous version went without a crash, I don't know if the lack of crashing from this new version means the bug is fixed or I just didn't hit the condition to trigger the crash. I'll update this bug if it crashes. A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0418d7a0903725ade71ae77c4ff900010a93a185 commit 0418d7a0903725ade71ae77c4ff900010a93a185 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-14 20:32:41 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). MFC after: 1 week PR: 278422 sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9ae078121d3f70d8cd8c537fa16daf302ff5ee21 commit 9ae078121d3f70d8cd8c537fa16daf302ff5ee21 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-14 20:36:20 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). MFC after: 1 week PR: 278422 sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4ab122e8ef127d36d95f874e85600c36c87c8c22 commit 4ab122e8ef127d36d95f874e85600c36c87c8c22 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-17 16:21:54 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d6e32525c778d92c26a37f4e1b562e80b18a9af7 commit d6e32525c778d92c26a37f4e1b562e80b18a9af7 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-22 17:34:05 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22) sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9629a4b6865c5c56804f79a62f45512b175776e2 commit 9629a4b6865c5c56804f79a62f45512b175776e2 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-22 17:37:31 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22) sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) A commit in branch releng/14.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b0a1a3138a37b7849d1fb735e6b5c2cd392a2e8b commit b0a1a3138a37b7849d1fb735e6b5c2cd392a2e8b Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-07 15:44:09 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-22 17:50:29 +0000 fib_dxr: set fib_data field in struct dxr_aux early enough Previously it was possible for dxr_build() to return with da->fd unset in case of range_tbl or x_tbl malloc() failures. This may have led to NULL ptr dereferencing in dxr_change_rib_batch(). Approved by: re (cperciva) MFC after: 1 week PR: 278422 (cherry picked from commit 0418d7a0903725ade71ae77c4ff900010a93a185) sys/netinet/in_fib_dxr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) A commit in branch releng/14.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=782f02004251f68d144ea7914e390297b6edad48 commit 782f02004251f68d144ea7914e390297b6edad48 Author: Marko Zec <zec@FreeBSD.org> AuthorDate: 2024-05-17 15:55:43 +0000 Commit: Marko Zec <zec@FreeBSD.org> CommitDate: 2024-05-23 04:29:22 +0000 fib_dxr: check if cached fib_data matches the new request in dxr_init() When calling dxr_init(), the FIB_ALGO infrastructure may provide a pointer to a previous dxr instance, which permits reuse of auxiliary dxr structures, i.e. incremental lookup structure updates. For dxr this is a crucial feature provided by FIB_ALGO, since dxr incremental updates are typically several orders of magnitude faster than full lookup table rebuilds. However, the auxiliary dxr structure caches a pointer to struct fib_data and relies upon it for performing incremental updates. Apparently, incremental rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer old_data set, may (under not yet fully understood circumstances) be invoked within a different fib_data context than the one cached in the previous version of dxr auxiliary structures. In such (rare) events, we ignore the offered old dxr context, and proceed with a full lookup structure rebuild instead of attempting an incremental one using a fib_data context which may or may not no longer be valid, and thus lead to a system crash. PR: 278422 MFC after: 1 week Approved by: re (cperciva) (cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22) (cherry picked from commit d6e32525c778d92c26a37f4e1b562e80b18a9af7) sys/netinet/in_fib_dxr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) |