283137 – pf: states corruption since 93c80b79ad65 leading to kernel panic

Bug 283137 - pf: states corruption since 93c80b79ad65 leading to kernel panic

Summary: pf: states corruption since 93c80b79ad65 leading to kernel panic

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	14.2-STABLE
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	freebsd-pf (Nobody)

URL:	https://github.com/opnsense/src/issue...
Keywords:	crash, regression, vendor

Duplicates (1):	283362 (view as bug list)
Depends on:
Blocks:

Reported:	2024-12-05 07:57 UTC by Franco Fichtner
Modified:	2025-02-05 14:34 UTC (History)
CC List:	8 users (show)

See Also:	283362

Attachments
crashinfo 2024-12-13, mars (17.49 KB, text/plain) 2024-12-15 13:29 UTC, Frank Behrens	no flags	Details
Test patch (1.50 KB, patch) 2024-12-16 12:40 UTC, Kristof Provost	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Franco Fichtner 2024-12-05 07:57:03 UTC

Hi,

OPNsense users report a pf state corruption since the deployment of 93c80b79ad65 which ends up in at least one kernel panic, but due to the nature of the situation it could actually be multiple.

The issue seems quite prevalent on production systems and may crash a system after just a couple of minutes of runtime.

One user provided a kernel dump.  I'm attaching the info for triage here:

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff8049c36a in db_dump (dummy=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:591
#3  0xffffffff8049c16d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504
#4  0xffffffff8049c2b6 in db_command_script (command=command@entry=0xffffffff81bbf6d3 <db_recursion_data+3> "dump") at /usr/src/sys/ddb/db_command.c:569
#5  0xffffffff804a1528 in db_script_exec (scriptname=<optimized out>, warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302
#6  0xffffffff804a1435 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:325
#7  0xffffffff8049f4f1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267
#8  0xffffffff80c09868 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe00e206e2e0) at /usr/src/sys/kern/subr_kdb.c:790
#9  0xffffffff810e0419 in trap (frame=0xfffffe00e206e2e0) at /usr/src/sys/amd64/amd64/trap.c:608
#10 <signal handler called>
#11 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#12 0xffffffff80bb91d2 in vpanic (fmt=0xffffffff823f5cbd "Bad link elm %p prev->next != elm", ap=ap@entry=0xfffffe00e206e510) at /usr/src/sys/kern/kern_shutdown.c:955
#13 0xffffffff80bb9283 in panic (fmt=0xffffffff81d82c18 <cnputs_mtx+24> "") at /usr/src/sys/kern/kern_shutdown.c:891
#14 0xffffffff823c1dd0 in pf_state_key_detach (s=s@entry=0xfffff803cc297b00, idx=idx@entry=0) at /usr/src/sys/netpfil/pf/pf.c:1456
#15 0xffffffff823ad0ef in pf_detach_state (s=s@entry=0xfffff803cc297b00) at /usr/src/sys/netpfil/pf/pf.c:1442
#16 0xffffffff823ac6d9 in pf_state_key_attach (skw=0xfffff803cc2c4420, sks=0x0, s=0xfffff803cc297b00) at /usr/src/sys/netpfil/pf/pf.c:1355
#17 pf_state_insert (kif=<optimized out>, orig_kif=orig_kif@entry=0xfffff80002150600, skw=0xfffff803cc2c4420, sks=<optimized out>, s=s@entry=0xfffff803cc297b00)
    at /usr/src/sys/netpfil/pf/pf.c:1535
#18 0xffffffff823ba740 in pf_create_state (r=0xfffff80227b7e000, nr=0xfffff80189e7a800, a=<optimized out>, pd=0xfffffe00e206eb00, nsn=0x0, nk=0x12, sk=<optimized out>, 
    m=0xfffff8001dc64800, off=20, sport=4843, dport=59668, rewrite=0xfffffe00e206ea0c, kif=0xfffff80002150600, sm=0xfffffe00e206ec18, tag=-1, bproto_sum=25520, 
    bip_sum=979, hdrlen=8, match_rules=<optimized out>) at /usr/src/sys/netpfil/pf/pf.c:5025
#19 pf_test_rule (rm=rm@entry=0xfffffe00e206ebf0, sm=sm@entry=0xfffffe00e206ec18, kif=kif@entry=0xfffff80002150600, m=0xfffff8001dc64800, off=20, 
    pd=pd@entry=0xfffffe00e206eb00, am=0xfffffe00e206ebd8, rsm=0xfffffe00e206ebc8, inp=0x0) at /usr/src/sys/netpfil/pf/pf.c:4800
#20 0xffffffff823b4471 in pf_test (dir=dir@entry=1, pflags=<optimized out>, ifp=0xfffff80001906000, m0=m0@entry=0xfffffe00e206ed08, inp=<optimized out>, 
    default_actions=default_actions@entry=0x0) at /usr/src/sys/netpfil/pf/pf.c:8269
#21 0xffffffff823dc177 in pf_check_in (m=0xfffffe00e206ed08, ifp=0x12, flags=-502865312, ruleset=<optimized out>, inp=0xffffffff80c10af0 <putchar>)
    at /usr/src/sys/netpfil/pf/pf_ioctl.c:6575
#22 0xffffffff80d19e98 in pfil_mbuf_common (pch=<optimized out>, m=0xfffffe00e206ed08, m@entry=0xfffffe00e206ec48, ifp=0xfffff80001906000, flags=65536, inp=inp@entry=0x0)
    at /usr/src/sys/net/pfil.c:212
#23 pfil_mbuf_in (head=<optimized out>, m=m@entry=0xfffffe00e206ed08, ifp=0xfffff80001906000, inp=inp@entry=0x0) at /usr/src/sys/net/pfil.c:230
#24 0xffffffff80d9c59a in ip_tryforward (m=0xfffff8001dc64800) at /usr/src/sys/netinet/ip_fastfwd.c:312
#25 0xffffffff80d9fa9c in ip_input (m=0xfffff8001dc64800) at /usr/src/sys/netinet/ip_input.c:621
#26 0xffffffff80d1682b in netisr_process_workstream_proto (nwsp=0xfffffe003a5eca40, proto=1) at /usr/src/sys/net/netisr.c:927
#27 swi_net (arg=0xfffffe003a5eca40) at /usr/src/sys/net/netisr.c:974
#28 0xffffffff80b6ffc6 in intr_event_execute_handlers (ie=0xfffff80001a59100, p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1205
#29 ithread_execute_handlers (ie=0xfffff80001a59100, p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1218
#30 ithread_loop (arg=arg@entry=0xfffff80001a7a620) at /usr/src/sys/kern/kern_intr.c:1306
#31 0xffffffff80b6c402 in fork_exit (callout=0xffffffff80b6fd70 <ithread_loop>, arg=0xfffff80001a7a620, frame=0xfffffe00e206ef40) at /usr/src/sys/kern/kern_fork.c:1164
#32 <signal handler called>

(kgdb) frame 14
#14 0xffffffff823c1dd0 in pf_state_key_detach (s=s@entry=0xfffff803cc297b00, idx=idx@entry=0) at /usr/src/sys/netpfil/pf/pf.c:1456
warning: Source file is more recent than executable.
1456		TAILQ_REMOVE(&sk->states[idx], s, key_list[idx]);
(kgdb) list
1451	#ifdef INVARIANTS
1452		struct pf_keyhash *kh = &V_pf_keyhash[pf_hashkey(sk)];
1453	
1454		PF_HASHROW_ASSERT(kh);
1455	#endif
1456		TAILQ_REMOVE(&sk->states[idx], s, key_list[idx]);
1457		s->key[idx] = NULL;
1458	
1459		if (TAILQ_EMPTY(&sk->states[0]) && TAILQ_EMPTY(&sk->states[1])) {
1460			LIST_REMOVE(sk, entry);
(kgdb) p *sk
$1 = {addr = {{{v4 = {s_addr = XXX}, v6 = {__u6_addr = {__u6_addr8 = "XXX", <incomplete sequence XXX>, 
            __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = {XXX, XXX, XXX, XXX}}}, 
        addr8 = "XXX", <incomplete sequence \XXX>, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, 
          XXX}, addr32 = {XXX, XXX, XXX, XXX}}}, {{v4 = {s_addr = XXX}, v6 = {__u6_addr = {
            __u6_addr8 = "XXX", <incomplete sequence XXX>, __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, 
              XXX, XXX, XXX}, __u6_addr32 = {XXX, XXX, XXX, XXX}}}, 
        addr8 = "XXX", <incomplete sequence XXX>, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, 
          XXX}, addr32 = {XXX, XXX, XXX, XXX}}}}, port = {49374, 57005}, af = 222 '\336', proto = 192 '\300', 
  pad = "\255", <incomplete sequence \336>, entry = {le_next = 0xdeadc0dedeadc0de, le_prev = 0xdeadc0dedeadc0de}, states = {{tqh_first = 0xdeadc0dedeadc0de, 
      tqh_last = 0xdeadc0dedeadc0de}, {tqh_first = 0xdeadc0dedeadc0de, tqh_last = 0xdeadc0dedeadc0de}}}
(kgdb) p *sk->states
$2 = {tqh_first = 0xdeadc0dedeadc0de, tqh_last = 0xdeadc0dedeadc0de}
(kgdb) p *s
$3 = {id = 10415225491559546880, creatorid = 1082503010, direction = 1 '\001', pad = "\000\000", state_flags = 128, timeout = 27 '\033', sync_state = 255 '\377', 
  sync_updates = 0 '\000', refs = 0, lock = 0xfffffe0109794688, sync_list = {tqe_next = 0x0, tqe_prev = 0x0}, key_list = {{tqe_next = 0x0, 
      tqe_prev = 0xfffff803cc2c4458}, {tqe_next = 0x0, tqe_prev = 0x0}}, entry = {le_next = 0x0, le_prev = 0x0}, src = {scrub = 0x0, seqlo = 0, seqhi = 0, seqdiff = 0, 
    max_win = 0, mss = 0, state = 1 '\001', wscale = 0 '\000', tcp_est = 0 '\000', pad = ""}, dst = {scrub = 0x0, seqlo = 0, seqhi = 0, seqdiff = 0, max_win = 0, 
    mss = 0, state = 0 '\000', wscale = 0 '\000', tcp_est = 0 '\000', pad = ""}, match_rules = {slh_first = 0x0}, rule = {ptr = 0xfffff80227b7e000, nr = 666361856}, 
  anchor = {ptr = 0x0, nr = 0}, nat_rule = {ptr = 0xfffff80189e7a800, nr = 2313660416}, rt_addr = {{v4 = {s_addr = 0}, v6 = {__u6_addr = {
          __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, addr8 = '\000' <repeats 15 times>, addr16 = {0, 
        0, 0, 0, 0, 0, 0, 0}, addr32 = {0, 0, 0, 0}}}, key = {0xfffff803cc2c4420, 0x0}, kif = 0xfffff80002150600, orig_kif = 0xfffff80002150600, rt_kif = 0x0, 
  src_node = 0x0, nat_src_node = 0x0, packets = {0, 0}, bytes = {0, 0}, creation = 127, expire = 127, pfsync_time = 0, act = {rtableid = -1, qid = 0, pqid = 0, 
    max_mss = 0, log = 0 '\000', set_tos = 0 '\000', min_ttl = 0 '\000', dnpipe = 0, dnrpipe = 0, flags = 128, set_prio = "\000"}, tag = 0, rt = 0 '\000'}


Cheers,
Franco

Comment 1 Mark Linimon freebsd_committer

2024-12-05 09:18:40 UTC

^Triage: Cc: committer of cited change.

Comment 2 Dave Cottlehuber freebsd_committer

2024-12-05 15:35:38 UTC

thanks Franco. I appreciate that this is tracked down to a specific commit
in OPNsense, and we have a kgdb with line numbers, presumably from an
OPNsense source tree.

I can't find https://cgit.freebsd.org/src/commit/?id=93c80b79ad65c in our
src tree, and I can't match the kgdb line numbers with FreeBSD stable,
that is likely my fault though.

I know this sounds like a broken record, but we need the usual information:

- backtrace, kgdb on a debug build, ideally line numbers 
- pf.conf & related configs
- any situational info that helps track this down
- ideally a reproduction or test case on a *FreeBSD* branch

thanks, Dave

Comment 3 Franco Fichtner 2024-12-05 16:11:14 UTC

Sorry, a typo in the subject. The description had the right commit referenced:

https://cgit.freebsd.org/src/commit/?id=93c80b79ad65

Comment 4 yds 2024-12-14 20:28:33 UTC

I might be suffering from this same issue on FreeBSD 14.2-STABLE.

14.2-RELEASE kernel and modules built on Nov 29th are the last known good set I'm using in production as a failsafe.

on 14.2-STABLE the kernel panics happen within an hour, but only the two CARP failover configured router boxes in production.  non-router server boxes work fine.

panics happen both on the weekly snapshot kernels downloaded from FreeBSD.org and on my locally built and configured kernels.  the fact that I can reproduce the issue on the weekly snapshot kernels from FreeBSD.org rules out that it's my local kernel config or build environment at fault.

here's the panic messages from the BACKUP router in the production CARP pair on `14.2-STABLE #0 stable/14-n269923-9bc9cf4a2ae7-dirty` Dec 14th kernel:
https://gist.github.com/yds/c15cd5fef442ed301242efea0e080645
https://gist.github.com/yds/bcb01703bfee37a282b66ff5d8d5eb9c

both times the panic happened after about 55 minutes of uptime.  that seems to be the average uptime with these problematic 14.2-STABLE kernels.

- pf.conf & related configs:
https://gist.github.com/yds/56ce14191c673682d5620beee4616628

I can provide any other "related configs" to help get this bug squashed.  I can also build custom kernels with whatever patches we need to test and try them out on the backup production router where these panics happen.

Comment 5 Frank Behrens 2024-12-15 13:29:15 UTC

Created attachment 255875 [details]
crashinfo 2024-12-13, mars

I also observed a panic, in my case on a self compiled FreeBSD 15.0-CURRENT based on commit 248109448f6afb3d093a522677aab2975a3e7a15. However, the stack trace is different. I have a vmcore at hand, attached is the stack trace.

Comment 6 Kristof Provost freebsd_committer

2024-12-16 12:40:07 UTC

Created attachment 255891 [details]
Test patch

I cannot reproduce this crash, but have a suspicion of what the cause might be. Can someone who can actually reproduce this try this patch?

Comment 7 Peter Blok 2024-12-16 13:50:05 UTC

I missed this bug report although I looked. I'll try the patch. It happens very frequent on my system, so I should be able to tell quickly.

Comment 8 yds 2024-12-16 16:51:31 UTC

(In reply to Kristof Provost from comment #6)

applied the patch. production backup router in the CARP pair has been up for two hours now -- usually the kernel panic would strike within about 50 minutes from reboot.

rebooted the primary router, now both routers are running the patched kernel.

I suppose we can call this bug "fixed" if neither router kernel panic reboots for the rest of the day. looks promising so far.

Comment 9 Peter Blok 2024-12-16 17:47:43 UTC

Same here. Firewall is still up and running. It would have crashed multiple times without the patch

Comment 10 Ed Maste freebsd_committer

2024-12-16 18:00:07 UTC

*** Bug 283362 has been marked as a duplicate of this bug. ***

Comment 11 yds 2024-12-17 16:49:12 UTC

(In reply to Kristof Provost from comment #6)

with this patch applied to the Dec 16th 14.2-STABLE kernel I got:
- 24 hours uptime on the primary router
- 26 hours uptime on the standby router

LGTM -- I wouldn't dare reboot these routers /without/ this patch. :)

without this patch there would be kernel panics at least once an hour on both routers in the CARP pair.  no kernel panics with the patch applied. :)

Comment 12 yds 2024-12-18 15:41:34 UTC

(In reply to Kristof Provost from comment #6)

I see the fix was already committed to the /main/ branch:
https://cGit.FreeBSD.org/src/commit/?id=01eb1261443dddcb50a3a278f1278fffdfb0d36e
 
would it be possible to get this MFCed to 14.2-STABLE sooner than later?

most of us suffering from this bug in production are not likely to be running 15-CURRENT.

many thanks for the quick fix, 48 hours of uptime now with the patched 14.2-STABLE kernel. :)

Comment 13 Franco Fichtner 2024-12-18 16:12:47 UTC

Given the 2 weeks MFC another reasonable thing would be to revert the initial backport on stable/14.

Comment 14 yds 2025-02-05 14:34:24 UTC

(In reply to Kristof Provost from comment #6)
this PR might as well be closed now. been running FreeBSD 14.2-STABLE again ever since the fix was MFCed. all good now.