Hi, OPNsense users report a pf state corruption since the deployment of 93c80b79ad65 which ends up in at least one kernel panic, but due to the nature of the situation it could actually be multiple. The issue seems quite prevalent on production systems and may crash a system after just a couple of minutes of runtime. One user provided a kernel dump. I'm attaching the info for triage here: (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff8049c36a in db_dump (dummy=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:591 #3 0xffffffff8049c16d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504 #4 0xffffffff8049c2b6 in db_command_script (command=command@entry=0xffffffff81bbf6d3 <db_recursion_data+3> "dump") at /usr/src/sys/ddb/db_command.c:569 #5 0xffffffff804a1528 in db_script_exec (scriptname=<optimized out>, warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302 #6 0xffffffff804a1435 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:325 #7 0xffffffff8049f4f1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267 #8 0xffffffff80c09868 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe00e206e2e0) at /usr/src/sys/kern/subr_kdb.c:790 #9 0xffffffff810e0419 in trap (frame=0xfffffe00e206e2e0) at /usr/src/sys/amd64/amd64/trap.c:608 #10 <signal handler called> #11 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #12 0xffffffff80bb91d2 in vpanic (fmt=0xffffffff823f5cbd "Bad link elm %p prev->next != elm", ap=ap@entry=0xfffffe00e206e510) at /usr/src/sys/kern/kern_shutdown.c:955 #13 0xffffffff80bb9283 in panic (fmt=0xffffffff81d82c18 <cnputs_mtx+24> "") at /usr/src/sys/kern/kern_shutdown.c:891 #14 0xffffffff823c1dd0 in pf_state_key_detach (s=s@entry=0xfffff803cc297b00, idx=idx@entry=0) at /usr/src/sys/netpfil/pf/pf.c:1456 #15 0xffffffff823ad0ef in pf_detach_state (s=s@entry=0xfffff803cc297b00) at /usr/src/sys/netpfil/pf/pf.c:1442 #16 0xffffffff823ac6d9 in pf_state_key_attach (skw=0xfffff803cc2c4420, sks=0x0, s=0xfffff803cc297b00) at /usr/src/sys/netpfil/pf/pf.c:1355 #17 pf_state_insert (kif=<optimized out>, orig_kif=orig_kif@entry=0xfffff80002150600, skw=0xfffff803cc2c4420, sks=<optimized out>, s=s@entry=0xfffff803cc297b00) at /usr/src/sys/netpfil/pf/pf.c:1535 #18 0xffffffff823ba740 in pf_create_state (r=0xfffff80227b7e000, nr=0xfffff80189e7a800, a=<optimized out>, pd=0xfffffe00e206eb00, nsn=0x0, nk=0x12, sk=<optimized out>, m=0xfffff8001dc64800, off=20, sport=4843, dport=59668, rewrite=0xfffffe00e206ea0c, kif=0xfffff80002150600, sm=0xfffffe00e206ec18, tag=-1, bproto_sum=25520, bip_sum=979, hdrlen=8, match_rules=<optimized out>) at /usr/src/sys/netpfil/pf/pf.c:5025 #19 pf_test_rule (rm=rm@entry=0xfffffe00e206ebf0, sm=sm@entry=0xfffffe00e206ec18, kif=kif@entry=0xfffff80002150600, m=0xfffff8001dc64800, off=20, pd=pd@entry=0xfffffe00e206eb00, am=0xfffffe00e206ebd8, rsm=0xfffffe00e206ebc8, inp=0x0) at /usr/src/sys/netpfil/pf/pf.c:4800 #20 0xffffffff823b4471 in pf_test (dir=dir@entry=1, pflags=<optimized out>, ifp=0xfffff80001906000, m0=m0@entry=0xfffffe00e206ed08, inp=<optimized out>, default_actions=default_actions@entry=0x0) at /usr/src/sys/netpfil/pf/pf.c:8269 #21 0xffffffff823dc177 in pf_check_in (m=0xfffffe00e206ed08, ifp=0x12, flags=-502865312, ruleset=<optimized out>, inp=0xffffffff80c10af0 <putchar>) at /usr/src/sys/netpfil/pf/pf_ioctl.c:6575 #22 0xffffffff80d19e98 in pfil_mbuf_common (pch=<optimized out>, m=0xfffffe00e206ed08, m@entry=0xfffffe00e206ec48, ifp=0xfffff80001906000, flags=65536, inp=inp@entry=0x0) at /usr/src/sys/net/pfil.c:212 #23 pfil_mbuf_in (head=<optimized out>, m=m@entry=0xfffffe00e206ed08, ifp=0xfffff80001906000, inp=inp@entry=0x0) at /usr/src/sys/net/pfil.c:230 #24 0xffffffff80d9c59a in ip_tryforward (m=0xfffff8001dc64800) at /usr/src/sys/netinet/ip_fastfwd.c:312 #25 0xffffffff80d9fa9c in ip_input (m=0xfffff8001dc64800) at /usr/src/sys/netinet/ip_input.c:621 #26 0xffffffff80d1682b in netisr_process_workstream_proto (nwsp=0xfffffe003a5eca40, proto=1) at /usr/src/sys/net/netisr.c:927 #27 swi_net (arg=0xfffffe003a5eca40) at /usr/src/sys/net/netisr.c:974 #28 0xffffffff80b6ffc6 in intr_event_execute_handlers (ie=0xfffff80001a59100, p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1205 #29 ithread_execute_handlers (ie=0xfffff80001a59100, p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1218 #30 ithread_loop (arg=arg@entry=0xfffff80001a7a620) at /usr/src/sys/kern/kern_intr.c:1306 #31 0xffffffff80b6c402 in fork_exit (callout=0xffffffff80b6fd70 <ithread_loop>, arg=0xfffff80001a7a620, frame=0xfffffe00e206ef40) at /usr/src/sys/kern/kern_fork.c:1164 #32 <signal handler called> (kgdb) frame 14 #14 0xffffffff823c1dd0 in pf_state_key_detach (s=s@entry=0xfffff803cc297b00, idx=idx@entry=0) at /usr/src/sys/netpfil/pf/pf.c:1456 warning: Source file is more recent than executable. 1456 TAILQ_REMOVE(&sk->states[idx], s, key_list[idx]); (kgdb) list 1451 #ifdef INVARIANTS 1452 struct pf_keyhash *kh = &V_pf_keyhash[pf_hashkey(sk)]; 1453 1454 PF_HASHROW_ASSERT(kh); 1455 #endif 1456 TAILQ_REMOVE(&sk->states[idx], s, key_list[idx]); 1457 s->key[idx] = NULL; 1458 1459 if (TAILQ_EMPTY(&sk->states[0]) && TAILQ_EMPTY(&sk->states[1])) { 1460 LIST_REMOVE(sk, entry); (kgdb) p *sk $1 = {addr = {{{v4 = {s_addr = XXX}, v6 = {__u6_addr = {__u6_addr8 = "XXX", <incomplete sequence XXX>, __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = {XXX, XXX, XXX, XXX}}}, addr8 = "XXX", <incomplete sequence \XXX>, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, addr32 = {XXX, XXX, XXX, XXX}}}, {{v4 = {s_addr = XXX}, v6 = {__u6_addr = { __u6_addr8 = "XXX", <incomplete sequence XXX>, __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = {XXX, XXX, XXX, XXX}}}, addr8 = "XXX", <incomplete sequence XXX>, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, addr32 = {XXX, XXX, XXX, XXX}}}}, port = {49374, 57005}, af = 222 '\336', proto = 192 '\300', pad = "\255", <incomplete sequence \336>, entry = {le_next = 0xdeadc0dedeadc0de, le_prev = 0xdeadc0dedeadc0de}, states = {{tqh_first = 0xdeadc0dedeadc0de, tqh_last = 0xdeadc0dedeadc0de}, {tqh_first = 0xdeadc0dedeadc0de, tqh_last = 0xdeadc0dedeadc0de}}} (kgdb) p *sk->states $2 = {tqh_first = 0xdeadc0dedeadc0de, tqh_last = 0xdeadc0dedeadc0de} (kgdb) p *s $3 = {id = 10415225491559546880, creatorid = 1082503010, direction = 1 '\001', pad = "\000\000", state_flags = 128, timeout = 27 '\033', sync_state = 255 '\377', sync_updates = 0 '\000', refs = 0, lock = 0xfffffe0109794688, sync_list = {tqe_next = 0x0, tqe_prev = 0x0}, key_list = {{tqe_next = 0x0, tqe_prev = 0xfffff803cc2c4458}, {tqe_next = 0x0, tqe_prev = 0x0}}, entry = {le_next = 0x0, le_prev = 0x0}, src = {scrub = 0x0, seqlo = 0, seqhi = 0, seqdiff = 0, max_win = 0, mss = 0, state = 1 '\001', wscale = 0 '\000', tcp_est = 0 '\000', pad = ""}, dst = {scrub = 0x0, seqlo = 0, seqhi = 0, seqdiff = 0, max_win = 0, mss = 0, state = 0 '\000', wscale = 0 '\000', tcp_est = 0 '\000', pad = ""}, match_rules = {slh_first = 0x0}, rule = {ptr = 0xfffff80227b7e000, nr = 666361856}, anchor = {ptr = 0x0, nr = 0}, nat_rule = {ptr = 0xfffff80189e7a800, nr = 2313660416}, rt_addr = {{v4 = {s_addr = 0}, v6 = {__u6_addr = { __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, addr8 = '\000' <repeats 15 times>, addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, addr32 = {0, 0, 0, 0}}}, key = {0xfffff803cc2c4420, 0x0}, kif = 0xfffff80002150600, orig_kif = 0xfffff80002150600, rt_kif = 0x0, src_node = 0x0, nat_src_node = 0x0, packets = {0, 0}, bytes = {0, 0}, creation = 127, expire = 127, pfsync_time = 0, act = {rtableid = -1, qid = 0, pqid = 0, max_mss = 0, log = 0 '\000', set_tos = 0 '\000', min_ttl = 0 '\000', dnpipe = 0, dnrpipe = 0, flags = 128, set_prio = "\000"}, tag = 0, rt = 0 '\000'} Cheers, Franco
^Triage: Cc: committer of cited change.
thanks Franco. I appreciate that this is tracked down to a specific commit in OPNsense, and we have a kgdb with line numbers, presumably from an OPNsense source tree. I can't find https://cgit.freebsd.org/src/commit/?id=93c80b79ad65c in our src tree, and I can't match the kgdb line numbers with FreeBSD stable, that is likely my fault though. I know this sounds like a broken record, but we need the usual information: - backtrace, kgdb on a debug build, ideally line numbers - pf.conf & related configs - any situational info that helps track this down - ideally a reproduction or test case on a *FreeBSD* branch thanks, Dave
Sorry, a typo in the subject. The description had the right commit referenced: https://cgit.freebsd.org/src/commit/?id=93c80b79ad65
I might be suffering from this same issue on FreeBSD 14.2-STABLE. 14.2-RELEASE kernel and modules built on Nov 29th are the last known good set I'm using in production as a failsafe. on 14.2-STABLE the kernel panics happen within an hour, but only the two CARP failover configured router boxes in production. non-router server boxes work fine. panics happen both on the weekly snapshot kernels downloaded from FreeBSD.org and on my locally built and configured kernels. the fact that I can reproduce the issue on the weekly snapshot kernels from FreeBSD.org rules out that it's my local kernel config or build environment at fault. here's the panic messages from the BACKUP router in the production CARP pair on `14.2-STABLE #0 stable/14-n269923-9bc9cf4a2ae7-dirty` Dec 14th kernel: https://gist.github.com/yds/c15cd5fef442ed301242efea0e080645 https://gist.github.com/yds/bcb01703bfee37a282b66ff5d8d5eb9c both times the panic happened after about 55 minutes of uptime. that seems to be the average uptime with these problematic 14.2-STABLE kernels. - pf.conf & related configs: https://gist.github.com/yds/56ce14191c673682d5620beee4616628 I can provide any other "related configs" to help get this bug squashed. I can also build custom kernels with whatever patches we need to test and try them out on the backup production router where these panics happen.
Created attachment 255875 [details] crashinfo 2024-12-13, mars I also observed a panic, in my case on a self compiled FreeBSD 15.0-CURRENT based on commit 248109448f6afb3d093a522677aab2975a3e7a15. However, the stack trace is different. I have a vmcore at hand, attached is the stack trace.
Created attachment 255891 [details] Test patch I cannot reproduce this crash, but have a suspicion of what the cause might be. Can someone who can actually reproduce this try this patch?
I missed this bug report although I looked. I'll try the patch. It happens very frequent on my system, so I should be able to tell quickly.
(In reply to Kristof Provost from comment #6) applied the patch. production backup router in the CARP pair has been up for two hours now -- usually the kernel panic would strike within about 50 minutes from reboot. rebooted the primary router, now both routers are running the patched kernel. I suppose we can call this bug "fixed" if neither router kernel panic reboots for the rest of the day. looks promising so far.
Same here. Firewall is still up and running. It would have crashed multiple times without the patch
*** Bug 283362 has been marked as a duplicate of this bug. ***
(In reply to Kristof Provost from comment #6) with this patch applied to the Dec 16th 14.2-STABLE kernel I got: - 24 hours uptime on the primary router - 26 hours uptime on the standby router LGTM -- I wouldn't dare reboot these routers /without/ this patch. :) without this patch there would be kernel panics at least once an hour on both routers in the CARP pair. no kernel panics with the patch applied. :)
(In reply to Kristof Provost from comment #6) I see the fix was already committed to the /main/ branch: https://cGit.FreeBSD.org/src/commit/?id=01eb1261443dddcb50a3a278f1278fffdfb0d36e would it be possible to get this MFCed to 14.2-STABLE sooner than later? most of us suffering from this bug in production are not likely to be running 15-CURRENT. many thanks for the quick fix, 48 hours of uptime now with the patched 14.2-STABLE kernel. :)
Given the 2 weeks MFC another reasonable thing would be to revert the initial backport on stable/14.
(In reply to Kristof Provost from comment #6) this PR might as well be closed now. been running FreeBSD 14.2-STABLE again ever since the fix was MFCed. all good now.