I have a pair of OPNsense firewalls, based on FreeBSD 13.1. They are configured as an HA pair with state synchronization via pfsync. I am experiencing random crashes stating: Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 06 fault virtual address = 0x0 fault code = supervisor read data, page not present One of the OPNsense maintainers looked at a core dump and found this cause being a null pointer dereference: #17 0xffffffff8237ed0f in pf_test_state_udp (state=<optimized out>, state@entry=0xfffffe001099b828, direction=<optimized out>, kif=<optimized out>, kif@entry=0xfffff800245b3a00, m=m@entry=0xfffff801e9409800, off=20, h=<optimized out>, pd=pd@entry=0xfffffe001099b758) at /usr/src/sys/netpfil/pf/pf.c:5086 5086 if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], pd->af) || (kgdb) list 5081 5082 /* translate source/destination address, if necessary */ 5083 if ((*state)->key[PF_SK_WIRE] != (*state)->key[PF_SK_STACK]) { 5084 struct pf_state_key *nk = (*state)->key[pd->didx]; 5085 5086 if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], pd->af) || 5087 nk->port[pd->sidx] != uh->uh_sport) 5088 pf_change_ap(m, pd->src, &uh->uh_sport, pd->ip_sum, 5089 &uh->uh_sum, &nk->addr[pd->sidx], 5090 nk->port[pd->sidx], 1, pd->af); (kgdb) p nk $10 = (struct pf_state_key *) 0x0 I subsequently disabled pfsync and that has resolved my crashes. It appears the state sync is bringing invalid states with it, which eventually causes a kernel panic.
Adam sent me a crash dump for further analysis that can be shared as well. Apparently it crashes where no sanity checking takes place in state inspection (so far UDP and ICMP seem to be affected but I wouldn't be surprised it's similar for TCP) leading to the assumption that states are not properly synced for one reason or another. The pfsync code has the latest updates from stable/13. Cheers, Franco
Can you reproduce this problem on FreeBSD?
Ok, let's make this longer than necessary shall we ;) Which kernel would you prefer? 13.1-RELEASE, 13.2-RELEASE or stable/13?
(In reply to Franco Fichtner from comment #3) main, ideally, but I'll take information for any supported branch. So include the (minimised!) reproduction scenario, any required non-default settings and any additional information that can be extracted from the core dump (i.e. backtrace, local variables).
Let's start with 13.2-RELEASE then. Will report back.
Update: Since rebooting both systems with their identical kernels, I have not been able to reproduce this issue. Somehow a corrupt state (or maybe several corrupt states) was being persisted by always having at least 1 firewall online. At this time I don't know how to reproduce the original issue, but would recommend that some thought be put into validating states that are being synchronized. It seems to be an edge case that I ran into.
Appears to be false positive, please close.
(In reply to Franco Fichtner from comment #7) Done- feel free to reopen if new information comes to light.