Bug 279899 - pf_unlink_state mutex unlock page fault panic
Summary: pf_unlink_state mutex unlock page fault panic
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.1-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-pf (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2024-06-21 20:20 UTC by Daniel Ponte
Modified: 2024-06-24 13:40 UTC (History)
4 users (show)

See Also:


Attachments
INVARIANTS core.txt excerpt (96.03 KB, text/plain)
2024-06-23 20:07 UTC, Daniel Ponte
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Ponte 2024-06-21 20:20:16 UTC
14-STABLE 935c5a5554e9. Issue was not present as of ff27c3872300. The crash happens pretty reliably within a couple minutes of boot.

#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
        td = <optimized out>
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
        error = 0
        coredump = <optimized out>
#2  0xffffffff8086b987 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:523
        once = 0
#3  0xffffffff8086be5e in vpanic (fmt=0xffffffff80e7a878 "%s",
    ap=ap@entry=0xfffffe0090e36c50) at /usr/src/sys/kern/kern_shutdown.c:967
        buf = "page fault", '\000' <repeats 245 times>
        __pc = 0x0
        __pc = 0x0
        __pc = 0x0
        other_cpus = {__bits = {14, 0 <repeats 15 times>}}
        td = 0xfffff800079d6000
        bootopt = <unavailable>
        newpanic = <optimized out>
#4  0xffffffff8086bcb3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:891
        ap = {{gp_offset = 16, fp_offset = 48,
            overflow_arg_area = 0xfffffe0090e36c80,
            reg_save_area = 0xfffffe0090e36c20}}
#5  0xffffffff80d63e2b in trap_fatal (frame=0xfffffe0090e36d30, eva=32)
    at /usr/src/sys/amd64/amd64/trap.c:952
        __pc = 0x0
        __pc = 0x0
        __pc = 0x0
        softseg = {ssd_base = 0, ssd_limit = 1048575, ssd_type = 27,
          ssd_dpl = 0, ssd_p = 1, ssd_long = 1, ssd_def32 = 0, ssd_gran = 1}
        code = 0
        ss = 40
        type = <optimized out>
        gdt = <optimized out>
        handled = <optimized out>
#6  0xffffffff80d63e76 in trap_pfault (frame=<unavailable>, usermode=false,
    signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:760
        __pc = 0x0
        __pc = 0x0
        __pc = 0x0
        td = 0xfffff800079d6000
        p = <optimized out>
        eva = <unavailable>
        map = <optimized out>
        ftype = <optimized out>
        rv = <optimized out>
#7  <signal handler called>
No locals.
#8  0xffffffff808d28c0 in turnstile_broadcast (ts=0x0, queue=queue@entry=0)
    at /usr/src/sys/kern/subr_turnstile.c:900
        td = <optimized out>
        ts1 = <optimized out>
        tc = <optimized out>
#9  0xffffffff80848c63 in __mtx_unlock_sleep (c=<optimized out>,
    v=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:1056
        tid = <optimized out>
        m = 0xfffffe0091b89548
        ts = 0x0
#10 0xffffffff80b6c268 in pf_unlink_state (s=s@entry=0xfffff801c6a56840)
    at /usr/src/sys/netpfil/pf/pf.c:2146
        _v = 0
        ih = 0xfffffe0091b89540
#11 0xffffffff80b6b7b8 in pf_purge_expired_states (i=103382, maxcheck=108)
    at /usr/src/sys/netpfil/pf/pf.c:2206
        count = 0
        ih = 0xfffffe0091af1970
        s = 0xfffff801c6a56840
        mrm = <optimized out>
#12 0xffffffff80b6b5db in pf_purge_thread (unused=<optimized out>)
    at /usr/src/sys/netpfil/pf/pf.c:1949
        saved_vnet = 0x0
        vnet_iter = 0xfffff800010af9c0
#13 0xffffffff8082677f in fork_exit (
    callout=0xffffffff80b6b4a0 <pf_purge_thread>, arg=0x0,
    frame=0xfffffe0090e36f40) at /usr/src/sys/kern/kern_fork.c:1164
        __pc = 0x0
        __pc = 0x0
        td = 0xfffff800079d6000
        p = 0xfffffe0010def5a0
        dtd = <optimized out>
#14 <signal handler called>
Comment 1 Kristof Provost freebsd_committer freebsd_triage 2024-06-23 08:54:35 UTC
(In reply to Daniel Ponte from comment #0)
All right, first off: did you mean eff27c3872300 rather than ff27c3872300?

Secondly, what is your setup? Post your ruleset and network configuration. Are you using pfsync, pflog, ....?

Lastly, build an INVARIANTS kernel and see what assertions you hit.

The panic seems to show us locking a hashrow that the state is not in, which explains the panic on mutex unlock (because we're unlocking a different mutex from the one we locked), but there's no obvious way for that to happen. That would be why it's important to provide setup details!
Comment 2 Daniel Ponte 2024-06-23 19:23:12 UTC
(In reply to Kristof Provost from comment #1)

Yes, eff27c3872300, sorry.

This machine is primarily a NAT, IPv6 and VPN gateway for my home network, but has some other roles (nginx reverse proxy, xmpp server, asterisk PBX). My ruleset is pretty complex and makes use of policy-based routing. The machine runs a vnet jail. I would rather not post the ruleset but would be happy to email it. I use pflogd, but no pfsync.
Comment 3 Daniel Ponte 2024-06-23 20:06:39 UTC
(In reply to Kristof Provost from comment #1)

With INVARIANTS:

panic: pfsync_drop: st->sync_state == q

core.txt excerpt attached. uname commit hash differs due to local (non kernel) modifications; this is built from the same eff27c3872300 tree.
Comment 4 Daniel Ponte 2024-06-23 20:07:11 UTC
Created attachment 251655 [details]
INVARIANTS core.txt excerpt
Comment 5 Kristof Provost freebsd_committer freebsd_triage 2024-06-24 09:10:08 UTC
(In reply to Daniel Ponte from comment #4)
In comment #2 you stated there's no pfsync, yet this panic is in pfsync.

It's very hard to debug things if you're not supplying the requested information. It's downright impossible if the information that is provided is wrong.

Provide all of the previously requested information, including the full diff against upstream, and ensure it is accurate.
Comment 6 Franco Fichtner 2024-06-24 09:16:03 UTC
Is this what it is now? Being harsh to people will to report and help squash bugs to subsystems you actively maintain? ;)


Cheers,
Franco
Comment 7 Daniel Ponte 2024-06-24 11:54:48 UTC
The diff against upstream is a patch to make rtadvd(8) deprecate addresses on shutdown. It has nothing to do with this crash.

I am not trying to be difficult, but my ruleset is 327 lines long and I will need to anonymize it, which will take time. I'm not certain your tone is warranted; I know my configuration is complex and tends to uncover bugs like this, which I have tried to report every time I encounter them. I am trying to help the Project.
Comment 8 Daniel Ponte 2024-06-24 11:57:09 UTC
And yes, pfsync is present in the kernel, but is not being used at all.
Comment 9 Daniel Ponte 2024-06-24 12:52:13 UTC
Removing pfsync from the kernel config has resolved this crash.
Comment 10 Zhenlei Huang freebsd_committer freebsd_triage 2024-06-24 13:40:59 UTC
(In reply to Kristof Provost from comment #5)
> (In reply to Daniel Ponte from comment #4)
> In comment #2 you stated there's no pfsync, yet this panic is in pfsync.

`vnet_pfsync_init()` vnet sysinit routine would unconditionally calls `pfsync_pointers_init()` [1]. I think that is why Daniel@ found
> Removing pfsync from the kernel config has resolved this crash
 
1. https://cgit.freebsd.org/src/tree/sys/netpfil/pf/if_pfsync.c#n3112